By capturing and documenting useful, accurate and relevant facts around your data, it can be made more useful to your future self and others. It can add valuable context and depth to your data and act as a reminder to yourself of what you did. Later on, if you choose to publish your data, good documentation of data will improve the visibility and value of your research to others. These will all contribute significantly to the Findability and Reusability of your data.
When considering what to document, there is no universal approach - each research project will examine something new; each discipline will have things that are considered useful or critical. But despite differences from field to field, there is one general thing to remember about documenting and describing your data:
The following is a non-exhaustive list of elements to consider describing and documenting as you collect your data. You should think through each section and consider if that element is relevant to your research data, or if it could be useful to future researchers.
A name of the dataset or the name of the project.
Names and contact details of the organisations or people who created the data and their unique identifiers (ORCID, ResearcherID, etc).
A unique number used to identify the data (DOI, Handle, IGSN).
Any key dates associated with the data. This may including project start and end date, time period or any other important dates associated with the data.
Information on how the data has been transformed, altered or processed (e.g. normalisation).
Citations to any data used in the research obtained or derived from other sources. At minimum it should include the creator, the year, the title of the dataset and some form of access information.
Keywords or phrases describing the subject or content of the data. This may also include Field of Research (FOR) codes and Socio-economic objective (SEO) codes.
Descriptions of relevant geographic information. This could be city names, region names, countries or more precise geographic coordinates.
A list of all variables in the data files, where applicable. This could also be captured in a codebook.
Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. ‘999 indicates a missing value in the data’).
A list of all the files that make up the dataset, including extensions (e.g., photo1023.jpeg’, ‘participant12.pdf’).
File formats of the data (e.g., SPSS, HTML, PDF GeoTIFF or JPEG).
Organisation of the data file(s) and layout of the variables, where applicable.
Information on the different versions of the dataset that exist, if relevant.
Names and version numbers of any special-purpose software packages required to use, create, view, or analyse the data.
Any known intellectual property rights, statutory rights, licenses, or restrictions on use of the data.
Where and how the data can be accessed.
One way that data documentation commonly appears is in structured descriptions of the data - this is known as metadata. Metadata is often defined literally as “data about data” and refers to the information used to describe the attributes of a resource in a standardised format. Metadata is often collected into one central location, so researchers can search in one place to find datasets that will help their research. By including useful and accurate information in your metadata and data description, other researchers will be more likely to find and reuse your dataset.
Many disciplines actually have their own specific ways of structuring metadata - these specific structures are called schemas. A schema will list what information you’ll need to include about your data and how that information should be structured. Below are some examples of various schemas.
Dublin Core (DC)
Metadata Object Description Schema (MODS)
Metadata Encoding and Transmission Standard (METS)
Categories for the Description of Works of Art (CDWA)
Visual Resources Association (VRA Core)
|Astronomy Visualization Metadata (AVM)
|Ecological Metadata Language (EML)
|Content Standard for Digital Geospatial Metadata (CSDGM)
|Data Documentation Initiative (DDI)
An introductory guide to the various aims, objectives and concepts of metadata.
A basic introduction to metadata.
A readme.txt file is a collection of very simple metadata provided alongside a dataset when researchers publish their data. It describes key details of the dataset for end users who somehow access the dataset without seeing or finding the metadata beforehand. Files like these help make published data more robust and improve the long-term usability of the data.
If you publish your data through Curtin, the library will automatically create a readme.txt file for you from the submitted information. If you are publishing in a subject or discipline specific repository and would like help creating a useful readme.txt, please email us at ResearchData@curtin.edu.au.
Codebooks are documents to help interpret any abbreviations, standard signifiers or codes used when entering data into a dataset. They explain what variable names refer to, what values you should expect in the field and what those values correspond to. They are incredibly useful for anyone reusing your dataset later on - your first-hand experience will give you an excellent understanding of what your codes mean, but anyone else seeking to reuse your data might be completely confused. In fact, if you come back to your own dataset years later the entries might even confuse yourself. Your codebook will clarify what the fields and entries mean and will allow others and your future self to use the data with confidence.
Guide to Codebooks [PDF, 3.45MB]
This guide from the ICPSR gives a number of descriptions, explanations and examples of codebooks.
Data ownership refers to the intellectual property rights over the data created through research, and may also define ongoing roles around data management and use. Ownership of research is a complex issue that may involve the principal investigator, the sponsoring institution, the funding agency, and any participating human subjects. Clarifying data ownership and intellectual property rights is an important part of data management as this will ultimately decide who has control and rights over the data and can influence how the research data is managed, how it can be reused in the future and who has responsibility for these issues.
Due to complications around research funding agreements, collaborative projects, ethical guidelines, shared datasets and institutional policies, data ownership can be confusing. If there are no formal agreements or guidelines, you should clarify the ownership of the data and the implications as soon as possible and keep this information in writing, the same way you would with an authorship agreement. These discussions could include parties such as: