Research data management

Documenting research

By capturing and documenting useful, accurate and relevant facts around your data, it can be made more useful to your future self and others. It can add valuable context and depth to your data and act as a reminder to yourself of what you did. Later on, if you choose to publish your data, good documentation of data will improve the visibility and value of your research to others. These will all contribute significantly to the Findability and Reusability of your data.

When considering what to document, there is no universal approach - each research project will examine something new; each discipline will have things that are considered useful or critical. But despite differences from field to field, there is one general thing to remember about documenting and describing your data:

Maximise documentation - Your own research problem will guide what data you collect. However, by capturing other information at the same time, your data can become exponentially more useful to future researchers. It may be that the context where you are making your observations will never exist again! It may also be that something you capture unexpectedly becomes critical to your own research later on. You’ll need to balance your available time and resources with the possibilities, but if you can easily capture something that could inform future research, you should do so.

The following is a non-exhaustive list of elements to consider describing and documenting as you collect your data. You should think through each section and consider if that element is relevant to your research data, or if it could be useful to future researchers.

A name of the dataset or the name of the project.

Names and contact details of the organisations or people who created the data and their unique identifiers (ORCID, ResearcherID, etc).

A unique number used to identify the data (DOI, Handle, IGSN).

Any key dates associated with the data. This may including project start and end date, time period or any other important dates associated with the data.

Information on how the data was generated, such as specific equipment or software used (including model and version numbers), formulae, algorithms or methodologies.
You might include an electronic lab notebook to aid this element.

Information on how the data has been transformed, altered or processed (e.g. normalisation).

Citations to any data used in the research obtained or derived from other sources. At minimum it should include the creator, the year, the title of the dataset and some form of access information.

Keywords or phrases describing the subject or content of the data. This may also include Field of Research (FOR) codes and Socio-economic objective (SEO) codes.

Descriptions of relevant geographic information. This could be city names, region names, countries or more precise geographic coordinates.

A list of all variables in the data files, where applicable. This could also be captured in a codebook.

Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. ‘999 indicates a missing value in the data’).

A list of all the files that make up the dataset, including extensions (e.g., photo1023.jpeg’, ‘participant12.pdf’).

File formats of the data (e.g., SPSS, HTML, PDF GeoTIFF or JPEG).

Organisation of the data file(s) and layout of the variables, where applicable.

Information on the different versions of the dataset that exist, if relevant.

Names and version numbers of any special-purpose software packages required to use, create, view, or analyse the data.

Any known intellectual property rights, statutory rights, licenses, or restrictions on use of the data.

Where and how the data can be accessed.

Metadata

One way that data documentation commonly appears is in structured descriptions of the data - this is known as metadata. Metadata is often defined literally as “data about data” and refers to the information used to describe the attributes of a resource in a standardised format. Metadata is often collected into one central location, so researchers can search in one place to find datasets that will help their research. By including useful and accurate information in your metadata and data description, other researchers will be more likely to find and reuse your dataset.

Many disciplines actually have their own specific ways of structuring metadata - these specific structures are called schemas. A schema will list what information you’ll need to include about your data and how that information should be structured. Below are some examples of various schemas.

Discipline	Metadata standard
General	Dublin Core (DC) Metadata Object Description Schema (MODS) Metadata Encoding and Transmission Standard (METS)
Arts	Categories for the Description of Works of Art (CDWA) Visual Resources Association (VRA Core)
Astronomy	Astronomy Visualization Metadata (AVM)
Biology	Darwin Core
Ecology	Ecological Metadata Language (EML)
Geographic	Content Standard for Digital Geospatial Metadata (CSDGM)
Social sciences	Data Documentation Initiative (DDI)

Metadata (Jisc)
An introductory guide to the various aims, objectives and concepts of metadata.

Metadata
A basic introduction to metadata.

Readme.txt files and codebooks

A readme.txt file is a collection of very simple metadata provided alongside a dataset when researchers publish their data. It describes key details of the dataset for end users who somehow access the dataset without seeing or finding the metadata beforehand. Files like these help make published data more robust and improve the long-term usability of the data.

If you publish your data through Curtin, the library will automatically create a readme.txt file for you from the submitted information. If you are publishing in a subject or discipline specific repository and would like help creating a useful readme.txt, please email us at ResearchData@curtin.edu.au.

Example readme.txt file [TXT, 3kB]
This text file is an example of a readme.txt file used in data publication.

Codebooks are documents to help interpret any abbreviations, standard signifiers or codes used when entering data into a dataset. They explain what variable names refer to, what values you should expect in the field and what those values correspond to. They are incredibly useful for anyone reusing your dataset later on - your first-hand experience will give you an excellent understanding of what your codes mean, but anyone else seeking to reuse your data might be completely confused. In fact, if you come back to your own dataset years later the entries might even confuse yourself. Your codebook will clarify what the fields and entries mean and will allow others and your future self to use the data with confidence.

When constructing a codebook, make sure you include:

A description of all the variables (this may include units of measurement, date formats or transcriptions of interview questions).
Full explanations of what all the entered coded values mean.
A description for any null values entered (if they exist).
Identification and explanation of any missing data.

Guide to Codebooks [PDF, 3.45MB]
This guide from the ICPSR gives a number of descriptions, explanations and examples of codebooks.

Ownership

Data ownership refers to the intellectual property rights over the data created through research, and may also define ongoing roles around data management and use. Ownership of research is a complex issue that may involve the principal investigator, the sponsoring institution, the funding agency, and any participating human subjects. Clarifying data ownership and intellectual property rights is an important part of data management as this will ultimately decide who has control and rights over the data and can influence how the research data is managed, how it can be reused in the future and who has responsibility for these issues.

Due to complications around research funding agreements, collaborative projects, ethical guidelines, shared datasets and institutional policies, data ownership can be confusing. If there are no formal agreements or guidelines, you should clarify the ownership of the data and the implications as soon as possible and keep this information in writing, the same way you would with an authorship agreement. These discussions could include parties such as:

Funding bodies
HDR Supervisors
Principal Investigators
Co-authors
Project members
Other collaborating institutions/research bodies

In general, Curtin students retain ownership of their data, as outlined in the Intellectual Property Policy. Curtin staff should refer to the Intellectual Property Policy, the Intellectual Property Procedures and the Research Data and Primary Materials Policy. Any staff who are considering making their code openly licenced should consult with the Curtin Commercialisation team and the Curtin Intellectual Property Policy and Procedures.

Any researcher conducting research in collaboration with an organisation external to Curtin should obtain a written agreement outlining the ownership of the research data. This agreement may also include details around particular storage and access requirements and who is responsible for meeting those requirements.

Curtin Commercialisation team

The Commercialisation team works closely with Curtin researchers who develop novel concepts or inventions to guide and assess the commercial viability and the best method of bringing it to market.

Research data management

Documenting research

Title

Creator/s

Identifier

Date

Method

Processing

Source

Subjects

Location

Variable list

Code list

File inventory

File Formats

File structure

Version

Software

Rights

Access information

Metadata

Readme.txt files and codebooks

Ownership