Collecting and organizing data

Research data can be gathered through observation, manual or automatic measurements in the laboratory or in the field, with remote sensing techniques, by interviews, by modelling and simulation, etc. Data can also be stored in many formats.

Considering what data will be collected, generated, and/or reused, and how you will organize them, is an important part of Research Data Management.

File formats

What is a file format?

A file format describes how information is stored within a digital file. Although each file format is unique, different file formats exist for similar types of information (e.g. text can be stored in a plain text file as well as in a word file).

On most computer systems, the format of a file is indicated by the ‘extension’ in the filename (e.g. .txt, .csv). The extension provides an immediate clue about the type of data within a file. For example, we expect that a file with a .jpg extension is an image, whereas a .docx should contain formatted text.

The difference between file formats is situated at the following levels:

  • Simple vs complex formats: e.g. the .txt format is a very simple way of storing text, while a .docx file has more complex properties.
  • Open vs closed file formats: closed (or proprietary) file formats are not open, in the sense that they cannot be freely used. Often they are owned by companies or are patented. Open formats can be used and implemented by anyone.

Which file formats to use?

The choice of file formats to use for research data depends on:

  • Discipline-specific standards and customs
  • Planned data analyses
  • Software availability/cost
  • Hardware used – e.g. audio capture, fMRI scanner

Risks

Using a specific format can hold risks. For instance, using formats which can only be used within specific software makes the digital data vulnerable to obsolescence of the software. This can lead situations of being locked out of one's own data.

Also, converting data from one format to another can lead to problems of losing metadata or formatting. Therefore, it is good practice to plan your choice of formats with long-term access in the back of your mind.

Best practices

To offer the best long-term guarantees in terms of usability, accessibility and sustainability, file formats should have the following characteristics:

  • Non-proprietary (not protected by trademark, patent or copyright)
  • Open, documented standard
  • Common usage by research community
  • Standard representation (ASCII, Unicode)
  • Lossless compression (>< lossy compression)

Recommended formats

Examples of recommended file formats for different types of data can be found via:

File naming

A file name is the principal identifier of a file. Therefore, good file names should:

  • Provide useful cues to content, status and version
  • Uniquely identify a file
  • Help to classify and sort files

As such, file names that reflect the content of the file will facilitate searching, discovering and understanding of the data.

File name elements

File names can be constructed using the following elements:

  • Project acronym
  • Content description
  • Date
  • Location
  • Creator name/initials
  • Status information (i.e. draft or final)
  • ….

Best practices

When creating a file name try to:

  • Give a unique name.
  • Use elements essential to identify the file.
  • Avoid long names, remove unnecessary elements.
  • For dates, use ISO8601 standard (i.e. YYYYMMDD). This will keep the files sorted chronologically.
  • For versioning via filename, use ascending, decimal version numbers.
  • Try to only use alphanumerical characters (A-Z, a-z, 0-9).
  • Do not use special characters like \ / : * ? " < > | ! % & - ; = () + , .
  • Do not use spaces. Use an underscore ("_") for separation.
  • Do not alter or remove the extension of a file (e.g. .txt, .sav, .mp4, .docx).
  • Be consistent in how you build up names and make sure there is consensus among all team/project members.

Examples

Some examples of good file names are:

CONS_INT1_12-03-2019.rtf

Result from Interview 1 of the Consumers research on 12/13/2019

GC-MS1_20180912_POLY03.ms

Polymer 3measured on GC-MS machine 1 on the September 1 2018

GC-MS2_20180914_POLY08.ms

Polymer 8 measured on GC-MS maching 2 on September 14, 2018

GC-MS2_20180914_POLY08.pptx

Chromatogram of polymer 8 measurement represented in a powerpoint presentation with all relevant peaks labeled

Folder/data organization

To be able to store data in a such a way that results can be reproduced and data can be re-used, one of the important challenges is working in a well-organized folder structure. Using a standard way of organizing research files has indisputable advantages, both for your daily work and also when sharing data with colleagues or others.

Almost all research domains require specific ways of organizing and structuring the stored research data. Therefore, it is difficult to provide general guidelines. To demonstrate how the general principle of structuring research data can be implemented, we provide some examples. All examples are based on real studies.

Examples

Version control

When you work with different versions of a file, it can be a challenge to locate the 'correct' version or to know how versions differ from each other. If not done well, it can even be difficult to know which file preceded the other.

The matter is even complicated further when files are kept in multiple locations, and multiple users edit these files. To avoid confusion and safeguard against accidental loss, a versioning system can be put in place.

Different approaches can be taken to provide version information about a file: manual or automatic version control methods.

Manual methods

File names

File names are a simple way to manually give information about the version/status of a file. This an be done by:

  • Including a date in the file name, e.g.: HealthTest-2008-04-06.docx
  • Including a version number in the file name, e.g.: HealthTest-00-02.docx or HealthTest-v02.docx

Version history table

A version history table is a table kept within the file itself or within a separate file including file history, version control table or notes. It is used to record versions, dates, authors and details of changes to files.

Example:

Version

What was changed?

By whom

When?

1

Initial draft

Godfried Bomans

12/05/2019

2

Revised Intro

Godfried Bomans

14/05/2019

3

Added Methodology

Louis Paul Boon

18/05/2019

4

Reviewed by promotor

Matthaeus, Marcus, Lucas, Johannes

21/06/2019

5

Accepted changes V4

Added final figures

Godfried Bomans

26/06/2019

6

Final version for submission

Godfried Bomans

03/07/2019

Best practices

When working with different versions of files try to take into account these tips:

  • Identify milestone versions of files to keep. Avoid clutter.
  • Use a systematic naming convention.
  • Record version and status of a datafile, e.g. raw, cleaned.
  • Document what changes are made to a file when a new version is created.
  • Document relationships between files where needed.
  • Track the location of files. When collaborating, use a common ‘workspace’ (i.e. shared folder, netshare) to avoid different versions of files lingering in different locations.
  • Keep file-sharing out of e-mail.

Automatic methods

Built-in versioning

Some software (platforms) provide(s) built-in version control. For instance, all Microsoft Office files stored on sharepoint or onedrive instances have automatic version history.

Versioning software

Specific software exists to systematically manage version information about files. Some of the most used examples of versioning software are git and subversion.

Also, cloud platforms exist to allow for simultaneous collaboration and version control. Examples are github and gitlab.