FAIR data

What is FAIR and why is it important?

The FAIR Data Principles were developed to guide you in the process to make your data findable (your data can be discovered by others), accessible (your data can be made available to others), interoperable (your data can be integrated with other data) and reusable (your data can be reused by others).

The goal of applying the FAIR Data Principles is to enable and enhance the reuse of data (and other digital objects), by both humans and machines.

Image by Patrick Hochstenbach, CC0 1.0
Image by Patrick Hochstenbach, CC0 1.0

The FAIR Data Principles originated in the life sciences, but are now gaining much traction beyond. Within research communities, the generation of FAIR data is encouraged to maximize their value beyond a specific research question or project and to enable research at larger scale and scope. As a consequence, your research impact and recognition as a researcher are enhanced.

This is why funders, publishers and policy makers also encourage the generation of FAIR data. The FAIR Data Principles are central to the Horizon 2020 and Horizon Europe guidelines on RDM. The European Code of Conduct for Research Integrity (2017), to which Ghent University subscribes, also expects that access to research data is in line with the FAIR principles where appropriate.

Are you new to FAIR? Check out the FAIR concepts section to get introduced to the main concepts and watch our 'What are the FAIR data principles' knowledge clip.

Want to assess how FAIR aware you are? Check out the FAIR-Aware tool.

How to make your data FAIR?

Just like there are various degrees of data sharing, FAIR is also a spectrum. In other words, data can be FAIR to a greater or lesser extent.

How to make your data FAIR in practice varies for different scientific domains. The FAIR Data principles were developed as domain-independent, high-level guidelines for increasing the reusability of data (Wilkinson et al. 2016). A data management plan can help to learn, think about and decide upon practices to make your data FAIR. For instance, how will you document your data, which metadata will you add, which file formats will you choose, how will you give access to the data, how will you license the data or how will you add a persistent identifier. To find out what this all means, have a look at the FAIR concepts section.

Want to assess how FAIR your data are? Check out S. Jones & M. Grootveld (2017), How FAIR are your data? or ARDC, FAIR self-assessment tool.

Findable - "your data can be discovered by others"

To ensure that data are findable or discoverable, they need to be uniquely and persistently identifiable through the assignment of a persistent identifier (PID). Moreover, describing the data with rich metadata (i.e. structured and machine-readable documentation) allows to find the data by searching or filtering on the metadata (e.g. all data of the year “2020”, all data from the country “Belgium”), most often through online search engines or data repositories which index, harvest or manage the metadata.

Assigning a persistent identifier as well as adding metadata to data can be achieved by depositing the data in a data repository.

Accessible - "your data can be made available to others"

(Meta)data should be retrievable via their persistent identifier (PID) using a standard communications protocol such as httpsWhen necessary, this protocol allows for an authentication and authorization procedure to identify the researcher who accesses the data and to check whether this researcher should be granted access or not. Also, access restrictions and conditions should be clearly specified. Metadata should be available even if the data themselves are not or no longer available.

These rather technical approaches to ensure that data are accessible can be achieved by depositing the data in a data repository. The appropriate access level of the data is chosen by considering possible legal or ethical reasons for restricting data access.

Interoperable - "your data can be integrated with other data"

To ensure that different datasets can be linked, aggregated and/or understood properly, recognized standards (formats, languages, vocabularies, ontologies, metadata standards…) should be applied to data and metadata.

Metadata standards establish a common way of describing data by defining which metadata is included, how the metadata fields or elements are named and which metadata values are accepted. Rather than using a custom metadata schema, metadata standards are preferred as they enable the exchange of metadata across data repositories thereby facilitating data integration. Examples of metadata standards can be found on the Documentation page.

Controlled vocabularies, taxonomies or ontologies unambiguously define the terminology that is used to describe a certain variable, a metadata value or any other entity within the dataset.  Resources such as FAIRsharing.org, the Ontology Lookup Service, Linked Open Vocabularies can be consulted to identify suitable standards. Examples are the standards for dates and times (ISO 8601), countries (ISO 3166), and domain-specific standards such as the NCBI taxonomy for organisms or the Getty vocabularies. When no controlled vocabularies exist, it is advised to use a code book or data dictionary that explains the terminology that you have used.

Standard data formats further facilitate the interoperability of data. They describe how the data needs to be formatted, thereby simplifying the use of data e.g. through standard file formats that can be read by common software. Examples are the text-based file format for sharing sequencing data (FASTQ), netCDF (Network Common Data Form) for storing and using data in arrays (e,g. for geospatial data) as well as generic data interchange formats such as JSON.

Finally, linking different research outputs to each other by using references to related data, publications or software will also increase data interoperability. This can be achieved through citation, but also by providing the persistent identifiers of the related research outputs as metadata.

When using domain-specific data repositories, the extent to which standards for data and metadata are applied, and even required, increases. It is therefore advisable to deposit data in these domain-specific and/or trusted data repositories which allow you to describe and standardize the data as much as possible.

A data repository will typically offer a metadata input form that is based on a metadata standard and/or that controls the metadata values that you can provide. In addition, a data repository can have restrictions on the data formats that you are allowed to deposit and can conduct a manual or automated curation process to ensure the quality of the provided (meta)data.

Reusable - "your data can be reused by others"

Data should be sufficiently described and documented in accordance with community standards. Documentation and metadata will inform a potential user of that data on how, why, by whom and when the data were created (i.e. provenance of the data) and thereby, allow that user to judge whether the data is relevant for the intended reuse.

Moreover, a license needs to be added to the data to specify the reuse conditions and permissions such that researchers know what they can do with your data. Using standard licenses (e.g. Creative Commons licenses) will enhance the reuse of the data as these can be read and understood by machines.

By providing documentation and assigning a license to your data, a data repository can do much of the work in making your research data reusable.

FAIR concepts

Machine-readable or machine-actionable

Machine-readability or actionability enables machines (e.g. scripts, software, algorithms) to read, understand and process the data and aggregate data from different sources, types and disciplines. As such, it can allow research at a much larger scope, scale and speed, often needed in contemporary science.

For instance, if the (meta)data are machine-readable, machines will be able to locate a digital object, identify the type of digital object (is it a dataset or a publication? does it contain experimental data or simulation data?) and determine whether it is usable with respect to accessibility, license, data format or other use constraints.

Persistent identifiers (PIDs) and globally unique identifiers (guid, uuid)

What is a PID?

A PID, such as a DOI, PURL, or Handle, is a long lasting reference to a digital object. PIDs avoid broken links and difficulties to locate a dataset that is e.g. underlying a journal article. A PID uniquely identifies the digital object and ensures that it can always be located, even if its web address (URL) changes. A PID can be used for data citation.

A central registry ensures that following the PID will point you to the digital object’s current location even if the URL changes.

You can typically get a PID for your datasets by depositing them in a data repository. In addition to being persistent, this identifier will also be globally (guid) or universally unique (uuid), i.e. ensuring that there are no two identical identifiers that point to different digital objects.

Digital Object Identifier (DOI)

The Digital Object Identifier or DOI is a commonly used identifier for research datasets.  It is generated by the central registry DataCite. A DOI always comprises:

  • A prefix: ‘10.’ + 4 or more numbers: identifies the organisation that registered the DOI at DataCite.
  • A suffix: identifies the dataset.

Appending a DOI to the resolver system http://doi.org/ takes you to the location or landing page of the digital object in question. An example for a dataset held at the Dryad repository is: https://doi.org/10.5061/dryad.4h16331.

Want to know more? Check out this video on PIDs and data citation.

 

Note that some community-wide accepted databases or repositories do not generate DOIs for the datasets that they manage, but rather make use of alternative PIDs or accession numbers.

In addition to PIDs for datasets, PIDs also exist for e.g. identification of researchers (ORCID) or organizations (ROR).

Metadata

Metadata are data about data. They are a structured and machine-readable form of documentation and are key to making data FAIR. To learn more about the different types of metadata, how metadata can be generated and metadata standards, check out our metadata page.

Metadata are managed by data repositories to enable you to search and filter the data. Moreover, online search engines can harvest (i.e. automatically collect) and index (i.e. restructure to speed up searches) metadata to enable searches across data repositories e.g. through Google or through data portals.

Controlled vocabulary, taxonomy, ontology

There are many different ways in which you can describe your data. Terminology might be ambiguous (e.g. the word “root” has a different meaning in biology and maths). Moreover, terminology might be highly domain-specific and therefore difficult to understand.

A controlled vocabulary can help to restrict the terminology that you are using to describe your data to previously defined terms. In taxonomies and ontologies, relations and/or semantics are added to the terms to increase the structure and expressiveness of the controlled vocabulary. For instance, geoNames can be used for geospatial semantic information, where the country name “France” will be connected to info such as the continent it is part of, ISO abbreviation for the country, used languages, etc.).

Using controlled vocabularies will improve the discovery (e.g. because different spelling is avoided), linking, understanding and reuse (e.g. because data can be aggregated more easily) of the data.

Authentication and authorization

During authentication, the identity of the user will be verified. During authorization, it will be verified whether the user has access to specific data, applications or files. These processes are necessary to determine whether access can be given to the user according to the access level of your data.

FAIR vs. Open research data

Open and FAIR are both about making data available for reuse. However, they are not synonyms!

Research data can in principle be managed without a view to data sharing, in which case they are neither open nor FAIR. Nevertheless, there are increasing expectations to share research data.

When shared, data can be open or FAIR, or both:

  • FAIR does not mean that data have to be open, in the sense of data that can be 'freely used, modified and shared by anyone for any purpose' (opendefinition.org).
  • Open data have an open access level and an open license, and ideally, are in an open file format.
  • Rather, the 'A' in FAIR means that it is clear and transparent how data can be accessed, and - if applicable - under which conditions. In other words, data shared under restrictions can still be FAIR (also see degrees of data sharing).
  • Open data are not necessarily FAIR (or even managed) data.
  • Ideally, the aim is to increase the amount of data that are open as well as FAIR.

 

FAIR vs open data.  Image adapted from 'Open data, FAIR data and RDM: the ugly duckling' by S. Jones, licensed under CC BY.

More information