Research Data Management from A to Z
It is considered good research practice to anonymize personal data in science. Anonymization refers to any measure that alters personal data in such a way that "the individual details of personal or factual circumstances can no longer be attributed to a specific or identifiable natural person, or can only be attributed to a specific or identifiable natural person with a disproportionate expenditure of time, cost and labor" (According to the Federal Data Protection Act, Section 3, Paragraph 6). A distinction is made between anonymization and pseudonymization.
An archive is a system that allows the organized storage and retrieval of historical data, documents, or objects. For long-term archiving, special archiving systems are required, which ensure the readability of file formats.
For many research areas, discipline-specific repositories are available that also function as archives.
RWTH Aachen University offers an archive for research data using the Coscine RDM platform.
Archiving means the unchangeable, long-term storage of your data. It is therefore very important you choose a suitable medium for storing your research data – saving your files on an external data carrier and keeping it in your drawer is not an option here!
You can archive your data using the Coscine RDM platform.
Documenting your data correctly is very important for archiving and you should ensure contain all the essential information, such as metadata, are included.
The RWTH IT Center's archive service offers the assignment of ePIC persistent identifiers as an archiving option.
Please refer to our Instructions on Archiving Data for a Publication, which are in German.
- Best Practice
German: Best Practice
The application of tested and proven methods to a work process is called Best Practice. This is "a technique or methodology that has proven to be reliable – by experience and/or through research – in achieving a desired result". A commitment to best practice requires the use of all available knowledge and technologies that guarantee successful implementation. In research data management, this term is used to describe the standards by which high quality records can be created.
You can find information on copyright in the Fact sheet on the Copyright Protection of Research Data.
- Creative Commons Licenses
In order to maximize the reusability of scientific research data, which may, in principle, be subject to copyright, it may be useful to grant additional rights of use, for example by licensing the data accordingly. The use of liberal licensing models, in particular the globally recognized Creative Commons Licences, or CC for short, offers a way to comprehensibly define the conditions for the reuse of published research data.
- Data Center
Data centers are among the central repositories. It is responsible for storing, managing and distributing data and information for a specific knowledge institution. Data centers for research data have mainly emerged from independent scientific initiatives.
- Data Curation
The term data curation describes the management activities required to maintain research data in the long term so that they are preserved and available for reuse. Curation in its broadest sense means a set of activities and processes performed to create, manage, maintain and validate an entity. With regard to data, it is thus the active and ongoing management of data throughout the data life cycle. Data curation allows to search, find and retrieve data as well as to secure their quality, added value, and long-term reusability.
- Data Curation Profiles
A data curation profile describes the 'history' of a data set or data collection, that is, the origin and life cycle of a data set within a research project. Developed by Purdue University Libraries, the profile and its associated toolkit are both a tool and a collection of data sets. The tool consists of an interview tool, which is presented to the user for a very thorough 'data exploration', which becomes a 'profile' as it is filled in. The data collection can be searched for completed data curation profiles, that is to obtain information services in research data management for the data curation of a specific discipline or research method.
- Data Journal
Data journals aim to secure the reusability of research data and to establish research data as an academic achievement in their own right. Moreover, they attempt to improve the transparency of academic methods and research results, support good data management practices, and provide long-term access to data sets. They can be seen as publications which make data sets available.
- Data Life Cycle
As a model, the data life cycle illustrates all the stages which research data can pass through, from collection to subsequent use. The different stages of the data life cycle can vary from discipline to discipline. In general, these stages are as follows:
- Planning (application and preparation)
- Access and reuse
- Data Management Plan
A data management plan, DMP for short, means the systematic and targeted documentation of your research data. A DMP takes into account the handling, storage and archiving, access and use of your data and metadata. Creating a DMP means much thought has got into the quality of your data, your resources, and your intellectual property right from the start of your project.
The following online tools can help you create a DMP:
- RWTH Aachen University's own DMP template
The Research Data Management Organizer, RDMO for short, is the product of a project funded by the DFG (German Research Foundation) and has been developed further by the community since 2020. The RDMO authority of RWTH contains the DMP template of RWTH and can be extended using DMP templates specifically adapted to your requirements.
DMPonline is a tool developed by the British Digital Curation Centre, DCC for short, and hosted at the University of Edinburgh. It provides different templates from funding organizations as well as a generic template that is suitable for every research project. DMPonline helps you create a DMP according to EU guidelines.
The DMPTool is offered by the California Digital Library. It contains instructions for certain funding organizations that have already made DMP a requirement today. Integrated resources and services from certain partner institutions make it easier to complete a DMP in some cases. The tool also offers a generic DMP template and is freely accessible to everyone. In addition, the website offers some examples of DMPs.
- Data Mapping
Data mapping is the process of transferring data (elements) from one data model to another. This is the first step towards integrating foreign information into your own information system. Data mapping comprises the data transformation during an electronic exchange of data, in which typically the XML markup language and the JSON data format are being used.
- Data Protection
The term data protection refers to technical and organizational measures to prevent the misuse of personal data. Misuse is defined as gathering, processing or using such data in an unauthorised way. Data protection is regulated by the EU General Data Protection Regulation (GDPR), by the German Federal Data Protection Act as well as the corresponding laws at the federal state level.
Personal data are gathered and used especially in medical and social science studies. It is mandatory to store them in an especially secure location. Pseudonymization and anonymization can make a publication of these kinds of data possible.
- Database Rights
Sui generis database rights protect a database against unauthorized use and reproduction for a period of 15 years, provided that a sufficient degree of "intellectual creation" has been reached in its development, that is, a “quantitatively or qualitatively substantial investment” of money, time, labor, etc. and so on. German database rights are based on the EU General Data Protection Regulation (GDPR, in force since 25 May 2018) and does not apply to the contents of a database, which may be subject to copyright themselves, but to the process of creating a database by systematic or methodical compilation.
- Digital Artefact
A digital artefact is the end result of the proces of digitalization in which an analog object (a text, an image, or a recording, for example) is transformed into digital values in order to be able to store it electronically. As opposed to an analog object, a digital artefact can be distributed in the form of digital research data and machine-processed. Another advantage of working with digital artefacts is that further alteration or damage to sensitive analog objects can be avoided.
- Digital Object Identifier (DOI)
German: Digital Object Identifier (DOI)
A Digital Object Identifier (DOI) is one of the most common systems for persistent identification of digital documents. A DOI is unique and remains unchanged over the entire lifetime of a designated object. The DOI system is managed by the International DOI Foundation. Another well-known system for persistent identification is the Uniform Resource Name (URN).
- DINI Certificate
The DINI certificate of the Deutschen Initiative für Netzwerkinformationen (German Initiative for Network Information) is a widely recognized quality seal for repositories. It guarantees a high-quality repository service for authors, users, funders, and the management of the certified institution. The certificate indicates that open access standards, guidelines, and recommendations (best practices) have been implemented.
- Domain Model
Different domains, as well as working environments, can be identified within a research project. The domains differ in the type of data exchange, the circle of exchange partners, and the type of use.
- The private domain indicates each researcher’s working environment.
- The group domain denotes the research group’s common working environment.
- The permanent domain means the working environment for long-term archiving.
- Access and reuse is the cross-project interdisciplinary working environment of all researchers around the world.
Every research project involves at least the first three domains over its duration.
The critical points are the transitions between the domains. Extensive planning, for example via a data management plan, is therefore required for this to be a smooth process.
- It is important to lay the foundations in the researcher’s private domain using an overall concept that will also be applied for the later transitions.
- In order to transition to the group domain, basic specifications for the common use and creation of research data are necessary.
- If permanent storage is required and publication is planned, information for cross-disciplinary understanding and reuse should be elaborated.
- You should bear in mind that data are often not only relevant to one sole research context. There are frequently overlaps to other fields and the data from one discipline today form the basis for research in another tomorrow. In order to create these new opportunities, it is important to create access to research data.
- File Format (Data type, file type)
German: Dateiformat (Datenart,Dateityp)
The file format specifies the syntax and semantics of data within a file. A computer or computer application needs to know the file format in order to interpret the data stored in a file. The format is indicated in coded form by the file extension. Most file formats are designed for a specific use and can be grouped together:
- Executable files
- System files
- Library files
- User files: Image files (vector graphics [SVG, ...], raster graphics [JPG, PNG, ...]), text files, video files, etc.
In addition, a distinction can be made between proprietary and open file formats.
Proprietary formats are mostly provided by software manufacturers or platforms, are typically copyrighted, and require manufacturer-specific knowledge for implementation.
Open formats allow unrestricted access to their source code and can therefore be used and adapted by users.
The spectrum of data types and formats of research data is very diverse.
Examples of data types are:
- Models: statistical, 3D modeling
- Multimedia data: JPEG, TIFF, MPEG
- Numerical data: Excel, SPSS, CSV
- Software: Java, C++
- Text documents: Word, PDF, XML
These file types are differently suitable for long-term archiving. The compatibility of different file formats and the conversion to other formats must be taken into account. A good overview of this is provided by forschungsdaten.info
- Good Academic Practice
German: Gute wissenschaftliche Praxis
Good academic practice implicates storing research data for at least ten years.
- Guidelines, Regulations, and Policies
German: Richtlinien, Regeln, Policies
Policies and guidelines have been developed to ensure that all employees of an institution are aware of the procedures and best practice in research data management. In Germany, there are almost no binding research data policies (data guidelines) with detailed specifications, but rather only basic self-imposed obligations, such as commitment to the principles of Open Access.
Harvesting is the automatic "collection" of data or metadata from archives and repositories via so-called data providers such as BASE, OAIster, or Scientific Commons.
So-called harvesting protocols are used to automatically retrieve the data. One of the most commonly used harvesting protocols is the XML-based Open Archives Initiative Protocol for Metadata Harvesting,OAI-PMH for short. Harvesting via OAI-PMH for the Dublin Core model has been chosen as the lowest common denominator for metadata representation, since many very different metadata standards coexist.
- Institutional Policy
Institutional policies can help you and your employees create security and orientation. In the RWTH Institutional Policy template, you will find proposals that are not binding and can be individually adapted to your working group, institute, et cetera. Institutional policies include the handling of data management plans, usage rights, copyright, and the storage and archiving of research data.
JSON is a compact, software-independent data format in an easy-to-read text format for the exchange of data between applications. It is used for the transmission and storage of structured data and is used especially for web applications.
Although JSON is not as versatile as XML, it requires significantly less storage space for storing the same information.
- Long-Term Archiving
Long-term archiving generally means ensuring data availability for a period of over ten years. Besides preserving the data content at the bit level, you should also bear in mind the following requirements for the future interpretability of data:
- Is the data format suitable for long-term archiving?
- Is special software required for interpretation?
- Are the metadata complete?
Technical and descriptive metadata are particularly important to ensure it will be possible to use the data in technical infrastructures of the future.
You can find more detailed information on the long-term archiving of research data in the NESTOR manuals: Long-Term Archiving of Research Data or Digitial Curation of Research Data, or via the nestor wiki.
One option for long-term archiving, which the RDM team has tested, is Ex Libris’ Rosetta software.
- Meta Data Standard
Metadata standards are standardized schemes to ensure interoperability, that is, the linking and common processing, of metadata. They are used for a structured and uniform description of similar data. A metadata standard can often be transformed into another metadata standard by a process called mapping.
The term "metadata" denotes further information about your research data. They describe your data in more detail and make them interpretable at any time. Metadata are particularly important for the documentation, management, and classification of digital research data, since they are essential in answering the following:
- Where does the data come from?
- Who created the data, when, and how?
To ensure the exchange and reusability of metadata via digital information systems, you should use standardized metadata schemata as consistently as possible.
Jisc infoKit provides an introduction to metadata. This guide informs you about the most important metadata goals and concepts and is suitable for those who do not have any prior knowledge in the area.
A very short introduction to documentation and metadata can be found in the presentation Explain It.
The interactive Mantra course offers training on documentation and metadata. You will quickly understand why it is important to document your own research – both for you and for others. In addition, you are taught when and why to use metadata.
- Metadata Schema
A metadata schema means compiling permitted data elements to uniquely describe a resource. A suitable metadata schema for you depends on a number of factors, such as the data type or the context in which it was created and used.
There are a variety of metadata schemas for data from different disciplines. The first step to take when designing your research data descriptions is to check whether a suitable schema already exists for your discipline. You can find an ever-growing list on FAIRsharing.org, for example, while Dublin Core and RADAR are two of the best-known standardized metadata schemes.
The MetadataManager lets you fill in metadata according to a schema created for your institution. The schema not only specifies which metadata fields, for example author and subject, need to, and even can be, registered, but also allows you to use controlled vocabularies. Selecting or creating a suitable metadata schema is by no means trivial and the RDM project team will be happy to assist you here.
Once you have decided on a metadata schema, you must define the content of the data fields. To ensure the greatest possible chances of reusability and to optimally support research, we recommend you use controlled vocabularies, thesauri, and classifications. You can also find a large number of both interdisciplinary and discipline-specific solutions for this.
- National Research Data Infrastructure (NFDI)
German: Nationale Forschungsdateninfrastruktur (NFDI)
The National Research Data Infrastructure, or NFDI for short, which is currently being set up, seeks to "systematically develop, secure in the long term, and make accessible the data stocks of science and research and to connect them (inter)nationally". The NFDI will be made up of a number of so-called consortia. These are consortia of universities, non-university research institutions, departmental research institutes, academies and other publicly funded information infrastructure facilities or other relevant entities. These will then develop and provide a portfolio of services for research data management for their respective sub-areas.
The NFDI is an initiative launched by the Joint Science Conference, GWK for short, and financed by the Federal Government and the federal states. It is intended to fund up to 30 consortia in total. The science-led process for reviewing and evaluating the consortium applications is carried out by the German Research Foundation (DFG). The GWK makes funding decisions on the basis of the results of this process. The first funding decision has been made in June 2020. Two further selection rounds will follow in 2020 and 2021.
- Persistent Identifier
German: Persistenter Identifikator
An identifier signifies the unique denotation of a resource, usually digital. A classic example of an identifier in printed resources is the International Standard Book Number, ISBN for short. The Uniform Resource Locator, URL for short, is often used for digital resources. URLs have a half-life of around 100 days. Due to this short lifespan, URLs are not a suitable identifier for the permanent and distinct citation of research data.
This is where persistent identifiers, PID for short, come into play. PIDs represent a middle layer between the reference and the object, whereby the object is uncoupled from the "electronic" location. This results in a reduction of broken links, or the “Error 404: Page not found” message, or, in other words, this increases the stability of references, even if the data’s storage location changes.
PIDs give research data a permanent and unchangeable identifier, called a Uniform Resource Identifier, or URI for short, which is assigned throughout their lifecycle and beyond.
The best-known example of a PID is the Digital Object Identifier, DOI for short.
RWTH offers its members ePIC PID assignments.
- Personal Data
German: Personenbezogene Daten
The Federal Data Protection Act (de), BDSG for short, defines personal data as “individual information about personal or factual circumstances of an identified or identifiable natural person (data subject).” Data are considered personal if they can be attributed to a particular natural person. Typical examples are name, profession, height, or nationality of an individual. Data on ethnic origin, political opinion, religious or philosophical beliefs, trade union membership, health, and sexuality are particularly sensitive data under the BDSG and are therefore subject to stricter protection requirements.
- Personal Data Management
German: Persönliches Datenmanagement
To implement data management according to your plan, you must carefully organize your general research activities on a daily basis. You should consider matters such as documentation, labelling samples, and organizing the data structure. You should therefore specify the following as early as possible:
- Data organization, storage structures, versioning
- Documentation, metadata
- Data, backup during the project period
- Responsibilities, access rights, collaboration rules
- Archiving or publishing after the end of project
The Data Management Plan or Institute Policy tools are suitable for supporting your personal data management. The Research Data team also advises you in individual or group consultations, where you can develop solution strategies tailored to your subject and the conditions at your institute using RWTH’s technical services.
- Primary Research Data
Primary research data are unprocessed and uncommented raw data which have not yet been complemented by any meta data. They form the foundation of all scientific activity relating to an object of inquiry. The distinction between reserach data and primary research data is of a theoretical nature only, as raw data are hardly ever published without any explanatory meta data. Digital artefacts are generally not published by their proprietors – such as scientific libraries or collections – without background information such as provenance and other information.
In contrast to anonymization, pseudonymization merely replaces certain identification features, such as the name, with a pseudonym (a letter and/or numerical code). In this way, data and identification characteristics are separated and can only be assigned to each other with the help of a key.
This is intended to prevent or significantly impede the identification of the involved individuals (BDSG § 3, para. 6a). During the course of a scientific study, such pseudonymization is often unavoidable. Personal data and corresponding codes are kept in a reference list and the research data in a separate database.
An anonymisation of data can be achieved, for example by deleting the reference list after completion of the study, so that no link can be established between individual test subjects and the study results.
There are both discipline-specific and institutional repositories. You can find a good overview of research data repositories on the Registry of Research Data Repository, re3data for short, which is funded by the German Research Foundation, DFG for short, and offered as a service by DataCite. Furthermore, you can also use the institutional repository RWTH Publications.
Many research data repositories, including RWTH Publications, can assign a Digital Object Identifier, DOI for short, to your data. The University Library RWTH Aachen University is already registered with the Leibniz Information Centre for Science and Technology and University Library, TIB for short, as a data center that assigns digital objects.
A repository is a document server on which academic-scientific material can be stored, archived and/or made available. At a more general level it refers to an administrated storage space for digital objects which are generally publically accessible or, at least, accessible to a specific group of users.
The institutional repository of RWTH is RWTH Publications.
For an overview of discipline-specific repositories, please visit re3data.org.
- Research Data
Research data are data that have been created in a research process (for example through measurements, surveys, source work), which form the basis for scientific activity (for example digital artefacts), or which document the results of research activities.
- Research Data Management
The term research data management refers to the systematic handling of research data throughout the entire data life cycle. The overall aim is to make these data available, reusable, and verifiable in the long term, independently from the researcher(s) who generated the data.
- Rights to Data
German: Rechte an Daten
It is possible to define rights to data from two perspectives. For researchers, decision rights on data result from their generation. For users, the term rather refers to the rights that must be taken into account when re-using data. Rights can be defined and communicated in a legally binding manner in the form of licences and associated licence agreements.
As a matter of course, the rules of good research practice apply to the re-use data. Most importantly, this means that the authors are to be correctly cited (copyright). This rule is reflected in the CC-BY Creative Commons licence. Data protection, patent, and personal rights may impose restrictions on the re-use of data.
- Threshold of Originality
Threshold of originality, or level of invested creativity, is a concept in copyright law that is used to assess whether a particular work can be copyrighted.It refers to the degree of originality of and creativity invested in an intellectual creation. Whether a work reaches this threshold of originality is a decisive criterion for its protectability by German copyright law. An important aspect of the threshold of originality is that the work is a result of its author's creativity and personality rather than an outcome of external circumstances (functionality, objectivity, etc.). This is why research data very rarely fall under German copyright law.
- Virtual Research Environments (VRE)
German: Virtuelle Forschungsumgebungen (VFU)
Virtual Research Environments, or VRE for short, are working platforms that enable simultaneous, location-independent collaboration without restrictions between scientists. A VRE is primarily an application-oriented service provided by an infrastructure facility such as a computing center or a library for use by a specific research association or research community. Virtual research platforms integrate discipline-specific tools, tool collections, and working environments.
- XML (Extensible Markup Language)
German: XML (Extensible Markup Language)
XML is a markup language for storing hierarchically structured information in the format of a text file. This file is both machine-readable and human-readable. It is mainly used for platform- and implementation-independent data exchange between computer systems.
If further content rules have been defined in an external file in addition to the general structural rules, it is possible to check whether the content of an XML document is valid. This makes it possible to describe the form and content of the encoded information very precisely. With the help of XSL (XML Stylesheet Language) it is possible to interpret the stored information and convert it into other file formats for visualization purposes.