Research data is essential for science, but many datasets are hidden on websites and in small repositories, or difficult to find because of inadequate metadata. Only a fraction of researchers proactively make dataset metadata available in public portals, and curating them is costly.
The “Unknown Data” project is now creating an infrastructure to facilitate the reanalysis of research data and the replication of research results. Furthermore, the origin of data will be made more traceable and datasets that are not yet available in public collections will be made accessible.
The goals described above are achieved through various procedures and approaches:
- The use of citations from scientific articles and websites to find metadata about datasets.
- Discovering datasets and their context by crawling relevant web pages
- Consolidating metadata by linking it to information from domain-specific databases
- Ensuring metadata quality by establishing a discipline-specific curation process
- Ensure long-term availability of original sources by archiving relevant web pages
Mining metadata about research data from web pages and publications is a novel approach that increases the visibility of “long tail” datasets while providing crucial insights into the actual use and impact of (known) research data. “Long-tail” datasets are those datasets that can only be found using specific search terms.
Two disciplines, computer science and social sciences, will benefit from the projects outputs through use-case pilots. The DBLP bibliography and the GESIS portals are among the most respected and widely used metadata collections in their respective disciplines. Both are used by many other search engines such as Google Dataset Search and CESSDA. Unknown Data will greatly improve the effectiveness and efficiency of researchers searching for data by creating, for the first time in computer science, a centralized and comprehensive collection of research data metadata and fundamentally improving the quality and quantity of dataset metadata in the social sciences.
Dataset citations extracted from web pages or publications will allow estimation of the impact of datasets – a critical feature for assessing their usefulness and reuse.
All collected metadata will be made permanently publicly available as Linked Open Data and via REST APIs to make research data discoverable, accessible, interoperable, and reusable for both researchers and machines (in accordance with the FAIR Data Principles). All software will be open-sourced and the methods developed can be adapted to other disciplines.
The project is funded by the German Research Foundation (DFG) and is being developed in cooperation with the Internet Archive and the Consortium of European Social Science Data Archives (CESSDA).
Contact
Prof. Dr. Stefan Dietze
Computer Science
Prof. Dr. Stefan Dietze is Professor of Data & Knowledge Engineering at HHU-Düsseldorf and Scientific Director of the Knowledge Technologies for the Social Sciences Department at GESIS (Leibniz Institute for the Social Sciences) in Cologne. In his research, he works on harnessing large amounts of data from the Web using methods of Natural Language Processing (NLP), information retrieval, and machine learning. As scientific head of the Knowledge Technologies for the Social Sciences department at GESIS, a particular focus is on the use of (social) web data for interdisciplinary research questions in the social sciences.
At DIID, he is interested in the investigation of online discourse using NLP-based methods, e.g., for the recognition and classification of statements or sources or the understanding of information diffusion in social networks.