Making Dark Data FAIR
Modern high-performance computing facilities (HPCs) generate a colossal amount of data. Recent studies have shown that a significant percentage (up to 3.41% of total storage capacity) of the data HPCs produce might never be used again — for reasons as mundane as improper labelling. Given that it’s often the case that large amounts of public money go into the production of that data, this is obviously hugely problematic.
The reasons for the proliferation of this so-called dark data (i.e., data that is not-reusable for a number of different reasons) are varied [1]. Missing or incomplete metadata, non-standardised storage methods, and researchers simply forgetting that the data is there, are just some of the reasons so much data becomes non-reusable.
Recently, the FAIR principles [2] concerning proper data management were introduced in an attempt to reduce the amount of data that is non-reusable. The principles outline a number of pragmatic measures institutions might enact in an effort to reduce the amount of data that is non-reusable.
With our project, entitled “Making Dark Data FAIR” [3], our aim is to provide analysis of exactly why dark data goes dark in the fair place, develop concrete strategies for how we might best enact the FAIR principles in an effort to reduce dark data, and finally to interrogate the FAIR principles themselves, in an effort to ensure they’re fulfilling the role intended for them (i.e., reducing dark data). While our primary focus is the generation of dark data at HPC facilities in particular, the results of this research are nonetheless widely applicable; the need for the FAIR principles is recognised by a wide variety of parties, not only those interested in high performance computing.
As part of the project, we are running a series of workshops in late 2020 and early 2021, which will serve as a platform to both discuss and disseminate the results obtained from this project. Workshops will be held both online and in person, including a workshop at TU Delft. These workshops will bring together a wide variety of stakeholders, including researchers, policy makers, data stewards, and HPC facility personnel.
The project is financed by the European Open Science Cloud Secretariat. The lead coordinator of the project is Juan M. Durán (TU Delft – j.m.duran@tudelft.nl). Jack Casey (TU Delft) is the postdoc and contact person (j.j.casey@tudelft.nl).
The project is a collaboration between TU Delft (The Netherlands), the University of Exeter (UK), the University of Stuttgart (Germany), the CNR-IOM Center (Italy), and the ERC Consortium SoBigData++.
References
[1] Schembera, B., & Duran, J. M. (2019). Dark Data as the New Challenge for Big Data Science and the Introduction of the Scientific Data Officer. Philosophy & Technology. https://doi.org/10.1007/s13347-019-00346-x
[2] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
[3] https://www.eoscsecretariat.eu/funding-opportunities/list-approved-co-creation-activities
Written by: Jack Casey
Revised by: Francesca Pratesi, Marco Braghieri, Luca Pappalardo