FAIRly big: A framework for computationally reproducible processing of large-scale data

Adina S Wagner; Laura K Waite; Małgorzata Wierzba; Felix Hoffstaedter; Alexander Q Waite; Benjamin Poldrack; Simon B Eickhoff; Michael Hanke

doi:10.1038/s41597-022-01163-2

FAIRly big: A framework for computationally reproducible processing of large-scale data

Sci Data. 2022 Mar 11;9(1):80. doi: 10.1038/s41597-022-01163-2.

Authors

Adina S Wagner^#¹, Laura K Waite^#², Małgorzata Wierzba^#^{2

3}, Felix Hoffstaedter², Alexander Q Waite², Benjamin Poldrack², Simon B Eickhoff^{2

4}, Michael Hanke^{2

4}

Affiliations

¹ Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Center Jülich, Jülich, Germany. [email protected].
² Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Center Jülich, Jülich, Germany.
³ Laboratory of Brain Imaging, Nencki Institute of Experimental Biology, Polish Academy of Sciences, Warsaw, Poland.
⁴ Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.

^# Contributed equally.

Abstract

Large-scale datasets present unique opportunities to perform scientific investigations with unprecedented breadth. However, they also pose considerable challenges for the findability, accessibility, interoperability, and reusability (FAIR) of research outcomes due to infrastructure limitations, data usage constraints, or software license restrictions. Here we introduce a DataLad-based, domain-agnostic framework suitable for reproducible data processing in compliance with open science mandates. The framework attempts to minimize platform idiosyncrasies and performance-related complexities. It affords the capture of machine-actionable computational provenance records that can be used to retrace and verify the origins of research outcomes, as well as be re-executed independent of the original computing infrastructure. We demonstrate the framework's performance using two showcases: one highlighting data sharing and transparency (using the studyforrest.org dataset) and another highlighting scalability (using the largest public brain imaging dataset available: the UK Biobank dataset).

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

Abstract

Publication types

Grants and funding