Recording provenance of workflow runs with RO-Crate

PLoS One. 2024 Sep 10;19(9):e0309210. doi: 10.1371/journal.pone.0309210. eCollection 2024.

Abstract

Recording the provenance of scientific computation results is key to the support of traceability, reproducibility and quality assessment of data products. Several data models have been explored to address this need, providing representations of workflow plans and their executions as well as means of packaging the resulting information for archiving and sharing. However, existing approaches tend to lack interoperable adoption across workflow management systems. In this work we present Workflow Run RO-Crate, an extension of RO-Crate (Research Object Crate) and Schema.org to capture the provenance of the execution of computational workflows at different levels of granularity and bundle together all their associated objects (inputs, outputs, code, etc.). The model is supported by a diverse, open community that runs regular meetings, discussing development, maintenance and adoption aspects. Workflow Run RO-Crate is already implemented by several workflow management systems, allowing interoperable comparisons between workflow runs from heterogeneous systems. We describe the model, its alignment to standards such as W3C PROV, and its implementation in six workflow systems. Finally, we illustrate the application of Workflow Run RO-Crate in two use cases of machine learning in the digital image analysis domain.

MeSH terms

  • Machine Learning
  • Reproducibility of Results
  • Software
  • Workflow*

Grants and funding

The authors acknowledge funding from: Sardinian Regional Government through the XData Project (S.L., L.P.); Spanish Government (contract PID2019-107255GB) (R.S.); MCIN/AEI/10.13039/501100011033 (CEX2021- 001148-S) (R.S.); Generalitat de Catalunya (contract 2021-SGR-00412) (R.S.); European High-Performance Computing Joint Undertaking (JU) (No 955558) (R.S.); EU Horizon research and innovation programme under Grant agreement No 101058129 (DT-GEO) (R.S.); ELIXIR Platform Task 2022-2023 funding for Task “Container Orchestration” (A.K.); Research Foundation - Flanders (FWO) for ELIXIR Belgium (I000323N and I002819N) (P.D.G.); Multiannual Agreement with Universidad Politecnica de Madrid in the line Support for R&D projects for Beatriz Galindo researchers, in the context of the V PRICIT (Regional Programme of Research and Technological Innovation) (D.G.); Comunidad de Madrid through the call Research Grants for Young Investigators from Universidad Politecnica de Madrid (D.G.); ICSC - Centro Nazionale di Ricerca in High-Performance Computing, Big Data and Quantum Computing, funded by European Union - NextGenerationEU December 13, 2023 24/31 (I.C.); ACROSS project, HPC Big Data Artificial Intelligence Cross Stack Platform Towards Exascale, funded by the European High-Performance Computing Joint Undertaking (JU) under G.A. n. 955648 (I.C.); EUPEX project, European Pilot for Exascale, funded by the European High-Performance Computing Joint Undertaking (JU) under G.A. n. 101033975 (I.C.); Life Science Database Integration Project, NBDC (National Bioscience Database Center) of Japan Science and Technology Agency (T.O.); European Commission Horizon 2020 825575 (European Joint Programme on Rare Diseases; SC1-BHC-04-2018 Rare Disease European Joint Programme Cofund) (L.R.N., J.M.F., S.C.G.), 955558 (eFlows4HPC) (R.S.), 823830 (BioExcel-2) (S.S.R.), 824087 (EOSC-Life) (S.L., L.R.N., P.D.G., R.W., L.P., J.M.F., S.C.G., S.S.R.); Horizon Europe 101046203 (BY-COVID) (S.L., L.R.N., P.D.G., R.W., L.P., J.M.F., S.C.G., S.S.R.), 101057388 (EuroScienceGateway) (P.D.G., J.M.F., S.C.G., S.S.R.), 101057344 (FAIR-IMPACT) (D.G., S.S.R.); UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee 10038963 (EuroScienceGateway), 10038992 (FAIR-IMPACT) (S.S.R.). H.S. is founder and CEO of the software company Sator Inc., Tokyo, which did not fund the present work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.