A Non-Intimidating Approach to Workflow Reproducibility in Bioinformatics: Adding Metadata to Research Objects through the Design and Evaluation of Use-Focused Extensions to CWLProv

Presenter: Renske de Wit (VU Amsterdam)


  • Session 2: :earth_africa: :earth_asia: (EMEA-APAC) Wednesday, March 1st 07:00 - 11:00 UTC / 16:00 - 20:00 Japanese Standard Time

In the era of big data and big science, workflows have been proposed as a means to achieve computational reproducibility. CWLProv, a serialization of the Research Object model, is a machine-accessible format for sharing the results of a workflow execution. In addition to the CWL workflow and input and output data for all steps, CWLProv RO Bundles contain a record of the execution (the provenance of the results), encoded in RDF. Here, we assess if the provenance contained in CWLProv is sufficient to address real-life provenance questions, based on a detailed examination of one use case bioinformatics workflow. Distinguishing 5 use cases for ROs associated with the workflow, we define a taxonomy of provenance metadata required to address these scenarios. Subsequently, we assess the CWLProv community standard for the representation of each of the taxonomy components. Based on the results of this analysis, we propose a standard for the annotation of input data, as well as an extension of the provenance graph to enable richer annotations. We are confident that the methodology applied here can be generalized to other workflows and use cases to identify additional provenance requirements, which together with the original provenance taxonomy can inform current and future RO specifications.

Slides: deWit_20230301_CWL_Conference.pdf - Google Drive

