Using schema.org attributes to define compatible workflow engines

Hello,

I am writing a suite of workflows in our cwl-ica repository.

These workflows were written for ICAv1 (which is being deprecated). We are moving to using ICAv2 along with testing out other CWL compatible workflow engines such as Amazon Omics etc.

Some workflows will be compatible only with ICAv2, and I assume some in future will only be compatible with Omics etc.

Rather than create separate repositories for each workflow engine, I would like to know if there are any official schemas to use to be able to specify which workflow engines are appropriate for a given workflow.

# Extensions
$namespaces:
    s: https://schema.org/
$schemas:
  - https://schema.org/version/latest/schemaorg-current-http.rdf

# Metadata
s:compatibleWorkflowEngines:
  - cwltool.local
  - https://ica.illumina.com/ica/rest
  - omics
  - toil

For dragen-specific workflows, https://ica.illumina.com/ica/rest may be the only compatible workflow engine.

Our catalogue (see example) would then be able to denote which engines are compatible by scraping the workflow.

I would not want anyone to come across our workflows on the likes of GitHub or Dockstore only to realise after testing that the workflow is not compatible with their setup.

Is there a best-practise for documenting this?

Would be keen to hear any other solutions / workarounds people have come across as well.

A few thoughts:

  • schema.org is a specific ontology, so you shouldn’t just make up terms. But you can make up your own prefix like this:
$namespaces:
  umccr: https://mdhs.unimelb.edu.au/cwl/
  • Instead of “compatible workflow engines” you would recommend saying calling it something like “tested workflow engines”. I’d also recommend using full URLs that are meaningful to other people:
umccr:testedWorkflowEngines:
  - https://github.com/common-workflow-language/cwltool
  - https://ica.illumina.com/ica/rest
  - https://github.com/DataBiosphere/toil
  - omics  # IDK exactly which service this refers to, which is why full URLs are good

While it makes practical sense to list which workflow engines have been tested, CWL workflows ought to be portable unless (a) the engine doesn’t support a standard CWL feature used by the workflow or (b) the workflow has special non-standard requirements (e.g. FPGA support).

In the second case, you should be able to infer from the requirements section which workflows need that special support and you can compare that with a list of which engines support that special requirement.

This isn’t always possible if the requirements are defined at the tool level.

The new namespace makes sense though, does https://mdhs.unimelb.edu.au/cwl/ have to be a valid site? Could this point to a file in GitHub if so?

Omics - linked here Amazon Omics now supports Common Workflow Language

Stable identifier for some of the engines mentioned

1 cwltool is RRID:SCR_015528 a.k.a. https://identifiers.org/RRID/RRID:SCR_015528
2. Toil is RRID:SCR_024391 a.k.a https://identifiers.org/RRID/RRID:SCR_024391
3. Arvados is RRID:SCR_002223 a.k.a. https://identifiers.org/RRID/RRID:SCR_002223
4. CWL-Airflow is RRID:SCR_017196 a.k.a. https://identifiers.org/RRID/RRID:SCR_017196
5. Galaxy is RRID:SCR_006281 a.k.a. https://identifiers.org/RRID/RRID:SCR_006281

No, it doesn’t have to be real. However, it is nice if it does load a page that describes the item being identified.

  • Do we link to these anywhere on commonwl.org? Feels like something that could be mentioned on Implementations | Common Workflow Language (CWL)
  • Regrettably this has a similar problem to the EDAM ontology (and many ontologies) where the numeric identifiers are extremely user-unfriendly. Maybe we should have a way to define URI aliases in CWL?

It’s possible something like this already works, I haven’t tried it though:

$namespaces:
  umccr: https://mdhs.unimelb.edu.au/cwl/
  cwltool_engine: https://identifiers.org/RRID/RRID:SCR_015528

umccr:testedWorkflowEngines:
  - "cwltool_engine:"