Syntactic Sugar for CWL with the POLUS Workflow Inference Compiler

jfennick · February 26, 2023, 8:40am

Presenter: Jake Fennick, National Center for Advancing Translational Sciences, Axle Research and Technologies

Session 1: (Americas-EMEA) Monday, February 27th 09:00 - 13:00 US EST / 14:00 - 18:00 UTC
Session 3: (APAC-Americas) Thursday, March 2nd, 19:00 - 23:00 US EST
Friday, March 3rd, 00:00 - 04:00 UTC / 09:00 - 13:00 Japanese Standard Time

The Common Workflow Language is a powerful standard for specifying command line tool-based workflows. However, like most workflow specifications, it requires the dependencies between all steps forming the directed acyclic graph (DAG) to be explicitly defined. This is not a problem for workflows that consist of a small number of monolithic steps, but for more complex workflows that contain many small steps, the verbosity can become prohibitive. POLUS Workflow Inference Compiler is a YAML-based domain specific language that compiles to CWL which aims to address this shortcoming. It can automatically infer non-ambiguous edges using type and file format matching. Inference is not limited to linear pipelines and can produce an arbitrary DAG. Subworkflows are fully supported, and the inference is guaranteed to work across subworkflow boundaries. Since inference is not always unique, explicit edges are supported with a lightweight syntax. Publication quality GraphViz DAGs are automatically generated. For long running workflow steps (e.g. simulations), realtime analysis is supported via iteratively speculatively executing an arbitrary subworkflow in a separate CWL runner. Automatically inserting missing workflow steps (e.g. file format conversions) is supported via speculatively inserting arbitrary subworkflows between steps from a whitelist at compile time. VSCode IntelliSense code completion is fully supported, and a KNIME-style graphical user interface is in development.

Please leave your questions for the presenter below!

Slides: Syntactic Sugar for CWL with the_POLUS Workflow Inference Compiler.pptx - Google Slides

As an alternative to YouTube, this presentations is also available on ConfTube

brunokinoshita · February 27, 2023, 2:37pm

Hi, great talk! I saw a “backend” in the presentation, in the demo. It was switched to another backend. Is that like a Workflow Registry, or more like a Batch Scheduler (like Slurm)?

Related question, are your group using any workflow registry to store & retrieve workflows? Anything like WorkflowHub, perhaps?

Thanks!

jfennick · February 27, 2023, 3:25pm

https://workflow-inference-compiler.readthedocs.io/en/latest/userguide.html#backend-independence

In this context, backends refer to subworkflows that are in some sense the “same”. We have a syntax for switching between these subworkflows. If the inputs/outputs of the subworkflows are not in fact identical, we need to patch up the difference. In my demo, I’m using two curated file format conversion steps and inference to achieve that.

There are many other cool features I did not have time to present, so feel free to explore the documentation

We have been thinking about how to share our workflows and we are still exploring options. If we want to share workflows directly in our YAML format, we would need https://workflowhub.eu/ to add support, etc.

Alternatively, we could share the compiled CWL. Since we include the exact version of the compiler that was used to generate each CWL file, this could possibly allow decompiling the CWL into our YAML format, and thus might allow using existing CWL workflow hubs.

jfennick · February 27, 2023, 4:07pm

@tetron You mentioned process generator: cwltool/processgen.rst at 5947fd2f07e02bf90dd9799ceeaef6fde76e90a5 · common-workflow-language/cwltool · GitHub

This has an example for generating a CommandLineTool, but is there an example of generating a non-trivial Workflow?

jfennick · March 3, 2023, 5:27pm

One point I forgot to stress in the demo is that, without inference, file format conversions would require users to manually delete and re-recreate numerous edges. Now consider how much manual work it would be if we wanted to switch all possible choices of backends for all workflow steps. There could be hundreds or thousands of edges! Inference opens up brand new possibilities such as doing protocol-level regression testing in the CI. Inference is not actually about edges; its about composability.