Improving the out of the box CWL experience

kaushik-work · November 10, 2020, 5:28pm

A colleague was trying to import and use RNA-seq CWL pipelines from github as an exercise in understanding non-SB created CWL content. She found issues in almost every pipeline she imported, some during import, some during running. e.g. In one case she discovered a subprocess that had a required input that was not exposed. In another case she found a pipeline from a reputable center that was just malformed (the YAML was wrong).

On one hand CWL is code like any other, and so we can’t be responsible for the quality of all code out there, just like there can be non-functional Python code on github.

On the other it would be important for users to have a good out of the box experience with CWL.

To this end we should think about

Encouraging developers to adopt some good development practices, such as CI, to raise the quality of their pipelines. Offer them badges for say cwltool --validate or other tests.
Offer users some online service to check the validity of a workflow on github, perhaps running cwltool --validate on CWL passed via URL.
Raise awareness with research centers as to the importance of the research code, including CWL, that they release publicly.

Just wanted to restart this discussion.

Thanks!

mrc · November 10, 2020, 5:30pm

Agreed! How about some templates for popular CI services to cwltool --validate all *.cwl files? (using the latest cwltool version, not some pinned ancient version).

steve · February 8, 2021, 6:59pm

I am kinda confused at the idea of groups putting CWL pipelines up on Github that simply do not work or are malformed or otherwise broken. Are they not running them themselves?

Is cwltool --validate really enough to test the validity of a pipeline? I think that just tells you if a file is valid CWL, not necessarily that it “runs” and that it “runs correctly”, right?

For my group, I have relied a lot on Python’s unittest suite to set up unit tests and integration tests for the CWL’s and their workflows; https://github.com/mskcc/pluto-cwl/tree/master/tests
Its been really helpful to ensure the pipelines don’t suddenly start breaking with new changes & updates, but it also required writing a ton of boilerplate code that is probably outside the scope of the CWL organization. Or maybe not? Maybe some kind of Python library that makes it easier to write custom integration tests for your CWL’s would be useful to others?

In regards to the “out of the box experience”, the first big difficulty I had (and still have) with CWL is this;

CWL is actually not code; its data. This makes writing it a lot less flexible, and it makes introspection difficult and confusing. Beyond the obvious ones, I can never really tell what fields are available in which places in my CWL documents. I cant tell which variables exist in which environments, or what attributes or methods they have. I was reading this page today and found out that I can do calls like valueFrom: $(self.basename); I did not even know that self existed, or where it exists, and had no clue there was a basename method available or what other methods might also exist. If I was using Python, I would be using a lot of dir(x), type(x), globals(), and various API reference docs; I am not sure what the equivalents to these kinds of things are in CWL so it feels like I am poking around in the dark a lot of times.

The other difficulty I have is with understanding the staging and execution of the commands I am wrapping up in CWL. For example, in Nextflow if my pipeline process breaks, I can go into the work directory for that job and see everything in one place, all the input files, the commands used to stage them, the commands that set up the execution environment, the fully interpolated shell commands that are going to be run, all the stdout and stderr logs, etc… When I run a CWL and something breaks or doesnt work right, trying to get all the information I need to fully understand what is happening and debug it becomes a lot more difficult. This page helps but is still just a high-level overview. I am also not sure if there is a way to re-run only the broken pipeline step, in isolation, without restarting the whole workflow.

I think the last thing that stands out to me is the reliance on 3rd party execution programs for advanced features like HPC job submission. Would be really great if the official CWL runner could just offer all the features itself. I understand the “why” behind this decision but it does not seem like something that is going to facilitate a good experience for people searching for a workflow framework that will “just work” for them, finding CWL, liking it, then discovering that they cant actually run it on their HPC unless they get some other program.

mrc · February 9, 2021, 5:52am

Mostly, see also @kaushik-work 's report at Writing Portable CWL & Rabix Benten update

Are you using Benten? That can provide code completion for CWL in many text editors.

Yeah, the CWL user guide is not yet an exhaustive reference. For that you’ll need Common Workflow Language (CWL) Command Line Tool Description, v1.0.2 & Common Workflow Language (CWL) Workflow Description, v1.2 ; that being said, enhancements to the CWL user guide are very welcome!

Those details are often workflow engine/platform specific. Here is what the CWL standards themselves have to say: Common Workflow Language (CWL) Workflow Description, v1.2

For example, the CWL reference runner, cwltool, has a variety of options to help with this: --debug, --cachedir some/path and then inspecting the results.

Check out GitHub - common-workflow-language/cwltool: Common Workflow Language reference implementation

@kaushik-work and others have proposed that the toil-cwl-runner become the CWL runner that we recommend by default to new users instead of cwltool. I’m not against this!

[thanks for sharing, I appreciate it]