I am kinda confused at the idea of groups putting CWL pipelines up on Github that simply do not work or are malformed or otherwise broken. Are they not running them themselves?
cwltool --validate really enough to test the validity of a pipeline? I think that just tells you if a file is valid CWL, not necessarily that it “runs” and that it “runs correctly”, right?
For my group, I have relied a lot on Python’s
unittest suite to set up unit tests and integration tests for the CWL’s and their workflows; pluto-cwl/tests at master · mskcc/pluto-cwl · GitHub
Its been really helpful to ensure the pipelines don’t suddenly start breaking with new changes & updates, but it also required writing a ton of boilerplate code that is probably outside the scope of the CWL organization. Or maybe not? Maybe some kind of Python library that makes it easier to write custom integration tests for your CWL’s would be useful to others?
In regards to the “out of the box experience”, the first big difficulty I had (and still have) with CWL is this;
CWL is actually not code; its data. This makes writing it a lot less flexible, and it makes introspection difficult and confusing. Beyond the obvious ones, I can never really tell what fields are available in which places in my CWL documents. I cant tell which variables exist in which environments, or what attributes or methods they have. I was reading this page today and found out that I can do calls like
valueFrom: $(self.basename); I did not even know that
self existed, or where it exists, and had no clue there was a
basename method available or what other methods might also exist. If I was using Python, I would be using a lot of
globals(), and various API reference docs; I am not sure what the equivalents to these kinds of things are in CWL so it feels like I am poking around in the dark a lot of times.
The other difficulty I have is with understanding the staging and execution of the commands I am wrapping up in CWL. For example, in Nextflow if my pipeline process breaks, I can go into the
work directory for that job and see everything in one place, all the input files, the commands used to stage them, the commands that set up the execution environment, the fully interpolated shell commands that are going to be run, all the stdout and stderr logs, etc… When I run a CWL and something breaks or doesnt work right, trying to get all the information I need to fully understand what is happening and debug it becomes a lot more difficult. This page helps but is still just a high-level overview. I am also not sure if there is a way to re-run only the broken pipeline step, in isolation, without restarting the whole workflow.
I think the last thing that stands out to me is the reliance on 3rd party execution programs for advanced features like HPC job submission. Would be really great if the official CWL runner could just offer all the features itself. I understand the “why” behind this decision but it does not seem like something that is going to facilitate a good experience for people searching for a workflow framework that will “just work” for them, finding CWL, liking it, then discovering that they cant actually run it on their HPC unless they get some other program.