I’m fairly new to CWL. I’ve been able to get my head around some of the ways that CWL works and have read the examples, etc. I’m trying to build a pipeline in which all steps are executed within Docker containers. I have several steps working. However, I am stumped on something super simple. Maybe this seems silly, but I’m trying to remove a file. Below is my workflow file.
- class: DockerRequirement
- class: InitialWorkDirRequirement
- entry: $(inputs.the_file)
Below is my job file:
It says it was successful, but the file is still there. What am I missing? Is there a simpler/better way?
Hello @srp33, and welcome!
Can you share with us why you want to delete a file?
If some step is producing a file you don’t need, then by not including it in the final
Workflow outputs then most CWL implementations will delete that intermediate file.
Since CWL is concerned with reproducible analysis, what happens in each step is supposed to only produce new data/files, not change or delete existing data or files. So when you run your example a copy of the input file
the_file is made (because it is marked
writable: true), and it is that copy that gets deleted.
I am gunzipping a file a certain way that keeps the original file. Then I want to remove the original file. After you said this, I realized that I could tweak the way I am gunzipping it so that I don’t need to remove it. But this feels restrictive to me. I wish I could just remove a file rather than having to rework my logic.
Suppose I have a workflow that includes various steps for trimming and aligning FASTQ files and then calling variants. I understand that CWL can store the intermediate files (for example, trimmed but unaligned FASTQ files) in temp folders within the Docker container and then delete those automatically. But what if I wanted to deliberately store those intermediate files (in case something fails) outside the container and later go back and delete those intermediate files as the last step in my workflow? Am I just missing the vision of what CWL is all about?
A few notes.
In your example, the “writable: true” gives you a copy of the file, so you’ve only deleted the copy.
The vision of CWL is to enable writing portable, reproducible computations. So you do not want to modify the inputs, because then you could get different results the next time. Also, if you are running on a platform where the inputs are read-only, or copied from a cloud bucket, or otherwise not a conventional file system, modifying inputs with “rm” would not do what you want.
For the same reason, you should let the CWL runner manage intermediate files for you, that way it can use them to cache intermediate results to restart a failed computation halfway through. If you want to look at the intermediate files, you should make them actual workflow outputs.
You may find that the file wrangling you are used to doing in your scripts that isn’t actually necessary in CWL.
You may also find that for a short linear sequence of commands, you can just use small shell script as a CWL step. A workflow step that only invokes “gunzip” or “rm” might be too low level.
That makes sense. I think I need to adjust my thinking a bit to the CWL way of doing things.