How adapted to batch analysis is CWL?


Kind of discovering CWL.
So far I like a lot of things about it, but I have to say that I am not able to evaluate how adapted or unadapted it is to my requirements.

I want to run a batch (potentially tens) of analyses on hundreds (possibly thousands) on input datasets.

The way I see it, the workflow tool I should should be able to

  1. create and use an output directory for each dataset
  2. create a subdirectory for each analysis, in which all output files will be placed,
  3. being able to start again after a crash without running the steps that already have been computed,
  4. probably beeing able to remove output files on requests (like “clean-rmsd” would remove files relative to the “rmsd” task),

Right now, I don’t see how CWL + cwltool can meet those requirements.
For example, I don’t see how it can manage the output directory contraints.

How would you, experts, handle this kind of tasks?

CWL is absolutely designed for large batch analysis. However, you should be aware that there are multiple platforms that can run CWL and cwltool is intentionally a small-scale single node runner. So limitations of cwltool are not necessary limitations of CWL in general – if you need to scale to dozens or hundreds or thousands of compute nodes, you may want to use another CWL platform:

I don’t quite understand how these are different, but what you probably want to do is have your main workflow produce a directory as an output, and have the final output be an array of directories.

This might be helpful, specifically Lessons 5 and 6 which discuss running on multiple data sets and collecting the results into directories:

cwltool has a --cachedir option which will record intermediate results.
To clean up when you’re done, you can just delete the cache directory. Other runners manage intermediate results in their own ways.

Thanks a lot.
I’ll have a deeper look in those areas.