How adapted to batch analysis is CWL?

blaurent · January 15, 2024, 10:51am

Hi,

Kind of discovering CWL.
So far I like a lot of things about it, but I have to say that I am not able to evaluate how adapted or unadapted it is to my requirements.

I want to run a batch (potentially tens) of analyses on hundreds (possibly thousands) on input datasets.

The way I see it, the workflow tool I should should be able to

create and use an output directory for each dataset
create a subdirectory for each analysis, in which all output files will be placed,
being able to start again after a crash without running the steps that already have been computed,
probably beeing able to remove output files on requests (like “clean-rmsd” would remove files relative to the “rmsd” task),

Right now, I don’t see how CWL + cwltool can meet those requirements.
For example, I don’t see how it can manage the output directory contraints.

How would you, experts, handle this kind of tasks?

tetron · January 15, 2024, 10:18pm

CWL is absolutely designed for large batch analysis. However, you should be aware that there are multiple platforms that can run CWL and cwltool is intentionally a small-scale single node runner. So limitations of cwltool are not necessary limitations of CWL in general – if you need to scale to dozens or hundreds or thousands of compute nodes, you may want to use another CWL platform:

I don’t quite understand how these are different, but what you probably want to do is have your main workflow produce a directory as an output, and have the final output be an array of directories.

This might be helpful, specifically Lessons 5 and 6 which discuss running on multiple data sets and collecting the results into directories:

cwltool has a --cachedir option which will record intermediate results.
To clean up when you’re done, you can just delete the cache directory. Other runners manage intermediate results in their own ways.

blaurent · January 16, 2024, 10:10am

Thanks a lot.
I’ll have a deeper look in those areas.

Cheers