Specifying custom hints for specific workflow engines

alexiswl · October 3, 2023, 3:02am

On a similar theme to Using schema.org attributes to define compatible workflow engines, ICAv2 has three options for its scratch storage space (small, medium and large), (1.2 Tb, 2.4 Tb or 7.2 Tb respectively). The scratch space is specified when creating the workflow definition but can be overridden at runtime.

Is there a way to specify in the hints of the workflow definition the appropriate scratch storage size needed?

I don’t expect this to be picked up by ICAv2 when uploading a workflow, but there would be a couple of use cases for this:

A new user would be able to use this information when manually generating a CWL pipeline in ICAv2.
An automation system would be able to scrape that information from a workflow when creating an CWL ICAv2 pipeline.
An automation system would be able to tag the workflow is a specific ICAv2 scratch space size for documentation purposes.

tetron · December 6, 2023, 11:05pm

Is there a reason you can’t use ResourceRequirement tmpdirMin and/or outdirMin to choose the size of scratch space?

alexiswl · December 8, 2023, 6:01am

This would be for the entire workflow though, not just a tool. Do you think this be appropriate at the workflow level?

tetron · December 8, 2023, 10:44pm

Is the idea that you need to know the greatest resource requirements of any step, so you can allocate a machine type that is big enough to run the whole thing?

I think what you want is some code that will load the entire document (following all the dependencies) and reach down to all the individual CommandLineTools to get the ResourceRequirement sections. If you’re working in Python that should be pretty easy to write using cwl-utils.

alexiswl · December 9, 2023, 1:21am

Hmm the cumulation of steps might work as an idea?

But this is more the overall scratch space for the workflow. ICAv2 works by downloading all the inputs to a local disk and then running cwltool locally (replacing docker with kubernetes). The user must specify how large that local disk when launching the workflow.

So it’s not so much about a given tool might use in scratch space but more correlated to the sum of workflow inputs plus output files of all workflow steps. I think the tool /tmp blowouts can be ignored in this case as each tool is run through kubernetes and each task will have its own isolated scratch space.

Does this make sense?

tetron · December 13, 2023, 5:50pm

So ideally the at the time the workflow is launched, your code can look at the size of inputs sum that up along with the outdirMin requirements of the individual steps in order to predict the disk size it’ll need.