Dealing with huge inputs in CWL for an AI use case

Hello,

We’re trying to use the CWL tool to run AI workflows within a digital pathology use case and verify options to generate compliant provenance information with CWL.

CWLtool creates a temporary folder and copies inputs of a particular workflow into it before the workflow is executed. The thing is, that size of our input can be hundreds of terabytes (and can potentially grow up to petabytes), thus copying it to a temporary folder is not very good for us :slight_smile: I would like to ask, is there a way in CWLtool how to bypass the “temporary folders mechanism”?

Thanks very much for any answers in advance :slight_smile:
Miro Bezak

1 Like

Hi @miro-bezak,

Cwltool generally uses symlinks and does not copy files, unless you are using features such as InititialWorkDir with “writable” files. You should verify whether copies are actually being made or not. It would also help if you could provide more information or a link your workflow.

Thanks,
Peter

If you do require writable inputs, then a feature we developed for the radio astronomy community will be useful: InplaceUpdateRequirement

Yes, you are right. For the files which are specified withing InititialWorkDirRequirement there is a symlink left in my tmp directory. However, I was not able to capture other files in temporary folder since they are deleted after execution. So I’m not sure how the outputs work exactly in CWL. For example if my script just creates just one file in its working directory and it gets globbed into output, was the file created in the temporary directory and then copied into the designated output directory? Because the during the execution looks it something like this [job create_map.cwl] /tmp/t343qphu$ python3 \ or if you could point me to a good article about this I would be very grateful, since I am kind of struggling to find information about this.

@miro-bezak If you can provide a specific example we can troubleshoot this with you. Can you share your current CWL description?