If I have a workflow that has a step to make a directory, that directory is then passed to another step which scatters over an array of UUIDs which will be downloaded into this directory… but then it returns an array of directories at the end due to any scattered step returning an array… how do I actually do something like this? I have to do somethign like this cause the number of files is so large that any attempt to mount to downstream tools would create a command line argument that is way larger than allowed.
This is the proof of concept wf I was trying to build for a colleague, but it’s obviously not what I really want to be happening.
Since it will return an array of directories, I’m not sure how to do this right. The subworkflow that it calls, I have to do this ugly passing of the directory like this:
Something to try is to take an array of files (or array of Directories) and return a single Directory object with everything in “listing”. The runner will stage those files to the new Directory. Here is a partial example:
I have to do somethign like this cause the number of files is so large that any attempt to mount to downstream tools would create a command line argument that is way larger than allowed.
In general, when this happens to me, I go the manifest file way. I use a JS expression to generate a manifest file from the list of files in the input and have my tool read the input list from the manifest
@kmhernan can clarify but I read this as the problem being that the docker command line was getting too long, because of all the -v options for each individual file. That’s a somewhat implementation specific problem (some runners talk directly to the Docker API and are not subject to command line length limits).
Using InitialWorkDir to construct a manifest file is usually the best solution to passing a large list of filenames, and I’m currently working on making it a little bit easier in CWL v1.2.
Correct, that is my problem. I actually use manifest files when tools accept them which was the reason for my other post about making manifest files that needed tabs and new lines haha.
This leads me to a more general concern though: How can I really understand what I can and can’t do within an ExpressionTool? I didn’t think i could create directories out of thin air. Can I use it to rename a file? Seems like i have to use InitialWorkDir for that. Are these limitations defined anywhere?
ExpressionTools manipulate CWL objects. So you can rename a file or directory; construct new files and directories (perhaps from existing files and directories) and manipulate/create arrays and other complex CWL types.
It is also possible to do all of that in a CommandLineTool, but on most CWL runners ExpressionToola schedule and run faster.
The principle is that your tell CWL what you want and it is the responsibility of the runner to do it for you. That’s the superpower that makes it agnostic to weird storage systems, splitting up steps to run on multiple nodes, etc.
The main things are File literals, Directory literals, and setting basename.
File literals have contents and basename set but no location. They get created on the fly when you need to run a CommandLineTool.
Directory literals have listing but no location. They also get created on the fly when you need to run a CommandLineTool.
It uses basename to name a file when it is staged or created on the fly, so you can logically rename a file in an expression by returning a File object with the same location but a different basename. This does not change the name of the file in the underlying storage system.
This is all described in the specification but it is pretty dense. We are getting a grant to improve the documentation so we’re hoping we’ll be able to expand the user guide to cover more topics like this.
@tetron or @mrc It seems like the expression tool works and all, but the next step that takes the directory the expression tool makes seems to still be trying to mount every individual file which would defeat the purpose of this (too many files to individually mount and the command line would be too large). I need to be able to get these 100’s of files into a directory and then just mount the directory, but this actually doesn’t seem possible. Thoughts?
I see. I think the best solution would be for cwltool should behave in a more scalable way for large numbers of inputs. Either there’s a way to pass the list of volume mounts to Docker via a file, or it could materialize the input staging by copying or hardlinking files and then it would only have to mount a single directory into the container.
Exceeding the command line length is somewhat specific to cwltool or other runners that invoke docker using the command line instead of the API or run it some other way. For example I have a CWL pipeline I run on Arvados that accepts an array of 7000 file inputs and it don’t have this problem – but it probably would if I ran it with cwltool.
The ugly workaround I can think of is to divide your list of Directories into smaller subsets, and have a step that simply copies input to output and produces a single directory as output, so then your downstream step has fewer directory inputs (where each of those directory has a subset of the original array). Does that make sense?