Best Practices for Avoiding "Argument List Too Long" in Large-Scale Workflows

jma1991 · July 4, 2025, 5:40pm

Hi everyone,

I’m looking for some clarity on issues I frequently encounter when working with large-scale processing workflows. Here’s the usual scenario:

I have a tool that processes a single file, and it is scattered across an array of thousands of input files. A second tool then performs a joint analysis, taking the complete array of output files from the scattered step as input.

I’m running into two main problems:

When the tool doesn’t accept a manifest file:
How can I avoid the “argument list too long” error when passing all the file paths directly to the tool (e.g., tool_name file1 file2 ... with thousands of files)?
When the tool does accept a manifest file:
Even then, I sometimes hit the “argument list too long” error at the Docker level due to having thousands of -v bind mounts in the command.

Are there recommended strategies or workarounds for these scenarios? Any advice or best practices would be much appreciated.

Thanks in advance!

mrc · July 7, 2025, 12:48pm

Hello @jma1991 and welcome! Which CWL runner are you using?