Best Practices for Avoiding "Argument List Too Long" in Large-Scale Workflows

Hi everyone,

I’m looking for some clarity on issues I frequently encounter when working with large-scale processing workflows. Here’s the usual scenario:

I have a tool that processes a single file, and it is scattered across an array of thousands of input files. A second tool then performs a joint analysis, taking the complete array of output files from the scattered step as input.

I’m running into two main problems:

  1. When the tool doesn’t accept a manifest file:
    How can I avoid the “argument list too long” error when passing all the file paths directly to the tool (e.g., tool_name file1 file2 ... with thousands of files)?
  2. When the tool does accept a manifest file:
    Even then, I sometimes hit the “argument list too long” error at the Docker level due to having thousands of -v bind mounts in the command.

Are there recommended strategies or workarounds for these scenarios? Any advice or best practices would be much appreciated.

Thanks in advance!