Too many arguments on the command line

Hi all, I am using CWL version 1.0 to describe both the steps and the workflow.
If I use as input to a step and I put a directory array (Directory[]) and I run the workflow with singularity
it happens that if the file list is too long I get a Too many arguments on the command line. I am currently running the workflow with Toil.
Is there a way around this problem?
Thank you,
Mattia

Welcome @matmanc!

Can you share the CWL tool description you are using and what happens with you execute it using the CWL reference runner, cwltool? (Or an excerpt from your toil-cwl-runner logs)

sure here is the tool description:

class: CommandLineTool
cwlVersion: v1.0
id: check_ateam_separation
baseCommand:
  - python3
  - /usr/local/bin/check_Ateam_separation.py
inputs:
  - id: ms
    type:
      - Directory
      - type: array
        items: Directory
    inputBinding:
      position: 0
    doc: Input measurement set
  - default: Ateam_separation.png
    id: output_image_name
    type: string?
    inputBinding:
      position: 2
      prefix: '--outputimage'
  - id: min_separation
    type: int
    inputBinding:
      position: 1
      prefix: '--min_separation'
outputs:
  - id: output_imag
    doc: Output image
    type: File?
    outputBinding:
      glob: $(inputs.output_image_name)
  - id: logfile
    type: File?
    outputBinding:
      glob: Ateam_separation.log
label: check_Ateam_separation
hints:
  - class: DockerRequirement
    dockerPull: lofareosc/prefactor:HBAcalibrator
  - class: InlineJavascriptRequirement
stdout: Ateam_separation.log

unfortunately I didnt save the log so I will paste it here soon.

The problem was on the line that singularity binds every single file in every single directory resulting in a very long command line that generates a os error as the number of arguments is limited by the kernel.

1 Like

Here it is

[2020-12-02T11:21:48+0100] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'file:///project/astroneosc/Software/prefactor3-cwl/lofar-cwl/steps
/check_ateam_separation.cwl#check_ateam_separation' python3 /usr/local/bin/check_Ateam_separation.py kind-file_project_astroneosc_Software_prefactor3-cwl_lofar-cwl_steps_check_ateam_se
paration.cwl_check_ateam_separation/instance-r0brc7sq
[2020-12-02T11:21:48+0100] [MainThread] [W] [toil.leader] Log from job kind-file_project_astroneosc_Software_prefactor3-cwl_lofar-cwl_steps_check_ateam_separation.cwl_check_ateam_separ
ation/instance-r0brc7sq follows:
=========>
        /table.dat:ro \
            --bind \
            /project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmpsvnitjnw.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/table.f4_TSM0:ro \
            --bind \
            /project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmpfw7whow7.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.info:ro \
            --bind \
            /project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmp1o7zp770.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.f0:ro \
            --bind \
            /project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmp8clmv0ww.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.dat:ro \
            --bind \
            /project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmpj09nien5.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/table.f0:ro \
            --bind \
            /project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmpnmj0wmwj.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/QUALITY_FREQUENCY_STATISTIC/table.info:ro \
            --bind \
            /project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmp3xtw2kh7.tmp
[...]
--pwd \
            /vWWYEQ \
            /project/astroneosc/Software/prefactor3.simg \
            python3 \
            /usr/local/bin/check_Ateam_separation.py \
            /var/lib/cwl/stg7dbf09a8-5fa9-48ed-b4c2-fe2eb29f266a/L755125_SB000_uv.MS \
            /var/lib/cwl/stgf721eb4f-99dd-4b87-8fe6-9e250a7317ce/L755125_SB002_uv.MS \
            /var/lib/cwl/stg5b01d833-66e4-4dcd-9877-d33c7b7cd5b9/L755125_SB005_uv.MS \
            /var/lib/cwl/stg55dc76f5-ce12-462c-bffd-1f6d2e4d66bb/L755125_SB004_uv.MS \
            /var/lib/cwl/stgf8d851aa-9737-4120-af95-241e53ef984b/L755125_SB006_uv.MS \
            /var/lib/cwl/stg2d1ebe28-a26f-4442-a4f7-e9fb177653fe/L755125_SB007_uv.MS \
            /var/lib/cwl/stgf4f9f1b2-855c-433f-87fc-f9b57e69f060/L755125_SB013_uv.MS \
            /var/lib/cwl/stgfbef4e7b-b90e-49eb-bb24-bc9074dec3ef/L755125_SB010_uv.MS \
[...]
 --min_separation \
            30 \
            --outputimage \
            Ateam_separation.png > /project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6
df42315/tu3w97mbq/tmp-outcd3vx2cf/Ateam_separation.log
        [2020-12-02T11:21:43+0100] [MainThread] [E] [cwltool] Exception while running job
        Traceback (most recent call last):
          File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/cwltool/job.py", line 394, in _execute
            default_stderr=runtimeContext.default_stderr,
          File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/cwltool/job.py", line 955, in _job_popen
            universal_newlines=True,
          File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
            restore_signals, start_new_session)
          File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child
            raise child_exception_type(errno_num, err_msg, err_filename)
        OSError: [Errno 7] Argument list too long: 'singularity'
        [2020-12-02T11:21:43+0100] [MainThread] [W] [cwltool] [job check_ateam_separation] completed permanentFail
        [2020-12-02T11:21:45+0100] [MainThread] [W] [toil.fileStores.abstractFileStore] LOG-TO-MASTER: Job used more disk than requested. Consider modifying the user script to avoid th
e chance of failure due to incorrectly requested resources. Job files/for-job/kind-CWLWorkflow/instance-rcfqyxlv/cleanup/file-5wk8511s/stream used 2725.25% (81.8 GB [87786401792B] used
, 3.0 GB [3221225472B] requested) at the end of its run.
        Traceback (most recent call last):
          File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/worker.py", line 368, in workerScript
            job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore, defer=defer)
          File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/job.py", line 1424, in _runner
            returnValues = self._run(jobGraph, fileStore)
          File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/job.py", line 1361, in _run
            return self.run(fileStore)
          File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/cwl/cwltoil.py", line 988, in run
            raise cwltool.errors.WorkflowException(status)
        cwltool.errors.WorkflowException: permanentFail
        [2020-12-02T11:21:45+0100] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host wn-db-02.novalocal

I redacted out the boring stuff

Looks like the loop in https://github.com/common-workflow-language/cwltool/blob/78fe9d41ee5a44f8725dfbd7028e4a5ee42949cf/cwltool/job.py#L685 is responsible as it doesn’t check to see if the File or Directory is part of a Directory that has already been added.

A possible workaround might be to change the cwlVersion: v1.0 to cwlVersion: v1.1 as we default to not enumerating the contents of Directory classes, unless requested: https://www.commonwl.org/v1.1/CommandLineTool.html#Changelog

Should I change all the cwl files including the workflow or only this specific step?

Just that step. What version of Toil are you using?

So the version I am using is the: 4.2.0

That version of Toil shouldn’t have a problem with cwlVersion: v1.1 as it is the latest version: https://pypi.org/project/toil/

I hope it works!

In case it doesnt is there a way to work around the issue?
Maybe an InputWorkDirRequirement?

So, it seems that the problem persists even though I changed it to a version 1.1.
Can it be caused by some GatherStep?

Here is the log

    01+0100] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'file:///project/astroneosc/Software/prefactor3-cwl/lofar-cwl/steps
    /check_ateam_separation.cwl#check_ateam_separation' python3 /usr/local/bin/check_Ateam_separation.py kind-file_project_astroneosc_Software_prefactor3-cwl_lofar-cwl_steps_check_ateam_se
    paration.cwl_check_ateam_separation/instance-twjhltat
    [2020-12-02T12:23:01+0100] [MainThread] [W] [toil.leader] Log from job kind-file_project_astroneosc_Software_prefactor3-cwl_lofar-cwl_steps_check_ateam_separation.cwl_check_ateam_separ
    ation/instance-twjhltat follows:
    =========>
            /table.dat:ro \
                --bind \
                /project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmpo4m7rlpd.tmp
    :/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/table.f4_TSM0:ro \
                --bind \
                /project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmpq7kwy461.tmp
    :/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.info:ro \
                --bind \
                /project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmppc2l734s.tmp
    :/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.f0:ro \
                --bind \
                /project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmpfq7oakbk.tmp
    :/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.dat:ro \
                --bind \
                /project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmpz184dsvg.tmp
    :/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/table.f0:ro \
                --bind \
                /project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmp5qv654ri.tmp
    :/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/QUALITY_FREQUENCY_STATISTIC/table.info:ro \
                --bind \
                /project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmpbctexdj6.tmp
    :/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/QUALITY_FREQUENCY_STATISTIC/table.f0:ro \
    [...]
           --pwd \
                /DxeKMf \
                /project/astroneosc/Software/prefactor3.simg \
                python3 \
                /usr/local/bin/check_Ateam_separation.py \
                /var/lib/cwl/stg15b2528c-1b4a-4f9f-a7a3-bf8e43d389b9/L755125_SB000_uv.MS \
                /var/lib/cwl/stg409537df-ebbe-4ae3-bb80-0c13caba3302/L755125_SB002_uv.MS \
                /var/lib/cwl/stgbb834c53-5f1e-4197-8714-fd207a152788/L755125_SB005_uv.MS \
                /var/lib/cwl/stgadad000d-0856-4b13-b956-d14fa22404e1/L755125_SB004_uv.MS \
                /var/lib/cwl/stg842b4160-f25d-4ba3-942d-cd77fb7a71e4/L755125_SB006_uv.MS \
                /var/lib/cwl/stg3bc51532-d697-4acc-8350-a04e03d22bfb/L755125_SB007_uv.MS \
                /var/lib/cwl/stg5668b9bb-177b-4dcc-a626-d7ea33b4f22e/L755125_SB013_uv.MS \
                /var/lib/cwl/stg46adea18-2eaa-4a67-8ad7-26b0adcda346/L755125_SB010_uv.MS \
                /var/lib/cwl/stg25502d25-4aff-4a21-976d-e8abc18dc865/L755125_SB011_uv.MS \
                /var/lib/cwl/stg67986c8f-e1b3-4c5c-9d0b-9384fcd6d609/L755125_SB008_uv.MS \
    [...]
                --min_separation \
                30 \
                --outputimage \
                Ateam_separation.png > /project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a340
    86a7c90/tb1hlymy6/tmp-outt7xt8yfi/Ateam_separation.log
            [2020-12-02T12:22:57+0100] [MainThread] [E] [cwltool] Exception while running job
            Traceback (most recent call last):
              File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/cwltool/job.py", line 394, in _execute
                default_stderr=runtimeContext.default_stderr,
              File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/cwltool/job.py", line 955, in _job_popen
                universal_newlines=True,
              File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
                restore_signals, start_new_session)
              File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child
                raise child_exception_type(errno_num, err_msg, err_filename)
            OSError: [Errno 7] Argument list too long: 'singularity'
            [2020-12-02T12:22:57+0100] [MainThread] [W] [cwltool] [job check_ateam_separation] completed permanentFail
            [2020-12-02T12:22:57+0100] [MainThread] [W] [toil.fileStores.abstractFileStore] LOG-TO-MASTER: Job used more disk than requested. Consider modifying the user script to avoid th
    e chance of failure due to incorrectly requested resources. Job files/for-job/kind-CWLWorkflow/instance-zznwcosd/cleanup/file-cauu9w_p/stream used 2725.25% (81.8 GB [87786401792B] used
    , 3.0 GB [3221225472B] requested) at the end of its run.
            Traceback (most recent call last):
              File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/worker.py", line 368, in workerScript
                job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore, defer=defer)
              File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/job.py", line 1424, in _runner
                returnValues = self._run(jobGraph, fileStore)
              File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/job.py", line 1361, in _run
                return self.run(fileStore)
              File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/cwl/cwltoil.py", line 988, in run
                raise cwltool.errors.WorkflowException(status)
            cwltool.errors.WorkflowException: permanentFail
            [2020-12-02T12:22:57+0100] [MainThread] [E] [toil.worker] Exiting the worker because of a 

failed job on host wn-db-03.novalocal

That’s cool, I didn’t know about SINGULARITY_BIND; yeah, that would probably work

I tried to change cwltool but something goes wrong anyway with the same error. I am really confused :confused:
I am trying now to create an CommandLineTool that copies everything in one directory and then it just passed that one. But I dont know if that would work either. After that I am out of ideas.
Do you have any?

Try changing everything (workflow and tools) to cwlVersion: v1.1 and then try it with cwltool again.

I will try to make a small example that fails tomorrow than I will post it here. I tried already to convert everything in v1.1 but it didnt work.

Does toil have an explicit notion of directories in its file store abstraction, or is it still up to toil-cwl-runner to enumerate directory contents?

Some method of passing a list of binds to singularity that isn’t constrained by command line limits would be a good solution.

I have the fear that that is the case. In fact the jobstore is created with a bunch of single files in a nested directory structure.

There is the SINGULARITY_BIND env variable that you can set to do that. But when I tried to change the singularity.py module in cwltool I still get the error. I have a feeling that the env variable and the command are executed on the same line. Do you know what file I should be looking at ?

Do you know if using the InputWorkDirRequirement would solve the issue?

create_files.cwl

class: CommandLineTool
cwlVersion: v1.1
$namespaces:
  sbg: 'https://www.sevenbridges.com/'
id: create_files_cwl
baseCommand:
  - bash
inputs: []
outputs:
  - id: output
    type: 'Directory[]'
    outputBinding:
      glob: out/*
label: create_files.cwl
arguments:
  - prefix: ''
    shellQuote: false
    position: 0
    valueFrom: script.sh
requirements:
  - class: ShellCommandRequirement
  - class: InitialWorkDirRequirement
    listing:
      - entryname: script.sh
        entry: |-
          #!/bin/bash


          for i in {1..2024}
          do
              for k in {1..20}
              do
                  mkdir -p out/$i/$k/
                  for s in {1..30}
                  do
                      touch out/$i/$k/$s
                  done
              done
          done
        writable: false

find.cwl

class: CommandLineTool
cwlVersion: v1.1
id: find
baseCommand:
  - ls
inputs:
  - id: input
    type: 'Directory[]'
    inputBinding:
      shellQuote: false
      position: 0
outputs:
  - id: output
    type: File?
    outputBinding:
      glob: stout
label: find
requirements:
  - class: ShellCommandRequirement
  - class: DockerRequirement
    dockerPull: 'ubuntu:latest'
stdout: stout

test_too_many_files_workflow.cwl

class: Workflow
cwlVersion: v1.0
id: test_too_many_arguments
label: test_too_many_arguments
inputs: []
outputs:
  - id: output
    outputSource:
      - find/output
    type: File?
steps:
  - id: find
    in:
      - id: input
        source:
          - create_files_cwl/output
    out:
      - id: output
    run: ./find.cwl
    label: find
  - id: create_files_cwl
    in: []
    out:
      - id: output
    run: ./create_files.cwl
    label: create_files.cwl
requirements:
  - class: InlineJavascriptRequirement
  - class: StepInputExpressionRequirement
1 Like

So I think the problem is that the toil file store only handles single files, not directories. The CWL layer makes up for this by enumerating the files itself, and reconstructing the directory structure when it is time to run a tool. The way it reconstructs the directory is either by making a bunch of symlinks (when not running inside a container) or by setting up bind mounts (when using containers).

Unfortunately it is with that second approach that you are running into scaling issues.

The InitialWorkDirRequirement uses the same strategies so I don’t think it will work.

Ideas for solutions (need to fix in cwltool and then update toil):

  • Some way of invoking singularity with an arbitrary number of binds
  • Construct the directory structure outside the container by copying the files, then bind mount only the directory
  • Construct the directory structure outside the container by hardlinking the files, then bind mount only the directory
  • Construct the directory structure outside the container by bind mounting the file store into the container so that symlinks are valid

Unfortunately I don’t have a good workaround on the user side, except for trying a different runner. For example, cwltool handles Directories differently than toil that might scale better.