Hi all, I am using CWL version 1.0 to describe both the steps and the workflow.
If I use as input to a step and I put a directory array (Directory[]) and I run the workflow with singularity
it happens that if the file list is too long I get a Too many arguments on the command line. I am currently running the workflow with Toil.
Is there a way around this problem?
Thank you,
Mattia
Welcome @matmanc!
Can you share the CWL tool description you are using and what happens with you execute it using the CWL reference runner, cwltool
? (Or an excerpt from your toil-cwl-runner
logs)
sure here is the tool description:
class: CommandLineTool
cwlVersion: v1.0
id: check_ateam_separation
baseCommand:
- python3
- /usr/local/bin/check_Ateam_separation.py
inputs:
- id: ms
type:
- Directory
- type: array
items: Directory
inputBinding:
position: 0
doc: Input measurement set
- default: Ateam_separation.png
id: output_image_name
type: string?
inputBinding:
position: 2
prefix: '--outputimage'
- id: min_separation
type: int
inputBinding:
position: 1
prefix: '--min_separation'
outputs:
- id: output_imag
doc: Output image
type: File?
outputBinding:
glob: $(inputs.output_image_name)
- id: logfile
type: File?
outputBinding:
glob: Ateam_separation.log
label: check_Ateam_separation
hints:
- class: DockerRequirement
dockerPull: lofareosc/prefactor:HBAcalibrator
- class: InlineJavascriptRequirement
stdout: Ateam_separation.log
unfortunately I didnt save the log so I will paste it here soon.
The problem was on the line that singularity binds every single file in every single directory resulting in a very long command line that generates a os error as the number of arguments is limited by the kernel.
Here it is
[2020-12-02T11:21:48+0100] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'file:///project/astroneosc/Software/prefactor3-cwl/lofar-cwl/steps
/check_ateam_separation.cwl#check_ateam_separation' python3 /usr/local/bin/check_Ateam_separation.py kind-file_project_astroneosc_Software_prefactor3-cwl_lofar-cwl_steps_check_ateam_se
paration.cwl_check_ateam_separation/instance-r0brc7sq
[2020-12-02T11:21:48+0100] [MainThread] [W] [toil.leader] Log from job kind-file_project_astroneosc_Software_prefactor3-cwl_lofar-cwl_steps_check_ateam_separation.cwl_check_ateam_separ
ation/instance-r0brc7sq follows:
=========>
/table.dat:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmpsvnitjnw.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/table.f4_TSM0:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmpfw7whow7.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.info:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmp1o7zp770.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.f0:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmp8clmv0ww.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.dat:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmpj09nien5.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/table.f0:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmpnmj0wmwj.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/QUALITY_FREQUENCY_STATISTIC/table.info:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmp3xtw2kh7.tmp
[...]
--pwd \
/vWWYEQ \
/project/astroneosc/Software/prefactor3.simg \
python3 \
/usr/local/bin/check_Ateam_separation.py \
/var/lib/cwl/stg7dbf09a8-5fa9-48ed-b4c2-fe2eb29f266a/L755125_SB000_uv.MS \
/var/lib/cwl/stgf721eb4f-99dd-4b87-8fe6-9e250a7317ce/L755125_SB002_uv.MS \
/var/lib/cwl/stg5b01d833-66e4-4dcd-9877-d33c7b7cd5b9/L755125_SB005_uv.MS \
/var/lib/cwl/stg55dc76f5-ce12-462c-bffd-1f6d2e4d66bb/L755125_SB004_uv.MS \
/var/lib/cwl/stgf8d851aa-9737-4120-af95-241e53ef984b/L755125_SB006_uv.MS \
/var/lib/cwl/stg2d1ebe28-a26f-4442-a4f7-e9fb177653fe/L755125_SB007_uv.MS \
/var/lib/cwl/stgf4f9f1b2-855c-433f-87fc-f9b57e69f060/L755125_SB013_uv.MS \
/var/lib/cwl/stgfbef4e7b-b90e-49eb-bb24-bc9074dec3ef/L755125_SB010_uv.MS \
[...]
--min_separation \
30 \
--outputimage \
Ateam_separation.png > /project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6
df42315/tu3w97mbq/tmp-outcd3vx2cf/Ateam_separation.log
[2020-12-02T11:21:43+0100] [MainThread] [E] [cwltool] Exception while running job
Traceback (most recent call last):
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/cwltool/job.py", line 394, in _execute
default_stderr=runtimeContext.default_stderr,
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/cwltool/job.py", line 955, in _job_popen
universal_newlines=True,
File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: 'singularity'
[2020-12-02T11:21:43+0100] [MainThread] [W] [cwltool] [job check_ateam_separation] completed permanentFail
[2020-12-02T11:21:45+0100] [MainThread] [W] [toil.fileStores.abstractFileStore] LOG-TO-MASTER: Job used more disk than requested. Consider modifying the user script to avoid th
e chance of failure due to incorrectly requested resources. Job files/for-job/kind-CWLWorkflow/instance-rcfqyxlv/cleanup/file-5wk8511s/stream used 2725.25% (81.8 GB [87786401792B] used
, 3.0 GB [3221225472B] requested) at the end of its run.
Traceback (most recent call last):
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/worker.py", line 368, in workerScript
job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore, defer=defer)
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/job.py", line 1424, in _runner
returnValues = self._run(jobGraph, fileStore)
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/job.py", line 1361, in _run
return self.run(fileStore)
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/cwl/cwltoil.py", line 988, in run
raise cwltool.errors.WorkflowException(status)
cwltool.errors.WorkflowException: permanentFail
[2020-12-02T11:21:45+0100] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host wn-db-02.novalocal
I redacted out the boring stuff
Looks like the loop in https://github.com/common-workflow-language/cwltool/blob/78fe9d41ee5a44f8725dfbd7028e4a5ee42949cf/cwltool/job.py#L685 is responsible as it doesn’t check to see if the File or Directory is part of a Directory that has already been added.
A possible workaround might be to change the cwlVersion: v1.0
to cwlVersion: v1.1
as we default to not enumerating the contents of Directory
classes, unless requested: https://www.commonwl.org/v1.1/CommandLineTool.html#Changelog
Should I change all the cwl files including the workflow or only this specific step?
Just that step. What version of Toil are you using?
So the version I am using is the: 4.2.0
That version of Toil shouldn’t have a problem with cwlVersion: v1.1
as it is the latest version: https://pypi.org/project/toil/
I hope it works!
In case it doesnt is there a way to work around the issue?
Maybe an InputWorkDirRequirement?
So, it seems that the problem persists even though I changed it to a version 1.1.
Can it be caused by some GatherStep?
Here is the log
01+0100] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'file:///project/astroneosc/Software/prefactor3-cwl/lofar-cwl/steps
/check_ateam_separation.cwl#check_ateam_separation' python3 /usr/local/bin/check_Ateam_separation.py kind-file_project_astroneosc_Software_prefactor3-cwl_lofar-cwl_steps_check_ateam_se
paration.cwl_check_ateam_separation/instance-twjhltat
[2020-12-02T12:23:01+0100] [MainThread] [W] [toil.leader] Log from job kind-file_project_astroneosc_Software_prefactor3-cwl_lofar-cwl_steps_check_ateam_separation.cwl_check_ateam_separ
ation/instance-twjhltat follows:
=========>
/table.dat:ro \
--bind \
/project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmpo4m7rlpd.tmp
:/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/table.f4_TSM0:ro \
--bind \
/project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmpq7kwy461.tmp
:/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.info:ro \
--bind \
/project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmppc2l734s.tmp
:/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.f0:ro \
--bind \
/project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmpfq7oakbk.tmp
:/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.dat:ro \
--bind \
/project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmpz184dsvg.tmp
:/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/table.f0:ro \
--bind \
/project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmp5qv654ri.tmp
:/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/QUALITY_FREQUENCY_STATISTIC/table.info:ro \
--bind \
/project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a34086a7c90/tmpbctexdj6.tmp
:/var/lib/cwl/stg4615cab6-2798-45b4-9f0c-ab3bb9f64241/L755125_SB222_uv.MS/QUALITY_FREQUENCY_STATISTIC/table.f0:ro \
[...]
--pwd \
/DxeKMf \
/project/astroneosc/Software/prefactor3.simg \
python3 \
/usr/local/bin/check_Ateam_separation.py \
/var/lib/cwl/stg15b2528c-1b4a-4f9f-a7a3-bf8e43d389b9/L755125_SB000_uv.MS \
/var/lib/cwl/stg409537df-ebbe-4ae3-bb80-0c13caba3302/L755125_SB002_uv.MS \
/var/lib/cwl/stgbb834c53-5f1e-4197-8714-fd207a152788/L755125_SB005_uv.MS \
/var/lib/cwl/stgadad000d-0856-4b13-b956-d14fa22404e1/L755125_SB004_uv.MS \
/var/lib/cwl/stg842b4160-f25d-4ba3-942d-cd77fb7a71e4/L755125_SB006_uv.MS \
/var/lib/cwl/stg3bc51532-d697-4acc-8350-a04e03d22bfb/L755125_SB007_uv.MS \
/var/lib/cwl/stg5668b9bb-177b-4dcc-a626-d7ea33b4f22e/L755125_SB013_uv.MS \
/var/lib/cwl/stg46adea18-2eaa-4a67-8ad7-26b0adcda346/L755125_SB010_uv.MS \
/var/lib/cwl/stg25502d25-4aff-4a21-976d-e8abc18dc865/L755125_SB011_uv.MS \
/var/lib/cwl/stg67986c8f-e1b3-4c5c-9d0b-9384fcd6d609/L755125_SB008_uv.MS \
[...]
--min_separation \
30 \
--outputimage \
Ateam_separation.png > /project/astroneosc/Data/tmp/node-dd6b59bf-01ca-44f4-8c4f-f1c40c45ab7e-b7562762-34ca-4a24-bcdd-d6a811d0201b/tmp6udp2o3w/5dc3e6bd-9858-40e2-9f6d-4a340
86a7c90/tb1hlymy6/tmp-outt7xt8yfi/Ateam_separation.log
[2020-12-02T12:22:57+0100] [MainThread] [E] [cwltool] Exception while running job
Traceback (most recent call last):
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/cwltool/job.py", line 394, in _execute
default_stderr=runtimeContext.default_stderr,
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/cwltool/job.py", line 955, in _job_popen
universal_newlines=True,
File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: 'singularity'
[2020-12-02T12:22:57+0100] [MainThread] [W] [cwltool] [job check_ateam_separation] completed permanentFail
[2020-12-02T12:22:57+0100] [MainThread] [W] [toil.fileStores.abstractFileStore] LOG-TO-MASTER: Job used more disk than requested. Consider modifying the user script to avoid th
e chance of failure due to incorrectly requested resources. Job files/for-job/kind-CWLWorkflow/instance-zznwcosd/cleanup/file-cauu9w_p/stream used 2725.25% (81.8 GB [87786401792B] used
, 3.0 GB [3221225472B] requested) at the end of its run.
Traceback (most recent call last):
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/worker.py", line 368, in workerScript
job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore, defer=defer)
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/job.py", line 1424, in _runner
returnValues = self._run(jobGraph, fileStore)
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/job.py", line 1361, in _run
return self.run(fileStore)
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/cwl/cwltoil.py", line 988, in run
raise cwltool.errors.WorkflowException(status)
cwltool.errors.WorkflowException: permanentFail
[2020-12-02T12:22:57+0100] [MainThread] [E] [toil.worker] Exiting the worker because of a
failed job on host wn-db-03.novalocal
That’s cool, I didn’t know about SINGULARITY_BIND
; yeah, that would probably work
I tried to change cwltool but something goes wrong anyway with the same error. I am really confused
I am trying now to create an CommandLineTool that copies everything in one directory and then it just passed that one. But I dont know if that would work either. After that I am out of ideas.
Do you have any?
Try changing everything (workflow and tools) to cwlVersion: v1.1
and then try it with cwltool
again.
I will try to make a small example that fails tomorrow than I will post it here. I tried already to convert everything in v1.1 but it didnt work.
Does toil
have an explicit notion of directories in its file store abstraction, or is it still up to toil-cwl-runner
to enumerate directory contents?
Some method of passing a list of binds to singularity that isn’t constrained by command line limits would be a good solution.
I have the fear that that is the case. In fact the jobstore is created with a bunch of single files in a nested directory structure.
There is the SINGULARITY_BIND env variable that you can set to do that. But when I tried to change the singularity.py module in cwltool I still get the error. I have a feeling that the env variable and the command are executed on the same line. Do you know what file I should be looking at ?
Do you know if using the InputWorkDirRequirement would solve the issue?
create_files.cwl
class: CommandLineTool
cwlVersion: v1.1
$namespaces:
sbg: 'https://www.sevenbridges.com/'
id: create_files_cwl
baseCommand:
- bash
inputs: []
outputs:
- id: output
type: 'Directory[]'
outputBinding:
glob: out/*
label: create_files.cwl
arguments:
- prefix: ''
shellQuote: false
position: 0
valueFrom: script.sh
requirements:
- class: ShellCommandRequirement
- class: InitialWorkDirRequirement
listing:
- entryname: script.sh
entry: |-
#!/bin/bash
for i in {1..2024}
do
for k in {1..20}
do
mkdir -p out/$i/$k/
for s in {1..30}
do
touch out/$i/$k/$s
done
done
done
writable: false
find.cwl
class: CommandLineTool
cwlVersion: v1.1
id: find
baseCommand:
- ls
inputs:
- id: input
type: 'Directory[]'
inputBinding:
shellQuote: false
position: 0
outputs:
- id: output
type: File?
outputBinding:
glob: stout
label: find
requirements:
- class: ShellCommandRequirement
- class: DockerRequirement
dockerPull: 'ubuntu:latest'
stdout: stout
test_too_many_files_workflow.cwl
class: Workflow
cwlVersion: v1.0
id: test_too_many_arguments
label: test_too_many_arguments
inputs: []
outputs:
- id: output
outputSource:
- find/output
type: File?
steps:
- id: find
in:
- id: input
source:
- create_files_cwl/output
out:
- id: output
run: ./find.cwl
label: find
- id: create_files_cwl
in: []
out:
- id: output
run: ./create_files.cwl
label: create_files.cwl
requirements:
- class: InlineJavascriptRequirement
- class: StepInputExpressionRequirement
So I think the problem is that the toil
file store only handles single files, not directories. The CWL layer makes up for this by enumerating the files itself, and reconstructing the directory structure when it is time to run a tool. The way it reconstructs the directory is either by making a bunch of symlinks (when not running inside a container) or by setting up bind mounts (when using containers).
Unfortunately it is with that second approach that you are running into scaling issues.
The InitialWorkDirRequirement uses the same strategies so I don’t think it will work.
Ideas for solutions (need to fix in cwltool
and then update toil
):
- Some way of invoking singularity with an arbitrary number of binds
- Construct the directory structure outside the container by copying the files, then bind mount only the directory
- Construct the directory structure outside the container by hardlinking the files, then bind mount only the directory
- Construct the directory structure outside the container by bind mounting the file store into the container so that symlinks are valid
Unfortunately I don’t have a good workaround on the user side, except for trying a different runner. For example, cwltool
handles Directories differently than toil
that might scale better.