Output files not next to each other in next step

Hi,

I have a Workflow where one step outputs files and the next does some processing on it.
The files refer to each other with links like "href": "./openEO_2023-06-01Z.tif.json",

However those links are broken in the next step, as each file gets put in a separate directory.
Is it possible to keep files outputted to the same directory together?

My logs:

cwltool  --leave-tmpdir  scatter-gather-stac.cwl 
INFO /home/venv_python3_8/bin/cwltool 3.1.20240708091337
INFO Resolved 'scatter-gather-stac.cwl' to 'file://***/scatter-gather-stac.cwl'
INFO [workflow ] start
INFO [workflow ] starting step gatherer_node_step1
INFO [step gatherer_node_step1] start
INFO [workflow gatherer_node_step1] start
INFO [workflow gatherer_node_step1] starting step scatter_node_step
INFO [step scatter_node_step] start
INFO [job scatter_node_step] /tmp/upy6d4xg$ /***/example_stac_catalog/sub_collection_maker.py \
    2023-06-01
Copied to /tmp/upy6d4xg/openEO_2023-06-01Z.tif
Copied to /tmp/upy6d4xg/openEO_2023-06-01Z.tif.json
INFO [job scatter_node_step] completed success
INFO [step scatter_node_step] start
INFO [job scatter_node_step_2] /tmp/hu879tgd$ /***/example_stac_catalog/sub_collection_maker.py \
    2023-06-04
Copied to /tmp/hu879tgd/openEO_2023-06-04Z.tif.json
Copied to /tmp/hu879tgd/openEO_2023-06-04Z.tif
INFO [job scatter_node_step_2] completed success
INFO [step scatter_node_step] completed success
INFO [workflow gatherer_node_step1] completed success
INFO [step gatherer_node_step1] completed success
INFO [workflow ] starting step gatherer_node_step2
INFO [step gatherer_node_step2] start
INFO [job gatherer_node_step2] /tmp/qihr86wj$ /***/example_stac_catalog/simple_stac_merge.py \
    /tmp/qjxkupzd/stg2a2b1801-c29b-489a-b7dd-cd4a094318ee/collection.json \
    /tmp/qjxkupzd/stg2767b9e4-922a-4e30-b2cf-860ed5a55adf/openEO_2023-06-01Z.tif.json \
    /tmp/qjxkupzd/stgf2aaf01a-08fb-42ca-a1c5-2df1075bad98/openEO_2023-06-01Z.tif \
    /tmp/qjxkupzd/stga5ef7d51-cbdf-4d6a-b80b-3578c4981196/collection.json \
    /tmp/qjxkupzd/stgfd65269c-ac43-4412-b59b-9557ec72e465/openEO_2023-06-04Z.tif.json \
    /tmp/qjxkupzd/stg6b7d8342-55e8-4cdf-8a5e-c4956f2d5d82/openEO_2023-06-04Z.tif
catalog_path=PosixPath('/tmp/qjxkupzd/stg2a2b1801-c29b-489a-b7dd-cd4a094318ee/collection.json')
catalog_path=PosixPath('/tmp/qjxkupzd/stg2a2b1801-c29b-489a-b7dd-cd4a094318ee/openEO_2023-06-01Z.tif.json')
Traceback (most recent call last):
  File "/***/example_stac_catalog/simple_stac_merge.py", line 106, in <module>
    main(sys.argv)
  File "/***/example_stac_catalog/simple_stac_merge.py", line 79, in main
    files = get_files_from_stac_catalog(collection_path)
  File "/***/example_stac_catalog/simple_stac_merge.py", line 36, in get_files_from_stac_catalog
    all_files.extend(get_files_from_stac_catalog(href))
  File "/***/example_stac_catalog/simple_stac_merge.py", line 18, in get_files_from_stac_catalog
    assert catalog_path.exists()
AssertionError
WARNING [job gatherer_node_step2] exited with status: 1
WARNING [job gatherer_node_step2] completed permanentFail
WARNING [step gatherer_node_step2] completed permanentFail
INFO [workflow ] completed permanentFail
{
    "gatherer_node_out": []
}WARNING Final process status is permanentFail

Hello @EmileSonneveld ; in CWL you have to refer to files using type: File or similar for this reason, if you use type: string then the workflow engine doesn’t know to make them available to your tool. This is critical for supporting distributed and cloud based execution where there might not be a shared filesystem.

I am using type: File, but the files still get separated to different locations, breaking relative paths.
I made a minimal example here: GitHub - EmileSonneveld/cwl-example

1 Like

In this case adding an InitialWorkDirRequirement to the validate_files step should solve the problem as the files then will be staged into the working directory.

requirements: 
      InitialWorkDirRequirement:
          listing:
            - $(inputs.validate_files_in1)
1 Like

Thank you for the minimal example, that helps a lot!

Yes, @JensKrumsieck has one solution that works well with your validate_files.py. For your original issue I wonder if using secondaryFiles might be a better solution.

This CWL feature is useful when there is a primary file that always needs to be accompanied by additional files, always co-located in the same directory. In many bioinformatics tools for which we made this feature, the primary and secondary files always share the same prefix in their filename; though in CWL that is not required.

If that sounds like the data pattern you have, then please see example of using `secondaryFiles` by mr-c · Pull Request #1 · EmileSonneveld/cwl-example · GitHub for an example implementation.

Hey, thanks for the extended responses.
InitialWorkDirRequirement seems a good solution.
In practice, I won’t know what files would be generated in previous steps, nor what their internal links would be, so secondaryFiles could be more difficult to use.
I updated the github example.

1 Like