Scattering with a directory to store all outputs

kmhernan · June 30, 2020, 10:49am

If I have a workflow that has a step to make a directory, that directory is then passed to another step which scatters over an array of UUIDs which will be downloaded into this directory… but then it returns an array of directories at the end due to any scattered step returning an array… how do I actually do something like this? I have to do somethign like this cause the number of files is so large that any attempt to mount to downstream tools would create a command line argument that is way larger than allowed.

This is the proof of concept wf I was trying to build for a colleague, but it’s obviously not what I really want to be happening.

cwlVersion: v1.0
class: Workflow
requirements:
  - class: SubworkflowFeatureRequirement
  - class: ScatterFeatureRequirement
  - class: SchemaDefRequirement
types:
  - $import: readgroup.cwl

inputs:
  bioclient_config: File
  input_fastq_list:
    type:
      type: array
      items: readgroup.cwl#readgroup_fastq_uuid

outputs:
  data_directory:
type: Directory[]
outputSource: stage_fastqs/output

steps:
  get_directory:
run: mkdir.cwl
in:
  dir_name:
    default: "fastq_dir"
out: [ output ]

  stage_fastqs:
run: extract_fastq_workflow.cwl
scatter: fastq_record
in:
  bioclient_config: bioclient_config
  download_dir: get_directory/output
  fastq_record: input_fastq_list
out: [ output ]

Since it will return an array of directories, I’m not sure how to do this right. The subworkflow that it calls, I have to do this ugly passing of the directory like this:

cwlVersion: v1.0
class: Workflow
id: extract_readgroup_fastq_wf
requirements:
  - class: InlineJavascriptRequirement
  - class: MultipleInputFeatureRequirement
  - class: StepInputExpressionRequirement
  - class: SchemaDefRequirement
    types:
      - $import: readgroup.cwl

inputs:
  fastq_record: readgroup.cwl#readgroup_fastq_uuid
  download_dir: Directory
  bioclient_config: File

outputs:
  output:
    type: Directory
    outputSource: extract_reverse_fastq/output

steps:
  extract_forward_fastq:
    run: bioclient_download_to_directory.cwl
    in:
      config-file: bioclient_config
      download_handle:
        source: fastq_record
        valueFrom: $(self.forward_fastq_uuid)
      download_directory: download_dir
      file_name:
        source: fastq_record
        valueFrom: $(self.readgroup_id + "." + self.forward_fastq_basename)
    out: [ output ]

  extract_reverse_fastq:
    run: bioclient_download_to_directory.cwl
    in:
      config-file: bioclient_config
      download_handle:
        source: fastq_record
        valueFrom: $(self.reverse_fastq_uuid)
      download_directory: extract_forward_fastq/output
      file_name:
        source: fastq_record
        valueFrom: $(self.readgroup_id + "." + self.reverse_fastq_basename)
    out: [ output ]

Ultimately just calls a CommandLineTool that has to use InitialWorkDirRequirement to mount the directory and make it writable.

cwlVersion: v1.0
class: CommandLineTool
id: bioclient_download_to_directory
requirements:
  - class: DockerRequirement
    dockerPull: quay.io/ncigdc/bio-client:latest
  - class: InlineJavascriptRequirement
  - class: InitialWorkDirRequirement
    listing:
      - entry: $(inputs.download_directory)
        writable: true
  - class: ResourceRequirement
    coresMin: 1
    coresMax: 1
    ramMin: 2000
    ramMax: 2000
    tmpdirMin: $(Math.ceil (inputs.file_size / 1048576))
    tmpdirMax: $(Math.ceil (inputs.file_size / 1048576))
    outdirMin: $(Math.ceil (inputs.file_size / 1048576))
    outdirMax: $(Math.ceil (inputs.file_size / 1048576))

inputs:
  config-file:
    type: File
    inputBinding:
      prefix: -c
      position: 0

  download:
    type: string
    default: download
    inputBinding:
      position: 1

  download_directory:
    type: Directory

  file_name:
    type: string

  download_handle:
    type: string
    inputBinding:
      position: 3

  file_size:
    type: long?
    default: 1

outputs:
  output:
    type: Directory
    outputBinding:
      glob: $(inputs.download_directory.basename)

baseCommand: [/usr/local/bin/bio_client.py]

arguments:
  - valueFrom: $(inputs.download_directory.basename + "/" + inputs.file_name)
    prefix: --file_path
    position: 2

tetron · July 1, 2020, 2:16pm

Something to try is to take an array of files (or array of Directories) and return a single Directory object with everything in “listing”. The runner will stage those files to the new Directory. Here is a partial example:

github.com

common-workflow-language/cwl-v1.1/blob/a22b7580c6b50e77c0a181ca59d3828dd5c69143/tests/dir7.cwl

class: ExpressionTool
cwlVersion: v1.1
requirements:
  InlineJavascriptRequirement: {}
inputs:
  files: File[]
outputs:
  dir: Directory
expression: |
  ${
  return {"dir": {"class": "Directory", "basename": "a_directory", "listing": inputs.files}};
  }

Does that help?

tetron · July 1, 2020, 2:17pm

(side note: github previews embed the actual file, that is awesome)

kmhernan · July 1, 2020, 3:35pm

I was never sure if I could do this kind of manipulation within an expression tool, I’ll try.

kmhernan · July 1, 2020, 4:21pm

This seems to have worked #themoreyouknow

kaushik-work · July 2, 2020, 11:51am

I have to do somethign like this cause the number of files is so large that any attempt to mount to downstream tools would create a command line argument that is way larger than allowed.

In general, when this happens to me, I go the manifest file way. I use a JS expression to generate a manifest file from the list of files in the input and have my tool read the input list from the manifest

tetron · July 2, 2020, 1:50pm

@kmhernan can clarify but I read this as the problem being that the docker command line was getting too long, because of all the -v options for each individual file. That’s a somewhat implementation specific problem (some runners talk directly to the Docker API and are not subject to command line length limits).

Using InitialWorkDir to construct a manifest file is usually the best solution to passing a large list of filenames, and I’m currently working on making it a little bit easier in CWL v1.2.

kmhernan · July 2, 2020, 2:12pm

Correct, that is my problem. I actually use manifest files when tools accept them which was the reason for my other post about making manifest files that needed tabs and new lines haha.

kmhernan · July 2, 2020, 2:16pm

This leads me to a more general concern though: How can I really understand what I can and can’t do within an ExpressionTool? I didn’t think i could create directories out of thin air. Can I use it to rename a file? Seems like i have to use InitialWorkDir for that. Are these limitations defined anywhere?

mrc · July 2, 2020, 2:27pm

ExpressionTools manipulate CWL objects. So you can rename a file or directory; construct new files and directories (perhaps from existing files and directories) and manipulate/create arrays and other complex CWL types.

It is also possible to do all of that in a CommandLineTool, but on most CWL runners ExpressionToola schedule and run faster.

kmhernan · July 2, 2020, 2:29pm

Do you have an example of constructing new files or renaming within an ExpressionTool anywhere?

mrc · July 2, 2020, 2:32pm

Here’s the craziest ExpressionTool I’ve ever written https://github.com/EBI-Metagenomics/ebi-metagenomics-cwl/blob/25129f55226dee595ef941edc24d3c44414e0523/workflows/convert-to-v3-layout.cwl

tetron · July 2, 2020, 2:41pm

The principle is that your tell CWL what you want and it is the responsibility of the runner to do it for you. That’s the superpower that makes it agnostic to weird storage systems, splitting up steps to run on multiple nodes, etc.

The main things are File literals, Directory literals, and setting basename.

File literals have contents and basename set but no location. They get created on the fly when you need to run a CommandLineTool.
Directory literals have listing but no location. They also get created on the fly when you need to run a CommandLineTool.
It uses basename to name a file when it is staged or created on the fly, so you can logically rename a file in an expression by returning a File object with the same location but a different basename. This does not change the name of the file in the underlying storage system.

This is all described in the specification but it is pretty dense. We are getting a grant to improve the documentation so we’re hoping we’ll be able to expand the user guide to cover more topics like this.

kmhernan · July 2, 2020, 2:43pm

That is extremely helpful and clear @tetron thank you

kmhernan · July 10, 2020, 7:07pm

@tetron or @mrc It seems like the expression tool works and all, but the next step that takes the directory the expression tool makes seems to still be trying to mount every individual file which would defeat the purpose of this (too many files to individually mount and the command line would be too large). I need to be able to get these 100’s of files into a directory and then just mount the directory, but this actually doesn’t seem possible. Thoughts?

Relevant bit from logs:

        run \
        -i \
        --volume=/mnt/tmp/tmptgIMVT:/var/spool/cwl:rw \
        --volume=/mnt/tmp/tmpOKvB5x:/tmp:rw \
        --volume=/mnt/tmp/tmp4nAfeE/B.fake_B_R1.fastq.gz:/var/lib/cwl/stg7f23fdfe-5bec-454b-b69b-d2fc3769295e/fastq_dir/B.fake_B_R1.fastq.gz:ro \
        --volume=/mnt/tmp/tmpsLphqy/B.fake_B_R2.fastq.gz:/var/lib/cwl/stg7f23fdfe-5bec-454b-b69b-d2fc3769295e/fastq_dir/B.fake_B_R2.fastq.gz:ro \
        --volume=/mnt/tmp/tmpCI55it/A.fake_A_R2.fastq.gz:/var/lib/cwl/stg7f23fdfe-5bec-454b-b69b-d2fc3769295e/fastq_dir/A.fake_A_R2.fastq.gz:ro \
        --volume=/mnt/tmp/tmpkdUy0F/A.fake_A_R1.fastq.gz:/var/lib/cwl/stg7f23fdfe-5bec-454b-b69b-d2fc3769295e/fastq_dir/A.fake_A_R1.fastq.gz:ro \

It’s mounting each individual file still

tetron · July 10, 2020, 7:36pm

I see. I think the best solution would be for cwltool should behave in a more scalable way for large numbers of inputs. Either there’s a way to pass the list of volume mounts to Docker via a file, or it could materialize the input staging by copying or hardlinking files and then it would only have to mount a single directory into the container.

Exceeding the command line length is somewhat specific to cwltool or other runners that invoke docker using the command line instead of the API or run it some other way. For example I have a CWL pipeline I run on Arvados that accepts an array of 7000 file inputs and it don’t have this problem – but it probably would if I ran it with cwltool.

The ugly workaround I can think of is to divide your list of Directories into smaller subsets, and have a step that simply copies input to output and produces a single directory as output, so then your downstream step has fewer directory inputs (where each of those directory has a subset of the original array). Does that make sense?