.cwl workflow find file in subdirectories of an input directory

I have output files from a previous step organized as files inside subdirectories of an output directory. Such as:

/home/enrico/Dropbox/NY/app/GATK_CNV_germline/scatterIntervals/scattered_hs37d5.preprocessed.300bp.primary_contigs.noBL.filtered.F10
├── temp_0001_of_37
│   └── scattered.interval_list
├── temp_0002_of_37
│   └── scattered.interval_list
├── temp_0003_of_37
│   └── scattered.interval_list

I need to pass the initial dir as input to a following step of the workflow and it to be able to find all files inside its subdirectories. I’ve tried this:

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: Workflow
inputs:
  my_dir: Directory

outputs: []

requirements:
  StepInputExpressionRequirement: {}
  ScatterFeatureRequirement: {}

steps:
  zeroth_step:
    run:
      class: ExpressionTool
      requirements: { InlineJavascriptRequirement: {} }
      inputs:
        dir: Directory
      expression: '${return {"inner_directories": inputs.dir.listing};}'
      outputs:
        inner_directories: Directory[]
    in:
      dir: my_dir
    out: [inner_directories]

  first_step:
    run:
      class: ExpressionTool
      requirements: { InlineJavascriptRequirement: {} }
      inputs:
        my_dir:
          type: Directory
      expression: '${return {"files": inputs.my_dir.listing};}'
      outputs:
        files: File[]
    in:
      my_dir: zeroth_step/inner_directories
    scatter: my_dir
    out: [files]

  second_step:
    run:
      class: CommandLineTool
      inputs:
        my_file:
          type: File[]
      baseCommand: echo
      outputs: []
    in:
      my_file: first_step/files
    out: []

But I got this error:

Tool definition failed validation:
GermlineCNVCaller-scattered-workflow-0.cwl:41:11: Source ‘files’ of type {“items”: {“type”: “array”, “items”: “File”}, “type”: “array”} is incompatible
GermlineCNVCaller-scattered-workflow-0.cwl:52:7: with sink ‘my_file’ of type {“type”: “array”, “items”: “File”}

Cause it recognizes is as a nested array. How can I workaround it?

Thank you so much in advance for any help!

Hi! I see you have a list of directories as output of your zeroth step, and as input a single directory for your first step. That is a type mismatch.

The error message text looks odd to me, and might be a cwltool bug. But this is on top of the port mismatch.

Dear CWL community,
can someone kindly pick up on this?

I’m trying to achieve something very similar, but so far failed to find a solution – likely due to a lack in knowledge of CWL and JavaScript.
Just to rephrase: I’d like to take a directory as input (step 0), pass the subdirectories of that directory to step 1, which passes the files of the subdirectories to step 2, which runs a CommandLineTool on an array of files.

I’ve tried to upgrade the above example to CWL v1.2, by also adding the LoadListingRequirement to zeroth_step and first_step.

        InlineJavascriptRequirement: {}
        LoadListingRequirement:
            loadListing: shallow_listing

I’ve tried different combinations (with and without zeroth_step and adapting the downstream steps accordingly).

Unfortunately, I keep running into either

  Expected class 'org.w3id.cwl.cwl.Directory' but this is 'File'

or the opposite:

    Expected class 'org.w3id.cwl.cwl.File' but this is 'Directory'

Is there a way to specify what (files / dirs / both) a *.listing would output?

I have also tried to stage the directories via InitialWorkDirRequirement or

Any help or link to tutorials and examples is highly appreciated!

Not most sophisticated, but I’ve managed the following. At least going in the right direction.

workflow.cwl:

cwlVersion: v1.2
class: Workflow

requirements:
  ScatterFeatureRequirement: {}

inputs:
  parentDir: Directory

steps:
  listDirs:
    run: list-dirs-cmd.cwl
    in:
      parentDir: parentDir    
    out: [dirList]
  listFiles:
    run: list-files-cmd.cwl
    in:
      inDir: listDirs/dirList
    scatter: inDir   
    out: [fileList, theBaseNameOfTheDirectory]
  countWords:
    run: wc.cwl
    scatter:
      - inFiles
    scatterMethod: dotproduct
    in:
      inFiles: listFiles/fileList
    out: [wcout]

outputs:
  outFiles:
    type: File[]
    outputSource: countWords/wcout

with the three steps using:

list-dirs-cmd.cwl

cwlVersion: v1.2
class: CommandLineTool

requirements:
  - class: InitialWorkDirRequirement
    listing:
      - entry: $(inputs.parentDir)

inputs:
  parentDir: 
    type: Directory

baseCommand: echo

outputs:
  dirList:
    type: Directory[]
    outputBinding:
      loadContents: true
      glob: $(runtime.outdir)/$(inputs.parentDir.basename)/*/

list-files-cmd.cwl

cwlVersion: v1.2
class: CommandLineTool

requirements:
  - class: InitialWorkDirRequirement
    listing:
      - entry: $(inputs.inDir)

inputs:
  inDir: 
    type: Directory

baseCommand: [echo]

arguments: 
  - valueFrom: $(inputs.inDir.basename)

outputs: 
  theBaseNameOfTheDirectory:
    type: string
    outputBinding:
      loadContents: true
      outputEval: $(inputs.inDir.basename)
  fileList:
    type: File[]
    outputBinding:
      glob: $(runtime.outdir)/$(inputs.inDir.basename)/*.*

wc.cwl

cwlVersion: v1.2
class: CommandLineTool

inputs:
  inFiles: 
    type: File[]
    inputBinding:
      position: 0

baseCommand: [wc, -l]

outputs:
  wcout:
    type: stdout

stdout: wc-output.txt

There must be a better solution.

I’ve got two questions:

  1. Would it be okay to combine step 0 and step 1, or do you need the output of step 1 for other parts of the workflow?
  2. Does the CommandLineTool step need to operate on all the files in one go, or could it take one or a few at a time for better parallelism?

Hi, thanks for your response!

  1. Combined (simplified) would be preferred.
  2. Few at a time would probably also be better. Since you’re familiar with it, the actual tool I’m trying to run is Kallisto (bio-cwl-tools/Kallisto/Kallisto-Quant.cwl at 66f620da5b0a11e934a6da83272205a2516bcd91 · common-workflow-library/bio-cwl-tools · GitHub)

To add some complexity, I’d also need some option to filter which files (e.g. *.fastq, *.fastq.gz) would be passed to the tool (and which ignored).

That is very helpful @Brilator , thank you for the specifics.

Do you have to start with a directory? CWL was designed for specific inputs and as you can see, not really designed for teasing apart complex directories (though it is possible).

I recommend making your workflow use specific inputs and adding the directory parsing later (if you still need that).

For example, make the workflow for a single sample. Then add a scatter for multiple samples, switching some or all of the inputs to arrays. Then add the Directory parsing “step 0” if you really need it.

Ok, I see. Well, I don’t have to start with the parent directory. I was just hoping to be able to cover the full analysis via CWL in its context together with the input data as-is.

The alternative would be to write bash for loops to go through the nested directories (which again I could wrap as a CWL workflow step).

Thanks for the input!