.cwl workflow find file in subdirectories of an input directory

cccnrc · August 19, 2020, 1:21pm

I have output files from a previous step organized as files inside subdirectories of an output directory. Such as:

/home/enrico/Dropbox/NY/app/GATK_CNV_germline/scatterIntervals/scattered_hs37d5.preprocessed.300bp.primary_contigs.noBL.filtered.F10
├── temp_0001_of_37
│   └── scattered.interval_list
├── temp_0002_of_37
│   └── scattered.interval_list
├── temp_0003_of_37
│   └── scattered.interval_list

I need to pass the initial dir as input to a following step of the workflow and it to be able to find all files inside its subdirectories. I’ve tried this:

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: Workflow
inputs:
  my_dir: Directory

outputs: []

requirements:
  StepInputExpressionRequirement: {}
  ScatterFeatureRequirement: {}

steps:
  zeroth_step:
    run:
      class: ExpressionTool
      requirements: { InlineJavascriptRequirement: {} }
      inputs:
        dir: Directory
      expression: '${return {"inner_directories": inputs.dir.listing};}'
      outputs:
        inner_directories: Directory[]
    in:
      dir: my_dir
    out: [inner_directories]

  first_step:
    run:
      class: ExpressionTool
      requirements: { InlineJavascriptRequirement: {} }
      inputs:
        my_dir:
          type: Directory
      expression: '${return {"files": inputs.my_dir.listing};}'
      outputs:
        files: File[]
    in:
      my_dir: zeroth_step/inner_directories
    scatter: my_dir
    out: [files]

  second_step:
    run:
      class: CommandLineTool
      inputs:
        my_file:
          type: File[]
      baseCommand: echo
      outputs: []
    in:
      my_file: first_step/files
    out: []

But I got this error:

Tool definition failed validation:
GermlineCNVCaller-scattered-workflow-0.cwl:41:11: Source ‘files’ of type {“items”: {“type”: “array”, “items”: “File”}, “type”: “array”} is incompatible
GermlineCNVCaller-scattered-workflow-0.cwl:52:7: with sink ‘my_file’ of type {“type”: “array”, “items”: “File”}

Cause it recognizes is as a nested array. How can I workaround it?

Thank you so much in advance for any help!

kaushik-work · August 24, 2020, 5:29pm

Hi! I see you have a list of directories as output of your zeroth step, and as input a single directory for your first step. That is a type mismatch.

The error message text looks odd to me, and might be a cwltool bug. But this is on top of the port mismatch.

Brilator · June 25, 2024, 2:07pm

Dear CWL community,
can someone kindly pick up on this?

I’m trying to achieve something very similar, but so far failed to find a solution – likely due to a lack in knowledge of CWL and JavaScript.
Just to rephrase: I’d like to take a directory as input (step 0), pass the subdirectories of that directory to step 1, which passes the files of the subdirectories to step 2, which runs a CommandLineTool on an array of files.

I’ve tried to upgrade the above example to CWL v1.2, by also adding the LoadListingRequirement to zeroth_step and first_step.

        InlineJavascriptRequirement: {}
        LoadListingRequirement:
            loadListing: shallow_listing

I’ve tried different combinations (with and without zeroth_step and adapting the downstream steps accordingly).

Unfortunately, I keep running into either

  Expected class 'org.w3id.cwl.cwl.Directory' but this is 'File'

or the opposite:

    Expected class 'org.w3id.cwl.cwl.File' but this is 'Directory'

Is there a way to specify what (files / dirs / both) a *.listing would output?

I have also tried to stage the directories via InitialWorkDirRequirement or

Any help or link to tutorials and examples is highly appreciated!

Brilator · June 28, 2024, 6:03pm

Not most sophisticated, but I’ve managed the following. At least going in the right direction.

workflow.cwl:

cwlVersion: v1.2
class: Workflow

requirements:
  ScatterFeatureRequirement: {}

inputs:
  parentDir: Directory

steps:
  listDirs:
    run: list-dirs-cmd.cwl
    in:
      parentDir: parentDir    
    out: [dirList]
  listFiles:
    run: list-files-cmd.cwl
    in:
      inDir: listDirs/dirList
    scatter: inDir   
    out: [fileList, theBaseNameOfTheDirectory]
  countWords:
    run: wc.cwl
    scatter:
      - inFiles
    scatterMethod: dotproduct
    in:
      inFiles: listFiles/fileList
    out: [wcout]

outputs:
  outFiles:
    type: File[]
    outputSource: countWords/wcout

with the three steps using:

list-dirs-cmd.cwl

cwlVersion: v1.2
class: CommandLineTool

requirements:
  - class: InitialWorkDirRequirement
    listing:
      - entry: $(inputs.parentDir)

inputs:
  parentDir: 
    type: Directory

baseCommand: echo

outputs:
  dirList:
    type: Directory[]
    outputBinding:
      loadContents: true
      glob: $(runtime.outdir)/$(inputs.parentDir.basename)/*/

list-files-cmd.cwl

cwlVersion: v1.2
class: CommandLineTool

requirements:
  - class: InitialWorkDirRequirement
    listing:
      - entry: $(inputs.inDir)

inputs:
  inDir: 
    type: Directory

baseCommand: [echo]

arguments: 
  - valueFrom: $(inputs.inDir.basename)

outputs: 
  theBaseNameOfTheDirectory:
    type: string
    outputBinding:
      loadContents: true
      outputEval: $(inputs.inDir.basename)
  fileList:
    type: File[]
    outputBinding:
      glob: $(runtime.outdir)/$(inputs.inDir.basename)/*.*

wc.cwl

cwlVersion: v1.2
class: CommandLineTool

inputs:
  inFiles: 
    type: File[]
    inputBinding:
      position: 0

baseCommand: [wc, -l]

outputs:
  wcout:
    type: stdout

stdout: wc-output.txt

There must be a better solution.

mrc · July 2, 2024, 3:13pm

I’ve got two questions:

Would it be okay to combine step 0 and step 1, or do you need the output of step 1 for other parts of the workflow?
Does the CommandLineTool step need to operate on all the files in one go, or could it take one or a few at a time for better parallelism?

Brilator · July 2, 2024, 3:20pm

Hi, thanks for your response!

Combined (simplified) would be preferred.
Few at a time would probably also be better. Since you’re familiar with it, the actual tool I’m trying to run is Kallisto (bio-cwl-tools/Kallisto/Kallisto-Quant.cwl at 66f620da5b0a11e934a6da83272205a2516bcd91 · common-workflow-library/bio-cwl-tools · GitHub)

To add some complexity, I’d also need some option to filter which files (e.g. *.fastq, *.fastq.gz) would be passed to the tool (and which ignored).

mrc · July 2, 2024, 3:34pm

That is very helpful @Brilator , thank you for the specifics.

Do you have to start with a directory? CWL was designed for specific inputs and as you can see, not really designed for teasing apart complex directories (though it is possible).

I recommend making your workflow use specific inputs and adding the directory parsing later (if you still need that).

For example, make the workflow for a single sample. Then add a scatter for multiple samples, switching some or all of the inputs to arrays. Then add the Directory parsing “step 0” if you really need it.

Brilator · July 3, 2024, 7:12am

Ok, I see. Well, I don’t have to start with the parent directory. I was just hoping to be able to cover the full analysis via CWL in its context together with the input data as-is.

The alternative would be to write bash for loops to go through the nested directories (which again I could wrap as a CWL workflow step).

Thanks for the input!