CWL - specifying Directory outputs as an input

Hello,

I am working on a tool definition that requires specifying an input parameter for the outputs location.
I have managed to write the following tool definition, that works as expected when testing with cwltool.

cwlVersion: v1.0
class: CommandLineTool

hints:
  DockerRequirement:
    dockerPull: umccr/alpine_pandas:1.0.1

requirements:
  InitialWorkDirRequirement:
    listing:
      - entry: $(inputs.outputDir)
        writable: true

baseCommand: []

inputs:
  script:
    type: File
    inputBinding:
      position: 0
  samplesheet:
    type: File
    inputBinding:
      position: 1
      prefix: -s
  metadata:
    type: File
    inputBinding:
      position: 2
      prefix: -t
  outputDir:
    type: Directory
    inputBinding:
      position: 4
      prefix: -o

outputs:
  splitSheets:
    type:
      type: array
      items: [File, Directory]
    outputBinding:
      glob: "*"

I am having an issue when trying to port this to an environment that does not allow writing to input files/directories. So, trying to find a workaround for this issue. I’m trying an alternate approach where the outputDir is passed as a string, i.e.

.
.
  outputDir:
    type: string
    inputBinding:
    position: 4
    prefix: -o

outputs:
  splitSheets:
    type:
      type: array
      items: [File, Directory]
    outputBinding:
      glob: "$(inputs.outputDir)"

But that produces the following error:

("Error collecting output for parameter 'splitSheets':\nsamplesheetPrep2.cwl:41:7: glob patterns must not start with '/'", {})

I have also tried specifying inputs.outputDir as a writable entry under InitialWorkDirRequirement and then specifying glob as

outputs:
  splitSheets:
    type:
      type: array
      items: [File, Directory]
    outputBinding:
      glob: "*"

This produces following error OSError: [Errno 30] Read-only file system: '/Users'

Wondering if there could be any other solution to this issue?
Any help will be appreciated.

Cheers,
Sehrish

Hello,

If I understood the interface of your tool correctly, you just have to pass the input name as a string and then in the output glob that directory. So it will look like

cwlVersion: v1.0
class: CommandLineTool

hints:
  DockerRequirement:
    dockerPull: umccr/alpine_pandas:1.0.1


# Pass the working directory name here to the command line
baseCommand: []

inputs:
  script:
    type: File
    inputBinding:
      position: 0
  samplesheet:
    type: File
    inputBinding:
      position: 1
      prefix: -s
  metadata:
    type: File
    inputBinding:
      position: 2
      prefix: -t
  outputDir:
    type: string
    inputBinding:
      position: 4
      prefix: -o

outputs:
  splitSheets:
    type: Directory
    outputBinding:
      glob: "$(inputs.outputDir)"

Hi Kaushik,

Thanks for the response. I tried capturing output explicitly as a directory.

It still throws the glob error ("Error collecting output for parameter 'splitSheets':\nsamplesheetPrep2.cwl:43:7: glob patterns must not start with '/'", {})

I think it does not like specifying outputDir as a string path that is starting with /Users/....

Can the program take a relative path. Typically I enter a relative path, something like output_dir

I was still getting the following error when specifying a relative path for outputDir.

("Error collecting output for parameter 'splitSheets':\nsamplesheetPrep2.cwl:43:7: Did not find output file with glob pattern: '['cwl']'", {})

I had to update requirements as following to make it work - thanks to Michael F. for the hints

InitialWorkDirRequirement:
    listing:
    - '$({class: "Directory", basename: inputs.outputdir, listing: []})'

Hope this makes sense to others as well.
Thank you.

2 Likes

Sorry, I missed this bit - what did you mean by adding the working dir name before the base command?

That was a reminder to pass the string to the command line, and that is done via the inputBinding.

@skanwal Glad your problem was solved.

For completeness, here is an example where the command line program creates a directory and the directory is globbed as an output

outdir.cwl

cwlVersion: v1.0
class: CommandLineTool

hints:
  DockerRequirement:
    dockerPull: python:alpine3.11


baseCommand: [python]
arguments: [$(inputs.script.path), $(inputs.outdir)]

inputs:
  script: File
  outdir: string

outputs:
  mydir:
    type: Directory
    outputBinding:
      glob: "$(inputs.outdir)"

Where the input script is script.py

import pathlib
import sys

out = sys.argv[1]
pathlib.Path(out).mkdir()

This can be run as

cwltool outdir.cwl --script script.py --outdir hello

Hi Kaushik,

Thanks for the example, it makes sense.

I believe to capture the contents in the output directory, it’s necessary to pass InitialWorkDirRequirement with appropriate listing array?

I’ll give some context to the solution, inside the InitialWorkDirRequirement, you return listing value which can be an array<File | Directory | Dirent | string | Expression> | string | Expression:

May be an expression. If so, the expression return value must validate as {type: array, items: [File, Directory]}

Hence through the expression we can return a valid Directory object with attributes:

  • class: “Directory” (duh),
  • basename: Name of the folder (no leading slashes or anything)
  • listing: empty list ([]), but you could return anything in here if you wanted the directory to have something in here.

Nb: wrap the expression in a string for it to be valid yaml.

This gives the tool:

class: CommandLineTool
baseCommand: ls

requirements:
  InitialWorkDirRequirement:
    listing:
      - '$({class: "Directory", basename: inputs.outputdir, listing: []})'

inputs:
  outputdir: string

outputs:
  outdir:
    type: Directory
    outputBinding:
      glob: $(inputs.outputdir)
  outls: stdout

Hi Michael,

Thanks for adding to the discussion.

listing : empty list ( [] ), but you could return anything in here if you wanted the directory to have something in here.

Just another point, I was able to return files in the directory by using an empty list i.e. listing: [].

Hi @skanwal, the InitialWorkDirRequirement is to stage or create files/directories during the job setup phase. It is separate from the output gathering stage. The example I provided does not have an InitialWorkDirRequirement because I don’t want to stage any inputs, merely gather the output directory, which the tool is creating.

Thanks a lot Kaushik. That makes absolute sense.

I am also facing the same error.
INFO [job varscan4.cwl] Max memory used: 230MiB
ERROR [job varscan4.cwl] Job error:
(“Error collecting output for parameter ‘vcf’: varscan4.cwl:39:7: Did not find output file with glob pattern: [‘output.vcf’].”, {})
WARNING [job varscan4.cwl] completed permanentFail
{}WARNING Final process status is permanentFail

Here is the cwl script

cwlVersion: v1.0
class: CommandLineTool
label: "VarScan Variant Calling"

requirements:
  - class: DockerRequirement
    dockerImageId: "kboltonlab/varscan2:1.1"

baseCommand: ["java", "-jar", "/opt/varscan/VarScan.v2.4.2.jar"]

arguments:
  - "pileup2cns"
  - "$(inputs.bam.path)"
  - "$(inputs.reference.path)"
  - "--output-vcf"
  - "1"

inputs:
  bam:
    type: File
    inputBinding:
      position: 3
  reference:
    type: File
    inputBinding:
      position: 2
    secondaryFiles: [.fai]

outputs:
  vcf:
    type: File
    outputBinding:
      glob: "output.vcf"

and this is the json file with the input params

{
  "bam": {
    "class": "File",
    "path": "/home/ec2-user/healthomics_SI/varscan/SRRoutput.pileup",
    "secondaryFiles": [
      {
        "class": "File",
        "path": "/home/ec2-user/healthomics_SI/varscan/ref_with_svs.bai"
      }
    ]
  },
  "reference": {
    "class": "File",
    "path": "/ngs/reference/Homo_sapiens_assembly38.fasta",
    "secondaryFiles": [
        {"class": "File", "path": "/ngs/reference/Homo_sapiens_assembly38.fasta.fai"},
    ]
  },
  "strand_filter": 0,
  "min_coverage": 8,
  "min_var_freq": 0.1,
  "min_reads": 2,
  "p_value": 0.99,
  "sample_name": "save.vcf"
}