Create/Move output of a tool to a specific location

Hello,

I am writing a tool file and would like to copy the outputs to a different location than the current work directory. I have tried to follow few online examples such as http://www.commonwl.org/user_guide/15-staging/ and advanced file staging: writable inputs · Issue #36 · common-workflow-language/user_guide · GitHub but couldn’t sort it out.

Following is the tool definition:

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool

hints:
 DockerRequirement:
   dockerImageId: umccr/pipeline-cwl

requirements:
  EnvVarRequirement:
    envDef:
      DEPLOY_ENV: $(inputs.denv)
  InitialWorkDirRequirement:
    listing:
      - $(inputs.samplesheet)
      - entry: $(inputs.outdir)
        writable: true

inputs:
  outdir:
    type: Directory

  denv: string

  samplesheet:
    type: File
    inputBinding:
      position: 1

  config:
    type: Directory
    inputBinding:
      position: 2

outputs:
  log_out:
    type: stdout

  split_samplesheets:
    type:
      type: array
      items: [File, Directory]
    outputBinding:
      glob: "*"

stdout: samplesheet-check.log

baseCommand: [python, /scripts/samplesheet-check.py]

This runs with success but writes output to the current work directory, instead of writing to outdir.
The complete log is:

$ cwltool ./tools/sampleSheetCheck.cwl ./jobs/sampleSheetCheck-job.yml
INFO /home/ssm-user/miniconda2/envs/cwl/bin/cwltool 1.0.20190915164430
INFO Resolved ‘./tools/sampleSheetCheck.cwl’ to ‘file:///data/git/bcl2fastq/workflow_cwl/tools/sampleSheetCheck.cwl’
INFO [job sampleSheetCheck.cwl] /tmp/8o0ysoqu$ docker
run
-i
–volume=/tmp/8o0ysoqu:/RmgdNv:rw
–volume=/tmp/wo83trck:/tmp:rw
–volume=/home/ssm-user/.config/gspread_pandas:/var/lib/cwl/stg0f93fa7a-c6f0-49fc-b543-e19b72c97a6d/gspread_pandas:ro
–volume=/data/bcl/SampleSheet.csv:/RmgdNv/SampleSheet.csv:ro
–workdir=/RmgdNv
–read-only=true
–log-driver=none
–user=1001:1001
–rm
–env=TMPDIR=/tmp
–env=HOME=/RmgdNv
–cidfile=/tmp/2lk7fn52/20191028003947-746374.cid
–env=DEPLOY_ENV=prod
umccr/pipeline-cwl
python
/scripts/samplesheet-check.py
/RmgdNv/SampleSheet.csv
/var/lib/cwl/stg0f93fa7a-c6f0-49fc-b543-e19b72c97a6d/gspread_pandas > /tmp/8o0ysoqu/samplesheet-check.log
INFO [job sampleSheetCheck.cwl] Max memory used: 0MiB
INFO [job sampleSheetCheck.cwl] completed success
{
“log_out”: {
“location”: “file:///data/git/bcl2fastq/workflow_cwl/samplesheet-check.log”,
“basename”: “samplesheet-check.log”,
“class”: “File”,
“checksum”: “sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709”,
“size”: 0,
“path”: “/data/git/bcl2fastq/workflow_cwl/samplesheet-check.log”
},
“split_samplesheets”: [
{
“location”: “file:///data/git/bcl2fastq/workflow_cwl/SampleSheet.csv”,
“basename”: “SampleSheet.csv”,
“class”: “File”,
“checksum”: “sha1$1f99bbd938d8d1be37b25cf441b9ef8a56ac4320”,
“size”: 4984,
“path”: “/data/git/bcl2fastq/workflow_cwl/SampleSheet.csv”
},
{
“location”: “file:///data/git/bcl2fastq/workflow_cwl/SampleSheet.csv.custom.1.10X”,
“basename”: “SampleSheet.csv.custom.1.10X”,
“class”: “File”,
“checksum”: “sha1$29e4f13cd3bf26e470c8db025db8d6b976ed4b1b”,
“size”: 4073,
“path”: “/data/git/bcl2fastq/workflow_cwl/SampleSheet.csv.custom.1.10X”
},
{
“location”: “file:///data/git/bcl2fastq/workflow_cwl/SampleSheet.csv.custom.2.truseq”,
“basename”: “SampleSheet.csv.custom.2.truseq”,
“class”: “File”,
“checksum”: “sha1$8c7d616050ef8b081940bb50ee6feec4fbb54403”,
“size”: 882,
“path”: “/data/git/bcl2fastq/workflow_cwl/SampleSheet.csv.custom.2.truseq”
},
{
“location”: “file:///data/git/bcl2fastq/workflow_cwl/output”,
“basename”: “output”,
“class”: “Directory”,
“listing”: ,
“path”: “/data/git/bcl2fastq/workflow_cwl/output”
},
{
“location”: “file:///data/git/bcl2fastq/workflow_cwl/samplesheet-check.log”,
“basename”: “samplesheet-check.log”,
“class”: “File”,
“checksum”: “sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709”,
“size”: 0,
“path”: “/data/git/bcl2fastq/workflow_cwl/samplesheet-check.log”
}
]
}
INFO Final process status is success

I hope I am specifying the InitialWorkDirRequirement correctly?


Sehrish K.

Hi Sehrish!

I assume you provided a directory named output as input to this command line tool? If so, then InitialWorkDirRequirement behaved as expected: An empty directory with the name output was created in the working directory and later collected as tool output in the “split_samplesheets”-array. That being said, i’m not sure you need InitialWorkDirRequirement at all for this wrapper.

The samplesheet-check.py script writes all output in the current directory and doesn’t care about the output-subdirectory you staged. Can you provide a command line argument specifying the output directory to this script?

If the script will not accept a parameter for the output directory, then a cwl-based solution might look like:

  split_samplesheets:
    type:
      type: Directory
    outputBinding:
      glob: .
      outputEval: |
        ${
          self[0].basename = "my-directory-name";
          return self[0]
        }

Hi Sehrish,

Could you explain a little more what you are trying to do?

Hi Peter,

Thanks for the response. I am writing a tool definition for a script that writes the outputs (multiple files) in the input file directory.

So far, my tool definition is:

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool

hints:
 DockerRequirement:
   dockerImageId: umccr/pipeline-cwl

requirements:
  EnvVarRequirement:
    envDef:
      DEPLOY_ENV: $(inputs.denv)
  InitialWorkDirRequirement:
    listing:
      - $(inputs.samplesheet)

inputs:

  denv: string

  samplesheet:
    type: File
    inputBinding:
      position: 1

  config:
    type: Directory
    inputBinding:
      position: 2

outputs:
  log_out:
    type: stdout

  split_samplesheets:
    type:
      type: array
      items: File
    outputBinding:
      glob: "*[!.csv]"

stdout: samplesheet-check.log

baseCommand: [python, /scripts/samplesheet-check.py]

And the job definition is:

outdir:
  class: Directory
  location: /data/bcl/output

denv: 'prod'

samplesheet:
  class: File
  path: /data/bcl/SampleSheet.csv

config:
  class: Directory
  location: /home/ssm-user/.config/gspread_pandas

This definition produces two split_samplesheets in the current working directory.
I was wondering if there is a way to instead write the output to the input file samplesheet's directory i.e. /data/bcl/ in this case?

Hello,

Thanks for the response.
It’s correct that the script will not accept a parameter for the output directory.

I have tried your suggested solution as following:

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool

hints:
 DockerRequirement:
   dockerImageId: umccr/pipeline-cwl

requirements:
  EnvVarRequirement:
    envDef:
      DEPLOY_ENV: $(inputs.denv)
  InitialWorkDirRequirement:
    listing:
      - $(inputs.samplesheet)

inputs:

  denv: string

  samplesheet:
    type: File
    inputBinding:
      position: 1

  config:
    type: Directory
    inputBinding:
      position: 2

outputs:
  log_out:
    type: stdout

  split_samplesheets:
    type:
      type: Directory
      items: File
    outputBinding:
      glob: "*[!.csv]"
    outputEval: |
      ${
        self[0].basename = "/data/bcl/output";
        return self[0]
      }

stdout: samplesheet-check.log

baseCommand: [python, /scripts/samplesheet-check.py]

It produces the following error:

ERROR Tool definition failed validation:
tools/sampleSheetCheck.cwl:3:1:  Object `tools/sampleSheetCheck.cwl` is not valid because
                                   tried `CommandLineTool` but
tools/sampleSheetCheck.cwl:32:1:     the `outputs` field is not valid because
tools/sampleSheetCheck.cwl:36:3:       item is invalid because
tools/sampleSheetCheck.cwl:37:5:         * the `type` field is not valid because
                                             - tried CommandOutputRecordSchema but
tools/sampleSheetCheck.cwl:38:7:                 * the `type` field is not valid
                                                 because
                                                     the value 'Directory' is not a valid
                                                     Record_symbol, expected 'record'
tools/sampleSheetCheck.cwl:39:7:                 * invalid field `items`, expected
                                                 one of: 'fields', 'type', 'label'
tools/sampleSheetCheck.cwl:37:5:             - tried CommandOutputEnumSchema but
                                                 * missing required field `symbols`
tools/sampleSheetCheck.cwl:38:7:                 * the `type` field is not valid
                                                 because
                                                     the value 'Directory' is not a valid
                                                     Enum_symbol, expected 'enum'
tools/sampleSheetCheck.cwl:39:7:                 * invalid field `items`, expected
                                                 one of: 'symbols', 'type', 'label', 'outputBinding'
tools/sampleSheetCheck.cwl:37:5:             - tried CommandOutputArraySchema but
tools/sampleSheetCheck.cwl:38:7:                 the `type` field is not valid
                                                 because
                                                   the value 'Directory' is not a valid
                                                   Array_symbol, expected 'array'
tools/sampleSheetCheck.cwl:42:5:         * invalid field `outputEval`, expected one of:
                                         'label', 'secondaryFiles', 'streamable', 'doc', 'id',
                                         'outputBinding', 'format', 'type'

Also, please see the response to Peter for an update on the tool definition.

Hi Sherish,

I’m sorry, there was one type too many in my initial suggestion. Please try the following tool wrapper:

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool

hints:
 DockerRequirement:
   dockerImageId: umccr/pipeline-cwl

requirements:
  InlineJavascriptRequirement: {}
  EnvVarRequirement:
    envDef:
      DEPLOY_ENV: $(inputs.denv)

inputs:
  denv: string
  samplesheet:
    type: File
    inputBinding:
      position: 1
  config:
    type: Directory
    inputBinding:
      position: 2

outputs:
  log_out:
    type: stdout
  split_samplesheets:
    type: Directory
    outputBinding:
      glob: .
      outputEval: |
        ${
          self[0].basename = "output";
          return self[0]
        }

stdout: samplesheet-check.log

baseCommand: [python, /scripts/samplesheet-check.py]

Regarding the outputBinding for split_samplesheet: glob: . instructs cwl to collect the current working directory as output. This should contain all the data that the python script has written. The working directory gets created by your cwl-runner at runtime, and will have a random collection of characters as a name. This is why we use outputEval to change the basename-property of the directory to a name of our choosing (in this case: “output”).

What you cannot do is instruct the CommandLineTool to create output at any specific location, it is constrained to the temporary directory used at runtime. For this purpose, use the corresponding argument of your cwl-runner. For cwltool this would be --outdir myDirectory

I hope this helps!

Cheers,
Tom

2 Likes

Thank you so much Tom for the explanation.
It definitely makes better sense now.

Just one question, if the script produces file and directories (with more files) as output and I am interested in capturing everything from the tool output. Would glob: . suffice for such case scenario as well? Or in this case glob: "*" makes more sense? I am not entirely sure the difference between both.

The difference between glob: “*” and glob “.” is the first one will return an array with all the File and/or Directory objects from the output directory. The second one will produce a single Directory object with all the File and/or Directory objects in the listing.

Another way of thinking of it is the contents of the “listing” of the Directory from glob: “.” is the same as glob: “*”.

2 Likes

Thanks a lot for clarifying, Peter.

(post deleted by author)