Avoiding repetitive expressions for output binding in CommandLineTool

ba1 · August 23, 2022, 8:20am

Hello, I have question about reusing expressions. I’d like to write a CommandLineTool for a tool that produces different output files with a certain prefix in their path. That prefix is generated by the tool from an input file’s base name like so:

SRR1659960_minimal_R1.fastq.gz
SRR1659960_minimal_R2.fastq.gz >> SRR1659960_minimal
SRR1659960_minimal_R1.fastq.gz
SRR1659960_maximal_R2.fastq.gz >> SRR1659960_maximal

At the moment, I try to bind these output files by a glob using a JS expression that mimics the generation of the prefix to generate it myself and add a suffix.

This leads to several repetitions of the prefix generation code as I couldn’t find any other way.

cwlVersion: 'v1.2'

class: 'CommandLineTool'
label: 'Tool'

$namespaces:
  edam: 'http://edamontology.org/'

$schemas:
  - 'EDAM_1.25.owl'

requirements:
  - class: 'InlineJavascriptRequirement'
  - class: 'DockerRequirement'
    dockerPull: 'path_to_image'

baseCommand: ['tool']

inputs:
  - id: "input_reads_r1"
    type: 'File'
    format: 'edam:format_1930'  # FASTQ
    inputBinding:
      prefix: '--input-reads-r1'
      position: 100
    label: 'Input reads file R1 (forward)'
    doc: >-
      Path to R1 (i.e. forward/left) reads file in gzipped fastq format.
  - id: "input_reads_r2"
    type: 'File'
    format: 'edam:format_1930'  # FASTQ
    inputBinding:
      prefix: '--input-reads-r2'
      position: 101
    label: 'Input reads file R2 (reverse)'
    doc: >-
      Path to R2 (i.e. reverse/right) reads file in gzipped fastq format.

####################################################################################################

arguments:
  - prefix: "--output-dir"
    position: 600
    separate: true
    valueFrom: |
      ${
        // obtain the common prefix of the input files for output directory naming
        var r1=inputs.input_reads_r1.nameroot;
        var r2=inputs.input_reads_r2.nameroot;
        var i = 0;
        while (r1[i] === r2[i]) {
            i++;
        }
        // remove trailing _R or _
        if (r1.slice(0,i).endsWith('_R')) {
          i = i - 2;
        } else if (r1.slice(0,i).endsWith('_')) {
          i--;
        }
        return r1.slice(0,i) + "_tool_output";
      }

outputs:
  - id: "zipped_output_directory"
    type: 'File?'
    format: 'edam:format_3989'  # GZIP
    outputBinding:
      glob: |
        ${
          // obtain the common prefix of the input files for output directory naming
          var r1=inputs.input_reads_r1.nameroot;
          var r2=inputs.input_reads_r2.nameroot;
          var i = 0;
          while (r1[i] === r2[i]) {
              i++;
          }
          // remove trailing _R or _
          if (r1.slice(0,i).endsWith('_R')) {
            i = i - 2;
          } else if (r1.slice(0,i).endsWith('_')) {
            i--;
          }
          return r1.slice(0,i) + "_tool_output.tar.gz";
        }
    streamable: true
    doc: 'All final and intermediary outputs of the tool as a gzipped tar archive.'
  - id: "final_prediction_csv"
    type: 'File?'
    format: 'edam:format_3751'  # DSV
    outputBinding:
      glob: |
        ${
          // obtain the common prefix of the input files for output directory naming
          var r1=inputs.input_reads_r1.nameroot;
          var r2=inputs.input_reads_r2.nameroot;
          var i = 0;
          while (r1[i] === r2[i]) {
              i++;
          }
          // remove trailing _R or _
          if (r1.slice(0,i).endsWith('_R')) {
            i = i -2;
          } else if (r1.slice(0,i).endsWith('_')) {
            i--;
          }
          var final_pred_dir = r1.slice(0,i) + '_tool_output/Summary/'
          return final_pred_dir + r1.slice(0,i) + '_something.csv';
        }
  - id: "standard_error"
    type: 'stderr'
    outputBinding:
      glob: |
        ${
          // obtain the common prefix of the input files for output directory naming
          var r1=inputs.input_reads_r1.nameroot;
          var r2=inputs.input_reads_r2.nameroot;
          var i = 0;
          while (r1[i] === r2[i]) {
              i++;
          }
          // remove trailing _R or _
          if (r1.slice(0,i).endsWith('_R')) {
            i = i -2;
          } else if (r1.slice(0,i).endsWith('_')) {
            i--;
          }
          return r1.slice(0,i) + '_interesting.stderr';
        }
    streamable: true
  - id: "standard_output"
    type: 'stdout'
    outputBinding:
      glob: |
        ${
          // obtain the common prefix of the input files for output directory naming
          var r1=inputs.input_reads_r1.nameroot;
          var r2=inputs.input_reads_r2.nameroot;
          var i = 0;
          while (r1[i] === r2[i]) {
              i++;
          }
          // remove trailing _R or _
          if (r1.slice(0,i).endsWith('_R')) {
            i = i -2;
          } else if (r1.slice(0,i).endsWith('_')) {
            i--;
          }
          return r1.slice(0,i) + '_interesting.stdout';
        }
    streamable: true

Is there a way to circumenvent this?

brunokinoshita · August 25, 2022, 4:03am

Hi @ba1

You already have an InlineJavascriptRequirement in your command-line tool. Maybe using its expressionLib will work for you? Common Workflow Language (CWL) Command Line Tool Description, v1.2

I tried to simplify your example workflow, and move the common JS code to a function in an expressionLib. I had some errors in the stdout & stderr outputs, but I think that’s for another thread, so I commented that out.

# File: /tmp/test.cwl
cwlVersion: 'v1.2'

class: 'CommandLineTool'
label: 'Tool'

requirements:
  - class: 'InlineJavascriptRequirement'
    expressionLib:
      - |
          var getCommonPrefix = function (inputs) {
            // obtain the common prefix of the input files for output directory naming
            var r1=inputs.input_reads_r1.nameroot;
            var r2=inputs.input_reads_r2.nameroot;
            var i = 0;
            while (r1[i] === r2[i]) {
                i++;
            }
            // remove trailing _R or _
            if (r1.slice(0,i).endsWith('_R')) {
              i = i -2;
            } else if (r1.slice(0,i).endsWith('_')) {
              i--;
            }
            return r1.slice(0,i)
          }
      # or load an external file, or other libraries to use in your workflow
      # - { $include: utils.js }
baseCommand: ['true']

inputs:
  - id: "input_reads_r1"
    type: 'File'
    inputBinding:
      prefix: '--input-reads-r1'
      position: 100
  - id: "input_reads_r2"
    type: 'File'
    inputBinding:
      prefix: '--input-reads-r2'
      position: 101

arguments:
  - prefix: "--output-dir"
    position: 600
    separate: true
    valueFrom: |
      ${
        var r1 = getCommonPrefix(inputs)
        return r1 + "_tool_output";
      }

outputs:
  - id: "zipped_output_directory"
    type: 'File?'
    outputBinding:
      glob: |
        ${
          var r1 = getCommonPrefix(inputs)
          return r1 + "_tool_output.tar.gz";
        }
# ERROR Tool definition failed validation:
# ../../../../../../tmp/test.cwl:52:1: Not allowed to specify outputBinding when using stderr
#                                      shortcut.

  # - id: "final_prediction_csv"
  #   type: 'File?'
  #   outputBinding:
  #     glob: |
  #       ${
  #         var r1 = getCommonPrefix(inputs)
  #         var final_pred_dir = r1.slice(0,i) + '_tool_output/Summary/'
  #         return final_pred_dir + r1.slice(0,i) + '_something.csv';
  #       }
  # - id: "standard_error"
  #   type: 'stderr'
  #   outputBinding:
  #     glob: |
  #       ${
  #         var r1 = getCommonPrefix(inputs)
  #         return r1 + '_interesting.stderr';
  #       }
  #   streamable: true
  # - id: "standard_output"
  #   type: 'stdout'
  #   outputBinding:
  #     glob: |
  #       ${
  #         var r1 = getCommonPrefix(inputs)
  #         return r1 + '_interesting.stdout';
  #       }
  #   streamable: true

The good thing about true is that you can play with it using any arguments. Excellent way to test scripts Here’s what I got running this example with cwltool:

(venv) kinow@ranma:~/Development/python/workspace/cwltool$ touch /tmp/colombia.txt /tmp/colombina.txt
(venv) kinow@ranma:~/Development/python/workspace/cwltool$ cwltool /tmp/test.cwl --input_reads_r1 /tmp/colombia.txt --input_reads_r2 /tmp/colombina.txt
INFO /home/kinow/Development/python/workspace/cwltool/venv/bin/cwltool 3.1.20220821233356
INFO Resolved '/tmp/test.cwl' to 'file:///tmp/test.cwl'
INFO [job test.cwl] /tmp/o2zn3i41$ true \
    --input-reads-r1 \
    /tmp/wnbdqn59/stg89819a8e-c829-49e4-b890-ca0957429330/colombia.txt \
    --input-reads-r2 \
    /tmp/wnbdqn59/stgadba0f7d-eab0-4d8c-9b1c-2150c5b8c177/colombina.txt \
    --output-dir \
    colombi_tool_output
INFO [job test.cwl] completed success
{
    "zipped_output_directory": null
}
INFO Final process status is success

Note the output_dir with the common part between the two input file names. You can just re-use the function defined in other parts of your workflow.

Hope that helps,
-Bruno

ba1 · August 29, 2022, 7:41am

This is great! I didn’t know about expressionLib and probably would have never found it in the specs. Thank you very much!