CWL - glob an array of output files based on a single input string

dennis_kennetz · March 19, 2020, 2:45pm

Hi all, I am writing a CWL command line tool for fastp which has the option to take an input fastq or fastq pair and split each fastq by number of lines which is beneficial for an embarrassingly parallel workflow. So I want to write a tool that has the option to require fastq1 as input, but also optionally pass fastq2 as input (because you can trim in single end or paired end mode). This would mean I optionally have output that can be a file or an array. I have done this with input fastqs, but I am not sure how to do it in the output. So for example if I had fastq1 called myfastq_R1_.fastq.gz that had 20,000,000 reads, and I named the output fastq myfastq_R1_.trimmed.fastq.gz and I split that by 2,000,000 lines each, the outputs would be named 001.myfastq_R1_.trimmed.fastq.gz , 002.myfastq_R1_.trimmed.fastq.gz … up until 010.myfastq_R1_.trimmed.gz if I split the fastq. Optionally, I could add a second fastq and do the same thing. The glob statement is the confusing part for globbing outputs because it could be the specified string ( myfastq_R1_.trimmed.fastq.gz ) or it could be that filename with a number increment at the beginning. Here is my tool currently:

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool

baseCommand: [fastp]

label: fastp adapter trimmer
doc: |
  fastp -i <fastq> -o <fastq_out> -I <fastq2?> -O <fastq2_out?> args.

inputs:

  ##################
  # Required input #
  ##################

  fastq:
    type: File
    inputBinding:
      prefix: -i
    doc: -i FILE    read1 input file name

  fastq_out:
    type: string
    inputBinding:
      prefix: -o
    doc: -o STRING  read1 output file name

  fastq2:
    type: File?
    inputBinding:
      prefix: -I
    doc: -I FILE    read2 input file name

  fastq2_out:
    type: string?
    inputBinding:
      prefix: -O
    doc: -O STRING  read2 output file name

  split_by_lines:
    type: int?
    inputBinding:
      prefix: -S
    doc: -S INT     split output by limiting total lines of each file. output will be named 001.fqname.fastq, 002.fqname.fastq...

outputs:
  trimmed_fastq:
    type: 
       - type: File
       - type: File[]
      glob: (could be $(inputs.fastq_out) or could be 001.$(inputs.fastq_out), 002.$(inputs.fastq_out)... n.$(inputs.fastq_out))

  trimmed_fastq2:
    type:
      - type: File?
      - type: array
          items: ["null", File]
       glob: (could not exist, or could possibly meet the same conditions are fastq1)

dennis_kennetz · March 19, 2020, 7:14pm

I have solved this, and the solution is a bit easier than I thought!

outputs:
  trimmed_fastq:
    type:
      - type: array
        items: File
    outputBinding:
      glob: ["$(inputs.fastq_out)", "*.$(inputs.fastq_out)"]
  trimmed_fastq2:
    type:
      - type: array
        items: File
    outputBinding:
      glob: ["$(inputs.fastq2_out)", "*.$(inputs.fastq2_out)"]