Workflow step with FIle array for input and output

roadrider · January 20, 2023, 7:51pm

Newbie question here. I’m just experimenting with CWL and I’m a software guy not a bioinformatician.

I’m trying to put together a simple workflow using some CWL tool definitions I cribbed from github: gunzip.cwl, bwa-index.cwl and bwa-mem.cwl. The idea is to unzip two fastq.gz files, index a FASTA reference and then align the uncompressed reads files to the reference using bwa mem. The first two steps work find but I’m having difficulty with the alignment step. Here is the workflow code:

The error I get is:

(python3.9) pdagosto@rckgxf0013 cwl]$ cwltool demo.cwl demo.yaml
INFO /work/miniconda3/envs/python3.9/bin/cwltool 3.1.20221201130942
INFO Resolved 'demo.cwl' to 'file:///work/snakemake_cwl_demo/cwl/demo.cwl'
ERROR Tool definition failed validation:

demo.cwl:31:5: Source 'output_filename_array' of type {"type": "array", "items": "File"} is
               incompatible
demo.cwl:68:7:   with sink 'output_filename' of type "string"
demo.cwl:54:9: Source 'output' of type {"type": "array", "items": "File"} is incompatible
demo.cwl:67:7:   with sink 'reads' of type {"type": "array", "items": {"type": "array",
                 "items": "File"}}
type or paste code here

The bwa-mem.cwl file defines the outputs as a glob of filenames:

outputs:
  output:
    type: File
    outputBinding:
      glob: $(inputs.output_filename)

I’m not sure how to represent that in the Workflow.

The second error has to do with the input to the bwa mem step which should be an array of files but my code seems to be requesting an array of an array of files. I’m not sure how to resolve this problem.

#!/usr/bin/env cwl-runner

class: Workflow
cwlVersion: v1.0

requirements:
  InitialWorkDirRequirement:
    listing: [ $(inputs.compressed_file) ]
  ScatterFeatureRequirement: {}

inputs:
  fastq_gz_file_array:
    type: File[]

  reads_array:
    type: File[]

  reference:
    type: File
    secondaryFiles:
      - .amb
      - .ann
      - .bwt
      - .pac
      - .sa

  decompress_to_stdout:
    type: boolean

  output_filename_array:
    type: File[]

outputs:
  unzip_output:
    type: File[]
    outputSource: unzip/output

  bwaindex_output:
    type: File
    outputSource: bwa_index/output

  bwamem_output:
    type: File[]
    outputSource: bwa_mem/output

steps:
  unzip:
    run: gunzip.cwl
    scatter: compressed_file
    in:
      compressed_file: fastq_gz_file_array
      decompress_to_stdout: decompress_to_stdout
    out:
      [ output ]
  bwa_index:
    run: bwa_index.cwl
    in:
      sequences: reference
    out:
      [ output ]

  bwa_mem:
    run: bwa_mem.cwl
    scatter: reads
    in:
      reference: reference
      reads: unzip/output
      output_filename: output_filename_array
    out:
      [ output ]

The bwa_mem.cwl file is:

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool

requirements:
#- $import: bwa-docker.yml
- class: InlineJavascriptRequirement

inputs:
  minimum_seed_length:
    type: int?
    inputBinding:
      position: 1
      prefix: -k
    doc: -k INT        minimum seed length [19]
  reference:
    type: File
    secondaryFiles:
        - .amb
        - .ann
        - .bwt
        - .pac
        - .sa
    inputBinding:
      position: 2

  output_filename: string
  reads:
    type:
      type: array
      items: File
    inputBinding:
      position: 3

  threads:
    type: int?
    inputBinding:
      position: 1
      prefix: -t
    doc: -t INT        number of threads [1]
  min_std_max_min:
    type: int[]?
    inputBinding:
      position: 1
      prefix: -I
      itemSeparator: ','
stdout: $(inputs.output_filename)

outputs:
  output:
    type: File
    outputBinding:
      glob: $(inputs.output_filename)

baseCommand:
- bwa
- mem

doc: |
  Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]

  Algorithm options:
         -w INT        band width for banded alignment [100]
         -d INT        off-diagonal X-dropoff [100]
         -r FLOAT      look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
         -y INT        seed occurrence for the 3rd round seeding [20]
         -c INT        skip seeds with more than INT occurrences [500]
         -D FLOAT      drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
         -W INT        discard a chain if seeded bases shorter than INT [0]
         -m INT        perform at most INT rounds of mate rescues for each read [50]
         -S            skip mate rescue
         -P            skip pairing; mate rescue performed unless -S also in use
         -e            discard full-length exact matches

  Scoring options:

         -A INT        score for a sequence match, which scales options -TdBOELU unless overridden [1]
         -B INT        penalty for a mismatch [4]
         -O INT[,INT]  gap open penalties for deletions and insertions [6,6]
         -E INT[,INT]  gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
         -L INT[,INT]  penalty for 5'- and 3'-end clipping [5,5]
         -U INT        penalty for an unpaired read pair [17]

         -x STR        read type. Setting -x changes multiple parameters unless overriden [null]
                       pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0  (PacBio reads to ref)
                       ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0  (Oxford Nanopore 2D-reads to ref)
                       intractg: -B9 -O16 -L5  (intra-species contigs to ref)

  Input/output options:

         -p            smart pairing (ignoring in2.fq)
         -R STR        read group header line such as '@RG\tID:foo\tSM:bar' [null]
         -H STR/FILE   insert STR to header if it starts with @; or insert lines in FILE [null]
         -j            treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)

         -v INT        verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]
         -T INT        minimum score to output [30]
         -h INT[,INT]  if there are <INT hits with score >80% of the max score, output all in XA [5,200]
         -a            output all alignments for SE or unpaired PE
         -C            append FASTA/FASTQ comment to SAM output
         -V            output the reference FASTA header in the XR tag
         -Y            use soft clipping for supplementary alignments
         -M            mark shorter split hits as secondary

         -I FLOAT[,FLOAT[,INT[,INT]]]
                       specify the mean, standard deviation (10% of the mean if absent), max
                       (4 sigma from the mean if absent) and min of the insert size distribution.
                       FR orientation only. [inferred]

  Note: Please read the man page for detailed description of the command line and options.
type or paste code here

roadrider · January 25, 2023, 5:58pm

I changed the workflow so that instead of trying to run bwa mem on each unzipped fastq file it uses the -p (smart pairing ) flag and will take both unzipped ffastq files as input. Here is the revised workflow file:

#!/usr/bin/env cwl-runner

class: Workflow
cwlVersion: v1.2

requirements:
  InitialWorkDirRequirement:
    listing: [ $(inputs.compressed_file) ]
  ScatterFeatureRequirement: {}
  MultipleInputFeatureRequirement: {}

inputs:
  fastq_gz_file_array:
    type: File[]

  reference:
    type: File
    secondaryFiles:
      - .amb
      - .ann
      - .bwt
      - .pac
      - .sa

  decompress_to_stdout:
    type: boolean

  smart_pairing:
    type: boolean

  output_filename:
    type: string

outputs:
  indexed_sequences:
    type: File
    secondaryFiles:
      - .amb
      - .ann
      - .bwt
      - .pac
      - .sa
    outputSource: bwa_index/indexed_sequences

  bwamem_output:
    type: File
    outputSource: bwa_mem/aligned_reads

steps:
  unzip:
    run: gunzip.cwl
    scatter: compressed_file
    in:
      compressed_file: fastq_gz_file_array
      decompress_to_stdout: decompress_to_stdout
    out:
      [ output ]
  bwa_index:
    run: bwa_index.cwl
    in:
      sequences: reference
    out:
      [ indexed_sequences ]
  bwa_mem:
    run: bwa_mem.cwl
    in:
      reference: reference
      reads: unzip/output
      output_filename: output_filename
      smart_pairing: smart_pairing
    out:
      [ aligned_reads ]

The tool wrappers have not changed except for adding the -p flag as an input for bwa_mem.

Here is the error I’m getting:

(python3.9) pdagosto@rckgxf0013 cwl]$ cwltool demo.cwl demo.yaml
INFO /work/miniconda3/envs/python3.9/bin/cwltool 3.1.20221201130942
INFO Resolved 'demo.cwl' to 'file:///work/snakemake_cwl_demo/cwl/demo.cwl'
WARNING Workflow checker warning:
demo.cwl:17:5: Source 'reference' of type "File" may be incompatible
demo.cwl:67:7:   with sink 'reference' of type "File"
ERROR Workflow error, try again with --debug for more information:
demo.cwl:18:5: Missing required secondary file 'mmv_NC_001510.fasta.amb' from file object: {
                   "class": "File",
                   "location": "file:///work/snakemake_cwl_demo/cwl/mmv_NC_001510.fasta",
                   "size": 5291,
                   "basename": "mmv_NC_001510.fasta",
                   "nameroot": "mmv_NC_001510",
                   "nameext": ".fasta",
                   "secondaryFiles": []
               }

I don’t understand this error because the index files have not yet been created. They are the output from the bwa_index step.

I tried adding the secondaryFiles item to the bwa_mem steps input but that just gave a different error.