What is the best way to ingest a directory of FASTQ paired-end files

pvanheus · March 3, 2021, 8:50pm

Processing pairs of FASTQ sequence files is a common task in bioinformatics. What is the best way to do this in CWL? I created this toy workflow to illustrate one approach:

#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow

requirements:
  InlineJavascriptRequirement: {}
  ScatterFeatureRequirement: {}

inputs:
  reads_dir:
    type: Directory

outputs:
  results:
    type: File[]
    outputSource: 
      process_read_pairs/result
steps:
  make_read_pairs:
    in:
      reads_dir: reads_dir
    out:
      - read_pairs
    run:
      class: ExpressionTool
      inputs:
        reads_dir:
          type: Directory
      outputs:
        read_pairs:
          type: 
            type: array
            items:
              type: array
              items: File
      expression: |
        ${ 
          var read_pairs = Array();
          inputs.reads_dir.listing.forEach( function (entry) {
            if (entry.class == "File" && (entry.location.endsWith("_2.fastq.gz") || entry.location.endsWith("_2.fastq."))) {
              var read1 = Object.assign({}, entry); // shallow copy
              read1.location = entry.location.replace('_2.fastq', '_1.fastq');
              read1.basename = entry.basename.replace('_2.fastq', '_1.fastq');
              read_pairs.push(Array(read1, entry));
            }
          });
          console.log("pairs: " + read_pairs);
          return {"read_pairs": read_pairs};
        }
  process_read_pairs:
    in:
      read_pairs: make_read_pairs/read_pairs
    out:
      - result
    scatter: read_pairs
    run:
      class: CommandLineTool
      inputs:
        read_pairs:
          type: File[]
          inputBinding:
            position: 10
      outputs:
        result:
          type: stdout
      baseCommand: [ echo ]

This uses an ExpressionTool to convert a Directory to pairs (making some assumptions of file naming along the way). How do others do this? Is there a “canonical” way?

Thanks, Peter

pvanheus · March 5, 2021, 7:27am

Right now I’m experimenting with a workflow similar to this and toil. Unfortunately a lot of copying happens.

One option is to operate outside of CWL with e.g. Looper but that seems somewhat contrary to the point of having a workflow language.