Processing pairs of FASTQ sequence files is a common task in bioinformatics. What is the best way to do this in CWL? I created this toy workflow to illustrate one approach:
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow
requirements:
InlineJavascriptRequirement: {}
ScatterFeatureRequirement: {}
inputs:
reads_dir:
type: Directory
outputs:
results:
type: File[]
outputSource:
process_read_pairs/result
steps:
make_read_pairs:
in:
reads_dir: reads_dir
out:
- read_pairs
run:
class: ExpressionTool
inputs:
reads_dir:
type: Directory
outputs:
read_pairs:
type:
type: array
items:
type: array
items: File
expression: |
${
var read_pairs = Array();
inputs.reads_dir.listing.forEach( function (entry) {
if (entry.class == "File" && (entry.location.endsWith("_2.fastq.gz") || entry.location.endsWith("_2.fastq."))) {
var read1 = Object.assign({}, entry); // shallow copy
read1.location = entry.location.replace('_2.fastq', '_1.fastq');
read1.basename = entry.basename.replace('_2.fastq', '_1.fastq');
read_pairs.push(Array(read1, entry));
}
});
console.log("pairs: " + read_pairs);
return {"read_pairs": read_pairs};
}
process_read_pairs:
in:
read_pairs: make_read_pairs/read_pairs
out:
- result
scatter: read_pairs
run:
class: CommandLineTool
inputs:
read_pairs:
type: File[]
inputBinding:
position: 10
outputs:
result:
type: stdout
baseCommand: [ echo ]
This uses an ExpressionTool to convert a Directory to pairs (making some assumptions of file naming along the way). How do others do this? Is there a “canonical” way?
Thanks, Peter