How to implement a "Reduce" method in CWL?

steve · February 9, 2021, 3:25pm

I am familiar with the scatter/gather method available in CWL, but I am trying to figure out if a “reduce” method is available somewhere or could be implemented somehow. Here is an example of how “reduce” works in languages such as R;

> a  <- c(1,2,3,4,5)
> Reduce(function(x, y){print(sprintf("x=%s, y=%s", x, y)); return(x+y)}, a)
[1] "x=1, y=2"
[1] "x=3, y=3"
[1] "x=6, y=4"
[1] "x=10, y=5"
[1] 15

A real-life example would be if I have e.g. 1,000 .bed files, and I want to run something like bedtools intersect on all the files, but I want to intersect 2 files at a time (or any other number of files), then take the product of each intersect and use it as one of the inputs of the next iteration.

The closest I have gotten so far is to just do a wrapper around GNU parallel inside an InitialWorkDirRequirement, but this still requires inputting all 1,000+ files into a single CWL step. Would be a lot nicer if I could implement it somehow in the CWL.

tetron · February 9, 2021, 6:35pm

Yes, that’s pretty much what you have to do right now. We have been discussing an iteration construct based on the idea of being able to feed back the output of the previous iteration to the next iteration, which sounds like it could do what you want, but that doesn’t exist yet.

mrc · February 10, 2021, 9:36am

Yeah, the only other option at this time would be to write some code to generate the workflow steps you need. docs needed for CWL generation programmatically · Issue #57 · common-workflow-language/cwl-utils · GitHub has a sketch of how to do programmatic CWL generation from Python. I suggest writing the workflow steps by hand for 3 iterations to get a feel for the desired result and then write some code using cwl-utils to generate the needed steps.