Hi all,
I ran my workflow with Toil and I was curious about the stats to improve my workflow in terms of performances.
So I ran the job with toil-cwl-runner saving the stats in the jobstore using the option --stats and then I have inspected it. I figured that most of the time is spend in the job:
ResolveIndirect
that should be related with the toil execution of the cwl workflow itself.
What does it actually mean?
How can I change my workflow to reduce the time spent in (ResolveIndirect)?
You might want to try pinging the Toil developers on gitter https://gitter.im/bd2k-genomics-toil/Lobby
Hi Mattia_Mancini,
That portion of the code evaluates the outputs of each job and resolves toil’s “promise” objects (basically placeholders for the real return values that resolve upon the job’s completion): https://github.com/DataBiosphere/toil/blob/6b69a22b00b0f27619613ccf16f3f12d5f4363b7/src/toil/cwl/cwltoil.py#L194
I’m not sure why your code is spending so much time in this phase, though I imagine if there are many outputs it might be traversing a fairly large dictionary.
That part of the code might need to be optimized. I’m not familiar with your workflow however. If you haven’t done so already, it might be worthwhile to experiment with different inputs (nearly empty files vs. large files) to see if they make a noticeable change in where the workflow spends its time.
@tetron Thanks for linking the toil gitter. That is definitely the place for questions like this.
Hi, can it be related by the fact that we have directories and array of directories as input and output of our workflow step?
If there are a large number of them, then possibly.
How long are we talking here? I’m used to that step being very quick, unless the other steps are also quite quick I wouldn’t expect it to be the slowest
Indeed, I tried again on the same cluster and on a different cluster and I am not able to reproduce the problem anymore. Perhaps I misread the printout.
Sorry for the trouble.