Scatter workflow: Calling cwl actors sequentially rather than when sufficient input is present

Levi_Craft · September 4, 2021, 1:56pm

In the code below, once the “protonate” step is complete, then both “depositProtonated” & “predictChemShift” steps have the necessary input to run their CWL actors. I am passing this scatterworkflow several thousand input and instead of calling both “depositProtonated” & “predictChemShift”, it will do all the several thousand “depositProtonated” before calling “predictChemShift”.

How do I set up this workflow so that it will run both “depositProtonated” & “predictChemShift” simultaneously?

cwlVersion: v1.0
class: Workflow

requirements:
  - class: ScatterFeatureRequirement
  - class: MultipleInputFeatureRequirement

inputs:
  database_file: File
  cfg_file: File
  genome_id: string[]
  pH_value: float?
  temp: float?
  cs_predictor: string
  task_insertCS: string
  outputFormat: string
  shift_predict_script: File
  pdb_file: File[]
  depo_task: string
  protonate_method: File
  reduce_dict: File

outputs: []

steps:
  protonate:
    run: addHydrogen_reduce.cwl
    scatter: [pdbFile]
    scatterMethod: dotproduct
    in:
      pdbFile: pdb_file
      protonateMethod: protonate_method
      reduceDict: reduce_dict
    out:
      - pdbH_File
  depositProtonated:
    run: depoAFH.cwl
    scatter: [AFH_file, genomeID]
    scatterMethod: dotproduct
    in:
      AFH_file: protonate/pdbH_File
      databaseFile: database_file
      depoTask: depo_task
      cfgFile: cfg_file
      genomeID: genome_id
    out: []
  predictChemShift:
    run: shiftx2Predict.cwl
    scatter: [protonated_pdb]
    scatterMethod: dotproduct
    in:
      protonated_pdb: protonate/pdbH_File
      pH: pH_value
      temperature: temp
      format: outputFormat
      shiftPredict_script: shift_predict_script
    out:
      - CSFile
  populateDatabase:
    run: populate_afDatabase.cwl
    scatter: [AFH_file, genomeID, chemicalShifts]
    scatterMethod: dotproduct
    in:
      AFH_file: protonate/pdbH_File
      pyInsertCS: database_file
      pH: pH_value
      temperature: temp
      csPredictor: cs_predictor
      chemicalShifts: predictChemShift/CSFile
      genomeID: genome_id
      cfgFile: cfg_file
      taskInsertCS: task_insertCS
    out: []

mrc · September 6, 2021, 8:22am

Hello @Levi_Craft ,

You didn’t say which CWL runner you are using, so I’ll give you some generic advice. Most CWL runners will wait to execute any scatter that depends on a result from another scattered step only after the first scattered step has finished. Your depositProtonated and predictChemShift steps are both scattering steps but they are siblings (no dependencies directly or indirectly on the results of each other), so they shouldn’t block each other.

Seems like a scheduling bug with your CWL runner, so please report that issue directly with them.

In the mean time I have a suggestion for a work around that should also speed up the time for a single pdb_file to be processed: refactor your workflow be centered around a single pdb_file and then put another workflow around it, scattering over pdb_file instead of separate scatters of pdb_file (or its derivative pdbH_File) in the individual steps.

Let us know if that helps!

Cheers,