How to fail-fast during parallel scatter


I have a workflow that I execute using cwltool --parallel. The workflow has a scatter step that is both time and memory intensive. Sometimes during early runs of the workflow, before users have their environment configured properly, OS resource management will kill some of these memory intensive processes. This means the workflow completes PermanentFail but cwltool doesn’t fail-fast as I would expect. Instead it waits (sometimes great lengths of time) for the sibling scatter processes to complete before exiting. This doesn’t help in my case because I’m more interested in downstream outputs that make use of these scatter outputs.

Is this expected behavior and is there anything I can do to achieve a fail-fast scatter? Demonstration workflow below. Make sure to run with the --parallel flag.

#!/usr/bin/env cwl-runner

class: Workflow
cwlVersion: v1.2

    type: int[]
    default: [ 11, 11, 11 ]
outputs: { }
  - class: ScatterFeatureRequirement

      sleeptime: sleeptime
    out: [ ]
    scatter: sleeptime
      class: CommandLineTool
          type: int
          inputBinding: { position: 1 }
      outputs: { }
      baseCommand: sleep
    in: { }
    out: [ ]
      class: CommandLineTool
      baseCommand: ['bash', '-c']
        - |
          # Wait 1 second for scatter to spin up and select a random sleep process to kill
          sleep 1
          ps -ef | grep 'sleep 11' | grep -v grep | awk '{print $2}' | shuf | head -n 1 | xargs kill -9
      inputs: { }
      outputs: { }

Good question. This is not yet built-in to cwltool, but the pieces are there:

So we’d need to add a call to main._terminate_processes() after task failure. Though this is a bit brutal! And the error messages produced may be a bit confusing.

Probably we’d want a new command-line option to enable this, maybe by enhancing --on-error to add abort or similar as one of the choices.

1 Like

@mrc thank you for your response. I have a minimal draft implementation of what you suggested. When using MultithreadedJobExecutor, all subprocesses are eventually killed but _terminate_processes() does this one at a time with a 10 second timeout per process. Terminal output also shows SIGKILL for each of these processes (i.e. the potentially confusing error messages you mentioned).

I was wondering if I could ask your thoughts on some rough ideas I had for an alternative approach:

  1. Introduce a new exception Abort that would be raised from JobBase._execute() when processStatus != "success" and runtimeContext.on_error == "abort".
    • The exception would bubble up through workflow_job, workflow, task_queue, and executors for any necessary cleanup tasks. This would also allow for greater control over the error messages produced.
  2. It would be useful to be able to interrupt CommandLineTool jobs that are currently running. Add a threading.Event kill switch to RuntimeContext. Then in JobBase.process_monitor() add another timer daemon to poll the kill switch value. If it’s set, then the target subprocess.Popen is already in scope to call .kill().
    • When a job’s subprocess indicates failure, it calls runtimeContext.kill_switch.set() to notify the other jobs before raising Abort.
    • The SIGKILL warning can be suppressed for subsequent kill-switched jobs because the switch’s value is known by all workers when their subprocess returns.
    • This eliminates the need to call main._terminate_processes() because the jobs take care of themselves.

I’m not sure if this contradicts the intention of RuntimeContext or if you would consider this bad practice. I’m open to any and all feedback.

Edit: I wasn’t able to reproduce the note I made about SingleJobExecutor so I suspect I was accidentally running with --on-error continue. I’ve removed the note.

1 Like

@tate I’d have to see the pull request but your alternative approach sounds pretty reasonable. This is an appropriate use of RuntimeContext. Its purpose is to provide a place to store both information that need to be tracked across an entire workflow execution, without introducing global variables. When altering some aspect of RuntimeContext for a branch of the workflow, it makes a shallow copy, so I expect the copy would continue to point to the same “kill switch” object.


@tetron thank you for your feedback on this. I’ve opened a PR as requested:

Regarding shallow copies of RuntimeContext, I did find that threading.Event can’t be pickled (makes sense) so as long as there aren’t any plans to make cwltool multiprocess then we should be good