How to fail-fast during parallel scatter

tate · January 29, 2024, 4:03pm

Hello,

I have a workflow that I execute using cwltool --parallel. The workflow has a scatter step that is both time and memory intensive. Sometimes during early runs of the workflow, before users have their environment configured properly, OS resource management will kill some of these memory intensive processes. This means the workflow completes PermanentFail but cwltool doesn’t fail-fast as I would expect. Instead it waits (sometimes great lengths of time) for the sibling scatter processes to complete before exiting. This doesn’t help in my case because I’m more interested in downstream outputs that make use of these scatter outputs.

Is this expected behavior and is there anything I can do to achieve a fail-fast scatter? Demonstration workflow below. Make sure to run with the --parallel flag.

#!/usr/bin/env cwl-runner

class: Workflow
cwlVersion: v1.2

inputs:
  sleeptime:
    type: int[]
    default: [ 11, 11, 11 ]
outputs: { }
requirements:
  - class: ScatterFeatureRequirement

steps:
  scatterstep:
    in:
      sleeptime: sleeptime
    out: [ ]
    scatter: sleeptime
    run:
      class: CommandLineTool
      inputs:
        sleeptime:
          type: int
          inputBinding: { position: 1 }
      outputs: { }
      baseCommand: sleep
  kill:
    in: { }
    out: [ ]
    run:
      class: CommandLineTool
      baseCommand: ['bash', '-c']
      arguments:
        - |
          # Wait 1 second for scatter to spin up and select a random sleep process to kill
          sleep 1
          ps -ef | grep 'sleep 11' | grep -v grep | awk '{print $2}' | shuf | head -n 1 | xargs kill -9
      inputs: { }
      outputs: { }

mrc · January 29, 2024, 4:48pm

Good question. This is not yet built-in to cwltool, but the pieces are there:

There is a list of Python subprocesses to kill
all CommandLineTool subprocesses are added to that list

So we’d need to add a call to main._terminate_processes() after task failure. Though this is a bit brutal! And the error messages produced may be a bit confusing.

Probably we’d want a new command-line option to enable this, maybe by enhancing --on-error to add abort or similar as one of the choices.

tate · February 1, 2024, 7:46pm

@mrc thank you for your response. I have a minimal draft implementation of what you suggested. When using MultithreadedJobExecutor, all subprocesses are eventually killed but _terminate_processes() does this one at a time with a 10 second timeout per process. Terminal output also shows SIGKILL for each of these processes (i.e. the potentially confusing error messages you mentioned).

I was wondering if I could ask your thoughts on some rough ideas I had for an alternative approach:

Introduce a new exception Abort that would be raised from JobBase._execute() when processStatus != "success" and runtimeContext.on_error == "abort".
- The exception would bubble up through workflow_job, workflow, task_queue, and executors for any necessary cleanup tasks. This would also allow for greater control over the error messages produced.
It would be useful to be able to interrupt CommandLineTool jobs that are currently running. Add a threading.Event kill switch to RuntimeContext. Then in JobBase.process_monitor() add another timer daemon to poll the kill switch value. If it’s set, then the target subprocess.Popen is already in scope to call .kill().
- When a job’s subprocess indicates failure, it calls runtimeContext.kill_switch.set() to notify the other jobs before raising Abort.
- The SIGKILL warning can be suppressed for subsequent kill-switched jobs because the switch’s value is known by all workers when their subprocess returns.
- This eliminates the need to call main._terminate_processes() because the jobs take care of themselves.

I’m not sure if this contradicts the intention of RuntimeContext or if you would consider this bad practice. I’m open to any and all feedback.

Edit: I wasn’t able to reproduce the note I made about SingleJobExecutor so I suspect I was accidentally running with --on-error continue. I’ve removed the note.

tetron · February 2, 2024, 2:53pm

@tate I’d have to see the pull request but your alternative approach sounds pretty reasonable. This is an appropriate use of RuntimeContext. Its purpose is to provide a place to store both information that need to be tracked across an entire workflow execution, without introducing global variables. When altering some aspect of RuntimeContext for a branch of the workflow, it makes a shallow copy, so I expect the copy would continue to point to the same “kill switch” object.

tate · February 4, 2024, 12:49am

@tetron thank you for your feedback on this. I’ve opened a PR as requested:

github.com/common-workflow-language/cwltool

Adding new choice to --on-error

common-workflow-language:main ← AlexTate:on-error-abort

opened 12:41AM - 04 Feb 24 UTC

AlexTate

+100 -15

### Summary This pull request introduces a the new choice `kill` to the `--on-e…rror` parameter. ### Motivation There currently isn't a way to have cwltool immediately stop parallel jobs when one of them fails. One might expect `--on-error stop` to accomplish this, but the help string is specific and accurate: "do not submit any more steps". Since scatter and subworkflow are treated as single "steps" within the parent workflow, this means cwltool is not wrong to wait for the rest of the scatter jobs to finish when `--on-error stop`. However, sometimes individual scatter jobs take a long time to complete, so if one of them fails early on, cwltool might wait great lengths of time for the other scatter jobs to complete before terminating the workflow. With `--on-error kill`, all running jobs are quickly notified and self-terminate upon one job's failure. ### Forum Post https://cwl.discourse.group/t/how-to-fail-fast-during-parallel-scatter/868

Regarding shallow copies of RuntimeContext, I did find that threading.Event can’t be pickled (makes sense) so as long as there aren’t any plans to make cwltool multiprocess then we should be good