Collect files/folders even if some steps of workflow fail

hcrat · November 10, 2020, 3:19pm

Hi all,
I am in the process of creating a pipeline that runs several steps. For each step the pipeline collects a log file. At the end of the pipeline I have a step (named merge_logs) that collect all the logs and puts the files in a single folder.

My problem is that if some of the steps fail the merge_logs step is not run. Even if I mark the inputs as being potentially null the step is never executed.

Here there is a simplified example that elucidates the problem:

cwlVersion: v1.1
class: Workflow

requirements:
  SubworkflowFeatureRequirement: {}
  InlineJavascriptRequirement: {}
  StepInputExpressionRequirement: {}
  MultipleInputFeatureRequirement: {}

inputs: []

outputs:
  mmm:
    type: Directory
    outputSource: merge_output/output

steps:
  step1:
    in: []
    out: [output]
    run:
      class: CommandLineTool
      requirements:
        - class: ShellCommandRequirement
      arguments:
        - shellQuote: false
          valueFrom: |
            echo "step1" > step1.txt
            false
      inputs: []
      outputs:
        output:
          type: File?
          outputBinding:
            glob: "step1.txt"
  step2:
    in: []
    out: [output]
    run:
      class: CommandLineTool
      baseCommand: [echo]
      stdout: "step2.txt"
      inputs: []
      outputs:
        output:
          type: stdout
      arguments:
        - valueFrom: "step2"
  merge_output:
    in:
      # in1: step1/output
      # in2: step2/output
      files:
        source:
          - step1/output
          - step2/output
        linkMerge: merge_flattened
    out: [output]
    run:
      class: ExpressionTool
      inputs:
        # in1: File?
        # in1: File?
        files:
          type:
            - type: array
              items:
                - "null"
                - "File"
            - "null"
      outputs:
        output:
          type: Directory
      expression: |
        ${
          return {
            output: {
              class: "Directory",
              basename: "files",
              listing: inputs.files
           // listing: [inputs.in1, inputs.in2]
            }
          };
        }

I’ve tried both the uncommented and the commented versions but they work the same.

I’ve tried another route: create a common folder where to place the logs. Here is the workflow:

cwlVersion: v1.1
class: Workflow

requirements:
  SubworkflowFeatureRequirement: {}
  InlineJavascriptRequirement: {}
  StepInputExpressionRequirement: {}
  MultipleInputFeatureRequirement: {}
  InitialWorkDirRequirement:
    listing: |
      ${
        return [
          {
            class: "Directory",
            basename: "logs",
            writable: true,
            listing: []
          }
        ];
      }

inputs: []

outputs:
  mmm:
    type: Directory
    outputSource: merge_output/output

steps:
  step1:
    in: []
    out: [output]
    run:
      class: CommandLineTool
      requirements:
        - class: ShellCommandRequirement
      arguments:
        - shellQuote: false
          valueFrom: |
            echo "step1" > logs/step1.txt
            false
      inputs: []
      outputs:
        output:
          type: File?
          outputBinding:
            glob: "step1.txt"
  step2:
    in: []
    out: [output]
    run:
      class: CommandLineTool
      baseCommand: [echo]
      stdout: "step2.txt"
      inputs: []
      outputs:
        output:
          type: stdout
      arguments:
        - valueFrom: "step2"
  merge_output:
    in: []
    out: [output]
    run:
      class: ExpressionTool
      inputs: []
      outputs:
        output:
          type: Directory
      expression: |
        ${
          return {
            output: {
              class: "Directory",
              basename: "files",
              listing: ???
            }
          };
        }

In this case however I don’t know how to access the logs folder either in the merge_output step (what I put in place of the ??? placeholder?) or in a workflow output?

I like the former approach much better because it doesn’t need to adapt the steps or to make the steps aware of the final folder structure.

Do you have any suggestion?

Thank you very much!

mrc · November 10, 2020, 5:26pm

Hello @hcrat. Which workflow system are you using? Some do allow access to results even if a workflow fails. If you are using cwltool, for testing and development purposes, then I recommend using --cachedir path/to/a/directory where the results of every step will be saved (and reused, if you pass the same --cachedir path in the next time).

hcrat · November 11, 2020, 9:05am

Hi @mrc,
thanks for your suggestion. As for now I’m using cwltool and I was already aware the cachedir option and is indeed quite useful.

What I was looking for, however, would be a way that would work in every implementation.

Said this, from your suggestion, I understand that what I’m asking is not possible in the standard as it is now. I’ve even tried using the --on-error continue option but as long as one step mention a previously failed step it will not be executed. Of course in most of the cases this makes total sense, but in some instances (as for my case) this would be useful. Can you confirm that it is indeed impossible?

An idea to add that capability could be by changing the type system. What I mean is adding something akin to the Haskell Either type. I’m not at all an Haskell expert but I think that in this case this type could fit well.

With a type like that added one could signal that it wants the step to be executed even if the mentioned step failed.

If for example I specify the type Either<string, File> (using a C++-ish syntax) this means that I want to run regardless of the exit state of the mentioned step. Moreover I can express that in case of failure I expect to receive a string and in case of success a File. Maybe this could be achieved by using the already present Records in some clever way.

Thank you!

mrc · November 12, 2020, 8:36am

It is correct, the CWL specification has no construct to say “this step is allowed to fail for any reason”.

As for your ultimate goal of getting diagnostic information from the execution of a CWL Workflow, they standard does not require anything. But, there is the proposed (but unofficial) CWLProv which can include all the information you’ve asked for: partial executions, intermediate results, system logs, resources used, and more: https://w3id.org/cwl/prov/0.6.0

Currently it is only implemented in cwltool, but we hope to finish adding it to toil-cwl-runner some day (I don’t know of anyone who has explicitly volunteered or is tasked to do that, so I can’t promise when).

I think CWLProv or a future version of it will be eventually widely implemented. Especially as Workflow RO-Crates and WorkflowHub.eu gets to be more popular.

In the mean time, I believe every CWL runner has their own way to access the diagnostic information you want. Toil has a utility to inspect the job and file stores, Arvados stores extensive information about failed runs, and I remember seeing something similar for the Seven Bridges platforms. If there is another platform that you’d like to get this information for, then I would suggest making a new topic here about that or contacting them directly.

I wish that CWLProv would be widely implemented already, I also want this!

As for some sort of marker or other way to explicitly allow a step to fail, you could make a proposal at github.com/common-workflow-language/common-workflow-language/issues ; pickValue from CWL v1.2 could then be used to handle the variation.