Loop requirement implementation

GlassOfWhiskey · June 9, 2022, 12:26pm

Hi channel,
I am implementing the Loop CWL extension proposal in StreamFlow. I have couple of questions regarding semantics. First one is: how to behave when both when and loop_when are present in the same step? Which is directly correlated with a second question: should loop_when be evaluated BEFORE or AFTER the first iteration (obtaining a while or a do-while style loop, respectively)? Here are the considerations:

First solution: keep both `when` and `loop_when`, evaluate `loop_when` at the end of the loop.

In this case we can have

steps:
  loop:
    run:
      ...
    in:
      i1: i1
    out: [o1]
    when: $(inputs.i1 > 0)
    requirements:
      cwltool:Loop:
        loop_when: $(outputs.o1 < 5)
        loop:
          i1: o1
        outputMethod: all

Second solution: disable `when`, evaluate `loop_when` at the beginning of the loop.

In this case we can write as follows (note that we have to treat the null outputs explicitly during the first iteration

steps:
  loop:
    run:
      ...
    in:
      i1: i1
    out: [o1]
      requirements:
        cwltool:Loop:
          loop_when: ${return inputs.i1 > 0 && (outputs.o1 === null || outputs.o1 < 5)}
          loop:
            i1: o1
          outputMethod: all

Third solution: disable `when`, evaluate `loop_when` at the end of the loop.

This will treat loops as do-while style by default, while disabling the first iteration will need a separate step. So:

steps:
  selection:
    run:
      loop:
        run:
        ...
        in:
          i1: i1
        out: [o1]
        requirements:
          cwltool:Loop:
            loop_when: $(outputs.o1 < 5)
            loop:
              i1: o1
            outputMethod: all
    in:
      i1: i1
    when: $(inputs.i1 > 0)
    out: [o1]

Fourth solution (the one I think would be better): disable `when`, remove `outputs` from `loop_when`

This changes my proposed implementation by disallowing the outputs field in the loop_when condition. This also means that the condition must be evaluated AFTER the application of the output-to-input transformations expressed in the loop field. However, it is always possible to create a special input field to be evaluated in the condition by means of the valueFrom field. Also, this does not require changes in the existing when clause. In this case we have:

steps:
  loop:
    run:
      ...
    in:
      i1: i1
    out: [o1]
      requirements:
        cwltool:Loop:
          loop_when: ${return inputs.i1 > 0 && inputs.i1 < 5)}
          loop:
            i1: o1
          outputMethod: all

GlassOfWhiskey · June 9, 2022, 12:49pm

Note that in the last case, an iterative solver loop such as

optimization:
  in:
      a: a
      threshold: threshold
    when: ${return (outputs.a - inputs.a) > inputs.threshold}
    run: optimize.cwl
    out: a
    loop:
      a: a
    outputMethod: last

should be written as (for example)

optimization:
  in:
      a: a
      prev_a:
        valueFrom: ${return inputs.a - (2 * inputs.threshold)}
      threshold: threshold
    when: ${return (inputs.a - inputs.prev_a) > inputs.threshold}
    run: optimize.cwl
    out: a
    loop:
      a: a
      prev_a:
        valueFrom: $(inputs.a)
    outputMethod: last

brunokinoshita · June 14, 2022, 12:24am

Hi @GlassOfWhiskey

Fourth solution (the one I think would be better): disable when, remove outputs from loop_when

No objection to this solution. I finished reading the GitHub issue, interesting comments there too.

I think either scatter before the loops, or disabling should be OK, initially. Having a subworkflow for cases where you need both should be fine, but it may be a bit annoying for users too (e.g. migrating an existing cyclic workflow and having to split it into multiple subworkflows could be a little hard).

But IMHO we should go with one that works for a practical use case and have more iterations on this extension before it’s part of the standard.

Q1) Couldn’t we name the cwltool:Loop loop_when as when instead? Since it’s an attribute of Loop, I think it should be fine to call it just when?

Q2) Do you have an existing workflow that you are using for this proposal? If you have one that you can share here, maybe we can get another workflow and compare if the chosen solution works for both (or more cases).

I am aware of the following workflow engines that either support cycles in workflows:

And I am not 100% sure these tools support it, but I think they should manage cyclic workflows somehow as these are used for running climate/weather models (which are normally cyclic):

We can get some of the test workflows from Cylc or StackStorm+Orquesta and see if it would work with the any of the proposed solutions, and what it would like when converted to CWL.

Cheers
-Bruno

GlassOfWhiskey · June 18, 2022, 12:23pm

Hi @brunokinoshita,
for the Q1, if I understood correctly, there is a limitation on Schema Salad, s.t. no two fields can have the same name if they belong to different scopes, or something like that. But I think that in the final version there will be only a single when in each workflow step.
For Q2, I started putting some toy examples in the cwltool PR here.
For the other engines, thank you for pointing them out. They will be useful for both comparing and as related works for a future article. Plus, I’d like to add that all low-level task based libraries (e.g. COMPSs, Parsl, Ray, etc.) support loops, but it is a bit different scenario since there the DAG is built incrementally and as a consequence loops are automatically unrolled during computation.

GlassOfWhiskey · June 18, 2022, 8:22pm

Hi all,
one last proposal from my side. I just realized that we can entirely avoid to add the outputs namespace in expression evaluation by forcing these rules in the valueFrom clause:

The inputs field always refer to the inputs of the current loop iteration, i.e. to either the initial inputs or the outputs of the previous iteration;
The loop_source field refers instead to the outputs of the current loop iteration, and the self parameter contains the value of the loop_source field (after applying pickValue and linkMerge directives).

In this way, we can build the input value of the next iteration by using any possible combination of inputs and outputs by relying only on the existing shape of the JavaScript context.

The advantages of not including the outputs context in my opinion are a better consistency with the existing standard and easier implementation of the JS expressions (which are already complex enough). Conversely, the main drawback is a different interpretation of the self field, which in valueFrom normally points to the inputs['my-name'] object. However, the standard usage of the self field is not very useful in my opinion, as it is simply a shortcut to one of the inputs. However, with this proposal it starts being useful for real

One example of usage could be

loop:
  i1: o1
  delta:
    loop_source: o1
    valueFrom: $(self - inputs.i1)

tetron · June 20, 2022, 8:37pm

Minor point, CWL fields are camel cased, so it should be loopWhen

I had previously suggested that having both when and loopWhen should be an error, as we seem to agree that we’d like to unify when for both conditionals and loops in CWL 1.3.

I think scatter and loop should be mutually exclusive in a single step. You can still use a subworkflow. Conceivably we could also improve the syntax for subworkflows (allow them to be declared inline instead of as a standalone process with separate input/output).

I think I agree with the 4th proposal, as that makes the context of when more consistent.

I was a little confused by this since it doesn’t show up in the previous code block, but I think this is just an example of how you could initialize this variable so that the 1st iteration when is always true?

I like your suggestion to use loop_source (should be loopSource) and valueFrom. Ideally, this could be defined almost exactly the same as the in block, which would also be more consistent.

brunokinoshita · July 1, 2022, 3:08am

AIIDA has an example in their tutorials that could be useful for testing the loop requirement implementation.

For example, we have prepared a simple workflow (using work chains, work functions and calculation functions) to optimize the lattice parameter of silicon efficiently using Newton’s algorithm on the energy derivative, i.e. the pressure p=−dE/dV. You can download this code here The outline looks like this

https://aiida-tutorials.readthedocs.io/en/latest/pages/2020_Intro_Week/appendices/workflow_logic.html

Cylc also contains several workflows with loops. But Cylc supports ISO8601 and Integer cycle points (cycle points are similar to loops). We can ignore the ISO8601 and choose some of the integer cycle points. There are some interesting examples there, like this one: https://github.com/cylc/cylc-flow/blob/34748a8b394e03c341fe9bfc75fdff031f55c219/tests/functional/cyclers/r5_initial-integer/flow.cylc

It contains 7 cycles, but some conditional to run tasks if integer-cycle-point <= 5. Here’s the output of cylc graph r5_initial-integer/run1 1 7. There is a task (akin to a step in CWL, I think), xyzzy, that triggers another task bar, except for the last two cycle points. I think we should be able to model that with the Loop requirement too?

Loop requirement implementation

First solution: keep both when and loop_when, evaluate loop_when at the end of the loop.

Second solution: disable when, evaluate loop_when at the beginning of the loop.

Third solution: disable when, evaluate loop_when at the end of the loop.

Fourth solution (the one I think would be better): disable when, remove outputs from loop_when

First solution: keep both `when` and `loop_when`, evaluate `loop_when` at the end of the loop.

Second solution: disable `when`, evaluate `loop_when` at the beginning of the loop.

Third solution: disable `when`, evaluate `loop_when` at the end of the loop.

Fourth solution (the one I think would be better): disable `when`, remove `outputs` from `loop_when`