I’m working on a project that uses Workflows with ResourceRequirement entries. I’ve been trying to detect any conflicts between ramMin/ramMax and coresMin/coresMax values in our CWL files.
As asked by my coworker, I tested this using cwltool to see if it can identify:
whether a step/run requirements is not larger than the global requirements
whether the resourceMin is not larger than the resourceMax in any requirements.
before running the job, but it seems that cwltool does not check for conflicts between the minimum and maximum resource values(?)
For example, even with the following configurations, the job still runs:
If the same process requirement appears at different levels of the workflow, the most specific instance of the requirement is used, that is, an entry in requirements on a process implementation such as CommandLineTool will take precedence over an entry in requirements specified in a workflow step, and an entry in requirements on a workflow step takes precedence over the workflow. Entries in hints are resolved the same way.
But what you are trying to do is reasonable, and there have been previous discussions about blending requirements, or reformulating them for CWL v2 to make that easier.
Back to your example, only the innermost ResourceRequirement will be evaluated by any CWL compliant engine. If the ResourceRequirement in CommandLineTool was under hints instead of requirements then it would have been overridden by theResourceRequirement entry under workflow step requirements.
If there was only a ResourceRequirement specified in the example workflow and not at the step level nor in the CommandLineTool then it would have been applied on the CommandLineTool. Though I don’t personally recommend that.
I remember seeing a CWL file in the cwltool tests called count-lines1-wf.cwl, which contained a Workflow with a ResourceRequirement, it was used to test sequential worflows. Is it allowed to include a ResourceRequirement in a Workflow only for this purpose?
So the priority in requirements is like this: CommandLineTool > WorkflowStep > Workflow?
And if there’s hints, it’s requirements first so priority is: Req[CommandLineTool > WorkflowStep > Workflow] > Hint[CommandLineTool > WorkflowStep > Workflow]?
So, if I understand correctly, we need to specify the ResourceRequirement for each individual step and CommandLineTool that requires it?
If two steps (or CLT) need the same resource requirements, I should define the requirement in both steps separately (even if it’s the same resource values), rather than placing it in the Workflow to apply it to all steps?
For example, instead of doing this:
Yes, it is allowed by the CWL standards to use a CLT-only requirement in a Workflow for the purposes of flowing down / overriding into the steps. Personally I don’t really recommend it.
Correct. Likewise if there was a sub-workflow in there.
From the perspective of portability & re-use it is nice if the CLTs are sufficient on their own. So if you know particular tool always needs some particular minimum requirements (especially if you can dynamically determine more accurate numbers via the inputs to the CLT), then I recommend putting those ResoureRequirements as an entry in the CLT’s hints.
I would like to provide additional context if it can help you understand the use case.
We’re working on transitioning our distributed workflow management system (DIRAC) from a custom workflow implementation to CWL. Our system has the following objects:
Jobs: Can be either a Workflow or a CLT, executed on a specific single worker node.
For single CLT, using one set of requirements for scheduling and execution works well.
However, for Workflows with varying requirements at different levels, we face a challenge: we need one set of requirements for scheduling the job on a resource and get an allocation, and we must ensure the job respects these limits during execution (using cgroups). If CLT-level requirements exceed the “scheduling” requirements, the job can be killed.
Transformations: Act as templates to create jobs with similar Workflows/CLTs and requirements, but different inputs.
Productions: Large workflows where each step represents a transformation.
Based on your suggestion, I understand we should internally use the maximum requirement values across all CLTs to schedule a given job.
In the example below, we would use the requirements from the first CLT to schedule the job:
cwlVersion: v1.2
class: Workflow
inputs: []
outputs: []
# This is not recommended and should be avoided, right?
requirements:
ResourceRequirement:
coresMin: 2
coresMax: 4
steps:
good_step_1:
run:
class: CommandLineTool
requirements:
ResourceRequirement:
coresMin: 3
coresMax: 6
baseCommand: ["echo", "Hello World"]
inputs: []
outputs: []
out: []
in: []
good_step_2:
run:
class: CommandLineTool
# No requirements: bad practice! But will inherit from the ones define at the workflow level IIUC.
baseCommand: ["echo", "Hello World"]
inputs: []
outputs: []
out: []
in: []
Note: This approach requires iterating over all step requirements to find the maximum values, which does not seem particularly convenient (but would probably have to be done in any case to validate the Workflow).