Build full command with CWL parsing

peterg1t · December 15, 2023, 6:40pm

Hi all,
This is more of a developer question than a user question. A full command is composed of a base command (baseCommand), a series of inputs with positions defined in inputBindings and arguments also with positions defined. I have been playing with cwl_utils to extract the baseCommand from a Workflow step but I found I need to come up with some logic to incorporate the inputs and arguments to obtain the full command.
In order to avoid duplicating work is there a package or function in cwl_utils that does this for a commandLineTool process?
Thanks in advance,
Pedro

mrc · December 15, 2023, 6:47pm

Hey Pedro, I’m not aware of any code to do that using the cwl_utils objects. But it would be really useful to have! I would even help review and improve any PR on this topic.

peterg1t · December 15, 2023, 6:51pm

Thanks Michael. I’ll get a proof of concept first working and get back at you on this.
Regards,
Pedro

alexiswl · January 4, 2024, 7:56am

Hi Peter,

Interested to know how you got on with this?

Does this imply you have an input json as well? I’m curious to know how you’d handle optional inputs?

peterg1t · January 4, 2024, 7:42pm

Hi Alexis and Michael,
So here is more or less how I’m trying to work this out. For the time being I’m ignoring expressions. For every step in my workflow I check if I have a workflow or a task (CommandLineTool) and act accordingly. The magic happens in the _process_task method.

for step in cwl_content.steps:
        step_element = cite_extract.get_process_from_step(step)
        if isinstance(step_element, CommandLineTool):
            _process_task(step, wf_graph)

        elif isinstance(step_element, Workflow):
            _process_workflow(step_element, wf_graph)
return wf_graph

I extract the baseCommand

def _process_task(step: WorkflowStep, wf_graph: nx.DiGraph) -> nx.DiGraph:
command_line_tool = cite_extract.get_process_from_step(step)
    inputs = {
        os.path.basename(step_inp.id.split("#")[1]): step_inp.source.split("#")[1]
        for step_inp in step.in_
    }

    if isinstance(command_line_tool.baseCommand, str):  
        base_command = str(command_line_tool.baseCommand)
    elif isinstance(command_line_tool.baseCommand, list): 
        base_command = " ".join(command_line_tool.baseCommand)

and add the arguments that I extract with the following method

def _build_commmand_arguments(command_line_tool, inputs: dict) -> list:
    command_arguments = []
    for inp in command_line_tool.inputs:
        tool_inputs = os.path.basename(inp.id.split("#")[1])
        if isinstance(inp.inputBinding, CommandLineBinding):
            command_arguments.append(
                _build_params(inputs[tool_inputs], inp.inputBinding)
            )
        else:
            pass
    if command_line_tool.arguments:
        for argument in command_line_tool.arguments:
            command_arguments.append(_build_params("", argument))

    return command_arguments

Probably it is not the most optimal solution but I can build a full command with handles for the input and output files (I’m not really interested in the real inputs) to build an abstract graph. I looked into cwltools to see if there was something that I could directly invoke and get the full command but to be honest I was a bit lost in it. I also don’t need the yaml file for the inputs this way so I though it was easier. So for the following example:

cwlVersion: v1.2
class: Workflow

label: An example tool demonstrating metadata.
doc: Note that this is an example and the metadata is not necessarily consistent.

requirements:
  SubworkflowFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  ScatterFeatureRequirement: {}
  InlineJavascriptRequirement: {}
  ResourceRequirement:
    coresMin: 4
    ramMin: 3000

inputs:
  input-file1: File
  input-file2: File

steps:
  process_file:
    run:
      label: "A test label"
      class: CommandLineTool
      baseCommand: [sh, process.sh]


      requirements:
        InlineJavascriptRequirement: {}
        InitialWorkDirRequirement:
          listing:
            - entry: ""
              entryname: "output.metadata"
              writable: true

      stdout: output.txt

      inputs:
        ifile:
          type: File
          inputBinding:
            position: 1

        ifile2:
          type: File
          inputBinding:
            position: 2

      outputs:
        ofile:
          type: File
          format: edam:format_1964
          label: A text file that contains a line count
          outputBinding:
            glob: output.txt
          secondaryFiles:
            - pattern: ^.metadata
              required: true


    in:
      ifile: input-file1
      ifile2: input-file2
    out: [ofile]


  process_file_2:
    run:
      class: CommandLineTool
      baseCommand: cat

      requirements:
        InlineJavascriptRequirement: {}
        ResourceRequirement:
          coresMin: 2
          ramMin: 6000
        InitialWorkDirRequirement:
          listing:
            - entry: ""
              entryname: "output_2.metadata"
              writable: true

      stdout: output_2.txt

      inputs:
        ifile:
          type: File
          inputBinding:
            position: 1

      outputs:
        ofile:
          type: File
          format: edam:format_1964
          label: A text file that contains a line count
          outputBinding:
            glob: output_2.txt
          secondaryFiles:
            - pattern: ^.metadata
              required: true



    in:
      ifile: process_file/ofile
    out: [ofile]

outputs:
  output-file:
    type: File
    outputSource: process_file_2/ofile


$namespaces:
  s: https://schema.org/
  edam: http://edamontology.org/

I would obtain the following (networkx) graph:

[
                (
                    "process_file",
                    {
                        "description": "process_file",
                        "command": "sh process.sh input-file1 input-file2",
                        "inputs": ["input-file1", "input-file2"],
                        "outputs": ["process_file/ofile"],
                    },
                ),
                (
                    "process_file_2",
                    {
                        "description": "process_file_2",
                        "command": "cat process_file/ofile",
                        "inputs": ["process_file/ofile"],
                        "outputs": ["process_file_2/ofile"],
                    },
                ),
            ],
            [("process_file", "process_file_2")]

Please let me know your thoughts and if there is something I missed in clwtools to achieve a similar result. Thanks!

Have a great day!