Using CWL for python based workflows

RandomDefaultUser · June 15, 2021, 5:14pm

Hey everyone,

I am new to the CWL world and while attempting to replicate my manual workflow in CWL I stumbled upon a problem. I’m from a materials science background and what I often need to do is process .cif files (containing crystallographic information for a certain element and structure) using the same python script over a range of different numbers of atoms. On the shell, this would look something like

python3 some_script.py -e some_element -s some_crystal_structure -n some_number_of_atoms

some_script.py will parse some_element and some_crystal_structure, load the appropriate .cif file and process. The only to options to replicate this in CWL I found so far are

I either copy my python script to the work directory via

requirements:
- class: InlineJavascriptRequirement
- class: InitialWorkDirRequirement
  listing:
    - entryname: some_script.py
      entry: |
        # source code for the script

which is inconvenient because then I can never access the script as a regular python file or
2. I explictly state the name of my python script in the input file

which I personally find kind of redundant, since this workflow is specific to my specific python script.
Is there a third option I am missing here?
This applies to the *.cif files as well. Ideally I would have a an input file to look like this:

element: Be
structure: hcp
natoms: 256

and the workflow would stage the appropriate .cif file itself. But I have found no way to stage a file to the work directory using values derived from the input values…

Any help would be appreciated!

Kind regards
Lenz

tate · June 17, 2021, 5:53pm

Hey Lenz,

I think you’re on the right track with your evaluation of solution 1). I believe this feature is meant more for Python “snippets” that can be succinctly expressed to perform simple tasks. If some_script.py is more than, say, 10 lines then this is probably not the move.

Solution 2) is the way to go if you haven’t packaged and installed your Python script. The reason for this is CWL does everything in an isolated temporary directory, so if your Python script isn’t an input for your Workflow or CommandLineTool then it isn’t copied to that temporary directory, and therefore when the Workflow executes Python with “some_script.py”, that script file isn’t found in current working directory (the temp directory).

Re redundancy: you need to list the python script as an input so that it is staged for use. With CommandLineTools, you don’t always define inputBinding, and when you don’t that input isn’t actually passed as a parameter to the underlying baseCommand. This is useful in combination with InitialWorkDirRequirement because it ensures that the file is included in the temporary directory without actually being passed to the command.

Another workaround is to make your script runnable from any directory, not just when it’s in your current working directory. The process of installing a Python package essentially places your script in a directory listed in your $PATH variable. You may also add the script’s current directory (absolute path) to the PATH variable, but this gets messy fast and I’d recommend against it.

Regarding your .cif files, if you have a consistent naming scheme for these files that can be derived from element,structure,natoms, then you can use these string type inputs to craft a File type input using an expression so that they can be staged with InitialWorkDirRequirement. Can you provide more details about the cif file(s) you would need for a given run, and whether these required files change depending on your inputs for element,structure,natoms?

RandomDefaultUser · June 22, 2021, 7:24am

Hello tate,

Thank you for your insight!
You’re right I could package the scripts or make them globally available but since they are small processing scripts, that feels not optimal either. I will go the way with including the script in the input file for now. I still find this solution somewhat suboptimal, since the other input parameters (e.g. -e, -s, -n) are specific to the script. Therefore there will be one input that can only have one value or else the CWL script will return an error. This stage of my workflow contains three python scripts of similar name, so I could imagine that this might happen.
Is this a situation where exclusive inputs should be used (Advanced Inputs – Common Workflow Language User Guide)? As in, I could make one CWL script for all the python processing scripts for this stage of my workflow and then make it so that depending on the chosen script, only the appropriate input parameters are allowed? Would that work
“if you have a consistent naming scheme for these files that can be derived from element,structure,natoms, then you can use these string type inputs to craft a File type input using an expression so that they can be staged with InitialWorkDirRequirement” - Yes, I do have that! Given element and structure, the cif files will take the form $element_$structure.cif. I was thinking I could do this via an expression, but I never got it to work. So I guess this is more my unfamiliarity with CWL/JS expressions, I will try again and maybe post a new question to this forum here if I cannot figure it out.

Kind regards
Lenz

tate · June 22, 2021, 6:14pm

Hmm, that sounds like something else might be going on. I’m going to take a step back on what I said about including the Python script as an input without inputBinding. Instead I’d recommend something like the following (which is just a snippet and hasn’t been tested):

requirements:
 - class: InlineJavascriptRequirement
 - class: InitialWorkDirRequirement
   listing: |
     ${var cif_dir = "/absolute/path/to/cifs";
       return {
          "class": "File",
          "location": "file://" + cif_dir + "/" + inputs.element + "_" + inputs.structure + ".cif"
       };
     }

baseCommand: python

inputs:
  script:
    type: File
    inputBinding:
      position: 0
  element:
    type: string
    inputBinding:
      position: 1
      prefix: -e
  structure:
    type: string
    inputBinding:
      prefix: -s
      position: 2
  natoms:
    type: string
    inputBinding:
      prefix: -n
      position: 3

outputs:
   ...

This of course is assuming that your Python script normally looks in the current working directory for your cif file.

tetron · June 24, 2021, 1:53pm

For wrapping self-contained Python scripts, here’s the pattern I use:

cwlVersion: v1.2
class: CommandLineTool
inputs:
  script:
    type: File
    default:
      class: File
      location: myscript.py
outputs: []
baseCommand: python3

This will take a directory as input, and then select a single file from the directory with the appropriate filename:

cwlVersion: v1.2
class: Workflow
inputs:
  cifs:
    type: Directory
    loadListing: true
  element: string
  structure: string
  natoms: int
steps:
  runscript:
    cifs: cifs
    element: element
    structure: structure
    natoms: natoms
    ciffile:
      valueFrom: |
        ${
          for (var i = 0; i < inputs.cifs.length; i++) {
            if (inputs.cifs[i].basename === (inputs.element+"_"+inputs.structure+".cif") {
              return inputs.cifs[i];
            }
          }
         }
  out: []
  run: myscript.cwl

I haven’t tested the above scripts so there may be typos but hopefully it will give you some ideas.

edit: corrected a bunch of typos

tate · June 27, 2021, 5:02pm

@tetron I like that. Rather than shoehorning a super minimal File object from a string, just start with a Directory listing with complete File records and choose the one(s) you want. Filing this away for later use, thank you!

RandomDefaultUser · June 29, 2021, 3:02pm

Hi @tate and @tetron ,

Thank you so much for your input! I like @tetron’s approach for the python scripts as well, this looks like what I am looking for. I will implement this and test the code for the cif file handling as well!