Install python modules when running cwl

Hello everyone,

So I am new to CWL and I am trying to implement my very first workflow to call some python scripts.

So far I’ve done some tests with basic python scripts and got them working, but when I want to use my real python scripts I encounter some issues regarding the import of certain python modules, i.e: “ModuleNotFoundError”. The modules are installed on my system and I can import them outside the CWL environment, so I guess I get the error because the module does not exist in the CWL container? How could I solve this issue?

Is there a way to install all the necessary modules at the begining of the workflow? The python modules that I need to import are used by several python scripts that will be called at different steps of the workflow.

Here are my CWL codes: the main one that calls the others for each step (so far I have just one step), and the one that calls the python script.

Thanks in advance!

#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow


inputs:
  zipFile: string
  
outputs:
  b1:
    type: Directory
    outputSource: unzip_files/b1
  t1:
    type: Directory
    outputSource: unzip_files/t1
  t2_am:
    type: Directory
    outputSource: unzip_files/t2_am
  t2_ph:
    type: Directory
    outputSource: unzip_files/t2_ph
  segmentation:
    type: Directory
    outputSource: unzip_files/segmentation
    
    
steps:    
  unzip_files:
    run: unzip.cwl
    in:
      zipFileIn: zipFile
    out: [b1, t1, t2_am, t2_ph, segmentation]
     

cwlVersion: v1.0
class: CommandLineTool

baseCommand: python

inputs:
  outPutDir: 
    type: string
    default: $(runtime.outdir)
    inputBinding:
      position: 3

  zipFileIn: 
    type: string
    inputBinding:
      position: 2

  script:
    type: File
    inputBinding:
      position: 1
    default:
      class: File
      location: ./QSM_pipeline/01_unzip.py
  

outputs:
  b1:
    type: Directory
    outputBinding:
      glob: "01-b1"
  t1:
    type: Directory
    outputBinding:
      glob: 02-t1"
  t2_am:
    type: Directory
    outputBinding:
      glob: "03-t2"
  t2_ph:
    type: Directory
    outputBinding:
      glob: "04-t2"
  segmentation:
    type: Directory
    outputBinding:
      glob: "05-segmentation"

You should look into a containerization method, such as Docker or Singularity. Prepare a container with your scripts installed into them, then add a Docker Requirement in your CWL for it.

Designing workflows to depend on scripts and modules installed on your host system is generally a bad idea.

an example from our project;

we have custom Python scripts here; helix_filters_01/bin at master · mskcc/helix_filters_01 · GitHub

they get installed into a container here and pushed to Docker Hub ; helix_filters_01/Dockerfile at b33bb08f926c062c712897a8df12a59b1fb56d9a · mskcc/helix_filters_01 · GitHub

then get used inside the CWL, via the container, here; pluto-cwl/add_af.cwl at a55ae2d7799dab8608bf9ace995a94e06a4bbe8a · mskcc/pluto-cwl · GitHub


Looking at the CWL you posted, it also appears as if you are also passing your script as an input item; this is definitely not a good method to use, your custom script should be invoked by the baseCommand method. It will preferably need to be accessible via your system PATH to function.

During development, I don’t think it is terrible to pass one’s scripts as inputs. That won’t be very portable, or reproducible, but it is very convenient when there is a lot of code churn.

Probably the Python is failing as all environment variables are tightly controlled, especially when using cwltool and related CWL implementations. Again, during periods of heavy development, I do recommend the following trick. Eventually you should make/find a container instead of this:

  --preserve-environment ENVVAR
                        Preserve specific environment variable when running CommandLineTools. May be provided multiple times. By default PATH is preserved when not running
                        in a container.
  --preserve-entire-environment
                        Preserve all environment variables when running CommandLineTools without a software container.