How to use the Docker installed libraries in a cwl workflow?

Hello,

I am creating a cwl workflow, where I am writing the steps to run. The first step is regarding a Docker container, which is used to install some Python libraries. Then, I want the other cwl files to point to the libraries installed within the Docker container, but I am having issues on how to do that.
This is how the install_docker_test.cwl looks like:

cwlVersion: v1.2
class: CommandLineTool
id: install_requirements
stderr: output_test
requirements:
  InlineJavascriptRequirement: {}
  DockerRequirement:
    dockerPull: test-docker-image
inputs:
  scripts_folder:
    type: Directory
outputs:
  output:
    type: File
    outputBinding:
      glob: output_test

And this is the workflow description:

cwlVersion: v1.2
class: Workflow
id: workflow
requirements:
  EnvVarRequirement:
    envDef:
      TEST_STR: "this_is_my_test"
inputs:
  input_folder:
    type: Directory
  my_script:
    type: string
outputs:
  output:
    type: File
    outputSource: download_files/download_files_output
steps:
  install_requirements:
    run:  install_docker_test.cwl
    in:
      input_folder: input_folder
    out: [output]
  download_files:
    run: download_files.cwl
    in:
      my_script: my_script
      input_folder: input_folder
    out: [download_files_output]

However, the issue is that the script run inside download_files.cwl does not point to the Docker container, even though I have install_libraries steps listed before. How to fix this and to make sure that the python script uses the Docker container instead of the generic python installed in the wsl?

The install_requirements and download_files steps are isolated from one another, so installing into the container like that won’t work – the download_files step starts in a fresh container environment so any changes made by install_requirements are wiped away.

Perhaps you could explain a little more what install_requirements needs to do? I suspect what you want to do is put that in a Dockerfile to build your image.

The install_requirements step triggers install_docker_test.cwl to run, which has the requirement for the Docker image to exist. I have a Docker file, where I am installing the Python libraries that I need in the other steps (here: download_files). Here is the example of the Docker file:

FROM python:3.8
WORKDIR /base
COPY example/ /base/example/
RUN pip install /base/example/custom_lib
COPY requirements.txt /base/example/requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
    python -m pip install -r /base/example/requirements.txt
RUN pip install torch
ENV PYTHONUNBUFFERED 1
COPY . .

The reason why I’ve separated the installation of the libraries with Docker is because the installation step in the workflow (by using a separate cwl file which would have the command as argument to install the Python libraries) would make use of too much memory: Max memory used: 213MiB

Does download_files also have this?

requirements:
  DockerRequirement:
    dockerPull: test-docker-image

But I suspect you have the wrong mental model of the workflow. You can’t install files in step 1 and then find them there in step 2, unless you specified the files as output in step 1 and connected it as in input to step 2.

The reason is that step 1 and step 2 are running in different containers, and could be running on entirely different machines.

You want to build your docker image with all of its software dependencies at the command line (or a separate shell script) and then launch the workflow. CWL has limited support for specifying how to build Docker images.

Does this help?