Input and output paths not explicitly provided in CWL for an AI use case

miro-bezak · February 10, 2021, 3:08pm

Hello,

We’re trying to use the CWL tool to run AI workflows within a digital pathology use case and verify options to generate compliant provenance information with CWL.

Our scripts do not use absolute paths to input/output datasets but rather consume several parameters used by our script to build a path internally. Is there a way to indicate this in CWL? With the thing being that we do not know the exact paths beforehand, but we know the logic of how they are created.

Thanks very much for any answers in advance
Miro Bezak

tetron · February 10, 2021, 9:53pm

CWL has several ways to approach this, you can use Directory inputs, you can use secondaryFiles, and/or you can use InitialWorkDirRequirement. Perhaps you could explain more precisely how these programs work and I could give some more detailed suggestions.

steve · February 10, 2021, 10:29pm

I have a feeling that the best solution to this will be something like “change your scripts to use the provided input path”

miro-bezak · February 11, 2021, 9:18pm

One of our scripts can be executed like this
python -m src.create_dataset -n Prostata -l 1 -r 96 -c 32 -t 96 -g all -x 1
It assumes that there are already created data frames present on the disk. Then it finds all that are relevant for these input arguments, which can be located in a folder like this (which is kind of like an array of file inputs but not provided on the command line):
/mnt/data/crc_ml/data/processed/Prostata/level1/r96px/c32px/t96px/all/coord_maps
And merges them into a single dataset in a similar folder.

We do not want to require these complex paths as inputs but rather specify multiple arguments that all have their semantics and also free the user from having to know our exact file structure.

If I haven’t explained it clearly enough, feel free to ask. I hope we can fit this somehow into CWL

tetron · February 12, 2021, 3:31am

Is /mnt/data/crc_ml/data/processed/ the current working directory in your example, or is it hardcoded somehow?

Do you expect to pass in a whole directory of files (with the expected structure, possibly including files you don’t need) or do you want to be able to pass in a single/small number of files and then place them in the correctly nested directory?

Am I correct in thinking that -l 1 turns into level1, -r 96 turns into r96px and so forth?

You can construct nested directories using InitialWorkDirRequirement like this:

cwlVersion: v1.0
class: CommandLineTool
inputs:
  name: string
  level: int
  r: int
  file: File
outputs: []
requirements:
  InlineJavascriptRequirement: {}
  InitialWorkDirRequirement:
    listing:
      - entry: |-
          ${
          return {
            class: "Directory",
            basename: inputs.name,
            listing: [{
              class: "Directory",
              basename: "level"+inputs.level,
              listing: [{
                class: "Directory",
                basename: "r"+inputs.r+"px",
                listing: [
                  inputs.file
                ]
              }]
            }]
          };
          }
baseCommand: find

miro-bezak · February 14, 2021, 12:49pm

These paths are configured in a separate file and then used in various of our scripts. Also these directories don’t even have to exist prior to the scripts execution if it the first time running the script.

After putting a lot of thought and consulting with my team I decided that we will just these command line arguments and not specify any input directories. It will also remove the struggle of possible copying them into a temporary directory. Since we are primarily interested in generating provenance, these inputs completely specify which input files were used by the script.

However, we are still interested in capturing the output. Is there a way to create output outside of the working directory and still mark it as an output in CWL? It would be located on the same kind of long path as I mentioned in my earlier post.

tetron · February 15, 2021, 3:42pm

Taking a single Directory as input, with all the files organized into subdirectories, is probably easier for your case than taking individual files and reconstructing their proper location.

Output must appear somewhere inside the output directory. They can be nested in subdirectories. If you are using a Docker container, you can specify the exact location of the output directory.