Potential defect in ontology format checking

kannon92 · January 31, 2022, 2:52pm

Hello,

We are interested in the ontology functionality of cwl. We work in the image processing domain and we would like to have cwl run a validation on the ontologies before executing a workflow.

I am under the impression that this exists in CWL. We don’t have existing ontologies in EDAM yet so we were hoping that it could perform an exact match on the format field for now.

This is a pretty simple use case. You can execute cwltool --validate to perform a check.

I expect that validate will fail because the input and output directories format fields do not match.

thresholding.cwl

cwlVersion: v1.0
id: "Thresholding plugin"
class: CommandLineTool
requirements: 
  DockerRequirement:
    dockerPull: wipp/wipp-thresh-plugin:1.1.1
  InlineJavascriptRequirement: {}
  InitialWorkDirRequirement:
    listing:
      -  entry: $(inputs.output)
         writable: true

baseCommand: [""]
inputs: 
  input: 
    type: Directory
    label: Input collection of ome.tiff files for the thresholding plugin.
    format: "string"
    inputBinding:
      prefix: --input
  output:
    type: string
    inputBinding:
      prefix: --outDir
outputs:
  thresholdOut: 
    type: Directory
    label: Output collection of ome.tiff files for for the thresholding plugin.
    format: "FAIL"
    outputBinding: 
      glob: "$(inputs.output.basename)"

workflow:

cwlVersion: v1.0
class: Workflow
inputs: 
  thresholdInput: Directory
  thresholdOutDir: string
  
outputs:
  threshold_out:
    type: Directory
    outputSource: thresholding/thresholdOut

steps:
  thresholding: 
    run: bad-thresholding.cwl
    in: 
      input: thresholdInput
      output: thresholdOutDir
    out: [thresholdOut]
  thresholding_bad: 
    run: bad-thresholding.cwl
    in: 
      input: thresholding/thresholdOut
      output: thresholdOutDir
    out: [thresholdOut]

brunokinoshita · February 11, 2022, 2:16am

Hi @kannon92

The format field can be used only for the File type (or an array of), as per the docs here: Common Workflow Language (CWL) Command Line Tool Description, v1.2

I think a good improvement would be to have a warning when format is used with an input that’s not a File: cwltool validate must have a warning when format is used with an invalid type (e.g. Directory) · Issue #1616 · common-workflow-language/cwltool · GitHub

Cheers
Bruno

kannon92 · February 11, 2022, 1:16pm

Nice find! What exactly is the difference between a File array and a Directory?

I hope we can enable this for Directories. It would be very useful!

brunokinoshita · February 11, 2022, 7:07pm

My understanding is that the ontology is used to validate that the file format matches what was specified in the CWL file, but it was not easy to use that to validate the directory or its contents - for a File it is probably easier to simply verify it is a fasta or a tiff file, but for the directory I’m not sure whether it would check if it’s a symlink, or the permissions of the directory, or its contents or how that would work hierarchically…

Would you be able to elaborate more on your use case for format + Directory here or in an issue on GitHub? Maybe it could be useful if others have a similar use case, or if a CWL dev has an idea on how to implement this.

mrc · February 14, 2022, 10:25am

A CWL Directory itself has a name and it can contain other directories in addition to files; those directories can also contain files and other directories and so on.

A File array doesn’t have its own name, and cannot contain a Directory

steve · March 1, 2022, 2:32pm

functionally, I have found that the biggest difference is that once you put your File's into a Directory, you lose the ability to refer to them by reference downstream. If there is a way to do this, it would be great to know. As such, its been a lot easier to use either File arrays for collections of files (where order does not matter), or array's of record types where each record can have some kind of label field plus and File field in order to identify the individual files. If you want your files in a specific directory then I think its best to implement that as the very last step in your workflow unless you know that throughout your workflow you will never need that file as an input again later.

mrc · March 1, 2022, 3:56pm

One can access the listing property of a CWL Directory object to get an array of File and Directory objects.