Input wrapped in double quotes getting wrapped in single quotes

Hello, new CWL user here.

I’m trying to implement a pattern matching flag of a tool, from their documentation: Use pattern matching (*) to specify multiple input files. Enclose the pattern in double quotes.
My issue arises when passing any type of input for this flag into my tool CWL, which will contain double quotes, the input will get wrapped into single quotes in the following manner: -P \ '../data/input/"*.fastq.gz"' \ which the tool cannot handle.
I have tried escaping the double quotes, passing the input through the arguments block and enabling shellquote: false among other things but to no avail.
For reference:

inputs:
  input_type:
    type: 
      - type: enum
        symbols: [ fa, fq, f5, f5s, seqtxt, bam, rrms ]
    label: input file type
    doc: |
      Acceptable input types:
        fa      FASTA file input
        fq      FASTQ file input
        f5      FAST5 file input
        f5s     FAST5 file input with signal statistics output
        seqtxt  sequencing_summary.txt input
        bam     BAM file input
        rrms    RRMS BAM file input
      Could be streamable depending on input type.
    inputBinding:
      position: -1
  pattern_input:
    type: string?
    label: pattern input
    doc: Use pattern matching (*) to specify multiple input files. Enclose the pattern in double quotes.
    inputBinding:
      prefix: -P

and

cwl:tool: longreadsum.cwl

input_type: fq

pattern_input: ../data/input/"*.fastq.gz"

I’m wondering if there is an easy solution I haven’t been able to find? Any help would be appreciated!

Hi @martijn,

Where are the files in this case, usually the inputs would include a directory or file list object?

I think we can get around your issue by using a shell script with ‘InitialWorkDirRequirement’ but you’ll need to update your question to include your full CommandLineTool

Hello @alexiswl, thank you for your response.

I would rather refrain from including the entire CommandLineTool so as not to have a giant post (I have been attempting to fully implement each parameter so the file is rather long).
Let us say a fairly minimal version which can also mount the input files would include:

requirements:
  InlineJavascriptRequirement: {}
 
hints:
  DockerRequirement:
    dockerPull: quay.io/biocontainers/longreadsum:1.3.1--py310h65e1ce4_3
  SoftwareRequirement:
    packages:
      LongReadSum:
        version: ["1.3.1"]
        specs: ["identifiers.org/RRID:SCR_026408"]

inputs:
  input_type:
    type: 
      - type: enum
        symbols: [ fa, fq, f5, f5s, seqtxt, bam, rrms ]
    label: input file type
    doc: |
      Acceptable input types:
        fa      FASTA file input
        fq      FASTQ file input
        f5      FAST5 file input
        f5s     FAST5 file input with signal statistics output
        seqtxt  sequencing_summary.txt input
        bam     BAM file input
        rrms    RRMS BAM file input
      Could be streamable depending on input type.
    inputBinding:
      position: -1
  pattern_folder:
    type: Directory?
    label: pattern directory
    doc: Directory containing input 
  pattern_input:
    type: string?
    label: pattern input
    doc: Use pattern matching (*) to specify multiple input files. Enclose the pattern in double quotes.
    inputBinding:
      prefix: -P 
  outputfolder:
    type: string
    label: output folder
    doc: Sets the output directory, defaults to "longreadsum_output".
    default: longreadsum_output
    inputBinding:
      prefix: '--outputfolder'

outputs:
  longreadsum_outdir:
    type: Directory
    label: output directory
    doc: LongReadSum output directory.
    outputBinding:
      glob: "$(inputs.outputfolder)"

I think I understand what you are driving at, I could include a mounting directory with the input files:

  InitialWorkDirRequirement:
    listing:
      - entry: "$(inputs.pattern_folder)"

And then use shell scripting in an arguments block to construct the entire command?

Ah this is much more useful, for a minimal reproducible example, the reproducible part is more important than the minimal bit (although this is important too).

Part 1 - Shell Expansion

For a bit of background information, an asterisk on the command-line is actually a shell expansion. Having a look at WGLab/LongReadSum, this is why it asks for the pattern in quotes. If there are no quotes, then the arguments are ‘expanded’ prior to being parsed to the tool itself. If the files foo.fastq.gz and bar.fastq.gz were inside the directory ../data/input/, then the command line would be expanded to longreadsum -P ../data/input/foo.fastq.gz ../data/input/bar.fastq.gz. And the arguments provided to the tool would be

[
  "-P",
  "../data/input/foo.fastq.gz",
  "../data/input/bar.fastq.gz"
]

Instead, by using double quotes, the tool then takes in ../data/input/*.fastq.gz as a single argument and I believe does it’s own shell expansion internally, (and so the arguments provided to the tool would be:

```json
[
  "-P",
  "../data/input/*.fastq.gz",
]

Therefore:

  • Don’t use shellquote: false, you don’t actually want a shell expansion here, you just want the pattern for shell expansion
  • Don’t quote in the input i.e pattern_input: ../data/input/"*.fastq.gz" should just be pattern_input: ../data/input/*.fastq.gz. Even though CWL will wrap the arguments in single quotes, what is parsed to the tool is the same as it would be for double quotes. However if you have double quotes in the input variable then the literal " is parsed since it’s inside the single quotes, which is not what the tool wants. The tool readme just tells you to enclose the pattern in quotes since you don’t want to expand the command prior to running the tool. The readme also uses double quotes since single quotes won’t evaluate variables before parsing the arguments to the tool.

Part 2 - Containerisation

This is the more pressing bit

The pattern command is engaged as both as a string and a File[] or Directory type.

Given you’re running this in docker, neither of the following inputs will mount ../data/input into the container, so longreadsum won’t be able to see the files in ../data/input anyway

In summary the inputs below will not mount a directory

input_type: fq
pattern_input: ../data/input/"*.fastq.gz"

Adding in a directory will also not help

input_type: fq
pattern_input: ../data/input/"*.fastq.gz"
pattern_folder:
    type: Directory
    location: ../data/input/

will likely mount the files in …/data/input under something like /data/cwl/mounts/, not ../data/input so the pattern_input will not know where they are.

You have two options:

  1. Scrap pattern_inputs and pattern_directory and instead go with the --inputs parameter, and use a list of files. You can use the itemSeparator key in inputBinding to separate each input by a comma as per the longreadsum readme. This would mean specifying each file which might be cumbersome but also means you can specify files that aren’t necessarily in the same folder. You can use the [] syntax to specify an array
inputs:
  ...
  input_file_list:
    label: List of input files
    type: File[]
    inputBinding:
      prefix: "--inputs"
      itemSeparator: ","
      separate: False
  1. You could use the ‘arguments’ tab to combine the pattern_directory and pattern_inputs parameters, this would change the behaviour of the tool a little as the pattern_inputs would need to only use the suffix and one could stitch the two together.

i.e

arguments:
  - prefix: -P
  - valueFrom: | 
      ${
        if (inputs.pattern_directory != null && inputs.pattern != null) {
          return inputs.pattern_directory.path + inputs.pattern;
        } else {
          return null
        }
      };

And then remove the input binding from the pattern_folder and pattern_input parameters.

The input form would then look something like this:

inputs:
  ...
  pattern_folder:
    type: Directory?
    label: pattern directory
    doc: Directory containing input 
  pattern_input:
    type: string?
    label: pattern input
    doc: Use pattern matching (*) to specify multiple input files. Enclose the pattern in double quotes.

Personally I think the first one is cleaner and more flexible.

Thank you for the elaborate and informative response! You have gotten me on the right track to finding a solution.
Full disclosure, this is my first tool CWL I have written and I just wanted to try and see if I could implement all of its features for my own understanding (and to make it as modular as possible). So even though I agree the pattern flag is not necessarily flexible and I had already implemented a working flag for multiple inputs (and I don’t even need these since the workflow I’ve built now is on a per sample basis), I still wanted to see if I could try and implement it.
It turns out indeed no extra visible quotes are required for the downstream internal shell pattern expansion.
The solution I found was returning an array in the arguments block, since passing it as a string would still invariably cause the same double wrapping of the input:

arguments:
  - prefix: -P
    valueFrom: |
      ${
        if (inputs.pattern_folder != null && inputs.pattern_input != null) {
          return [inputs.pattern_folder.path + '/' + inputs.pattern_input];
        } else {
          return null
        }
      }

Thank you again for your responses. As a side-point, I was wondering if there perhaps are open databases containing existing cwl tools to facilitate reusage of already existing cwls (I suppose I am aware of bio-cwl-tools).

1 Like