Cwl GATK GenotypeGVCFs error in linux due to quotes in filenames

Hi noticed that everytime I try to run .cwl scripts that include GATK GenotypeGVCFs the runner encounter an error that is related to how the previous step creates filenames in genomic DB (from GATK GenomicsDBImport called through .cwl too):

Invalid filename: ‘8$1$146364022’ contains illegal characters

and actually investigating the genomic DB directory (GenomicsDBImport output) it actually creates filenames for each chromosome directory within quotes that then raise the error above:

enrico@godzilla:/media/kong/enrico/MCD/cwl-run-DIR$ ls -thal MCD_n15/
total 136K
drwxrwsr-x  3 enrico lab 4.0K Dec  5 10:47  ..
drwx------  4 enrico lab 4.0K Dec  5 10:35 'X$1$155270560'
drwx------ 25 enrico lab 4.0K Dec  5 10:29  .
drwx------  4 enrico lab 4.0K Dec  5 10:29 '22$1$51304566'
drwx------  4 enrico lab 4.0K Dec  5 10:25 '21$1$48129895'
drwx------  4 enrico lab 4.0K Dec  5 10:22 '20$1$63025520'
drwx------  4 enrico lab 4.0K Dec  5 10:17 '19$1$59128983'
drwx------  4 enrico lab 4.0K Dec  5 10:10 '18$1$78077248'
drwx------  4 enrico lab 4.0K Dec  5 10:04 '17$1$81195210'
drwx------  4 enrico lab 4.0K Dec  5 09:56 '16$1$90354753'
drwx------  4 enrico lab 4.0K Dec  5 09:49 '15$1$102531392'
drwx------  4 enrico lab 4.0K Dec  5 09:42 '14$1$107349540'
drwx------  4 enrico lab 4.0K Dec  5 09:34 '13$1$115169878'
drwx------  4 enrico lab 4.0K Dec  5 09:28 '12$1$133851895'
drwx------  4 enrico lab 4.0K Dec  5 09:17 '11$1$135006516'
drwx------  4 enrico lab 4.0K Dec  5 09:06 '10$1$135534747'
drwx------  4 enrico lab 4.0K Dec  5 08:55 '9$1$141213431'
drwx------  4 enrico lab 4.0K Dec  5 08:45 '8$1$146364022'
drwx------  4 enrico lab 4.0K Dec  5 08:35 '7$1$159138663'
drwx------  4 enrico lab 4.0K Dec  5 08:22 '6$1$171115067'
drwx------  4 enrico lab 4.0K Dec  5 08:09 '5$1$180915260'
drwx------  4 enrico lab 4.0K Dec  5 07:56 '4$1$191154276'
drwx------  4 enrico lab 4.0K Dec  5 07:43 '3$1$198022430'
drwx------  4 enrico lab 4.0K Dec  5 07:28 '2$1$243199373'
drwx------  4 enrico lab 4.0K Dec  5 07:09 '1$1$249250621'
-rwx------  1 enrico lab 8.4K Dec  5 06:49  vidmap.json
-rwx------  1 enrico lab  18K Dec  5 06:49  vcfheader.vcf
-rwx------  1 enrico lab 1.4K Dec  5 06:49  callset.json
-rwx------  1 enrico lab    0 Dec  5 06:49  __tiledb_workspace.tdb

This happens every single time I have a GenomicsDBImport output in my Linux Ubuntu 18.04.5

Does anybody worked this around? I know I can call it from GATK outside .cwl but for pipeline purposes I’d like to be able to pass this DB through .cwl too.

Thank you very much in advance for any help! Below my cwl script:

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool
label: gatk GenomicsDBImport on GATK docker images

hints:
  DockerRequirement:
    dockerPull: broadinstitute/gatk:latest
  ResourceRequirement:
    coresMin: $(inputs.GenomicsDBImport_coresMin)
    ramMin: $(inputs.GenomicsDBImport_ramMin)

requirements:
  InlineJavascriptRequirement: {}

baseCommand: gatk
arguments: [ "GenomicsDBImport" ]

inputs:
  - id: interval_list
    type: File
    inputBinding:
      position: 1
      prefix: '-L'
  - id: cohort_name
    type: string
    inputBinding:
      position: 2
      prefix: '--genomicsdb-workspace-path'
  - id: gvcf_files
    type:
      - type: array
        items: File
        inputBinding:
          position: 0
          prefix: '-V'
          separate: true
    secondaryFiles:
      - .tbi

outputs:
  GenomicsDBImport_directory:
    type: Directory
    outputBinding:
      glob: $(inputs.cohort_name)

Hi @cccnrc,

Assuming you are using cwltool, you should be able to suppress the “Invalid filename” error with --relax-path-checks.

In fact, there are no quotes in the file names. I think it is actually the $ character that is causing you problems. What you are seeing when you run ls is a behavior where filenames with shell-sensitive characters get surrounded by quotes. Try ls --quoting-style=literal and the quotes should go away.

1 Like