Ask CWL to rename a secondary file

Essentially, I want to hide the fact that some tools only accept ^.bai (looking at GATK) while make them appear as if they accept .bam.bai (with the corresponding .bai secondary syntax) which we want to be the standard for our pipelines.

Basically, I tried to use an outputEval, but the order of operations makes the impossible.

I’m happy to use a specifically CWL expression for the secondaryFile to do this, but I don’t quite understand what I should return if:

  • The primary file (correctly globbed) is called myfile.bam
  • A file called myfile.bai is present in the output directory
  • The returned secondary file should have the extension myfile.bam.bai

This would ensure it’s correctly localised in future steps. But maybe I don’t quite understand how secondary files are passed around in CWL especially CWLTool.

The topic follows stems from my questions started from here.

I’ll write out my thought process.

Picking up outputs

I can get the secondary files to be picked up with an expression on the secondaryFiles field on the output.

  out:
    type: File
    outputBinding:
      glob: $(inputs.bam.basename)
    secondaryFiles: |
      ${
          function resolveSecondary(base, secPattern) {
            if (secPattern[0] == "^") {
              var spl = base.split(".");
              var endIndex = spl.length > 1 ? spl.length - 1 : 1;
              return resolveSecondary(spl.slice(undefined, endIndex).join("."), secPattern.slice(1));
            }
            return base + secPattern
          }

          return [{
            path: resolveSecondary(self.path, "^.bai"),
            basename: resolveSecondary(self.basename, ".bai"),
          }]
      }

But this doesn’t actually rename the file though (except on a completed workflow that gets exported). It just references it, eg:

  • TaskA generates the file “file.bam” + “file.bai”.
  • if TaskB comes along wanting the file “file.bam.bai”
  • It fails with the error: "Missing required secondary file '$name' from file object", even though it seems to exist in the secondaryFiles array (depends what it matches on I guess).

Error:

cwltool.errors.WorkflowException: Missing required secondary file 'generated-7999e776-212c-11ea-a264-acde48001122.bam.bai' from file object: {
    "location": "file:///private/tmp/docker_tmpv8y7n43v/generated-7999e776-212c-11ea-a264-acde48001122.bam",
    "basename": "generated-7999e776-212c-11ea-a264-acde48001122.bam",
    "nameroot": "generated-7999e776-212c-11ea-a264-acde48001122",
    "nameext": ".bam",
    "class": "File",
    "checksum": "sha1$a1a12417d413ab4847188d6eee7f6e675450656d",
    "size": 2998029,
    "secondaryFiles": [
        {
            "basename": "generated-7999e776-212c-11ea-a264-acde48001122.bam.bai",
            "location": "file:///private/tmp/docker_tmpv8y7n43v/generated-7999e776-212c-11ea-a264-acde48001122.bai",
            "class": "File",
            "nameroot": "generated-7999e776-212c-11ea-a264-acde48001122.bam",
            "nameext": ".bai",
            "checksum": "sha1$e5e276cfbd0a1cfe7828ec744b02de3d8ee78f88",
            "size": 1472592,
            "http://commonwl.org/cwltool#generation": 0
        }
    ],
    "http://commonwl.org/cwltool#generation": 0
}

Picking up inputs

I’m also having trouble doing a similar process on the CommandInput, when I place a very similar expression block in the secondaryFiles on the input (location instead of path):

    type: File
    secondaryFiles: |
      ${
          function resolveSecondary(base, secPattern) {
            if (secPattern[0] == "^") {
              var spl = base.split(".");
              var endIndex = spl.length > 1 ? spl.length - 1 : 1;
              return resolveSecondary(spl.slice(undefined, endIndex).join("."), secPattern.slice(1));
            }
            return base + secPattern
          }
          // return resolveSecondary(self.basename, "^.bai")
          return [{
            location: resolveSecondary(self.location, ".bai"),
            basename: resolveSecondary(self.basename, "^.bai"),
          }]
      }

But, I get a CWLTool error:

Traceback (most recent call last):
  File "/anaconda3/lib/python3.7/site-packages/cwltool/executors.py", line 169, in run_jobs
    for job in jobiter:
  File "/anaconda3/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 430, in job
    builder = self._init_job(job_order, runtimeContext)
  File "/anaconda3/lib/python3.7/site-packages/cwltool/process.py", line 747, in _init_job
    discover_secondaryFiles=getdefault(runtime_context.toplevel, False)))
  File "/anaconda3/lib/python3.7/site-packages/cwltool/builder.py", line 276, in bind_input
    bindings.extend(self.bind_input(f, datum[f["name"]], lead_pos=lead_pos, tail_pos=f["name"], discover_secondaryFiles=discover_secondaryFiles))
  File "/anaconda3/lib/python3.7/site-packages/cwltool/builder.py", line 332, in bind_input
    sf_location = datum["location"][0:datum["location"].rindex("/")+1]+sfname
TypeError: can only concatenate str (not "dict") to str

This seems to go against the spec which says:

The expression must return:

  • a filename string relative to the path to the primary File,
  • a File or Directory object with either path or location and basename fields set,
  • or an array consisting of strings or File or Directory objects.
    It is legal to reference an unchanged File or Directory object taken from input as a secondaryFile.

Potential solution

I made a small modification to CWLTool to pull the location + fname from this File object, ie:
builder.py:

  • Remove L332,
  • Insert the following stub at L325 to get the correct basename (instead of assuming it’s a string).
if isinstance(sfname, string_types):
    sf_location = datum["location"][0:datum["location"].rindex("/")+1]+sfname
else:
    sf_location = sfname["location"]
    sfname = sfname["basename"]

And this seems to solve the problems:

  • Returning a File object in the secondary expression
  • Matching the correct basename for follow up steps

I’ll page @mr-c to see if this change might go against the spec.

Did you try renaming the .bam.bai to ^.bai using the InitialWorkDirRequirement?

Hey @mrc, took me a while to test this, but I can confirm that using the InitialWorkDirRequirement can be used to rename the secondary, but it also localises the file into the current execution directory (which is not something I always want to have happen).

I’d be happy for that to be a recommendation (and happy to write one for the Misc page: https://www.commonwl.org/user_guide/misc/). I think my proposal is valid within the spec, and that the change within CWLTool is valid within the spec.

Edit: