Ask CWL to rename a secondary file

Essentially, I want to hide the fact that some tools only accept ^.bai (looking at GATK) while make them appear as if they accept .bam.bai (with the corresponding .bai secondary syntax) which we want to be the standard for our pipelines.

Basically, I tried to use an outputEval, but the order of operations makes the impossible.

I’m happy to use a specifically CWL expression for the secondaryFile to do this, but I don’t quite understand what I should return if:

  • The primary file (correctly globbed) is called myfile.bam
  • A file called myfile.bai is present in the output directory
  • The returned secondary file should have the extension myfile.bam.bai

This would ensure it’s correctly localised in future steps. But maybe I don’t quite understand how secondary files are passed around in CWL especially CWLTool.

The topic follows stems from my questions started from here.

I’ll write out my thought process.

Picking up outputs

I can get the secondary files to be picked up with an expression on the secondaryFiles field on the output.

  out:
    type: File
    outputBinding:
      glob: $(inputs.bam.basename)
    secondaryFiles: |
      ${
          function resolveSecondary(base, secPattern) {
            if (secPattern[0] == "^") {
              var spl = base.split(".");
              var endIndex = spl.length > 1 ? spl.length - 1 : 1;
              return resolveSecondary(spl.slice(undefined, endIndex).join("."), secPattern.slice(1));
            }
            return base + secPattern
          }

          return [{
            path: resolveSecondary(self.path, "^.bai"),
            basename: resolveSecondary(self.basename, ".bai"),
          }]
      }

But this doesn’t actually rename the file though (except on a completed workflow that gets exported). It just references it, eg:

  • TaskA generates the file “file.bam” + “file.bai”.
  • if TaskB comes along wanting the file “file.bam.bai”
  • It fails with the error: "Missing required secondary file '$name' from file object", even though it seems to exist in the secondaryFiles array (depends what it matches on I guess).

Error:

cwltool.errors.WorkflowException: Missing required secondary file 'generated-7999e776-212c-11ea-a264-acde48001122.bam.bai' from file object: {
    "location": "file:///private/tmp/docker_tmpv8y7n43v/generated-7999e776-212c-11ea-a264-acde48001122.bam",
    "basename": "generated-7999e776-212c-11ea-a264-acde48001122.bam",
    "nameroot": "generated-7999e776-212c-11ea-a264-acde48001122",
    "nameext": ".bam",
    "class": "File",
    "checksum": "sha1$a1a12417d413ab4847188d6eee7f6e675450656d",
    "size": 2998029,
    "secondaryFiles": [
        {
            "basename": "generated-7999e776-212c-11ea-a264-acde48001122.bam.bai",
            "location": "file:///private/tmp/docker_tmpv8y7n43v/generated-7999e776-212c-11ea-a264-acde48001122.bai",
            "class": "File",
            "nameroot": "generated-7999e776-212c-11ea-a264-acde48001122.bam",
            "nameext": ".bai",
            "checksum": "sha1$e5e276cfbd0a1cfe7828ec744b02de3d8ee78f88",
            "size": 1472592,
            "http://commonwl.org/cwltool#generation": 0
        }
    ],
    "http://commonwl.org/cwltool#generation": 0
}

Picking up inputs

I’m also having trouble doing a similar process on the CommandInput, when I place a very similar expression block in the secondaryFiles on the input (location instead of path):

    type: File
    secondaryFiles: |
      ${
          function resolveSecondary(base, secPattern) {
            if (secPattern[0] == "^") {
              var spl = base.split(".");
              var endIndex = spl.length > 1 ? spl.length - 1 : 1;
              return resolveSecondary(spl.slice(undefined, endIndex).join("."), secPattern.slice(1));
            }
            return base + secPattern
          }
          // return resolveSecondary(self.basename, "^.bai")
          return [{
            location: resolveSecondary(self.location, ".bai"),
            basename: resolveSecondary(self.basename, "^.bai"),
          }]
      }

But, I get a CWLTool error:

Traceback (most recent call last):
  File "/anaconda3/lib/python3.7/site-packages/cwltool/executors.py", line 169, in run_jobs
    for job in jobiter:
  File "/anaconda3/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 430, in job
    builder = self._init_job(job_order, runtimeContext)
  File "/anaconda3/lib/python3.7/site-packages/cwltool/process.py", line 747, in _init_job
    discover_secondaryFiles=getdefault(runtime_context.toplevel, False)))
  File "/anaconda3/lib/python3.7/site-packages/cwltool/builder.py", line 276, in bind_input
    bindings.extend(self.bind_input(f, datum[f["name"]], lead_pos=lead_pos, tail_pos=f["name"], discover_secondaryFiles=discover_secondaryFiles))
  File "/anaconda3/lib/python3.7/site-packages/cwltool/builder.py", line 332, in bind_input
    sf_location = datum["location"][0:datum["location"].rindex("/")+1]+sfname
TypeError: can only concatenate str (not "dict") to str

This seems to go against the spec which says:

The expression must return:

  • a filename string relative to the path to the primary File,
  • a File or Directory object with either path or location and basename fields set,
  • or an array consisting of strings or File or Directory objects.
    It is legal to reference an unchanged File or Directory object taken from input as a secondaryFile.

Potential solution

I made a small modification to CWLTool to pull the location + fname from this File object, ie:
builder.py:

  • Remove L332,
  • Insert the following stub at L325 to get the correct basename (instead of assuming it’s a string).
if isinstance(sfname, string_types):
    sf_location = datum["location"][0:datum["location"].rindex("/")+1]+sfname
else:
    sf_location = sfname["location"]
    sfname = sfname["basename"]

And this seems to solve the problems:

  • Returning a File object in the secondary expression
  • Matching the correct basename for follow up steps

I’ll page @mr-c to see if this change might go against the spec.

Did you try renaming the .bam.bai to ^.bai using the InitialWorkDirRequirement?

Hey @mrc, took me a while to test this, but I can confirm that using the InitialWorkDirRequirement can be used to rename the secondary, but it also localises the file into the current execution directory (which is not something I always want to have happen).

I’d be happy for that to be a recommendation (and happy to write one for the Misc page: https://www.commonwl.org/user_guide/misc/). I think my proposal is valid within the spec, and that the change within CWLTool is valid within the spec.

Edit:

Hi @illusional

I think there’s two different questions here.

  1. how to use an expression to rename a secondaryFile of an input file
  2. how to use an expression to rename a secondaryFile of an output file

For (1) I need to think about this a little bit more, because the intent of the code in builder.py is primarily to validate the input file objects, not modify them. It might be reasonable to accept a modified object but we need to think through the behavior and write conformance tests. The fact that doing this crashes cwltool shows that this case wasn’t thought through (underspecified behavior) so we have some flexibility to go back and fix it to do something reasonable.

For (2) as @mrc said you can already use glob to match all the files you need and then construct the desired object with outputEval. I agree in that case it would have been more convenient for secondaryFiles to be evaluated first. I don’t really have a rationale for the current behavior, but it is well specified so changing it could potentially break compatability with existing code so that’s a very high bar.

Hi @tetron / @mrc,

You’re correct that it is really two different questions, input and output. To address this, I’ve attached two different conformance tests to my original PR.

  1. I can understand the intention. It’s what I’ve been recommended to do in the past, and the spec does state that you can return a File object:

    CommandInputParameter.secondaryFiles

    The expression must return […] a File or Directory object with either path or location and basename fields set, or an array consisting of strings or File or Directory objects.

  2. This is already possible with the current setup. I did say I was concerned that follow up steps would not pick the secondary file correctly, this was false.

    I created a small workflow that collected and renamed the secondaryFile, connected it to a tool that was expecting the correct secondary format and it worked.

Edit: Or, as you suggested. I give up and construct my secondary files in the output. I’m less keen on this given that it should be part of the spec and now I’ve got conformance tests.