Hi, I am trying to formalize a JSON input taken as input by biobb command line tools. At the moment we have received these as strings, which is OK, however it means the values are always hardcoded and can’t come from the workflow:
step1_pdb_config: '{"pdb_code" : "1aki"}'
step4_editconf_config: '{"box_type": "cubic","distance_to_molecule": 1.0}'
step6_gppion_config: '{"mdp": {"type":"minimization"}}'
step7_genion_config: '{"neutral": "True"}'
step8_gppmin_config: '{"mdp": {"type":"minimization", "nsteps":"5000", "emtol":"500"}}'
step11_gppnvt_config: '{"mdp": {"type":"nvt", "nsteps":"5000", "dt":0.002, "define":"-DPOSRES"}}'
From a CWL-point of view it is also blackbox for users what fields this JSON expects, unless they look up the biobb documentation, e.g. properties
of grompp
In theory I thought this would be a great example of using records, specially as these properties have an underlying JSON schema we could then auto-generate the CWL record declarations for.
However I don’t want to commit to this JSON being embedded in the job file, so want to retain the option of it being passed as a file, similar to --config
with the biobb command line.
So here is the idea (using echo
instead of grompp
for now)
#!/usr/bin/env cwl-runner
cwlVersion: v1.1
class: CommandLineTool
baseCommand: echo
dockerPull: quay.io/biocontainers/biobb_md:0.1.5--py_0
InlineJavascriptRequirement: {}
- entryname: "grompp-config.json"
entry: $(inputs.config_rec || {}) # empty JSON by default
doc: ""
- "null"
- $import: grompp-config.cwl # GromppConfig record
#prefix: --config
# "grompp-config.json" from InitialWorkDirRequirement
valueFrom: "grompp-config.json"
# This does not actually work because the inner "default" fields
# are not allowed, but cwltool fills in lots of nulls for the optionals:
- "null"
- string
- File
prefix: --config
type: stdout
Thus there can be either a --config
to the CWL that works the same as for the existing command line, or a config-rec
key (probably from the yaml) that has nested yaml elements without any tricky '["escaping"]'
. (I could not make these two --config
s exclusive as in http://www.commonwl.org/user_guide/11-records/ without making yet another nesting which I thought was too cumbersome)
I don’t know of a way to make the InitialWorkDirRequirement
file creation optional based on the input being specified or not; is there another way? It’s not a big problem as the empty file is just ignored otherwise, anyway I put || {}
to make it valid JSON.
Is this stringifiying of the $(inputs.config_rec)
record to JSON expected to work across CWL implementations? I found no documentation for this behaviour, but it is exactly what I need. Using $("" + inputs.config_rec)
did not work as it gave [Object obj]
Now then the problem was how to make the record definition. I managed to get as far as this:
#cwlVersion: v1.0
type: record
name: GromppConfig
doc: "JSON configuration for invoking Grompp building block"
type: string?
doc: "Path of the input MDP file."
doc: "MDP options specification. (Used if *input_mdp_path* is null)"
s:url: "http://manual.gromacs.org/2020-current/user-guide/mdp-options.html"
type: record
name: MDPOptions
type: string?
doc: "directories to include in your topology. Format: -I/home/john/mylib -I../otherlib"
type: string?
doc: |-
defines to pass to the preprocessor, default is no defines.
You can use any defines to control options in your
customized topology files. Options that act on
existing top file mechanisms include:
-DFLEXIBLE will use flexible water instead of rigid
water into your topology, this can be useful for normal mode analysis.
-DPOSRES will trigger the inclusion of posre.itp into
your topology, used for implementing position restraints.
type: enum
name: GrompIntegrator
symbols: [md, md-vv, md-vv-avek, sd, bd, steep, cg, l-bfgs, nm, tpi, tpic, mimic]
doc: |-
Despite the name, this list includes algorithms that are not
actually integrators over time. integrator=steep and all
entries following it are in this category
## TODO: Document all of the fields of mdp file
## http://manual.gromacs.org/2020-current/user-guide/mdp-options.html
## but as they are expected by biobb in JSON variant:
doc: "Default options for the mdp file. Valid values: minimization, nvt, npt, free, index"
#default: "minimization"
- "null"
- type: enum
symbols: [minimization, nvt, npt, free, index]
type: string?
#default: "grompp.mdp"
doc: "Path of the output MDP file."
type: string?
#default: "grompp.top"
doc: "Path the output topology TOP file."
type: int?
#default: 10
doc: "Maximum number of allowed warnings."
type: string?
#default: "gmx"
doc: "Path to the GROMACS executable binary"
type: boolean?
#default: true
doc: "[WF property] Remove temporal files."
type: boolean?
#default: false
doc: "[WF property] Do not execute if output files exist."
type: string?
doc: "Path to the binary executable of your container."
type: string?
#default: "gromacs/gromacs:latest"
doc: "Container Image identifier to execute gromacs from"
type: string?
#default: "/data"
doc: "Path to an internal directory in the container."
type: string?
doc: "Path to the internal CWD in the container."
type: string?
doc: "User number id to be mapped inside the container."
type: string?
#default: "/bin/bash"
doc: "Path to the binary executable of the container shell."
s:url: "https://biobb-md.readthedocs.io/en/latest/gromacs.html#module-gromacs.grompp"
s: http://schema.org/
- http://schema.org/version/latest/schema.rdf
However as you see I had to comment out default
as it was not allowed by cwltool 3.0.20200706173533.
Strangely doc
was not permitted, although it is listed on https://www.commonwl.org/v1.1/Workflow.html#InputRecordSchema so I used label
instead - another cwltool
The sad thing is that the optional string?
and boolean?
etc in here do not work, because I get them all filled in with null
Running with a partial grompptest.yaml
input_mdp_path: "hello"
integrator: steep
type: nvt
output_mdp_path: soup
maxwarn: 100
restart: true
I then get an output JSON that includes all the other keys that are optional, not just the keys I gave above:
{"container_image": null, "container_path": null, "container_shell_path": null, "container_user_id": null, "container_volume_path": null, "container_working_dir": null, "gmx_path": null, "input_mdp_path": "hello", "maxwarn": 100, "mdp": {"define": null, "include": null, "integrator": "steep"}, "output_mdp_path": "soup", "output_top_path": null, "remove_tmp": null, "restart": true, "type": "nvt"}
Now this is a problem, because as you see in the default
comments lots of these keys have other default values, and this will really confuse the underlying Python tool that is called.
The second problem is that I can’t allow a nested arbitrary JSON for my inner MDPOptions
- these are very extensive upstream and often change, which means the CWL might become quickly out of date.
Now I seem to be required to specify all of them, but several of them are of a type that is impossible or very difficult to express in Avro Schemas.
How can I do wildcard fields within a record?
By https://www.commonwl.org/v1.1/Workflow.html#InputRecordSchema it says fields
is optional, but as raised in cwltool #608 fields
is required by cwltool
although the CWL spec says it is optional.
Is it too ambitious? Is there a better way of using the existing JSON schemas from CWL for a record?
These files are currently in our cwl-records
Tested with cwltool
in this conda environment:
(test5) stain@biggie:~/src/biobb_example_workflow$ cwltool --version
/home/stain/miniconda3/envs/test5/bin/cwltool 3.0.20200706173533
(test5) stain@biggie:~/src/biobb_example_workflow$
