Hi, I am trying to formalize a JSON input taken as input by biobb command line tools. At the moment we have received these as strings, which is OK, however it means the values are always hardcoded and can’t come from the workflow:
step1_pdb_config: '{"pdb_code" : "1aki"}'
step4_editconf_config: '{"box_type": "cubic","distance_to_molecule": 1.0}'
step6_gppion_config: '{"mdp": {"type":"minimization"}}'
step7_genion_config: '{"neutral": "True"}'
step8_gppmin_config: '{"mdp": {"type":"minimization", "nsteps":"5000", "emtol":"500"}}'
step11_gppnvt_config: '{"mdp": {"type":"nvt", "nsteps":"5000", "dt":0.002, "define":"-DPOSRES"}}'
From a CWL-point of view it is also blackbox for users what fields this JSON expects, unless they look up the biobb documentation, e.g. properties
of grompp
In theory I thought this would be a great example of using records, specially as these properties have an underlying JSON schema we could then auto-generate the CWL record declarations for.
However I don’t want to commit to this JSON being embedded in the job file, so want to retain the option of it being passed as a file, similar to --config
with the biobb command line.
So here is the idea (using echo
instead of grompp
for now)
#!/usr/bin/env cwl-runner
cwlVersion: v1.1
class: CommandLineTool
baseCommand: echo
hints:
DockerRequirement:
dockerPull: quay.io/biocontainers/biobb_md:0.1.5--py_0
requirements:
InlineJavascriptRequirement: {}
InitialWorkDirRequirement:
listing:
- entryname: "grompp-config.json"
entry: $(inputs.config_rec || {}) # empty JSON by default
inputs:
config_rec:
doc: ""
type:
- "null"
- $import: grompp-config.cwl # GromppConfig record
inputBinding:
#prefix: --config
# "grompp-config.json" from InitialWorkDirRequirement
valueFrom: "grompp-config.json"
# This does not actually work because the inner "default" fields
# are not allowed, but cwltool fills in lots of nulls for the optionals:
config:
type:
- "null"
- string
- File
inputBinding:
prefix: --config
outputs:
concatination:
type: stdout
Thus there can be either a --config
to the CWL that works the same as for the existing command line, or a config-rec
key (probably from the yaml) that has nested yaml elements without any tricky '["escaping"]'
. (I could not make these two --config
s exclusive as in http://www.commonwl.org/user_guide/11-records/ without making yet another nesting which I thought was too cumbersome)
I don’t know of a way to make the InitialWorkDirRequirement
file creation optional based on the input being specified or not; is there another way? It’s not a big problem as the empty file is just ignored otherwise, anyway I put || {}
to make it valid JSON.
Is this stringifiying of the $(inputs.config_rec)
record to JSON expected to work across CWL implementations? I found no documentation for this behaviour, but it is exactly what I need. Using $("" + inputs.config_rec)
did not work as it gave [Object obj]
.
Now then the problem was how to make the record definition. I managed to get as far as this:
#cwlVersion: v1.0
type: record
name: GromppConfig
doc: "JSON configuration for invoking Grompp building block"
fields:
input_mdp_path:
type: string?
doc: "Path of the input MDP file."
mdp:
doc: "MDP options specification. (Used if *input_mdp_path* is null)"
s:url: "http://manual.gromacs.org/2020-current/user-guide/mdp-options.html"
type:
type: record
name: MDPOptions
fields:
include:
type: string?
doc: "directories to include in your topology. Format: -I/home/john/mylib -I../otherlib"
define:
type: string?
doc: |-
defines to pass to the preprocessor, default is no defines.
You can use any defines to control options in your
customized topology files. Options that act on
existing top file mechanisms include:
-DFLEXIBLE will use flexible water instead of rigid
water into your topology, this can be useful for normal mode analysis.
-DPOSRES will trigger the inclusion of posre.itp into
your topology, used for implementing position restraints.
integrator:
type:
type: enum
name: GrompIntegrator
symbols: [md, md-vv, md-vv-avek, sd, bd, steep, cg, l-bfgs, nm, tpi, tpic, mimic]
doc: |-
Despite the name, this list includes algorithms that are not
actually integrators over time. integrator=steep and all
entries following it are in this category
#..
## TODO: Document all of the fields of mdp file
## http://manual.gromacs.org/2020-current/user-guide/mdp-options.html
## but as they are expected by biobb in JSON variant:
type:
doc: "Default options for the mdp file. Valid values: minimization, nvt, npt, free, index"
#default: "minimization"
type:
- "null"
- type: enum
symbols: [minimization, nvt, npt, free, index]
output_mdp_path:
type: string?
#default: "grompp.mdp"
doc: "Path of the output MDP file."
output_top_path:
type: string?
#default: "grompp.top"
doc: "Path the output topology TOP file."
maxwarn:
type: int?
#default: 10
doc: "Maximum number of allowed warnings."
gmx_path:
type: string?
#default: "gmx"
doc: "Path to the GROMACS executable binary"
remove_tmp:
type: boolean?
#default: true
doc: "[WF property] Remove temporal files."
restart:
type: boolean?
#default: false
doc: "[WF property] Do not execute if output files exist."
container_path:
type: string?
doc: "Path to the binary executable of your container."
container_image:
type: string?
#default: "gromacs/gromacs:latest"
doc: "Container Image identifier to execute gromacs from"
container_volume_path:
type: string?
#default: "/data"
doc: "Path to an internal directory in the container."
container_working_dir:
type: string?
doc: "Path to the internal CWD in the container."
container_user_id:
type: string?
doc: "User number id to be mapped inside the container."
container_shell_path:
type: string?
#default: "/bin/bash"
doc: "Path to the binary executable of the container shell."
s:url: "https://biobb-md.readthedocs.io/en/latest/gromacs.html#module-gromacs.grompp"
$namespaces:
s: http://schema.org/
$schemas:
- http://schema.org/version/latest/schema.rdf
However as you see I had to comment out default
as it was not allowed by cwltool 3.0.20200706173533.
Strangely doc
was not permitted, although it is listed on https://www.commonwl.org/v1.1/Workflow.html#InputRecordSchema so I used label
instead - another cwltool
bug?
The sad thing is that the optional string?
and boolean?
etc in here do not work, because I get them all filled in with null
etc.
Running with a partial grompptest.yaml
:
config_rec:
input_mdp_path: "hello"
mdp:
integrator: steep
type: nvt
output_mdp_path: soup
maxwarn: 100
restart: true
I then get an output JSON that includes all the other keys that are optional, not just the keys I gave above:
{"container_image": null, "container_path": null, "container_shell_path": null, "container_user_id": null, "container_volume_path": null, "container_working_dir": null, "gmx_path": null, "input_mdp_path": "hello", "maxwarn": 100, "mdp": {"define": null, "include": null, "integrator": "steep"}, "output_mdp_path": "soup", "output_top_path": null, "remove_tmp": null, "restart": true, "type": "nvt"}
Now this is a problem, because as you see in the default
comments lots of these keys have other default values, and this will really confuse the underlying Python tool that is called.
The second problem is that I can’t allow a nested arbitrary JSON for my inner MDPOptions
- these are very extensive upstream and often change, which means the CWL might become quickly out of date.
Now I seem to be required to specify all of them, but several of them are of a type that is impossible or very difficult to express in Avro Schemas.
How can I do wildcard fields within a record?
By https://www.commonwl.org/v1.1/Workflow.html#InputRecordSchema it says fields
is optional, but as raised in cwltool #608 fields
is required by cwltool
although the CWL spec says it is optional.
Is it too ambitious? Is there a better way of using the existing JSON schemas from CWL for a record?
These files are currently in our cwl-records
branch:
Tested with cwltool
in this conda environment:
(test5) stain@biggie:~/src/biobb_example_workflow$ cwltool --version
/home/stain/miniconda3/envs/test5/bin/cwltool 3.0.20200706173533
(test5) stain@biggie:~/src/biobb_example_workflow$
(test5) stain@biggie:~/src/biobb_example_workflow$ conda list -n test5
# packages in environment at /home/stain/miniconda3/envs/test5:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 0_gnu conda-forge
bagit 1.7.0 py_0 conda-forge
brotlipy 0.7.0 py38h1e0a361_1000 conda-forge
ca-certificates 2020.6.20 hecda079_0 conda-forge
cachecontrol 0.11.7 py_0 conda-forge
cairo 1.16.0 h3fc0475_1005 conda-forge
certifi 2020.6.20 py38h32f6830_0 conda-forge
cffi 1.14.0 py38hd463f26_0 conda-forge
chardet 3.0.4 py38h32f6830_1006 conda-forge
coloredlogs 14.0 py38h32f6830_1 conda-forge
cryptography 2.9.2 py38h766eaa4_0 conda-forge
cwltool 3.0.20200706173533 py38h32f6830_0 conda-forge
decorator 4.4.2 py_0 conda-forge
expat 2.2.9 he1b5a44_2 conda-forge
fontconfig 2.13.1 h1056068_1002 conda-forge
freetype 2.10.2 he06d7ca_0 conda-forge
fribidi 1.0.9 h516909a_0 conda-forge
gettext 0.19.8.1 hc5be6a0_1002 conda-forge
glib 2.65.0 h6f030ca_0 conda-forge
graphite2 1.3.13 he1b5a44_1001 conda-forge
graphviz 2.42.3 h0511662_0 conda-forge
harfbuzz 2.4.0 hee91db6_5 conda-forge
html5lib 1.1 pyh9f0ad1d_0 conda-forge
humanfriendly 8.2 py38h32f6830_0 conda-forge
icu 67.1 he1b5a44_0 conda-forge
idna 2.10 pyh9f0ad1d_0 conda-forge
isodate 0.6.0 py_1 conda-forge
jpeg 9d h516909a_0 conda-forge
keepalive 0.5 py_1 conda-forge
ld_impl_linux-64 2.34 h53a641e_0 conda-forge
libffi 3.2.1 he1b5a44_1007 conda-forge
libgcc-ng 9.2.0 h24d8f2e_2 conda-forge
libgomp 9.2.0 h24d8f2e_2 conda-forge
libiconv 1.15 h516909a_1006 conda-forge
libpng 1.6.37 hed695b0_1 conda-forge
libstdcxx-ng 9.2.0 hdf63c60_2 conda-forge
libtiff 4.1.0 hc7e4089_6 conda-forge
libtool 2.4.6 h14c3975_1002 conda-forge
libuuid 2.32.1 h14c3975_1000 conda-forge
libwebp-base 1.1.0 h516909a_3 conda-forge
libxcb 1.13 h14c3975_1002 conda-forge
libxml2 2.9.10 h72b56ed_1 conda-forge
libxslt 1.1.33 h572872d_1 conda-forge
lockfile 0.12.2 py_1 conda-forge
lxml 4.5.1 py38hbb43d70_0 conda-forge
lz4-c 1.9.2 he1b5a44_1 conda-forge
mistune 0.8.4 py38h1e0a361_1001 conda-forge
mypy_extensions 0.4.3 py38h32f6830_1 conda-forge
ncurses 6.1 hf484d3e_1002 conda-forge
networkx 2.4 py_1 conda-forge
openssl 1.1.1g h516909a_0 conda-forge
pango 1.42.4 h7062337_4 conda-forge
pcre 8.44 he1b5a44_0 conda-forge
pip 20.0.2 py_2 conda-forge
pixman 0.38.0 h516909a_1003 conda-forge
prov 1.5.1 py_1 conda-forge
psutil 5.7.0 py38h1e0a361_1 conda-forge
pthread-stubs 0.4 h14c3975_1001 conda-forge
pycparser 2.20 pyh9f0ad1d_2 conda-forge
pydotplus 2.0.2 pyhd1c1de3_3 conda-forge
pyopenssl 19.1.0 py_1 conda-forge
pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge
pysocks 1.7.1 py38h32f6830_1 conda-forge
python 3.8.2 h8356626_5_cpython conda-forge
python-dateutil 2.8.1 py_0 conda-forge
python_abi 3.8 1_cp38 conda-forge
rdflib 4.2.2 py38_1000 conda-forge
rdflib-jsonld 0.5.0 py38h32f6830_0 conda-forge
readline 8.0 hf8c457e_0 conda-forge
requests 2.24.0 pyh9f0ad1d_0 conda-forge
ruamel.yaml 0.16.5 py38h516909a_1 conda-forge
ruamel.yaml.clib 0.2.0 py38h1e0a361_1 conda-forge
schema-salad 7.0.20200612160654 py_1 conda-forge
setuptools 46.1.3 py38h32f6830_0 conda-forge
shellescape 3.4.1 py_1 bioconda
six 1.15.0 pyh9f0ad1d_0 conda-forge
sparqlwrapper 1.8.5 py38h32f6830_1003 conda-forge
sqlite 3.30.1 hcee41ef_0 conda-forge
tk 8.6.10 hed695b0_0 conda-forge
typing_extensions 3.7.4.2 py_0 conda-forge
urllib3 1.25.9 py_0 conda-forge
webencodings 0.5.1 py_1 conda-forge
wheel 0.34.2 py_1 conda-forge
xorg-kbproto 1.0.7 h14c3975_1002 conda-forge
xorg-libice 1.0.10 h516909a_0 conda-forge
xorg-libsm 1.2.3 h84519dc_1000 conda-forge
xorg-libx11 1.6.9 h516909a_0 conda-forge
xorg-libxau 1.0.9 h14c3975_0 conda-forge
xorg-libxdmcp 1.1.3 h516909a_0 conda-forge
xorg-libxext 1.3.4 h516909a_0 conda-forge
xorg-libxpm 3.5.13 h516909a_0 conda-forge
xorg-libxrender 0.9.10 h516909a_1002 conda-forge
xorg-libxt 1.1.5 h516909a_1003 conda-forge
xorg-renderproto 0.11.1 h14c3975_1002 conda-forge
xorg-xextproto 7.3.0 h14c3975_1002 conda-forge
xorg-xproto 7.0.31 h14c3975_1007 conda-forge
xz 5.2.5 h516909a_0 conda-forge
zlib 1.2.11 h516909a_1006 conda-forge
zstd 1.4.4 h6597ccf_3 conda-forge