Idn2 not working when used as baseCommand

Hi

I am attempting to use idn2 to encode utf-8 domain names as IDNA2008:

$ idn2 é.fr
xn--9ca.fr

Used as a CommandLineTool, the tool fails with utf-8 strings (ok with eg e.fr):

#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
baseCommand: idn2
arguments: ["é.fr"]
inputs: []
outputs: []
INFO [job test.cwl] /tmp/w2iwrcpk$ idn2 \
    é.fr
idn2: toAscii: could not convert string to UTF-8
WARNING [job test.cwl] exited with status: 1

Wrapping idn2 with sh works fine for some reason:

#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
baseCommand: sh
arguments: ["-c", "idn2", "é.fr"]
inputs: []
outputs: []

Using it with a container (using DockerRequirement) also works fine.

This behavior is unexpected. Am I missing something? I use cwltool. Should I open an issue?

Thanks,
Louis

Hello and welcome, Louis! Thank you for asking your great question.

Looking at the idn2 manual online (or the manual page via man idn2) we learn that

All strings are expected to be encoded in the preferred charset used by your locale

https://www.gnu.org/software/libidn/libidn2/manual/libidn2.html#Invoking-idn2

In POSIX/Linux systems, the locale is communicated via environment variables that start with LC_, like LC_ALL: https://en.wikipedia.org/wiki/Locale_(computer_software)#POSIX_platforms

For CWL, there are very few environment variables set by default: https://www.commonwl.org/v1.2/CommandLineTool.html#Runtime_environment

When idn2 is invoked inside a shell, then that shell will likely set some locale environment variables, which is why sh -c idn2 worked.

When idn2 is run using a software container in a CWL context, then the CWL runner will only override a few environment variables set by the container. So perhaps the container you used set LC_ALL or similar.

To run tools like idn2 (that need a locale set) from within CWL without wrapping them in a shell call, then you can use the EnvVarRequirement to set LC_ALL to match your input string encoding. 2.12. Environment Variables — Common Workflow Language User Guide 0.1 documentation

Here is your example using UTF-8 encoding:

#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
requirements:
  EnvVarRequirement:
    envDef:
      LC_ALL: C.UTF-8
baseCommand: idn2
arguments: ["é.fr"]
inputs: []
outputs: []

And the result:

INFO [job discourse-1043.cwl] /tmp/tuyev6mw$ idn2 \
    é.fr
xn--9ca.fr

You can even make the encoding configurable:

#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
inputs:
  domain_name: string
  encoding:
    type: string
    default: C.UTF-8
requirements:
  EnvVarRequirement:
    envDef:
      LC_ALL: $(inputs.encoding)
baseCommand: idn2
arguments: [ $(inputs.domain_name) ]
outputs: []
1 Like

Thank you very much for the detailed answer :smiley:

1 Like