Scatter and collect stdout per scattered input

Hello,

So for my use case, I use stdout and stderr to capture container logs. I am looking into supporting scatter functionality in my compute platform.

So I have the following workflow:

{
	"cwlVersion": "v1.0",
	"class": "Workflow",
	"id": "61b0d5b7eb97fd6c8dbba48c",
	"requirements": {
		"ScatterFeatureRequirement": {}
	},
	"inputs": {
		"message": "string[]"
	},
	"outputs": {
		"echoStdOut": {
			"type": "File[]",
			"outputSource": ["echo/echoStdOut"]
		},
		"echoStdErr": {
			"type": "File[]",
			"outputSource": ["echo/echoStdErr"]
		}
	},
	"steps": {
		"echo": {
			"run": "/tmp/cwl/plugin:6148da6f08b2c40710890b09.cwl",
			"scatter": "message",
			"in": {
				"message": "message"
			},
			"out": ["echoStdOut", "echoStdErr"]
		}
	}
}

And my echo tool basically captures stdout and stderr:

{
	"cwlVersion": "v1.0",
	"$namespaces": {
		"CustomResourceRequirement": "https://polus.org"
	},
	"$schemas": ["https://schema.org/version/latest/schemaorg-current-https.rdf"],
	"id": "echo",
	"class": "CommandLineTool",
	"stdout": "echo.out",
	"stderr": "echo.out",
	"CustomResourceRequirement:gpu": "0",
	"requirements": {
		"DockerRequirement": {
			"dockerPull": "busybox"
		},
		"InlineJavascriptRequirement": {},
		"ResourceRequirement": {},
		"InitialWorkDirRequirement": {
			"listing": []
		}
	},
	"baseCommand": ["echo"],
	"inputs": {
		"message": {
			"type": "string",
			"inputBinding": {
				"prefix": "--message"
			}
		}
	},
	"outputs": {
		"echoStdOut": {
			"type": "stdout"
		},
		"echoStdErr": {
			"type": "stderr"
		}
	}
}

My main question is how can I still use scatter and capture logs on each scatter step. Currently, all logs write to the same echo.out log file and they are overwritten.

Welcome @kannon92 !

	"stdout": "echo.out",
	"stderr": "echo.out",

Try giving different file names here?

Like:

	"stdout": "echo.out",
	"stderr": "echo.err",

As far as dealing with filename conflicts from multiple steps and/or the scattering of a single step: if you are making your own executor, then you need to manage the results of each tool invocation. Typically this is done with separate output folder per job (a single instance of a non-scattered step, or each permutation of a scattered step).

Hey @mrc ,

I am using toil for my HPC environment. Is it possible to have each step write into a separate working directory? I thought that this is a cli for toil that tells where the cwd is.

Is it possible to take the scatter index and use that as a name for the stdout or stderr file in the tool definition? I guess I could make it an argument that goes into the plugin so I can change the name of the log file.

So I did some more investigating. If I remove the file name, CWL will generate a unique UUID for the log file. Cwltool will use a random UUID for the log file but it detects that there are conflicts and generates separate files (using the same UUID as a base) for each tool invocation.

The toil runner has a bug where it overwrites the existing log file for every tool invocation.

I created an issue in the toil github repo for this.

1 Like

For those that found this via google. I did release a PR that fixes this issue in toil.

1 Like