Scatter and collect stdout per scattered input

kannon92 · December 8, 2021, 4:17pm

Hello,

So for my use case, I use stdout and stderr to capture container logs. I am looking into supporting scatter functionality in my compute platform.

So I have the following workflow:

{
	"cwlVersion": "v1.0",
	"class": "Workflow",
	"id": "61b0d5b7eb97fd6c8dbba48c",
	"requirements": {
		"ScatterFeatureRequirement": {}
	},
	"inputs": {
		"message": "string[]"
	},
	"outputs": {
		"echoStdOut": {
			"type": "File[]",
			"outputSource": ["echo/echoStdOut"]
		},
		"echoStdErr": {
			"type": "File[]",
			"outputSource": ["echo/echoStdErr"]
		}
	},
	"steps": {
		"echo": {
			"run": "/tmp/cwl/plugin:6148da6f08b2c40710890b09.cwl",
			"scatter": "message",
			"in": {
				"message": "message"
			},
			"out": ["echoStdOut", "echoStdErr"]
		}
	}
}

And my echo tool basically captures stdout and stderr:

{
	"cwlVersion": "v1.0",
	"$namespaces": {
		"CustomResourceRequirement": "https://polus.org"
	},
	"$schemas": ["https://schema.org/version/latest/schemaorg-current-https.rdf"],
	"id": "echo",
	"class": "CommandLineTool",
	"stdout": "echo.out",
	"stderr": "echo.out",
	"CustomResourceRequirement:gpu": "0",
	"requirements": {
		"DockerRequirement": {
			"dockerPull": "busybox"
		},
		"InlineJavascriptRequirement": {},
		"ResourceRequirement": {},
		"InitialWorkDirRequirement": {
			"listing": []
		}
	},
	"baseCommand": ["echo"],
	"inputs": {
		"message": {
			"type": "string",
			"inputBinding": {
				"prefix": "--message"
			}
		}
	},
	"outputs": {
		"echoStdOut": {
			"type": "stdout"
		},
		"echoStdErr": {
			"type": "stderr"
		}
	}
}

My main question is how can I still use scatter and capture logs on each scatter step. Currently, all logs write to the same echo.out log file and they are overwritten.

mrc · December 9, 2021, 7:47am

Welcome @kannon92 !

	"stdout": "echo.out",
	"stderr": "echo.out",

Try giving different file names here?

Like:

	"stdout": "echo.out",
	"stderr": "echo.err",

As far as dealing with filename conflicts from multiple steps and/or the scattering of a single step: if you are making your own executor, then you need to manage the results of each tool invocation. Typically this is done with separate output folder per job (a single instance of a non-scattered step, or each permutation of a scattered step).

kannon92 · December 9, 2021, 3:04pm

Hey @mrc ,

I am using toil for my HPC environment. Is it possible to have each step write into a separate working directory? I thought that this is a cli for toil that tells where the cwd is.

Is it possible to take the scatter index and use that as a name for the stdout or stderr file in the tool definition? I guess I could make it an argument that goes into the plugin so I can change the name of the log file.

kannon92 · December 14, 2021, 3:24pm

So I did some more investigating. If I remove the file name, CWL will generate a unique UUID for the log file. Cwltool will use a random UUID for the log file but it detects that there are conflicts and generates separate files (using the same UUID as a base) for each tool invocation.

The toil runner has a bug where it overwrites the existing log file for every tool invocation.

I created an issue in the toil github repo for this.

kannon92 · January 5, 2022, 3:04pm

For those that found this via google. I did release a PR that fixes this issue in toil.

github.com/DataBiosphere/toil

Issue/3960 scatter outputs should not clash (#3)

DataBiosphere:master ← PolusAI:issue/3960-scatter-outputs-should-not-clash

opened 03:41PM - 17 Dec 21 UTC

kannon92

+133 -6

## Changelog Entry If you use scatter and collect files, you can now correctl…y detect duplicates. Follow the same behavior that cwltool does. Fix https://github.com/DataBiosphere/toil/issues/3960 ## Reviewer Checklist * [x] Make sure it is coming from `issues/XXXX-fix-the-thing` in the Toil repo, or from an external repo. * [ ] If it is coming from an external repo, make sure to pull it in for CI with: ``` contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing ``` * [ ] If there is no associated issue, [create one](https://github.com/DataBiosphere/toil/issues/new). * [x] Read through the code changes. Make sure that it doesn't have: * [x] Addition of trailing whitespace. * [x] New variable or member names in `camelCase` that want to be in `snake_case`. * [x] New functions without [type hints](https://docs.python.org/3/library/typing.html). * [x] New functions or classes without informative docstrings. * [x] Changes to semantics not reflected in the relevant docstrings. * [x] New or changed command line options for Toil workflows that are not reflected in `docs/running/{cliOptions,cwl,wdl}.rst` * [x] New features without tests. * [x] Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines. * [x] Finish the review with an overall description of your opinion. ## Merger Checklist * [ ] Make sure the PR passes tests. * [ ] Make sure the PR has been reviewed **since its last modification**. If not, review it. * [ ] Merge with the Github "Squash and merge" feature. * [ ] If there are multiple authors' commits, add [Co-authored-by](https://github.blog/2018-01-29-commit-together-with-co-authors/) to give credit to all contributing authors. * [ ] Copy its recommended changelog entry to the [Draft Changelog](https://github.com/DataBiosphere/toil/wiki/Draft-Changelog). * [ ] Append the issue number in parentheses to the changelog entry.