Using CWL at scale in the Genomic Data Commons

mrc · February 27, 2022, 2:46pm

Authors: Shenglai Li, Kyle Hernandez, Zhenyu Zhang

Session 1 (Americas-EMEA) Monday, February 28th, 14:05 UTC

Summary: "The National Cancer Institute’s Genomic Data Commons (GDC) has been using CWL since 2016. We have completed over 1,000,000 CWL workflow jobs on both on-prem and AWS combined with cwltool as our workflow engine. We have some limitations in current implementation that cwltool couldn’t parallelize the steps to multiple instances so we can not consistently leverage AWS spot instances. However, since the GDC pipeline automation system (GPAS) was designed to be easily configured with a different CWL engine, we are always investigating other workflow engines. We also want to share some of our experience including cwltest, modularizing, and improving the code readability.

This project has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Contract No. 75N91019D00024 Task Order 17X147F15”

Slides: Using CWL at scale in the Genomic Data Commons.pptx.pdf - Google Drive

See also: GitHub - NCI-GDC/gdc-dnaseq-cwl: CWL for GDC DNASeq workflows

Please leave your questions for the presenter below!

As an alternative to YouTube, this presentations is also available on ConfTube