Submit job on remote cluster

mdrio · February 24, 2021, 9:33pm

Hi, I have a workflow in which steps are dockerized ML predictions. Actually, the computation should take place on a remote Dask cluster. The problem is I cannot find a way to share the inputs and outputs of the steps and the remote cluster. In fact, they are mounted on internal paths (such as //input) that cannot be resolved externally.
Is there a way to solve this problem? Maybe having some control on the docker internal path of inputs and outputs can help for some workaround.
Thanks

kmhernan · February 26, 2021, 3:27am

I think your problem is that you need an engine that can handle different file stores. But maybe I’m misunderstanding. If your files live in say s3, or via http/ftp, etc a smart engine could use the uri scheme to then know which file store to use. I guess you could have a step that copies the remote file locally and then pass it to the next step. This is what I do from s3 if I’m using cwltool. Is this helpful to your problem?

mdrio · February 26, 2021, 9:00am

My files are on NFS, visible to the machine where cwl runs and to the cluster that accepts jobs submitted by some steps in the workflow. Files are pretty large, so copying them should be avoided.
Steps that submits jobs run on docker, basically are converted in something like:

docker run .... /<RANDOM-DIR>/input-file

/<RANDOM-DIR>/input-file is sent as part of the job submission to the cluster, and obviously cannot be resolved externally.

Basically the problem is that one or more steps communicate externally in order to process inputs, so the paths where inputs are isolated should be some way shared.

As far as I understand, it is not supported by the reference implementation, but maybe it can be achieved configuring properly the paths (or modifying the way they are managed) where inputs are moved/linked and the volumes mounted on docker container.

steve · March 2, 2021, 5:00pm

It was my understanding that if you are using a remote cluster for execution, you need to have a shared filesystem with the same filepaths on both the local host and the remote system, right? Is that not the case in your situation? It might be easiest to just have the NFS mounted identically on both systems. Or, run the workflow directly from the remote system where the paths are correct.

An S3 solution sounds like it could be helpful but its probably a lot more work and overhead than you want.

mrc · March 5, 2021, 2:57pm

Hello @mdrio and welcome! Which workflow platform or engine are you using?

Is this what the Dask documentation calls (under “Distributed computing”) an “un-managed cluster” or something else?

mdrio · March 5, 2021, 3:57pm

Yes, this is situation, I have different machines where NFS is mounted on the same paths. Unfortunately, things are more complex, since I have the requirement of running steps on docker.

The client machine runs a workflow with an input file /mnt/NFS/input. A dockerized step takes it and submit a job on the remote cluster, but inside the docker container the input file has been changed to /RANDOM_DIR/input. So the original file /mnt/NFS/input is visible on the cluster, since the same NFS is mounted, but obviously /RANDOM_DIR/input cannot be resolved.
I suppose a solution can be achieved configuring the parent dir of the random dirs generated inside the docker container, and mounting the NFS on the same path on the container.

In other words: if NFS is mounted on /mnt/NFS on the docker container, and the parent dir for random dirs inside the container is again /mnt/NFS, the job on the remote server will be run on a file located in /mnt/NFS/RANDOM_DIR/input, which is is visible by the server.

As far as I understand it is not possible in the current reference implementation, anyway probably it could be achivied with minor changes. Does this approach violate in some way the philosophy of CWL?

It sounds interesting, at this moment I would like too keep it simple and use NFS, but in the future I think this is something I will need.

mdrio · March 5, 2021, 3:59pm

Yes, it is an un-managed Dask cluster.