Another nice thing about setups like yours is reproducibility. So long as you've...

aabaker99 · on July 16, 2017

I agree and I have been working to do this with some of my pipelines as well. One challenge I have been facing is that my compute environment may be quite different than others'. This is mainly the case with respect to distributed computing that seems to be an essential part of the pipelines: I often wish to experiment with multiple hyperparameter settings which creates a lot of processes to run.

Do you or the parent or others take steps to abstract away the distributed computing steps so that others may run the pipelines in their distributed computing environments? More specifically, I use Condor (http://research.cs.wisc.edu/htcondor/) but other batch systems like PBS are also popular. Ideally my pipeline would support both systems (and many others).

azag0 · on July 16, 2017

I ended up writing a simple distributed build system (https://github.com/azag0/caf, undocumented) that abstracts this away. (I use it at a couple of clusters with different batch systems). You define tasks as directories with some input files and a command, which get hashed and define a unique identity. The build system then helps with distributing these tasks to other machines, execution, and collection of processed tasks.

Ultimately though, I rely on some common environment (such as a particular binary existing in PATH) that lives outside the build system and would need to be recreated by whoever who would like to reproduce the calculations. I never looked into abstracting that away with something like Docker.

(I plan to document the build system once it's at least roughly stable and I finish my Phd.)

edwinksl · on July 16, 2017

Maybe containers like Docker are useful for your use case?

aabaker99 · on July 16, 2017

Docker can distribute the software needed to run the job well, which is definitely part of the issue and something I should use more.

However, I also have in mind a pipeline of scripts where one script may be a prerequisite to the other. Condor has some nice abstractions for this by organizing the scripts/jobs as a directed acyclic graph according to their dependencies. I was thinking other batch systems might support this as well. But some of my challenge comes in learning how each batch system would run these DAGs. Each one will have some commands to launch jobs, to wait for a job to finish before running some other job, to check if jobs failed, to rerun failed jobs in the case of hardware failure, etc.

It seems like the DAG representation would contain enough detail for any batch system but there may be some nuances. For example, I tend to think of these jobs as a command to run, the arguments to give that command, and somewhere to put stdout and stderr. But Condor also will report information about the job execution in some other log files. Cases like this illustrate where my DAG representation (or at least the data tracked in nodes) might break down, but I haven't used these other systems like PBS enough to know for sure.

querious · on July 16, 2017

Apache Airflow defines and runs DAGs in a sane way, IMO. Takes some configuration, but worth it for more complicated projects.

vageli · on July 17, 2017

[Luigi](https://github.com/spotify/luigi) out of Spotify sounds exactly like what you're looking for. It allows you to specify dependent tasks, pipe input/output between them, and more.

dev1n · on July 16, 2017

You should check out pachyderm [1] for setting up automated data pipelines. Also great for reproducibility.

[1]: http://www.pachyderm.io

naturalgradient · on July 16, 2017

Yes, this comes especially handy when reviewers ask for additional experiments.