In the intro video they say it’s “based on git but supports large files”. Are th...

mroutis · on Feb 10, 2019

Git itself is not suitable for huge files:

- Large binary files tend to be not very "deflatable"

- xdelta (used in Git to diff files) tries to load the entire content of a file into memory, at once.

This is why there are solutions like Git-LFS, where you keep your versions on a remote server / cloud storage and you use git to track only the "metadata" files.

DVC implemented its own solution, in order to be SCM agnostic and cloud flexible (supporting different remote storages).

Here's more info comparing DVC to similar/related technologies: https://dvc.org/doc/dvc-philosophy/related-technologies

EDIT: formatting

codetrotter · on Feb 11, 2019

Thank you for the link, that’s the kind of comparison I was looking for all the way down to even talking about how DVC compares to git-annex :)

topherpalmtree · on Feb 10, 2019

They use cloud storage backends as remotes aws, google cloud, azure. No specific git lfs support. But possible compatibility by using it to track the dvc cache.

moocowtruck · on Feb 10, 2019

honestly i'm confused..i went to the site, watched the video. I don't get it.. what is it?

deepsun · on Feb 10, 2019

Basically you're not going to check-in to GIT your data used to train models and models themselves if they are multi-GB.

DVC basically sym-links those big files and checks-in the symlinks.

It also can download those files from GCS/S3, and track which file came from where (e.g. if you generated output.data using input.data, then whenever input.data changes, DVC can detect that output.data needs to be regenerated as well).

That's my understanding.

maksimum · on Feb 10, 2019

> if you generated output.data using input.data, then whenever input.data changes, DVC can detect that output.data needs to be regenerated as well

To my understanding you could do the same with Docker. E.g. if you COPY your input files into the image, rebuilding the image would only be an action if the input files changed.

dmpetrov · on Feb 10, 2019

Docker can help only if there is a single step in your project. In ML projects you usually have many steps - Preprocess, Train. Each of the steps can be divided: extract Evaluate step from Train etc.

Also, Docker has an overhead - copy of data needs to be created. While DVC just saves links (sym-links, hard-links or reflinks) with a minimum overhead. It is crucial when you work with multi-GB datasets.

maksimum · on Feb 10, 2019

Good points, I'd appreciate if you could elaborate since it seems you've thought a lot about this.

> Docker can help only if there is a single step in your project. In ML projects you usually have many steps - Preprocess, Train. Each of the steps can be divided: extract Evaluate step from Train etc.

Yeah this is something I've been struggling with. In a project I'm working on I use docker build to 1) set up the environment 2) get canonical datasets 3) pre-process the datasets. However I've left reproducing as manual instructions, e.g. run the container, call script 1 to repro experiment 1, call script 2 to repro experiment 2, etc. I think I could improve this by providing `--entrypoint` at docker run, or by providing a docker-compose file (wherein I could specify an entrypoint) for each experiment.

What do you think are the generalizability pitfalls in this workflow? How could dvc help?

> Also, Docker has an overhead - copy of data needs to be created. While DVC just saves links (sym-links, hard-links or reflinks) with a minimum overhead. It is crucial when you work with multi-GB datasets.

Good point!

theossuary · on Feb 11, 2019

I could see using an entry point for that. The entry point script could take as an argument the experiment to run and then the docker command would be: docker run -ti experiments:latest experiment-1

I could also see creating a base dockerfile, and a dockerfile per experiment. The base docker file would do the setup, and the experiment docker files would just run the commands necessary to reproduce the experiment, and exit.

dmpetrov · on Feb 11, 2019

The major pitfall - it depends on your goal. Your approach looks good if you just need to retrain an existing model\code in production. However, this approach is not perfect in the ML modeling\development stage. Let me explain...

`--entrypoint` defines a single step\script. You make a single black box from your entire ML processes. There is a lack of granularity for ML modeling process when people tend to separate different stages to make the process more manageable and efficient: manage and version dataset separately from modeling, preprocess data before training, training code as a separate unit plus some problem specific steps\units.

DVC gives you the ability to increase the granularity of your ML project while still keeps it manageable and reproducible. The steps still can be wrapped in Docker - it is a good practice. As @theossuary said, run `docker run -ti experiments:latest` as a step in DVC.

earlhathaway · on Feb 10, 2019

my understanding: it's a combination of

1 - git based management (not storage) of data files used in ML experiments;

2 - lightweight pipelines integrated with git to allow reproducibility of outputs and intermediaries

3 - integrating git with experimentation

If you've worked on teams building ML products, this is something you've at least half-built internally. So you can share outputs internally with tracked lineage showing how to repro. Plus the pipeline management.

topherpalmtree · on Feb 10, 2019

allows you to track the progress of your models. You can improve reproducibility by having a tool like this to track training/testing data, you can use this to see where that data was used to train or test a specific model, the parameters with which that model is built, and how that model affects downstream model performance.