Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In the intro video they say it’s “based on git but supports large files”.

Are they using Git LFS [1] or did they make something else?

And what is their proposed value add over using git directly?

Edit: They say a little more about the large file stuff

> DVC keeps metafiles in Git instead of Google Docs to describe and version control your data sets and models. DVC supports a variety of external storage types as a remote cache for large files.

So from what they said in the video and what I read on the page this is probably a limited front-end to make using git easier for people that don’t know git.

And in terms of the large file stuff it seems from what they are saying like they have implemented the equivalent of git-annex [2]. Or maybe they are using that even. I didn’t look to see if they wrote their own or used git-annex.

[1]: https://github.com/git-lfs/git-lfs/blob/master/README.md

[2]: https://git-annex.branchable.com/



Git itself is not suitable for huge files:

- Large binary files tend to be not very "deflatable"

- xdelta (used in Git to diff files) tries to load the entire content of a file into memory, at once.

This is why there are solutions like Git-LFS, where you keep your versions on a remote server / cloud storage and you use git to track only the "metadata" files.

DVC implemented its own solution, in order to be SCM agnostic and cloud flexible (supporting different remote storages).

Here's more info comparing DVC to similar/related technologies: https://dvc.org/doc/dvc-philosophy/related-technologies

EDIT: formatting


Thank you for the link, that’s the kind of comparison I was looking for all the way down to even talking about how DVC compares to git-annex :)


They use cloud storage backends as remotes aws, google cloud, azure. No specific git lfs support. But possible compatibility by using it to track the dvc cache.


honestly i'm confused..i went to the site, watched the video. I don't get it.. what is it?


Basically you're not going to check-in to GIT your data used to train models and models themselves if they are multi-GB.

DVC basically sym-links those big files and checks-in the symlinks.

It also can download those files from GCS/S3, and track which file came from where (e.g. if you generated output.data using input.data, then whenever input.data changes, DVC can detect that output.data needs to be regenerated as well).

That's my understanding.


> if you generated output.data using input.data, then whenever input.data changes, DVC can detect that output.data needs to be regenerated as well

To my understanding you could do the same with Docker. E.g. if you COPY your input files into the image, rebuilding the image would only be an action if the input files changed.


Docker can help only if there is a single step in your project. In ML projects you usually have many steps - Preprocess, Train. Each of the steps can be divided: extract Evaluate step from Train etc.

Also, Docker has an overhead - copy of data needs to be created. While DVC just saves links (sym-links, hard-links or reflinks) with a minimum overhead. It is crucial when you work with multi-GB datasets.


Good points, I'd appreciate if you could elaborate since it seems you've thought a lot about this.

> Docker can help only if there is a single step in your project. In ML projects you usually have many steps - Preprocess, Train. Each of the steps can be divided: extract Evaluate step from Train etc.

Yeah this is something I've been struggling with. In a project I'm working on I use docker build to 1) set up the environment 2) get canonical datasets 3) pre-process the datasets. However I've left reproducing as manual instructions, e.g. run the container, call script 1 to repro experiment 1, call script 2 to repro experiment 2, etc. I think I could improve this by providing `--entrypoint` at docker run, or by providing a docker-compose file (wherein I could specify an entrypoint) for each experiment.

What do you think are the generalizability pitfalls in this workflow? How could dvc help?

> Also, Docker has an overhead - copy of data needs to be created. While DVC just saves links (sym-links, hard-links or reflinks) with a minimum overhead. It is crucial when you work with multi-GB datasets.

Good point!


I could see using an entry point for that. The entry point script could take as an argument the experiment to run and then the docker command would be: docker run -ti experiments:latest experiment-1

I could also see creating a base dockerfile, and a dockerfile per experiment. The base docker file would do the setup, and the experiment docker files would just run the commands necessary to reproduce the experiment, and exit.


The major pitfall - it depends on your goal. Your approach looks good if you just need to retrain an existing model\code in production. However, this approach is not perfect in the ML modeling\development stage. Let me explain...

`--entrypoint` defines a single step\script. You make a single black box from your entire ML processes. There is a lack of granularity for ML modeling process when people tend to separate different stages to make the process more manageable and efficient: manage and version dataset separately from modeling, preprocess data before training, training code as a separate unit plus some problem specific steps\units.

DVC gives you the ability to increase the granularity of your ML project while still keeps it manageable and reproducible. The steps still can be wrapped in Docker - it is a good practice. As @theossuary said, run `docker run -ti experiments:latest` as a step in DVC.


my understanding: it's a combination of

1 - git based management (not storage) of data files used in ML experiments;

2 - lightweight pipelines integrated with git to allow reproducibility of outputs and intermediaries

3 - integrating git with experimentation

If you've worked on teams building ML products, this is something you've at least half-built internally. So you can share outputs internally with tracked lineage showing how to repro. Plus the pipeline management.


allows you to track the progress of your models. You can improve reproducibility by having a tool like this to track training/testing data, you can use this to see where that data was used to train or test a specific model, the parameters with which that model is built, and how that model affects downstream model performance.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: