Hmm. If I understand correctly, in order to reproduce the steps taken in creatin...

dmpetrov · on Feb 10, 2019

@mcncfie your understanding is correct. #3 might include output data\models as well and intermediate results like preprocessed data. DVC also handles dependencies between all of these.

Answers:

1. DVC does dependency tracking in addition to that. It is like a lightweight ML pipelines tool or ML specific Makefile. Also, DVC works just faster that LFS which is critical in 10Gb+ cases.

2. This is a great case. However, in some scenarios, you would prefer to store the query output along with the query and DVC helps with that.

3. Correct, there are no data diffs. DVC just stores blobs and you can GC the old ones - https://dvc.org/doc/commands-reference/gc

cyphar · on Feb 11, 2019

> Correct, there are no data diffs. DVC just stores blobs and you can GC the old ones

Have you looked into using content-defined chunking (a-la restic or borgbackup) so that you get deduplication without the need to send around diffs? This is related to a problem that I'm working on solving in OCI (container) images[1].

[1]: https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar

dmpetrov · on Feb 11, 2019

Content-defined chunks - very interesting. I'd suggest you ask this question in DVC issue tracker or DVC channel https://dvc.org/chat

mcncfie · on Feb 10, 2019

Thanks! Regarding 2, could you give an example?

Also, can I combine DVC with a pipeline tool like Apache Airflow?

dmpetrov · on Feb 10, 2019

Example. A query to DB gives you different results since the data\table evolves over time. So, you just store the query output (let say a couple GBs) in DVC to make your research reproducible.

This is like assigning a random-seed to DB :)

Sure, some teams combine DVC with AirFlow. It gives a clear separation between engineering (reliability) and data science (lightweight and quick iteration). A recent discussion about this: https://twitter.com/FullStackML/status/1091840829683990528