Hmm. If I understand correctly, in order to reproduce the steps taken in creating machine learning models, I need to version control more things than just the code:
1. Code
2. Configuration (libraries etc)
3. Input/training data
1 and 2 are easily solved with Git and Docker respectively, although you would need some tooling to keep track of the various versions in a given run. 3 doesn't quite figure.
According to the site DVC uses object storage to store input data but that leads to a few questions:
1. Why wouldn't I just use Docker and Git + Git LFS to do all of this? Is DVC just a wrapper for these tools?
2. Why wouldn't I just version control the query that created the data along with the code that creates the model?
3. What if I'm working on a large file and make a one byte change? I've never come across an object store that can send a diff, so surely you'd need to retransmit the whole file?
@mcncfie your understanding is correct. #3 might include output data\models as well and intermediate results like preprocessed data. DVC also handles dependencies between all of these.
Answers:
1. DVC does dependency tracking in addition to that. It is like a lightweight ML pipelines tool or ML specific Makefile. Also, DVC works just faster that LFS which is critical in 10Gb+ cases.
2. This is a great case. However, in some scenarios, you would prefer to store the query output along with the query and DVC helps with that.
> Correct, there are no data diffs. DVC just stores blobs and you can GC the old ones
Have you looked into using content-defined chunking (a-la restic or borgbackup) so that you get deduplication without the need to send around diffs? This is related to a problem that I'm working on solving in OCI (container) images[1].
Example. A query to DB gives you different results since the data\table evolves over time. So, you just store the query output (let say a couple GBs) in DVC to make your research reproducible.
This is like assigning a random-seed to DB :)
Sure, some teams combine DVC with AirFlow. It gives a clear separation between engineering (reliability) and data science (lightweight and quick iteration). A recent discussion about this: https://twitter.com/FullStackML/status/1091840829683990528
1. Code
2. Configuration (libraries etc)
3. Input/training data
1 and 2 are easily solved with Git and Docker respectively, although you would need some tooling to keep track of the various versions in a given run. 3 doesn't quite figure.
According to the site DVC uses object storage to store input data but that leads to a few questions:
1. Why wouldn't I just use Docker and Git + Git LFS to do all of this? Is DVC just a wrapper for these tools?
2. Why wouldn't I just version control the query that created the data along with the code that creates the model?
3. What if I'm working on a large file and make a one byte change? I've never come across an object store that can send a diff, so surely you'd need to retransmit the whole file?