1. This looks super cool 2. I admit to not understanding data lakes at all. I th...

jaychia · on June 7, 2023

Hi, I'm one of the maintainers of Daft

1. Thanks! We think so too :)

2. Here's my 2c in argument of flat files

- Ingestion: ingesting things into a data lake is much easier than writing to a database (all you have to do is drop some JSON, CSVs or protobufs into a bucket). This makes integrating with other systems, especially 3rd-party or vendors, much easier since there's an open language-agnostic format to communicate with.

- Multimodal data: Certain datatypes (e.g. images, tensors) may not make sense in a traditional SQL database. In a datalake though, data is usually "schema-on-read", so you can at least ingest it and now the responsibility is on the downstream application to make use of it if it can/wants to - super flexible!

- "Always on": with databases, you pay for uptime which likely scales with the size of your data. If your requirements are infrequent accesses of your data then a datalake could save you a lot of money! A common example of this: once-a-day data cleanups and ETL of an aggregated subset of your data into downstream (clean!) databases for cheaper consumption.

On "isn't updating it a huge chore?": many data lakes are partitioned by ingestion time, and applications usually consume a subset of these partitions (e.g. all data over the past week). In practice this means that you can lifecycle your data and put old data into cold-storage so that it costs you less money.

camgunz · on June 8, 2023

Heya thanks for the super informative response! I learned a lot here it was very useful. Good luck with Daft!

DecoPerson · on June 7, 2023

A database is typically an alive, running program that requires maintenance. There is a strong coupling for most “databases” between disk format and executable code. It’s not easy to read from a random Postgres database sitting on disk. You could not do in Python, “import postgres” and then “postgres.open(‘mydb’)”.

I’m no data scientist, and have only worked with data lakes a couple times, but I can see why data science tends to be done with very predictable (if inefficient) data formats such as CSV, JSON, and JSONL.

Edit: SQLite is the best of both worlds. It’s a database, but it’s also “just a file.” It’s easy to work with, and many languages & frameworks are getting good support for it. SQLite’s reliability-first approach means many of the kinks that arise from involving databases (so much complexity!!) are ironed out and don’t arise as issues. (Things like auto-indexing, auto-vacuuming, avoiding & dealing with corruption, backwards & forwards compatibility, …)

marsupialtail_2 · on June 7, 2023

While we are on this topic, the challenge with data lakes for Python based projects like Daft and Quokka (what I work on) is the poor Python support for data lakes like Delta, Iceberg and Hudi. Delta has the best support but its Python API is consistently behind the Java ones. Iceberg doesn't support Python writes. Hudi doesn't support anything Python.

I have users demanding Iceberg writes and Hudi reads/writes. I don't know what to tell them, since I don't have the resources to add a reader/writer myself for those projects.

Hopefully as DuckDB becomes more popular we will see Python bindings for these popular data lake formats this year.

IanCal · on June 7, 2023

One side of data lakes is it's more of a start of where the data is. Your normalised and more processed data may end up in a nice clean database but the start is not like that.

The other main points are usually

* Data size

* Data access patterns

* Data formats

The more you're looking at "I want to pull 400G of data out of my 30TB set of images from a bunch of machines running a custom python script, then shut it down in twenty minutes and not start anything else until tomorrow" then the more a data lake makes sense vs a database.

> because isn't updating it a huge chore?

Not with the right tools, which can also give you things like a git-like commit experience with branching.

> You have to make sure that if you're updating you're not also generating new analytics, which it seems like you're always doing because it's very slow.

Why would you be generating new analytics? I feel I've missed something there.

rgrieselhuber · on June 7, 2023

2. I hate the name data lake but I’ve always used them as part of a pipeline. It’s useful to keep the raw data around when you need to recover / replay the pipeline.

short_sells_poo · on June 7, 2023

I think most data lakes really turn into a data swamp as people/orgs are not diligent enough to keep them in a good shape. In absence of costs, people never delete anything voluntarily and the garbage will grow monotonically until it takes up all space.

musingsole · on June 7, 2023

That applies to databases as much as data lakes.

I don't know why it's so hard to get across that data expires in the same way that language changes.