Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hi, I'm one of the maintainers of Daft

1. Thanks! We think so too :)

2. Here's my 2c in argument of flat files

- Ingestion: ingesting things into a data lake is much easier than writing to a database (all you have to do is drop some JSON, CSVs or protobufs into a bucket). This makes integrating with other systems, especially 3rd-party or vendors, much easier since there's an open language-agnostic format to communicate with.

- Multimodal data: Certain datatypes (e.g. images, tensors) may not make sense in a traditional SQL database. In a datalake though, data is usually "schema-on-read", so you can at least ingest it and now the responsibility is on the downstream application to make use of it if it can/wants to - super flexible!

- "Always on": with databases, you pay for uptime which likely scales with the size of your data. If your requirements are infrequent accesses of your data then a datalake could save you a lot of money! A common example of this: once-a-day data cleanups and ETL of an aggregated subset of your data into downstream (clean!) databases for cheaper consumption.

On "isn't updating it a huge chore?": many data lakes are partitioned by ingestion time, and applications usually consume a subset of these partitions (e.g. all data over the past week). In practice this means that you can lifecycle your data and put old data into cold-storage so that it costs you less money.



Heya thanks for the super informative response! I learned a lot here it was very useful. Good luck with Daft!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: