The YouTube video “Apache Iceberg: What It Is and Why Everyone’s Talking About It” by Tim Berglund explains data lakes really well in the opening minutes: https://www.youtube.com/watch?v=TsmhRZElPvM
~40y ago invented data warehouse, where an ETL process overnight would collect data from smaller dbs into a central db (the data warehouse)
~15y ago, data lake (i.e. hadoop) emerged to address scaling and other things. Same idea but ELT instead of ETL: less focus on schema, collect the data into S3 and transform it later
Thank you for your work ! We use DuckDB with dbt-duckdb in production (because on-prem and because we don't need ten thousands nodes) and we love it ! About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ? From my understanding DuckLake was responsible for managing the files on the storage layer.
> About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ?
Dropping the Parquet files on the blob storage will not work – you have to COPY them through DuckLake so that the catalog databases is updated with the required catalog and metadata information.
Ah, drat. I have an application that uses DuckDB to output parquet files. That application is by necessity disconnected from any sense of a data lake. But, I would love to have a good way of then pushing them up to S3 and integrating into a data lake. I’ve been looking into Iceberg and I’ve had the thought, “this is great but I hate the idea of what all these little metadata files will do to latency.”
I was also thinking about this use case when reading the announcement. Let's say you have a bunch of parquet files already (on local FS, HTTPS, S3, ...) that you can assume are immutable (or maybe append-only). It would be great if you could attach them to the DuckLake without copying them! From the design doc, it seems it should essentially work, you would read those parquet files to compute the metadata, and insert a reference to the parquet file instead of copying them to the storage you manage. Basically you want to create the catalog independently from the underlying data.
AWS started offering local SSD storage up to 2 TB in 2012 (HI1 instance type) and in late 2013 this went up to 6.4 TB (I2 instance type). While these amounts don't cover all customers, plenty of data fits on these machines. But the software stack to analyze it efficiently was lacking, especially in the open-source space.
AWS also had customers that had petabytes of data in Redshift for analysis. The conversation is missing a key point: DuckDB is optimizing for a different class of use cases. They’re optimizing for data science and not traditional data warehousing use cases. It’s masquerading as size. Even for small sizes, there are other considerations: access control, concurrency control, reliability, availability, and so on. The requirements are different for those different use cases. Data science tends to be single user, local, and lower availability requirements than warehouses that serve production pipelines, data sharing, and so on. I also think that DuckDB can be used for those, but not optimized for those.
>..here is a small number of tables in Redshift with
trillions of rows, while the majority is much more reasonably sized
with only millions of rows. In fact, most tables have less than a
million rows and the vast majority (98 %) has less than a billion
rows.
The argument can be made that 98% of people using redshift can potentially get by with DuckDB.
Hi, DuckDB devrel here. DuckDB is an analytical SQL database in the form factor of SQLite (i.e., in-process). This quadrant summarizes its space in the landscape:
It works as a replacement / complementary component to dataframe libraries due to it's speed and (vertical) scalability. It's lightweight and dependency-free, so it also works as part of data processing pipelines.
Hello, I'd love to use this but I work with highly confidential data. How can we be sure our data isn't leaking with this new UI? What assurances are there on this, and can you comment on the scope of the MotherDuck server interactions?
I'm a co-author of the blog post. I agree that the wording was confusing – apologies for the confusion. I added a note at the end:
> The repository does not contain the source code for the frontend, which is currently not available as open-source. Releasing it as open-source is under consideration.
I have observed the Makefile effect many times for LaTeX documents. Most researchers I worked with had a LaTeX file full of macros that they have been carrying from project to project for years. These were often inherited from more senior researchers, and were hammered into heavily-modified forks of article templates used in their field or thesis templates used at their institution.
This is a great example of an instance of this "Makefile effect" with a possible solution: use Markdown and Pandoc where possible. This won't work in every situation, but sometimes one can compose a basic Beamer presentation or LaTeX paper quickly using largely simple TeX and the same Markdown syntax you already know from GitHub and Reddit.
That won’t solve any problem that LaTeX macros solve. Boilerplate in LaTeX has 2 purposes.
The first is to factor frequently-used complex notations. To do this in Markdown you’d need to bolt on a macro preprocessor on top of Markdown.
The second one is to fine-tune typography and layout details (tables are a big offender). This is something that simply cannot be done in Markdown. A table is a table and if you don’t like the style (which is most of the time inadequate) then there is no solution.
Yes, if you break the file into parts with GNU Parallel, you can easily beat DuckDB as I show in the blog post.
That said, I maintain that it's surprising that DuckDB outperforms wc (and grep) on many common setups, e.g., on a MacBook. This is not something many databases can do, and the ones which can usually don't run on a laptop.
Re the original analysis, my own opinion is that the outcome is only surprising when the critical detail, highlighting how the two are different, is omitted. It seems very unsurprising if it is rephrased to include that detail: "DuckDB, executed multi-threaded + parallelized, is 2.5x faster than wc, single-threaded, even though in doing so, DuckDB used 9.3x more CPU".
In fact, to me, the only thing that seems surprising about that is how poorly DuckDB does compared to WC-- 9x more CPU for only 2.5x more improvement.
But an interesting analysis regardless of the takeaways-- thank you
Hi – DuckDB Labs devrel here. It's great that you find DuckDB useful!
On the setup side, I agree that local (instance-attached) disks should be preferred but does EBS incur an IO fee? It incurs a significant latency for sure but it doesn't have a per-operation pricing:
> I/O is included in the price of the volumes, so you pay only for each GB of storage you provision.
Can’t remember anymore, but it’s either (a) the gp2 volumes were way too slow for the ops or (b) the IOPs charges made it bad. To be clear I didn’t do it on duckdb but hosted a Postgres. I moved to light sail instead and was happy with it (you don’t get attached SSD in ec2 until you go to instances that are super large).
DuckDB supports partial reading of Parquet files (also via HTTPS and S3) [1], so it can limit the scans to the required columns in the Parquet file. It can also perform filter pushdown, so querying data in S3 can be quite efficient.
(I work at DuckDB Labs.)