More

szarnyasg · 2025-05-27T19:02:46 1748372566

Yes - updates on existing rows are supported.

(I work at DuckDB Labs.)

TheGuyWhoCodes · 2025-05-27T19:33:08 1748374388

Thanks szarnyasg. If I've got you here, can you use the ducklake extension commands to get the parquet files for a query without running said query?

That way you could use another query engine while still use duckdb to handle the data mutation.

szarnyasg · 2025-05-27T14:49:35 1748357375

The YouTube video “Apache Iceberg: What It Is and Why Everyone’s Talking About It” by Tim Berglund explains data lakes really well in the opening minutes: https://www.youtube.com/watch?v=TsmhRZElPvM

adastra22 · 2025-05-27T16:30:48 1748363448

Thanks but I don’t have the time to watch YouTube.

dsp_person · 2025-05-27T19:03:41 1748372621

he explains

~40y ago invented data warehouse, where an ETL process overnight would collect data from smaller dbs into a central db (the data warehouse)

~15y ago, data lake (i.e. hadoop) emerged to address scaling and other things. Same idea but ELT instead of ETL: less focus on schema, collect the data into S3 and transform it later

adastra22 · 2025-05-27T20:01:49 1748376109

Thank you!

simlevesque · 2025-05-27T18:27:42 1748370462

It's your db but on s3.

szarnyasg · 2025-05-27T14:46:05 1748357165

Yes, you can use standard SQL constructs such as INSERT statements and COPY to load data into DuckLake.

(diclaimer: I work at DuckDB Labs)

formalreconfirm · 2025-05-27T14:54:06 1748357646

Thank you for your work ! We use DuckDB with dbt-duckdb in production (because on-prem and because we don't need ten thousands nodes) and we love it ! About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ? From my understanding DuckLake was responsible for managing the files on the storage layer.

szarnyasg · 2025-05-27T15:03:47 1748358227

Great!

> About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ?

Dropping the Parquet files on the blob storage will not work – you have to COPY them through DuckLake so that the catalog databases is updated with the required catalog and metadata information.

jonstewart · 2025-05-27T23:19:23 1748387963

Ah, drat. I have an application that uses DuckDB to output parquet files. That application is by necessity disconnected from any sense of a data lake. But, I would love to have a good way of then pushing them up to S3 and integrating into a data lake. I’ve been looking into Iceberg and I’ve had the thought, “this is great but I hate the idea of what all these little metadata files will do to latency.”

tracnar · 2025-05-29T10:58:44 1748516324

I was also thinking about this use case when reading the announcement. Let's say you have a bunch of parquet files already (on local FS, HTTPS, S3, ...) that you can assume are immutable (or maybe append-only). It would be great if you could attach them to the DuckLake without copying them! From the design doc, it seems it should essentially work, you would read those parquet files to compute the metadata, and insert a reference to the parquet file instead of copying them to the storage you manage. Basically you want to create the catalog independently from the underlying data.

szarnyasg · 2025-05-22T13:43:08 1747921388

AWS started offering local SSD storage up to 2 TB in 2012 (HI1 instance type) and in late 2013 this went up to 6.4 TB (I2 instance type). While these amounts don't cover all customers, plenty of data fits on these machines. But the software stack to analyze it efficiently was lacking, especially in the open-source space.

mehulashah · 2025-05-22T18:49:23 1747939763

AWS also had customers that had petabytes of data in Redshift for analysis. The conversation is missing a key point: DuckDB is optimizing for a different class of use cases. They’re optimizing for data science and not traditional data warehousing use cases. It’s masquerading as size. Even for small sizes, there are other considerations: access control, concurrency control, reliability, availability, and so on. The requirements are different for those different use cases. Data science tends to be single user, local, and lower availability requirements than warehouses that serve production pipelines, data sharing, and so on. I also think that DuckDB can be used for those, but not optimized for those.

Data size is a red herring in the conversation.

nojito · 2025-05-22T23:13:00 1747955580

>Data size is a red herring in the conversation.

Not really. A Redshift paper just shared that.

>..here is a small number of tables in Redshift with trillions of rows, while the majority is much more reasonably sized with only millions of rows. In fact, most tables have less than a million rows and the vast majority (98 %) has less than a billion rows.

The argument can be made that 98% of people using redshift can potentially get by with DuckDB.

https://assets.amazon.science/24/3b/04b31ef64c83acf98fe3fdca...

szarnyasg · 2025-03-13T10:04:21 1741860261

Hi, DuckDB devrel here. DuckDB is an analytical SQL database in the form factor of SQLite (i.e., in-process). This quadrant summarizes its space in the landscape:

https://blobs.duckdb.org/slides/goto-amsterdam-2024-duckdb-g...

It works as a replacement / complementary component to dataframe libraries due to it's speed and (vertical) scalability. It's lightweight and dependency-free, so it also works as part of data processing pipelines.

enema_bag_jones · 2025-03-16T17:35:52 1742146552

Hello, I'd love to use this but I work with highly confidential data. How can we be sure our data isn't leaking with this new UI? What assurances are there on this, and can you comment on the scope of the MotherDuck server interactions?

szarnyasg · 2025-03-12T17:31:43 1741800703

I'm a co-author of the blog post. I agree that the wording was confusing – apologies for the confusion. I added a note at the end:

> The repository does not contain the source code for the frontend, which is currently not available as open-source. Releasing it as open-source is under consideration.

rastignack · 2025-03-13T11:05:20 1741863920

Some people work in serious work environments, on heavily regulated data. Thanks for another software landmine !

Make it opt-in, or not installed by default please, it’s so hazardous.

szarnyasg · 2025-01-11T06:50:39 1736578239

I have observed the Makefile effect many times for LaTeX documents. Most researchers I worked with had a LaTeX file full of macros that they have been carrying from project to project for years. These were often inherited from more senior researchers, and were hammered into heavily-modified forks of article templates used in their field or thesis templates used at their institution.

wjholden · 2025-01-11T07:08:27 1736579307

This is a great example of an instance of this "Makefile effect" with a possible solution: use Markdown and Pandoc where possible. This won't work in every situation, but sometimes one can compose a basic Beamer presentation or LaTeX paper quickly using largely simple TeX and the same Markdown syntax you already know from GitHub and Reddit.

kergonath · 2025-01-11T07:59:44 1736582384

> use Markdown and Pandoc where possible.

That won’t solve any problem that LaTeX macros solve. Boilerplate in LaTeX has 2 purposes.

The first is to factor frequently-used complex notations. To do this in Markdown you’d need to bolt on a macro preprocessor on top of Markdown.

The second one is to fine-tune typography and layout details (tables are a big offender). This is something that simply cannot be done in Markdown. A table is a table and if you don’t like the style (which is most of the time inadequate) then there is no solution.

gardenerik · 2025-01-11T08:24:06 1736583846

A much better solution would be to use Typst, but that still might not work in all situations.

szarnyasg · 2024-12-05T23:27:00 1733441220

I am the author of the original post and I also wrote a followup blog post on it yesterday: https://szarnyasg.org/posts/duckdb-vs-coreutils/

Yes, if you break the file into parts with GNU Parallel, you can easily beat DuckDB as I show in the blog post.

That said, I maintain that it's surprising that DuckDB outperforms wc (and grep) on many common setups, e.g., on a MacBook. This is not something many databases can do, and the ones which can usually don't run on a laptop.

mattewong · 2024-12-09T21:08:34 1733778514

Your follow-up post is helpful and appreciated!

Re the original analysis, my own opinion is that the outcome is only surprising when the critical detail, highlighting how the two are different, is omitted. It seems very unsurprising if it is rephrased to include that detail: "DuckDB, executed multi-threaded + parallelized, is 2.5x faster than wc, single-threaded, even though in doing so, DuckDB used 9.3x more CPU".

In fact, to me, the only thing that seems surprising about that is how poorly DuckDB does compared to WC-- 9x more CPU for only 2.5x more improvement.

But an interesting analysis regardless of the takeaways-- thank you

szarnyasg · on Nov 6, 2024

Hi – DuckDB Labs devrel here. It's great that you find DuckDB useful!

On the setup side, I agree that local (instance-attached) disks should be preferred but does EBS incur an IO fee? It incurs a significant latency for sure but it doesn't have a per-operation pricing:

> I/O is included in the price of the volumes, so you pay only for each GB of storage you provision.

(https://aws.amazon.com/ebs/pricing/)

ramraj07 · on Nov 10, 2024

Can’t remember anymore, but it’s either (a) the gp2 volumes were way too slow for the ops or (b) the IOPs charges made it bad. To be clear I didn’t do it on duckdb but hosted a Postgres. I moved to light sail instead and was happy with it (you don’t get attached SSD in ec2 until you go to instances that are super large).

szarnyasg · on June 3, 2024

DuckDB supports partial reading of Parquet files (also via HTTPS and S3) [1], so it can limit the scans to the required columns in the Parquet file. It can also perform filter pushdown, so querying data in S3 can be quite efficient.

Disclaimer: I work at DuckDB Labs.

[1] https://duckdb.org/docs/data/parquet/overview#partial-readin...

killcoder · on June 5, 2024

Does DuckDB supports partial reading of .duckdb files hosted externally?

gizzlon · on June 3, 2024

oooh, cool