Very cool. I wonder if the pluggable storage engine also supports query pushdown.
For example, if you do SELECT ... FROM mytable WHERE key = 42, then if this is a Postgres table, you want the key predicate to be pushed down to Postgres so you don't need to scan the whole table. Same for ORDER BY, joins, LIMIT, OFFSET. For joins between foreign tables from different sources, you'd want the optimizer to be aware of table/column stats in order to pick the right join strategy.
Sqlite has a "virtual table" mechanism where you can plug in your own engine. But it's relatively simplistic when it comes to pushdown.
It does push down filters! Not sure if other nodes are pushed down, I think for now the database boundary is at the scan node, but you could add additional optimizer/rewrite rules in the postgres/sqlite/mysql extensions to push down other parts of the plans when applicable.
We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.
Exactly what I want from DuckDB. Was playing with using it as a quicker cache for a data app backed in Snowflake, wonder how hard it’d be to write the attach for that vs doing it at the client level
I have been wondering how to support interactive / real-time web apps based on Snowflake data. I suppose pushing down to DuckDB a subset of data needed for a chart would be one way to do this...
If you’re pushing down the data, you’re losing the real-time capability no?
If you want fast, adhoc, real-time querying, load the data as it’s created directly into duckdb or clickhouse. Now you’ll have sub-100ms responses for most of your queries.
I'd assume they mean users interacting with the chart vs first load. So the user sees the base chart (Let's say 1MB of data on the server, less depending what gets pushed to the user) and then additional filters, aggregations, etc. are pretty cheap because the server has a local copy to query against
i found this feature by accident while comparing performance of duckdb and clickhouse for materializing from an sqlite file, duckdb was instant vs clickhouse taking seconds to stream the data from sqlite. then i found out duckdb actually uses sqlite directly including the indices that were stored. once the table was in a clickhouse table, clickhouse was twice as fast without an index than duckdb or sqlite with index but the initial load from sqlite took way longer. for one shot things like in a dbt pipeline duckdb might make more sense if all input formats are supported.
I wish if there's a way we can hook DuckDB as Extension to Postgres itself.
Currently there's no proper columnar database/extension exist in PG. In most of the use cases, one doesn't really require dedicated "analytics" only DB and sometimes everything has to be done through "Primary DB" (typically in enterprise products world).
By hooking DuckDB as Extension, we get best of both of worlds from PG and DuckDB while retaining the persistence and analytical capabilities without hosting yet another DB infrastructure.
+1. I checked Citus, hydra, etc, and nothing comes close, all has its restrictions or gotchas.
I also really want to have a LSM based storage engine, but noone implemented it yet.
Yugabyte built a postgres compatible database, but it has some management-related stuff not opensource... Cockroach similarly...
It doesn't support writing data. One has to hookup worker to write data efficiently to parquet on regular basis or some fashion using PG. Its like bits and pieces everywhere but didnt create end to end usecase
Wouldn’t you still miss the part about storing tables column based in your database rather than row based? Would be nice to have “create table” or “create materialized” view with columnar storage option.
Hey! I'm one of the makers of pg_analytics/ParadeDB.
We store tables column-based. You can do CREATE TABLE (...) USING deltalake; when using ParadeDB and it will be stored column-based (and soon, with support for also querying cloud object stores!)
DuckDB is in-memory. It makes sense as a in-memory columnar cache. We went with DataFusion to benefit from the deep integration it has into the ecosystem with Arrow, Parquet, etc.
As for Delta Lake, we would've loved to do Iceberg but the Rust crate for Iceberg is very new, and doesn't have enough functionalities yet. We're tracking the project and planning to contribute, so that we can add support for Iceberg eventually
If Kuzu and LanceDB are added we could have a perfect way of dealing with graphs, vector, and tables all together in one ecosystem. Honestly may be worthy of a thin wrapper project around those.
They would work so incredibly well together to free people from the horrible corporate "cloud" mess/B's everyone keeps trying to push.
I was just reading about this and wanted to ask: how much do the different storage engines affect performance? I assumed you couldn't just plop the DuckDB query engine on top of the PostgreSQL storage engine and get DuckDB performance.
Harsh reality is that in real world scenario different databases would be in different network segment probably not open to each other. Switching from one db to another will limit the capabilities.
What are the performance implications, if any? I assume it's just translating queries as needed into the database engine language, in which case I wouldn't expect much of a performance issue.
Can someone provide an example of where someone is using some other tech, but when you get to X records, or Y columns, and Z sorts of queries it makes sense to switch from MySQL/SQLite/PostGres/etc to DuckDB?
You generally wouldn't "switch" from one to the other, they'd be used in concert for different purposes
MySQL/SQLite/PostGres fall under the "transactional" database category and others like DuckDB, Redshift, or Clickhouse are considered "analytical" ones and are meant for aggregating large volumes of data
It's a pretty deep topic so searching for something like "OLTP vs OLAP databases" will give you a lot of reading material
Other replies are also right--this is in a family of analytics databases that are unusable for single-row transactions but great at reporting over large amounts of data.
The other tool with a similar focus on querying data where/how it is Trino. (Trino is the engine underlying Amazon Athena. Trino was initially released as Presto, which is now the less-active side of a fork; most of the project founders work on Trino.) ClickHouse, Redshift, and Snowflake are slightly different analytics tools more focused on querying data stored their own way.
Common themes across the whole family include storing each column of data separately, compressed in large (>=1MB) blocks, often using 'min-max' indices or other types of index that just allow skipping whole irrelevant blocks, not finding individual entries. Even data not in a columnar format is still compressed, minimally 'indexed' (maybe just partitioned by month), and more compact than it typically is in an OLTP DB.
Also, while something like MySQL wants a server to serve many concurrent transactions so would rather avoid a single query using lots of cores and RAM, these analytics DBs are happy to use lots of resources for a short burst.
Combining all the above, I've seen this family of database chew through reporting tasks in seconds that take minutes on a well-provisioned MySQL node.
Compared to the others, DuckDB lives in a host process you provide and does not itself do multi-node deployments. However, nodes get pretty big, and the columnar tricks pack data down tight, so (as much as in OLTP-land or, likely, more) you shouldn't underestimate what one node is capable of doing!
My search engine mixes data in MySQL, SQLite-databases and the Parquet-format. The needs and sizes of these data stores are so disparate there's not really one store that suits them all. I don't need concurrent write safety for for crawl data, for example, and enforcing that would make the crawler slower.
This basically lets me access all this data for ad-hoc data munging. I don't think DuckDB will replace these things, so much as help me integrate them for OLAP-style queries.
You wouldn't switch but use DuckDB in addition to your system of record, to perform analytical queries. Pingcap's TiDB has a simulator to show the kind of queries that an OLAP db like DuckDB excels at, although TiDB OLAP is based on ClickHouse (1).
For example, if you do SELECT ... FROM mytable WHERE key = 42, then if this is a Postgres table, you want the key predicate to be pushed down to Postgres so you don't need to scan the whole table. Same for ORDER BY, joins, LIMIT, OFFSET. For joins between foreign tables from different sources, you'd want the optimizer to be aware of table/column stats in order to pick the right join strategy.
Sqlite has a "virtual table" mechanism where you can plug in your own engine. But it's relatively simplistic when it comes to pushdown.