Multi-database support in DuckDB

atombender · on Jan 29, 2024

Very cool. I wonder if the pluggable storage engine also supports query pushdown.

For example, if you do SELECT ... FROM mytable WHERE key = 42, then if this is a Postgres table, you want the key predicate to be pushed down to Postgres so you don't need to scan the whole table. Same for ORDER BY, joins, LIMIT, OFFSET. For joins between foreign tables from different sources, you'd want the optimizer to be aware of table/column stats in order to pick the right join strategy.

Sqlite has a "virtual table" mechanism where you can plug in your own engine. But it's relatively simplistic when it comes to pushdown.

maxxen · on Jan 29, 2024

It does push down filters! Not sure if other nodes are pushed down, I think for now the database boundary is at the scan node, but you could add additional optimizer/rewrite rules in the postgres/sqlite/mysql extensions to push down other parts of the plans when applicable.

zacblanco · on Jan 29, 2024

We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.

CSDude · on Jan 29, 2024

I'd also love to write storage plugins with Go or Rust, not just C++

fbdab103 · on Jan 29, 2024

How big is the DuckDB team? I am blown away at the quality of life improvements they keep bringing into the project.

maxxen · on Jan 29, 2024

We’re currently 16~ software engineers at DuckDB Labs! I think its a good size to be productive but also stay nimble.

alextheparrot · on Jan 29, 2024

Exactly what I want from DuckDB. Was playing with using it as a quicker cache for a data app backed in Snowflake, wonder how hard it’d be to write the attach for that vs doing it at the client level

buremba · on Jan 29, 2024

I'm waiting for them to support BigQuery and Snowflake with the same pushdown capabilities. Hope it doesn't end up in their cloud offering!

izyda · on Jan 29, 2024

I have been wondering how to support interactive / real-time web apps based on Snowflake data. I suppose pushing down to DuckDB a subset of data needed for a chart would be one way to do this...

hipadev23 · on Jan 29, 2024

If you’re pushing down the data, you’re losing the real-time capability no?

If you want fast, adhoc, real-time querying, load the data as it’s created directly into duckdb or clickhouse. Now you’ll have sub-100ms responses for most of your queries.

alextheparrot · on Jan 29, 2024

I'd assume they mean users interacting with the chart vs first load. So the user sees the base chart (Let's say 1MB of data on the server, less depending what gets pushed to the user) and then additional filters, aggregations, etc. are pretty cheap because the server has a local copy to query against

izyda · on Jan 29, 2024

Yes -- sorry, I meant exactly the above.

oulipo · on Jan 29, 2024

isn't there a postgres compatibility layer in BigQuery? shouldn't that then work with the duckdb-postgres compatibility ?

derekperkins · on Feb 2, 2024

You're probably thinking of Spanner, not BigQuery

jFriedensreich · on Jan 29, 2024

i found this feature by accident while comparing performance of duckdb and clickhouse for materializing from an sqlite file, duckdb was instant vs clickhouse taking seconds to stream the data from sqlite. then i found out duckdb actually uses sqlite directly including the indices that were stored. once the table was in a clickhouse table, clickhouse was twice as fast without an index than duckdb or sqlite with index but the initial load from sqlite took way longer. for one shot things like in a dbt pipeline duckdb might make more sense if all input formats are supported.

seanlaff · on Jan 29, 2024

Does duckdb support remote-duckdb as a storage engine? Seems like a way to support distributed duckdbs. Ducks all the way down? :)

nitinreddy88 · on Jan 29, 2024

I wish if there's a way we can hook DuckDB as Extension to Postgres itself. Currently there's no proper columnar database/extension exist in PG. In most of the use cases, one doesn't really require dedicated "analytics" only DB and sometimes everything has to be done through "Primary DB" (typically in enterprise products world).

By hooking DuckDB as Extension, we get best of both of worlds from PG and DuckDB while retaining the persistence and analytical capabilities without hosting yet another DB infrastructure.

tehlike · on Jan 29, 2024

+1. I checked Citus, hydra, etc, and nothing comes close, all has its restrictions or gotchas.

I also really want to have a LSM based storage engine, but noone implemented it yet. Yugabyte built a postgres compatible database, but it has some management-related stuff not opensource... Cockroach similarly...

philippemnoel · on Jan 29, 2024

Check out https://github.com/paradedb/paradedb/tree/dev/pg_analytics, we're shipping this week for column-based storage in Postgres. It's not LSM, though, but Parquet/Arrow

wenc · on Jan 29, 2024

Seems like someone has tried it:

https://medium.com/@ahuarte/loading-parquet-in-postgresql-vi...

nitinreddy88 · on Jan 29, 2024

Now we need to write parquet files from Postgres. Its essentially what do we need to do for end to end solution with PG by leveraging extensions

ibotty · on Jan 29, 2024

This can be done using duckdb_fdw. It needs one additional copy though.

You insert into the duckdb table and then export from duckdb using duckdb_execute.

tehlike · on Jan 29, 2024

https://github.com/adjust/parquet_fdw

nitinreddy88 · on Jan 29, 2024

It doesn't support writing data. One has to hookup worker to write data efficiently to parquet on regular basis or some fashion using PG. Its like bits and pieces everywhere but didnt create end to end usecase

nattaylor · on Jan 29, 2024

pg_analytics looks promising here https://github.com/paradedb/paradedb/tree/dev/pg_analytics

ako · on Jan 29, 2024

Wouldn’t you still miss the part about storing tables column based in your database rather than row based? Would be nice to have “create table” or “create materialized” view with columnar storage option.

philippemnoel · on Jan 29, 2024

Hey! I'm one of the makers of pg_analytics/ParadeDB.

We store tables column-based. You can do CREATE TABLE (...) USING deltalake; when using ParadeDB and it will be stored column-based (and soon, with support for also querying cloud object stores!)

ako · on Jan 30, 2024

Looks interesting, but exactly what is missing when just adding duckdb to Postgres.

Why did you go with delta lake over iceberg?

philippemnoel · on Feb 5, 2024

DuckDB is in-memory. It makes sense as a in-memory columnar cache. We went with DataFusion to benefit from the deep integration it has into the ecosystem with Arrow, Parquet, etc.

As for Delta Lake, we would've loved to do Iceberg but the Rust crate for Iceberg is very new, and doesn't have enough functionalities yet. We're tracking the project and planning to contribute, so that we can add support for Iceberg eventually

chaxor · on Jan 29, 2024

Fantastic!

If Kuzu and LanceDB are added we could have a perfect way of dealing with graphs, vector, and tables all together in one ecosystem. Honestly may be worthy of a thin wrapper project around those.

They would work so incredibly well together to free people from the horrible corporate "cloud" mess/B's everyone keeps trying to push.

camgunz · on Jan 29, 2024

I was just reading about this and wanted to ask: how much do the different storage engines affect performance? I assumed you couldn't just plop the DuckDB query engine on top of the PostgreSQL storage engine and get DuckDB performance.

debarshri · on Jan 29, 2024

Harsh reality is that in real world scenario different databases would be in different network segment probably not open to each other. Switching from one db to another will limit the capabilities.

totalhack · on Jan 29, 2024

What are the performance implications, if any? I assume it's just translating queries as needed into the database engine language, in which case I wouldn't expect much of a performance issue.

jgalt212 · on Jan 29, 2024

Can someone provide an example of where someone is using some other tech, but when you get to X records, or Y columns, and Z sorts of queries it makes sense to switch from MySQL/SQLite/PostGres/etc to DuckDB?

wild_egg · on Jan 29, 2024

You generally wouldn't "switch" from one to the other, they'd be used in concert for different purposes

MySQL/SQLite/PostGres fall under the "transactional" database category and others like DuckDB, Redshift, or Clickhouse are considered "analytical" ones and are meant for aggregating large volumes of data

It's a pretty deep topic so searching for something like "OLTP vs OLAP databases" will give you a lot of reading material

twotwotwo · on Jan 29, 2024

Other replies are also right--this is in a family of analytics databases that are unusable for single-row transactions but great at reporting over large amounts of data.

The other tool with a similar focus on querying data where/how it is Trino. (Trino is the engine underlying Amazon Athena. Trino was initially released as Presto, which is now the less-active side of a fork; most of the project founders work on Trino.) ClickHouse, Redshift, and Snowflake are slightly different analytics tools more focused on querying data stored their own way.

Common themes across the whole family include storing each column of data separately, compressed in large (>=1MB) blocks, often using 'min-max' indices or other types of index that just allow skipping whole irrelevant blocks, not finding individual entries. Even data not in a columnar format is still compressed, minimally 'indexed' (maybe just partitioned by month), and more compact than it typically is in an OLTP DB.

Also, while something like MySQL wants a server to serve many concurrent transactions so would rather avoid a single query using lots of cores and RAM, these analytics DBs are happy to use lots of resources for a short burst.

Combining all the above, I've seen this family of database chew through reporting tasks in seconds that take minutes on a well-provisioned MySQL node.

Compared to the others, DuckDB lives in a host process you provide and does not itself do multi-node deployments. However, nodes get pretty big, and the columnar tricks pack data down tight, so (as much as in OLTP-land or, likely, more) you shouldn't underestimate what one node is capable of doing!

corasaurus-hex · on Jan 29, 2024

I didn't really fully get the advantages of columnar stores until I read this (which won't load for me right now but it's on wayback machine): https://web.archive.org/web/20230603123119/https://www.the-p...

marginalia_nu · on Jan 29, 2024

My search engine mixes data in MySQL, SQLite-databases and the Parquet-format. The needs and sizes of these data stores are so disparate there's not really one store that suits them all. I don't need concurrent write safety for for crawl data, for example, and enforcing that would make the crawler slower.

This basically lets me access all this data for ad-hoc data munging. I don't think DuckDB will replace these things, so much as help me integrate them for OLAP-style queries.

emmanueloga_ · on Jan 29, 2024

You wouldn't switch but use DuckDB in addition to your system of record, to perform analytical queries. Pingcap's TiDB has a simulator to show the kind of queries that an OLAP db like DuckDB excels at, although TiDB OLAP is based on ClickHouse (1).

--

1: https://play.tidbcloud.com/real-time-analytics

iouwhldkfjgklj · on Jan 29, 2024

Can someone explain to me what's the upside of having multiple databases? I mean, I understand the downside, no explanation needed there.

pjf · on Jan 29, 2024

What are the limits on the number of databases (say sqlite files) one can have attached (and queried through a UNION) at a single time?

oulipo · on Jan 29, 2024

I guess this should allow to connect duckdb to a BigQuery instance since there is a Postgres compatibility layer?

fithisux · on Jan 29, 2024

I also wait for their spark emulation too.