An overview of distributed Postgres architectures

magden · on Jan 9, 2024

First time seeing someone call Spanner, CockroachDB, and YugabyteDB a "distributed key-value store with SQL" :)

The cons of the mentioned distributed shared-nothing SQL databases are questionable:

- "Key-value store" is in fact an LSM-tree-based document store that supports column-level versioning (Postgres supports row-level versioning only).

- "Many internal operations incur high latency." - I guess this conclusion is based on the referenced Gigaom benchmark that was paid for by Microsoft to compare apples to oranges.

- "No local joins in current implementations." (YugabyteDB certainly has colocated tables that store a whole table on a single node. CockroachDB and Spanner might do this as well.)

- "Not actually PostgreSQL..." - There is only one 100% compatible database with Postgres...It's Postgres itself. Citus, CockroachDB, Aurora, Alloy, YugabyteDB, and others can be classified as "not actually Postgres."

- "And less mature and optimized." - Well, both CockroachDB and YugabyteDB are on Gartner's Magic Quadrant of the top 20 cloud databases. Toys don't get there.

It feels like the author joined Crunchy to work on their own distributed version of Postgres. Good move for Crunchy, good luck!

seedless-sensat · on Jan 9, 2024

I think this con is very real:

> Related tables and indexes are not necessarily stored together, meaning typical operations such as joins and evaluating foreign keys or even simple index lookups might incur an excessive number of internal network hops. The relatively strong transactional guarantees that involve additional locks and coordination can also become a drag on performance.

You handwaved this away saying you can just store an entire table on a single node, but that defeats many of the benefits of these sharded SQL databases.

Edit: Also, before attacking the author's biases, it seems fair to disclose you appear to work at Yugabyte

franckpachot · on Jan 9, 2024

In the case of YugabyteDB, here is how we avoid "excessive number" of networks hops

- true Index Only Scan. PostgreSQL doesn't store the MVCC visibility in indexes and have to look at the table even in case of Index Only Scan. YugabyteDB has a different implementation of MVCC with no bloat, no vacuum and true Index Only Scan. Here is an example: https://dev.to/yugabyte/boosts-secondary-index-queries-with-... This is also used for reference table (duplicate covering indexes in each regions)

- Batching reads and writes. It is not a problem to add 10ms because you join two tables or check a foreign key. What would be problematic is doing that for each rows. YugabyteDB batches the read/write operations as much as possible. Here are two examples: https://dev.to/franckpachot/series/25365

- Pushdowns to avoid sending rows that are discarded later. Each node can apply PostgreSQL expressions to offload filtering to the storage nodes. Examples: https://dev.to/yugabyte/yugabytedb-predicate-push-down-pbb

- Loose index scan. With YugabyteDB LSM-Tree indexes, one index scan can read multiple ranges, which avoids multiple roundtrips. An example: https://dev.to/yugabyte/select-distinct-pushdown-to-do-a-loo...

- Locality of transaction table. If a transaction touches to only one node, or zone, or region, a local transaction table is used, and is promoted to the right level depending on what the transaction reads and writes.

Most of the times when I've seen people asking to store tables together, it was premature optimization, based on opinions rather than facts. When they try (with the right indexes of course) they appreciate that the distribution is an implementation detail that the application doesn't have to know. Of course, there are more and more optimizations in each release. If you have a PostgreSQL application and see low performance, please open a git issue.

I'm also working for Yugabyte as Developer Advocate. I don't always feel the need to precise it as I'm writing about facts, not marketing opinions, and who pays my salary has no influence on the response time I see in execution plans ;)

fuy · on Jan 9, 2024

Hey Franck, just wanted to say I appreciate your database writings. I read a whole bunch of them over the years, and always found them interesting and educational.

magden · on Jan 9, 2024

> You handwaved this away, saying you can just store an entire table on a single node, but that defeats many of the benefits of these sharded SQL databases.

I just clarified one-liners listed under the closing "Cons" section. My intention was not to say that the author is utterly wrong. Marco is a recognized expert in the Postgres community. It only feels like he was too opinionated about distributed SQL DBs while wearing his Citus hat.

> Also, before attacking the author's biases, it seems fair to disclose that you appear to work at Yugabyte.

I'm sorry if I sounded biased in my response. I'm with the YugabyteDB team right now, but that's not my first and I bet not the last database company. Thus, when I respond on my personal accounts, I try to be as objective as possible and don't bother mentioning my current employment.

Anyway, I'm very positive to see that this article got traction on HN. As a Postgres community member, I truly love what's happening with the database and its ecosystem. The healthy competition within the Postgres ecosystem is a big driver for the database growth that's becoming the Linux of databases.

dikei · on Jan 9, 2024

>> First time seeing someone call Spanner, CockroachDB, and YugabyteDB a "distributed key-value store with SQL"

That was the first thing come to my mind when I read the paper on Spanner and CockroachDB (haven't read the paper on YugabyteDB yet) though, and surely I'm not the only one.

franckpachot · on Jan 10, 2024

Referring to Distributed SQL as a key-value store is like defining monolithic databases as a block storage. It reduces it to an internal structure. What makes it a database is what is on top: ACID transaction, full SQL features, relational tables, JSON document, foreign keys,... which are not available when sharding is done on top of SQL

feverzsj · on Jan 9, 2024

In my experience, HA stories of Postgres are always buggy and disappointing for homegrown systems. None of these tools actually gives you a solid HA like proprietary databases do. Managed database services can reduce or eliminate some of the problems, but you can't put critical parts of your system on the cloud.

sargun · on Jan 9, 2024

Having run Oracle RAC, MySQL circular replication, Postgres replication since 9.X, Amazon RDS, and others in production, I can say that Oracle RAC was the worst experience.

We had to provide our own active / active storage backend, and fabric. If there were any hiccups, the entire system fell over. The horizontal scalability was nice, but if you caused I/O saturation on the backend, you'd end up knocking over the entire cluster due to shared components. Several times, the entire DB just "broke" (this was a couple scenarios, one where the DB was using 100% of CPU, one where the DB was frozen, and not allowing connections, and things like hung queries that couldn't be stopped), and it required restarting the whole cluster, for which there was minimal tooling.

Perhaps SQL server is better, but that comes with an entirely new ecosystem, and other problems.

My biggest thing is being able to crack open the DB, and look at the source code when it breaks. In this, the likes of Datastax Cassandra and Cockroachdb are great, but I wouldn't call them "proprietary" by any means.

pi-e-sigma · on Jan 9, 2024

To this day only Oracle RAC provides true active-active cluster. All the other things are not real clusters, just replications.

sargun · on Jan 9, 2024

err, CockroachDB would like a word... Although, I'm not sure if its just data serving that's active-active, or query execution is distributed.

Also, Cassandra is active-active as well.

I'm sure there are others, but I'm less familiar with them (Yugabyte, Couchbase, etc..)

pi-e-sigma · on Jan 9, 2024

Active-active is when you don't have any replicas, you have just one data store and multiple 'engines' accessing it. So there is no replication, because there is nothing to replicate. So CockroachDB is definitively not active-active, despite their claims to the otherwise; what they call active-active is two-way synchronous replication, which is worse than asynchronous replication because of the latency and worse than Oracle RAC active-active because it's less fail-tolerant. In Oracle RAC your cluster is up and running even if just one engine is still alive, in CockroachDB approach when you lose majority of nodes the cluster goes down. Obviously Oracle RAC active-active has it's own downsides, too, but it's another topic. EDIT to expand: Think about a SQL transaction. In Oracle RAC it gets picked up by any of the nodes and is executed by this node. All the other nodes see updated data immediately. In CockroachDB the same transaction would need to be executed by each of the nodes separately plus the overhead of communication between the nodes.

franckpachot · on Jan 10, 2024

You have a misconception about CockroachDB and YugabyteDB architecture. It is not 2-way replication. It writes to a Raft group. The transaction can be picked up by any node because the transaction intents are also replicated. The main difference with RAC is that the current version of a row is read/write on the Raft Leader, replicated to followers (that can be elected new leader). In RAC, the block with current version of a row is moved back and forth between the instance. That's possible only with reliable, short distance, dedicated network. Distributed SQL can scale out to multiple availability zones. Not RAC

NavinF · on Jan 9, 2024

> In Oracle RAC your cluster is up and running even if just one engine is still alive

So it's not consistent in the CAP Theorem sense. Comparing it with other databases that don't lose writes and don't have split brain issues makes no sense

dikei · on Jan 10, 2024

Oracle RAC is a shared storage system, every database instances write to same datastore, so as long as a transaction is committed by one node, it'll be seen by other nodes.

Of course, the issue with all shared-storage systems is how much it costs to have a reliable and fast shared storage.

pi-e-sigma · on Jan 9, 2024

It is always consistent because it is a single storage database with multiple access points. It's not distributed data so CAP theorem doesn't even apply to it.

teaearlgraycold · on Jan 9, 2024

> you can't put critical parts of your system on the cloud

This seems highly dependent on how you define “critical”. I think most people’s definition allows for everything to be in the cloud.

magden · on Jan 9, 2024

There are good and solid HA options in the Postgres ecosystem that are fully open. You can use Patroni for standard Postgres or YugabyteDB that is fault-tolerant by design.

Specifics are here: https://www.yugabyte.com/postgresql/postgresql-high-availabi...

jgimenez · on Jan 9, 2024

What would you recommend instead?

feverzsj · on Jan 9, 2024

If you have enough budget, go for proprietary databases. StackOverflow runs on a single Sql Sever instance with a hot standby. The tooling is just much better and easy to use. The save on maintenance solely is worthy.

The free alternative would be Mysql/Mariadb + Galera Cluster. Not as solid as proprietary ones, but far easier to use and less buggy than Postgres + tons of tools.

dikei · on Jan 9, 2024

>> The free alternative would be Mysql/Mariadb + Galera Cluster. Not as solid as proprietary ones, but far easier to use and less buggy than Postgres + tons of tools.

Until someone accidentally run an expensive DDL on your Galera Cluster: now your cluster is down for hours without anyway to cancel that query except nuking the entire database and restore from backup.

ManBeardPc · on Jan 9, 2024

Very good overview and easy to read. In recent years I‘ve encountered more and more businesses that use or want to use Postgres instead of other proprietary or NoSQL databases, so this post is very useful.

franckpachot · on Jan 9, 2024

The article is a marketing promotion for Citus. They ignore Neon, a major innovation in distributed cloud storage solutions. They reduce Distributed SQL to key-value stores, when it is actually the only distributed architecture that preserves all SQL features (ACID, consistent reads, referential integrity, global indexes).

And the only fact about performance is a benchmark comparing elastic and resilient distributed SQL to non-HA Citus running on larger machines.

ManBeardPc · on Jan 9, 2024

Big parts of the article are pretty generic concepts and only mention other products as an example. Haven't heard of Neon yet. What does Neon do differently? Is there a good introduction/overview?

Good, fair and reproducible benchmarks are a rarity. Do you have any (independent) benchmarks that compare different distributed PostgreSQL-based solutions?

tristan957 · on Jan 9, 2024

Here is a blog post that should be a good dive into the Neon storage engine. I work at Neon, and am happy to answer any other questions you might have.

https://neon.tech/blog/get-page-at-lsn

franckpachot · on Jan 9, 2024

All benchmarks are biased because nobody does the same level of config tuning on different databases. But this GigaOm/Microsoft one is a joke. They run Citus with no replication on 12x16+1x32 vCores and the others on 14x16 vCPU (so 14x8 cores) with replication factor 3.

ManBeardPc · on Jan 9, 2024

Without good benchmarks it's hardly possible to compare different products.

But I agree that this benchmarks seems questionable at best then. I usually only work with on-premise deployments, so I don't know the details of all the cloud offerings. People will probably judge products on what is available on the first few search results though.

mnming · on Jan 10, 2024

I fail to see this as a marketing promotion though, it doesn't seem to have much bias.

Though I do agree calling distributed SQL as key value store inaccurate and maybe a bit inappropriate.

indymike · on Jan 9, 2024

Neon is a very confusing name, and honestly needs to be changed to something that can be distinct in the marketplace.

franckpachot · on Jan 9, 2024

I mentioned Neon because it is open source and the article says "Not OSS" for "DBMS-optimized cloud storage". Think of it as open-source Aurora https://neon.tech/

indymike · on Jan 11, 2024

I think Neon is KDE's Linux distribution.

chuckhend · on Jan 9, 2024

> Guideline: the durability and availability benefits of network-attached storage usually outweigh the performance downsides, but it’s worth keeping in mind that PostgreSQL can be much faster.

I think it often goes overlooked just how slow network attached block storage is though, and some organizations get very surprised when moving from an on-prem data center to cloud.

asah · on Jan 9, 2024

This is great.

One thing I'd add is a sense of scale - are these architectures for 100 queries per second or 100,000 or 100,000,000 ?

craigkerstiens · on Jan 9, 2024

Marco (author) is probably asleep at this point and could give a deeper perspective. He sort of hits on this when talking about disk latency... Depending on your setup and well just from some personal experience I know it's not crazy for Postgres queries to go at 1ms per query. From there you can start to do some math on how many cores, how many queries per second, etc.

Single node Postgres (with a beefy machine) can definitely manage in the 100k transactions per second. When you're pushing the high 100k into millions read replicas is a common approach.

When we're talking transactions, question of is it simply basic queries, bigger aggregations, and is it writes or reads. Writes if you can manage to do any form of multi-line insert or batching with copy you can push basic Postgres really far... From some benchmarks Citus as mentioned can hit millions of records per second safely with those approaches, and even without Citus can get pretty high write throughput.

franckpachot · on Jan 9, 2024

The "disappointing" benchmark mentioned in the article is a shame for GigaOm who published it and for Microsoft who paid for it. They compare Citus with no HA to CockroachDB and YugabyteDB with replication factor 3 Multi-AZ, resilient to data center failure. And they run Citus on 16 cores (=32 vCPU) and the others on 16 vCPU. But your point about "beefy machine" shows the real advantages of Distributed SQL. PostgreSQL and Citus needs downtime to save cost if you don't need that beefy machine all days all year. Scale up and down is downtime, as well as upgrades. Distributed SQL offers elasticity (no downtime to resize the cluster) and high availability (no downtime on failure or maintenance)

AdamProut · on Jan 9, 2024

RE: "Distributed SQL offers elasticity (no downtime resize"). I'm not sure this is as much of an advantage of distributed databases vs single host databases anymore. Some of the tech to move virtual machines between machines quickly (without dropping TCP connections) is pretty neat. Neon has a blog post about it here[1]. Aurora Serverless V2 does the same thing (but I can't find a detailed technical blog post talking about how it works). Your still limited by "one big host" but its no longer as big of a deal to scale your compute up/down within that limit.

[1] https://neon.tech/blog/scaling-serverless-postgres

mistrial9 · on Jan 9, 2024

second yes to that - postgresql warm with plenty of RAM can do some fancy things and return an answer sub-millisecond too

cache is King

franckpachot · on Jan 9, 2024

but large cache is expensive in the cloud and you cannot scale up/down without downtime

datadrivenangel · on Jan 9, 2024

4TB of ram is only $71 per hour on AWS RDS. If you're at planetary scale that's not bad.

magden · on Jan 9, 2024

Scalability is not the only reason for jumping on a distributed Postgres version.

Some apps might do just 1000 ops/second but still run on a distributed database for high availability or data locality reasons. For instance, shared-nothing databases usually guarantee RPO=0 (no data loss, recovery point objective) with RTO (recovery time objective) measured in seconds for zone and region-level outages. As for data locality, think automatic data placement/pinning to regions/data centers for data regulatory and low latency reasons (serve read/write requests equally fast for folks living in NYC, London, Tokyo).

never_inline · on Jan 9, 2024

Any reason you can't achieve those RPO/RTO with straightforward replication?

magden · on Jan 9, 2024

You can achieve RPO=0 with Postgres using synchronous logical replication. You would need to replicate to 2+ standbys because if there is only one standby and it goes down then the primary will stuck. During the failover you would need to have Patroni or comparable tool, but I don’t know what’s the RTO.

But once you outgrow the primary/standbys severs storage or compute capacity you would need to scale to larger machines that can incur downtimes. With distributed Postgres such as YugabyteDB this is not gonna happen because you can scale horizontally

o11c · on Jan 9, 2024

Funny that it mentions Transparent sharding (with the downsides thereof), but not other kinds of sharding.

franckpachot · on Jan 9, 2024

Sharding above is not transparent (the application has to do it) Sharding below is what Distributed SQL does (what they call distributed key-value storage, ignoring that the SQL transactions are also distributed)

IntrnlCmplrErr · on Jan 9, 2024

Under the “DBMS-optimized cloud storage” section, won’t replicating the WAL across various different availability zones entail doing an atomic broadcast if you want any isolation?

It seems to me that you would need to run some sort of consensus algorithm to ensure the replication is consistent but that’s obviously very expensive in latency. Is it actually done this way?

foota · on Jan 9, 2024

They mention paralleling them, so it's only as expensive as the slowest write.

franckpachot · on Jan 9, 2024

Aurora writes each block to 6 replicas, and waits for a quorum of 4 to acknowledge. This has small impact on commit time which you can see as IO:XactSync (https://dev.to/aws-heroes/aws-aurora-io-xactsync-is-not-a-po...) but provides good HA

ashconnor · on Jan 13, 2024

Video of the talk is now up: https://www.youtube.com/watch?v=hcUeMo4lpsw

pama · on Jan 9, 2024

Why not include redshift in the mix?

dikei · on Jan 9, 2024

Redshift is a data warehouse, it's not suitable for OLTP use case.