First time seeing someone call Spanner, CockroachDB, and YugabyteDB a "distributed key-value store with SQL" :)
The cons of the mentioned distributed shared-nothing SQL databases are questionable:
- "Key-value store" is in fact an LSM-tree-based document store that supports column-level versioning (Postgres supports row-level versioning only).
- "Many internal operations incur high latency." - I guess this conclusion is based on the referenced Gigaom benchmark that was paid for by Microsoft to compare apples to oranges.
- "No local joins in current implementations." (YugabyteDB certainly has colocated tables that store a whole table on a single node. CockroachDB and Spanner might do this as well.)
- "Not actually PostgreSQL..." - There is only one 100% compatible database with Postgres...It's Postgres itself. Citus, CockroachDB, Aurora, Alloy, YugabyteDB, and others can be classified as "not actually Postgres."
- "And less mature and optimized." - Well, both CockroachDB and YugabyteDB are on Gartner's Magic Quadrant of the top 20 cloud databases. Toys don't get there.
It feels like the author joined Crunchy to work on their own distributed version of Postgres. Good move for Crunchy, good luck!
> Related tables and indexes are not necessarily stored together, meaning typical operations such as joins and evaluating foreign keys or even simple index lookups might incur an excessive number of internal network hops. The relatively strong transactional guarantees that involve additional locks and coordination can also become a drag on performance.
You handwaved this away saying you can just store an entire table on a single node, but that defeats many of the benefits of these sharded SQL databases.
Edit: Also, before attacking the author's biases, it seems fair to disclose you appear to work at Yugabyte
In the case of YugabyteDB, here is how we avoid "excessive number" of networks hops
- true Index Only Scan. PostgreSQL doesn't store the MVCC visibility in indexes and have to look at the table even in case of Index Only Scan. YugabyteDB has a different implementation of MVCC with no bloat, no vacuum and true Index Only Scan. Here is an example: https://dev.to/yugabyte/boosts-secondary-index-queries-with-... This is also used for reference table (duplicate covering indexes in each regions)
- Batching reads and writes. It is not a problem to add 10ms because you join two tables or check a foreign key. What would be problematic is doing that for each rows. YugabyteDB batches the read/write operations as much as possible. Here are two examples: https://dev.to/franckpachot/series/25365
- Locality of transaction table. If a transaction touches to only one node, or zone, or region, a local transaction table is used, and is promoted to the right level depending on what the transaction reads and writes.
Most of the times when I've seen people asking to store tables together, it was premature optimization, based on opinions rather than facts. When they try (with the right indexes of course) they appreciate that the distribution is an implementation detail that the application doesn't have to know. Of course, there are more and more optimizations in each release. If you have a PostgreSQL application and see low performance, please open a git issue.
I'm also working for Yugabyte as Developer Advocate. I don't always feel the need to precise it as I'm writing about facts, not marketing opinions, and who pays my salary has no influence on the response time I see in execution plans ;)
Hey Franck, just wanted to say I appreciate your database writings. I read a whole bunch of them over the years, and always found them interesting and educational.
> You handwaved this away, saying you can just store an entire table on a single node, but that defeats many of the benefits of these sharded SQL databases.
I just clarified one-liners listed under the closing "Cons" section. My intention was not to say that the author is utterly wrong. Marco is a recognized expert in the Postgres community. It only feels like he was too opinionated about distributed SQL DBs while wearing his Citus hat.
> Also, before attacking the author's biases, it seems fair to disclose that you appear to work at Yugabyte.
I'm sorry if I sounded biased in my response. I'm with the YugabyteDB team right now, but that's not my first and I bet not the last database company. Thus, when I respond on my personal accounts, I try to be as objective as possible and don't bother mentioning my current employment.
Anyway, I'm very positive to see that this article got traction on HN. As a Postgres community member, I truly love what's happening with the database and its ecosystem. The healthy competition within the Postgres ecosystem is a big driver for the database growth that's becoming the Linux of databases.
>> First time seeing someone call Spanner, CockroachDB, and YugabyteDB a "distributed key-value store with SQL"
That was the first thing come to my mind when I read the paper on Spanner and CockroachDB (haven't read the paper on YugabyteDB yet) though, and surely I'm not the only one.
Referring to Distributed SQL as a key-value store is like defining monolithic databases as a block storage. It reduces it to an internal structure.
What makes it a database is what is on top: ACID transaction, full SQL features, relational tables, JSON document, foreign keys,... which are not available when sharding is done on top of SQL
In my experience, HA stories of Postgres are always buggy and disappointing for homegrown systems. None of these tools actually gives you a solid HA like proprietary databases do. Managed database services can reduce or eliminate some of the problems, but you can't put critical parts of your system on the cloud.
Having run Oracle RAC, MySQL circular replication, Postgres replication since 9.X, Amazon RDS, and others in production, I can say that Oracle RAC was the worst experience.
We had to provide our own active / active storage backend, and fabric. If there were any hiccups, the entire system fell over. The horizontal scalability was nice, but if you caused I/O saturation on the backend, you'd end up knocking over the entire cluster due to shared components. Several times, the entire DB just "broke" (this was a couple scenarios, one where the DB was using 100% of CPU, one where the DB was frozen, and not allowing connections, and things like hung queries that couldn't be stopped), and it required restarting the whole cluster, for which there was minimal tooling.
Perhaps SQL server is better, but that comes with an entirely new ecosystem, and other problems.
My biggest thing is being able to crack open the DB, and look at the source code when it breaks. In this, the likes of Datastax Cassandra and Cockroachdb are great, but I wouldn't call them "proprietary" by any means.
Active-active is when you don't have any replicas, you have just one data store and multiple 'engines' accessing it. So there is no replication, because there is nothing to replicate. So CockroachDB is definitively not active-active, despite their claims to the otherwise; what they call active-active is two-way synchronous replication, which is worse than asynchronous replication because of the latency and worse than Oracle RAC active-active because it's less fail-tolerant. In Oracle RAC your cluster is up and running even if just one engine is still alive, in CockroachDB approach when you lose majority of nodes the cluster goes down.
Obviously Oracle RAC active-active has it's own downsides, too, but it's another topic.
EDIT to expand: Think about a SQL transaction. In Oracle RAC it gets picked up by any of the nodes and is executed by this node. All the other nodes see updated data immediately. In CockroachDB the same transaction would need to be executed by each of the nodes separately plus the overhead of communication between the nodes.
You have a misconception about CockroachDB and YugabyteDB architecture. It is not 2-way replication. It writes to a Raft group. The transaction can be picked up by any node because the transaction intents are also replicated. The main difference with RAC is that the current version of a row is read/write on the Raft Leader, replicated to followers (that can be elected new leader). In RAC, the block with current version of a row is moved back and forth between the instance. That's possible only with reliable, short distance, dedicated network. Distributed SQL can scale out to multiple availability zones. Not RAC
> In Oracle RAC your cluster is up and running even if just one engine is still alive
So it's not consistent in the CAP Theorem sense. Comparing it with other databases that don't lose writes and don't have split brain issues makes no sense
Oracle RAC is a shared storage system, every database instances write to same datastore, so as long as a transaction is committed by one node, it'll be seen by other nodes.
Of course, the issue with all shared-storage systems is how much it costs to have a reliable and fast shared storage.
It is always consistent because it is a single storage database with multiple access points. It's not distributed data so CAP theorem doesn't even apply to it.
There are good and solid HA options in the Postgres ecosystem that are fully open. You can use Patroni for standard Postgres or YugabyteDB that is fault-tolerant by design.
If you have enough budget, go for proprietary databases. StackOverflow runs on a single Sql Sever instance with a hot standby. The tooling is just much better and easy to use. The save on maintenance solely is worthy.
The free alternative would be Mysql/Mariadb + Galera Cluster. Not as solid as proprietary ones, but far easier to use and less buggy than Postgres + tons of tools.
>> The free alternative would be Mysql/Mariadb + Galera Cluster. Not as solid as proprietary ones, but far easier to use and less buggy than Postgres + tons of tools.
Until someone accidentally run an expensive DDL on your Galera Cluster: now your cluster is down for hours without anyway to cancel that query except nuking the entire database and restore from backup.
Very good overview and easy to read. In recent years I‘ve encountered more and more businesses that use or want to use Postgres instead of other proprietary or NoSQL databases, so this post is very useful.
The article is a marketing promotion for Citus. They ignore Neon, a major innovation in distributed cloud storage solutions. They reduce Distributed SQL to key-value stores, when it is actually the only distributed architecture that preserves all SQL features (ACID, consistent reads, referential integrity, global indexes).
And the only fact about performance is a benchmark comparing elastic and resilient distributed SQL to non-HA Citus running on larger machines.
Big parts of the article are pretty generic concepts and only mention other products as an example. Haven't heard of Neon yet. What does Neon do differently? Is there a good introduction/overview?
Good, fair and reproducible benchmarks are a rarity. Do you have any (independent) benchmarks that compare different distributed PostgreSQL-based solutions?
Here is a blog post that should be a good dive into the Neon storage engine. I work at Neon, and am happy to answer any other questions you might have.
All benchmarks are biased because nobody does the same level of config tuning on different databases. But this GigaOm/Microsoft one is a joke. They run Citus with no replication on 12x16+1x32 vCores and the others on 14x16 vCPU (so 14x8 cores) with replication factor 3.
Without good benchmarks it's hardly possible to compare different products.
But I agree that this benchmarks seems questionable at best then. I usually only work with on-premise deployments, so I don't know the details of all the cloud offerings. People will probably judge products on what is available on the first few search results though.
I mentioned Neon because it is open source and the article says "Not OSS" for "DBMS-optimized cloud storage". Think of it as open-source Aurora
https://neon.tech/
> Guideline: the durability and availability benefits of network-attached storage usually outweigh the performance downsides, but it’s worth keeping in mind that PostgreSQL can be much faster.
I think it often goes overlooked just how slow network attached block storage is though, and some organizations get very surprised when moving from an on-prem data center to cloud.
Marco (author) is probably asleep at this point and could give a deeper perspective. He sort of hits on this when talking about disk latency... Depending on your setup and well just from some personal experience I know it's not crazy for Postgres queries to go at 1ms per query. From there you can start to do some math on how many cores, how many queries per second, etc.
Single node Postgres (with a beefy machine) can definitely manage in the 100k transactions per second. When you're pushing the high 100k into millions read replicas is a common approach.
When we're talking transactions, question of is it simply basic queries, bigger aggregations, and is it writes or reads. Writes if you can manage to do any form of multi-line insert or batching with copy you can push basic Postgres really far... From some benchmarks Citus as mentioned can hit millions of records per second safely with those approaches, and even without Citus can get pretty high write throughput.
The "disappointing" benchmark mentioned in the article is a shame for GigaOm who published it and for Microsoft who paid for it. They compare Citus with no HA to CockroachDB and YugabyteDB with replication factor 3 Multi-AZ, resilient to data center failure. And they run Citus on 16 cores (=32 vCPU) and the others on 16 vCPU.
But your point about "beefy machine" shows the real advantages of Distributed SQL. PostgreSQL and Citus needs downtime to save cost if you don't need that beefy machine all days all year. Scale up and down is downtime, as well as upgrades. Distributed SQL offers elasticity (no downtime to resize the cluster) and high availability (no downtime on failure or maintenance)
RE: "Distributed SQL offers elasticity (no downtime resize"). I'm not sure this is as much of an advantage of distributed databases vs single host databases anymore. Some of the tech to move virtual machines between machines quickly (without dropping TCP connections) is pretty neat. Neon has a blog post about it here[1]. Aurora Serverless V2 does the same thing (but I can't find a detailed technical blog post talking about how it works). Your still limited by "one big host" but its no longer as big of a deal to scale your compute up/down within that limit.
Scalability is not the only reason for jumping on a distributed Postgres version.
Some apps might do just 1000 ops/second but still run on a distributed database for high availability or data locality reasons. For instance, shared-nothing databases usually guarantee RPO=0 (no data loss, recovery point objective) with RTO (recovery time objective) measured in seconds for zone and region-level outages. As for data locality, think automatic data placement/pinning to regions/data centers for data regulatory and low latency reasons (serve read/write requests equally fast for folks living in NYC, London, Tokyo).
You can achieve RPO=0 with Postgres using synchronous logical replication. You would need to replicate to 2+ standbys because if there is only one standby and it goes down then the primary will stuck. During the failover you would need to have Patroni or comparable tool, but I don’t know what’s the RTO.
But once you outgrow the primary/standbys severs storage or compute capacity you would need to scale to larger machines that can incur downtimes. With distributed Postgres such as YugabyteDB this is not gonna happen because you can scale horizontally
Sharding above is not transparent (the application has to do it)
Sharding below is what Distributed SQL does (what they call distributed key-value storage, ignoring that the SQL transactions are also distributed)
Under the “DBMS-optimized cloud storage” section, won’t replicating the WAL across various different availability zones entail doing an atomic broadcast if you want any isolation?
It seems to me that you would need to run some sort of consensus algorithm to ensure the replication is consistent but that’s obviously very expensive in latency. Is it actually done this way?
The cons of the mentioned distributed shared-nothing SQL databases are questionable:
- "Key-value store" is in fact an LSM-tree-based document store that supports column-level versioning (Postgres supports row-level versioning only).
- "Many internal operations incur high latency." - I guess this conclusion is based on the referenced Gigaom benchmark that was paid for by Microsoft to compare apples to oranges.
- "No local joins in current implementations." (YugabyteDB certainly has colocated tables that store a whole table on a single node. CockroachDB and Spanner might do this as well.)
- "Not actually PostgreSQL..." - There is only one 100% compatible database with Postgres...It's Postgres itself. Citus, CockroachDB, Aurora, Alloy, YugabyteDB, and others can be classified as "not actually Postgres."
- "And less mature and optimized." - Well, both CockroachDB and YugabyteDB are on Gartner's Magic Quadrant of the top 20 cloud databases. Toys don't get there.
It feels like the author joined Crunchy to work on their own distributed version of Postgres. Good move for Crunchy, good luck!