Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Let's consign CAP to the cabinet of curiosities (brooker.co.za)
135 points by nalgeon 12 months ago | hide | past | favorite | 151 comments


So you’re setting up a multi-region RDS. If region A goes down, do you continue to accept writes to region B?

A bank: No! If region A goes down, do not process updates in B until A is back up! We’d rather be down than wrong!

A web forum: Yes! We can reconcile later when A comes back up. Until then keep serving traffic!

CAP theorem doesn’t let you treat the cloud as a magic infinite availability box. You still have to design your system to pick the appropriate behavior when something breaks. No one without deep insight into your business needs can decide for you, either. You’re on the hook for choosing.


Actually, banks use eventually consistent systems in many cases. They'd rather let a few customers overdraw a bit than stop a customer from being able to buy something on their card when they have the funds for it.


It helps that most bank accounts come with overdraft agreements and the bank can come after you for repayment with the full force of credit law behind them


Yeah, banks try and encourage overdraws because they make a bunch of fees off them. I remember reading about some of them getting in trouble for reordering same day deposits in a way to maximize the overdraw fees.

https://cw33.com/news/heres-how-banks-rearrange-transactions...


Good point. But I'd argue that's an artifact of history. Traditional banks started out when eventual consistency was the only choice. First with humans manually reconciling paper ledgers based on messages from couriers and telegraphs, later with the introduction of computers in batch processes that ran over night. The ability to do this live is new to them, and thus their processes are designed to not require it.

Financial institutions set up in this century (like paypal or fintecs) usually do want consistency at all cost. Your risk exposure is a lot lower if you know people don't spend or withdraw more than they have.


> Good point. But I'd argue that's an artifact of history.

I think there's also less liability for the banks if they accept and build around the risk of both eventual consistency and fraud. Even now wiring is necessary to pull off a digital bank theft, and even then you need to be extremely fast to get away with the money, and extremely fast to not go to prison, and liquidate it extremely quickly (likely into bitcoin). Even small delays in the transactions would make this basically impossible with modern banking, let alone a full clearing house day.


Eventual consistency gives you more availability, at the cost of consistency. It's fine for financial institutions to prefer this trade-off, because the any over/under error is a likely a fraction of the cost of unavailability.

Also, in the real world, we can solve problems at a different level - like legal or product.


Also the financial institutions are making enough profit that they can afford to lose a little bit in the process of reconciliation. Consider how the 3.5% fees credit cards charge mean they can afford to eat a bit of fraud and actually can tolerate more fraud than a system which had lower fees could.


Requiring availability sounds like a good recipe for global payment system outages. It also makes it a lot harder to buy things on a plane.


fintechs have some inhouse tech but typically involve all kinds of third party systems. Even if you are interacting with these third parties via apis(instead of csv or whatever), you will always need an eventually consistent recon mechanism to make sure both sides agree on what was done. It doesnt have to take days though i agree.

The only way around this would be to replace stateful apis with some kind of function that takes a reference to a verifiable ledger, that can mutate that and anyone can inspect themselves to see if the operation has been completed. Oh wait :)


In many cases. There are few absolutes at the edges. They have to be decided on a case-by-case basis. Also, failure to decide is deciding.


Very eventually consistent. Like, humans involved sometimes.


Ignoring CAP theorem doesn’t just mean eventually consistent - means you will lose transactions when replicas that aren’t fully replicated get corrupted. Probably not something you want to be caught doing if you’re a bank


Real world financial systems do have transactions that aren’t fully replicated and hence can be lost as a result. But banks just decide that both the probability and sums involved are low enough that they’ll wear the risk.

Example: some ATMs are configured that if the bank computer goes down (or the connection to it), they’ll still let customers withdraw limited sums of money in “offline” mode, keeping a local record of the transaction, that will then be uploaded back to the central bank computers when the connection comes back up.

But, consider this hypothetical scenario: bank computer has a major outage. ATM goes into offline mode. Customer goes to ATM to withdraw $100. ATM grants withdrawal and records card number, sum and timestamp internally (potentially even in two different forms-on disk and a paper-based backup). Customer leaves building. 30 minutes later, bank computer is still down, when a terrorist attack destroys the building and the ATM inside. The customer got $100 from the bank, but all record of it has been destroyed. Yet, this is such an unlikely scenario, and the sums involved are small enough (in the big picture), that banks don’t do anything to mitigate against it - the mitigation costs more than the expected value of the risk


I'd be surprised if they didn't have insurance that covered the sum of money in the machine


As patio11 says, "The optimum level of fraud is not zero."


Having worked at a couple of banks, this is absolutely true.


That and they love to charge overdraft fees to accounts that don't otherwise make them a ton of money for the privilege.


Actually, CAP theorem is only relevant to a database transaction, not a system. The system has to be available to even LET the card process. This is why no bank would ever use anything BUT an ACID based system as data integrity is far more important than speed (and availability for that matter).


Yes. I see CAP theorem as a (perfectly right) answer to managers demanding literally impossible requirements


You are mischaracterizing what TFA says. It says that partitions either keep a client from reaching any servers (in which case we don't care) or they keep some servers from talking to others (or some clients from talking to all servers), and that in the latter case whenever you can have a quorum (really, a majority of the servers) reachable then you can continue.

You've completely elided the quorum thing to make it sound like TFA is nuts.


Quorum doesn't let you ignore partitions. It can't. You still have to decide what happens to the nodes on the wrong side of the partition, or how to handle multiple partitions (you have 5 nodes. You have 2 netsplits. Now you have clusters of 2 nodes, 2 nodes, and 1 node. Quorum say what?).

I don't think TFA is nuts. I do think the premise that engineers using cloud can ignore CAP theorem is wrong, though. It's a decision you must consider when you're designing a system. As long as everything is running well, it doesn't matter. You've got to make sure that the behavior when something breaks is what you want it to be. And you can't outsource that decision. It can't be worked around.


It's poor timing for this article.

The CrowdStrike incident has demonstrated that no matter what you do it is possible for any distributed system to be partitioned in such a way that you have to choose between consistency and availability and if the system is important then that choice is important.


You're ignoring that a system is temporarily partitioned for the duration of the time it takes to retrieve information from all nodes.

Partition intolerance means you can only proceed when all nodes have been reached. This means partition intolerance not only deals with availability in the yes or no sense, but also in terms of latency.


That's true, but I think the catch is that most customers likely didn't view it as an investment that could decrease availability.


> You still have to decide what happens to the nodes on the wrong side of the partition

You don't because their clients won't also be isolated from the other servers. That's TFA's point, that in a cloud environment network partitions that affect clients generally deny them access to all servers (something you can't do anything about), and network partitions between servers only isolate some servers and none of the clients.


That’s fine if you value availability over consistency. Which is fine if that’s the appropriate choice for your use case! Even then you have to decide whether to accept updates in that situation, know that they may never be reconciled if the partition is permanent.


No, you can have availability and consistency as long as you have enough replicas. Network partitions that leave you w/o a quorum are like the whole cloud being down. Network partitions don't leave some clients only able to reach non-quorums -- this is the big point of TFA.


> No, you can have availability and consistency as long as you have enough replicas.

You've misunderstood the problem.

You cannot be consistent if you cannot resolve conflicts, taking writes after a network partition means you are operating on stale data.

Honestly the closest we've ever gotten to solving CAP theorum is spanner[0], which trades off availability by being hugely less performant for writes. I'm aware spanner is used a lot in google cloud (under the hood), but you won't solve CAP by having hundreds of PGSQL replicas, because those aren't using Spanner.

In fact, you can test it out, ElasticSearch orchestrates a huge number of lucene databases, a small install will have a dozen or so replicas and partitions, but you're welcome to crank it to as many nodes as you want then split off a zone, and see what happens.

I'm becoming annoyed at the level of ignorance on display in this thread, so I'm sorry for the curt tone, you can't abstract your way to solving it, it's a fundamental limitation of data.

[0]: https://static.googleusercontent.com/media/research.google.c...


Did you read TFA? Please, the point of TFA is NOT that you don't need distributed consensus algorithms, but that once you have them you don't need to worry about partitions, therefore you can have consistency, and you get "availability" in that the cloud "is always available", and if that sounds like cheating you have to realize that partitions in the cloud don't isolate clients outside the cloud, only servers inside the cloud, and if there's no quorum anywhere then you just treat that as "the cloud is down".


I suspect you never needed CAP in the first place, because even outside of cloud, you don't seem to pay a lot of attention about whether your service is up or not or how to increase that uptime.

In the same spirit, some businesses are fine with a single db, making backups every night and losing some data when an issue happens. It doesn't mean that people maintaining these systems get to tell others that distributed systems are essentially a non-problem.


> you get "availability" in that the cloud "is always available",

The cloud isn't what's being described as available. The service is available or not. The service is hosted on multiple computers that talk to each other. If they stop talking to each other, the service needs to either become unavailable (choosing consistency) or risk having not up to date data (choosing availability).


I did, and it doesn't provide insight that didn't exist 15 years ago.

Cloud is always available? I have a bridge to sell you.


That presumes you can only have 1 partition at a time. That is not the case. You can optimize for good behavior when you end up with 2 disconnected graphs. It’s absolutely the case that a graph with N nodes can end up with N separate groups. Conceptually, imagine a single switch that connects all of them together and that switch dies.

You can engineer cloud services to be resilient. AWS has done a great job of that in my personal experience. But worst cases can still happen.

If you have an algorithm that runs in n*log(n) most of the time but 2^n sometimes, it can still be super useful. You have to prepare for the bad case though. If you say it’s been running great so far and we don’t worry about it, fine, but that doesn’t mean it can’t still blow up when it’s most inconvenient. It only means it hasn’t yet.


Again, a partition that leaves you with no quorum anywhere is like the cloud being down.


> Again, a partition that leaves you with no quorum anywhere is like the cloud being down.

What does this mean? What is the cloud being down? The cloud isn't a single thing; it's lots of computers in lots of data centres.


It means there will be an article on the front page declaring the outage


A partition that leaves you with no quorum anywhere is like the cloud being down if you value consistency over availability.


Yes, but again, TFA says that you will trust the cloud will be up long enough that you would indeed prefer consistency.

My goodness. So many commenters here today are completely ignoring TFA and giving lessons on CAP -- lessons that are correct, but which completely ignore that TFA is really arguing that clouds move the needle hard in one direction when considering the CAP trade-offs, to the point that "the CAP theorem is irrelevant for cloud systems" (which was this post's original title, though not TFA's).


The issue people are pointing is that the author conflates "Cloud is good enough for my needs" with "CAP is no longer relevant".The point is especially annoying considering that the author works at a cloud vendor and not a cloud user.


Huh? You can create partitions in ways other than issues with the underlying network. You can mess up firewall rules for example and find yourself with no quorum.


Those aren't the only failure modes - you can have two sets of servers partitioned from one another (in two different data centers), both of which are reachable by clients. Do you allow those partitions to remain available, or do you stop accepting client connections until the partition heals? The "right" choice depends entirely on the application.


In TFA's world-view clients will not talk to isolated servers.


So that would be choosing consistency over availability.


Yes. That's the point, that cloud changes the trade-off such that consistency becomes worthwhile.


Indeed and this is a very subtle point that I found very hard to explain when talking about availability in the event of network partitions.

You have to consider the system as a whole including external clients since the definition of the availability of the entire system ultimately depends on the point of view of the end users of the system.


Quorum just means you are a CP system rather than an AP system.


If you have a partition, then you have to wait until the partition resolves if you want consistency. Otherwise the other partition may receive an update that you are unaware of, that invalidates your consistency constraints.

The author of the article doesn't actually understand the CAP theorem.


How do you setup a multi-region RDS with multi-region write?


I think that's a thing? It's been a while since I've looked. Even if not, multi-AZ replication with failover is a standard setup and it has the same issues. Suppose you have some frontend servers and an RDS primary instance in us-west-2a and other frontend servers and an RDS replica in us-west-2b. The link between AZs goes down. For simplicity, pretend a failover doesn't happen. (It wouldn't matter if it did. All that would change here is the names of the AZs.)

Do you accept writes in us-west-2a knowing the ones in 2b can't see them? Do you serve reads in 2b knowing they might be showing old information? Or do you shut down 2b altogether and limp along at half capacity in only 2a? What if the problem is that 2a becomes inaccessible to the Internet so that you can no longer use a load balancer to route requests to it? What if the writes in 2a hadn't fully replicated to 2b before the partition?

You can probably answer those for any given business scenario, but the point is that you have to decide them and you can't outsource it to RDS. Some use cases prioritize availability above all else, using eventual consistency to work out the details once connectivity's restored. Others demand consistency above all else, and would rather be down than risk giving out wrong answers. No cloud host can decide what's write for you. The CAP theorem is extremely freaking relevant to anyone using more than one AZ or one region.

(I know you weren't claiming otherwise, just taking a chance to say why cross-AZ still has the same issues as cross-region, as I have heard well meaning people say they were somehow different. AWS does a great job of keeping things running well to the point that it's news when they don't. Things still happen though. Same for Azure, GCP, and any other cloud offering. However flawless their design and execution, if a volcano erupts in the wrong place, there's gonna be a partition.)


You wish.

> DNS, multi-cast, or some other mechanism directs them towards a healthy load balancer on the healthy side of the partition

Incidentally that's where CAP makes it's appearance and bites your ass.

No amount of VRRP, UCARP wishful thinking can guarantee a conclusion on what partition is "correct" in presence of a network partition between load balancer nodes.

Also, who determines where to point the DNS? A single point of failure VPS? Or perhaps a group of distributed machines voting? Yeah.

You still need to perform the analysis. It's just that some cloud providers offer the distributed voting clusters as a service and take care of the DNS and load balancer switchover for you.

And that's still not enough, because you might not want to allow stragglers write to orphan databases before the whole network fencing kicks in.


The idea that someone in a partition can get around it by clever routing tricks indicates the author has some “new” definition of partition.

Partition isn’t one ethernet cable going bad. It’s all the ethernet cables going bad. Redundant network providers for your data center to handle the idiot with the backhoe isn’t surviving partition it’s preventing it in the first place.


You really do need to consider a full partition, and a partial partition. Both happen.

It's most common that you have a full partition when nobody can talk to node A, because it's actually offline. And sometimes you've got a dead uplink for a switch with a couple of nodes behind it that can talk to each other.

But partial partitions are really common too. If Node A can talk (bidirectionally) to B and C, but B and C can each only talk to A, you could do something with clever routing tricks. If you have a large enough network, you see this kind of thing all the time, and it's tempting to consider routing tricks. IMHO, it's a lot more realistic to just have a fast feedback between monitoring the network and the people who can go out and clean the ends on the fiber on the switches and routers and what not. The internet as exists is the product of smart people doing routing tricks on a global basis; it's not universally optimal --- you could do better in specific cases all the time, but it's really easy to identify these as a human when something is broken; actually doing probing for host routing is a big exercise for typically marginal benefits. Magic host routing won't help when a backhoe (or directional drill or boat anchor) has determined your redundant fiber paths are all physically present in the same hole though.


I've seen partitions in Postgres clusters where a node gets overloaded to the point it's not responding to health checks and the cluster management software performs a failover. Since the previous master can't be reached but is still running, it accepts writes even after a new replica is promoted.

I've seen similar things happen in other software under load as well (which can easily cause cascading failures, too)


A somewhat plausible scenario that's easy to illustrate would be if you have a DC in the US and one in New Zealand, and all 7 subsea cables connecting New Zealand to the internet go dark.

Something like that happened just this year to 13 African nations [1]

1: https://www.techopedia.com/news/when-africa-lost-internet-ex...


This is not really correct, and assumes the state of a cloud service (let's say load balancing) is binary.

In my experience it's not. The cloud will glitch. The load balancing algo will break subtly for your workload. Your traffic will get blackholed for no apparent reason. I spent a week trying to convince a cloud provider they fucked up (the time it took for them to give us someone who could run the appropriate tcpdump) once. There was no global outage.

It's on you to determine if this is important for you or not in your case, but you will need to mitigate it above a certain SLA threshold requirement. It's far for consigning CAP to history or a curiosity, which is like saying you will never have network issues if you use "serverless" stuff.

Not talking about anyone in particular, but sometimes I feel people building and using the cloud reach hubris level of surety in their systems - I worked both sides of the fence, and I know the long tails of fuckups that impact customers though...


I remember an article--I think from Gitlab--about random latency and out of order packet issues they were seeing on GCP. Turns out it only happened under a certain amount of load since GCP's routing algorithm uses a different network if traffic is under a certain amount and switches to a more distributed mode when it crosses a threshold.

In addition, sometimes you land on a bad piece of hardware but it's not bad enough to trigger the provider's monitoring. Few months ago we had a bad EC2 instance whose network would drop every 2 hours causing a bunch of random errors before recovering for a little while.


Could a partition just be the machine hosting one of your DBs going down? When it comes back up, it still has its data, but it's missed out on what happened in the meantime.


Your question was answered by someone else but you've triggered a different form of dread.

We lost a preprod database that held up dev and QA for half a day. To bring the server room up to code they moved the outlets, and when they went to move the power cords, someone thought they could get away with relying on the UPS power during the jiggery. Turned out the UPS that was reporting all green only actually had about 8 seconds of standby power in it, which was a little longer than the IT person needed to untangle the cables.

So in the end we were just fucked on a different day than we would have been if we had waited for the next hiccup in the city power.

I mention this because if you've split your cluster between three racks and someone manages to power down an entire rack, you're dangerously close to not having quorum and definitely close to not having sufficient throughput for both consumer traffic and rebuilding/resyncing/resilvering.

It's a slow motion rock slide that is a cluster that was not sized correctly for user traffic plus recovery traffic as the entire thing limps toward a full on outage when just one more machine needs to be rebooted, because, for instance, your team decided that restart a machine every n/48 hours, to deal with a slow leak is just fine since you're using a consensus protocol that will hide the reboots. Maybe rock slide is the wrong word. It's a train wreck. Once it starts you can't stop it, you can only watch and fret.


Not really - if a database stops you fundamentally don't have availability. CAP is specific to network partitions because it wants to grapple with what happens when clients can talk to nodes in your service, but the nodes can't talk to each other.


The thing is, each ethernet cable is a temporal partition. By adding more ethernet cables you're not making partitions less likely, you're making more of them. A distributed system is considered partitioned while the packets are in flight and have not yet reached their destination and come back with a response. This is true even if the system on the other side has not yet failed or the network equipment is functioning properly.


Yep. CAP comes into play when all the mitigations fail. However clever, diligent, and well-executed a network setup, there's some scenario where 2 nodes can't communicate. It's perfectly fine to say we've made it so vastly unlikely that we accept the risk to just take the whole thing offline when that happens. The problem cannot be made to go away. It's not possible.


For a cloud customer, the networking side might be as simple as "run a few replicas in different regions, and let the load balancer deal with it." Or only different AZs. The database problem is much harder to sign away as someone else's job.


TFA doesn't say you don't need Paxos other such. It says that partitions that leave less than a quorum of servers isolated do not reduce availability, and that partitions that leave you w/o a quorum are not really a thing on the cloud. That last part might not be true, but to the extent that it is true then you can have availability and strong consistency.


It’s not true, but if it were true then it’s true?


No, more like: when the cloud is all down, yeah, you don't have availability, but so what, that's like all servers being down -- that's always a potential problem, but it's less likely on a cloud.


Yep. You can't magic cloud away Consistency by ignoring it and YOLO hoping it will Do The Right Thing™.


I don’t know. People spend a lot of time thinking about network partitions. What’s the difference between a network partition, and a network that is very slow? You could communicate via carrier pigeon. Practicably, you can easily communicate via people in different regions, who already “know” which partition is correct. Networks are never really partitioned; and, the choice of denominator when calculating their speed means they’re never really transferring at 0 bits per second either. Eventually data will start transferring again.

So a pretty simple application design can deal with all of this, and that’s the world we live in, and people deal with delays and move on with life, and they might bellyache about delays for the purpose of getting discounts from their vendors but they don’t really want to switch vendors. If you’re a bank in the highly competitive business of taking 2.95% fees (that should be 0.1%) on transactions, maybe this stuff matters. But like many things in life, like drug prices, that opportunity only exists in the US, it isn’t really relevant anywhere else in the world, and it’s certainly not a math problem or intrinsic as you’re making it sound. That’s just the mythology the Stripe and banks people have kind of labeled on top of their shit, which was Mongo at the end of the day.


"A good driver sometimes misses a turn, a bad driver never misses a turn"

A good engineer knows that all real-time systems are turn-based, and the network is _always_ partitioned


This is exactly the right way to think about it. One should be happy when the network works, and not expect any packet to make it somewhere on any particular schedule. Ideally. In practice it's too expensive to plan this way for most systems, so we compensate by pretending the failure modes don't exist.


I don't know what the formalisms here are, but as someone intermittently (thankfully, at this point, rarely) responsible for some fairly large-scale distributed systems: the "network that is really slow" case is much scarier to me than the "total network partition". Routine partial failure is a big part of what makes distributed systems hard to reason about.


Even more fun are asymmetric network degradations or partitions.


You've got to at least consider what is going to happen.

The whole point of the CAP theorem is that you can't have a one size fits all design. Network partitions exist, and it's not always a full partition where you have two (or more) distinct groups of interconnected nodes. Sometimes everybody can talk to everybody, except node A can't reach node B. It can even be unidirectional, where node A can't send to B, but node B can send to A. That's just how it is --- stuff breaks.

There's designs that highlight consistency in the face of partitions; if you must have the most recent write, you need a majority read or a source of truth --- if you can't contact that source of truth or a majority of copies in a reasonable time (or at all). And you can't confirm a write unless you can contact the source of truth or do a majority write. As a consequence, you're more likely to hit situations where you reach a healthy frontend, but can't do work because it can't reach a healthy backend; otoh, that's something you should plan for anyway.

There's designs that highlight availability in the face of partitions; if you must have a read and you can reach a node with the data, go for it. This can be extended to writes too, if you trust a node to persist data locally, you can use a journal entry system rather than a state update system, and reconcile when the partition ends. You'll lose the data if the node's storage fails before the partition ends, of course. And you may end up reconciling into an undesirable state; in banking, it's common to pick availability --- you want customers to be able to use their cards even when the bank systems are offline for periodic maintenance / close of day / close of month processing, or unexpected network issues --- but when the systems come back online, some customers will have total transaction amounts that would have been denied if the system was online. Or you can do a last state update wins too --- sometimes that's appropriate.

Of course, there's the underlying horror of all distributed systems. Information takes time to be delivered, so there's no way for a node to know the current status of anything; all information from remote nodes is delayed. This means a node never knows if it can contact another node, it only knows if it was recently able to or not. This also means unless you do some form of locking read, it's not unreasonable for the value you've read to be out of date before you receive it.

Then there's even further horrors in that even a single node is actually a distributed system. Although there's significantly less likelyhood of a network partition between cpu cores on a single chip.


There is also one more level of horror people often forget. Sometimes the history unwinds itself. Some outages will require restoration from a backup and some amount of transactions may be lost. It does not happen often, but it may happen. Such situations may lead to e.g. sequentially allocated IDs being reused or originally valid pointers to records now point nowhere and so on.


Everything you're saying is true. It's also true that you can ignore these things, have some problems with some customers, and if you're still good, they'll still shop with you or whatever, and life will go on.

I am really saying that I don't buy into the mythology that someone at AWS or Stripe or whatever knows more about the theoretical stuff than anyone else. It's a cloud of farts. Nobody needs these really detailed explanations. People do them because they're intellectually edifying and self-aggrandizing, not because they're useful.


Suppose we all live under socialism and capitalist rules no longer apply. Only physics.

You run a vacation home exchange/booking site, which you run in 3 instances in America, Europe and Asia because base you found that people are happier when the site is snappy without those pesky 50ms delays.

Now suppose it's the middle of the night in one of those 3 regions, so no big deal, but the rollout of new version brings down the database that's holding the bookings for two hours before it's fixed. Yeah, it was just the simplest solution to have just a one database with the bookings, because you wouldn't have to worry about all that pesky CAP stuff.

But people then start to ask if it would be possible to book from regions that are not experiencing the outage while double-bookings would still not happen. So you give it a consideration and then you figure out a brilliant, easy solution. You just


Incidentally, I'm highly suspicious of the claim that those pesky 50ms delays matter much. Checking DOMContentLoaded timings on a couple sites:

Reddit: 1.31 s

Amazon: 1.51 s

Google: 577 ms

CNN: 671 ms

Facebook: 857 ms

Logging into Facebook: 8.37 s

As long as you have TLS termination close to the end user, and your proxy maintains connections to the backend so that extra round-trips aren't needed, the amount of time large popular sites take to load suggests that people wildly overstate how much anyone cares about a fraction of a blink of an eye. A lot of these sites have a ~20 ms time to connect for me. If it were so important, I'd expect to see page loads on popular sites take <100 ms (a lot of that being TCP+TLS handshake), not 800 ms or 8 seconds.


It’s hard to say. It’s common sense that it doesn’t matter that much for rational outcomes. Maybe it matters a lot if you are selling something psychological - like in a sense, before Amazon, people with shopping addiction were poorly served, and maybe 50ms matters to them.

You can produce a document that says these load times matter - Google famously observed that SERP load times have a huge impact on a thing they were measuring. You will never convince those analytics people that the pesky 50ms doesn’t matter, because they operate at a scale where they could probably observe a way that it does.

The database people are usually sincere. But are the retail people? If you work for AWS, you’re working for a retailer. Like so what if saving 50ms makes more money off shopping addicts? The AWS people will never litigate this. But hey, they don’t want their kids using Juul right? They don’t want their kids buying Robux. Part of the mythology I hate about these AWS and Stripe posts is, “The only valid real world application of my abstract math is the real world application that pays my salary.” “The only application that we should have a strictly technical conversation about is my application.”

Nobody would care about CAP at Amazon if it didn’t extract more money from shopping addicts! AWS doesn’t exist without shopping addicts!


What I mean is that Amazon seems to think it doesn't matter that much if they're taking hundreds of ms (or over 1 s) to show me a basic page. The TCP+TLS handshake takes like 60 ms, plus another RTT for the request is 80 ms total. If it takes 10 ms to generate the page (in fact it seems to take over 100), that should still be under 100 ms total on a fresh connection. But instead they send me off to fetch scripts from multiple other subdomains (more TCP+TLS handshakes). That alone completely ruins that 50 ms savings.

So the observation is that apparently Amazon does not think 50 ms is very important. If they did, their page could be loading about 5-10x faster. Likewise with e.g. reddit; I don't know if that site has ever managed to load a page in under 1 s. New reddit is even worse. At one point new reddit was so slow that I could tap the URL bar on my phone, scroll to the left, and change www to old in less time than it took for the page to load. In that context, I find people talking about globally distributed systems/"data at the edge" to save 50 ms to be rather comical.


and travel sites will be a lot slower for other reasons


I once lost an entire Christmas vacation to fixing up the damage caused when an Elasticsearch cluster running in AWS responded poorly to a network partition event and started producing results that ruined our users' day (and business records) in a "costing millions of dollars" kind of way.

It was a very old version of ES, and the specific behavior that led to the problem has been fixed for a long time now. But still, the fact that something like this can happen in a cloud deployment demonstrates that this article's advice rests on an egregiously simplistic perspective on the possible failure modes of distributed systems.

In particular, the major premise that intermittent connectivity is only a problem on internetworks is just plain wrong. Hubs and switches flake out. Loose wires get jiggled. Subnetworks get congested.

And if you're on the cloud, nobody even tries to pretend that they'll tell you when server and equipment maintenance is going to happen.


Twice I’ve had to take an Ethernet cable from an IT lead and throw it away. “It’s still good!” No it fucking isn’t. One of them compromised by pulling it out of the trash and cutting the ends off.

When there’s maintenance going on in the server room and a machine they promised not to touch starts having intermittent networking problems, it’s probably a shitty cable getting jostled. There is an entire generation of devs now that have had physical hardware abstracted away and aren’t learning the lessons.

Though my favorite stories are the cubicle neighbor who taps their foot leading to packet loss.


I watched a colleague diagnosis a flaky cable. Then he immediately pulled out a pocket knife and sliced the cable in 2. He said he didn't ever even want to be tempted to trust it again.

That's been my model since then. I've never worked with a cable so expensive or irreplaceable that I'd want to take a chance on it a second time. I'm sure they exist. I haven't been around one.


Lightning cable. Gotta mess with it until the iPhone charges. I'm not replacing it until it's tied into a blackwall hitch and still isn't charging.


It's been awhile since I've been in a data center, but fiber cables were usually the ones that seemed to have more staying power even when they started to flake. Maybe because some of them were long runs and they weren't cheap. You'd see some absolutely wrenched cable going into rack at a horrifying angle from the tray on 30ft run with greater regularity than I'd care to admit.


When I design systems I just think about tiny traitor generals and their sneaky traitor messengers racing in the war, their clocks are broken, and some of them are deaf, blind or both.

CAP or no CAP, chaos will reign.

I think FLP (https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf) is better way to think about systems.

I think CAP is not as relevant in the cloud because the complexity is so high that nobody even knows what is going on, so the just C part, regardless of the other letters, is ridiculously difficult even on a single computer. A book can be written just to explain write(2)'s surprise attacks.

So you think you have guarantees whatever the designers said they have AP or CP, and yet.. the impossible will happen twice a day (and 3 times at night when its your on-call).


Nice! A paper by Michael Fischer of Multiflow fame. I have one of their computers in my garage. Thanks!


I can think of a lot of cloud systems that have C. Businesses will rely on a single machine running a big Oracle DB. Microservice architectures will be eventually consistent as a whole, but individual services will again have a single DB on a single machine.

The single machine is a beastly distributed system in of itself, with multiple cores, CPUs, NUMA nodes, tiered caches, RAM, and disks. But when it comes down to two writers fighting over a row, it's going to consult some particular physical place in hardware for the lock.


The military lives in this world and will likely encourage people to continue thinking about it. Think about wearables on a submarine, as an example. Does the captain want to know his crew is fatigued, about to get sick, getting less exercise than they did on their last deployment? Yes. Can you talk to a cloud? No. Does the Admiral in Hawaii want to know those same answers about that boat, and every boat in the Group, eventually? Yes. For this situation, datacenter-aware databases are great. There are other solutions for other problems.


Yep.

For those unaware, military networks deal with disruptions, disconnections, intermittent connectivity, and low-bandwidth (DDIL).

https://www.usni.org/magazines/proceedings/sponsored/ddil-en...


> The CAP Theorem is Irrelevant

Just sprinkle the magic "cloud" powder on your system and ignore all the theory.

https://ferd.ca/beating-the-cap-theorem-checklist.html

Let's see, let's pick some checkboxes.

(x) you pushed the actual problem to another layer of the system

(x) you're actually building an AP system


Specifically they pushed the problem to the load-balancer so they get to check this one too.

(x) your solution requires a central authority that cannot be unavailable

I'm amazed the author doesn't consider that the load-balancers in this situation are effectively acting as non-voting witnesses and that the whole system hinges on the lbs being able to correctly determine the healthy nodes. In a single-master setup the network partition that cuts the line between the lbs and the current master (but nothing else) is a fun one. Everything would be fine if you could promote a new master but who's gonna tell em'?


That's a good point. It describes the situation better than my checklist.


You're building a ??? system. Either CP or AP depending on what random values got put into some config.


There's a better rebuttal(*) of CAP in Kleppmann's DDIA, under the title, "The unhelpful CAP theorem".

I won't plagiarize his text, instead the chapter references his blogpost, "Please stop calling databases CP or AP": https://martin.kleppmann.com/2015/05/11/please-stop-calling-...

(*): rebuttal I think is the wrong word, but I couldn't think of better.


Really good read, in fact all of these discussions are really great.

I liked the linearizable explanation, like when Alice knows the tournament outcome but Bob doesn't. A super extension to this would underscore how important this is, and the danger of Alice knowing the outcome but Bob not knowing, just extend the website a bit and make it a sports gambling website. A system under such a parition would allow Alice to make a "sure thing" bet against Bob, so the constraint should be that when Bob's view is stale, it does not take bets. But how does Bobs view know its stale? It has to query Alices view! Lots of mind games to play!


Plot twist: in the article drawings, replica one and two are split by network, and it could fail.

The author seems to not understand what the meaning of the P in CAP


I missed your comment and just made the same comment with different words. He literally has a picture of a partition.


"Non failing node" does not mean "Non partitioned node", simple as that.

If you treat a partitioned node as "failed", then CAP does not apply. You've simply left it out cold with no read / write capability because you've designated a "quorum" as the in-group.


Someone else tok ownership of the problem for you and sells you their solution : "The theoretical issue is irrelevant to me".

Sure. Also, there's a long list of other things that are probably irrelevant to you. That is, until your provider fails and you need to understand the situation in order to provide a workaround.

And slapping "load-balancers" everywhere on your schema is not really a solution, because load-balancers themselves are a distributed system with a state and are subject to CAP, as presented in the schema.

> DNS, multi-cast, or some other mechanism directs them towards a healthy load balancer on the healthy side of the partition.

"Somehow, something somewhere will fix my shit hopefully". Also, as a sidenote, a few friends would angrily shake their "it's always DNS" cup reading this.

edit: reading the rest of the blog and author's bio, I'm unsure whether the author is genuinely mistaken, or whether they're advertising their employer's product.


> None of the clients need to be aware that a network partition exists (except a small number who may see their connection to the bad side drop, and be replaced by a connection to the good side).

What a convenient world where the client is not affected by the network partition.


As someone who's worked extensively on distributed systems, including at a cloud provider, after reading this I think the author doesn't actually understand the CAP theorem or the two generals problem. Their conclusions are essentially utterly incorrect.


Many things can be solved by the SEP Field[0]

[0]: https://en.wikipedia.org/wiki/Somebody_else's_problem#Dougla...


The CAP theorem is quantum mechanics of software with C*A = O(1) in theory, similarly to uncertainty principle, but in many use cases this value is so small that "classical" expectations of both C and A are fine.


> In practice, the redundant nature of connectivity and ability to use routing mechanisms to send clients to the healthy side of partitions

Iow: You can have CAP as long as you can communicate across "partitions".


More precisely: partitions in the graph theory sense rarely (if ever) happen. In practice it looks more like a complete graph losing some edges.

Let's say you have servers in DCs A and B, and clients at ISPs C and D. Normally servers at A and B can communicate, and clients at both C and D can each reach servers at A and B.

If connectivity between DCs A and B goes down there is technically no partition, because connectivity between the clients and each DC is still working. The servers at one of the DCs can just say "go away, I lost my quorum", and the clients will connect to the other DC. This is the most likely scenario, and in practice the only one you're really interested in solving.

If connectivity between ISP C and DC A goes down, there is no partition because they can just connect to DC B.

If all connectivity for ISP C goes down, that is Not Your Problem.

If all connectivity for DC A goes down, it doesn't matter because it no longer has clients writing to it, and they've all reconnected to DC B.

To have a partition, you'd need to lose the link between DCs A and B, and lose the link from ISP C to DC B, and lose the link from ISP D to DC A. This leaves the clients from ISP C connected only to DC A, and the clients from ISP D connected only to DC B - while at the same time being unable to reconcile between DCs A and B. Is this theoretically possible? Yes, of course. Does this happen in practice - let alone in a close-enough split that you could reasonably be expected to serve both sides? No, not really.


Partition tends to be a problem between peers, where observers can see they aren’t talking to each other.

In the early days that might have two machines we walked up to seeing different results, now we talk to everything over the internet. We generally mean partition when one vlan is having trouble. One ethernet card is dying or has a bad cable. If the entire thing becomes unreachable we just call it an outage.


  DCA: north america
  DCB: south america
  ISP C: north american ISP
  ISP D: south american ISP
A major fiber link between north and south america is cut, disrupting all connectivity between the two areas.

  "lose the link between DCs A and B": check.
  "lose the link from ISP D to DC A.": check.
replace north and south america with any two disparate locations.


So glad to see that the CAP "theorem" is being recognized as a harmful selfish meme like Fielding's REST paper with a deadly seductive power against the overly pedantic.


I think every couple of months there's yet another article saying the CAP theorem is irrelevant. The problem with these is that they ignore the fact that CAP theorem isn't a guide, a framework or anything else.

It's simply the formalization of a fact, and whether or not that fact is *important* (although still a fact) depends on the actual use case. Hell, it applies even to services within the same memory space, although obviously the probability of losing any of the three is orders of magnitude less than on a network.

Can we please move on?


> Can we please move on?

Doubtful. If there's one thing that new developers have always insisted upon doing it's telling the networking and data store engineers that their foundational limitations are wrong. It seems that everyone has to learn that computers aren't magic through misunderstanding and experience.


Way way back, I had a coworker who aspired to be a hacker get really mad at me because he was just sure that spinning rust was “digital” while I was asserting to him that there’s a DAC in there and that’s how data recovery services work. They crack open the drive and read the magnetic domains to determine not just what the current bytes are but potentially what the previous ones were too.

They’re digital! It’s ones and zeroes! Dude, it’s a fucking magnet. A bunch of them in fact.


Yes, any real digital device is relying on a hopefully well-separated bimodal distribution of something analog.


Hmm this article seems misleading. I suppose it's trying to make the point that application designers usually don't need to think too hard about it, because it's already being addressed by a quorum consensus protocol implemented by someone else. This is a bit of a tautology though; the author seems to be saying 'assume you have a solution to CAP theorem -- now isn't it silly to worry about CAP theorem?'.

One of the fundamental assumptions of CAP theorem is that you can't tell whether or not you have a partition. If you have an oracle that can instantaneously tell you the state of every subsystem, then yeah, CAP is pointless.

But if one of your DBs is connected, reporting itself as alive, and throwing all its writes into /dev/null, you won't be able to route traffic to a quorum of healthy instances because it's not possible to be certain that they're all healthy.

This is what CAP theorem is about: managing data in a distributed system where the status of any given system is fundamentally unknowable because of the Two Generals' Problem (https://en.wikipedia.org/wiki/Two_Generals'_Problem)

In many cases in Cloud though, we can skip that technical stuff and design systems as if we really _did_ have an oracle that could instantaneously and perfectly tell us the state of the system, and things will typically work fine.


The point trying to be made is that with nimble infrastructure the A in CAP can be designed around to such a small amount you may as well be a CP system unless you have a really good reason to go after that 0.005% of availability. Not being CP means sacrificing the wonderful benefits that being consistent (linearizability, sequential consistency, strict serializibility) make possible. It's hard to disagree with that sentiment, and is likely why the Local First ideology is centered on data ownership rather than that extra 0.0005 ounces of availability. Once availability is no longer the center of attention the design space can be focused on durability or latency: how many copies to read/write before acking.

Unfortunately the point is lost because of the usage of the word "cloud", a somewhat contrived example of solving problems by reconfiguring load balancers (in the real world certain outages might not let you reconfigure!), and missing empathy that you can't tell people not to care about how the semantics that thinking about, or not thinking about, availability imposes on the correctness of their applications.

As for the usage of the word cloud: I don't know when a set of machines becomes a cloud. Is it the APIs for management? Or when you have two or more implementations of consensus running on the set of machines?


He's saying that you don't need Partition Tolerance because network is never actually Partitioned. This is exactly why the Internet and the US Interstate Highway system were invented in the first place.

Or he's saying you don't need Consistency because your system isn't actually distributed; it's just a centralized system with hot backups.

It's unclear what he's trying to say.

No idea why he wrote the blog post. It doesn't increase my confidence in the engineering equality of his employer AWS


> If the partition extended to the whole big internet that clients are on, this wouldn’t work. But they typically don’t.

This is the key, that network partitions either keep some clients from accessing any servers, or they keep some servers from talking to each other. The former case is uninteresting because nothing can be done server-side about it. The latter is interesting and we can fix it with load balancers.

This conflicts with the picture painted earlier in TFA where the unhappy client is somehow stuck with the unhappy server, but let's consider that just didactic.

We can also not use load balancers but have the clients talk to all the servers they can reach, when we trust the clients to behave correctly. Some architectures do this, like Lustre, which is why I mention it.

I see several comments here that seem to take TFA as saying that distributed consensus algorithms/protocols are not needed, but TFA does not say that. TFA says you can have consistency, availability, and partition tolerance because network partitions between servers typically don't extend to clients, and you can have enough servers to maintain quorum for all clients (if a quorum is not available it's as if the whole cloud is down, then it's not available to any clients). That is a very reasonable assertion, IMO.


> TFA says you can have consistency, availability, and partition tolerance because network partitions between servers typically don't extend to clients, and you can have enough servers to maintain quorum for all clients

It's not "partition tolerance" if you declare that partitions typically don't happen.


So basically this is saying that the CAP theorem is irrelevant because a partition is not really have a partition (since the load balancer still can reach everybody). Hmm...

I agree that in modern data centers the CAP theorem is essentially irrelevant for intra-DC services, due the uptime and redundancy of networking H/W (making a partition less likely than other systemic failures).

Across DCs I'll claim it is still absolutely relevant.


The only concrete solution the article proposes that I can think of: Spanner uses quorum to maintain availability and consistency. Your "master" is TrueTime, which is considered reliable enough. You have replicated app backends. If this isn't too generous, let's also say the cloud handles load balancing well enough. CAP isn't violated, but you might say the user no longer worries about it.

Most databases don't work like Spanner, and Spanner has its downsides, two of them being cost and performance. So most of the time, you're using a traditional DB with maybe a RW replica, which will sacrifice significant consistency or availability depending on whether you choose sync or async mode. And you're back to worrying about CAP.


Ok, you could also have multiple synchronous Postgres standbies in a quorum setup so that it's tolerable to lose one. Supposedly Amazon Aurora does that. Comes with costs and downsides too.


Weird article. Different users have different priorities and that’s what the CAP theorem expresses. The article also pretends that there’s a magic “load balancer” in the cloud that always works and also knows which segment of a partitioned network is the “correct” one (one of the points of CAP is that there’s not necessarily a “correct” side), and that no users will ever be on the “wrong” side of the partition. And not only that but all replicas see the exact same network partition. None of this is reality.

But the gist, I guess, is that for most applications it’s not actually that important, and that’s probably true. But when it is important, “the cloud” is not going to save you.


CAP was never designed as an end all template you blindly apply to large scale systems. Think of it more as a mental starting pointing, that systems have these trade offs you need to consider. Each system you integrate has complex and nuanced requirements that don't neatly fall into clean buckets.

As always Kleppmann has a great and deep answer for this.

https://martin.kleppmann.com/2015/05/11/please-stop-calling-...


I suspect the CAP theorem factored into the design of these cloud architectures, in such a way that it now seems irrelevant. But it probably was relevant in preventing a lot of other more complex designs.


For example, having 3 AZs allows Paxos/Raft voting.


Paxos being CP.


I kinda see what the author is getting at, but I don't buy the argument.

However, in the example with the network partition, it relies on proper monitoring to work out if the DB its attached to is currently in partition.

managing reads is a piece of piss, mostly. Its when you need propagate write to the rest of the DB system, thats where stuff gets hairy.

Now, most places can run from a single DB, especially as disks are fucking fast now. so CAP is never really that much of a problem. However when you go multi-region, thats when it gets interesting.


This only addresses one kind of partition.

What if your servers can't talk to each other, but clients can?

What if clients can't connect to any of your servers?

What if there are multiple partitons, and none of them have a quorum?

Also, changing the routing isn't instantaneous, so you will have some period of unavailability between when the partition happens, and when the client is redirected to the partition with the quorum.


> if a quorum of replicas is available to the client, they can still get both strong consistency, and uncompromised availability.

Then it’s not a partition.


This is just saying because the cloud system hides the implications of the theorem from you, it's not relevant.

I suppose it's kinda true in the sense that how to operate a power plant is not relevant when I turn on my lights.


We pay a lot of money to Amazon to not think about the 8 Fallacies as well.

You still get consequences for ignoring them, but they show up as burn rate.


And if you actually use cloud systems long enough, you'll see violations of the "C" in CAP in the form of results that don't make any sense. So even in practice… it's good to understand, because it can and does matter.


This article assumes P(artitions) don't happen, and then concludes you can have both C and A. Congrats, that's the CAP theorem.


> The formalized CAP theorem would call this system unavailable, based on their definition of availability:

Umm, no? That’s a picture of a partition. The partition is not able to make progress because the system is not partition tolerant. If it did it wouldn’t be consistent. It’s still available.


See also:

> In database theory, the PACELC theorem is an extension to the CAP theorem. It states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C) (as per the CAP theorem), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and loss of consistency (C).

* https://en.wikipedia.org/wiki/PACELC_theorem


This seems like what I've noticed on MPP systems (a little before cloud): data replicas give a lot more availability than the number of partition events would suggest.

I likely need to read the paper linked, but it's common to have an MPP database lose a node but maintain data availability. CAP applies at various levels, but the notion of availability differs:

1. all nodes available 2. all data available

Redundancy can make #2 a lot more common than #1.


The CAP theorem is irrelevant if your acceptable response time is greater than the time it takes your partitions to sync.

At that point you get all 3: consistency,availability, partitioning.

In my opinion it should be the CAPR theorem.


Partition encompasses response time. If you have defined SLAs on response time and your partition times exceed that, you're not available.


One option to increase availability though is to reduce partition times.


> The CAP theorem is also irrelevant if your acceptable response time is greater than the time it takes your partitions to sync. This is really an oversimplification. The more important metric here is the delay between write and read of the same data. Even in that case if when the system write load is unpredictable it will definitely lead to high variance in replication lag. The number of times I had to deal with a race condition for not considering replication lag factor is more than I would like to admit.


What's the difference between choosing an acceptable response time that is greater than the time it takes for your partitions to sync and giving up on availability?

I don't think it makes sense to say that CAP doesn't apply if you don't need consistency, availability, or tolerance to partitions. CAP is entirely about the need to relax at least one of those three to shore up the others.


You mean CLAP? Latency and availability are basically the same thing. CAP is simple.


No I'm talking about response time 'requirements'


and considering most cloud compute has a better backbone or internal routing, userspace should see this less.

That being said, if this is truly a problem for you CRDB is basically built with this all in mind.


Wow have heard about CRDB but just now reading what it does.

That's an incredible piece of engineering.


I mean you just KNOW that while that addition may make sense, some 12 year minded person like me would just start referring to it as the CRAP theorem. And I don’t even dislike the theorem.


If you don't care about costs...


...or physics.


All models are wrong, some are useful. CAP is probably at least as useful as Newtonian mechanics WHEN you are explaining why you just did a bunch of… extra stuff.

I would like to violate CAP, please. I would like to be nearish to c, please.

Here is my passport. I have done the work.


> The point of this post isn’t merely to be the ten billionth blog post on the CAP theorem. It’s to issue a challenge. A request. Please, if you’re an experienced distributed systems person who’s teaching some new folks about trade-offs in your space, don’t start with CAP.

Yeah… no. Just because the cloud offers primitives that allow you to skip many of the challenges that the CAP theorem outlines, doesn’t mean it’s not a critical step to learning about and building novel distributed systems.

I think the author is confusing systems practitioners with distributed systems researchers.

I agree in some part, the former rarely needs to think about CAP for the majority of B2B cloud SaaS. For the latter, it seems entirely incorrect to skip CAP theorem fundamentals in one’s education.

tl;dr — just because Kubernetes (et al.) make building distributed systems easier, it doesn’t mean you should avoid the CAP theorem in teaching or disregard it altogether.


Every time someone tries to deprecate the nice and simple CAP theorem, it grows stronger. It's an unstoppable freight train at this point, like the concept of relational DBs after the NoSQL fad.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: