How Eventual is Eventual Consistency?

pbailis · on March 2, 2012

Shameless plug:

Interactive Demo: http://bailis.org/projects/pbs/#demo

PBS for Cassandra: https://github.com/pbailis/cassandra-pbs

jbellis · on March 2, 2012

And an explanation of the PBS paper as applied to Cassandra: http://www.datastax.com/dev/blog/your-ideal-performance-cons...

cperciva · on March 2, 2012

Good talk, but disappointing that he doesn't mention the issue of node failures: With W=1 and a single node failure at the wrong time, you don't have eventual consistency any more... you have data loss.

pbailis · on March 2, 2012

This depends. Even with W=R=1, you still send each read/write to every replica. If the replica that acknowledged the write fails, the other two replicas should eventually receive the write.

But--what happens if the write messages to the other two replicas drop? Well, typically the coordinator will hand the write off to replacement replicas (hinted handoff/sloppy quorums). Alternatively, when you detect the message loss via a timeout you could retry. But what happens if the coordinator dies and can't retry? Then you'd better replicate your coordinator. That gets hairy, so hinted handoff and sloppy quorums are easier. However, whichever way you go, it's definitely possible to handle an arbitrary (still fixed) number of node failures without data loss.

In general, though, sloppy quorums/hinted handoff solve this problem. I haven't heard any data loss complaints with Riak/Cassandra/Voldemort due to replication, but I'm very interested to hear if you have.

You can definitely extend PBS to node failure cases/sloppy quorums/hinted handoff. The main reason we didn't was because we don't have good failure data. There's nothing stopping you, and, as we point out in the paper/backup slides, you can potentially hide this in the tail (e.g. the .01% of stale cases) provided your DB "does the right thing".

cperciva · on March 2, 2012

Indeed, if your coordinator isn't a replica you obviously need it to fail as well in order to get data loss -- I was thinking of the symmetric case where every replica is a coordinator and W=1 reduces to "store locally, broadcast, and send an ACK".

And yes, of course there are hairy ways to solve the problem; I just would have liked to see it mentioned so that people realize that it exists.

pbailis · on March 2, 2012

You make a good point. Without additional safeguards/depending on implementation, W=1 can indeed mean "durability of one".

This also depends on your failure model. If your node crashed (RAM cleared) and your durable storage broke, you're in trouble. If the data was durably persisted and you just have to restart the node, it's a better situation.

opendomain · on March 2, 2012

Great presentation Mark! This is what I LOVE about Basho - use science and metrics to show how their Riak compares to other NoSQL stores rather than marketing.

TL;DR version: In all databases, you have to choose which 2 parts of CAP you want: Consistency, Availability, Partition tolerance. If you choose Availability- just HOW Consistent is my data? If I want my data to be more consistent, then how does this affect my availability? This gives an actual formula to calculate the best design that meets your applications goals.

victork2 · on March 2, 2012

I can't see the talk at the moment but here is my 5 cents on the subject:

The main problem of eventual consistency is not how often that happens, it is: What damage will it do WHEN it happens?

Imagine you're a bank, you handle big clients you lose track of a write of 50+ million dollars. Where did the money go ? How to differentiate that from a fraud attempt ?

If you have customers how will you tell them that you just lose, probabilistically speaking "one in 10 million packages ?"

But that's a very interesting question that also has philosophical repercussions: How come that we are in a society that did not build system that accept a certain degree of failure?

pbailis · on March 2, 2012

Eventual consistency isn't about "losing writes". It's about how long it will take for all of your replicas to agree on/observe the last written versions and, in the meantime, you'll read potentially stale data.

Certain data structures inherently tolerate staleness or message reordering: look at your Twitter feed, any kind of log, or other "commutative data structures". If you can't handle staleness, you should probably use stronger consistency.

However, if you can find out about staleness after the fact (an asynchronous callback, for instance), you can run some sort of compensatory routine (e.g., overdraft charges for your bank). Then you have an optimization problem: (cost of compensation)*(number of times you have to run compensation) vs. the benefit you get from weak consistency (latency, availability, whatever).

There's an awesome paper by Pat Helland about the problem you mention regarding building real-world systems on top of inherently unreliable components. It's called "Building on Quicksand": http://arxiv.org/pdf/0909.1788.pdf

victork2 · on March 5, 2012

hmm I have been downvoted but I deserve it. So to rephrase what I do the problem is a you say not losing data that is going to be inserted but performing operations with the wrong type of data which mean errors.

Let's say I have 3 pieces of data required by a function to compute the outcome of a certain operation (withdraw 10 billion dollars): A B C. We change the data in that fashion: A -> A' B -> B' C -> C' -> C"

When I query, because of eventual consistency the f(A,B,C) may very well be:

f(A,B,C), f(A,B',C), f(A,B,C'), f(A,B,C"), ... so on. It is simple when you have 3 sources, but when you have 50, and then when the operation use 50 or those f, depending on 50 other pieces of data ?

Anyway, again sorry for my poor explanation of the issue!

nosequel · on March 2, 2012

I think you are misrepresenting eventual consistency with losing data. They are not the same at all. Good eventually consistent systems give you the option to see the entire history of a key, and if there are inconsistencies you can resolve those how you see fit. If you absolutely for a fact need to know the value of something you can do reads at high R values to make sure everyone agrees on the value. No one who markets are sells eventual consistency, be it Amazon or Basho in this case drop data in cases where consistency isn't guaranteed.

ChuckMcM · on March 2, 2012

That is true, although one of the things EC has to be careful about is when the mutation history of a key results in different values if it gets out of order.

Consider the example value :

  Mutations "insert 5 A"                AAAAA
            "delete 2 characters"       AAA
            "insert 3 B"                BBBAAA

vs an out of order version:

  Mutations "insert 5 A"                AAAAA
            "insert 3 B"                BBBAAAAA
            "delete 2 characters"       BAAAAA

If you partition and rejoin, even knowing all the time stamps can make it hard to re-assemble.

But the basic thesis that EC is not appropriate for all data models is certainly valid. I certainly wouldn't want my bank to use such a model for reconciling transactions, they screw up enough as it is.

pbailis · on March 2, 2012

Right; you need some notion of a total order or commutativity in your update functions. f(A,B) = f(B,A). "Last writer wins" is one example of commutativity, but that isn't often what you want.

What you want is something like a "Commutative Replicated Data Type" [1], where you define a commutative function specific to your application. Libraries like StateBox allow you to build CRDTs [2]. In fact, your example of document editing was one of the areas where these ideas first came up.

There's also a theorem saying that if your program is "logically monotonic"--that is, if your data only "grows" in size, and facts never change from True to False or vice versa--then your program will work under eventual consistency without modification [3].

Finally, bank accounts have to employ eventual consistency. Banks demand availability from their ATMs and give up consistency to handle partition tolerance: your ATM will still work even if you sever its network connection. However, banks, unlike a lot of software, have well-defined compensation protocols for handling inconsistency. For example, they'll charge you for negative balances left in your account via overdraft fees.

[1] http://arxiv.org/pdf/0710.1784.pdf

[2] http://labs.mochimedia.com/archive/2011/05/08/statebox/

[3] http://databeta.wordpress.com/2010/10/28/the-calm-conjecture...

kahirsch · on March 3, 2012

> your ATM will still work even if you sever its network connection

Do they?

snewman · on March 2, 2012

It's all about tradeoffs and acceptable risk. A bank, an e-mail provider, and a game analytics tracking system all have different levels of risk tolerance. (Protip: delivery services already lose way more than one in 10 million packages, for any number of reasons.)

I haven't watched the talk yet either, but actually quantifying the risks associated with eventual consistency is a nice idea. It should help system designers make better decisions as to what tools are trustworthy enough for their application.