Gryadka is not Paxos, so it's probably wrong

neuronsguy · on March 24, 2017

I'm including the Gryadka author's rebuttal here, for completeness:

Thank you for the analysis of my post but it seems that you didn't get it correctly. Even to read a value you have to execute the full cycle (prepare, accept) of consensus (in the case of the stable leader we can skip prepare), so when you read the nil value the state of the system will be:

A: (value=foo ballot=1 promised=2) B: (value=nil ballot=2 promised=2) C: (value=nil ballot=2 promised=2)

Not the one you mentioned in the post:

A: (value=foo ballot=1 promised=2) B: (value=nil ballot=0 promised=2) C: (value=nil ballot=0 promised=2)

So the counter example is incorrect.

I proved the algorithm mathematically by hand and used very aggressive property based testing with fault injections so I'm pretty confident in its correctness.

tschottdorf · on March 28, 2017

I'm not really on wifi good enough to reply extensively, but I pushed an update to the post explaining this better. If you carry out the read explicitly, you can still get the anomaly in much the same way. I should've done so from the beginning, but I tried to simplify the argument and went too far.

mjb · on March 24, 2017

This is a very good analysis.

I think it makes a slightly stronger argument about MultiPaxos than it needs to. There are other correct ways to use single-decree Paxos, as long as you recognize it's limitations. The true limitation is "use every register once", and a log is the gold standard way to do that. Another correct pattern could be a set-only K/V store, or a store for monotonic finite state machine states (at the cost of O(N) rounds for every contender). One real-world example is 2PC coordination, where the state machine is very simple, and can be modeled as an ordered pair of three registers.

mjb · on March 24, 2017

While we're here, it's also not true that all correct consensus protocols are Paxos. For example, Viewstamped Replication is correct, but a different algorithm (see https://arxiv.org/pdf/1309.5671v3.pdf). There's a number of correct algorithms, and they all smell an awful lot like Paxos, but aren't all exactly Paxos. That's the hard part: nearly all Paxos-like algorithms are broken, but not all.

Also from the "variants of Paxos" department is Flexible Paxos (https://blog.acolyer.org/2016/09/27/flexible-paxos-quorum-in...) which changes the rules about how to select a quorum.

Scaevolus · on March 24, 2017

EPaxos (essentially leaderless Paxos) is interesting as well, but I'm not sure how much deployment it's seen: https://github.com/efficient/epaxos

The repo includes a TLA+ model too, which will hopefully become a trend in newly proposed distributed systems in the future!

Cassandra looked into implementing it, and reached a prototype implementation with ~60% better performance, but it looks like the contributor didn't continue driving it: https://issues.apache.org/jira/browse/CASSANDRA-6246

mring33621 · on March 24, 2017

Five bucks says that 'the contributor' got a job at ScyllaDb.

sulam · on March 24, 2017

Apple, actually.

rystsov · on March 25, 2017

It's good but wrong because it's based on a false assumption about how the read operation works. I explained it in the comments.

RcouF1uZ4gsC · on March 24, 2017

http://www.allreadable.com/5b354QWp

"every consensus protocol out there or every fully distributed consensus protocol is either Paxos or Paxos with cruft or broken" - Mike Burrows

irfansharif · on March 24, 2017

that was a great talk, thank you for sharing!

grogers · on March 24, 2017

I haven't done any in depth analysis of gryadka but I think the premise of the argument here may be wrong. Even if a particular value isn't accepted by a majority of nodes, that doesn't necessarily make it a dirty read. As long as anyone participating in the algorithm sees the history as if it was committed right before the cas result is committed, it could still be linearizable. I would need to create a formal model to be sure whether it is correct or not, but don't just assume it's wrong because it isn't paxos (which it isn't and it shouldn't advertise like it is).

mattb314 · on March 25, 2017

But depending on node failures, couldn't the same client successfully run cas(nil, A) -> cas(A, B) -> cas(nil, C), with all operations succeeding? Say the first two operations only succeed in writing to a single node (as in the post's example). Then if that single node goes down, the third cas will succeed, which is certainly not linearizable. Note: I'm working of the assumption that the system is correctly described by the posts author. The original author of Gryadka has disputed the description in the post, and I haven't read the source.

sargun · on March 24, 2017

I think the most interesting thing the article states is the following question (and proposed answer):

Is it possible to get compare-and-swap without the log?

TL;DR: I don’t think so.

Riak Ensemble is such a system that utilizes single-decree Paxos in order to try to achieve CAS: https://github.com/basho/riak_ensemble

If the only the distinguished leader is allowed to publish a proposal for value change, then this allows you to preserve the CAS invariant. Unfortunately, this comes at the sacrifice of liveness. If you then have a second Paxos group to elect the leader, then this removes the liveness issue. On view change, you have to contact a quorum of nodes, and re-propose a new epoch in order to avoid issues.

I'm curious if someone can poke a hole in riak_ensemble's algorithm.

norswap · on March 24, 2017

It would have been nicer to contact the authors of the algorithm before making these (apparently incorrect) claims via blog post. It's just sensationalistic.

gtrubetskoy · on March 24, 2017

For those wondering, "gryadka" is "bed" as in vegetable/garden/flower bed in Russian.

inlineint · on March 24, 2017

And Redis, multiple instances of which are supposed to be plugged into gryadka, means "radish" in Russian :)

fwefwwfe · on March 24, 2017

Sounds Mario 2 themed.

elvinyung · on March 24, 2017

I know that consensus is probably the cleanest way to guarantee very strict correctness, but I'm kind of not convinced that synchronous replication is the ideal solution for HA. I have yet to see clear evidence that teams at less-than-Google scale aren't mostly getting by with semisynchronous replication, or even just asynchronous replication.

Always running an ensemble of three or five nodes for each shard seems to be pretty overkill (and expensive, both in terms of money and latency), especially if it's mostly just a way to do automated failovers. I sometimes wonder if there's a good-enough cheaper semisync alternative.

electronvolt · on March 25, 2017

You get other advantages for doing three/five node replication--online updates are free if you require (n-1) compatibility.

Reality is that if you need to hit 5+ nines and require strong consistency guarantees (like "we never lose our customer's data") at data center scale, you'll probably need something similar. Most people probably don't need those guarantees, however.

elvinyung · on March 25, 2017

I should have clarified -- I meant mostly for things like OLTP databases.

I see why Google built Megastore, Spanner, etc. And sure, ZooKeeper or etcd makes sense for putting small amounts of configuration data.

But most of us aren't Google, and synchronously replicating the entire OLTP database, 3 or 5 nodes times the number of shards, seems kind of absurdly expensive for most people.

cube2222 · on March 25, 2017

A node can handle many more than one shard. So you can basically have 10 nodes, 8 shards with 5 replicas each. (Just an example)

elvinyung · on March 25, 2017

Oops, you are correct -- I somehow forgot that shards are logical, not physical.

wahern · on March 25, 2017

I've always thought that what's missing is a solid distributed naming service that uses something like EPaxos (Egalitarian Paxos). Something that is dead simple, auto-healing, and bullet proof. I say EPaxos because while Raft is a simplification in terms of replication, having a single leader breaks symmetry and adds unnecessary complexity and inconsistency in both the implementation and the design, especially in the context of simple operations. And having a single leader pushing data updates is an invitation for people to get too clever with their data model.

It doesn't need to be incredibly performant. Its role is merely to unify the topology. Various services can use whatever strategy best fits the problem. But you have to be able to register and find those services first.

Of course, there are dozens of projects that attempt to provide this functionality, but none are plug-and-play AFAIIK, not yet. I shouldn't need a config file, or any kind of configuration except maybe a couple of DNS records to bootstrap discovery and a path to a certificate authority. There shouldn't be _anything_ that a sysadmin could fiddle with.

A long time ago (almost 10 years) I wrote something like this using BerkeleyDB replication. (I think it was basically Raft long before Raft was written down and formalized.) It was a replicated backing store for BIND. All you provided the daemon was the name of a SRV record, a certificate authority, and signed public key credentials.

It wasn't complete. BerkeleyDB left initial replication of bulk data up the user, and I never completely finished that work before the startup ended. And there were some other problems left unsolved and unexplored. Plus BerkeleyDB wasn't completely multi-master like EPaxos. But the basic system worked: firing up and shutting down nodes was trivial and seamless.

The database was itself the backing store for an authoritative DNS zone which registered the distributed nodes used for a tunneling/proxy network. (Not a botnet ;) (Think Cloudflare + Sling + ...) To access a slave node you just queried it's SRV record, which gave you the location of the gateway. If the slave node disappeared and reappeared at another gateway address (because of a network hiccup), that was fine: logical streams were multiplexed over the transport channel(s).

And there was also a separate service that gateway'd external HTTP requests, so external software didn't have to understand the internal protocols. Using A records and HTTP redirects clients were usually directed to the node terminator node for the tunnel, to reduce internal network traffic, but that was just a small optimization--everything looked the same no matter your ingress or egress points.

The thing was, this was a _completely_ flat architecture. Every node provided each and every service, backend and frontend. Growing and shrinking the network was trivial, as was building the machine images. Testing and development was easy, too, because absolutely _everything_ hung off a DNS zone. The only necessary bit of configuration was providing the signed X.509 certificate for a node with an embedded ID and hostname. Because I was using SRV records which provided port numbers, you could run multiple nodes on a single machine during development.

It wasn't a perfect system, and it was incomplete. But I did this in just a couple of months, from scratch, alone, and while writing other components, long before all these other projects came out. And my solution was much easier to manage. I'm nowhere near as smart, experienced, or capable as many of the engineers on these modern projects. I don't understand why all the solutions suck.

CockroachDB seems like the best candidate, but of course they're providing much more capability than is necessary for a naming service and are still immature. etcd seems like one of the most mature, but it fails in being zero maintenance, which is like the most important part. The others are generally much too complex in almost every respect. Though I have to admit I haven't been following the space very closely.

nsxwolf · on March 24, 2017

That article was like stepping foot on an alien world. I have never even heard of a single one of those things.

mindcrime · on March 24, 2017

If you're interested, read these:

https://en.wikipedia.org/wiki/Consensus_%28computer_science%...

https://en.wikipedia.org/wiki/Two_Generals'_Problem

https://en.wikipedia.org/wiki/Byzantine_fault_tolerance

https://en.wikipedia.org/wiki/Paxos_%28computer_science%29

nsxwolf · on March 24, 2017

So is this like the sort of stuff in newfangled cars, where you hear about automatic breaks, and they say "There's 4 computers and at least 3 of them have to agree on the sensor data before they slam the breaks"... kind of thing?

0xbadcafebee · on March 24, 2017

tl;dr1 Yes, read the paper I link at the bottom, or this http://users.ece.utexas.edu/~bevans/courses/ee382c/projects/...

tl;dr2 when you read "Paxos" or "distributed"-anything just assume it's a bunch of nerds arguing about theoretical problems inside a box

Recap of those wiki pages:

- Consensus is many different ways to solve similar problems, all of which involve agreeing on a solution, to different degrees.

- The two-generals' problem is a proved unsolvable problem that says you can't know what two generals are going to do if you have no way to prove communication between them is valid (or happens at all).

- Byzantine faults are basically the same thing, but 10 people instead of 2.

- Paxos is a bunch of ways of proving various degrees of different kinds of consensus with different uses that works most of the time. (also known as "the algorithm family that comp sci majors keep trying to improve on, but then a genius points out that any changes make it work differently, and therefore must suck")

Why do we care? So you can use 5 servers all around the world, store random messages on them, and be sure that a bunch of them have the information you want, that it is correct, that it will still be available somewhere if one of them goes down, and that if one comes back up with shitty data, it will get corrected and become good data again.

The reason why this is hard is we assume that the messages:

1. have no CRC (integrity, "this data is not corrupt")

2. or digital signature (integrity+authenticity, "this message is not corrupt and definitely came from Bob")

3. and that even if they did, it could have been 3a. "wrong" before it was signed, or 3b. "faked" so that we can't tell when a CRC or sign is correct,

4. or the message never arrived,

5. or that random valid messages have been saved on one server and that when it rejoins the group it now has data that, regardless of it being good or bad, we don't want in the rest of the pool of servers because now the rest have to incorporate it, and what if there are conflicts

The Paxos algorithm family seeks to solve most of this in one go, with exceptions.

Do you need Paxos to solve all those problems? No. Do you need it to solve those problems in the theoretical boundaries of comp sci nerds? Yes. Do you need to know how it works? No. Are there situations where a Paxos network can simply stop working? Yes. Do you need to use Paxos for something you're working on? Probably not. Is it possible Paxos will not solve your problem? Yes. And do you still need plans to back up and restore your data in the event of catastrophic failure? Definitely.

The only paper that I know of currently that I would recommend anyone to read on byzantine failures and their solutions is this one: https://www.cs.indiana.edu/classes/p545/post/lec/fault-toler...

(Granted, I don't know what I am talking about, so take all that with a grain of salt)

0xbadcafebee · on March 24, 2017

I wouldn't worry about it, it's a nerd sniping post.

jolux · on March 25, 2017

What about Raft?

vittore · on March 25, 2017

Can we have TLA+ for Gryadka done by some?

rystsov · on March 25, 2017

I don't know TLA+ yet but I'll assist anybody with an explanation on how Gryadka works.

whatnotests · on March 24, 2017

Everybody upvoting this without any discussion.

I suppose "It's not Paxos, so it's probably wrong" is beyond discussion (not that I disagree, of course).

skybrian · on March 24, 2017

Is it? What about Raft?

jancsika · on March 24, 2017

"If it’s not Paxos, it’s probably wrong."

Is Bitcoin Paxos?

kcudrevelc · on March 24, 2017

It's not a (provable, guaranteed) CP system. Bitcoin attempts to provide a global consensus across a large number of actors, but its attempt is based on proof of work, specifically that it's probably difficult to compute hashes with a certain property (leading zeros, last I checked). It in no way guarantees consistency in the face of partitions.

Consider the simplest case: 1/2 of bitcoin users/miners are temporarily split off from the other half, for say a week (all atlantic/pacific fibers are broken at once, all satellites fall out of the sky, other huge catastrophes occur all simultaneously). Each half would append their own blocks to the blockchain happily, and depending on the chain lengths added, once the partition went away there would be no good way to reconcile. Thus: not consistent in the face of partitions.

jancsika · on March 24, 2017

> Each half would append their own blocks to the blockchain happily, and depending on the chain lengths added, once the partition went away there would be no good way to reconcile.

Why do you say that? The chain with the greatest total difficulty wins.

kybernetikos · on March 24, 2017

Which means that all the people who wrote to successfully and relied on the other chain over the last week suddenly lose all their transactions. That's not what anyone means by CP.

nordsieck · on March 24, 2017

That is not an acceptable form of reconciliation in a CP system.

eternalban · on March 24, 2017

Bitcoin is AP not CP.

Dylan16807 · on March 25, 2017

The hard problem is consistency, in terms of tradeoffs and implementation. Pure AP is easy.

xyzzyz · on March 24, 2017

What if both happen to have the same total difficulty? Granted, it is unlikely, but so is unlikely a failure of non-paxos system with lease timeouts on the order of ~10 minutes.

mjb · on March 24, 2017

See, for example, Garay et. al: https://eprint.iacr.org/2014/765.pdf