I know that consensus is probably the cleanest way to guarantee very strict corr...

electronvolt · on March 25, 2017

You get other advantages for doing three/five node replication--online updates are free if you require (n-1) compatibility.

Reality is that if you need to hit 5+ nines and require strong consistency guarantees (like "we never lose our customer's data") at data center scale, you'll probably need something similar. Most people probably don't need those guarantees, however.

elvinyung · on March 25, 2017

I should have clarified -- I meant mostly for things like OLTP databases.

I see why Google built Megastore, Spanner, etc. And sure, ZooKeeper or etcd makes sense for putting small amounts of configuration data.

But most of us aren't Google, and synchronously replicating the entire OLTP database, 3 or 5 nodes times the number of shards, seems kind of absurdly expensive for most people.

cube2222 · on March 25, 2017

A node can handle many more than one shard. So you can basically have 10 nodes, 8 shards with 5 replicas each. (Just an example)

elvinyung · on March 25, 2017

Oops, you are correct -- I somehow forgot that shards are logical, not physical.

wahern · on March 25, 2017

I've always thought that what's missing is a solid distributed naming service that uses something like EPaxos (Egalitarian Paxos). Something that is dead simple, auto-healing, and bullet proof. I say EPaxos because while Raft is a simplification in terms of replication, having a single leader breaks symmetry and adds unnecessary complexity and inconsistency in both the implementation and the design, especially in the context of simple operations. And having a single leader pushing data updates is an invitation for people to get too clever with their data model.

It doesn't need to be incredibly performant. Its role is merely to unify the topology. Various services can use whatever strategy best fits the problem. But you have to be able to register and find those services first.

Of course, there are dozens of projects that attempt to provide this functionality, but none are plug-and-play AFAIIK, not yet. I shouldn't need a config file, or any kind of configuration except maybe a couple of DNS records to bootstrap discovery and a path to a certificate authority. There shouldn't be _anything_ that a sysadmin could fiddle with.

A long time ago (almost 10 years) I wrote something like this using BerkeleyDB replication. (I think it was basically Raft long before Raft was written down and formalized.) It was a replicated backing store for BIND. All you provided the daemon was the name of a SRV record, a certificate authority, and signed public key credentials.

It wasn't complete. BerkeleyDB left initial replication of bulk data up the user, and I never completely finished that work before the startup ended. And there were some other problems left unsolved and unexplored. Plus BerkeleyDB wasn't completely multi-master like EPaxos. But the basic system worked: firing up and shutting down nodes was trivial and seamless.

The database was itself the backing store for an authoritative DNS zone which registered the distributed nodes used for a tunneling/proxy network. (Not a botnet ;) (Think Cloudflare + Sling + ...) To access a slave node you just queried it's SRV record, which gave you the location of the gateway. If the slave node disappeared and reappeared at another gateway address (because of a network hiccup), that was fine: logical streams were multiplexed over the transport channel(s).

And there was also a separate service that gateway'd external HTTP requests, so external software didn't have to understand the internal protocols. Using A records and HTTP redirects clients were usually directed to the node terminator node for the tunnel, to reduce internal network traffic, but that was just a small optimization--everything looked the same no matter your ingress or egress points.

The thing was, this was a _completely_ flat architecture. Every node provided each and every service, backend and frontend. Growing and shrinking the network was trivial, as was building the machine images. Testing and development was easy, too, because absolutely _everything_ hung off a DNS zone. The only necessary bit of configuration was providing the signed X.509 certificate for a node with an embedded ID and hostname. Because I was using SRV records which provided port numbers, you could run multiple nodes on a single machine during development.

It wasn't a perfect system, and it was incomplete. But I did this in just a couple of months, from scratch, alone, and while writing other components, long before all these other projects came out. And my solution was much easier to manage. I'm nowhere near as smart, experienced, or capable as many of the engineers on these modern projects. I don't understand why all the solutions suck.

CockroachDB seems like the best candidate, but of course they're providing much more capability than is necessary for a naming service and are still immature. etcd seems like one of the most mature, but it fails in being zero maintenance, which is like the most important part. The others are generally much too complex in almost every respect. Though I have to admit I haven't been following the space very closely.