The benefits we want to obtain from building distributed systems are:
1) Increased availability
2) Ability to scale (better throughput)
3) Lower latency (get the data closer to the client)
As you said, WAL + Consensus solves the consistency problem in distributed systems. It does however go against all those desirable properties:
1) You lose availability when consensus cannot be reached
2) Throughput is decreased in the face of contention
3) Latency is worse because you need quorum between the separate locations
One way to go about this trade-off is to push for "raw power". Better networks, better clocks, etc.. This a commendable task and advances here should be celebrated, but there's a wall in the horizon: eventually we will hit actual physical limits (e.g.: speed of light doesn't let you lower latencies anymore). Google's spanner is probably close to those limits already. What can we do when this is not enough then?
The other approach is to work on reducing the need for coordination as much as possible. The paper fits in this realm. What it does is to identify a class of (sub)problem specifications that are solvable without coordination: monotonic specifications. It shows that CRDTs are monotonic, and I'm not sure whether any monotonic specification can be redefined as a CRDT (be it operational or state-based).
What CALM provides though is a "different way to think about the issue". If you can devise a monotonic specification of your problem, then you know that an implementation that provides consistent output without any coordination is possible. Furthermore, if your design is not monotonic, an implementation will require coordination or the output won't be consistent.
Finally, the paper dabbles in the realm of breaking your problem in monotonic and non-monotonic "pieces". The monotonic pieces you can implement without coordination. Non-monotonic pieces either require coordination OR "repair" (e.g.: send an e-mail to the customer apologizing that their item is actually unavailable and their order has been cancelled). Once a non-monotonic "piece" has been handled in this manner, the new piece that includes repair/coordination can be considered monotonic. Once all your system's pieces have gone through this process, you are guaranteed to have a "consistent" output.
An interesting analogy would be the "unsafe" blocks in rust. Rust in general is not safe because unsafe exists, but the fact that unsafe pieces are clearly identified makes it easier to reason about program safety as a whole. Similarly, the non-monotonic pieces of your system are where you risk losing consistency, whereas monotonic pieces are just not a problem (i.e.: you can get easily get good performance for those parts). By extension, if you can model your system so that all non-monotonic pieces are out of the critical path, your system is guaranteed to perform well in those scenarios!
All in all, this is just about giving better tools/thinking frameworks for system designers to minimize the use of coordination in a distributed system, which will invariably result in better performance without sacrificing output "consistency" (i.e.: without violating business rules).
That's an excellent explanation, except for couple of points which I think can be misinterpreted. A distributed system with consensus will in practice provide higher availability than a single-node system, because it provides fault-tolerance. In fact, fault-tolerance is the primary point of using (non-Byzantine) consensus. But you are absolutely right that a distributed system using consensus has worse availability than a distributed system with no coordination.
Also, when using a system "with consensus" there is often no need to actually invoke consensus on the read side of the system, in which case you don't have to pay the throughput and latency penalties. I know you've sort of said this already, but it might be helpful to mention explicitly.
> A distributed system with consensus will in practice provide higher availability than a single-node system, because it provides fault-tolerance.
I'm not sure this is true. It protects against one class of fault (node failure) but opens you up to another (network failure). As a distsys engineer I am increasingly convinced that fault tolerance is not a good selling point for distribution, the fallacies of distributed computing are real and difficult to accommodate.
That's not really true. With a single, non-replicated server, you're also very much exposed to network failures. If the server's connection goes down, you're screwed. Compare this to a Google replication setup of 2 East Cost, 2 West Coast, and 1 central USA server. The client must only be able to reach two out of three data centers (and they have to be able to communicate with each other). That sounds much more resilient to me - and I guess Google agrees, since they deploy the setup.
> when using a system "with consensus" there is often no need to actually invoke consensus on the read side of the system, in which case you don't have to pay the throughput and latency penalties.
Doesn't matter if the system is designed for faster reads. There is still coordination, you still pay coordination overhead, including throughput and latency penalties. Without coordination, for example, you can have some part of the database relevant to the client stored directly on the client, not doing remote reads at all.
You're comparing apples to oranges. If read coordination can be avoided (and it often can, even in systems with consensus), whether you stick a cache on the client or not is completely orthogonal to whether the system uses consensus for write operations.
Think about it, if you read from a replica that is partitioned from the rest of the system and there is no coordination, how would the replica or the client know that the value replica returns is too old and therefore breaks strong consistency guarantee?
No, you don't get linearizability of all operations, you might not even read your earlier writes - but the whole point is that you sometimes don't need these guarantees for reads. You get a consistent snapshot read, and that's often good enough. You can get an idea of how recent the snapshot is based on timestamps, but "recent" is hard to define in a distributed system.
> but the whole point is that you sometimes don't need these guarantees for reads.
If that was your point, than sure. If you drop consistency, you can drop coordination too. But typically people expect reads to be consistent in consensus based systems, which requires coordination.
Well written explanation. I think categorizing problem into monotonic and non-monotonic pieces will be especially helpful for a developer when the underlying database provides multiple consistency levels to choose from.
1) Increased availability
2) Ability to scale (better throughput)
3) Lower latency (get the data closer to the client)
As you said, WAL + Consensus solves the consistency problem in distributed systems. It does however go against all those desirable properties:
1) You lose availability when consensus cannot be reached
2) Throughput is decreased in the face of contention
3) Latency is worse because you need quorum between the separate locations
One way to go about this trade-off is to push for "raw power". Better networks, better clocks, etc.. This a commendable task and advances here should be celebrated, but there's a wall in the horizon: eventually we will hit actual physical limits (e.g.: speed of light doesn't let you lower latencies anymore). Google's spanner is probably close to those limits already. What can we do when this is not enough then?
The other approach is to work on reducing the need for coordination as much as possible. The paper fits in this realm. What it does is to identify a class of (sub)problem specifications that are solvable without coordination: monotonic specifications. It shows that CRDTs are monotonic, and I'm not sure whether any monotonic specification can be redefined as a CRDT (be it operational or state-based).
What CALM provides though is a "different way to think about the issue". If you can devise a monotonic specification of your problem, then you know that an implementation that provides consistent output without any coordination is possible. Furthermore, if your design is not monotonic, an implementation will require coordination or the output won't be consistent.
Finally, the paper dabbles in the realm of breaking your problem in monotonic and non-monotonic "pieces". The monotonic pieces you can implement without coordination. Non-monotonic pieces either require coordination OR "repair" (e.g.: send an e-mail to the customer apologizing that their item is actually unavailable and their order has been cancelled). Once a non-monotonic "piece" has been handled in this manner, the new piece that includes repair/coordination can be considered monotonic. Once all your system's pieces have gone through this process, you are guaranteed to have a "consistent" output.
An interesting analogy would be the "unsafe" blocks in rust. Rust in general is not safe because unsafe exists, but the fact that unsafe pieces are clearly identified makes it easier to reason about program safety as a whole. Similarly, the non-monotonic pieces of your system are where you risk losing consistency, whereas monotonic pieces are just not a problem (i.e.: you can get easily get good performance for those parts). By extension, if you can model your system so that all non-monotonic pieces are out of the critical path, your system is guaranteed to perform well in those scenarios!
All in all, this is just about giving better tools/thinking frameworks for system designers to minimize the use of coordination in a distributed system, which will invariably result in better performance without sacrificing output "consistency" (i.e.: without violating business rules).