Systems need to be robust against uncontrollable failures, like a cosmic ray destroying an image as it travels over the internet, because we can never prevent those.
But systems should quickly and reliably surface bugs, which are controllable failures.
A layer of suffering on top of that simple story is that it's not always clear what is and what is not a controllable failure. Is a logic error in a dependency of some infrastructure tooling somewhere in your stack controllable or not? Somebody somewhere could have avoided making that mistake, but it's not clear that you could.
An additional layer of suffering is that we have a habit of allowing this complexity to creep or flood into our work and telling ourselves that it's inevitable. The author writes:
> Once your system is spread across multiple nodes, we face the possibility of one node failing but not another, or the network itself dropping, reordering, and delaying messages between nodes. The vast majority of complexity in distributed systems arises from this simple possibility.
But somehow, the conclusion isn't "so we shouldn't spread the system across multiple nodes". Yo Martin, can we get the First Law of Distributed Object Design a bit louder for the people at the back?
> systems should quickly and reliably surface bugs, which are controllable failures
I was thinking, if the error exists between keyboard and chair, I want the strictest failure mode to both catch it and force me to do things right the first time.
But once the thing is up and running, I want it to be as resilient as possible. Resource corrupted? Try again. Still can't load it? At this point, in "release mode" we want a graceful fallback -- also to prevent eventual bit rot. But during development it should be a red flag of the highest order.
But systems should quickly and reliably surface bugs, which are controllable failures.
A layer of suffering on top of that simple story is that it's not always clear what is and what is not a controllable failure. Is a logic error in a dependency of some infrastructure tooling somewhere in your stack controllable or not? Somebody somewhere could have avoided making that mistake, but it's not clear that you could.
An additional layer of suffering is that we have a habit of allowing this complexity to creep or flood into our work and telling ourselves that it's inevitable. The author writes:
> Once your system is spread across multiple nodes, we face the possibility of one node failing but not another, or the network itself dropping, reordering, and delaying messages between nodes. The vast majority of complexity in distributed systems arises from this simple possibility.
But somehow, the conclusion isn't "so we shouldn't spread the system across multiple nodes". Yo Martin, can we get the First Law of Distributed Object Design a bit louder for the people at the back?
https://www.drdobbs.com/errant-architectures/184414966
And let us never forget to ask ourselves this question:
https://www.whoownsmyavailability.com/