Very sorry to hear about this. Working in ops for 10+ years I know that drop kick in the stomach feeling when everything is going wrong and nothing is working as expected. I hope the author continues forward and takes this as a lesson.
My own two cents which others have echoed is this is extremely overly complex setup for this situation. Its been told 100x times on HN about this but there is a reason. The more complex a system is the more points of failure you have. For all the tools you were running you should have had a team of ops people. In reality a database, some servers and a load balancer would get you 99% of the way there.
Modern engineering is a dumpster fire of complexity mostly hocked by shills working to sell contracts to enterprises.
You have learned this lesson the hard way but less moving parts means less things that can go wrong. Basics like monitoring, backup, testing etc go only so far if your system is a rube goldberg machine. Hope as an industry someday we can get back to simplicity
You're right of course that complexity makes it harder to reason about, which makes it easy to "fat finger" a destructive operation with unintended consequences.
However I'd say the truly disastrous issue here wasn't really that. Data loss is something we've always had to plan around, whether due to mechanical failure, bug, or operator error. The actual death knell here was a classic ops blunder:
> From what I had seen our backups had been running successfully. I discovered once the incident started that backups had captured everything but the Persistent Volume Claim data. While manual backup and restore tests were run once a month to ensure our backups were functioning, they were run manually. After digging into why our restores were not coming up with data, I found that our recurring backups were missing the flag to run volume backups with Restic which snapshots PVC block volume data.
So they weren't actually backing up the data they thought they were, which is a class of error that is familiar to everybody who has been working in ops for a while. Unless you test the restore of an actual backup process, you have no idea whether you have useful backups.
Yeah there are always sharp edges on backup processes. For example I'd been taking faithful backups of Hyper-V virtual machines, but apparently there's a private key that is stored elsewhere (and thus wasn't backed up), but without which you can't restore the software TPMs (~required since Windows 11). There are so many of these that the only way to really know is to test. Thankfully that is becoming so much easier with all the declarative infrastructure, Docker/containerization, and virtualization tools these days. Spin up a second copy of your infrastructure once in a while and try a restore.
> Hope as an industry someday we can get back to simplicity
The services upon services with no idea what underlying layers are doing and pretty much enterprises collaborating to get stuff 'out the door' means we may not go back to simplicity easily. At this point I am only finding work who are basically middleware and rarely doing something unique even there. Basically getting contracts and servicing a business need is all that I see when I try to find jobs these days. Initially I was doing something quite unique in networking but now that the client needs have changed its basically just another product that is servicing clients which big companies like Arista can't do for them as there are custom requirements.
My philosophy here is never automate/gitops something that you cannot revert manually if something goes wrong. I learned this almost the same way as OP, but in test and dev environments. As my K8s clusters crashed irrecoverably in many different ways (etcd, pvcs, velero, hdd failures, shared storage, databases) to the point where they'd need to be rebuilt, I imagined what it would be like in prod, and the thought made me quickly decide to:
- Not host databases inside k8s
- Not host shared storage or storage clusters in k8s
- Drop etcd, use external postgres as control plane database
- Use velero, but make sure you do no need to rely on velero (backup manifests, not databases or volumes)
- Also keep your manifests in git
- Generaly treat your entire cluster as ephemeral and optimize for full rebuilds (drive your time to rebuild down to a few hours or minutes).
> Modern engineering is a dumpster fire of complexity mostly hocked by shills working to sell contracts to enterprises.
I think that's like 10% of the reason. 80% is following hype(-), resume-driven engineering (-), and or standardizing on non-proprietary tools to improve job mobility(+). Sure, maybe maintaining the esoteric DSL created by my employer's custom load balancer and DB failover codebases named after obscure comic book characters[1] may arguably be the "right" thing to do - or I could advocate for k8s which may be a little overkill for our needs, but is easier to onboard new joiners, and once can actually search for solutions on StackOverflow. Kubernetes is also a jack of all trades and is designed to handle configurations much more complex than I have, but at least it's "portable".
1. "Oh, _La'varo_ is named after a villain that appeared in a 1959 issue of Superman (Volume 7), it once was an LB that avoided hot instances when routing requests, but now its KV store"
My own two cents which others have echoed is this is extremely overly complex setup for this situation. Its been told 100x times on HN about this but there is a reason. The more complex a system is the more points of failure you have. For all the tools you were running you should have had a team of ops people. In reality a database, some servers and a load balancer would get you 99% of the way there.
Modern engineering is a dumpster fire of complexity mostly hocked by shills working to sell contracts to enterprises.
You have learned this lesson the hard way but less moving parts means less things that can go wrong. Basics like monitoring, backup, testing etc go only so far if your system is a rube goldberg machine. Hope as an industry someday we can get back to simplicity