Very sorry to hear about this. Working in ops for 10+ years I know that drop kick in the stomach feeling when everything is going wrong and nothing is working as expected. I hope the author continues forward and takes this as a lesson.
My own two cents which others have echoed is this is extremely overly complex setup for this situation. Its been told 100x times on HN about this but there is a reason. The more complex a system is the more points of failure you have. For all the tools you were running you should have had a team of ops people. In reality a database, some servers and a load balancer would get you 99% of the way there.
Modern engineering is a dumpster fire of complexity mostly hocked by shills working to sell contracts to enterprises.
You have learned this lesson the hard way but less moving parts means less things that can go wrong. Basics like monitoring, backup, testing etc go only so far if your system is a rube goldberg machine. Hope as an industry someday we can get back to simplicity
You're right of course that complexity makes it harder to reason about, which makes it easy to "fat finger" a destructive operation with unintended consequences.
However I'd say the truly disastrous issue here wasn't really that. Data loss is something we've always had to plan around, whether due to mechanical failure, bug, or operator error. The actual death knell here was a classic ops blunder:
> From what I had seen our backups had been running successfully. I discovered once the incident started that backups had captured everything but the Persistent Volume Claim data. While manual backup and restore tests were run once a month to ensure our backups were functioning, they were run manually. After digging into why our restores were not coming up with data, I found that our recurring backups were missing the flag to run volume backups with Restic which snapshots PVC block volume data.
So they weren't actually backing up the data they thought they were, which is a class of error that is familiar to everybody who has been working in ops for a while. Unless you test the restore of an actual backup process, you have no idea whether you have useful backups.
Yeah there are always sharp edges on backup processes. For example I'd been taking faithful backups of Hyper-V virtual machines, but apparently there's a private key that is stored elsewhere (and thus wasn't backed up), but without which you can't restore the software TPMs (~required since Windows 11). There are so many of these that the only way to really know is to test. Thankfully that is becoming so much easier with all the declarative infrastructure, Docker/containerization, and virtualization tools these days. Spin up a second copy of your infrastructure once in a while and try a restore.
> Hope as an industry someday we can get back to simplicity
The services upon services with no idea what underlying layers are doing and pretty much enterprises collaborating to get stuff 'out the door' means we may not go back to simplicity easily. At this point I am only finding work who are basically middleware and rarely doing something unique even there. Basically getting contracts and servicing a business need is all that I see when I try to find jobs these days. Initially I was doing something quite unique in networking but now that the client needs have changed its basically just another product that is servicing clients which big companies like Arista can't do for them as there are custom requirements.
My philosophy here is never automate/gitops something that you cannot revert manually if something goes wrong. I learned this almost the same way as OP, but in test and dev environments. As my K8s clusters crashed irrecoverably in many different ways (etcd, pvcs, velero, hdd failures, shared storage, databases) to the point where they'd need to be rebuilt, I imagined what it would be like in prod, and the thought made me quickly decide to:
- Not host databases inside k8s
- Not host shared storage or storage clusters in k8s
- Drop etcd, use external postgres as control plane database
- Use velero, but make sure you do no need to rely on velero (backup manifests, not databases or volumes)
- Also keep your manifests in git
- Generaly treat your entire cluster as ephemeral and optimize for full rebuilds (drive your time to rebuild down to a few hours or minutes).
> Modern engineering is a dumpster fire of complexity mostly hocked by shills working to sell contracts to enterprises.
I think that's like 10% of the reason. 80% is following hype(-), resume-driven engineering (-), and or standardizing on non-proprietary tools to improve job mobility(+). Sure, maybe maintaining the esoteric DSL created by my employer's custom load balancer and DB failover codebases named after obscure comic book characters[1] may arguably be the "right" thing to do - or I could advocate for k8s which may be a little overkill for our needs, but is easier to onboard new joiners, and once can actually search for solutions on StackOverflow. Kubernetes is also a jack of all trades and is designed to handle configurations much more complex than I have, but at least it's "portable".
1. "Oh, _La'varo_ is named after a villain that appeared in a 1959 issue of Superman (Volume 7), it once was an LB that avoided hot instances when routing requests, but now its KV store"
>I won’t personally be bringing back outdoors.lgbt or firefish.lgbt. Being an admin has been one of the most fulfilling things I have done in a long time and you all have made it such an amazing experience., however, I need to take a step back. I would love to hand the domains over to someone with a similar passion for creating a safe and welcoming community.
I hope the author isn't only quitting because of a feeling of guilt or shame. If that's true, it's certainly ok to give it another try. Tech is toxic enough depending on where you look, and that's not even counting the self-hurt that some of us do to ourselves.
Off-topic, but this website has some pretty crazy dark-patterns!
I went to the user's profile and right-clicked their photo. The contextmenu action is intercepted and replaced with a custom menu mimicking the browser's context menu. This fake context menu has the option "Open in Window". I click it. The website doubles-down on the charade and opens a "popup" inside itself - maximize/minimize icons in the action bar and everything.
It's the sort of thing you're taught phishing websites do to lure you into inputting your credentials into the wrong website.
That's not a "dark pattern", it's just the result of of making the browser a OS and giving websites free rein of anything inside.
There is an aptly named browser extension to help combat this called StopTheMadness [1] , I recommend it.
Let's try to save the term Dark Patterns for the things that that deserve it, where the goal is to steer the user to make a unfavourable choice at the benefit of someone else.
Firefox has a built-in method to get around context menu blocking by holding down the "shift" key while right clicking. I wish there were similar built-in overrides for things like blocking selecting text and copying, you commonly find stuff like this on corporate blogs or lyric sites where they want to "protect" their text content, of course you can just get around it by opening the page HTML and copying from there.
Voting with your feet is the strongest veto we have as web users.
I think this is happening because we as a collective are using advertisement based websites with traffic funneled to them by advertisement network operators.
Don’t let Googles algorithms decide what website you visit. Find alternate search engines and indexes.
I still feel this is a dark pattern. It's attempting to keep engagement on the website and to prevent direct linking of underlying images. Direct linking images costs the server without providing sufficient marketing. You see a lot of different techniques employed to prevent image interaction. This is one such option.
The fact that they then implemented "Open in Window" rather than allowing the browser to handle it means they know they're affecting functionality the user wants and are choosing to provide an option that isn't the browser default. Why? It's certainly to their benefit. If it were to the user's benefit then it would be standard across all websites rather than provided by the browser by default.
Might be related: I recently joined a company using Flutter web and notices similar stuff. Yes, the software engineers are wondering: "Why do we reimplement browser stuff?" But the business ask for it and we don't hear complaints from the end user.
Even trying to long press and copy some of the domain names that the author mentioned, was impossible. They've (who is they here?) implemented a context menu that interpreted the wrong link and offers to open it in a new tab.
I’m trying very hard to not be flippant here, but I can’t shake the feeling that this Kubernetes norm has to end. I’m not saying Kubernetes needs to disappear, but people need to stop treating it as the new normal, as if VM:s and config management is somehow an outdated and incapable alternative.
To me, this is an example of the complexity of Kubernetes coming back to blow your foot off. Remember, Kubernetes exists to make scaling and redundancy easier, but it’s only easier if you fully understand the implications of every configuration that you make.
Complexity causes incidents, so my mantra will always be that if you propose to introduce complexity, have a justification ready for why its inherent risks are outweighed by the benefits.
If you’re deploying your side projects in an infrastructure this complex, I would strongly suggest taking a step back and questioning if the same benefits really couldn’t be achieved in a simpler way.
It's hard to blame kubernetes as the issue here. The way EKS is implemented, it's basically EC2's running a bunch of containers with a convenient control plane and network interface abstracted over them.
If they had used EC2's/VM's, they could have still run into the same issue by blowing away a host's local volume they were never actually backing up. I have administrated social media websites (not this software, mainly discourse) and have seen this exact thing happen. Luckily, the backups worked.
> Remember, Kubernetes exists to make scaling and redundancy easier, but it’s only easier if you fully understand the implications of every configuration that you make.
It exists to make it easier. Did anyone ask if it’s something you needed to be doing in the first place?
For many, I think using kubernetes is like buying something you don’t need at the store because it’s on sale and thinking it’s saving you money.
I think we all have a desire to build out infinitely scalable and 100% reliable solutions because _obviously_ the site can’t go offline. But did you ever actually ask what would happen if it did? I worked with one company that, about 8 years into their existence, botched a data migration and went down for an entire month. Customers called CS every day begging them to fix it, they refunded everyone’s fees for the month, and… not one customer cancelled. That company is still around, still growing strong, and still growing.
I’m not suggesting going to the other extreme but… if your site goes down for a few hours overnight until you wake up and fix it, for most projects it’s not going to be the end of things. Hell, GitLab is still around!
(On that note… Take _durability_ seriously. Have backups. rclone everything you have into b2 or something. Even if it takes you a few days to restore the entire service, at least it will come back. If you lose everyone’s data _then_ you’re probably done.)
Reduce your abstractions and reduce your dependencies for a more stable environment. It will give you greater control while you at the same time don’t need to relearn everything every 5 years.
Why does it have to be “containerised” at all? Why can’t the software be directly installed on a *nix virtual machine?
This fatal scenario could have been easily avoided with a basic rsync cronjob to rsync.net or another $10 virtual machine.
> Why does it have to be “containerised” at all? Why can’t the software be directly installed on a *nix virtual machine?
I am very interested in this upcoming space. I had set up devbox.io for local development, and it even offers the concept of services. Like a docker compose for local development, just directly on the host.
Nasty and hairy stuff like Python is still best shoved into a container to be done with it, but even there we are seeing a move to venv by default (Debian 12+), which should solve a lot of issues, making *nix more viable again.
So I guess the main distinction so far is that containers are more performant, at less isolation than VMs, and containers have a gigantic ecosystem of build tools. It's easy to build a OCI image. Everyone and their mother has put together a Dockerfile by now. Not so with VMs. That might change if nix catches on and we get proper tools to fully describe VMs a priori. I reckon technology is already good enough to make VM startups performant enough (Firecracker, ...)?
I tried cloud-hypervisor recently and it booted the VM in 70ms, but I have no idea how to deploy this. We already have a deployment workflow for OCI images, doing this for VMs would mean a big refactoring
I have setups where the container orchestration was done completely by Terraform or Puppet. Docker Compose also comes a long way if you are not running a FAANG webscale service. If you manage your VM's basics with configuration management and infrastructure as code anyways, it's easy to bolt containers onto that and treat them like any other installable piece of code. Kubernetes does bring a nice abstraction if you have DevOps teams and separate Ops for your bare metal or Cloud.
> If the application is containerised, like most are these days, how do you propose running these on said VM?
As far as I can tell, this kind of load can be perfectly served by a €40 bare metal Hetzner instance (together with a Minecraft server and what not) so even a simple Docker Compose would do.
It's a very complex discussion, almost complex enough to be unsuitable for text, because for every reply ten more "what-ifs" seem to open up. It's hard to be general enough that what you're saying is useful, while being specific enough that you're actually answering the question.
If we're talking about a multi-container application with moving parts, then yes, Kubernetes is suitable exactly because it was created to resolve the issues arising from that architecture. But I don't think it's fair to start a general discussion about architecture from a point of "Step 0: The application is made to run in Kubernetes".
By the way, is the statement that "most applications today are containerized" really true? I would claim that you can only count applications into this category that are published as "containerized by default", and to me that certainly doesn't seem true. I've encountered a few, but they're exceedingly rare. If you also count "applications that are available as a container", then sure, but then the statement should be that "most applications today *can* be containerized", or they are "available" in containerized form.
Whether something is "more complicated than a single container" depends not only on the application, but how you choose to deploy it. If you take a hypothetical web application running nginx for PHP with a Postgres backend, you can deploy both of those as containers, but you certainly don't have to. You could skip using containers entirely, or you could deploy the nginx component as unmanaged containers as if they were an RPM/DEB-package, while not involving containers for your database at all. Yes, it requires more configuration for all the things that are no longer "magic", but with static infrastructure the work you put in comes back to you as "simplicity": the moving parts are easier to understand, which means problems are easier to troubleshoot. Yes, you will lose "auto-healing", but you also lose "auto-nuking", and so on.
I want to be clear again that I'm not saying Kubernetes is objectively bad, but I am saying that I think a lot of people using it have not considered whether it's a suitable solution for their current problems (never-mind imaginary, future problems).
To me, it's like comparing a lawn mower to the space shuttle, and then trying to motivate building the latter by saying that a lawn mower will never be able to get into space: if you're running a web site that is looking at a few hundred thousand users, you're still (probably) not "going to space".
Fundamentally, I think the over-use of Kubernetes is nothing more than an example of premature optimization. The idea goes something like: "If we become the most successful /thing/ in the world, the infrastructure is already made in such a way that we can scale" -- but there are no decisions when it comes to infrastructure where you get the pros but can escape the cons, and those cons for Kubernetes specifically seem to rarely be discussed out of fear of sounding like a Luddite.
You have omitted the crux of the issue: the whole point of GitOps is that you can always roll your deployments back, to any particular commit if needed. The very fact that GitOps was used against itself means it was set up incorrectly.
Kudos to the admin for writing this up and making it public.
Everyone’s lives will involve an irretrievable loss of something, on some scale, at some point. A friend, partner, parent, child — those are the big ones. Your home, job, your pet, a precious object (or in this case, data), or an enjoyment of something: the accumulation can be a lot to handle as time goes on, but it’s also human nature, and it’s important to learn to tackle grief without ignoring it.
Grief is so core to being human that, thankfully, it is an aspect of life where you can find a lot of support. Religious communities, for example, can provide a lot of help that at the same time will be orthogonal to their core mission of worship. You can benefit from the former without having to engage with the latter. Don’t worry about being transactional in looking for help — people will want to help you.
Here we go again, a lot of comments saying k8s and gitops are too complex.
IMHO, your ops team can accidentally delete your data one way or another regardless. Today it was moving files to a wrong directory in git, yesterday it could have been human executing `rm -rf` against the wrong path. The highlighted item here should be the bad backup process. You can't say your data is backed up unless you actually verify successful restoration from the backup data, on a regular basis. This was true 20 years ago for your SQL database instance, and this is still true today for your k8s PVCs.
FWIW, I have been running GitOps long enough that any PR that moves files around raise the highest alerts in my head. The fundamental issue I have seen in many places is that engineers stores infra code in git and call it "GitOps". When you adopt GitOps concept, the most important thing is to train your engineers to switch to a different mental model, where your git repo is the _desired_ state of your infra -- it's not the actual state of the infra, and it's not a store of your imperative commands to manipulate the infra. When your desired state of your namespace is to not to exist, your gitops engine will try to make that happen!
It's a very real problem in K8s operators (ArgoCD being a very complicated one, but an operator nonetheless)...
...does absence signify deletion?
Some projects decide yes, if the YAML specifying your resource is absent on the next reconciliation, then delete the corresponding K8s resources.
Then some projects go with "no, deletion must be explicit, set this flag in your YAML".
But the last approach doesn't work as nicely with GitOps, as you have to do a two stage workflow to delete - first set the delete flag, then once the resources are deleted, then delete the YAML.
But I'm okay with deletion being harder to do if it makes it harder to well, accidentally delete all your stuff.
Because I've never met a K8s workflow where resources were commonly deleted, usually it's creation and update 90% of the time.
But I can understand it'd be annoying if you ended up with dangling resources if you missed an explicit delete flag.
I think this pattern is fine most of the time. I hate it for PVCs. At the least with cloud-like volumes it should leave a final snapshot that sticks around for a time.
I have all the sympathy for the author. I’m sure they don’t feel good right now, but I hope they continue contributing. They’ve learned a hard lesson, but putting it back into their practice is the only way to make it count.
You can set a PVC's reclaim policy to "Retain" (default is "Delete"). When the namespaced claim is deleted, the volume will still be around until it is manually deleted. The docs even suggest to do this for "precious data".
Guardrails are essential. A properly-designed declarative schema change tool provides guardrails to detect and prevent unwanted deletion / drops / lossy conversions. I develop widely-used software in this space (Skeema) and my tool has always offered drop-prevention functionality ever since its first beta release in 2016.
I also argue that guardrails are equally important for imperative migration tools as well, but more often they're lacking or half-baked, which gives a false sense of safety. For example, down/reverse migrations are a very common landmine for human error. Order-of-operations problems also happen frequently when imperative tools are used by large dev teams, resulting in subtle schema drift between environments when there's disagreement between lexicographic migration file order, git history order, and the actual migration application order on each DB.
By default, ArgoCD doesn't remove resources if the manifests are deleted, but you can switch this by turning on auto prune[1]. Likewise, you can specify whether a persistent volume should be retained or deleted if the k8s PersistentVolume resource is deleted[2]. Both of those settings would have prevented data loss in this case, but of course that's easy to say in hindsight.
Ugh I did something like this once when I wrote a script to rsync one volume to a backup volume. Everything was great until someone accidentally rm -r ‘d the source volume and the script happily deleted everything on the backup volume to keep it in sync.
Agreed. Reading this kind of thing makes my blood run cold. Your 'accidental sudo rm -rf /' not only destroyed one machine, but all of them. The entire cluster, standbys, backups, storage volumes, the whole enchilada. That's a Business Ending Event, in one merge request.
I'm hard-pressed to think of a scenario where a single sysadmin in an 'old-fashioned' enterprise datacenter could do as much damage with, maybe not exactly a one-liner but a lightweight change instruction.
The risk-reward calculation seems completely bananas to me.
In a nutshell, we "abandon" the resource and clean up abandoned resources after N days. To delete something, you need to explicitly check it's name into a list of resources to delete.
So deletions are explicit and a mistake is just un-abandoning.
Terraform, unless invoked as tf-yolo (aka terraform -auto-approve), will present you with a cheerful "Hello, I would like to delete everything. Continue: y/n". This makes it fairly easy to panic before deleting everything. Otoh, if you run <terraform destroy> and don't read the plan carefully... eh.
We're planning to streamline our run approvals a bit, because approving every DNS record addition results in some approval fatigue, but resource deletion will most likely always require human approval.
I have been building and running web applications which are used by millions of users for over 20 years now. But reading this post, I feel like I am looking into a completely different world.
None of the following is something which ever crossed my path:
I can see the point you're trying to make, but you're choosing an interesting way of conveying it, and I disagree.
Restic is a backup tool. Velero is a backup tool for Kubernetes. Vultr is a low cost but still decent "cloud" provider. GitOps is a philosophy which makes sense even on small projects.
None of those are "wrong" or "overcomplicated" options.
The elephant in the room is Kubernetes, which is indeed quite complex, and often gets used as the go-to even where it doesn't make sense [1] either because it's popular or because that's what people know, or because of the ready-made tools from others (e.g. if all you want to deploy exists in the form of Helm charts, it can save you lots of time) but it has its place and brings a lot to the table. You just have to be aware of the risks the complexity brings.
Disclaimer: I work at HashiCorp, I'm a massive fan of Nomad and think it's a better fit than Kubernetes in many cases, but dismissing Kubernetes outright is wrong.
> GitOps is a philosophy which makes sense even on small projects.
Er, does it? Root cause of this catastrophic dataloss incident is that in "GitOps" none of the traditional safety checks can be implemented. In normal sysadmin workflows, attempting to delete all your data will yield an "Are you sure?!" type message and you'll probably have to take explicit steps to confirm that this is really what you intended. There will also be dry run modes and other helpers.
Because git is intended for source code and not as a way to make stateful changes to servers, there are no features for that. If you push a commit that didn't do what you mean, it will just blindly do it.
It seems like this is a pretty major flaw in the whole "philosophy". The whole point of hacking a VCS into a server admin UI is because people think git will let you roll back infrastructure changes easily. But it cannot, because infrastructure isn't a stateless function of your git repository.
> In normal sysadmin workflows, attempting to delete all your data will yield an "Are you sure?!" type message and you'll probably have to take explicit steps to confirm that this is really what you intended
Depends on how you go about it, a wrong pathed mv could erase your data without a warning.
> Because git is intended for source code and not as a way to make stateful changes to servers, there are no features for that. If you push a commit that didn't do what you mean, it will just blindly do it.
There are no features in git itself for that, but in the way I usually implement GitOps, with a CI/CD system, you can easily have a manual "check this terraform plan's output to make sure you aren't doing anything crazy, and manually click this button to approve" step.
GitOps workflows can have safety gates that require a manual double-check/approval step, for instance if the deployment plan is substantially different or if deletes and loss of data would occur. They weren't implemented in this case, but that doesn't mean they can't be implemented. They probably should be implemented as a takeaway from this article.
Source control is great at auditing why a change occurred. (Keep in mind some of those changes include things like bad merge accidents and sloppy refactors. Source control never guarantees a perfect state of the code at any point in time, only a saved state.) Source control can also give you an estimate in how big of a change occurred (diff size, number of files changed/moved/deleted). You can use those same tools in the process of a GitOps workflow and in setting smart manual gates, not just in post-mortem root cause finding when things go wrong.
> Because git is intended for source code and not as a way to make stateful changes to servers, there are no features for that. If you push a commit that didn't do what you mean, it will just blindly do it.
git doesn't do anything (except keep versioned source of declarative state).
Instead, have a look at your state engine, and make it as safe as you care to.
> because infrastructure isn't a stateless function of your git repository
So we agree, there's your problem. Not git, but the state engine function.
While you don't need K8s, for GitOps to work you do need a structured approach for operating your infrastructure. These concepts can help:
“It is relatively easy to manage and scale web apps, mobile backends, and API services right out of the box. Why? Because these applications are generally stateless, so scripts can scale and recover infrastructure from failures without additional knowledge.”
“A larger challenge is managing stateful applications, like databases, caches, and monitoring systems. These systems require application domain knowledge to correctly scale, upgrade, and reconfigure while protecting against data loss or unavailability. We want this application-specific operational knowledge encoded into software … to run and manage the application correctly.”
Start by looking at the Operator Capability Level diagram here:
> If the application is containerised, like most are these days, how do you propose running these on said VM?
It really depends on how you implement it. In simplest setups - yes, you can destroy infrastructure with data and when you recreate the infra the data is gone. But the implementations I worked on were specifically designed to withstand this kind of problem.
> You just have to be aware of the risks the complexity brings.
I'd say most are not aware. It is not enough to say that people should "just" be aware. That is something said from a position of knowledge and awareness, which helps nobody.
Completely agree with the parent comment, simplicity is the first thing that should be reached for. K8s ought to be dismissed by default, because then you have to justify its inclusion. That's probably a way to increase awareness before plunging in.
Nomad is honestly really good. For a long time I've wished that it had a wider reach because I just outright don't want to go back to Kubernetes after using them both.
I'm really fearful that the recent events have harmed the chances of that happening though. It's a shame so I hope that isn't how it all happens.
> The elephant in the room is Kubernetes, which [...] often gets used as the go-to even where it doesn't make sense either because it's popular or because that's what people know
Or because folks want to pad their resume with k8s, which is what I'm seeing at work most of the time.
> While manual backup and restore tests were run once a month to ensure our backups were functioning, they were run manually. After digging into why our restores were not coming up with data, I found that our recurring backups were missing the flag to run volume backups with Restic which snapshots PVC block volume data.
Can someone explain this? How did they test restores, if the actual restore failed to come up with data?
When I started at a large company a few years back, the company specified restoration test was just that the archive restored successfully onto a server, not that there was anything actually in the archive. Digging into it, the archives were all empty due to a commit a few years previous that added an incorrect exclude option that ended up excluding all files.
They were running for years on the cusp of total failure and had automated restoration tests that caused a false sense of security in the tooling.
The second thing I did was adjust the restoration tooling to validate data existence and over time added validation tests (percent of data matched current live systems, specific fields and values were there, etc).
It's just too easy to screw up, doubly so when time constrained and alone doing the best you can without any oversight.
That’s one of the reasons I started tracking the filezise of the archive, and monitoring that over time. Datadog will trigger an alert if our backups are suddenly X% smaller.
Velero can be configured to run on a schedule, but the scheduled command was apparently not the exact same command they were using to perform the manual tests - the scheduled job was missing part of the command, basically.
Sp they were manually doing a backup and then testing that backup, rather than testing their automated backups? If thats what they were doing that just makes me wonder... why?
It's possible that the backup restore/test process was still 'automated' to some degree, like a manually-run pipeline or playbook etc.
Obviously there were mistakes made in process design and tools deployment but I don't think this is 'baffling incompetence' more than it is a couple of small mistakes that compounded each other to have a pretty spectacular impact.
I am not sure about the pricing structure of their provider, but maybe it is a Hotel California situation? Data egress price of the provider was such that they did not want to pay to retrieve the full backup "just for a test"? Instead, repeat the backup command, but swap it out for a local location.
> We use #Velero to capture backups of our cluster every 6 hours. From what I had seen our backups had been running successfully. I discovered once the incident started that backups had captured everything but the Persistent Volume Claim data
This is why for non hobby stuff I advocate for RDS, or its counterpart on your favorite provider. Running a production db on Kubernetes is looking for trouble unless you really, really know what you are doing.
I would also advice to have some backups available outside of your favourite cloud provider. Your cloud provider may fuckup, it has happen many times. You may encounter issues such as your account deleted or temporarily unavailable. Or the whole datacenter and many zones are on fire.
For example, the "delete" button on GCP deletes the backups along with the database you're deleting.
I would highly recommend not relying on those backups for anything mission-critical. Transfering the backups to another cloud provider is essential if your risk model involves losing your cloud account somehow.
We had a database downtime on RDS. I queried it, because we were paying good money for a failover database. We were told that the downtime was because the database was being updated, and all databases are upgraded at the same time so the failover was also being upgraded.
Never using RDS again. I'll stick with running Postgres on something I can actually manage myself.
Was it a major version upgrade? Postgres WAL replication is a physical replication stream which doesn't work across major versions. I don't know much about RDS Postgres but I'd assume this is the reason. Logical replication can solve this, but has some important limitations (e.g. no replication of DDL), so it's understandable that RDS can't leverage this generically and automatically.
I would expect that RDS MySQL doesn't have this problem since MySQL's built-in replication is logical rather than physical.
That sounds unlike my experience with AWS RDS, and I don't think there would be many users of it if it were generally true. When was this and was it AWS?
Hope you are keeping well and safe. Thank you for reaching back out to me.
I can completely understand that this is not the behavior you expected with regards to engine version upgrades with multi-AZ configuration.
By design, when performing any version upgrade on a multi-AZ instance, the engine version is upgraded on the primary and the secondary at the same time resulting in both the primary and standby being unavailable. Unfortunately, it is not possible for the instance to failover in this scenario since the database level changes needs to be applied at the same time.
As mentioned in my previously correspondence, you do have the ability to opt-out of auto minor version upgrades if this does not suit your requirements [1]. By doing this, minor version updates will no longer be applied automatically and will have to be manually maintained.
As additional information, note that multi-AZ deployment does reduce the downtime in certain activities, however database engine version upgrade does not come under this. Multi-AZ deployments helps RDS automatically perform a failover in the event of any of the following:
1. An Availability Zone outage
2. The primary DB instance fails
3. The DB instance's server type is changed
4. The operating system of the DB instance is undergoing software patching
5. A manual failover of the DB instance was initiated using Reboot with failover
For more information on High Availability (Multi-AZ) deployments, kindly refer the below document for your reference [2].
I sincerely hope I was able to address your concerns Marcus. Please don't hesitate to reach back out to me and I will be happy to assist further as best I can.
Thank you for your time. Wishing you a great weekend ahead, and stay safe.
We value your feedback. Please share your experience by rating this correspondence using the AWS Support Center link at the end of this correspondence. Each correspondence can also be rated by selecting the stars in top right corner of each correspondence within the AWS Support Center.
Very sad. But also a miss on admin's part, even if unfortunate. They did not realize during manual restore testing that volume was not being backed up.
"Yes and also apparently no. We use #Velero to capture backups of our cluster every 6 hours. From what I had seen our backups had been running successfully. I discovered once the incident started that backups had captured everything but the Persistent Volume Claim data. While manual backup and restore tests were run once a month to ensure our backups were functioning, they were run manually. After digging into why our restores were not coming up with data, I found that our recurring backups were missing the flag to run volume backups with Restic which snapshots PVC block volume data."
I want to feel bad for them, and I do, but really I’m not sure what you expect. They never tested their backup. They tested their ability to manually back up, but I’m not sure why they would ever think to conflate that with automatic processes. If they had automatic backups, why were they relying on manual ones for testing?
Yeah. Kudos to being open about it but this is an amateur mistake. I ran cloud infrastructure for large companies, small companies, and small companies with large data. Disaster Recovery, Disaster Response, Data Backups, Time to restore, etc are all table stakes to running anything in “production”.
They tested their ability to do a backup, but never went through with actually backing it up - automating it - let alone making it a process to do before you make changes in “production”. Amateur. I will give this credit though. It’s hard. Not everyone can do it. Running production that is. Anyone can build for production but actually keeping it live, running, and working nominally with proper SLA and DR? It’s hard. I’m curious what the volumes were even for? I tend to not use storage outside of blob, databases, queues.
ArgoCD deleting PVCs is pretty much exactly the same as someone clicking around in the web UI deleting EBS volumes (or whatever the equivalent is on their cloud provider). It happens all the time. You have to expect it, or you will lose data.
The only thing worse than automation going haywire is someone in a hurry being careless.
Vultr has volume protection too but only when in-use. During a deployment you can easily do what the OP did if you don’t do Canary deployments with red/blue failover for zero downtime. The volume will become “non-in-use” and then…
The real issue here is it all came down to a tab.
“How the backup was created was the key component. Manual backups were made with a command and recurring were setup with a yaml manifest. When we ran manual backups for testing, we set the default-volumes-to-fs-backup flag. The yaml manifest HAD the flag, but it was indented one two many times, so it wasn't actually setting the correct key in the velero helm chart, which might as well have not been setting it at all.”
Anyone who would feel a "crippling loss" if their data disappeared must execute a real world real failure test at least annually. Disconnect the storage, attach new storage, and go.
Oof. I've been there. The rush of adrenaline and then just feeling completely wrecked and empty is something that sticks with you for a long time. Take care, lone admin - don't be too hard on yourself.
The only data that I really care about are my photos and videos and they are backed up to four different providers.
But this also made me realize how upset I would be if my blog over at micro.blog was accidentally deleted. It’s more of a journal of my digital nomadding with my wife across the US than anything else.
I immediately went over and started a JSON export in bar (?) format.
I don't understand the backup failure, how did you not notice your PVCs weren't being backed up? Like wouldn't you realize they were too small or happening too fast? It seems like the root cause is people not having their work checked from first principles.
Honestly, it shouldn’t feel that bad, what was the loss again here? A bunch of users and their posts/memes?! They can create other users it should not be an issue. The real fuckups IMO are stuff that involves people’s lives, things like industrial automation, robotics, autonomous vehicles, aircraft systems and what not, unfortunately, usually you don’t see such an accountability like in OP when things go south in those domains.
It sounds like this failure led to the complete destruction of 3 online communities. While this might not be as bad as, say, a fatal multi-car accident on the highway, I bet it led to some heavy grief as users mourned the losses of their communities.
It might not sound like a big deal because it's just "a bunch of users and their posts/memes", but many people get the vast majority of their social interactions through social media. A sudden, total loss of such a site could be devastating to someone's social life. They might've lost contact with an intimate friend group they knew only via aliases with no other way to reach out.
Even for older generations like Baby Boomers, life is an increasingly online experience:
- How many senior citizens are reliant on Facebook for tons of their daily interactions with old friends? Probably a lot!
- Entire degrees from reputable universities can be earned fully remotely, and the transcripts, curricula, forum posts, and even proof that you earned such a degree exist in databases and distributed systems vulnerable to the exact same types of flaws outlined in the OP.
- Dating apps like Tinder / Bumble connect people who form long-lasting and loving relationships, and the sudden loss of such platforms could quickly end your potential lifelong romance.
- LinkedIn networks are of huge importance to some for continued, gainful employment. Are there memes there? Sure! But does that mean it's "just a bunch of posts and memes" whose loss would be fine? Hardly, in my opinion.
My use of the above platforms as examples is not an endorsement of them, BTW. I just wanted to illustrate that the position you're defending will become more and more indefensible as people's lives become increasingly enmeshed in the digital world. Socializing online is still socializing, and a loss can still be painful even if there are memes and usernames involved.
While my knee-jerk reaction is to blame bad backup practices, that's not what I believe happened here. Not entirely at least.
The setup is too complex to easily verify and restore.
Yes, a bespoke box is annoying and "inelegant" but it also works, and if it's backed up following even 80's tier best practice, it can be restored by anyone with a pulse.
This all seems way overly complicated for what could probably be a few services running on a VM or four. Why does Kubernetes/Argo/Helm need to get involved? Why couldn't the whole architecture diagram look a lot more like HN? I feel like we've entirely lost our way with complexity.
IMO it's because of this default in k8s storage class: https://kubernetes.io/docs/concepts/storage/storage-classes/.... Almost everything in a k8s cluster can be ephemeral... except your data! But they make the defaults insane and it eats a lot of people's lunches. It's honestly very sad. I fortunately haven't "lost" data in 25 years, but I still remember the pain when I made this kind of mistake.
That explains why the K8S defaults are bad, but why use K8S at all here?
Personally it seems like a lot of shade-tree server admins are way too eager to bust out much more tooling than is actually necessary for many tasks, just because it's the way that $BIGNAME company does it.
Realistically, most people running a personal or small community's IT system aren't going to need the sort of crazy scalability that container orchestration systems are designed to provide. And if you're not running them at significant scale, the containerization and orchestration-system overhead are often quite significant fractions of the overall resource footprint. Plus there's the unnecessary complexity due to all the levels of abstraction that just don't need to be there.
E.g. for a Mastodon instance, I wouldn't touch Kubernetes or complex orchestration unless I was out of other, more traditional options for scaling. The server side is already broken into a number of components (RoR app, PosgreSQL, Redis, Sidekiq, node.js) which you can separate out onto their own servers, and from there each component has preferred ways of scaling based on need. And while nothing is safe from failure, backing up a bog-standard PostgreSQL VPS is a lot more straightforward than K8S.
If you're doing deployments dozens of times a day, of course automation is desirable. But if you're doing it once, I suspect most people would be hard-pressed to make back the investment of time and additional testing required (well, should be required) of working through a complex container-management architecture.
You kind of explained it yourself. You need to run a number of components, keep track of all of them, be able to update them, scale them, make sure they're healthy and restarted when needed, etc. This is of course possible in a number of ways, but not trivial. You're basically describing an orchestrator such as Kubernetes or Nomad. Especially with an already existing Helm chart covering all the deployment logic (what needs to be deployed, how many instances, health checks, etc.): https://github.com/mastodon/chart it's quite an easy choice instead.
Well I guess I'll share my opinion too. K8s is not just for $BIGNAME companies. Personally I find k8s to be excellent for small shops because it has trivialized the install of large complex apps, opening doors for small groups to have a bigger impact. K8s is simply a package manager for complex deployments.
The learning curve for k8s may be a bit steep initially, but it covers all the areas you _should_ have some understanding of before tackling a large complex install of a web application and puts them in one place: manifests. The best Helm configs provide all the toggles you would need for complex deployments, but provide reasonable defaults (e.g. Mastodon [1]). Literally the application developers are also building the deployment _for_ you. It can't be overstated enough, because of what use is an app if you can't deploy it?
K8s and tools like ArgoCD also encourage some of the best modern practices with IaC. I don't need to hunt for every config file someone might have tinkered with to solve a problem on a server (and hopefully made a doc comment I can find to reproduce it). Every change is there in the manifests. There are OS variations, config variations, etc. and I just don't care about any of that! Just that k8s can run on it and use the resources. The infrastructure and git history is self documenting, allowing me to come into any small team and understand where they are at.
You said orchestration can take a lot of resources: meh. I trust something like k8s to use all of my compute resources more efficiently than a half dozen servers that are probably not sized correctly anyways for the individual components. Digital Ocean will run the control plane for free for example. What the orchestration overhead gives you is trivial reproduction anywhere. It really has nothing to do with "crazy scalability". It's the ability to tear down and bring up the whole cluster in one command. For example you can easily reproduce your full system locally for testing. Good luck doing that reliably with a bunch of servers without making it your full time job and making mistakes constantly.
Re pgsql backups: it's really not any more difficult on k8s. Maybe the problem is too many options: your storage class could do it for you with snapshots, you could have a script connect to the pod and run pg_dump, you could use a third party app like pgadmin4, or you could have sidecars do the work for you (assuming it is configured correctly too). In all cases you need to put the backup data somewhere. You could also use a managed postgres service and completely separate it from your cluster.
Back to the blog post: really the only problem the admin person had was not understanding the default settings of their storage class. Unfortunately that was also the most devastating. They could have also configured ArgoCD to not delete unknown manifests (opening up options to violate IaC). There is an information gap there, there is a gap in testing and backup strategy, and I hope others can learn and avoid getting bitten by it. But this blog post is not the reason to not use k8s. With some, honestly trivial, tweaks, they could still be supporting their community with these apps.
They said they used Vultr's k8s service, and they provide storage solutions also [2]. Did they contact Vultr to see if they implement some recovery safety for them? 30 day deletion policy, etc.? They probably knew about the problem in minutes, and data might have been or is recoverable in the backend.
If Lily Cohen would have used either, recovery would have been possible. Of course then Lily would have to explicitly delete the PVs to prevent getting billed for the storage.
It's interesting (so far) that none of the comments have addressed the core question. Why is k8s even needed here?
Many connects are focused on how it could have been solved with the given setup. It's as though using k8s is the default conclusion, and that might actually be the problem. Even in the author's post, their reason for choosing a hosting provider was related to k8s.
I'm not familiar with HN architecture. I agree with your sentiment, IMO there was no need to involve k8s here, keep the implementation simple and grow from there.
That sucks. Everyone has been there. On my first sysadmin job I did an rm /* as root and then discovered my backups across the network to a remote tape unit had a buffering issue and were basically useless.
Did this at insurance company back in 99. They'd been running on desktop pcs with tops off in a closet (server room) with carpet floors.
Built a cluster in two full racks and Friday after COV started migration. Failed all though Saturday and Sunday to restore data. Has tested backups (DAT drives) after backup which passed but something corrupted. Found another backup tape, and had it up at 0745am on Monday. Zero sleep all weekend had to call in some other admins to get it done. Was both stressful and exhilarating. Became extremely cautious with backup testing since then.
Sure it is. Plenty of companies test load balancing by temporarily shutting off power to one of their data centers and seeing if anyone notices. It's just that usually they can turn the power back on if something does go wrong, but you won't have many recovery options if you blank your only bootable drive. So do use a spare.
Personally my personal files like important docs go to:
- iCloud (automatically)
- Local nightly rsync to a zfs share from local Docs folder
- Backblaze
- Some crit stuff to OneDrive
- TimeMachine on a local Linux server at home
Periodically I check them all, verify file counts and sizes, spot check files. Periodically isn’t that often, but I’ve actually needed backed up files and used it.
Data volume are low though.
I keep Photos libraries similarly backed up. That’s harder to verify, although sizes is pretty good. Even if the library metadata gets fried the raw files still exist in the folder structure.
Or said another way -- deletion is dangerous. I've observed that mixing the convenience of automation with the risk of hard deletion is fraught with peril. If you really need automation around deletion, it's best to set up roadblocks and use tombstoning approaches, where archive/backup data is sent to S3 Glacier or something of that ilk. Storage is cheap enough these days that there's no reason not to.
The point of the tooling is that you describe what you want your deploy to look like, and it updates the deploy to match the description. If you delete something from the description and it stays running, that would be very confusing.
I haven’t used their particular tool (ArgoCD) but the ones I have used include an option to keep specific pieces of infrastructure around even when it’s deleted from the description. That’s absolutely what you should be doing for anything that stores data.
It's not too bad. There are two approaches to this one in AWS: 1. You can mark specific resources like db and volumes to be retained on stack deletion. 2. You can't delete some resources easily without deleting the content. Wanna delete a bucket? Explicitly delete all the objects first.
Last week, I upgraded my home laptop's SSD. I decided to take out the old one, do a fresh install on the new one, and restore the backup from a Synology NAS.
The good news: Backup seems complete. It's not the first time I migrated like this, and an older hardware upgrade went a lot less well, so I am happy.
The bad news: It was far from a trouble free process. I succeeded in the end, but had to restore a few times because the first tries were botched and there were some gotchas to learn.
Here we go again, a lot of comments saying the giant ball of knives is too dangerous.
I work with the giant ball of knives every day, wrapped in my custom-tailored full-body kevlar bubble, and I barely ever have any catastrophic life-threatening accidents!
The giant ball of knives is not the issue. Your carelessness when working around the giant ball of knives is the issue. Look inward.
One learning for all of us that I see every other day at work causing issues, the testing is not planned or not done based on what is present in production but something which makes the tester feel...Works for me....
Probably need a similar solution like Docker for this works for me disease in testing.
That sucks! Now I'm thinking my github action that is doing a `git pull repo && cd repo/ && rm -rf dist/ && populate.sh && git push` in my bash script will totally go rogue one day and kill everything. Any idea, peeps?
Separate your code from your data. Be very careful with your data, and make sure you have backups of your data and that you test and validate your backup procedure. At the very least, verify the integrity of your data backups.
Props to the author for being so open and honest, and for writing it up as a cautionary tail. None of us are perfect, thankfully. The stigma around failing should be eradicated from society.
This would also mean they did not have a staging environment for verifying a botched deployment. Very unfortunate and a double disappointment for finding out backups were missing.
I think they are saying they manually tested what they thought the automated process was doing, then found out their assumption about what it was doing was wrong. So the manual tests were moot.
At least they had fun with trendy tech. Who wants to ssh to servers any more? Throw in ten different tools you barely read documentation for, season it up with some yaml you found on gist.github and join the cool guys!
This is a dumb question, but why can't people just make things simpler? Tools are supposed to make work easier and make things more efficient, but this sort of complexity just seems to hurt, doesn't it?
I wonder if it has to do with mismatches between -as-cattle and -as-pets philosophies and usecases. From the post, it looks like they were using k8s. It's hard to tell given the, you know, total data loss, but it looks like the instance had at max triple-digit users. In my experience running small community stuff like this, the -as-cattle tooling, even though I'm familiar with it from work, is way overkill and introduces far more brittleness than it prevents. My go-to for stuff like this is -as-pets: gimme a box, a daemon, some rc scripts and maybe a backup client. Admin complexity and burnout are huge problems for volunteer sysadmins, and IMO the burden of -as-cattle tooling unnecessarily exacerbates both.
I've had an irc daemon running on the same metal for almost 15 years now, uninterrupted besides occasional OS upgrades and patches. There is basically nothing else happening on the host. It is dumb simple and just works.
Some of that may be on devs, actually: distribution-as-docker-container at the expense of real packaging seems to be on the rise.
> Some of that may be on devs, actually: distribution-as-docker-container at the expense of real packaging seems to be on the rise.
Because it eliminates a whole host of "works on my computer" "which version of X dependency library do you have installed" problems nobody wants to deal with.
At work, we're running a hybrid container/vm setup. We host the applications from our dev-teams as containers, since there is a lot of them and this allows them to do most of the work to integrate new apps with the stack. However, the actual data stores like postgres, glusterfs and such are simple VMs for a simple reason: It has less failure modes and many of the failure modes on a VM are less subtle and strange to deal with. And very few failure modes on a VM go straight to irrecoverable dataloss.
In fact, the company I work at has deployed production like that for many years when our needs were simpler with just 1-2 applications. 1-4 application VMs provisioned by a config management, maybe a loadbalancer, maybe 1-2 DB VMs behind it. Worked like a charm until the new parent company started throwing loads of complexity at it.
This sounds like a good explanation. As an outsider may be complexity seems unnecessary but it's needed for scale (an analogy is what is needed in a kitchen for a spaghetti meal to what is needed in a resturaunt to what is needed in a canned spaghetti factory), but problems can arise when you try to shoehorn methodologies and tools for scaled production to smaller systems I guess.
You can't choose "simpler" overall. The choice is normally between "simpler for one-off things" and "simpler in aggregate". If you choose the first option too often, you'll learn that doing the simplest thing every time paints you into a corner and you have to deal with all the tech debt one day. Doing things in a designed/automated/generated simpler way means you sometimes end up dealing with complex system failures. You can't avoid the complexity - just decide how you organise it.
Simpler to use, simpler to learn, simpler to design, and simpler to create are all different properties. Sometimes they are unrelated, sometimes complementary, sometimes in tension.
The more layers of automation we add, the more invisible points of failure.
Bullshit. Doing things by hand is more error prone.
Without automation this situation would've played out like: well we got used to running things with e.g. --force or --yes or just hitting yes manually at every prompt. Unfortunately we just nuked our data store.
Alternatively they would've looked at the dashboards, seen perhaps low CPU or memory utilization for the data store namespace and manually nuked it to save some money..
While this smells like a process issue it's mostly an architectural one. It's good that things were namespaced, however, for persistent data the infra should be more or less air gapped from a tooling POV. Updating the persistent data store should be a whole separate CI job, ideally with more intervention required to effect change.
Disclaimer: Author to a competitive k8s backup solution.
I personally don't think Velero is a solution for production workloads or anything serious. Only an established backup company/devs will have expertise to implement and handle all cases and take care of all data loss scenarios. Ideally k8s authors should have stopped at providing a tool (they have snapshots like any db/fs) rather than writing their own backup piece. Unfortunately many of the industry solutions are wrapped over Velero except a few (two?), one I implemented from scratch for Commvault.
My own two cents which others have echoed is this is extremely overly complex setup for this situation. Its been told 100x times on HN about this but there is a reason. The more complex a system is the more points of failure you have. For all the tools you were running you should have had a team of ops people. In reality a database, some servers and a load balancer would get you 99% of the way there.
Modern engineering is a dumpster fire of complexity mostly hocked by shills working to sell contracts to enterprises.
You have learned this lesson the hard way but less moving parts means less things that can go wrong. Basics like monitoring, backup, testing etc go only so far if your system is a rube goldberg machine. Hope as an industry someday we can get back to simplicity