I can see the point you're trying to make, but you're choosing an interesting wa...

nvm0n2 · on Aug 28, 2023

> GitOps is a philosophy which makes sense even on small projects.

Er, does it? Root cause of this catastrophic dataloss incident is that in "GitOps" none of the traditional safety checks can be implemented. In normal sysadmin workflows, attempting to delete all your data will yield an "Are you sure?!" type message and you'll probably have to take explicit steps to confirm that this is really what you intended. There will also be dry run modes and other helpers.

Because git is intended for source code and not as a way to make stateful changes to servers, there are no features for that. If you push a commit that didn't do what you mean, it will just blindly do it.

It seems like this is a pretty major flaw in the whole "philosophy". The whole point of hacking a VCS into a server admin UI is because people think git will let you roll back infrastructure changes easily. But it cannot, because infrastructure isn't a stateless function of your git repository.

sofixa · on Aug 28, 2023

> In normal sysadmin workflows, attempting to delete all your data will yield an "Are you sure?!" type message and you'll probably have to take explicit steps to confirm that this is really what you intended

Depends on how you go about it, a wrong pathed mv could erase your data without a warning.

> Because git is intended for source code and not as a way to make stateful changes to servers, there are no features for that. If you push a commit that didn't do what you mean, it will just blindly do it.

There are no features in git itself for that, but in the way I usually implement GitOps, with a CI/CD system, you can easily have a manual "check this terraform plan's output to make sure you aren't doing anything crazy, and manually click this button to approve" step.

WorldMaker · on Aug 28, 2023

GitOps workflows can have safety gates that require a manual double-check/approval step, for instance if the deployment plan is substantially different or if deletes and loss of data would occur. They weren't implemented in this case, but that doesn't mean they can't be implemented. They probably should be implemented as a takeaway from this article.

Source control is great at auditing why a change occurred. (Keep in mind some of those changes include things like bad merge accidents and sloppy refactors. Source control never guarantees a perfect state of the code at any point in time, only a saved state.) Source control can also give you an estimate in how big of a change occurred (diff size, number of files changed/moved/deleted). You can use those same tools in the process of a GitOps workflow and in setting smart manual gates, not just in post-mortem root cause finding when things go wrong.

Terretta · on Aug 28, 2023

> Because git is intended for source code and not as a way to make stateful changes to servers, there are no features for that. If you push a commit that didn't do what you mean, it will just blindly do it.

git doesn't do anything (except keep versioned source of declarative state).

Instead, have a look at your state engine, and make it as safe as you care to.

> because infrastructure isn't a stateless function of your git repository

So we agree, there's your problem. Not git, but the state engine function.

While you don't need K8s, for GitOps to work you do need a structured approach for operating your infrastructure. These concepts can help:

https://operatorframework.io/about/

“It is relatively easy to manage and scale web apps, mobile backends, and API services right out of the box. Why? Because these applications are generally stateless, so scripts can scale and recover infrastructure from failures without additional knowledge.”

“A larger challenge is managing stateful applications, like databases, caches, and monitoring systems. These systems require application domain knowledge to correctly scale, upgrade, and reconfigure while protecting against data loss or unavailability. We want this application-specific operational knowledge encoded into software … to run and manage the application correctly.”

Start by looking at the Operator Capability Level diagram here:

https://sdk.operatorframework.io/docs/overview/

You can iterate your ability to operate from desired state through those stages, with focus on the ones that hurt the most (by risk or by repetition).

hdjjhhvvhga · on Aug 28, 2023

> If the application is containerised, like most are these days, how do you propose running these on said VM?

It really depends on how you implement it. In simplest setups - yes, you can destroy infrastructure with data and when you recreate the infra the data is gone. But the implementations I worked on were specifically designed to withstand this kind of problem.

politelemon · on Aug 28, 2023

> You just have to be aware of the risks the complexity brings.

I'd say most are not aware. It is not enough to say that people should "just" be aware. That is something said from a position of knowledge and awareness, which helps nobody.

Completely agree with the parent comment, simplicity is the first thing that should be reached for. K8s ought to be dismissed by default, because then you have to justify its inclusion. That's probably a way to increase awareness before plunging in.

earthling8118 · on Aug 28, 2023

Nomad is honestly really good. For a long time I've wished that it had a wider reach because I just outright don't want to go back to Kubernetes after using them both.

I'm really fearful that the recent events have harmed the chances of that happening though. It's a shame so I hope that isn't how it all happens.

tannhaeuser · on Aug 28, 2023

> The elephant in the room is Kubernetes, which [...] often gets used as the go-to even where it doesn't make sense either because it's popular or because that's what people know

Or because folks want to pad their resume with k8s, which is what I'm seeing at work most of the time.