Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The more layers of automation we add, the more invisible points of failure. Magic is great until it isn’t.

I feel their pain. The sinking feeling when you realize that data is gone and not coming back is an awful experience.




This is a dumb question, but why can't people just make things simpler? Tools are supposed to make work easier and make things more efficient, but this sort of complexity just seems to hurt, doesn't it?


I wonder if it has to do with mismatches between -as-cattle and -as-pets philosophies and usecases. From the post, it looks like they were using k8s. It's hard to tell given the, you know, total data loss, but it looks like the instance had at max triple-digit users. In my experience running small community stuff like this, the -as-cattle tooling, even though I'm familiar with it from work, is way overkill and introduces far more brittleness than it prevents. My go-to for stuff like this is -as-pets: gimme a box, a daemon, some rc scripts and maybe a backup client. Admin complexity and burnout are huge problems for volunteer sysadmins, and IMO the burden of -as-cattle tooling unnecessarily exacerbates both.

I've had an irc daemon running on the same metal for almost 15 years now, uninterrupted besides occasional OS upgrades and patches. There is basically nothing else happening on the host. It is dumb simple and just works.

Some of that may be on devs, actually: distribution-as-docker-container at the expense of real packaging seems to be on the rise.


> Some of that may be on devs, actually: distribution-as-docker-container at the expense of real packaging seems to be on the rise.

Because it eliminates a whole host of "works on my computer" "which version of X dependency library do you have installed" problems nobody wants to deal with.


At work, we're running a hybrid container/vm setup. We host the applications from our dev-teams as containers, since there is a lot of them and this allows them to do most of the work to integrate new apps with the stack. However, the actual data stores like postgres, glusterfs and such are simple VMs for a simple reason: It has less failure modes and many of the failure modes on a VM are less subtle and strange to deal with. And very few failure modes on a VM go straight to irrecoverable dataloss.

In fact, the company I work at has deployed production like that for many years when our needs were simpler with just 1-2 applications. 1-4 application VMs provisioned by a config management, maybe a loadbalancer, maybe 1-2 DB VMs behind it. Worked like a charm until the new parent company started throwing loads of complexity at it.


> Some of that may be on devs, actually: distribution-as-docker-container at the expense of real packaging seems to be on the rise.

You could just run docker or podman or whatever alone on a VM and start/stop containers entirely manually.


This sounds like a good explanation. As an outsider may be complexity seems unnecessary but it's needed for scale (an analogy is what is needed in a kitchen for a spaghetti meal to what is needed in a resturaunt to what is needed in a canned spaghetti factory), but problems can arise when you try to shoehorn methodologies and tools for scaled production to smaller systems I guess.


You can't choose "simpler" overall. The choice is normally between "simpler for one-off things" and "simpler in aggregate". If you choose the first option too often, you'll learn that doing the simplest thing every time paints you into a corner and you have to deal with all the tech debt one day. Doing things in a designed/automated/generated simpler way means you sometimes end up dealing with complex system failures. You can't avoid the complexity - just decide how you organise it.


Simpler to use, simpler to learn, simpler to design, and simpler to create are all different properties. Sometimes they are unrelated, sometimes complementary, sometimes in tension.


  The more layers of automation we add, the more invisible points of failure. 
Bullshit. Doing things by hand is more error prone.

Without automation this situation would've played out like: well we got used to running things with e.g. --force or --yes or just hitting yes manually at every prompt. Unfortunately we just nuked our data store.

Alternatively they would've looked at the dashboards, seen perhaps low CPU or memory utilization for the data store namespace and manually nuked it to save some money..

While this smells like a process issue it's mostly an architectural one. It's good that things were namespaced, however, for persistent data the infra should be more or less air gapped from a tooling POV. Updating the persistent data store should be a whole separate CI job, ideally with more intervention required to effect change.


The opposite of complexity is not manual. It is possible to host a setup that is simpler than the one that was chosen.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: