This greatly depends on the organisations size and if you're on-prem vs cloud. I...

jaimex2 · on March 29, 2022

The opposite actually. If you're that size it's insane to not have fault tolerant clusters, on-prem or cloud doesn't change much. Time to invest in it.

WestCoastJustin · on March 29, 2022

Yeah, wishful thinking. I work with them everyday and get to see under the hood. I 100% agree with you on investing in fault tolerant clusters / automation. There is a massive trend happening to modernise legacy infrastructure. Almost all the big companies will have some internal platforms team that are trying to standardise. Just extremely slow and tons of training and education.

I don't think it's the newly build stuff that companies are worried about. It's the old legacy stuff that's on life support that no one wants to modernise or maybe they cannot.

jhugo · on March 29, 2022

The existence of small and large companies with bad systems architecture isn't an argument for implementing more bad systems architecture. Small or large, it doesn't need to cost more to design stuff well.

WestCoastJustin · on March 29, 2022

For what it's worth -- I agree with you and is why folks are flocking to docker/kubernetes and devops tooling like Terraform. I just think you're missing the scope of it. Say for example, a big box store, they might have 500 locations, and all the sudden they buy a few more companies, and merge all this stuff under a single brand. All the sudden they have tons of systems all over the place, lots of existing platforms that need to talk to each other, and staff that are resistant to change. This isn't "newly designed stuff" it's systems that sort of organically grew over a long time. You're talking lots of different operating systems, networks, front-end/back-end systems, www corperate site, mobile site, rewards sites, all sorts of internal support applications, pos system backends, etc. They probably have 20+ different large database systems and they might not even know all the apps connecting to them. I actually worked with a company like this on a few cloud projects. It was amazing to see the complexity. This is the type of stuff that's running large parts of companies that you interact with on a daily basis.

I guess what I'm getting at, is that sure if they design new things they will follow modern patterns but there is so many things that are not modern. They don't have the time or incentive to just go and rebuild all this stuff. There is zero benefit to them on a bottom line, unless there is some burning fire, a way they can extract more money, or save tons of money. So, they just keep them on life support and run in a keep the lights on mode until something happens. These are the systems all sysadmin's just wish went away and there's many of these types of things all over the place.

jhugo · on March 29, 2022

Yeah, I agree. Your initial reply to me didn't mention legacy issues, just org size, and I was interpreting it more in the context of a large org building new systems the same unmaintainable way they always have because of inertia / politics / ossified sysadmins / etc.

devonkim · on March 29, 2022

The problems are around process almost always. But I’ve also seen on occasion some sadly pragmatic reasons for very slow processes such as critical legacy software that a replacement for can’t be found even with 8+ figure checks to vendors waved.

One of the big trends from the 2010s for cloud software was to cloudwash old stacks that really weren’t more than simply validated to not crash and burn on an EC2 instance, which is why the entire cloud native movement exists to differentiate greenfield cloud architecture services from cloudwashed ones.

Things can be pretty frustrating working with different vendors of different applications and competencies. People will be patching log4j issues for years to come, for example, and that’s probably easier to validate in aggregate than entire kernel upgrades for decrepit, unsupported distros like CentOS 5 that I still hear about being used.

misnome · on March 29, 2022

If you are starting from scratch, sure.

Real life is more complicated, and even if the organisation willpower and politics are aligned in a way to _want_ to fix it, this takes a long time.

Chastising someone on HN because they own a system that probably wasn’t designed and might not have the power to fix seems at best, a little unfair.

jhugo · on March 29, 2022

I agree, there are many reasons why bad systems architecture might remain in place due to inertia, lack of resources, organisational politics, etc.

I didn't chastise anyone.

misnome · on March 29, 2022

Sorry, that wasn’t addressed to you specifically, just aiming upwards in the thread.

makerdiety · on March 29, 2022

A modernization effort is like adding a new, alien science officer named Spock to your crew. Who has the patience, skill, and charisma to deal with the annoying Vulcan and his cold logic?

kerng · on March 29, 2022

I have seen a lot of places, yet not a single one that runs all the latest software everywhere all the time. This is even true for smaller pieces of software like Linux kernel.

arka2147483647 · on March 29, 2022

How would fault tolerance help with privilege escalation?

Would it not just mean that you have more computers to update in your redundant/tolerant cluster?

hnlmorg · on March 29, 2022

In this context it means you can take individual servers offline without taking your entire service down. So you can then update each server (even on production systems) live without requiring a maintenance window.

For bonus points you're also not babysitting manually provisioned servers but instead have your software installs automated. So any failure on a server or OS update isn't seen as a maintenance piece but rather just terminating the old server and letting your pipeline auto-build a new server. This is often referred to as "treating your servers as cattle rather than pets", though not everyone likes that analogy.

jhugo · on March 29, 2022

In the context of a cluster, fault tolerance allows you to replace nodes without downtime. With automation a kernel update can then be a routine, low effort, low stress task.

Honestly, a kernel update has to be a routine, low effort, low stress task. It's a common event that should be seen as part of the normal operation of the system, not as some exceptional event that means someone has to work on the weekend.

kasey_junk · on March 29, 2022

Theoretically it means that you can be running regular node OS updates as a matter of course simply by replacing some percentage of them on a rapid cadence.

Then there isn’t any stress to doing it, it’s routine and automated.