Aim for Operability, Not SRE as a Cult

hobs · on Oct 21, 2020

In my experience there's almost no place implementing SRE at all, much less as a cult.

parsley27 · on Oct 21, 2020

The author mentions what may be an effective way of viewing SRE, that it's only for the most critical of systems that SRE would run, but doesn't mention clearly a secondary goal of SRE (in my opinion), which is to standardize and support teams running their own products.

Having worked for a company where outages have clearly defined revenue loss amounts in the hundreds of thousands per hour, the biggest issue to implementing SRE was around the IT as a Cost Centre model, not anything wrong with SRE itself.

0xbadcafebee · on Oct 21, 2020

I don't think anyone argues that SRE is wrong, but I think most people will be wrong if they assume most places can or will implement it. It's just like DevOps in the sense that it's a corporate efficiency strategy, but it's never seen as all that necessary. Execs just see it as another Six Sigma, something to banter about at the golf club. And nobody outside of sysadmins and managers really get it at all.

I think the more companies rely on technology, the more screwed they are, because they don't realize how much harder it is to run a business dependent on technology. The fact that SRE exists proves that, I think. The system is by default a garbage fire, because by default you can't push back on product in order to fix your tech debt. Most other businesses seem to grasp the concept that they need to perform maintenance on their machines or they won't be in business for long, but as soon as it's "the magical box with the 0s and 1s inside", they pretend it doesn't require long term investment or a different strategy.

hobs · on Oct 21, 2020

I agree, and I think anyone whose complaining about SRE as a cult probably is writing a blog post to their company or their boss or their consulting consortium - they have a narrow focus of what's generally occurring.

Getting the idea of increasing quality to reduce cost understood by many higher ups can be a losing battle, especially when management changes ever 2-3 years.

It's almost always easier to cut spending than have some organizational breakthrough that standardizes delivery.

nopit · on Oct 21, 2020

Have you worked in environments where the abstractions piled up so high it allows for SWE to make silly mistakes and be completely oblivious to their negative impact on others?

hobs · on Oct 21, 2020

Yes, almost all of them.

apple4ever · on Oct 22, 2020

Where I work we are called SRE's and are told to read the book, but we rarely follow what is in the book.

Terretta · on Oct 21, 2020

This is a really good discussion of more than just two ways of thinking of DevOps or SRE, as anti-patterns and patterns.

https://web.devopstopologies.com

Every shop has reasons it may be more suited for a particular one of these over others, or possibly even an anti-pattern being better than most patterns or patterns done badly.

paledot · on Oct 22, 2020

"For example, an availability level of 99.9% equates to an error budget of 0.01% unsuccessful requests. 0.002% of failing requests in a week would consume 20% of the error budget, and leave 80% for the quarter."

I... what?

tolbish · on Oct 22, 2020

Given 10,000 requests in a quarter, in order to achieve 99.9% availability, 9,990 of those requests must be successful. That leaves 10 that can fail. If two fail in a week, then only 8 more can fail for the rest of the quarter.

paledot · on Oct 22, 2020

Right, but 10,000 requests in a quarter is 833 per week. 2/833 is 0.24%, or 99.76% uptime that week.

crmrc114 · on Oct 22, 2020

I really enjoyed this, smaller SRE groups in smaller companies may not be the same as Google, but the philosophical dna should at least match.