Hacker News new | past | comments | ask | show | jobs | submit login
Aim for Operability, Not SRE as a Cult (stevesmith.tech)
41 points by slyall on Oct 21, 2020 | hide | past | favorite | 12 comments



In my experience there's almost no place implementing SRE at all, much less as a cult.


The author mentions what may be an effective way of viewing SRE, that it's only for the most critical of systems that SRE would run, but doesn't mention clearly a secondary goal of SRE (in my opinion), which is to standardize and support teams running their own products.

Having worked for a company where outages have clearly defined revenue loss amounts in the hundreds of thousands per hour, the biggest issue to implementing SRE was around the IT as a Cost Centre model, not anything wrong with SRE itself.


I don't think anyone argues that SRE is wrong, but I think most people will be wrong if they assume most places can or will implement it. It's just like DevOps in the sense that it's a corporate efficiency strategy, but it's never seen as all that necessary. Execs just see it as another Six Sigma, something to banter about at the golf club. And nobody outside of sysadmins and managers really get it at all.

I think the more companies rely on technology, the more screwed they are, because they don't realize how much harder it is to run a business dependent on technology. The fact that SRE exists proves that, I think. The system is by default a garbage fire, because by default you can't push back on product in order to fix your tech debt. Most other businesses seem to grasp the concept that they need to perform maintenance on their machines or they won't be in business for long, but as soon as it's "the magical box with the 0s and 1s inside", they pretend it doesn't require long term investment or a different strategy.


I agree, and I think anyone whose complaining about SRE as a cult probably is writing a blog post to their company or their boss or their consulting consortium - they have a narrow focus of what's generally occurring.

Getting the idea of increasing quality to reduce cost understood by many higher ups can be a losing battle, especially when management changes ever 2-3 years.

It's almost always easier to cut spending than have some organizational breakthrough that standardizes delivery.


Have you worked in environments where the abstractions piled up so high it allows for SWE to make silly mistakes and be completely oblivious to their negative impact on others?


Yes, almost all of them.


Where I work we are called SRE's and are told to read the book, but we rarely follow what is in the book.


This is a really good discussion of more than just two ways of thinking of DevOps or SRE, as anti-patterns and patterns.

https://web.devopstopologies.com

Every shop has reasons it may be more suited for a particular one of these over others, or possibly even an anti-pattern being better than most patterns or patterns done badly.


"For example, an availability level of 99.9% equates to an error budget of 0.01% unsuccessful requests. 0.002% of failing requests in a week would consume 20% of the error budget, and leave 80% for the quarter."

I... what?


Given 10,000 requests in a quarter, in order to achieve 99.9% availability, 9,990 of those requests must be successful. That leaves 10 that can fail. If two fail in a week, then only 8 more can fail for the rest of the quarter.


Right, but 10,000 requests in a quarter is 833 per week. 2/833 is 0.24%, or 99.76% uptime that week.


I really enjoyed this, smaller SRE groups in smaller companies may not be the same as Google, but the philosophical dna should at least match.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: