I've found that not only is it good to not assign blame in postmortems, but it's...

devonkim · on May 28, 2017

One other important point is that the very term "root cause" is extremely harmful in that it presumes a primary failure and already seeds the idea of one bad actor and, by proxy, blame. Systems today are too complex to blame upon one or two things - we operate in a very complicated, "complected" world both in our software and in many organizations.

While there are always technical causes for larger technical failures, I've seen far too many times RCA post-mortems performed that result in witch hunts instead of a solemn contemplation of how things could be better done by everyone. Such an RCA may ignore that a normally careful engineer was overworked by managers, never is lack of relevant monitoring and testing due to budget cuts cited, and you'll certainly never see "teams X and Y collaborated too much" as a reason for failure in these places. Because in a typical workplace, the company's values and culture are never related to a failure. You can't objectively measure how bad or how good a culture is either. Why make it part of post mortems when you don't think it's a failure?

hinkley · on May 28, 2017

I don't recall ever having a manager use the term root cause analysis in the way you are implying. Usually we are looking for the cheapest or most effective process change that will prevent that class of problem happening again.

cyphar · on May 28, 2017

> but a better way to think about it is that airplanes are so safe that an accident' can't occur unless a whole series of things go very specifically wrong.

As an aside, I met someone who was working on a graph theory problem as their research project, and the application was that you could model the entire process of aircraft control through a state machine using that graph. Effectively they are working on making it mathematically impossible for a crash to occur assuming that a certain process is followed (with safety measures ofc).

jl6 · on May 28, 2017

> assuming that a certain process is followed

The challenge is to avoid pushing all the risk into that assumption. It's easy enough to build a system that never breaks if you're willing to assume perfect behaviour on the part of its dependencies, environment, users and operators.

cyphar · on May 28, 2017

If you've seen how aircraft controllers and pilots work, I think that "following the rules" is a very fair assumption to make. But ignoring that, obviously if implemented there would be fail-safes.

> It's easy enough to build a system that never breaks if you're willing to assume perfect behaviour

It isn't though. Seriously, think about how you could safely route several thousand flying hunks of metal through fairly small air corridors (which all have inertia) and you need to maintain strict flight schedules. Then think about how you need to factor in all of the edge cases caused by emergencies on planes (these are all included in the process for flight controllers). Then think how you could mathematically prove that safety.

Yes, it's easier if you assume that people will follow a certain process (and actual flight systems have so many layers of fail-safes that it's ridiculous) but it's definitely not "easy enough".