How Complex Systems Fail (1998)

eternalban · on Sept 19, 2022

Engineering a safer world - Systems thinking applied to thinking, by Nancy Leveson (MIT - 2009) was recommended in a previous discussion as a more comprehensive and systemic treatment: https://news.ycombinator.com/item?id=14131981

http://sunnyday.mit.edu/safer-world.pdf

STAMP ("System-Theoretic Accident Model and Processes") is reviewed here: https://www.sciencedirect.com/science/article/abs/pii/S09504...

And there is a course (lecture notes look great): https://ocw.mit.edu/courses/16-63j-system-safety-spring-2016...

eternityforest · on Sept 19, 2022

The real URL is how.complexsystems.fail

Point 2 is "Complex systems are heavily and successfully defended against failure"

Complex systems do fail. But airplanes are still extremely safe. Because people stacked on even more complex systems, often involving worldwide change in response to an accident that happened once.

You constantly hear about how safe it is to fly. And yet hardly anyone seems to learn fron their successes. When you stop accepting failure and are willing to disrupt everything if it saves even one life, you can do a lot.

Complex systems may be unreliable, but with enough work, it seems we can sometimes make the overall picture safer than not having them.

I can't firmware update all of mankind to never leave a baby in a hot car. But they can put sensors on seats and continually do studies to be sure it's working. Complex systems are sometimes more controllable than people or simple systems.

The choice sometimes seems to be "Add complexity, do nothing, or do something that nobody will accept"

dijit · on Sept 19, 2022

I really see your point here; but I have to caution: Airplanes are "exactly as simple" as they need to be.

There is a lot that goes into their design to simplify things greatly; you're probably thinking of complicated computer systems that are used in planes.

But those computer systems are incredibly simple compared to what we use or build atop of: as simple as they have to be in order to be fully understood.

eternityforest · on Sept 20, 2022

They're still far more complicated than a layman might guess after years of hearing that simple is always better.

I'm guessing things like fly by wire would be automatically assumed to be unsafe by a lot of people.

When cars get features that are anything like what goes into planes, people tend to get upset and say "I'm not an idiot, you should make cars expensive and complicated just because some drivers can't stop crashing without a computer".

Ferret7446 · on Sept 20, 2022

> are willing to disrupt everything if it saves even one life

I feel like you're walking away with the wrong lesson. Disrupting everything is a great way to blow up complex systems. You want to change things gradually, ensuring that the human side can keep up.

eternityforest · on Sept 20, 2022

A lot of the time what happens is the human side doesn't need to keep up.

They'll say "This actuator fails if driven past it's limit in hot weather after a rainstorm, and we have data showing that people can overdrive it accidentally in this condition".

Then they'll replace all affected actuators even if it costs millions .

Or they'll add a software patch to keep you from overdriving it.

What they don't do is say "It's probably fine, people just need to be more careful". If someone made a mistake once, someone else can make it again. Systems have to be built for the people who will actually use them, not theoretical elite users.

On occasion the technical fix has it's own dangers that need to be evaluated and you can't find any substitute for operators doing the right thing(See Gare de Lyon for the perfect example of multiple human errors by different people interacting with complex safety systems).

But only some careful analysis will tell you what's more dangerous.

hutzlibu · on Sept 19, 2022

"I can't firmware update all of mankind to never leave a baby in a hot car. But they can put sensors on seats and continually do studies to be sure it's working."

We can also put mandatory sensors in peoples bodies, to make sure they act and live allright.

But I think this would be overcomplicating things.

eternityforest · on Sept 20, 2022

Complexity wouldn't be the problem, the issue would be violation of people's bodies.

Something like a car seat sensor is just a consumer product safety regulation that does not imply any extreme expense, danger, or violation, except to very extreme anti-tech or anti-regulation people. It's a further development of the same trend as headlight or seatbelt related laws.

Plus, it protects people who have no way of protecting themselves, from mistakes that are made by people who have actively been prevented all their life (via confidence culture) from having the tools to prevent making them.

hutzlibu · on Sept 20, 2022

"Plus, it protects people who have no way of protecting themselves, from mistakes that are made by people who have actively been prevented all their life"

Let me put it that way, idiots who let their babies in a hot car actually would also need a million of other mandated sensors and I would not trust them with small lives, or a dangerous fast and heavy bullet like a car in the first place.

But the default should be trust and not micromanaging people assuming they are all idiots. Because this keeps people as idiots.

eternityforest · on Sept 20, 2022

This isn't a matter of "Keeping people as idiots". It's a matter of natural capability.

For some people there's a constant risk that at any moment your brain will glitch out and drop any file at any time. Even your most ingrained habits aren't reliable. No matter how important something is it can disappear from your mind like it was never there.

It's not a matter of skills you learn, it's a matter of someone who has never in their life been able to master a "When X I will do Y" process, and likely has never even been made aware that there's any issue beyond not trying hard enough or not caring.

I hear "You obviously don't care or you'd remember" pretty much all the time.

They care about their kids. They have probably been taught their whole life that leaving them in a hot car is something that only happens if you don't care.

When it happens they are as shocked as the rest of the world.

We don't have the option to choose who has kids and who doesn't. They do, and they're going to, and if you try to take them away to some horrid state orphanage they will probably have even worse lives.

Tech is a factor we can control. Maybe it will make things better, maybe worse, but that's what the studies are for.

0xbadcafebee · on Sept 19, 2022

Numbers 1, 4-8, and 11-18, are all Truisms. The rest are not:

"2. Complex systems are heavily and successfully defended against failure"

Many complex systems are weakly defended, sometimes not at all. Sometimes the defense is accidental or incidental. Sometimes they are heavily yet unsuccessfully defended. Never attribute to defense that which can be attributed to purely random chance, ignorance, convenience, and avoidance of responsibility.

"3. Catastrophe requires multiple failures – single point failures are not enough."

Catastrophe definitely can and does happen from single points of failure. It's just that in highly defended systems, multiple failures are common.

"9. Human operators have dual roles: as producers & as defenders against failure."

These can be distinct roles, but in practice that requires extra money, staffing, etc which makes it rare. However, there are systems in which defense becomes its own role, often because the producers suck at it or don't want to do it, or are just really busy.

"10. All practitioner actions are gambles."

On the fence about this one. I would say all practitioner changes are gambles. A practitioner looking at a pressure gauge dial is an action, but it isn't a gamble. Unless the gauge needle sticks, and reading it was a critical action... I suppose you could say all actions are gambles, and changes are much more risky gambles, and non-change actions are likely to be seen as non-risky.

_plg_ · on Sept 19, 2022

The Career, Accomplishments, and Impact of Richard I. Cook: A Life in Many Acts (September 12, 2022) https://www.adaptivecapacitylabs.com/blog/2022/09/12/richard...

greyface- · on Sept 19, 2022

2020, 84 comments https://news.ycombinator.com/item?id=25550685

2019, 33 comments https://news.ycombinator.com/item?id=20380055

2017, 21 comments https://news.ycombinator.com/item?id=15002683

2014, 16 comments https://news.ycombinator.com/item?id=8282923

dang · on Sept 19, 2022

Thanks! Macroexpanded:

How Complex Systems Fail (1998) - https://news.ycombinator.com/item?id=25550685 - Dec 2020 (83 comments)

How Complex Systems Fail (1998) [pdf] - https://news.ycombinator.com/item?id=20380055 - July 2019 (33 comments)

How complex systems fail (2002) [pdf] - https://news.ycombinator.com/item?id=15002683 - Aug 2017 (19 comments)

How Complex Systems Fail (1998) [pdf] - https://news.ycombinator.com/item?id=14127543 - April 2017 (7 comments)

How Complex Systems Fail (1998) - https://news.ycombinator.com/item?id=8282923 - Sept 2014 (16 comments)

How Complex Systems Fail [pdf] - https://news.ycombinator.com/item?id=926735 - Nov 2009 (1 comment)

mellavora · on Sept 19, 2022

So it looks like 2020 is still in the lead...

xbar · on Sept 19, 2022

There were some spectacular systems failures in 2020 that earned reminders.

I am of the mind that this document is undertaught.

x32n23nr · on Sept 19, 2022

I highly recommend the book Normal Accidents by Charles Perrow. Perrow argues that multiple and unexpected failures are built into society's complex and tightly coupled systems

bob1029 · on Sept 19, 2022

The root cause thing I would push back on.

Management of complex systems is never a done deal, so there is always the possibility you missed some tiny gap in your process that can still take you out entirely.

A good example of this being the Texas grid in 2021.

redtexture · on Sept 19, 2022

Not a tiny gap at all.

For which there had been numerous warnings for years by industry observers.

That system does not draw upon outside of network utilities, to avoid Federal regulation, hence has limited reserve resources. And the Texas system did not pay providers to have standby reserves. Thus a fragile system, easy to be subject to failure.

Edit:

Texas Was Warned a Decade Ago Its Grid Was Unready for Cold (Bloomberg)

https://www.bloomberg.com/news/articles/2021-02-17/texas-was...

fatneckbeardz · on Sept 20, 2022

the thing is that we all see this system as a failure,

but the people who make the decisions, and who run that system, see it as a success.

their primary goal is not to provide reliable power, their primary goal is profit and ideology. and they have successfully done both those things.

a bunch of people died, nothing will change, and they will face no consequences. texas will remain off the grid and the next winter storm the same thing will happen.

in their book, they are a success.

thats the whole thing about complex systems. at some point, human beings disagree on what the priorities are, so they disagree on what failure is, they disagree on what maintenance is, and they disagree on what "proper function" is.

complex systems are always connected to complex vested interests and flows of power and money.

so people say things like "the boeing 787 failed". . . did it though? It killed hundres of people , but the executives in charge of it made huge profits and faced zero consequences. Boeing stock price is fine, and it will not face any meaningful punishment or consequences from the government. nor will it face any meaningful consequences from the legal system, which is irrevocably twisted in favor of big corporations like them.

From these peoples perspective, the 787 killing hundreds of people is not a failure, its just something that happened that they can hire PR people to deal with. it wont interrupt cash flow (or, at least it wont interrupt their personal bonuses and personal wealth that much) so its basically irrelevant to them.

redtexture · on Sept 22, 2022

Not a gap.

A planned system, as you indicate.

Prior poster indicated the item was a mere gap.

The system operator does not treat the problem as a gap, and also has had no authority to enforce appropriate remedation at the generator-plant level, and the gas-supply level to reduce the cold-caused shutdowns encountered.

---

The Texas Electric Grid Failure Was a Warm-up: One year after the deadly blackout, officials have done little to prevent the next one—which could be far worse.

By Russell Gold, Texas Monthly (February 2022)

https://www.texasmonthly.com/news-politics/texas-electric-gr...

aliasxneo · on Sept 19, 2022

I believe the point is to not look for a single root cause.

kwhitefoot · on Sept 19, 2022

Yes, but that tiny gap is no more the root cause than any other one of many other decisions that preceded the catastrophe.

teddyh · on Sept 19, 2022

s/^/How /

Zealotux · on Sept 19, 2022

Simple systems also fails!

tpoacher · on Sept 23, 2022

How complex systems fail:

Like most things. Slowly at first, then all at once.

tempie_deleteme · on Sept 19, 2022

why read about it when we can experience it by being alive today?

to better understand the times, of course...

dqpb · on Sept 19, 2022

This does not seem very rigorous. Can someone point to a better coverage of this topic?

thraxil · on Sept 19, 2022

It's not intended to be rigorous. The context here is that Richard I. Cook, one of the main figures in safety and resilience engineering, who's published many, many papers on these topics died recently. The "How Complex Systems Fail" paper is intended to be a bit pithy and light; more an attempt at summarizing years of wisdom. See: https://www.adaptivecapacitylabs.com/blog/2022/09/12/richard...

dqpb · on Sept 19, 2022

Well, this sounds wrong to me:

> Catastrophe requires multiple failures – single point failures are not enough

My experience is that a single failure causes a cascade of subsequent failures. This topic is very interesting, but this post is more of a teaser of topics than a real explanation.

advisedwang · on Sept 19, 2022

Places where a single failure in an otherwise perfectly functioning system can cause catastrophic outcomes are relatively easy to identify, relatively easy to argue need to be fixed and relatively easy to fix. As a result mature, complex systems have generally developed safety mechanisms for such issues. Once you have done that you need at least two failures (underlying issue + safety, hot+cold, or two interacting systems).

I would suspect that your experience of single modes of failure being present are one of the following

* Immature system (e.g. a startup) * One where failure is acceptable and so engineering isn't invested in solving these issues (i.e. the author is talking about disasters that kill people, not causing a few mins of ads not getthing shown) * Extreme organizational dysfunction (talking criminal negligence type stuff)

dqpb · on Sept 19, 2022

> Once you have done that you need at least two failures (underlying issue + safety, hot+cold, or two interacting systems).

Ah, I missed the part where he said - except for distributed systems. The thing is, effectively all systems are distributed systems with two or more interacting subsystems.

And no, I'm not talking about immature systems or ones where failure is acceptable. Queuing issues, for example, are well known to cause to cascading effects, and are not trivial to identify or solve.

Even basic correctness issues can be very difficult to identify if you have a large permutation space and no model checking, and will also cascade.

novixyz · on Sept 19, 2022

I agree.

I always thought that late Paul Ciliers' did a great summary on complexity (sorry no online link):

"Complexity in a Nutshell:

I will not provide a detailed description of complexity here, but only summarise the general characteristics of complex systems as I see them.

-Complex systems consist of a large number of elements that in themselves can be simple.

- The elements interact dynamically by exchanging energy or information. These interactions are rich. Even if specific elements only interact with a few others, the effects of these interactions are propagated throughout the system. The interactions are nonlinear.

- There are many direct and indirect feedback loops.

- Complex systems are open systems – they exchange energy or information with their environment – and operate at conditions far from equilibrium. Complex systems have memory, not located at a specific place, but distributed throughout the system. Any complex system thus has a history, and the history is of cardinal importance to the behaviour of the system.

- The behaviour of the system is determined by the nature of the interactions, not by what is contained within the components. Since the interactions are rich, dynamic, fed back, and, above all, nonlinear, the behaviour of the system as a whole cannot be predicted from an inspection of its components. The notion of emergence is used to describe this aspect. The presence of emergent properties does not provide an argument against causality, only against deterministic forms of prediction.

- Complex systems are adaptive. They can (re)organise their internal structure without the intervention of an external agent.

Certain systems may display some of these characteristics more prominently than others. These characteristics are not offered as a definition of complexity, but rather as a general, low-level, qualitative description. If we accept this description (which from the literature on complexity theory appears to be reasonable), we can investigate the implications it would have for social or organisational systems."

Ciliers, P. (2016). Critical Complexity Collected Essays, Walter de Gruyter GmbH. 67

Also if you look up any Dave Snowden's video on YT you'll find plenty of useful info.