The Therac-25 Incident

omginternets · on Feb 15, 2021

The featured comment is great, for those who missed it:

I am a physician who did a computer science degree before medical school. I frequently use the Therac-25 incident as an example of why we need dual experts who are trained in both fields. I must add two small points to this fantastic summary.

1. The shadow of the Therac-25 is much longer than those who remember it. In my opinion, this incident set medical informatics back 20 years. Throughout the 80s and 90s there was just a feeling in medicine that computers were dangerous, even if the individual physicians didn't know why. This is why, when I was a resident in 2002-2006 we still were writing all of our orders and notes on paper. It wasn't until the US federal government slammed down the hammer in the mid 2000's and said no payment unless you adopt electronic health records, that computers made real inroads into clinical medicine.

2. The medical profession, and the government agencies that regulate it, are accustomed to risk and have systems to manage it. The problem is that classical medicine is tuned to "continuous risks." If the Risk of 100 mg of aspirin is "1 risk unit" and the risk of 200 mg of aspirin is "2 risk units" then the risk of 150 mg of aspirin is strongly likely to be between 1 and 2, and it definitely won't be 1,000,000. The mechanisms we use to regulate medicine, with dosing trials, and pharmacokinetic studies, and so forth are based on this assumption that both benefit and harm are continuous functions of prescribed dose, and the physician's job is to find the sweet spot between them.

When you let a computer handle a treatment you are exposed to a completely different kind of risk. Computers are inherently binary machines that we sometimes make simulate continuous functions. Because computers are binary, there is a potential for corner cases that expose erratic, and as this case shows, potentially fatal behavior. This is not new to computer science, but it is very foreign to medicine. Because of this, medicine has a built in blind spot in evaluating computer technology.

dwohnitmok · on Feb 15, 2021

I suspect that a large proportion of ways that abstract planning fail are due to discontinuous jumps, foreseen or unforeseen. That may be manifested in computer programs, government policy, etc.

Continuity of risk, change, incentives, etc. lend themselves to far easier analysis and confidence in outcomes. And higher degrees of continuity as well as lower values of change only make that analysis easier. Of course it's a trade-off: a flat line is the easiest thing to analyze, but also the least useful thing.

In many ways I view the core enterprise of planning as an exercise in trying to smooth out discontinuous jumps (and their analogues in higher degree derivatives) to the best of one's ability, especially if they exist naturally (e.g. your system's objective response may be continuous, but its interpretation by humans is discontinuous, how are you going to compensate to try to regain as much continuity as possible?).

beerandt · on Feb 15, 2021

It's so short-sighted that he doesn't see that medical records being forced so quickly to digital/computers is almost exactly the same problem being played out, just not as directly or dramatically, but with a much wider net, and way more short- and long-term problems (including the software/systems trust mentioned).

pessimizer · on Feb 15, 2021

'Shocking' hack of psychotherapy records in Finland affects thousands

https://www.theguardian.com/world/2020/oct/26/tens-of-thousa...

jbay808 · on Feb 15, 2021

Similar thing happened in Canada:

https://globalnews.ca/news/6311853/lifelabs-data-hack-what-t...

bluGill · on Feb 15, 2021

What has been lost because records were not? Doctors shouldn't have to rely on my memory of what I had years ago.

beerandt · on Feb 16, 2021

There are soooo many problems with electronic records that it's hard to even summarize. But the biggest few, in my opinion:

1) The software influences the medical workflow and becomes a major distraction to visits. What was completely analog and free-form is now binned, discrete, and made more complex.

2) Desired outcome affects what and how things are recorded, instead of vice-versa. Staff learn what they have to do to get the orders or prescription or billing output that they need, which often doesn't line up with the actual diagnosis.

3) It reduces productivity, causes physician burnout, and put most small practices out of business. (Any practice not large enough to self-host records hire and have full time IT/software staff were basically strong armed into selling out to hospitals or other large org.)

4) It tracks both efficiency and patient satisfaction in the same system as the records and billing, which leads to some pretty perverse incentives on multiple levels. (As much heat as the drug companies over the opiod crisis, I'd argue doctors worried about dinging their patient satisfaction scores were just as responsible, by being afraid to tell too many patients "no".)

It goes on... Just a huge cluster of poorly thought out unintended consequences.

And that's just the medical side of things. The technical, financial, and legal aspects all have similar issues.

bluGill · on Feb 16, 2021

You missed my point. Go back and try again.

To address your points

Half of this is a good thing. I want my doctor to follow the proper process and checklist everytime, no matter what. There are many one in a million cases that have the same symptoms as the common thing they see daily. The process is how you catch them and treatment. The doctors office is no place for creativity until everything else has been ruled out.

That doesn't mean there isn't room to improve the user interface. However doctors are in the wrong to be so technology backwards

beerandt · on Feb 17, 2021

Then you're allowing your software developer to dictate medical care, not your doctor.

And frankly, this type of response (the software knows better than the doctor/human) is (again) a root cause of problems like the Therac-25 incedent.

The system needs to conform to the doctors, not vice-versa.

bluGill · on Feb 17, 2021

A bit of both. The medical system has long known that doctors are too much "cowboy" and not following useful process. Many doctors resisted hand washing in the late 1800s.

User interaction studies have made great progress. A lot of software ignores that. Likewise we know a lot of ways to write high quality software that are ignored in your typical web app.

However we also know that doctors are human and they fail often. while computer systems do fail, those failures are much easier to fix once and for all.

craftinator · on Feb 16, 2021

> What was completely analog and free-form is now binned, discrete, and made more complex.

This sentence makes zero sense. Limiting choices inherently makes things LESS complex. That's why we use frameworks for decision making and risk assessment, rather than just doing everything "analog and free form"; it's the same reason we do "structured training" for complex and difficult tasks, rather than just let people try to figure them out. It's just a completely wrong and backwards statement.

ceejayoz · on Feb 16, 2021

> Limiting choices inherently makes things LESS complex.

"Do you want me to kick you in the face, or the groin?" is a substantially more complex choice than "Do you want me to kick you in the face, or the groin, or not at all?"

Sometimes, limited choices require you to shoehorn something into one of those available choices when it's not actually appropriate. Such is the case with medical record systems at times.

bluGill · on Feb 17, 2021

That is a bug in the system them. Don't confuse bugs in the system with the idea that the system cannot/should not be fixed!

giantDinosaur · on Feb 16, 2021

If I limited your choice to answering questions in binary yes/no, do you really think that makes things less complex than a free form & lucid description of an issue/procedure? Perhaps if you are communicating exclusively with a machine..!

mulmen · on Feb 15, 2021

Trust in information technology in medical settings for one.

lmilcin · on Feb 16, 2021

I think in this case the problem wasn't because of some inherent characteristics of the software.

If this was completely analog, mechanical tool, the problem could still have existed. For example, you could have made a mistake using various knobs and switches.

And in this case the solution was to physically design the device so that it is not possible to put it in an incorrect state. Thus it did not matter much if there was a software error as whatever software did would not make it possible to put the beam and the metal shield in an incorrect state.

Using physical safeties is a common strategy. For example, my Instant Pot has multiple physical safeties built in.

For example:

* even if the controlling software or sensor fails, there is overpressure valve that will not let the pressure rise over certain value.

* there is a single piece of metal that both blocks gasses escaping from the pot AND blocks ability to open the pot. If the piece of metal is not in place, there is a hole in the pot and pressure cannot rise. If the piece of metal is in place so the hole is closed, it blocks possibility of opening the pot and creating an explosion.

* there is a specially designed guide that makes it impossible to close the device only partly

* the device is designed so that the weakest element keeping pressure is the seal. If the pressure rises, instead of the pot blowing up the seal will deform, be blown off and let the steam escape in more or less controlled manner.

* there is a bimetal safety device that will turn off the heater if temperature rises too much,

and so on.

You see, none of these features relies on software. There are software safeties but they are redundant in that device does not rely solely on them.

If the company producing Instant Pot can show this kind of safety-conscious design I am sure it can also be applied to dangerous medical devices.

dang · on Feb 15, 2021

(For the curious) the Therac-25 stack on HN:

2019 https://news.ycombinator.com/item?id=21679287

2018 https://news.ycombinator.com/item?id=17740292

2016 https://news.ycombinator.com/item?id=12201147

2015 https://news.ycombinator.com/item?id=9643054

2014 https://news.ycombinator.com/item?id=7257005

2010 https://news.ycombinator.com/item?id=1143776

Others?

kondro · on Feb 15, 2021

It comes up a lot, but it’s an incredibly important story that bares repeating often. Especially with similar issues like the 737-MAX occurring pretty recently.

dang · on Feb 16, 2021

Oh for sure, and the purpose of those links is not to imply that it's too often. Reposts are fine on HN if a story hasn't had a big discussion after a year or so (https://news.ycombinator.com/newsfaq.html). I just list them because people often like to look at the past threads. That's what "For the curious" means :)

yummypaint · on Feb 16, 2021

Yeah aparently we have to constantly relearn that software hotfixes are not an acceptible substitute for hardware interlocks and good overal design practices when lives are at stake.

noefingway · on Feb 15, 2021

As a long time subscriber to the Risks mailing list, I would have to note the extensive discussion of this incident:

http://catless.ncl.ac.uk/Risks/search?query=therac

FiatLuxDave · on Feb 15, 2021

I find it interesting how often the Therac 25 is mentioned on HN (thanks to Dan for the list), but nobody ever mentions that those kind of problems never entirely went away. Therac 25 is just the famous one. You don't have to go back to 1986, there are definitely examples from this century. The root causes are somewhat different, and somewhat the same. But no one seems to be teaching these more modern cases to aspiring programmers in school, at least not to the level where every programmer I know has heard of them.

For example, the issue which caused this in 2007:

https://www.heraldtribune.com/article/LK/20100124/News/60520...

Or the process issues which caused this in 2001:

https://www.fda.gov/radiation-emitting-products/alerts-and-n...

sebmellen · on Feb 15, 2021

I quote this from the article so people may take an interest in reading it. This is the opening paragraph:

> As Scott Jerome-Parks lay dying, he clung to this wish: that his fatal radiation overdose -- which left him deaf, struggling to see, unable to swallow, burned, with his teeth falling out, with ulcers in his mouth and throat, nauseated, in severe pain and finally unable to breathe -- be studied and talked about publicly so that others might not have to live his nightmare.

Shivetya · on Feb 16, 2021

To this day I run into development managers who will dismiss an easily replicated issue on the grounds it won't happen enough, though of course with a big group of users it actually does, to warrant fixing. While this can be true at times and one excuse is always, we have support to fix the data, it tends to a standard by which all fixes are handled.

which is, once you start blowing off testing, having exceptions to avoid every fix, you rarely if ever leave that mode and end up with a stressed support staff and users in the field who think your coders are incompetent.

Hackbraten · on Feb 16, 2021

Snapshot for EU residents, who are blocked from reading the article: https://web.archive.org/web/20210215234240/https://www.heral...

ahepp · on Feb 16, 2021

The original NYT article for the 2007 incident has more details about the problematic software: https://www.nytimes.com/2010/01/24/health/24radiation.html

Leherenn · on Feb 16, 2021

There's a duality to this that often makes me wonder. It seems like a huge issue, but at the same time, the examples everyone hear about are 20+ years old. I would have thought that, with the explosion of digital systems, they should happen regularly, and yet we do not hear of them. I mean, your "unknown" examples are from 20 and 14 year ago.

Are we actually not as bad at mitigating risks when needed as the rest of the industry would led us to believe, or are those incidents just not covered?

time0ut · on Feb 15, 2021

Many years ago, I had an opportunity to work on a similar type of system (though more recent than this). In the final round of interviews, one of the executives asked if I would be comfortable working on a device that could deliver a potentially dangerous dose of radiation to a patient. In that moment, my mind flashed to this story. I try to be a careful engineer and I am sure there are many more safeguards in place now, but, in that moment, I realized I would not be able to live with myself if I harmed someone that way. I answered truthfully and he thanked me and we ended things there.

I do not mean this as a judgement on those who do work on systems that can physically harm people. Obviously, we need good engineers to design potentially dangerous systems. It is just how I realized I really don't have the character to do it.

walrus01 · on Feb 15, 2021

> I do not mean this as a judgement on those who do work on systems that can physically harm people

In industrial non software settings this is not as rare as you'd think, if you research the tower crane, truck crane and rigging industry. Lots of things can kill people. The important part is that we need the appropriate 'belt and suspenders' safety checks and engineering practices to prevent them from doing so.

randomly chosen industrial example:

https://en.wikipedia.org/wiki/PEPCON_disaster

sho_hn · on Feb 15, 2021

> In the final round of interviews, one of the executives asked if I would be comfortable working on a device that could deliver a potentially dangerous dose of radiation to a patient

Automotive software engineer here: I've asked the same in interviews.

"We work on multi-ton machines that can kill people" is a frequently uttered statement at work.

Hydraulix989 · on Feb 15, 2021

I would have considered the fact that for a vast majority of people suffering from cancer, this device helps them instead of harms them. However, I can also imagine leadership at some places trying to move fast and pressure ICs into delivering something that isn't completely bulletproof in the name of bottom line. That is something I would have tried to discern from the executives. Similar tradeoffs have been made before with cars against expected legal costs.

There are plenty of other high stakes software that involve human lives (Uber self driving cars, SpaceX, Patriot missiles) and many of them completely scare me and morally frustrate me as well to the point where I would not like to work on one, but I totally understand if you have a personal profile that is different than mine.

mcguire · on Feb 15, 2021

Good for you to realize that, then.

On the other hand, I'm not entirely sure it's appropriate to be comfortable with that kind of position in any case.

Doxin · on Feb 16, 2021

I feel like if you're comfortable working on such software you'd probably be the least qualified person to do so. Seems to me that you can't be paranoid enough when developing these kinds of systems.

jtchang · on Feb 15, 2021

I can say when I was doing my CS degree this was definitely covered. In fact it is one of the lectures that stood out in my mind at that time. My professor at the time (Bill Leahy) definitely drilled into us the importance of understanding the systems in which we were eventually going to work on.

Not sure if this is still covered today.

Mountain_Skies · on Feb 15, 2021

When I was the graduate teaching assistant for a software engineering lab, the students got a week off to do research on software failures that harmed humans. For many of the students it was the first time they gave any thought to the concept of software causing actual physical harm. I'm glad we were able to expose them to this reality but also was a bit disheartened as they should have thought about it far before a fourth year course in their major.

Jtsummers · on Feb 15, 2021

Per younger, CS, colleagues who went through school in the last 6 years, it was still being taught at their smaller US colleges.

icelancer · on Feb 15, 2021

Was definitely covered in my small school program in Embedded Systems in the early 2000s.

lxgr · on Feb 15, 2021

It was covered in at least one of my classes as well. (Graduated only a few years ago.)

_wjtv · on Feb 15, 2021

The book I use to instruct software engineering uses it, and I do use that chapter.

ufmace · on Feb 15, 2021

> The Therac-25 was the first entirely software-controlled radiotherapy device. As that quote from Jacky above points out: most such systems use hardware interlocks to prevent the beam from firing when the targets are not properly configured. The Therac-25 did not.

This makes me think - there was only one developer there, I guess, who was doing everything in assembly. This software, and the process to produce it, must have been designed in the early days of their devices, when there would be expected to be hardware interlocks to prevent any of the really bad failure modes. I bet they never did change much of the software, or their procedures for developing, testing, qualifying, and releasing it in light of the change from relying on hardware interlocks to the quality of the software being the only thing preventing something terrible from happening.

Gare · on Feb 15, 2021

A quote from the report:

> Related problems were found in the Therac-20 software. These were not recognized until after the Therac-25 accidents because the Therac-20 included hardware safety interlocks and thus no injuries resulted.

The safety fuses were occasionally blowing during the operation of Therac-20, but nobody asked why.

baobabKoodaa · on Feb 15, 2021

> The safety fuses were occasionally blowing during the operation of Therac-20, but nobody asked why.

Have you tried turning it off and on again?

bluGill · on Feb 15, 2021

The software was working just fine for years before on earlier versions with the interlocks. They never checked to see how often or why the interlocks fired before removing them. Turns out those interlocks fired often because of the same bugs.

brians · on Feb 15, 2021

They had two fuses, so they had a 2:1 safety margin! Just like the NASA managers who decided that 30% erosion in an O-ring designed for no erosion meant a 3:1 safety margin.

b3lvedere · on Feb 15, 2021

Interesting story. I've been on the testing site of medical hardware many many moons ago. It's quite amazing what you can and must test. For instance: We had to prove that if our equipment broke by falling damage that the potential debris flying off could not harm the patient.

I always liked the testing philosophy institutes like for instance Underwriter Laboratories had: Your product will fail. This is stated as fact and is not debatable. What kind of fail safes and protection have you made so that when it fails (and it will) it cannot not harm the patient?

buescher · on Feb 15, 2021

Yes. It's amazing the number of engineers that resist doing that analysis - "oh that part won't ever break". Some safety standards do allow for "reliable components" (i.e. if the component is already been scrutinized for safe failure modes, you don't have to consider them) and for submitting reliability analysis or data. I've never seen reliability analysis or data submitted instead of single-point failure analysis though, myself.

Single-point failure analysis techniques like fault trees, event trees, and especially the tabular "failure modes and effects analysis" (FMEA) are so powerful, especially for safety-critical hardware, that when people learn them they want to apply them to everything, including software.

However, FMEA techniques actually have been found to not apply well to software below about the block diagram level. They don't find bugs that would not be found by other methods (static analysis, code review, requirements analysis etc) and they're extremely time and labor intensive. Here's an NRC report that goes into some detail: https://www.nrc.gov/reading-rm/doc-collections/nuregs/agreem...

b3lvedere · on Feb 15, 2021

"Through analysis and examples of several real-life catastrophes, this report shows that FMEA could not have helped in the discovery of the underlying faults. The report concludes that the contribution of FMEA to regulatory assurance of Complex Logic, especially software, in a nuclear power plant safety system is marginal."

Even more interesting! Thank you for this link. I appreciate it. Never too old to learn. :)

anotheraccount9 · on Feb 16, 2021

I used to work in an ICU. The software that controlled the patient's cardiac monitor would not allow us to modify the alarm settings (sensitivity, volume, etc.)

I'm not sure if the settings were locked as a default by the manufacturer, or a protected setting that no one knew how to change, or simply locked by management.

Anyhow, from mild atrial fibrillation to irregular rhythm, the alarm was constantly beeping, which was to be expected, considering the acuity of our patients. Nurses and MD became really annoyed by this.

Eventually, one of the staff covered the speaker from inside, to muffle the almost constant warnings. The staff repeatedly asked management to find a solution to this, and the associated risk.

One day, a patient had a life threatening cardiac incident. The alarm was muffled enough that no one nearby could hear a thing.

lordnacho · on Feb 15, 2021

How much money was AECL making selling these things? You'd think a second pair of eyes on the code would not cost too much. Do I blame the one person? Not really, who in this world hasn't written a race condition at some point? RCs are also one of those things where someone else might spot it a lot sooner than the original writer.

I agree with the sentiment that they took the software for granted. I get the feeling that happens in a lot of settings, most of them less life-threatening than this one. I've come across it myself too, in finance. Somehow someone decides they have invented a brilliant money-making strategy, if they could only get the coders to implement it properly. Of course the coders come back to ask questions, and then depending on the environment it plays out to a resolution. I get the feeling the same thing happened here. Some scientist said "hey all it needs is to send this beam into the patient" and assumed their description was the only level of abstraction that needed to be understood.

jancsika · on Feb 15, 2021

I found this comment from the article fascinating:

> I am a physician who did a computer science degree before medical school. I frequently use the Therac-25 incident as an example of why we need dual experts who are trained in both fields. I must add two small points to this fantastic summary.

> 1. The shadow of the Therac-25 is much longer than those who remember it. In my opinion, this incident set medical informatics back 20 years. Throughout the 80s and 90s there was just a feeling in medicine that computers were dangerous, even if the individual physicians didn't know why. This is why, when I was a resident in 2002-2006 we still were writing all of our orders and notes on paper. It wasn't until the US federal government slammed down the hammer in the mid 2000's and said no payment unless you adopt electronic health records, that computers made real inroads into clinical medicine.

> 2. The medical profession, and the government agencies that regulate it, are accustomed to risk and have systems to manage it. The problem is that classical medicine is tuned to "continuous risks." If the Risk of 100 mg of aspirin is "1 risk unit" and the risk of 200 mg of aspirin is "2 risk units" then the risk of 150 mg of aspirin is strongly likely to be between 1 and 2, and it definitely won't be 1,000,000. The mechanisms we use to regulate medicine, with dosing trials, and pharmacokinetic studies, and so forth are based on this assumption that both benefit and harm are continuous functions of prescribed dose, and the physician's job is to find the sweet spot between them.

> When you let a computer handle a treatment you are exposed to a completely different kind of risk. Computers are inherently binary machines that we sometimes make simulate continuous functions. Because computers are binary, there is a potential for corner cases that expose erratic, and as this case shows, potentially fatal behavior. This is not new to computer science, but it is very foreign to medicine. Because of this, medicine has a built in blind spot in evaluating computer technology.

mcguire · on Feb 15, 2021

I'm not sure I buy that. Or, well, I suppose that those in the medical field believe it, but I don't think they're right.

Consider something like a surgeon nicking and artery while performing some routine surgery, the patient not responding normally to anesthesia or the anesthetist not getting the mixture right and the patient not coming back the way they went in. Or that subset of patients that have poor responses to a vaccine.

Everybody likes to think of the world as a linear system, but it's not.

giantDinosaur · on Feb 16, 2021

Well, an equivalent example is really something more like: the surgeon picking up their scalpel too quickly actually means that what they picked up wasn't a scalpel 10% of the time, even though what they saw with their eyes was a scalpel. That's a rather more terrifying error. It's not really possible in the normal world, but is easily possible in the software one (UI not corresponding to the real program state.)

In other words: tangible objects usually correspond to what we see; in software, you have no way necessarily of knowing if the UI/interface is outright lying to you. It could be doing anything internally, and a single flipped bit deep in some subroutine could cause death.

at_a_remove · on Feb 15, 2021

I rather randomly met a woman with a similar sort of background and trajectory as I have: trained in physics, got sucked into computers via the brain drain. She programmed the models for radiation dosing in the metaphorical descendants of Therac-25. I asked her just how often it was brought up in her work and she mentioned that she trained under someone who was in the original group of people brought in to analyze and understand just what happened with Therac-25. Fascinating stuff.

ed25519FUUU · on Feb 15, 2021

The very worst part of this story is that the manufacture vigorously defended their machine, threatening individuals and hospitals with lawsuits if they spoke out publicly. I have zero doubt this led to more deaths.

AdamJacobMuller · on Feb 16, 2021

Plainly Difficult has an episode on the Therac-25: https://www.youtube.com/watch?v=-7gVqBY52MY

Highly recommend all of his videos, actually.

gww · on Feb 16, 2021

Great channel. I found it a few months ago and have watched all of his videos. I like that he provides enough scientific detail without making it too hard to follow.

buescher · on Feb 15, 2021

Here's a bit more background, from Nancy Leveson (now at MIT): http://sunnyday.mit.edu/papers/therac.pdf

meristem · on Feb 15, 2021

Leveson's Engineering a Safer World[1] is excellent, for those interested in safety engineering.

[1] https://mitpress.mit.edu/books/engineering-safer-world

buescher · on Feb 15, 2021

The systems safety case studies in that are great. It's also available as "open access" (free-as-in-free-beer) at MIT press:

https://direct.mit.edu/books/book/2908/Engineering-a-Safer-W...

dade_ · on Feb 15, 2021

The article says the developer was never named, as if that had anything to do with the actual problems. Everything about this project sounds insane & inept.

Some questions in my mind while reading this article (but I couldn't find them quickly in a search): Who were the executives that were running the company? Sounds like something that should be taught to MBA as well as CS students. Further, the AECL was a crown corporation of the Canadian government. Who was the minister and bureaucrats in charge of the department? What role did they have in solving or covering up the issue?

strken · on Feb 15, 2021

The article has a very long explanation of why one developer is not to blame, and why it's entirely the fault of the company for having no testing procedure and no review process.

joncrane · on Feb 15, 2021

I feel like this makes it to HN once every few years or so.

I know it well from it being the first and main case study in my software testing class as an undergraduate CS major in Washington DC in 1999.

It will never not be interesting.

siltpotato · on Feb 15, 2021

Apparently this is the seventh one. I've never worked on a safety critical system but this is the story that makes me wonder what it's like to do so.

Jtsummers · on Feb 15, 2021

It’s stressful, but often worthwhile. It requires diligence, deliberate action, and patience.

matthias509 · on Feb 15, 2021

I used to work on public safety radio systems. Things which seem like minor issues like clipping the beginning of a transmission every now and then are showstopper defects in that space.

It’s because it can be the difference between “Shoot” and “Don’t shoot.”

sn41 · on Feb 16, 2021

The Therac-25 incident, and Bhopal Gas Tragedy were definitely routine lessons in courses on industrial safety and reliability. (Another reason not to turn your nose up against academic coursework - they do impart useful knowledge.)

I feel that lot of manual errors and UI errors can be mitigated by careful design. Punishing natural careless errors is not very effective, as recent research shows [1]:

[1] https://www.jstage.jst.go.jp/article/jjsca/32/7/32_954/_arti...

ZuLuuuuuu · on Feb 15, 2021

This is one of the infamous incidents where software failure caused harm on humans. I am kind of fascinated about such incidents (since I learn so much by reading about them), are there any other examples of such incidents that you guys know of? It doesn't have to result in harming human, but any software failure related incident which resulted in big consequences.

One another example that comes to my mind is the Toyota "unintended acceleration" incident. Or the "Mars Climate Orbiter" incident.

probably_wrong · on Feb 15, 2021

If big consequences is what you're after, I can think of three typical incidents: the "most expensive hyphen in history" (you can search it like that), it's companion piece, the Mars Climate Orbiter(which I see you added now), and the Denver Airport Baggage System fiasco [1] where bad software planning caused over $500M in delays.

[1] http://calleam.com/WTPF/?page_id=2086

crocal · on Feb 15, 2021

Google Ariane 501. Enjoy.

nickdothutton · on Feb 15, 2021

I mentioned this incident in passing on my post of a few years ago. We studied it many years ago in the early 90s at university. https://blog.eutopian.io/the-age-of-invisible-disasters/

sleepy_keita · on Feb 16, 2021

I recently watched this YouTube video about the Therac-25, it was similar to this article (albeit a little less in-depth). https://www.youtube.com/watch?v=-7gVqBY52MY

AdamJacobMuller · on Feb 16, 2021

Hah!

We both posted the EXACT same video at the exact same time!

Plainly Difficult is an awesome channel.

sleepy_keita · on Feb 16, 2021

Haha! It definitely is. I've been working through their backlog off and on for the past few weeks.

autoditype · on Feb 15, 2021

One of my university professors taught us this incidet, along with a number of other failures, as an ethics lesson. It shook me and I think of it whenever I see automated medical machines, self-driving cars, etc

ericsoderstrom · on Feb 16, 2021

This was (is?) used as a case study in MIT's system's engineering course (6.033)

KMag · on Feb 16, 2021

I was in Course 2 (Mechanical Engineering), but 6.004 and 6.033 were two of my favorite classes at MIT. As I remember, we spent 2 lectures covering many mistakes made in the Therac-25. We also covered the Ariane 5's first launch failure. We also covered a bunch of systems that made good design trade-offs and discussed their design spaces.

lmilcin · on Feb 16, 2021

As a professional developer interested in extreme reliability, I have studied this and similar stories. I have studied high profile software flops, project failures as well as NTSB reports for aircraft and train accidents, reports of buildings failing etc. and find all those reports very illuminating in the sense of a kind of mindset that is necessary to ensure reliability.

I am not working on safety-critical systems but still, the failure of the type of backend systems I work on can bring huge damage to the company.

You can come out with a set of rules that can really improve the reliability of your application.

* Ensure the system always moves between known, valid states. Errors can still be valid states if you plan for them correctly.

I like to compare this to logical induction: if you start with correct state and on every operation ensure that if it executes on correct state it must result in correct state, then you guarantee that the system is always in correct state.

* Ensure incorrect operations are impossible to execute.

As an example, if you want to guard against large data loss, you can choose to not expose ability to remove data. In my current project we took away everybody's write access to the database, we removed all functions that can remove the data from the database and exchanged them with functions that either mask data (setting flags so that application ignores it) or move to archive. If you need to run an operation on database, you need to submit a pull request and go through Code Review. There is additional review before merge where the team goes through checklist to verify the change meets certain standards.

* Ensure anything the client can throw at the application does not have a chance to escalate to other transactions.

All inputs are limited and validated mercilessly, all queries are constructed so that it is possible to predict resources necessary to execute the query. Every type of query has a limit on how many queries of that type can be executed in parallel or in a unit of time and how much results it can return.

All functionality is constructed in such a way that the performance per transaction does not degrade when load increases (for example by amortizing costs and batching work). Then every functionality is load tested and based on testing results it is certified with maximum load that it can process. Everything over that limit is preemptively declined.

* If an algorithm or a solution cannot be guaranteed to work correctly it cannot be used.

For example, if library cannot be guaranteed to use at most X amount of memory for the operation, then it is not suitable.

* Every single failure of the system is investigated.

No system is without faults. The only way to build reliable system is to not ignore errors and always try to figure out underlying failure in the process that let the fault happen.

Most projects deal with errors only after the error is urgent enough or only to fix the immediate cause of the error, rather than the failure in the process.

If you are interested in very reliable software, when you have it fail for whatever reason you need to look at all stages of your process and identify all places that were supposed to prevent the problem but did not.