As terrible as it was, that Therac-25 remains one of the most frequently cited e...

simias · on Aug 11, 2018

I think that's because for most applications where bodily harm is a possibility you generally (in my experience) have hardware protections that will prevent the software from doing anything stupid. Take an elevator for instance, even if the software controller is bugged (or hacked) and decides that it should drop the cabin from the top floor to the ground level at full speed there are hardware protections (security brakes, limitations on the engine itself etc...) that will take over and make sure nobody gets hurt. Therefore in order for something to go completely wrong you need both a software and hardware failure. The main flaw in Therac-25 was arguably that no such protection was present, the hardware should have been designed in order to make the bogus configuration impossible to achieve solely in software.

I think unfortunately this is going to change with the advent of "AI" and related technologies, such as autonomous driving (we've already had a few cases related to self-driving cars after all). When the total enumerable set of possible configurations become too great to exhaustively "whitelist" we won't be able to have foolproof hardware designs anymore. In these situations software bugs can be absolutely devastating.

mhneu · on Aug 11, 2018

I think unfortunately this is going to change with the advent of "AI" and related technologies, such as autonomous driving

Yes, the potential cost of software bugs is increasing as software does things that no hardware interlock can stop. And worse than that, as a society we largely haven’t realized we need to optimize for worst case (not average) performance of algorithms, because they WILL be attacked. If you’re lucky, they won’t be attacked by sophisticated, well resourced nation-state attackers. But sometimes that will happen.

The rise of complex algorithms to control complex processes is the real difficulty. Facebook’s banning algorithm is an example of something that has been exploited by attackers.

Let’s hope voting software is not the next target where bugs can be exploited. Because changing political decisions can and does produce life-changing effects.

oxymoron · on Aug 11, 2018

Your point rings true even in this case. There was another Therac (50? 100? It’s been a while since I read about it) machine which had the same bug, but where noone got hurt due to hardware safeguards.

triska · on Aug 11, 2018

In my opinion, one of the most tragic aspects of these horrific incidents is that the predecessors of the Therac-25 actually had independent protective circuits and other measures to ensure safe operations, which the Therac-25 lacked.

Here is a quote from http://sunnyday.mit.edu/papers/therac.pdf:

"In addition, the Therac-25 software has more responsibility for maintaining safety than the software in the previous machines. The Therac-20 has independent protective circuits for monitoring the electron-beam scanning plus mechanical interlocks for policing the machine and ensuring safe operation. The Therac-25 relies more on software for these functions. AECL took advantage of the computer's abilities to control and monitor the hardware and decided not to duplicate all the existing hardware safety mechanisms and interlocks."

So, regarding these important safety aspects, even the Therac-20 was better than the Therac-25!

The linked post also mentions this:

"Preceding models used separate circuits to monitor radiation intensity, and hardware interlocks to ensure that spreading magnets were correctly positioned."

And indeed, the Therac-20 also had the same software error as the Therac-25! However, quoting again from the paper:

"The software error is just a nuisance on the Therac-20 because this machine has independent hardware protective circuits for monitoring the electron beam scanning. The protective circuits do not allow the beam to turn on, so there is no danger of radiation exposure to a patient."

ecpottinger · on Aug 11, 2018

I have friend who has 40 years programming experience, he is building a computer controlled milling machine in his basement.

When I asked him about the limit switches it turns out they are read by software only and the software will turn off power to the motor controllers if a limit switch is activated.

I asked why he does not wire the switches to cut power directly to be on the safe side.

His answer "It's to much bother to add the extra circuits."

We are talking less than $20 in parts and a day of his time. If the software fails after sending the controller a message to start moving the head at a certain speed then crashes there is nothing to stop the machine wreaking itself.

E.C.P.

voxadam · on Aug 12, 2018

While it's not terribly uncommon for small hobby machines to depend on software limits I can't think of a single instance in my years of maintaining "real" production CNC machines of encountering a machine that didn't also include hard limits.

Limit switches typically include two trip points. The first is monitored by the control system; when it is tripped the control halts execution and stops the machine. The second limit is wired directly to the servo amplifier so that, if for what ever reason, the control fails to halt the machine when the soft limit is tripped power is removed and motion is halted. Both limits are fail-safe such that if they were to become disconnected it would result in a limit exceeded condition.

therein · on Aug 11, 2018

In a situation like that, I wouldn't blame him. Consider while building his milling machine how many of these situations he will come across. If he had to make sure there was a hardware failswitch, it would simply not scale.

3D printers are like this too. They have mechanical limit switches [0] that are read only by software. So if there is a bug in the software, nothing is stopping it from pushing the hardware limits and breaking. Same goes the other way around, if this switch is broken, same might happen.

[0] https://i.ebayimg.com/images/g/EYAAAOSwbopZguz4/s-l300.jpg

Matumio · on Aug 12, 2018

Most 3D printers don't have massive printing heads. If they drive into the end-stops, the motors will likely just skip steps and be stuck. They are not designed to apply much force.

I'm much more worried about the heating element. Its temperature is usually controlled by the same cpu that also does motion control and g-code parsing. If anything locks up the CPU the heat might not be turned off in time, and (because you also want fast startup) there is enough power available to melt something. At the very least you would get nasty fumes from over-heated plastics, and maybe even teflon tape, which often is part of the print head. At worst it could start a fire.

ioquatix · on Aug 12, 2018

As a 3D printing enthusiast, I can confirm your fears. It's all in software and while there are good control systems, nothings perfect. I had the hotbed fail and it was smoking when I found it.

Doxin · on Aug 13, 2018

Note that (depending on how the 3d printer is wired) a defective switch will result in an "endstop hit" condition. A dislodged switch however will happily keep reporting it's not being hit.

Gracana · on Aug 11, 2018

Probably not that big a deal. An operator/program can easily damage a CNC mill by jogging a substantial tool (or the spindle itself) into solid material, which is a much more likely scenario.

userbinator · on Aug 11, 2018

...but it's not "much bother" to run the control loop through the complexity of software? This seems to be the equivalent of overengineering in software, where something straightforward is instead performed through many layers of abstraction and indirection.

The straightforward way of implementing this with a bidirectional motor is to wire normally-closed limit switches in series with their appropriate direction signals, such that when the switch is actuated it prevents the motor from going in that direction, but it can still move away from the switch.

phkahler · on Aug 12, 2018

Try that with stepper motors used in 3d printers. Try it with an off the shelf driver. Make it so the limit switch only stops motion in one direction.

sverige · on Aug 11, 2018

So that's where the youtube videos of CNC milling machine failures come from! TFA noted that major causes of the disaster was the culture and failure to independently unit test the machine.

kw71 · on Aug 12, 2018

This seems like a good time to mention that God gave us hardware interrupt inputs and watchdog timers.

jibal · on Aug 12, 2018

> It’s been a while since I read about it

It's in the article you're commenting on.

dunmalg · on Aug 11, 2018

>the Hyatt bridge collapse a year earlier was a couple of orders of magnitude worse (114 people, https://en.m.wikipedia.org/wiki/Hyatt_Regency_walkway_collap...) from what was also a fairly subtle engineering failure.

It wasn't subtle at all. The entire design was substandard to begin with and didn't meet code. Suspending a walkway from a piece of square tubing made by welding two pieces of C-channel together was ALREADY pants-on-head retarded, and undersized to boot. Deciding it's OK to hang the lower span off the upper span's substandard tubing was just the last step in a long chain of gross engineering negligence.

Calling it a "subtle" failure is like calling Challenger subtle because it was "just a leaky O-ring", or the Apollo I fire subtle because it was "just a tiny spark". There was a completely avoidable cascade of multi-level failure leading up to all of them.

btrettel · on Aug 11, 2018

Mechanical engineer here. I don't think the Hyatt Regency bridge collapse was caused by a subtle problem. The design change should be obviously bad to any practicing civil engineer. Unfortunately far too many engineers don't perform even basic sanity checks. I'd say a better engineering culture would have caught the problem. Things like this are why I am becoming more and more into testing.

Of course, as you have said, building the wrong thing swamps other harms. Unintended consequences are hard to predict, unfortunately, but I am interested in ways to improve this situation. Standards design also interests me, particularly standards which are hard to cheat.

azernik · on Aug 11, 2018

The number one way to prevent building the wrong thing is a professional code of ethics, which software engineers (at least in the US) do not yet have.

fedorareis · on Aug 11, 2018

May I point you to the ACM/IEEE-CS Software Engineering Code of Ethics https://ethics.acm.org/code-of-ethics/software-engineering-c... this was a major thing discussed in my professional ethics corse in college.

sverige · on Aug 11, 2018

It's a nice gesture. Unfortunately, it is not required for software engineers to pass any test on this or subscribe to it or otherwise be held to the standard before they are allowed to write and publish software, so unless and until someone forfeits a lot of money and/or their liberty, it will likely not be widely adopted in any practical sense.

astraltheman · on Aug 11, 2018

I've never seen this before, this is really interesting. Thank you for the link!

walrus01 · on Aug 11, 2018

I am personally more concerned with software engineers and network engineers aiding and abetting the imprisonment, torture and execution of people by repressive regimes, by enabling surveillance technology and fucking with internet traffic analysis. Way more people are going to be hurt in the near term by that than by therac-25 type mistakes.

For example if you're a Chinese network engineer, and you can avoid it, don't take a job setting up tracking and database of Uyghur people. That is an ethical issue just as important as the therac-25 type problem.

ysleepy · on Aug 11, 2018

You don't have to look that far. The ICE detaining children and violating them are already crimes against humanity. I bet there are IT people working for that agency.

They killed at least one child and are drugging them against their will, while they are forcibly taken from their parents and held in worse conditions than terrorists.

ghaff · on Aug 11, 2018

In addition to the ACM and IEEE, the National Society of Professional Engineers also has a code of ethics. However, relatively few engineers get licensed in the US--it's mostly needed for signing off on drawings for regulators and that sort of thing--and, in fact, the Software Engineering exam is being phased out.

(I took the Engineer-in-Training exam once upon a time but then stropped practicing engineering so I never got the PE.)

pjmlp · on Aug 11, 2018

In Portugal we do have it, and while it is optional to join it if you don't legally sign off projects on the company's name, at least they certify that University degrees are actually teaching proper software engineering.

btrettel · on Aug 11, 2018

Do you say that to give poor software engineering legal consequences?

_lqaf · on Aug 11, 2018

Not the poster, but I think there are multiple paths of action encouraged by a code similar to other engineering disciplines[1].

More involvement of the legal and insurance industries are one of them. Another is to give software engineers something solid to brace themselves on when pushing back at management - completely aside from consequences for the company, if you're bonded or worry about a license, there are some things you won't let your manager sweet-talk you in to. Another is to provide a model of behavior for engineers, like it says on the tin. It doesn't mean everyone will, or even that the model is always absolutely correct. But giving folks a way to think about things when they feel something's off is a good thing.

Yet another is theoretically providing a baseline of competence. I think that depends more on improving informal culture than any formal mechanism, though.

[1] Note: not arguing in favor of one; I haven't made up my mind on what I think about the topic.

chabsf · on Aug 11, 2018

In Canada, that's the legal definition of engineering. You may not call yourself an engineer without accreditation and such accreditation will be rescinded if you make severe enough engineering mistakes.

walrus01 · on Aug 11, 2018

I don't disagree with you, but there's a LOT of people in Canada calling themselves software engineers or network engineers who don't have a degree qualified to wear the iron ring.

igivanov · on Aug 12, 2018

That's about being a Professional Engineer, which legally entitles you to certain actions (e.g. sign off on design docs). Anyone can call themselves engineers of any kind as long as they don't pretend to be P.Engs. I am not sure though that being P.Engs in software means anything in practice, unlike in e.g. structural engineering.

gaius · on Aug 11, 2018

which software engineers (at least in the US) do not yet have

"It is difficult to get a man to understand something, when his salary depends on his not understanding it." -- Upton Sinclair

umvi · on Aug 11, 2018

Seems like now we could have CAD software perform static analysis on the design as well as physics simulation "unit tests" in order to augment testing

btrettel · on Aug 11, 2018

Yes, that would be ideal. My impression is that CAD software usually integrates with another software to do stress analysis, e.g., FEA software, so it's more complicated than you've described, but very possible.

gmueckl · on Aug 11, 2018

How far advanced are methods to automatically setup FEA simulations on arbitrary inputs? The parameter space for FEA methods is pretty large and things like meshing can go terribly wrong and lead to utterly wrong results. Trying to run simulations in the background while a building is being designed seems like a goal to strife for, but to be successful, the software needs to be able to perform equivalently to an experienced engineer without guidance. That's a tall order.

btrettel · on Aug 12, 2018

Sorry, I wouldn't really know, as I work in fluid mechanics, not solid mechanics. In fluids I'm getting the impression that automated meshing is becoming fairly robust in some circumstances, to the point where I believe it is sometimes better than an experienced engineer. Solids seems easier in this respect, but as I said, I don't really know.

umvi · on Aug 12, 2018

I mean, even if the software could do something like flag the hyatt bridge redesign as failing to meet weight specs, that would be a win. I could be wrong but it seems like certain civil engineering projects (such as indoor bridges) would be pretty simple to check the math on compared to a building subject to wind, etc.

lomnakkus · on Aug 11, 2018

Damage-by-software usually isn't spectacular (and therefore not likely to get noticed) or not necessarily very directly costly in terms of human lives, but I'd argue it's actually more significant in the long term and in the grand scheme of things. Software rules everything and even slight errors or inefficiencies have absolutely incredible incidental cost.

Spooky23 · on Aug 12, 2018

We’re already seeing bugs kill or maim with autonomous cars. You can be sure there is far worse data associated with military system flaws that we don’t know about.

In any case, it’s dangerous to explain away software failure by dumping blame on systems. The lack of professional standards in software makes it easy for people to do bad things well.

iforgotpassword · on Aug 12, 2018

The difference here is that you treat one person at a time, and the bug doesn't get triggered every time. A bridge just happens to be used by a lot of people at the same time. The bridge didn't collapse three times until people figured out something was wrong.

This applies to the vast majority of fields where software is in use, except maybe planes, trains and nuclear power plants (where I sure as hell hope are a bunch of hardware safeguards in place). So in a sense, we software developers just got lucky here that our mistakes only kill one or very few at a time in most use cases (if at all).

It's still insane how according to the article they apparently just had that thing developed using emulated hardware with no proper security audits, safety guidelines or formal verification.

astraltheman · on Aug 11, 2018

Software is honestly cheaper and easier to test, other kinds of engineering tests run far more expensive in time and materials, and simulations aren't perfect and can't replicate all real world conditions.

Definitely agree on the explicitly bad choices though, and since software's impact is often very subtle it might really be impossible to gauge exactly how bad some of those choices end up being.