Hacker News new | past | comments | ask | show | jobs | submit login
You're missing your near misses (surfingcomplexity.blog)
154 points by azhenley 6 days ago | hide | past | favorite | 51 comments





This is part of the Israeli Air Force safety culture: https://www.talentgrow.com/podcast/episode92

"By implementing a unique culture and methodology, the Israeli Air Force became one of the best in the world in terms of quality, safety, and training, effectively cutting accidents by 95%."

Near misses and mistakes are investigated and there's a culture supporting the reporting of these and that has resulted in a huge change to the overall incident rate.

Software is a little like this as well. The observable quality issues are sort of the tip of the iceberg. Many bugs and architectural issues can lurk underneath the surface and only some random subset of those will have impact. Focus only on the impactful ones without dealing with what's under the surface may not have a material impact on quality.


> Software is a little like this as well. The observable quality issues are sort of the tip of the iceberg.

Most of the places I've worked cannot even resolve all of their observable quality issues, let alone the hidden ones. Hell, at most places I've worked, either it's a P0 emergency, a P1 must-do, or the defect will basically never be fixed, despite how visible it is or easy it is to diagnose and correct. The bug list always grows far faster than things get resolved, until it gets to a certain size and someone decides to "declare bankruptcy" and mass-close everything older than N days. Then, the cycle continues.


Similar situation. The P0 interest tends to immediately cease when there is sticky tape mitigation in place as well. There is no resolution. It just dangles waiting to bite again.

It’s demoralizing how the number of times an event has to happen for action to occur is quadratic to the severity (1 being highest).

Troubling things have to happen three or four times and super annoying has to happen dozens before anyone acts, let alone gets a mandate to do so.


> Most of the places I've worked cannot even resolve all of their observable quality issues, let alone the hidden ones.

Just remember...

> Nobody gives a hoot about profit. — W. Edwards Deming


Yep, the only approach I've ever seen to quality from the top is hot air.

There's nothing wrong with improving organizational reporting culture, but the "cutting accidents by 95%" claim seems highly dubious to me. If you take it from the opposite angle, how could a poor reporting culture alone cause twenty times more accidents?

I don't think it's reasonable to expect similar improvements as those claimed here in most settings, and indeed it would be dangerous to do so if it means compromising other safety measures. Much is made of the good reporting culture in civil aviation, for instance, but that only helps if every other aspect of civil aviation has safety regarded as a priority. Reporting culture won't help you when your Boeing 737 MAX 8 has just started an uncommanded nosedive!


> Reporting culture won't help you when your Boeing 737 MAX 8 has just started an uncommanded nosedive!

Yet, it took until the second MCAS caused crash for the 737 MAX 8 to get grounded. Discounting the role of good incident reporting and investigation based on that example seems silly.

> how could a poor reporting culture alone cause twenty times more accidents?

If you assume that the majority of accidents are caused unrecognized design or process defects, then near miss reporting and investigation can allow these defects to be identfied before they cause an accident. The plausibility of a 95% reduction depends a great deal on how poor the safety record was going into that change.


> Reporting culture won't help you when your Boeing 737 MAX 8 has just started an uncommanded nosedive!

It should have helped by reporting this many, many times before said uncontrolled nosedive.


Looking at Lion Air Flight 610, all deaths could have been avoided given perfect communication about the fault:

People familiar with the investigation reported that during a flight piloted by a different crew on the day before the crash, the same aircraft experienced a similar malfunction but an extra pilot sitting in the cockpit jumpseat correctly diagnosed the problem and told the crew how to disable the malfunctioning MCAS flight-control system.

...

After the accident, the United States Federal Aviation Administration and Boeing issued warnings and training advisories to all operators of the Boeing 737 MAX series, reminding pilots to follow the runaway stabilizer checklist to avoid letting the MCAS cause similar problems.

...

These training advisories were not fully followed, however, and similar issues caused the crash of Ethiopian Airlines Flight 302 on 10 March 2019, prompting a worldwide grounding of all 737 MAX aircraft.


The first incident of an uncommanded nosedive in a MAX that later crashed was actually avoided by the crew, who disabled electronic trim before it overwhelmed the ability to correct the issue with the smaller manual trim wheels of the MAX. So there was a near miss before the first crash. The next crew didn't disable electronic trim in time, and the decreased manual trim wheel size meant they couldn't move the horizontal stabilizer back into trim.

Agreed. I know enough about aviation that this is simply "goosing the metrics" by redefining "accident" and "frequency".

Related to Boeing, everyone knew their goose was cooked as soon as they didn't have sign-offs and checklists on those door bolts. Hell, even the AvGas delivery guy at a dirt-strip muni airport with no tower MUST have sign-offs and checklists in triplicate.


For some mind-boggling near miss account that no one cared about, see https://avherald.com/h?article=4b6eb830. See how FAA didn’t even have a record of this.

And that’s an extreme case. How many less extreme ones happen?

In the US alone, there’s research estimating the number of air fume events is around 2000 per year. The number of reported fume events is less than 10.

Mental degradation is insidious. “Just wear an oxygen mas-” what if you forgot about the oxygen mask. You forgot to drop the gear.

Extensive training (to the point of automation) and human resilience is perhaps the main reason fumes do not seem to be causing many incidents, but resilience can be individual and training cannot drill into pilots the correct intuitive response to all possible scenarios. In addition, it’s unknown in what number of incidents where the cause is pilot’s mistake that mistake was in turn caused by partial mental incapacitation (that perhaps not even the pilot was aware of).


https://en.wikipedia.org/wiki/Fume_event is a bit clearer on what these "fume events" are.

This surprised me:

> It is not mandatory for fume events to be reported in the U.S.


More than just a plain fume event, if people in charge of aircraft had difficulty interpreting ATC directions and had no recollection of landing and taxiing that is a clear near miss. One more thing going wrong could have turned it into a disaster.

I am additionally surprised that captain’s death did not make it mandatory to report.


"Here's one they just made up: near miss. When two planes almost collide they call it a near miss. It's a near hit! A collision is a near miss." ~ George Carlin [0]

[0]: https://www.youtube.com/watch?v=zDKdvTecYAM


When deciding whether to do an incident investigation for a near miss, one aspect to consider is whether it was caught by a safety system as designed, or caught by a lucky accident. The latter should be top of the priority list.

E.g., Bad package deployed to production. Stopped because it didn’t have the “CI tested” flag: low pri. Stopped because someone happened to notice a high CPU alert before the load balancer switched over: high pri.


It can be a good idea to count and record these safety net activations for reporting purposes. If you do so, the discussion of e.g. "Why do we run replicas for our databases" becomes a very simple discussion: It has simplified us maintenance X times this year (by avoiding downtime to to switch overs), and it has avoided Y outages (again due to automated switch overs) - recovery of the databases after the outage took Z hours, which would have been downtime. Suddenly you've moved the vague "We need it for safety and availability reasons" into concrete admin-hours and downtime-minutes.

Or, "Can't we skip all these tests/other long-running quality process?"


Definitely, counting the ones you have is great data for justifying their existence, and can help identify places you should perhaps put a check farther upstream.

> Stopped because it didn’t have the “CI tested” flag: low pri.

Agree, certainly lower priority than the "caught by chance" case, but I wouldn't say "low priority". That system is the last line of defence and the fact that something is hitting it is definitely a large concern.

In the CrowdStrike's fiasco last year, their last line of defence, "Content Validator" failed.

"On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data."

https://www.crowdstrike.com/falcon-content-update-remediatio...


Oh absolutely, if you have the budget then do them all. I just meant “lower pri” than the other.

My assumption is in most orgs you’ll find it hard to take time to investigate any of them, so you have to place your bets.


In The Field Guide to Human Error Investigations Dekker talks about how “number of minor incidents” correlates inversely with “number of fatal incidents” for airlines (scaled per flight hour or whatever). I have forgotten whether this was all airlines or Western only. I wonder if it still holds.

The rest of the book is also quite a good read including a fun take on Murphy’s Law that goes “Everything that can go wrong will go right” which is the basis for normalization of deviance: where a group of people (an org whatever) slowly shifts from their performance metrics as they “get away with it”.

I wonder how modern organizations fight this. Most critically I imagine warfighting ability can experience massive multipliers based on adherence to design. But also civilian performance to a lesser extent (outcome is often less catastrophically binary).

Anyway, I got a lot of mileage out of the safety book wrt software engineering.


Safety literature focuses on straightforward ways to build highly reliable systems involving humans. That applies to almost everyone on HN.

What's funny is that the suggestions are usually pretty "common sense". You already know most of the information in Sidney Dekker's books and the NASA guidelines. They're essentially the same principles we all like to see in code.

Things like: Consider the human factors. Make doing the right thing easy. Make wrong things obvious. Trust, but verify. Keep the signal to noise ratio high in communication. Etc.


One big lesson, I think (at least as flying does safety), is that even if the ideas are common sense, you have to take the common sense and turn it into rules and processes, and make sure people stick to those. And if it's a role where time matters, you have to drill those so people stick to them when seconds count.

There's a lot of value in the books. Applying "common sense" consistently and intentionally is extremely difficult. It's not a deep magic entirely divorced from the practices we already know and agree with, but rather the organizational equivalent of going to the gym.

Are chaos monkeys relevant here. And at some level: testing! You definitely find more severe issues of you actually test stuff. In production.

A simple decision rule in (personal) aviation is three strikes.

e.g., (1) nervous passenger; (2) clouds coming in; (3) running late -> abort.

It produces very conservative decisions, and overcomes the drive to just try.

But the interesting part is that you then realize how often you are at two strikes. That in itself makes you more careful.

Two strikes I would call "noticeable". I wouldn't wait for near-miss events. Then there's a measure of how on-edge we're running.

So at work, I just put a red dot on the calendar if it's a day with something urgent and visible to outsiders, or if we're having problems we don't see our way out of. It keeps us from tolerating long stretches of stress before taking a step back, and we usually also do attribution: if x is causing n>4 red days per month, it gets attention.

Obviously, varies with context: high-achieving team would be mostly always red internally, but rarely externally.


Useful for software or engineering, of course, but also useful for everyday life - safety, relationships, cooking, etc. People (sometimes) learn from painful mistakes, but rarely learn from the painless ones!

Is this basically debriefing (psychology/therapy thing)?

This is the most important thing about riding a motorcycle, too. If you almost crash, you just got lucky. Consider the root causes, don't just be proud you escaped the situation.

Isn't riding a motorcycle one of the root causes, sorry. Any mistake anyone you meet does becomes a major danger for you

In urban areas, yes, other's mistakes are more likely to run your day/life. One mitigating strategy is to only ride in low traffic areas. (Most of the men on my father's side have had life threatening accidents on motorcycles.)

Yes, but you can still ride extremely safely. Motorcycle cops do not get into many accidents, for example.

Aha, interesting, I didn't know

I have 10+ year old projects on github and virtually all of them are now ridden with security problems. Or their dependencies are in any case. The alert emails are comprehensive.. This tells me that software is inherently unsafe, it’s just a matter of time until flaws are found.

Ok you could say that quality is improving and it’s less and less so but from experience at work I would say that’s wishful thinking and if anything it’s the opposite.


> we should be treating our near misses as first-class entities, the way we do with incidents.

That's exactly like in driving. You have to take your close calls seriously and reflect over them to improve your habits of observation.


Working in Big Tech: my colleagues aren't

This can be taken both with admiration and distaste. I'll either be rewarded with a lesson or a knife


The article is spot on. It's pretty much what happened when Maersk was hacked within an inch of bankruptcy.

The flaw was identified, flagged and acknowledged before it happened:

"In 2016, one group of IT executives had pushed for a preemptive security redesign of Maersk’s entire global network. They called attention to Maersk’s less-than-perfect software patching, outdated operating systems, and above all insufficient network segmentation. That last vulnerability in particular, they warned, could allow malware with access to one part of the network to spread wildly beyond its initial foothold, exactly as NotPetya would the next year."

But:

"The security revamp was green-lit and budgeted. But its success was never made a so-called key performance indicator for Maersk’s most senior IT overseers, so implementing it wouldn’t contribute to their bonuses."

Basically, a near miss that wasn't incentivised for anyone to fix.

If you're interested in this type of story, it's an absolute thriller to read: https://archive.is/Gyu2T#selection-3563.0-3563.212


For better or worse a near-miss has zero cost to the org as a whole and thus justifies org level zero investment.

That is okay as long as someone is noticing! As stated in the article, these types of near misses are noticed within the team and mitigated at that level so the org doesn’t need to respond.

That’s a cost effective way to deal with them, so I would argue everything works the way it should.


A near miss means your processes need work. If you have any sort of reliable process, actual misses should be vanishingly rare and you'll need to look at proxies anyway to monitor for improvement, so why not use the best one available?

My point was there’s a nuance.

The processes don’t need work - at an org level - because that near miss is noticed and addressed within the team. That’s the most cost effective place for it to happen.

By analogy say you stumble while walking, that causes you to ask whether the concrete was uneven, or your shoes are the wrong size, or whatever else. Treating that trip as a “near miss” and “assessing at an org level”, by way of analogy, might be a bit like going to the doctor to ask if your legs are okay - that would be a disproportionate response.


The problem is with your analogy here. Let's try two marginally less silly examples:

1) You're a sidewalk construction company. You monitor some of your sidewalks to see how often people trip and adjust the leveling or depth.

2) You coach a running team and your runners average 30s lost to stumbles per marathon. You think about better stride trainings or nutrition schedules to minimize that loss.

These are still a bit silly, but far more analogous to actual software companies.


These analogies are great for demonstrating a deeper nuance, I like them!

So for 1) though, the sidewalk construction company is only incentivized to do monitoring and continuous improvement to the extent that a) they do it to compete or b) the client asks for it.

And for 2) stumbles are likely to be quite obvious failures - but the obviousness of the failure doesn’t invalidate your example. To strengthen your point we could target something subtler than stumbles - let’s say 30s is lost to incomplete nutrition plans.

In both cases only organizations that value excellence would consider these initiatives. Which actually aligns with the article, apart from a key difference that both examples given describe top-down searches for near-misses rather than bottom up flagging of them.

I think ultimately this is about operational excellence - an org that really cares will actively look for near misses, but flagging them from the bottom up won’t make an org care unless it already did, because maybe they don’t have an incentive to care, or maybe they don’t see the value of that work or maybe they do care but haven’t addressed bigger issues yet or are too low margin of a business to afford to address them so it’s a yes-we-know-but-not-yet type of thing.

And if bottom up flagging doesn’t work the best you can do is address it locally.


That however assumes a level of team autonomy that's not happening much from my experience.

And the consequences from a disaster are very likely to affect the entire org.


Oh, well, surprisingly, it seems this article hadn't been posted here yet?

Do enjoy the discussion, and, whatever you do, please don't let the apparent incongruity of "near miss" when it's clear that it should be "near accident" derail the conversation... (insert-innocent-smiley-here)


The word "accident" is generally inappropriate in this context. For example, it's banned from FHA and NHTSA communications [0] among others because it implies that the incident was random or unpreventable. This article is talking about incidents that can be prevented and were narrowly avoided as opposed to "accidents".

The FAA and the NTSB continue to use "accident", but they're somewhat unique in that and they have very specific technical definitions that don't match popular connotations.

[0] https://www.fmcsa.dot.gov/newsroom/crash-not-accident


"Near collision" has some currency, as in this prize-winning photograph.

https://en.wikipedia.org/wiki/1950_Pulitzer_Prize#/media/Fil...


Why do we call it “near miss” when it’s more like a “near hit”?

Out of respect for George Carlin.

It's a miss that is near [the hit].

Planes were near.

Planes missed.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: