Aerospace has a very high quality standard compared to other industries.
Lots of formal processes capture what would otherwise be informal design decisions elsewhere. In this case, they probably have reams of pages detailing a failure mode effects analysis (FMEA). One mode is “oops, we sent the wrong command” and the document would define the specific design mitigation(s) for that outcome until it reaches an accepted risk threshold.
As far as I’m aware, no NASA standards call for FMEDA. It doesn’t mean a project manager couldn’t levy it, but it’s not often that a contractor adds additional requirements to a gov funded build.
FMEA relies on a really smart person anticipating all the different combinations of failures worth exploring (NxM), not just N or M.
Some failures are fairly common, and individual failures might be fairly inert but have more serious consequences if they are cascaded with another specific failure.. for example, cruise control enable + failure of steering wheel control pad _and_ previously undetected failure of brake sensor/brake light circuit = cruise control stuck ON. Actually, this failure is inert if the cruise control is OFF when it happens. Contrived example but you get the idea ...
I have seen a lot of FMEDA (and other tool) use lately to combat concerns with cascading failure, but not sure what's currently standard at NASA or how they deal with this. I would think cascading failure would be their expected scenario on a 10+ year unmanned mission.
NASA STDs, handbooks, guidebooks, NPDs and NPRs are all open-source. They don’t mention FMEDA, and they don’t generally have a detectability column in their FMEA. IMO they are a little outdated
I've done for NASA what they were calling FMECA and FTA for a subsystem. They had a lot of freedom to tailor the analysis to the situation, and the end result didn't quite match anything established. We addressed detection in some of the FMECA columns which are not traditionally for detection; and events in some of the FTA. It was a contortion of terminology and format to modernize and maximize the value of the analysis given their limited resources and the bureaucracy of what they were allowed/required to do on paper.
Here's how I would describe the possible analysis approaches in broad terms, avoiding terminology that NASA does not officially use.
- Start from the hazard of being pointed in the wrong direction and work backwards to identify the causes, forming a tree.
- Start from the event of commanding the wrong direction and work forwards to identify mitigations or the lack thereof, also forming a tree.
- Start from looking at a component or subsystem, list all the ways it can fail without regard for the application. Then consider the application and work up towards the causes/events.
- Close any gaps between the top-down and bottom-up approaches.
Yes, what you're describing is two different approaches for safety analysis. According to the NASA software engineering handbook [1]
"Software Fault Tree Analysis (SFTA) is a top-down approach to failure analysis which begins with thinking about potential failures or malfunctions (What could go wrong?) and then thinking through all the possible ways that such a failure or malfunction could occur. Fault Tree Analysis (FTA), is often used by the hardware teams to identify potential hazards that might be caused by failures in hardware components or systems, but with the SFTA, the software isn’t considered the hazard, but it can be a cause or contributor when considered in the context of the system."
"The Software Failure Modes and Effects Analysis (SFMEA) is a bottom up approach where each component is examined and all the possible ways it can fail are listed. Each possible failure is traced through the system to see what effect it might have on the system and to determine if it results in a hazardous state. Then the likelihood of the failure and the severity of the system failure can be considered."
But, to the earlier post, these are driven by hard requirements; specifically adherence to NASA STD 7150.2 and NPR 7150.2. Developers/contractors can tailor/waive them with pre-approval but, in general, they tend to go in the direction of less requirements, not more. This may all be moot because I think Voyager pre-dates any of those requirement documents and I'm not sure what existed in the late 1970s.
The D aspect of the FMEA I worked on was motivated by a reliability requirement, not by 7150.2. 70's NASA was using FTA and FMEA but avoiding putting numbers on top-level analysis. I imagine they did whatever ad-hoc analysis they thought was necessary for such a highly publicized mission even if it wasn't a separate deliverable.
Edit: The comment you deleted right before I could reply was good! I think people would enjoy and benefit from your description of how the process works if you're willing to repost it.
As you noted the reliability requirement did in fact flow down from an engineering requirement which is why they exceeded the minimum FMEA standards. There's no official guidance on where and how exactly to track that information so they put it in the usual place but in an unusual way. The lack of a standard during Voyager's time probably impacted the visibility of the work more than the substance.
Lots of formal processes capture what would otherwise be informal design decisions elsewhere. In this case, they probably have reams of pages detailing a failure mode effects analysis (FMEA). One mode is “oops, we sent the wrong command” and the document would define the specific design mitigation(s) for that outcome until it reaches an accepted risk threshold.