Probably the best "debugging in space" story is Don Eyles hacking the Apollo moon lander program that was in "core rope ROM" to work around a hardware failure on Apollo 14.
It is findable online, but he published a book "Sunburst and Luminary" (Fort Point Press, 2019) about the whole process of getting the moon landing code ready in time to use, and the hack.
As I understand it (I just got the book, haven't read it yet) the Apollo Guidance Computer, one each in the Command Module and the LEM, was programmed by Margaret Hamilton (inventor of Software Engineering as we know it today) with a real-time executive and interpreter emulating a saner machine, and Eyles coded the landing to the interpreter. Because it was interpreted code, it was patchable, and he came up with a patch on the fly that the astronauts punched into the AGC by hand, and saved the mission.
Margaret Hamilton's real-time executive itself saved the day when Apollo 11 crew left some extra stuff turned on, by accident, that the system had not been tested with and that burned excess CPU cycles during the landing. When it trapped a scheduling failure, it checkpointed important state and was able to resume the important tasks where they left off. That happened several times during the landing.
It is findable online, but he published a book "Sunburst and Luminary" (Fort Point Press, 2019) about the whole process of getting the moon landing code ready in time to use, and the hack.
As I understand it (I just got the book, haven't read it yet) the Apollo Guidance Computer, one each in the Command Module and the LEM, was programmed by Margaret Hamilton (inventor of Software Engineering as we know it today) with a real-time executive and interpreter emulating a saner machine, and Eyles coded the landing to the interpreter. Because it was interpreted code, it was patchable, and he came up with a patch on the fly that the astronauts punched into the AGC by hand, and saved the mission.
Margaret Hamilton's real-time executive itself saved the day when Apollo 11 crew left some extra stuff turned on, by accident, that the system had not been tested with and that burned excess CPU cycles during the landing. When it trapped a scheduling failure, it checkpointed important state and was able to resume the important tasks where they left off. That happened several times during the landing.