Software Design Lessons from the Apollo Moon Landing

We all know the story of Apollo 13 from the Ron Howard film starring Tom Hanks. In the film heroic engineers found solutions to a series of critical problems during the mission and played a key role in bringing the three astronauts back to earth safely.
A story that is far less well known is how quick thinking and thorough testing by software engineers prevented the first moon landing from ending in disaster. If you listen to the tapes of the Apollo 11 moon landing you will hear Buzz Aldrin call out “Program Alarm - 1202.” This was a software alarm from the Apollo Guidance Computer. The computer was critical not only for the lunar landing but also for the astronauts’ safe return from the lunar surface. A few seconds later, Aldrin calls out “1202 Alarm 1201.” The guidance computer was failing and the software engineers in mission control had but seconds to decide whether it was necessary to abort the moon landing. In spite of the recurring software alarms the decision was “Go” and the lunar landing succeeded.
What were those 1202 (and 1201) alarms? How could the software engineers be confident that the moon landing could proceed safely?
The 1201 and 1202 alarms were “executive overflow” alarms, meaning the computer was over-loaded and had run out of resources. The Apollo Guidance Computer’s software was designed using a priority based scheduling scheme. The scheduling executive was designed to finish all critical tasks in a given period of time. Lower priority tasks were given a block of memory and scheduled to run after the critical tasks. If the lower priority tasks were not completed during a time period, they were re-scheduled to run in the next time period after the critical tasks executed. The alarms indicated that the computer was unable to schedule any more tasks because it had run out of memory blocks to assign. The root cause was that the rendezvous radar, only used during ascent from the lunar surface was turned on during the lunar descent. The data from the rendezvous radar was causing invalid tasks to be scheduled that were filling up the memory.
The software engineers in mission control did not know the root cause of the software alarms when they had to decide if it was safe to proceed with the moon landing. However, they were confident that the computer would continue to process all the critical tasks in spite of the overload. They knew this because of the software’s design and extensive testing.
There were two critical design decisions that allowed the Apollo guidance computer to continue functioning in an overload condition. First was the use of a priority based scheduling scheme that guaranteed that all critical tasks would execute before any lower priority tasks were processed. The second decision was the software abort and restart mechanism that allowed critical tasks to resume executing almost at the same point where they had been interrupted by the alarm. During system development, the restart capability was extensively tested. Knowing this, the software engineers could be confident that the computer would continue to perform its critical tasks, even when overloaded and restarting about every ten seconds.
What can we learn from the Apollo 11 Guidance Computer failures? Most of us will never develop life-critical software and even fewer of us will write software for manned spacecraft guidance. However, many of us write mission-critical software. The principles of software design and testing used for the Apollo Guidance Computer are as applicable to modern embedded systems as they were to the aerospace systems of the 1960s.
Reliability starts with the software architecture. The Apollo Guidance Computer used a priority-based scheduler and a safe restart mechanism. These were architectural decisions that could not have been retrofitted into existing code after a problem was discovered in testing. You can’t test reliability into a product after it is built.
Embedded software needs to detect errors and act on them. The Apollo software was able to detect the memory overflow conditions, reported the problem, the 1202 alarm, and reacted by restarting the computer. Error detection and recovery are critical parts of embedded software design at every level.
Finally error recovery mechanisms must be thoroughly tested if we are to have any confidence in them. During the Apollo moon landing, engineers could be confident that it was safe to proceed in spite of software alarms because they had tested the software restart mechanism. Testing under overload or other anomalous conditions is required for embedded systems verification.







