Join us for the next Cause Mapping Root Cause Analysis Public Workshop in HOUSTON on April 30 – May 2.
All organizations experience setbacks, and usually the severity of the problems they encounter are proportional to their goals. If you own a coffee shop, what’s the worst that could happen? The fallout from most any problem that arises in such a setting generally stops with you: maybe you experience some equipment failures and your bank account suffers as a result; perhaps your business plan wasn’t tailored to the market, and then your bank account suffers a bit more. Maybe you spill an ice cold chocolate-peppermint extra special Christmas coffee all over a regular customer. Maybe it was a scalding hot cup of coffee; then you have to reprint cups warning customers that coffee is, in fact, hot. On this scale, finding the source of the problem and proposing solutions isn’t rocket science.
Other industries, on the other hand, operate on a much higher level of risk. Say you’re in charge of a nuclear power plant that experiences a meltdown. Suddenly, identifying the problem and proposing solutions becomes much more complicated; moreover, the consequences are not limited to the power plant, but rather impact an entire community. Because the failure is more public, you would also be dealing with the added headache of explaining what went wrong to a public that almost certainly is incapable of understanding the technical aspects of the problem–and sometimes, the problem consists entirely of technical aspects.
NASA operates on a scale that is largely incomprehensible to those who did not pay extra good attention in their science and math classes. The space program involves complex astrophysics and a highly specialized language full of words like ‘redshifts’ and ‘dark matter,’ words that take on an almost mythical aspect to those who are not versed in the language of space. Its engineering projects are large–space ship large–and necessitate coordination between multiple agencies, providers, machinists, and scientists to realize. Such scientists work in large collaborations spread out all over the country, operate in units that most people have never heard of and deal with distances that are so great they are incredibly difficult to imagine.
And it costs a whole lot of money.
The work that the dedicated men and women who invest their lives in this industry do has advanced knowledge of space, the great beyond, by leaps and bounds over just a few decades. It feeds the nation inspiration and awe, and though many do not understand precisely how we got a man to the moon or really how we know so much about the universe, it sure makes for great television. And though America might not understand how these missions work or why they succeed, we generally bask in their successes.
The downside, of course, is that when something goes wrong, it goes wrong in a big way. NASA’s failures are rarely private, and tend to garner as much (if not more) media attention than its successes. And while most Americans may not completely understand the technical aspects of what went wrong when something does go wrong, they have little trouble understanding the concept of a price tag. NASA’s missions are expensive, and when they fail, they fail on a scale of millions of dollars.
Root Cause Analysis has been quite successful in breaking down the complexity of such failures to a level that almost anyone can understand. While news reports that an O-ring failure caused the space shuttle Challenger to explode might not explain a whole lot to most people, a root cause analysis of that incident performed as a Cause Map tends to do the trick. Without root cause analysis, however, understanding what went wrong in incidents like the Columbia or Apollo 1 disasters is, well, rocket science.
Here, however, we will discuss a case in which the cause of a mission failure can be traced to an error in calculation that your middle school math teacher warned you against again and again:
Don’t forget to convert your units!
The most simple version of events, a version that we will contextualize and expand on in our root cause analysis, is that the teams responsible for engineering and launching the Mars Climate Orbiter were not on the same page when it came to the units of measurement they operated in. The flight software was in metric, and the team on the ground assumed the Imperial. This resulted in a miscalculation in the Orbiter’s trajectory that cost NASA millions of dollars and a good bit of its reputation.
Sometimes root cause analysis is like detective work, working backwards from an affected goal through a complex series of causes and effects until the problems are discovered and understood. Other times, the chain of causes and effects itself is more important.
Understanding that an organization as sophisticated as NASA managed to lose a multi-million dollar piece of equipment by making this kind of relatively simple error is actually quite easy. It’s no secret that the world operates on the metric system while the United States clings to the Imperial system, and problems are bound to result from that fact. Understanding how this error was made and went uncorrected in an organization as sophisticated as NASA–now there’s a complex problem.
Because the Cause Mapping approach to root cause analysis centers on solving a problem by teasing out a chain of events that lead from a failed goal to its root causes, there is no better tool for understanding what happened in a way that permits enacting solutions at multiple levels to ensure it doesn’t happen again.
Now that technology (in some respects) has caught up to human curiosity, we are in a position to gather facts and start trying to answer some of these questions. As residents of the third planet from the sun, it makes sense to start looking for answers on the fourth: Mars, “the red planet.”
Exploration of Mars began in earnest in 1965, with the first successful flyby by the Mariner 4, and at present there are three orbiters surveying the planet and two rovers on the surface. Bouncing from one planet to another, however, is not easy, which makes the failure rate of such missions rather high: about ⅔ of all spacecraft destined for Mars fail before completing their missions.
The Mars Climate Orbiter was a space probe designed to be a smaller, less expensive way to explore Mars in the wake of the 1993 loss of the Mars Observer, and was launched atop a Delta II launch vehicle on December 11, 1998. Its mission: to function as an interplay satellite and communication relay for the Mars Planetary Lander. Together, the Mars Orbiter and Mars Planetary Lander were to map Mars’ surface, profile the structure of the atmosphere, try to detect surface ice reservoirs, and dig for traces of water beneath the surface–to understand its history and its potential to sustain, or to have sustained, life.
Now, it’s one thing to say that NASA launched the Mars Climate Orbiter, and another to really understand what that means. After launch, the orbiter reached a final velocity of 5.5 kilometers per second, on a 669 million kilometer trajectory towards Mars. It took the orbiter over nine months to arrive at the point at which it could begin to enter Martian orbit. This is a long time for a million dollar orbiter to be flying, fast and very far, in space; before we begin to consider what went wrong in this case, it’s only fair to take a minute to marvel at everything that went right.
Are you appropriately impressed? Good.
Nine and a half months after launch, the Mars Climate Orbiter was scheduled to begin the process of establishing an orbit around Mars. The plan was to use a technique called aerobraking to reduce the velocity and slowly move the orbiter from a 14 hour orbit to a 2 hour orbit.
At 09:00:46 UTC on September 23, 1999, the orbiter began its planned insertion. 4 minutes and 6 seconds later, the orbiter passed behind Mars and went out of radio contact. This was 49 seconds earlier than it should have been, and communication with the Orbiter was never reestablished–the Orbiter had encountered Mars at a lower than anticipated altitude, and fell victim to atmospheric stresses.
Another way to say the same thing, and the way that was chosen by most of the media reports on the story is as follows: On September 23, 1999, the $125 million dollar Mars Climate Orbiter was lost during the attempt to establish orbit around Mars because of a mathematical error that could easily have been avoided.
To better understand what exactly happened and what could be done to reduce the risk of it happening, we turn to a root cause analysis of the incident.
Performing root cause analysis on any incident, whether it deals with too many rabbits on an island or the loss of a multimillion dollar piece of equipment in space, always begins by defining the problem as precisely as possible, and logging the results on an easy to read problem outline. We do this by specifying what the problem was, when and where it took place, and how it affected the goals of the organization in question–in this case, NASA.
The trick is to approach an incident with an open mind, so that no possible solutions are neglected from the root cause analysis. When we begin by defining the problem, we consider all definitions of “the problem”, and leave it to the analysis phase of root cause analysis to evaluate their impacts. Thus, though asking any group of people to identify the problem will always, without fail, produce multiple different answers, root cause analysis considers all suggestions relevant at this stage. There are good reasons for this. Aside from the benefits of multiple perspectives, considering all possible “problems” from the start eliminates the risk of confirmation bias. Moreover, if we think we’ve identified the problem before we start, we obliterate any need for an investigation.
As with any root cause analysis, we begin our investigation of the Mars Climate Orbiter loss by defining the problem. The answers are documented on an outline, from which the root cause analysis will grow. Here, we will summarize the problem as being the loss of the Mars Climate Orbiter.
As to when the incident occurred, any differences that were present should be documented in addition to the date and time. This is fundamental to any root cause analysis, as changes can provide clues as to why the failures occurred. In this example, the date is September 23, 1999 and the time is 9:04am. The orbiter’s accident took place in Mars’ upper atmosphere (physical/geographic location), during orbital insertion (the process). In some cases, there may also be a business location, where the name of the company and the business is listed. The Outline can be modified as needed to document any relevant location information.
The most significant difference was that this was a new design.
Finally, we detail each individual organization goal that was adversely affected by the incident in question. The goals reflect the ideal state of an organization–here, if NASA was operating at an ideal state, the mission would have gone swimmingly. In this section we record both potential and actual impacts on the goals.
NASA launched the Mars Climate Orbiter in order to accomplish specific mission goals like recording changes on the Martian surface and looking for evidence of past climate change–all goals that required the Mars orbiter remaining intact. The equipment failed before accomplishing any of these goals (we’ll call them Mission goals). NASA lost a lot of money along with the Orbiter, too. The cost of the mission was $327.6 million total for the orbiter, $193.1 million for spacecraft development, $91.7 million for launching it, and $42.8 million for mission operations. These expenses would thus be listed as a negative impact on the Material and Labor Goal.
Finally, as we mentioned before, NASA’s failed missions, by their very nature, tend to fail very publicly, and in ways that people are able to grasp much more easily than what NASA had been attempting to do. The complete failure of the Mars Climate Orbiter project negatively affected public support for NASA, thus adversely affecting the Customer Service goal.
The final piece of information documented on the outline is the frequency of the incident, or how often it has or is likely to occur. The frequency is a multiplier that helps us to understand the total magnitude of an issue (if NASA lost everything it launched into space, the problem would suddenly be much, much bigger).
Below is a completed Outline for the Mars Climate Orbiter example.
Root cause analysis provides a complete understanding of an incident by breaking it down into simpler elements. Our root cause analysis therefore proceeds by asking, as a two year old might, the question “why” repeatedly of each link of the cause and effect chain. We begin with one of the goals that was adversely affected by the incident in question, and build our Cause Map from there.
Starting with the mission goal, our root cause analysis opens with this simple cause and effect relationship.
Five “whys” later into the root cause analysis, our Cause Map will look something like this: the mission goal failed because the orbiter was lost, because it was subjected to extreme heat, because it entered the atmosphere of Mars.
Now, the Cause Map above is accurate, but essentially useless if we leave it like this. The only possible solution we can suggest at this point is “stay away from Mars,” which makes studying Mars rather tricky.
Rather than leaving our root cause analysis at this oversimplified stage, we can add detail to the Cause Map in a number of ways. We can add causes left to right by asking why, in between existing Causes by taking smaller steps when asking why, and vertically. To produce an effect, more than one cause is usually required. To determine if additional causes should be added vertically, ask “Is this cause sufficient (on its own) to produce the effect?” If the answer is no, more causes should be added. To check if a cause is documented appropriately, ask “Is the cause necessary to produce the effect?” If the answer is yes, the cause is documented correctly. If the answer is no, the cause should not be included.
Continue to ask “why” questions and build the Cause Map until the level of detail is sufficient to understand the issue.
Below is the Mars Climate Orbiter Cause Map with additional detail added. In this example, hitting the gas environment and traveling at a high velocity were both needed to produce the extreme heat that destroyed the orbiter. Both were necessary to create the effect, so both are listed vertically and separated with an “and”.
Our root cause analysis, at this stage, then reads something like this: NASA’s mission goal failed because the Orbiter was lost after being subjected to extreme heat. Why? Because it hit a gas environment and had been traveling at high velocity. While traveling at a high velocity was part of the plan, it hit a gas environment when it entered Martian atmosphere at a lower trajectory than expected.
As much detail as needed can be incorporated by continuing to add causes, and the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals. The greater the impact to the goals, the more detailed the Cause Map will be.
To conduct a thorough root cause analysis, each impacted goal should be added to the Cause Map and built up to a sufficient level of detail by using the same method as above.
Below is the Mars Climate Orbiter Cause Map with all impacted goals added to the analysis.
Now that we’ve built up all of the relevant goals, a few more causes need to be added to the right to understand why the trajectory was lower than expected. This is where the middle school math teacher would come in to remind us to work in the same units when performing calculations.
On September 15th, a trajectory correction maneuver had been executed that was supposed to place the Orbiter in a prime position for orbital insertion, at an altitude of 226 kilometers. In the time between this correction maneuver and the actual insertion, though, the navigation team noted that the Orbiter’s altitude may in fact have been much lower. 80 kilometers is the lowest altitude that the Orbiter could survive in this maneuver. A day before insertion was to take place, calculations had it at 110 kilometers, which should have been fine. Final calculations had the spacecraft at within 57 kilometers of the surface, however, at which point the orbiter disintegrated.
What went wrong? The Orbiter’s flight system software was designed to take thrust instructions in Newtons (N), a metric unit; the software on the ground, however, generated instructions in pound-force (lbf), an Imperial unit. 1 pound-force equals 4.44822162 Newtons. This, in turn, goes back to what we were saying about NASA’s projects involving many collaborators from all over the country: engineers from Lockheed Martin in Colorado provided the navigation commands for the Orbiter’s thrusters, while scientists at NASA’s Jet Propulsion Laboratory in California operated in metric units, as has long been the standard at NASA.
Once the Cause Map is built to a sufficient level of detail with supporting evidence, it can be used to develop solutions. The Cause Map is used to identify all the possible solutions for given issue so that the best solutions can be selected. It is easier to identify many possible solutions from the detailed Cause Map than the oversimplified high level analysis.
This is an intermediate level Cause Map of the loss of the Mars Climate Orbiter. This example shows what the Cause Map looks like with 40 causes, evidence blocks and some possible solutions added.
With our root cause analysis complete, we can now fully address the question of why the Mars Climate Orbiter was lost. As the Cause Map shows, a number of factors contributed to the loss, the most obvious of which being a unit error in the software used to help predict the velocity of the Mars Climate Orbiter, which in turn is used to predict the trajectory the Mars Climate Orbiter would take to enter the Martian atmosphere. This was a simple conversion mistake: the results were in pound force and the program that predicted velocity assumed Newton’s, a factor of 4.45 difference. The error in the software resulted in the calculated trajectory being higher than the actual trajectory.
Now, this is the cause that got the most press. It makes for an easy joke. Yet root cause analysis serves to lower levels of risk to an acceptable level, which means proposing multiple solutions that could potentially be enacted all along the chain of cause and effect that led to the loss of the orbiter. So, while pointing out the initial conversion issue is fun, it’s not exactly useful: implementing safety nets along the chain of events goes a whole lot further.
For example, consider the scenario in which the initial conversion error was still made. An effective software validation program would have identified and corrected the error before it resulted in a complete loss of the mission. The ineffectiveness of the software test program was thus clearly also a cause of the loss of the orbiter. Additionally, even if the software error was made and the calculated trajectory was wrong, the mission might still have been saved if the lower trajectory was found earlier. Early identification of the orbiter’s low trajectory would have allowed the team to take action and potentially raise the trajectory. The ability to raise the trajectory was included in the design, but no attempt to change the trajectory was made because the team didn’t understand how low the actual trajectory was going to be, and the necessary planning wasn’t done to allow quick act at the time of Mars insertion.
Another cause is the inherent difficulties associated with space travel, making measurement of the exact trajectory tricky. The difficulty in determining actual trajectory is one of the reasons that NASA relied on the calculated trajectory to make decisions during the mission of the Mars Climate Orbiter.
The NASA investigation also identified a number of areas where the project team wasn’t effective. The NASA reports weren’t particularly detailed in this area so it is hard to clearly understand what factors contributed to the ineffective team, but it is clear that were difficulties in several areas. One area where there were issues was with communication among team members. The project team consisted of a number of different organizations in different geographic locations. Additionally, inadequate training was a cause that contributed to the loss of the orbiter. The software conversion mistake indicated that the Software Integration Specification, a document which identified what units to use in software, either wasn’t well understood or wasn’t used by the entire team. A project team that was more effective, with adequate staffing, adequate training and a more clearly defined organization would have increased the likelihood that the errors that resulted in the loss of the Mars Climate Orbiter would have been caught earlier and corrected.
At a cursory glance, the loss of the Mars Climate Orbiter has only one cause. On a more detailed level it has 4, 12, or even 100 causes. The Cause Mapping approach to root cause analysis permits us to zoom in and out to reveal more or less detail as needed. When performing root cause analysis on any issue, it should be worked to a sufficient level of detail to prevent the incident, to reduce the risk of the incident occurring to an acceptable level. This is why solutions and work processes at a coffee shop are not as thorough or detailed as they would be at an airline or nuclear power facility. The risk or impact on the goals dictates how effective the solutions will be. Lower risk incidents will have relatively lower detail investigations while significantly high risk to an organization’s goals requires a much more through analysis.
Possible solutions are typically documented on the Cause Map as a green box above the cause it addresses. When proposing the possible solutions, don’t be concerned about limits, boundaries, schedules or financial constraints. All possible solutions should be added to the Cause Map so everybody can see and think through them, select the best among them, and define an action plan and due dates.
The Mars Climate Orbiter was intended to be a less expensive, more efficient way to explore Mars, and was part of a larger effort to make space exploration “faster, better, cheaper.” It was thus a new design created under a new guiding philosophy, and with new designs and guiding philosophies come increased risk. Its loss, coupled with the loss of the Mars Polar Lander in 1999, prompted NASA to rethink what had been the “acceptable levels of risk” in innovation and exploration. The Mars Surveyor (a lander), which was already in production at the time of the Orbiter loss and had been built from the same design, was shelved. Meanwhile, NASA took some time to rethink the processes it had in place to catch mistakes before they culminate in a complete loss of the mission.
Part of the “cheaper” part of NASA’s formula involved eliminating some layers or error-checking that may have saved the Orbiter–after all, the mission had been a complete success until the discrepancy in units sent the Orbiter to its demise. Certainly, the notion that “the mission was a complete success until it failed” may seem self-evident, but if you take a moment again to consider the levels of complexity involved in the Orbiter’s design, engineering, trajectory, and mission, it should become clear that the parts of the Orbiter’s story that worked are indeed impressive. The lessons that NASA took from this setback thus adjusted their notions of acceptable levels of risk in working out a budget: as our root cause analysis has shown, a little more money spent on safety nets would have saved the mission.
Efforts to better understand the red planet and its history have therefore continued, and significant progress has been made. The Mars Odyssey mission (an orbiter), developed by NASA and contracted out to Lockheed Martin, went forward, though under increased scrutiny. It was launched in 2001 and has been a great success, breaking the record for the longest-serving spacecraft at Mars in December of 2010. The Mars Reconnaissance Orbiter, launched in 2005, completed most of the Mars Orbiter’s mission goals, and benefited from the lessons it left NASA: plenty of ground tests, independent reviews.
As our root cause analysis of the Mars Climate Orbiter incident has shown, even sophisticated organizations can make simple mistakes. The key, and the goal of root cause analysis, is to produce possible solutions at multiple points in the chain of causes and effects so that the level of risk is reduced to an acceptable level. Our root cause analysis has shown that even if a problem arises and errors are made, solutions can be enacted that would catch the error before its consequences become catastrophic.
The Cause Mapping approach to root cause analysis focuses on the basics of the cause-and-effect principle so that it can be applied consistently to day-to-day issues as well as catastrophic, high risk issues. The steps of Cause Mapping are the same, but the level of detail is different. Focusing on the basics of the cause-and-effect principle make the Cause Mapping approach to root cause analysis a simple and effective method for investigating safety, environmental, compliance, customer, production, equipment or service issues.
Want to see more NASA-related Cause Maps? Check out our root cause analysis of the fire aboard Apollo 1 or the space shuttle Columbia disaster.
The images used were produced by NASA. Use of these images is not meant to imply NASA endorsement of Cause Mapping. Many more images of the Climate Orbiter and detailed information on the mission is available on the NASA website.
Information used for the write up is from:
Mars Climate Orbiter Mishap Investigation Board Phase I Report (dated November 10, 1999)
Report on Project Management in NASA by the Mars Climate Orbiter Mishap Investigation Board (dated March 13, 2000)
Schedule a workshop at your location to train your team on how to lead, facilitate, and participate in a root cause analysis investigation.