| Many organizations have, from time to time, | | | | to the loss of service, but also to understand how our |
| experienced significant disruption of their IT services - | | | | response to the incident might have contributed to the |
| major incidents. This article examines how an | | | | overall impact. |
| organization might turn this to their advantage. | | | | RCA helps to identify not only what happened and |
| Very often a simple initial failure is made worse by | | | | how it happened but also why. Only by understanding |
| other, unrelated, failures; these might be failures of | | | | why will we be able to devise workable corrective |
| hardware, software, people or process. The article | | | | measures. For instance, suppose a network technician |
| expands on the material covered during accredited ITIL | | | | disconnects a working router rather than a broken one. |
| training courses and describes a systematic way of | | | | A typical investigation might conclude that human error |
| analyzing chains of events and identifying specific | | | | was the cause and recommend better training or that |
| improvements that will address not only the original | | | | technicians should take more care but neither of these |
| cause but also the subsequent failures. | | | | is likely to prevent future occurrences. RCA assumes |
| Root Cause Analysis | | | | that mistakes do not just happen but that they have |
| The Service Operation volume of the IT Infrastructure | | | | specific causes, and would ask 'why?' In the case of |
| Library recommends that every major problem should | | | | the poor network technician the RCA analyst might |
| be reviewed to learn lessons for the future. However | | | | ask 'was the router properly labelled?', 'was the |
| it gives little or no guidance on how this might be done. | | | | technician told which router was faulty?', 'is there a |
| Root Cause Analysis is an excellent technique for | | | | recognized procedure for deciding whether a router is |
| addressing the issues identified in Service Operation:o | | | | working or not?', 'did the technician know what it was?'. |
| What was done correctlyo What was done wrongo | | | | Root causes have four characteristics: |
| What could be done better in futureo How to prevent | | | | 1. They are specific causes: 'human error', for example, |
| recurrenceo Whether there has been any third-party | | | | is too general. |
| responsibility and whether follow-up actions are | | | | 2. They are causes that can reasonably be identified: |
| required | | | | RCA must be cost beneficial so the analyst must |
| The phrase 'root cause analysis' is often used in a | | | | know when to stop the investigation. |
| general sense to describe the activity of identifying the | | | | 3. They are within the control of the management of |
| underlying cause of an incident (and this is the sense | | | | the organization. The analyst is looking for causes that |
| that it appears to be used in the Glossary of Service | | | | can be addressed by the organization. Although |
| Operation). However, the name Root Cause Analysis | | | | adverse weather conditions might very well have |
| (RCA) is also given to a specific technique that is | | | | triggered the incident, we cannot do anything to affect |
| intended for use in investigating a series of actions or | | | | the weather and so that is not an appropriate root |
| occurrences that lead to an undesired outcome. | | | | cause. We can of course do something about how |
| It is particularly useful where a number of contributory | | | | we are impacted by adverse weather and perhaps |
| causes might be involved; it helps the analyst to avoid | | | | our root causes might lie there. |
| the common mistake of becoming fixated on a single | | | | 4. They can be addressed by specific solutions. A |
| cause (usually the very first event). This technique is | | | | vague recommendation such as 'ensure that |
| particularly useful in reviewing a Major Problem which | | | | technicians follow defined procedures' probably means |
| might have several contributory causes, and whose | | | | that more thought needs to be given to identifying a |
| impact might be made worse by the way it is handled. | | | | specific cause. |
| RCA not only helps us to identify the factors that lead | | | | I shall discuss the four phases of RCA in part two. |