The Use of Root Cause Analysis in Conducting Major Problem Reviews (Part One)

Many organizations have, from time to time,to the loss of service, but also to understand how our
experienced significant disruption of their IT services -response to the incident might have contributed to the
major incidents. This article examines how anoverall impact.
organization might turn this to their advantage.RCA helps to identify not only what happened and
Very often a simple initial failure is made worse byhow it happened but also why. Only by understanding
other, unrelated, failures; these might be failures ofwhy will we be able to devise workable corrective
hardware, software, people or process. The articlemeasures. For instance, suppose a network technician
expands on the material covered during accredited ITILdisconnects a working router rather than a broken one.
training courses and describes a systematic way ofA typical investigation might conclude that human error
analyzing chains of events and identifying specificwas the cause and recommend better training or that
improvements that will address not only the originaltechnicians should take more care but neither of these
cause but also the subsequent failures.is likely to prevent future occurrences. RCA assumes
Root Cause Analysisthat mistakes do not just happen but that they have
The Service Operation volume of the IT Infrastructurespecific causes, and would ask 'why?' In the case of
Library recommends that every major problem shouldthe poor network technician the RCA analyst might
be reviewed to learn lessons for the future. Howeverask 'was the router properly labelled?', 'was the
it gives little or no guidance on how this might be done.technician told which router was faulty?', 'is there a
Root Cause Analysis is an excellent technique forrecognized procedure for deciding whether a router is
addressing the issues identified in Service Operation:oworking or not?', 'did the technician know what it was?'.
What was done correctlyo What was done wrongoRoot causes have four characteristics:
What could be done better in futureo How to prevent1. They are specific causes: 'human error', for example,
recurrenceo Whether there has been any third-partyis too general.
responsibility and whether follow-up actions are2. They are causes that can reasonably be identified:
requiredRCA must be cost beneficial so the analyst must
The phrase 'root cause analysis' is often used in aknow when to stop the investigation.
general sense to describe the activity of identifying the3. They are within the control of the management of
underlying cause of an incident (and this is the sensethe organization. The analyst is looking for causes that
that it appears to be used in the Glossary of Servicecan be addressed by the organization. Although
Operation). However, the name Root Cause Analysisadverse weather conditions might very well have
(RCA) is also given to a specific technique that istriggered the incident, we cannot do anything to affect
intended for use in investigating a series of actions orthe weather and so that is not an appropriate root
occurrences that lead to an undesired outcome.cause. We can of course do something about how
It is particularly useful where a number of contributorywe are impacted by adverse weather and perhaps
causes might be involved; it helps the analyst to avoidour root causes might lie there.
the common mistake of becoming fixated on a single4. They can be addressed by specific solutions. A
cause (usually the very first event). This technique isvague recommendation such as 'ensure that
particularly useful in reviewing a Major Problem whichtechnicians follow defined procedures' probably means
might have several contributory causes, and whosethat more thought needs to be given to identifying a
impact might be made worse by the way it is handled.specific cause.
RCA not only helps us to identify the factors that leadI shall discuss the four phases of RCA in part two.