In the IT world, it’s a given that problems will be encountered with the systems or procedures which power an organization. The infrastructures and applications that a business depends on have multiple layers of complexity which can malfunction for a wide variety of reasons. Underlying physical principles such as chaos theory and the butterfly effect demonstrate that very small changes in a procedure’s operating conditions can have monumental effects on its outcome.
These concepts help explain why it is so difficult to control and make predictions regarding complex systems. The computing environment with which IT professionals make their living often provides a prime example of the butterfly effect in action. One little change or error can result in disastrous consequences that can cripple a system, and in some cases, the whole organization.
Many IT professionals have made a career based on the fact that computer systems are not infallible. Addressing and fixing problems is one of the primary tasks of a large percentage of information technology workers from DBAs to tech support specialists. If you are currently in one of those types of roles, your job is safe. Perfectly designed and self-correcting systems are not on the immediate horizon.
What is Root Cause Analysis
When an organization experiences an information system problem that impacts their operation, there are compelling reasons to identify the underlying causes of the issue. Root cause analysis (RCA) provides a framework from which investigation into the origins of the problem can be conducted. The purpose of RCA is to find the source of the problem and establish procedures to minimize its repetition.
Larger enterprises may have regularly scheduled RCA meetings where the most pressing issues affecting the business are discussed with the intention of eliminating their recurrence. Smaller companies may still engage in root cause analysis in a less structured manner.
The goal of an RCA is to identify the how and why behind problems affecting your computer systems. A root cause analysis of an issue is done by executing the following steps.
Data collection – The quality of data collection is a key factor in the ability to perform a thorough RCA. This is often the most time-consuming step in the process.
Causal factor charting – Organizing the information gathered in the data collection phase is the next step in an RCA. This organization needs to be flexible enough to thoroughly address all of the interconnected parts that may have contributed to the problem being investigated. Obtaining all of the information required to create the chart may entail further data collection efforts.
Root cause identification – Taking all of the collected information into account, the root cause of an issue is identified in this step of the process. A root cause map or another organizational device may be used to facilitate this identification.
Developing and implementing recommendations – Based on the preceding steps, recommendations to correct broken procedures or address oversights which contributed to the problem are established and disseminated to stakeholders.
One widely-used method of conducting an RCA employs a template designed to identify the ‘five whys’ behind the causes of the event under analysis. This is an iterative process through which the most obvious reasons for the issue are first determined. Then each one is further analyzed in an attempt to uncover any factors that may have contributed to the high-level findings. There are usually less conspicuous elements that are at the root of the problem being addressed.
Why Five Whys?
There are some instances of a root cause analysis that do not lend themselves to going five levels deep to find the underlying cause behind the issue. At other times, five answers may not be enough and still leave questions open regarding the true reason for the outage being analyzed.
The purpose of specifying that five questions need to be answered to provide adequate resolution to a problem is to illustrate the concept that there is often a contributing sequence of events that seem harmless on the surface. Failure to address and correct these seemingly innocuous factors will lead to an unsuccessful resolution of the specific issue and leave the organization open to further similar afflictions.
Providing Root Cause Identification
A DBA has several tasks related to root cause analysis when problems impact their systems and databases. They will be called upon to provide information during the data collection phase of the analysis as well as identifying the issue and implementing the final recommendations established by the process.
Tools such as IDERA’s Precise Application Performance Platform can prove to be instrumental in accurately identifying the cause of performance problems. This is accomplished by the ability to drill down into database transactions to find areas of concern such as poorly designed SQL queries and resource shortages. The tool provides detailed steps regarding access paths and furnishes statistics to help understand and isolate the offending system components.
Speeding up the time it takes to identify the root cause of problems is one the major benefits of Precise, according to its users. In addition to the many other advantages that a DBA can enjoy by using this tool, it will make it much easier to negotiate those RCA meetings and requests for information. This can only work to simplify the busy life of your database team.