How to conduct Root Cause Analysis

Root cause analysis (RCA) is a method of finding and solving the underlying causes of a problem, rather than just treating the symptoms. RCA can help you prevent the recurrence of the problem, improve the quality of your products or services, and save time and money. In this blog post, we will show you how to conduct a Root cause analysis in six steps, using some examples along the way.

How to conduct Root Cause Analysis

How to Avoid Common Mistakes in Problem Solving

Identify Population

The first step is to identify the population that is affected by the problem. This includes the policies, processes, people, places, resources, assets, risks, and liabilities that are involved or impacted by the problem. By defining the scope and boundaries of the problem, you can focus your analysis on the relevant factors and avoid unnecessary distractions.

For example, suppose you are a manager of a software company and you want to analyze why your customers are experiencing frequent crashes and errors in your application. The population of the problem could be:

  • Policies: the quality standards, testing procedures, and customer feedback mechanisms that you have in place
  • Processes: the development, deployment, and maintenance workflows that you follow
  • People: the developers, testers, support staff, and customers that are involved or affected by the problem
  • Places: the locations where the problem occurs, such as the servers, devices, or networks that host or run your application
  • Resources: the tools, technologies, and platforms that you use to build, test, and deliver your application
  • Assets: the intellectual property, code, data, and documentation that you own or rely on for your application
  • Risks: the potential negative consequences or losses that the problem could cause, such as customer dissatisfaction, reputation damage, or legal liability
  • Liabilities: the obligations or responsibilities that you have to your customers, partners, or regulators regarding the problem

Collect Data

The next step is to collect data related to the problem. This includes both reactive and proactive data. Reactive data is the information that you gather after the problem has occurred, such as feedback, complaints, logs, or reports. Proactive data is the information that you collect before or during the problem, such as surveys, tests, audits, or observations. You can use various methods to collect data, such as automation, interviews, questionnaires, or inspections.

The purpose of collecting data is to understand the problem better, measure its frequency and severity, and identify its possible causes. You should collect as much relevant and reliable data as possible, but also avoid information overload or bias.

For example, to collect data about the crashes and errors in your application, you could:

  • Use automation tools to monitor the performance, availability, and usage of your application
  • Conduct surveys or interviews with your customers to get their feedback and satisfaction levels
  • Review the logs, reports, or tickets that record the incidents and errors that occurred
  • Perform tests or audits on your code, data, or infrastructure to check for defects, vulnerabilities, or inconsistencies

Record Incidents

The third step is to record the incidents that are related to the problem. An incident is an event or occurrence that deviates from the expected or desired outcome. There are different types of incidents, such as:

  • Non-conformance: an incident that violates a specification, standard, or requirement
  • Quality problem: an incident that affects the functionality, usability, or reliability of a product or service
  • Inability: an incident that prevents a person, process, or system from performing a task or achieving a goal
  • Outage: an incident that interrupts or reduces the availability or accessibility of a product or service
  • Disaster: an incident that causes significant damage, harm, or loss to a person, process, or system

You should record the incidents in a systematic and consistent way, using a format that captures the essential information, such as:

  • What happened?
  • When did it happen?
  • Where did it happen?
  • Who was involved or affected?
  • How did it happen?
  • Why did it happen?

You can use tools such as spreadsheets, databases, or software to record and organize the incidents.

Identify Patterns

The fourth step is to identify patterns in the incidents that you have recorded. A pattern is a regular or repeated occurrence or behavior that indicates a relationship or connection between the incidents. By identifying patterns, you can narrow down the possible causes of the problem and focus on the most relevant or significant ones.

There are different ways to identify patterns, such as:

  • Sorting: arranging the incidents by a certain attribute or criterion, such as date, location, user, error, or cause
  • Grouping: classifying the incidents into categories or clusters based on their similarities or differences
  • Filtering: selecting or excluding the incidents based on a condition or threshold, such as frequency, severity, or impact
  • Comparing: contrasting the incidents with each other or with a baseline or benchmark, such as expected or desired outcome, previous or current performance, or industry or best practice
  • Visualizing: displaying the incidents using graphs, charts, or diagrams to show their distribution, trends, or correlations

Formulating a Plan: A Discussion on Root Cause Analysis

Locate Problem Areas

The fifth step is to locate the problem areas that are causing or contributing to the incidents. A problem area is a factor or element that influences or affects the outcome or performance of a person, process, or system. There are different types of problem areas, such as:

  • Physical: a problem area that involves the tangible or material aspects of a person, process, or system, such as equipment, infrastructure, or environment
  • Logical: a problem area that involves the intangible or abstract aspects of a person, process, or system, such as data, code, or logic
  • Human: a problem area that involves the behavioral or psychological aspects of a person, process, or system, such as skills, knowledge, or attitude
  • Organizational: a problem area that involves the structural or functional aspects of a person, process, or system, such as roles, responsibilities, or communication

You should locate the problem areas by tracing the causal chain or sequence of events that led to the incidents. You can use various techniques to do this, such as:

  • Five whys: a technique that involves asking “why” repeatedly until you reach the root cause of a problem
  • Fishbone diagram: a technique that involves drawing a diagram that resembles a fishbone, with the problem as the head and the problem areas as the bones
  • Fault tree analysis: a technique that involves drawing a diagram that resembles a tree, with the problem as the root and the problem areas as the branches

Apply Short Term Fixes

The sixth step is to apply short term fixes to the problem areas that you have located. A short term fix is a solution that addresses the immediate or urgent effects or symptoms of a problem, but does not eliminate or prevent the root cause of the problem. Short term fixes are also known as quick fixes, temporary fixes, or workarounds.

The purpose of applying short term fixes is to reduce the frequency or severity of the incidents, improve the satisfaction or safety of the customers or users, and buy some time or resources to implement long term solutions. You should apply short term fixes that are feasible, effective, and efficient, but also avoid creating new problems or dependencies.

For example, some short term fixes for the crashes and errors in your application could be:

  • Restarting the server or device that is causing the problem
  • Restoring the database from a backup or repairing the corrupted data
  • Updating the firewall configuration or installing a compatible version of the application
  • Providing alternative or manual methods or instructions to the customers or users
  • Offering compensation or apology to the customers or users

Re-Recording Incidents (Fine Tuning)

After applying short term fixes, you should re-record the incidents that are related to the problem. This means repeating the steps 2 and 3 of the Root cause analysis process, but with the updated data and information. The purpose of re-recording the incidents is to evaluate the effectiveness and efficiency of the short term fixes


Tagged , , , , , , , . Bookmark the permalink.

Comments are closed.