Problem Management Process Design and Root Cause Analysis in Complex Organizations

Problem Management Root Cause Analysis

Date: 10 December 2025**
Client Question: What are best practices for structuring a problem management process in a large, complex organization? Are there alternative methodologies to ITIL for problem management?
What tools and techniques can technical teams use to identify root causes, rather than applying temporary fixes? The client seeks to move beyond incident management and implement sustainable, organization-wide problem resolution.

Dear Isac,

Thank you for your inquiry on “Problem Management Process Design and Root Cause Analysis in Complex Organizations”.

The below response provides an overview of best practices for problem management, proactive root cause analysis, integration with ITSM processes, and data-driven improvement to reduce recurring incidents and support workload. For additional insights, we would recommend scheduling a follow-up with our I&O experts. Hope you find this response useful!

Problem Management

Problem management myopically focuses on creating root cause analysis (RCA) write-ups for highly visible incidents. Little attention is given to incident prevention. It is also important to distinguish how problem management differs from a separate, but related, ITSM practice: incident management. Many organizations confuse these two practices, often leading to suboptimal results for both practices. This difference is explained by the analogy of a person addressing weeds (i.e., incidents) in their lawn by mowing over them (i.e., incident resolution) (see Figure 1). Refer ITSM Best Practices for Problem Management.

Problem management and incident management are distinct practices that are complementary to one another; when paired together, the cause of incidents can be identified and addressed at the root. Refer Solution Path for Implementing and Optimizing Foundational ITSM Practices.

Figure 1: Contrasting Problem Management and Incident Management

Mowing weeds symbolizes incident management, addressing symptoms but not causes. Removing roots represents problem resolution, targeting underlying issues to prevent recurrence and growth of incidents.

Best Practices

Transition to a problem management practice that actively seeks out and addresses the common underlying causes of high-volume incidents to improve the health of your services and reduce the demand on your support staff.
Improve incident handling by establishing a known error database (KEDB) to manage problems and workarounds. Applying workarounds to known issues will reduce the mean time to repair (MTTR) by enabling support staff to quickly identify and address these recurring incidents.
Focus your efforts on solving problems with the greatest ROI within your domain of influence by balancing the impact of a problem against the effort required to resolve underlying causes.
Measure and communicate the progress and the impact of your actions by comparing the results of your efforts with baseline metrics. Let the data tell your story.

Gartner’s framework consists of a prework phase followed by four action phases (see Figure 2). The framework can be used to implement a new problem management practice or enhance an existing one. For new practices, Gartner recommends following the framework phases in order. For existing practices, use the framework in its entirety or portions of the framework, and choose an order that best aligns with your most pressing needs.

Figure 2: The Guidance Framework

The guidance framework outlines five stages: prework, identification and creation, mitigation, resolution, and optimization. Key steps include defining scope, identifying problems, leveraging knowledge bases, choosing solutions, and optimizing management impact.

Refer #ITIL v3 Process Cheat Sheets. Use the cheat sheets for 21 ITIL processes to build crucial background for ITIL v3 processes.

For Recurring Incidents and Root Cause Analysis

Organizations must address the causes of their recurring incidents to reduce their impact on the organization. Take the following steps to implement a problem management practice that actively seeks out and addresses causes of recurring issues to reduce the volume and impact of future incidents:

Align the scope of your practice with its intended outcomes. Avoid getting into a situation where the problem management team is identifying problems it can’t address.
Surround your problem manager with a high-performing team of problem solvers to launch your practice. Problem management is a team sport that relies on the ability of multidisciplinary teams to work together on identifying problems and remediate the underlying causes to make a positive impact.
Integrate problem management with complementary practices:

Incident management: Require all incident tickets to have at least one associated CI record from the CMDB. This will help in identifying the source of recurring incidents.
Knowledge management: Leverage your existing knowledge management practice to manage workarounds.
Change management: Integrate with change management by ensuring all fixes that impact entities under change control are logged as changes.

Ensure your incident management is capturing useful data for the problem management practice by implementing a final categorization step. These small enhancements can have a big effect on your practices. For example, since little is known about incidents at creation, they are often miscategorized, which leads to poor data quality. Poor incident data hampers process improvement efforts and the ability to proactively identify the source of your recurring issues — a key activity within your problem management practice. Ensure you are collecting high-quality incident data by requiring the incident resolver to validate what was actually broken.
Avoid harvesting and creating a bunch of problem tickets that will sit idle waiting for available resources to work the backlog. This will waste time, demoralize the team and mismanage the expectations of stakeholders. Instead, pick a problem that generates a high volume of incidents and get started. By targeting high-volume incident drivers, you will immediately begin freeing up your organization’s capacity, while improving the overall health of your technology.
The recurrence of incidents lets us know a problem exists, but the actual causes of the problem are often difficult to pinpoint. The sheer volume and complexity of data generated by humans and various platforms — such as system logs, engineering and development tools, observability platforms, and IT Service Management (ITSM) systems — present significant challenges for traditional RCA techniques (see Figure 14). These methods often rely heavily on manual interpretation, which can be time-consuming and prone to human oversight, especially when dealing with unstructured data and intricate cause-and-effect relationships.

Figure 3: Root Cause Analysis Techniques

This figure depicts various root cause analysis techniques.

For leveraging GenAI in the problem management practices, see Infographic: Enhance Root Cause Analysis with GenAI.

Additional Note(s) to Read:

ITSM Best Practices for Major Incident Management

Note: If images are not displayed on your device, please try viewing them on a laptop. Alternatively, consult the associated research note for the necessary information.

# Note: This research note is archived and some of the insights may not be relevant as of today

Hope this helps!!

Quartz 4

Explorer

Problem Management Process Design and Root Cause Analysis in Complex Organizations

Graph View