Skip to main content

Misunderstood and Misused - A Rant About Problem Management

It’s interesting to see how members of different communities can view a practice so differently. That is currently the case with problem management, the practice of identifying, removing, or mitigating the cause or contributing factors to service disruptions. For the most part, the IT service management (ITSM) community recognizes the value of problem management. They may admittedly struggle to find the time or resources needed to perform the practice. Or they may find it difficult to justify the actions needed to introduce permanent solutions identified as a result of the practice. But they, for the most part, value the practice. Conversely, some members of the DevOps community view problem management, or more specifically, root cause analysis, as a complete waste of time. We’ve found that there are a few common reasons that on the surface make sense, but that have counterpoints worth considering. One reason for the pushback is that the practice of root cause analysis is often misused. Particularly in toxic cultures. In those cultures, root cause analysis is focused more on who to blame or point the finger at than it is on figuring out what happened and how to make improvements. This is an extremely counter-productive behavior and so learning to conduct blameless reviews and retrospectives is a great place to start.

Another perspective is that there is no ‘root’ cause in a complex system. An easy response to this perspective is that there is no ‘single’ root cause in a complex system. Increasingly, incidents are caused by unlikely combinations of improbable factors. This also ties into a perspective we sometimes hear which is that “the root cause is just the place where we decide we know enough to stop analyzing and trying to learn more.” In other words, we grab onto the first solution we stumble across and accept that as the definitive answer. The reality is that there can be multiple causal factors for a problem, and these factors can span multiple categories such as people, processes, technology, and information.

One way we can address these perspectives is by redefining the term root cause as “the contributing or causal factors underlying a nonconformance.” This definition moves us away from the notion that there is one single, definitive action we can take to categorically eliminate certain incidents, to understanding that there is likely a set of factors, one or more of which could rear its ugly head on any given day. Having said that, problem management also teaches us to, at some point, start thinking about probabilities and to focus on the areas that are the most likely to recur and that can cause the greatest impact.

Another perspective that at-a-glance makes sense is the notion that root cause analysis is viewed as inherently reactive. The proponents of this perspective make the case that the incidents have already occurred and now we’re trying to figure out why. Wouldn’t it be better to look early in the lifecycle at things like design and development and testing practices and try to prevent the incidents from happening in the first place by making improvements in those areas? Well, yes. And this is very much in line with W. Edwards Deming’s belief that we should build quality in from start to finish. But… where did the insight come from to understand the problems in these areas? Root cause analysis can help with that as well and if nothing else can help quantify the impact of poor practices in those areas.

It's worth mentioning that the notion of proactive problem management has been around for years. For many organizations, however, dealing with reactive problem management, paying down technical debt, or simply chasing the next new shiny often stands in the way of taking a more proactive approach. Here’s the good news. As practices such as observability, and AIOps, and artificial intelligence become more a part of our lives, we’ll get better at preventative and predictive ways of working and develop the ability to circumvent incidents.

But here’s the reality. For now, incidents are still happening and in today’s digitally connected world, incidents, particularly those that result in outages, can cripple a company, damage its reputation, and even cause it to lose customers.

So, let’s recognize that with practice comes mastery. Problem management is a critical capability and there are proven methods and techniques that, when used correctly, enable organizations to improve the quality and reliability of their products and services, the experiences of consumers acquiring and using them, and the efficiency and effectiveness of the practices and processes used to design, deliver, and support those products and services. Perhaps most importantly, these methods and techniques enable organizations to improve the knowledge, skills, and decision-making capability of their workforce. And who wouldn’t want a company full of problem solvers?

Not all root cause analysis techniques, however, are equal. For example, a common perception is that the 5 Whys, while easy to use, may cause people to investigate only one causal factor. So, if you think a technique such as the 5 Whys is wrongly leading you down a path that is perceived to have a silver bullet at the end, consider one of the many other problem-solving methods such as Ishikawa diagramming, Pareto analysis, A3, or Kepner-Tregoe that challenge people to explore a more comprehensive set of causal factors.

If there is pushback in your organization to the traditional definition of ‘root cause’, or to the notion of ‘root cause analysis,’ then evolve your vocabulary. As discussed during a recent podcast with John Willis, maybe today what we want to talk about is ‘probable cause analysis’. The use of the term ‘probable’ implies we are reasonably sure, but not certain, and so what we need to do is prove or disprove our hypothesis by conducting experiments.

This doesn’t mean we abandon the problem management practice or proven root cause analysis techniques. It simply means we need to recognize the increasingly dynamic nature of this practice. And we also need to recognize that sometimes all it takes is a simple shift in vocabulary to avoid misunderstandings and misinterpretations.

Problem management isn’t going anywhere and has become critically important as individuals, organizations, and even society rely ever increasingly on digital technologies. This is one reason that high-performing organizations are starting to explore the idea of tying service-level targets to problems.

Given all the possible outcomes of problem management – permanent solution, workaround, known error – you might question how targets are possible. The most common approach is to tie the targets to root (probable) cause analysis. For example, a target could be set that root cause analysis is completed within 5 to 7 days of identifying a problem. This doesn’t mean the problem is solved. It simply means we’ve pulled together the data, evidence, and experts needed to do an analysis. This timeframe makes sense as it is the time during which the data, evidence, and ability of stakeholders to recall the sequence of events leading to the problem are still relatively fresh. From there, the resulting solution(s) will need to be justified and prioritized. Activities that can be managed via adjacent practices such as change management (change enablement in ITIL 4).

A last point worth noting is that some of the issues related to problem management are influenced by how tools are configured. For example, some service management tools conflate incident and problem management by requiring that a root cause be identified for every incident. Best practice views these two practices as separate and distinct for several reasons, one of which is that identifying a root cause for every incident doesn’t make the best use of time or resources and so likely drives bad behavior.

When used correctly, root cause analysis is a critical capability for IT teams. If you’re finding that teams are pushing back on the notion of ‘root’ cause, consider evolving your vocabulary. Help individuals learn about, practice, and teach others about the wide range of tools available for problem analysis. Help them develop the expertise needed to quickly identify and analyze the causes of problems and justify the actions needed to solve or prevent them. Challenge people to look beyond the most visible causes of problems. If you only address the most visible causes, you’ll likely fail to improve the system. If you don’t fix the system, it’s likely that problems will recur. By improving the overall system, the specific problem you are addressing is less likely to occur, and there may also be related problems that stem from the same weakness in the system that can be prevented as well.

To learn more about problem management and other core service management practices, consider the following ITSM Academy certification courses:


Popular posts from this blog

The Four Ps of Service Design - It’s not all about Technology

People ask me why I think that many designs and projects often fail. The most common answer is from a lack of preparation and management. Many IT organizations just think about the technology (product) implementation and fail to understand the risks of not planning for the effective and efficient use of the four Ps: People, Process, Products (services, technology and tools) and Partners (suppliers, manufacturers and vendors). A holistic approach should be adopted for all Service Design aspects and areas to ensure consistency and integration within all activities and processes across the entire IT environment, providing end to end business-related functionality and quality. (SD 2.4.2) People:   Have to have proper skills and possess the necessary competencies in order to get involved in the provision of IT services. The right skills, the right knowledge, the right level of experience must be kept current and aligned to the business needs. Products:   These are the technology managem

What Is A Service Offering?

The ITIL4 Best Practice Guidance defines a “Service Offering” as a description of one or more services designed to address the needs of a target customer or group .   As a service provider, we can’t stop there!   We must know what the contracts of our service offering are and be able to put them into context as required by the customer.     Let’s explore the three elements that comprise a Service Offering. A “Service Offering” may include:     Goods, Access to Resources, and Service Actions Goods – When we think of “Goods” within a service offering these are the items where ownership is transferred to the consumer and the consumer takes responsibility for the future use of these goods.   Example of goods that are being provided in the offering – If this is a hotel service then toiletries or chocolates are yours to take with you.   You the consumer own these and they are yours to take with you.               Note: Goods may not always be provided for every Service Offe

What is the difference between Process Owner, Process Manager and Process Practitioner?

I was recently asked to clarify the roles of the Process Owner, Process Manager and Process Practitioner and wanted to share this with you. Roles and Responsibilities: Process Owner – this individual is “Accountable” for the process. They are the goto person and represent this process across the entire organization. They will ensure that the process is clearly defined, designed and documented. They will ensure that the process has a set of Policies for governance. Example: The process owner for Incident management will ensure that all of the activities to Identify, Record, Categorize, Investigate, … all the way to closing the incident are defined and documented with clearly defined roles, responsibilities, handoffs, and deliverables. An example of a policy in could be… “All Incidents must be logged”. Policies are rules that govern the process. Process Owner ensures that all Process activities, (what to do), Procedures (details on how to perform the activity) and the