Skip to main content

Misunderstood and Misused - A Rant About Problem Management

By Donna Knapp

It’s interesting to see how members of different communities can view a practice so differently. That is currently the case with problem management, the practice of identifying, removing, or mitigating the cause or contributing factors to service disruptions. For the most part, the IT service management (ITSM) community recognizes the value of problem management. They may admittedly struggle to find the time or resources needed to perform the practice. Or they may find it difficult to justify the actions needed to introduce permanent solutions identified as a result of the practice. But they, for the most part, value the practice.

Conversely, some members of the DevOps community view problem management, or more specifically, root cause analysis, as a complete waste of time. We’ve found that there are a few common reasons that on the surface make sense, but that have counterpoints worth considering. One reason for the pushback is that the practice of root cause analysis is often misused. Particularly in toxic cultures. In those cultures, root cause analysis is focused more on who to blame or point the finger at than it is on figuring out what happened and how to make improvements. This is an extremely counter-productive behavior and so learning to conduct blameless reviews and retrospectives is a great place to start.

Another perspective is that there is no ‘root’ cause in a complex system. An easy response to this perspective is that there is no ‘single’ root cause in a complex system. Increasingly, incidents are caused by unlikely combinations of improbable factors. This also ties into a perspective we sometimes hear which is that “the root cause is just the place where we decide we know enough to stop analyzing and trying to learn more.” In other words, we grab onto the first solution we stumble across and accept that as the definitive answer. The reality is that there can be multiple causal factors for a problem, and these factors can span multiple categories such as people, processes, technology, and information.

One way we can address these perspectives is by redefining the term root cause as “the contributing or causal factors underlying a nonconformance.” This definition moves us away from the notion that there is one single, definitive action we can take to categorically eliminate certain incidents, to understanding that there is likely a set of factors, one or more of which could rear its ugly head on any given day.

Another perspective that at-a-glance makes sense is the notion that root cause analysis is viewed as inherently reactive. The proponents of this perspective make the case that the incidents have already occurred and now we’re trying to figure out why. Wouldn’t it be better to look early in the lifecycle at things like design and development and testing practices and try to prevent the incidents from happening in the first place by making improvements in those areas? Well, yes. And this is very much in line with W. Edwards Deming’s belief that we should build quality in from start to finish.

It's worth mentioning that the notion of proactive problem management has been around for years. For many organizations, however, dealing with reactive problem management, paying down technical debt, or simply chasing the next new shiny often stands in the way of taking a more proactive approach. Here’s the good news. As practices such as observability, and AIOps, and artificial intelligence become more a part of our lives, we’ll get better at preventative and predictive ways of working and develop the ability to circumvent incidents.

Not all root cause analysis techniques, however, are equal. For example, a common perception is that the 5 Whys, while easy to use, may cause people to investigate only one causal factor. So, if you think a technique such as the 5 Whys is wrongly leading you down a path that is perceived to have a silver bullet at the end, consider one of the many other problem-solving methods such as Ishikawa diagramming, Pareto analysis, A3, or Kepner-Tregoe that challenge people to explore a more comprehensive set of causal factors.

If there is pushback in your organization to the traditional definition of ‘root cause’, or to the notion of ‘root cause analysis,’ then evolve your vocabulary. As discussed during a recent podcast with John Willis, maybe today what we want to talk about is ‘probable cause analysis’.


To learn more about problem management and other core service management practices, consider the following ITSM Academy certification courses:

Comments

Popular posts from this blog

Four Service Characteristics

Recently I came across several articles by researchers and experts that laid out definitions and characteristics of services. ITIL provides us with a definition that can help drive the creation of value-laden services: A means of delivering value to customers by facilitating outcomes customers want to achieve without the ownership of specific costs and risks. An area that ITIL is not so clear is in terms of service characteristics. Several researchers and experts put forth that services have four basic characteristics (IHIP): Intangibility—Services are the results of actions not things. They have no physical presence and represent a logical set of elements. One way to think of service is “work done for others.”  Heterogeneity—Also known as “variability”; services are unique items because of the mechanisms used to deliver services, which is people. Because the people element adds variability, the service is variable. This holds true, especially for the value proposition—not eve...

What Is A Service Offering?

The ITIL 4 Best Practice Guidance defines a “Service Offering” as a description of one or more services designed to address the needs of a target customer or group.   As a service provider, we can’t stop there!   We must know what the contracts of our service offering are and be able to put them into context as required by the customer.     Let’s explore the three elements that comprise a Service Offering. A “Service Offering” may include:     Goods, Access to Resources, and Service Actions 1. Goods – When we think of “Goods” within a service offering these are the items where ownership is transferred to the consumer and the consumer takes responsibility for the future use of these goods.   Example of goods that are being provided in the offering – If this is a hotel service then toiletries or chocolates are yours to take with you.   You the consumer own these and they are yours to take with you.      ...

What is the difference between Process Owner, Process Manager and Process Practitioner?

This article was originally published in 2015. With the Introduction of ITIL 4, some of this best practice has changed. See  ITIL 4 and the Evolving Role of Roles . Updated Definitions in ITIL 4: Process Owner: In ITIL 4, the concept of 'processes' has expanded into broader 'practices.' Consequently, the Process Owner is now often referred to as the 'Practice Owner.' This individual is accountable for the overall design, performance, integration, and improvement of a specific practice within the organization. They ensure that the practice achieves its intended outcomes and aligns with the organization's objectives. Process Manager: Now commonly known as the 'Practice Manager' in ITIL 4, this role is responsible for the day-to-day management of the practice. The Practice Manager ensures that activities are carried out as intended, manages resources assigned to the practice, and oversees the practitioners performing the work. Process Practit...