By Donna Knapp
It’s interesting to see how members of different communities can view a practice so differently. That is currently the case with problem management, the practice of identifying, removing, or mitigating the cause or contributing factors to service disruptions. For the most part, the IT service management (ITSM) community recognizes the value of problem management. They may admittedly struggle to find the time or resources needed to perform the practice. Or they may find it difficult to justify the actions needed to introduce permanent solutions identified as a result of the practice. But they, for the most part, value the practice.
Conversely, some members of the DevOps community view problem management, or more specifically, root cause analysis, as a complete waste of time. We’ve found that there are a few common reasons that on the surface make sense, but that have counterpoints worth considering. One reason for the pushback is that the practice of root cause analysis is often misused. Particularly in toxic cultures. In those cultures, root cause analysis is focused more on who to blame or point the finger at than it is on figuring out what happened and how to make improvements. This is an extremely counter-productive behavior and so learning to conduct blameless reviews and retrospectives is a great place to start.
Another perspective is that there is no ‘root’ cause in a complex system. An easy response to this perspective is that there is no ‘single’ root cause in a complex system. Increasingly, incidents are caused by unlikely combinations of improbable factors. This also ties into a perspective we sometimes hear which is that “the root cause is just the place where we decide we know enough to stop analyzing and trying to learn more.” In other words, we grab onto the first solution we stumble across and accept that as the definitive answer. The reality is that there can be multiple causal factors for a problem, and these factors can span multiple categories such as people, processes, technology, and information.
One way we can address these perspectives is by redefining the term root cause as “the contributing or causal factors underlying a nonconformance.” This definition moves us away from the notion that there is one single, definitive action we can take to categorically eliminate certain incidents, to understanding that there is likely a set of factors, one or more of which could rear its ugly head on any given day.
Another perspective that at-a-glance makes sense is the notion that root cause analysis is viewed as inherently reactive. The proponents of this perspective make the case that the incidents have already occurred and now we’re trying to figure out why. Wouldn’t it be better to look early in the lifecycle at things like design and development and testing practices and try to prevent the incidents from happening in the first place by making improvements in those areas? Well, yes. And this is very much in line with W. Edwards Deming’s belief that we should build quality in from start to finish.
It's worth mentioning that the notion of proactive problem management has been around for years. For many organizations, however, dealing with reactive problem management, paying down technical debt, or simply chasing the next new shiny often stands in the way of taking a more proactive approach. Here’s the good news. As practices such as observability, and AIOps, and artificial intelligence become more a part of our lives, we’ll get better at preventative and predictive ways of working and develop the ability to circumvent incidents.
Not all root cause analysis techniques, however, are equal. For example, a common perception is that the 5 Whys, while easy to use, may cause people to investigate only one causal factor. So, if you think a technique such as the 5 Whys is wrongly leading you down a path that is perceived to have a silver bullet at the end, consider one of the many other problem-solving methods such as Ishikawa diagramming, Pareto analysis, A3, or Kepner-Tregoe that challenge people to explore a more comprehensive set of causal factors.
If there is pushback in your organization to the traditional definition of ‘root cause’, or to the notion of ‘root cause analysis,’ then evolve your vocabulary. As discussed during a recent podcast with John Willis, maybe today what we want to talk about is ‘probable cause analysis’.
To learn more about problem management and other core service management practices, consider the following ITSM Academy certification courses:
It’s interesting to see how members of different communities can view a practice so differently. That is currently the case with problem management, the practice of identifying, removing, or mitigating the cause or contributing factors to service disruptions. For the most part, the IT service management (ITSM) community recognizes the value of problem management. They may admittedly struggle to find the time or resources needed to perform the practice. Or they may find it difficult to justify the actions needed to introduce permanent solutions identified as a result of the practice. But they, for the most part, value the practice.
Conversely, some members of the DevOps community view problem management, or more specifically, root cause analysis, as a complete waste of time. We’ve found that there are a few common reasons that on the surface make sense, but that have counterpoints worth considering. One reason for the pushback is that the practice of root cause analysis is often misused. Particularly in toxic cultures. In those cultures, root cause analysis is focused more on who to blame or point the finger at than it is on figuring out what happened and how to make improvements. This is an extremely counter-productive behavior and so learning to conduct blameless reviews and retrospectives is a great place to start.
Another perspective is that there is no ‘root’ cause in a complex system. An easy response to this perspective is that there is no ‘single’ root cause in a complex system. Increasingly, incidents are caused by unlikely combinations of improbable factors. This also ties into a perspective we sometimes hear which is that “the root cause is just the place where we decide we know enough to stop analyzing and trying to learn more.” In other words, we grab onto the first solution we stumble across and accept that as the definitive answer. The reality is that there can be multiple causal factors for a problem, and these factors can span multiple categories such as people, processes, technology, and information.
One way we can address these perspectives is by redefining the term root cause as “the contributing or causal factors underlying a nonconformance.” This definition moves us away from the notion that there is one single, definitive action we can take to categorically eliminate certain incidents, to understanding that there is likely a set of factors, one or more of which could rear its ugly head on any given day.
Another perspective that at-a-glance makes sense is the notion that root cause analysis is viewed as inherently reactive. The proponents of this perspective make the case that the incidents have already occurred and now we’re trying to figure out why. Wouldn’t it be better to look early in the lifecycle at things like design and development and testing practices and try to prevent the incidents from happening in the first place by making improvements in those areas? Well, yes. And this is very much in line with W. Edwards Deming’s belief that we should build quality in from start to finish.
It's worth mentioning that the notion of proactive problem management has been around for years. For many organizations, however, dealing with reactive problem management, paying down technical debt, or simply chasing the next new shiny often stands in the way of taking a more proactive approach. Here’s the good news. As practices such as observability, and AIOps, and artificial intelligence become more a part of our lives, we’ll get better at preventative and predictive ways of working and develop the ability to circumvent incidents.
Not all root cause analysis techniques, however, are equal. For example, a common perception is that the 5 Whys, while easy to use, may cause people to investigate only one causal factor. So, if you think a technique such as the 5 Whys is wrongly leading you down a path that is perceived to have a silver bullet at the end, consider one of the many other problem-solving methods such as Ishikawa diagramming, Pareto analysis, A3, or Kepner-Tregoe that challenge people to explore a more comprehensive set of causal factors.
If there is pushback in your organization to the traditional definition of ‘root cause’, or to the notion of ‘root cause analysis,’ then evolve your vocabulary. As discussed during a recent podcast with John Willis, maybe today what we want to talk about is ‘probable cause analysis’.
To learn more about problem management and other core service management practices, consider the following ITSM Academy certification courses:
Comments