Problem Management
for Newbies (Part 2 of 2)
In part one of “Problem Management for Newbies” we looked at
reactive Problem management and how Problem Management can serve as a pillar of
support to incident management. Problem
Management prevents, minimizes and eliminates future incidents and problems
from occurring. There will always be a
need for reactive problem management. IT
support can never guarantee that there will not be outages and will always need
clearly defined roles, skilled staff and governance for the resolution of
incidents and problems when they occur. Added
value to the business is via proactive problem management!
Proactive Problem Management
Proactive problem management will glean management
information from the function of the service desk, and others across the
organization. By viewing and analyzing
reports on frequency of incidents, types of incidents, noting the times that incidents and problems
occur and most importantly understanding the business impact, problem management
teams can work to get to the root of the root and prevent further incidents and
problems from ever even occurring. The
incident Management process has no control over how many incidents occur! Incident Management can only give assurance
on restoring service as quickly as possible when there is an incident. Problem Management on the other hand can
actually reduce the volume of incidents, and eliminate negative business impact
that would have otherwise resulted.
Proactive Problem Management Techniques
In addition to analyzing trends, proactive problem
management will utilize many techniques for root cause analysis. Among these are such things as the “Ishikawa
Diagrams” (aka: fishbone diagram), the 5 whys, Fault Tree Analysis, and other Total
Quality Measurement (TQM) methods such
as brainstorming. While reactive
problem management works with incident to restore service fast, proactive
problem management may take time to form focus groups, capture data, report and
analyze data and ultimately submit a proposal or Request For Change (RFC) to
resolve the problem and prevent future impact.
The skill set involved in proactive problem management while technical
and analytical also requires strong management and facilitation experience.
Keys to Success
The key to success in Proactive Problem Management Process
maturity is not only to have clearly defined roles and responsibilities,
ownership, integration and handoff points but most importantly to define
various problem models. A problem model
is a unique set of steps, defining all of the roles, responsibilities and
procedures for a specific type of problem.
Not all types of problems are the same.
Some examples of problems that problem management can
resolve
Recurring Incidents – Example: ABC Company’s problem
management team have analyzed trends and noticed that the increase in disk
crashes compared to Q1 has increased 75%!
What could be the cause? If we
don’t know what cause it then “Houston we have a problem”! Problem Management might work with vendors
and discover that one of them did have a bad batch of disks. After some research it is discovered that ABC
Company has several hundred of the bad disks installed in their organization. In this case problem management would submit
and RFC and work with the vendor to proactively replace all disks that are at
risk and proactively prevent future incidents and negative business impact.
Major Problems - (you
know! The all hands on deck High Impact
type) - Example: A recent problem was
identified and the cause of several hundred incidents was that a mirrored
server did not fail over when required. Surprise! This followed a recent change and impacted
vital business processes for this company. When investigating the problem the original
cause was documented as: “Wrong firmware on secondary router prevented the
mirrored server from failing over as it should have”. The firmware was updated and problem
resolved?! NO! That is reactive problem management. In the above example reactive problem
management provided a temporary workaround to fix the mirrored servers by
updating the firmware, but the real cause or root cause is WHY did the secondary server have the wrong
firmware in the first place?!
After forming
a focus group, with timeline of events and by using RCA techniques, it was
determined that the testing was performed with only one router and there was no
criteria in the RFC of that change to update the secondary switch. The real “Problem” was in the Design and
Transition process and procedures. In
addition to new procedures in the PMO and the Service Design lifecycle, two new
processes were proposed for Service Transition.
The new processes were “Test and Validation” and “Change Evaluation”. If they had not taken this root cause
analysis further this company could have kept doing the same thing over and
over and each time expected a different results only to experience chaos and
business impact.
Getting to the root of the cause will prevent similar major
outages that could have occurred after every major change. As we saw in this last example, the root of
the cause will generally go much wider and broader than a hardware or software
break fix type of solution. Preventing
the outage from every happening again increases confidence of staff, the
business and customers, prevents cost over runs and enables a service provider
for success. So… what’s the problem?
Comments