Problem Management for Newbies (Part 2 of 2)
In part one of “Problem Management for Newbies” we looked at reactive Problem management and how Problem Management can serve as a pillar of support to incident management. Problem Management prevents, minimizes and eliminates future incidents and problems from occurring. There will always be a need for reactive problem management. IT support can never guarantee that there will not be outages and will always need clearly defined roles, skilled staff and governance for the resolution of incidents and problems when they occur. Added value to the business is via proactive problem management!
Proactive Problem Management
Proactive problem management will glean management information from the function of the service desk, and others across the organization. By viewing and analyzing reports on frequency of incidents, types of incidents, noting the times that incidents and problems occur and most importantly understanding the business impact, problem management teams can work to get to the root of the root and prevent further incidents and problems from ever even occurring. The incident Management process has no control over how many incidents occur! Incident Management can only give assurance on restoring service as quickly as possible when there is an incident. Problem Management on the other hand can actually reduce the volume of incidents, and eliminate negative business impact that would have otherwise resulted.
Proactive Problem Management Techniques
In addition to analyzing trends, proactive problem management will utilize many techniques for root cause analysis. Among these are such things as the “Ishikawa Diagrams” (aka: fishbone diagram), the 5 whys, Fault Tree Analysis, and other Total Quality Measurement (TQM) methods such as brainstorming. While reactive problem management works with incident to restore service fast, proactive problem management may take time to form focus groups, capture data, report and analyze data and ultimately submit a proposal or Request For Change (RFC) to resolve the problem and prevent future impact. The skill set involved in proactive problem management while technical and analytical also requires strong management and facilitation experience.
Keys to Success
The key to success in Proactive Problem Management Process maturity is not only to have clearly defined roles and responsibilities, ownership, integration and handoff points but most importantly to define various problem models. A problem model is a unique set of steps, defining all of the roles, responsibilities and procedures for a specific type of problem. Not all types of problems are the same.
Some examples of problems that problem management can resolve
Recurring Incidents – Example: ABC Company’s problem management team have analyzed trends and noticed that the increase in disk crashes compared to Q1 has increased 75%! What could be the cause? If we don’t know what cause it then “Houston we have a problem”! Problem Management might work with vendors and discover that one of them did have a bad batch of disks. After some research it is discovered that ABC Company has several hundred of the bad disks installed in their organization. In this case problem management would submit and RFC and work with the vendor to proactively replace all disks that are at risk and proactively prevent future incidents and negative business impact.
Major Problems - (you know! The all hands on deck High Impact type) - Example: A recent problem was identified and the cause of several hundred incidents was that a mirrored server did not fail over when required. Surprise! This followed a recent change and impacted vital business processes for this company. When investigating the problem the original cause was documented as: “Wrong firmware on secondary router prevented the mirrored server from failing over as it should have”. The firmware was updated and problem resolved?! NO! That is reactive problem management. In the above example reactive problem management provided a temporary workaround to fix the mirrored servers by updating the firmware, but the real cause or root cause is WHY did the secondary server have the wrong firmware in the first place?!
After forming a focus group, with timeline of events and by using RCA techniques, it was determined that the testing was performed with only one router and there was no criteria in the RFC of that change to update the secondary switch. The real “Problem” was in the Design and Transition process and procedures. In addition to new procedures in the PMO and the Service Design lifecycle, two new processes were proposed for Service Transition. The new processes were “Test and Validation” and “Change Evaluation”. If they had not taken this root cause analysis further this company could have kept doing the same thing over and over and each time expected a different results only to experience chaos and business impact.
Getting to the root of the cause will prevent similar major outages that could have occurred after every major change. As we saw in this last example, the root of the cause will generally go much wider and broader than a hardware or software break fix type of solution. Preventing the outage from every happening again increases confidence of staff, the business and customers, prevents cost over runs and enables a service provider for success. So… what’s the problem?