Getting Started with Problem Management
To understand the process of Problem Management one must first understand that a problem is distinctively different than an Incident. It is tracked and recorded separately, it requires a very different skill set and has a different objective than those that are required for “Incident Management”. Problem records are unique entities and are reported upon separately. A repeatable lean problem management process could very well be the glue that helps IT Service providers integrate and automate much of the work and effort required to “prevent” “Eliminate” and to “Minimize” the impact of incidents on your business and end user customers.
While an incident is an unplanned interruption that creates an impact to one or more business services, the problem is actually the cause of one or more incidents. Example: “I can’t access the ERP system”, “The web portal will not come up!” “I can’t log in” are all examples of incidents. The cause or the “problem” might be that the router on the network is down. Therefore, you could have many incidents related to a single problem record. It is important to note that an incident never becomes a problem. The Service Desk Agent owns the incident and the problem management team or technician will work the problem to identify the root cause of these incidents and most importantly provide a solution to the Service Desk so that the incidents can be resolved. The root cause and solution for the incident is logged in the Known Error Database (KEDB) that is owned by the Problem Management and shared with the service desk and other IT support staff.
Reactive Problem management
Reactive Problem Management is a process that is primarily performed in support of Incident Management to ensure that the service desk has the solution to resolve the incident. Problem Management will identify root cause of the incident. in cases where resolving the cause might exceed the service level for the Mean Time To Restore Service (MTRS), Problem Management will attempt to provide a temporary solution to the service desk. This enables the service desk agent to restore service as quickly as possible, meet the agreed targets and reduce the impact of the outage. This temporary solution is referred to as a “Work Around”. A very classic example might be something such as reboot the system. The system is rebooted, the user is happy and running, Yay! We met our SLA! The resolution for a permanent solution might require an RFC. This RFC would be submitted by problem management and might take two days, two weeks, or two months depending on the scope and complexity of this change to permanently fix the problem. Although the “Incident” is closed in this example, the “Problem Record” will remain open until the resolution to the problem is completed.
If we the service provider follow industry best practice and track and record the incident record separately from the problem record our reports would show that we met the SLA! Even though the problem could take days, or weeks to resolve, the problem record stays open until the permanent solution is implemented and recorded in the Known Error Database that is owned by the Problem Management Process. We internally can generate management information on the “Problems” to determine the effort that was needed to resolve the problem, the cost to the business to resolve the problem, and also enable the service provider to demonstrate the value of IT to the business while building confidence of users and customers. It is a beautiful thing! If you like the value that this reactive side of “Problem Management” can bring to your business stay tuned for Part Two of “Problem Management for Newbies where the focus is on “Proactive Problem Management”!