Skip to main content

Posts

Showing posts with the label Problem Management

Site Reliability Engineering

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations with the goal of creating ultra-scalable and highly reliable software systems.  Google’s mastermind behind SRE, Ben Treynor, describes site reliability as “what happens when a software engineer is tasked with what used to be called operations.” Historically, Dev teams want to release new features in a continuous manner (Change). Ops teams want to make sure that those features don’t break their stuff (Reliability). Of course the business wants both, so these groups have been incentivized very differently leading to what Lee Thompson ( (formerly of E*TRADE) coined the “wall of confusion”.  This inherent conflict creates a downward spiral that creates slower feature time to market, longer deployment cycles, increasing numbers of outages, and an ever increasing amount of technical debt. The discipline of SRE can begin to reduce this dilemma by

Incident vs. Problem

You may have seen a similar blog from the Professor a few years back that talked about the distinction between the idea of an incident vs problem.  Everything from that article is still relevant.  As process and methods for development and deployment have matured so has the usage of Incident and Problem Management. This is one of the most often confused points in for Agile, LEAN and ITIL adaptations. The ITIL definition is the same. Incident: Any unplanned event that causes, or may cause, a disruption or interruption to service delivery or quality Problem: The cause of one or more incidents, events, alerts or situation­­­­­­­ Where and how we apply Incident and Problem Management is evolving. A decade ago, and still in some organizations, Incident and Problem Management are processes exclusive to Service Operation.   ITIL is so very relevant and today we find, with the onset of DevOps and cultural shifts, many organizations are adopting little or zero tolerance

Service Operation and the Service Lifecycle – Yesterday and Today

ITSM Best Practice will align five main process with the lifecycle of “Service Operation”. Incident Management Problem Management Event Management Request Fulfillment Access Management  It was not too long ago that the idea of some of these processes were new to service providers. Most will find them to be common in today’s market place.  An organization may not literally follow the best practices for the service operation processes but most likely have some close facsimile when executing Incident, Problem, Request Fulfilment, and Event management processes for provisioning IT services and support.  In order to ensure identity management and authorization for access, some form of “Access Management” will also be needed to support an overall security policy in Service Operation.  I would like to focus on some thoughts for “Event Management” and early engagement of operational staff in the service lifecycle. As organizations mature they begin to realize the value

Visible Ops

Anyone who has worked in Information Technology knows that today, there is and always will be improvement opportunities available to our organizations.  This is especially in light of the pace of change that is taking place in all market spaces and the level of customer expectations that accompanies that change. If you have worked in IT for a number of years, you may remember when change was not welcomed. Well the good old days weren’t always that good and tomorrow ain’t as bad as it seems (Billy Joel).  The challenge is in getting started. If……. ·        the processes that are currently being engaged are not as efficient and effective as you would like ·        you are finding that your environment isn’t as stable and reliable as it should be ·        that when you make changes to your environment it generally results in an outage and prolonged and repeatable firefighting then ……. I recommend that you read The Visible Ops Handbook by Gene Kim, Kevin Behr and Geor

The Value of Problem Models

If a problem is the unknown cause of one or more incidents then how can I design a repeatable model for something that is unknown? The purpose of Problem Management is to manage the problems throughout their lifecycle. Problem Management seeks to not only to minimize the adverse effect of incidents by providing work arounds, but also seeks to eliminate outages, and prevent them from recurring again. In Incident Management ITIL defines an Incident Model as a predefined set of procedures based on type of incident.   So then what is a “Problem Model”?   Problem Models Not all problems are the same.   There are many different types of problems and each type will require unique roles and responsibilities, varied skill sets and different timelines and policies based on the complexity of the problem.   When considering how to design problem models consider the workflow required once the “problem” or is identified. Approach to Defining Problem Models One approach is to classify

Incidents and Problems

  An incident is an unplanned interruption to an IT service or reduction in the quality of an IT service and is strictly a reactive process. A problem on the other hand represents a different perspective of an incident by diagnosing its underlying root cause, which might also be the cause of multiple other incidents. Incidents however do not always grow up to become problems.  While Incident Management activities focus on restoring services to normal operations as quickly as possible, Problem Management activities determine the root cause, find the most effective and efficient permanent resolution and ultimately prevent the incident from happening again.    Problem Management can be both reactive and proactive. Proactive Problem Management identifies weaknesses in the environment before actual incidents occur.  These can then be exploited as improvement opportunities.   Reactive Problem Management addresses problems that were identified from one or more incidents.      The pol

Problem, Incident and Change Management Integration

“ Problem Management  seeks to minimize the adverse impact of incidents and problems on the business that are caused by underlying errors within the IT infrastructure and to proactively prevent the recurrence of incidents related to those errors.   In order to achieve this,  Problem Management  seeks to get to the root cause of incidents, document and communicate known errors and to initiate actions to improve or correct the situation”.    Given that statement is directly from the ITIL Best Management Practices text, it’s a wonder more organizations don’t have well integrated Problem, Incident and Change processes in their organizations. I never want to say that there is a single silver bullet solution for a given problem and I’m not suggesting that here.  However having a solid CMS (Configuration Management System) is a good step in the right direction.   Of course before we even think of tools we must have rules.  Thinking holistically we can create an integrated set of best p

Problem Management for Newbies (Part 2 of 2)

Problem Management for Newbies (Part 2 of 2) In part one of “Problem Management for Newbies” we looked at reactive Problem management and how Problem Management can serve as a pillar of support to incident management.  Problem Management prevents, minimizes and eliminates future incidents and problems from occurring.  There will always be a need for reactive problem management.  IT support can never guarantee that there will not be outages and will always need clearly defined roles, skilled staff and governance for the resolution of incidents and problems when they occur.  Added value to the business is via proactive problem management!  Proactive Problem Management Proactive problem management will glean management information from the function of the service desk, and others across the organization.  By viewing and analyzing reports on frequency of incidents, types of incidents,  noting the times that incidents and problems occur and most importantly understanding the bu

Problem Management for Newbies! Part 1 of 2

Getting Started with Problem Management To understand the process of Problem Management one must first understand that a problem is distinctively different than an Incident.   It is tracked and recorded separately, it requires a very different skill set and has a different objective than those that are required for “Incident Management”.   Problem records are unique entities and are reported upon separately.   A repeatable lean problem management process could very well be the glue that helps IT Service providers integrate and automate much of the work and effort required to “prevent” “Eliminate” and to “Minimize” the impact of incidents on your business and end user customers. While an incident is an unplanned interruption that creates an impact to one or more business services, the problem is actually the cause of one or more incidents.    Example:   “I can’t access the ERP system”, “The web portal will not come up!”    “I can’t log in” are all examples of incidents.    The c

The Best of Service Operation, Part 3

The Value of Known Errors and Workarounds Originally Published on December 7, 2010 The goal of Problem Management is to prevent problems and related incidents, eliminate recurring incidents and minimize the impact of incidents that cannot be prevented. Working with Incident Management and Change Management, Problem Management helps to ensure that service availability and quality are increased. One of the responsibilities of Problem Management is to record and maintain information about problems and their related workarounds and resolutions. Over time, this information is continually used to expedite resolution times, identify permanent solutions and reduce the number of recurring incidents. The resulting benefits are greater availability and less disruption to critical business systems. Although Incident and Problem Management are separate processes, they typically use the same or similar tools.    This allows for similar categorization and impact coding systems.  Each of thes

Reasoning for Problem Management

When it comes to Problem Management two things should come to mind: Root Cause Analysis (RCA) and finding a permanent resolution. How often have you thought about what it takes to conduct these aspects of Problem Management? An important underlying aspect of conducting a Root Cause Analysis and finding the permanent resolution are the reasoning approaches used. Three types of basic reasoning approaches are: Inductive: Reasoning from specific examples to general rules Deductive: Reasoning from general rules to specific examples Abductive: Reasoning to the most likely answer Each has its own uses and can be applied to problem solving and Problem Management at different times and for different reasons. However, when performing the Problem Management process we should be open to using all three reasoning approaches. They all complement each other with Inductive and Deductive reasoning forming two ends of a spectrum while Abductive thought looks for the balance between the other two a

Problem Management Techniques

Perhaps one of the most underused yet powerful processes from ITIL is Problem Management. Many people recognize the importance of Problem Management, especially in relationship to Incident Management. Yet when I ask students if they have implemented a Problem Management process the response is often “We plan to...” or “We started but did not get too far…” or “Not yet.” So what is keeping companies and individuals from using Problem Management to its full effectiveness? I propose that some of the reason is fear, uncertainty and doubt about how to go about “doing” Problem Management. By understanding that the Problem Management process has a number of techniques and tools available to help a service provider indentify root cause and recommend permanent resolution we may be able to remove some of the fear, uncertainty and doubt. What are some of the techniques and how could we apply them? Let’s take a brief look and see what we can uncover. Chronological Analysis: This time based appr

Effective Brainstorming Session

There are several problem analysis techniques which are discussed in the V3 Service Operation book, including brainstorming. I have used brainstorming sessions often in my career. Brainstorming is used throughout the problem solving process whenever the team needs to generate ideas quickly and effectively. Some sessions have been very valuable, others not so much. What was the difference? Basically, we need some structure around the sessions and some rules of engagement. Let’s begin by defining brainstorming. This is a technique used to quickly generate a list of ideas by a team to solve problems or issues. The relevant people must be gathered together either physically and/or electronically to increase creativity and idea generation in a very short amount of time. Here are 3 different types of brainstorming methods:  Free Wheeling Brainstorming Participants call out their ideas when they occur to them and in no particular order. A recorder posts all ideas for everyone to see as

Achieving ITSM Balance

In speaking with colleagues and practitioners, I have found that one of the greatest difficulties for companies to overcome in a Service Management implementation is the desire to be more complex and unbalanced than is absolutely necessary. One of the most basic and underlying elements of good Service Management is the achievement of balance in how we approach the delivery of value to the customers and users through services. Balance helps us to find an equitable point that brings value to the customers and users without throwing out the efforts and actions needed to keep IT going. When I speak of balance, I am referring to finding the middle ground between extremes. These include balances like the amount of time and effort spent between Incident Management and Problem Management; or perhaps the balance between flexibility and stability; or even the challenges of being proactive versus reactive; customer/service-centric versus technology-centric. There are a multitude of these types

Incidents when a Defect is Involved

Question: We currently track defects in a separate system than our ticket management system. With that said, my question is does anyone have suggestions and/or best practices on how to handle incidents when a defect is involved? Should the incident be closed since the defect is being worked on in another defect tracking system if it is noted in the incident ticket? I am considering creating an incident statuses of 'closed-unresolved' so the incident can still be reported on in our ticket management system but know it is being worked on/tracked in the defect system. With defects, it is possible that we may never work on them because they are very low priority and the impact is low to the user. However, in some cases a defect is being worked on. Should we create a problem ticket instead? Thanks, René W. Answer: René. In ITIL, the activity you are describing is handled by the Problem Management process. ITIL does not use the term “defect” but it does use the term “known er