Skip to main content

Site Reliability Engineering

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations with the goal of creating ultra-scalable and highly reliable software systems.  Google’s mastermind behind SRE, Ben Treynor, describes site reliability as “what happens when a software engineer is tasked with what used to be called operations.”

Historically, Dev teams want to release new features in a continuous manner (Change). Ops teams want to make sure that those features don’t break their stuff (Reliability). Of course the business wants both, so these groups have been incentivized very differently leading to what Lee Thompson ((formerly of E*TRADE) coined the “wall of confusion”.  This inherent conflict creates a downward spiral that creates slower feature time to market, longer deployment cycles, increasing numbers of outages, and an ever increasing amount of technical debt.

The discipline of SRE can begin to reduce this dilemma by introducing multiple analytics and statistical analyses for green- or red-lighting launches and help to resolve the extreme focus of stability vs. agility, operational work vs. software engineering and proactive vs. reactive work. These SRE teams are staffed with developer/sys-admin hybrids who not only know how to find problems, but according to Googles Melissa Binde “figure out why it happened, what was the root cause, figure out how to detect it sooner and ideally insure that it doesn’t happen again”.  Sounds a lot like ITIL’s Problem Management process only on steroids.

So at a basic level here is how I understand this works.   As we all know from years of experience, and just being human, nothing is perfect. None of our services ever really achieve 100% uptime; it’s why we invented SLAs. Take it from someone who used to write them.  This is the concept I think is just so cool.  If a team agrees to a 99.8% SLA, it gives them an “error budget” of 0.2%.  This is the maximum allowable threshold for service interruptions. The production team can utilize this error budget however they see fit and in turn release whenever and whatever they want given they are within the SLA.  They get green-lighted based on past performance. If they are operating at or below the defined SLA, all launches are red-lighted until they reduce the number of errors to a level that allows the launch to proceed. SREs (Ops) and developers (Dev) have a strong incentive to work together to minimize the number of errors. This ties completely into the cultural and professional movement known as DevOps, which stresses communication, collaboration and integration between software developers and operational professionals while automating the process of software delivery and infrastructure changes. 


For more information; www.itsmacademy.com/devops

Comments

Popular posts from this blog

Four Service Characteristics

Recently I came across several articles by researchers and experts that laid out definitions and characteristics of services. ITIL provides us with a definition that can help drive the creation of value-laden services: A means of delivering value to customers by facilitating outcomes customers want to achieve without the ownership of specific costs and risks. An area that ITIL is not so clear is in terms of service characteristics. Several researchers and experts put forth that services have four basic characteristics (IHIP): Intangibility—Services are the results of actions not things. They have no physical presence and represent a logical set of elements. One way to think of service is “work done for others.”  Heterogeneity—Also known as “variability”; services are unique items because of the mechanisms used to deliver services, which is people. Because the people element adds variability, the service is variable. This holds true, especially for the value proposition—not eve...

What Is A Service Offering?

The ITIL 4 Best Practice Guidance defines a “Service Offering” as a description of one or more services designed to address the needs of a target customer or group.   As a service provider, we can’t stop there!   We must know what the contracts of our service offering are and be able to put them into context as required by the customer.     Let’s explore the three elements that comprise a Service Offering. A “Service Offering” may include:     Goods, Access to Resources, and Service Actions 1. Goods – When we think of “Goods” within a service offering these are the items where ownership is transferred to the consumer and the consumer takes responsibility for the future use of these goods.   Example of goods that are being provided in the offering – If this is a hotel service then toiletries or chocolates are yours to take with you.   You the consumer own these and they are yours to take with you.      ...

What is the difference between Process Owner, Process Manager and Process Practitioner?

This article was originally published in 2015. With the Introduction of ITIL 4, some of this best practice has changed. See  ITIL 4 and the Evolving Role of Roles . Updated Definitions in ITIL 4: Process Owner: In ITIL 4, the concept of 'processes' has expanded into broader 'practices.' Consequently, the Process Owner is now often referred to as the 'Practice Owner.' This individual is accountable for the overall design, performance, integration, and improvement of a specific practice within the organization. They ensure that the practice achieves its intended outcomes and aligns with the organization's objectives. Process Manager: Now commonly known as the 'Practice Manager' in ITIL 4, this role is responsible for the day-to-day management of the practice. The Practice Manager ensures that activities are carried out as intended, manages resources assigned to the practice, and oversees the practitioners performing the work. Process Practit...