Site Reliability Engineering (SRE) is
a discipline that incorporates aspects of software engineering and applies that
to operations with the goal of creating ultra-scalable and highly reliable
software systems. Google’s mastermind behind SRE, Ben Treynor, describes
site reliability as “what happens when a software engineer is tasked with what
used to be called operations.”
Historically, Dev teams want to release new features in a continuous
manner (Change). Ops teams want to make sure that those features don’t break
their stuff (Reliability). Of course the business wants both, so these groups
have been incentivized very differently leading to what Lee Thompson ((formerly of E*TRADE) coined the “wall of confusion”. This inherent
conflict creates a downward spiral that creates slower feature time to market,
longer deployment cycles, increasing numbers of outages, and an ever increasing
amount of technical debt.
The discipline of SRE can begin to reduce this dilemma by introducing
multiple analytics and statistical analyses for green- or red-lighting launches
and help to resolve the extreme focus of stability vs. agility, operational
work vs. software engineering and proactive vs. reactive work. These SRE teams are staffed with developer/sys-admin hybrids
who not only know how to find problems, but according to Googles Melissa Binde “figure
out why it happened, what was the root cause, figure out how to detect it
sooner and ideally insure that it doesn’t happen again”. Sounds a lot like ITIL’s Problem Management
process only on steroids.
So at a basic level
here is how I understand this works. As
we all know from years of experience, and just being human, nothing is perfect.
None of our services ever really achieve 100% uptime; it’s why we invented
SLAs. Take it from someone who used to write them. This is the concept I think is just so
cool. If a team agrees to a 99.8% SLA,
it gives them an “error budget” of 0.2%.
This is the maximum allowable threshold for service interruptions. The
production team can utilize this error budget however they see fit and in turn
release whenever and whatever they want given they are within the SLA. They get green-lighted based on past
performance. If they are
operating at or below the defined SLA, all launches are red-lighted until they
reduce the number of errors to a level that allows the launch to proceed. SREs
(Ops) and developers (Dev) have a strong incentive to work together to minimize
the number of errors. This ties completely into the cultural and professional
movement known as DevOps, which stresses communication, collaboration and
integration between software developers and operational professionals while
automating the process of software delivery and infrastructure changes.
For more information; www.itsmacademy.com/devops
Comments