Skip to main content

Posts

Showing posts with the label Site Reliabilitu Engineering

Site Reliability Engineering

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations with the goal of creating ultra-scalable and highly reliable software systems.  Google’s mastermind behind SRE, Ben Treynor, describes site reliability as “what happens when a software engineer is tasked with what used to be called operations.” Historically, Dev teams want to release new features in a continuous manner (Change). Ops teams want to make sure that those features don’t break their stuff (Reliability). Of course the business wants both, so these groups have been incentivized very differently leading to what Lee Thompson ( (formerly of E*TRADE) coined the “wall of confusion”.  This inherent conflict creates a downward spiral that creates slower feature time to market, longer deployment cycles, increasing numbers of outages, and an ever increasing amount of technical debt. The discipline of SRE can begin to reduce this dilemma by