Posts

Showing posts from 2019

"Site Reliability Engineering: How Google Runs Production Systems"; my key take-aways

Image
" Site Reliability Engineering: How Google Runs Production Systems " edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy contains a number of insights into Google's SRE practice. It is a bit repetitive at times but this assists in drilling in some of the key points. My key take aways were: Google places a 50% cap on all aggregate "ops" work for all SREs - tickets, on-call, manual tasks, etc. The remaining 50% is to do development. If consistently less than 50% development then often some of the burden is pushed back to the development team, or staff are added to the team. An error budget is set that is one minus the availability target (e.g. a 99.99% availability target will have a 0.01% error budget). This budget cannot be overspent and assists in the balance of reliability and the pace of innovation. Cost is often a key factor in determining the proposed availability target for a service. if we were to build an operate