"Site Reliability Engineering: How Google Runs Production Systems"; my key take-aways
"Site Reliability Engineering: How Google Runs Production Systems" edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy contains a number of insights into Google's SRE practice. It is a bit repetitive at times but this assists in drilling in some of the key points.
My key take aways were:
My key take aways were:
- Google places a 50% cap on all aggregate "ops" work for all SREs - tickets, on-call, manual tasks, etc.
- The remaining 50% is to do development.
- If consistently less than 50% development then often some of the burden is pushed back to the development team, or staff are added to the team.
- An error budget is set that is one minus the availability target (e.g. a 99.99% availability target will have a 0.01% error budget).
- This budget cannot be overspent and assists in the balance of reliability and the pace of innovation.
- Cost is often a key factor in determining the proposed availability target for a service.
- if we were to build an operate these systems at one more nice of availability, what will our incremental increase in revenue be? does this offset the cost to reach that level of reliability?
- There are three kinds of monitoring outputs:
- Alerts
- signifies a human needs to take action immediately
- Tickets
- signifies a human needs to take action, but not immediately
- Logging
- no one needs to do anything, but it is recorded for diagnostic or forensic purposes
- Automate yourself out of a job: Automate all the things. Have a low tolerance for repetitive reactive work.
- When creating rules for monitoring and alerting, consider how to reduce false positives and pager burnout:
- Will I ever be able to ignore this alert knowing it is benign? When and why? How can I avoid this scenario?
- Does this alert definitely indicative users and being negatively affected? If not, how can these be filtered out?
- Is the action required urgent or can it wait until the morning?
- Adopt a philosophy of pages and pagers of:
- Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued,
- Every page should be actionable.
- Every page response should require intelligence. If a page merely merits a robotic response, it shouldn't be a page (e.g. automate it).
- Pages should be about a novel problem or an event that hasn't been seen before.
- Google's services communicate using Remote Procedure Call (RPC). c.f. gRPC.
- Data is transferred to and from an RPC using protocol buffers
- Google Software Engineers work from a single shared repository.
- If they encounter a problem in a component outside of their project, they can fix the problem and send the proposed changes to the oner for review.
- Changes to source code in an engineer's own project require a review.
- All code is checked into the main branch of the source code tree (mainline). Most major projects don't release directly from the mainline; instead they branch from the mainline at a specific revision and never merge changes back to the mainline. Bug fixes are submitted to the mainline and then cherry picked into the branch.
- At Google a time-based metric for availability is usually not meaningful due to globally distributed services. Instead "request success rate" is often more meaningful.
- Google SRE's unofficial motto is "Hope is not a strategy"
- An SLO is a service level objective; a target value or range of values for a service level that is measured by a service level indicator (SLI).
- Common SLI's are request latency, error rate, throughput, correctness and availability
- Choose just enough SLOs to provide good coverage of your system's attributes. Defend the SLOs you pick; if you can't ever win a conversation about priorities by quoting a particular SLO, it's probably not worth having that SLO.
- Perfection can wait
- You can refine SLO definitions and targets over time as you learn about a system's behaviour. It's better to start with a loose target that you tighten.
- The Service Reliability Hierarchy provides a good graphical depiction of the elements that go into making a service reliable.
- Prometheus is an opensource monitoring system that shares many similarities to the tool used by Google for white-box monitoring; Borgmon.
- To facilitate mass collection, the metrics format was standardised, with metrics being able to be pulled back in one fetch e.g. http://webserver:80/varz could pull back http_requests and errors_total
- Borgmon also records info such as whether the target responded, what time the collection finished etc.
- Google utilises Prober for black box testing from an end-user perspective.
- The most important on-call resources are:
- Clear escalation paths
- Well defined incident management procedures
- A blameless portmortem culture
- Your first response in a major outage may be to start troubleshooting and try to find a root cause as quickly as possible. Ignore that instinct! Instead your course of action should be to make the system work as well as it can under the circumstances.
- c.f. a novice pilot is taught that their first responsibility in an emergency is to fly the airplane; troubleshooting is secondary to getting the plane and everyone on it safely onto the ground.
- Test roll-back procedures before large-scale tests
- History is about learning from everyone's mistakes
- Postmortems are key to putting effective effective prevention plans in place and learning. Make postmortems widely available (inc. having postmortem reading clubs) and consider that people reading them may have minimal knowledge about the context of the environment.
- Plan for failures / potential incidents.
- If you haven't gamed out your response to potential incidents in advance, principled incident management can out the window in real-life situations.
- Ask questions such as What if the building power fails…? What if the network equipment racks are standing in two feet of water…? What if the primary datacenter suddenly goes dark…? What if someone compromises your web server…? What do you do? Who do you call? Who will write the check? Do you have a plan? Do you know how to react? Do you know how your systems will react? Could you minimise the impact if it were to happen now? Could the person sitting next to you do the same?
- Have a matrix of all possible combinations of disasters with plans to address each of these.
- Processes should be documented in such a way that any team member can execute a given process in an emergency.
- Document all manual processes
- Document the process for moving your service to a new datacenter
- Automate the process for building and releasing a new version
- Google's incident management system is based on the FEMA Incident Command System.
- This includes several distinct roles that should be delegated to particular individuals for an incident (swap roles around the team for different incidents):
- Incident Command
- hold high level state of incident and maintain living incident document. They structure the incident response task force and assign responsibilities and prioritise
- Operational Work
- Communication
- public face of the incident response task force. Provide periodic updates
- Planning
- deals with longer-term issues, such as filing bugs, ordering dinner, arranging handoffs and tracking how the system has diverged from the norm.
- The incident management framework can apply to other operational changes as part of change management, which provides an opportunity to practice the process.
- Weighted round robin is often used for load balancing. Each client keeps a "capability" score for each backend in its subset.
- Gracefully handling overload conditions is fundamental to running a reliable serving system.
- Build clients and backends to handle resource restrictions gracefully: redirect when possible, serve degraded results when necessary, and handle resource errors transparently when all else fails.
- On client-side, for the last two minutes of history, retain requests and accepts. Use this to throttle and to handle retries.
- Have different criticality of requests (Google has four levels; CRITICAL_PLUS, CRITICAL, SHEDDABLE_PLUS, SHEDDABLE)
- The Paxos protocol is often used for strong consistency guarantees is a distributed environment.
- Paxos utilises a consensus approach.
- Google uses this for many things, including cron and Chubby (their distributed lock service).
- BASE (Basically Available, Soft state, Eventual consistency) allows for higher availability than ACID (Atomicity, Consistency, Isolation, Durability), in exchange for a softer distributed consistency guarantee.
- The most important difference between backups and archives is that backups can be loaded back into an application, while archives cannot.
- From the user's point of view, data integrity without expected and regular availability is effectively the same as having no data at all.
- Bugs in applications account for the vast majority of data loss and corruption events.
- The ability to undelete data for a limited time becomes the primary line of defence against the majority of otherwise permanent, inadvertent data loss. c.f. soft delete and then deletion after a reasonable delay (e.g. Mail trash can).
- Backups and data recovery are the second line of defence after soft deletion. Test your data recovery.
- The third line of defence is early detection; put in place out-of-band checks and balances within and between an application's datastores.
- The ability to control the behaviour of a client from the server side has proven a useful tool (c.f. a config file might enable/disable certain features or set parameters such as how often to sync or retry)
- A good visualisation of a blueprint for bootstrapping an SRE to on-call is depicted as:
- Have learning paths that are orderly for new SREs, such as:
- How a query enters the system
- Frontend serving
- Mid-tier services
- Infrastructure
- Tying it all together; debugging techniques, escalation procedures and emergency scenarios.
Comments
Post a Comment