These are some of the key takeaways from my reading of the Google SRE book. These points stood out to me because they were things seemed not obvious and piqued my interest.
- Beware of mean statistics, percentiles can reveal more information
- This is because the distributions of response times often have long tails.
- Dashboards are pretty, but no one should ever have to watch them.
- Responses to alerts should require intelligence
- The benefits of automation are much more than time saving, in fact this is one of the least important benefits
- Consistency and enabling anyone to do the task (without expertise) are some of the benefits
- Whitebox monitoring is more valuable than blackbox monitoring as it allows you to address problems before they become catastrophes
- End-to-end tests are an example of blackbox monitoring
- 95th percentile response time is an example of whitebox monitoring
- Turnup processes must be owned by the service owner
- Automation workflow should be idempotent when possible