SRE book notes: Emergency Response

Hercules Lemke Merscher - Jan 30 '23 - - Dev Community

These are the notes from Chapter 13: Emergency Response from the book Site Reliability Engineering, How Google Runs Production Systems.

This is a post of a series. The previous post can be seen here:

In this chapter there were 3 case studies, highlighting real outages that happened within Google’s infrastructure, and the processes taken to remediate and fix the problems.


Things break; that’s life.

First of all, don’t panic!

If you feel overwhelmed, pull in more people.


Because we hadn’t tested our rollback procedures in a test environment, these procedures were flawed, which lengthened the outage. We now require thorough testing of rollback procedures before such large-scale tests.

The moral of the story: always test your rollback and backup recovery procedures folks. Until you have tested your rollback, there’s no rollback.


Time and experience have shown that systems will not only break, but will break in ways that one could never previously imagine. One of the greatest lessons Google has learned is that a solution exists, even if it may not be obvious, especially to the person whose pager is screaming. If you can’t think of a solution, cast your net farther. Involve more of your teammates, seek help, do whatever you have to do, but do it quickly. The highest priority is to resolve the issue at hand quickly. Oftentimes, the person with the most state is the one whose actions somehow triggered the event. Utilize that person.


There is no better way to learn than to document what has broken in the past. History is about learning from everyone’s mistakes. Be thorough, be honest, but most of all, ask hard questions.

Ensure that everyone within the company can learn what you have learned by publishing and organizing postmortems.

Once you have a solid track record for learning from past outages, see what you can do to prevent future ones.


Ask the Big, Even Improbable, Questions: What If…?


When it comes to failures, theory and reality are two very different realms. Until your system has actually failed, you don’t truly know how that system, its dependent systems, or your users will react.


If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.

You can also follow me on Twitter and Mastodon.


Photo by Jason Leung on Unsplash

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .