Manual and Automated Runbooks for SREs

Michael Levan - Sep 15 '21 - - Dev Community

Throughout the time of engineering, there have been a few different names for what a Runbook is considered today, but what does it actually mean?

Simply put, a Runbook is an exact set of instructions on how to fix an issue.

For example, let's say you have a pesky service that constantly goes down. The goal is to fix it, but that'll take time, so you write a Runbook (a specific list of instructions) that anyone can follow on how to resolve the service until the code is fixed and the service doesn't go down anymore.

How about if you want to take things a step further? You can with Automated Runbooks. In this blog post, you'll learn what an automated Runbook is and how you can use them.

What Is A Runbook

At a very high level, a Runbook is a set of instructions. It tells a person, regardless of what tech team they're on, how to fix an issue.

At a more in-depth level?

A runbook is step-by-step instructions, without missing a single step, on how to resolve a particular issue that may occur. These step-by-step instructions are extremely detailed all the way down to "open up the terminal". Even though a lot of what is included in the instructions may be obvious to some, the idea is that anyone can run the Runbook. Even someone who isn't all that technical or may be involved in a much different technical discipline.

A huge component here is even though the solutions may be minor, how to get to the solution may be detailed. If it's a simple service restart, but the service is on an EC2 instance in AWS or a virtual machine in Azure, getting to the service will be quite different on each platform.

Runbooks are great because there's no guessing involved. Anyone can dive in, fix an issue, and move on with their day. It's as detailed as an architecture plan for building a house. There's no doubt to anyone while they're using the Runbook.

A Typical Runbooks (Documentation)

A Runbook should include a few things:

  • WHAT: Review the actual problem. What tools should the person using the Runbook look at? Which monitoring platforms are needed? The idea here is to save cognitive load for actually solving the problem instead of finding the problem.
  • WHERE: Ensure to share the location of the tools, notes, docs, etc. to fix the issue.
  • HOW: Next, start on the actions that are used to remediate the problem (service restarts for example).
  • WHO: If the fix doesn't work, escalation and subject matter experts are important for any Runbook. Figure out who should be included and add their contact information.

Why This Doesn't Fully Work

Runbooks are great and detailed, but they don't fully work in the autonomous world we should be living in.

Here's the problem though - it's still manual. An engineer has to manually do work, whether it's at 2:00 PM or 2:00 AM.

Here's a few different opinions on the concern:

  • If the steps are detailed with code, commands, and instructions, why can't that be set up to run automatically?
  • Humans can follow instructions, but humans are also prone to errors. Code that's properly tested is usually not.
  • Engineers have to take time out of their day or night to fix an issue when in-fact, most of the issues can be automatically resolved.
  • MTTR (Mean Time To Recovery) may be longer with manual efforts.

At this time, you may be thinking about how to take the solutions and automate them.

Automating Runbooks (Workflows)

Depending on the situation, there are absolutely ways to automate the fix. There are also ways to think about automation for a fix that may not be so obvious.

For example, let's say one fix is to SSH into a server and restart a service. Sounds easy, but there are a lot of steps that go into this:

  • Being on a network that has property connectivity to the network that the server is on.
  • Valid authentication to the server
  • Exact commands to run

Something that seems easy when you're on a terminal in your day-to-day job may be a little bit more cumbersome to automate, but it's certainly doable.

There are a few options you can choose:

  • Create your own automation for Runbooks
  • Use a Runbook automated solution

If you're thinking about an automated solution, one that is solid and free for personal use to test out, try, and implement on a small team, is PagerDuty Rundeck.

Rundeck used to be it's own solution, but PagerDuty bought them. Now you can combine PagerDuty for alerts and Rundeck to automate the solution to the alert.

If you're interested in checking out Rundeck, try out their community version via Docker which you can find here: https://github.com/rundeck/welcome-project-community

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .