Efficient On-Call Practices For SREs

Michael Levan - Nov 1 '21 - - Dev Community

What's the biggest thing that people dread when on-call? That 2:00 AM call about some system or service is down. The reason why people typically dread it is that it happens all of the time and no one is able to fix the issue, just throw duct tape on it. Maybe it's a service that everyone knows is an issue, and no one actually fixes. Maybe it's an issue with a system's resources, but no one wants to take the time to properly set it up with auto-scaling. Maybe it's simply a fluke.

There are several steps that an organization can make to ensure on-call practices are proper and efficient instead of a major headache.

In this blog post, we'll talk about the steps needed for efficient on-call practices.

Schedule With Respect (Better Rotations)

I've been in positions where I was on-call 14-15 days out of the month. As you can imagine, that's not a great quality of life. It wasn't even really the fact of being on call 2 weeks out of the month because I typically have my laptop/iPad with me anyways, so carrying that around wasn't a big deal. It was the fact that there were several recurring issues and management didn't want to take the time to fix the issue, which would have helped them reduce tech debt.

Another aspect of it is this - even though carrying around a laptop isn't a huge deal, who wants to be on-call 14 days out of the month with no overtime? That's 336 hours each month.

Organizations must think about better on-call rotations. Even though on-call will typically fall on SREs or DevOps teams, this isn't always the case. Instead, think about having development teams be on-call as well.

For example, maybe you're experimenting with high-velocity teams and service-specific teams. If there is a service-specific team, they can be on-call for that service.

If you're a smaller startup and don't have enough employees to separate services and teams like that, you must be upfront with who you're hiring in terms of how the on-call schedule looks and you must pay them for it.

Define Escalation Paths

One of my first jobs out of school was as a Support Engineer. The responsibilities were broad, but one of them was to be on-call. Since I just came out of school, I barely knew how to work a backup server, let alone fix an application or a system that was down after-hours. The worst part was that the escalation path was to VPs and C-levels (it was a small company, but not that small). Because of that, you can imagine they weren't pleased to have the issue escalated to them (which begs the question; why be an escalation point?).

The point of the story is this - have proper escalation paths. When you define escalation paths, ensure they're the proper people and they understand what being an escalation path is. If you put a VP or a Director on an escalation path that hasn't touched any of the code or the systems, what can they actually do to help out? Not much, so that means having them as an escalation doesn't really help anyone.

Instead, think about a few things when defining escalation paths:

  • Who's the team lead or code owner for that application or service?
  • Who are the most senior engineers? They should be last as an escalation point when no one else can solve the issue
  • Should entry-level engineers actually be on-call?

Handle It With Automation

Even though we live in a world of everyone throwing the word automation out at everything, there are still two cornerstones of tech that are lacking automation - networking and on-call.

Automating on-call alerts, like a service restart or a system resource issue is something that's now available. In the cloud world, for example, it's straightforward to have a script auto-scale a cluster that's running out of system resources. You can also have a system that calls upon a script to restart a service.

There are tools out there that help with, what's called Automated Runbooks, like xMatters and Pagerduty Rundeck. The platforms help you out in automating the tasks that no one needs to wake up at 2:00 AM for. The best thing, in this case, is to still get notified that something occurred, but not to wake someone up. Instead, they can view it in the morning and find a Root Cause Analysis (RCA).

Of course, not every issue that occurs will be as easy as automating an auto-scaling group or restarting a service, so you'll need to get creative in some cases. Another thing that proper automation allows you to do is identify the problem. For example, if you keep having to auto-scale a cluster because RAM keeps spiking, maybe there's a memory leak in the application. At that point, you can go and fix the code, which will ensure that the alert doesn't happen for that particular issue again.

Managers - Keep A Cool Head

Two of the biggest problems bad managers have are:

  • They panic when things get serious
  • They try to go on witch hunts and point fingers

The one thing that both of those problems don't do is provide any value at all. Running around like a chicken with your head cut off solves zero problems and brings no value to the table.

Instead, managers must think about keeping a cool head. Solving an on-call issue, whether it's an application that's down or systems that are down isn't a walk in the park. There are SLA's at stake and people breathing down your neck. As a manager, you must know to keep it cool and keep the heat from leadership away from your engineers. Why? Because your engineers have a job to do; fix the problem. They don't need to hear you freaking out or pointing fingers. Instead, they need you on their side and keeping the drama away from them while they solve the problem.

In short, have a blameless culture.

Pay Overtime

Money doesn't buy back time and it definitely doesn't make waking up at 2:00 AM any easier, but it softens the blow a little bit. As an SRE, you know you'll have to be on-call at some point. The question is; are there better on-call options? The answer is yes.

One of the ways to make being on-call a bit better is by paying your employees to be on-call. Even if they're salary, they should still have the opportunity to reap some sort of reward. This may sound foreign to some, but there are several organizations that allow over overtime when someone is on-call.

This helps for a few reasons:

  • Employees are more inclined to actually fix an issue instead of doing the bare minimum to get back to bed.
  • Employees want to put in the effort to fix the issue because they have a cash incentive.
  • Management is way smarter about employees being on-call and what they should get alerted for. It makes the employee's life much better and it saves the organization money to actually go and fix the issue instead of just restarting a service and not fixing the underlying code, ultimately decreasing tech debt.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .