Making the most of our startup’s on-call rotation

RJ Zaworski - Nov 16 '21 - - Dev Community

Will I have to be on call?

In the last hour of Koan’s on-site interview we turn the tables and invite candidates to interview our hiring team. At face value it’s a chance to answer any open questions we haven’t answered earlier in the process. It’s also a subtle way to introspect on our own hiring process — after three rounds of interviews and side-channel conversations with the hiring manager, what have we missed? What’s on candidates’ minds? Can we address it earlier in the process?

So, you asked, will I have to be on call?

The middle of the night pager rings? The panicked investigations? Remediation, write-ups, post-mortems?
We get it. We’ve been there.

Patrick Collison’s been there, too:

“Don’t ruin the duck.” There are worse guiding principles for an on-call process (and operational health generally).

So, will I have to be on call?

Yeah, you will. But we’ve gotten a ton out of Koan’s on-call rotation and we hope you will, too. Ready to learn more?

On-call at Koan

We set up Koan’s on-call rotation before we’d heard anything about Patrick’s ducks. Our version of “don’t ruin the duck,” included three principles that (if somewhat less evocative) have held up surprisingly well:

  1. We concentrate distractions — our on-call developer is tasked with minimizing context switching for the rest of the team. We’ll escalate incidents if needed, but as much as possible the business of ingesting, diagnosing, and triaging issues in production services stays in a single person’s hands — keeping the rest of the team focused on shipping great product.
  2. We control our own destiny — just like Koan’s culture at large, being on-call is much more about results (uptime, resolution time, pipeline throughput, and learning along the way) than how they come about. Our on-call developer wields considerable authority over how issues are fielded and dispatched, and even over the production release schedule.
  3. We take turns — on-call responsibilities rotate weekly. This keeps everyone engaged with the on-call process and avoids condemning any single person to an eternity (or even an extended period) of pager duty.

These principles have helped us wrangle a fundamentally interrupt-driven process. What we didn’t realize, though, was how much time — and eventually, value — we were recovering between the fire drills.

How bugs begin

Before that, though, we’d be remiss to skip the easiest path to a calm, quiet on-call schedule: don’t release. To paraphrase Descartes, code ergo bugs — no matter how diligent you are in QA, shipping software means injecting change (and therefore new defects) into your production environment.

Not shipping isn’t an option. We’re in the habit of releasing multiple times per day, not to mention all of the intermediate builds pushed to our staging environment via CI/CD. A production issue every now and then is a sign that the system’s healthy; that we’re staying ambitious and shipping fast.

But it also means that things sometimes break. And when they do, someone has to pick up the phone.

Goals

On the bad days, on-call duty is a steady stream of interruptions punctuated by the occasional crisis. On the good days it isn’t much to write home about. Every day, though, there are at least a few minutes to tighten down screws, solve problems, and explore the system’s nooks and crannies. This is an intentional feature (not a bug) of our on-call rotation, and the payoff has been huge. We’ve:

  • built shared ownership of the codebase and production systems
  • systematized logging, metrics, monitoring, and alerting
  • built empathy for customers (and our support processes)
  • spread awareness of little-used features (we’re always onboarding)
  • iterated on key processes (ingestion/triage, release management, etc)

You don’t get all that by just passing around a firefighting hat. You need buy-in and — crucially — a healthy relationship with your production environment. Which brings us back to our principles, and the on-call process that enables it.

We concentrate distractions

When something breaks, the on-call schedule clarifies who’s responsible for seeing it’s fixed. As the proverbial umbrella keeping everyone else focused and out of the rain (sometimes a downpour, sometimes a drizzle), you don’t need to immediately fix every problem you see: just to investigate, file, and occasionally prioritize them for immediate attention.

That still means a great deal of on-call time spent ingesting and triaging a steady drip of symptoms from:

  • customer issues escalated by our customer success team
  • internal bug reports casually mentioned in conversations, slack channels, or email threads
  • exceptions/alerts reported by application and infrastructure monitoring tools

Sometimes symptoms aren’t just symptoms, and there’s a real issue underneath. Before you know it, the pager starts ringing—

Enter the pager

The water’s getting warmer. A pager ping isn’t the end of the world, but we’ve tuned out enough false positives that an alert is a good sign that something bad is afoot.

Once you’ve confirmed a real issue, the next step is to classify its severity and impact. A widespread outage? Those need attention immediately. Degraded performance in a specific geography? Not awesome, but something that can probably wait until morning. Whatever it is, we’re looking to you to coordinate our response, both externally (updating our status page) and either escalating or resolving the issue yourself.

On-call isn’t a private island. There will always be times we need to pause work in progress, call in the team, and get to the bottom of something that’s keeping us down. But the goal is to do it in a controlled fashion, holding as much space for everyone else as you reasonably can.

We control our own destiny

Your responsibilities aren’t purely reactive, however. Controlling your own destiny means having at least a little agency over what breaks and when. This isn’t just wishful thinking. While issues introduced in the past are always a lurking threat — logical edge cases, bottlenecks, resource limits, and so on — the source of most new issues is a new release.

It makes sense, then, for whoever’s on-call to have the last word on when (and how) new releases are shipped. This includes:

  • managing the release — generating changelogs, reviewing the contents of the release, and ensuring the appropriate people are warned and signatures are obtained
  • debugging release / deployment issues — monitoring both the deployment and its immediate aftermath, and remediating any issues that arise
  • making the call on hotfix releases and rollbacks — as a step sideways from our usual flow they’re not tools we use often. But they’re there (and very quick) if you need them

Closing the feedback loop

An unexpected benefit we’ve noticed from coupling on-call and release management duties is the backpressure it puts on both our release cadence and deployment pipeline. If we’re underwater with issues from the previous release, the release manager has strong incentives to see they’re fixed before shipping anything else. Ditto any issues in our CI/CD processes.

Neither comes up too often, fortunately, and while we can’t totally write off the combination of robust systems and generally good luck, it’s just as hard to discount the benefits of tight feedback and an empowered team.

We take turns

But you said, “team!” — a lovely segue to that last principle. Rotating on-call responsibility helps underscore our team’s commitment to leaving a relatively clean bill (releases shipped, exceptions handled; tickets closed; etc) for the next person up. When you’re on-call, you’re the single person best placed to deflect issues that would otherwise engulf the entire team. When you’re about to be on call, you’re invested in supporting everyone else in doing the same. You’d love to start your shift with:

  • healthy systems
  • a manageable backlog of support inquiries
  • a clear list of production exceptions
  • a quick brain-dump of issues fielded (and ongoing concerns) from the teammate you’re taking over from

A frequent rotation almost guarantees that everybody’s recently felt the same way. Team members regularly swap shifts (for vacations, appointments, weddings, anniversaries, or any other reason), but it’s never long before you’re back on call.

The rest of the time

Ultimately, we’ve arrived at an on-call process that balances the realities of running software in production with a high degree of agency. We didn’t explicitly prioritize quality of life, and we don’t explicitly track how much time on-call duties are eating up. But collective ownership, individual buy-in, and tight feedback have pushed the former up and the latter down, to the point where you’ll find you have considerable time left over for other things. Ideally you’ll use your turn on-call to dig deeper into the issues you touch along the way:

  • exploring unfamiliar features (with or without reported bugs)
  • tightening up our CI processes
  • tuning configurations
  • writing regression tests
  • improving logging and observability

Yes, you’ll be triaging issues, squashing bugs, and maybe even putting out the odd production fire. You can almost count on having time left to help minimize the need for on-call. You’re on the hook to fix things if they break — and empowered to make them better.

So yes, you’ll have to take an on-call shift.

Help us make it a good one!


Cover image by Daniel Seßler on Unsplash

. . . . . . .