Shift Shift Forward: Incidents

Melissa McEwen - May 14 '20 - - Dev Community

Shift Shift Forward is a podcast that showcases everything that makes Glitch the best place to create on the web. Subscribe on Apple Podcasts, Breaker, Google Podcasts, Overcast, Pocket Casts, RSS, Spotify, Stitcher

We’ve all experienced it before -- you go to your favorite website, but it’s not loading. Or you try clicking on a link to complete a transaction, but your browser times out and you get an error message. It can be frustrating to deal with outages and similar issues from a user perspective, but let’s see what it looks like from the other side. What happens when these incidents occur, and what does it take to get everything running smoothly again?

On this episode, we look at incidents from three different angles: infrastructure, support, and leadership. Mads Hartmann, Cori Schlegel, Antoinette Smith, and Emmett Walsh take us through how their team fixes outages and other issues, while senior support engineer Tasha Hewett shows us how she gracefully handles support issues, and we end things off with a few words from our CEO, Anil Dash.

Looking for bonus content? Our app collection for this episode is where it's at, including links to Glitch apps, profiles, articles, and other information mentioned in this episode!

Executive Producers: Maurice Cherry, Keisha "TK" Dutes

Editor: Brittani Brown

Engineer: Keisha "TK" Dutes

Mixing: C.

Transcript

Jacky Alciné:

This is Shift Shift Forward, a new podcast for Glitch that showcases everything that makes Glitch the best place to create on the web. I'm Jacky Alciné and I work in a platform team. In this episode, we're taking a look at what happens when websites go down, including ours. We talked to Tasha Hewett, our senior support engineer, and learn what it's like communicating information to users during an outage or incident. Later, we hear from our CEO Anil Dash, who gives us thoughts on these issues from a leadership perspective. But first, TK sat down with some members of infrastructure team. They're the ones who not only get notified first when site issues happen, but also help identify and fix those issues to get things back on track.

[SEGMENT 1]

Mads Hartmann:

My name is Mads Hartmann and I'm a site reliability engineer at Glitch.

Emmett Walsh:

My name is Emmett Walsh and I'm a site reliability engineer at Glitch.

Cori Schlegel:

My name is Cori Schlagel and I'm a site reliability engineer at Glitch.

Antoinette Smith:

My name is Antoinette Smith and I'm the engineering manager of the infrastructure team at Glitch.

Keisha "TK" Dutes:

What is the process of response for when an incident happens?

Antoinette Smith:

It really depends. When an incident occurs, typically, if it's something that we have an alert for, then whoever the person who is assigned to feed the pager holder is the first person that it gets alerted and that handles it.

Keisha "TK" Dutes:

About the pager, is there a literal pager that goes off or an email notification?

Antoinette Smith:

There is no literal pager. There's a site called PagerDuty and everyone has an account on PagerDuty. And then, you can download the PagerDuty app or it'll go to your email. It'll also go to our Slack channels, but no physical pagers.

Emmett Walsh:

When the page first goes off, I think maybe the most uncomfortable experience is if you're getting woken up three in the morning, I would say, my first response is just pure confusion. I don't know what's happening. It's happened before. You get called by PagerDuty. You have to answer the call to acknowledge it. It's happened before. I actually can't remember what a phone is or how to answer it for the first 10 seconds. Normally, it's just complete debilitation. And then shortly after that, it is normally cases like, "Oh, God, is this a bad one or is it a good one?" Because you don't really know until you get logged onto your computer, so it can be a minute or two of mounting trepidation where you're like, "Oh, is it going to be a sticky one or not?" Until eventually, you get stuck into it. And then, you very quickly try and figure out what the scope of the impact is. Is it something that's really urgent? Is it something that you can snooze both figuratively and literally, or is it something where you need to call in the cavalry and get more support?

Keisha "TK" Dutes:

What's the difference between a good one and a bad one? What's sticky? What's snoozable?

Emmett Walsh:

The kind of page I see and I'll feel relief is one that is narrow in scope, I understand it really well, and I've seen it before. It's a bit annoying that I'm being woken up to deal with something that we know is a problem, but we haven't had a chance to do a more automated fix. But, at least I know exactly what to do and I know I'll be able to get back to sleep very soon. A scary one is one where everything's down, so all users are affected and I don't know why. It's novel, it's unfamiliar, the signals I'm getting from it are confusing. We have different ways of measuring how things are doing. I think that's the case where it's a bit of a tricky one.

Cori Schlegel:

Once you get to the point where you know what the problem is and at least a possible solution, a candidate is something that we can try to resolve things. In some cases, it's still going to be hours and hours before we can recover. That's a little dread-inducing, right? Even if you're confident that you can fix it, but you're looking forward to the other eight hours of monitoring things and restarting services or restarting the hosts, restarting servers, which is one of the things that takes a long time at Glitch. We have a lot of servers. If something affects all of the servers that are serving user projects, that's a thousand servers that we have to recycle. We can't do it quickly. We can only do, I don't know what the rate is, but a few hundred an hour, so that's a 10-hour, an eight or 10-hour outage if it takes us that long to get to basically restart or recreate every single server that's serving a user project. Those make for very, very long days.

Keisha "TK" Dutes:

There's several different time zones across the team, so how do you all manage that? I'll go to you, Mads. When you're receiving the page, are you eating a sandwich or sleeping or what?

Mads Hartmann:

Sometimes, I'm definitely eating a sandwich when I get a page. The way that we have set it up at Glitch is that we have working hours coverage. That means I will take the page during my normal working hours, which is nice because it means that the Americans can sleep a little longer and hopefully don't get interrupted. But, we also have another schedule, which is primary. For that, I am on call at night or in the morning and for all the hours in a day for a week. That might wake me up. Everybody who's on the rotation has a primary like that where they will have to cover the night-time shift as well.

Keisha "TK" Dutes:

How does each person know their role in fixing the incident?

Mads Hartmann:

The way that we have it set up is that there is an initial person who will get the page and it's that person's responsibility to escalate if necessary. That person might initially have a look at it and see if they can fix the problem themselves and go back to sleep or back to work. If they can't, then they can escalate to the secondary on-call. If that person, when that person joins, they'll try to fix it together. If that doesn't happen, you start calling in the cavalry and just pinging everybody and trying to get as many people in who might be able to help.

Cori Schlegel:

There are some cases where even the working hours on-call and the primary on-call and the secondary on-call might all -- some of whom may be the same person occasionally -- will get on together at the same time and they'll be able to isolate where the problem is, but they're not a person who is equipped to fix it. And so, that's when we'll interrupt somebody or get somebody else on a different team, often, out of bed or out of whatever zone they happen to be in to join in and try to figure out what the best way to fix the problem is.

Emmett Walsh:

When we decide who to call, if you're the person who picks up the page first, things we have intention, we don't want to interrupt people who don't necessarily need to be interrupted because we want to keep shipping new features and bug fixes against the very massive concern. If it's the outage that users are being impacted by, then that's a pretty high priority. Normally, we will prioritize getting people off their normal work to address those things at least when we're trying to understand what's going on and maybe all the way through to resolution.

Keisha "TK" Dutes:

In this moment, you're fixing it and it's taking all day. How are you feeling?

Cori Schlegel:

How you feel really depends on what kind of incident and what phase of the incident. When you're in the nobody-can-access-Glitch state and nobody's projects are running because those are two different things. There's at least three different things. Nobody can actually create an account or create a new project. Nobody can edit their existing projects. Then the third type is the projects that everyone has already created. None of those are running. It's kind of three different categories, at least three different categories. When you don't know what's going on and everybody's affected in any one of those categories, that is fear-inducing and stress-inducing. Panic-inducing maybe is the right phrase, at least for me.

Cori Schlegel:

Once you've sort of figured it out or at least you have a little bit of a direction to go, then it can be a little...You might know that it's going to take you awhile to figure it out. But at least you have a pathway and you can feel a little more confident. There's still a lot of pressure. We need to try to fix this as fast as possible, but it's not panicky. It's just "Let's just see how we can do this quickly" rather than "I have no idea what to do", which is that those are the ones that it's...especially at 3:00 AM. I have no idea. I don't even know where I am because it's 3:00 AM. I also have no idea what's going on.

Mads Hartmann:

To me, at least, it is like you're feeling stressed throughout the whole thing, but it can also be quite fun. What you're doing during this incident response is you're trying to figure out what's wrong with the system. That means you're coming up with hypotheses of what might be wrong. Then you're trying to prove or disprove them together with the team. That is extremely challenging and extremely fun for the most of it, I think.

Mads Hartmann:

It's also where you learn a ton of stuff about your systems. You learn more about your systems when they're broke than when they're working just fine. I think that part is really fun and can be very rewarding. Finally, once you've gone through the whole incident and you figure out what's actually wrong with the systems and you fix it, then you're done. That can be extremely fulfilling. I think, at that moment, I'm feeling at least always extremely proud of the team that worked on the incident and proud of myself for dealing with it without breaking down. I think that is what makes it in the end worthwhile. Being on the on-call rotation is that it's not just threat and stress. It can also be very fulfilling and challenging and fun and a great way to bond with your teammates who are in it with you if you're trying to fix the systems.

Keisha "TK" Dutes:

How does the user know that the incident has been resolved?

Antoinette Smith:

Well, what's really helpful is more recently we have a support team. I mean, we've always had a support team, but we haven't had a support team that was actively involved in our incidents. That's been a process change over the past few months. We have a support team and when an incident occurs, they'll update status page at IO so that people will know that Glitch is down or some part of Glitch is down. They also handle communicating with people on social media, with people in our forums. They are part of handling the incident and their major task is making sure that people outside of our team and people that are the users that are interacting with Glitch know what's going on. Once things are resolved, once our team is like, "Oh cool, this is done. Things are looking good," then they say, "Okay, this is resolved." They let everybody know.

Emmett Walsh:

Maybe I could talk more generally about the process things.

Emmett Walsh:

When we have an outage or some dropping function out at Glitch, it's not over when we get things recovered again. The job isn't done. Afterwards, we have a follow-up process. We go through an instant retrospective. We first try as best we can to understand what happened. Sometimes we'll have a really good idea in the middle of an incident. Sometimes we'll never quite get to the bottom of it. Either way, we try to figure out the best we can. Then we get together everyone involved or people who are interested, who might have domain expertise, even if they weren't involved. We'll discuss what happened. We'll try and pick out things that went well in how we dealt with the incident. Be that from how we noticed it to how we dealt with it in the moment to how we communicated about it.

Emmett Walsh:

Likewise, we'll think about things that didn't go so well. Then we also think about where we just got lucky, which depending how you think about it might be a "didn't go so well", depending how cynical you're feeling. Then we'll try and come up with some actions. We try and not be hit by the same thing twice. We'll say, "Hey, in this case, the problem is the way we configure these servers." We change the configuration this way. We make sure this kind of problem won't happen again. Then we try and prioritize when we can get that work done alongside our normal work in shipping new features or trying to do maintenance reliability work.

Keisha "TK" Dutes:

Given all that each of you do on a daily basis to make sure Glitch isn't experiencing issues, running smoothly, what would you like the users to know?

Mads Hartmann:

Alright. One thing I would love the users to know is that sometimes incidents happened because we make mistakes. We ship bad code, we don't follow a process correctly, and it's totally our fault. We try to fix that as quickly as we can. Unfortunately, in other cases, people out to get us. They are trying to use Glitch in ways that is malicious, both for other sites of the incident, but also for Glitch. They might try to use Glitch to attack us or take down another site. In cases like that, it will look the same to you. Your Glitch will be down. The project will not work. But for us, it's very different. We're doing our best to also take care of those cases. It's not always easy.

Emmett Walsh:

When there's an incident and Glitch is down, I would love users to know that we really care and we really hate it when Glitch is down. We work very hard and very quickly to get it back again.

Cori Schlegel:

I guess, as far as what I'd like Glitch users to know, I'd second definitely what Emmett says. We care. We're Glitch users. When Glitch is down, it affects us as well. We do dogfood our own product. Most of us use Glitch for things that are not necessarily in our day-to-day jobs. Support has really engendered this across the board -- Glitch is pretty stable. We don't have a lot of really big incidents. I'd like for users to remember that it's unusual for Glitch, especially Glitch projects, to be down and not responsive.

Antoinette Smith:

I guess what I would like users to know during times when the site is down or some key functionality is down is that Glitch is a really small company still. Our team that are handling these incidents are very small. In addition to shipping features to make Glitch better, on top of that, the team is also handling these incidents. Everyone is working, I feel, as hard as they possibly can. We want Glitch to be stable. That's the thing that we want the most. We're getting there, but it will take us time.

Antoinette Smith:

And we're getting there, but it will take us time.

[SEGMENT 2]

Jacky Alciné:

When the website goes down, our support staff is right there on the front line. Next up is Tasha Hewett, our senior support engineer.

Tasha Hewett:

Incident response is a team effort, but more specifically, it's like being part of an orchestra because everybody has their own part to play. And while we all know what each other's role is, we might not necessarily know how they physically play their part. So we have to sit back and let everyone play their own instrument. You have to leave alone the people who are fixing the actual issue, let them do their thing, while still paying careful attention to what's going on and then playing your part when it's your turn.

Tasha Hewett:

So basically what that turns into for me is a whole lot of lurking and slack, just watching the wheels turning, the suggestions for how we're going to fix it come up, and letting the engineers who are responding do their thing. And then my part in the orchestra is like the gong player. I don't play much, but when I do, it's very important and the audience is going to hear it. So I have to sit and wait and wait. And when is it my turn? Okay, here's my turn. I'm going to do a status update, or I'm going to post on Twitter. And it has to be at the right time and it has to be accurate. So it can be a lot of pressure, but it's an important role.

Tasha Hewett:

My name is Tasha Hewitt and my role at Glitch is senior support engineer. So that makes me responsible for making sure that users are getting the help that they need quickly and consistently, also representing the Glitch community when features, policies, and processes are being worked on. During an incident in particular, my responsibilities are making sure the community is informed about what's going on, making sure that anyone who is experiencing a problem directly related to the incident is getting the help they need. And also letting the rest of the incident response team know about any user-facing issues that they haven't already identified.

Tasha Hewett:

When I hear about an incident, I immediately am going to vet the issue that's been reported. I'm going to see if I can replicate it myself. I'm going to see how many reports are already coming in from users. Our users are really great about detecting things because they're using our site all the time and letting us know if something isn't working.

Tasha Hewett:

It is stressful because for me in particular, being the gong player, you don't want to mess up the whole orchestra. You don't want to hit that gong or send a message at the wrong time, and that can be stressful. But when incident response is going well, and when you have a really good playbook for how these things play out, you know what your part is. You have a good idea of what you need to do and when you need to do it. And when you do it well, it feels really good to be able to give that information to your users. Even if it's not what they're looking for, if we're not saying, "Hey, look, this issue is still happening. We're still looking into it. It hasn't been resolved yet." Just to be able to let them know we're there, and we're seeing what they're seeing, feels really good and makes all the pressure worthwhile.

Tasha Hewett:

Once an issue has been resolved, it's really important for support to verify that that is true. And then we need to communicate that to our users. Really important is to be honest, don't over promise anything. If we've made a mistake, for example, if we said things were resolved too soon and they weren't, we need to own up about that. We need to be transparent, because that helps you build trust between yourself and the users.

Tasha Hewett:

The really important thing is never gaslight your users. So if you're going to say things are resolved, make sure that you can say that or make sure you're prepared to say, "Whoops, that was a mistake." Because the last thing you want to do is pretend like everything's okay when your users know it isn't.

Tasha Hewett:

Users can be more disgruntled about issues depending on the severity or type of the incident. But I think it's more related to how the particular incident directly affects the user. If you have a user who is repeatedly telling you that something is wrong and we unintentionally decide it's user error or make them feel like it's user error, that can be really frustrating to them. So it's always important to pay attention to your users and listen to the things they're saying regardless. Always vet their issues. Always check with them to see if what they're seeing is something that we can see too and investigate it. When users realize that you care specifically about how they're being roadblocked by an incident, they will have more trust in you to fix it for them.

Tasha Hewett:

When an incident happens, we'll receive reports from our users via social media and also our forum and email, and the thing I want to communicate to them, and the most important thing I want them to know, is that we care. We feel for them, we know it stinks to be in the middle of working on something and suddenly have to stop for reasons that are out of your control, and for them to know that we are looking at it and we're not going to stop until it gets fixed. Sometimes that happens in 15 minutes. Sometimes it can take a day to really sort out, but we will not forget about it. We are going to make sure that we can get all our Glitch users back on the platform and working on the apps of their dreams, instead of wondering what the heck is going on.

[SEGMENT 3]

Jacky Alciné:

When incidents happen, it affects all parts of the company, including leadership. Now here's a word from the person at the top.

Anil Dash:

Hey everybody, I'm Anil, and I'm the CEO here at Glitch. And whenever we have any downtime, incidents or problems, it's ultimately my fault. But we always try to say that we are a blameless culture. We try not to point the fingers. And so people are very kind not to point fingers at me. But you know what? I think everybody at Glitch feels that way. If there's ever a problem, if there's ever an incident, if there's ever a downtime, I think everybody takes it so seriously and personally, because we just all feel that responsibility and that obligation to keep everything working. We love the apps that people build. We love everything people make on Glitch.

Anil Dash:

And so any minute where you can't get to it or access it or do your work or make your project is stressful. We feel it. And it's odd because everybody in tech has this issue. There's no website you've ever used, even Google, even Facebook, they have all gone down at different times. They've all fallen over some time. And so it's a universal experience, but everybody acts like it's totally outrageous. It never happens. It's really rare. It is not.

Anil Dash:

... outrageous, it never happens, it's really rare. It is not rare, and it's wild because it even can become part of the culture of a site. Like I go back in time and I look at, seven or eight years ago, Twitter used to not be able to handle its traffic load all the time.

Anil Dash:

I mean, who could blame them? They had gone from nowhere to everywhere in, like, five minutes and it's hard to keep things running. But it would be especially true when there was a big event, awards show, or a news event, or something like that. You would get this error message that had a drawing of a blue cartoon whale on it, and they called it the fail whale, that was when Twitter fell over. It was sort of like this running joke that every time you would go to check out Twitter during something important, you would get the fail whale.

Anil Dash:

I talked to people that helped fix it there and they felt really terrible about it, and the truth of it is, look, if people are, well, hooked enough on your service that they get mad when it falls down, that's probably a good sign. But you don't feel that way at the time, you feel like you're letting them down.

Anil Dash:

I think it's easy to forget that it's just humans. There's just people that are running around, probably frantically trying to keep this thing running, and that's a perspective that's easy to lose unless you really think about the human part of it.

Anil Dash:

We spent some time at Glitch trying to do that towards the end of 2019. It was a huge year for us. I think we went from a million apps on the platform to probably 5 million. I mean, it was just outrageous. We also saw the first big spam attacks, people trying to use the site to build lots of spam apps or things like that.

Anil Dash:

So it was hard. It's hard to scale anything. It's hard to grow anything. I mean, we had growing pains in every part of the company, and the technology infrastructure was no exception to that. Even though we had this incredibly brilliant tech team, you still have incidents and outages and downtime.

Anil Dash:

So I think by the end of the year, people were really feeling it. They were kind of feeling the pressure of it, and I wrote about it a little bit publicly, which was frankly kind of scary, because I wanted to make it more clear that we're just people trying our best, and that we're going to try and make everything great and bulletproof, but we're not always going to succeed.

Anil Dash:

I mean, our name, Glitch, means a problem. It means a bug. So hopefully that helps people feel more comfortable with the fact that sometimes things are going to break, and hopefully everybody who creates on Glitch sees, one, is there's people that really, really care and are really doing everything to try and make sure that everything you build is there, and works, and is fast and reliable and all that.

Anil Dash:

But also, people create on Glitch all the time and are like, "Oh, man, my code is bad, and my app broke, and ... " You know what? Us, too. It's no different. There's always somebody who's just on the other side that is trying to keep things running and make this thing work, and they really, really try, and that's inspiring.

Anil Dash:

The other part I look at is the community, and I'm like, don't worry if it's not perfect every time and it falls over sometimes, because, like I said, that happens to us, too.

Jacky Alciné:

That's it for this episode of Shift Shift Forward. Visit us online at glitch.com/ssf. Follow us on Twitter. Our Twitter handle is @glitch, and let us know what you think about this episode by using the hashtag #shiftshiftforward.

Jacky Alciné:

If you really liked the show, then subscribe to us on Apple Podcasts, Spotify, Google Podcasts, or wherever you find your favorite shows. Leave us a rating and review. Shift Shift Forward is produced by Maurice Cherry and Keisha "TK" Dutes, with editing by Brittani Brown, sound design by Keisha "TK" Dutes and mixing by C. Special thanks to the entire team at Glitch.

Jacky Alciné:

Here is what's coming up on Shift Shift Forward. We are watching a lot of movies and TV these days, and it's interesting to see how apps, websites, and other consumer tech are portrayed. Some directors set their entire production within desktop screens, mobile phones and webcam footage, and others use onscreen texting to show conversations and nonverbal interactions between characters. And yet, none of them really feel or look the like tech we actually use in our day-to-day lives. Why is that? We'll take a closer look and find out.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .