If you happen to know what SRE is, you might be wondering how it relates to DevOps. Well, let's not beat around the bush. There's no "versus"—there's only a different approach for how to deliver better software faster. In this post, I'll break down each approach and show where DevOps and SRE differ. You'll notice that SRE has an opinionated approach for how to run production systems, whereas DevOps focuses more broadly on people, process, and tools—in that order of importance.
Let's start by setting the foundation for what DevOps and SRE are.
What's DevOps?
I'm not going to spend too much time on definitions, but I'll use them throughout the post to remark on the differences between DevOps and SRE. Of the many definitions of DevOps, I prefer this one from Gene Kim:
DevOps is [the] set of cultural norms and technology practices that [enables] the fast flow of planned work from, among others, development, through tests into operations while preserving world-class reliability, operation and security. DevOps isn't about what you do, but what your outcomes are.
So DevOps is mainly a cultural shift inside the organization, not a group, person, or position. What's essential in DevOps are the outcomes at the finish line. The "what" and "how" of it all isn't important. That's why the DevOps model CALMS outlines a set of principles that every DevOps initiative should consider using, and it starts with culture.
Now, let's continue with SRE.
What's SRE?
SRE stands for "site reliability engineering," a term coined by Google. A few years ago, Google engineers wrote a book to explain how they run and operate their systems in production. Then, they wrote a second book on practical ways to implement SRE. Both books are now available for free.
Google's definition of SRE is quite simple:
SRE is what happens when you ask a software engineer to design an operations team.
Therefore, SREs are operations folks with strong development backgrounds, and they apply engineering practices to solve common problems when running systems in production. SREs are responsible for making systems available, resilient, efficient, and compliant with the organization's policies (like change management).
Now, let's get into the details of where SRE differs from DevOps.
Removing Silos in the Organization
The DevOps movement was initiated to eliminate the silo between developers and operators. Developers want to deploy the features they just coded as soon as possible. Operations folks would like to slow down doing deployments to maintain available systems. How does DevOps solve this problem? Besides the CALMS framework, there are also principles from the three ways of DevOps, which aim to break down the silos between developers and operators. The DevOps Handbook best explains these three ways of DevOps and gives you a few ideas to apply.
SRE also removes silos. The difference is that instead of only finding ways to optimize flow between teams, SREs get their hands dirty. Being where the action is gives SREs better context for supporting systems in production. SREs integrate into the team as consultants, helping developers create more reliable systems. What's most important here is that SREs share ownership of running systems in production with developers.
Measuring a Successful Implementation
DevOps metrics focus mainly on how quickly and frequently deployments are happening and how often they go wrong. In other words, according to the 2017 report from Puppet and DORA (DevOps Research and Assessment), the key metrics in DevOps are the number of deployments, the lead time from code commit to releasing, the number of deployments failed, and how much time it takes to recover from failure. Feedback loops help DevOps continuously improve the quality of systems, and they open the door to experimentation.
SRE also depends on metrics to improve systems, but from the reliability perspective. The foundations for SREs are the service-level objective (SLO), service-level indicator (SLI), and service-level agreement (SLA). Each of these metrics will show how reliable the system is. SREs use these metrics to determine if a release for a change in the system will go to production or not. In SRE, speed and quality are products of reliable systems, and SREs focus on those types of metrics.
Pursuing CI/CD Practices
DevOps is a huge advocate for automation; I'd say that after culture, automation is the second most crucial aspect. In DevOps, the message is to automate as much as possible and make the releases boring. Many activities happen after a developer commits the code, and most of these activities can—and should—be automated. For example, you can automate leveraging the application's build process after integrating everybody's work in code to a machine. Or you can automate the process of deploying application changes, which is—or should be—the same every time. DevOps pursues CI/CD to increase the velocity and quality of the systems.
SRE pursues CI/CD for a different reason: to reduce the cost of failure. For SREs, all the tedious and repetitive tasks that are common in operations—like deployments, application restarts, or backups—aren't appealing. For that reason, SREs reserve a certain amount of time (for example, Google reserves 50%) for reducing the operational work or toil. SRE uses the same practices from DevOps, such as canary releases, blue/green deployments, and infrastructure as code. But SRE does so with the purpose of doing other more appealing things, like evolving the architecture or implementing new technologies.
Accepting Failures
DevOps fosters a blameless culture because every time something goes wrong, it's a learning opportunity. Accepting that failures will continue to happen is the first step. Instead of putting too much effort into making systems completely fault-tolerant, a DevOps culture finds ways to tolerate fault. Netflix is the most prominent advocate of this culture, with its Simian Army. Netflix is continuously bringing part of its system down so that it's just regular business when a real fault comes. If a set of servers goes down in a zone, Netflix automates the process of recreating servers in a different zone. And they practice it in a production environment all the time.
SRE practices blameless postmortem every time a failure in the system happens. The idea of blameless postmortems is to identify what caused the fault, then find ways to avoid having the same failure happen again in the same way. SRE also accepts failures, but they put numbers to it—they call it the error budget. After defining the SLI, SLO, and SLA, SRE determines how much failure is acceptable (the budget), because it's expensive to be 100% available. And in some cases, it's not possible.
Therefore, SREs determine how long it would be acceptable for the system to be down. For example, say the site can be down for about 43 minutes every month, which means the uptime is 99.9%. If the system has been down more than the allowed budget that month, releases are paused until the next month.
DevOps and SRE Don't Compete With Each Other
I very much like the way Google relates SRE with DevOps by using the following phrase:
class SRE implements interface DevOps
SRE and DevOps don't compete with each other. SRE is the name Google chose to define the way they do DevOps before the term DevOps was coined. There are slight differences, but as it happens when a class implements an abstract class, the implementation might choose to overwrite or extend the base functionality.
I'd say the main difference is that DevOps is a culture that broadly defines a way of doing things. Maybe that's why there are too many definitions of DevOps and many case studies from companies of different sizes and industries. By contrast, SRE has an opinionated way of doing things, but that's because it was born when Google published their explanation of how they run systems in production.
My two cents? Study each movement, and pick the practices and principles that work for your organization today. Tomorrow? Well, tomorrow things will change, and you might need to adopt new principles from DevOps and SRE.