I often get asked by people what I do as a Site Reliability Engineer(SRE), so I decided to make a blog post out of it.
What is a Site Reliability Engineer?
There are many different ways you can define an SRE team, so let me start with how we have chosen to define the SRE team at Kenna.
The SRE team is a group of developers that are focused on using software to optimize performance and ensure stability and reliability across all of our systems.
When talking with our lead operations engineer, we decided that an SRE is a developer+. The plus stands for some bit of extra knowledge beyond that of just writing code. For me, the plus is my comprehensive understanding of how Elasticsearch works. For others, their plus might be the ability to work seamlessly with a framework like Ansible, or maybe they have a deep understanding of containers. The plus can be almost anything tech related that would help an SRE with their job.
Another trait that I feel characterizes a good SRE, is the ability to look at and understand how an entire system works. It is easy to understand small pieces of a system, but the ability to step back and conceptually understand how all the pieces fit together is key to being an SRE. Having a high level understanding allows us to figure out a system's weakest points and improve on them to ensure reliability across the entire system.
Given how we define SRE at Kenna, I think it is crucial for those entering the industry to start out in a full-stack or backend developer role first before moving into an SRE position. I believe taking a year or two to hon your developer chops and be exposed to many types of software tools will allow you to be much more successful if, eventually, you decide you want to become an SRE.
How I Became an SRE
I originally was a full-stack developer when I started. I was doing it all, from the frontend javascript to the behind the scenes background workers. I continued to do this when I joined Kenna in 2015 as a general Software Engineer. Shortly after I joined Kenna, the senior developer who primarily worked with our Elasticsearch cluster, left. I saw his departure as an opportunity to learn something new and possibly own a piece of our infrastructure. I decided to learn everything I could about Elasticsearch. I went to Elasticsearch trainings and I read a lot of the Elasticsearch docs. Slowly, I took on more and more Elasticsearch focused stories in my day to day work.
As time went on, I began to shift my focus to the backend and to Elasticsearch. As my focus shifted, Kenna continued to grow. In the Fall of 2017, we had gotten so big that we decided to split up the core dev team into multiple teams. At this point, we had hired an SRE from another company. After he was hired, I was asked if I would like to join him to form the first Kenna SRE team. Since I was already focused on the reliability and performance of one of our core datastores, Elasticsearch, it seemed like the perfect fit. I said yes!
What Site Reliability Engineers Do
When the SRE team was first created, there was no shortage of work to keep us busy. Kenna's platform was growing like crazy, and we were having some real scaling issues. Our team's main focus at the beginning was optimizing our code so it could handle all of the new data we were getting. We spent a lot of days in monitoring tools like Datadog looking for slow queries or hotspots in our code that needed to be optimized. If you want to learn more about exactly what we did during that first year, checkout my Cache is King speech. In this talk I break down 5 very big optimizations that we made to our codebase that led to some big performance improvements. There are some pretty sexy graphs in there 😉
In addition to making performance improvements, during that first year our team also:
- Overhauled our entire monitoring framework (Post coming soon on this!)
- Added extensive logging to our application and improved our log storage to make it more accessible and searchable
- Improved our admin site, which was essential as more support engineers joined
- Made access control improvements. One example of this was setting up a read-only console for engineers to use when interacting with production
- Updated our continuous integration(CI) workflow to handle multiple virtual private cloud environments
Looking Ahead: Kenna's SRE Roadmap
The SRE team is now in its second year at Kenna and we have accomplished a lot. However, as Kenna continues to grow, there is still is plenty to do. Our roadmap currently has the following projects on it.
- Upgrading Elasticsearch to 6.x. Our last upgrade was rough, so we have some extensive testing plans associated with this upgrade.
- Defining service level objectives. Our customers are happy now, but what does that mean in terms of metrics? How fast do we need to load searches to keep customers happy? How fast does data processing need to happen? Our goal is to answer questions like these.
- Wrangling all our new virtual private cloud(VPC) environments. A lot of our large clients want their own virtual private cloud for running Kenna. This means we have a lot of different environments. As you can imagine, working with all of them and keeping them in sync is a challenge. As our VPC numbers increase this year, my team is hoping to make working across all VPCs as seamless as possible.
- Implementing a load testing framework. We are constantly asked, how much can the platform handle? Currently, we respond 🤷. It would be nice to actually know what the limits are.
As the SRE team and Kenna continue to grow, I am sure the roles and responsibilities of the team will evolve as well. I cannot wait to see where we are after another year 😃
Why I Love Being An SRE
I LOVE my job! I mean, I really love it. I enjoyed being a full-stack developer and getting to deliver major projects for clients, but I like being an SRE far more. Before, the majority of my day was spent building features that were requested by some of our clients. Now, the work I do affects the entire platform and ALL of our clients. I have the ability to tweak some code in a background job and speed up data processing for all of our clients, not just some. Having the ability to affect change and improvements at that scale is incredible.
I also take what I build very seriously. Thanks to a couple of years spent in a customer service rotation, I am very thoughtful and sensitive to our client experience. Being able to focus on making the platform more stable and reliable for our clients is a dream come true.
If you enjoy working on the backend, and want to get closer to your system's performance, reliability, and scalability, then an SRE role might just be perfect for you!
Further Reading
If you are interested in learning more about what it means to be a SRE I highly recommend checking out Google's SRE Book!
NOTE: The definition of a Site Reliability Engineer is likely going to be slightly different everywhere. If you are applying for a Site Reliability Engineering job, make sure you are asking the right questions during the interview to ensure the job responsibilities fit what you want.