I was recently asked to move into the role of a Site Reliability Engineer. The only problem was… I didn’t even know what a Site Reliability Engineer was. Over the last few weeks I’ve started to dive in on what Site Reliability Engineering really is, SRE for short, and how to fulfill this role. I’d like to share some of the insights that I’ve gained about the role, what it means, and discuss some of the topics around SRE’s.
The role of an SRE is fairly straight forward on the surface but rather complex and intricate on how it is fulfilled. The job of an SRE is to make sure a website or webservice is reliable. The question is, what does reliability mean?
Does it just mean the website is up and running?
Does it mean there are no bugs?
Does it mean it performs well and there’s no latency?
And once you’ve defined what SRE is, how do you accomplish it?
Lots of questions and lots of answers.
SRE’s are sometimes confused with DevOps and some of their functions are very similar and overlap. Both roles favor automation and building internal systems and tools but SRE’s really differ in what they are trying to accomplish.
In DevOps, we want to improve the internal efficiency of our daily development operations, automating builds and unit tests, creating deployment pipelines, and be really focused on the CI/CD pipeline.
For SRE’s, we want to improve the post-CI/CD side of things. Meaning, once things are deployed, we want to make sure they are running smoothly and when they’re not we want to make sure we can get back to a smooth state as quickly as possible. For SRE’s it’s all about monitoring and alerting. Remember the end goal for an SRE is reliability.
Reliability… there’s that word again… so what does it mean for an SRE? Well it actually means a lot of things that come right down to what level of service you provide for your customers. Reliability is more than just, “is the application up and running?” It actually pertains to a wide variety of categories that all come back to how end-users view the quality of your website or webservice.
Here are a few of the categories that fall under the SRE belt:
- Latency & Response Times
- Data Quality & Correctness
- Uptime & Availability
- Volume & Load
- Errors & Bugs
All of these fall under the umbrella of reliability. After all, what good is a website if it slows to a crawl during peak hours? What value does it have if the data is incorrect or invalid? How useful is a website or webservice if it’s filled with bugs after every release?
These are all areas of responsibility for an SRE. Thankfully there are some great tools and best practices to make the life of an SRE easier. Over the next few weeks we’ll dive into some of those practices and tools and discuss how best to implement them. Even if you aren’t an SRE there will be plenty of valuable information and content for you to add to your Software Engineering tool box!