Tag Archives: Site Reliability Engineering

The Error Budget

In two of my previous posts I’ve covered the SLO and SLI concepts, if you’re not familiar, I would start there:

What is an SLO?

What is an SLI?

But this post here, is really the cream of the crop when it comes to Site Reliability Engineering. The Error Budget, as it’s so called, is what takes the SLO and the SLI concepts from above, and gives us something actionable to work with. Let’s dive into the specifics.



The Error Budget concept, originally created by Google, bridges the gap between Product teams and Engineering teams. Where Product teams want new, bigger, and more features, Engineers are historically burdened with balancing deliverables, technical debt, and well… keeping production from crashing on a Friday afternoon.

…keeping production from crashing on a Friday afternoon.

But no more!

The Error Budget aims to strike a balance between what’s wanted, like new features, against what’s needed, like performance. It works by taking the concepts of SLO’s and SLI’s, ties them together and draws a tight line in the sand on when it’s time to shift gears and focus on technical debt.

Advertisements

This is accomplished by defining SLO’s around an applications performance. Should the application have 99.9% uptime? Should the application be able to handle a request in under .5 seconds? Should the application degrade gracefully in the event of a hardware failure? These are all various areas an SLO can aim for, and once we have the SLO, it’s a matter of implementing the right SLI, or measuring stick, to ensure we’re meeting our targets.

So what actually happens when we don’t meet our objectives of an SLO? What happens when our application isn’t available 99.9% of the time? What happens when that last feature push started a slow degradation of our applications ability to handle a request in under .5 seconds? The Error Budget happens.

The Error Budget is a formal agreement, written and signed off on, by all of the involved parties, and simply states what must happen in the event of a missed SLO. The Error Budget is not a punishment or tool to place blame. The Error Budget is a gauge to let us know, “Hey we’ve done too much and neglected some key areas of service we feel are important, let’s take some time to set that straight.”

The Error Budget is not a punishment or tool to place blame.

When an SLO is missed, the Error Budget covers what happens next. In most cases, a feature freeze is put into place for a certain number of days. This allows the engineering team to shift their focus and prioritize work that will directly improve the SLO target. If the application were to suffer an outage and miss it’s 99.9% uptime target, the engineering team can now focus on, not only how to fix the issue, but how to prevent it from occurring again in the future.

Advertisements

So why is it called an “Error Budget” anyways? It’s named this way because you do in fact have a “Budget” that allows teams to identify when it’s acceptable to take risk and when it’s time to pivot. One thing you do not want to do, is approach a team that has experienced and well-trained members, and try to implement concepts and red-tape that slow them down. If you have a team that can develop features and deliver on performance at the same time, why would you want to disrupt that?

With an Error Budget, you don’t have to worry about that because the budget aspect gives teams the room they need to work quickly and efficiently, and only when an SLO is missed, does the team need to shift gears and change direction. Therefore the actual budget is calculated as a small percentage of hitting a target at 100%. Let’s look at an example:

A product organization decides that they want to ensure their application accepts incoming requests, without errors, 99.9% of the time, over a 30 day period for all of their 1,000 customers. On average, let’s say customers submit roughly 100,000 requests over that same 30 day period. Our budget then becomes 99.9% of 100,000, which allows our teams 100 failed requests before our SLO is violated and our Error Budget is enacted.

Advertisements

If you have a team that can develop features and deliver on performance at the same time, why would you want to disrupt that?

By operating in this fashion, our system can handle a small hiccup without us having to freeze all features and enact the Error Budget. We can also allow our teams to take small, acceptable risks, knowing that if that mid-day release for that key customer goes wrong, and we miss our SLO, it’s time to rethink business hour deployments, BUT on the flip side, if that release only interrupts the system for a moment, and only 10 failed requests are counted, the team is still within their SLO and should carry on as they were.

The Error Budget is more than just a way to tell teams when to pay down technical debt. It provides us the foundation for measuring and accepting risk, helps us to bridge the gap between features and reliability, and finally gets product teams and engineering teams speaking the same language.

I encourage you to take a look at Google’s example Error Budget and hit me with any questions you may have!


What is Site Reliability Engineering?

I was recently asked to move into the role of a Site Reliability Engineer. The only problem was… I didn’t even know what a Site Reliability Engineer was. Over the last few weeks I’ve started to dive in on what Site Reliability Engineering really is, SRE for short, and how to fulfill this role. I’d like to share some of the insights that I’ve gained about the role, what it means, and discuss some of the topics around SRE’s.

The role of an SRE is fairly straight forward on the surface but rather complex and intricate on how it is fulfilled. The job of an SRE is to make sure a website or webservice is reliable. The question is, what does reliability mean?

photo cred: Romson Preechawit

Does it just mean the website is up and running?

Does it mean there are no bugs?

Does it mean it performs well and there’s no latency?

And once you’ve defined what SRE is, how do you accomplish it?

Lots of questions and lots of answers.

SRE’s are sometimes confused with DevOps and some of their functions are very similar and overlap. Both roles favor automation and building internal systems and tools but SRE’s really differ in what they are trying to accomplish.

In DevOps, we want to improve the internal efficiency of our daily development operations, automating builds and unit tests, creating deployment pipelines, and be really focused on the CI/CD pipeline.

Advertisements

For SRE’s, we want to improve the post-CI/CD side of things. Meaning, once things are deployed, we want to make sure they are running smoothly and when they’re not we want to make sure we can get back to a smooth state as quickly as possible. For SRE’s it’s all about monitoring and alerting. Remember the end goal for an SRE is reliability.

Reliability… there’s that word again… so what does it mean for an SRE? Well it actually means a lot of things that come right down to what level of service you provide for your customers. Reliability is more than just, “is the application up and running?” It actually pertains to a wide variety of categories that all come back to how end-users view the quality of your website or webservice.

Here are a few of the categories that fall under the SRE belt:

  • Latency & Response Times
  • Data Quality & Correctness
  • Uptime & Availability
  • Volume & Load
  • Errors & Bugs
photo cred: Chuttersnap

All of these fall under the umbrella of reliability. After all, what good is a website if it slows to a crawl during peak hours? What value does it have if the data is incorrect or invalid? How useful is a website or webservice if it’s filled with bugs after every release?

These are all areas of responsibility for an SRE. Thankfully there are some great tools and best practices to make the life of an SRE easier. Over the next few weeks we’ll dive into some of those practices and tools and discuss how best to implement them. Even if you aren’t an SRE there will be plenty of valuable information and content for you to add to your Software Engineering tool box!