In two of my previous posts I’ve covered the SLO and SLI concepts, if you’re not familiar, I would start there:
But this post here, is really the cream of the crop when it comes to Site Reliability Engineering. The Error Budget, as it’s so called, is what takes the SLO and the SLI concepts from above, and gives us something actionable to work with. Let’s dive into the specifics.
The Error Budget concept, originally created by Google, bridges the gap between Product teams and Engineering teams. Where Product teams want new, bigger, and more features, Engineers are historically burdened with balancing deliverables, technical debt, and well… keeping production from crashing on a Friday afternoon.
…keeping production from crashing on a Friday afternoon.
But no more!
The Error Budget aims to strike a balance between what’s wanted, like new features, against what’s needed, like performance. It works by taking the concepts of SLO’s and SLI’s, ties them together and draws a tight line in the sand on when it’s time to shift gears and focus on technical debt.
This is accomplished by defining SLO’s around an applications performance. Should the application have 99.9% uptime? Should the application be able to handle a request in under .5 seconds? Should the application degrade gracefully in the event of a hardware failure? These are all various areas an SLO can aim for, and once we have the SLO, it’s a matter of implementing the right SLI, or measuring stick, to ensure we’re meeting our targets.
So what actually happens when we don’t meet our objectives of an SLO? What happens when our application isn’t available 99.9% of the time? What happens when that last feature push started a slow degradation of our applications ability to handle a request in under .5 seconds? The Error Budget happens.
The Error Budget is a formal agreement, written and signed off on, by all of the involved parties, and simply states what must happen in the event of a missed SLO. The Error Budget is not a punishment or tool to place blame. The Error Budget is a gauge to let us know, “Hey we’ve done too much and neglected some key areas of service we feel are important, let’s take some time to set that straight.”
The Error Budget is not a punishment or tool to place blame.
When an SLO is missed, the Error Budget covers what happens next. In most cases, a feature freeze is put into place for a certain number of days. This allows the engineering team to shift their focus and prioritize work that will directly improve the SLO target. If the application were to suffer an outage and miss it’s 99.9% uptime target, the engineering team can now focus on, not only how to fix the issue, but how to prevent it from occurring again in the future.
So why is it called an “Error Budget” anyways? It’s named this way because you do in fact have a “Budget” that allows teams to identify when it’s acceptable to take risk and when it’s time to pivot. One thing you do not want to do, is approach a team that has experienced and well-trained members, and try to implement concepts and red-tape that slow them down. If you have a team that can develop features and deliver on performance at the same time, why would you want to disrupt that?
With an Error Budget, you don’t have to worry about that because the budget aspect gives teams the room they need to work quickly and efficiently, and only when an SLO is missed, does the team need to shift gears and change direction. Therefore the actual budget is calculated as a small percentage of hitting a target at 100%. Let’s look at an example:
A product organization decides that they want to ensure their application accepts incoming requests, without errors, 99.9% of the time, over a 30 day period for all of their 1,000 customers. On average, let’s say customers submit roughly 100,000 requests over that same 30 day period. Our budget then becomes 99.9% of 100,000, which allows our teams 100 failed requests before our SLO is violated and our Error Budget is enacted.
If you have a team that can develop features and deliver on performance at the same time, why would you want to disrupt that?
By operating in this fashion, our system can handle a small hiccup without us having to freeze all features and enact the Error Budget. We can also allow our teams to take small, acceptable risks, knowing that if that mid-day release for that key customer goes wrong, and we miss our SLO, it’s time to rethink business hour deployments, BUT on the flip side, if that release only interrupts the system for a moment, and only 10 failed requests are counted, the team is still within their SLO and should carry on as they were.
The Error Budget is more than just a way to tell teams when to pay down technical debt. It provides us the foundation for measuring and accepting risk, helps us to bridge the gap between features and reliability, and finally gets product teams and engineering teams speaking the same language.
I encourage you to take a look at Google’s example Error Budget and hit me with any questions you may have!