More by Nobl9
| Author: Nobl9
What is an error budget, and how does it relate to site reliability? This is a common question we hear when discussing service level objectives, service level indicators, and service level agreements. Each has its role, and they are all interrelated. In this post, we’ll explain what each of these terms means and how they contribute to helping enterprises improve reliability so that customers are happy and keep coming back for more.
What is an error budget?
An error budget can be thought of as a conceptual model for understanding acceptable risk in your services. According to the book Implementing Service Level Objectives, “an error budget is a way of measuring how your service level indicator (SLI) has performed against your service level objective (SLO) over a period of time. It defines how unreliable your service is permitted to be within that period and serves as a signal of when you need to take corrective action.” It is considered a budget because your organization can allocate (or spend) it and track its current balance.
Error budgets are a key component in the concept of site reliability engineering, which originated at Google. One example of how their engineering teams use error budgets is to set expectations of how much uptime a service should have per a given time period. Product management defines SLOs with expectations on how much uptime they should deliver in a given time period, and observability tools monitor the actual uptime of services. This allows for a more objective view of the service’s performance.
You can think of an error budget as an early warning signal to tell you when it’s time to shift your work efforts. When your error budget is getting close to being spent, then you can divert resources from innovating to addressing technical debt so that the errors are resolved before customers begin to notice and complain.
Who are error budget stakeholders?
The primary stakeholders involved in creating and using error budgets include Site Reliability (SRE) engineers, DevOps teams, product development/product engineering teams, and product management teams. The titles and department names may be different in your organization, but the primary responsibilities across these groups include:
- Writing code for new features and fixing errors (engineers/developers)
- Managing applications in a production-ready state (SREs)
- Deploying and monitoring applications in production (SREs / DevOps)
- Managing release schedules and prioritizing work to meet customer requirements (product management)
How do you use an error budget?
Site reliability engineers (SREs), product developers, and product managers each have slightly different agendas. At the most basic level, SREs want to ensure reliability, developers want to keep producing new features, and product management needs to balance these two priorities. As a result, these teams negotiate resource allocation when it comes to adding new product or service capabilities and fixing bugs and reliability issues. The risk of allowing reliability, availability, and performance failures need to be discussed in order to understand the impact on customer satisfaction and engineering productivity. The error budget helps solve this issue by providing objective measurements on achieving reliability targets. An error budget allows these three stakeholders to agree on resource allocation without compromising each team’s goals.
Error budgets can be used to balance efforts on resolving technical debt and releasing new features. If the error budget is available, the team can move forward with delivering new features. If the error budget is exhausted, teams might shift focus to product reliability. In this way, enterprises can balance the need for innovation and maintaining a high level of reliability.
How can SRE teams use error budgets to help maintain performance levels?
SREs face problems that deal with availability in three ways:
- Defining availability in terms of how users expect a service to operate
- Finding an appropriate level of availability for the service
- Creating a plan to deal with failures of availability
An SRE must also consider how changes will affect other services and create relationships between them. This way, they can work together as part of an overall system rather than being disparate parts working independently from each other.
SREs can use error budgets to track the maximum number of failed events or maximum downtime a service can endure before customers complain. Error budgets can also track minimum throughput levels and the correctness or freshness of data. Error budget information provides reliable data to the development and SRE teams so they can confidently set new release velocity.
The error budget is set based on the SLO for that service. Since no service or website or application can be perfect and running 100% of the time, the team sets an SLO with some amount of downtime or reliability incidents that are acceptable. When the error budget is available, teams can confidently focus on innovating and adding new features. But when the error budget approaches or goes negative, then the team can shift focus to quality assurance, stability, and performance improvements before customers notice. Error budgets provide a metric for making these decisions.
What are the different types of errors?
There are two types of errors avoidable and unavoidable. Both will factor into your error budget. Avoidable errors are those that you can eliminate through process improvement or better code management. Unavoidable errors, on the other hand, are inherent in the system and can’t be prevented without sacrificing essential features or functions.
System-wide events are rare, and their impact is minimal on an individual team's error budget. As such, you should not adjust the error budget for these types of events. If the impact of a system-wide event warrants an adjustment, it can be escalated to a monthly report.
It's important to remember that every decision you make comes with a trade-off: maintain more stability, or keep your customers happy with new features. To push the envelope and innovate, you need to find a way to accept a certain level of unavoidable errors. If you care more about reliability than innovation, your system will not succeed.
The team should be involved in defining the SLO, with input from multiple stakeholders. This will ensure that everyone understands what's expected and how they can work together to achieve it.
SLOs are decision-making drivers that help teams find the right balance between velocity and reliability. They also provide transparency into service health so everyone can understand if they're on track or not. As digital dominance increases expectations for more resilient services, it's more important to set measurable, concrete targets that allow for an appropriate rate of new feature delivery.
Error budget policies
What makes an error budget real? Error budgets only have value if the enterprise believes they have value. In order to take the error budget seriously, there needs to be a consequence to exceeding the error budget that has a real effect on the allocation of resources within the organization. Since you are trying to create a cultural shift to a team that correctly balances reliability and features, you need to notice when you are leaning too far one way or the other and correct back to center.
An error budget policy simply states what a team must do when they deplete their error budget. If an error budget is exceeded, there are policies to limit further customer impact, such as halting new feature releases for some time. Error budgets help developers take action towards ensuring reliability for customers. The general remedy is easy: if you exceed your error budget, focus on improving reliability. This may be enough of a policy to get most of the benefit, although teams have created more sophisticated policies with various thresholds and rules of escalation.
This policy needs to be communicated to all members of the organization, so there is a common understanding of what is expected. A best practice for this communication is to create a written process document that everyone in the company can refer to with steps on what to do if an error budget is exceeded. This could result in a re-evaluation of your service, adding a focus on reliability to your OKR or KPI, or another remedy to ensure reliable outcomes for customers.
Errors are the result of specific circumstances that occur and affect customers. By understanding these circumstances, teams can create alerts and policies related to their error budget. For example, if an organization's SLO dictates that no more than 1% of requests should return an error code, then the team would create a policy stating that once this threshold is reached, action must be taken (such as a system rollback).
Error budgets and SLOs
Setting objectives is an essential part of any organization. The goal of an SLO is to find the right balance between new feature velocity and reliability. An SLO is an agreed-upon target about how reliable a service should be over time. You define your SLOs based on your SLAs (if you have or offer one), or other risks to your business and your error budget is how you track whether you are violating your SLO. SLIs come from your many observability tools, and depending on how you set up your SLOs, may need to be aggregated together to provide a holistic view so that you can calculate compliance. In essence, SLIs inform SLOs.
A time frame can be set on an SLO, which helps keep them relevant in terms of how long customers tend to remember failure. The error budget is the maximum time an SLO allows for a given type of error. Common examples of these metrics include the number of errors or incidents, latency, uptime, and so on – whatever is important for your customer expectations and to meet your SLAs.
SLOs are more granular than SLAs. They help DevOps, SRE, and engineering teams set service level agreements to achieve goals related to error budgets. You may also use SLOs to track other types of services within your organization, for example on the reliability of internal applications used by employees. SLOs are not limited to just servicing customers – providing a good employee experience is also important for attracting and retaining your people.
Error budgets and SLAs
A Service Level Agreement (SLA) is a contract dictating that a service will reliably perform to meet specific customer requirements. For both customers and providers, one of the biggest concerns is ensuring that the SLA is met. This means that providers need to be able to track various aspects of the system to make sure that things are running smoothly. Error budgets are a good way to do this.
SLAs often include a financial penalty if the service level is not met. The SLA will define the minimum level of service that the vendor must provide to their customer - for example, your enterprise may agree to provide your customers with 99.9% uptime of your service. This means you can have about 44 minutes of downtime per month and not be subject to any financial penalty. You may therefore set an SLO for your service to ensure you do not exceed 44 minutes of downtime per month. Your error budget can keep track of downtime during the month and alert you when you approach or exceed that limit so you do not violate your SLA.
Using SLOs is a great way to track your SLAs. It's a best practice to set up a tighter SLO than your contractual commitment so you get a warning long before you would have violated your SLA. This approach can save you a lot from the headaches, paperwork, and financial consequences of breaching an SLA with customers.
How do you implement an error budget?
You can use an SLO as an application's target for the number of good or successful requests. To determine this value, you must first calculate the Service Level Indicator. SLIs tell you whether or not an event has happened and can be used as evidence when calculating the error budget for an SLO. They help you keep track of the frequency of events over time to predict future failures more accurately. SLIs commonly come from your observability and monitoring tools. You may have many of these tools that focus on different parts of your technology stack in order to understand reliability across the entire user journey. The SLI is typically a ratio and can calculated as follows:
Service Level Indicator = (# of Good Events) / (Total # of Events)
Once you have your SLI, you can then define your Service Level Objective. You’ll set a target for the reliability of your service as a percentage. To then get your acceptable error rate, subtract your SLO target from 100%. For example, if your SLO for a given service is 99% uptime, your SLO’s acceptable error rate is 1%.
Now that you know what your error budget is, it's important to track how quickly you're burning through it so that you can stay on top of potential outages. You can use burn rate formulas like this one:
Burn Rate = (# Observed Errors in Period)/(# Acceptable Errors in Period)
where "# Observed Events" refers to how many bad events happened in a given period, "# acceptable errors " represents the maximum total errors that are allowed within that time period If the result is > 1, you’re consuming the error budget faster than you are allowing for. If it’s < 1, you’re within budget.
What factors should be considered when setting up an error budget?
Error budgets get defined when you are setting up your SLOs. If your SLO says a service needs 99.9% uptime, then your error budget for that service should be set for 44 minutes per month.
But how do you decide if 99.9% is the right service level objective? Should it be 99.99%? Or can you get by with 95% availability? There is no right answer to this question. It depends on your industry, what your competition offers, and what your customers expect. Cost also plays a big role in making the decision about how many 9s to deliver. The more nines you desire, the more expensive it will be to achieve the desired reliability. Another factor is how often you want or need to add new features to your product or service, and how many resources need to be applied to deliver on time.
Your stakeholders need to discuss what each customer journey is and the customer's expectations for each before setting SLOs to define the reliability of each journey/product/feature in a given time period. Then, set SLOs which define how much uptime or latency, or error tolerance the product should have for a given time period - a day, a week, a month, a quarter, etc. Observability tools monitor the service's ability to meet these uptime/latency/error limits and provide SLIs to the SLO platform, which calculates burn against the error budget.
Error budgets are where service level objectives start to become real. They are what convert metrics into action, but only if they are collectively understood and taken seriously. Quantifying our customer experience can have a dramatic impact on an organization.
SLOs and error budgets help cultivate a culture that focuses on customer happiness and your current ability to meet their needs. It helps you proactively balance reliability investments right when they’re needed and can provide real data to support a freeze on feature releases for a week or quarter or until your reliability is under control. If you do this across your teams you will soon see a positive culture shift. Managing an error budget across your teams will lead to a positive culture shift, one happy user (and employee) at a time.