Error Budgets Are for Asking Questions

Written by Alex Hidalgo | Aug 9, 2022 9:15:15 PM

In 2016 Google published the first SRE book. The book has sold phenomenally over the last six years and remains one of the best-selling tech books to this day. In this book, many people were first introduced to ideas like handling interrupts, blameless incident culture, and service level objectives.

Service level objectives (SLOs) are an incredibly powerful way to overhaul your thinking about what reliability actually means and how important it is to your users and your company. There is a reason I’ve dedicated my life to them at this point. I’ve witnessed firsthand – on many occasions – how “doing SLOs” is capable of making people’s lives easier and better.

Adopting service level indicators, service level objectives, error budgets, and everything that comes with them can be a truly watershed moment for your team, organization, and company.

But “doing SLOs” isn’t really a great way to talk about things at all, because there are three very distinct components to such an approach: service level indicators (SLIs), service level objectives, and error budgets. I’ve taken to calling these three components “the reliability stack” to give us a better way to talk about the entire approach as a whole.

I don’t want to spend too much time getting into the details of each component of the reliability stack since much has been written about them. But the tl;dr is that SLIs are measurements of the reliability of your service, SLOs are targets for how often you aim to be reliable, and error budgets are a way of capturing how often you have actually been reliable based upon those targets over a window of time. It’s all better data that allows you to better understand how your computer services have performed against the targets.

While I love how many people all over the world have been introduced to SLOs by the first SRE book, I do lament that how to use the data that they provide you is barely touched upon at all. In fact, only one real option is given: ship features, or don’t.

The basic idea is that if you have error budget remaining you can and should ship features as quickly as you can or want to. On the other hand, if you don’t have error budget remaining you should halt shipping features and focus on reliability work instead. There isn’t anything terribly wrong with this at a conceptual level, but it turns out to be a very limited approach in the real world.

For example: shouldn’t reliability be considered an important feature of your computer services in the first place?

What does it even mean to not ship features in a scenario where you’ve depleted your error budget?

Does it mean that your developers should stop writing code entirely?

And what do you do if you don’t own the code for the service you’re supporting in the first place?

What do you do if you’re responsible for a service that’s heavily based upon open source software?

Or if your service is actually hardware or networking gear?

Does anyone have to change their focus to reliability work at all?

What if you burned all of your error budget due to the catastrophic failure of an underlying component or a vendor dependency?

Should you still be halting all of your feature releases just because of that?

We could go on for a long time, but there is a better way to use the status of your error budgets that you might have picked up on: the best way you can use your error budget data is to ask meaningful questions which should help you make better decisions. Error budgets are for asking: “What is going on and what does that mean?”

Perhaps sometimes it does make sense to freeze feature development work and focus on reliability efforts. But also: sometimes it might not make any sense at all! Sometimes maybe you just need a single member of a single team to change the focus of their upcoming sprint. Sometimes you might need your entire company to switch the focus of their efforts for a full quarter. There isn’t a one-size-fits-all answer for how to react to a severely diminished or depleted error budget.

Adopting service level indicators, service level objectives, error budgets, and everything that comes with them can be a truly watershed moment for your team, organization, and company. They’re a phenomenal approach that gives you better insight into the reliability performance of your systems. They are a remarkable way of making both your engineers and customers happier. But as you embark (and continue!) upon your SLO journey, remember that at their heart they’re better measurements that give you better data to ask better questions about what is actually going on. And asking these better questions leads to better discussions and better decisions. Don’t just listen to what others have written down. Use your data to make the right decisions for you.

View full post