At the base of all things related to SLO-based approaches to reliability are your service level indicators. While we throw around the acronym SLO much more often than SLI, it’s really your SLIs that drive everything forward. After all, without a meaningful SLI you cannot have meaningful SLOs or error budgets. A good SLI has to be your first step in the process, but choosing a good one is not always easy.
Think about everything your users actually need from you, and then figure out how you can contain all of that with as few measurements as possible.
The reason you adopt SLOs is because you want better data to help you make better reliability decisions. And you want reliability because you want happy users. In fact, in some sense the question “Is our service reliable?” is the same question as “Is our service doing what its users need it to be doing?” And, extrapolating this one step further, a meaningful SLI is basically asking the exact same question: “Are we doing what users need us to do?” This all seems straightforward enough, but it turns out that these are not always easily-answered questions.
Most people start with a bottom-up approach. First they might want to ensure that a service is up and available for their users. Availability is generally pretty easy to measure, and many people may already have the data needed for such an approach. But then as you think about things a little more you realize that availability doesn’t matter if you’re incredibly slow.
So naturally you now realize you need to be thinking about both availability and latency — and, again, these are both pretty common types of data to have about a service. If you’re aiming to do what users need from you, it’s pretty easy to see that you generally have to be both available and responsive.
Measuring availability and latency is a pretty good start, but it turns out that an available and responsive service that only returns errors isn’t very useful to anyone at all, so you have to add error rates to your list. You’re now asking a bunch of questions about your service:
- Is my service up?
- Is my service available?
- Is my service responsive?
- Is my service returning enough good responses?
These are all incredibly reasonable questions to ask about your service, which means they’re all great things to measure. But, does any of that actually measure what your users need? Probably not.
For example, it’s all well and good if your service is up, available, responsive, and not returning errors; however, it’s not being very reliable if it’s returning data in the wrong format. And even if that data format is correct, it’s not being very reliable if the data isn’t the data actually being asked for. And even if the data format is correct and it’s the data being asked for, your service is not being very reliable if the data is yesterday’s data instead of todays. Users need fresh and correct data.
We could repeat this thought exercise for a while. The point is that your users likely need many more things from you than you might first expect. Availability and latency don’t tell much of the actual story.
So now you might be asking yourself, “What does this mean for me? Does this mean I need 10 SLIs per service? 20? How many things do I have to be measuring?!”
The answer to that is two-fold. Yes: you need to be measuring a lot of things; however, you don’t need many SLIs at all.
Let’s use our example questions from just before. If you can answer the question, “Can a user get fresh and correct data?” you already know all of the rest. A response with fresh and correct data has to also be a response that was returned in the correct data format. And if you’re sending a response in the correct data format, you know that you’re not returning errors. And if you’re sending responses at all, you know that your service is both up and available. It turns out that you can measure many things by measuring only a few.
This is not to say that you shouldn’t have telemetry telling you about all of those things! You should absolutely have a metric that says, “Are we up?” and one that says, “Are we responsive?” It’s just that those might not make the best SLIs. Think about everything your users actually need from you, and then figure out how you can contain all of that with as few measurements as possible.
Service level indicators are at the bottom of the reliability stack, but you formulate the best ones by looking from the top down.
Delivering reliable software services is a challenge for any team running infrastructure, and...