What is Reliability Anyway?

Jan 25, 2023 | Author: Alex Hidalgo

Avg. reading time: 2 minutes

We all want our services to be reliable, and service level objectives (SLOs) are perhaps our best approach to doing this. It’s often easy to sell people on the concept of aiming to be the right amount of reliable, but what does reliable actually mean in the first place?

It’s an unfortunate fact that in our industry most people conflate reliability with availability. And while availability is an important part of reliability, it doesn’t really tell much of the overall story at all!

We can define service reliability as “doing what the service is supposed to do.” Being available for users and customers is of course very important, but a service can easily be available while not actually being reliable. Let’s drill down into an example by thinking about a simple web API. The details of this API are not really important – just think of it as something that has to be able to accept a request and respond to that request appropriately and responsively.

The first thing we need to do is make sure this API is actually up and running. After all, your computer services aren’t doing much good if they’re not started or in a crash-loop backoff. A first step in measuring the reliability of this API is therefore making sure that it’s actually running. If it isn't, someone probably needs to take action to ensure that it is.

Next you need to make sure that this service is actually available as well. It doesn’t matter if all your containers are up and running and reporting a healthy status if users can’t send requests to them! This means you need to think about things like the network ingress and egress, your load-balancing and routing layers, and perhaps things like quotas negotiated with your cloud vendors.

But even if your service is both up and available to your users, it can’t be reliable if it’s only returning errors to every request or too many requests. Now we’re at a point where if you want to measure the reliability of your service you need to ensure it’s up, that it’s available, and that it isn’t returning too many errors.

Once you’ve accomplished this it might feel like you’re well on your way to measuring the reliability of your API, but there is actually so much more that it needs to do for your users! For example, it doesn’t matter if it is up, available, and not responding with errors if it’s not returning data in the correct format. Or perhaps it’s returning data in the correct format, but it’s not returning the correct data at all! Maybe it’s returning the correct data, but it’s doing it so slowly that the clients and users that rely on it are having a bad experience!

We could continue this thought exercise for a long time. The point is that you really need to think about what your service needs to be doing for you to be measuring its actual reliability.

Now, perhaps this now sounds like a lot of work – there are so many things you have to measure and track! But it turns out that there is a great trick you can use to measure all of these things at once.

If you know that your API is returning the correct data in a timely manner, you also know that your service is returning things in the correct format, that it’s not responding without too many errors, and that it’s both up and available for your users.

And that’s what a good service level indicator (SLI) is! It’s a measurement that captures as much of the user journey as possible. A good SLI is the true measurement of the reliability of your service, and an SLO-based approach to reliability needs good SLIs in order to be effective.

There is nothing wrong with simply measuring availability, error rates, or latency. These are, after all, very important components of a reliable computer service. But true reliability goes much deeper than that. As you progress on your SLO journey make sure you’re always looking towards how you can improve your measurements and how you can better capture the actual reliability of your services.