After SLOConf: A Conversation About Reliability

Jun 8, 2021 | Author: Erza Zylfijaj

Avg. reading time: 3 minutes

If you didn’t get a chance to attend SLOConf, the first Service Level Objective Conference for Site Reliability Engineers, you missed Nobl9’s own Keri Melich give one of the most popular talks on SLO Basics – a Conversation About Reliability. Don’t worry! It’s still available for you to watch.

An easy-to-digest presentation on the basic theories behind SLOs, Keri’s presentation is a great place to start when beginning with service level objectives. Beginning with basic acronyms and ending with how SLOs will build a happier team experience, Keri walks you through all the ways SLOs can improve workflow.

Begin with a Conversation

“I think one of the biggest responsibilities of an SRE is bridging the gap between different groups within a company.” – Keri Melich

Decisions leadership makes affect developers and SLOs are a tool that can be used to start conversations around those decisions. For example, one SRE responsibility specifically related to SLOs is observability. Simply put, when talking about observability, we’re referring to the ability to understand the internal state of a system by its external outputs. This includes things like system stats, availability of a service, dependencies, and errors. For SREs, the goal is to implement and monitor those measurements, so they can have more effective conversations about the state of the product, and where they should be putting in their efforts. One such conversation is whether or not you’re in a good place to release new features without negatively impacting the customer experience. This is where we bring in SLOs.

When you talk to leadership you’re going to have to explain why 100% availability is unrealistic and how it’s only going to hurt your business.

What’s an SLO?

An SLO is a goal that is set using the data you receive from monitoring. It gauges how well the product is doing and helps to point out things like trends – such as maybe there’s consistent downtime when code changes are implemented, or you’ll be able to forecast the availability of your service a bit better. To help calculate the SLOs, you’re going to use what’s called an SLI or a Service Level Indicator. These SLIs are going to be the actual numbers or queries that you pull from your monitoring stack. You’re going to apply a little bit of math to them which will tell you how well you’re meeting your goals. The great thing about an SLO is it can be as generic or specific as you like as long as you’re recording the data to measure it. And SLOs are really only for internal purposes, so don’t be afraid to change them!

A Specific SLO Case

Let’s say your data is telling you that 99.99% availability is unrealistic and maybe you haven’t been meeting that goal for months – change it. Goals work best when you do them in small, digestible chunks. If you set really lofty goals for yourself right away, you won’t get any meaningful data. Instead, you should set your SLO to say 99.95% of requests on a service will be completed in under 2000 milliseconds. That’s from the moment a customer interacts with your website and how fast they expect your website to load. In this case, our SLI would be a query that will help calculate the latency of that specific type of request.

This math leads to the calculation of an error budget. Error budgets are a really helpful way to see how much you’re impacting the customer experience. It’s basically an amount of error that your service can accumulate over a period of time before customers start to notice and complain. If you’ve been shipping a lot of new features and had a lot of downtimes, this might be a good indicator that you should focus on the reliability, or even fix some existing features. There’s a pretty fine line between your service being available and your service having new features and customers are always going to want both. If you can consistently ship new features without bugs, your customers are going to be happy. If that’s not possible, it’s a good idea to ask yourself if there is a service you can make more stable to help ship new features without major restarts. It might be a good indicator that you need to invest more time in QA before you ship something (this one, in particular, might require some buy-in from leadership).

You can see from this example how SLOs are a useful tool for talking to leadership. They’re great at helping translate important parts of your observability into impactful and easier-to-digest conversations. When you talk to leadership you’re going to have to explain why 100% availability is unrealistic and how it’s only going to hurt your business. But if you give yourself a buffer for downtime, you’re going to make happier engineers by giving them a better work-life balance. They won’t be glued to their phones or their machines. And your customers know that sometimes computers don’t work. They’re most likely not going notice if you have 20 cumulative minutes of downtime per month. On the other hand, if you really are hitting a hundred percent consistently that gives you room for chaos engineering. Chaos engineering is when you purposefully break your system in order to learn how to fix it which will help your engineers prepare for different disaster scenarios in a controlled environment.

This is a lot to digest if you’re new to SLOs. We highly suggest you read Alex Hidalgo’s “Implementing Service Level Objectives” for more information on how you and your team can adopt SLOs. And, of course, watch Keri’s full talk below.