More by Keri Melich
| Author: Keri Melich
Explaining SLOs to someone not familiar with them can be intimidating. Where do you start? How do they work? We’re going to break down the where, how and what of SLOs into a few easy steps and help you get anyone to start adopting SLOs.
Step one: How to use customer feedback
An SLO (or Service Level Objective) is used to help your product measure and achieve future goals, setting a baseline for where you realistically want it to be. Plus, SLOs are a great way to keep your product customer-focused. They will help you find gaps & set goals that will be the largest impact to your customers. Once you find those gaps, you can start using that data to figure out how to prioritize customer feedback.
Here’s the funny thing about customer feedback: it can, of course, be used to set your SLOs but you can also use your SLOs to manage that feedback. For example, if people are complaining about login errors or outages more than anything else, you should have an SLO that monitors login error counts. If people are complaining about slowness, you should have a latency SLO. This will help you keep track of the issues that are impacting your customers the most. And once you’ve hit an equilibrium where you have a good monitoring stack and well defined SLOs, you can start to use them as a tool to encourage conversations about the direction of your product.
These customer-focused indicators will help you prioritize feature requests based on how well you’re meeting your SLOs. New features can often add new complexity and potential for some minor downtime. If you’re not meeting your SLOs, your customers have likely experienced recent flakiness and shipping a new complex feature may only make that worse. This is a perfect opportunity to start a conversation about reliability vs. feature requests.
Step two: Have a conversation (or six)
Let’s say we have a trial sign up we’re monitoring and it’s proceeding at a steady pace, but then we release a new feature that causes downtime and our site is unavailable for one hour. Once we recover, the trial sign up rate dips extremely low and continues to trend that way for the day.
We could start the conversation about why this happened:
- Did it continue to trend downward because the feature (once fixed) was not in high demand. Did the downtime damage our reputation?
- When we have had downtime in the past, how long did it last? What is the maximum amount of downtime we’ve experienced without major impacts to our signup rate?
- Did we train our customers to expect us to be unavailable for longer periods of time?
- Can we correlate a specific amount of profit lost to shipping a new unstable feature that caused downtime?
- Or possibly it’s saying we need a better CI/CD pipeline that includes feature flagging, A/B testing, or a better rollback system.
The ability to pinpoint profit loss is an easy way to show leadership that we need better support.
The ability to pinpoint profit loss is an easy way to show leadership that we need better support. Perhaps more QA testing pre-feature shipping or maybe we need to pivot to fixing a backend service before we start offering new features.
You see, there are so many conversations this data can prompt — all leading to a much more efficient and customer-friendly outcome.
Step three: Look at the flip side
It might also be easier to understand this one from the opposite side of the spectrum. So we release a new (stable) feature and we see trial sign ups spike. That might tell us that this particular feature was well received or something a lot of customers really wanted.
- Where did we find the feedback that led to this feature release?
- How can we prioritize more of those feature requests?
- Which features gain the most profit?
- What do customers really care about?
Step four: Make it realistic!
Anytime we set goals it’s easy to want to overachieve them, but there is a diminishing return when overachieving SLOs. Most people have some amount of personal experience that has taught them computers aren’t always reliable (to say the least!). And over time, we’ve trained ourselves to know this about anything related to computers. So having a limited amount of downtime within a timeframe isn’t fully unexpected behavior, but using SLOs will help us manage how much of an impact we are impressing on our customers from that downtime.
If your systems are simple and you really aren’t experiencing any downtime, then use that time to find ways your system can break before they actually break. Getting ahead of issues will make troubleshooting unexpected issues much smoother — this is called Chaos Engineering — and works best when you have SLOs in place.
We also want to keep our goals realistic to help us set better SLOs rather than SLAs (Service Level Agreements.) SLAs are usually set at a low bar allowing for anything and everything. They’re not in the best interest of the customer. If we shoot for perfection, that isn’t possible either. SLOs, however, allow us to lower our reliability goals to something more realistic and we’re better able to concern ourselves less with troubleshooting downtime and managing unrealistic expectations and more with managing our roadmap and delivering feature requests. And that will bring us back to step one: using customer feedback. Guess we made a feedback loop!
Step five: Get started!
It’s never too early or too late to start using SLOs. I find there to be less friction the sooner you start, but just remember that no matter where you are in your product’s lifecycle, using SLOs to focus on your customer will also help you! Whenever you decide to implement SLOs will help shape how you use them, your downtime procedures, and how you start the types of conversations that will improve your product and make your customers happy.
Wondering where to start?
We call Nobl9 the cheat code to SLOs for a reason, come check it out for yourself.