Nobl9 - News and updates on SRE, SLO and general reliability

4 Considerations for a Great SLO Platform

Written by Keri Melich | Aug 29, 2022 6:21:51 PM

Managing a service well comes down to one thing: customer satisfaction. A service without customers isn’t worth much, even if it’s 99.999% reliable. So why have we for so long put the maintenance of our services before the customer experience? It stems from systemic patterns in the workplace where development (the services) and operations (the customer experience) are siloed. I’ve seen so much burnout in our industry due to one person or group firefighting to hold up their side of that equation without much consideration of or consultation with the other—and this is why DevOps tools that promote shared principles and collaboration are crucial to our work and happiness, and ultimately to customer satisfaction.

So how do we quantify the benefits of maintaining services with our customers in mind? We can do this in time spent on toil, time spent firefighting, the number of times we are woken up in the middle of the night to deal with an incident, and our deployment velocity. 

As an incident responder, a favorite DevOps tool that has helped me manage my on-call stress is the Service Level Objective (SLO).

Conceptually, SLOs are pretty simple: we’re setting targets for our services based on minimum satisfaction levels determined by our customer journeys and expectations. While that may be easier said than done, there are an ever-growing number of platforms that can help us do just that. To find the right solution for your team, I recommend starting with the following four important considerations.

1. Build or buy

Deciding whether to build it or buy it is typically the starting point of any conversation about adopting a new tool. Five years ago this would’ve been a quick decision, but there are now a growing number of SaaS and open source SLO solutions out there. So how do we weigh our options? I measure them in terms of toil. Maintenance is a significant source of toil in our industry, and a platform that integrates many sources (such as monitoring tools, communication tools, incident management tools, identity providers, etc.) is a recipe for large amounts of toil. A great SLO platform should help you deal with toil, not add to it. If your team is small and already fighting fires constantly, you likely don’t have the resources to build and maintain an SLO platform without significant toil. If you think about this in terms of annual cost: how many engineers does it take to build and maintain an SLO platform that’s easy to adopt? If they work on this project full time, is that less than or equivalent to the cost to buy it?

2. Tool agnosticism

We’re far past the days when a single monitoring solution was enough. Most of us use many monitoring tools, and that’s why it’s so important for our SLO platforms to be tool agnostic. Many SLO platforms are built as extensions to existing monitoring tools, and while this may be useful in a handful of cases, it can quickly limit our ability to scale our SLOs with our services. A great SLO platform should work seamlessly with as many sources as you need from your current stack but also to your future scale.

3. Role-Based Access Control (RBAC)
Living in a world of various compliance standards (SOC 2, HIPAA, ISO, etc.), RBAC has become a fairly common offering. However, I still find it important to highlight as a consideration for a great SLO platform because of its implications. SLOs are useful because they provide an easy-to-understand overview of the state of the product, something anyone across the company can use in their respective role. This makes them a valuable tool for fostering and facilitating communication, and therefore the ability to provide an appropriate level of access to everyone in the company remains an important offering that we should not ignore. In fact, I would love for our Nobl9 SLOs to be public, but that’s a post for another day!

4. Incident management

If you have ever been on call, this topic could be the bane of your professional existence, which makes it a crucial consideration for a great SLO platform. SLOs should always help your on-call team, and an SLO platform does this by providing meaningful alerts, like predictive alerting based on the consumption rate of your error budget. But let’s not forget the primary function of SLOs: to map our monitoring data to our customer journeys. If your alerts do not embody this by focusing on your customer journey(s) and measuring actual impacts, SLOs can quickly become another channel of stress and alert fatigue in incident management. A great SLO platform should help you build meaningful alerts based on meaningful customer experiences. 

An SLO Platform 

The value add of a great SLO platform is exponential when we are at our most stressed. It’s easy to get lost in the constant firefighting, and we often feel that we don’t have the time or resources to set up SLOs. But this is when they are most valuable! Our toolkit can be the first step in the right direction toward a development and deployment culture that focuses on achievable goals. When we take the time to talk about reliability in a realistic way, we’re giving ourselves permission to be “good enough.” Perfect is the enemy of good, and stress can easily make work feel overwhelming. So when we aim for “good” in a culture that does not cultivate stress, we have more time to iterate in ways that actually impact our customers and lead us to “great.”