More by Brian SingerHow Product Managers And SRE Together Can Delight Customers
| Author: Brian Singer
Whether or not your website works is not the only thing your customer cares about. Software applications and infrastructure technology each play a crucial role in customer happiness. And the most effective way to ensure your team is keeping those customers happy (and supporting business objectives) is to use Service Level Objectives (SLOs).
The best thing about SLOs is that their value to your company goes WAY beyond website availability.
Is your technology contributing to customer happiness?
Customers have come to see natural evolution of monitoring Ideally, your software engineering team should be building infrastructure supporting your customer’s wants and needs. This is why you’re hearing so much about SRE (Site Reliability Engineering) and why “site reliability engineer” is one of the hottest job titles in IT.
However, as we’ve pointed out in other blogs, 100% reliability is not only impossible but extremely expensive. And it’s often in direct opposition to the goal of getting new software features and services to market quickly. SLOs help your team find the optimal balance between reliability and speed.
An iterative approach to SLOs helps to achieve that balance:
- First, use SLOs to engineer a system that is reliable enough to keep customers happy without wasting resources.
- Next, use SLOs to fine-tune performance, improve products, minimize costs, and guide application development, so that your customers and internal business units get what they need from your technology systems.
The Direct Connection between SLOs and Business Objectives
One of the reasons you hear site reliability engineers (SREs) talk so much about “availability” is because it’s easy to understand how customers can get frustrated when websites are down or functioning erratically. To illustrate this, we can use a simple e-commerce example:
If I’m trying to buy something from your website and there’s a glitch that prevents me from checking out, I’ll probably come back 20 minutes later and try checking out again. And if the glitch doesn’t happen again, I won’t think twice about it. However, if every time I try to check out, your site bugs out for 20 minutes, then I’m leaving. If that happens, your site’s unavailability has cost you a customer.
This is a case where it’s easy to draw a straight line between site availability and customer happiness.
In other cases, the line between engineering metrics and business results can be less direct, particularly as there are so many different kinds of metrics we can use. But which metrics deserve our attention? The ones which highlight business impacts we actually care about.
Let’s look at a real-world example from a customer:
Among the business KPIs we used was the total number of deployments versus total revenue we were making. Of course, the more deployments, the more money we made, so the line chart for these metrics tracked beautifully “up and to the right”, like this:
Other than creating a false sense of security and stroking our egos, what did that KPI really do for us? Not much.
Here, we tracked the ratio of successful deployments to total deployments. Now here was some useful information! If the total number of attempts was going up faster than the number of successful deployments, we had a problem. We were likely losing a lot of deployments to failures, and that equates to lost revenue.
Our KPI chart showed revenue going up, but our SLO told us revenue was not going up as fast as it should be. We were able to divert resources to focus on the root causes of deployment failures that otherwise would not have been identified as problematic.
This is where SLOs become valuable – shining a light on that which might otherwise go unnoticed.
SLOs can also show us where our technology is working better than it should, which means we’re probably over spending. Perhaps we could move that workload to a smaller machine or reduce the resources we are devoting to MTR (mean time to recovery). Neither of these actions would negatively impact revenue, but your IT costs would go down. Money saved is an example of the direct connection between SLOs and business objectives.
Beyond Availability: SLOs Take You From Good to Better
The best thing about SLOs is that their value to your company goes WAY beyond website availability. SLOs serve as “leading indicators”—not just of something being broken or over-resourced, but also of how effective an experiment or change has been.
Let’s say our company has built an auto-complete function which suggests words or phrases as you type. The expectation is a certain percentage of suggestions are going to be useful and users are going to place them into their document.
We have an SLO for that! For example, we expect when a user is presented with a suggestion, the user will accept it 50% of the time. We’ll run that SLO in our system, and see how it bears out. Sure enough, 50% is the norm. Next, let’s try adjusting our algorithm, adding a new data set of words, or make some other modification. Perhaps, after pushing out this change, we see we’re only hitting 40% instead of 50%. That tells us something very interesting: nothing was necessarily broken in our infrastructure by making this change, but our autocomplete application experienced a significant regression in outcomes. We probably want to roll that change back.
Or, knowing there’s a connection between how often customers pick our auto-complete recommendations and their overall satisfaction, we may take it upon ourselves to improve the SLO here. We may shoot for a 60% success rate by making some changes to the software. We can watch how that plays out and, if we don’t hit that SLO, at least we have more information to factor into future decisions.
It’s important to realize SLOs are not just about monitoring availability and response time. In fact (despite the example I shared earlier) we often say the worst SLO is an availability SLO – not because it offers zero insight but because it doesn’t offer enough. Most SLOs can provide specific insight into customer happiness. Better SLO examples: Are your customers annoyed when they get suggestions for products they’ve already bought? Are the purchasing recommendations useful? Is what I’m doing with this algorithm making the customer happy?
SLOs Answer an Existential Question
Is the software you’re building fulfilling its purpose? Or is it just technically meeting its purpose without really serving any customers? SLOs provide the answer. The “value-add” and purpose of SLOs is to help us measure whether our software and infrastructure are providing the experience we want our users to have. The SLO framework helps engineering teams directly tie their efforts to business objectives and to customer happiness. Take it from a bunch of software engineers – having that kind of impact on customers and on business outcomes is exciting!
Bottom line: SLOs certainly aren’t the only thing determining your customers’ happiness, but SLOs are one of the most powerful tools at your disposal to take your company from good to better.