More by Kit MerkerSLOconf Speaker Profile: Steve McGhee SRE 101: The SRE Toolset
| Author: Kit Merker
Much of the available literature about Site Reliability Engineering (SRE) and Service Level Objectives (SLOs) refers to pure, homogeneous infrastructure environments (think hyperscalers, including Google, where the concept of SRE originated). Row after row and data center after data center of similarly sourced and configured racks of gear, configured and operated the same way with the same infrastructure software.
One might assume, then, that SRE is only for the Google types, that it is a poor fit for “low scale,” non-hyperscale environments where real, “wild” world messiness of mixed, heterogeneous cloud infrastructures and dedicated, legacy systems rule. In short, one might assume that SRE isn’t a great fit for organizations that are not yet using it.
Nothing could be farther from the truth.
SRE is a great fit for homogeneous environments, even those who are not hyperscale. Let’s walk through a “modest scale” homogenous scenario. Suppose we’re running a video application in the public cloud. And, just to decrease latency, we are using multi-region datacenters to deliver video to our customers—one on the US East Coast and one on the West. If our East Coast regional datacenter goes down and we redirect the traffic to the West Coast, how do we measure our service to East Coast customers? If the metric we are using to measure customer satisfaction is some sort of monitor on the performance of the East Coast servers, we have very little visibility into what the actual East Coast user experience is during this period of degraded service in which customers will be upset.
What we really need to know are the answers to questions like “Are users experiencing videos that are playing smoothly, or are they stalling?” Or maybe, “Are videos taking too long to start for users that are in the East Coast region?” Another might be, “If East Coast users were already halfway through an hour-long episode when the outage hit, did the video stall out?” Another: “When the flip to the West Coast datacenter was made, did the video play recover?”
Clearly, just taking the pulse of servers doesn’t tell you what you need to know about customer happiness. Even here, in this modest-scale homogenous environment, we need a different style of metrics to assess user experience, and these metrics are our real SLOs.
Those SLOs help us determine where we might be over-provisioned, where we might be under-provisioned—where we might be paying too much for our cloud hosting and where we might need to add another region. And therein you see the beauty of SRE: Once we have those real SLOs, we have powerful decision-making tools for every type of change we might consider.
Now, let’s address SLOs in the “real world” of the enterprise—heterogeneous environments, where we likely find multiple and often quite different stacks. Perhaps in our real world we have some vSphere, some OpenStack, some bare metal, some AWS, and some z/OS on a mainframe (don’t laugh). How is SRE implementation here different?
In a situation where making changes to your infrastructure is a lot harder and more time consuming and expensive than just pasting a stanza of YAML, it’s all the more important that you evaluate those decisions carefully and that you have very relevant metrics to help you make those decisions. Fortunately, the best SLOs are not specific to Kubernetes or cloud or to any of the elements in your stack. Instead, the best SLOs look at what matters to the consumer of the service. The fundamental benefit of SLOs is that they give us a way to elevate our focus to what is critical—the consumer’s perception of results—rather than opportunistically monitoring too many subsystems and too many “vital signs” that are nothing more than proxies for health.
The best SLOs look at what matters to the consumer of the service
To illustrate this point, here’s an analogy. Let’s say you’re an elite runner, and you’d like a checkup prior to a marathon. So, you visit a well-respected, modern clinic for elite athletes. The doctors and nurses diligently run a battery of tests with the latest high-tech diagnostic tools. Finding no problems, they call in specialized experts who perform further diagnostics. And yet, after reviewing thousands of charts and statistics, they find nothing wrong. Finally someone thinks to ask, “How do you feel?” You reply, “I feel great. I only came in for a checkup.”
This tongue-in-cheek analogy is intended to make a simple point: health and happiness are not the same thing. Your underlying infrastructure (servers, networks, storage) might be healthy, and the consumers of your service can still be unhappy. If you want to make people happy, infrastructure stats alone aren’t going to get you there. Admittedly, monitoring these “vital signs” is both necessary and useful, but the first thing we should be focused on is how our services are performing and, by extension, how happy are the consumers of those services with that performance.
Understanding the experiences of service consumers is especially important when you add stress (e.g., put a system under heavy load) or make architectural changes to the system, because then the old proxies become less relevant. In our marathon example, your heart rate at rest is not the best indicator of how you’ll perform at mile thirteen. Under the stress and load of mid-marathon conditions, all your vital signs combined won’t reveal as much—or as quickly—about your performance as will simply asking “Are you feeling okay? Any pain? Feel strong enough to keep going?”
Are SLOs of any value to the organization in the absence of a full-scale SRE program? Absolutely, for three key reasons. First, regardless of whether or not you have a full-scale SRE program, SLOs will help you achieve better service reliability. Well-crafted SLOs that reflect what’s important to the customer will help you uncover ways to improve service reliability without going the traditional route of buying and operating gold-plated hardware, over provisioning, or other expensive ways that we’ve boosted availability in the past.
Secondly, SLOs will help your existing team scale. This is becoming more and more important in a world where the number of services your team is responsible for deploying, iterating, operating, and maintaining is growing. Your ops staff likely isn’t getting much bigger, so SLOs help you achieve greater service reliability with the same team size. are getting. In the days when the number of services was limited, a single operator could know the ins and outs of those services so well that they could “hear the engine” and diagnose health by, essentially listening to the performance data. In an era of rapidly iterating microservices with a high number of interdependencies stretched across multiple infrastructure types and locations, asking operators to do this is folly. With SLOs, you monitor the system from the outside: how is the system performing in the eyes of its users? SLOs are akin to a “black box” that provides a baseline indication of “do I need to look at this thing today?” As a result, using SLOs allows small operations teams to scale, even if they’re not running a full-fledged SRE program.
And finally, using SLOs will help you prepare for the SRE transition, should you be thinking of moving in that direction eventually. (read more about that). But don’t wait for a team of site reliability engineers to appear: start using SLOs today. You’ll reap the benefits no matter what your infrastructure environment may look like.
Image source: https://commons.wikimedia.org/wiki/File:Android_sculpture_with_Noogler_beanie.jpg