As we discussed in the previous post, Site Reliability Engineering (SRE) is an operating model that helps your organization grow and innovate with velocity while maintaining infrastructure reliability for service levels that keep customers happy. (It’s like DevOps on steroids.) The benefits of SRE are many. The end result is customer satisfaction, and the bottom line is efficient revenue growth. Conceptually, it’s a no-brainer.
But in order to champion SRE, a CEO also needs to understand the decision from a dollars-and-cents viewpoint and be able to defend it with all the stakeholders in the organization.
At its very core, SRE is a framework for thinking about ROI and risk, so you might say that the dollars-and-cents analysis you need is built right in, in the form of SLOs and error budgets.
In SRE, service level objectives (SLOs) are defined for many different aspects of IT service to mark the precise level of service that needs to be achieved in order to avoid unacceptable levels of risk of displeasing the customer. An error budget, then, is the inverse of the SLO; it is the error rate we will tolerate for a given set of services, because we expect that error rate will not upset customers enough to warrant prevention. (The beauty of SRE is that these service-level objectives* are not arbitrary—they are tied directly to business outcomes.)
Think of SRE as Smart Resource Engineering: a data-informed approach to delivering what customers want, within the bounds of the imperfections they’re willing to accept.
Let’s use availability as an example. When we talk about how often your infrastructure is available (uptime), we typically speak in terms of “nines.” If your infrastructure is available “four nines” or 99.99% available, it will be unavailable 52.6 minutes a year. However, if your infrastructure achieves “five nines,” then your system is up and working 99.999% of the time—that is, it’s down only 5.26 minutes a year. Take note that once you have multiple overlapping services and redundant regions, you need to calculate uptime differently as the proportion of customers served successfully. Measuring uptime in minutes may be overstating your actual reliability.
Avoiding Gold-Plated Infrastructure
In an ideal world, we’d want our infrastructure to achieve as many nines as possible; however, moving from one class of nines to the next higher class is roughly ten times more expensive (you’ll incur significant people and infrastructure costs to make the leap to the next level). And, when you consider the inherent limitations of physics and the architecture of public networks, approaching five nines of reliability consistently can actually become very nearly impossible.
So how many nines are good enough? At what point on the “nines class scale” do your customers become unhappy with their service, that is, at what point do they notice and complain, or even walk away? In this case, the SLO is the uptime goal, and the error budget is a small acceptable allowance for the system being down.
SLOs and error budgets keep your customers happy while balancing the competing interests of product stakeholders who want to rapidly launch new features/products and IT operators who want to maximize infrastructure uptime. Here are two examples:
You might be concerned about putting in an error budget system that you can’t overrule. Think about error budget as a fiat currency, and you (and upper management) are the central bank. You can always “print” more error budget, but do it too much and you will devalue the currency!
So, in terms of dollars and cents, here’s your cost/benefit equation:
Like everything else your exec team evaluates for organization-wide implementation, approach SRE from a cost/benefit perspective, and consider your risk profile. Properly implemented and operated, SRE frees your application teams to focus on delivering accelerated value to your customers. And, they can do this with new features and capabilities, within the risk-adjusted guardrails of SLOs that define what customers are willing to tolerate before bolting. At the same time, it gives your IT operations teams the freedom to make decisions about infrastructure management, unencumbered by the unachievable “never suffer an outage” standard that accomplishes nothing more than lining the pockets of your service providers and frustrating the best talents of your product managers and application developers.
In short, think of SRE as Smart Resource Engineering: a data-informed approach to delivering what customers want, within the bounds of the imperfections they’re willing to accept. It’s an approach that makes dollars…and sense!*SLOs are so important that we’ve built Nobl9 on that premise. Our business is about helping our customers to precisely quantify the experience of the user, then statistically and rationally translate that knowledge into wise tradeoffs and informed resource allocation decisions. If that sounds like something that solves a problem in your organization, there are more useful resources here in the Nobl9 blog.
Image Credit: Medienstürmer on Unsplash