If you appreciate the irony of TLA (a recursive acronym for “three-letter acronym”) this blog is for you. Even if you find the “alphabet soup” of business unappetizing, read on anyway, because we guarantee you’ll find some meaty morsels of value in the following discussion.
Nearly everyone, whether you’re in the C-suite or on the front-line of an organization, knows about performance metrics. They allow individuals, teams, and organizations in a business to focus on achieving goals that are important to the organization’s mission. Many types of performance measures are commonly used, and we typically refer to them with TLAs. Perhaps you’ve heard of these:
- KPIs—Key Performance Indicators: quantifiable measures used to evaluate the success of an organization, employee, etc. in meeting objectives for performance.
- SLAs—Service Level Agreements: contracts defining the level of service a customer expects from a vendor. These agreements lay out the metrics by which the service is measured as well as the remedies or penalties incurred should the agreed-on service levels not be achieved.
To set the SLO any higher than necessary would be to waste money and time on providing an unnecessary degree of reliability.
- OKRs—Objectives & Key Results: a simple management framework that helps everyone in the organization see progress toward common goals. Short, inspirational objectives define where you want to go. (Companies typically create three to five high-level, ambitious objectives per quarter.) Key Results are the deliverables that you define for each objective so that you can measure your progress toward achieving that goal. (Each objective should have two to five measurable key results.)
Recently, Site Reliability Engineering (SRE)—the infrastructure and operations discipline popularized by Google—has introduced several new TLAs to our business conversations, including SLO and SLI. Fortunately, we can easily understand SLOs and SLIs by drawing some comparisons to the business performance measures we already know:
|The SRE term…||…is analogous to the business performance term…|
|SLO (Service Level Objective)||SLA. (An SLO is an SLA without contractual penalties.)|
|SLI (Service Level Indicator)||KPI. (An SLI is essentially a “service KPI.”)|
An SLO is like an SLA
You can think of an SLO as an SLA without contractual consequences. SLOs are usually more stringent than SLAs because SLAs don’t necessarily map well to customer expectations. An SLO, by contrast, expresses the service level of infrastructure/operations we need to achieve in order to keep our customers satisfied. Commonly, SLOs pertain to infrastructure service attributes such as availability, latency, data freshness, or degradation. For example, an SLO for availability might address whether a website or application is available to be seen/used when the end-user wants to see or use it (e.g., “Customers using the application will find it available 99.95% of the time over a 1 month period”). The key to establishing an SLO is to “define the lowest level of reliability that you can get away with, and state that as your Service Level Objective.” To set the SLO any higher would be to waste money and time on providing an unnecessary degree of reliability.
Keep in mind that SLOs don’t carry any contractual consequences. These are penalties (usually financial) that must be paid to the customers if benchmarks are not met. SLAs do carry these penalties. Therefore, your SLAs should be less stringent than your SLOs (your internal service objectives). For example, if your SLO is an availability of 99.95% over 1 month, your SLA might be 99.9% over 1 month. That way, as long as you are meeting that internal SLO, you will, by definition, meet your SLAs and avoid penalties.
An SLI is like a KPI
When we measure the performance of a business, there are dozens of metrics we could use, but we typically focus on a few performance indicators that tell us at a glance how the company or a unit within the company is doing. These few metrics are selected because they best express what is truly essential to the company’s success. We call them Key Performance Indicators.
KPIs vary from one company and unit to the next, but some of the commonly used KPIs include profit, cost of goods sold (COGS), sales by region, annual recurring revenue (ARR), customer acquisition cost (CAC), employee turnover rate (ETR), net promoter score (NPS), and the granddaddy of them all EBITDA—earnings before interest, tax, depreciation, and amortization. Simple examples, but you get the point. Similarly, when we’re measuring the performance of our infrastructure, we want to focus on a few key indicators (service level indicators, or SLIs) that tell us the most about the user experience and where we are going to draw the line when we need to make tradeoffs between operational improvement and pushing new features.
Another similarity between KPIs and SLIs is that both are helpful in aggregating key bits of information. Just as the KPI “profitability” takes into account revenues and expenses, SLIs can tell you about multiple subsystems through a single metric. In this particular example, the indicator is a ratio. Ratios can be helpful in showing us the balance between conflicting pressures on the business. That tension between two priorities: expenses that drive customer satisfaction and revenue that’s needed to run the business. Another way to think about SLIs is that they are KPIs focused on infrastructure services, or “service KPIs.”
“What’s in a name? That which we call a rose | By any other name would smell as sweet”
If you’re sick of TLAs by now, that’s understandable. Even at Nobl9, our teams prefer to say simply “objectives” and “indicators,” and that works well for us. The names you choose does not really matter. What matters is applying the substance “SLOs” and “SLIs” does; it’s critical if you want to create highly reliable and scalable systems that simultaneously support the rapid development and launch of innovative services.