More by Kit MerkerWhat is an SLO? Explained in 90 Seconds Kubernetes Knative Serverless Latency Metrics: Interview with Matt Moore Nobl9 Demo: Setting up a Prometheus SLO with the Web UI Reliability Evolution from Datacenter to Cloud: Interview with Less Lincoln, SRE at Microsoft Want a Reputation for Reliability? Keep it Simple. Interview with Matt Klein Nobl9 Demo: Kubernetes Cluster Failover Scenario Nobl9 Demo: GitOps Ready sloctl and SLO YAML Why Your Marketing Site Needs Reliability Targets (SLOs) Too Delivering the Right Data for Better SLOs with Nobl9 & New Relic SRE 101: The SRE Toolset SLO Many Talks About Reliability at KubeCon: Here Are Our Picks An Easy Way to Explain SLOs and SLAs to Business Executives Tame the YAML in 2021 Nobl9 and Datadog: Better Data Makes Better SLOs Nobl9 and Lightstep Partner to Integrate Distributed Tracing Technology into SLO Management Platform The Edge of Excellence: How to Delight Customers at Scale in Digital Services Optimizing Cloud Costs through Service Level Objectives How do we measure the customer experience? SLOconf Speaker Profile: Alina Anderson What is Five 9s Availability? Do you really need 99.999% Server Uptime? Measuring Technology ROI: SLOs for CFOs Creeping Latency Metrics: Review of a Subtle Kubernetes Serverless Scalability Bug 5 “Reasons” I Hate SLOs Going to KubeCon in Search of Reliability Talks? The Ultimate Guide. Nobl9 Has Joined The Cloud Native Computing Foundation The Ultimate Guide to Reliability Talks at re:Invent 2020 Measuring and Optimizing CPU Performance Driving SLO Adoption through CICD You’re Not Google. And, Yes, You Still Need SLOs Nobl9 & Adobe Systems: Let’s Talk SLOs for OpenStack SREs: Stop Asking Your Product Managers for SLOs SLOconf Speaker Profile: Steve McGhee
| Author: Kit Merker
Congratulations! You are ready to sit down with your team and establish your first Service Level Objective (SLO).
You might be wondering where to start. Here’s an outline of how you could approach your first SLO-setting discussion with your developer and operations teams:
- Share a user story. Suppose you have an e-commerce user story that says the user expects to be able to add things to their cart and immediately check out. Your user has a certain latency threshold for checkout, and when checkout takes longer than that, your user gets upset and abandons their cart.
- Phrase this customer experience issue more precisely as an SLO. What proportion of users should be able to add items to their cart and check out within X amount of time?
- Identify and quantify the risks. What happens if a customer isn’t able to check out within that time frame? What does it cost when the SLO is missed?
- Brainstorm the risk categories together. What are the things that can go wrong that would cause us not to be able to meet the SLO? Your team will respond with a wide variety of risks, likely including “our underlying infrastructure might go down,” “maybe we pushed a buggy update,” “we didn’t anticipate so much demand all at once,” and more.
- Ask “how could we mitigate these risks?” When considering the resources/costs required to mitigate the risk versus the cost of failure, what do you leave to chance and what do you take a proactive approach to? Use this information to determine the service level indicators (SLIs) you will use to measure and track your ability to meet the SLO.
As you might imagine, this can be a fairly involved discussion, and all the stakeholders need to contribute their perspective in order to have buy-in in the end. It may take a while to find agreement, but when you do, you will find that among the developers and operators there is much more genuine understanding of (a) what the customer wants, (b) what new product features will truly cost, including operational support and capacity, and (c) how to prioritize and make customer-centered decisions when tradeoffs are necessary.
We’d love to hear how your first SLO-setting discussion goes. Let us know on twitter @nobl9inc.