| Author: Alex Nauda
We in the SRE world often speak in generalities about “customer happiness” and how SLOs can help us find that ideal balance between software reliability and the velocity at which we release new features. To be sure, for many of our conversations, it’s expedient to refer to “the customer” as a homogeneous entity, as if we can meet a certain reliability goal and the “collective customer” will be happy. But in the real world, there’s no such thing as the homogeneous customer, and there’s no such thing as a one-size-fits-all SLO. We need to be able to account for customer heterogeneity in our reliability monitoring.
- Fact: Customers differ in where, when, and how they use your infrastructure.
- Fact: Customers differ in how important their workloads are for internal and external end-users.
- Fact: Customers differ in how important their business is to your company.
At Nobl9, our premise is: SLOs are good; SLOs for defined customer segments are better.
We have been thinking a lot about this and how best to go about accounting for the reality of diverse populations when designing reliability monitoring. Here are three ideas that can help you take SLOs to the next level by segmenting user experiences:
Imagine how powerful it would be if the SRE team could “see” the value of individual users.
1. Understand user experience at the individual user level.
Just because you have a 0.1% failure rate (i.e., 99.9% uptime) doesn’t mean 0.1% of users are experiencing it. So how do you know which users experience that failure? Or, to take it a step further, how do you know what any given user is experiencing with your service at any given moment?
Techniques are emerging that will help us measure experience at the user level and define what actually happens for a given user. For example, you can read about Google’s approach to windowed user-uptime here.
Another possible approach is using user IDs (or user emails or user names) to map customer relationships. Multiple user IDs may map to a single organization or a single user ID may map to multiple organizations. Either way, by understanding these connections, we are able to isolate key user experiences.
Here’s a simple illustration: let’s look at a video streaming app. If you know that 0.1% of video playbacks are buffering slowly, you could drill into that and see how a specific customer company is experiencing the delay. Or, further, you could potentially see how a project, team or individual customer user is experiencing the service–an invaluable insight when that individual user is the CEO trying to show a video during a key investor presentation!
The point is, we need to explore ways to slice and dice our SLO metrics for unique users.
2. Incorporate business intelligence into an SLO-based “triage system” to assess and prioritize unique user needs.
The triage concept is familiar to the DevOps teams of most B2B SaaS organizations. (“Should we wake up an engineering director now?”) But at Nobl9, we have dreams of doing triage better. The reliability-focused triage system we envision would take into account customer segments and incorporate business intelligence to reveal the unique facets of user relationships.
Imagine how powerful it would be if the SRE team could “see” the value of individual users. For example, suppose User A appears to simply be testing a trial account, but in reality User A is also a huge potential customer that your sales team has been trying to close for two years? That critical information is probably reflected in Salesforce or some other CRM system, but totally unknown to the SRE team. To apply SRE in its most potent form, important business insights about individual customers need to be available to the team in real-time.
The closest thing we have to that ideal triage system today is when companies segment their customers based on contract value and SLAs. The value of an existing account is a good place to start, but it shouldn’t be the only factor considered in the triage system.
3. Emerging technologies will enhance our ability to customize reliability metrics to unique user populations.
The idea of prioritizing workloads isn’t new. IT operators are accustomed to making workload adjustments to protect the interests of internal and external customers. For example, we can isolate specific accounts or tenants onto a hardware environment that is more performant and reliable than the general population. Or we can dynamically move customers from fast to slow to even slower lanes of traffic to accommodate workload types, such as a large batch upload that need not be serviced immediately, while preserving performance for other customers whose workloads are more time sensitive.
What’s new today is the idea of using SLOs to guide and automate these workload prioritization and infrastructure decisions, and that’s what Nobl9’s technology is all about.
In the future, we foresee advancements in routing, load balancing, ingress software, service mesh, and other critical infrastructure services that will give us richer features to isolate, track, and proactively adjust service levels using SLOs as our guide. I think we’re just scratching the surface of what is possible.
So, let me ask you: Where do you stand today in your ability to identify your most valuable customers and isolate their unique experiences with your service? At a minimum, I hope I’ve challenged you to think more rigorously about how to segment your users with your own user and customer account data. I hope I’ve also convinced you to take the next steps toward applying SLOs to segments of your users. You can start by segmenting customers using financial data, then layer on the more subjective business intelligence you have, such as projected LTV. Next, pick a segment, put baseline SLOs in place, and begin turning knobs to fine-tune your reliability efforts.
Remember: SLOs are good; SLOs for defined customer segments are better. I can assure you that we’ll continue to explore this topic at Nobl9 and share our insights here in this blog on a regular basis.