More by Brian Singer
| Author: Brian Singer
If you are a software product manager, you probably spend a lot of time thinking about how to delight your customers with new features and high-quality services. After all, your objective is to attract and retain customers and generate new sources of revenue for the company.
But what about reliability? Reliability should be just as important to product managers because it impacts customer satisfaction just like new features and capabilities. Reliability is the key to making sure that all of those new and valuable services are actually available to customers when they need them.
We got to the point where we were continuously hitting our SLOs. And, sure enough, customer satisfaction improved tremendously.
This is why reliability should be a component of every product definition.
Consider Netflix. Netflix may have great content and an appealing user interface. However, if Netflix’s hot new shows and movies are taking longer to buffer than customers think they should, customer satisfaction with Netflix plummets, customers churn, and revenue eventually declines. That’s why at Netflix, reliability is everyone’s job.
So, as a product manager, how can you factor reliability into your assessment of customer satisfaction? How can you “have a seat at the reliability table,” especially when it is typically the domain of the IT operations side of the house rather than the software development side?
I want to share some ideas on how product managers can use techniques from Site Reliability Engineering (SRE) to deliver a consistently delightful end-user experience.
Let me start with a personal example. My last startup was a cloud billing service provider for resellers of cloud services. Typically, these resellers sold services from many cloud platforms, and we aggregated and sorted the billing to prepare customized billing statements for their customers—a pretty complex process. We were consuming large volumes of billing data from the resellers, and they wanted their billing data updated pretty quickly, as close to real time as possible.
I remember getting a phone call late at night from an angry customer because his company’s billing data hadn’t yet updated. At that time, the way we resolved the issue with the customer was to engage in what I later learned Google SRE calls “toil.” We went into super-responsive mode: manually logging in, checking the system, calling the engineering team, waking up a lot of people in the wee hours of the morning to get the data processed and refreshed.
When this type of crisis episode occurs, the squeaky wheel (or the blown tire!) becomes the priority. In many cases like this, you have to convince the development team to hold off on releasing new features until the burning issue is fixed. It’s no fun having to say, “We know it’s going to take us three months to resolve the issue, but the customer is angry now, and we have no other choice.” Even if you are able to patch the problem temporarily, the underlying problem remains.
Does that sound familiar?
When the company was acquired by Google, suddenly we had to assimilate to Google’s ways and Google’s production standards, which included a new approach called Site Reliability Engineering (SRE). One of the first questions we were asked was ‘What SLO do you want to hit?’
And that changed everything.
When I learned what an SLO was (a service level objective), it immediately dawned on me that a lot of the challenges we had with prioritizing our product roadmap could have been solved if we had just created really good SLOs around data freshness: how often do we show a customer billing data that is more than 2 hours old? (See The Site Reliability Workbook, p. 269.) So we actually did create those data freshness SLOs, and when we first applied them, our metrics were horrible!
That’s ok though, because what it did was provide tremendous clarity to the development team and tremendous insight to the product management team. It became super obvious that any new features we wanted to build were pointless in light of the reliability concerns. Together, we were able to put our full focus on fixing the data freshness issue, and we dug ourselves out of that hole. We got to the point where we were continuously hitting our SLOs. And, sure enough, customer satisfaction improved tremendously. I don’t think I received a single customer complaint after a year of being at Google and using the SRE model, because we set our own internal expectations higher than our customers’ expectations and held ourselves to it.
These quick points can help form your approach to SRE as a product manager:
Focus everyone’s attention on customer delight. Everyone in a company wants to delight the customer. Unfortunately, in most companies, tension arises between acquiring customers (developing new features that new customers are asking for) and retaining customers (meeting existing expectations around reliability). But at the end of the day, both are required for the company to succeed. A product manager can use the SRE philosophy to culturally indoctrinate a balance between these two competing interests.
Enhance customer empathy. It’s the product manager’s job to feel what customers feel, to understand what it takes to inspire their long-term happiness and to understand where customers draw the line between delight and disappointment. It’s also the product manager’s job to convey that knowledge to the software developers and IT operations teams supporting the product. SRE underscores the crucial role the product manager plays for the entire company in this regard.
Quantify customer happiness. Two familiar corollaries apply: You can’t improve what you can’t measure, and what gets measured gets focused on. The SRE community shares best practices for quantifying customer happiness, such as using anchored scales to identify customer happiness thresholds: (e.g., 1 sec = annoying, 3 sec = painful, 5 = brutal) and building an inverse happiness index with SLOs.
Make data-driven decisions. A basic tenet of SRE is that all stakeholders need to be involved in setting SLOs focused on company business objectives. In turn, SRE provides data for decision-making that is in alignment with company business objectives as well. Without SLOs, it is very hard to make prioritization decisions based on data. How do you decide how to allocate finite resources? Too often intuition, culture, and organizational momentum at a moment in time are used to make that decision. Decisions made on a hunch create situations where there are “local optimizations”—decisions that are right for one team or one executive but not necessarily the right “global” decision for the organization as a whole.
Understand tradeoffs. SRE provides a way to reconcile the tradeoffs between shipping new features and maintaining availability/reliability. Without SLOs, you have no real way to make sense of the tradeoff you’re making. Take our Netflix example from above. Suppose Netflix has an SLO that a customer can log in and play a video within 15 seconds. If a new feature will delay access time to greater than 15 seconds, causing you to miss the SLO, you recognize that the new feature will actually cause customer happiness to decline. The ensuing compromise/tradeoff may be suboptimal for each team but maximally optimal for the organization as a whole.
Prioritize the reliability roadmap. SRE creates common ground for product and operations teams, so that collaborative, data-driven decisions about roadmap priorities can be made and fully supported.
Take it to the next level. The best product teams at companies like Google, Facebook, Slack, and many others, translate these SRE concepts into systems with feedback loops that become an integral part of the product development process and spur innovation at scale.
The benefits of applying Google’s SRE approach to our cloud billing product were tremendous. That experience taught me how incredibly valuable SRE could be for product managers, development teams, and for companies of all sizes beyond Google. That revelation provided part of the inspiration for Nobl9.
You can read more in our blog about how Nobl9 is making the power of the SRE approach accessible to everyone. If you are new to the concept of SRE, I want to encourage you to dig deeper—I guarantee it will be worth your time.