| Author: Brian Singer
Product managers play a critical role in driving and supporting reliability objectives. This may seem daunting, but we can apply the same PM principles used during product development to building reliable products.
The SRE approach is the same. Just follow the steps:
1. Listen to the customer. When it’s time to discuss what your product reliability targets should be, user stories are your most valuable asset. As product manager, you are the keeper of the stories. Share what you’ve learned about what pleases customers and, conversely, what makes customers upset and disappointed. User stories can be used as the basis for SLOs, because they reveal what is most important to your customers and what makes them happy. Now you just need to quantify those “happy customer moments” and turn them into SLOs. Ideally, your SLOs should align with KPIs (key performance indicators) already used in the organization.
2. Start small and “go for the low-hanging fruit.” Be practical in selecting your first SLO; give your team an opportunity to tackle an obvious need and succeed. Here’s how Google describes it:
[…I]t’s important to differentiate aspirational goals of the product from minimum success criteria (or Minimum Viable Product). Projects can lose credibility and fail by promising too much, too soon; at the same time, if a product doesn’t promise a sufficiently rewarding outcome, it can be difficult to overcome the necessary activation energy to convince internal teams to try something new. Demonstrating steady, incremental progress via small releases raises user confidence in your team’s ability to deliver useful software.
(Site Reliability Engineering, Chapter 18)
Everyone should have access to this SLO documentation, which serves as the “single source of truth.
So when it comes to your first foray into creating SLOs, don’t try to boil the ocean or make everything perfect at the outset. Instead, pick something that applies to most use cases and is highly visible to users. Availability is a good place to start; you may choose to start with the log-in experience, for example. Remember to be reasonable in setting your SLOs. Don’t ask for SLOs that are “four nines” for uncritical workloads. This just sets your team up for failure, and the whole point of Step 2 is to get a solid win.
Here’s a handy discussion guide to help your team define your first SLO.
3. Document. In order to get high-quality feedback, you will need a written representation of the SLOs. At a minimum, write out detailed and specific definitions and ask for feedback. Make sure that people understand what they mean and why they are important. If they aren’t clear to everyone, change them.
Everyone should have access to this SLO documentation, which serves as the “single source of truth.” Although this information could be in a shared document or a spreadsheet, ideally the definitions will be encoded in software, such as into the monitoring, logging, and alerting system, and stored in a central SLO repository with a friendly dashboard.
4. Test and benchmark. If you set a monthly goal of 99.9% availability and have a clear metric source for how you will measure it (a.k.a., a service level indicator, or SLI), you now need to see how your service has historically performed versus that goal. Get your hands on the data, and see where you stand. Share this openly so everyone gets the benefit of this experience. The purpose of looking at historical data is simply to test achievability. If you don’t have access to historical data, set your SLO with your best instincts and move on. It’s more important to get going with measuring and tracking in real-time.
5. Iterate and improve. Chances are, you’ll get the SLOs initially—but that’s okay. Just as with software projects, you’ll iterate and improve. With time, you’ll hone the metrics to better match user expectations. One common pitfall is to set aspirational SLOs rather than achievable SLOs. For example, if you’ve never measured before, you may have an aspiration to have three or four nines for a particular SLO; then, when you see the real data, you may find that only two-and-half or three nines is achievable in the current state. Learn and adjust.
Ultimately, you will settle on a small number of SLOs that tell you a lot about service health and customer happiness. With better insight into how the system truly performs, your team may also be shocked by a good dose of reality that will inspire reliability improvements. All of this is good! Now that you have defined SLOs that everyone has seen and provided feedback on, and that have been tested with real data, you can start to truly reap the benefit.
When you complete these five steps, you will have taken your first step toward minimum viable reliability—your first SLO. Congratulations! Now you have a launchpad for expanding your SRE initiatives to address more risks to customer happiness; to build product improvement plans that factor in reliability, capacity and addressing technical debt; and to refocus your people and your investment strategy on things that will deliver highest ROI in terms of customer experience.
Delivering reliable software services is a challenge for any team running infrastructure, and...
If you are a software product manager, you probably spend a lot of time thinking about how to...