More by Alex Nauda
| Author: Alex Nauda
Originally published at The New Stack on September 14, 2020
Unless you’re operating an Astute-class submarine hundreds of meters beneath the surface of the sea, chances are good that your infrastructure and software development stack relies on third-party technologies. These technologies no doubt provide great value to your technology stack and deliver significant cost efficiencies (saving time, money, and other resources), but they also expose your operations to risk in terms of reliability. For example, in seven years at my previous company, more than half of the outages we sustained were caused by… you guessed it… third-party outages.
If my system is built upon other systems, I can achieve high reliability only if the underlying systems are even more reliable than my service needs to be.
In some cases, you may be lucky enough to have options to choose from for your third-party services, but you will typically pay steep premiums for the service that is more reliable. I once had a third-party vendor that was responsible for three huge outages and several smaller ones in a two-year period. At the time, their service cost was $4,000 a month. The more reliable alternative was charging $16,000 a month. Should we have paid $144,000 more a year for the more reliable service? Well, it depends. (See “Do You Really Need Five Nines?”)
Here’s what I’ve learned: If my system is built upon other systems, I can achieve high reliability only if the underlying systems are even more reliable than my service needs to be.
That’s why monitoring of your external systems is just as important as monitoring internal systems. The good news is there’s a way to do it without logging in to the third-party’s admin console. You should be able to select a few metrics and know whether the third-party system is running in a healthy state or not. Service Level Objectives (SLOs) are the way to do this at scale.
Let’s discuss how to go about defining SLO-based roles and responsibilities with a third-party vendor. Here are my top tips:
- Start with your own SLOs. Know what’s important to your users and how your vendors may impact their experiences.
- Assess what your vendor knows about SLOs and reliability. You need mutual understanding that reliability is important, expensive and requires investment. Here at Nobl9, for example, we work with one party that is well established and provides higher reliability than we need. On the other hand, we also work with a small startup that may need our help progressing along the SLO learning curve. In either case, we have to work at our relationships and clearly define our expectations. We have to have an open conversation about reliability, about the data we need, and about our performance targets.
- Insist on using SLOs to monitor reliability of vendor systems, just as you do to measure the SLOs of your other internal department systems. But observe these three “don’ts”: (1) Don’t just go by the vendor’s “nines” or their published reliability metrics; (2) don’t just monitor “Are they up?”; and (3) don’t let Service Level Agreements (SLAs) in sales contracts lull you into a false sense of security. Keep in mind, the maximum financial risk when a vendor fails to meet an SLA is contract value. The maximum financial risk of blowing an error budget and losing a customer is typically far greater than that.
- Use your own reliability metrics to make proactive decisions about mitigating risk. If you truly need Four Nines service as a part of your SLO for your customer, and that Four Nines is dependent on a third party, proactive discussion with the vendor is needed to consider your risks, failure modes and recovery options. Don’t wait for that to come out after a crisis. Use your SLOs to inform that discussion with standardized data that speaks to the needs of your customers in your business.
I used this approach successfully at a previous company. We had a vendor in a critical path of service who started having reliability issues, which manifested as user-visible outages and lagging performance on a recurring basis over a period of a few months. We looked at our SLO and how their issues were eating up the error budget. We ended up creating a new relationship with a second vendor to overcome their respective limitations and thereby created redundancy and resiliency that kept our SLOs on target.
If you have SLOs on external vendors and you see the error budget is about to be blown, you can pull the ripcord and implement a back-up plan. Then have a conversation about improving going forward.
In summary, I’ll admit to you that I think SLOs are magical in their ability to establish shared vision and collaboration not only between internal departments but with third-party service providers as well. SLOs give you the common language you need to have blame-free conversations and get systems where they need to be to keep customers happy.