Originally published at The New Stack on September 14, 2020
Unless you’re operating an Astute-class submarine hundreds of meters beneath the surface of the sea, chances are good that your infrastructure and software development stack relies on third-party technologies. These technologies no doubt provide great value to your technology stack and deliver significant cost efficiencies (saving time, money, and other resources), but they also expose your operations to risk in terms of reliability. For example, in seven years at my previous company, more than half of the outages we sustained were caused by… you guessed it… third-party outages.
If my system is built upon other systems, I can achieve high reliability only if the underlying systems are even more reliable than my service needs to be.
In some cases, you may be lucky enough to have options to choose from for your third-party services, but you will typically pay steep premiums for the service that is more reliable. I once had a third-party vendor that was responsible for three huge outages and several smaller ones in a two-year period. At the time, their service cost was $4,000 a month. The more reliable alternative was charging $16,000 a month. Should we have paid $144,000 more a year for the more reliable service? Well, it depends. (See “Do You Really Need Five Nines?”)
Here’s what I’ve learned: If my system is built upon other systems, I can achieve high reliability only if the underlying systems are even more reliable than my service needs to be.
That’s why monitoring of your external systems is just as important as monitoring internal systems. The good news is there’s a way to do it without logging in to the third-party’s admin console. You should be able to select a few metrics and know whether the third-party system is running in a healthy state or not. Service Level Objectives (SLOs) are the way to do this at scale.
Let’s discuss how to go about defining SLO-based roles and responsibilities with a third-party vendor. Here are my top tips:
I used this approach successfully at a previous company. We had a vendor in a critical path of service who started having reliability issues, which manifested as user-visible outages and lagging performance on a recurring basis over a period of a few months. We looked at our SLO and how their issues were eating up the error budget. We ended up creating a new relationship with a second vendor to overcome their respective limitations and thereby created redundancy and resiliency that kept our SLOs on target.
If you have SLOs on external vendors and you see the error budget is about to be blown, you can pull the ripcord and implement a back-up plan. Then have a conversation about improving going forward.
In summary, I’ll admit to you that I think SLOs are magical in their ability to establish shared vision and collaboration not only between internal departments but with third-party service providers as well. SLOs give you the common language you need to have blame-free conversations and get systems where they need to be to keep customers happy.