ATT suffered a massive outage that could potentially have been saved with SLOs

Could Service Level Objectives Have Prevented AT&T's Outage?

More by Daniel Ruby:

att-oopsReliability is part of your product, and yesterday AT&T discovered just how devastating a breakdown in reliability can be. For nearly six hours, the telecommunications giant suffered a massive outage in their cell network, affecting more than 1.7 million customers and disrupting 911 services. As rumors swirled about potential cyberattacks, the company announced that the cause of the outage was a software update - described by a spokesperson as "the application and execution of an incorrect process used as we were expanding our network."

An incident of this magnitude is every SRE's nightmare. Beyond site reliability, though, an incident of this magnitude is the nightmare of every CEO, CTO, VP of Marketing, PR representative, Director of Finance - it doesn't really matter where in the business you stand, a severe breakdown in reliability will leave you scrambling to either get or share answers.

So how to prevent this? Any reliability professional will see the words "application and execution of an incorrect process" and groan; AT&T's cellular offering is comprised of countless internal products and systems, all working together, all contributing to (or detracting from) the overarching reliability of the offering itself. I'm certainly not claiming to know AT&T's reliability stack, but having an offering-level dashboard of service level objectives, pulling from the various reliability and observability platforms used across different parts of the organization, could quite possibly have prevented this outage from occurring.

If nothing else, being able to see error budgets in real time and historically across a customer offering and being able to drill down into any systems that spiked in errors leading up to an outage (as well as any annotations corresponding to the system's spike) can help speed up root-cause analysis and bring services back to customers in minutes rather than hours.

Reliability has real-world customer impacts, and needs to be understood across an organization. Too often it's viewed as a cost center - how much do we pay for this, and how can we pay less without it breaking? In situations like this - and I am by no means suggesting AT&T has this perspective - the customer experience suffers as SREs and their managers work to justify their budget to executives.

As seen with AT&T's outage, reliability must be viewed as part of a company's product. Not reliability looked at in segments unconnected to each other, but the overarching reliability of what a customer interacts with, with disparate metrics normalized into an ongoing view of how a company's product is performing.

The best way to do this is via Service Level Objectives, and the most impactful way to implement SLOs across your organization is with Nobl9.

SLOs in Minutes, Not Months

Get Started with Nobl9 Reliability Center Free Edition

Start Now

Do you want to add something? Leave a comment