More by Alex HidalgoWhat is Reliability Anyway?
| Author: Alex Hidalgo
When we talk about service level objectives we often focus on things like the latency or error rates of web-facing APIs. This makes sense because they’re relatively easy to measure and do represent things that are important to our users. At the end of the day we want to ensure that our computer services respond correctly and respond in a timely manner, so these sorts of measurements are a great way to think about how reliable we’re being for our users and customers.
But service levels aren’t just for computers. Before I entered the tech industry I worked in a ton of different roles and capacities, and one of the reasons service level objectives speak to me so much is that as I learned about them I realized that I had always been using them at every job I worked.
It turns out that humans in general are understanding of occasional failures or mistakes and that they’re actually very resilient to them. Nothing is ever perfect all of the time, and most people understand this – no matter what the situation actually is.
For example, I worked at a coffee shop for a while. It was a commuter stop and I worked the opening shift. This meant I had to get there at 05:00 to open by 06:00. It was not at all uncommon for us to end up with a line of 30-40 people stretching the length of the entire store at peak times. I generally worked on the espresso bar, and we’d call out for orders down the line before customers could even get to the register to get started on the drinks. We had a system for taking orders and how to document the customizations people had for their respective drinks. But: we also knew we’d make a mistake every once in a while. Either we’d write down the order slightly wrong, or we’d just make a mistake while trying to produce and serve multiple orders per minute.
In fact, we even had a goal as a store. While we were at our busiest we had a target of only having 1 in 20 drinks returned for being somehow wrong or otherwise perceived as unsatisfactory to the customer. We actually kept track of this number over the course of our shift to figure out how we’d performed that day.
And that’s a service level objective! As a coffee shop we had decided upon a 95% SLO target for customers being happy with their drink! We generally exceeded this target by a wide margin, but it was a reasonable one to track in terms of making decisions about staffing, who was working what station, how we stocked supplies behind the bar, and more.
When I worked at a dive bar we didn’t have any formalized SLOs, but I did have my own. I had goals that I intended to greet every customer within a minute and that they’d have a drink in front of them within two minutes. I didn’t actually measure any of this with a stopwatch or anything, but it was something I constantly kept in mind while I was working. These goals were easy to reach on most weeknights, but on busy weekend evenings they could sometimes be difficult to achieve.
I knew that if I hit my goals often enough – even on busy nights – I’d get tipped well, would have a good night, and that most of those customers would return. But I also knew that if it was particularly busy I just wouldn’t be able to actually greet every customer in time or deliver them their drink within that time window. However, it turns out that was just fine! I didn’t have an actual percentage in mind while I looked towards these targets, but this was still a service level objective. It was literally an objective for the level of service I was trying to achieve for my customers.
The point is the service level objectives are actually a very natural way of thinking about things. They’re not in any way unique to computer services. Almost everyone that provides a service to someone else sets reasonable targets for themselves. It doesn’t matter what industry you’re in or what the service looks like. Occasionally failing, making a mistake, or missing a target is a thing humans are cool with. And that’s why it makes so much sense to apply this same sensibility to our computer services.
No one needs anyone else to be perfect. Embrace occasional failure and make sure you’re aiming to be the right amount of reliable for your customers and your business while allowing your teams to make mistakes and your services to sometimes fail. After all, it’s inevitable that this will happen.