How to Tailor Your Error Budget Calculation Method to Your Business Case

One of Nobl9’s benefits is that it allows you to monitor your error budget. This gives you full visibility into how reliable your product is and shows you when your customers’ satisfaction might be affected.

Have you ever found yourself wondering how to set up your error budget effectively, so it will indicate when the customer experience begins to suffer? Or which error budget calculation method will best reflect your business case? This article will shed some light on those questions.

Nobl9 offers two error budget calculation methods to choose from: Time Slices and Occurrences.

With the Time Slices method, what we count (the objective we measure) is how many good minutes (minutes where the system is operating within defined boundaries) were observed, compared to the total number of minutes in the window.

With the Occurrences method, we count good attempts (for example, requests that are within defined boundaries) against the count of all attempts (i.e., all requests, including requests that perform outside of the defined boundaries). 

So which should you choose? Let’s walk through an example scenario. Suppose you’re looking at traffic on a website that fluctuates throughout the day. During peak load, the service’s performance deteriorates and some requests fail. There’s also a small performance hiccup during the release due to startup costs, although maintenance is usually planned during low-traffic hours.

Let’s assume that at first you decide to go with Time Slices as your error budget calculation method. The disadvantage in this scenario is that with this approach, a bad minute that occurs during a low-traffic period (say, the mentioned downtime during release, which is scheduled for the middle of the night and probably won’t be even noticed by users) will have the same effect on your SLO as a bad minute caused by the fact that the platform is overloaded with traffic, which many users are likely to notice.

The Occurrences budgeting method is better suited to this situation. Since total attempts are fewer during low-traffic periods, it automatically adjusts to lower traffic volumes. This method is straightforward and automatically weights impact by the total number of requests served, so it will give an accurate reflection of when your customers’ experience is actually affected.

Next I’ll show you an example where Time Slices is the better method to use–but first I need to explain the concept of the Time Slice Allowance. When you add an SLO using this budgeting method, the Target is the reliability percentage you want to aim for, and the Time Slice Allowance is the percentage of the time slice for which you want to meet that Target. For example, if your Target is 95% and your Time Slice Allowance is 90%, that means you want to achieve your 95% reliability target 90% of the time.

How does it work? Say your application’s goal is to target a good experience 95% of the time, and a good minute is defined as “90% of all responses are under 1000 ms.” Nobl9 will slice your SLO time window into minute-long intervals, and based on your Target, you will be presented with a budget of allowed bad minutes. This method can be useful if you are dealing with a contractual agreement and need to monitor the availability of a given application. Service level agreements are often expressed in these terms (e.g., “the application will be available 99% of the time”).

I hope that this brief comparison has given you a taste of the flexibility of Nobl9 to adapt to  your business goals. For more information on how to set up error budgets, see the documentation.

If you’re interested in exploring Nobl9 and seeing what else it has to offer, you can sign up for a free 30-day trial at nobl9/signup.

Example of the Occurrences configuration using Threshold metrics.

Example of Time Slices configuration using Threshold metrics.

 


Featured image via Diana Polekhina

Get started and

Try NOBL9 yourself

Try NOBL9 now