| Author: Natalia Sikora-Zimna
Incidents happen. No matter how much you invest in the reliability of your services, complex systems can fail. Moreover, incidents may affect your reliability toolset, leaving you in the dark about the overall health of your system. Unavailability of this data can have serious consequences. You may not be aware of issues your customers are experiencing, which can lead to SLA violations and undermine your product’s credibility. Thus, it’s crucial to have processes that will allow you to recover quickly when incidents occur.
To bulletproof the post-incident recovery process, Nobl9 has extended the Replay feature (currently in beta) with the ability to retrieve historical data for existing SLOs. Thanks to this new capability, you no longer need to worry about a risky and costly data retrieval process if anything happens to the source of the SLIs feeding your SLOs. If your data source recovered data after the incident, you can backfill the missing data and recalculate the error budget of any affected SLO in minutes without worrying about data inconsistencies.
Nobl9 is all about reliability. We aim to build reliable software and help our customers monitor their own performance. However, we’re not immune to incidents – and since reliability is at the heart of everything we do, we also invest in our incident management processes. In the past, if anything happened to one of our data sources, we spent a considerable amount of time filling in the missing data with surgical precision. While ensuring our customers were not affected by the incident was worth it, this manual process of retrieving data was both time-consuming and costly. To solve this pain point for all Nobl9 users, we invested in extending Replay to pull historical data for existing SLOs. And guess what? An incident happened.
One of the data sources we integrate with was affected by an outage that resulted in gaps in the data. A substantial number of SLOs were affected. Thanks to Replay, we were able to avoid what likely would have been days’ worth of manual effort by several engineers working to retrieve the missing data and fill in the holes in just a few hours. Moreover, the new capability excluded the human error factor, ensuring correct calculations. Let’s see how to use this feature to manage the post-incident data recovery process.
Pulling historical data for an existing SLO and recalculating its error budget is simple. Make sure Replay is enabled for the data source supporting the affected SLO, then open the SLO’s details pane and click the Reimport Historical Data button at the top left. Choose the period for which you want to reimport data (remember that you won’t be able to go further than the Maximum Period for Historical Data Retrieval parameter set for your data source), click Reimport, and that’s it! Nobl9 will do all the work for you.
An SLO affected by data loss
Reimporting historical data in the SLO details pane
Reimportation of historical data and error budget recalculation in progress
SLO with backfilled data
When you want to fix multiple SLOs and you don’t want to go through this process for each of them individually, you can use sloctl, Nobl9’s command-line tool, to retrieve data for all the SLOs in a single operation. To protect the rate limits of your data source, the SLOs will be queued during the data retrieval process.
Refer to the Nobl9 documentation to see the data sources that currently support Replay. Nobl9 is gradually expanding this list.
Replay for existing SLOs is now available to all Nobl9 customers. If you’d like to try Nobl9 and see how it can help your business set up actionable reliability goals, sign up for the Nobl9 Free Edition.