Reliability Engineers have gotten on board with the idea of SLOs as a framework for defining reliability goals and taking action to meet those goals. But one common refrain we hear over and over again is, “I need to clean up my data before I can define SLOs.” I understand this sentiment because it’s one of the most critical design decisions we made when we built Nobl9.
You already have monitoring metrics. Maybe the metrics aren’t perfect, and that’s okay. You also clearly have an appetite to improve your reliability or at least to quantify your reliability to make better decisions or help meet a business goal.
I wanted to share some ideas to help you get from the data you have already to your first set of SLOs. You can then use them to benchmark your current systems’ reliability, start looking at improvement ideas and watch as your telemetry and the system itself improve over time. Nothing feels better than watching real incremental progress happen toward a goal!
I had one customer who called me up and said, “Alex, something is wrong with this SLO. It’s not showing up right in Nobl9.” I logged in with him, and we looked at the data, which was correct, but the latency goal was extremely tight – 14ms at 99.999%. He believed that the latency goal for this service was five nines, and now the data had shown that it wasn’t adhering to that level.
“Listen,” I replied, “has anyone noticed that you’re not hitting this goal?”
“Well, no,” he admitted.
“In that case, lower the goal! We can still have five nines at a higher latency threshold, but clearly, no one cares if this service is running slower than 14 milliseconds.”
Another Nobl9 user loaded her metrics into Nobl9 to compare to the team’s spreadsheets to calculate SLAs. She immediately saw that one of the services was not meeting the same goal in the “official” monthly SLA reports. And if this were true, then the end user of that service would have reported it by now. But our friend hadn’t heard anything. To double-check, she reached out to one of the application leads (luckily, it was an internal service) and found out (lo and behold!) they had been mildly frustrated with the performance for months! It just wasn’t bad enough to complain. They were then able to fix the service, and everyone was happy.
I share these stories to highlight how important it is to get started. If you wait for clean, perfect data, you’re going to be waiting forever. Start right now (with dirty data if necessary) and see what you learn. We made sure you could get from ingest to insight in a few clicks. The world is too complicated these days to strive for telemetry perfection, and if you don’t start with something, you may have an incomplete understanding of the quality of your data. Not looking at the data doesn’t make the problem go away.
Where do we start?
I hope you’ll get started with the data you have to build SLOs. If you have a data source that we don’t support (yet), please contact us, and we can share where it falls on our roadmap (we might be working on it already), or we may have a workaround for you to bring that data into Nobl9. We’d love to hear about your SLO journey, your use cases, and how we can take your current data and help you with the noble pursuit of reliable software.
Image Credit: Michael Dziedzic on Unsplash