| Author: Alex Nauda
Reliability Engineers have gotten on board with the idea of SLOs as a framework for defining reliability goals and taking action to meet those goals. But one common refrain we hear over and over again is, “I need to clean up my data before I can define SLOs.” I understand this sentiment because it’s one of the most critical design decisions we made when we built Nobl9.
You already have monitoring metrics. Maybe the metrics aren’t perfect, and that’s okay. You also clearly have an appetite to improve your reliability or at least to quantify your reliability to make better decisions or help meet a business goal.
I wanted to share some ideas to help you get from the data you have already to your first set of SLOs. You can then use them to benchmark your current systems’ reliability, start looking at improvement ideas and watch as your telemetry and the system itself improve over time. Nothing feels better than watching real incremental progress happen toward a goal!
- Vital Signs – start simple. Pick some availability metrics or basic latency metrics and pull them into Nobl9. You will see your SLO immediately and compare the indicators to multiple thresholds and error budget windows.
- Integrated SLOs – if you get into Nobl9, we already work with popular monitoring systems like Prometheus, Datadog and Lightstep. If you’re using these, you can get set up in just a few clicks.
- Iterate and improve – looking at data invariably leads to asking questions. These questions will push you in different directions as you figure out where to focus your energy to make the system better. Was the SLO better than you expected? Was it worse, but no one noticed?
I had one customer who called me up and said, “Alex, something is wrong with this SLO. It’s not showing up right in Nobl9.” I logged in with him, and we looked at the data, which was correct, but the latency goal was extremely tight – 14ms at 99.999%. He believed that the latency goal for this service was five nines, and now the data had shown that it wasn’t adhering to that level.
“Listen,” I replied, “has anyone noticed that you’re not hitting this goal?”
“Well, no,” he admitted.
“In that case, lower the goal! We can still have five nines at a higher latency threshold, but clearly, no one cares if this service is running slower than 14 milliseconds.”
Another Nobl9 user loaded her metrics into Nobl9 to compare to the team’s spreadsheets to calculate SLAs. She immediately saw that one of the services was not meeting the same goal in the “official” monthly SLA reports. And if this were true, then the end user of that service would have reported it by now. But our friend hadn’t heard anything. To double-check, she reached out to one of the application leads (luckily, it was an internal service) and found out (lo and behold!) they had been mildly frustrated with the performance for months! It just wasn’t bad enough to complain. They were then able to fix the service, and everyone was happy.
I share these stories to highlight how important it is to get started. If you wait for clean, perfect data, you’re going to be waiting forever. Start right now (with dirty data if necessary) and see what you learn. We made sure you could get from ingest to insight in a few clicks. The world is too complicated these days to strive for telemetry perfection, and if you don’t start with something, you may have an incomplete understanding of the quality of your data. Not looking at the data doesn’t make the problem go away.
Where do we start?
- Availability is the most common metric. Most services have decent availability data at the application layer or at ingress (load balancer, mesh, what have you)
- Password reset? I don’t know of a more universally frustrating user experience than a broken password reset flow. And password reset is surprisingly complicated: it typically involves at least one UI, an API, a mail sending service, the user’s mail service, the user’s mail client — and the user might remember their credentials midstream and abandon the process! If you have a metric that simply counts the start point of a person clicking on “forgot my password” and a separate count of successful completion of reset, you don’t even need to correlate the individual users. Set a goal for that ratio at a threshold you know is working — maybe it’s only 75%, that’s fine — and you can easily measure an issue or outage in the overall user journey.
- Page Load Times – add latency metrics of page load time and define separate thresholds for happy (say 100ms), laggy (250ms), and painful (1000ms) experience. You can even set separate thresholds on average latency vs p90 vs p99 latency, whatever is appropriate for your system — all based on a single latency metric, that you probably already have instrumented at a load balancer or at the application level.
- Overachieving Services. Create objectives (SLOs) for services you know run well and seem very reliable. Look at those green metrics! Now ask yourself, are we overspending on our hosting to support this? Could we reduce capacity? Could we take more risk with this service, by say shipping more changes faster? Did we get lucky lately on what could be an otherwise fragile service? Ask these tough questions about systems that are seemingly, quietly, humming along.
I hope you’ll get started with the data you have to build SLOs. If you have a data source that we don’t support (yet), please contact us, and we can share where it falls on our roadmap (we might be working on it already), or we may have a workaround for you to bring that data into Nobl9. We’d love to hear about your SLO journey, your use cases, and how we can take your current data and help you with the noble pursuit of reliable software.
This year AWS re:Invent has gone online. While the overall schedule has a ton of great content,...