More by Alex HidalgoYou Can See More From the Top
| Author: Alex Hidalgo
Originally published in 97 Things Every SRE Should Know, November 2020
When someone I’ve just met asks me what I do for a living, I generally fall back to something along the lines of: "I'm a Site Reliability Engineer. We keep large-scale computer services reliable." For many people this is sufficiently boring and our general pleasantries continue. But occasionally I run into people that are a bit more curious than that: "Oh, that sounds interesting! How do you do that?"
Incremental progress is the only reliable way to reliability.
And that's a difficult question to answer! What is it that SREs actually do? For many years I'd rely on just listing an assortment of things—some of which have made their way into essays within this very book. While an answer like that isn't exactly wrong it also never felt truly satisfying. There had to be a more cohesive answer, and when I reflect on my decade of performing this job, I think I've finally figured it out. Virtually everything SREs do relies on our ability to do six things: measure, analyze, decide, act, reflect, and repeat.
Measuring does not just mean collecting data. To measure something you have some sort of goal in mind. You don't collect flour to bake a cake, you measure the flour, otherwise things will end up a mess. SREs need to measure things, because pure data isn't enough. Our data needs to be meaningful. We need to be able to answer the question: "Is this service doing what its users need it to be doing?"
Once you have measurements, the next step is to analyze them. This is where some basic statistics and probability analysis can be helpful. Learn as much as you can from the things you are measuring by using the centuries of study and knowledge mathematicians have made available to us.
Now you’ve done your best at measuring and analyzing how a certain thing is behaving. Use this analysis to make a decision about how to best move into the future!
Then you must act. You need to actually do the thing you decided upon. And it could be that this action is actually to take no action at all!
Finally, reflect on what you’ve decided to do once you’ve done so. Take a critical — but blameless — eye and place it squarely upon whatever you’ve done. You can generally learn much more from this process than you can from your initial measurement analysis.
Now you start over. Something has either changed about the world due to your decision, or it hasn’t, and you need to keep measuring to see what the real impact of this action, or inaction, actually was. Keep measuring, then analyze, decide, act, reflect, and repeat again and again. It’s the SRE way. Incremental progress is the only reliable way to reliability.
Site reliability engineering is a broad discipline. We are often called upon to be software engineers, system administrators, network engineers, systems architects, and even educators or consultants. But one paradigm that flows through all of those roles is that SRE is data-driven. Measure the things you need to measure, analyze the data you collect, decide what to do with this analysis, act upon your findings, reflect on your decision, and then do it all over, again and again and again.
Measure, analyze, decide, act, reflect and repeat: That’s site reliability engineering in six words.