This is a guest blog by Zac Nickens, SRE Team Captain at OutSystems
Must See SRE
Tuesday, 14:15–15:00 (Opening Plenary Session)
Don’t Follow Leaders” or “All Models Are Wrong (and So Am I)
Niall Murphy, RelyAbility
Five years after the publication of the SRE book, it’s a good time to reflect on what it did—the good and the bad, the ugly and the beautiful, and relate it to what is going on in production engineering in general, SRE in particular, and the problems in the field we’ve yet unaddressed and/or created for ourselves.
Niall Murphy has worked in Internet infrastructure since the mid-1990s, specializing in large online services. He has worked with all of the major cloud providers from their Dublin, Ireland offices, and most recently at Microsoft, where he was global head of Azure Site Reliability Engineering (SRE). He is the instigator, co-author, and editor of the two Google SRE books, and he is probably one of the few people in the world to hold degrees in Computer Science, Mathematics, and Poetry Studies. He lives in Dublin with his wife and two children.
Zac’s Commentary: This one is a must-see. When Niall talks, our industry listens, and for good reason! – Niall knows reliability. On a deeper level, it is critical for SREs to perform a review and retrospective for our domain, and to be honest about the things we still need to work on, and to highlight the methods and approaches which have served us well. If we do it for systems and software, we should do it for ourselves every once in a while, and there is no one better to lead the effort than Niall.
Great Teams Do Great Work
When the SRE Book was published in 2016 the job title of SRE was not widely used outside Google. Fast-forward five years and it seems like every company is hiring SREs. Did the System Administrator and Operations jobs disappear or have their job titles simply changed?
At the end of the talk, you will know one way of transforming a disjoint team of engineers into a high-performing SRE team. If you are a manager of a team or you are interested in team building this talk is for you. This is not a technical talk; I will focus solely on how to set up a team for success.
Benjamin Bütikofer is the Head of Platform Services at the Swiss online marketplace Ricardo. He started working in the systems administration space in 1999. He worked in data centers, managed Unix mainframes, and was a Linux admin. After an excursion into Software Development and via acquisition, he ended up at Microsoft. At Microsoft, he built his first team of Site Reliability Engineers.
Zac’s Commentary: There can be lots of unorganized change in an org going through change. It is critical that we enable teams for success, especially when the big change is “we’re SREs now, now what?”. Focus on building and enabling great teams is a must-see for anyone that manages teams, but also for anyone that wants to be in a great team.
Must See SRE
Rethinking the SDLC
Emily Freeman, AWS
The software (or systems) development lifecycle has been in use since the 1960s. And it’s remained more or less the same since before color television and the touchtone phone. While it’s been looped it into circles and infinity loops and designed with trendy color palettes, the stages of the SDLC remain almost identical to its original layout.
Yet the ecosystem in which we develop software is radically different. We work in systems that are distributed, decoupled, complex, and can no longer be captured in an archaic model. It’s time to think differently. It’s time for a revolution.
The Revolution model of the SDLC captures the multi-threaded, nonsequential nature of modern software development. It embodies the roles engineers take on and the considerations they encounter along the way. It builds on Agile and DevOps to capture the concerns of DevOps derivatives like DevSecOps and AIOps. And it, well, revolves to embrace the iterative nature of continuous innovation. This talk introduces this new model and discusses the need for how we talk about software to match the experience of development.
Zac’s Commentary: We’re revolutionizing and updating everything about systems and computing every couple years, and we are changing how we work and why we do that work more and more, so it stands that we should really update the SDLC to be more harmonious with the tools and tactics that we use now. More succinctly, just listen to Emily, she knows what she’s talking about and is one of the important voices in our field!
Turn the SLOs up to 11
SLOs to the X
SLX: An Extended SLO Framework to Expedite Incident Recovery
Qian Ding and Xuan Zhang, Ant Group
This talk is based on a real journey on establishing SLOs for an infrastructure SRE team whose availability target is higher than 99.999%. First, we reveal our process on defining SLOs and demonstrate the gaps between expectations and reality on using SLOs with dev teams. Secondly, we present a uniformed SLO framework (SLX) design to facilitate SREs to manage hundreds of SLOs. For example, other than using SLO data for basic alerting and weekly reporting, we combine the SLO framework with statistical anomaly detection algorithms to locate the pitfalls automatically. To achieve that, we introduce several new concepts like Service-Level-Factor (SLF) and Service-Level-Dependency (SLD) and use them to build SLO knowledge graphs across multiple infrastructure systems. Finally, we present our intent-driven SLX implementation inspired by the Kubernetes design and the Gitops paradigm.
Zac’s Commentary: How do you achieve reliability above 5 nines, at scale? TBH i don’t know! But I’m excited to see this talk. Managing SLOs in the hundred, thousands, and tens of thousands is a challenge. I can’t wait to hear how this team is tackling the scale challenge.
Must See SRE
DevOps Ten Years After Review of a Failure with John Allspaw and Paul Hammond
Thomas Depierre, Liveware Problems; John Allspaw, Adaptive Capacity Labs; Paul Hammond
Ten years after a talk that started the DevOps movement, we are bringing John Allspaw and Paul Hammond for a discussion with old men yelling at clouds.
John Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John’s publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.” John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University.
Paul Hammond is a software engineer, manager, and advisor. His career has spanned twenty years at companies including Slack, Adobe, Typekit, Flickr, and the BBC. His recent work has focused on how software is developed and operated, including building Slack’s continuous deployment pipeline, development environments, and test infrastructure.
Zac’s Commentary: Old men yelling at/about Clouds. What more could we ask for? Featuring John Allspaw and Paul Hammond? Sign me up. This is a must-see!
Must See SRE
You’ve Lost That Process Feeling: Some Lessons from Resilience Engineering
David D. Woods, Ohio State University and Adaptive Capacity Labs; Laura Nolan, Slack
Software systems are brittle in various ways, and prone to failures. We can sometimes improve the robustness of our software systems, but true resilience always requires human involvement: people are the only agents that can detect, analyze, and fix novel problems.
But this is not easy in practice. Woods’ Theorem states that as the complexity of a system increases, the accuracy of any single agent’s own model of that system—their ‘process feel’—decreases rapidly. This matters, because we work in teams, and a sustainable on-call rotation requires several people. This talk brings a researcher and a practitioner together to discuss some Resilience Engineering concepts as they apply to SRE, with a particular focus on how teams can systematically approach sharing experiences about anomalies in their systems and create ongoing learning from ‘weak signals’ as well as major incidents.
Laura Nolan is a Senior Staff Engineer and tech lead at Slack, working mainly on service networking and ingress load balancing, as well as occasionally writing outage reports for the Slack Engineering blog. Laura has contributed to a number of books on SRE, including Site Reliability Engineering: How Google Runs Production Systems, Seeking SRE, and 97 Things Every SRE Should Know. She also regularly writes for USENIX’s login: magazine and is a member of the USENIX board and SREcon Steering Committee.
Zac’s Commentary: David Woods and Laura Nolan. Guaranteed Must See SRE; when these two offer you lessons, you pay attention, especially when the lessons are about resilience in practice! Throw in bonus John Allspaw and it’s an Allstar event!
Embracing the Known Unknowns
Demystifying Machine Learning in Production: Reasoning about a Large-Scale ML Platform
Mary McGlohon, Google
Machine Learning is often treated as mysterious or unknowable. This can lead to SREs choosing to work around ML-related reliability problems in their systems rather than through them. This avoidance is not only risky but also unnecessary: Any given SRE operates with systems that they themselves may not know in great depth. To manage risk, they use a series of generalized techniques to understand the properties of the system and its failure modes.
In this talk, we apply this outside-in approach towards ML reliability, drawing from experiences with a large-scale ML production platform. We describe common failure modes (spoiler alert: they tend to be the same things that happen in other large systems), and based on these failure modes, recommend best practices for productionization: Monitor systems and protect them from human error. Understand data integrity needs, and meet them. Prioritize pipeline workloads for efficiency and backlog recovery.
Mary McGlohon is a Site Reliability Engineer at Google, who has worked on large-scale ML systems for the past 4 years. Prior to that, her career included data mining research, software development, and distribution pipeline systems. She completed a B.S. in computer science from the University of Tulsa and a Ph.D. in machine learning from Carnegie Mellon University. She is interested in how production techniques can make ML better for human operators and users.
Zac’s Commentary: As SREs, we are used to wading into territory that we might not already know intimately.That’s part of what we do! So we should embrace new-to-us components as they become parts of the systems we care about. This really should be a MUST SEE for any SREs that want to stay near the leading edge of technology – I don’t think ML is going away anytime soon.
Learning From Other Disciplines
When Systems Flatline—Enhancing Incident Response with Learnings from the Medical Field
Sarah Butt, Salesforce
In many ways, incident management is the “emergency room” for technical systems. As technology has evolved, it has progressed from auxiliary systems to essential business systems of record, to critical systems of engagement across multiple industries. As these systems become increasingly critical, SRE’s role in incident management and resolution has become vital for any essential technical system. This talk focuses on how various strategies used in the medical field can be applied to incident response. From looking at algorithm guided decisions (and learning a bit about what “code blue” really means) to discussing approaches to triage and stabilization based on the ATLS protocol, to considering the role of response standardization such as surgical checklists in reducing cognitive overhead (especially when PagerDuty goes off at 2 a.m.!), this talk aims to take key learnings from the medical field and apply it in practical ways to incident management and response. This talk is largely conceptual in nature, with takeaways for attendees from a wide variety of backgrounds and technical experience levels.
Sarah is a former audio engineer turned technology professional who has spent the past 6 years of her career at Salesforce and Dell devoted to customer-perceived reliability. She is a 2021 MBA graduate from The University of Texas (Hook’em!) where she did graduate work studying the intersection of technology, business, and people in the context of SRE. A few of her favorite topics include user-centric monitoring, intelligent alerting, and using innovative technology to drive the high availability of complex distributed systems. Sarah is currently part of Salesforce’s SRE organization, where you’ll likely find her talking about topics such as resilience, observability, and incident management and response. In her free time, you’ll often find her hiking in the Texas Hill Country with Rosie, her yellow lab.
Zac’s Commentary: This should be a must-see for emerging SREs. We can learn from a variety of other fields and disciplines across industries. Thinking outside our boxes is often what leads to inventive and effective strategies for SREs, and we should cherish the opportunities we get to learn from outside the technology sector. This one is going to be great!
What are we working on next? More Known Unknowns…
Panel: Unsolved Problems in SRE
Moderator: Kurt Andersen, Blameless
Panelists: Niall Murphy, RelyAbility; Narayan Desai, Google; Laura Nolan, Slack; Xiao Li, JP Morgan Chase; Sandhya Ramu, LinkedIn
Every field of endeavor has its leading edge where the answers are unclear and active exploration is warranted. Although the phrase “here be dragons” might be an appropriate warning, this panel of intrepid adventurers will venture into that unknown territory.
Kurt Andersen is the head of the strategy for Blameless.com. Prior to that, he was one of the leads for the Product-SRE organization at LinkedIn. Across the full spectrum of IT influence, he is strongly committed to developing the best engineers and teams, and enabling them with the right ideas, tools, and connections at the right time. Kurt has been active in the anti-abuse and IETF standards communities for over 20 years. He has spoken at multiple conferences on various aspects of reliability, authentication, and security and written for O’Reilly. He also serves on the USENIX Board of Directors and as liaison to the SREcon conferences worldwide.
Zac’s Commentary: This panel is a list of great people with great minds who all have a great experience! Every SRE should be lined up to learn from them. Especially since the next generation of great SREs are going to be the ones leading us to tackle the unknowns in reliability and systems. This is a must-see for all SREs, but especially for emerging SREs.
MUST SEE SRE & Learning From Other Disciplines
Friday, 2021, October 15 – 04:45–05:15
Food for Thought: What Restaurants Can Teach Us about Reliability
Alex Hidalgo, Nobl9
Nothing is ever perfect; all systems will fail at some point. This is true of everything we might define as a complex system. We could be discussing computer services, living organisms, buildings, or societal structures—at some point failure will occur in these systems. It turns out that failure is actually totally fine, and humans know this innately even if they’re not always aware of their own fault tolerance. Like so many other things, restaurants are complex systems made up of many independent complex systems that all rely on each other, just like our computer services. In this talk let’s use the experiences we’ve all had dining at, ordering from, or working at restaurants to draw parallels to how we can better think about the reliability of computers. From The Floor to The Bar to The Line: restaurants have many lessons to teach us.
Alex Hidalgo is the Director of Site Reliability Engineering at Nobl9 and author of Implementing Service Level Objectives. During his career, he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex’s previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.
Zac’s Commentary: Alex Hidalgo has become just about synonymous with SRE and SLOs these days, and for good reason. Alex is one of the internet’s preeminent Reliability thinkers and writers (he’s also pretty good for a joke once in a while). This talk is a must-see SRE talk for emerging reliability practitioners based on Alex’s merits alone, but what really makes it a must-see for everyone is that we can learn about reliability engineering from many disciplines and sectors outside of the tech world. Alex is all about embracing failure and risk to make our services or servers (the food kind) more reliable and focused on customer happiness.
Image Credit: Brunno Emmanuelle on unpslash.com