Reliability and SLOs at Scale: Key Lessons from the SRE Pulse Roundtable

Written by Erza Zylfijaj | Jan 22, 2026 3:37:48 PM

Reliability and service level objectives play a central role in how large organizations operate software systems. In this post, we share key lessons from an SRE Pulse roundtable featuring leaders in reliability, observability, and incident response. The discussion focused on how teams implement reliability practices, where challenges emerge at scale, and how SLOs are used in practice across organizations.

The roundtable was hosted by Brian Singer, Chief Product Officer at Nobl9, and included Mandi Walls, Developer Advocate at PagerDuty, Jack Fordyce, Senior Site Reliability Engineer, and Jack Neely from Cardinality Cloud.

How SRE Leaders define DevOps and reliability

Mandi Walls opened the discussion by reflecting on how DevOps is understood today.

“It means everybody cares about the customer and the user experience.”

Rather than teams focusing only on their individual responsibilities, she emphasized the importance of understanding how work impacts users and the business as a whole.

Jack Fordyce spoke about the relationship between centrally managed SRE teams and application development teams. He highlighted the need to bridge gaps between these groups so they can work toward shared goals instead of operating in isolation.

What high-functioning reliability practices look like

When discussing what differentiates teams that consistently deliver reliable systems, the panel emphasized habits and collaboration over tooling.

Jack Neely noted:

“Habits and rituals of how software engineering teams and SRE teams work together is the marker of a team that can produce reliable software.”

Mandi Walls expanded on this by highlighting the importance of continuous learning across teams. She described the value of paying attention to what other teams are seeing and learning, and incorporating those insights rather than operating in silos.

Psychological safety as a foundation

The panel also discussed the role of psychological safety in reliability work. Mandi Walls described it as creating an environment where people feel comfortable asking questions, discussing mistakes, and learning from incidents.

She noted that building this kind of culture takes time and sustained effort, but it is essential for teams that want to improve reliability over the long term.

Measuring reliability and communicating impact

Measuring reliability and communicating its value to leadership remains a common challenge.

Jack Fordyce pointed out:

“It is tough if you have not adopted a service level objectives approach.”

Without SLOs or similar mechanisms, it becomes difficult to understand whether reliability efforts are effective or improving over time.

Mandi Walls added that reliability work needs to connect to outcomes leadership already cares about.

“You hope that the work is reflected in the things that your executives care about, like revenue, maybe NPS, or however you are measuring customer user happiness.”

Navigating disagreements and operational challenges

Operational friction between SRE and development teams was another topic of discussion. Mandi Walls emphasized the importance of empathy and shared learning.

“You want folks learning from this and passing it on.”

She highlighted the need for processes that enable knowledge sharing beyond a single team or incident.

Jack Neely reinforced the importance of feedback loops, noting that without them, teams risk staying on the wrong path for extended periods of time without realizing it.

Looking ahead

As the conversation wrapped up, panelists shared areas they continue to focus on in their own organizations.

Mandi Walls spoke about ongoing improvements at PagerDuty, particularly around post-incident reviews, and emphasized that reliability practices continue to evolve.

Jack Fordyce reflected on the challenges of maintaining strong feedback loops in distributed development environments.

“One of the toughest things that I have come across is the feedback loop between a distributed development environment and central SRE organizations.”

Without that feedback loop, SRE teams can struggle to succeed.

Overall, the discussion reinforced that reliability at scale depends on more than tools or frameworks. Collaboration, feedback, learning, and clear communication across teams remain critical to making reliability practices work in large organizations.

Watch the full discussion

To hear the complete conversation, register to watch the full SRE Pulse Roundtable.
If you have specific questions about reliability or SLOs in your organization, get in touch with our team and we are happy to discuss them.

View full post