Engineering AI Reliability:

Proven Strategies for Maximizing Uptime

calendar_date_schedule_event_icon June 18, 2025

clock_time_date_event_icon10:00 AM PT / 1:00 PM ET

Organizations across every industry are integrating AI applications into their architectures, and often with substantial investments. Those investments bring increased expectations and attention, which means higher demands for availability and performance.

But AI applications are complex. In some ways, managing their reliability  is no different than other applications, while in other ways it’s an entirely different way of looking at things, with differences in both techniques required as well as the impact of failures, in some cases requiring costly restarts of long running processes

With this increased pressure to minimize adverse impacts, how can teams keep their AI applications functioning at a high level?

Join Nobl9, Gremlin, and Pagerduty for a roundtable discussion about what engineers can do to keep the uptime of AI applications high and avoid or lessen the impact of incidents. We’ll cover how SLOs, resilience testing, and incident response come together to support AI reliability.

  • Best practices from customers for high availability AI applications
  • What’s different between AI applications and other applications

  • How to use reliability data to prioritize AI system improvements

test

Speakers

Alex Nauda
CTO
Nobl9

Mandi Walls
DevOps Advocate
PagerDuty
Kolton Andrus
Founder & CTO
Gremlin
 

Recorded Webinars

Missed a session or want to rewatch a favourite?
Explore our library of past webinars and catch up anytime.