Complex AI, Fragile Systems

Proven Strategies for Maximizing Uptime

Watch recording

Complex AI, Fragile Systems - webinar

Complex AI, Fragile Systems

Recording Now Available

Originally Recorded on June 18, 2025

Organizations across every industry are integrating AI applications into their architectures, and often with substantial investments. Those investments bring increased expectations and attention, which means higher demands for availability and performance.

But AI applications are complex. In some ways, managing their reliability  is no different than other applications, while in other ways it’s an entirely different way of looking at things, with differences in both techniques required as well as the impact of failures, in some cases requiring costly restarts of long running processes.

With this increased pressure to minimize adverse impacts, how can teams keep their AI applications functioning at a high level?

Join Nobl9, Gremlin, and Pagerduty for a roundtable discussion about what engineers can do to keep the uptime of AI applications high and avoid or lessen the impact of incidents. We’ll cover how SLOs, resilience testing, and incident response come together to support AI reliability.

Form is only visible outside fullscreen mode.

You'll learn to:

Best practices from customers for high availability AI applications
What’s different between AI applications and other applications
How to use reliability data to prioritize AI system improvements

Speakers

Alex Nauda
Mandi Walls
Kolton Andrus
Explore the Sandbox
Get a feel for the platform
no account, no configuration needed.
Recorded Webinars
Explore our library of past webinars