Complex AI, Fragile Systems | Proven Strategies for Maximizing AI Uptime| Webinar

Complex AI, Fragile Systems

Recording Now Available

Originally Recorded on June 18, 2025

Organizations across every industry are integrating AI applications into their architectures, and often with substantial investments. Those investments bring increased expectations and attention, which means higher demands for availability and performance.

But AI applications are complex. In some ways, managing their reliability is no different than other applications, while in other ways it’s an entirely different way of looking at things, with differences in both techniques required as well as the impact of failures, in some cases requiring costly restarts of long running processes.

With this increased pressure to minimize adverse impacts, how can teams keep their AI applications functioning at a high level?

Join Nobl9, Gremlin, and Pagerduty for a roundtable discussion about what engineers can do to keep the uptime of AI applications high and avoid or lessen the impact of incidents. We’ll cover how SLOs, resilience testing, and incident response come together to support AI reliability.

Speakers

Alex Nauda

Mandi Walls

Kolton Andrus