More by Dan Kurson:
| Author: Dan Kurson
Avg. reading time: 5 minutes
As most marketing organizations evolve to incorporate AI into their everyday operations, we were attempting to integrate a Claude agent with HubSpot to manage our CMS. The agent already had access to read existing pages via a service key. The scope to edit them was missing. Something as easy as adding an edit scope to the agent’s abilities ended up taking over two and a half hours. The agent continually proposed non-existent scopes, pushed back on our rebuttals, and even attempted to convince us this was impossible — that HubSpot had not released that capability in our instance. I am admittedly not the most technical user and clearly haven’t spent hours learning all the OAuth scopes available in HubSpot or what they mean (but really, who has time for that?). Neither the AI nor I caught it for hours.
A more technically experienced user might have spotted it sooner. That’s not the point. The point is that this kind of inefficiency happens routinely, costs real money in tokens, costs real hours in human time, and isn’t being measured anywhere.
AI agents are increasingly embedded in business-critical workflows that used to take humans hours: pulling data from APIs, debugging code, drafting documents, and managing infrastructure. Adoption has been faster than almost any technology shift in recent memory. What hasn’t kept up is how reliability is measured.
The hidden reliability tax in AI
Look at how AI labs evaluate their models today. The published benchmarks — MMLU, SWE-bench, HumanEval, GPQA — are predominantly single-shot correctness tests. Given a prompt, did the model produce the right output? Move on. These are useful measurements of capability. They are not, on their own, measurements of operational reliability.
The bar is starting to move on this. Anthropic recently published operational metrics from real Claude Code sessions, including turn duration percentiles, human-interrupt rates, and self-initiated stops — data drawn from millions of production conversations. Anthropic’s engineering team also documents how they think about turn count, tool calls, and token totals as standard transcript metrics. Princeton’s Holistic Agent Leaderboard introduced cost-per-task as a default evaluation axis across 21,000+ agent rollouts. OpenAI’s SWE-Lancer ties agent task success to literal dollar payouts. The raw ingredients of operational measurement are increasingly in the public record.
What’s still missing is the layer on top: reliability commitments. No lab publishes an agent SLO with a numeric target. No lab commits to an error budget. No lab offers a customer-facing reliability dashboard analogous to AWS or Stripe status pages. There is no standardized cross-vendor SLI for metrics such as path divergence, post-pushback recovery time, or efficiency regression detection. Every major lab now publishes how capable its agents are. None publishes how reliable they are — not in the way an SRE would mean the word.
That distinction matters because operational reliability is what users actually experience: how often the agent gets to the right answer, how efficiently, with how much friction. It’s the difference between a model that scores 92% on a benchmark and a model that consistently gets a real task done in the time and tokens you’d expect. The model’s reliability is one thing, but the model plus the harness is what really matters. The full loop from input to output is the reliability that actually matters to users.
A typical AI agent session looks nothing like a benchmark. It’s multi-turn. The agent commits to strategies, sometimes abandons them, sometimes succeeds, sometimes burns through dozens of dead-end attempts. The new generation of lab measurement captures some of this. What’s still not captured at the level customers can hold a vendor accountable to:
- Turns-to-resolution on real production tasks, by task class
- Token cost on a successful task versus the optimal path
- How often the agent commits to a strategy and abandons it mid-conversation
- How quickly the agent self-corrects after the user pushes back
- How often the user has to push back at all, as a leading indicator of fitness
These are the metrics that decide whether an agent feels like a competent colleague or a cost center. Anthropic’s autonomy paper proves they’re measurable in production. What’s missing is the framework that turns them into reliability commitments customers can plan around. And operational reliability is precisely the thing that shouldn’t depend on every user being maximally vigilant. When the system requires the user to grok every line of output for the phrase that would unstick it, the system is the problem.
This is exactly what SLOs are for
Service Level Objectives exist because component health doesn’t equal a good user experience. A backend can be 100% available and a session can still be miserable. The SRE discipline figured this out because component-level monitoring consistently missed experiences that users were actually leaving over.
The fix was to define measurable objectives from the user’s perspective and put error budgets behind them so engineering teams prioritize the right work. SLOs aren’t about blame. They describe outcomes the business cares about, with budgets that force prioritization when those outcomes degrade.
For AI agents, the equivalent objectives are concrete and measurable today:
- Token cost per recurring workflow. Baseline each workflow on its own — support triage, code review, lead enrichment — and flag runs past p95 of that workflow’s distribution. Comparing across workflows is noise.
- User-pushback rate. How often does the user have to redirect the agent before it lands? How does the agent respond to pushback? What if the user believes an agent that says a task is not worth it or impossible?
- Path-divergence count. How often does the agent commit to a strategy, abandon it, commit to another? Each divergence is a negative indication of reliability.
- Recovery latency after pushback. When the user signals the agent is wrong, how many turns until it corrects?
None of these depend on the user being a prompt expert. They measure the experience that actually happened, regardless of where the friction came from. The SLO captures the outcome and forces the team to own it.
Why AI vendors haven’t done this yet
AI vendors have spent the last several years optimizing for capability. Capability unlocks new use cases and is rewarded by benchmarks. But capability and operational reliability are unique challenges.
SRE went through the same evolution a generation ago. Early operations measured component uptime — was the database up, was the API responding. The shift to user-experience-driven SLOs took years and required a culture change, because component metrics consistently failed to capture the user-experienced reality. AI experiences are at the same inflection point. The published metrics are capability metrics. The next maturity level is end-to-end reliability.
The same logic applies to companies buying AI
This isn’t only a vendor problem. Any organization paying per token for AI capabilities needs to assess reliability. This is the key linkage between token cost and ROI.
A typical enterprise rolling out AI agents is starting to meter token spend surrounding business-critical workflows like marketing, engineering, customer support, and sales operations. Total spend gets attributed by team and lands in a FinOps dashboard. But how do you know if what you spent was on efficient conversations, or wastes of time? Circling on a simple task is a major waste of tokens, and money.
The good news is that the telemetry we need already exists. Every API call to a major AI vendor returns token counts in the response payload. The integration layers like opencode or LangChain have visibility into turn count, completion signals, and the user side of the conversation. What’s missing is the framework that turns that raw data into the same kind of operational reliability metric an SRE team would build for any other production system.
The same SLOs apply, measured one layer up:
- Cost per resolved task, by workflow. What does it actually cost in tokens for the AI agent to complete a customer support inquiry, a code review, a marketing brief? What’s the p95? Anything past a threshold gets sampled and reviewed.
- Task efficiency over time. Is the agent spending more tokens this month than last for the same work? That’s a regression even if total spend is flat, because volume usually isn’t.
- User-pushback rate as a leading indicator. If a team is having to redirect the AI more often this week than last, something has changed — and you’ll see it in pushback frequency before you see it in spend.
This is the natural next move for any organization that takes FinOps seriously. AI spend behaves more like an operational service than a static utility. The teams that treat it that way will know which workflows are getting cheaper as models improve, and which are silently regressing because nobody is looking.
Where this is heading
This is the same pattern that played out in cloud infrastructure a decade ago. The hyperscalers won not because their underlying technology was uniquely better, but because their operational reliability was measurable, public, and improving on a known cadence. Customers paid for the predictability and built their own measurement discipline around tracking their consumption against expectations.
AI is heading to the same place, with pressure coming from both directions. Vendors who lead on operational reliability metrics will differentiate. Buyers who instrument their own AI workflows will discover the inefficiencies first and demand more from the platforms they pay for. The framework to do all of this already exists.
The companies on either side of that contract that move now will look much smarter in eighteen months than the ones still asking why their token bill keeps growing.
Do you want to add something? Leave a comment