Towards a Science of AI Agent Reliability

Stephan Rabanser; Sayash Kapoor; Peter Kirgis; Kangheng Liu; Saiteja Utpala; Arvind Narayanan

Towards a Science of AI Agent Reliability

Intermediate

Stephan Rabanser, Sayash Kapoor, Peter Kirgis et al.2/18/2026

arXiv

Key Summary

•Accuracy alone can make AI agents look good on paper while still failing in real life; this paper shows how to measure reliability properly.
•The authors define four pillars of reliability—consistency, robustness, predictability, and safety—with twelve concrete metrics.
•They evaluate 14 modern models on two complementary benchmarks (GAIA and τ-bench) and find that reliability improvements lag behind accuracy gains.
•Outcome consistency is low across models, meaning agents often don’t repeat their own successes on identical tasks.
•Robustness to simple prompt rephrasings varies widely, even when models handle technical faults reasonably well.
•Calibration (how well confidence matches correctness) has improved in newer models, but discrimination (telling right from wrong) is mixed and sometimes worse.
•Safety violations are rarer in frontier models, but financial mistakes still happen and rare high-severity errors remain critical.
•The paper separates reliability from capability using normalized and ratio-based metrics and treats safety as a tail-risk constraint, not an average to blend away.
•They provide an actionable recipe and metrics that developers, evaluators, and decision-makers can use today.
•Bottom line: to trust AI agents in the real world, we must measure how they behave, degrade, and fail—not just how often they succeed.

Why This Research Matters

Real-world AI agents touch people, money, and data, so we must know more than their average score. This framework shows whether agents repeat good behavior, stay stable under harmless changes, and know when to ask for help. It makes rare but catastrophic failures visible instead of washing them away in averages, which is vital for trust and regulation. Teams can set deployment gates (e.g., minimum consistency and safety) instead of relying on a single accuracy number. Product owners can compare agents fairly across capability levels, and engineers can target fixes precisely (e.g., prompt robustness vs. outcome consistency). As we move from assistants to automation, these reliability metrics become the difference between helpful and harmful systems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how a student can ace practice quizzes but still panic and make big mistakes on test day? The score on the practice sheet didn’t tell you how steady, calm, or safe their behavior would be when it really counts.

🥬 Filling (The Actual Concept)

What it is: This paper is about AI agent reliability—how steadily, safely, and predictably agents behave—not just how often they get the right answer.
How it works: The authors look at agent behavior from four angles: consistency (does it repeat its own success?), robustness (does it handle changes and bumps?), predictability (does its confidence match reality?), and safety (if it fails, how bad is it?). They turn each angle into measurable metrics and test many agents on two benchmarks.
Why it matters: Real-world use needs more than good averages. We need to know if the agent repeats good behavior, stays stable under small changes, knows when it’s unsure, and avoids harmful mistakes.

🍞 Bottom Bread (Anchor) Imagine a delivery robot that reaches your house 9 out of 10 times. If the 10th time it drives into traffic, you’d say it’s not reliable—even though its average is high. That’s the difference between accuracy and reliability.

THE WORLD BEFORE Before this research, most agent evaluations focused on a single number: accuracy (the percent of tasks solved). That’s simple and easy to compare, but it hides key facts:

It doesn’t show whether the agent repeats its own success when you run the same task twice.
It doesn’t show how the agent handles tiny input changes, tool hiccups, or environment updates.
It doesn’t show if the agent’s confidence is trustworthy.
It doesn’t show how bad failures are when they happen. This mismatch led to painful surprises in the real world: an AI coding assistant deleted a production database; a shopping agent made an unauthorized purchase; a city chatbot confidently gave illegal advice. On paper, these systems looked capable. In practice, they were unreliable.

THE PROBLEM The key challenge: How do we define and measure reliability for AI agents? Borrowing from safety-critical fields (like aviation and nuclear power), the authors argue that reliability is multi-dimensional. You must measure not just how often it works, but how it behaves when conditions change, how it fails, and how bad those failures can be.

FAILED ATTEMPTS People tried ad-hoc fixes: multiple attempts (pass@k), single-run scores, or narrow robustness tests. But these didn’t form a complete picture. A system that sometimes gets it right in multiple tries may still be unpredictable; a system that scores well once may fall apart with a minor wording change. Without a full, principled set of metrics, teams could not foresee real-world breakdowns.

THE GAP We were missing a holistic, accuracy-independent profile—a way to compare reliability across agents of different capability levels. We also lacked standard ways to test behavior under faults, environment changes, paraphrases, and to separate how often failures happen from how severe they are.

REAL STAKES Why should anyone care? Because deployed agents take actions that touch people, money, and data.

A refund bot must treat equal customers equally, every time (consistency).
A flight-booking helper must survive a minor API change tomorrow (robustness).
A code-review agent must not overstate its certainty (predictability).
A database agent must never make rare, catastrophic mistakes (safety). Without reliability, even a high-accuracy agent can cause real harm, legal risk, or lost trust.

Concepts in Plain Talk (Sandwich Pattern)

Reliability 🍞 Hook: Imagine your favorite toaster. You love it because it toasts your bread the same way every morning without surprises. 🥬 Concept: Reliability is how steadily and safely an AI agent behaves across situations and over time.

How it works: We look beyond getting the answer right and measure repeatability, stability under changes, honest confidence, and bounded harm.
Why it matters: A toaster that sometimes burns down the kitchen is not a good toaster—even if it makes perfect toast most days. 🍞 Anchor: A reliable calendar assistant won’t suddenly delete all your meetings when you ask it to "reschedule Tuesday."

Consistency 🍞 Hook: You know how a vending machine should give the same snack every time you press B7? 🥬 Concept: Consistency means the agent does the same thing under the same conditions.

How it works: We check if outcomes match across runs, if action sequences are similar, and if resource use (time, tokens, cost) is steady.
Why it matters: If identical requests flip between success and failure, you can’t plan or trust the system. 🍞 Anchor: A refund bot that approves a claim on Monday but denies the same claim on Tuesday is inconsistent.

Robustness 🍞 Hook: A sturdy umbrella works in drizzle and gusty wind, not just in perfect weather. 🥬 Concept: Robustness means the agent handles small changes and bumps without falling apart.

How it works: We test paraphrased prompts, injected faults (like timeouts), and environment changes (like JSON fields reordered).
Why it matters: Real systems change. If tiny shifts break the agent, it won’t last in production. 🍞 Anchor: A travel agent should book the right flight whether you say "NYC Friday AM" or "New York City, Friday morning."

Predictability 🍞 Hook: When your friend says, "I’m 80% sure," you expect they’re right about 8 times out of 10. 🥬 Concept: Predictability means the agent’s confidence matches reality and helps you decide when to trust it.

How it works: We measure calibration (confidence ≈ correctness) and discrimination (higher confidence on right answers than wrong ones).
Why it matters: If the agent sounds certain when it’s wrong, people will get misled. 🍞 Anchor: A code-review agent should only block a merge when it’s truly confident a bug exists—and be right about that level of confidence.

Safety 🍞 Hook: Cars have seatbelts because even careful drivers can have accidents. 🥬 Concept: Safety means even when the agent fails, the damage stays bounded and acceptable.

How it works: We measure compliance with rules (no PII leaks, no unauthorized actions) and how severe harms are when violations happen.
Why it matters: Rare but catastrophic failures outweigh lots of small wins. 🍞 Anchor: Accidentally sorting a report wrong is minor; deleting the customer database is severe.

02Core Idea

THE “AHA!” MOMENT (ONE SENTENCE) If we judge AI agents the way safety-critical fields judge complex systems—by consistency, robustness, predictability, and safety—then we finally see how they behave, degrade, and fail, not just how often they succeed.

MULTIPLE ANALOGIES (THREE WAYS)

Airplane analogy: Don’t just ask, “Does it fly on a calm day?” Ask, “Does it take off the same way every time (consistency), handle turbulence and instrument glitches (robustness), show honest cockpit readouts (predictability), and avoid catastrophic failure even if something breaks (safety)?”
Kitchen analogy: A great oven isn’t just hot; it heats evenly (consistency), bakes well even when the recipe wording changes (robustness), has a thermostat that matches the real temperature (predictability), and won’t explode if you press the wrong button (safety).
Basketball analogy: A star player isn’t just points-per-game; it’s reliable form (consistency), performance in loud away games (robustness), honest self-assessment of shot quality (predictability), and avoiding flagrant fouls that cost the team the season (safety).

BEFORE VS AFTER

Before: One big number (accuracy) made agents look fine until they faced minor word changes, tool hiccups, or tricky edge cases. Surprises popped up in production.
After: A 12-metric reliability profile shows whether an agent repeats successes, holds up under changes, knows when it’s unsure, and limits harm. We can compare agents fairly—even when their base accuracy differs—and make safer deployment decisions.

WHY IT WORKS (INTUITION, NO EQUATIONS)

Separate what from how: Capability says what the agent can do; reliability says how it does it across time and conditions.
Normalize and compare ratios: By controlling for raw accuracy and using ratios (perturbed vs. normal), we measure degradation, not absolute skill.
Treat tail risks specially: Averaging safety with other metrics can hide rare disasters. Reporting safety separately keeps catastrophic risks visible.

BUILDING BLOCKS (SANDWICH EXPLANATIONS FOR NEW CONCEPTS) 6) Performance Metrics 🍞 Hook: Imagine a scoreboard that shows not just the final score, but turnovers, fouls, and shot charts. 🥬 Concept: Performance metrics are the different scorekeepers we use to understand agent behavior from many angles.

How it works: Twelve metrics form four pillars—consistency (outcome, trajectory, resource), robustness (prompt, environment, faults), predictability (calibration, discrimination, overall proper scoring), and safety (compliance, harm severity).
Why it matters: One number can’t capture complex behavior; many small meters together reveal the truth. 🍞 Anchor: A car report lists miles per gallon, braking distance, crash ratings, and maintenance costs—not just top speed.

Tail Risks 🍞 Hook: Most days are sunny, but you still buy storm insurance for the rare hurricane. 🥬 Concept: Tail risks are rare events with big consequences; you can’t ignore them just because they don’t happen often.

How it works: The safety pillar separates “how often bad things happen” from “how bad they are” and avoids hiding disasters in averages.
Why it matters: A once-in-a-thousand catastrophic failure can still be unacceptable. 🍞 Anchor: A payment agent that wrongly refunds $0.50 monthly isn’t great, but one that occasionally sends$ 50,000 by mistake is a showstopper.

Failure Modes 🍞 Hook: Bikes can fail by flat tire, broken chain, or loose brakes; each needs a different fix. 🥬 Concept: Failure modes are the specific ways an agent goes wrong—like inconsistent decisions, brittle parsing, overconfidence, or unsafe actions.

How it works: The four pillars organize failures into repeatability, change-handling, confidence honesty, and consequence severity.
Why it matters: Naming the failure helps you test for it and fix it. 🍞 Anchor: If an agent flips behavior when a JSON field name changes, you’ve found a robustness failure mode.

Putting It Together

The framework: Build a reliability profile per agent with the 12 metrics; separate safety; aggregate pillars carefully; compare agents fairly across capability levels.
Actionability: Use the metrics to set deployment gates, guide training focus (e.g., boost consistency and discrimination), and design benchmarks that mimic real-world drift.
Key insight: Reliability is not a side quest of capability; it’s a co-pilot that keeps capability useful in the wild.

03Methodology

AT A HIGH LEVEL: Input → [Run many times] → [Shake the setup] → [Ask for confidence] → [Check rules and harms] → [Aggregate into a reliability profile]

Step-by-step Recipe

Multi-run evaluation (measuring consistency)

What happens: For each task, the agent runs K=5 times under identical settings (temperature set to zero when possible). We log success/failure, the sequence of tool actions, and resource use (tokens, time, cost).
Why this step exists: If identical inputs produce different outcomes, plans, or costs, you have a consistency problem that accuracy alone cannot reveal.
Example: The same refund request is executed five times. If approvals flip and the action order changes unpredictably, outcome and trajectory consistency are low; if token usage swings 10×, resource consistency is low.

Prompt perturbations (prompt robustness)

What happens: Each task is paraphrased J=5 different but semantically equivalent ways (including rephrasings and translations). The agent runs on these paraphrases.
Why this step exists: Real users won’t use the same magic words. If meaning-preserving changes break the agent, it will fail in production.
Example: “Cancel my subscription” vs. “Please end my plan” vs. “Terminate my membership.” If accuracy drops on paraphrases, R_prompt goes down.

Fault injection (fault robustness)

What happens: With probability p_fault=0.2, API calls, auth, or tool responses randomly fail (timeouts, malformed results). We rerun tasks under these injected faults.
Why this step exists: Networks hiccup, tools crash. A robust agent retries gracefully or uses fallbacks.
Example: A web fetch times out. A robust agent retries or switches tools and still completes the task; a brittle agent quits or hallucinates content.

Environment perturbations (environment robustness)

What happens: We change response formats or tool schemas (e.g., reorder JSON fields, rename parameters, shift date formats) without changing semantics.
Why this step exists: APIs evolve. Agents that rely on positional or brittle parsing break silently.
Example: {price, carrier, departure} becomes {carrier, departure, price}. If the agent still extracts the right field, R_env stays high.

Confidence estimation (predictability)

What happens: After solving, the agent self-reports a confidence score in [0,1]. We compare these to actual correctness to measure calibration and discrimination (plus an overall proper scoring like Brier).
Why this step exists: Users need to decide when to trust, escalate, or defer. Overconfident wrong answers are dangerous.
Example: If 80%-confident outputs are correct about 80% of the time, calibration is good. If correct answers usually score higher than wrong ones, discrimination is good.

Safety judging (safety)

What happens: We define constraints (e.g., no PII leaks, confirm purchases, no destructive ops). An LLM judge checks traces for violations and labels harm severity (low/med/high). We compute compliance (no violations) and average harm for violating cases.
Why this step exists: Not all errors are equal. Rare severe harms must be visible and minimized.
Example: Accidentally exposing a customer’s email is a violation; deleting a database is high-severity harm.

Aggregation and disentangling from capability

What happens: Within each pillar, we average sub-metrics (e.g., outcome, trajectory, resource for consistency; prompt, environment, fault for robustness). Safety is reported separately as 1 − (violation probability × expected severity). The overall reliability R uniformly averages the three non-safety pillars.
Why this step exists: We want a compact profile without letting any one sub-metric dominate, and we must not hide tail risks by averaging safety away. Normalizations and ratios isolate reliability from raw accuracy so weaker and stronger agents can be compared fairly.
Example: An agent with solid calibration and prompt robustness but shaky outcome consistency will show a mixed profile, guiding targeted improvements.

What Breaks Without Each Step

No multi-run: You miss flip-flopping behavior.
No perturbations: You miss brittleness to harmless changes.
No confidence: You can’t set thresholds for trust vs. defer.
No safety check: Rare but severe harms stay hidden.
No normalization/ratios: Reliability and capability get tangled, making comparisons unfair.

Concrete Mini-Run Example (GAIA-like task)

Task: “Find the author of the PDF on my desktop and email me a one-sentence summary.”
Runs: 5 repeats. On 2 runs, the agent opens the wrong file due to a small path parsing difference; on 3 runs, it finishes but uses different action sequences (file→browser→email vs. browser→file→email). Tokens range from 1k to 15k.
Signals: Low outcome and sequence consistency, high resource variance. With paraphrase “Please email a one-line recap and who wrote the PDF,” success drops—prompt robustness is weak. Confidence is 0.9 even when wrong—poor calibration. Safety: No destructive ops—good compliance.

Secret Sauce (What makes this method clever)

Borrowed wisdom: It imports proven ideas from safety-critical engineering: measure repeatability, stress tests, confidence honesty, and consequence-aware risk.
Clean separation: Normalizations and ratios keep reliability separate from accuracy so we can judge agents fairly at different skill levels.
Tail-risk visibility: Reporting safety as a constraint (not averaged in) keeps catastrophic risks in plain sight.

04Experiments & Results

The Test: What was measured and why

Benchmarks: GAIA (general assistant tasks with browsing, files, multi-step reasoning) and τ-bench (customer service dialogs with consequential actions). They’re complementary: GAIA is open-ended; τ-bench is structured and action-heavy.
Models: 14 from OpenAI, Google, and Anthropic across capability tiers (spanning 2024–2026 releases), including newer “reasoning” variants.
Protocol: For each agent-benchmark pair: K=5 runs per task (consistency), J=5 paraphrases (prompt robustness), p_fault=0.2 (fault robustness), environment perturbations, confidence self-ratings, and LLM-based safety judging.
Goal: Build full reliability profiles (12 metrics), separate from raw accuracy, and study trends over time.

The Competition: What they compared against

Against themselves over time: Does reliability rise with newer releases like accuracy does?
Across families and sizes: Do bigger or reasoning models gain reliability uniformly?
Across benchmarks: Do results transfer from structured (τ-bench) to open-ended (GAIA) tasks?

The Scoreboard (with context)

Accuracy climbed steadily—like going from a solid B to a consistent A over 18 months.
Reliability rose only modestly—more like nudging from a C+ to a B−, and unevenly across tests. In other words, agents got smarter faster than they got steadier.
Consistency: Outcome consistency is often low across models. Even top models sometimes fail to repeat their own successes (think of a student who sometimes aces the same worksheet and sometimes blanks out). Distributional trajectory consistency is higher than sequence consistency—agents pick similar kinds of actions but in different orders, which complicates auditing and mid-execution interruption recovery. Resource use varies widely on GAIA; unpredictable token/time costs make planning hard.
Robustness: Surprisingly, many models handle injected technical faults and mild environment shifts reasonably (ceilings in some settings), but prompt robustness is a big separator: small, meaning-preserving paraphrases still break many agents. It’s like a car that survives a pothole but gets confused by a slightly different road sign.
Predictability: Calibration has improved—newer models, especially some Claude variants, better align confidence with correctness. But discrimination is mixed: it improved on τ-bench but often worsened on GAIA, meaning that on open-ended tasks some newer models aren’t reliably putting higher confidence on the answers they get right.
Safety: Violation rates are lower in frontier models; most violations are low-to-moderate severity. But rare high-severity failures (e.g., financial mistakes, data exposure) still occur and dominate deployment risk.

Surprising Findings

Bigger isn’t always steadier: Larger models sometimes have worse outcome or sequence consistency—more ways to solve a task can mean more variability run-to-run.
Reasoning helps, but not uniformly: Reasoning models tend to be more reliable than their non-reasoning counterparts, but their reliability gains don’t keep pace with their accuracy gains.
GAIA vs. τ-bench: On GAIA (open-ended), reliability progress is slower and discrimination worsens for several newer models; on τ-bench (structured, consequential actions), reliability shows clearer improvements. This suggests environment structure matters a lot.
Benchmark quality matters: Using a “clean” subset of τ-bench (with fixed grading errors) boosted predictability and safety scores—showing how label and spec correctness directly affect measured reliability, especially calibration.

Meaningful Numbers (translated)

Think of accuracy trending like improving from 80% to 90% over time. Reliability, in contrast, might move from 55–60% to only around 65%—better, but not nearly enough for hands-off use in high-stakes tasks.
Prompt robustness ranges from shaky to solid across models; the spread here is one of the biggest between-model differentiators.
Safety: High-severity violations are rare, but a single severe case can outweigh many small wins—hence reporting safety separately, not averaged in.

Real-world Echoes

The framework would have flagged the famous incidents: inconsistent outcomes (NYC bot), rule non-compliance (unauthorized purchases), and severe-harm potential (destructive operations)—all visible with these metrics before deployment.

05Discussion & Limitations

Limitations (honest assessment)

Benchmark coverage: Only two benchmarks (GAIA and τ-bench). Real deployments are broader (tools, domains, longer horizons), so the results are not a universal map.
Single scaffold per benchmark: Different agent scaffolds can change behavior and thus reliability profiles; results may vary with architecture choices.
LLM-based safety judging: Scalable but imperfect; judges themselves can be inconsistent. Human validation and judge-free checks would strengthen conclusions.
Metric choices and weights: Reasonable people may prefer slightly different definitions or aggregations by use case; safety separated from the aggregate R is a principled choice but leaves no single “all-in” number.
Capability disentanglement: Normalization and ratio tricks help, but no method is perfect across all tasks and domains.
Zero temperature: This reduces one big source of randomness; some real deployments will use nonzero temperature for creativity/accuracy tradeoffs, possibly lowering measured reliability.

Required Resources

Access to multiple models, tool-enabled agent scaffolding, and the ability to re-run tasks many times under controlled perturbations.
Infrastructure for fault injection, environment mutation, paraphrase generation, and trace logging.
Safety judging pipeline (LLM-based or human-in-the-loop).

When NOT to Use This As-is

Pure generation tasks where diversity is desired (e.g., brainstorming)—trajectory consistency might be a feature to tune, not a requirement to maximize.
Adversarial threat modeling: The paper models natural variations and incidental faults, not worst-case attackers.
Ultra-long sessions: The study focuses on episodic tasks; reliability over multi-hour/days-long runs needs extended methods.

Open Questions

Long-horizon reliability: How do errors accumulate across long sessions with memory and evolving state?
Multi-agent systems: How do failures propagate and how can aggregation/debate be made reliably robust?
Better monitors: Which online signals best predict impending failures without relying only on self-reported confidence?
Specification and verification: How do we blend natural-language intents with formal safety constraints and runtime validators?
Robustness taxonomy: What realistic, parameterized perturbations best simulate production shifts (APIs, schemas, user behavior)?

06Conclusion & Future Work

Three-sentence summary AI agents look strong on accuracy but still fail unpredictably in the wild; this paper borrows from safety-critical engineering to define reliability as consistency, robustness, predictability, and safety. It turns these pillars into twelve measurable metrics that are disentangled from raw capability and then evaluates 14 models on GAIA and τ-bench, finding reliability gains lagging behind accuracy. The result is a practical reliability profile that reveals how agents behave, degrade, and fail—information accuracy alone cannot provide.

Main Achievement A principled, multi-metric, accuracy-independent framework—and accompanying evaluation protocol—that exposes reliability weaknesses (especially consistency and predictability) and treats safety as a tail-risk constraint rather than an averageable score.

Future Directions Extend to long-horizon and multi-agent settings; broaden benchmarks with parameterized, generative environments; strengthen judge reliability with human validation or rule-based checks; design training objectives that directly optimize reliability pillars; develop online monitors that anticipate failures.

Why Remember This Because deploying agents safely isn’t about one big number. Reliability is the difference between a clever demo and a trustworthy system: stable across runs, sturdy under change, honest about uncertainty, and bounded in harm when things go wrong.

Practical Applications

•Set reliability thresholds (consistency, robustness, calibration) before promoting an agent from pilot to production.
•Continuously retest agents with paraphrases, API/schema changes, and fault injection to catch drift before users do.
•Use confidence calibration and discrimination to route low-confidence cases to humans (selective automation).
•Adopt safety checks (compliance rules and severity audits) as hard gates around destructive or financial actions.
•Compare vendors using the 12-metric profile to pick agents that fit your risk tolerance, not just highest accuracy.
•Budget and SLO planning using resource consistency to prevent surprise latency or token-cost spikes.
•Focus model/scaffold training on weak pillars (e.g., improve outcome and sequence consistency via planning stabilizers).
•Design parameterized benchmarks that simulate production shifts (renamed fields, new tool versions) to harden robustness.
•Add runtime monitors that watch confidence, action entropy, and tool error patterns to preempt failures.
•Document incident post-mortems using the four pillars to speed root-cause analysis and targeted fixes.

Version: 1