Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Wenxuan Ding; Nicholas Tomlin; Greg Durrett

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Intermediate

Wenxuan Ding, Nicholas Tomlin, Greg Durrett2/18/2026

arXiv

Key Summary

•This paper teaches AI agents to make smart choices about when to explore for more information and when to act right away.
•The key idea, called Calibrate-Then-Act (CTA), gives the agent a simple map of what is likely true (priors) and how much actions cost.
•With CTA, even small models can balance accuracy and speed better, like choosing to test code only when it’s worth it.
•In a toy “Pandora’s Box” game, adding priors made the model match the optimal strategy about 94% of the time.
•In question answering, CTA helped the model retrieve information only when needed, earning the best overall reward.
•In coding tasks, CTA plus reinforcement learning (CTA-RL) beat standard RL by adapting behavior to different cost settings.
•Calibrating confidence (rescaling it to be accurate) was crucial; it slashed error in confidence from 0.618 to 0.029.
•A tiny BERT prior-predictor learned useful file-format hints from filenames and guided the coding agent effectively.
•End-to-end RL alone didn’t reliably learn the right trade-offs, but adding explicit priors did.
•CTA offers a general recipe for cost-aware exploration across many tools and environments.

Why This Research Matters

CTA helps AI agents stop wasting time and money on steps that don’t pay off, and avoid rushing into wrong answers. It turns vague hunches into calibrated numbers, so the agent can do simple math to decide whether to explore or act. This is useful for coding assistants, research helpers, and customer-support bots that use tools like search. By making decisions adaptive to changing costs, systems stay responsive when speed is crucial and careful when accuracy matters most. The approach also plays nicely with RL, improving generalization rather than locking into one routine. In short, CTA delivers better answers, faster and cheaper, across many real-world tasks.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how before a big test you decide whether to quickly answer questions you’re sure about or spend extra time double-checking the tricky ones? That choice balances speed (finishing on time) and accuracy (getting them right).

🥬 Filling (The Actual Concept) What it is: Many AI agents today must choose between acting now or exploring for more info, and each choice has a cost. How it works:

The agent can take steps that gather information (like running a web search or a unit test) or it can commit to an answer.
Each extra step costs something: time, money, or user patience.
Good behavior means exploring only when the expected gain is bigger than the cost. Why it matters: Without smart trade-offs, agents either waste time over-checking or rush and make mistakes. 🍞 Bottom Bread (Anchor) Imagine a coding helper: if it’s unsure about a CSV file’s format, it might run a cheap test first. If it’s already very sure, it writes the code directly.

🍞 Top Bread (Hook) You know how you make decisions step by step in a board game, changing your plan as you learn new things? Each move affects the next.

🥬 Sequential Decision-Making What it is: A way to model choices that happen in a sequence, where each action gives you new info for the next step. How it works:

Start with a goal and a current belief about the world.
Pick an action (explore or commit) and see the result.
Update your beliefs and choose the next action.
Stop when you decide the benefits of more exploring are not worth the costs. Why it matters: Without thinking in steps, an agent can’t adapt to new information. 🍞 Anchor Like deciding whether to peek at one more face-down card in a game before placing your final bet.

🍞 Top Bread (Hook) Imagine guessing what’s inside a gift based on the wrapping before you open it.

🥬 Priors What it is: Priors are your starting beliefs about what’s likely true before seeing new evidence. How it works:

Use clues (like a filename or past experience) to assign likelihoods to possibilities.
Take an action that reveals information.
Update beliefs and repeat until ready to decide. Why it matters: Good priors help you explore wisely instead of randomly. 🍞 Anchor A file named data.tsv likely uses tabs; that’s a strong prior that guides the first code attempt.

🍞 Top Bread (Hook) Think of a stopwatch that reduces your points the longer you take.

🥬 Discount Factor (Cost of Time/Steps) What it is: A number between 0 and 1 that shrinks your reward for each extra step taken. How it works:

Start with full reward.
Each explore step multiplies the reward by the discount (less than 1).
Commit too late and even correct answers are worth less. Why it matters: Forces agents to care about efficiency, not just accuracy. 🍞 Anchor Retrieving web documents may improve an answer, but it also delays the final reply.

🍞 Top Bread (Hook) You know how your gut feeling about an answer isn’t always right until you check it against real results?

🥬 Calibration What it is: Adjusting stated confidence to match true correctness rates. How it works:

Ask the model to report how sure it is.
Compare that to reality on a dataset.
Learn a mapping (like isotonic regression) that turns raw confidence into accurate probabilities. Why it matters: If confidence is off, the explore-or-act decision will be off too. 🍞 Anchor If “90% sure” turns out correct only 60% of the time, you recalibrate so future 90% claims are truly 90% right.

🍞 Top Bread (Hook) Picture two strategies: always check everything first, or never check and just guess. Neither is best all the time.

🥬 Pareto-Optimal Exploration What it is: Choosing actions so you can’t improve speed without hurting accuracy (and vice versa). How it works:

Compare the value of acting now vs. exploring more.
Pick the action that gives the best balance given current costs and uncertainty.
Adapt as the situation changes. Why it matters: Fixed, one-size-fits-all strategies leave performance on the table. 🍞 Anchor Sometimes writing one cheap unit test beats a risky full run; other times, a confident direct answer saves time.

🍞 Top Bread (Hook) Imagine you first check the weather forecast (your prior), then decide whether to carry an umbrella or just go.

🥬 Calibrate-Then-Act (CTA) What it is: A framework that first supplies the agent with calibrated priors about the world, then lets it choose actions. How it works:

Estimate priors (e.g., model’s chance to answer without searching, or likely file format from filename).
Present these priors and action costs to the agent.
The agent reasons about the expected value of exploring vs. committing and acts. Why it matters: Separating “what’s likely” from “what to do” makes the agent’s decisions more reliably optimal. 🍞 Anchor If the agent believes it’s 80% likely the delimiter is ';', it may try code first; if only 30%, it might run a unit test.

The world before: LLM agents were good at answering single-turn questions but struggled in multi-step settings where they had to decide whether to search the web, run tools, or verify assumptions. Many systems used fixed routines (e.g., always ask a clarifying question, always read files first), which wasted time on easy tasks and rushed through hard ones.

The problem: How can an agent weigh the cost of extra steps against the benefit of being more certain? This balance depends on two slippery things: uncertainty (how likely current beliefs are correct) and costs (how much each extra step reduces the final reward).

Failed attempts: End-to-end reinforcement learning tried to teach agents everything directly from reward. But uncertainty estimation and decision making got tangled, and the learned behavior often collapsed to a single habit (like always testing first). Pure prompting without priors left the model guessing; uncalibrated confidence misled its choices.

The gap: Agents needed explicit, calibrated priors and a clear view of action costs to reason abstractly about the explore-or-act trade-off.

Real stakes: This choice shows up everywhere—search vs. answer now, test code vs. ship, order another lab test vs. diagnose. Better trade-offs mean lower bills, faster answers, fewer mistakes, and happier users.

02Core Idea

🍞 Top Bread (Hook) Imagine you’re choosing between peeking at a few puzzle pieces (which takes time) or just solving the puzzle now if you think you’ve got it.

🥬 The Aha! Moment What it is: Give the agent accurate priors about uncertainty and explicit costs, then let it decide—to explore or to act—based on expected value. How it works:

Estimate priors (what’s likely true before acting).
Calibrate confidence so probabilities reflect reality.
Show costs (discounts) for each type of exploration.
Let the agent compare “act now” vs. “gather info” and pick the better one. Why it matters: Without priors and costs in view, agents can’t reliably balance accuracy and efficiency. 🍞 Anchor A question-answering agent sees: “I’m 80% sure I know this. Retrieving helps to 90% but costs time.” If the discount is small, it may retrieve; if large, it answers now.

Three analogies:

Doctor visit: If the diagnosis seems obvious (high prior), skip extra tests; if unclear and tests are cheap (low discount), run a test.
Treasure hunt: If one chest is very likely (strong prior), open it first; if none stand out and checking is cheap, inspect more.
Cooking: If the oven is probably at 350°F (prior) and checking the manual is quick (low cost), peek; if the timer is almost up (high discount), bake now.

🍞 Priors (revisited as a building block) What it is: Starting likelihoods about hidden facts. How it works:

From filenames, past data, or verbalized confidence, estimate probabilities.
Use these to rank actions.
Update after observations (posterior thinking). Why it matters: Guides exploration order and when to stop. 🍞 Anchor In coding, a name like report_semicolon.csv signals a high chance of ';' as delimiter, steering first attempts.

🍞 Calibration (building block) What it is: Making confidence numerically honest. How it works:

Collect pairs of (stated confidence, actual correctness).
Fit a curve (isotonic regression) mapping raw to calibrated confidence.
Use calibrated values in decisions. Why it matters: Overconfident or underconfident agents choose the wrong path. 🍞 Anchor After calibration, a model’s “70% sure” is truly correct about 70% of the time.

🍞 Discounted Reward (building block) What it is: A score that shrinks with each extra step. How it works:

Start with full credit for correct answers.
Multiply by a discount for each explore action.
Commit too late, and payoff drops. Why it matters: Prevents endless exploring. 🍞 Anchor Every RETRIEVE step reduces the final reward, so only retrieve when it meaningfully boosts accuracy.

Before vs. After:

Before: Agents followed static scripts (always retrieve, always test) or guessed based on fuzzy hunches.
After: Agents evaluate expected value using explicit priors and costs, shifting smoothly between fast-and-confident vs. careful-and-verify.

Why it works (intuition, not equations): Expected value compares two paths: commit now (chance of being right times current reward) versus explore once (discounted reward times the chance exploration helps). When priors and costs are visible and confidence is calibrated, this comparison becomes straightforward and robust.

Building blocks summary:

Priors from confidence or predictors (e.g., tiny BERT on filenames).
Calibration to fix misestimated confidence.
A cost model (discounts) per action type.
A simple agent loop that chooses the higher expected value at each step.

🍞 Calibrate-Then-Act (CTA) (core concept) What it is: A two-stage recipe—first calibrate beliefs, then choose actions. How it works:

Calibrate priors (confidence-to-probability and format likelihoods).
Feed priors and costs into the LLM.
The LLM reasons step-by-step about explore vs. commit.
Stop when committing beats exploring. Why it matters: Decouples “knowing” from “doing,” making strategies more optimal and generalizable. 🍞 Anchor In the Pandora’s Box game, CTA makes the agent verify the most promising box first and stop when guessing is better than another costly check.

03Methodology

At a high level: Input → Estimate/Calibrate Priors → Show Priors and Costs to the Agent → Agent Chooses Explore or Commit → Environment Replies → Repeat or Stop → Output.

🍞 Top Bread (Hook) Imagine a treasure map (priors) and a stopwatch (costs). Each peek at the map takes time, but helps you choose where to dig.

🥬 Sequential Decision Recipe What it is: Turning tools-and-steps into a loop that compares explore vs. act. How it works:

Inputs: the question/task, available actions, and discounts (costs).
Priors: estimate what’s likely true (e.g., model’s no-retrieval accuracy; file format likelihoods).
Agent loop: at each step, compare expected value of committing now vs. exploring once more.
Observations: after exploring, update what you know and repeat.
Termination: commit when acting now beats exploring. Why it matters: Makes decisions adaptive and data-driven. 🍞 Anchor For a CSV task, start with likely delimiter ';'. If uncertain and tests are cheap, run testdelimiter; else try code immediately.

Step-by-step details and what breaks without each step:

Define the environment (the world and moves)

What happens: Model the problem as a partially observable process with actions (e.g., RETRIEVE, CODE, UNITTEST, ANSWER) and observations (documents, stdout, test results).
Why it exists: Without a formal structure, the agent can’t evaluate what information each action reveals.
Example: Running a unit test returns the true delimiter; that observation updates beliefs.

Introduce costs via discounts

What happens: Assign a discount factor to each exploration type (e.g., retrieval step or unit test) and multiply the final reward by these factors.
Why it exists: Prevents infinite exploration; pushes the agent to be efficient.
Example: If retrieval discount is 0.6, a correct answer after one retrieval counts 60% as much as answering immediately.

Estimate priors

QA priors (no-retrieval success):
- What happens: Ask the model for verbalized confidence, then calibrate it (isotonic regression) to get a reliable probability it can answer without search.
- Why it exists: Raw confidence is miscalibrated; decisions based on it will be wrong.
- Example: Calibration cut expected calibration error from 0.618 to 0.029 on PopQA.
Coding priors (file formats):
- What happens: Train a tiny BERT on filenames to predict probabilities over delimiter, quotechar, skiprows.
- Why it exists: Filename cues often hint at format; explicit probabilities guide exploration.
- Example: The predictor reached about 67% average accuracy on format attributes.

Calibrate-Then-Act prompting

What happens: Put priors and costs directly into the prompt so the LLM can reason about them.
Why it exists: Makes uncertainty and costs explicit, enabling the model to follow near-optimal rules.
Example: In Pandora’s Box, with priors shown, the model verifies in the right order and knows when to stop.

Reinforcement learning with priors (CTA-RL)

What happens: During RL, keep feeding priors and costs so the learned policy keeps separating “what’s likely” from “what to do.”
Why it exists: End-to-end RL alone tends to learn a single habit; conditioning on priors encourages adaptive policies across cost regimes.
Example: In coding, CTA-RL beat vanilla RL on discounted reward by about 3.5%.

Task-specific implementations:

Pandora’s Box (toy):
- Agent sees priors for each box and a discount. It either VERIFYs a box or GUESSes.
- Rule of thumb learned: verify the most likely first; commit when its posterior beats the discounted value of verifying again.
Knowledge QA with optional RETRIEVE:
- Priors: p_no_context (calibrated from verbalized confidence) and p_with_context (retriever + LLM accuracy estimate).
- Decision: retrieve if p_with_context × discount ≥ p_no_context; otherwise, answer now.
- Outcome: fewer unnecessary retrievals and higher overall reward.
Coding with selective testing:
- Priors: p(delimiter | filename), p(quotechar | filename), p(skiprows | filename).
- Actions: UNITTEST to reveal exact format, CODE to try a solution, ANSWER to finish.
- Strategy: when code attempts are costly, favor unit tests; when tests are costly, try code early if priors are strong.

The secret sauce:

Decoupling uncertainty estimation (calibrate priors) from action selection (expected value comparison) lets even medium-size LLMs discover near-optimal strategies.
Explicit priors transform a fuzzy problem into a clear, math-like choice, which LLMs handle well in chain-of-thought.
Conditioning RL on priors keeps policies flexible across changing costs rather than collapsing into a one-trick routine.

04Experiments & Results

The test: Do priors and costs, made explicit, actually change behavior in the right way? The authors evaluate three settings.

Pandora’s Box (toy, clear priors):

Why: Prove the model can follow the optimal explore-or-act rule when priors and discounts are given.
Compared to: Prompted models without priors and with/without “thinking” steps.
Scoreboard:
- Oracle policy: optimal match 100%, reward 0.649.
- Prompted-NT (no thinking): 11% optimal match, reward 0.441.
- Prompted (thinking, no priors): 23% optimal match, reward 0.476.
- CTA-Prompted (priors + thinking): 94% optimal match, reward 0.625.
Meaning: Adding explicit priors flips the model into the right decision mode; it starts verifying in the right order and stops at the right time.

Knowledge QA with optional retrieval (PopQA):

Why: See if the agent learns to retrieve only when worth it, balancing accuracy vs. time cost.
Baselines:
- Never retrieve: accuracy 0.226, reward 0.226 (fast but often wrong).
- Always retrieve: accuracy 0.578, reward 0.213 (more right but slower, loses on discount).
Multi-turn agents:
- Prompted-NT: retrieve ~98%, accuracy 0.619, reward 0.244 (retrieves almost always, overpays in cost).
- Prompted: retrieve 61.4%, accuracy 0.501, reward 0.283 (some cost-awareness, but noisy decisions).
- CTA-Prompted: retrieve 65.3%, accuracy 0.512, reward 0.293 (best reward overall; decisions align with the oracle boundary p_with_context × discount ≥ p_no_context).
Surprising finding: Calibrated confidence was necessary; recalibration cut ECE from 0.618 to 0.029, making decisions reliably cost-aware.

Coding with selective testing (FILEREADING):

Why: Test a realistic tool-use loop where format uncertainty matters and costs differ for code vs. tests.
Priors: a tiny BERT predicts format probabilities from filenames (67% average attribute accuracy).
Results (averaged over cost ratios):
- Prompted: U=2.67 tests, C=1.42 code attempts, accuracy 0.958, reward 0.229.
- CTA-Prompted: U=2.51, C=1.41, accuracy 0.945, reward 0.240 (slightly fewer steps; better reward).
- RL (no priors): U=2.13, C=1.39, accuracy 0.997, reward 0.259 (strong, but policy becomes static: always test-first).
- CTA-RL: U=1.98, C=1.46, accuracy 0.991, reward 0.268 (best reward; adapts to cost settings and maintains strong accuracy).
Interpretation: RL alone learned “test-first” regardless of costs (0% code-first), which is suboptimal in some regimes. Conditioning on priors (CTA-RL) produced adaptive behavior—more tests when code is relatively costly, more code-first when tests are costly—achieving a Pareto-front of rewards across cost ratios.

Overall takeaways:

Pandora’s Box: priors unlock near-optimal reasoning (94% policy match).
QA: CTA-Prompted yields the best discounted reward by retrieving only when the math says so.
Coding: CTA-RL generalizes better than RL, staying on the Pareto frontier across cost regimes.
Bonus insight: Even simple prior estimators (verbalized confidence + isotonic regression; tiny BERT on filenames) are enough to induce better strategies.

05Discussion & Limitations

Limitations:

Prior quality matters: If priors are wrong or poorly calibrated, the agent’s explore-or-act choices may be suboptimal.
Estimation overhead: Building and calibrating prior estimators (confidence calibration, tiny BERT) adds engineering steps.
Domain shifts: Priors learned from past data may degrade if filenames or retrieval quality change over time.
Partial observability: In more complex worlds, modeling all relevant latent variables can be hard.

Required resources:

A base LLM capable of reasoning with prompts and tool feedback.
A calibration pipeline (e.g., isotonic regression) for confidence.
Optional lightweight predictors (e.g., tiny BERT) to estimate structural priors.
For CTA-RL: RL infrastructure and cost-varied training episodes.

When not to use CTA:

When exploration is effectively free (no latency/cost), simpler strategies might suffice.
When priors are unavailable, unstable, or misleading, relying on them can hurt more than help.
In ultra-high-stakes environments where mistakes are intolerable, you may need stronger verification regardless of costs.

Open questions:

How to learn richer, task-agnostic priors that transfer across domains and tools?
Can we calibrate multimodal confidence (text + code + tools) jointly rather than per-tool?
How to make the agent update its priors online from outcomes without forgetting prior calibration?
Can we automate cost estimation (dynamic discounts) from user preferences and system load in real time?
What are the privacy and fairness impacts of priors derived from historical data (e.g., biased filename conventions)?

06Conclusion & Future Work

Three-sentence summary: This paper introduces Calibrate-Then-Act (CTA), which separates uncertainty estimation (via calibrated priors) from action selection, so LLM agents can balance exploration costs against expected gains. With explicit priors and costs in the prompt—and optionally during RL—agents adopt near-optimal strategies in toy, QA, and coding settings. CTA consistently improves discounted reward and leads to adaptive, Pareto-optimal behavior where end-to-end RL alone often defaults to a single habit.

Main achievement: Showing that simple, explicit priors plus calibrated confidence reliably induce cost-aware, optimal-like exploration policies in LLM agents.

Future directions: Broaden prior estimation to richer domains (APIs, planning, science), learn online calibration, and integrate dynamic cost models that reflect real-time conditions. Explore safety-aware extensions where costs include risk and not just time or compute. Develop general-purpose prior predictors that transfer across tasks.

Why remember this: CTA is a clean, reusable recipe—first make beliefs honest and explicit, then act—that helps agents save time and money without sacrificing accuracy. It turns fuzzy hunches into numbers the model can reason with, producing smarter tool use and better user experiences across many applications.

Practical Applications

•Build QA agents that retrieve only when the expected accuracy gain beats the time penalty.
•Create coding assistants that run unit tests selectively based on calibrated uncertainty and code-run costs.
•Design customer-support bots that decide when to search a knowledge base vs. answer directly.
•Calibrate an LLM’s verbalized confidence with isotonic regression before deploying cost-aware decisions.
•Train tiny prior-predictors (e.g., filename-to-format) to guide tool actions with lightweight models.
•Augment RL training by conditioning policies on explicit priors to maintain adaptability across cost settings.
•Implement dynamic discounting to reflect real-time latency or API pricing and plug it into the CTA loop.
•Instrument agent pipelines to log confidence vs. correctness and continuously recalibrate in production.
•Use CTA prompts in multi-tool agents (browsers, code runners, calculators) to standardize explore-or-act choices.
•Prototype Pandora’s Box-style tasks to sanity-check that your agent follows expected optimal rules before deployment.

Version: 1