Budget-Constrained Agentic Large Language Models: Intention-Based Planning for Costly Tool Use

Hanbing Liu; Chunhao Tian; Nan An; Ziyuan Wang; Pinyan Lu; Changyuan Yu; Qi Qi

Budget-Constrained Agentic Large Language Models: Intention-Based Planning for Costly Tool Use

Intermediate

Hanbing Liu, Chunhao Tian, Nan An et al.2/12/2026

arXiv

Key Summary

•This paper tackles a simple but serious question: can AI agents use paid tools to finish multi-step tasks without blowing the budget?
•The authors introduce INTENT, a lightweight, inference-time planner that guides a strong language model to make cost-aware decisions.
•Instead of predicting exact tool outputs, INTENT predicts whether a tool call will satisfy the agent’s intention and uses that to estimate expected cost.
•It simulates an ‘ideal’ future plan where each step works, then risk-adjusts the cost of each step by dividing by its predicted success probability.
•A safety knob called gamma lets users choose between more conservative or more aggressive spending behavior.
•On a cost-augmented version of StableToolBench, INTENT achieves 100% budget feasibility while greatly improving task success over prompting and other enforcers.
•INTENT stays robust when tool prices change, new tools appear, or budgets vary, all without retraining the main agent.
•The method needs only small auxiliary models and logs, not expensive reinforcement learning or heavy tree search.
•Case studies show INTENT doesn’t just block bad calls; it nudges the agent toward a cheaper, steadier plan that still solves the task.
•This work reframes budget control for AI agents as intention-aware planning, making economic constraints a first-class part of decision making.

Why This Research Matters

Many real tools cost money, and AI agents increasingly rely on them to do useful work. INTENT lets us trust agents with spending limits, keeping them from overspending while still finishing tasks well. This makes AI safer and more deployable in fields like finance, law, IT operations, and research where budgets are strict. It adapts to changing tool prices and new tools without retraining the big model, which saves time and money. The approach is lightweight, making it practical to plug into existing systems. By treating budget awareness as a core objective, INTENT moves AI closer to being responsible, efficient helpers in the real world.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re shopping with a gift card that has a hard limit. You can fill your cart with helpful items, but if the total goes over the card’s amount, the cashier says “Nope!” and you walk away empty-handed. That’s exactly the kind of problem AI agents face when they must use paid tools to complete tasks within a strict budget.

🥬 The Concept: Large Language Models (LLMs) are quickly becoming agents that can think in steps and use external tools to get things done. But in the real world, many tools cost money per use, and the total spend must stay under a hard budget. The challenge is that tool outcomes are uncertain (sometimes searches fail, APIs vary, or prices change), and agents can’t practice for free during deployment.

How it worked before: Agents planned as they went, calling tools until they had enough info. Costs weren’t tracked precisely, or only gently nudged with prompts like “You have $50 total.”
Why it broke: Prompting alone didn’t stop budget overruns. Agents retried tools, followed brittle habits, or picked expensive ones too early. Reinforcement learning would be slow and outdated as tool markets change. And classical tree search is too heavy and assumes cheap exploration.
What was missing: A way to guide a capable agent at inference time—without retraining—so it can foresee likely spending under uncertainty and still aim for success.

🍞 Anchor: Think about a school science fair: you have a $20 supply budget. If you buy expensive glitter glue first and it doesn’t work, you may not have enough left for the poster board and markers you still need. AI agents face the same tradeoffs with paid tools.

🍞 Hook: You know how you can’t see the future, but you can make a plan like, “If the first try works, I’ll do X; if not, I’ll do Y, and I need to save money for the ending”? That’s the kind of foresight agents need.

🥬 The Concept: Sequential decision making is the idea of choosing the next action based on what’s happened so far and what you expect to happen next. For agents using tools, each step (call a tool or stop) changes what they know and how much money remains.

What people tried: Formulas like online knapsack (divide budget across options), extra training to learn budgets, or heavy planning (MCTS) that simulates many branches.
Why it didn’t work: Tool calls aren’t independent—you need outputs from earlier calls to decide later ones. Retraining can’t keep up with changing tool markets. And exploring many futures is too slow and costly.

🍞 Anchor: It’s like planning a school day: if the bus is late (stochastic), you might miss a club meeting (dependency). You need a plan that respects time (budget) while adapting to surprises.

🍞 Hook: Picture an online app store for tools where prices shift like supermarket sales. One day a data API costs 10 credits; the next day it’s 25.

🥬 The Concept: A dynamic tool marketplace is a changing collection of tools, each with its own price per use. The agent sees a snapshot for each task and must pick tools and their order, staying under budget.

How it works (step by step):
1. See user request, budget, and available tools with prices.
2. Think a bit and propose a tool call or decide to answer.
3. If calling a tool, pay the price, get a (possibly noisy) result, and update the budget.
4. Repeat until you answer or run out of money.
Why it matters: If you ignore price changes or tool reliability, you can overspend or choose a weak plan.

🍞 Anchor: It’s like buying snacks for a field trip from a store where prices change daily—you must still feed everyone without going over the money you have.

🍞 Hook: Imagine you’re playing a game where each action can cost coins and might fail. You’d want a coach that says, “If you follow this path, you’ll probably spend too much—try a cheaper route.”

🥬 The Concept: The paper’s central problem is budget-constrained tool use: solve tasks with tool calls while never exceeding a hard budget.

Before: Agents often ignored costs or learned fuzzy rules that didn’t hold in new markets.
Problem: Tool results are uncertain, retries burn money, and you can’t just try everything.
Gap: We need a fast, inference-time planner that predicts future cost risks and gently steers the agent without heavy search or retraining.

🍞 Anchor: Like a GPS that avoids toll roads when you have only a few dollars left, the planner must guide the route to finish the trip without running out of money.

02Core Idea

🍞 Hook: You know how when you bake cookies, you don’t need to predict the exact shape of each cookie to know if the batch will turn out fine—you just need to know whether the dough is good enough to bake?

🥬 The Concept: The “aha!” insight is this: instead of predicting the exact content of every future tool output, predict whether each tool call will satisfy the agent’s intention—and use that to estimate how much money the plan will probably cost. That’s what INTENT does.

How it works:
1. Extract the plan the agent seems to be aiming for (its intention).
2. Simulate an ideal future where each step “works as intended.”
3. For each step, estimate how many tries it might take by using a success probability.
4. Add up a risk-adjusted cost for the whole plan and compare to the remaining budget.
5. If it fits, allow the action; if not, block it and give helpful feedback to re-plan.
Why it matters: Without this, agents either overspend or get too scared to try useful tools.

🍞 Anchor: It’s like planning a school play: you don’t need to know every line to budget for costumes and props. You just need to know which scenes will definitely happen and how likely each special effect will work.

🍞 Hook: Imagine three different explanations of the same idea.

🥬 The Concept (3 analogies):

Grocery trip: You plan meals (intention) and estimate if buying eggs will probably work (success chance). If eggs often run out, you budget extra time/money for a second store. You don’t predict the exact egg carton details; you just plan around whether you’ll likely succeed.
Road trip: You don’t predict the exact shape of clouds; you just see if a route likely stays open. If a bridge is often closed (low success), you budget for a detour (higher cost) ahead of time.
Video game: You don’t predict every enemy move; you check if a strategy usually clears the level. If it’s risky, you bring extra potions (budget) or pick a safer route.

🍞 Anchor: In all three, you judge whether each step likely works and budget accordingly, not the exact fine-grained future.

🍞 Hook: What changes with INTENT?

🥬 The Concept: Before vs. After

Before: Agents either overspent (chasing info) or got too cautious (never trying good but costly tools). Prompting helped but didn’t guarantee staying under budget.
After: INTENT enforces hard budget feasibility by risk-adjusting the plan’s cost and gently steering choices, all while keeping the agent’s core skills intact.
Why it works: It focuses on intention satisfaction (did the tool achieve what the reasoning wanted?), which is more stable than predicting exact content in noisy environments.

🍞 Anchor: It’s like a coach who says, “This play usually works 70% of the time; save enough energy in case you need a second try,” instead of trying to predict every bounce of the ball.

🍞 Hook: How can we trust the math without equations?

🥬 The Concept: Why it works (intuition)

Key idea: Most agents, when following a plan, keep trying a step until it works (within reason). If step k has a success chance ρ, then on average you’ll need about 1/ρ tries. So expected money for that step is price/ρ.
INTENT adds a safety knob γ (gamma). Multiply the total expected cost by γ to be more conservative (>1) or less (<1).
Simulating an “ideal plan” reveals the agent’s latent intentions without wandering down failure loops.

🍞 Anchor: If a basketball shot goes in 50% of the time and each shot costs a ticket, you expect to spend two tickets per basket. If you want to be safe, budget three tickets (γ = 1.5).

🍞 Hook: What are the building blocks?

🥬 The Concept: Building blocks

Language World Model: predicts tool output structure to drive simulations.
Intention Predictor: estimates success chance (ρ) of a tool call given the agent’s current reasoning.
Conditional Generator: simulates outputs assuming success (to keep the plan on the ideal path).
Geometric Cost Calibration: turns price and ρ into expected cost = price/ρ.
Risk Preference (γ): dials how strict or bold the planner is.
Feedback: if an action is blocked, INTENT returns a short, helpful summary (like the predicted action chain and ρ’s) so the agent can re-plan smartly.
Caches: reuse simulated future steps if the agent stays on the plan, to save time.

🍞 Anchor: It’s like planning a relay race: you estimate each runner’s chance to pass the baton, budget extra time if some are less reliable, and keep a backup plan handy. If the team sticks to the same order, you don’t need to re-plan every lap.

03Methodology

🍞 Hook: Imagine getting a class project: you have a question to answer, a fixed allowance for supplies, and a menu of stores with changing prices. You want to finish the project without running out of money.

🥬 The Concept: At a high level, INTENT works like a traffic officer for tool calls. It watches what the agent wants to do next and quickly simulates an ideal future to check if the plan will likely fit inside the budget.

Overview pipeline: Input (question + budget + tool market) → Agent proposes “next step” → INTENT simulates an ideal plan and estimates risk-adjusted cost → If total fits budget, approve; otherwise, block and return a short hint → Agent re-plans until final answer.
Why it matters: Without this checkpoint, agents can overspend on retries or pick costly tools early and leave no budget for the ending.

🍞 Anchor: It’s like asking a teacher: “If I buy these supplies first, will I still have enough for the poster board?” If not, you adjust before buying.

🍞 Hook: You know how you can imagine how a project would go if everything cooperated? That’s the “ideal plan” trick.

🥬 The Concept: Step-by-step details

Initialize the context

What: The system prompt, the user’s question, the budget, and the market snapshot with per-call costs go into the history.
Why: Everything the agent needs to reason and choose tools is placed in one place.
Example: “Budget = 50 credits, tools: SearchAPI=7, FinanceAPI=23, WeatherAPI=5.”

Agent proposes a reasoning trace and an action

What: The agent thinks (reasoning text), then chooses either a tool call with arguments or to answer now.
Why: The reasoning reveals intention: what the agent expects this tool to achieve.
Example: “I’ll get quarterly cash flow first to find net income trend; then use balance sheet to get total assets.”

INTENT extracts intention and simulates an ideal plan

What: Using the current proposal, INTENT performs ideal trajectory simulation: it forces each simulated tool step to ‘succeed’ (z=1) so the plan doesn’t fall into failure loops.
Why: Directly simulating reality is noisy and can underestimate retries. The ideal path reveals the agent’s latent plan structure.
Example: Simulated steps: [CashFlowAPI → IncomeStatementAPI → Answer].

INTENT predicts success probability (ρ) per step

What: The Intention Predictor reads the reasoning + tool call and estimates the chance that the tool result actually satisfies the intention.
Why: Not all tools or queries are equally likely to work; ρ captures that.
Example: ρ(CashFlowAPI)=0.35 (brittle), ρ(IncomeStatementAPI)=0.94 (reliable).

Geometric cost calibration

What: Expected cost per step = price / ρ. Total expected cost = sum over steps. Then multiply by γ (risk preference) for safety.
Why: If success is rare, expect more retries and higher spend.
Example: If a tool costs 10 and ρ=0.5, budget ~20 for that step; with γ=0.5 (slightly optimistic), the total plan cost might still fit.

Decision: approve or block

What: If immediate cost is affordable and γ × (sum of price/ρ $) ≤ remaining$ budget, approve the action; else block.
Why: Ensures never exceeding the hard budget while allowing good plans to proceed.
Example: Approve CashFlowAPI if expected total stays under 50; otherwise block and ask to try a cheaper or more reliable tool.

Feedback for re-planning

What: On block, INTENT returns a compact hint: the predicted future action sequence and per-step ρ (no speculative content).
Why: The agent learns which part of the plan is risky or expensive and adjusts.
Example: “Your plan likely exceeds budget due to low ρ on CashFlowAPI; consider a cheaper or more reliable alternative.”

Simulation reuse (caches)

What: If the agent’s next action matches the simulated plan, skip new simulation and approve quickly. Track last rejected action and blacklist tools with extremely low ρ.
Why: Saves time and avoids thrashing on bad options.
Example: If the agent follows [IncomeStatementAPI → BalanceSheetAPI] as simulated, keep approving unless costs change.

🍞 Anchor: It’s like checking your shopping list once, then cruising through the store; you only re-check if you switch to a different recipe.

🍞 Hook: Think of three small helpers that make the system both safe and fast.

🥬 The Concept: The secret sauce

Intention-based factorization: Split “Will this satisfy my goal?” from “What exact content appears?” The first is stable and lets you budget; the second is noisy and not needed for cost control.
Pessimistic-but-tunable cost: price/ρ with γ lets you be safe without being frozen. You can turn the knob.
Minimal intervention: The main agent stays the same. INTENT is a thin planning layer at inference time—no retraining, no heavy tree search.

🍞 Anchor: It’s like carrying a pocket calculator to check if your plan fits your allowance. You don’t rewrite the whole plan; you just sanity-check costs as you go.

04Experiments & Results

🍞 Hook: Imagine running a science fair of 765 projects where each team must buy data from different vendors, each with prices. Who can finish great projects without overspending?

🥬 The Concept: The authors tested INTENT on StableToolBench, a large benchmark of multi-step tool-use tasks. They added prices to tools and set a hard budget (like 50 credits) per task. They compared INTENT to: RAW (no cost hints), PROMPT (text reminders about costs), and several enforcers (DFSDT, BTP, BATS), plus a simpler Monte Carlo Oracle (MCO).

What they measured (why it matters):
- Pass Rate (PR): How often the agent finishes correctly (like getting an A on the project).
- Budget-Optimal Pass (OR): Success vs. the best that’s achievable under the same budget (how close you are to the gold standard).
- Feasible Rate (FR): How often the agent stays within budget (no overdrafts).
- Average Cost and Price: Whether the agent is actually spending wisely.
- Time/Latency/Tokens: Does the method slow things down too much?

🍞 Anchor: It’s like grading projects for both quality and whether they stayed within their spending rules.

🍞 Hook: What did the scoreboard say?

🥬 The Concept: Scoreboard with context

Standalone agents struggle: With only prompting, strong models still blew the budget surprisingly often or got too cautious. For example, PROMPT (non-reasoning GPT 4.1 mini) had PR=30.9% and FR=67.2%—meaning one-third of tasks still exceeded budget.
Enforcers help but trade off: DFSDT and BTP enforced budgets (FR=100%) but were conservative, leaving performance on the table. BATS improved success but was slow and token-hungry.
INTENT wins the trade-off:
- Non-reasoning backbone (GPT 4.1 mini): INTENT PR=63.8%, FR=100%, with moderate overhead (time 1. $23× vs$ RAW).
- Reasoning backbone (GPT 5 nano): INTENT PR=76.0%, FR=100%, significantly higher than PROMPT (48.5%) and enforcers, while keeping average cost lower (AC=29.2) and runtime reasonable (time 1.79×).
MCO (simpler rollout) helped but lagged INTENT, showing the value of intention-aware cost estimation.

🍞 Anchor: Think of INTENT as getting an A- while always staying within budget, whereas others either got Bs and stayed within budget or got As but broke the rules.

🍞 Hook: Can it handle real-world messiness?

🥬 The Concept: Robustness under market changes

New tools: Even with only a fraction of logs to train its small oracle models, INTENT improved smoothly (log-linear scaling early), showing it can adapt quickly.
Price changes: When some tool prices got cheaper or pricier, PROMPT’s performance swung a lot; INTENT stayed steady, reacting rationally to price shifts.
Budget scaling: With smaller budgets, INTENT still did well; with bigger budgets, it scaled gracefully and used extra room wisely.

🍞 Anchor: It’s like a shopper who adjusts calmly to sales and price hikes, still making dinner without blowing the allowance.

🍞 Hook: Any surprises?

🥬 The Concept: Surprising findings

Prompted agents that “knew” prices still overspent because retries and noisy tools compound costs in sneaky ways.
A single stochastic rollout (MCO) sometimes underestimates true spend; the intention-aware version stabilizes cost estimates.
Lightweight hints (predicted action chain plus success probabilities) help the agent re-plan smarter than just saying “too expensive.”

🍞 Anchor: Telling a friend, “You’ll probably run out of money if you do A→B→C” with a note like “A is only 35% likely to work” is more helpful than just “Don’t do it.”

05Discussion & Limitations

🍞 Hook: Imagine a careful coach who always makes the playoffs but sometimes plays it a bit safe, and who needs a scouting report to judge each play’s reliability.

🥬 The Concept: Honest assessment

Limitations (what it can’t do):
1. If the Intention Predictor’s probabilities are poorly calibrated, expected costs (price/ρ) can be off—too strict or too loose.
2. Assuming roughly geometric retries (constant success chance each attempt) may overestimate or underestimate costs for tools that get easier or harder on retries.
3. The Language World Model simulates output structure, not exact facts; if tool formats change drastically, simulations may degrade until refreshed.
4. There’s still inference overhead vs. pure prompting, though INTENT keeps it moderate with caches.
Required resources: A capable base agent; a small language world model; a small classifier for intention success; interaction logs (and an LLM-as-judge once during training to label success); and a serving stack to run the oracle.
When NOT to use: If tools are free or nearly free; if the environment is fully deterministic (no retries); if tasks are single-shot and trivial; or if you can’t afford any extra latency.
Open questions: Can we jointly handle multiple resources (money, time, tokens) at once? Can the predictor update online as tools drift? How do we coordinate multiple agents under one shared budget? Can we incorporate value-of-information to sometimes try a risky cheap call when it unlocks big downstream savings?

🍞 Anchor: It’s like a budget-savvy team that plays smart ball. If the scouting report is off, or the league changes rules, the playbook needs quick updates to stay sharp.

06Conclusion & Future Work

🍞 Hook: Picture handing your allowance to a helper and saying, “Buy what we need, but don’t go over.” You’ll only trust them if they’re both smart and careful.

🥬 The Concept: Three-sentence summary

The paper reframes budget-aware tool use as an intention-aware planning problem: predict whether each step satisfies the agent’s goal, not the exact content, then budget using price divided by success chance.
INTENT enforces hard budgets at inference time with a light planning layer that simulates an ideal plan, uses risk calibration, and gives helpful feedback for re-planning.
On a priced version of StableToolBench, INTENT achieves the best success–feasibility balance, stays robust to market shifts, and adds only mild overhead.

Main achievement: Showing that a small, intention-aware oracle can reliably steer powerful agents to stay on budget without retraining or heavy search.
Future directions: Extend to multi-resource control (money, latency, tokens), online calibration as tools drift, and multi-agent shared-budget planning; explore richer value-of-information strategies.
Why remember this: It turns “budget awareness” into a first-class, practical capability for agentic AI—making delegation safer wherever tools cost real money.

🍞 Anchor: In short, INTENT is like a wise shopping buddy for AI agents: it doesn’t change who they are, it just helps them finish the job without maxing out the card.

Practical Applications

•Automated financial research that selects cost-effective data APIs and stays within daily spending caps.
•Legal due diligence agents that prioritize cheaper databases first and only escalate to premium sources if needed.
•Cloud incident diagnosis agents that avoid costly, repeated diagnostics when cheaper signals suffice.
•Marketing analytics bots that choose between free/cheap web search and paid data feeds without exceeding budgets.
•Business intelligence workflows that allocate limited budget across multiple queries during peak demand.
•Developer assistants that balance paid code analysis tools with local checks to meet a per-task budget.
•Procurement assistants that simulate ideal sequences of vendor checks and avoid wasteful retries.
•Customer support agents that selectively call premium verification or routing tools only when success chances are high.
•Research agents that throttle expensive literature or patent databases based on predicted usefulness and success.
•Operations dashboards that enforce team-level tool budgets with per-call approvals guided by INTENT.

Version: 1