Budget-Constrained Agentic Large Language Models: Intention-Based Planning for Costly Tool Use
Key Summary
- ā¢This paper tackles a simple but serious question: can AI agents use paid tools to finish multi-step tasks without blowing the budget?
- ā¢The authors introduce INTENT, a lightweight, inference-time planner that guides a strong language model to make cost-aware decisions.
- ā¢Instead of predicting exact tool outputs, INTENT predicts whether a tool call will satisfy the agentās intention and uses that to estimate expected cost.
- ā¢It simulates an āidealā future plan where each step works, then risk-adjusts the cost of each step by dividing by its predicted success probability.
- ā¢A safety knob called gamma lets users choose between more conservative or more aggressive spending behavior.
- ā¢On a cost-augmented version of StableToolBench, INTENT achieves 100% budget feasibility while greatly improving task success over prompting and other enforcers.
- ā¢INTENT stays robust when tool prices change, new tools appear, or budgets vary, all without retraining the main agent.
- ā¢The method needs only small auxiliary models and logs, not expensive reinforcement learning or heavy tree search.
- ā¢Case studies show INTENT doesnāt just block bad calls; it nudges the agent toward a cheaper, steadier plan that still solves the task.
- ā¢This work reframes budget control for AI agents as intention-aware planning, making economic constraints a first-class part of decision making.
Why This Research Matters
Many real tools cost money, and AI agents increasingly rely on them to do useful work. INTENT lets us trust agents with spending limits, keeping them from overspending while still finishing tasks well. This makes AI safer and more deployable in fields like finance, law, IT operations, and research where budgets are strict. It adapts to changing tool prices and new tools without retraining the big model, which saves time and money. The approach is lightweight, making it practical to plug into existing systems. By treating budget awareness as a core objective, INTENT moves AI closer to being responsible, efficient helpers in the real world.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine youāre shopping with a gift card that has a hard limit. You can fill your cart with helpful items, but if the total goes over the cardās amount, the cashier says āNope!ā and you walk away empty-handed. Thatās exactly the kind of problem AI agents face when they must use paid tools to complete tasks within a strict budget.
š„¬ The Concept: Large Language Models (LLMs) are quickly becoming agents that can think in steps and use external tools to get things done. But in the real world, many tools cost money per use, and the total spend must stay under a hard budget. The challenge is that tool outcomes are uncertain (sometimes searches fail, APIs vary, or prices change), and agents canāt practice for free during deployment.
- How it worked before: Agents planned as they went, calling tools until they had enough info. Costs werenāt tracked precisely, or only gently nudged with prompts like āYou have $50 total.ā
- Why it broke: Prompting alone didnāt stop budget overruns. Agents retried tools, followed brittle habits, or picked expensive ones too early. Reinforcement learning would be slow and outdated as tool markets change. And classical tree search is too heavy and assumes cheap exploration.
- What was missing: A way to guide a capable agent at inference timeāwithout retrainingāso it can foresee likely spending under uncertainty and still aim for success.
š Anchor: Think about a school science fair: you have a $20 supply budget. If you buy expensive glitter glue first and it doesnāt work, you may not have enough left for the poster board and markers you still need. AI agents face the same tradeoffs with paid tools.
š Hook: You know how you canāt see the future, but you can make a plan like, āIf the first try works, Iāll do X; if not, Iāll do Y, and I need to save money for the endingā? Thatās the kind of foresight agents need.
š„¬ The Concept: Sequential decision making is the idea of choosing the next action based on whatās happened so far and what you expect to happen next. For agents using tools, each step (call a tool or stop) changes what they know and how much money remains.
- What people tried: Formulas like online knapsack (divide budget across options), extra training to learn budgets, or heavy planning (MCTS) that simulates many branches.
- Why it didnāt work: Tool calls arenāt independentāyou need outputs from earlier calls to decide later ones. Retraining canāt keep up with changing tool markets. And exploring many futures is too slow and costly.
š Anchor: Itās like planning a school day: if the bus is late (stochastic), you might miss a club meeting (dependency). You need a plan that respects time (budget) while adapting to surprises.
š Hook: Picture an online app store for tools where prices shift like supermarket sales. One day a data API costs 10 credits; the next day itās 25.
š„¬ The Concept: A dynamic tool marketplace is a changing collection of tools, each with its own price per use. The agent sees a snapshot for each task and must pick tools and their order, staying under budget.
- How it works (step by step):
- See user request, budget, and available tools with prices.
- Think a bit and propose a tool call or decide to answer.
- If calling a tool, pay the price, get a (possibly noisy) result, and update the budget.
- Repeat until you answer or run out of money.
- Why it matters: If you ignore price changes or tool reliability, you can overspend or choose a weak plan.
š Anchor: Itās like buying snacks for a field trip from a store where prices change dailyāyou must still feed everyone without going over the money you have.
š Hook: Imagine youāre playing a game where each action can cost coins and might fail. Youād want a coach that says, āIf you follow this path, youāll probably spend too muchātry a cheaper route.ā
š„¬ The Concept: The paperās central problem is budget-constrained tool use: solve tasks with tool calls while never exceeding a hard budget.
- Before: Agents often ignored costs or learned fuzzy rules that didnāt hold in new markets.
- Problem: Tool results are uncertain, retries burn money, and you canāt just try everything.
- Gap: We need a fast, inference-time planner that predicts future cost risks and gently steers the agent without heavy search or retraining.
š Anchor: Like a GPS that avoids toll roads when you have only a few dollars left, the planner must guide the route to finish the trip without running out of money.
02Core Idea
š Hook: You know how when you bake cookies, you donāt need to predict the exact shape of each cookie to know if the batch will turn out fineāyou just need to know whether the dough is good enough to bake?
š„¬ The Concept: The āaha!ā insight is this: instead of predicting the exact content of every future tool output, predict whether each tool call will satisfy the agentās intentionāand use that to estimate how much money the plan will probably cost. Thatās what INTENT does.
- How it works:
- Extract the plan the agent seems to be aiming for (its intention).
- Simulate an ideal future where each step āworks as intended.ā
- For each step, estimate how many tries it might take by using a success probability.
- Add up a risk-adjusted cost for the whole plan and compare to the remaining budget.
- If it fits, allow the action; if not, block it and give helpful feedback to re-plan.
- Why it matters: Without this, agents either overspend or get too scared to try useful tools.
š Anchor: Itās like planning a school play: you donāt need to know every line to budget for costumes and props. You just need to know which scenes will definitely happen and how likely each special effect will work.
š Hook: Imagine three different explanations of the same idea.
š„¬ The Concept (3 analogies):
- Grocery trip: You plan meals (intention) and estimate if buying eggs will probably work (success chance). If eggs often run out, you budget extra time/money for a second store. You donāt predict the exact egg carton details; you just plan around whether youāll likely succeed.
- Road trip: You donāt predict the exact shape of clouds; you just see if a route likely stays open. If a bridge is often closed (low success), you budget for a detour (higher cost) ahead of time.
- Video game: You donāt predict every enemy move; you check if a strategy usually clears the level. If itās risky, you bring extra potions (budget) or pick a safer route.
š Anchor: In all three, you judge whether each step likely works and budget accordingly, not the exact fine-grained future.
š Hook: What changes with INTENT?
š„¬ The Concept: Before vs. After
- Before: Agents either overspent (chasing info) or got too cautious (never trying good but costly tools). Prompting helped but didnāt guarantee staying under budget.
- After: INTENT enforces hard budget feasibility by risk-adjusting the planās cost and gently steering choices, all while keeping the agentās core skills intact.
- Why it works: It focuses on intention satisfaction (did the tool achieve what the reasoning wanted?), which is more stable than predicting exact content in noisy environments.
š Anchor: Itās like a coach who says, āThis play usually works 70% of the time; save enough energy in case you need a second try,ā instead of trying to predict every bounce of the ball.
š Hook: How can we trust the math without equations?
š„¬ The Concept: Why it works (intuition)
- Key idea: Most agents, when following a plan, keep trying a step until it works (within reason). If step k has a success chance Ļ, then on average youāll need about 1/Ļ tries. So expected money for that step is price/Ļ.
- INTENT adds a safety knob γ (gamma). Multiply the total expected cost by γ to be more conservative (>1) or less (<1).
- Simulating an āideal planā reveals the agentās latent intentions without wandering down failure loops.
š Anchor: If a basketball shot goes in 50% of the time and each shot costs a ticket, you expect to spend two tickets per basket. If you want to be safe, budget three tickets (γ = 1.5).
š Hook: What are the building blocks?
š„¬ The Concept: Building blocks
- Language World Model: predicts tool output structure to drive simulations.
- Intention Predictor: estimates success chance (Ļ) of a tool call given the agentās current reasoning.
- Conditional Generator: simulates outputs assuming success (to keep the plan on the ideal path).
- Geometric Cost Calibration: turns price and Ļ into expected cost = price/Ļ.
- Risk Preference (γ): dials how strict or bold the planner is.
- Feedback: if an action is blocked, INTENT returns a short, helpful summary (like the predicted action chain and Ļās) so the agent can re-plan smartly.
- Caches: reuse simulated future steps if the agent stays on the plan, to save time.
š Anchor: Itās like planning a relay race: you estimate each runnerās chance to pass the baton, budget extra time if some are less reliable, and keep a backup plan handy. If the team sticks to the same order, you donāt need to re-plan every lap.
03Methodology
š Hook: Imagine getting a class project: you have a question to answer, a fixed allowance for supplies, and a menu of stores with changing prices. You want to finish the project without running out of money.
š„¬ The Concept: At a high level, INTENT works like a traffic officer for tool calls. It watches what the agent wants to do next and quickly simulates an ideal future to check if the plan will likely fit inside the budget.
- Overview pipeline: Input (question + budget + tool market) ā Agent proposes ānext stepā ā INTENT simulates an ideal plan and estimates risk-adjusted cost ā If total fits budget, approve; otherwise, block and return a short hint ā Agent re-plans until final answer.
- Why it matters: Without this checkpoint, agents can overspend on retries or pick costly tools early and leave no budget for the ending.
š Anchor: Itās like asking a teacher: āIf I buy these supplies first, will I still have enough for the poster board?ā If not, you adjust before buying.
š Hook: You know how you can imagine how a project would go if everything cooperated? Thatās the āideal planā trick.
š„¬ The Concept: Step-by-step details
- Initialize the context
- What: The system prompt, the userās question, the budget, and the market snapshot with per-call costs go into the history.
- Why: Everything the agent needs to reason and choose tools is placed in one place.
- Example: āBudget = 50 credits, tools: SearchAPI=7, FinanceAPI=23, WeatherAPI=5.ā
- Agent proposes a reasoning trace and an action
- What: The agent thinks (reasoning text), then chooses either a tool call with arguments or to answer now.
- Why: The reasoning reveals intention: what the agent expects this tool to achieve.
- Example: āIāll get quarterly cash flow first to find net income trend; then use balance sheet to get total assets.ā
- INTENT extracts intention and simulates an ideal plan
- What: Using the current proposal, INTENT performs ideal trajectory simulation: it forces each simulated tool step to āsucceedā (z=1) so the plan doesnāt fall into failure loops.
- Why: Directly simulating reality is noisy and can underestimate retries. The ideal path reveals the agentās latent plan structure.
- Example: Simulated steps: [CashFlowAPI ā IncomeStatementAPI ā Answer].
- INTENT predicts success probability (Ļ) per step
- What: The Intention Predictor reads the reasoning + tool call and estimates the chance that the tool result actually satisfies the intention.
- Why: Not all tools or queries are equally likely to work; Ļ captures that.
- Example: Ļ(CashFlowAPI)=0.35 (brittle), Ļ(IncomeStatementAPI)=0.94 (reliable).
- Geometric cost calibration
- What: Expected cost per step = price / Ļ. Total expected cost = sum over steps. Then multiply by γ (risk preference) for safety.
- Why: If success is rare, expect more retries and higher spend.
- Example: If a tool costs 10 and Ļ=0.5, budget ~20 for that step; with γ=0.5 (slightly optimistic), the total plan cost might still fit.
- Decision: approve or block
- What: If immediate cost is affordable and γ Ć (sum of price/Ļ) ⤠remaining budget, approve the action; else block.
- Why: Ensures never exceeding the hard budget while allowing good plans to proceed.
- Example: Approve CashFlowAPI if expected total stays under 50; otherwise block and ask to try a cheaper or more reliable tool.
- Feedback for re-planning
- What: On block, INTENT returns a compact hint: the predicted future action sequence and per-step Ļ (no speculative content).
- Why: The agent learns which part of the plan is risky or expensive and adjusts.
- Example: āYour plan likely exceeds budget due to low Ļ on CashFlowAPI; consider a cheaper or more reliable alternative.ā
- Simulation reuse (caches)
- What: If the agentās next action matches the simulated plan, skip new simulation and approve quickly. Track last rejected action and blacklist tools with extremely low Ļ.
- Why: Saves time and avoids thrashing on bad options.
- Example: If the agent follows [IncomeStatementAPI ā BalanceSheetAPI] as simulated, keep approving unless costs change.
š Anchor: Itās like checking your shopping list once, then cruising through the store; you only re-check if you switch to a different recipe.
š Hook: Think of three small helpers that make the system both safe and fast.
š„¬ The Concept: The secret sauce
- Intention-based factorization: Split āWill this satisfy my goal?ā from āWhat exact content appears?ā The first is stable and lets you budget; the second is noisy and not needed for cost control.
- Pessimistic-but-tunable cost: price/Ļ with γ lets you be safe without being frozen. You can turn the knob.
- Minimal intervention: The main agent stays the same. INTENT is a thin planning layer at inference timeāno retraining, no heavy tree search.
š Anchor: Itās like carrying a pocket calculator to check if your plan fits your allowance. You donāt rewrite the whole plan; you just sanity-check costs as you go.
04Experiments & Results
š Hook: Imagine running a science fair of 765 projects where each team must buy data from different vendors, each with prices. Who can finish great projects without overspending?
š„¬ The Concept: The authors tested INTENT on StableToolBench, a large benchmark of multi-step tool-use tasks. They added prices to tools and set a hard budget (like 50 credits) per task. They compared INTENT to: RAW (no cost hints), PROMPT (text reminders about costs), and several enforcers (DFSDT, BTP, BATS), plus a simpler Monte Carlo Oracle (MCO).
- What they measured (why it matters):
- Pass Rate (PR): How often the agent finishes correctly (like getting an A on the project).
- Budget-Optimal Pass (OR): Success vs. the best thatās achievable under the same budget (how close you are to the gold standard).
- Feasible Rate (FR): How often the agent stays within budget (no overdrafts).
- Average Cost and Price: Whether the agent is actually spending wisely.
- Time/Latency/Tokens: Does the method slow things down too much?
š Anchor: Itās like grading projects for both quality and whether they stayed within their spending rules.
š Hook: What did the scoreboard say?
š„¬ The Concept: Scoreboard with context
- Standalone agents struggle: With only prompting, strong models still blew the budget surprisingly often or got too cautious. For example, PROMPT (non-reasoning GPT 4.1 mini) had PR=30.9% and FR=67.2%āmeaning one-third of tasks still exceeded budget.
- Enforcers help but trade off: DFSDT and BTP enforced budgets (FR=100%) but were conservative, leaving performance on the table. BATS improved success but was slow and token-hungry.
- INTENT wins the trade-off:
- Non-reasoning backbone (GPT 4.1 mini): INTENT PR=63.8%, FR=100%, with moderate overhead (time 1.23Ć vs RAW).
- Reasoning backbone (GPT 5 nano): INTENT PR=76.0%, FR=100%, significantly higher than PROMPT (48.5%) and enforcers, while keeping average cost lower (AC=29.2) and runtime reasonable (time 1.79Ć).
- MCO (simpler rollout) helped but lagged INTENT, showing the value of intention-aware cost estimation.
š Anchor: Think of INTENT as getting an A- while always staying within budget, whereas others either got Bs and stayed within budget or got As but broke the rules.
š Hook: Can it handle real-world messiness?
š„¬ The Concept: Robustness under market changes
- New tools: Even with only a fraction of logs to train its small oracle models, INTENT improved smoothly (log-linear scaling early), showing it can adapt quickly.
- Price changes: When some tool prices got cheaper or pricier, PROMPTās performance swung a lot; INTENT stayed steady, reacting rationally to price shifts.
- Budget scaling: With smaller budgets, INTENT still did well; with bigger budgets, it scaled gracefully and used extra room wisely.
š Anchor: Itās like a shopper who adjusts calmly to sales and price hikes, still making dinner without blowing the allowance.
š Hook: Any surprises?
š„¬ The Concept: Surprising findings
- Prompted agents that āknewā prices still overspent because retries and noisy tools compound costs in sneaky ways.
- A single stochastic rollout (MCO) sometimes underestimates true spend; the intention-aware version stabilizes cost estimates.
- Lightweight hints (predicted action chain plus success probabilities) help the agent re-plan smarter than just saying ātoo expensive.ā
š Anchor: Telling a friend, āYouāll probably run out of money if you do AāBāCā with a note like āA is only 35% likely to workā is more helpful than just āDonāt do it.ā
05Discussion & Limitations
š Hook: Imagine a careful coach who always makes the playoffs but sometimes plays it a bit safe, and who needs a scouting report to judge each playās reliability.
š„¬ The Concept: Honest assessment
- Limitations (what it canāt do):
- If the Intention Predictorās probabilities are poorly calibrated, expected costs (price/Ļ) can be offātoo strict or too loose.
- Assuming roughly geometric retries (constant success chance each attempt) may overestimate or underestimate costs for tools that get easier or harder on retries.
- The Language World Model simulates output structure, not exact facts; if tool formats change drastically, simulations may degrade until refreshed.
- Thereās still inference overhead vs. pure prompting, though INTENT keeps it moderate with caches.
- Required resources: A capable base agent; a small language world model; a small classifier for intention success; interaction logs (and an LLM-as-judge once during training to label success); and a serving stack to run the oracle.
- When NOT to use: If tools are free or nearly free; if the environment is fully deterministic (no retries); if tasks are single-shot and trivial; or if you canāt afford any extra latency.
- Open questions: Can we jointly handle multiple resources (money, time, tokens) at once? Can the predictor update online as tools drift? How do we coordinate multiple agents under one shared budget? Can we incorporate value-of-information to sometimes try a risky cheap call when it unlocks big downstream savings?
š Anchor: Itās like a budget-savvy team that plays smart ball. If the scouting report is off, or the league changes rules, the playbook needs quick updates to stay sharp.
06Conclusion & Future Work
š Hook: Picture handing your allowance to a helper and saying, āBuy what we need, but donāt go over.ā Youāll only trust them if theyāre both smart and careful.
š„¬ The Concept: Three-sentence summary
- The paper reframes budget-aware tool use as an intention-aware planning problem: predict whether each step satisfies the agentās goal, not the exact content, then budget using price divided by success chance.
- INTENT enforces hard budgets at inference time with a light planning layer that simulates an ideal plan, uses risk calibration, and gives helpful feedback for re-planning.
- On a priced version of StableToolBench, INTENT achieves the best successāfeasibility balance, stays robust to market shifts, and adds only mild overhead.
- Main achievement: Showing that a small, intention-aware oracle can reliably steer powerful agents to stay on budget without retraining or heavy search.
- Future directions: Extend to multi-resource control (money, latency, tokens), online calibration as tools drift, and multi-agent shared-budget planning; explore richer value-of-information strategies.
- Why remember this: It turns ābudget awarenessā into a first-class, practical capability for agentic AIāmaking delegation safer wherever tools cost real money.
š Anchor: In short, INTENT is like a wise shopping buddy for AI agents: it doesnāt change who they are, it just helps them finish the job without maxing out the card.
Practical Applications
- ā¢Automated financial research that selects cost-effective data APIs and stays within daily spending caps.
- ā¢Legal due diligence agents that prioritize cheaper databases first and only escalate to premium sources if needed.
- ā¢Cloud incident diagnosis agents that avoid costly, repeated diagnostics when cheaper signals suffice.
- ā¢Marketing analytics bots that choose between free/cheap web search and paid data feeds without exceeding budgets.
- ā¢Business intelligence workflows that allocate limited budget across multiple queries during peak demand.
- ā¢Developer assistants that balance paid code analysis tools with local checks to meet a per-task budget.
- ā¢Procurement assistants that simulate ideal sequences of vendor checks and avoid wasteful retries.
- ā¢Customer support agents that selectively call premium verification or routing tools only when success chances are high.
- ā¢Research agents that throttle expensive literature or patent databases based on predicted usefulness and success.
- ā¢Operations dashboards that enforce team-level tool budgets with per-call approvals guided by INTENT.