InfoPO: Information-Driven Policy Optimization for User-Centric Agents
Key Summary
- â˘Many real-life requests to AI helpers are vague, so agents must ask good questions before acting.
- â˘Old training methods gave a single end-of-task score, making it hard to tell which turn helped and which didnât.
- â˘InfoPO teaches agents to value each conversational turn by measuring how much new feedback changes the agentâs very next decision.
- â˘It does this with a safe counterfactual: compare the policy with the feedback versus with a masked âno infoâ version and reward the difference.
- â˘A smart gate then blends this turn-level information reward with the actual task outcome, turning up information when outcomes are unhelpful and turning it down when outcomes are clear.
- â˘Across intent clarification, coding with a collaborator, and tool-using decisions, InfoPO beats strong prompting and RL baselines by about 14â16%.
- â˘InfoPO starts learning earlier, is more stable when group outcomes are tied, and learns to clarify early, then act confidently.
- â˘Theory shows the turn-level reward equals conditional mutual information, and enough total information gain is necessary to succeed.
- â˘It generalizes beyond user chats to environment-interactive tasks (like Sokoban/WebShop) and is robust to different user simulators.
- â˘Compute overhead is modest (about 1.6Ă wall clock in practice) and requires no extra environment calls.
Why This Research Matters
Real assistants must ask the right questions before acting, or they waste time and make mistakes. InfoPO gives agents a clear, per-turn way to learn which questions actually help, even when final outcomes donât yet separate good from bad attempts. This makes training faster and steadier, producing agents that clarify early and act confidently later. It works across tasks like travel planning, coding with someone, and customer support with tools. The method is theory-backed, practical to implement, and robust when the âuserâ changes style. As a result, we can build AI helpers that feel more thoughtful, save users time, and improve success rates in everyday workflows.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
Letâs first set the stage and introduce the key ideas using simple stories.
đ Hook: You know how when a friend says, âBook me a flight next week,â you canât do it until you ask for the city, date, and budget? 𼏠The Concept (User-Centric Agents): These are AI helpers designed to talk with people, ask for missing details, and then do things. How it works: (1) Listen to the user, (2) Ask clarifying questions, (3) Use tools or take actions, (4) Check results, (5) Repeat until done. Why it matters: Without asking the right questions, the agent guesses and often fails. đ Anchor: An agent canât buy the right ticket if it doesnât know âfrom whereâ or âhow much you can spend.â
đ Hook: Imagine playing a turn-by-turn board game where each move depends on what just happened. 𼏠The Concept (Multi-Turn Interactions): This means solving tasks through back-and-forth steps. How it works: (1) Agent says/does something, (2) User/environment responds, (3) Agent updates its plan, (4) Repeat. Why it matters: Without multi-turn thinking, the agent canât adjust or refine its plan. đ Anchor: A travel chatbot that first asks your budget, then checks hotels, then confirms booking is doing multi-turn interaction.
đ Hook: Think of getting a single grade at the end of a long group projectâhard to know who helped and when. 𼏠The Concept (Credit Assignment Problem): Itâs hard to tell which specific step in a long process caused success or failure. How it works: (1) Many steps happen, (2) You get one final reward, (3) Itâs unclear which steps deserve credit. Why it matters: If we canât reward the right steps, the agent learns slowly or learns the wrong lessons. đ Anchor: If the agent asked a great question early but messed up the final tool call, a single âfailâ score hides that the question was helpful.
đ Hook: Imagine a teacher who only gives a final score but no comments on your answers along the way. 𼏠The Concept (Reinforcement Learning, RL): A way to train agents by giving rewards or penalties based on what they do. How it works: (1) Agent tries actions, (2) Gets rewards, (3) Updates policy to do better next time. Why it matters: Without the right rewards at the right time, RL agents struggle on long, chatty tasks. đ Anchor: If a chatbot only gets a score at the very end of a 10-turn conversation, it doesnât know which question helped.
The world before InfoPO: LLM agents could chat, use tools, and complete simple tasks, but they struggled when user requests were underspecified. Many RL methods for agents used group-relative policy optimization (GRPO), which compares several rollouts for the same prompt to reduce noise. However, they mostly computed a single âtrajectory-levelâ score. This makes learning fragile, especially when the groupâs outcomes are all similar (like everyone failing in the same way). Even worse, âhonorable failuresâ (the agent asked great clarifying questions but made one bad final call) gave a zero reward, so good questioning wasnât reinforced.
Failed attempts and why they didnât fix it:
- Trajectory-only rewards: Too coarse. They ignore which turn helped.
- Task-specific shaping and heuristics: Hard to generalize across domains; lots of manual design.
- Process reward models (PRMs): Can help, but require extra training data/models and may still miss general, task-agnostic information value.
- Curiosity/novelty bonuses: Encourage exploration but donât directly measure if the new info actually changed what the agent does next.
The gap: We needed a principled, general way to reward useful information gathering at the level of each turn, without extra environment calls, and still guide the agent toward the final goal.
The real stakes: In daily life, your assistant should ask the right questions earlyâabout your budget, your preferences, or the exact issue with your phoneâbefore taking actions that cost time or money. In coding collaboration, it should clarify requirements before writing code. In customer support, it should diagnose quickly and then fix. Without good turn-level learning signals, assistants waste time, frustrate users, and fail more often.
Enter InfoPO: It treats multi-turn interaction as âactively reducing uncertainty.â It asks: Did the userâs feedback at this turn actually change what the agent is likely to do next? If yes, reward that turn. Then it blends this information reward with the final task outcome in an adaptive way. This teaches agents to clarify first, then act confidentlyâexactly how helpful humans behave.
02Core Idea
Letâs introduce the main ideas, each in the Sandwich pattern, then connect them.
đ Hook: You know how detectives keep asking targeted questions to shrink the list of suspects? 𼏠The Concept (Information-Driven Policy Optimization, InfoPO): A training method that rewards an agent for gathering information that actually changes its next move, while still aiming at the final task. How it works: (1) After each turnâs feedback, compare the policyâs next-action likelihood with and without that feedback (a masked counterfactual), (2) Give a turn-level âinformation gainâ reward based on the difference, (3) Fuse this with the taskâs outcome using an adaptive gate, (4) Update the policy. Why it matters: Without valuing informative turns, agents donât learn to clarify early and can get stuck when end rewards donât differentiate rollouts. đ Anchor: If learning youâre âunder $250 budgetâ makes the agent pick different hotels next turn, that turn gets positive credit.
đ Hook: Imagine watching two versions of the same sceneâone with the clue, one where the clue is blanked outâand asking, âDid that clue change what Iâd do next?â 𼏠The Concept (Turn-Level Counterfactual Information Gain Reward): A per-turn reward that measures how much the latest feedback shifts the agentâs next-action distribution compared to a âno infoâ placeholder. How it works: (1) Take the real conversation with feedback, (2) Create a copy where that feedback is replaced by a neutral mask, (3) Compute how the probability of the real next action changes between the two, (4) Use that difference as the turnâs reward. Why it matters: Without counterfactual comparison, you canât tell if a turnâs feedback truly mattered or if the agent would act the same anyway. đ Anchor: With the answer âthe device shows 2G and data off,â the agentâs next-step probabilities (like âenable dataâ vs. âchange network modeâ) shift a lotâso the turn is rewarded.
đ Hook: Think of a mixing board where you turn up the âinformationâ slider when the songâs main melody is too quiet, and turn it down when the melody is loud and clear. 𼏠The Concept (Variance-Gated Fusion): A way to blend information rewards and outcome rewards by looking at how different the outcomes are within a rollout group. How it works: (1) Measure variance of end-task rewards across the group, (2) If variance is tiny (all similar), itâs hard to learn from outcomesâturn up the weight on information gain, (3) If variance is big (discriminative), trust outcomes moreâturn down info weight, (4) Combine to get a single training signal. Why it matters: Without adaptive blending, training can stall when outcomes donât distinguish good from bad conversations. đ Anchor: Early in training, many conversations tie at âfail,â so the gate leans on info gain; later, as some succeed, the gate shifts focus to winning strategies.
đ Hook: In science class, to see what really causes a change, you change only one thing at a time. 𼏠The Concept (Causal Isolation via Teacher Forcing and Masking): Ensuring that the measured change in the next step comes only from the feedback, not random generation noise. How it works: (1) Compare log-probabilities for the same next action tokens with vs. without the real feedback, (2) Keep everything else fixed, (3) Attribute differences to that feedback. Why it matters: Without isolating causes, you might wrongly reward random fluctuations. đ Anchor: You read the same future answer the agent actually produced, but test it under âwith feedbackâ vs. âwith placeholderââa fair A/B test.
đ Hook: Picture water flowing through pipes: information flows from what you observe to what you decide. 𼏠The Concept (Directed Information Flow and Mutual Information Intuition): The turn reward, in expectation, equals how much the feedback tells you about your next action (conditional mutual information), and summed over turns, it equals the directed information from observations to decisions. How it works: (1) Each turnâs reward measures info from feedback to next action, (2) Sum across turns to get total info flow, (3) Theory shows a minimum total info is necessary to reliably solve tasks with hidden goals. Why it matters: Without enough information flow, success is mathematically unlikelyâyou canât guess the hidden goal. đ Anchor: If there are M hidden intents, you must gather enough bits of info to narrow to the right one; InfoPOâs reward tracks that progress.
Multiple analogies for the key idea:
- Detective analogy: Good questions that change your suspect ranking get points; wild guesses donât.
- Video game fog analogy: Each useful question lifts more fog, changing where you walk next; the reward measures how the path changes when the fog lifts.
- Cooking analogy: Tasting the soup and adjusting seasoning changes your next cooking action; the reward is bigger if that taste caused a different choice.
Before vs. After:
- Before: Agents were often rewarded only at the end; they didnât know which turn helped. They overfit to short scripts, collapsed when outcomes tied, and missed early clarifications.
- After: Agents get per-turn credit for informative feedback, learn faster when outcomes tie, and naturally adopt âclarify-then-actâ strategies.
Why it works (intuition, no equations): The more a piece of feedback changes the agentâs next-step decision, the more uncertainty it removed. Measuring that change is a clean, task-agnostic signal. Blending it with outcome rewards ensures the agent still aims for real success, not just asking questions forever. The variance gate automatically balances these two over training.
Building blocks:
- Counterfactual masking: Replace feedback with a neutral placeholder to compute a fair comparison.
- Teacher forcing: Evaluate probabilities on the actual next action tokens to keep the test stable.
- Group-relative normalization: Standardize rewards within a group to control scale and stabilize learning.
- Variance-gated fusion: Automatically tune the blend between information and outcomes.
- PPO-style update with KL to a reference: Keep updates stable and avoid drifting too far, too fast.
03Methodology
High-level recipe: Input (prompt + environment) â Roll out a group of conversations â Compute two signals (task outcome per trajectory, info gain per turn) â Normalize each â Blend with a variance gate â Update the policy with PPO-style clipping and KL regularization â Output: a better agent that asks smarter questions and acts decisively.
Step-by-step, with what/why/examples:
- Grouped rollouts (collect experiences)
- What: For each task prompt, sample G trajectories (conversations) from the current agent until a turn limit or termination.
- Why: Comparing similar conversations reduces noise and gives a stable baseline (group-relative learning).
- Example: For the travel prompt âHotel in Seattle next weekend,â we generate 5 different conversations in parallelâsome ask about budget first, some donât.
- Outcome signal (trajectory-level)
- What: Score each full trajectory with the benchmarkâs external metric (success, accumulated reward, pass rate, etc.).
- Why: This anchors learning to real goalsâfinishing the booking, passing unit tests, or resolving the customerâs issue.
- Example: If the agent booked correctly, success=1; if coding tests passed 7/10, score=0.7.
- Turn-level information gain (intrinsic signal)
- What: For each valid turn t, compute how much the latest feedback changes the policyâs probabilities for the very next action tokens.
- How: (a) Take the real history including feedback; compute log-prob of the actual next action tokens, (b) Take a counterfactual history where the feedback is replaced by a neutral placeholder (like âNo information found.â), compute the log-prob for the same tokens, (c) Subtract to get the shift. Average over the next action segment.
- Why: This isolates the impact of that feedback on the next decisionâdense, turn-level credit without extra environment calls.
- Example: The user says âKeep hotel under 250.â Without it: probabilities are spread or focus on wrong actions. The difference is the reward for that turn.
- Group-relative normalization (stabilize scales)
- What: Normalize outcome scores across the group to get outcome advantages; normalize turn-level info gains across valid turns in the group to get info advantages.
- Why: Different tasks and turns have different scales. Normalization keeps learning stable and fair.
- Example: If all five rollouts scored around 0.4â0.45, their standardized values center near zero; turns with unusually high info gain also stand out after standardization.
- Variance-gated fusion (adaptive blending)
- What: Compute the standard deviation of outcomes in the group. If outcomes barely differ (near-zero variance), increase the weight on info gain; if they differ a lot, reduce it.
- Why: Early on, outcomes often tie (everyone failing similarly), so info gain keeps learning moving. Later, outcomes become informative, so we prioritize them.
- Example: At the start, most rollouts failâgate turns up info advantage. By mid-training, some succeedâgate shifts weight back to outcomes.
- Token-level assignment and update (PPO + KL)
- What: Broadcast the fused advantage to the tokens of each assistant response segment and run a PPO-style clipped update, plus a KL penalty to a reference model.
- Why: Token-level updates align with how LLMs generate text; clipping and KL keep updates stable and prevent drifting too far from a safe distribution.
- Example: The advantage for the âask budgetâ turn is attached to those generated tokens; the policy is nudged to make such turns more likely next time, but clipping and KL avoid overcorrection.
- No extra environment calls for counterfactuals
- What: Counterfactual evaluations are done offline with teacher forcing; no new actions are executed in the environment.
- Why: Saves time and avoids altering the environment state. Overhead is typically less than 2Ă wall-clock (around 1.63Ă in runs).
- Example: For each turn, we just run two forward passes (with feedback vs. with placeholder) on the same next-action tokens.
Concrete micro-example (toy numbers):
- Next action length = 5 tokens. With feedback, average log-prob = â1.0 per token; with placeholder, average log-prob = â1.5. Info gain = (â1.0) â (â1.5) = +0.5 per token, summed/averaged â +0.5 for the turn. After normalization within the group, suppose this becomes +1.2. If outcome variance is tiny, gate weight might be 0.8, so blended advantage adds 0.96 (=0.8Ă1.2) to that turnâs token advantages.
What breaks without each step:
- Without grouped rollouts: Noisy baselines; unstable learning.
- Without outcome signal: Agent might keep asking forever and not finish tasks.
- Without info gain: Sparse rewards; slow learning; misses clarifications.
- Without normalization: A few extreme turns dominate; unstable updates.
- Without gating: Training stalls when outcomes tie, or drifts when outcomes dominate too soon.
- Without PPO+KL: Policy may collapse or diverge.
Secret sauce:
- Causal isolation with teacher forcing and masking gives a clean, cheap, per-turn signal.
- Variance-gated fusion keeps learning shaped and purposeful across training phases.
- The signal is task-agnostic, avoiding bespoke heuristics or expensive reward models while still aligning with final goals.
04Experiments & Results
The tests and why: The authors evaluated three kinds of interactive skillsâasking for missing intent, collaborating on code, and making tool-augmented decisionsâbecause real assistants need all three. They measured success rates, pass rates on hidden tests, and average rewards, which tell us if the agent both clarified well and completed tasks correctly.
The competition: InfoPO was compared to strong prompting strategies (ReAct, Reflexion) and multiple RL baselines (UserRL, RAGEN, Search-R1). These are respected methods that either guide reasoning without training or try to stabilize multi-turn RL with group-relative tricks.
Scoreboard with context:
- UserGym (8 diverse gyms): On Qwen2.5-7B, InfoPO was best in 7/8 sub-environments and lifted the macro average. It especially shined in held-out generalization gyms where goals are underspecified, like IntentionGym (1.892 vs. 1.826 for the best baseline) and SearchGym (0.480 vs. 0.446). In travel subsets, it improved tough variants too, meaning itâs not just memorizing easy patterns.
- ColBench (collaborative coding): InfoPO improved both Pass and Success. For example, Pass rose to 0.534 vs. 0.457 for the best baseline, even edging GPT-4.1âs 0.529 in that metric. This shows the agent learns to clarify data schemas and requirements before coding.
- Ď-Bench (airline/retail/telecom, long-horizon with dual control): InfoPO matched or improved the best open-source baselines in all domains (e.g., Telecom 0.181; Retail 0.188; Air 0.150), notable because tasks often exceed 30 turns and use only 178 tasksâhard mode for RL.
Making the numbers meaningful:
- A 14â16% relative gain over GRPO-style methods is like moving from a âsolid Bâ to an âA-â in a tough classâespecially impressive when the tests are long and tricky.
- Early training often had 30â80% of rollout groups with zero outcome variance (everyone failing in the same way). Where most methods see near-zero learning signals, InfoPOâs info gain kept gradients aliveâlike switching on headlights in fog.
- Training curves showed InfoPO starts improving earlier and oscillates less, meaning it learns faster and steadier.
Surprising findings:
- Emergent âclarify-then-actâ: As training progressed, the info-gain rewards concentrated on early turns. The agent learned to ask the sharpest questions first, then execute.
- Context-aware interaction depth: In UserGym and ColBench, the agent first explored a bit (more turns) to resolve ambiguity, then shortened responses. In Ď-Bench (super long), it pruned from the start, cutting wasteful steps immediately.
- Generalization beyond user chats: On Sokoban and WebShop (environment-interactive tasks), InfoPO avoided the âEcho Trapâ collapse seen in standard GRPO and kept improvingâevidence that âreduce uncertainty, then actâ is a general principle.
- Robust to user simulator shifts: Using stronger or better-prompted simulators at test time nudged scores up in Ď-Bench and sometimes ColBench, with mixed effects in UserGym (reflecting stricter user behavior). This suggests InfoPO learns policies that still work when the âuserâ style changes.
Efficiency:
- Counterfactual evaluation adds extra forward passes but no extra environment calls; wall-clock overhead averaged about 1.63Ă, far from the naive 2Ă worst case. Considering the stability and gains, this is a practical trade-off.
05Discussion & Limitations
Limitations:
- Extra compute per turn: The counterfactual pass adds overhead (about 1.6Ă wall clock), which may matter for very long contexts or tight budgets.
- Text-centric scope: Experiments focus on text interfaces; extending to multimodal (vision/audio) or embodied agents needs careful design for masking and teacher forcing.
- Simulator realism: Training quality depends on user simulators. If theyâre inconsistent or unrealistic, the learned interaction style might overfit. The paper mitigates this with user-shift tests, but real human-in-the-loop trials are the gold standard.
- Extremely short tasks: If a task is one-shot or trivially short, info-gain per turn provides little advantage over a standard outcome reward.
- When counterfactuals are ill-defined: For tasks where âremovingâ feedback breaks coherence (e.g., creative storytelling), masking may be less meaningful.
Required resources:
- A capable base LLM, PPO-style training stack, and GPU memory for long rollouts.
- A simulator or environment to provide feedback and outcomes.
- Reference model weights for KL regularization.
When not to use:
- Purely single-turn Q&A with immediate, reliable labelsâstandard supervised fine-tuning may suffice.
- Creative tasks where measuring âchanged next actionâ isnât aligned with quality.
- Settings with strict latency limits where even modest extra compute per turn is unacceptable.
Open questions:
- Multimodal extension: How to mask and measure info gain from images, audio, or sensors?
- Human feedback: Can we combine InfoPOâs intrinsic signal with light-weight human preferences or PRMs for even better guidance?
- Adaptive gates: Can the gate learn richer signals (beyond variance) to switch emphasis more intelligently?
- Theory to practice: Are there tighter bounds connecting required information gain to success in messy, real-world tasks?
- Safety and alignment: How does rewarding âinformation that changes actionsâ interact with safety constraints and refusal policies?
06Conclusion & Future Work
Three-sentence summary: InfoPO trains chatty, tool-using agents by rewarding turns that truly reduce uncertaintyâmeasured by how much the latest feedback changes the agentâs next action compared to a masked counterfactualâand by blending this signal with end-task rewards via a variance-aware gate. This gives dense, turn-level credit when outcomes tie and keeps optimization focused on real success when outcomes differentiate, leading to earlier, steadier, and stronger learning. Across intent clarification, collaborative coding, and decision-making with tools, InfoPO consistently beats strong baselines, generalizes to new users and environments, and comes with theory showing information gain is necessary for success.
Main achievement: A simple, principled, and scalable way to do turn-level credit assignmentâtask-agnostic, causally isolated, and adaptively fused with outcomesâthat unlocks robust learning for multi-turn agents.
Future directions: Extend to multimodal and embodied settings, mix with lightweight human/process rewards, learn smarter gates, and validate extensively with real users. Explore safety-aware variants where only safe, on-policy information shifts are rewarded.
Why remember this: InfoPO turns âask good questions first, act confidently nextâ into a measurable training signal. Itâs a clean bridge between information theory and practical RL for agents, showing that uncertainty reduction per turn isnât just niceâitâs necessary for reliable success in the real world.
Practical Applications
- â˘Train customer support agents to diagnose issues by asking the fewest, highest-impact questions before fixing.
- â˘Improve travel and shopping assistants so they elicit budget and preferences early, then finalize bookings or orders correctly.
- â˘Enhance coding copilots to clarify ambiguous requirements before writing code, boosting pass rates on unit tests.
- â˘Guide tool-using agents (search, databases, APIs) to gather key facts first, then execute accurate, efficient tool calls.
- â˘Speed up troubleshooting flows (e.g., telecom, IT helpdesk) by rewarding steps that quickly narrow down the root cause.
- â˘Strengthen planning agents (research, project setup) to identify missing constraints up front, avoiding rework later.
- â˘Stabilize long-horizon agents (games, simulations) by rewarding information that meaningfully changes the next move.
- â˘Adapt to different user styles or simulators by learning general âclarify-then-actâ strategies that transfer well.
- â˘Reduce token waste in long conversations by pruning low-information turns and focusing on decisive actions.
- â˘Combine with lightweight human or process rewards to align agents with domain-specific safety and quality norms.