Interactive Benchmarks

Baoqing Yue; Zihan Zhu; Yifan Zhang; Jichen Feng; Hufei Yang; Mengdi Wang

Interactive Benchmarks

Beginner

Baoqing Yue, Zihan Zhu, Yifan Zhang et al.3/5/2026

arXiv

Key Summary

•This paper says we should test AI the way real life works: by letting it ask questions, gather clues, and make smart moves step by step under a limited budget.
•It introduces Interactive Benchmarks, a single framework that measures how well models acquire information and reason during multi-turn interactions.
•There are two main kinds: Interactive Proofs (finding a definite truth with a Judge) and Interactive Games (playing strategically to win rewards).
•In logic puzzles, all models failed without interaction, proving that active questioning is essential to solve these problems.
•In math, interactive checking outperformed pass@k sampling under the same token budget, showing that interaction can save time and improve accuracy.
•In poker, some models made steady profits while balancing aggression and smart risk management, revealing long-horizon strategic strengths and weaknesses.
•In the Trust Game, only a few models beat simple classic strategies, showing lots of room to improve adaptive cooperation and defection.
•The framework gives clearer signals about true reasoning ability and reduces issues like benchmark contamination and memorization.
•Interactive Benchmarks are practical: they mimic real-world tasks where information is incomplete and you must ask, plan, and adapt.
•The authors release a unified, reusable setup so researchers can fairly compare models across interactive tasks.

Why This Research Matters

Real life is interactive: we rarely have all the facts upfront, and we must ask, test, and adapt under limits. Interactive Benchmarks capture this reality by measuring how well AI acquires information and uses it to reason, not just how nicely it answers a one-shot question. This helps us build assistants that can plan research, troubleshoot code, navigate the web, or negotiate—safely and efficiently. It reduces the risk of hallucinations by rewarding verification and self-correction. It also gives clearer, fairer rankings when benchmarks get saturated or contaminated. In short, it’s a better compass for advancing practical, trustworthy AI.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you do a science project, you don’t get all the answers at once—you have to ask questions, test things, and keep track of what you learn? That back-and-forth is what makes you smarter.

🥬 The Concept (Interactive Benchmarks): Interactive Benchmarks are tests where an AI learns and reasons by asking questions and taking actions over multiple turns, all under a limited budget. How it works: (1) The AI sees the current situation, (2) decides what to ask or do next, (3) gets feedback, (4) updates its beliefs, and (5) repeats until it must give an answer or runs out of budget. Why it matters: Without interaction, we judge AIs like they’re taking a one-shot guess, which hides whether they know how to get missing information—the heart of real problem solving.

🍞 Anchor: Think of 20 Questions. You don’t blurt out the final answer immediately—you ask smart questions to narrow things down. These benchmarks reward that skill.

The World Before: For years, we graded AI with fixed tests like math sheets (GSM8K) or super-knowledge quizzes (MMLU). They’re handy, but they’re becoming crowded, sometimes leaked into training data, and they don’t reflect what happens when the AI has to figure things out mid-task. Preference arenas (like Chatbot Arena) tell us what people like, but not necessarily if a model is good at careful reasoning. Agentic benchmarks moved closer to real life by letting AIs use tools and act in environments, but they often require heavy setups and still don’t directly test whether the AI knows which question to ask next or how to manage a tiny budget of tries.

The Problem: Real tasks are messy: the facts are incomplete, the clock is ticking, and choices cost time and tokens. A key part of intelligence is knowing when you don’t know enough—and then actively collecting the most helpful information. Static tests rarely measure this because they make the AI a passive answer machine.

Failed Attempts: (1) Just sample more answers (pass@k). This wastes compute on full wrong solutions and doesn’t show if the AI can check and fix itself mid-solution. (2) Rely on human preferences. That mixes style with substance and can’t nail down objective reasoning skill. (3) Complex agent setups. They can be too tied to specific tools or environments, making results hard to generalize.

The Gap: We needed a clean, reusable way to measure “information-acquiring reasoning” that: (a) works across different domains, (b) uses strict, interpretable feedback (like yes/no), (c) respects a budget, and (d) makes scores comparable and meaningful.

🍞 Hook to New Concepts:

You know how in a mystery, the detective interviews witnesses to uncover the truth? That’s Interactive Proofs: the AI queries a Judge to lock onto a correct explanation.
You know how in games like poker, you change tactics based on what others might do? That’s Interactive Games: the AI chooses actions to maximize long-run rewards.

🥬 Why This Matters in Daily Life: Scheduling, research, debugging, medical triage, shopping, and negotiations all require deciding what to ask, when to stop searching, and how to act with uncertainty. If our AIs can’t do this, they’ll hallucinate, waste tokens, or make poor decisions. Interactive Benchmarks reveal whether an AI can be a smart detective or a careful strategist—not just a good guesser.

🍞 Anchor: Imagine planning a field trip. You don’t guess a destination and hope. You ask the bus schedule, check the weather, compare prices, and decide. A good AI should work the same way—these benchmarks test that exact skill.

02Core Idea

🍞 Hook: Imagine you’re playing 20 Questions with a timer and only 10 questions allowed. You’ll pick each question carefully to squeeze the most information from every turn.

🥬 The Aha! Moment: The key insight is that intelligence isn’t just answering—it’s choosing the right next question or action under a budget, then updating beliefs as feedback arrives.

Multiple Analogies:

Detective Analogy: A great detective doesn’t dump all theories at once; they ask the most revealing question next. Interactive Benchmarks test that detective skill.
Grocery Analogy: Shopping with $10 means you pick items that give the most nutrition per dollar. Here, the AI picks questions/actions that give the most information per budget.
Telescope Analogy: Astronomers choose where to point a limited-time telescope. The AI does the same with its attention—aiming each query where it reduces uncertainty the most.

Before vs. After:

Before: AIs were graded mostly on one-shot replies, which mixed lucky guesses with real reasoning and hid self-correction.
After: AIs are graded on sequences: what they ask, what they learn, how fast they converge, and whether they win in strategic play. We can finally see whether they actively acquire and use information well.

Why It Works (Intuition, no equations):

Budgeting adds pressure to prioritize the most informative steps first.
Binary (yes/no) feedback in proofs keeps evaluation objective and prevents hints that solve the puzzle by accident.
Long-horizon games force planning and adaptation, revealing whether the agent can switch strategies when opponents change.

Building Blocks (with Sandwich explanations):

🍞 Hook: You know how teachers sometimes grade your process, not just the final answer? 🥬 Interactive Benchmarks: A unified way to score an AI’s step-by-step information gathering and reasoning under a limit. How: track the history, allow a next move, give restrained feedback, repeat until stop. Why: Without this, we can’t tell if an AI knows how to learn mid-task. 🍞 Anchor: Like showing your scratch work during math class and being graded on both your steps and your final answer.
🍞 Hook: Imagine there’s a referee who secretly knows the truth about a riddle. 🥬 Interactive Proofs: The AI (Player) questions a Judge who has the hidden truth, getting answers like yes/no/both/irrelevant. How: the AI proposes checks on partial steps and prunes wrong paths quickly. Why: Without this, models waste tokens finishing full wrong solutions. 🍞 Anchor: Like narrowing a suspect list by asking, “Was the suspect tall?” then “Wearing a hat?” until only one story fits.
🍞 Hook: Think of a strategy game where each move changes what happens next. 🥬 Interactive Games: The AI acts in an environment (like poker) to maximize reward over time. How: observe state and history, choose action (fold/call/raise etc.), see payoff, adapt. Why: Without this, we can’t measure planning, risk, or adaptation to other agents. 🍞 Anchor: Playing Texas Hold’em: you use odds and reads to decide whether to bluff or fold this turn so future turns go better.
🍞 Hook: When you have only a few questions left, you ask the most revealing one. 🥬 Epistemic Truth-Seeking: The goal in proofs is to reduce uncertainty until the explanation is locked in. How: ask questions that split the space of possibilities. Why: Without focusing on uncertainty, questions become random and waste budget. 🍞 Anchor: In 20 Questions, asking “Is it an animal?” cuts the options way more than asking “Does it like pizza?”
🍞 Hook: When you’re saving up allowance, you try to get the most fun per dollar. 🥬 Utility Maximization: In games, the goal is to earn the most long-term reward. How: weigh risks, consider opponents, and choose actions that pay off across future rounds. Why: Without it, you might win a tiny pot now but lose big later. 🍞 Anchor: Taking a small loss now to set up a big win later—like folding a weak hand to keep chips for a better spot.
🍞 Hook: When the facts are weird, you guess the simplest story that fits them. 🥬 Abductive Reasoning: Make best-guess explanations from odd clues. How: hypothesize, test with a question, keep or discard. Why: Without it, you cling to bad stories or chase random ideas. 🍞 Anchor: Seeing wet streets, you suspect it rained; then you ask, “Did it rain?” to confirm.
🍞 Hook: A puzzle piece only makes sense in the whole picture. 🥬 Contextual Integration: Combine many small answers into one consistent story. How: track constraints; a new ‘no’ might force a different path. Why: Without integration, you get contradictions and dead-ends. 🍞 Anchor: Assembling a jigsaw: edges and colors from many pieces must match up to form the final image.

03Methodology

High-level Overview (like a recipe): Input → [Choose setting: Proofs or Games] → [Plan next query/action based on history and budget] → [Get constrained feedback/payoff] → [Update beliefs/strategy] → [Repeat until stop] → Output final answer or total reward.

Core Setup Pieces:

History ht: everything seen so far (messages, actions, states).
Actions at: in proofs, ask yes/no-style questions or submit final answer; in games, choose legal moves.
Budget B: every action costs something; once out of budget, you must answer.
Feedback: proofs give {yes, no, both, irrelevant} or {correct/incorrect}; games give observations and payoffs.

Formal Objectives (with friendly numeric examples):

Interactive Proofs objective: $\pi^{\star}_{\mathrm{IP}} \in \arg\max_{\pi} \; \mathbb{E}[\mathbf{1}\{\hat y = y^{\star}(x)\}] \quad \text{s.t.} \quad \sum_{t=1}^{T} c(a_t) \le B.$ Example: Suppose each question costs 1, a final guess costs 2, and your budget is $B=6$ . If you ask 4 questions and then guess, the total cost is $1+1+1+1+2=6$ (fits); if you try 5 questions plus a guess, $1\times 5 + 2=7$ (over budget) so it’s not allowed.
Interactive Games objective: $\pi^{\star}_{\mathrm{Game}} \in \arg\max_{\pi} \; \mathbb{E}\left[ \sum_{t=1}^{T} \gamma^{t-1} r_t \right].$ Example: If $\gamma=0.9$ and rewards are $r_1=2, r_2=3, r_3=0$ , the discounted sum is $2 + 0.9\times 3 + 0.9^2\times 0 = 2 + 2.7 + 0 = 4.7$ .

Trust Game Horizon:

Random number of rounds $T$ with $\Pr(T=t)=(1-\delta)\,\delta^{t-1}$ . Example: If $\delta=0.8$ , then $\Pr(T=3)=(0.2)\times 0.8^{2}=0.128$ .
Expected length: $\mathbb{E}[T]=\tfrac{1}{1-\delta}$ . Example: If $\delta=0.8$ , then $\mathbb{E}[T]=\tfrac{1}{0.2}=5$ rounds on average.

Tournament Scoring (Trust Game):

Average score per round: $\text{Score}(m)= \dfrac{\sum_{g\in G(m)} \sum_{t=1}^{T_g} u^{(g,t)}_m}{\sum_{g\in G(m)} T_g}$ . Example: If a model played 2 matches with (6 total points over 5 rounds) and (4 total over 3 rounds), then score $=(6+4)/(5+3)=10/8=1.25$ .
Cooperation rate: $\text{CoopRate}(m)= \dfrac{\sum I[\text{cooperate}]}{\text{total rounds}}$ . Example: If it cooperates 6 times in 8 rounds, $6/8=0.75$ .
Betrayal rate (defect right after opponent cooperated): $\text{BetrayalRate} \approx \dfrac{\text{#(opp cooperated last round and we defect now)}}{\text{#(opp cooperated last round)}}$ . Example: If the opponent cooperated last round 4 times and we defected next round once, the rate is $1/4=0.25$ .

Budget Matching for pass@k in Math:

Choose $k^{\star}$ to match interactive token budget: $k^{\star}=\arg\min_{k\in\{1,2,...\}} \left| k\cdot \mathbb{E}[T^{(1)}_{\text{pass}}] - \mathbb{E}[T_{\text{interactive}}] \right|$ . Example: If one pass@1 attempt uses 200 tokens on average and an interactive run uses 600, then $k^{\star}=3$ because $|3\times 200 - 600|=0$ is minimal.

Now, step-by-step recipes for each setting:

A) Interactive Proofs: Logic (Situation Puzzles)

What happens: The Player asks yes/no/both/irrelevant questions to find a hidden explanation (like the reason someone felt happy after being shoved). The Judge only replies with restrained labels—no spoilers.
Why this step exists: Restricting answers avoids giving away the solution and forces the Player to plan informative splits, pruning wrong stories fast.
Example with data: In the “sleeping brothers” puzzle, asking “Did the death change the sleeping arrangement?” (yes) narrows the space. Later, “Did you divide the body to place halves on both sides?” (yes) clinches a shocking but consistent explanation.
Secret sauce: Abductive reasoning + strategic inquiry. Every good question shrinks uncertainty a lot.

B) Interactive Proofs: Math (Interactive checking vs. pass@k)

What happens: The Player proposes intermediate claims (like a lemma or equation) and asks the Judge if each is valid (yes/no/both/irrelevant). Wrong branches get pruned immediately.
Why this step exists: Without early checks, a small early mistake wastes a long solution. With checks, fewer tokens reach the correct path faster.
Example with data: In the “Roman numerals love letter” problem, the Player verifies “Do we have 27 symbols (26 letters + space)?” (yes) and “Is $27^L \le 7^{10000}$ needed?” (yes), then computes the largest $L$ .
Secret sauce: Turn long proofs into guided searches with guardrails; you see not just the final answer but the model’s ability to test and fix itself.

C) Interactive Games: Texas Hold’em Poker

What happens: The Agent sees cards and betting history, then outputs a legal action plus a brief reason. Invalid outputs trigger a second chance; two failures fold.
Why this step exists: Strict formats ensure fairness and prevent free-form text from sneaking in extra signals.
Example with data: With ace-six suited and one raiser, the Agent calls to see the flop because pot odds and playability are good; later it bets for protection when a helpful card appears.
Secret sauce: Balancing math (odds, ranges) and psychology (opponent tendencies) over many hands.

D) Interactive Games: Trust Game (Iterated Cooperation/Defection)

What happens: Each round both players secretly pick COOPERATE or DEFECT; the match may continue (probability δ). The Agent should adapt to the opponent and horizon.
Why this step exists: Fixed final rounds cause last-move traps. A random horizon makes short-term betrayal vs. long-term trust a real, learnable tradeoff.
Example with data: Against a mostly cooperative opponent, staying cooperative earns steady points; against a sneaky defector, switching to DEFECT protects your score.
Secret sauce: Learning a compact model of the opponent from recent history and updating quickly.

04Experiments & Results

The Tests and Why They Matter:

Logic (Situation Puzzles, 46 items): Measures whether models can only solve with interaction. All models scored virtually zero without asking questions, proving interaction is essential here.
Math (HLE subset, 52 items): Compares interactive checking to repeated full solutions (pass@k) under about the same token budget, to test if interaction boosts efficiency and accuracy.
Poker (No-Limit Hold’em, 5000 hands, 10 tables): Gauges long-horizon strategy, risk management, and execution reliability.
Trust Game (Iterated Prisoner’s Dilemma): Tests adaptive cooperation/defection under a random-length horizon, with classic rule-based baselines for context.

Competition and Setup:

Players: Grok-4.1-fast, Gemini-3-flash, GPT-5-mini, Kimi-k2-thinking, DeepSeek-v3.2, Qwen3-max.
Judges: Fixed to Grok-4.1-fast for proofs. Temperature set to 0 for reproducibility.
Budgets: 20 turns for interactive proofs; pass@k matched roughly by tokens.

Scoreboard with Context:

Logic (Situation Puzzles): Gemini-3-flash had the top accuracy (about one in three puzzles solved within 20 turns), with GPT-5-mini next. Kimi-k2-thinking solved its wins in fewer turns than most, suggesting efficient questioning when on track.
Math (Interactive vs pass@k): Interactive checking beat pass@k across models under matched budgets. Grok-4.1-fast led (about three in four correct), GPT-5-mini close behind; Kimi-k2-thinking lagged on this hard subset. This is like getting an A while the non-interactive method averages closer to a B or C under the same study time.
Poker: Gemini-3-flash earned the highest average winnings per hand and was the most stable winner; Grok-4.1-fast and GPT-5-mini were profitable but swingier. GPT-5-mini played the most hands (high VPIP) and folded the least, reflecting an aggressive style that can win big or lose big.
Trust Game: Only Qwen3-max and GPT-5-mini beat both classic baselines (Grim Trigger and Tit-for-Tat). Most other models tied or trailed, showing much room to grow in adaptive cooperation.

Surprising Findings:

Interaction is not optional for certain problems: without asking questions, logic puzzles were basically unsolvable by all models.
In math, interactive checking doesn’t just help a little; it can change who looks best under the same token budget, revealing hidden strengths in models that can self-correct.
Strategic profiles differ: a “loose and brave” style (high VPIP) isn’t automatically best in poker—discipline matters.
High cooperation with low betrayal tends to outperform in the Trust Game, but some strong general models still didn’t beat simple baselines, hinting that dynamic social reasoning remains challenging.

Takeaway: Interactive Benchmarks reveal abilities and weaknesses that static tests blur—especially information acquisition, self-correction, and adaptation to others.

05Discussion & Limitations

Limitations:

Coverage: The benchmark spans logic, math, poker, and trust games, but real life includes many other interactions (tools, physical sensors, long projects). Expanding domains will improve generality.
Judge/Environment Bias: Using a fixed Judge model or a particular game engine may favor certain styles. Cross-judging (multiple judges) and environment diversity can reduce bias.
Budget Choices: Turn and token limits influence strategies. Different budgets could reshuffle rankings; reporting sensitivity analyses helps.
Learning vs. Testing: Models evaluated here might be fine-tuned to the protocol in the future; guarding against overfitting the benchmark will be important.

Required Resources:

Deterministic inference (temperature 0) and logging to track exact histories.
A standardized Judge for proofs and validated game engines for poker and trust games.
Token accounting for fair pass@k vs. interactive comparisons.

When NOT to Use:

One-shot tasks with complete information (e.g., a direct factual lookup) don’t benefit much from interactive evaluation.
Extremely tool- or domain-specific workflows may need their own custom interfaces rather than this general protocol.
If human preference or style is the main target (e.g., creativity contests), preference arenas are a better fit.

Open Questions:

Training: What learning methods (e.g., reinforcement learning from interaction traces) best improve interactive reasoning under a budget?
Judge Design: Can we build multi-judge or provably robust verifiers that reduce bias and leakage while staying inexpensive?
Planning and Memory: How should models store and retrieve intermediate constraints to speed up convergence across many steps?
Meta-Reasoning: Can models learn to estimate their own uncertainty to pick the most informative next question more reliably?
Generalization: How well do strategies learned in one interactive domain transfer to another (e.g., from logic puzzles to code debugging)?

06Conclusion & Future Work

Three-Sentence Summary: Interactive Benchmarks evaluate AI the way real work happens: by letting models actively acquire information and reason over multiple turns under a budget. The framework unifies two settings—Interactive Proofs (truth-seeking with a Judge) and Interactive Games (long-term reward with other agents)—and provides principled objectives, clean feedback, and fair budgets. Experiments across logic, math, poker, and trust games show that interaction reveals real reasoning skill and that current models still have ample room to grow.

Main Achievement: A clear, reusable evaluation paradigm that isolates and measures information-acquisition skill—choosing the right next question or action—rather than only grading one-shot answers.

Future Directions: Broaden domains (tools, web, robotics), develop robust multi-judge schemes, train explicitly for interactive performance, and study how interactive skills transfer between tasks.

Why Remember This: If we want trustworthy, practical AI, we must test whether it can figure out what it doesn’t know, ask the right questions, and adapt under pressure. Interactive Benchmarks make that ability visible, comparable, and improvable.

Practical Applications

•Interactive tutoring systems that ask clarifying questions before explaining a concept.
•Code assistants that verify intermediate steps (tests, invariants) before proposing full fixes.
•Research agents that plan web queries, check sources, and stop searching when confidence is high.
•Customer support bots that triage by asking the fewest, most revealing questions.
•Medical pre-diagnosis tools that collect key symptoms under time limits and verify red flags.
•Business analytics copilots that probe data with targeted queries to reduce uncertainty fastest.
•Game-playing AIs (e.g., poker) that adapt to opponent styles while managing long-term risk.
•Negotiation assistants that reason about counterpart behavior and adjust offers over rounds.
•Robotics task planners that gather missing sensor info before committing to a path.
•Safety evaluators that interactively test for model hallucinations and force step-by-step verification.

Version: 1