Tool Verification for Test-Time Reinforcement Learning

Ruotong Liao; Nikolai Röhrich; Xiaohan Wang; Yuhui Zhang; Yasaman Samadzadeh; Volker Tresp; Serena Yeung-Levy

Tool Verification for Test-Time Reinforcement Learning

Intermediate

Ruotong Liao, Nikolai Röhrich, Xiaohan Wang et al.3/2/2026

arXiv

Key Summary

•The paper fixes a big flaw in test-time reinforcement learning (TTRL): when many wrong answers agree, the model rewards the mistake and gets stuck.
•They add a checker called a verifier that turns each solution into code, runs it, and boosts the vote of solutions that pass the check.
•This creates verification-weighted voting, which favors answers backed by executable evidence rather than just popularity.
•On hard math problems (like AIME 2024), the method improves accuracy by up to 31.6%, which is a big jump over standard TTRL.
•The approach is simple: sample several solutions, verify them with a code tool, give verified ones bigger votes, pick the weighted winner, and train on that.
•It works best on tougher problems where mistakes pile up, and with stronger verifiers that write and run better code.
•Moderate weighting works best (e.g., a verified answer counting like about five normal votes); too little or too much hurts performance.
•It reduces randomness between runs and achieves similar or better results with fewer sampled solutions, saving test-time compute.
•Very small verifiers can backfire by producing noisy or non-executable code, which misleads the training signal.
•Overall, the method turns test-time learning into verified online data synthesis, making self-improvement more stable and trustworthy.

Why This Research Matters

When AI learns during use, it can accidentally reward its own popular mistakes and get stuck in the wrong answers. This work adds a simple, powerful fix: verify solutions with executable tools and give those verified solutions more say. That makes the training signal more trustworthy without needing labeled answers, which is crucial in the real world where labels are scarce. It improves accuracy most where we need it most—on hard problems that involve long chains of reasoning. It can also cut compute costs by needing fewer sampled attempts to get a reliable training signal. Overall, it makes self-improving AI safer, steadier, and more dependable.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your class is choosing the right answer to a math problem by voting, but most classmates made the same mistake. If you just follow the majority, you’ll choose the wrong answer with a lot of confidence.

🥬 The Concept (Majority Voting): Majority voting is a way to pick the answer most people chose.

What it is: A decision rule where the most frequent answer wins.
How it works: 1) Collect everyone’s answers, 2) Count how many times each answer appears, 3) Pick the most common one.
Why it matters: If lots of people share the same mistake, the vote picks a wrong answer. Then any system that trusts this vote can learn the wrong thing. 🍞 Anchor: In a class poll—A (2 votes), B (7 votes), C (1 vote)—the class picks B. If B is actually wrong but popular, the class gets it wrong together.

🍞 Hook: You know how a basketball player can get better during a game by noticing what’s working and adjusting their shots right away?

🥬 The Concept (Test-Time Reinforcement Learning, TTRL): TTRL is when a model keeps learning and adjusting while it’s answering new questions, even without seeing the correct answers.

What it is: A way for models to self-improve on-the-fly using rewards made from their own agreement (self-consistency) instead of teacher-provided labels.
How it works: 1) The model generates several solution attempts (called rollouts), 2) It picks a “winner” by majority vote and treats that as a pseudo-label, 3) It gives reward to attempts matching the “winner,” 4) It updates itself to make future answers more like the rewarded ones.
Why it matters: This lets models improve in new situations without needing hand-labeled data, but it can go wrong if the majority is confidently mistaken. 🍞 Anchor: The model solves one problem 32 times, 18 say “42,” 14 say “40.” The model treats “42” as correct (pseudo-label) and rewards those 18 paths, even if “42” is wrong.

🍞 Hook: Picture trying out different ways to solve a puzzle—each try is like a path you walk. Some paths are great; some lead to dead ends.

🥬 The Concept (Rollout): A rollout is one complete attempt the model makes to solve a problem, including its step-by-step reasoning and final answer.

What it is: A single sampled reasoning path from start to finish.
How it works: 1) Start from the question (state), 2) Generate steps (actions), 3) Produce a final answer, 4) Use it later for voting and reward.
Why it matters: Rollouts are the raw material. If rollouts are biased or sloppy, the votes and rewards become unreliable. 🍞 Anchor: For “What is $12×13$ ?” one rollout multiplies incorrectly ( $12×13$ =132), another correctly (156). Each is a rollout with a final answer.

🍞 Hook: Suppose your class doesn’t know the correct answer, so you guess the “truth” based on what most students say.

🥬 The Concept (Pseudo-label): A pseudo-label is a guessed correct answer made from the model’s own outputs when the real answer is unknown.

What it is: A stand-in label built from model self-consensus.
How it works: 1) Gather rollouts, 2) Count final answers, 3) Pick the most frequent as the pseudo-label, 4) Train using it like it were ground truth.
Why it matters: Pseudo-labels aim to replace missing labels, but if the guess is wrong, the model is trained in the wrong direction. 🍞 Anchor: If most rollouts say “132” for $12×13$ , the pseudo-label becomes 132—even though 156 is correct.

🍞 Hook: Think of a rumor lots of people repeat. Even if it’s false, it can sound true because it’s popular.

🥬 The Concept (False-Popular Mode Collapse): This is when the model’s most popular wrong answer keeps winning, gets rewarded, and eventually crowds out the correct answer.

What it is: A feedback loop where wrong-but-frequent answers dominate learning.
How it works: 1) A wrong answer is sampled often, 2) Majority voting picks it, 3) Rewards reinforce it, 4) The model becomes even more likely to pick it next time, 5) It collapses onto the wrong mode.
Why it matters: Once stuck, the model’s learning gets steered toward errors, and self-correction becomes much harder. 🍞 Anchor: If “B” is wrong but keeps getting 60% of votes, the model keeps boosting “B,” and soon nearly all rollouts say “B,” even though “C” was the truth.

🍞 Hook: When you’re unsure about your math, you grab a calculator to double-check so you don’t fool yourself.

🥬 The Concept (Test-Time Verification, TTV): TTV is adding a checker during answering time that tests whether each reasoning path makes sense using external evidence (like code execution).

What it is: A mechanism that evaluates solution quality at inference time using outside tools.
How it works: 1) Read the model’s solution, 2) Turn it into a testable form (like Python code), 3) Run the code or checks, 4) Decide if the solution passes verification, 5) Prefer verified answers.
Why it matters: It breaks the “echo chamber” of self-agreement by asking the world (a tool) for evidence. 🍞 Anchor: If the solution claims $12×13$ =132, the checker runs code that computes 12*13 and sees it is 156, so it marks the solution as unverified.

The world before: TTRL made it possible to improve models without labeled answers by trusting majority votes across many rollouts. It worked surprisingly well on easier tasks but struggled on harder reasoning problems where small mistakes add up. The problem: Popular-but-wrong answers could hijack the rewards and steer learning the wrong way (false-popular mode collapse). Failed attempts: Pure self-consistency (majority voting) and more sampling alone couldn’t reliably fix the issue because they just made the popular wrong answer even more popular. The gap: The system needed an external reality check during test time. The paper’s solution: test-time verification using a verifier that turns solutions into code and checks them with a tool, then gives verified solutions extra voting power. The stakes: In real life—tutoring, coding, science help—trusting a loud wrong answer is risky. Adding verification makes models more dependable, especially when tasks are tough and getting things right matters.

02Core Idea

🍞 Hook: You know how in a science fair, judges don’t just listen to how confident you sound—they ask you to run your experiment to prove it works.

🥬 The Concept (The “Aha!” Moment): Shift learning signals from what is most frequent to what is externally verified.

What it is: Use a tool-verified checker to boost the voting power of solutions backed by executable evidence, not just popularity.
How it works: 1) Generate several solutions, 2) A verifier turns each into Python, 3) A tool runs the code, 4) Mark rollouts that match the tool’s result as verified, 5) Give verified answers bigger votes, 6) Use the weighted winner to give rewards and update the model.
Why it matters: This cuts off the feedback loop that rewards false-popular answers and stabilizes self-improvement on unlabeled data. 🍞 Anchor: If 3 rollouts say 132 and 2 say 156, but only the 156 answers pass code checks, a weighted vote can still pick 156 as the training signal.

Three analogies for the same idea:

Courtroom analogy: Don’t trust the loudest lawyer; trust evidence. The verifier is the evidence collector; the tool is the lab test; the jury gives extra weight to evidence-backed claims.
Group project analogy: If two teammates tested their design in a simulator and three just guessed, the teacher counts the tested designs more.
Sports scouting analogy: Don’t rank players only by hype; check their game stats. Verified stats count more than opinions.

Before vs. After:

Before: Majority voting treats all rollouts equally. Popular mistakes gain power, so training drifts toward them.
After: Verification-weighted voting gives extra weight to the solutions proven by execution. This redirects training toward grounded correctness.

Why it works (intuition, no heavy math):

Self-consistency alone can be biased; external signals (like code execution) inject groundable evidence.
Weighting verified rollouts makes it likelier that the chosen label matches reality, especially on problems with long reasoning chains where small arithmetic slips build up.
Moderate weights are best: enough to overcome a false-popular answer, but not so large that a single verified rollout drowns out diverse good ideas.

Building blocks (introduced with the Sandwich pattern):

🍞 Hook: Think of a head judge who reads your work, writes a test plan, runs the test, and decides if your claim holds up.

🥬 The Concept (External Verifier): An LLM that extracts the final answer, rewrites the reasoning into code, runs the tool, and decides if the rollout is valid.

What it is: A smart checker that both plans and evaluates tests.
How it works: 1) Parse the rollout’s answer, 2) Generate Python to recompute from the problem (not just copy the trace), 3) Execute the code, 4) Compare tool result with the rollout’s answer, 5) Output verified/not-verified.
Why it matters: It turns fuzzy text into concrete evidence. 🍞 Anchor: For a fraction problem, the verifier writes Python using fractions.Fraction to recompute exactly, then checks that the rollout’s simplified fraction matches the computed one.

🍞 Hook: When you’re unsure of your arithmetic, you open a calculator app to be sure.

🥬 The Concept (Verification Tool): A code interpreter that executes verifier-generated Python to produce a deterministic check.

What it is: An external execution engine that provides ground truth for computable steps.
How it works: 1) Receive code, 2) Run it safely (sandbox), 3) Return the numeric result, 4) Enable the verifier to judge correctness.
Why it matters: It reduces guessing and catches arithmetic slips reliably. 🍞 Anchor: Code runs 12*13 and returns 156. Any rollout claiming 132 gets flagged as unverified.

🍞 Hook: In student council, experienced members’ votes sometimes count extra on tricky issues because they’ve proven reliable.

🥬 The Concept (Verification-Weighted Voting): A voting rule that gives verified rollouts more vote mass than unverified ones.

What it is: Weighted majority where verified solutions count more.
How it works: 1) Assign weight 1 to unverified answers, 2) Assign weight $\omega$ to verified answers, 3) Sum weights per answer, 4) Pick the answer with the largest total weight.
Why it matters: It can flip the outcome when the (wrong) majority is unverified but a (smaller) verified minority is correct. 🍞 Anchor: If unverified 132 has 3 votes (weight 1 each = 3) and verified 156 has 2 votes (weight $\omega=3$ each = 6), the winner is 156.

The key takeaway: Trust evidence over echo. By preferring verified rollouts, the model learns from what’s grounded, not just what’s popular.

03Methodology

High-level pipeline: Input (unlabeled problem) → [Sample multiple rollouts] → [Verifier builds and runs code with the tool] → [Give verified rollouts extra weight] → [Pick weighted winner as pseudo-label] → [Give rewards and update policy].

We now introduce each step “like a recipe,” and include simple math only when needed.

Step 1: Sample rollouts

What happens: The model generates N different solution attempts (rollouts) to the same question.
Why this step exists: Multiple tries give us options; diversity helps us spot both popular patterns and minority-but-correct attempts.
Example: For a math word problem, we sample N=5 rollouts with final answers: $\begin{pmatrix} 132 \\ 156 \\ 156 \\ 132 \\ 132 \end{pmatrix}$ .

🍞 Hook: Imagine trying five recipes for cookies and picking the best.

🥬 The Concept (Rollout): A rollout is one full attempt, including reasoning and final answer.

What it is: A single sampled solution path.
How it works: The policy “thinks out loud,” then ends with a final answer.
Why it matters: Voting happens over these final answers, and rewards are assigned per rollout. 🍞 Anchor: One rollout says “ $12×13$ =132,” another says “ $12×13$ =156.”

Step 2: Extract answers and generate verification code (Verifier)

What happens: For each rollout, the verifier reads the final answer and writes Python that recomputes it from the problem, not by copying the rollout.
Why this step exists: Independent recomputation avoids rubber-stamping wrong reasoning.
Example: If the rollout claims 132, the verifier writes code to compute 12*13 and prints the result.

🍞 Hook: Think of a lab assistant who writes a test for each claim.

🥬 The Concept (External Verifier): An LLM that produces tests and judges results.

What it is: The planner and referee of verification.
How it works: Extract answer → Write code → Run tool → Compare → Decide verified/not.
Why it matters: Without it, we can’t convert text into executable checks. 🍞 Anchor: The verifier writes code using sympy to solve an equation, then compares with the rollout’s answer.

Step 3: Execute the code (Verification Tool)

What happens: The code interpreter runs the Python safely and returns a number or result.
Why this step exists: It provides deterministic evidence for claims that are computable.
Example: Running 12*13 returns 156.

🍞 Hook: Using a calculator to settle a debate.

🥬 The Concept (Verification Tool): The execution engine for tests.

What it is: A sandboxed code runner.
How it works: Receive Python → Execute → Return value → Feed back to verifier.
Why it matters: It greatly reduces uncertainty in checking arithmetic/logical steps. 🍞 Anchor: The tool executes code that expands (x+1)^2 and confirms simplification steps.

Step 4: Decide verified or not

What happens: The verifier compares the tool’s output with the rollout’s extracted final answer.
Why this step exists: To assign a validity flag that will change the vote weight.
Example: If the rollout said 132 but the tool shows 156, then mark unverified.

Verification decision rule (simple math): $\; v_i = 1[a_i = \hat{a}_i]$ . Example: If $a_i=156$ from the tool and $\hat{a}_i=156$ from the rollout, then $v_i=1$ ; if $\hat{a}_i=132$ , then $v_i=0$ .

Step 5: Give weights to votes

What happens: Each rollout contributes a vote weight. Unverified rollouts get 1. Verified rollouts get $\omega$ (omega), a number bigger than 1.
Why this step exists: To trust evidence-backed answers more than unverified popularity.
Example (formula): $\; w_i = (1 - v_i)\cdot 1 + v_i \cdot \omega$ . If $\omega=3$ and $v_i=1$ , then $w_i=3$ ; if $v_i=0$ , then $w_i=1$ .

🍞 Hook: In a debate, experts get more sway when they’ve tested their claims.

🥬 The Concept (Verification Weight): The extra voting power for verified rollouts.

What it is: A scalar $\omega \ge 1$ that boosts verified votes.
How it works: If verified, use weight $\omega$ ; else, use 1.
Why it matters: It can flip outcomes in favor of correctness. 🍞 Anchor: With counts [ $132×3$ (unverified), $156×2$ (verified)] and $\omega=3$ , totals are 132→3, 156→6, so 156 wins.

Step 6: Pick the weighted winner (verification-aware consensus)

What happens: Sum weights per answer and pick the answer with the highest total.
Why this step exists: To form a better pseudo-label grounded in evidence.
Example (formula): $\; \tilde{y}^{*} = \arg\max_{a}\sum_{i=1}^{N} w_i\, 1[a_i=a]$ . If 132 has total weight 3 and 156 has total weight 6, choose 156.

🍞 Hook: In a science fair, the most evidence-backed project wins, not the loudest pitch.

🥬 The Concept (Verification-Weighted Voting): The rule for forming the new training signal.

What it is: Weighted majority using verification weights.
How it works: Compute totals per answer and choose the maximum.
Why it matters: It reduces false-popular mistakes. 🍞 Anchor: Even if 132 appears more often, 156 can still win because its votes are verified and heavier.

Step 7: Reward rollouts and update the model

What happens: Any rollout whose answer matches the weighted winner gets reward 1, else 0. Then we update the policy.
Why this step exists: Rewards steer learning toward verified-correct behavior.
Example (formula): $\; r^v_i = 1[a_i = \tilde{y}^{*}]$ . If $\tilde{y}^{*}=156$ , all rollouts with 156 get 1, the rest get 0.

Optional compact objective (intuition only): $\; \max_{\theta} \, \mathbb{E}_{y\sim\pi_{\theta}(\cdot|x)}[\,r^v(x,y)\,]$ . Example: If among 4 rollouts, 3 match $\tilde{y}^{*}$ and 1 doesn’t, the average reward is $3/4=0.75$ ; updating $\theta$ aims to increase this.

The secret sauce:

Independent recomputation: The verifier recomputes from the original problem, not just rephrasing the rollout’s steps.
Executable evidence: The tool supplies a clear, checkable number when possible.
Soft, not hard, trust: Moderate $\omega$ (like 5) works best—big enough to overcome false popularity, small enough to keep useful diversity.
Compute efficiency: Verified rollouts carry more information, so you can often sample fewer of them to reach the same or better accuracy.

04Experiments & Results

🍞 Hook: When you test a new study method, you don’t just say it “feels better”—you check your grades across different subjects and compare to your old method.

🥬 The Concept (Pass@1): Pass@1 measures how often the first chosen answer is correct.

What it is: The percentage of problems solved correctly on the first try.
How it works: 1) For each problem, look at the model’s top answer, 2) Check if it’s correct, 3) Average across all problems.
Why it matters: It’s a clear, simple scoreboard. 🍞 Anchor: If a model gets 7 out of 10 problems right on the first attempt, Pass@1 is 70%.

🍞 Hook: Suppose you improved your study routine and want to see if your scores went up, and by how much compared to before.

🥬 The Concept (Average Relative Accuracy Gain): This tells you how much better the new method is compared to the baseline.

What it is: Improvement measured relative to the old score, averaged over tests.
How it works: 1) Compute improvement per benchmark/model, 2) Express as a percent relative to the baseline, 3) Average across settings.
Why it matters: It normalizes gains across tasks of different difficulty. 🍞 Anchor: If baseline is 50% and new is 60%, the relative gain is (60−50)/50 = 20%.

The test: The authors evaluated on three math benchmarks of increasing difficulty: MATH-500 (easier), AMC (medium), and AIME 2024 (hardest). They measured Pass@1 and compared three setups: Baseline model, standard TTRL (majority-vote rewards), and Tᵀᴿᴸ (tool-verified weighted voting). They tested multiple backbones: vanilla, math-specialized, and instruction-tuned models.

The competition: Tᵀᴿᴸ was compared mainly with TTRL, the standard self-consistency training without verification.

The scoreboard (with context):

On AIME 2024 (hard): Up to a 31.6% relative improvement over TTRL for Qwen-Math-1.5B (15.8% → 20.8%). That’s like moving from a C to a solid B on a very hard exam.
Across all models: Average gains around 11% relative vs. TTRL, with bigger boosts on harder tasks.
Trend: The tougher the benchmark or difficulty level, the larger the gains. This matches intuition: more steps mean more chances for small arithmetic errors; executable checks help catch them.
Model types: Gains appear across vanilla, math-specialized, and instruction-tuned models, suggesting the method is broadly useful, not tied to a single pretraining style.

Surprising findings and useful details:

Verification alone helps: Even an LLM-only verifier (without running code) gives some gains—showing that structured checking is valuable.
Tool execution adds more: Running the verifier’s code in a sandbox provides a clear extra lift beyond LLM-only checking.
Weight tuning matters: Moderate $\omega$ works best (e.g., around 5). Too small fails to overcome false popularity; too large is brittle and reduces diversity.
Compute efficiency: With verification, you often need fewer rollouts to beat TTRL@64. That means better results for less test-time compute.
Stronger verifiers help: Bigger verifier models write better code and judge more reliably, improving results further.
Failure case: Very small verifiers (e.g., ~0.5B) can harm performance by producing hardcoded outputs or non-executable code, adding noise to the reward signal.

Bottom line: The method doesn’t just add a gadget; it changes the training signal to trust evidence. That shift yields consistent, sometimes large, gains—especially where they matter most: on the hardest problems.

05Discussion & Limitations

Limitations:

Small or weak verifiers can inject noise: If the verifier struggles to follow instructions or generate valid, executable Python, verification becomes unreliable and can misweight votes.
Limited benefit on easy tasks: When rollouts are already consistent and accurate, verification adds overhead without changing many outcomes.
Overweighting risk: If $\omega$ is too large, learning can collapse onto a small set of verified rollouts, reducing useful diversity and making training brittle to occasional verifier/tool mistakes.
Partial coverage: Not all reasoning is easily or safely executable. Some problems require logic or world knowledge beyond what code can check.

Required resources:

A capable verifier LLM that can extract answers, generate concise correct code, and compare results reliably.
A secure sandboxed code interpreter (e.g., Python with restricted libraries) and infrastructure to run it at test time.
Compute budget for sampling multiple rollouts and running verifications, though verification can reduce the needed rollout count.

When not to use:

Very simple tasks where majority voting already matches ground truth with high confidence.
Settings where code execution is infeasible, unsafe, or disallowed (e.g., strict sandbox restrictions, privacy/security constraints), and no alternative verifiers (symbolic checks, formal solvers) are available.
Low-resource deployments where adding a verifier and tool runtime would exceed latency or cost budgets.

Open questions:

Beyond math: How can we design verification tools for domains without obvious executables (e.g., commonsense reasoning, long-form writing)?
Better verifiers: What architectures and prompts maximize reliable, concise code generation and strict instruction following?
Adaptive weighting: Can $\omega$ be learned per problem or per verifier confidence instead of fixed, to balance reliability and diversity dynamically?
Multi-signal verification: How well do ensembles of tools (symbolic solvers, unit tests, reward models) combine for even stronger evidence?
Safety and robustness: How to harden the pipeline against adversarial inputs or tool misuse while preserving performance gains?

06Conclusion & Future Work

Three-sentence summary:

The paper introduces Tᵀᴿᴸ, which adds test-time verification to test-time reinforcement learning so that rewards favor evidence-backed answers, not just popular ones.
It uses a verifier to rewrite solutions as Python, executes them in a code tool, and gives verified rollouts extra voting weight when forming the training signal.
This reduces false-popular mode collapse, stabilizes self-improvement, and delivers consistent (often large) gains, especially on hard math benchmarks.

Main achievement:

Turning self-consistency into evidence-weighted self-improvement: a simple, general recipe that upgrades pseudo-label quality by grounding votes in executable checks.

Future directions:

Build stronger, domain-aware verifiers; add non-code verification (symbolic solvers, unit tests, formal proofs); learn adaptive weights from verifier confidence; and extend beyond math to more open-ended reasoning.

Why remember this:

It’s a clean, practical idea: trust evidence over echo. With a small architectural change—verification-weighted voting—you convert noisy self-agreement into verified online data synthesis, making on-the-fly learning more reliable when it matters most.

Practical Applications

•Math tutoring systems that double-check computations and favor verified solutions during on-the-fly learning.
•Code generation assistants that run unit-like checks and weight verified code paths higher when refining themselves.
•Data analysis bots that use spreadsheets or Python to verify calculated results before updating their decision strategies.
•Scientific helpers that simulate simple formulas or models to validate claims before learning from them.
•Financial calculators that recompute key metrics (e.g., interest, ROI) with code before reinforcing patterns.
•Operations assistants that verify scheduling or routing computations using solvers and weight those results more in learning.
•Educational quiz apps that self-improve using verified reasoning instead of raw popularity of student-like answers.
•Customer support bots that verify policy rules or price calculations with executable checks before adapting their responses.
•Benchmarking pipelines that use verification-weighted voting to produce cleaner pseudo-labels for continual training.
•Agent systems that isolate tool use to a verifier for reward shaping, keeping the main policy simpler and more stable.

Version: 1