$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Harman Singh; Xiuyu Li; Kusha Sareen; Monishwaran Maheswaran; Sijun Tan; Xiaoxia Wu; Junxiong Wang; Alpay Ariyak; Qingyang Wu; Samir Khaki; Rishabh Tiwari; Long Lian; Yucheng Lu; Boyi Li; Alane Suhr; Ben Athiwaratkun; Kurt Keutzer

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Intermediate

Harman Singh, Xiuyu Li, Kusha Sareen et al.3/4/2026

arXiv

Key Summary

•The paper shows that when a model compares two of its own answers head-to-head, it picks the right one more often than when it judges each answer alone.
•They introduce V1, a two-part framework: V_Infer (an inference-time tournament that compares answers in pairs) and V_PairRL (a training method that teaches one model to both solve and judge using pairwise feedback).
•Pointwise judging (scoring each answer alone) suffers from calibration collapse, meaning scores aren’t on a consistent scale across problems, so the model over-trusts bad answers.
•Aggregation methods (like RSA) can merge many answers into one, but often cause diversity collapse, throwing away rare correct answers while refining.
•V_Infer uses an uncertainty-guided Swiss-style tournament that spends most comparisons on the most confusing pairs, boosting accuracy with fewer judge calls.
•Across code and math benchmarks, V_Infer improves Pass@1 by up to 10% over pointwise judging and is more compute-efficient than popular aggregation methods.
•V_PairRL co-trains generation and pairwise verification so the verifier learns from the model’s current outputs, yielding 7–9% better test-time scaling vs. standard RL and up to +8.7% base Pass@1 in code.
•Pairwise self-verification shines most on hard problems and even helps real-world software bug fixing (SWE-bench Lite), outperforming pointwise verification.
•Combining pairwise verification with aggregation (like RSA) gives even faster, better convergence because pairwise scores act as a reliable “fitness signal.”
•This work suggests a simple rule: when in doubt, compare two answers directly—then scale that with smart tournaments and train the model to do both solving and judging.

Why This Research Matters

Picking the right answer from many candidates is crucial for real-world reliability—especially in coding, where a single wrong edge case can crash an app. Pairwise self-verification lets models judge more like people do: by comparing two options directly, which is simpler and more accurate than absolute scoring. The result is better performance with smarter use of compute, making everyday tools faster and more dependable. It also helps in areas where there’s no easy ground-truth check at inference time, like open-ended bug fixes or long-form reasoning. By training one model to both solve and judge, we reduce overhead and keep judging calibrated to the model’s current skills. This improves not only final answers but also the model’s underlying reasoning over time.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how during a school quiz, it’s hard to grade your own answer all by itself, but if you look at two classmates’ answers side by side, it’s easier to say which one seems better?