Efficient RLVR Training via Weighted Mutual Information Data Selection

Xinyu Zhou; Boyu Zhu; Haotian Zhang; Huiming Wang; Zhijiang Guo

Efficient RLVR Training via Weighted Mutual Information Data Selection

Intermediate

Xinyu Zhou, Boyu Zhu, Haotian Zhang et al.3/2/2026

arXiv

Key Summary

•Reinforcement learning (RL) trains language models by letting them try answers and learn from rewards, but training is slow if we pick the wrong practice questions.
•Most online methods pick questions mainly by “difficulty,” assuming medium-hard ones are always the most useful, which ignores how much we already know about those questions.
•INSIGHT is a new data selection method that scores each question by how much new information it will likely teach (mutual information) and how usefully tricky it is (a gentle difficulty weight).
•It keeps a Bayesian belief (a running estimate) of each question’s true success rate using a Beta distribution and updates it as rewards come in.
•The acquisition score is Weighted Mutual Information: a product of mutual information (epistemic exploration) and a smooth difficulty weight (aleatoric exploitation).
•This decouples difficulty from evidence: medium difficulty helps only if we still have uncertainty; once well-known, even medium-hard items teach little.
•Across planning, math, and general reasoning, INSIGHT yields higher accuracy (+1.41 avg in planning/math; +1.01 in general reasoning) and up to ~2.2x faster training.
•It needs no extra rollouts or value networks and adds negligible compute overhead; it slots into standard RLVR pipelines like GRPO.
•Ablations show both parts (MI and the weight) matter; using the mean success rate beats sampled difficulty for stable selection.
•INSIGHT is most helpful for smaller models or low-resource settings where picking the right problems makes a big difference.

Why This Research Matters

Picking the right practice problems can make training large language models faster, cheaper, and more environmentally friendly. INSIGHT focuses training on questions that teach the most right now, instead of blindly chasing “medium-hard” items. That means better math, planning, and science answers sooner, which helps students, teachers, engineers, and researchers. Because it needs no extra rollouts, it fits into existing RL pipelines without slowing them down. More efficient learning lets smaller labs and schools get strong results on modest hardware, broadening access. Over time, this can speed up progress in tutoring systems, coding assistants, and scientific tools that rely on reliable reasoning.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re practicing for a math contest. If you keep redoing questions you already mastered or ones you can’t solve yet, you won’t learn fast.

🥬 The Concept (Reinforcement Learning): RL is a way to train AI by letting it try answers and learn from rewards.

How it works:
1. The AI sees a question (prompt) and generates answers (rollouts).
2. A rule checks each answer and gives a reward (right=1, wrong=0 in RLVR).
3. The AI updates itself to be more like the answers that earned higher rewards.
Why it matters: Without smart practice selection, RL wastes time on questions that are too easy or impossibly hard, slowing learning.

🍞 Anchor: Like a student doing practice tests with a red pen that marks answers right or wrong and then studies the mistakes to improve.

🍞 Hook: You know how some homework has an answer key you can check instantly?

🥬 The Concept (RL with Verifiable Rewards, RLVR): RLVR uses tasks where we can automatically verify if an answer is correct.

How it works:
1. The model solves a problem.
2. A checker compares its answer to the ground truth.
3. The reward is 1 if correct, 0 otherwise, and training proceeds.
Why it matters: Fast, automatic checking lets the model learn quickly and reliably.

🍞 Anchor: Solving arithmetic or unit-conversion problems where a script can check the exact number.

🍞 Hook: Think of a teacher picking which problems you’ll do next.

🥬 The Concept (Data Selection): Data selection is choosing which questions to train on right now.

How it works:
1. Look at a pool of questions.
2. Decide which ones will teach the most next.
3. Train on those and repeat.
Why it matters: Bad picks waste time; good picks speed up learning and stabilize training.

🍞 Anchor: A coach picking drills that fix your current weak spots instead of repeating what you already do well.

🍞 Hook: You know how medium-hard problems feel “just right”—not too easy, not too hard?

🥬 The Concept (Difficulty-Based Heuristics): These are rules that pick questions near a target success rate (often ~50%).

How it works:
1. Estimate how often the model gets a question right.
2. Prefer questions close to 50% right—assumed “most informative.”
3. Keep selecting such items during training.
Why it matters: It’s simple and sometimes effective early on, but it ignores how much evidence we’ve already gathered.

🍞 Anchor: If you’ve already done a certain type of 50-50 question 100 times, doing more of the same won’t teach much new.

🍞 Hook: Imagine flipping a coin (randomness) versus not knowing how biased the coin is (ignorance).

🥬 The Concept (Aleatoric vs. Epistemic Uncertainty): Aleatoric is randomness in outcomes; epistemic is uncertainty from not knowing enough yet.

How it works:
1. Aleatoric: Even if you know the rules, outcomes can vary (like a dice roll).
2. Epistemic: You’re unsure because you lack data; more evidence can reduce this.
3. Training should focus on reducing epistemic uncertainty efficiently.
Why it matters: Difficulty-only rules chase variability (aleatoric) but may ignore whether we still lack knowledge (epistemic).

🍞 Anchor: Rolling a die is always random (aleatoric). But guessing a quiz topic before studying is epistemic—you can fix it by studying.

🍞 Hook: Picture keeping score of how often you solve a type of problem.

🥬 The Concept (Bayesian Inference for Success Rates): We keep a belief about each question’s true success rate and update it with new results.

How it works:
1. Start with a prior belief (Beta distribution) about success probability.
2. Observe wins/losses (Binomial/Bernoulli rewards).
3. Update to a posterior belief (new Beta) after each batch.
Why it matters: This calibrated belief tells us both the current difficulty and how certain we are about it.

🍞 Anchor: If you’ve seen 2/4 successes, you believe ~50% success but with high uncertainty; after 500/1000 successes, 50% is much more certain.

🍞 Hook: Imagine choosing a question that will teach you the most new facts right now.

🥬 The Concept (Mutual Information): Mutual information measures how much observing an outcome reduces our uncertainty about something we care about.

How it works:
1. Measure how uncertain we are about a question’s true success rate.
2. Imagine possible rewards from doing it K times.
3. Pick questions whose results would shrink our uncertainty the most.
Why it matters: It directly targets “learning value,” not just “hardness.”

🍞 Anchor: Picking the experiment that will best reveal whether a bridge design is safe before building the real thing.

The World Before: RL for reasoning LLMs got big boosts, but training was costly. Static sampling wasted compute on too-easy or too-hard problems. Online selection via difficulty helped early but faltered later because it ignored how evidence reduces uncertainty over time.

The Problem: How do we select questions that truly maximize learning per step—balancing how hard they are with how much we still have to learn about them?

Failed Attempts: Oversampling methods ran many extra rollouts (expensive). Single-signal difficulty methods equated “medium-hard” with “most informative,” even after those items became well-understood.

The Gap: A principled selector that jointly accounts for difficulty and accumulated evidence without extra rollouts.

Real Stakes: Faster, steadier training means cheaper, greener, and better-performing models that plan, calculate, and reason more reliably across education, coding, science, and beyond.

02Core Idea

🍞 Hook: You know how a great tutor doesn’t just give you medium-hard problems—they pick the ones that will clear up your biggest confusions right now.

🥬 The Concept (Key Insight): Choose data by how much new certainty it will add (mutual information), then gently weight by useful difficulty.

How it works:
1. Keep a Bayesian belief (mean and uncertainty) for each question’s success rate.
2. Compute how much doing K rollouts would reduce uncertainty (mutual information, MI).
3. Multiply by a smooth difficulty weight that prefers informative, non-trivial items.
Why it matters: Difficulty alone can’t tell if we’ve already “figured it out.” MI decays when evidence is large, so we don’t overtrain on stale items.

🍞 Anchor: Picking the next practice problem that both challenges you and answers the most pressing unknown.

Multiple Analogies:

Detective Analogy: Don’t just chase “hard clues.” Chase the clues that will shrink your suspect list the most (MI), favoring clues that aren’t trivial or impossible (weight).
Cooking Analogy: Don’t just choose medium-spicy dishes. Choose ingredients that most improve the flavor you’re unsure about, while keeping the dish tasty (difficulty weight).
Studying Analogy: Don’t just do 50-50 questions. Do the ones where a few attempts will most reduce your confusion, while staying in a learnable zone.

Before vs After:

Before: Selection = “Is it medium-hard?”; often re-picks the same items even when well-known; extra rollouts needed for stable estimates.
After: Selection = “Will this shrink my uncertainty, and is it the right level?”; avoids overfamiliar items; requires no oversampling; more stable progress.

🍞 Hook: Think of a flashlight that dims as you learn more about a question—it tells you when to move on.

🥬 The Concept (Why It Works—Intuition): Uncertainty reduction naturally fades as evidence builds up for a question.

How it works:
1. Early on, outcomes teach a lot—big MI.
2. As counts grow, the belief tightens—MI drops.
3. Weighting by expected difficulty keeps focus on fertile learning zones.
Why it matters: It balances exploration (learn new facts) and exploitation (practice where variance is informative), maximizing learning per step.

🍞 Anchor: After 100 tries on the same type of puzzle, your tutor stops giving it to you—not because it’s easy, but because you won’t learn much more from it.

🍞 Hook: Imagine the method as LEGO bricks that click together.

🥬 The Concept (Building Blocks):

What it is: INSIGHT = Bayesian Belief + Multi-rollout Mutual Information + Smooth Difficulty Weight.
How it works:
1. Bayesian Belief: Track successes/failures per question with a Beta prior/posterior.
2. MI Term: Estimate expected uncertainty reduction over K rollouts.
3. Weight: A gentle function of mean success rate that filters trivial/impossible items and biases toward a target zone.
Why it matters: Each brick solves a piece—belief calibrates certainty, MI targets learning value, weight shapes curriculum.

🍞 Anchor: Like building a study plan that knows what you know, predicts what each problem will teach, and keeps you in the learning sweet spot.

03Methodology

At a high level: Pool of questions → Keep/update beliefs → Score by Weighted Mutual Information (WMI) → Pick top-M → Roll out answers and get rewards → Update beliefs and policy → Repeat.

Step-by-step recipe:

Initialize Bayesian beliefs for each datapoint

What happens: For every question τ, set a Beta prior (α, β) representing initial successes/failures; usually start uniform (1,1).
Why this exists: We need a calibrated starting belief about difficulty and our confidence in it.
Example: For a new math problem type, α=1, β=1 means “about 50% but we’re very unsure.”

Sample a larger candidate set T^ (much bigger than M)

What happens: Randomly pick a bigger pool (e.g., $16× batch$ size) from all questions.
Why this exists: We want variety so we can find the most informative items right now.
Example: If M=256, collect ~4096 candidates to score.

Compute Mutual Information I(R1:K; Φτ) for each candidate

What happens: Estimate how much K rollouts on τ would reduce our uncertainty about its success rate.
Why this exists: This is the “how much new knowledge will I gain?” core.
Example: For K=8 and a question with few past tries, MI is high; for a question we’ve seen a lot, MI is low.

Compute the smooth difficulty weight w(φ̄τ)

What happens: Use the current mean success rate φ̄τ to weight questions that are neither trivial nor impossible, with a bias toward a target μ (e.g., ~0.3–0.5).
Why this exists: Medium-ish difficulty tends to yield helpful gradients; the weight enforces this gently, without ignoring uncertainty.
Example: A question at φ̄=0.95 gets low weight (too easy); φ̄=0.05 also low (too hard); φ̄≈0.3–0.5 higher weight.

Form the WMI score and rank

What happens: A(τ) = w(φ̄τ $) × I(R1$ :K; Φτ); rank candidates by A and pick the top-M.
Why this exists: Multiplying combines “where we’ll learn most” (MI) with “is it usefully difficult” (weight).
Example: Two 50% items—if one has tons of past tries (low MI), it scores lower than a similar one we barely tried (high MI).

Generate K rollouts for the selected batch and compute rewards

What happens: For each chosen question, the model produces K answers; a checker gives binary rewards.
Why this exists: We need fresh evidence to both train the model and update beliefs.
Example: For τ, K=8 answers yield successes S=3; we record those 1/0 rewards.

Update the policy with RL (e.g., GRPO)

What happens: Use the rewards to compute advantages and update model parameters.
Why this exists: This is the learning step; better-chosen data → more informative gradients → faster improvement.
Example: The model leans toward answer patterns that got reward 1.

Update Bayesian beliefs for each observed τ

What happens: Add successes S and failures K−S to α, β (optionally with a discount λ); this becomes the new prior.
Why this exists: Keeps our certainty calibrated for the next selection round.
Example: If α=3, β=5 and S=3, K=8, then α←6, β←10 → narrower, more confident belief.

Repeat for T steps

What happens: Loop selection → rollout → update.
Why this exists: As the model improves, the selector adapts, always chasing the next best learning opportunities.
Example: Early rounds pick many unknowns; later rounds shift toward still-uncertain but learnable niches.

Concrete mini-walkthrough with data:

Inputs: 10,000 math/planning questions; start α=β=1; K=8; M=256; candidate size ~16×.
Round t: • A tough but under-explored AIME-style item: φ̄≈0.35, few tries → high MI; weight is also decent → high WMI. • A medium-hard Countdown item seen many times: φ̄≈0.5 but large evidence → low MI; weight decent → moderate WMI. • An almost trivial arithmetic item: φ̄≈0.95 → low weight; even if MI is moderate, product stays low.
Select top-M by WMI, train, and update.

The Secret Sauce:

Decoupling: MI handles “how much we don’t know,” the weight handles “is it usefully tricky?”
Mean, not samples: Using φ̄ (mean belief) avoids noisy sampled success rates that jitter rankings.
Multi-rollout aware: MI is computed for K rollouts, matching common RLVR practice.
Evidence-aware: As certainty grows, MI shrinks naturally, preventing wasted practice on over-known items.

04Experiments & Results

The Test: Measure how much INSIGHT improves both accuracy (pass@1) and training speed across planning (Countdown), mathematics (AIME24, AMC23, MATH500, Minerva Math, OlympiadBench), and general reasoning (MMLU, GPQA). Models range from 0.6B to 7B parameters. All methods use GRPO; each problem is evaluated with 16 generations.

The Competition:

RANDOM: uniform sampling.
MOPPS: picks items near target difficulty (~50%).
INVERSE-EVIDENCE: favors least-seen items (epistemic only).
EXPECTED-DIFFICULTY: uses mean difficulty (not sampled) near target, but ignores evidence.
Dynamic Sampling (DS): oversamples and filters; strong but very expensive.

The Scoreboard (with context):

Planning & Math: INSIGHT consistently tops baselines. Average gains up to +1.41 points over strong online baselines; big wins on Countdown (+5.13) and AIME24 (+2.30) in smaller models—like moving from a B to an A- while others hover at B.
General Reasoning: Smaller but steady improvements; +1.01 average gain overall; notable boosts on MMLU-STEM (+1.14 on 1.7B) and GPQA (+3.16 on 0.6B)—akin to getting a few extra questions right where it’s hardest.
Efficiency: On Countdown, INSIGHT reaches target accuracy with fewer steps: up to ~2. $2× faster$ (0.6B), ~1. $5× (4B$ ), ~1. $6× (7B$ ). That’s like finishing your homework in half the time with the same grade.
Overhead: INSIGHT adds negligible compute versus RANDOM/MOPPS; DS takes > $2× training$ hours but only minor accuracy gains over INSIGHT.

Surprising/Notable Findings:

Mean beats sample: EXPECTED-DIFFICULTY (mean) often beats MOPPS (sampled), showing that stable beliefs matter.
Epistemic alone isn’t enough: INVERSE-EVIDENCE can match or underperform RANDOM—uncertainty must be paired with learnable difficulty.
Noise sensitivity: On some noisy general reasoning tasks, random occasionally edges sampled-difficulty methods, reinforcing that noisy proxies can mislead selectors.
Components matter: Ablations show MI-only underperforms; weight-only is competitive but inconsistent. Together (WMI) is best across the board.
Difficulty bias μ: Middle settings (around 0.3–0.5) work best; extremes (too easy/hard) hurt, especially for smaller models.
Candidate size: Larger candidate pools help bigger models; medium pools sometimes suit smaller models better, hinting at capacity-aware exploration.

Bottom line: Scoring by “how much I’ll learn” (MI) times “how usefully hard it is” (weight) wins on both accuracy and speed without extra rollouts.

05Discussion & Limitations

Limitations:

Near-deterministic tasks: If outcomes are almost always 0 or 1, mutual information is tiny; selection has little room to improve over random.
Mis-specified rewards: If the checker is noisy or biased, the Bayesian belief gets skewed, and WMI can over- or under-prioritize items.
Hyperparameters: The difficulty bias (μ, η) and candidate size need light tuning; poor choices can reduce gains, especially for very small models.
Computation on massive pools: Computing MI and weights for extremely large candidate sets can add modest overhead (though still far less than oversampling-based methods).

Required Resources:

A verifiable reward function (binary or easily scored) and a standard RLVR setup (e.g., GRPO).
Modest extra bookkeeping to maintain Beta parameters per question and compute WMI each step.
Usual GPU resources for RL training; no additional value networks or oversampling passes are needed.

When NOT to Use:

Purely generative, unscorable tasks where automatic, reliable rewards aren’t available.
Domains where difficulty doesn’t correlate with gradient usefulness (e.g., reward hacking shortcuts dominate) and cannot be mitigated.
Settings with ultra-stationary, noiseless items where simple curricula already suffice.

Open Questions:

Beyond binary rewards: How to extend WMI robustly to graded or structure-aware rewards while keeping computation light?
Cross-task transfer: Can beliefs transfer across related problem families to accelerate cold starts?
Adaptive μ and η: Can we meta-learn the difficulty bias on the fly based on model capacity and training signals?
Robustness: How to make WMI resilient to adversarial or mislabeled items in open-world data pools?
Scale-up: What happens at 70B+ models and multi-domain mega-pools—does WMI still offer large efficiency gains?

06Conclusion & Future Work

Three-sentence summary: INSIGHT selects training data for RL by multiplying how much a question will reduce uncertainty (mutual information) with a gentle difficulty weight, using Bayesian beliefs that update with each reward. This decouples “how hard it is” from “how much we still need to learn,” preventing over-focus on medium-hard but already well-understood items. Across planning, math, and general reasoning, INSIGHT improves accuracy and speeds up training—often dramatically—without extra rollouts.

Main Achievement: A principled, efficient, and plug-and-play selection rule—Weighted Mutual Information—that consistently outperforms difficulty-only heuristics while adding negligible overhead.

Future Directions: Extend to richer rewards and multi-signal uncertainty, meta-learn the difficulty bias, share beliefs across related tasks, and test at larger scales and broader domains.

Why Remember This: It reframes data selection from “pick medium-hard stuff” to “pick what will teach me the most right now, if it’s still in the learnable zone,” a simple but powerful shift that turns wasted practice into steady, efficient progress.

Practical Applications

•Speed up RLVR training of math-solvers by prioritizing problems with the highest WMI scores.
•Improve planning agents (e.g., Countdown-like tasks) by selecting practice instances that most reduce uncertainty.
•Cut training costs for small labs by avoiding oversampling-based selectors and using INSIGHT’s lightweight scoring.
•Stabilize RL training runs by avoiding overly easy or overexposed medium-difficulty items that no longer teach much.
•Auto-curriculum for mixed-domain reasoning datasets (science, finance, etc.) using a single WMI selector.
•Cold-start new domains by quickly identifying which problem families have the biggest learning payoff.
•Tune μ and η to match model capacity: gentler difficulty bias for small models, broader exploration for large ones.
•Use mean-based difficulty (φ̄) instead of sampled rates to reduce ranking noise in online selection.
•Extend to partial-credit rewards (when available) by adapting the MI computation to graded signals.
•Deploy INSIGHT within GRPO implementations (e.g., VeRL) with minimal code changes for immediate gains.

Version: 1