DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

Zhongwei Wan; Yun Shen; Zhihao Dou; Donghao Zhou; Yu Zhang; Xin Wang; Hui Shen; Jing Xiong; Chaofan Tao; Zixuan Zhong; Peizhou Huang; Mi Zhang

DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

Intermediate

Zhongwei Wan, Yun Shen, Zhihao Dou et al.2/23/2026

arXiv

Key Summary

•LLMs trained with simple rewards often latch onto just a few ways of solving problems and stop exploring, which hurts their ability to find other correct answers.
•DSDR teaches models to explore at two scales at once: picking different overall solution paths (global) and keeping each path flexible word-by-word (local).
•At the global scale, DSDR only gives extra credit to correct solutions that are different from other correct ones, so the model learns many valid strategies.
•At the local scale, DSDR adds a length-fair, token-level entropy bonus but only for correct solutions, so each strategy stays healthy without becoming random.
•A softmax ‘coupling’ uses how globally distinctive a correct solution is to decide how much local exploration to add to it.
•The authors prove that, if bonuses are bounded, DSDR won’t reduce the best possible correctness and keeps learning signals strong in group-based training.
•Across multiple math benchmarks and model sizes, DSDR consistently boosts Pass@1, Avg@16, and especially Pass@k as k grows.
•Removing either the global diversity or the global-to-local coupling makes performance drop, showing both parts are necessary.
•DSDR improves exploration without chasing wrong answers by rewarding diversity only among correct trajectories.
•The method is simple to add to GRPO-style RL training and uses lightweight embedding and formula signals to measure diversity.

Why This Research Matters

Real-world problems rarely repeat exactly, so a model that knows only one way to solve a task can fail when details change. DSDR raises the odds that a model learns several correct strategies and keeps each strategy flexible, improving reliability. This helps math tutors show multiple valid methods, code assistants explore different debugging plans, and scientific tools consider alternative hypotheses. Because DSDR rewards diversity only among correct answers, it avoids wandering into noisy or unsafe territory. Stronger Pass@k means more value from multiple attempts, which is how people often use AI systems in practice. Overall, DSDR makes exploration useful, not random—leading to better reasoning under real-world variety.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your class is solving a tricky math puzzle. If everyone copies the first student who gets the right answer, the class might miss other smart ways to solve it—and the next, slightly different problem could stump everyone.

🥬 The Situation (The World Before):

Large language models (LLMs) have become pretty good at step-by-step reasoning when trained with reinforcement learning that uses a checker (a verifier) to tell if a final answer is correct. This setup is called RL with Verifiable Rewards, or RLVR.
Group-based methods like GRPO compare a handful of attempts for the same question, learn from which ones are better, and update the model. This makes training steadier than judging each attempt in isolation.
This worked well for math, code, and logic problems. Pass@1 (getting it right on the first try) usually goes up.

🍞 Hook: You know how, in games, if you find a winning move, you might keep using it and never learn new tricks? That’s what happens to many LLMs.

🥬 The Problem: Limited deep exploration.

Models often collapse onto just a few reasoning “templates”—they put almost all their probability on one or two familiar solution paths.
This hurts pass@k (chances of at least one correct answer in k tries) because all k tries look nearly the same.
In group-based training, when many attempts are correct and similar, rewards within the group become almost the same. Then the learning signal fades, and progress stalls.

🍞 Hook: People tried shaking the box to make the marble move around more.

🥬 Failed Attempts:

Token-level entropy regularization: adding randomness word-by-word. It injects local noise, but doesn’t reliably create different full solution paths.
Trajectory-level diversity alone: trying to spread out final solutions, but once a few templates dominate, the model becomes overconfident within those modes and stops exploring nearby correct variants.
Other tricks (temperature tweaks, reward relaxations, intervention in rollouts) help a bit but don’t coordinate exploration across scales.

🍞 Hook: It’s like trying to get a soccer team to play creatively by either yelling “be random!” every pass or telling them to try different formations but never practicing variations inside each formation.

🥬 The Gap:

Exploration must be “correctness-aligned”: it should nurture different correct strategies, not just random detours.
Exploration must work at two scales and be coordinated: global (different full reasoning paths) and local (healthy variability inside each path), so the model doesn’t become brittle or overconfident.
We also need to keep a strong learning signal in group-based training even when many attempts are already correct.

🍞 Hook: Why should we care? Because in real life, puzzles rarely repeat exactly. You want a toolbox, not just one hammer.

🥬 Real Stakes (Daily Life Connections):

Math helpers should find multiple correct ways so they adapt when numbers or wording change.
Code assistants should try different debugging paths so one bug doesn’t derail all attempts.
Scientific helpers should consider multiple hypotheses, not just the first plausible one.
Safe systems benefit from diverse, correct plans rather than overconfident single tracks.
For students, seeing different correct solutions builds deeper understanding and transfer to new problems.

🍞 Anchor: Picture a maze with many exits that all lead to the treasure. A model that only memorizes one path gets stuck when that hallway is blocked. A model trained to explore many correct paths will still find the treasure—no matter which door is open today.

02Core Idea

🍞 Hook: You know how chefs learn faster when they try different recipes (global) and also tweak the spices inside each recipe (local)?

🥬 The “Aha!” Moment (One sentence): Encourage diversity at two scales—between whole solution paths and within each correct path—and connect them so the most unique correct paths get extra careful exploration.

🥬 Multiple Analogies:

Hiking trails: Try multiple distinct trails up the mountain (global), and on each trail, be flexible about turns and pace (local). Prioritize exploring the trails that look most different from what you already know.
Orchestra: Discover different musical pieces to perform (global), and within each piece, vary the dynamics and tempo tastefully (local). Spend more rehearsal time on the pieces that add the most variety to your concert.
Lego builds: Build different models from the same set (global), and within each model, experiment with piece placements (local). Focus your tinkering on the models that truly broaden your collection.

🥬 Before vs After:

Before: Models improved Pass@1 by overfitting a few “favorite” templates. Pass@k barely improved because samples looked the same.
After (DSDR): Models discover several distinct correct solution modes and keep each mode flexible. Pass@1, Avg@16, and Pass@k rise together, and larger k helps more because attempts are usefully different.

🥬 Why It Works (Intuition, no equations):

Keep the learning signal alive: When many tries are correct, rewards can look identical. A small extra bonus for being a different correct solution spreads those rewards apart just enough to guide learning.
Don’t reward wrong detours: Only correct attempts get diversity bonuses and local entropy. This aligns exploration with success instead of noise.
Be fair about length: Local entropy is computed per token on correct paths, so longer answers don’t get unfair extra credit.
Spend exploration where it counts: Use a softmax over how distinctive each correct solution is to decide where to add more local exploration. Unique correct paths get more attention, growing new solution neighborhoods.
The math backs it: With bounded bonuses, you don’t reduce the best possible correctness, and the coupling rule naturally falls out of an optimal allocation argument.

🥬 Building Blocks (each as a Sandwich):

🍞 Hook: Imagine grading a quiz where a checker just says “right” or “wrong.” 🥬 The Concept: Reinforcement Learning with Verifiable Rewards (RLVR) is training where a verifier gives a score (often 0/1) after a whole answer. How it works:
1. Sample several answers to a prompt.
2. A verifier marks each as correct or not.
3. The model updates to make correct answers more likely. Why it matters: It teaches full reasoning, not just copying tokens. 🍞 Anchor: Math problems where you can check the final numeric answer.
🍞 Hook: Think of picking the top student papers in a small group. 🥬 The Concept: Group-based optimization (GRPO) compares attempts within a group to stabilize learning. How it works:
1. Generate G solutions for the same question.
2. Compare their rewards relative to each other.
3. Update the policy using these within-group comparisons. Why it matters: More stable than judging one sample at a time. 🍞 Anchor: Choosing the best of 8 tries for the same puzzle.
🍞 Hook: If a writer always uses the same words, their stories feel flat. 🥬 The Concept: Entropy regularization encourages some randomness in choices. How it works:
1. Add a small bonus for being less predictable.
2. This nudges the model to consider alternatives.
3. But too much randomness becomes noisy. Why it matters: It prevents premature overconfidence. 🍞 Anchor: Temperature or entropy bonuses during text generation.
🍞 Hook: Don’t just vary words—vary the whole plan. 🥬 The Concept: Trajectory-level diversity means encouraging different complete reasoning paths. How it works:
1. Represent each full solution.
2. Measure how different it is from others.
3. Prefer correct solutions that expand coverage. Why it matters: Word-level noise alone rarely makes new plans. 🍞 Anchor: Two distinct proofs of the same theorem.
🍞 Hook: Give more practice to the most unique correct ideas. 🥬 The Concept: Global-to-local coupling connects big-picture diversity to word-level exploration. How it works:
1. Score how distinctive each correct solution is.
2. Use a softmax to assign more local exploration to the most distinctive ones.
3. This grows new neighborhoods of correct solutions. Why it matters: You explore where it helps most. 🍞 Anchor: Spending extra rehearsal time on the pieces that broaden the concert’s variety.
🍞 Hook: Keep each good plan flexible, not brittle. 🥬 The Concept: Token-level entropy regularization (length-invariant, correct-only) keeps per-word choices healthy inside correct paths. How it works:
1. Average uncertainty per token (so length doesn’t dominate).
2. Apply it only if the overall answer is correct.
3. Weight it by the global-to-local coupling. Why it matters: Prevents mode collapse without drifting off-task. 🍞 Anchor: Within a correct algebra method, don’t lock into just one phrasing or step order.

03Methodology

At a high level: Prompt → Sample a group of solutions → Compute global diversity on correct ones → Add small diversity bonuses to correct rewards → Allocate local exploration to distinctive correct solutions → Update policy with GRPO + local entropy term → New policy.

Step-by-step (with Sandwich explanations at first use):

Grouped Sampling and Verifiable Rewards (RLVR + GRPO)

🍞 Hook: Like giving 8 students the same problem and seeing how each does.
🥬 What happens: For each prompt, the model generates G solutions. A verifier marks each complete solution correct (1) or incorrect (0). GRPO then compares solutions within this group to compute learning signals. Why this step exists: It stabilizes training by judging solutions relative to peers rather than absolute scores only. Example: For a geometry question, you get 8 proofs; 3 are correct, 5 are not.
🍞 Anchor: You only compare papers from the same assignment and class.

Compute Global Diversity for Each Solution (only shapes correct ones)

🍞 Hook: You know how two stories can look similar in words but use totally different plots? We measure both.
🥬 What happens: a) Semantic diversity: Encode each full solution with a small frozen encoder and compare pairwise cosine distances. More distance = more different meaning/approach. b) Formula diversity: Extract formulas (when present). Count how many formulas in a solution don’t appear in other group members. More unique formulas = more structural novelty. c) Combine both into a single diversity score in [0,1] and clip to a safe range. Why this step exists: It captures trajectory-level differences beyond random word swaps and keeps the signal well-scaled. Example: Two correct solutions—one uses recursion, another uses generating functions—score as more diverse than two paraphrases of the same method.
🍞 Anchor: An essay graded for both overall argument (semantic) and unique use of evidence (formula).

Correct-Only Global Reward Shaping

🍞 Hook: Don’t pay for being different if the answer is wrong.
🥬 What happens: Add a small, clipped diversity bonus to the reward, but only for correct solutions. This spreads out rewards among correct answers so the model learns to keep multiple correct modes. Why this step exists: When many in-group answers are correct, plain rewards become nearly equal, shrinking the learning signal. The bonus restores healthy differences without incentivizing wrong answers. Example: If four correct answers look very similar and one is quite different, the different one gets a tiny extra nudge.
🍞 Anchor: In science fair judging, originality helps—after you pass basic correctness.

Global-to-Local Coupling via Softmax

🍞 Hook: Spend practice time where it grows the repertoire most.
🥬 What happens: Turn the (clipped) global diversity scores of correct solutions into weights using a softmax. Higher score → bigger weight. These weights decide how much local (token-level) exploration to apply to each correct solution. Why this step exists: Not all correct solutions are equally valuable to explore. Focusing on more distinctive ones expands coverage efficiently. Example: If 3 correct solutions have distinctiveness 0.2, 0.3, 0.8, the 0.8 case gets the most local exploration.
🍞 Anchor: Choir practice gives more time to the pieces that add variety to the concert.

Local Positive-Sample Token Entropy (length-invariant)

🍞 Hook: Keep each good plan from becoming stiff.
🥬 What happens: Compute average per-token entropy along each correct trajectory and encourage a modest increase, weighted by the coupling above. This is done in an off-policy-friendly way, so we reuse the sampled tokens and apply standard importance weighting under the current policy. Why this step exists: Prevents overconfidence within a mode, preserves nearby correct variants, and avoids a length bias. Example: If a correct derivation can say “Thus” or “Hence,” the model doesn’t lock into just one phrasing.
🍞 Anchor: In sports, even with a winning play, you practice alternatives so the team stays adaptable.

Integrate with GRPO Objective and Update Policy

🍞 Hook: Combine the team score with targeted drills.
🥬 What happens: The final objective is GRPO (using the diversity-augmented rewards for group normalization) plus the weighted local entropy term on correct samples. The model updates to increase the chances of diverse, correct solutions while keeping each solution mode healthy. Why this step exists: It unifies broad exploration (global) and stable flexibility (local) into one update. Example: The optimizer moves probability mass toward multiple correct modes and avoids making any single mode brittle.
🍞 Anchor: A coach evaluates scrimmages (global performance) and schedules targeted exercises (local flexibility) before the next game.

Concrete Mini-Example:

Prompt: “Find the expected steps for a random walk to reach the opposite cube corner.”
Group of 8 solutions: • 2 are correct: one uses symmetry and states, the other sets up and solves linear equations. • 6 are incorrect.
Diversity: • Semantic: the embeddings of the two correct solutions are far apart. • Formula: only the equation-based solution includes a specific recurrence system that others don’t.
Reward shaping: Both correct get base reward 1; the more distinctive one gets a small extra (clipped) bonus.
Coupling: The distinctive one gets a higher weight for local entropy.
Update: Next round, the model grows probability mass around both strategies and keeps each flexible.

The Secret Sauce (why DSDR stands out):

Correct-only: Exploration stays aligned with success.
Dual-scale: Prevents collapse across modes (global) and within modes (local).
Coupling: Invests exploration budget where it grows the frontier fastest.
Boundedness: Keeps correctness optimality intact and learning signals strong in group-based RL.

04Experiments & Results

The Test (what they measured and why):

Pass@1: Can the model get it right in one shot? Shows top-guess quality.
Avg@16: With 16 tries, how often are answers correct on average? Shows distribution quality and stability.
Pass@k (k up to 64): As we allow more tries, does performance keep improving? Shows depth and breadth of correct solution modes.

The Competition (baselines):

Backbone (no RL): the starting model.
GRPO: group-relative RL without special diversity shaping.
DAPO: a strong open-source RL system.

Scoreboard with Context:

Small model (Qwen2.5-Math-1.5B): DSDR achieved the best average across benchmarks (about 25.4 Pass@1 and 25.6 Avg@16). Think of this like going from a C+ to a solid B across tough math quizzes compared to classmates.
Medium model (Qwen3-1.7B): DSDR reached roughly 36.8/36.8 (Pass@1/Avg@16), noticeably ahead of GRPO and DAPO. That’s like scoring an A- when others are at B or B+.
Large model (Qwen3-4B): DSDR hit about 48.0/46.8 (Pass@1/Avg@16), again the best overall. As models grow, DSDR’s gains grow too, especially on difficult AIME and Olympiad-style problems.

Pass@k Findings (why it’s meaningful):

As k increases, DSDR’s lead widens on hard sets (AIME2024/2025, Olympiad). This shows samples are not just sharper—they are more diverse in correct ways, so more tries really help.
On Minerva (hard, sparse-correctness), DSDR still maintains an advantage, suggesting better exploration even when rewards are scarce.
On MATH500 (already high baseline), DSDR adds steady improvements without collapsing at large k.
Importantly, no drop-offs at high k: diversity is happening among correct solutions, not drifting into noise.

Surprising/Notable Observations:

Training Dynamics: Plain GRPO/DAPO show low, flat entropy—limited exploration. DSDR balances entropy: not too high (random), not too low (stuck). Removing global diversity (w/o GD) can push entropy too high (noisy), and removing coupling (w/o GC) shrinks later exploration (over-concentration). DSDR keeps semantic and formula-level diversity lower (more distinct) across training, indicating sustained multi-path exploration.
Human-like Diversity Judging: Using an LLM-as-a-judge to rate diversity (1–10), DSDR scores higher diversity and higher pass@32 than DAPO. This supports that DSDR’s diversity is high-quality, not random.
Ablations: Removing global diversity or the global-to-local coupling drops performance consistently across model sizes and benchmarks. This shows both ingredients are essential.
Hyperparameters: Moderate local entropy (e.g., λℓ≈0.001) and small global bonus (λd≈0.001) are stable. Too large entropy destabilizes training, as expected.

Plain-English Takeaway:

DSDR reliably improves “first try” quality, “average over many tries,” and “best-of-k” performance. It does so by growing multiple correct solution paths and keeping each one flexible, which is exactly what you want when problems vary and you have several attempts.

05Discussion & Limitations

Limitations (be specific):

Needs a verifier: RLVR depends on reliable correctness checks. If the verifier is weak or noisy, the signals can mislead learning.
Hyperparameter tuning: The strengths of global (λd), local (λℓ), and the softmax temperature (τ) matter. While a small range works well in the paper, other domains may need tuning.
Diversity signals depend on tools: The semantic encoder and formula extractor shape what “diverse” means. If these tools miss nuances in a new domain (e.g., legal reasoning), the signal could be less helpful.
Computation: Group-based RL with 8 rollouts and long contexts is resource-intensive.
Domain scope: The paper focuses on math/problem solving. Applying DSDR to open-ended dialogue or creative writing may need task-specific adjustments and validators.

Required Resources:

A group-based RLVR pipeline (e.g., GRPO implementation) and a verifier for the target tasks.
Lightweight text encoder for semantic similarity and a formula/structure extractor where applicable.
Enough compute to sample groups per prompt and run long-context generations.

When NOT to Use:

Tasks without clear or checkable correctness (e.g., purely subjective writing) where a verifier is unavailable or unreliable.
Settings where diversity among wrong answers is dangerous (e.g., critical instructions) and no strong correctness filter exists.
Extremely short outputs where local entropy either dominates or is irrelevant.

Open Questions:

Beyond math: How to best design global diversity signals in domains without formulas (e.g., plans, code architectures, proofs in other fields)?
Adaptive tools: Can the semantic encoder or structure extractor adapt during training without reward hacking?
Multi-turn settings: How to measure and reward diversity across multi-step dialogues while ensuring end-task success?
Verifier robustness: How to handle partial credit or continuous rewards while keeping the benefits of correct-only shaping?
Efficiency: Can we reduce the number of rollouts per group or reuse off-policy data more effectively without losing gains?

06Conclusion & Future Work

Three-Sentence Summary:

DSDR improves LLM reasoning by encouraging diversity at two scales: it rewards different correct solution paths (global) and keeps each correct path flexible word-by-word (local).
A softmax coupling focuses local exploration on the most distinctive correct solutions, and theory shows these bounded bonuses preserve correctness and stabilize group-based learning signals.
Experiments across math benchmarks and model sizes show consistent gains in Pass@1, Avg@16, and Pass@k, especially as k grows.

Main Achievement:

A correctness-aligned, dual-scale, and tightly coupled exploration strategy that reliably turns many similar correct attempts into a garden of distinct, healthy solution modes.

Future Directions:

Extend diversity measures to domains like coding patterns, planning graphs, or multi-turn dialogues.
Learn task-specific structure extractors that remain robust to reward hacking.
Improve efficiency with smarter sampling, off-policy tricks, or smaller groups without sacrificing signal quality.

Why Remember This:

DSDR shows that “explore more” only works when you explore the right things. By rewarding diverse correct paths and keeping each path flexible, it turns exploration into a dependable engine for better reasoning—today and as problems evolve.

Practical Applications

•Train math-tutor LLMs to present multiple correct solution methods (algebraic, geometric, numerical) for the same problem.
•Improve code assistants by encouraging diverse debugging and refactoring strategies that still compile and pass tests.
•Enhance automated theorem-proving by cultivating several distinct proof paths and keeping each path adaptable.
•Boost planning agents (e.g., study plans, logistics) by growing multiple correct plans and avoiding brittle single routes.
•Strengthen self-reflection pipelines where the model proposes and evaluates different correct reasoning drafts.
•Increase reliability in pass@k settings (e.g., generate-then-rank) by ensuring samples are diverse and correct-leaning.
•Develop classroom tools that show students varied, correct explanations, supporting deeper understanding and transfer.
•Improve data labeling for reasoning tasks by sampling diverse correct rationales, enriching training corpora.
•Support scientific assistants in proposing multiple plausible models or hypotheses while maintaining correctness checks.
•Stabilize RLVR training runs by preserving group-level learning signals even when many samples are already correct.

Version: 1