Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Yuanda Xu; Hejian Sang; Zhengze Zhou; Ran He; Zhipeng Wang

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Intermediate

Yuanda Xu, Hejian Sang, Zhengze Zhou et al.2/24/2026

arXiv

Key Summary

•The paper shows that when training reasoning AIs with reinforcement learning, treating every wrong answer the same makes the AI overconfident in some bad paths and less diverse overall.
•They introduce ACE, a new penalty that hits overconfident wrong answers harder while leaving honest exploration mostly alone.
•ACE uses a simple confidence score for each full answer (c = log of policy probability over reference probability) to decide how much extra penalty to add.
•Mathematically, ACE behaves like a selective reverse-KL regularizer that only pushes down overconfident mistakes, plus a small extra term that tempers the push.
•ACE plugs into GRPO and DAPO with one line change to the advantage for wrong rollouts and almost no extra compute.
•Across three model families (Qwen2.5-Math-7B, Qwen3-8B-Base, Llama-3.1-8B-Instruct), ACE consistently improves Pass@k, especially for larger k.
•On MATH-500, ACE-DAPO reaches 96.1% Pass@32, beating strong DAPO baselines; on AIME 2025, ACE also lifts large-k performance.
•ACE reduces the fraction and severity of overconfident errors during training and slows harmful entropy collapse, preserving useful diversity.
•Softplus works better than ReLU for the confidence modulation because it is smooth and gently differentiates borderline cases.
•Limitations include reliance on a good reference model, current focus on binary rewards, and potential adjustments needed for very long chains of thought.

Why This Research Matters

ACE helps language models reason better in a way people actually use them: by trying several solutions and picking a good one. It keeps the creative, promising paths alive while turning down the volume on loud-but-wrong ones, which translates to higher success rates when you sample multiple outputs. Because ACE is simple to add, cheap to run, and works with popular algorithms like GRPO and DAPO, practitioners can adopt it quickly in real training pipelines. The method’s selective nature also makes training more efficient by focusing effort where it counts—on correcting the most harmful mistakes. Beyond math, the idea of asymmetric, confidence-aware penalties could benefit coding assistants, scientific problem solvers, and tutoring systems. In short, it’s a small change with broad, practical impact: smarter correction, healthier diversity, and better outcomes.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how in class, sometimes a student gives a wrong answer very confidently, and everyone starts to believe it—even though it’s wrong? If the teacher doesn’t correct that more strongly than a simple mistake, the whole class can get stuck on the wrong idea.

🥬 The Concept (Reinforcement Learning with reasoning AIs):

What it is: Reinforcement Learning with Verifiable Rewards (RLVR) trains language models to reason better by giving them a thumbs-up for correct final answers and a thumbs-down for incorrect ones.
How it works:
1. The model reads a math problem and writes a full chain-of-thought (a step-by-step solution).
2. A verifier checks if the final answer is correct (binary reward: 1 or 0).
3. The learning algorithm increases the chance of good chains and lowers the chance of bad ones.
4. A “reference model” gently keeps the student from drifting too far away via a KL penalty.
Why it matters: Without this setup, models might babble or latch onto weird shortcuts. RLVR added big Pass@1 gains (first try correctness), which made models feel much more reliable.

🍞 Bottom Bread (Anchor): Imagine practicing long-division with an answer key. Every time you get it right, you repeat those steps more. If wrong, you try to do less of that path next time. RLVR is that practice routine for AIs.

🍞 Top Bread (Hook): Imagine a class where only the first hand raised gets attention. Over time, fewer students speak up, even if they might have better ideas.

🥬 The Concept (Diversity collapse and Pass@k):

What it is: Pass@k measures if at least one of k tries is right; diversity collapse is when the model focuses on too few solution paths, hurting large-k performance.
How it works:
1. RLVR makes the most confident pattern more likely (great for Pass@1).
2. But if all wrong answers are punished the same, some bad-but-confident routes survive and hog probability.
3. This squeezes out other creative or correct paths, lowering the chance that any of the k samples is right.
Why it matters: Without diversity, Pass@k falls below even the base model’s, meaning the AI is worse at exploring many reasoning paths.

🍞 Bottom Bread (Anchor): It’s like a soccer team that only practices one attack play. It scores sometimes (Pass@1), but when defenses adapt, having more plays (diverse reasoning) would win more games across multiple attempts (Pass@k).

🍞 Top Bread (Hook): You know how teachers don’t grade all mistakes the same? A careless arithmetic slip is different from confidently insisting 2+2=5.

🥬 The Concept (Uniform penalties are the problem):

What it is: Most current methods punish all wrong answers equally inside a group, ignoring how confident the model was in those answers.
How it works:
1. Compute a single “negative advantage” for all wrong rollouts in a group.
2. Apply equal push-down pressure regardless of whether the model is exploring or doubling down.
3. Global KL pulls everyone back evenly, including good exploration and bad overconfidence.
Why it matters: Without telling mistakes apart, overconfident wrong paths keep their power and block healthy exploration.

🍞 Bottom Bread (Anchor): If a quiz grader marks all wrong answers as -1 point, the student who guessed kindly and the student who stubbornly insisted on a false rule get the same penalty. The stubborn idea sticks around.

🍞 Top Bread (Hook): Picture three kinds of wrong answers: curious guesses, fading fads, and loud-but-wrong myths.

🥬 The Concept (Three error regimes):

What it is: Wrong rollouts fall into exploratory errors (probability similar to reference), self-correcting errors (already going down), and overconfident errors (probability growing).
How it works:
1. Exploratory errors: normal “try and learn.”
2. Self-correcting errors: already being reduced.
3. Overconfident errors: getting stronger because training rewarded them by mistake.
Why it matters: Treating these equally wastes penalties on errors already shrinking and doesn’t push hard enough on the harmful overconfident ones.

🍞 Bottom Bread (Anchor): It’s like coaching: don’t scold a player who’s already fixing their form; do step in when a player keeps proudly practicing a harmful technique.

🍞 Top Bread (Hook): Imagine a thermometer that says how much hotter a new stove is than your old one. That difference matters more than the temperature alone.

🥬 The Concept (Confidence shift):

What it is: Confidence shift c = log(π_policy/π_ref) measures how much more (or less) confident the new model is than the reference for a specific full answer.
How it works:
1. Compute the model’s total log-probability of a rollout.
2. Subtract the reference model’s log-probability for the same rollout.
3. Positive means “more confident now”; negative means “less confident now.”
Why it matters: This single number flags overconfident errors precisely—the ones we must correct harder.

🍞 Bottom Bread (Anchor): If your new calculator gives 12 as the answer and your old reliable one said 8, the “shift” tells you the new one’s getting bolder. If the answer’s wrong, time to check that confidence.

🍞 Top Bread (Hook): Imagine a gentle spring that pulls you back toward a familiar center no matter which direction you wander.

🥬 The Concept (KL divergence in this setting):

What it is: KL divergence measures how far the current model’s outputs drift from a reference model.
How it works:
1. Compare probabilities of outputs under the policy vs. reference.
2. Add a symmetric penalty for any difference.
3. Keep the model from drifting too far overall.
Why it matters: Helpful safety belt, but it equally shrinks good exploration and bad overconfidence—too blunt for our problem.

🍞 Bottom Bread (Anchor): It’s like a leash that limits how far a puppy wanders—useful, but it can also stop the puppy from exploring the backyard’s safe corners.

🍞 Top Bread (Hook): Think of a sports season where the team learns to win while still trying new plays—they need guidance that’s targeted, not one-size-fits-all.

🥬 The Concept (The gap this paper fills):

What it is: We need a penalty that is asymmetric—stronger for overconfident wrong answers, lighter for exploratory or shrinking mistakes.
How it works:
1. Detect overconfidence with the confidence shift.
2. Multiply the usual negative push by a factor that grows with this shift.
3. Keep the rest of training the same.
Why it matters: This preserves good exploration, reduces harmful traps, and expands the model’s reasoning boundary.

🍞 Bottom Bread (Anchor): It’s like a teacher who says, “I won’t punish brave guesses much, but if you loudly insist on a false rule, we’ll correct that strongly so the class stays on track.”

02Core Idea

🍞 Top Bread (Hook): You know how a referee gives tougher fouls for dangerous plays than for small bumps? The goal is fairness plus safety, not just equal treatment.

🥬 The Concept (ACE in one sentence):

What it is: ACE is a confidence-aware penalty that multiplies the negative advantage for wrong answers by 1 + α·Softplus(c), so overconfident errors get hit harder.
How it works:
1. For each wrong rollout, compute its confidence shift c = log(π_policy/π_ref).
2. Compute Softplus(c), which smoothly grows as c grows.
3. Multiply the original negative advantage by (1 + α·Softplus(c)).
4. Keep the rest (correct answers, clipping, KL) the same.
Why it matters: Without ACE, bad but confident paths hog probability and squeeze out diverse, correct ideas at larger sampling budgets.

🍞 Bottom Bread (Anchor): If two students are wrong—one unsure, one loudly confident—ACE tells the teacher to correct the loud one more so the class’s future answers stay healthy and varied.

Three different analogies for the same idea:

Classroom analogy: Exploratory mistakes = curious questions (light correction). Self-correcting mistakes = already fading habits (tiny correction). Overconfident mistakes = loudly repeated myths (strong correction).
Garden analogy: Don’t pull every weed equally. Yank out fast-spreading weeds (overconfident errors) quickly, but let new sprouts (exploration) grow until you know what they are.
Traffic analogy: A gentle speed warning for going 2 mph over (exploratory), almost none if you’re braking already (self-correcting), but a firm ticket for reckless speeding (overconfident mistake).

Before vs After:

Before: Uniform penalties treat all mistakes the same, so some wrong paths become entrenched and diversity collapses.
After: ACE punishes overconfident wrong paths more, preserving useful exploration and improving Pass@k at larger k.

Why it works (intuition without equations):

Overconfident errors are the real troublemakers because training is accidentally feeding them probability.
ACE selectively reduces those probabilities more, like targeted pruning.
Since ACE mostly leaves exploratory and self-correcting errors alone, the model keeps a broader set of reasoning paths alive.
The theory shows ACE acts like a “reverse-KL” pressure focused only on overconfident wrong outputs (plus a tempering residual), so it suppresses the right region instead of everywhere.

Building blocks (each with a mini sandwich):

🍞 Hook: Imagine checking how much louder a new microphone is than the old one. 🥬 Confidence shift c:

What it is: A score showing how much more (or less) confident the new policy is than the reference for one whole answer.
How it works: Take log-prob under policy minus log-prob under reference; positive means “more confident now.”
Why it matters: It’s our detector for overconfident errors. 🍞 Anchor: If the policy is 4x likelier than the reference to produce a wrong solution, that’s c > 0 and a red flag.

🍞 Hook: Ramps help wheels roll smoothly; steps cause jolts. 🥬 Softplus:

What it is: A smooth function that grows gently from near 0 into a linear rise for positive inputs.
How it works: For negative c, Softplus( $c) ≈ 0$ (tiny extra penalty); for large positive c, Softplus( $c) ≈ c$ (strong extra penalty).
Why it matters: Smooth signals mean steadier learning than a sharp cutoff. 🍞 Anchor: Instead of a cliff at c = 0, Softplus is a ramp, so borderline cases aren’t jerked around.

🍞 Hook: A whistle that only blows when the dangerous play happens. 🥬 Selective regularization:

What it is: ACE’s gradient matches the push of a reverse-KL-like regularizer, but only on overconfident wrong outputs.
How it works: The math decomposition shows ACE equals the main selective push plus a small residual; this keeps training stable.
Why it matters: It targets the bad region (overconfident errors) without shrinking the good ones. 🍞 Anchor: Instead of lowering every voice in a choir, ACE turns down only the off-key loud singer.

🍞 Hook: Many tries beat one try when exploring mazes. 🥬 Pass@k improvement:

What it is: A better chance that at least one of k attempts finds the right answer.
How it works: By keeping more diverse, promising paths alive, the model’s k samples cover more ground.
Why it matters: Real-world users often rely on sampling more than once; breadth wins. 🍞 Anchor: With ACE, drawing 32 solutions is like having 32 scouts explore different trails, not 32 copies of the same trail.

03Methodology

At a high level: Prompt → Generate multiple rollouts → Verify each rollout (reward) → Compute advantages → Apply ACE to wrong rollouts only → Optimize policy.

We’ll introduce each step like a recipe and explain why it exists.

Step 1: Make several guesses (rollouts)

What happens: For each problem (prompt), the model generates G complete chains of thought (candidate solutions).
Why it exists: We need options. If we only had one guess, we couldn’t tell which patterns to encourage or discourage.
Example: For a geometry question, the model writes 8 different solution attempts with different diagrams and algebra steps.

Step 2: Check answers with a verifier (binary rewards)

What happens: A rule-based or programmatic checker gives 1 if the final answer matches and 0 otherwise.
Why it exists: We need a reliable, simple signal. Binary is easy to trust and use.
Example: If the answer is 42, only rollouts that output 42 get reward 1; all others get 0.

Step 3: Compute group-normalized advantages (GRPO)

What happens: Inside each prompt’s group, we compute an average and standard deviation of rewards and turn each rollout’s reward into an advantage (how much better or worse than average).
Why it exists: Normalization makes the learning signal stable across mixed-easy and mixed-hard prompts.
Example: If 2 out of 8 are correct, those two get positive advantage; the 6 wrong ones get the same negative advantage under standard GRPO.

Step 4: Compute confidence shift c per rollout

What happens: For each full rollout, measure c = log(policy probability) – log(reference probability).
Why it exists: This tells us whether the policy has grown more confident than the reference about this exact answer.
Example: If a wrong rollout became 3x more likely than before, c is positive and fairly large.

Step 5: Apply ACE only to wrong rollouts

What happens: For wrong rollouts, we multiply the negative advantage by (1 + α·Softplus(c)). For correct rollouts, we leave advantages unchanged.
Why it exists: This is the heart of the method—overconfident wrong answers get a stronger push down; exploratory or shrinking wrong answers get almost the base push.
Example: Three wrong rollouts with c = -2, 0, 2 might get multipliers around 1.05, 1.69, and 3.13 respectively (if α = 1), creating fine-grained penalties.

Step 6: Clipping and KL (unchanged)

What happens: Keep GRPO’s clipping to avoid too-big updates, and keep a small global KL penalty to prevent drift.
Why it exists: Stability and safety. ACE is a targeted tweak, not a full rewrite.
Example: Even after ACE, the importance ratio is clipped to a safe range, and a gentle KL pulls the model toward the reference.

Step 7: Update the model

What happens: Use the combined loss to compute gradients and run one optimizer step.
Why it exists: This is how the model learns from rewards.
Example: After each batch of prompts and rollouts, parameters shift so good paths are more likely and bad overconfident paths are less likely.

What breaks without each step:

Without multiple rollouts: No sense of diversity; can’t learn which paths to keep.
Without verification: No ground truth to guide learning.
Without group-normalized advantages: Instability across prompts of different difficulty.
Without confidence shift: No way to spot overconfident wrong answers.
Without ACE modulation: Uniform penalties; overconfident errors persist.
Without clipping/KL: Risk of unstable jumps and reward hacking.
Without optimizer step: No learning.

Concrete data walkthrough:

Suppose a prompt yields 8 rollouts; 2 are correct (reward 1), 6 are wrong (reward 0).
Standard GRPO assigns the same negative advantage to all 6 wrong rollouts.
Now compute c for each wrong rollout: say $\begin{pmatrix} -1.5 \\ -0.2 \\ 0.0 \\ 0.3 \\ 1.5 \\ 2.0 \end{pmatrix}$ .
Softplus(c) is small for negative numbers and grows for positive numbers.
ACE multiplies each negative advantage by factors like $\begin{pmatrix} 1.02 \\ 1.57 \\ 1.69 \\ 2.03 \\ 2.70 \\ 3.13 \end{pmatrix}$ .
Result: the two overconfident wrong rollouts (1.5 and 2.0) get much stronger push-down, freeing probability for other paths.

The secret sauce (why this is clever):

It is rollout-level and confidence-aware: it looks at whole trajectories, not just tokens.
It is selective: only wrong answers get modulated, and only strongly if they’re overconfident.
It’s nearly free: c is already computed for KL; we add just one Softplus per wrong rollout.
It composes with other methods (e.g., DAPO) because it works at a different level (trajectory vs. token).

Mini sandwich for GRPO and advantage: 🍞 Hook: Think of grading on a curve within the same class period. 🥬 What it is: GRPO gives each answer a score relative to the group’s average so updates don’t blow up.

How it works: Compute mean and spread; scale each reward accordingly.
Why it matters: Keeps training stable and fair across prompts. 🍞 Anchor: If 2 of 8 rollouts are right, those two get positive scores; the rest get equal negative scores—before ACE steps in to differentiate.

Mini sandwich for rollouts: 🍞 Hook: Multiple drafts of an essay often produce a better final piece. 🥬 What it is: Independent full attempts from the model for the same prompt.

How it works: Sample G chains-of-thought per prompt.
Why it matters: More attempts = more chances to find a solid reasoning path. 🍞 Anchor: For a tricky algebra problem, 8 different lines of reasoning give you a better shot than just 1.

Mini sandwich for reference model: 🍞 Hook: A reliable map helps you not get lost while exploring new roads. 🥬 What it is: A stable baseline policy used to measure drift and compute c.

How it works: Compare policy probabilities against this reference to see shifts.
Why it matters: Defines overconfidence consistently. 🍞 Anchor: If your new directions differ wildly from the trusted map, proceed carefully—especially if you keep arriving at the wrong destination.

04Experiments & Results

🍞 Top Bread (Hook): Imagine testing a new coaching rule across three different teams and two leagues to see if it truly helps win more games, not just look good in practice.

🥬 The Concept (What they measured and why):

What it is: They measured Pass@k on two math benchmarks (MATH-500 and AIME 2025) for k in {1,2,4,8,16,32}.
How it works: For each question, sample k full solutions; Pass@k is the chance at least one is correct.
Why it matters: Pass@1 shows immediate accuracy; large-k shows exploration breadth and reasoning boundary.

🍞 Anchor: If one coin toss (k=1) misses, try more tosses (k bigger). If your ‘tosses’ are diverse, odds go up faster.

The competition (baselines):

Base model: pre-trained only.
GRPO: standard group-relative RLVR.
DAPO: a strong method using token-level clipping (Clip-Higher) and other tricks to preserve diversity.
ACE-GRPO and ACE-DAPO: our method added to GRPO or to DAPO, respectively.

Scoreboard with context:

Qwen2.5-Math-7B on MATH-500:
- GRPO Pass@32: 91.3%; ACE-GRPO: 94.3% (like moving from a solid A- to A).
- DAPO Pass@32: 94.6%; ACE-DAPO: 96.1% (a further bump into A+ territory).
Qwen3-8B-Base on MATH-500:
- GRPO Pass@32: 88.6%; ACE-GRPO: 91.1% (+2.5pp).
- DAPO Pass@32: 90.4%; ACE-DAPO: 91.6% (+1.2pp).
Llama-3.1-8B-Instruct on MATH-500:
- GRPO Pass@32: 79.3%; ACE-GRPO: 81.5% (+2.2pp).
- DAPO Pass@32: 80.4%; ACE-DAPO: 82.1% (+1.7pp).
AIME 2025 shows the same pattern, with ACE lifting large-k even for a weaker base model like Llama-3.1-8B-Instruct.

Surprising or notable findings:

ACE preserves Pass@1 while significantly boosting large-k. This means it doesn’t sacrifice immediate accuracy to get better breadth.
ACE plus DAPO still helps, though the gains are smaller than over GRPO. That’s expected: DAPO already protects diversity at the token level; ACE adds trajectory-level selectivity.
ACE reduces both the fraction and the magnitude of overconfident errors during training, confirming the proposed mechanism.
Entropy analysis shows ACE slows the harmful early collapse of token-level entropy, which correlates with better large-k performance.

Mini sandwich for Pass@k: 🍞 Hook: If you roll more dice, your chance to see a six goes up—if your dice aren’t all the same. 🥬 What it is: The probability that at least one of k samples is correct.

How it works: Draw k full solutions; check if any is right.
Why it matters: Captures how wide your reasoning boundary is. 🍞 Anchor: With diverse drafts, by k=32 you usually have at least one that nails the solution.

Mini sandwich for entropy/diversity: 🍞 Hook: A playlist with many genres is less likely to bore you than one song on repeat. 🥬 What it is: Entropy measures how spread out the model’s choices are.

How it works: Higher entropy = more varied token picks; lower entropy = more concentrated.
Why it matters: Too-low entropy too-early means mode collapse: fewer paths, worse Pass@k. 🍞 Anchor: ACE keeps more variety in early training, which shows up as better large-k results.

Takeaway from the scoreboard:

ACE is consistently helpful across different model families and training recipes.
The biggest wins show up where we care most about exploration: larger k.
ACE’s mechanism is robust—it improves metrics and the underlying training dynamics (fewer overconfident errors, healthier entropy).

05Discussion & Limitations

Limitations:

Reference dependence: ACE defines overconfidence relative to a reference model. If that reference is poorly calibrated, c can mislabel which errors are truly overconfident.
Binary rewards focus: The paper targets 0/1 rewards. Extending ACE to graded or process-level rewards needs careful definitions of what counts as an overconfident ‘error’ versus a partial success.
Very long chains: For extremely long chains of thought, sequence-length normalization of c may need refinement to avoid over- or under-penalizing.

Required resources:

A reference model to compute c (already standard in RLVR for KL).
Verifier to supply binary rewards.
Usual RLVR compute for sampling multiple rollouts; ACE adds negligible overhead (one Softplus per wrong rollout).

When not to use:

If your task has no reliable verifier (can’t tell right from wrong), ACE can’t know which confident paths are harmful.
If the reference model is drastically mismatched (e.g., domain shift), c may be misleading.
If you already operate in a regime where diversity is abundant and overconfident errors are rare, ACE’s gains may be small.

Open questions:

How to adapt ACE beyond binary rewards to partial credit or step-level process rewards?
Can we learn or schedule α automatically based on observed overconfidence dynamics?
Would a learned, data-driven modulation function outperform Softplus while staying smooth and stable?
Can an adaptive or moving-reference scheme make c more robust when the static reference is weak?
How does ACE behave with very long reasoning models (e.g., >10k tokens) and in non-math domains like coding or scientific QA?

06Conclusion & Future Work

Three-sentence summary:

The paper identifies a hidden training bug in RLVR: overconfident wrong answers soak up probability because all mistakes are penalized equally.
ACE fixes this by multiplying the negative advantage for wrong rollouts with a confidence-aware factor (1 + α·Softplus(c)), strongly targeting overconfident errors and preserving exploration.
The result is higher Pass@k across models and benchmarks, with theory linking ACE to a selective reverse-KL regularizer and experiments showing fewer overconfident errors and slower entropy collapse.

Main achievement:

A tiny, practical change—one line in the advantage for wrong rollouts—that consistently expands the reasoning boundary (large-k gains) without sacrificing Pass@1, and that composes cleanly with strong baselines like DAPO.

Future directions:

Extend ACE to continuous or process-level rewards; make α adaptive; explore moving references; and test across long-CoT and non-math domains.

Why remember this:

ACE turns a blunt, uniform penalty into a smart, asymmetric one that corrects the truly harmful mistakes more. It’s a small, principled tweak with big, repeatable benefits: more diversity where it counts, better odds that one of several tries gets the right answer, and a clearer path to robust reasoning in LLMs.

Practical Applications

•Improve math tutoring models so they maintain diverse solution paths and find correct answers more often with multiple samples.
•Enhance code-generation assistants by suppressing overconfident but buggy patterns while preserving exploratory refactorings.
•Boost scientific reasoning systems that need multiple hypotheses, increasing the chance that at least one sample is correct.
•Stabilize long chain-of-thought training by reducing early entropy collapse and keeping alternative reasoning routes alive.
•Combine with existing RLVR stacks (GRPO, DAPO) to lift large-k performance with minimal engineering changes.
•Monitor and reduce overconfident error fractions during training as a diagnostic for healthy exploration.
•Use ACE in curriculum or staged training to aggressively correct spurious shortcuts that models latch onto.
•Apply ACE to evaluation-time reranking pipelines by preferring less overconfident wrong paths among candidates.
•Deploy ACE in domains with deterministic verifiers (logic puzzles, equation solving, program synthesis) for reliable gains.
•Adopt ACE as a safety layer to avoid models becoming stubborn about incorrect reasoning patterns over time.

Version: 1