iGRPO: Self-Feedback-Driven LLM Reasoning

Ali Hatamizadeh; Shrimai Prabhumoye; Igor Gitman; Ximing Lu; Seungju Han; Wei Ping; Yejin Choi; Jan Kautz

iGRPO: Self-Feedback-Driven LLM Reasoning

Beginner

Ali Hatamizadeh, Shrimai Prabhumoye, Igor Gitman et al.2/9/2026

arXiv

Key Summary

•This paper teaches a language model to improve its own math answers by first writing several drafts and then learning to beat its best draft.
•The method, called iGRPO, adds a simple two-stage loop on top of GRPO, a popular reinforcement learning method for language models.
•Stage 1 samples many drafts and picks the best one using the same reward used for training; Stage 2 feeds that best draft back into the model as guidance and trains the model to produce an even better answer.
•Under the same compute budget as GRPO, iGRPO wins across many math benchmarks and model sizes (7B, 8B, 14B).
•With a stronger model and bigger dataset, iGRPO reaches state-of-the-art scores on AIME24 (85.62%) and AIME25 (79.64%).
•Ablations show the idea works with other GRPO-style variants, also benefits from a generative judge, and delays harmful early overconfidence (entropy collapse).
•Training takes only about 13% more time and uses essentially the same GPU memory as GRPO.
•At test time, there is no extra draft step—the model answers in one shot, but it is better because of how it was trained.
•The core insight is dynamic self-conditioning: as the model gets better, the drafts it conditions on get better too, which keeps lifting the model’s reasoning.

Why This Research Matters

iGRPO shows that models can learn a general skill of “improve your own best attempt,” which is how people solve hard problems too. This raises accuracy on tough math tasks without adding any steps at test time, so users get faster, better answers. The idea transfers beyond math: coding with tests, planning multi-step tasks, and verifying scientific calculations can all benefit. Because it reuses simple scalar rewards, it fits current RL pipelines and doesn’t need heavy extra systems. The compute and memory overheads are small, making it practical for real training runs. As models handle more critical, checkable tasks (finance, engineering, education), reliable refinement becomes a key ingredient for trust and usefulness.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you write an essay, your first try isn’t perfect, so you make a rough draft, spot mistakes, and then fix them to make a better final version? Good writers improve by revising.

🥬 Filling (The Actual Concept):

What it is: This paper is about teaching big language models to solve math problems better by letting them learn from their own drafts, just like we do when we revise essays.
How it works (story of the field):
1. Before: Language models got pretty good at chatting and answering simple questions, but tricky multi-step math often tripped them up—they’d make one small mistake near the end.
2. Reinforcement Learning (RL) helped by giving models rewards for good answers, nudging them toward better behavior.
3. PPO and then GRPO made RL training practical for huge models by simplifying the math and using group-based comparisons instead of training a separate value network.
4. But even with RL, models usually tried to solve a whole problem in a single go—they didn’t pause, reflect, or refine like people do.
5. Researchers tried add-ons like self-critique or self-verification, but those can make the model juggle extra jobs (write a critique, verify, etc.) that are not directly the main goal: get the answer right.
Why it matters: Without an easy way to learn from your own best attempt, a model can keep repeating the same kind of mistake. It’s like always turning in your first draft—good, but not your best.

🍞 Bottom Bread (Anchor): Imagine solving a long division problem. If you never check your work, a tiny slip ruins the final answer. A system that encourages “draft → improve” helps catch and fix slips before the final answer.

🍞 Top Bread (Hook): You know how coaches compare players on a team to decide who did better during practice? That comparison helps everyone improve.

🥬 The Concept (GRPO):

What it is: Group Relative Policy Optimization (GRPO) is an RL training method where the model writes several answers to the same problem, scores them, and learns more from the better ones using group-relative normalization.
How it works:
1. Sample a small group of answers to the same prompt.
2. Score each answer with a reward (like 1 if the final number matches the key, else 0).
3. Normalize scores within the group (who did better than average).
4. Update the model to make higher-scoring answers more likely, with safety guards so it doesn’t change too fast.
Why it matters: Without group comparison, the model needs a separate value network (more complexity). GRPO stays simpler and stable for big models.

🍞 Anchor: It’s like trying 8 ways to solve a puzzle, grading each, and then learning mostly from the best few so next time you start closer to a winning path.

🍞 Top Bread (Hook): Think of a GPS that lets you try small detours but keeps you from driving way off course.

🥬 The Concept (PPO):

What it is: Proximal Policy Optimization (PPO) is an RL method that improves a model but limits how far it can move each update.
How it works:
1. Take a snapshot of current behavior.
2. Try actions, see rewards.
3. Update the policy, but clip changes so updates are “not too big.”
4. Repeat in steps.
Why it matters: Without limits, training can wobble or crash. PPO’s “guardrails” keep learning steady.

🍞 Anchor: When practicing piano, you increase the speed a little at a time. Jumping from slow to super-fast at once makes you mess up.

🍞 Top Bread (Hook): When you solve a riddle, you often mumble a first idea, notice a flaw, then tweak it. That little self-feedback loop is powerful.

🥬 The Concept (Self-Feedback):

What it is: Self-feedback is when the model looks at its own attempt and uses it as guidance to make a better one.
How it works:
1. Produce an attempt.
2. Judge it (with a reward or a soft score).
3. Use the judged attempt as a guide to try again, aiming to surpass it.
Why it matters: Without self-feedback, models tend to one-shot answers and miss easy fixes.

🍞 Anchor: Like writing a rough draft, marking it up with a red pen, and then writing a cleaner final essay.

02Core Idea

🍞 Top Bread (Hook): Imagine a science fair where you build a bridge, test several versions, pick the strongest one, then use that best design as a blueprint to build an even stronger bridge.

🥬 The Concept (iGRPO):

What it is: iGRPO is GRPO with a neat twist—after trying several drafts, it picks the best one and feeds it back into the prompt so the model learns to beat its own best attempt.
How it works:
1. Stage 1 (Explore): Sample N drafts for the same problem and score them.
2. Pick the highest-scoring draft (the “best draft”).
3. Stage 2 (Refine): Append that best draft to the original prompt as a guide and sample G refinements.
4. Use GRPO-style updates on these refinements to make surpassing the best draft more likely next time.
Why it matters: Without this two-stage loop, the model doesn’t get that strong, evolving hint from itself. iGRPO turns its own progress into better training fuel.

🍞 Anchor: It’s like solving a maze: first you try a few paths, choose the one closest to the exit, then start from that path and refine the last turns to finish perfectly.

Multiple analogies:

Sports: Run several practice sprints, keep the best lap as the pace car, then train to run even faster than that pace.
Cooking: Try a few cookie recipes, pick the tastiest, then tweak it (a pinch more salt, slightly shorter bake) to make it even better.
Drawing: Sketch multiple thumbnails, select the strongest composition, and refine lines and shading to a polished piece.

Before vs After:

Before (GRPO): The model learns from groups of one-shot answers; it improves but still misses easy late fixes.
After (iGRPO): The model learns to condition on its own best try and specifically practice going from “almost right” to “exactly right.”

Why it works (intuition without equations):

Dynamic self-conditioning means the guide gets better as the model gets better. Better guides make better refinements, creating a virtuous cycle (bootstrapping).
Group normalization keeps the signal strong: we prefer refinements that beat the current best.
Clipping and KL regularization keep updates stable while still letting the policy explore.

Building blocks (mini-sandwiches):

🍞 Hook: You know how teachers grade tests with a simple right/wrong? 🥬 Concept (Reward Function): A binary or graded score that tells the model how good an answer was; steps: extract final answer, compare to key (or use a judge); matters because learning needs a clear goal. 🍞 Anchor: “1 if 42, else 0.”
🍞 Hook: Comparing friends’ times after a race shows who did relatively better. 🥬 Concept (Group-Normalized Advantage): We compare scores inside the sampled group and favor above-average ones; steps: compute mean, std, normalize; matters because it gives a strong, stable learning signal. 🍞 Anchor: Getting a gold star when you beat the group’s average.
🍞 Hook: Don’t sprint too fast on the first lap. 🥬 Concept (Clipping & KL Guardrails): Limit how big each update is and stay near a reference policy; matters because it prevents training from wobbling. 🍞 Anchor: Training wheels on a bike.
🍞 Hook: Keep trying options before settling. 🥬 Concept (Entropy & Exploration): Higher mid-training entropy means the model still considers alternatives; matters because it avoids early overconfidence. 🍞 Anchor: Sampling a few ice-cream flavors before choosing.

🍞 Bottom Bread (Anchor): On AIME math, the model first finds its best draft, then learns patterns that finish the last tricky step correctly more often, boosting scores without extra test-time steps.

03Methodology

At a high level: Prompt → Stage 1 (Exploratory Drafts) → Pick Best Draft → Stage 2 (Conditioned Refinements) → Update Policy → Better One-Shot Answers at Test Time.

Step-by-step (with the Sandwich pattern for each key step):

🍞 Hook: When you practice, you try a few ways before choosing the best approach. 🥬 Stage 1: Exploratory Draft Generation
- What it is: The model writes N different drafts for the same problem.
- How it works:
  1. Sample N drafts from the current snapshot policy.
  2. Score each draft using a reward (binary exact match or a judge in [0,1]).
  3. Select the highest-scoring draft; this becomes the self-feedback exemplar.
- Why it matters: Without Stage 1, we have no evolving guidance; we’d be guessing how to refine without a strong anchor. 🍞 Anchor: Try five ways to factor a polynomial, keep the one that gets closest to the correct roots.
🍞 Hook: A good example makes it easier to do better next time. 🥬 Building the Augmented Prompt (Dynamic Self-Conditioning)
- What it is: Attach the best draft right after the original question to form an enhanced input.
- How it works:
  1. Concat(original question, best draft) with a fixed template.
  2. This context shows the model a “strong prior attempt.”
  3. The model is asked to produce a refined, improved solution—not a copy.
- Why it matters: Without this, the model lacks a concrete, high-quality scaffold to surpass. 🍞 Anchor: A math workbook shows a worked example before asking you to solve a tougher version.
🍞 Hook: When a coach watches several attempts after a good warm-up, they can fine-tune your form. 🥬 Stage 2: Conditioned Refinements + GRPO Update
- What it is: Sample G refined answers using the augmented prompt and apply GRPO-style learning.
- How it works:
  1. Generate G new completions conditioned on (question + best draft).
  2. Score each completion with the same reward.
  3. Compute group-normalized advantages (who beat the group’s mean).
  4. Update the policy with PPO-style clipping and (optionally) KL regularization.
- Why it matters: Without Stage 2 updates, the model would not learn to reliably beat the best draft. 🍞 Anchor: After seeing your best lap time, the coach helps you shave off the final seconds.

Concrete example with data:

Question: “What is (13×17)?”
Stage 1 drafts: [draft A: 221 (wrong), draft B: 230 (wrong), draft C: 221 (wrong), draft D: 221 (wrong), draft E: 221 (wrong)] Suppose draft B is 221? Actually 13×17 = 221 is correct; we’d mark 221 as 1, others 0.
Best draft = a correct 221.
Augmented prompt: question + “Best draft tried: 221. Improve clarity and verification.”
Stage 2 refinements: The model writes a clearer step-by-step multiplication and confirms 221, which earns reward 1 and strengthens that reliable computation pattern.

Secret sauce (why it’s clever):

It keeps compute roughly the same: we split the same rollout budget between drafts and refinements.
It reuses the same scalar reward—no special critic, no extra verifier needed.
It trains a general “refinement skill” that transfers to many math tasks.
It delays harmful early overconfidence (entropy collapse), preserving exploration longer while still converging.

Implementation notes:

Reward can be binary exact-match (1 if parsed answer matches key, else 0) or a graded generative judge (0–1) that gives partial credit to near-misses.
Only Stage 2 tokens get gradients; Stage 1 just chooses the self-feedback draft.
Inference-time is simple: no drafts are needed; the trained policy answers in one shot.
With the same total samples per prompt, iGRPO’s main overhead is a modest ~13% extra wall-clock for two decoding rounds, with nearly identical peak GPU memory to GRPO.

04Experiments & Results

The test: The authors measured Pass@1 accuracy (did the first answer match the gold answer?) on multiple math reasoning benchmarks. They kept the total number of generated samples per prompt the same across methods to ensure a fair compute match.

The competition: iGRPO was compared against vanilla GRPO and two strong self-improvement baselines—Self-Verification and Critique-GRPO—across several model families and sizes (7B, 8B, 14B), using shared rewards and training protocols.

Scoreboard with context:

On an 8B general model (Nemotron-H-8B-Base-8K), iGRPO lifted the macro-average to about 45%, which is like jumping from a solid C to a mid B when others are around low 40s.
On a strong 7B distilled reasoner (DeepSeek-R1-Distill-Qwen-7B), iGRPO edged out GRPO and the critique/verification baselines, reaching roughly 69.9% average—like squeezing extra points on long, multi-step questions where tiny late slips matter.
On a math-specialized 7B (OpenMath-Nemotron-7B), GRPO had little headroom, but iGRPO still improved the average to about 76.1%, especially on tougher contests like AIME and AMC.
At 14B scale, gains persisted: iGRPO improved both DeepSeek-R1-Distill-Qwen-14B and OpenMath-Nemotron-14B, with particularly nice jumps on AIME24 (long-horizon problems).

State of the art:

With a stronger base (OpenReasoning-Nemotron-7B) and a harder dataset (AceReason-Math), iGRPO hit new highs on AIME24 (85.62%) and AIME25 (79.64%). That’s like getting an A on very challenging competitions.

Surprising findings:

The same two-stage self-feedback wrapper also improves other GRPO variants (DAPO, GSPO) by ~+1.1 to +1.2 points, suggesting the refinement interface—not GRPO specifics—is the main booster.
Swapping the binary checker for a generative judge (e.g., GPT-5 scoring in [0,1]) gave an extra ~+0.94 average, likely because partial credit keeps promising near-misses alive for Stage 2 to fix.
Entropy analysis showed iGRPO keeps exploration higher in mid-training than GRPO, delaying early overconfidence; yet final entropies are similar, so the gain is about better learning dynamics, not permanent randomness.

Compute and resources:

Nearly identical peak memory to GRPO; small throughput drop due to two decoding stages; about 13% more total training time for the shown setup.

Takeaway: Under the same rollout budget, adding the self-feedback conditioning step consistently makes models more reliable on long, verifiable math problems.

05Discussion & Limitations

Limitations:

Modest extra wall-clock time (~13%) due to the two decoding stages, even though peak memory remains almost the same.
Diminishing returns on already very strong, domain-specialized models; the absolute gains can shrink as the ceiling gets higher.
Depends on a reward that reflects correctness; binary rewards can be harsh on near-misses unless you use a generative judge.
The approach trains a refinement skill; tasks that don’t benefit from a “draft → refine” pattern may see smaller gains.

Required resources:

A standard RL fine-tuning stack for LLMs with GRPO-style updates.
A reward function: either exact-match parsing or a learned/generative judge that outputs a scalar score.
Sufficient generation budget per prompt to split between Stage 1 drafts and Stage 2 refinements (e.g., 8 total).

When not to use:

Very short, trivial questions where drafting adds no value.
Domains without clear reward signals (no way to check answers) unless you have a reliable judge.
Extreme low-latency training settings where even a small time overhead is unacceptable.

Open questions:

How best to allocate N vs. G under various difficulties and model sizes?
Can multi-round (more than two stages) training give further boosts without large overhead?
What’s the best design for learned judges to avoid bias or reward hacking?
Can the idea help beyond math—like coding with unit tests, or science proofs with verifiers—at even larger scales?
How does dynamic self-conditioning interact with curricula (easy-to-hard) or with mixture-of-experts models?

06Conclusion & Future Work

Three-sentence summary:

iGRPO adds a simple two-stage loop to GRPO: try multiple drafts, pick the best, then condition on it to train refinements that beat it.
This dynamic self-conditioning bootstraps learning—better policies produce better drafts, which produce better training signals—improving verifiable reasoning under the same rollout budget.
Experiments show consistent gains across models and datasets, including state-of-the-art AIME24/25 results, with minimal extra training overhead and no inference-time cost.

Main achievement:

Turning the model’s own best attempt into a live, evolving training scaffold that reliably teaches “go from almost right to exactly right.”

Future directions:

Smarter allocation of drafts vs. refinements, richer judges, multi-round refinement, and broader application to coding, proofs, and planning tasks.

Why remember this:

iGRPO shows that a tiny training tweak—feeding your best draft back into the prompt—can unlock a powerful, general refinement skill that lifts accuracy on hard multi-step problems without complicating test-time use.

Practical Applications

•Train math tutor bots that fix small late-step errors and explain cleaner solutions.
•Improve code assistants by conditioning on the best unit-tested draft and refining to pass all tests.
•Enhance step-by-step planners (e.g., study schedules, lab procedures) to reduce compounding mistakes.
•Boost data labeling quality in RL pipelines by turning near-miss solutions into correct ones via refinement.
•Strengthen competitive-exam solvers (AIME/AMC) where exact final answers matter.
•Upgrade scientific calculators that require precise multi-step derivations with verifiable endpoints.
•Help spreadsheet and finance models refine calculations to match target checks or reconciliation constraints.
•Enable small on-device models to learn efficient refinement skills under tight compute budgets.
•Pair iGRPO with a generative judge to capture partial credit and convert close attempts into correct answers.
•Apply the self-feedback wrapper to other PPO/GRPO variants to stabilize and lift reasoning across domains.

Version: 1