On-Policy Self-Distillation for Reasoning Compression
Key Summary
- â˘Reasoning models often talk too much, and those extra words can actually make them more wrong.
- â˘This paper introduces OPSDC: the model tells itself âbe concise,â then learns to act that way even without being told.
- â˘It uses on-policy self-distillation: the model learns from its own outputs, not from external answer keys or reward models.
- â˘A simple training trickâcompare normal outputs to âconcise-modeâ outputs token by token and nudge the normal mode to matchâdoes the job.
- â˘OPSDC cuts 35â59% of reasoning tokens while keeping or even improving accuracy, especially on math benchmarks.
- â˘On MATH-500, Qwen3-14B jumps from 70.0% to 86.1% accuracy while using 56.5% fewer tokens.
- â˘It adapts automatically: big cuts on easy problems, gentle cuts on hard ones, without needing a difficulty detector.
- â˘It preserves entropy (the modelâs ability to explore) unlike many RL length-penalty methods that collapse diversity.
- â˘No ground-truth answers, no length budgets, no reward engineeringâjust a concise instruction and the modelâs own rollouts.
- â˘The key insight: extra, redundant tokens are not just wasteful; they can snowball into compounding errors.
Why This Research Matters
Shorter, sharper reasoning means lower cost and faster responses, which is crucial for everyday tools like homework helpers, coding assistants, and chatbots. Because OPSDC needs no answer keys or complex reward models, itâs easy to adopt and scales to domains without reliable labels. The method preserves the modelâs ability to explore, so it stays strong on hard problems rather than collapsing into overly short answers. Automatic difficulty adaptation trims easy tasks aggressively while keeping essential steps on tough ones, improving both efficiency and reliability. By reducing compounding errors, concise reasoning can actually raise accuracyâa win for users who need both speed and correctness. Finally, less compute and fewer tokens mean greener AI, enabling high-quality reasoning on smaller devices or tighter budgets.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how some people over-explain everything, repeating points and second-guessing themselves until it gets confusing? Thatâs a lot like how modern reasoning AIs behave. They âthink out loudâ for thousands of tokens, even when the problem is easy. Before this work, the common wisdom was: more thinking tokens means better answers on hard tasks, so we just had to accept a lot of extra words (and cost). But there was a catchâthese models often couldnât stop. Ask a simple question, and theyâd still write essays. What was happening in the world before? Researchers had already built powerful reasoning models that could solve hard problems by showing their work. This helped accuracy on tough tasks but led to waste on easy ones. Serving these models got expensive (more tokens = more compute, more time, more money). Worse, the long reasoning sometimes backfired: a correct line of thought could get buried under detours, doubts, and restarts, causing the final answer to be wrong. The problem: how do we make models reason conciselyâusing fewer, sharper stepsâwithout losing the careful thinking needed for hard problems? And can we do it without hand-holding the model with ground-truth answers, custom reward functions, or rigid token budgets that fit some tasks but not others? People tried several approaches, and each one had trade-offs:
- Reinforcement learning (RL) with length penalties: Reward short answers. This needs ground-truth answers to score success and often crushes the modelâs âexploration drive,â making it worse on hard problems.
- Supervised fine-tuning (SFT) on short solutions: Train on someone elseâs concise reasoning. This can cause âforgettingâ because the student stops practicing its own native style and becomes brittle when real prompts differ from the training set.
- Prompting or decoding tricks: Tell the model to be brief or trim its output on the fly. These help a bit but vanish when the prompt changes and usually donât reach large compression safely. The missing piece was a way to let the model keep its own style and skills while learning to be brief when thatâs enough. In other words, a method that:
- Uses no answer keys (so it scales to domains without labels),
- Doesnât require complex reward engineering, and
- Automatically adapts: compress easy tasks strongly and hard tasks gently. This paper fills that gap with OPSDC (On-Policy Self-Distillation for Reasoning Compression). The idea is disarmingly simple: the same model acts as both teacher and student. The teacher is just the model with a âbe conciseâ instruction prepended; the student is the model without that instruction. We then teach the student to behave like the âconcise teacherâ on the studentâs own outputs. No ground-truth answers are needed. Why does this matter in everyday life? Shorter, clearer reasoning means faster responses, lower cloud bills, and greener computing. It means your homework helper, coding assistant, or on-device tutor can be both smarter and snappier. And it means fewer errors caused by overthinking tangentsâlike when you know the right answer but talk yourself out of it.
02Core Idea
Aha! Moment in one sentence: Ask the model to be concise, then teach it to behave that way by learning from its own âconcise-modeâ selfâso conciseness becomes the default without needing to be asked. Multiple analogies (same idea, three ways):
- Traffic cop analogy: You know how a traffic cop waves cars through the fastest route instead of letting them circle around the block? The âconcise teacherâ shows the shortest safe path; the âstudentâ learns to take that path on its own next time.
- Highlighter analogy: Imagine highlighting only the most important lines in a long article. The teacher (with the concise instruction) highlights the essentials; the student learns to write just those lines when it rewrites the article later.
- Trailblazer analogy: One hiker walks a mountain, marks the easiest route with flags (concise teacher), and the next hiker (student) follows those flags and remembers them for future hikes, needing no flags later. Before vs. After:
- Before: Models produced long chains of thought, even for simple tasks, wasting tokens and sometimes derailing correct reasoning.
- After: Models reduce their words by about one-third to one-half, while matching or beating accuracyâespecially on easy-to-medium problemsâwithout losing their ability to think deeply when needed. Why it works (deep intuition without equations):
- Extra steps often add noise. Every extra token is a chance to go off-track. Trimming redundant steps cuts off places where errors can snowball.
- The teacher is just the model with a concise instruction. Because itâs the same brain with a nudge, its preferences are familiar and safe to learn from.
- The learning signal is gentle and local: we compare the studentâs next-token choices to the teacherâs at the same spot in the studentâs own writing and nudge toward the teacher only where the student already tends to go. That keeps the model stable and prevents collapse.
- Difficulty-adaptive magic: When a problem is easy, the concise teacher shortens a lotâso the student gets a strong âtrim hereâ signal. When a problem is hard, the teacher still writes more, so the signal to compress is weak. No difficulty classifier needed. Building blocks, explained with the Sandwich pattern:
- đ Hook: You know how you can learn from your own past homework by noticing what you did right and what was extra? 𼏠Concept: On-Policy Self-Distillation is when a model learns from its own fresh outputs, not a static dataset. How it works: (1) The student writes an answer. (2) The teacher (same model + âbe conciseâ instruction) shows how it would write at each step. (3) We gently nudge the student toward the teacherâs choices at those exact steps. Why it matters: Without learning on its own outputs, the model forgets its style or chases mismatched examples. đ Anchor: The student solves a fraction problem, the teacher shows a shorter path at each step, the student soon prefers that shorter path by default.
- đ Hook: Imagine telling a friend, âplease be brief and clear.â 𼏠Concept: Conciseness Instruction is simply a short note like âSolve concisely; avoid unnecessary steps.â How it works: Add this sentence before the problem and the model trims fluff while keeping key steps. Why it matters: It unlocks a behavior the model already knows but doesnât default to. đ Anchor: With the note, it factors a polynomial in three lines instead of twenty.
- đ Hook: Think of a coach and a player who are actually the same person wearing two hats. 𼏠Concept: TeacherâStudent Framework here means one shared model in two roles: teacher (with the concise instruction) and student (without it). How it works: Use the teacherâs token-by-token preferences as targets for the studentâs updates. Why it matters: No external labels or stronger models are required. đ Anchor: The âteacher-youâ shows a minimal solution; the âstudent-youâ learns to do that next time without a reminder.
- đ Hook: You know how you adjust more where youâre already working, rather than everywhere at once? 𼏠Concept: Reverse KL Divergence is a way to measure and nudge differences so the student updates mostly where it already places attention. How it works: Compare probabilities for the next token; push the student away from tokens the teacher dislikes, but donât force it toward uncertain areas. Why it matters: This keeps the model stable and avoids the crash that happens when you push too hard everywhere. đ Anchor: If both agree âthe next word is âthereforeâ,â no change; if the student wants a long detour and the teacher doesnât, we trim the detour.
- đ Hook: Like updating a playbook every few games once the team improves. 𼏠Concept: Periodic Teacher Update means we occasionally refresh the teacherâs weights to match the improved student, then apply the concise instruction again to get an even crisper target. How it works: Copy student â make it the teacher â the teacher-in-concise-mode becomes a slightly better âshortest safe path.â Why it matters: Compression deepens step by step without instability. đ Anchor: After a few training rounds, the new teacher writes even briefer, so the student learns to be briefer still.
- đ Hook: Pack lighter for a day trip, heavier for a week-long hike. 𼏠Concept: Difficulty-Adaptive Compression means easy problems get big cuts; hard ones keep the needed steps. How it works: The concise teacher naturally writes shorter on easy tasks and longer on hard ones; the student mirrors that. Why it matters: You donât need a difficulty detector or per-task budgets. đ Anchor: A simple sum gets a one-line solution; a contest-level geometry proof keeps its careful steps.
- đ Hook: Like turning a page-long summary into a crisp paragraph. 𼏠Concept: Token Reduction means using fewer tokens to say the same essential reasoning. How it works: Remove repetition, restatements, and meandering; keep the logic that changes the answer. Why it matters: Fewer tokens save time and moneyâand often avoid mistakes. đ Anchor: Instead of re-deriving and re-checking the same equation three times, solve it once cleanly and move on.
03Methodology
High-level recipe: Input â Student writes a response â Teacher (same model + concise instruction) scores each next step â Compare student vs. teacher, nudge student toward teacherâs choices â Every so often, refresh teacher from the improved student â Output is a shorter, accurate reasoning style by default. Step-by-step, with what/why/examples:
- Prepare two roles from one model
- What happens: We take one reasoning model. The student sees the normal prompt. The teacher sees the same prompt with a short âbe concise and correctâ instruction added on top.
- Why it exists: The teacher provides a clearer, shorter style that the student can learn fromâwithout using answer keys.
- Example data: Problem: âFactor â 6 + 11x â 6.â Student prompt: normal task instructions. Teacher prompt: same, plus âSolve concisely; avoid unnecessary steps.â
- Student writes; teacher guides at each token
- What happens: The student starts writing its hidden reasoning and answer. For each next-token decision, we ask the teacher, âWhat would your concise-mode prefer next?â and read the teacherâs probability over tokens for that exact spot.
- Why it exists: Comparing at the exact student-written prefix keeps training on-policyâno mismatch between training data and the modelâs own style.
- Example: The student begins a long explanation (âLet me check againâŚâ) while the teacher gives low preference to that detour and prefers a direct factoring step.
- Nudge the student toward the concise teacher (token-by-token)
- What happens: We gently adjust the student so it places less weight on rambling tokens and more on the teacherâs direct path. This nudge is strongest where the student was already likely to go (stability) and weaker elsewhere (safety).
- Why it exists: It avoids crushing the modelâs ability to explore. Unlike blunt length penalties, this preserves useful uncertainty on hard tasks.
- Example: If the student often writes âAlternativelyâŚâ followed by repetition, and the teacher de-emphasizes that word, the student learns to skip it next time.
- Refresh the teacher periodically
- What happens: Every so often (e.g., every 50 training steps), we copy the improved student into the teacher slot. The concise instruction now makes the teacher even more succinct.
- Why it exists: It creates a moving staircase of improvementâeach refresh offers a slightly stricter but still safe target, deepening compression without collapse.
- Example: After a few refreshes, the teacher solves the factoring in a few crisp steps; the student learns to do the same without needing the reminder.
- Stop when concise becomes the default
- What happens: After modest training (the paper saw quick convergence), the student habitually picks shorter, more direct reasoning.
- Why it exists: This is the goalâconciseness without asking for it at inference time.
- Example: On a simple algebra task, the model writes a tight, correct derivation in far fewer tokens. Concrete walk-through with actual content:
- Input: âFind all real x such that â 6 + 11x â 6 = 0.â (In the paperâs dataset.)
- Studentâs first try: Long chain-of-thought: repeats synthetic division twice, re-checks roots, then re-explains the final step after the hidden reasoningâlots of duplication.
- Teacherâs concise-mode view: Prefers âTry small integer roots 1, 2, 3; factor quickly; state answer.â
- Update: The model learns to prioritize the quick factoring (xâ1)(xâ2)(xâ3) and skip repeats.
- Outcome next time: The model goes straight to testing small roots and factors cleanly, saving hundreds of tokens. What breaks without each step:
- Without on-policy student rollouts: The model trains on someone elseâs style and forgets its own (distribution shift).
- Without the concise instruction: No clear âshortest safe pathâ to imitate.
- Without reverse-KL-style gentle nudges: Training can oscillate or collapse, crushing exploration.
- Without periodic refresh: Compression stalls early (frozen teacher) or destabilizes (updating every step). The secret sauce:
- Gentle, local learning from yourself: Comparing the student to its own concise-mode view at the exact places it writes keeps changes safe and targeted.
- Automatic difficulty adaptation: Easy tasks get big trims; hard tasks keep needed steps.
- No labels or reward engineering: Simpler, cheaper, and scalable to domains without answer keys.
- Entropy preservation: The model keeps its ability to explore on hard problems instead of being forced into always-short outputs.
04Experiments & Results
The test setup: The authors trained on about 13.6k competition-style math problems without ground-truth answers and evaluated on three benchmarks with very different difficulty levels: MATH-500 (500 problems), AIME 2024 (30 problems), and AIME 2025 (30 problems). They measured two main things: (1) accuracy (did the final answer match?), and (2) average number of response tokens (how long was the reasoning). They also checked general knowledge (MMLU) to make sure the model didnât forget broad skills. Competition (what they compared against):
- Base model (no changes),
- Concise prompt only at inference (no training),
- OPSDC (their method). Scoreboard with context:
- On MATH-500, Qwen3-14B went from 70.0% to 86.1% accuracy while using about 56.5% fewer tokens. Thatâs like going from a C to a solid A while writing half as much.
- On MATH-500, Qwen3-8B also reached roughly 86.6% accuracy, slashing tokens by about 58.8% (stronger than the prompt-only trick).
- On AIME 2024, Qwen3-14B gained about 10 points (65.8% â 76.3%) with roughly 41% compression.
- On AIME 2025 (hardest), the method still compressed around 35% but traded off a few accuracy pointsâshowing the method backs off compression less on very tough problems.
- General capabilities (MMLU) stayed the same, meaning the model didnât forget broad knowledge while learning to be concise. Surprising findings:
- Less can be more. Cutting tokens often improved accuracy. Why? Those extra, redundant tokens were not just fillerâthey were places where the model could introduce mistakes or talk itself out of correct answers.
- Difficulty-adaptive behavior emerged naturally. There was about 1. compression on easy tasks than on hard ones without any difficulty detector.
- Entropy (the modelâs âcuriosityâ to explore alternatives) stayed stable. RL methods with harsh length penalties often crush entropy; OPSDC preserved it. This is key for tough problems that need exploration.
- The direction of the comparison matters. When they tried a different comparison setup (called âforward KLâ), performance became unstable after each teacher refresh, accuracy dipped in saw-tooth patterns, and outputs got too short. The chosen âreverse KLâ-style nudge stayed stable and strong.
- Vague can beat precise. Telling the teacher âbe conciseâ worked better than telling it âuse exactly 50% fewer tokens.â Precise numeric targets inflated compression but hurt accuracy, especially on competition-level tasks.
05Discussion & Limitations
Limitations:
- It relies on the model being able to follow a âbe conciseâ instruction. Very small or poorly aligned models might not get much benefit until they can follow instructions reliably.
- Most results are for math reasoning. The method is domain-agnostic in design, but broader evaluations (coding, science QA, planning) are still needed.
- Picking the teacher-update interval matters. Updating too often can destabilize training; updating rarely may limit compression depth. The paper found a stable sweet spot around every few dozen steps. Required resources:
- A standard supervised training stack (no reward models or PPO needed),
- Enough compute to run dual forward passes (student + teacher) per token during training,
- A modest dataset of prompts (no labels needed). The paper used 8 GPUs and trained quickly (they saw strong gains within around 100 steps). When NOT to use it:
- If you absolutely need long, fully spelled-out chains for auditing (e.g., a formal proof transcript), forcing brevity may hide intermediate detailâeven if the final answer is correct.
- If your model cannot follow the concise instruction at all (e.g., too small, unaligned), gains may be limited until you first improve basic instruction following.
- If your decoding stack or product requires fixed, scripted outputs that must include every step verbatim, compression may conflict with those constraints. Open questions:
- How well does it transfer to coding, scientific reasoning, or tool-using agents without ground-truth rewards?
- Can we automatically tune the teacher-update interval or use a smooth EMA teacher for even better stability?
- How do different wording styles of the concise instruction affect outcomes across model families?
- Can we detect and retain âexplanatoryâ steps users value while still trimming genuine redundancy?
- What are the best safeguards so that compression never trims crucial safety or compliance checks in high-stakes settings?
06Conclusion & Future Work
Three-sentence summary: OPSDC teaches a reasoning model to make concise thinking its default by imitating its own âconcise-modeâ behavior, no labels or rewards required. It trims 35â59% of tokens while preserving or improving accuracy, especially on easier benchmarks, and it adapts automatically so hard problems keep needed steps. Unlike many RL approaches, it preserves the modelâs exploratory capacity and stays stable through training. Main achievement: Showing that a minimalist, on-policy self-distillation setupâjust add a concise instruction to create the teacher and gently align the student to itâcan both compress and improve reasoning at scale, label-free. Future directions: Apply this beyond math (coding, science, planning), refine teacher-refresh strategies, personalize the degree of conciseness per user, and combine with selective visibility (keep internal rigor when needed while showing concise summaries). Why remember this: It flips a common assumptionâthat more words equal safer reasoningâby proving that shorter, sharper chains often reduce compounding errors and boost accuracy; itâs a simple, practical recipe any lab can try without building a full RL pipeline.
Practical Applications
- â˘Homework helpers that show just the key steps, answer faster, and avoid talking themselves into wrong answers.
- â˘Coding assistants that propose concise, correct fixes instead of lengthy, meandering explanations.
- â˘On-device tutoring apps that fit tight memory and compute limits by defaulting to concise reasoning.
- â˘Customer support chatbots that resolve issues quickly without redundant back-and-forth.
- â˘Scientific Q&A systems that summarize derivations while preserving the critical reasoning steps.
- â˘Automated graders or verifiers that benefit from cleaner, shorter student-model solutions.
- â˘Meeting-note generators that keep only decisive reasoning instead of verbose tangents.
- â˘Planning agents that maintain necessary checks on hard tasks but skip fluff on routine tasks.
- â˘Educational tools that can toggle visible detail: concise summaries by default, expanded steps on demand.
- â˘Cost-optimized AI deployments where fewer tokens per answer directly reduce serving bills.