On-Policy Self-Distillation for Reasoning Compression

Hejian Sang; Yuanda Xu; Zhengze Zhou; Ran He; Zhipeng Wang; Jiachen Sun

On-Policy Self-Distillation for Reasoning Compression

Beginner

Hejian Sang, Yuanda Xu, Zhengze Zhou et al.3/5/2026

arXiv

Key Summary

•Reasoning models often talk too much, and those extra words can actually make them more wrong.
•This paper introduces OPSDC: the model tells itself “be concise,” then learns to act that way even without being told.
•It uses on-policy self-distillation: the model learns from its own outputs, not from external answer keys or reward models.
•A simple training trick—compare normal outputs to “concise-mode” outputs token by token and nudge the normal mode to match—does the job.
•OPSDC cuts 35–59% of reasoning tokens while keeping or even improving accuracy, especially on math benchmarks.
•On MATH-500, Qwen3-14B jumps from 70.0% to 86.1% accuracy while using 56.5% fewer tokens.
•It adapts automatically: big cuts on easy problems, gentle cuts on hard ones, without needing a difficulty detector.
•It preserves entropy (the model’s ability to explore) unlike many RL length-penalty methods that collapse diversity.
•No ground-truth answers, no length budgets, no reward engineering—just a concise instruction and the model’s own rollouts.
•The key insight: extra, redundant tokens are not just wasteful; they can snowball into compounding errors.

Why This Research Matters

Shorter, sharper reasoning means lower cost and faster responses, which is crucial for everyday tools like homework helpers, coding assistants, and chatbots. Because OPSDC needs no answer keys or complex reward models, it’s easy to adopt and scales to domains without reliable labels. The method preserves the model’s ability to explore, so it stays strong on hard problems rather than collapsing into overly short answers. Automatic difficulty adaptation trims easy tasks aggressively while keeping essential steps on tough ones, improving both efficiency and reliability. By reducing compounding errors, concise reasoning can actually raise accuracy—a win for users who need both speed and correctness. Finally, less compute and fewer tokens mean greener AI, enabling high-quality reasoning on smaller devices or tighter budgets.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how some people over-explain everything, repeating points and second-guessing themselves until it gets confusing? That’s a lot like how modern reasoning AIs behave. They “think out loud” for thousands of tokens, even when the problem is easy. Before this work, the common wisdom was: more thinking tokens means better answers on hard tasks, so we just had to accept a lot of extra words (and cost). But there was a catch—these models often couldn’t stop. Ask a simple question, and they’d still write essays. What was happening in the world before? Researchers had already built powerful reasoning models that could solve hard problems by showing their work. This helped accuracy on tough tasks but led to waste on easy ones. Serving these models got expensive (more tokens = more compute, more time, more money). Worse, the long reasoning sometimes backfired: a correct line of thought could get buried under detours, doubts, and restarts, causing the final answer to be wrong. The problem: how do we make models reason concisely—using fewer, sharper steps—without losing the careful thinking needed for hard problems? And can we do it without hand-holding the model with ground-truth answers, custom reward functions, or rigid token budgets that fit some tasks but not others? People tried several approaches, and each one had trade-offs:

Reinforcement learning (RL) with length penalties: Reward short answers. This needs ground-truth answers to score success and often crushes the model’s “exploration drive,” making it worse on hard problems.
Supervised fine-tuning (SFT) on short solutions: Train on someone else’s concise reasoning. This can cause “forgetting” because the student stops practicing its own native style and becomes brittle when real prompts differ from the training set.
Prompting or decoding tricks: Tell the model to be brief or trim its output on the fly. These help a bit but vanish when the prompt changes and usually don’t reach large compression safely. The missing piece was a way to let the model keep its own style and skills while learning to be brief when that’s enough. In other words, a method that:
Uses no answer keys (so it scales to domains without labels),
Doesn’t require complex reward engineering, and
Automatically adapts: compress easy tasks strongly and hard tasks gently. This paper fills that gap with OPSDC (On-Policy Self-Distillation for Reasoning Compression). The idea is disarmingly simple: the same model acts as both teacher and student. The teacher is just the model with a “be concise” instruction prepended; the student is the model without that instruction. We then teach the student to behave like the “concise teacher” on the student’s own outputs. No ground-truth answers are needed. Why does this matter in everyday life? Shorter, clearer reasoning means faster responses, lower cloud bills, and greener computing. It means your homework helper, coding assistant, or on-device tutor can be both smarter and snappier. And it means fewer errors caused by overthinking tangents—like when you know the right answer but talk yourself out of it.

02Core Idea

Aha! Moment in one sentence: Ask the model to be concise, then teach it to behave that way by learning from its own “concise-mode” self—so conciseness becomes the default without needing to be asked. Multiple analogies (same idea, three ways):

Traffic cop analogy: You know how a traffic cop waves cars through the fastest route instead of letting them circle around the block? The “concise teacher” shows the shortest safe path; the “student” learns to take that path on its own next time.
Highlighter analogy: Imagine highlighting only the most important lines in a long article. The teacher (with the concise instruction) highlights the essentials; the student learns to write just those lines when it rewrites the article later.
Trailblazer analogy: One hiker walks a mountain, marks the easiest route with flags (concise teacher), and the next hiker (student) follows those flags and remembers them for future hikes, needing no flags later. Before vs. After:
Before: Models produced long chains of thought, even for simple tasks, wasting tokens and sometimes derailing correct reasoning.
After: Models reduce their words by about one-third to one-half, while matching or beating accuracy—especially on easy-to-medium problems—without losing their ability to think deeply when needed. Why it works (deep intuition without equations):
Extra steps often add noise. Every extra token is a chance to go off-track. Trimming redundant steps cuts off places where errors can snowball.
The teacher is just the model with a concise instruction. Because it’s the same brain with a nudge, its preferences are familiar and safe to learn from.
The learning signal is gentle and local: we compare the student’s next-token choices to the teacher’s at the same spot in the student’s own writing and nudge toward the teacher only where the student already tends to go. That keeps the model stable and prevents collapse.
Difficulty-adaptive magic: When a problem is easy, the concise teacher shortens a lot—so the student gets a strong “trim here” signal. When a problem is hard, the teacher still writes more, so the signal to compress is weak. No difficulty classifier needed. Building blocks, explained with the Sandwich pattern:
🍞 Hook: You know how you can learn from your own past homework by noticing what you did right and what was extra? 🥬 Concept: On-Policy Self-Distillation is when a model learns from its own fresh outputs, not a static dataset. How it works: (1) The student writes an answer. (2) The teacher (same model + “be concise” instruction) shows how it would write at each step. (3) We gently nudge the student toward the teacher’s choices at those exact steps. Why it matters: Without learning on its own outputs, the model forgets its style or chases mismatched examples. 🍞 Anchor: The student solves a fraction problem, the teacher shows a shorter path at each step, the student soon prefers that shorter path by default.
🍞 Hook: Imagine telling a friend, “please be brief and clear.” 🥬 Concept: Conciseness Instruction is simply a short note like “Solve concisely; avoid unnecessary steps.” How it works: Add this sentence before the problem and the model trims fluff while keeping key steps. Why it matters: It unlocks a behavior the model already knows but doesn’t default to. 🍞 Anchor: With the note, it factors a polynomial in three lines instead of twenty.
🍞 Hook: Think of a coach and a player who are actually the same person wearing two hats. 🥬 Concept: Teacher–Student Framework here means one shared model in two roles: teacher (with the concise instruction) and student (without it). How it works: Use the teacher’s token-by-token preferences as targets for the student’s updates. Why it matters: No external labels or stronger models are required. 🍞 Anchor: The “teacher-you” shows a minimal solution; the “student-you” learns to do that next time without a reminder.
🍞 Hook: You know how you adjust more where you’re already working, rather than everywhere at once? 🥬 Concept: Reverse KL Divergence is a way to measure and nudge differences so the student updates mostly where it already places attention. How it works: Compare probabilities for the next token; push the student away from tokens the teacher dislikes, but don’t force it toward uncertain areas. Why it matters: This keeps the model stable and avoids the crash that happens when you push too hard everywhere. 🍞 Anchor: If both agree “the next word is ‘therefore’,” no change; if the student wants a long detour and the teacher doesn’t, we trim the detour.
🍞 Hook: Like updating a playbook every few games once the team improves. 🥬 Concept: Periodic Teacher Update means we occasionally refresh the teacher’s weights to match the improved student, then apply the concise instruction again to get an even crisper target. How it works: Copy student → make it the teacher → the teacher-in-concise-mode becomes a slightly better “shortest safe path.” Why it matters: Compression deepens step by step without instability. 🍞 Anchor: After a few training rounds, the new teacher writes even briefer, so the student learns to be briefer still.
🍞 Hook: Pack lighter for a day trip, heavier for a week-long hike. 🥬 Concept: Difficulty-Adaptive Compression means easy problems get big cuts; hard ones keep the needed steps. How it works: The concise teacher naturally writes shorter on easy tasks and longer on hard ones; the student mirrors that. Why it matters: You don’t need a difficulty detector or per-task budgets. 🍞 Anchor: A simple sum gets a one-line solution; a contest-level geometry proof keeps its careful steps.
🍞 Hook: Like turning a page-long summary into a crisp paragraph. 🥬 Concept: Token Reduction means using fewer tokens to say the same essential reasoning. How it works: Remove repetition, restatements, and meandering; keep the logic that changes the answer. Why it matters: Fewer tokens save time and money—and often avoid mistakes. 🍞 Anchor: Instead of re-deriving and re-checking the same equation three times, solve it once cleanly and move on.

03Methodology

High-level recipe: Input → Student writes a response → Teacher (same model + concise instruction) scores each next step → Compare student vs. teacher, nudge student toward teacher’s choices → Every so often, refresh teacher from the improved student → Output is a shorter, accurate reasoning style by default. Step-by-step, with what/why/examples:

Prepare two roles from one model

What happens: We take one reasoning model. The student sees the normal prompt. The teacher sees the same prompt with a short “be concise and correct” instruction added on top.
Why it exists: The teacher provides a clearer, shorter style that the student can learn from—without using answer keys.
Example data: Problem: “Factor $x^3$ − 6 $x^2$ + 11x − 6.” Student prompt: normal task instructions. Teacher prompt: same, plus “Solve concisely; avoid unnecessary steps.”

Student writes; teacher guides at each token

What happens: The student starts writing its hidden reasoning and answer. For each next-token decision, we ask the teacher, “What would your concise-mode prefer next?” and read the teacher’s probability over tokens for that exact spot.
Why it exists: Comparing at the exact student-written prefix keeps training on-policy—no mismatch between training data and the model’s own style.
Example: The student begins a long explanation (“Let me check again…”) while the teacher gives low preference to that detour and prefers a direct factoring step.

Nudge the student toward the concise teacher (token-by-token)

What happens: We gently adjust the student so it places less weight on rambling tokens and more on the teacher’s direct path. This nudge is strongest where the student was already likely to go (stability) and weaker elsewhere (safety).
Why it exists: It avoids crushing the model’s ability to explore. Unlike blunt length penalties, this preserves useful uncertainty on hard tasks.
Example: If the student often writes “Alternatively…” followed by repetition, and the teacher de-emphasizes that word, the student learns to skip it next time.

Refresh the teacher periodically

What happens: Every so often (e.g., every 50 training steps), we copy the improved student into the teacher slot. The concise instruction now makes the teacher even more succinct.
Why it exists: It creates a moving staircase of improvement—each refresh offers a slightly stricter but still safe target, deepening compression without collapse.
Example: After a few refreshes, the teacher solves the factoring in a few crisp steps; the student learns to do the same without needing the reminder.

Stop when concise becomes the default

What happens: After modest training (the paper saw quick convergence), the student habitually picks shorter, more direct reasoning.
Why it exists: This is the goal—conciseness without asking for it at inference time.
Example: On a simple algebra task, the model writes a tight, correct derivation in far fewer tokens. Concrete walk-through with actual content:
Input: “Find all real x such that $x^3$ − 6 $x^2$ + 11x − 6 = 0.” (In the paper’s dataset.)
Student’s first try: Long chain-of-thought: repeats synthetic division twice, re-checks roots, then re-explains the final step after the hidden reasoning—lots of duplication.
Teacher’s concise-mode view: Prefers “Try small integer roots 1, 2, 3; factor quickly; state answer.”
Update: The model learns to prioritize the quick factoring (x−1)(x−2)(x−3) and skip repeats.
Outcome next time: The model goes straight to testing small roots and factors cleanly, saving hundreds of tokens. What breaks without each step:
Without on-policy student rollouts: The model trains on someone else’s style and forgets its own (distribution shift).
Without the concise instruction: No clear “shortest safe path” to imitate.
Without reverse-KL-style gentle nudges: Training can oscillate or collapse, crushing exploration.
Without periodic refresh: Compression stalls early (frozen teacher) or destabilizes (updating every step). The secret sauce:
Gentle, local learning from yourself: Comparing the student to its own concise-mode view at the exact places it writes keeps changes safe and targeted.
Automatic difficulty adaptation: Easy tasks get big trims; hard tasks keep needed steps.
No labels or reward engineering: Simpler, cheaper, and scalable to domains without answer keys.
Entropy preservation: The model keeps its ability to explore on hard problems instead of being forced into always-short outputs.

04Experiments & Results

The test setup: The authors trained on about 13.6k competition-style math problems without ground-truth answers and evaluated on three benchmarks with very different difficulty levels: MATH-500 (500 problems), AIME 2024 (30 problems), and AIME 2025 (30 problems). They measured two main things: (1) accuracy (did the final answer match?), and (2) average number of response tokens (how long was the reasoning). They also checked general knowledge (MMLU) to make sure the model didn’t forget broad skills. Competition (what they compared against):

Base model (no changes),
Concise prompt only at inference (no training),
OPSDC (their method). Scoreboard with context:
On MATH-500, Qwen3-14B went from 70.0% to 86.1% accuracy while using about 56.5% fewer tokens. That’s like going from a C to a solid A while writing half as much.
On MATH-500, Qwen3-8B also reached roughly 86.6% accuracy, slashing tokens by about 58.8% (stronger than the prompt-only trick).
On AIME 2024, Qwen3-14B gained about 10 points (65.8% → 76.3%) with roughly 41% compression.
On AIME 2025 (hardest), the method still compressed around 35% but traded off a few accuracy points—showing the method backs off compression less on very tough problems.
General capabilities (MMLU) stayed the same, meaning the model didn’t forget broad knowledge while learning to be concise. Surprising findings:
Less can be more. Cutting tokens often improved accuracy. Why? Those extra, redundant tokens were not just filler—they were places where the model could introduce mistakes or talk itself out of correct answers.
Difficulty-adaptive behavior emerged naturally. There was about 1. $6× more$ compression on easy tasks than on hard ones without any difficulty detector.
Entropy (the model’s “curiosity” to explore alternatives) stayed stable. RL methods with harsh length penalties often crush entropy; OPSDC preserved it. This is key for tough problems that need exploration.
The direction of the comparison matters. When they tried a different comparison setup (called “forward KL”), performance became unstable after each teacher refresh, accuracy dipped in saw-tooth patterns, and outputs got too short. The chosen “reverse KL”-style nudge stayed stable and strong.
Vague can beat precise. Telling the teacher “be concise” worked better than telling it “use exactly 50% fewer tokens.” Precise numeric targets inflated compression but hurt accuracy, especially on competition-level tasks.

05Discussion & Limitations

Limitations:

It relies on the model being able to follow a “be concise” instruction. Very small or poorly aligned models might not get much benefit until they can follow instructions reliably.
Most results are for math reasoning. The method is domain-agnostic in design, but broader evaluations (coding, science QA, planning) are still needed.
Picking the teacher-update interval matters. Updating too often can destabilize training; updating rarely may limit compression depth. The paper found a stable sweet spot around every few dozen steps. Required resources:
A standard supervised training stack (no reward models or PPO needed),
Enough compute to run dual forward passes (student + teacher) per token during training,
A modest dataset of prompts (no labels needed). The paper used 8 GPUs and trained quickly (they saw strong gains within around 100 steps). When NOT to use it:
If you absolutely need long, fully spelled-out chains for auditing (e.g., a formal proof transcript), forcing brevity may hide intermediate detail—even if the final answer is correct.
If your model cannot follow the concise instruction at all (e.g., too small, unaligned), gains may be limited until you first improve basic instruction following.
If your decoding stack or product requires fixed, scripted outputs that must include every step verbatim, compression may conflict with those constraints. Open questions:
How well does it transfer to coding, scientific reasoning, or tool-using agents without ground-truth rewards?
Can we automatically tune the teacher-update interval or use a smooth EMA teacher for even better stability?
How do different wording styles of the concise instruction affect outcomes across model families?
Can we detect and retain “explanatory” steps users value while still trimming genuine redundancy?
What are the best safeguards so that compression never trims crucial safety or compliance checks in high-stakes settings?

06Conclusion & Future Work

Three-sentence summary: OPSDC teaches a reasoning model to make concise thinking its default by imitating its own “concise-mode” behavior, no labels or rewards required. It trims 35–59% of tokens while preserving or improving accuracy, especially on easier benchmarks, and it adapts automatically so hard problems keep needed steps. Unlike many RL approaches, it preserves the model’s exploratory capacity and stays stable through training. Main achievement: Showing that a minimalist, on-policy self-distillation setup—just add a concise instruction to create the teacher and gently align the student to it—can both compress and improve reasoning at scale, label-free. Future directions: Apply this beyond math (coding, science, planning), refine teacher-refresh strategies, personalize the degree of conciseness per user, and combine with selective visibility (keep internal rigor when needed while showing concise summaries). Why remember this: It flips a common assumption—that more words equal safer reasoning—by proving that shorter, sharper chains often reduce compounding errors and boost accuracy; it’s a simple, practical recipe any lab can try without building a full RL pipeline.

Practical Applications

•Homework helpers that show just the key steps, answer faster, and avoid talking themselves into wrong answers.
•Coding assistants that propose concise, correct fixes instead of lengthy, meandering explanations.
•On-device tutoring apps that fit tight memory and compute limits by defaulting to concise reasoning.
•Customer support chatbots that resolve issues quickly without redundant back-and-forth.
•Scientific Q&A systems that summarize derivations while preserving the critical reasoning steps.
•Automated graders or verifiers that benefit from cleaner, shorter student-model solutions.
•Meeting-note generators that keep only decisive reasoning instead of verbose tangents.
•Planning agents that maintain necessary checks on hard tasks but skip fluff on routine tasks.
•Educational tools that can toggle visible detail: concise summaries by default, expanded steps on demand.
•Cost-optimized AI deployments where fewer tokens per answer directly reduce serving bills.

Version: 1