Beyond Correctness: Learning Robust Reasoning via Transfer
Key Summary
- •This paper teaches language models not just to get the final answer right but to think in a way others can reliably follow.
- •It introduces a new training method called RLTR that rewards reasoning that another model can continue to reach the correct answer.
- •The key idea is to treat reasoning as meaning that should transfer: if you stop one model halfway, a different model should still be able to finish correctly.
- •RLTR adds a transfer reward on top of the usual correctness reward, nudging models to produce clearer, more stable steps.
- •On math benchmarks like MATH-500 and AMC23, RLTR boosts both single-try accuracy and majority-vote accuracy across many tries.
- •RLTR also learns faster, matching RLVR’s accuracy in about 2.5× fewer training steps while using only a bit more compute per step.
- •The method works beyond math, improving scientific question answering on GPQA too.
- •Ablations show that a stronger transfer reward and a capable receiver model yield bigger gains in consistency.
- •Even though RLTR adds a receiver pass during training, its overall compute to reach a target accuracy is lower thanks to faster learning.
- •The big picture: encouraging transferable reasoning makes AI more dependable when you sample multiple answers or change decoding settings.
Why This Research Matters
In real life, we don’t just want right answers—we want explanations others can use, check, and continue. RLTR trains AI to write steps that survive interruptions and handoffs, making systems more dependable when you sample multiple times or change settings. This helps tutors, coding assistants, and research tools produce reasoning that humans (and other AIs) can audit and build on. It also cuts total training compute to reach strong performance, easing costs and environmental impact. By rewarding shareable steps, RLTR moves AI closer to how people learn from each other and work in teams.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine two classmates solving a tricky math problem. One writes super neat steps that any friend can pick up and finish. The other jumps around, and even if they sometimes end at the right answer, nobody else can follow. Which one would you trust more? 🥬 The Concept: Reasoning is the step-by-step thinking that leads from a question to an answer. • How it works: 1) You read the problem. 2) You write small steps that move you forward. 3) You check that each step is clear and correct. • Why it matters: If steps aren’t clear, you might occasionally land on the right answer by luck, but others can’t reuse your work or recover from interruptions. 🍞 Anchor: In math class, a clean solution helps classmates learn and helps you catch mistakes.
🍞 Hook: You know how a coach gives points to players who make good moves, not just to the team that happens to win? 🥬 The Concept: Reinforcement Learning (RL) is a way to train models by giving them rewards for good behavior. • How it works: 1) The model tries something (like writing a solution). 2) It gets a reward (points) based on how good that attempt was. 3) It adjusts to try to get higher rewards next time. • Why it matters: Without rewards, the model can’t tell what to improve. 🍞 Anchor: Like a video game character learning to avoid traps because losing a life gives a “bad” signal.
🍞 Hook: Think of a giant digital librarian who has read tons of books and can talk about almost anything. 🥬 The Concept: Large Language Models (LLMs) are AI systems trained on lots of text to predict and produce useful words and sentences. • How it works: 1) They read your prompt. 2) They guess the next word many times in a row. 3) They form answers and explanations. • Why it matters: LLMs can reason, but their steps can be messy or fragile unless trained carefully. 🍞 Anchor: When you ask a math tutor bot for help, it uses an LLM to show the steps and answer.
🍞 Hook: Imagine judging a bake-off by tasting each cake yourself instead of trusting a friend’s opinion. 🥬 The Concept: Verification Processes check answers directly (like comparing to the correct number), instead of trusting a learned judge. • How it works: 1) The model gives an answer. 2) A simple rule checks if it matches ground truth. 3) Reward is given only if it’s correct. • Why it matters: This avoids “reward hacking,” where a model tricks a learned judge. 🍞 Anchor: A calculator confirming 42 equals 42 is safer than asking a friend who sometimes guesses.
🍞 Hook: You know how teachers sometimes grade only the final answer? That can miss sloppy work along the way. 🥬 The Concept: Reinforcement Learning with Verifiable Reward (RLVR) trains models using rewards from answer correctness and formatting, not from a separate, fallible reward model. • How it works: 1) Model writes reasoning and an answer. 2) A checker verifies if the final answer is right and the format is valid. 3) The model gets reward points accordingly. • Why it matters: RLVR makes training simpler and safer, but it pays attention to the destination more than the journey, so the steps can still be brittle. 🍞 Anchor: A student who gets the right answer but with confusing steps still gets full credit.
🍞 Hook: If five friends each try a puzzle and most agree on the same solution, you feel confident, right? 🥬 The Concept: Statistical Consistency is about getting the same correct result across many tries, not just once. • How it works: 1) Sample multiple answers from the model. 2) See how often they agree on the right answer. 3) Higher agreement means more reliable reasoning. • Why it matters: Without consistency, the model might be lucky sometimes but unreliable overall. 🍞 Anchor: Picking a restaurant when most friends independently choose the same spot.
🍞 Hook: Imagine spinning a wheel several times to see different outcomes. 🥬 The Concept: Sampling Techniques draw multiple model responses by adding controlled randomness (like temperature) to see a variety of solutions. • How it works: 1) Change sampling temperature to allow different word choices. 2) Generate K solutions. 3) Compare and combine them. • Why it matters: If reasoning is robust, many samples should still land on the correct answer. 🍞 Anchor: Trying a few different paths in a maze but still finding the exit most of the time.
🍞 Hook: When a class votes on a field trip, the majority choice usually wins. 🥬 The Concept: Majority Voting at K (Maj@K) picks the most frequent answer among K model attempts and counts it as correct if it matches the truth. • How it works: 1) Generate K answers. 2) Tally the answers. 3) The most common answer is your final pick. • Why it matters: If the model’s reasoning is stable, the correct answer will dominate as K grows. 🍞 Anchor: If 50 out of 64 draws say “Paris,” you trust that over a single guess.
The world before: LLMs got better using RL, especially with RLVR, which avoids tricky learned reward models and instead rewards verified final answers. This cut down on “reward hacking” and boosted scores on math tasks. The problem: Focusing only on final answers doesn’t teach models to write steps that are easy to follow, robust under interruptions, or reusable by others. Models could look good on a single try but wobble when you sample many times or change decoding settings; in some cases, more samples even hurt majority voting, hinting that the steps are brittle. Failed attempts: Process Reward Models try to score steps directly with a learned verifier, but they need lots of step-level labels and can reintroduce reward mismatch and complexity. Self-consistency helps a bit but doesn’t train the reasoning itself to be shareable or stable mid-trajectory. The gap: We need a training signal that values the reasoning process itself—clear enough that another model could reliably pick it up mid-way. Real stakes: In tutoring, coding help, science Q&A, and safety-critical uses, we need dependable steps that other systems (or people) can audit, continue, or recover from—even if the first attempt gets cut off or switched to a different “thinker.”
02Core Idea
🍞 Hook: Imagine you start a puzzle, then hand your half-finished notebook to a friend. If your notes are clear, your friend can finish it correctly. If not, they’ll get lost. 🥬 The Concept: The paper’s key insight is to treat reasoning as meaning that should transfer: a good partial explanation from one model should help another model finish with the right answer. • How it works: 1) One model begins the solution. 2) We stop halfway (truncate). 3) A different model continues from those notes. 4) If the final answer is right, we reward the starter model for being clear and helpful. • Why it matters: Without this, models might produce steps that only they can interpret, breaking under interruptions or different decoders. 🍞 Anchor: A math solution that a substitute teacher can pick up mid-proof and still grade to a correct result.
Explain the idea 3 ways:
- Relay race: Runner A hands a baton (partial reasoning) to Runner B. If the baton is cleanly passed, the team wins. RLTR rewards clean handoffs, not just crossing the finish line. 2) Lego instructions: Good instructions let anyone assemble the set from the middle. RLTR rewards writing those instructions. 3) Cooking recipe: If a chef leaves halfway, another cook should finish from the written steps. RLTR rewards writing recipes that travel well.
🍞 Hook: You know how teachers sometimes check if your explanation helps a classmate solve the same problem? 🥬 The Concept: Reasoning Transferability is how well a partial reasoning prefix from one model helps a different model reach the correct final answer. • How it works: 1) Generate steps and stop at a random point. 2) Hand them to a separate model. 3) If it finishes correctly, that prefix had high transferability. • Why it matters: Without transferability, steps are fragile or idiosyncratic and won’t survive interruptions or handoffs. 🍞 Anchor: A good outline lets any teammate finish your slideshow accurately.
🍞 Hook: Think of following a trail of breadcrumbs. If the early crumbs are clear, anyone can keep going. 🥬 The Concept: Cross-Model Reasoning Transferability checks that a different, frozen receiver model can continue another model’s partial reasoning to a correct answer. • How it works: 1) Pick a receiver model. 2) Feed it the truncated reasoning. 3) Verify if its final answer matches ground truth. • Why it matters: Passing this test shows the steps are understandable beyond the original model’s “mind.” 🍞 Anchor: A substitute in a group project finishing your half-done section correctly.
🍞 Hook: Imagine a scoreboard that gives points only if a teammate can finish from your notes. 🥬 The Concept: Transfer Reward is an extra reward given when the receiver completes the problem correctly from the truncated steps. • How it works: 1) Start with the usual correctness reward for your own final answer. 2) Add the transfer reward if a different model, starting from your prefix, also gets the right answer. 3) Combine them to train the generator to write reusable steps. • Why it matters: Without the transfer reward, the model has no reason to make its steps stable and shareable. 🍞 Anchor: You earn bonus points when your lab notebook helps a classmate replicate your experiment.
Before vs. After: Before RLTR, RLVR taught models to hit the bullseye but didn’t ensure they drew a clear path. After RLTR, models still aim for the bullseye but also learn to sketch a reliable map that anyone can follow. Why it works: Optimizing transferability pressures the model to express intermediate structure (definitions, invariants, subgoals) that generalizes across small sampling changes and even across different models. Building blocks: (a) A generator model writes full reasoning; (b) a truncation step cuts a prefix at random ratios; (c) a frozen receiver continues; (d) verifiable checks compute answer reward and transfer reward; (e) a weighted sum of rewards updates the generator; (f) repeat over tasks. Intuition: If many different continuations—especially from a different model—can finish correctly, then the early steps must encode stable, interpretable logic, which also boosts multi-sample consistency like Maj@K.
03Methodology
High-level pipeline: Input → Generator writes reasoning and answer → Truncate reasoning to a prefix → Receiver continues from the prefix → Verify final answers → Combine answer reward + transfer reward → Update generator policy.
🍞 Hook: Think of writing a solution, tearing it halfway, and asking a friend to finish. If they succeed, your first half was solid. 🥬 The Concept: Output Accuracy is whether the final answer matches the ground truth. • How it works: 1) Extract the final boxed answer. 2) Compare to the correct answer. 3) Reward 1 for match, 0 otherwise. • Why it matters: Without this, you can’t tell if the reasoning solves the problem. 🍞 Anchor: Checking if 24×2 really equals 48.
Step-by-step:
- Input and generator rollout • What happens: The generator model (trainable) takes a problem x, produces full reasoning and a final answer. • Why this step: We need the generator’s natural reasoning trace to judge its clarity and correctness later. • Example: On a MATH-500 problem, it writes 300–1,000 tokens of chain-of-thought and outputs \boxed{42}.
- Truncation of the reasoning prefix • What happens: Randomly choose a truncation ratio τ between 0.3 and 0.9; keep the first τ fraction of tokens. • Why this step: It simulates interruptions at unpredictable points, forcing the model to write sturdy steps at any stage. • Example: Cut the 600-token trace at 420 tokens when τ=0.7.
- Receiver continuation • What happens: Feed the (problem x + truncated prefix) to a frozen receiver model; let it continue to a final answer. • Why this step: It tests cross-model understanding; the steps must be understandable to a different “mind.” • Example: The receiver is a 3B model that reads your partial solution and outputs \boxed{42}.
- Verifiable checks and rewards • What happens: Compute two rewards: (a) Answer reward for the generator’s own final answer; (b) Transfer reward if the receiver’s final answer is also correct from the prefix. Optionally include a format reward to ensure clean outputs. • Why this step: Answer reward keeps the task grounded; transfer reward incentivizes reusable intermediate structure. • Example: If both are correct, the combined reward is higher than if only the generator got it right.
- Policy update • What happens: Use the combined reward signal to update the generator via a simple, stable RL procedure (e.g., GRPO), encouraging behaviors that increase both correctness and transferability. • Why this step: Without updating the policy on both signals, the model won’t learn to write portable steps. • Example: Over many batches, the generator learns to define variables clearly, state subgoals, and avoid brittle leaps.
🍞 Hook: When you ask many classmates the same question, if most agree, you trust the answer. 🥬 The Concept: Majority Voting at K (Maj@K) is a test-time metric: generate K answers, pick the most frequent, and check if it’s correct. • How it works: 1) Sample K solutions with temperature. 2) Tally final answers. 3) See if the majority matches ground truth. • Why it matters: Robust steps cause independent samples to converge on the same correct answer. 🍞 Anchor: If 60 of 64 votes say “triangle,” that’s strong evidence.
Secret sauce: The truncation-and-transfer test creates a training signal that directly values clarity in the middle of thinking, not just at the end. Random τ prevents overfitting to fixed cut points. A capable, frozen receiver provides honest, external pressure: if steps are vague, transfer fails and the generator earns less. By combining answer reward and transfer reward, RLTR balances destination and journey.
Concrete mini-example: • Problem: “If 3x+2=14, what is x?” • Generator steps: (1) 3x = 12 (2) x = 4; Final: \boxed{4}. • Truncation at τ=0.5: Keep “3x+2=14 ⇒ 3x=12 …” • Receiver continuation: “So x=4; \boxed{4}.” • Rewards: Answer reward=1 (generator correct). Transfer reward=1 (receiver correct). Update increases patterns like isolating variables clearly and boxing the final answer.
What breaks without each step: • No truncation: The model might hide confusion in the early steps. • No receiver: You don’t know if steps are understandable to others. • No transfer reward: The model has no incentive to make steps reusable. • No answer reward: The model could write pretty steps that go nowhere. • No format reward: The checker might fail to read the answer correctly.
04Experiments & Results
The test: Measure both single-try accuracy (how often one sample is right) and multi-try consistency via Maj@K (how often the most common answer among K samples is right). Also track transferability (the receiver’s accuracy when continuing from truncated prefixes). Why these metrics: Accuracy shows skill; Maj@K shows stability; transferability tests whether the steps really help others finish.
The competition: • Baselines: the original base model and RLVR (answer-only reward). • Datasets: MATH-500 and GSM8K (moderate), AMC23 and AIME2024 (hard competitions), and GPQA (science QA). • Settings: Same sampling temperature (T=1.0), three seeds averaged.
Scoreboard with context: • MATH-500 (in-distribution, moderate): – Base: Acc 71.0; Maj@16 81.2; Maj@64 82.6. – RLVR: Acc 76.2; Maj@16 80.2; Maj@64 82.2. – RLTR: Acc 77.0; Maj@16 83.8; Maj@64 84.2. Meaning: RLTR gets an A- on both single tries and group votes, while RLVR looks like an A- on single tries but slips to a B+ when many samples vote. RLTR’s +3.6 points on Maj@64 over RLVR is like jumping from a B to a solid A on consistency. • GSM8K (out-of-distribution, moderate): – Base: Acc 89.1; Maj@64 93.3. – RLVR: Acc 89.4; Maj@64 92.9. – RLTR: Acc 92.0; Maj@64 94.2. Meaning: RLTR improves both accuracy and consistency out-of-domain; RLVR slightly helps singles but loses on large-K voting versus the base. • AMC23 (hard): – Base: Acc 46.2; Maj@64 60.8. – RLVR: Acc 52.8; Maj@64 61.7. – RLTR: Acc 53.5; Maj@64 67.5. Meaning: On tough competition problems, RLTR clearly increases majority-vote reliability, like getting several more problems right when you let the class vote. • AIME2024 (hardest): – Base: Acc 9.8; Maj@64 16.7. – RLVR: Acc 11.6; Maj@64 18.9. – RLTR: Acc 14.8; Maj@64 21.1. Meaning: RLTR’s gains grow on the toughest problems, where clean, transferable steps matter most. • GPQA (science QA): – Base: Acc 32.4; Maj@16 35.2. – RLVR: Acc 33.0; Maj@16 37.0. – RLTR: Acc 34.8; Maj@16 37.7. Meaning: The idea generalizes beyond math; better transfer leads to steadier science answers too.
Training dynamics and efficiency: • RLTR reaches RLVR’s accuracy with about 2.5× fewer steps, learning faster thanks to the extra, process-sensitive signal. • Even though RLTR adds a receiver pass (≈7% more FLOPs per step), it still requires much less total compute to reach the same accuracy level (since it needs far fewer steps). Think of paying a tiny toll but taking a freeway that’s much shorter overall.
Surprising and telling findings: • Transferability strongly tracks Maj@K over training: when prefixes help a different model finish, many samples from the generator also agree more often. • RLVR can keep its single-sample grade while its multi-sample consistency drops later in training—like a student who gets some answers right but becomes less predictable when you re-ask in different ways. • RLTR improves Pass@K diversity while boosting accuracy, avoiding the pitfall where models only concentrate probability on a few brittle patterns.
Ablations: • Reward ratio: Heavier emphasis on transfer reward boosts high-K Maj@K and Pass@K, confirming that rewarding reusable steps pays off for consensus. • Receiver choice: A stronger receiver model provides a clearer transfer signal, further improving high-K Maj@K, though RLTR still beats RLVR even with a smaller receiver.
05Discussion & Limitations
Limitations: • Extra compute per step: RLTR adds a receiver rollout, increasing per-step cost by about 7%. The good news is that faster learning more than compensates, lowering total compute to a target accuracy. • Receiver dependence: The training signal quality depends on the receiver. A too-weak receiver might under-credit good prefixes; a too-strong or stylistically mismatched receiver could bias the signal. • Domain assumptions: RLTR relies on verifiable end answers (math, code execution, multiple-choice). Open-ended tasks without clear ground truth need adaptations. • Truncation design: Uniform random truncation between 0.3–0.9 works well, but the best truncation curriculum or adaptive strategy is still an open question. • Format sensitivity: If the final answer is misformatted, verification can fail; format rewards help, but strict parsers can be brittle.
Required resources: • A trainable generator LLM and a frozen receiver LLM. • A verifiable-reward dataset (ground-truth answers). • GPUs with enough memory for RL and rollouts; the paper used 8× H200 141 GB for main development. • An RL framework supporting GRPO-style training and verifiable checking.
When not to use: • Ultra low-latency applications where any extra rollout is unacceptable. • Tasks with no reliable verification signals (e.g., subjective writing quality without preference labels). • Scenarios with highly specialized private receivers that misinterpret general prefixes (mismatched style or domain).
Open questions: • Can we adaptively choose truncation points where clarity matters most (e.g., at subgoal boundaries)? • What about multi-receiver ensembles to reduce bias and better approximate “anyone can finish”? • How does RLTR combine with Process Reward Models to blend portable prefixes with finer step-scoring, while avoiding proxy pitfalls? • Can transferability work as a verifier at test time (accept solutions whose prefixes multiple receivers can complete correctly)? • Beyond math and science, how well does RLTR scale to tool-use agents, multi-modal tasks, or interactive planning?
06Conclusion & Future Work
Three-sentence summary: This paper proposes RLTR, a simple but powerful extension to RLVR that rewards reasoning which transfers: if a different model can finish your partial steps to the right answer, you earn bonus points. This pushes models to write clearer, more stable steps, improving both accuracy and multi-sample consistency (Maj@K) and learning much faster. The method generalizes beyond math, reduces total compute to reach a target accuracy, and preserves diversity better than standard approaches.
Main achievement: Turning “explain it so others can finish” into a training signal (transfer reward) that measurably raises robustness in reasoning and consensus across samples.
Future directions: Use transfer as a verifier at test time; explore adaptive truncation curricula; mix multiple receivers; combine with lightweight process supervision; extend to code, tool-use, and multi-modal reasoning. Why remember this: RLTR shifts focus from only “Was the answer right?” to also “Are the steps reusable by others?”, which aligns AI reasoning with how humans learn, teach, and trust explanations.
Practical Applications
- •Math tutoring that shows clear, reusable steps students and teachers can follow.
- •Coding assistants whose partial plans remain usable if generation is interrupted or handed to another tool.
- •Scientific Q&A systems that provide stable intermediate reasoning for peer review.
- •Team-of-agents setups where one model’s partial plan can be reliably picked up by another.
- •Education platforms that grade not only final answers but also the transferability of explanations.
- •Auto-debugging pipelines where interrupted traces still allow a verifier or second model to finish.
- •Safer deployment via test-time checks that prefer solutions with high cross-model transferability.
- •Knowledge distillation where clean, transferable steps help train smaller student models.
- •Interactive planning tools (e.g., robotics or scheduling) that can switch planners mid-run without losing the plot.
- •Evaluation dashboards that track transferability alongside accuracy and Maj@K to monitor robustness.