RelayGen: Intra-Generation Model Switching for Efficient Reasoning
Key Summary
- âąRelayGen is a training-free way to switch between a big model and a small model while one answer is being generated.
- âąIt watches for easier parts in a long reasoning chain and hands those parts to a smaller, faster model.
- âąHard parts stay on the big model, so accuracy stays close to the big modelâs level.
- âąThe paper finds simple words like âTherefore,â or âNow,â often signal easier stretches, based on measured model certainty.
- âąAnswer writing after the thinking phase is very stable, so the small model can usually handle it without changing the final answer.
- âąRelayGen reduces waiting time and cost, and it also plays nicely with speculative decoding for extra speed.
- âąOn AIME 2025 with Qwen3-32B, RelayGen plus speculative decoding reached up to 2.2Ă speedup with under 2% accuracy drop.
- âąIt needs no extra training or learned routers, just a short offline calibration to pick good switch cues.
- âąSwitching happens at sentence-sized chunks, avoiding the overhead of token-by-token routing.
- âąRelayGen shows that coarse-grained, difficulty-aware control can deliver strong efficiency without heavy new machinery.
Why This Research Matters
RelayGen makes powerful reasoning models faster and cheaper to use by letting smaller models help at the right times. This means shorter wait times for students, coders, and researchers who rely on long chain-of-thought responses. It can lower energy use and costs in data centers by keeping the large model active only when it truly helps. Because RelayGen is training-free and simple to integrate, organizations can adopt it without building new routers or retraining models. Its compatibility with speculative decoding stacks multiple speedups together, unlocking even more efficiency. In practice, this opens the door to deploying strong reasoning systems on tighter budgets and potentially closer to the edge, where compute is limited.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine youâre solving a long math problem. Some parts are tough puzzles, and some parts are just writing out the final answer neatly. You wouldnât call your genius friend for the neat-writing partâyouâd ask them for help only on the hardest steps.
đ„Ź The Concept (Large Reasoning Models, or LRMs): LRMs are big AI models that solve complex, multi-step problems by writing long chains of thought before giving the final answer. How it works:
- The model thinks out loud in many steps (the reasoning stage).
- Then it writes a short final answer (the answer stage).
- Longer, smarter models tend to do better on tricky puzzles, but take more time and money to run. Why it matters: If we use the biggest model for every single word, it gets slow and expensive, even when the text is easy. đ Anchor: Like asking a math whiz to invent the strategy, then having a classmate copy it neatly.
The world before: As reasoning got better, models grew bigger and wrote longer chains of thought. That increased accuracy, but also made inference expensive and slow. People noticed that not every part of the chain is equally hard, yet most systems still treated it all the same.
The problem: How do we speed up long reasoning without losing the accuracy boost from a big model? We want to save time on the easy stretches while keeping the big model on the hard parts.
đ„Ș New Concept Sandwich â Input-level routing đ Hook: You know how a coach picks one player to play the whole game, no matter what? đ„Ź The Concept: Input-level routing chooses one model per question and sticks with it. How it works: You score the difficulty of the whole input, pick a model, and let it generate everything. Why it matters: It misses the fact that difficulty changes inside one answer, so it can overuse the big model or underperform on hard parts. đ Anchor: Using one backpack for a whole tripâeven when sometimes you only need a light daypack.
đ„Ș New Concept Sandwich â Token-level routing đ Hook: Imagine switching drivers every single second on a road trip. đ„Ź The Concept: Token-level routing chooses which model to use at every word. How it works: A trained router predicts, for each token, whether the big or small model should produce it. Why it matters: Itâs very fine-grained, but needs extra training, adds system complexity, and breaks other speedup tools like speculative decoding. đ Anchor: Itâs like changing drivers so often that you lose time instead of saving it.
Failed attempts: Step-level or heuristic approaches split the answer into chunks but often pick chunks using hand-made rules that donât always match where the hard thinking really is. Token-level routers are precise but heavy and hard to deploy, often clashing with speculative decoding.
The gap: We need a simple, training-free way to switch models inside one answer, at just the right times, using signals that truly reflect difficultyâand that still works with speculative decoding.
đ„Ș New Concept Sandwich â Difficulty varies within one output đ Hook: When you write an essay, brainstorming is hard, but writing the final summary is easier. đ„Ź The Concept: Difficulty rises and falls during a single generated answer. How it works: Early parts might explore and revise; later parts may summarize or conclude. The modelâs uncertainty changes accordingly. Why it matters: If we can detect easy stretches, we can switch to a smaller model to save time. đ Anchor: You do deep thinking first, then just clean up the final paragraph.
Real stakes: Faster answers mean lower costs, less waiting, and greener computing. It lets more people use powerful reasoning on everyday devices, and it makes big systems more practical in classrooms, coding assistants, and research tools.
02Core Idea
đ„Ș Aha! Moment In one sentence: Watch for moments when the generation becomes easier and hand off those stretches to a smaller modelâkeep the hard thinking on the big model.
Three analogies:
- Relay race: The sprinter (big model) handles steep hills; on flat ground, they pass the baton to a steady jogger (small model).
- Chef and prep cook: The head chef designs the dish (reasoning), then the prep cook plates and serves (answering) quickly.
- Off-road to highway: Use a 4x4 for the rocky trail, then switch to a scooter for smooth city streets.
Before vs. After:
- Before: Either one model did everything, or complex routers decided token-by-token, adding overhead and hurting compatibility with speculative decoding.
- After: Training-free, segment-level handoffs triggered by empirically chosen switch cues. Big model for hard parts; small model for easy continuations and final answers.
đ„Ș New Concept Sandwich â Discourse-level cues đ Hook: In stories, words like âTherefore,â or âNow,â tell you what kind of sentence is coming next. đ„Ź The Concept: Discourse-level cues are words that signal shifts in the flow of reasoning, like moving from exploring to summarizing. How it works: The paper profiles many cue words in advance and keeps only those that, on average, are followed by easier, more predictable text for a given model pair. Why it matters: Good cues let us switch models at the right times without training a router. đ Anchor: Seeing âThus,â often means a neat wrap-up is comingâperfect time to hand off.
đ„Ș New Concept Sandwich â Token probability margin đ Hook: When youâre sure the next word in a sentence is âthe,â you barely hesitate; when unsure, you pause. đ„Ź The Concept: The token probability margin is how much more confident the model is in its top guess than its second-best guess for the next word. How it works: Larger margins mean higher certainty. The paper measures average margins after certain cues and picks cues that reliably lead to higher certainty. Why it matters: Certainty is a useful thermometer for difficultyâhigher certainty often means the small model can handle it. đ Anchor: If youâre 95% sure the next word is âTherefore,â thatâs a low-difficulty moment.
Why it works (intuition):
- During the reasoning stage, uncertainty jumps around; after certain cues, the text often becomes summarizing or reflective, which is easier.
- The answer stage depends on the entire previous reasoning, so itâs more like formatting the final thought than inventing itâsmall models do well there.
- Switching at sentence-sized chunks avoids constant overhead and stays compatible with speculative decoding.
Building blocks:
- Offline cue selection using probability margins from a small calibration set (no training).
- Runtime switching using stop tokens at selected cues and at the boundary between thinking and answering.
- Prefix caching so switching doesnât mean redoing all the work.
- Full handoff of the final answer to the small model because itâs stable and cheap.
03Methodology
High-level recipe: Prompt â Big model starts reasoning â On certain cue words, delegate a sentence to small model â Return to big model for next hard bit â After </think>, small model writes the final answer.
Step 1: Offline pick the right cues (once per bigâsmall pair) What happens: The team collects a small set of reasoning traces (e.g., 40 AMC problems, 4 traces each â 160 samples). For each candidate cue (like âThus,â âNow,â âTherefore,â), they measure the average token probability margin from the cue to the end of the sentence and compare it to the global average. Why this step exists: Not all cues mean âeasy.â Some cues begin new explorations. Picking only cues that consistently raise certainty avoids bad handoffs. Example: If âThus,â is usually followed by higher confidence sentences, it becomes a switch cue; if âSo,â is inconsistent, itâs rejected.
Step 2: Start reasoning on the big model What happens: At runtime, the big model begins generating the chain-of-thought. Its stop tokens include: the approved cue words and the special boundary token that ends reasoning (like </think>). Why this step exists: We want the big model to lead the hard thinking but pause right when an easy segment begins. Example: The model writes, âWe solved for x. Therefore,â and stops on âTherefore,â because itâs a chosen switch cue.
Step 3: Temporary handoff to the small model for an easy sentence What happens: After a cue triggers a stop, the small model continues just until the sentence ends (e.g., at a period). Then control returns to the big model if reasoning continues. Why this step exists: Sentences after good cues are often simpler and more predictable. The small model handles them fast and cheaply. Example: Small model writes, âTherefore, x = 7 and y = 3.â On hitting the period, it stops; control returns to the big model.
Step 4: Repeat as needed during reasoning What happens: This back-and-forth can happen multiple times. Thanks to prefix caching (reusing past context), the system doesnât re-run the entire prompt each timeâonly the new bits. Why this step exists: Keep hard chunks on the big model and easy chunks on the small model, many times, without paying switching penalties. Example: Later, âNow,â appears; small model takes one sentence; return to big model again.
Step 5: Final handoff for the answer stage What happens: When the special token indicates the end of thinking (e.g., </think>), the restâthe answer formattingâis fully delegated to the small model. Why this step exists: The paper shows answer writing is highly stable under handoff (about 99.86% matching the big model), while being the most expensive per token due to long attention spans. Perfect moment to save time. Example: Small model formats: âFinal answer: 7.â
Secret sauce:
- Coarse, sentence-sized switching triggered by empirically validated cues, not training.
- Answer-stage stability: small model finishes reliably while saving lots of compute.
- Compatibility: Because switching happens in chunks, speculative decoding can still form long draft spans for extra speed.
Concrete data example:
- Calibration set: 40 AMC problems Ă 4 traces = 160 samples; took ~100 minutes offline on two A100s (about 80 minutes to generate traces, 20 minutes to compute margins and pick cues). Smaller calibration sets (10â40 samples) still worked well in ablations.
- Example cues that worked (for certain model pairs): âThus,â âTherefore,â âNow,â âSimilarly,â âSpecifically,â and variants.
What breaks without each step:
- Without cue selection: You might switch at the wrong times, harming accuracy.
- Without segment boundaries: Switching too often adds overhead and blocks speculative decoding.
- Without full answer handoff: Youâd miss the biggest cheap-win zone where small models excel.
- Without prefix caching: Switching would force costly re-prefills, erasing the gains.
04Experiments & Results
The tests and why: The paper checks two main thingsâaccuracy and speed. Accuracy is measured by pass@1 on reasoning benchmarks (how often the top answer is correct). Speed is measured as end-to-end latency and reported as a speedup compared to using only the large model.
Competitors:
- Small model only: Fast but much weaker.
- Large model only: Strong but slow and expensive.
- Speculative Thinking: Step-level heuristic switching using hand-picked cues.
- R2R: Token-level routing with a trained router (more complex and often incompatible with speculative decoding).
Scoreboard with context:
- Accuracy: With Qwen3-32B/Qwen3-1.7B on AIME 2025, large model alone gets about 70% pass@1; small model alone about 32%. RelayGen reaches about 68.3%, keeping close to the big model while using the small model for easy parts. On GPQA-Diamond, RelayGen is also close to the big model and clearly above the small model. Compared to Speculative Thinking, RelayGen keeps more accuracy (because cue selection is evidence-based). Against R2R, RelayGen is competitive or better on several benchmarks without training a router.
- Speed: RelayGen by itself achieves speedups similar to R2R but with higher large-model utilization (it keeps the hard parts on the big model). Crucially, RelayGen composes with Eagle-3 speculative decoding to reach up to about 2.20Ă speedup, while staying training-free. Thatâs like finishing two homework sets in the time it used to take to finish oneâwithout your grades really dropping.
Surprising findings:
- Answer-stage stability: When the small model finishes the answer after the big modelâs reasoning, the answer matches the big modelâs answer in about 99.86% of cases (728 samples, only one mismatch). This is a free win for speed and cost.
- Small calibration works: Even with as few as 10â40 calibration samples, performance remained strong. You donât need lots of data to pick good cues.
- Not all cues are equal: Using every candidate cue hurts accuracy. Selecting only cues that reliably raise certainty matters a lot.
Big picture: RelayGen lives at a sweet spotânear-big-model accuracy with meaningful speedups, and it stacks with speculative decoding instead of fighting it.
05Discussion & Limitations
Limitations:
- Works best when the model actually writes out long, structured reasoning. If tasks are short or donât have a clear reasoning/answer split, thereâs less to gain.
- Needs a reasonably capable small model. If the small model canât even handle the easy sentences, switching hurts.
- Cues are model-pair specific. A cue thatâs easy for one pair might not be for another; a brief calibration is required.
- Most results shown are on English. Multilingual behavior likely transfers but needs testing.
Required resources:
- One large and one small model available at inference (often two GPUs or careful scheduling on one).
- vLLM or an engine with prefix caching and simple stop-token control.
- A tiny calibration pass (10â160 samples) done once per model pair; no training.
When not to use:
- Very short answers or tasks without visible âthinkingâ segments.
- Highly creative writing where cues donât map to difficulty.
- Ultra safety-critical settings where any offloading risk is unacceptable (until further validation).
Open questions:
- Can cue discovery be automated further or adapted on-the-fly per user/task?
- Whatâs the best way to pick among multiple small models (e.g., math-specialist vs. generalist)?
- How does this scale with extremely long contexts and memory-optimized hardware?
- Can we extend beyond cue words to structural markers (lists, equations, verified steps) while staying training-free?
- How does this work across many languages and domains with different discourse patterns?
06Conclusion & Future Work
Three-sentence summary: RelayGen is a simple, training-free way to speed up long reasoning by switching between a big and a small model at the right moments. It uses evidence-based cue words and sentence-sized segments to keep hard thinking on the big model and easy parts on the small one, then hands the final answer to the small model. It preserves most of the big modelâs accuracy, composes with speculative decoding, and delivers up to about 2.2Ă speedups.
Main achievement: Showing that coarse-grained, empirically guided switching inside a single answer is enough to capture real difficulty shiftsâno trained router neededâwhile staying compatible with speculative decoding.
Future directions: Broaden to multilingual setups; refine automatic cue selection; consider multiple small models specialized by domain; integrate smarter runtime signals (still training-free); and explore tighter hardwareâsoftware co-design to cut latency further.
Why remember this: RelayGen proves you donât need complicated token-by-token routing to get big wins. A little bit of smart, sentence-level timingâgrounded in actual model certaintyâcan make large reasoning models faster, cheaper, and easier to deploy at scale.
Practical Applications
- âąMath tutoring systems that keep deep derivations on a big model but let a small model format and present final answers.
- âąCoding assistants that let the big model solve tricky logic while a small model writes boilerplate or summarizes fixes.
- âąCustomer support bots that use the big model for complex escalation reasoning and a small model for standard replies.
- âąResearch assistants that offload literature summarization to a small model after the big model identifies key insights.
- âąExam prep tools that let the big model plan multi-step solutions and the small model generate step-by-step explanations.
- âąEducational content creation where the big model drafts the outline and the small model cleans and formats lesson text.
- âąData analysis write-ups where the big model designs the analysis and the small model produces readable reports.
- âąOn-device assistants that leverage a cloud big model for tough parts while running easy continuations locally.
- âąWorkflow automation agents that use the big model to decide actions and the small model to generate logs and summaries.