Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL

Ian Wu; Yuxiao Qu; Amrith Setlur; Aviral Kumar

Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL

Intermediate

Ian Wu, Yuxiao Qu, Amrith Setlur et al.2/3/2026

arXiv

Key Summary

•Reasoning Cache (RC) is a new way for AI to think in steps: it writes some thoughts, makes a short summary, throws away the long thoughts, and then keeps going using only the summary.
•This simple loop lets the AI keep improving for a very long time at test time, even longer than what it practiced during training (it extrapolates).
•RC works because most models are better at reading and using a clear summary than at starting from scratch every time (a summarization–generation asymmetry).
•The team trains the model with short-horizon reinforcement learning (RL) that rewards good answers when reasoning from summaries, plus a replay buffer that reuses past summaries safely.
•With only a 16k-token training budget, the RC-trained 4B model reaches about 70% on HMMT 2025 when allowed 512k tokens at test time, beating bigger specialized reasoning models.
•RC decoding alone already helps instruction-following models; training with RC helps much more and generalizes beyond math to scientific reasoning (FrontierScience).
•RC avoids the usual long-context problems: each step stays short and in-distribution, so the model doesn’t get verbose or repetitive as it reasons longer.
•Summaries must be the right size (about 1–2+ paragraphs): too short loses key ideas; too long acts like clutter and hurts progress.
•RC-trained models also use external scaffolds (like RSA and DSM Agent) better, because they learned to reason from self-generated guidance.
•RC isn’t perfect: it uses myopic rewards, relies on good summarization skills, and helps less on search-heavy tasks that need full detailed logs.

Why This Research Matters

RC makes “more thinking time” actually pay off. It means a homework helper that keeps improving its solution when you let it run longer. It lets science assistants and planners reliably push through tough steps without getting lost in long, messy text. Because RC keeps each step short and focused, it’s faster and more memory-friendly than giant single-pass generations. RC-trained models also plug into existing toolchains and scaffolds more effectively, amplifying benefits from things we already know work. In short, RC turns extra compute into real, steady progress on hard problems people care about.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you do a big school project, you don’t write everything at once. You make notes, then a short outline, and keep improving using that outline so you don’t get lost.

🥬 The Concept: Autoregressive decoding.

What it is: It’s a way an AI writes answers one token (tiny piece of text) at a time, using everything it has already written as context.
How it works:
1. Read the question.
2. Predict the next token based on all previous tokens.
3. Repeat until you think you’re done.
Why it matters: Without it, the AI can’t build long answers step-by-step; but if the answer gets very long, it can drift off-topic or repeat. 🍞 Bottom Bread (Anchor): Like finishing a story by adding one word at a time while rereading the whole story so far before each new word.

🍞 Top Bread (Hook): Imagine your teacher gives you a time limit for each homework problem. You have to choose how much to think before writing your final answer.

🥬 The Concept: Token budget.

What it is: A limit on how many tokens (pieces of text) the model can use to think and answer.
How it works:
1. Pick a maximum number of tokens allowed.
2. The model uses part of it to think (reasoning) and part to answer.
3. When the budget runs out, it must stop.
Why it matters: If the problem is hard but the budget is too small, the model can’t think long enough to solve it. 🍞 Bottom Bread (Anchor): Like having only 5 minutes to solve a puzzle—you might stop just before the final move.

🍞 Top Bread (Hook): Think of learning to shoot basketball free throws. You try, get a score (reward), and adjust based on what worked.

🥬 The Concept: Reinforcement Learning (RL).

What it is: A way for AI to learn by trying different answers and getting rewards for good outcomes.
How it works:
1. The model generates an answer.
2. It gets a reward (e.g., correct or not).
3. It updates itself to make good answers more likely next time.
Why it matters: RL teaches procedures for solving problems, not just copying examples. 🍞 Bottom Bread (Anchor): Like practicing until you sink more shots because you learn from each score.

🍞 Top Bread (Hook): Have you ever read a super long message and lost the point in the middle?

🥬 The Concept: Distribution shift (when test-time looks different from training-time).

What it is: The model runs into kinds of long contexts and states it didn’t practice on during training.
How it works:
1. During training, responses are short/medium.
2. At test time, responses get much longer.
3. The model now generates in unfamiliar territory and can repeat or ramble.
Why it matters: Even strong models can get worse if they leave the “familiar zone.” 🍞 Bottom Bread (Anchor): Like suddenly writing a 20-page essay when you’ve only practiced 2 pages—you might repeat yourself.

🍞 Top Bread (Hook): Imagine you can keep improving your drawing if you spend more time—10 minutes looks okay, 1 hour looks great.

🥬 The Concept: Extrapolation of reasoning.

What it is: The ability for the model to get better as you give it more test-time compute than it saw in training.
How it works:
1. Start with a process that works in short runs.
2. Let it run for more steps at test time.
3. Each extra step improves or refines the solution.
Why it matters: Without extrapolation, giving extra time or tokens won’t help on harder tasks. 🍞 Bottom Bread (Anchor): Like practicing a piano piece longer and actually getting better, not just playing the same mistakes louder.

The world before: LLMs could reason with chain-of-thought, and RL could make them more strategic. But training was done with fixed token budgets and data. At test time, harder problems required longer thinking, and models either stopped early (because training taught them to finish within the budget) or kept going but got repetitive when contexts grew far beyond training lengths.

The problem: How to let models keep improving for much longer at test time—hours, many turns, or hundreds of thousands of tokens—without retraining weights and without getting lost.

Failed attempts:

Just increase the training budget: Too expensive and still doesn’t generalize to even longer horizons.
Self-refine/self-verify by prompting: Helps some, but still conditions on ever-longer raw traces and runs into distribution shift and verbosity.
Special curricula/datasets: Gains taper after ~3–4× the training length.

The gap: We need a decoding procedure that keeps each step short and in-distribution but lets total progress grow linearly with the number of steps. And we need training that teaches the model to use this procedure.

Real stakes: In daily life, this means tutors that can stick with a tricky math proof, science assistants that can explore ideas for days, coders that can debug and refine big projects, and planners that improve with more time rather than rambling. Without reliable long-horizon improvement, extra compute is wasted, answers stagnate, and users can’t trust the model to persevere on hard problems.

02Core Idea

🍞 Top Bread (Hook): You know how you solve a big puzzle by doing a chunk, jotting a short note of what you learned, then using that note to do the next chunk?

🥬 The Concept: Iterative decoding.

What it is: A way for the model to work in turns—think a bit, then compress that thinking, then think more using only the compressed note.
How it works:
1. Generate reasoning for a short, fixed length.
2. Summarize it into a short “note.”
3. Throw away the long text and keep the note.
4. Repeat, conditioning only on the note.
Why it matters: Each step stays short (safe and familiar), but the total number of steps can grow very large. 🍞 Bottom Bread (Anchor): Like hiking by following a simple trail note each mile instead of carrying a heavy scroll of everything you’ve seen.

🍞 Top Bread (Hook): Imagine you’re better at using a good outline than writing a perfect essay from scratch.

🥬 The Concept: Summarization–generation asymmetry.

What it is: Models are usually better at reading and acting on a clear summary than at solving a problem cold.
How it works:
1. Write a clean, 1–2+ paragraph summary of what was tried and found.
2. Use that as guidance to verify, refine, or explore next.
3. Repeat, so the model keeps momentum.
Why it matters: This gap powers RC—summaries guide better next steps than raw, messy long text. 🍞 Bottom Bread (Anchor): Like cooking better when you follow a neat recipe card than when you skim a messy, 20-page cooking diary.

🍞 Top Bread (Hook): Think of a stopwatch that lets you add more 5-minute chunks whenever you want.

🥬 The Concept: Reasoning Cache (RC).

What it is: A generate–summarize–repeat decoding algorithm where the summary is the “cache” that carries progress forward.
How it works:
1. Reason up to HR tokens (a short, fixed step).
2. Summarize to HS tokens (tiny).
3. Discard the long trace; keep only the summary.
4. Start the next step conditioned on the summary.
Why it matters: Total progress scales with turns (T×HR), while each turn stays short and familiar. 🍞 Bottom Bread (Anchor): Like keeping a small index card that grows smarter each turn, instead of lugging around every page you ever wrote.

The “Aha!” in one sentence: If you keep each step short and pass forward only a compact, useful summary, you can safely chain many steps and get better the longer you try.

Three analogies:

Travel journal: Take a trip, write a short highlight, toss all receipts, plan the next leg from the highlight.
Science lab notebook: Run a short experiment, log key findings, design the next step from the log.
Video game boss fight: Try a strategy, save a short “what worked/failed” note, then restart using that tip for a better run.

Before vs After:

Before: Long answers drift; extra tokens don’t help much beyond training lengths; models may stop early.
After: Add more RC turns; accuracy keeps rising; models stay on-track because each turn looks like training.

Why it works (intuition, not equations):

Keep distribution stable: Every turn is a short, familiar generation, so no long-context weirdness.
Use the asymmetry: Clear summaries are easier to act on than raw, noisy chains-of-thought.
Make progress monotonic: Summaries focus on conclusions and next steps, so each turn tends to move forward—verify, refine, or explore.

Building blocks:

Two instructions: one for Reasoning (IR) and one for Summarizing (IS).
Budgets: HR for reasoning, HS for summary, with HS ≪ HR.
Turns: T turns gives Htest ≈ T×HR total reasoning.
Training: Short-horizon RL rewards correct answers when reasoning from summaries; a replay buffer reuses useful summaries so later-turn states are seen during training.
Output: The final reasoning trace from the last turn is the answer.

03Methodology

At a high level: Input problem → RC decoding turn (Reasoning HR → Summary HS) → repeat T times → final answer. For training: Input batch → run a few RC turns to collect summaries → sample summaries → generate K reasoning rollouts from each summary → give outcome rewards → update policy (GRPO) → store summaries in replay buffer → next batch.

Step 1: RC Decoding (one turn)

What happens: The model reads the problem plus the current summary (if any), writes up to HR tokens of reasoning, then writes a short HS-token summary of that reasoning (merging with the previous summary), and discards the long reasoning.
Why this step exists: It keeps each generation short and in-distribution and focuses the next turn with a clean guide.
Example with data: On an HMMT problem, the model writes 2–4k tokens checking a strategy, then summarizes in ~200–800 tokens: “Tried A, got contradiction B; new path: try C using lemma D; partial result E.” Next turn uses that to verify E or try C.

Step 2: Iteration across many turns

What happens: Repeat turn 1 for T turns. Each turn can choose to verify prior conclusions, refine a promising path, or explore alternatives (empirically the most common is verification, then exploration, then refinement).
Why this step exists: Extra turns add more compute safely and productively; progress compounds.
Example: Turns 1–2 explore two algebraic setups; turn 3 verifies a critical identity; turn 4 refines the clean path and computes the final numeric answer.

Step 3: Training objective (short-horizon RL)

What happens: For each training problem, run RC for T_train short turns to collect summaries. Sample N_summ summaries and, for each, generate K reasoning rollouts (length ≤ HR). Reward each rollout by correctness of the final answer. Use GRPO to update the policy so better rollouts get higher probability.
Why this step exists: It directly teaches the model to produce correct answers when conditioned on summaries, strengthening the “read a short plan and execute it” skill that powers RC.
Example: Suppose for a problem, we collect 3 summaries. From summary #2, generate 8 solutions; 3 are correct. Those get positive advantage; the others get lower. The model updates to make future summary-conditioned reasoning more accurate.

Step 4: Summary replay buffer (off-policy learning)

What happens: Store problem–summary pairs in a buffer. In later epochs, sample from this buffer to start RC rollouts and to condition training rollouts.
Why this step exists: It widens coverage of summary states without generating very long on-policy traces and helps the model practice later-turn states it wouldn’t often see otherwise.
Example: A great turn-3 summary from last week’s run can be reused now to train better turn-4 reasoning, even if we don’t regenerate all turns.

Step 5: Practical training recipe

Stage I: No replay buffer yet. Focus on early turns (including turn 0 with no summary) to solidify the skill of reasoning-from-summary and good summarization formats.
Stage II: Enable replay buffer and add some harder problems so the model learns to leverage and build on strong later-turn summaries.
Hyperparameters used in the paper: HR=16k, HS≈2k, T_train=3, K=8, N_summ=2; base model Qwen3-4B-Instruct-2507; trained model RCT-4B.

Secret sauce (why it’s clever):

Decouples horizon from step length: You can scale T (the number of turns) arbitrarily while keeping each turn short, cheap, and familiar.
Exploits the asymmetry: Acting from a clean summary is easier than decoding from scratch for long spans.
Safe RL signal: Because HR is short and most steps end with an attempted answer, you can use outcome rewards per step without tricky long-range credit assignment.
Replay helps breadth: Off-policy summaries expose the model to a rich variety of later-turn states without huge on-policy costs.

Concrete walk-through (mini case): Problem: “The sum of a few perfect squares uses digits 1–9 exactly once, find the minimum sum.”

Turn 1 (HR): Try small squares; collect digits used; hit a conflict. Summary (HS): “Tested 1^2, 3^2, 5^2, 6^2, 28^2 → digits cover 1–9 exactly once ignoring zeros; candidate sum S; needs verification of minimality.”
Turn 2 (HR): Verify minimality with swaps and pruning. Summary (HS): “No cheaper combination found under bounds; minimality likely holds.”
Turn 3 (HR): Final clean derivation and boxed answer. Discard previous long text; only summaries guided each new step.

What breaks without each step:

No summaries: Context grows and drifts; repetition and verbosity appear; extrapolation stalls.
No HR cap: Steps get too long and off-distribution; training/inference become slow and unstable.
No replay: The model under-practices later-turn states; gains at large T shrink.
No per-step rewards: The model isn’t pushed to be precise when executing from summaries; summaries become less useful.

04Experiments & Results

🍞 Top Bread (Hook): Imagine a spelling bee where you can ask for more hints. The question: do more hints actually help you spell tougher words right?

🥬 The Concept: Benchmarks and baselines.

What it is: Benchmarks are standard tests; baselines are other models or methods to compare against.
How it works:
1. Pick fair test sets released after training (to avoid leaks): AIME 2025, HMMT 2025 (Nov), IMO-AnswerBench, FrontierScience.
2. Compare RC-trained model (RCT-4B) against base and strong alternatives (Qwen3-4B/30B, Qwen-4B-Thinking, Polaris-4B, standard RL, self-refine/verify).
3. Vary token budgets to see if more turns truly help.
Why it matters: We need to know if RC really scales performance with more test-time compute, not just in easy cases. 🍞 Bottom Bread (Anchor): Like racing multiple bikes on the same track and timing them at different distances.

The tests and why:

Accuracy on math/science problems measures if longer RC horizons pay off.
Pass@k (try up to k samples) shows how likely a method solves a hard problem with multiple attempts.
Termination rate checks if models actually finish within HR (suggesting training lengths).

Scoreboard with context (selected highlights):

Extrapolation without training: RC decoding alone improves Qwen3-4B-Instruct-2507 as Htest rises far beyond ~16k; accuracy gains ~17% up to 192k.
Trained RC (RCT-4B): With 16k training budget, extrapolates to 512k test tokens and improves HMMT 2025 from ~40% to ~70% (like going from a C to nearly an A).
Across benchmarks at large budgets (192k–256k):
- AIME 2025: Base 46% → RCT-4B 74.9%.
- HMMT 2025 (Nov): Base 39.8% → RCT-4B 66.3%.
- IMO-AnswerBench: Base 33.5% → RCT-4B 49.4%.
- FrontierScience: Base 23.3% → RCT-4B 34.1% (not in training domain!), suggesting learned, general strategies.
Against strong 4B+ models with autoregressive decoding, RCT-4B + RC wins on 3 of 4 benchmarks and is competitive with much larger models.

Surprising findings:

Instruction-following matters: Using a “thinking-specialist” model with weaker instruction-following reduced RC gains, showing the need for the summarization–generation asymmetry.
Summary size sweet spot: Too-short summaries lose key ideas; too-long summaries behave like clutter. About 1–2+ paragraphs worked best.
Small HR hurts: HR=8k was okay; HR=4k caused many early terminations and worse performance—there wasn’t enough room for a meaningful chunk per turn.
Token usage scales linearly with budget under RC: the model keeps using extra turns productively rather than stopping early.
Hard problems: On an adversarial set where the base model fails in 256 tries, RCT-4B with RC lifts pass@16 from ~20% (base) to ~35%, showing true new problem-solving, not just polishing easy ones.
Scaffolds: Plugging RCT-4B into RSA or DSM Agent increased scores more than with base or standard-RL models; adding RC inside those scaffolds added further gains.

🍞 Bottom Bread (Anchor): Picture giving the model more “rounds” to think with a clean checklist after each round—and seeing its grade rise steadily while others plateau or ramble.

05Discussion & Limitations

🍞 Top Bread (Hook): If you study with flashcards, it works great for facts—but not for building a giant Lego castle where every tiny piece matters.

🥬 The Concept: When RC may not fit.

What it is: RC is best when progress can be summarized cleanly; it’s weaker when you must preserve lots of fine-grained search details.
How it works:
1. RC compresses; compression drops unimportant tokens.
2. In search-heavy tasks, many details are important.
3. Summaries then risk losing critical state.
Why it matters: Know the task shape before choosing RC. 🍞 Bottom Bread (Anchor): Fine for summarizing a chapter; risky for recording every move in a chess endgame search.

Limitations:

Myopic rewards: Training rewards per-turn correctness, not multi-turn plans where early exploration pays off later; future non-myopic objectives could help.
Summary-generation training: Directly optimizing summaries was hard (credit assignment) and even hurt performance; better reward designs are needed.
Model requirements: Works best with instruction-following models that respect summaries; thinking-specialists may need extra tuning.
Resource needs: Training still requires RL infrastructure, data curation, and compute; though RC steps are shorter and more efficient than giant long-context runs, there is engineering overhead.

When not to use:

Pure search or enumeration tasks where every tried path must be tracked.
Settings where summaries must be exact logs rather than compact abstractions.

Open questions:

Can we design non-myopic rewards that teach multi-turn exploration without hurting stability?
How do we train summary generation itself, possibly with better proxy rewards or verifiers?
Can we mix RC with external tools (retrieval, code execution) and still keep steady extrapolation?
What’s the best automatic “summary length controller” that adapts HS to task difficulty?

🍞 Bottom Bread (Anchor): Think of RC as a strong “outline-and-iterate” study method. It shines on concept-building tasks; for raw exhaustive search, consider different tools or hybrid designs.

06Conclusion & Future Work

Three-sentence summary: RC is a simple generate–summarize–repeat decoding method that keeps each step short but chains many steps, so models improve the longer they think. Training with short-horizon RL to execute from summaries—and reusing summaries via replay—teaches models to extrapolate well beyond training lengths. The result is a 4B model that beats larger systems on tough math and science tests when given bigger test-time budgets.

Main achievement: Decoupling effective reasoning horizon from step length so test-time compute reliably buys progress, with a training method that makes summary-conditioned reasoning a core skill.

Future directions: Design non-myopic rewards that encourage early exploration that pays off later; directly train better summaries with improved credit assignment; extend RC to open-ended domains like proofs and research workflows; and blend RC with external tools and scaffolds adaptively.

Why remember this: RC turns “more time” into “more progress,” safely and efficiently. It’s a practical recipe for continual, long-horizon reasoning that generalizes across domains and scales with the compute you can afford.

Practical Applications

•Math tutoring that improves step-by-step when given more time, verifying and refining solutions.
•Scientific reasoning assistants that carry forward compact findings across long investigations.
•Coding agents that summarize failing runs, then fix and test iteratively without bloated logs.
•Planning and scheduling tools that refine multi-day plans using short, focused updates.
•Proof assistants that verify lemmas, summarize progress, and attempt the next step reliably.
•Data analysis pipelines that record compact insights and branch to explore promising leads.
•Customer support bots that summarize long threads and propose precise next actions.
•Legal or policy drafting that iteratively refines arguments from concise position summaries.
•Research brainstorming that captures key hypotheses and designs the next experiment turn-by-turn.
•Benchmark farming: efficiently allocate test-time tokens to push accuracy on difficult evals.

Version: 1