When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

Leheng Sheng; Yongtao Zhang; Wenchang Ma; Yaorui Shi; Ting Huang; Xiang Wang; An Zhang; Ke Shen; Tat-Seng Chua

When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

Intermediate

Leheng Sheng, Yongtao Zhang, Wenchang Ma et al.2/11/2026

arXiv

Key Summary

•Long texts overwhelm many language models, which forget important bits and slow down as the context grows.
•MemAgent helped by reading in chunks and keeping a running memory, but it often stuffed that memory with junk and never knew when to stop.
•This paper adds two simple, text-controlled gates: an update gate (only write when it’s useful) and an exit gate (stop when you’ve got enough).
•These gates are trained with reinforcement learning that gives rewards for correct updating, correct exiting, clean formatting, and right answers.
•With the gates, memory stops ballooning, and the model quits early when the last clue has been found, saving a lot of time.
•Across many long-context QA tasks, GRU-Mem beats the older MemAgent and can be up to 4x faster at inference.
•The method works especially well when evidence is sparse or appears early after reranking, where early exit shines.
•Ablations show the reward mix (alpha) balances stability and learning; RL training gives the biggest gains on harder tasks.
•Limitations: focused on QA, training is trickier with multiple rewards, and some tasks need reading everything so early exit is off.
•Bottom line: teaching models when to write and when to stop makes long-context reasoning both smarter and cheaper.

Why This Research Matters

Real documents are long, and most of what’s inside is not relevant to a single question. Teaching models to write to memory only when it helps and to stop reading once they have enough makes answers come faster and more reliably. That means assistants can deal with contracts, medical histories, or big codebases without drowning in details. It also saves computing time and energy, which reduces costs and environmental impact. In settings where evidence shows up early, the speedups are dramatic. The method’s ideas—gates plus targeted rewards—can inspire better systems for summarization, research assistance, and beyond.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re reading a giant book to answer one question. You don’t read every word with the same focus—you skim, you jot a few notes, and you stop once you’ve got what you need.

🥬 Filling (The Actual Concept): Long-context reasoning is when an AI reads very long texts and still figures out the right answer.

How it works (before this paper): Most models try to read everything at once or use tricks to stretch their memory, but they start missing clues as the text grows.
Why it matters: Without good long-context reasoning, AI can’t handle books, large manuals, patient histories, or huge codebases.

🍞 Bottom Bread (Anchor): Think of a student answering a history question after skimming 1,000 pages. If they can’t manage long reading, they’ll likely miss the key paragraph and guess.

🍞 Top Bread (Hook): You know how it’s hard to find a single needle in a huge haystack? The bigger the haystack, the easier it is to get lost.

🥬 Filling (The Actual Concept): Evidence sparsity (the “needle in a haystack” problem) means only a few parts of a long text actually matter for the answer.

How it works: Important facts are scattered; most paragraphs are irrelevant.
Why it matters: If the model treats every sentence as equally important, it wastes time and misses the key clues.

🍞 Bottom Bread (Anchor): When you ask “What city did the event happen in?”, only one sentence names the city; the rest is fluff.

🍞 Top Bread (Hook): Imagine reading a long article in small pieces and keeping a sticky-note summary that you update as you go.

🥬 Filling (The Actual Concept): Recurrent memory (like in MemAgent) reads the text chunk-by-chunk and keeps a running textual memory to answer at the end.

How it works: Split the long text into chunks; for each chunk, update a memory note; after all chunks, answer using that note.
Why it matters: This avoids feeding the entire book to the model at once and can work beyond the model’s built-in context window.

🍞 Bottom Bread (Anchor): It’s like condensing each chapter into a few bullet points, then answering the quiz from your bullets.

🍞 Top Bread (Hook): Have you ever overfilled your backpack with random stuff until it’s too heavy to carry?

🥬 Filling (The Actual Concept): Memory explosion is when the model’s running memory grows with irrelevant or noisy info.

How it works: If you add notes from every chunk—even empty ones—your summary bloats and becomes hard to use.
Why it matters: A bloated memory makes future updates worse and slower, and the model may miss key evidence later.

🍞 Bottom Bread (Anchor): If your study sheet becomes 20 pages of mixed notes, you can’t find the one crucial formula during the test.

🍞 Top Bread (Hook): You know when you’ve already found your lost keys and there’s no need to keep searching?

🥬 Filling (The Actual Concept): An exit mechanism lets the model stop scanning once it has enough evidence.

How it works: After each chunk, decide whether the last necessary clue has appeared; if yes, stop and answer.
Why it matters: Without an exit, the model wastes time reading the rest of the haystack even after finding the needle.

🍞 Bottom Bread (Anchor): If you spot the answer in paragraph 2, you don’t need to read to paragraph 200.

The World Before: Researchers tried three main routes: (1) change the transformer’s attention so it can look farther with less cost; (2) stretch position embeddings to squeeze in longer inputs; (3) use recurrent, chunked reading (like MemAgent) to process massive contexts. These helped, but problems remained. Even MemAgent, which loops through chunks and updates a memory, could balloon the memory with junk and never knew when to stop.

The Problem: Two issues caused real pain:

Memory explosion: indiscriminate updates pile up noise and cost.
No exit: the loop was hard-coded to read every chunk, even when the answer was already solvable.

Failed Attempts: Feed-everything-at-once struggled with “lost in the middle.” Sparse or linear attention reduced compute but still degraded on huge contexts. Context-extension tricks helped with longer inputs but not with reasoning over scattered clues. MemAgent’s chunked memory worked better, but the “always update, never stop early” habits still wasted compute and muddied memory.

The Gap: What was missing was teaching the model two human-like skills: (1) only write to memory when it’s useful, and (2) stop reading when you’ve got enough.

Real Stakes: In daily life, this affects how well assistants read long contracts, scan patient histories, search massive code repositories, summarize meeting logs, and answer multi-document questions. Efficiency also saves money, energy, and time, especially when the important bits appear early after reranking.

02Core Idea

🍞 Top Bread (Hook): Imagine packing for a trip with two smart rules: only pack what you’ll really use, and stop packing when your suitcase is full enough.

🥬 Filling (The Actual Concept): The key insight: add two simple gates—one to decide when to update memory and one to decide when to stop reading—and train them with rewards.

How it works:
1. Read the next chunk with the question and current memory.
2. Propose a candidate memory update.
3. Update gate says “yes” (write) or “no” (skip).
4. Exit gate says “continue” or “end” (stop early).
5. If “end,” answer immediately from the final memory.
Why it matters: This prevents memory explosion and avoids wasted computation, making long-context reasoning stable and fast.

🍞 Bottom Bread (Anchor): It’s like taking notes only when you see a fact you need, and once you have all the facts, you close the book and answer.

🍞 Top Bread (Hook): You know how you don’t keep every receipt—only the important ones?

🥬 Filling (The Actual Concept): GRU-Mem is a gated recurrent memory system for long texts.

How it works: For each chunk, the model emits (a) a candidate memory, (b) an update gate decision, and (c) an exit gate decision, then proceeds accordingly.
Why it matters: The gates keep memory clean and let the loop finish early, directly tackling the two pain points of MemAgent.

🍞 Bottom Bread (Anchor): It’s like a tidy notebook with only useful bullets, and you stop note-taking once the final clue appears.

🍞 Top Bread (Hook): Picture a filter on your camera that lets through only the best light.

🥬 Filling (The Actual Concept): The update gate decides whether to write the candidate memory into the running memory.

How it works: After reading a chunk, the model scores if the chunk helps the question; “yes” writes the update; “no” keeps the old memory.
Why it matters: Without this, the memory fills with noise, grows too big, and future useful updates get drowned out.

🍞 Bottom Bread (Anchor): When reading Animorphs info, you’d keep the parts naming the series and its companion books, not random background fluff.

🍞 Top Bread (Hook): When you find the last puzzle piece, you stop searching the box.

🥬 Filling (The Actual Concept): The exit gate decides when the loop can stop because the last needed evidence has arrived.

How it works: After each chunk, decide “continue” or “end.” If “end,” pass the memory to the answer agent right away.
Why it matters: Saves time and compute, especially if reranking brings key evidence early.

🍞 Bottom Bread (Anchor): If the population number of Strasbourg shows up, you stop scanning and answer.

🍞 Top Bread (Hook): Training a puppy works best when you give rewards for the exact behaviors you want.

🥬 Filling (The Actual Concept): Reinforcement learning (RL) trains the gates by giving rewards for good updates, good exits, correct formatting, and correct answers.

How it works: The model generates full trajectories; it gets (1) an outcome reward for a right answer, (2) an update reward per step for correct “yes/no,” (3) an exit reward for stopping at the right time, and (4) a format reward for clean structured outputs.
Why it matters: If you only reward the final answer, the model may learn bad habits like over-updating or never stopping early.

🍞 Bottom Bread (Anchor): The model gets a small “treat” each time it skips an empty chunk or stops exactly when the last clue appears, not just when it answers correctly.

Multiple Analogies:

Backpack analogy: Only pack essentials (update gate), and leave home once ready (exit gate).
Treasure hunt: Keep only real clues (update gate); stop searching when you find the last one (exit gate).
Librarian: Add sticky notes only for key facts (update gate); close the book when you have the answer (exit gate).

Before vs After:

Before (MemAgent): Updates memory every step, even on empty chunks; always processes all chunks.
After (GRU-Mem): Selectively updates on evidence chunks; stops early when the last evidence is found.

Why It Works (intuition, not math): Noise compounds when you keep writing junk; selective writing prevents that. Time is wasted when you keep reading after finding the last clue; early exit saves it. Specific rewards teach both habits clearly. Mixing trajectory-level and step-level advantages balances learning to answer well and learning to gate well.

Building Blocks:

Candidate memory: the proposed new summary for this step.
Update gate: “write or skip?”
Exit gate: “continue or end?”
Structured format: think → check → update → next.
Rewards: outcome, update, exit, format.
Advantage mixing: combine trajectory-level and turn-level signals with a balancing knob (alpha).

03Methodology

At a high level: Input (Question Q, chunks C1..CT, previous memory) → [Step A: Read and reason] → [Step B: Decide update] → [Step C: Decide exit] → [If end, Answer] → Output (Answer Â).

🍞 Top Bread (Hook): Imagine solving a mystery by flipping through pages: think about what you see, decide if it’s clue-worthy, write it down if yes, and stop once you’ve got all the clues.

🥬 Filling (The Actual Concept): GRU-Mem is a step-by-step loop with two gates.

How it works (recipe):
1. Split the long context into fixed-size chunks.
2. For chunk t, feed (Q, Ct, Mt−1) to the memory agent.
3. The agent produces: (a) a private reasoning trace <think>…</think>, (b) an update decision <check>yes/no</check>, (c) a candidate memory <update>…</update>, and (d) an exit decision <next>continue/end</next>.
4. If update = yes, set Mt ← M̂t; else Mt ← Mt−1.
5. If exit = end, stop and send Mt to the answer agent to produce Â.
Why it matters: Each piece serves a purpose—reasoning focuses attention, the update gate prevents memory bloat, and the exit gate saves compute.

🍞 Bottom Bread (Anchor): For a question about Strasbourg’s population, the model skips unrelated chunks, updates when it sees “Strasbourg … 276,170 inhabitants (2014),” then exits and answers.

Step-by-Step Details:

Chunking the context

What happens: The long text is split into T chunks of fixed size (e.g., 5,000 tokens).
Why this step exists: It prevents overloading the model’s context window and lets the loop handle very long inputs.
Example: A 500k-token document becomes 100 chunks of 5k tokens each.

Reading and thinking (<think>)

What happens: The agent internally reasons about whether the new chunk contains evidence.
Why it matters: Without thinking, the model may update randomly.
Example: “This paragraph names ‘Animorphs’ and mentions companion books—relevant!”

Update decision (<check>yes/no</check>)

What happens: The agent declares whether the chunk is useful.
Why it matters: Without this decision, memory would grow with junk (memory explosion).
Example: If the chunk is only general background, output <check>no</check>.

Candidate memory (<update>…</update>) and committing

What happens: The agent writes a clean, compact summary of new evidence; it is committed only if <check>yes</check>.
Why it matters: Keeps the memory precise and small.
Example: “Animorphs is a YA sci‑fi series told in first person; The Hork-Bajir Chronicles is a companion narrating enslavement.”

Exit decision (<next>continue/end</next>)

What happens: The agent decides whether it has seen the last needed evidence.
Why it matters: Saves time by avoiding extra chunks once the answer is solvable.
Example: After reading the population number, output <next>end</next>.

Answering

What happens: The answer agent reads (Q, final M) and outputs Â.
Why it matters: Separates memory building from final answering.
Example: Outputs “276,170.”

🍞 Top Bread (Hook): Like training a sports team, you don’t just grade the final score; you also reward good passes and smart timeouts.

🥬 Filling (The Actual Concept): Reinforcement learning teaches the loop to gate well and answer well.

How it works:
- Outcome reward: +1 if the final answer is correct; else 0.
- Update reward: Per step, +1 for correct yes/no on evidence-present/absent; −1 for mistakes.
- Exit reward: Best when stopping exactly at the last-evidence chunk; penalize too-early or too-late stops (early is worse).
- Format reward: +1 only if every turn is correctly formatted with think/check/update/next.
- Advantage mixing: Combine trajectory-level (answer/exit/format) and turn-level (update) signals with a weight alpha.
Why it matters: Rewarding only final answers is too blunt; fine-grained rewards shape good habits at each step.

🍞 Bottom Bread (Anchor): The model earns small rewards for skipping empty chunks and stopping exactly when it sees the last Animorphs clue, plus the big reward for answering right.

🍞 Top Bread (Hook): Imagine a checklist that must be followed, or the referee calls a foul.

🥬 Filling (The Actual Concept): Structured output formatting ensures the system can parse decisions reliably.

How it works: Each turn must include <think>, <check>yes/no</check>, <update>…</update>, and <next>continue/end</next>.
Why it matters: Without strict formatting, the loop can’t tell the gate decisions and memory content apart, breaking the workflow.

🍞 Bottom Bread (Anchor): If the agent writes “yes” outside <check>…</check>, the parser can’t find it; the format reward trains it to be neat every time.

Secret Sauce:

Text-controlled gates trained with separate rewards give clear signals for “when to write” and “when to stop.”
Advantage decomposition (trajectory vs. turn) stabilizes learning so the model doesn’t overfit to only final answers or only gating.
Early-exit inference mode (optional) converts better judgment into real speedups when tasks don’t require reading everything.

04Experiments & Results

The Test: The authors evaluate long-context question answering across in-distribution (HotpotQA) and out-of-distribution tasks, including single-key and multi-key “needle in a haystack” setups, multi-query, and multi-value tasks. Contexts range from 7k up to 896k tokens. They measure answer accuracy and inference time.

The Competition: GRU-Mem is compared to MemAgent using the same backbone models (Qwen2.5-3B and Qwen2.5-7B). Two inference modes are tested for GRU-Mem: with early exit (w EG) and without early exit (w/o EG), since some tasks require scanning everything.

The Scoreboard (with context):

Accuracy: GRU-Mem generally outperforms MemAgent across datasets. Think of it like getting A/A− grades where MemAgent gets B/B−, especially on out-of-distribution NIAH tasks where evidence is sparse.
Speed: GRU-Mem is significantly faster. Without early exit, it’s commonly about 2x faster. With early exit, it can be up to 4x faster in several settings, like MK-1, while keeping or improving accuracy.
Smaller models benefit most: With the 3B backbone, GRU-Mem’s stability helps more, reducing the sharp drops MemAgent shows on tougher NIAH tasks.

Surprising/Notable Findings:

Memory growth under control: Tracking memory size over long runs shows GRU-Mem’s memory grows slowly and avoids hitting the 1024-token cap where performance and cost spike; MemAgent often hits that cap quickly.
Early evidence advantage: When the last piece of evidence is guaranteed to appear early (e.g., top 20% or even top 10% after reranking), GRU-Mem’s exit gate shines—cutting inference time to about one-quarter compared to MemAgent while maintaining accuracy.
Exit accuracy: GRU-Mem learns to stop exactly at the last-evidence chunk the majority of the time, with high exact-stop ratios and low early/late stops as training progresses.
Training dynamics: Tuning the alpha (advantage mixing weight) balances two goals—answering well and gating well. An alpha around 0.9 yields stable validation rewards and balanced update accuracy on evidence-present and evidence-free chunks.
RL matters more on hard tasks: Adding RL significantly boosts performance on challenging datasets like HotpotQA, SQuAD, and the multi-key series.

Contextualizing the Numbers:

“Up to 400% faster” means if MemAgent takes 400 seconds, GRU-Mem can do it in around 100 seconds with early exit—like finishing the test in one-quarter the time.
“Generally higher accuracy” means fewer misses when the crucial clues are scattered, because GRU-Mem avoids burying those clues under noisy memory updates.

Takeaway: The two gates, trained with the right rewards, directly fix MemAgent’s two biggest issues—memory bloat and never knowing when to stop—leading to more accurate answers and a big reduction in wasted compute.

05Discussion & Limitations

Limitations:

Task scope: The paper focuses on long-context QA. Other tasks like summarization, timeline building, or code refactoring remain to be tested.
Training stability: Multiple rewards (outcome, update, exit, format) can destabilize training. This required lower off-policy drift and longer training for convergence.
Exit risk: If the exit gate fires too early on tasks that require scanning everything (e.g., “list all items”), the answer could be incomplete. That’s why a no-exit inference mode is provided.
Dependence on chunking and prompts: Fixed chunk sizes and strict formatting help reliability, but the best chunking strategy or prompt phrasing may vary by domain.

Required Resources:

An RL training stack capable of multi-turn policy optimization (e.g., DAPO/GRPO variants), with GPUs (the paper evaluates on 8-GPU nodes).
Long-context data for training and validation, plus tools to detect evidence-present vs. evidence-free chunks for reward signals.

When NOT to Use:

Exhaustive tasks: If the question asks for “all” items or requires global coverage, disable early exit (w/o EG mode) or don’t use gating.
Very short contexts: The overhead of the loop might not pay off.
Safety-critical domains with invisible evidence: If missing a late-arriving clue has high cost, keep exit off or require extra confidence checks.

Open Questions:

Can GRU-Mem generalize to summarization, coding assistance, or multi-agent settings?
How to calibrate exit confidence, especially under noisy retrieval or adversarial ordering of evidence?
Can we learn chunk sizes adaptively or switch to hybrid memories (text + vector) to further reduce bloat?
How do better rerankers and embeddings interact with the exit gate for even earlier stopping?
Are there supervised or self-play signals that can complement RL to stabilize training further?

06Conclusion & Future Work

Three-Sentence Summary: Long texts overwhelm models because useful clues are sparse; older systems like MemAgent helped by reading in chunks but often wrote down too much and never knew when to stop. GRU-Mem fixes this with two text-controlled gates: only update memory when it matters and stop as soon as the last needed clue appears, trained via well-shaped rewards. The result is stronger accuracy and big speedups (up to 4x) on long-context QA.

Main Achievement: Turning “when to write” and “when to stop” into first-class, trainable decisions—via an update gate and an exit gate—makes long-context reasoning both stable and efficient.

Future Directions: Extend beyond QA to summarization and coding; refine exit confidence and combine with reranking; explore adaptive chunking and hybrid memories; stabilize multi-reward RL with improved advantage decomposition or auxiliary objectives.

Why Remember This: It shows that small, human-like habits—selective note-taking and knowing when to finish—can transform how models handle huge contexts, saving time, cost, and energy while improving answers.

Practical Applications

•Contract review: Scan lengthy agreements, keep only clauses relevant to a question, and stop early once all needed clauses are found.
•Medical records QA: Pull only pertinent patient facts into memory and answer without reading the entire chart.
•Codebase search: Update memory only on files that contain signals for a bug or feature, and stop once the last clue is identified.
•E-discovery/legal search: Traverse large document sets, logging only relevant passages and exiting when sufficient evidence is collected.
•Customer support: Read long ticket histories, keep critical steps, and answer as soon as the final needed detail appears.
•Academic research: Skim many papers, capture only key findings for a question, and stop when the last required citation is located.
•Meeting assistants: Process transcripts in chunks, keep decisions and action items, and finalize summaries early when complete.
•Compliance checks: Flag only regulation-relevant text and stop when all compliance criteria have been verified.
•RAG pipelines: Combine with reranking so key evidence appears early, boosting the impact of early exit.
•Data extraction: Pull specific fields (dates, ids, totals) from massive reports, halting once all fields are captured.

Version: 1