The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

Xiaoyuan Liu; Tian Liang; Dongyang Ma; Deyu Zhou; Haitao Mi; Pinjia He; Yan Wang

The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

Intermediate

Xiaoyuan Liu, Tian Liang, Dongyang Ma et al.2/12/2026

arXiv

Key Summary

•This paper gives language models a 'wand' to manage their own memory, instead of relying on humans to stuff the prompt for them.
•The new models, called StateLMs, can read long texts in chunks, take notes, delete clutter, and keep only what matters.
•They use a 'Pensieve' setup: a small external notebook for key facts plus tools to prune unneeded context.
•A learned loop of read → note → delete keeps the context short and clean (a 'sawtooth' pattern) even for million-token inputs.
•StateLMs beat standard LLMs on long-document QA while using only about one quarter of the active context.
•On chat memory tasks, StateLMs improve accuracy by 10%–20% over standard LLMs.
•On deep research (BrowseComp-Plus), StateLM reaches up to 52% accuracy, while standard LLMs hover near 5%.
•Training mixes supervised trajectories from a teacher with reinforcement learning to refine tool use and state control.
•The approach turns passive predictors into state-aware agents that manage their own reasoning process.
•Limits include keyword search misses on implicit queries, occasional tool-call formatting errors, and slow buildup of stubs if pruning is mistimed.

Why This Research Matters

Many real tasks involve more information than a model can see at once. By learning to summarize key facts into notes and delete bulky text, StateLMs stay accurate even with huge inputs while keeping costs down. This makes long customer chats, big reports, and multi-step research actually practical for AI. Teams can rely less on fragile, human-written pipelines and let the model handle memory for itself. The approach scales across different domains without task-specific rewrites. In everyday life, it means faster answers, fewer mistakes, and tools that feel more like helpful, organized partners.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your backpack during the school year. At first, it’s neat. But as you keep stuffing papers in and never take any out, it gets messy and you can’t find anything.

🥬 The Concept (Statelessness of LLMs): What it is: Most language models are stateless, which means they only look at whatever text you put in front of them right now and don’t truly manage a long-term memory of their own. How it works: 1) You give the model a big prompt. 2) It predicts the next words. 3) If you want it to remember something, you must paste it back in again later. 4) The context grows and grows. Why it matters: Without its own memory management, the model’s prompt becomes a junk drawer—useful facts get buried, and the model wastes brainpower on irrelevant text. 🍞 Anchor: If you ask a model 50 questions in a chat without cleaning up, the early important bits may get pushed out or drowned by fluff, so answers get worse over time.

🍞 Hook: You know how a librarian decides which books go on the cart and which stay shelved so people can find things faster?

🥬 The Concept (Context Window): What it is: The context window is the model’s short-term workspace—only text inside it can be directly used to think and answer. How it works: 1) Text is tokenized. 2) Only up to the window limit is processed. 3) Anything beyond is invisible. Why it matters: If the window fills up, either you must drop something or the model can’t see the rest—accuracy drops when key info isn’t inside the window. 🍞 Anchor: If the model can only read 128 pages at once, but your book is 500 pages, you must choose which 128 pages to show; choose badly, and it misses the answer.

🍞 Hook: Think of a coach whispering plays to a player every second. The player just follows orders and never chooses plays themselves.

🥬 The Concept (Context Engineering): What it is: Context engineering is when humans craft the model’s prompt, deciding what to include and exclude. How it works: 1) Retrieve documents, 2) select chunks, 3) format tools/prompts, 4) feed it all into the model. Why it matters: It makes the model dependent on human ‘wizards’—clever workflows help, but the model stays passive. 🍞 Anchor: A script searches your files and stuffs the top 5 chunks into the prompt; if it picks the wrong chunks, the model can’t fix that on its own.

🍞 Hook: Picture using a search bar before reading a whole encyclopedia.

🥬 The Concept (RAG – Retrieval-Augmented Generation): What it is: RAG finds likely relevant chunks from a database and hands them to the model. How it works: 1) Turn the question into a vector or keywords, 2) retrieve similar text, 3) stuff retrieved text into the prompt, 4) answer. Why it matters: Helpful, but the model still accepts whatever is retrieved, can’t prune or reorganize by itself, and can drown in extra text. 🍞 Anchor: Ask “Who is the mayor?” RAG fetches bios; if the right line got missed, the model can’t search again on its own.

🍞 Hook: Imagine a tidy student who summarizes each chapter onto a note card and tosses the heavy printouts.

🥬 The Concept (Agentic Memory Systems): What it is: Agentic memory frameworks try to help models by paging info in/out or summarizing states. How it works: 1) Predefined routines manage memory, 2) the model follows those routines, 3) summaries replace raw text. Why it matters: Better than nothing, but still bound to human-designed schedules; the model doesn’t truly invent its own strategy. 🍞 Anchor: A system forces a summary every 3 steps; even if step 2 needed more detail, the schedule won’t adapt.

🍞 Hook: You know how the best learners decide what to keep, what to toss, and when to review?

🥬 The Concept (The Gap): What it is: We lacked models that could manage their own context on the fly. How it works: Give the model tools to index, take notes, and delete clutter—and train it to choose when to use each. Why it matters: Without this, long chats, huge documents, and deep research stay fragile and error-prone. 🍞 Anchor: Reading a 300-page report becomes reasonable if the model can skim, save key facts, and clear the rest—just like a student preparing for a test.

🍞 Hook: Think of forgetting unhelpful details so your brain focuses on the big idea.

🥬 The Concept (Real Stakes): What it is: Long memory matters for everyday AI uses. How it works: 1) Customer support chats span weeks, 2) legal/medical papers are massive, 3) research needs many search/reading passes. Why it matters: Better memory management means faster, more accurate help and less cost. Without it, the AI gets slower, pricier, and more wrong. 🍞 Anchor: A tutoring bot that remembers a student’s progress across months and trims old fluff gives better advice in seconds, not minutes.

02Core Idea

🍞 Hook: You know how a chef keeps only the ingredients needed for this dish on the counter and puts everything else back, to stay neat and fast?

🥬 The Concept (StateLM): What it is: StateLM is a language model that learns to manage its own context using tools—reading, note-taking, and pruning—so it keeps only what’s useful. How it works: 1) Look at the task and size of input, 2) index or search to find promising bits, 3) read a chunk, 4) write a short note of key facts in an external notebook, 5) delete the bulky chunk from the active context, 6) repeat until ready, 7) answer. Why it matters: Without this loop, long tasks overflow the window or bury the answer; with it, the model stays sharp and efficient. 🍞 Anchor: Like a detective who jots clues in a pocket notebook and tosses old newspapers, the model travels light yet remembers the important parts.

🍞 Hook: Imagine Dumbledore pulling thoughts into a Pensieve to revisit later.

🥬 The Concept (Pensieve Paradigm): What it is: A setup where the model keeps compact, durable notes outside the live chat and can prune the chat itself. How it works: 1) External notebook stores distilled insights, 2) deleteContext removes bulky messages from the visible history, 3) notes can be re-read on demand. Why it matters: The model breaks free from a fixed, ever-growing prompt and builds its own sustainable memory. 🍞 Anchor: Read a chapter, save two bullet facts to the notebook, delete the raw chapter text from the chat, and keep going.

🍞 Hook: Think of a saw blade—up, then a quick drop, then up again.

🥬 The Concept (Sawtooth Context): What it is: The model’s live context grows as it explores, then drops when it deletes clutter, repeating in waves. How it works: 1) Add new reads → context rises, 2) take notes, 3) delete raw text → context falls, 4) repeat. Why it matters: Prevents slow, monotonic bloat that ruins long-horizon reasoning. 🍞 Anchor: While processing a 1M-token corpus, the active context hovers around 32K because old chunks are promptly cleared.

🍞 Hook: You know how spring cleaning frees up space and makes you faster at finding things?

🥬 The Concept (deleteContext tool): What it is: A command that removes chosen past messages from the visible history. How it works: 1) Identify the message ID to remove, 2) call deleteContext(msg_id), 3) the system leaves a tiny stub, 4) space and focus improve. Why it matters: Without deletion, clutter piles up; with it, the model avoids overflows and distraction. 🍞 Anchor: After reading a 10,000-token chunk and saving two sentences of notes, the model deletes the chunk’s message so new chunks can fit.

🍞 Hook: Picture sticky notes on your desk with only the key points you need today.

🥬 The Concept (Note-taking memory): What it is: A compact external notebook that stores distilled, task-relevant facts. How it works: 1) write/update a note (note/updateNote), 2) later, load it when needed (readNote). Why it matters: Notes preserve meaning while shedding bulk; without them, deleting raw text would lose the facts. 🍞 Anchor: “Chapter 3: culprit left-handed; alibi weak; motive = debt.” That small note replaces 20 pages.

🍞 Hook: Imagine a travel guide that helps you jump straight to the best sights.

🥬 The Concept (Indexing and Search): What it is: Tools (buildIndex, searchEngine, readChunk) that let the model target relevant text instead of scanning everything. How it works: 1) buildIndex structures the document, 2) searchEngine finds candidate chunks, 3) readChunk inspects selected parts in detail. Why it matters: Saves time and context; without targeting, the model scans forever or misses the answer window. 🍞 Anchor: Ask “When did the policy change?” The model searches for ‘policy change’ hits, opens two chunks, notes the date, deletes the chunks.

🍞 Hook: Think of learning to plan your own study routine, not just copying a friend’s schedule.

🥬 The Concept (Learning the Loop): What it is: The model is trained—first by copying expert examples, then by trial-and-error rewards—to choose tools wisely. How it works: 1) Supervised fine-tuning on good trajectories, 2) reinforcement learning with rewards for correct, well-formatted, within-budget finishes, 3) snapshots guide credit to key states. Why it matters: Without learning, the model misuses tools (over-deletes, under-notes, or overflows context). With learning, stable strategies emerge. 🍞 Anchor: After training, the model regularly deletes read chunks right after note-taking and checks budget before loading more.

Multiple Analogies:

Backpack analogy: Keep essentials in the front pocket (notes), recycle old worksheets (deleteContext), and grab only the needed pages from the binder (readChunk).
Chef analogy: Mise en place (notes), fetch ingredients as needed (search), clear the counter (deleteContext) to keep cooking smooth.
Detective analogy: Build a case file (index), read a lead (chunk), write a clue card (note), toss the newspaper (delete), solve the case (finish).

Before vs After:

Before: The model waited for humans to stuff the prompt; context just grew until it broke.
After: The model engineers its own context, staying lean and accurate across very long inputs.

Why It Works (intuition): The information bottleneck is the live context, not total storage. By distilling signal into notes and pruning noise, the model keeps high-utility bits within its attention while discarding bulk that would dilute focus.

Building Blocks: external notebook, deleteContext, buildIndex/searchEngine/readChunk, note/updateNote/readNote, budget checks, and a learned policy that orchestrates them into a repeating, efficient loop.

03Methodology

At a high level: Input (user query + long text) → Analyze size and budget → Build index (if long) → Search for candidates → Read a chunk → Note key facts → Delete the bulky chunk → Repeat until ready → Finish with an answer.

🍞 Hook: You know how you first glance at a book’s thickness before deciding to skim or read deeply?

🥬 The Concept (analyzeText + checkBudget): What it is: Tools to gauge input size and how much context/turn budget remains. How it works: 1) analyzeText estimates scale, 2) checkBudget reports free space and allowed steps, 3) strategy adjusts accordingly. Why it matters: Without this, the model might try to load too much and overflow. 🍞 Anchor: Seeing the document is huge, the model decides to index and process in chunks, checking budget every few steps.

Step-by-step (with what/why/example):

Build an index (buildIndex)

What happens: The model constructs a searchable structure over the long text.
Why: To avoid dumping the entire document into the prompt.
Example: A 500-page report gets split into labeled chunks; now the model can look up ‘Chapter 9: budget policy’ fast.

Search for candidate segments (searchEngine)

What happens: The model queries keywords/phrases to find likely relevant chunks.
Why: To reduce scanning and load only promising text.
Example: Query ‘policy change 2018’ returns three candidate chunks; the model selects two to read.

Read a chunk (readChunk)

What happens: Load the selected chunk’s raw text into context.
Why: You need details to answer precisely.
Example: The chunk shows, “Policy revised June 12, 2018.”

Take notes (note/updateNote)

What happens: The model writes concise, durable notes into the external notebook.
Why: Keep the facts without keeping the bulk.
Example: Note: ‘Revision date = 2018-06-12; Scope: procurement rules; Exceptions: small vendors’.

Delete clutter (deleteContext)

What happens: Remove the just-read chunk and the message that created the note.
Why: Free space and keep the active context clean.
Example: After writing the note, the model deletes the 10,000-token chunk, leaving only a tiny stub.

Reuse notes (readNote)

What happens: Pull the compact notes back into context when synthesizing the final answer.
Why: Re-ground reasoning in verified facts without rereading heavy raw text.
Example: For the final write-up, the model loads the ‘policy revision’ note and cites the date.

Finish (finish)

What happens: The model returns the final answer via a proper tool call.
Why: Signals the end, ensuring format and evaluation compatibility.
Example: ‘Final answer: The policy changed on June 12, 2018.’

The Secret Sauce: The Search–Read–Note–Delete loop. Each pass adds knowledge (note) and subtracts weight (delete). This creates the sawtooth pattern: context grows briefly during reading, then shrinks after pruning, staying within budget.

Training like a recipe:

Stage 1: Supervised Fine-Tuning (SFT) • What happens: A strong teacher model demonstrates good trajectories using the toolset. The authors keep only correct, well-managed runs (reject sampling) and balance action types so deletion doesn’t drown out rarer actions. • Why: Teaches solid habits: build index when long, take notes before deleting, check budget, and finish cleanly. • Example: From a 20-turn expert session, the training dataset creates multiple step-by-step samples that show exactly when to note and delete.
Stage 2: Reinforcement Learning (RL) • What happens: The model tries full episodes, gets a reward for being correct, finished, formatted, and within limits; penalties for incorrect or unfinished. State snapshots around context-edits help assign credit. • Why: SFT shows ‘how’; RL lets the model discover even better timing and tool combinations for tough cases. • Example: The model learns that deleting right after note-taking avoids brief overflows later.

Technical guardrails:

Budgets: A round budget and a context budget prevent runaway loops.
Formatting: Answers must be returned via finish and follow task rules.
Snapshot sampling: Keep training fair across long/short trajectories.

Putting it all together on actual data:

Input → The model sees a 189K-token long-document QA question.
Analyze → It detects ‘too long,’ so it builds an index.
Search/Read → It pulls two chunks with candidate dates.
Note/Delete → Saves the correct date, deletes both chunks.
CheckBudget → Still safe; reads one more chunk to confirm.
ReadNote/Finish → Loads notes, writes the final answer, and stops.

Outcome: A compact, tidy working memory that scales to very long tasks without drowning in text.

04Experiments & Results

🍞 Hook: Think of a scavenger hunt. The best teams not only find clues fast, they also don’t carry every old clue forever—they keep a summary and move on.

🥬 The Concept (What the tests measured): What it is: The authors checked whether models could find facts in huge texts, remember long chats, and do deep web-style research, all while staying within a small active context. How it works: 1) Synthetic ‘needle in a haystack’ tests (find one key sentence in up to 2M tokens), 2) Long document QA (NovelQA, ∞Bench), 3) Chat memory (LongMemEval-S), 4) Deep research (BrowseComp-Plus). Why it matters: If a model can keep accuracy high while managing its own memory, it proves real progress beyond window size tricks. 🍞 Anchor: It’s like checking whether a student can ace open-book exams by taking good notes and cleaning their desk, not by trying to stack the whole library on it.

The competition: Standard Qwen3 instruct models (4B/8B/14B) with a 128K window; memory agents like ReadAgent and RL-MemAgent; a strong 235B Qwen with the Pensieve setup.

Scoreboard with context:

Needle-in-a-Haystack (no search tool): • Baselines crumbled beyond 128K, dropping near 0% at 1M tokens. • StateLM variants stayed robust: the 14B reached around mid-90% up to 1M and ~84% at 2M. That’s like still getting an A when others are failing.
Long Document QA (NovelQA Copyright ~135K avg; ∞Bench En.MC ~189K avg): • StateLM used only a 32K active context (about one quarter of the 128K baseline) and still outperformed. For example: StateLM-8B around 84% on NovelQA vs ~66% for Qwen3-8B; similar strong gains on ∞Bench. • Gains were biggest when the answer evidence was late in the document (e.g., 128–256K tokens range), showing the value of search–note–delete.
Chat Memory (LongMemEval-S ~115K avg): • StateLM improved accuracy by about 10%–20% over standard LLMs across scales, indicating better long-term conversational recall without bloating context.
Deep Research (BrowseComp-Plus ~552K avg): • The standout: StateLM-14B-RL up to 52% vs ~5% for the vanilla 14B LLM. That’s like jumping from a near F to a solid pass on a very tough project.

Surprising/Notable findings:

Even with the search tool disabled on the needle task, StateLM used scanning plus disciplined delete habits to remain strong at extreme lengths—evidence that the loop itself (read→note→delete) is powerful.
RL fine-tuning added extra points on some benchmarks (e.g., ∞Bench), while extra SFT alone sometimes hurt—suggesting exploration-based polishing matters.
Tool-use patterns adapted by task: longer inputs triggered more searches, not more memory writes, keeping the notebook lean and focused.

Bottom line: Managing one’s own context beats simply enlarging the window or following a fixed human script. The sawtooth approach scaled where others stalled.

05Discussion & Limitations

🍞 Hook: Imagine a great note-taker who still sometimes struggles when questions are sneaky or when a binder’s tabs aren’t perfect.

🥬 The Concept (Limitations): What it is: Where the method currently falls short. How it works: 1) Keyword search (BM25) can miss paraphrased or implicit clues, 2) occasional formatting errors in long runs (e.g., malformed tool calls), 3) deleted messages leave tiny stubs that can slowly accumulate, 4) poor timing (too-late pruning) can cause brief overflows. Why it matters: Knowing the edges helps improve the next version and tells users when to be cautious. 🍞 Anchor: If you search for ‘soccer’ but the text says ‘football,’ you might miss it; also, if you keep too many tiny sticky tabs, the notebook still bulks up over time.

Required resources:

A tool-enabled environment (index, search, notebook, delete) and a judge for RL when using open-ended answers.
Training data with expert trajectories and compute for both SFT and RL; inference settings enforcing budgets.

When not to use:

Very short tasks where full text fits easily—overhead of tools may not pay off.
Highly implicit questions where keyword search is weak and no semantic retriever is available.
Strictly deterministic pipelines that forbid multi-step interactions.

Open questions:

Can stronger semantic search (dense retrieval, hybrid BM25+dense) reduce misses on paraphrases?
Can we compress or garbage-collect stubs better to avoid long-horizon drip growth?
What’s the optimal schedule for note merging and deletion under different domains?
How does this scale with larger base models and thinking modes without blowing budget?
Can the model learn when to switch between scanning and search automatically, beyond prompts?

06Conclusion & Future Work

Three-sentence summary: This paper turns language models from passive predictors into StateLMs that actively manage their own context using a Pensieve of notes plus pruning tools. By learning a read→note→delete loop, StateLMs keep a tidy ‘sawtooth’ context that stays sharp on massive inputs. Experiments show big wins across long-document QA, long chat memory, and deep research, far beyond what bigger windows or scripted agents alone delivered.

Main achievement: Giving the model the ‘wand’—a learned, general toolkit for self-engineering its context—so reasoning becomes a stateful, manageable process.

Future directions: Upgrade retrieval to richer semantic search; improve stub handling and note merging; expand training with longer, trickier trajectories; and test larger backbones and thinking modes while preserving budget discipline.

Why remember this: It reframes progress from “make the window bigger” to “make the model smarter about its own memory.” That shift unlocks durable, scalable reasoning across tasks and lengths—more like how skilled humans study, summarize, and clean as they go.

Practical Applications

•Customer support bots that remember weeks of conversation but prune old clutter to stay fast and accurate.
•Document assistants that read 500-page reports by indexing, skimming relevant parts, and saving concise notes.
•Research agents that search, read, note, and prune across many web pages to produce reliable summaries.
•Legal or policy review tools that extract statutes, dates, and clauses while discarding bulk text promptly.
•Education tutors that track a student’s progress over months using compact notes instead of long chat histories.
•Medical literature triage that surfaces trial outcomes and patient criteria from massive PDFs without overloading context.
•Enterprise knowledge assistants that build internal note libraries and keep prompts clean during multi-department queries.
•Codebase explorers that index repositories, read targeted files, record API facts, and delete raw diffs after noting.
•Financial analysis copilots that extract key metrics and dates from filings while keeping a lean working memory.
•E-discovery tools that index evidence, record salient facts, and sustain long investigations without prompt bloat.

Version: 1