Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Zhenting Wang; Huancheng Chen; Jiayun Wang; Wei Wei

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Beginner

Zhenting Wang, Huancheng Chen, Jiayun Wang et al.3/4/2026

arXiv

Key Summary

•This paper teaches long-horizon AI agents to remember everything exactly without stuffing their whole memory at once.
•Instead of keeping a giant, messy history in the prompt, the agent keeps a tiny, neat summary paired with stable labels (indices) that point to full details stored outside.
•When the agent needs an old detail, it dereferences an index to pull back the exact original evidence, not a fuzzy guess.
•A special training method, MemexRL, uses rewards and penalties to teach when to compress, what to store, how to label it, and when to fetch it.
•This approach is much less lossy than normal summarization because the full-fidelity evidence is kept safe in an external store.
•Theory shows that if only a few pieces need to be dereferenced each step, the agent can make optimal decisions while keeping its working context small.
•In a hard, modified ALFWorld benchmark, task success jumped from about 24% to about 86%, while peak working context shrank by roughly 43%.
•After training, the agent compressed fewer times but retrieved more often, showing it learned to build and reuse a precise memory index.
•Overall, Memex turns long problem-solving into a pointer-based workflow: small brain in the moment, big library on the side.

Why This Research Matters

Long, real-world tasks often need exact details from many steps ago—like an account ID, a code snippet, or a test log. Memex lets AI agents keep their active thinking light while never losing access to full, exact evidence. That means fewer mistakes, faster decisions, and less wasted time re-running tools. In customer support, it can recall a precise policy clause; in coding, it can reopen the exact failing log line. In research, it can pull a quoted paragraph with perfect fidelity. This is a practical path to AI that works reliably over hours, days, or even ongoing projects.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how a school backpack gets super heavy if you try to carry every book, worksheet, and project all the time? It’s hard to find what you need, and you get tired fast.

🥬 Filling (The Actual Concept)

What it is: The paper studies how AI agents can handle long tasks without overloading their limited “backpack” (context window) by using a smart, indexed memory.
How it works: Instead of carrying everything, the agent keeps a short, tidy summary with labeled pointers (indices) to a big, exact archive of past details stored outside the backpack. When needed, it uses a pointer to fetch the precise item.
Why it matters: Without this, long tasks make the agent’s prompt grow too big, slow, and blurry, which can hide or lose important facts.

🍞 Bottom Bread (Anchor) Imagine you keep a study guide (short summary) and put all notes in labeled binders at home. In class, you lookup the binder label when you need exact details.

— New Concept: Machine Learning — 🍞 Hook You know how you get better at basketball by practicing shots and learning from misses? 🥬 The Concept

What: Machine Learning is when computers get better at tasks by learning from data and feedback.
How: 1) See examples, 2) try to predict or act, 3) compare with the right answer, 4) adjust to do better next time.
Why: It lets computers improve instead of being told every tiny rule. 🍞 Anchor A spam filter learns what emails are spam by seeing lots of labeled examples and fixing its mistakes.

— New Concept: Reinforcement Learning — 🍞 Hook Imagine training a puppy: it gets treats for good tricks and no treat for chewing shoes. 🥬 The Concept

What: Reinforcement Learning (RL) teaches an agent by rewarding good actions and penalizing bad ones.
How: 1) The agent tries actions, 2) gets rewards or penalties, 3) updates its strategy to earn more rewards.
Why: For long tasks, rules are not obvious; rewards make the agent discover strategies that work over many steps. 🍞 Anchor A maze-solving robot learns to turn at the right spots because correct paths lead to reward.

— New Concept: Reward Shaping — 🍞 Hook You know how getting small stickers along the way keeps you motivated before the big trophy at the end? 🥬 The Concept

What: Reward shaping gives extra helpful signals (stickers) so the agent learns faster and avoids bad habits.
How: 1) Define main success reward, 2) add small penalties for waste (like repeating actions), 3) add signals for good memory use.
Why: Without shaping, the agent might only learn from the final result and miss what went wrong in the middle. 🍞 Anchor A student earns points for neat notes (good memory), loses points for copying the same line twice (redundant), and still aims for the A on the final project.

— New Concept: Memory Management — 🍞 Hook Imagine your desk: if you leave every paper on top, you can’t find the one you need. 🥬 The Concept

What: Memory management is organizing what to keep at hand versus what to file away.
How: 1) Keep a small working set close, 2) archive the rest neatly, 3) retrieve exactly what’s needed later.
Why: Without it, the agent’s prompt gets cluttered, slow, and important details slip by. 🍞 Anchor You keep a to-do list on the desk and file detailed receipts in folders you can pull out by name.

— New Concept: Context Management — 🍞 Hook When you bake cookies, you keep the recipe steps in mind but don’t carry the whole cookbook in your hands. 🥬 The Concept

What: Context management is deciding what information the AI should keep in its short-term prompt right now.
How: 1) Track current goal, 2) include just the needed notes and labels to deeper info, 3) add/remove as the task changes.
Why: Without it, the AI tries to read a giant history every time and gets slower and more confused. 🍞 Anchor A chef keeps today’s recipe card on the counter and pulls a pantry label to fetch a specific spice jar if needed.

The World Before: Agents tried to keep long histories in the prompt or squish them into summaries. Truncation throws away details; summaries are lossy and can forget key numbers, IDs, or code pieces. Similarity search helps find related text, but gets fuzzy when many near-duplicates exist and doesn’t teach the agent how to structure its own memory.

The Problem: Long tasks require reusing exact old evidence (like an object ID, a log line, or a code snippet) many steps later. Finite context windows and lossy summaries make that hard.

Failed Attempts: Giant rolling prompts become huge and slow; hand-tuned summaries drop needed precision; semantic retrieval can pick the wrong near-duplicate. None give the agent stable, reusable references to exact artifacts.

The Gap: A way to keep a short in-context state that still guarantees exact, faithful access to past evidence.

The Real Stakes: This affects coding assistants recalling earlier errors, customer agents remembering exact account IDs, research assistants reusing earlier paper quotes, or robots recalling a location label seen at the start. Getting details wrong can break tools, waste time, or produce incorrect answers.

02Core Idea

🍞 Top Bread (Hook) Imagine a tidy notebook page that lists only the main points and a set of sticky-note labels like A1, B2, C3. Each label points to a full binder on a shelf where all the raw details live.

🥬 Filling (The Actual Concept)

What it is: The main innovation is Memex: a system where the agent keeps a compact indexed summary in its prompt and stores full-fidelity evidence outside under stable indices it can dereference later.
How it works (recipe):
1. As the agent works, it periodically CompressExperience: writes a short, structured summary plus a list of (index → description) labels.
2. It archives exact artifacts (tool outputs, code, logs) into an external key–value store under those indices.
3. Later, if a past detail is needed, the agent calls ReadExperience(index) to pull the exact block back into context.
4. Reinforcement learning (MemexRL) teaches when to compress, what to index, how to label clearly, and when to retrieve.
Why it matters: It avoids the core tradeoff: staying short without losing precision. The agent’s active context stays small, yet nothing is thrown away; exact facts are one dereference away.

🍞 Bottom Bread (Anchor) A student’s one-page study guide lists: “A1—full class notes Week 1; B2—labs; C3—practice tests.” In an exam prep moment, they open binder B2 to see the exact original lab steps.

Multiple Analogies (3 ways):

Library Analogy: The card catalog (summary) stays on the desk, while entire books (evidence) remain on shelves. You pull the exact book only when needed.
Computer Analogy: Your desktop has shortcuts (indices) to big files stored on the drive. Clicking a shortcut opens the full file instantly.
Kitchen Analogy: The fridge has labeled bins (indices). The shopping list (summary) says “Sauce base in Bin S1.” You grab S1 only when you start cooking.

Before vs After:

Before: Long prompts get bloated; summaries lose details; retrieval is fuzzy and can miss the exact right chunk.
After: Prompts are short and structured; exact evidence is archived verbatim; retrieval is precise through stable indices.

Why It Works (intuition):

Decisions at each step rarely need the whole history—just the current plan and a few exact facts. If the summary tells you which facts matter and where they live, a small number of targeted fetches is enough. This keeps thinking light and accurate.

Building Blocks (with Sandwich Intros):

New Concept: Memex 🍞 Hook: You know how your brain remembers main ideas and your notebook keeps the full details? 🥬 Concept: Memex is the agent’s two-level memory: tiny in-context index + full outside archive. It replaces clutter with pointers. Why: Without Memex, long trails drown the agent in tokens or lose key details. 🍞 Anchor: A travel checklist with file names lets you open the exact PDF ticket when boarding.
New Concept: Indexed Experience Memory 🍞 Hook: Imagine labeling each science project box “P1 Volcano,” “P2 Solar Car,” so you can find them later. 🥬 Concept: It’s the structure where each label (index) maps to exact stored content. The in-context part shows short descriptions and labels; the outside store holds the full stuff. Why: Without indices, retrieval is fuzzy and error-prone. 🍞 Anchor: “Index K: Screenshot of error at step 12.” Later, you reopen the exact screenshot.
New Concept: Dereferencing 🍞 Hook: You know how clicking a hyperlink jumps to the exact webpage? 🥬 Concept: Dereferencing means using an index to fetch its precise archived content. Why: Without it, you’d re-run tools or guess from memory, which is slow and risky. 🍞 Anchor: ReadExperience('ct $x_u$ nit $s_c$ od $e_e$ xcerp $t_0$ 02') returns the exact code lines needed to make a fix.
New Concept: MemexRL 🍞 Hook: Think of a coach who not only trains you to play but also to take smarter notes so future practices go better. 🥬 Concept: MemexRL is the reinforcement learning setup that teaches the agent how and when to compress, index, and retrieve under a context budget. Why: Without training, the agent might write poor labels or forget to fetch at the right moment. 🍞 Anchor: The agent learns that storing object IDs under 'ct $x_l$ ocations' and later reading them wins more tasks.

03Methodology

At a high level: Input (system prompt + task) → Work steps (reasoning and tool calls) → CompressExperience (write a short indexed summary and archive full details) → Continue working with a tiny context → ReadExperience(index) only when exact evidence is needed → Output (finish the task).

Step-by-step (like a recipe), with Sandwich intros for the key operations:

Initialization

What happens: The agent starts with just the system prompt and the user’s task in its context and an empty external store D.
Why this exists: Keeps the slate clean; only the truly needed info goes in.
Example: M = [system, task], D = {}.

Work Loop: Think → Act → Observe

What happens: At each step, the agent writes a short thought, calls a tool (e.g., search, open file), and appends the tool’s result to the context.
Why this exists: You can’t solve tasks without interacting and gathering facts.
Example: “Search for function name” → tool returns a list of file hits.

New Concept: CompressExperience 🍞 Hook Imagine you stop mid-project to tidy your desk: you write a cheat sheet (summary) and file detailed papers into labeled folders. 🥬 The Concept

What: CompressExperience replaces the long working history with a compact indexed summary and stores full-fidelity blocks in D under stable indices.
How:
1. Choose what to keep short in-context (plans, verified progress) and which artifacts to archive exactly (tool outputs, code spans, logs).
2. Create an Index Map: pairs of (index, brief description).
3. Save content blocks either by writing them directly (Option A) or by anchor-based extraction (Option B) that copies the exact span from conversation.
4. Overwrite current working context so it becomes [system, task, IndexedSummary].
Why: Without compression, the context grows huge; without indices, retrieval later is fuzzy; without exact archiving, you lose precision. 🍞 Anchor In the SymPy bug example, the agent stores code excerpts as 'ct $x_u$ nit $s_c$ od $e_e$ xcerp $t_0$ 01/002' and a repro script as 'ct $x_r$ epr $o_s$ crip $t_0$ 01', while the in-context summary lists what each index contains.

Dual Archiving Modes (the clever bit)

What happens:
- Option A (authoring): the model writes a concise but accurate block (e.g., notes, structured lists).
- Option B (anchor extraction): the model gives 3 short anchors (start, mid, end) to capture an exact, verbatim span.
Why this exists: Some details must be exact (IDs, code, logs). Anchors guarantee precise copying without bloating the context.
Example: Anchors around “def _collec $t_f$ acto $r_a$ n $d_d$ imension” lines copy the exact code.

New Concept: ReadExperience (Dereference) 🍞 Hook You know how you grab the exact folder labeled “Science Fair 2024 – Photos” when you need one picture? 🥬 The Concept

What: ReadExperience(index) pulls the exact archived block back into the working context.
How: Look up D[index] and append it so the model can see the original content again.
Why: Without retrieving, the agent would re-run tools needlessly or guess from memory. 🍞 Anchor When needing the exact object IDs or a code snippet, it calls ReadExperience('ct $x_l$ ocations') or 'ct $x_u$ nit $s_c$ od $e_e$ xcerp $t_0$ 02'.

Soft Triggering via Context Status

What happens: At each step, a small message reports working tokens and the threshold (e.g., “working=6932, threshold=8000”), nudging the agent to compress at good times.
Why this exists: Makes compression a learnable skill instead of a hard cutoff; sometimes it’s smart to finish first, sometimes to compress early.
Example: The agent compresses when working hits ~80% of threshold unless it’s one step from finishing.

New Concept: MemexRL (Training) 🍞 Hook Think of a sports drill where you get points for winning, lose points for wasting time, and also get style points for clean play. 🥬 The Concept

What: A GRPO/PPO-style RL training loop gives returns that mix task success with three memory penalties: context overflow, redundant tool calls, and malformed tool formats.
How:
1. Roll out multiple attempts per task.
2. Score each attempt: success reward minus penalties.
3. Update the policy to prefer attempts that solved tasks and used memory well.
4. Segment trajectories at each compression so earlier write decisions still get credit from the final outcome.
Why: Without these signals, the agent won’t learn good indices, timely compression, or exact retrieval. 🍞 Anchor After training, the agent compresses fewer times but retrieves more, indicating it learned to store once and reuse precisely.

Secret Sauce (what’s clever)

Stable indices: Clear, reusable labels that make later dereferences precise.
Pointer-heavy summaries: Keep only what helps decide next steps and how to fetch details.
Dual-mode archiving: Mix concise notes with exact verbatim spans where precision matters.
Soft triggers + penalties: Turn context management into a learnable, rewarded skill.
Segmented credit assignment: Ensure compression quality gets judged by the final success.

Concrete Mini-Example with data:

Before: M holds 10 tool outputs (IDs, code, $logs) ≈ 5$ ,000 tokens.
CompressExperience → Summary (≈300 tokens) + D{ct $x_i$ ds, ct $x_c$ ode, ct $x_l$ ogs}.
Next step needs one ID → ReadExperience('ct $x_i$ ds') $adds ≈150$ tokens temporarily.
Working context stays small (≈450) instead of 5,000, yet the exact ID is available.

04Experiments & Results

The Test

Goal: See if MemexRL actually helps agents finish long, multi-step tasks while keeping the working context small.
Setup: A tougher, modified ALFWorld where easy hints are removed: no automatic list of valid actions, the initial room IDs are hidden unless you look once, the look action is limited to one use, and summaries are truncated to 300 tokens so IDs must be archived and later retrieved.
Why this matters: It forces the agent to store and precisely retrieve earlier facts (like object/location IDs) instead of keeping them in the live prompt.

The Competition

Baseline: The same agent without MemexRL training (so less skilled at when/how to compress, what to index, and when to retrieve).
Evaluated Model: Qwen3-30B variant with tool understanding, trained with MemexRL.

Metrics

Task Success Rate: Did the agent complete the household task?
Peak Working Context Length: How big did the active prompt get at its highest point?
Memory Tool Usage: How often did the agent compress vs. retrieve?

Scoreboard (with context)

Success Rate: 24.2% → 85.6% after MemexRL. That’s like going from losing 3 out of 4 games to winning most matches.
Peak Working Context: 16,934 → 9,634 tokens (about a 43% drop). That’s like packing half as much in your backpack but still having every book on a shelf nearby.
Training dynamics: During training, rollout success climbed from ~20% to >90%, while total penalties improved (less context overflow, fewer redundant tool calls, cleaner formats).

Behavioral Shifts (surprising and important)

CompressExperience calls per episode: dropped from ~6.5 to ~3. This means the agent learned to compress at better, more meaningful checkpoints instead of constantly rewriting context.
ReadExperience calls per episode: increased from ~1 to ~6–7. The agent started to rely on dereferencing exact stored evidence rather than re-running the same tools.
Interpretation: RL didn’t just teach the agent to summarize harder. It taught it to build a reusable, precise index and come back to it exactly when needed—just like good human note-taking.

Why the Results Make Sense

The environment forces precision (IDs matter). Since summaries are truncated, the only path to those IDs later is to retrieve from the archive. MemexRL rewards that behavior and penalizes re-doing the same action, nudging the agent toward smart retrieval.

Takeaway

The numbers show both better accuracy and better efficiency: more wins with a lighter working context, proving the pointer-based design works in practice, not just in theory.

05Discussion & Limitations

Limitations

Quality depends on the base LLM: If the model struggles with tool use or instruction following, it may write poor indices or summaries.
Training cost: MemexRL involves multi-step rollouts, external tools, and reward computation; this can be compute-intensive.
Index quality: Bad labels (too vague or too many) can still make retrieval messy. The method helps learn good labels, but it’s not magic.
Retrieval bounds: Theory assumes only a small number of dereferences per step. If a task truly needs many blocks at once, the working context can still grow.

Required Resources

An LLM with tool-use ability and enough capacity to follow the memory protocol.
An external key–value store for archived blocks, plus plumbing for anchor-based extraction and retrieval.
RL infrastructure (rollouts, rewards, GRPO/PPO updates) and datasets of long-horizon tasks.

When NOT to Use

Very short or single-shot tasks where context never gets large—overhead may not be worth it.
Tasks that never need exact past evidence (e.g., pure brainstorming) where lossy summaries are already fine.
Extremely time- or budget-constrained settings where RL training cannot be afforded.

Open Questions

Automated index naming: Can we learn naming schemes that guarantee uniqueness and usefulness across domains?
Multi-agent memory: How should several agents share and trust each other’s indices and archives?
Robustness: How to detect and repair broken anchors or missing blocks gracefully?
Safety and privacy: How to govern what gets archived and who can dereference it in sensitive domains?
Scaling laws: How do summary size, dereference count, and archive granularity trade off as tasks get $10× longer$ ?

06Conclusion & Future Work

Three-Sentence Summary Memex lets an agent keep only a compact indexed summary in its prompt while archiving full-fidelity evidence outside, so nothing important is lost. MemexRL trains the agent to decide when to compress, what to store, how to label it, and when to retrieve, using rewards that value smart memory use. Together, they enable long-horizon success with a smaller working context and precise access to past details.

Main Achievement The paper shows that indexed experience memory—paired with RL—turns long problem-solving into a pointer-based workflow that is far less lossy than summary-only approaches, and it improves both success rate and efficiency in a challenging benchmark.

Future Directions

Smarter automated indexing and anchor selection to reduce human prompt design.
Sharing indexed memories across tasks and agents, with permissions and trust signals.
Extending to multimodal artifacts (images, audio, traces) and real-world tools.
Studying how few dereferences are needed as tasks scale to hundreds or thousands of steps.

Why Remember This It reframes memory for LLM agents: keep thinking space tiny, never throw evidence away, and use stable indices to fetch exact facts on demand. That simple principle makes long, tool-heavy workflows both accurate and efficient.

Practical Applications

•Coding assistants that store exact error logs and code excerpts under indices, then retrieve them to fix bugs later.
•Customer support bots that index previous case notes and pull the exact account details when needed.
•Research copilots that keep a compact plan and index full paper quotes, tables, and figures for precise citation.
•Data analysis agents that archive raw query results and retrieve them instead of re-running expensive jobs.
•IT troubleshooters that store configuration snapshots and system logs, reopening the right one by label to compare states.
•Education tutors that index a student’s past attempts, mistakes, and solutions, retrieving them to personalize new lessons.
•Business workflow agents that label API responses and invoices for precise recall during audits and reconciliations.
•Healthcare admin assistants that index prior authorizations and lab results for exact retrieval under strict compliance.
•Robotics planners that store maps and object detections, recalling precise IDs when revisiting locations.
•Legal drafting assistants that index contract clauses and precedents to quote the exact language later.

Version: 1