Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Zeyuan Liu; Jeonghye Kim; Xufang Luo; Dongsheng Li; Yuqing Yang

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Intermediate

Zeyuan Liu, Jeonghye Kim, Xufang Luo et al.2/26/2026

arXiv

Key Summary

•This paper teaches a language-model agent to explore smarter by combining two ways of learning (on-policy and off-policy) with a simple, self-written memory.
•The agent writes short 'tips' after each try, saves them in memory, and uses the best tips to guide future tries.
•During training, the agent sometimes uses tips to act and sometimes acts without tips, so it learns to do well both with and without memory.
•Off-policy updates distill good behaviors found with tips into the model itself, so the agent won’t need tips later to perform well.
•A stability trick masks extremely unlikely tokens during off-policy learning, preventing training collapse.
•An intrinsic reward nudges the agent to visit new places or states, making it more curious and less stuck.
•On ScienceWorld, EMPO improved over GRPO by 128.6%, and on WebShop by 11.3%.
•In new, unseen tasks, the trained agent adapts quickly using only memory (no weight updates), showing strong generalization.
•The method is simple to add to existing GRPO-style training and uses retrieval of up to 10 short tips per step.
•EMPO points to agents that learn from their own experiences like curious students with notebooks, not just from pretraining.

Why This Research Matters

Many real tasks don’t reward you until the very end, so agents must be curious to discover what works. EMPO shows how a simple memory of one-sentence tips can dramatically boost exploration without needing human-written demonstrations. By distilling tip-guided wins back into the model, the agent becomes capable even when the memory is removed. This reduces deployment costs and makes the agent more robust in new settings. The idea generalizes beyond text games or shopping to coding, math, and multimodal tasks. In short, EMPO paves a path to agents that learn like people: try, reflect, remember, and then internalize.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning to explore a new science museum. If you only walk to the first shiny exhibit you recognize and never peek into unknown rooms, you’ll miss the coolest discoveries. Real learning needs curiosity.

🥬 The Concept: Before this paper, many Large Language Model (LLM) agents behaved like tourists who only visit familiar spots. They relied heavily on what they already knew from pretraining and did just a little bit of searching in new places. That works when tasks match their memory, but it fails when the environment hides important things in unfamiliar corners.
How it worked before (step by step):

The agent reads a task and uses its pretrained knowledge to choose actions.
If it fails, it usually doesn’t remember why or what to try next (beyond a final score).
Training updates often reinforce what it already does well, not what it needs to discover.
Why it matters: Without real exploration, agents stall. They may never learn to find a missing red light bulb in another room or try a new sequence of steps in a website.

🍞 Anchor: In ScienceWorld, an agent told to “turn on the red light bulb” must first search around rooms to find it. Agents that don’t explore just stare at the empty hallway and give up.

🍞 Hook: You know how you keep a school notebook? When you mess up a math problem, you write a note: “Don’t forget to carry the 1!” That tiny memory helps next time.

🥬 The Concept: Researchers tried giving LLM agents an external memory—like a notebook—to store reflections, examples, or past attempts. This kind of “memory-augmented agent” can improve without changing its brain (its parameters).
How it works:

After a try, the agent writes a short tip about what worked or failed.
Next time, it retrieves similar tips and uses them while acting.
It avoids past mistakes and tries new things.
What breaks without it: The agent repeats the same errors because the only feedback is a single reward number, not guidance.

🍞 Anchor: A tip like “You didn’t find the red bulb in the hallway; check the workshop” can redirect the very next attempt.

🍞 Hook: Imagine studying for a test only by taking the test (learning on-policy) versus also learning from solved examples (off-policy). Which helps you improve faster? Both together!

🥬 The Concept: In reinforcement learning (RL), on-policy learning means updating from what your current behavior just did, while off-policy learning means also learning from data produced under different conditions (like with extra hints).
How it works:

On-policy: Use your latest tries to update you in the same situation.
Off-policy: Learn to do the same good actions even when the extra help (like tips) is removed.
Balance both so you’re strong with hints and strong without hints.
Why it matters: If you only learn with tips, you might depend on them. If you never learn from tip-guided successes, you might never discover the good path.

🍞 Anchor: It’s like reading the worked solution to a math problem and then practicing to solve it on your own without looking—so you can ace the test.

🍞 Hook: Think of curiosity points. If you open a door you’ve never opened, you get a tiny “yay!” even if there’s no treasure yet.

🥬 The Concept: Intrinsic rewards give the agent small bonuses for exploring new states, not just for finishing a task.
How it works:

Track which states you’ve seen before.
If a state is new or unusual, give a bonus.
This keeps the agent curious and prevents it from looping the same path.
Why it matters: Without curiosity, agents can get stuck trying the same comfortable moves and never learn how the world really works.

🍞 Anchor: In a shopping website, clicking a new filter or a new category can earn a tiny bonus, nudging the agent to discover the right item faster later.

🍞 Hook: Picture two worlds. In the old world, LLM agents mostly followed their comfort zone. In the new world, they write notes to themselves, try new things, and learn to carry these lessons even when the notes are taken away.

🥬 The Concept: The gap was clear: we needed a unified way to connect memory (non-parametric updates) with policy learning (parametric updates) and blend on-policy and off-policy learning.
How it works:

Use memory to guide exploration during rollouts.
Update the model both with the memory (on-policy) and without it (off-policy) so knowledge moves into the model’s parameters.
Use intrinsic curiosity to keep searching.
What breaks without it: Performance saturates quickly, exploration is weak, and any gains from memory don’t become permanent skills.

🍞 Anchor: EMPO is this bridge: it helps the agent use its notebook to explore now and also learn the lesson so well that, later, it can do the task without the notebook.

02Core Idea

🍞 Hook: You know how training wheels help you learn to ride a bike, but the goal is to ride smoothly even after you remove them?

🥬 The Concept (Aha! in one sentence): EMPO teaches an LLM agent to explore with the help of self-written tips (memory) and then distills those successes into its own parameters so it can perform well even without tips.
How it works (like a recipe):

Roll out episodes sometimes with memory tips and sometimes without.
After each episode, write a short tip and store it in memory.
Update the model two ways: on-policy (with tips) and off-policy (distill the tip-guided behavior into no-tip behavior).
Add intrinsic curiosity bonuses to seek new states.
Why it matters: The agent learns to be curious and skillful not only when it has reminders, but also when it doesn’t—true mastery.

🍞 Anchor: It’s like solving a maze with a hint, then practicing the same path without any hint until you remember it by heart.

Multiple analogies:

Teacher and student: The tip-using policy acts like a teacher showing a smart path; off-policy learning makes the no-tip policy a student who learns to do the same thing independently.
Hiking with a map: First you hike with a map (tips) to find a great trail; later you can walk it from memory without the map.
Cooking with a recipe card: You follow a recipe (tips) to make a dish perfectly; after a few times, you don’t need the card—you’ve internalized the steps.

Before vs After:

Before: Agents relied on pretraining and did shallow exploration; memory methods helped short-term but didn’t sink into the model.
After: The agent uses memory to explore deeply now and off-policy distillation to bake those wins into the model, making the improvements stick even without memory.

Why it works (intuition):

Exploration with tips finds trajectories that plain pretraining would rarely try.
On-policy updates reinforce behaviors under the exact tip-conditioned context for stability.
Off-policy updates transfer only the good parts of those behaviors to the no-tip policy, so the model becomes robust.
Intrinsic rewards keep the policy’s attention wide, not narrow, so it keeps discovering.
A stability mask ignores super unlikely tokens during off-policy updates to prevent math from blowing up.

Building blocks (Sandwich for each new idea):

🍞 Hook: You know how a dog learns tricks for treats?
🥬 The Concept: Reinforcement Learning (RL) is learning by trying actions and getting rewards.
How it works:

See a situation (state).
Pick an action.
Get a reward and a new state.
Repeat, and adjust to get more rewards over time.
Why it matters: Without RL, the agent can’t improve from its own experience.
🍞 Anchor: A shopping agent clicks pages and gets a higher final score when it buys the right item.

🍞 Hook: Choosing between your favorite ice cream or tasting a new flavor is a real dilemma!
🥬 The Concept: Exploration vs. exploitation is the tradeoff between trying new actions (explore) and using what works (exploit).
How it works:

Sometimes do the best-known action.
Sometimes try something new to learn more.
Balance changes during training.
Why it matters: All exploit means you never discover; all explore means slow progress.
🍞 Anchor: In ScienceWorld, exploring a new room might reveal the missing tool.

🍞 Hook: A diary helps you remember important lessons.
🥬 The Concept: Memory-augmented agents store short tips about what worked or failed.
How it works:

After an episode, write a concise tip.
Next time, retrieve up to 10 relevant tips to guide actions.
Keep updating the memory with new tips.
Why it matters: It adds continuity between attempts.
🍞 Anchor: “Didn’t find the bulb in the hallway—try workshop” is exactly the kind of tip that saves time.

🍞 Hook: Practicing during the game teaches fastest.
🥬 The Concept: On-policy learning updates the model from actions it just took under the same conditions.
How it works:

Roll out with or without tips.
Use those exact conditions to compute updates.
This is stable and consistent.
Why it matters: It directly improves the policy that produced the data.
🍞 Anchor: If you solved a step using tips, on-policy updates improve the tipped behavior.

🍞 Hook: You can also learn from worked examples you didn’t produce yourself.
🥬 The Concept: Off-policy learning updates the no-tip policy using data produced with tips.
How it works:

Treat tip-guided trajectories as teacher examples.
Update the no-tip policy to imitate the good ones (more on this in Methodology).
Keep only the helpful parts (guided by rewards).
Why it matters: The agent gets good even without tips later.
🍞 Anchor: Like seeing a perfect solution, then practicing to do it solo.

🍞 Hook: Curiosity points make adventures fun.
🥬 The Concept: Intrinsic rewards give small bonuses for visiting new or unusual states.
How it works:

Compare the current state to states in memory.
If it’s novel enough, give a bonus.
The agent keeps exploring new regions.
Why it matters: Curiosity prevents getting stuck.
🍞 Anchor: Clicking a new website filter earns a bonus that might help find the exact product later.

03Methodology

At a high level: Input (task) → Rollout (sometimes with memory tips, sometimes without) → Write/Store a tip → Update policy (on-policy or off-policy) → Output (a better, more curious agent).

Key steps in detail (with kid-friendly ‘why’ and concrete mini-examples):

Generate episodes (rollouts) in two modes

What happens: For each step, the agent either uses retrieved tips (memory-augmented prompting) or ignores memory and just uses the task and current state. The choice is random with a set probability (e.g., 25% with memory, 75% without).
Why this step exists: Mixing modes teaches the agent to operate both with hints and without them, so it doesn’t become dependent on tips.
Example: In ScienceWorld, the agent might first try without memory and fail. In the next episode, it retrieves up to 10 short tips reminding it to search the workshop—this time it finds the bulb.

Compute returns and advantages for the rollouts

What happens: Each trajectory gets a total return (sum of step rewards). We then compare multiple trajectories of the same task to find which ones did better than average (relative advantage).
Why this step exists: We need a fair way to boost actions from better-performing rollouts and reduce actions from worse ones.
Example with math:
Return of a trajectory is $R = \sum_{t=1}^{T} r_t$ . For example, if rewards are $r_1 = 2$ , $r_2 = -1$ , $r_3 = 5$ , then $R = 2 + (-1) + 5 = 6$ .
Relative advantage (like GRPO) uses $A = \frac{R - \bar R}{\sigma(R)}$ . For example, if three trajectories have returns $R = 10, 20, 30$ , then $\bar R = 20$ and $\sigma(R) \approx 8.16$ . For $R = 30$ , $A = \frac{30 - 20}{8.16} \approx 1.23$ .

Write a short self-generated tip and store it in memory

What happens: After each episode, the agent summarizes what went wrong or right in one sentence and stores it along with a similarity key and a score.
Why this step exists: The tip is a breadcrumb trail for future attempts, maintaining continuity across episodes.
Example: “You didn’t find the red bulb in the hallway. Next time, check the workshop first.”

Retrieve up to 10 relevant tips before each step (when using memory)

What happens: The agent finds similar past experiences using a simple embedding similarity and includes those tips in the prompt.
Why this step exists: Reminders can prevent repeated mistakes and seed smarter exploration.
Example: If the current page looks like a product list in WebShop, the agent fetches tips about using filters or reading specs first.

On-policy update: learn under the same conditions

What happens: For episodes generated with tips, we sometimes update using the same tip-conditioned prompts. For episodes without tips, we update using no-tip prompts.
Why this step exists: On-policy updates are stable because the model is updated on exactly what it did.
Example: If the agent solved a subtask with tips, on-policy updates reinforce those exact actions under the same tipped context.

Off-policy update: distill from tips into no-tips

What happens: For episodes generated with tips, we sometimes update the model as if no tips had been present—this is off-policy. We calculate how surprising the action is under the no-tip policy and push the model to reproduce good actions even without tips.
Why this step exists: This is the secret sauce—turn tip-guided wins into built-in skills.
Example with math:
The importance sampling ratio is $\rho = \frac{\pi_\theta(a\mid s)}{\pi_{\theta_{old}}(a\mid s)}$ when updating the no-tip policy from tip-generated data. If the current no-tip policy assigns $\pi_\theta(a\mid s) = 0.08$ and the old tip-policy had $\pi_{\theta_{old}}(a\mid s) = 0.04$ , then $\rho = 0.08/0.04 = 2.0$ . A higher $\rho$ upweights this action’s contribution (bounded by clipping for stability).
Stability trick: very low-probability tokens can explode the math. EMPO masks these by ignoring tokens whose probability falls below a small threshold, preventing training from blowing up.

Encourage curiosity with intrinsic rewards

What happens: The agent gets a tiny bonus for visiting states that are new compared to memory (measured by similarity). This keeps policy entropy healthy and exploration active.
Why this step exists: Real environments often don’t give rewards for just exploring, so intrinsic bonuses keep the agent moving.
Example with math:
If the intrinsic reward is defined as $r_{intrinsic} = 1/n$ where $n$ is the number of similar past states, then if $n = 5$ , $r_{intrinsic} = 1/5 = 0.2$ . If it’s a brand-new state with $n = 1$ , the bonus is $1/1 = 1.0$ , bigger to celebrate novelty.

Balance the mix of modes

What happens: Two knobs control how often we use memory during rollouts and how often we do off-policy updates.
Why this step exists: Too few memory rollouts and you won’t explore; too many and you might overfit to tips. Too many off-policy updates and stability can suffer; too few and you don’t distill enough.
Example: The paper finds stable defaults (e.g., 25% memory rollouts; 2/3 off-policy among memory rollouts) that work well across tasks.

The secret sauce (why EMPO is clever):

Self-generated memory lowers the barrier to exploration—no human labels or extra teacher model needed.
Off-policy distillation transforms “I can do it with hints” into “I can do it without hints.”
A simple stability mask on very unlikely tokens prevents gradient blow-ups.
Curiosity bonuses keep the agent from getting lazy, maintaining healthy exploration pressure.
All parts fit tightly: memory finds better paths; on-policy locks in stability; off-policy internalizes the good paths; curiosity keeps the search broad.

04Experiments & Results

The test and why it matters:

ScienceWorld: Text-based science labs where you must plan, test ideas, and find items across rooms. Rewards range from about −100 (failures) to 100 (successes). This benchmark exposes whether an agent truly explores new states or just repeats known patterns.
WebShop: A realistic shopping website where the agent must search, filter, compare, and buy items that match constraints. This checks multi-step reasoning and web navigation.

The competition (baselines):

Reflexion (non-parametric): Uses memory-only reflections without changing the model weights.
Retrospex (offline RL): Trains from logged data with a critic (IQL) to score actions; no online exploration.
GRPO (online RL): Strong baseline that compares groups of trajectories to compute relative advantages.
GiGPO (online RL, WebShop only): Groups similar observations for finer credit assignment, improving GRPO’s advantage estimates.

Scoreboard with context:

ScienceWorld: EMPO averaged far higher returns than all baselines, improving over GRPO by 128.6%. That’s like jumping from a C to an A+ while others hover around B levels. Several tasks that started negative reached the perfect 100 score under EMPO. This shows that memory-boosted exploration plus off-policy distillation helps the agent keep learning instead of plateauing.
WebShop: EMPO beat GRPO and GiGPO in both average score and success rate. The gains (~11.3% over GRPO) are smaller than in ScienceWorld but still meaningful—a solid step up in a competitive setting where advanced baselines already perform well.
Out-of-distribution (OOD) adaptation: After training on one task, the EMPO model faced new tasks. Without any weight updates, just by adding memory again, EMPO adapted in a handful of trials. This indicates the trained model has learned how to use memory effectively as a tool for exploration, a skill that transfers.

Making numbers meaningful:

In ScienceWorld, GRPO’s learning curves sometimes flattened early, suggesting it got stuck exploiting. EMPO’s curves kept rising, signaling ongoing, fruitful exploration.
In WebShop, where many methods are already strong, EMPO’s consistent lead says: even among high performers, exploration and distillation still add value.

Surprising findings:

Non-parametric memory alone (like Reflexion) helped but saturated quickly; it didn’t create lasting skills. EMPO’s off-policy step made those short-term gains stick inside the model.
Too much off-policy can hurt stability, but a simple mask that skips very low-probability tokens fixed blow-ups in practice—an unexpectedly effective and simple patch.
The intrinsic curiosity bonus wasn’t super picky—several settings worked—suggesting the method’s robustness to the exact exploration bonus shape.

Practical takeaways:

Memory is most powerful when it is both a crutch during exploration and a teacher whose lessons are distilled into the student’s own brain.
A small set of up to 10 retrieved tips was enough to move the needle.
The trained EMPO agent performs well even with memory turned off at test time, showing the core promise: explore with help, then graduate beyond it.

05Discussion & Limitations

Limitations (honest assessment):

Simple retrieval: The memory uses a basic similarity search. Smarter retrieval (better embeddings, rerankers, or multi-hop recall) could fetch even more helpful tips.
Model variety: Results are shown primarily with Qwen2.5-7B-Instruct; more models and sizes should be tested to confirm broad generalization.
Domain coverage: Only two environments are studied. More domains (math, code, multi-hop QA, multimodal tasks) would test robustness.
Off-policy stability: Even with masking, off-policy updates can be sensitive. Alternative algorithms (e.g., conservative policy gradients or trust-region variants) might further improve stability.

Required resources:

GPUs for online RL with LLMs (the paper used A100s).
A memory server or lightweight in-process store for tips, with embeddings for similarity search.
An RL training framework (e.g., GRPO-based) that supports multi-step rollouts.

When not to use EMPO:

If your environment gives perfect demonstrations and little need for exploration, simple SFT or offline RL may suffice.
If you cannot afford online interaction costs (time or compute), a purely offline method may be more practical.
If safety-critical constraints forbid trial-and-error, exploration-focused online RL may be risky.

Open questions:

Can more advanced memory retrieval (like trainable retrievers or planning-aware search) boost exploration further?
How does EMPO scale with much larger models—do the gains grow or shrink?
Can we unify the intrinsic reward with the memory signal (e.g., novelty-aware tip scoring) for even better credit assignment?
Are there off-policy objectives better suited to language (e.g., sequence-level distillation losses) that remain stable without heavy masking?

06Conclusion & Future Work

3-sentence summary: EMPO is a hybrid online RL method that lets an LLM agent explore using self-written memory tips and then distills those tip-guided successes back into the no-tip policy. By combining on-policy stability, off-policy distillation, and a simple curiosity bonus—with a small but effective stability mask—EMPO learns faster and more robustly. It works well in ScienceWorld and WebShop and adapts quickly to new tasks using memory without any weight updates.

Main achievement: Showing that self-generated memory plus off-policy distillation turns short-term exploratory wins into long-term, parameterized skills, yielding large gains over strong online RL baselines.

Future directions: Improve memory retrieval quality, try larger and different base LLMs, explore richer off-policy objectives tailored to language, and extend to new domains like math, coding, multi-hop QA, and multimodal RL. Investigate tighter integration of curiosity bonuses with memory signals.

Why remember this: EMPO is like training wheels that teach the rider how to balance even after the wheels come off—explore now with tips, but bake the learning into the model so it thrives without them. It’s a clear path toward more curious, adaptable, and general LLM agents.

Practical Applications

•Customer support bots that explore new troubleshooting paths and then remember the best ones for future chats.
•Coding assistants that try different debugging strategies, store useful tips, and later fix similar bugs without hints.
•Educational tutors that adapt lesson plans based on student mistakes, then internalize better teaching sequences.
•Shopping agents that learn to use filters and comparisons effectively and keep those skills even without stored tips.
•Robotics tasks where the robot explores tool use safely in simulation, then transfers the learned strategy to the real world.
•Data labeling assistants that learn smarter search-and-verify routines for ambiguous cases.
•Scientific discovery helpers that explore experimental steps, record reflections, and then internalize better lab protocols.
•Task-oriented dialogue systems that explore questioning strategies to gather missing information.
•Game-playing agents that explore novel strategies, then bake those strategies into their policy for tournaments.
•Web automation tools that learn robust click-and-form sequences to handle varied site layouts.

Version: 1