Free(): Learning to Forget in Malloc-Only Reasoning Models

Yilun Zheng; Dongyang Ma; Tian Liang; Jiahao Xu; Xinting Huang; Lihui Chen; Haitao Mi; Yan Wang

Free(): Learning to Forget in Malloc-Only Reasoning Models

Intermediate

Yilun Zheng, Dongyang Ma, Tian Liang et al.2/8/2026

arXiv

Key Summary

•LLMs can think for many steps, but when they keep every step forever, the extra tokens turn into noise and make answers worse, not better.
•The paper says most models are like a computer that only allocates memory (malloc) but never frees it; they keep piling up notes until they crash into loops.
•Free()LM adds a tiny plug-in, the Free-Module, that switches the model between thinking and cleaning so it can forget useless parts of its own thoughts.
•In cleaning mode, the Free-Module outputs small JSON commands with a prefix and suffix to delete big redundant chunks safely and quickly.
•A special reward-checked training pipeline teaches the Free-Module what to delete by keeping only examples that don’t hurt accuracy (and often help it).
•Across 8B to 685B models, Free()LM improves accuracy by about 3.3% on tough reasoning benchmarks while using fewer tokens.
•On ultra-long problems where a top model dropped to 0% accuracy, Free()LM recovered performance to roughly 50% by pruning the noisy parts.
•It also cut memory (KV cache) usage by about 45% in tests, trading around 56% extra latency for much cleaner, more stable reasoning.
•The Free-Module trained on an 8B model even helped very different, larger models, acting like a universal context-cleaning service.
•The big idea: real intelligence isn’t just about remembering more; it’s also about knowing what to forget.

Why This Research Matters

Real-world assistants often face long, messy tasks: solving big math problems, debugging long logs, summarizing hours of meetings, or planning many-step actions. If they keep every detail forever, they slow down, run out of memory, and start making more mistakes. Free()LM shows that teaching models to forget safely makes them more accurate and stable on exactly those hard cases. It also reduces memory use, which is crucial for running large models affordably at scale. By turning “think more” into “think cleaner,” we get systems that are more dependable in classrooms, workplaces, and research labs. This shift paves the way for AI that can handle long projects without collapsing under its own notes.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how your desk gets messy if you never throw away old worksheets? At first it’s fine, but after a while, you can’t find your homework because the pile gets in the way.

🥬 The Concept: malloc-only architecture

What it is: Many language models act like a desk that only adds papers and never throws any away—they allocate memory (malloc) but don’t free it.
How it works:
1. The model writes down every step of its thinking in the same space.
2. Even if some steps are wrong, repeated, or already resolved, they still stay there.
3. Over time, these extra notes crowd the space and confuse later steps.
Why it matters: Without a way to forget, long thinking sessions get noisy and can break down into loops. 🍞 Anchor: Imagine solving a math puzzle on one whiteboard without erasing old tries—eventually the board is so full that new steps are hard to place or read.

🍞 Hook: Imagine trying to carry every toy you own at once—you drop things and can’t move well.

🥬 The Concept: token accumulation

What it is: Token accumulation is when a model keeps adding more and more words to its context, even if many are no longer useful.
How it works:
1. The model generates step-by-step thoughts (tokens).
2. These tokens pile up, mixing the helpful ones with dead ends and repeats.
3. The model must read through this bigger pile to make each next step.
Why it matters: The bigger the messy pile, the harder it is to find the right clues, so accuracy can drop. 🍞 Anchor: Like reading a mystery but the book keeps inserting duplicate chapters—you’ll waste time and might miss the real clue.

🍞 Hook: Think of planning a long road trip with many stops—you need to remember some notes (gas, maps) but not all (a wrong turn you already corrected).

🥬 The Concept: long-horizon reasoning

What it is: Long-horizon reasoning is solving problems that take lots of steps and a lot of text to explain.
How it works:
1. Start with a plan and explore different paths.
2. Keep useful facts and correct partial results.
3. Drop stale or wrong attempts as the plan becomes clearer.
Why it matters: Without dropping the stale parts, long problems overload memory and the model starts looping or failing. 🍞 Anchor: In math olympiad problems, models sometimes need 10k–100k tokens of thinking; if they never clean up, they can crash into repetitive loops.

The World Before: People learned that giving models more time to think (more tokens) often helps—up to a point. But many teams noticed a surprise: beyond a certain size, longer thoughts hurt performance. On AIME-style math, when reasoning tokens filled most of the context window, answers got worse and models often repeated themselves.

The Problem: Models were acting like “append-only” notebooks. They kept every trial step with no way to prune mistakes or redundant checks. So, test-time scaling (just think longer) hit a wall: more thinking turned into more noise.

🍞 Hook: You know how a friend can sometimes tidy their notes while they work, but telling them to do it without practice doesn’t go well?

🥬 The Concept: in-context learning (ICL)

What it is: ICL means asking a model to do a new skill just by showing it examples in the prompt, without updating its weights.
How it works:
1. Write careful instructions in the prompt about what to delete.
2. Hope the model follows them while it solves problems.
3. Evaluate whether results improve.
Why it matters: Without training, the model often deletes the wrong things (important anchors) or barely helps. 🍞 Anchor: The paper tried ICL and even used a strong helper model, but gains were small (~1%) because guessing what to forget is hard without learning.

Failed Attempts: Heuristic compression methods tried to keep tokens with high attention scores and drop others. But attention isn’t the same as logical importance. These heuristics sometimes broke the chain of thought, causing more loops and longer, worse outputs. ICL-style “please delete the fluff” prompting also underperformed, often cutting needed context or being too timid.

🍞 Hook: Think of trimming a tree—you want to cut dead branches, not healthy ones.

🥬 The Concept: context pruning

What it is: Context pruning means deleting parts of the model’s internal notes that are truly redundant or obsolete.
How it works:
1. Identify spans that don’t help the current line of reasoning.
2. Remove them while keeping essential anchors.
3. Continue reasoning on the cleaned notes.
Why it matters: Without smart pruning, long solutions become cluttered and fragile. 🍞 Anchor: If you erase long, repeated self-corrections but keep the key lemmas, the next steps are faster and safer.

The Gap: We needed a learned, precise way to forget—something trained to recognize and safely prune only the junk, not the anchors. It had to be lightweight, easy to plug into many models, and safe enough to not hurt short, ordinary questions.

Real Stakes: This matters for tutors that solve step-by-step math, coding assistants that debug long logs, analysts who summarize long documents, and agents that plan over many actions. If they can’t clean their own workspace, they waste time, memory, and sometimes fail completely. Freeing memory wisely is as important as thinking deeply.

02Core Idea

🍞 Hook: Imagine you have a smart notebook that can pause writing, skim its pages, cross out the unhelpful parts, and then keep writing clearly.

🥬 The Concept: Free()LM

What it is: Free()LM is a way to give models a built-in forgetting skill so they can clean their own notes while they think.
How it works:
1. Add a tiny plug-in (Free-Module) to the model.
2. Alternate between two modes: Reasoning (think) and Cleaning (forget).
3. In Cleaning, output small commands that mark big redundant spans to delete.
4. Resume Reasoning on the cleaned context.
Why it matters: Without this, long thoughts get jammed with repeats and dead ends, causing crashes; with it, the chain stays compact and focused. 🍞 Anchor: On very long math problems where a giant model fell to 0% accuracy, Free()LM’s cleaning brought it back to around 50%.

The Aha! Moment (one sentence): To think for a long time, a model must also learn to forget—freeing useless context is as vital as adding more steps.

Multiple Analogies:

Backpack analogy: You hike farther when you drop rocks (redundant notes) instead of carrying everything forever.
Whiteboard analogy: Erase dead-end calculations to leave the main proof visible; otherwise, new steps get squeezed out.
Kitchen analogy: Clean as you cook; otherwise, the counter fills with peels and wrappers, and cooking slows or stops.

Before vs After:

Before: Models wrote every step and never erased; extra thinking often turned into loops and lower accuracy.
After: Models periodically clean, keeping only useful steps; accuracy rises while total tokens shrink.

Why It Works (intuition, no equations): A model’s next step depends on the text it sees. If that text has lots of misleading or repeated content, its attention spreads thin and it can latch onto the wrong clues. By training a module to recognize and cut truly redundant spans, the remaining text becomes a sharper, lower-noise guide. This lowers distraction, keeps key anchors, and helps the model pick the right next step more often.

Building Blocks:

🍞 Hook: You know how adding a small tool to your pencil makes it a mechanical pencil without changing the pencil itself?

🥬 The Concept: Free-Module (a LoRA adapter)

What it is: The Free-Module is a small, trainable add-on (via LoRA) that can be merged into the model during cleaning and unmerged during reasoning.
How it works:
1. Attach the adapter to the frozen backbone (no big retraining).
2. Merge it to switch into Cleaning mode; unmerge to go back to Reasoning.
3. Train it on examples where safe deletions kept or improved correctness.
Why it matters: It’s lightweight, plug-and-play, and can generalize across different backbones. 🍞 Anchor: An 8B Free-Module improved both a 235B Qwen model and a different DeepSeek model, like a universal eraser tool.

🍞 Hook: Think of sticky tabs you use to mark where to cut a long paper roll.

🥬 The Concept: Prefix–Suffix pruning commands

What it is: During cleaning, the module outputs tiny JSON commands that specify a prefix and a suffix to define the chunk to delete.
How it works:
1. Find a unique start (prefix) and end (suffix) text in the context.
2. Delete everything between them in one go.
3. Repeat if there are multiple redundant spans.
Why it matters: A few command tokens can remove thousands of junk tokens cheaply and reliably. 🍞 Anchor: One short JSON line can replace a 2,000-token ramble with a single <Del> marker, immediately shrinking the context.

🍞 Hook: Imagine a referee approving only the cleanups that don’t make the team play worse.

🥬 The Concept: Reward-checked training

What it is: The module learns from candidate deletions that are kept only if they preserve or improve accuracy in repeated trials.
How it works:
1. Generate many candidate deletions using prompting.
2. For each, run multiple rollouts before and after deletion.
3. Keep the example only if accuracy stays the same or goes up.
Why it matters: This teaches the module to prune true redundancy, not essential anchors. 🍞 Anchor: After filtering, about 6,648 high-quality examples remained to train precise forgetting.

Put together, Free()LM is a “think–clean–think” loop powered by a small adapter that knows where it’s safe to erase. That’s why it makes long, careful reasoning both sharper and more reliable.

03Methodology

At a high level: Input problem → Reasoning Mode (generate steps) → Trigger → Cleaning Mode (output prefix/suffix deletions) → Prune context → Resume Reasoning on cleaned notes → Final answer.

🍞 Hook: Imagine you’re solving a big jigsaw puzzle. You place pieces, pause sometimes to clear the table of wrong clusters, then keep going.

🥬 The Concept: Reasoning–Cleaning cycle

What it is: A loop where the model alternates between thinking (writing steps) and tidying (deleting junk) to keep the workspace clear.
How it works:
1. Start in Reasoning Mode (Free-Module off/unmerged) and generate steps.
2. After L_clean tokens, switch to Cleaning Mode (merge Free-Module).
3. Output prefix–suffix commands to prune redundant spans.
4. Unmerge the module and continue reasoning on the cleaned context.
Why it matters: Without the cycle, long chains bloat and can spiral into loops. 🍞 Anchor: Every 5,000 tokens (typical setting), the model pauses, cleans, and continues.

Each Step in Detail:

Reasoning Mode (unmerged)

What happens: The backbone model runs normally, generating chain-of-thought tokens.
Why this exists: We want the base reasoning quality unchanged; cleaning only happens when needed.
Example: The model solves a geometry proof, adding lemmas and subcases for a few thousand tokens.

Triggering Cleaning

What happens: When the generated token count since the last clean reaches the pruning interval L_clean, we switch modes.
Why this exists: A fixed interval guarantees regular tidy-ups before clutter dominates.
Example: After 5,000 new tokens, we stop generation and call the Free-Module.

Cleaning Mode (merged Free-Module)

What happens: The module scans context and emits JSON commands like [{"prefix": "...", "suffix": "..."}] describing spans to delete.
Why this exists: Deleting by spans is robust and efficient; a few command tokens can remove huge junk segments.
Example with data: If the context contains a repeated self-correction paragraph starting with “I might have made an error” and ending with “let me recompute,” the module outputs those as prefix/suffix anchors.

🍞 Hook: Think of using two bookmarks to mark the start and end of a section you want to rip out from a notebook.

🥬 The Concept: Prefix–Suffix anchors

What it is: Short snippets that uniquely identify where a deletion should start and stop.
How it works:
1. Choose distinctive text for prefix and suffix.
2. Delete the minimal text between them via a regex-like match.
3. Replace with a simple marker (e.g., <Del>).
Why it matters: It avoids complicated token-by-token editing; it’s fast and easy to parse programmatically. 🍞 Anchor: In code, a single regex call removes everything from prefix to suffix and inserts <Del>.

Execute Pruning

What happens: An external executor reads the JSON, finds those spans, and replaces them with a small marker.
Why this exists: Separating decision (what to delete) from action (deleting it) keeps the model simple and reliable.
Example: A 3,000-token dead-end proof block becomes “<Del>,” shrinking the context.

Resume Reasoning

What happens: We unmerge the Free-Module to go back to normal generation on the cleaned context.
Why this exists: After cleaning, we want the backbone’s full reasoning power without the adapter’s influence.
Example: The next steps proceed smoothly, now focusing on the surviving key lemmas.

Two Ways to Resume Efficiently:

Re-prefilling (used in the paper): Reuse the cached keys/values for the unchanged prefix, and only re-run the modified tail. It’s widely supported and stable.
KV Cache Pruning (future-friendly): Directly cut the deleted blocks from the cache and rotate positions to realign. It can be faster but needs special serving support.

🍞 Hook: It’s like reopening a chapter without rereading the whole book—just skim the parts that changed.

🥬 The Concept: Re-prefilling

What it is: A trick to avoid recomputing attention on the entire context by only reprocessing the changed part after deletion.
How it works:
1. Keep the cache for the unchanged prefix.
2. Re-run the altered suffix only.
3. Continue decoding.
Why it matters: Saves time and compute, making frequent cleanups practical. 🍞 Anchor: After deleting a 2k-token ramble in the middle, we reuse the first 15k tokens’ cache and only re-prefill the last 5k.

Training: Learning to Forget

Challenge: There’s no simple label saying “this paragraph is redundant.”
Solution: Build candidates, then keep only those that pass a reward check.

Data Synthesis (like practicing on realistic messes):

Break long real trajectories into 1k-token chunks and simulate cleaning step-by-step, always conditioning on previously cleaned history (not the pristine original). This teaches the module to operate in the same messy states it will see at test time.
Use a strong helper model to propose many deletions, creating ~8,000 candidate training cases.

Reward Mechanism (the referee):

For each candidate deletion, run multiple rollouts on the original context and on the cleaned context.
Keep the example only if accuracy stays the same or improves (Acc_new ≥ Acc_raw).
This distilled the set to ~6,648 high-quality “safe deletion” examples.

🍞 Hook: Like a coach who only adds drills to practice if the team plays at least as well after trying them.

🥬 The Concept: Reward-checked pruning examples

What it is: Training cases approved only when deletion doesn’t hurt accuracy in controlled tests.
How it works:
1. Propose deletion.
2. Test before vs after multiple times.
3. Keep only safe or beneficial ones for training.
Why it matters: Prevents the module from learning risky deletions that clip essential anchors. 🍞 Anchor: After this filter, the module learned to delete Gemini-style flab without causing regeneration of needed text.

The Secret Sauce:

A tiny LoRA adapter learns semantic redundancy, not just attention-based importance. It operates via ultra-light JSON commands to remove huge spans, and it’s trained only on verified-safe cleanups. That combination delivers higher accuracy with fewer tokens and even transfers to other models—like a universal, learned eraser.

04Experiments & Results

The Test: The team measured how often the first answer was correct (pass@1), how long the answers were (number of tokens), and how much shorter the context became after cleaning (reduction ratio). They tested on math-heavy, long-horizon datasets (AIME24/25, HLE, HMMT, BeyondAIME, BrUMO25, IMOAnswerBench) and on general, shorter tasks (BBH, MMLU-Pro, MMLU-STEM, GPQA) to check safety.

The Competition: They compared three kinds of approaches:

Vanilla backbones (no cleaning).
Heuristic compression (e.g., H2O, ThinkClearly), which drop tokens using attention-like rules.
ICL-based cleaners, including prompts to the backbone itself and to a strong external model (Gemini-2.5-Pro).

🍞 Hook: When you study for a big exam, you might try different ways to shrink your notes—randomly crossing out lines, asking a friend what to skip, or learning a skill to summarize.

🥬 The Concept: pass@1 (first-try accuracy)

What it is: The share of problems solved correctly in the first attempt.
How it works:
1. Sample an answer once per problem (or per rollout set).
2. Check if it matches the gold solution.
3. Average across problems.
Why it matters: It reflects practical, one-shot performance. 🍞 Anchor: Saying “87% pass@1” is like scoring an A when most others got B’s.

Scoreboard With Context:

Across 8B to 235B+ scales, Free()LM improved average pass@1 by about 3.3% while cutting tokens substantially.
On Qwen3-8B, Free()LM scored about 48.1% vs 44.2% vanilla (a solid bump) and used about 21% fewer tokens than vanilla and ~13% fewer than Gemini-based cleaning, while outperforming Gemini.
On Qwen3-30B, gains grew: average pass@1 rose by roughly 4.8% over vanilla, with a ~12% token reduction.
On Qwen3-235B, Free()LM still improved accuracy while reducing tokens by around 26% on average.

Long-Horizon Recovery:

The dramatic story is on ultra-long HLE cases: the vanilla 235B model’s accuracy fell to 0% beyond ~80k response tokens—total collapse from context pollution.
Free()LM compressed those massive traces by about 40–50%, pulling them back into a “sweet spot” (around 40k–70k) where the backbone works well.
Result: accuracy rebounded to about 50% in that hard tail. That’s like reviving a team from a losing streak to winning half their games again.

Heuristic Failures:

Methods like H2O and ThinkClearly didn’t just underperform; they sometimes made outputs longer and more loopy.
Why? Attention-based token importance isn’t the same as logical necessity. Cutting the wrong anchors breaks the proof and triggers repetition or regeneration.

ICL Limitations:

Prompt-only or external-model cleaning helped a little (about 1–2% in places) but often deleted needed setup steps, causing the backbone to re-generate them later.
Case studies showed this prune-then-regenerate pattern wastes tokens and harms stability.

Cross-Model Generalization:

An 8B-trained Free-Module improved a 235B Qwen model on IMOAnswerBench and even boosted a different architecture (DeepSeek V3.2-Speciale), raising accuracy by ~2.3% while slashing tokens by ~46%.
This suggests the Free-Module learned a generally useful redundancy sense, not just a model-specific trick.

System Trade-offs:

Latency rose by ~56% due to extra cleaning steps and re-prefilling, but KV cache memory dropped by ~45%—a big win for serving where memory bandwidth is precious.
The team expects latency could drop toward ~20% overhead if native KV-pruning becomes available.

Surprising Findings:

Cleaning helped most when things were longest and messiest—that’s where vanilla models were weakest.
More thinking wasn’t better; smarter forgetting was. With the right eraser, models both spoke less and answered better.

05Discussion & Limitations

Limitations:

Extra latency: Cleaning cycles add overhead for generating commands and re-prefilling. In tests, latency rose ~56% per sample, though memory use improved sharply.
Occasional over-pruning: Even with reward-checked training, rare deletions can remove mild anchors, requiring some regeneration.
Dependence on good training data: The Free-Module’s precision comes from carefully filtered examples; weaker pipelines may teach riskier behaviors.
Serving complexity: Native KV cache pruning and position realignment aren’t widely supported yet, so the fastest path remains future work.

Required Resources:

A backbone LLM (8B–685B) and a small LoRA adapter.
Training infra to run many rollouts per candidate deletion for reward filtering.
A serving stack that can handle periodic mode switches, JSON-structured outputs, and efficient re-prefilling.

When NOT to Use:

Very short tasks where reasoning is brief: The overhead of cleaning may not pay off, and the Free-Module would rarely trigger anyway.
Highly structured prompts that already enforce brevity: Extra cleaning may add unnecessary latency.
Latency-critical, memory-plentiful settings: If speed matters more than memory or stability, the overhead could be undesirable.

Open Questions:

Can we learn dynamic triggers (not just fixed intervals) that predict the best cleanup times based on entropy, repetition, or novelty signals?
How far does cross-model generalization go—across languages, domains, or multimodal reasoning?
Can native KV cache pruning be standardized in serving frameworks to cut latency overhead significantly?
Could the Free-Module learn constructive compression (summarize-and-keep) in addition to deletion, retaining gist while freeing space?
How does forgetting interact with retrieval-augmented setups—can we jointly decide what to forget and what to fetch next?

Big Picture: Free()LM shows that memory management is part of reasoning. The best gains appear exactly where models used to fail: in long, noisy chains. Turning “remember everything” into “remember what matters” is a shift from raw length-scaling to smart, sustainable thinking.

06Conclusion & Future Work

Three-Sentence Summary:

Long thoughts can backfire when models never forget: extra tokens turn into noise and loops, crashing accuracy on the hardest problems.
Free()LM adds a small Free-Module that switches between thinking and cleaning, issuing tiny prefix–suffix commands to delete big redundant spans safely.
Trained with reward-checked examples, it boosts accuracy across 8B–685B models, revives ultra-long reasoning, and reduces memory use—proving that forgetting is a core skill for sustainable intelligence.

Main Achievement:

Establishing a practical, learned free() operation that consistently improves long-horizon reasoning while cutting context size, and that even generalizes across very different backbones.

Future Directions:

Native KV cache pruning to reduce latency; smarter, data-driven triggers for when to clean; and moving beyond deletion to learned summarization that compresses without losing meaning. Extending the universal “eraser” to multilingual, multimodal, and tool-augmented settings could broaden impact.

Why Remember This:

It reframes test-time scaling: not more thinking, but cleaner thinking. Like erasing your scratch work at the right times, Free()LM keeps the proof readable, the memory light, and the answers sharp—exactly what long-horizon AI needs.

Practical Applications

•Math tutors that keep proofs tidy by deleting dead-end steps, improving correctness on hard problems.
•Coding assistants that prune noisy logs and repeated stack traces while debugging long-running issues.
•Research assistants that clean intermediate notes when summarizing long papers or multi-document dossiers.
•Customer support bots that trim repetitive dialogue in long tickets, staying focused on the root cause.
•Planning agents (e.g., travel or supply chain) that drop superseded subplans to avoid confusion and loops.
•Meeting summarizers that erase tangents and redundancies to keep only decisions, action items, and key facts.
•Legal and policy analyzers that prune duplicative arguments in long briefs to maintain clear reasoning chains.
•Game-playing or simulation agents that forget explored dead ends to plan deeper without running out of context.
•Data analysts that streamline step-by-step explorations, keeping only validated findings and critical assumptions.

Version: 1