Panini: Continual Learning in Token Space via Structured Memory

Shreyas Rajesh; Pavan Holur; Mehmet Yigit Turali; Chenda Duan; Vwani Roychowdhury

Panini: Continual Learning in Token Space via Structured Memory

Intermediate

Shreyas Rajesh, Pavan Holur, Mehmet Yigit Turali et al.2/16/2026

arXiv

Key Summary

•Panini is a way for AI to keep learning new facts without changing its brain by storing them as tiny linked Q&A facts in an external memory.
•Instead of re-reading long document chunks, Panini builds a structured map called a Generative Semantic Workspace (GSW) and reads only the small Q&A pieces it needs.
•A new reader called RICR (Reasoning Inference Chain Retrieval) follows chains of clues across the GSW, like a detective, to answer multi-step questions.
•Across six QA benchmarks, Panini scores about 5–7 points higher than strong baselines and shines especially on multi-hop questions.
•Panini uses far fewer tokens (about 2–30× less) at answer time because it feeds the model compact Q&A evidence, not big passages.
•On special tests with missing evidence, Panini answers correctly when it can and says 'N/A' when it cannot, reducing hallucinations.
•Panini works with open-source models too; even if accuracy drops a bit, its advantage over baselines remains, especially on multi-hop tasks.
•Its memory format (GSW) is reusable: plugging it into other agents like Search-R1 boosts their scores without retraining.
•Ablations show Panini stays strong with narrow beams and that its chain-scoring choice matters; it’s also robust as corpora grow with distractors.

Why This Research Matters

Panini shows how to keep AI reliable as information changes by storing new facts as tiny, linked Q&A pieces instead of re-reading big text blocks. This saves time and money because the model only reads what it needs, which means fewer tokens and faster responses. It also reduces the risk of hallucinations because the final answer must be grounded in retrieved evidence or return ‘N/A’. That makes it safer for real-world uses in health, finance, and law where wrong guesses are costly. Panini’s structured memory layer is reusable too, boosting other systems without retraining. Finally, it works with open-source models, making private, on-prem deployments more practical.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how when you learn something new for school, you don’t rewrite your entire notebook—you just add a new sticky note and link it to the right page so you can find it fast later?

🥬 Filling (The Actual Concept)

What it is: This paper is about helping AI do the same thing—add new facts as small, well-placed notes (not rewrite its whole brain) and find them quickly later.
How it works (big picture):
1. Keep the main AI model fixed (no weight updates).
2. When new documents arrive, extract tiny Q&A facts and link them into a structured memory (like labeled sticky notes connected by strings).
3. When a question comes, follow the links to get only the exact facts needed.
4. Answer from those facts, or say "N/A" if the facts aren’t there.
Why it matters: Without this, AIs keep re-reading the same long passages, waste time and money, and sometimes grab irrelevant text that causes made-up answers.

🍞 Bottom Bread (Anchor) Imagine a homework question: “Who was Lothair II’s mother, and when did she die?” Instead of reading two Wikipedia pages over and over, the AI jumps along a chain of tiny Q&As—(Lothair II → mother?) and (mother → death date?)—and answers fast.

— New Concept — Non-Parametric Continual Learning (NPCL) 🍞 Hook: Imagine adding new magnets to your fridge without repainting the fridge each time. 🥬 The Concept:

What it is: NPCL stores new knowledge outside the model instead of changing its internal weights.
How it works:
1. Keep the model frozen.
2. Store new facts externally with structure.
3. Retrieve only what’s needed at question time.
Why it matters: It avoids expensive retraining and catastrophic forgetting. 🍞 Anchor: A news article about a new mayor is saved as a few linked facts; the AI can use them immediately without retraining.

— Existing Baseline — Retrieval-Augmented Generation (RAG) 🍞 Hook: You know how you bookmark whole web pages even when you only need one sentence? 🥬 The Concept:

What it is: RAG finds and feeds long text chunks to the AI to help it answer.
How it works:
1. Search for relevant passages.
2. Stuff the top chunks into the prompt.
3. Let the AI read and answer.
Why it matters: It helps, but the AI keeps re-reading bulky text and can be distracted by extra details. 🍞 Anchor: To answer “Where was Obama born?”, RAG might pass a whole biography page, not just the one birth sentence.

— Task Type — Multi-Hop Question Answering 🍞 Hook: Solving a mystery often needs several clues in the right order. 🥬 The Concept:

What it is: Multi-hop QA needs combining facts across steps (e.g., person → mother → death date).
How it works:
1. Break a big question into simple steps.
2. Answer step 1 to unlock step 2, and so on.
3. Combine the steps for the final answer.
Why it matters: Many real questions require connecting dots across sources. 🍞 Anchor: “Who wrote the book that inspired the movie Jaws?” requires step 1 (what book inspired Jaws?) then step 2 (who wrote that book?).

The World Before and The Problem

Before: LLMs could read long contexts (RAG) or get retrained with new data (parametric continual learning). But long contexts cause “lost in the middle” issues, and retraining is costly and risky (forgetting, misaligning safety).
Failed Attempts: Better retrievers/rerankers still hand long chunks to the reader; some graph and summarization methods compress themes but don’t support detailed reasoning across entities and events; agentic retrieval is powerful but requires many model calls per question.
The Gap: We need a write-time structure that captures specific, linkable facts and a read-time process that follows reasoning chains—fast, accurate, and willing to abstain when evidence is missing.
Real Stakes: Users want up-to-date, personalizable, low-latency answers. Fewer prompt tokens reduce cost and energy. Reliable abstention reduces hallucinations in sensitive domains like health, finance, and law.

02Core Idea

🍞 Top Bread (Hook) Imagine building a Lego city where each brick is labeled and snaps only to the right neighbors. When someone asks, “How do I get from the library to the park?”, you just follow the Lego path—no need to rummage through a box of random pieces.

🥬 Filling (The Actual Concept)

The Aha! in one sentence: Write experiences as tiny, linked Q&A facts (a Generative Semantic Workspace), then read by following reasoning chains (RICR) instead of re-reading big text chunks.
Multiple analogies:
1. Map analogy: GSW is a city map of facts; RICR is your GPS that plots the shortest path of clues.
2. Recipe analogy: GSW is a recipe card box (each card = Q&A); RICR picks only the cards needed to bake one cake.
3. Detective analogy: GSW is a wall of labeled clues; RICR is the detective’s string that connects suspect → event → time.
Before vs After: • Before: Systems fetched big passages; the reader re-processed lots of fluff and sometimes got confused. • After: Systems fetch only compact Q&A facts and chain them, saving tokens and increasing precision.
Why it works (intuition): • Facts are stored atomically (one question → one answer) and attached to entities/events. This reduces noise. • Retrieval is guided step-by-step using the previous answer, preventing drift. • A small beam of alternative chains hedges against early mistakes. • Final answers are grounded in retrieved Q&A, supporting abstention when evidence is missing.
Building blocks:
1. Generative Semantic Workspace (GSW): an entity- and event-aware network of Q&A pairs.
2. Dual Indexing: a sparse index over entities + a dense index over Q&A pairs.
3. Question Decomposition: turn complex questions into ordered single-hop sub-questions.
4. RICR (Reasoning Inference Chain Retrieval): hop-by-hop retrieval with beam search and chain scoring.
5. Evidence-Only Answering: pass deduplicated Q&A evidence (not long chunks) to the reader; answer or abstain.

— New Concept — Generative Semantic Workspace (GSW) 🍞 Hook: You know how you make flashcards where each card has one clear question and one answer? 🥬 The Concept:

What it is: GSW is a structured memory that stores tiny Q&A facts linked to entities and events.
How it works:
1. Find entities (people, places, dates) and verb-phrases (born on, married to, ruled over).
2. Make bidirectional Q&A pairs for each verb-phrase (Who was born on X? Where was Y born?).
3. Link everything so you can jump from one entity to the next via questions.
Why it matters: It makes the set of involved entities explicit and lets retrieval follow precise, factual links instead of vague text. 🍞 Anchor: “When was Barack Obama born?” ↔ “Who was born on August 4, 1961?” are both cards; they link Obama to that date.

— New Concept — Reasoning Inference Chain Retrieval (RICR) 🍞 Hook: Think of solving a maze by always following signposts you’ve already passed. 🥬 The Concept:

What it is: RICR is a hop-by-hop retriever that follows chains of Q&A links through the GSW to answer multi-step questions.
How it works:
1. Decompose the question into atomic sub-questions.
2. For each sub-question, fetch candidate Q&A pairs using dual indexing.
3. Keep a small set (beam) of best partial chains; score them by the strength of all hops.
4. Advance to the next hop by plugging the last answer into the next sub-question.
5. After the last hop, deduplicate the Q&A evidence and answer.
Why it matters: It avoids grabbing unrelated text and stays anchored to the reasoning path. 🍞 Anchor: For “When did Lothair II’s mother die?”, RICR finds mother = Ermengarde of Tours, then finds her death date, and answers.

— New Concept — Beam Search (in RICR) 🍞 Hook: When you’re not sure which hallway leads to the treasure, check a few likely ones in parallel. 🥬 The Concept:

What it is: Beam search keeps a few top candidate chains at each step instead of committing to one too early.
How it works:
1. Start with top-k candidates for hop 1.
2. Extend each by top-k at hop 2.
3. Keep only the top-B chains overall (unique current answers) using a cumulative score that penalizes weak links.
Why it matters: If one early guess is wrong, other chains can still succeed. 🍞 Anchor: If both “Ermengarde of Tours” and “Ermengarde of Hesbaye” show up as candidate mothers, the beam carries both until evidence favors one.

— New Concept — Evidence-Only Answering and Abstention 🍞 Hook: If your worksheet doesn’t have enough numbers, you don’t guess—you leave it blank or write “N/A”. 🥬 The Concept:

What it is: The answerer sees only the retrieved Q&A evidence and must answer from it—or output “N/A” if evidence is insufficient.
How it works:
1. Provide compact Q&A pairs to the reader.
2. Instruct it to answer only from these or say “N/A”.
3. Evaluate answers and refusals separately on special splits.
Why it matters: This reduces hallucinations and teaches the system to admit uncertainty. 🍞 Anchor: If no Q&A card has the date of a person’s death, the model returns “N/A” instead of guessing.

03Methodology

At a high level: Documents → Write-time GSW building + Dual Indexing → One-shot Question Decomposition → RICR (hop-by-hop chain retrieval with a beam) → Deduplicated Q&A evidence → Answer or N/A.

Step 1: Write-Time Structured Memory (GSW)

What happens: Each document is turned into a Generative Semantic Workspace (GSW): • Entities (with roles/states): Barack Obama (person; 44th President), Honolulu (location), August 4, 1961 (date), etc. • Verb-phrase/event nodes: born on, married to, ruled over, etc. • Bidirectional Q&A pairs: “When was Obama born?” ↔ “Who was born on Aug 4, 1961?”
Why this step exists: If you don’t structure at write time, later retrieval must keep scanning lots of raw text, which is slow and noisy.
Example: From a Lothair II passage, extract entities (Lothair II, Ermengarde of Tours, Teutberga), events (married to, son of), and Q&A pairs for both directions.

Step 2: Dual Indexing the Memory

What happens:
1. Sparse entity index (BM25): index entity names + role/state snippets, for fast entity lookups.
2. Dense Q&A index: index every Q&A pair as a vector, for semantic matches to sub-questions.
Why this step exists: Entity hits provide precise anchor points; Q&A hits provide semantic coverage. Using both increases recall and precision.
Example: Query mentions “Lothair II”; entity index nominates his node; dense Q&A index also finds QAs about his family.

Step 3: Question Decomposition

What happens: An LLM rewrites the user’s question into a minimal sequence of atomic sub-questions, with placeholders that pass answers to the next hop (like <ENTITY_Q1>).
Why this step exists: Many questions need multi-hop reasoning. Without decomposition, retrieval can overfit the first hop and miss the bridge to the second.
Example: “Who died later, Lothair II’s mother or Amadeus I’s father?” becomes two parallel chains: (mother of Lothair II → her death date) and (father of Amadeus I → his death date).

Step 4: RICR – Hop-by-Hop Chain Retrieval with a Beam

What happens (the inner loop per sequence):
1. Input sub-question q_t(x_t), where x_t is the current entity (from question text or previous hop answer).
2. Retrieve candidates by merging: • Entity-anchored candidates via BM25 over entities (collect their attached QAs), and • Dense Q&A candidates via the QA index.
3. Rerank candidates with a cross-encoder (e.g., Voyage Rerank-2.5), keep top-k.
4. Extend each current chain by one QA candidate; compute chain score as the geometric mean over hop scores to penalize any weak link.
5. Prune to top-B chains, ensuring unique current answers (avoid redundant paths to the same entity).
6. Move to the next hop: set x_{t+1} to the chosen answer a_t.
Why this step exists: Committing too early can lock in an error; the beam provides recovery. Geometric-mean scoring favors consistently strong evidence across hops.
Example: For “When did Lothair II’s mother die?” hop 1 finds mother candidates (Tours vs Hesbaye). Hop 2 asks for each one’s death date. The Tours chain wins on cumulative score.

Step 5: Evidence Packaging for the Reader

What happens: After finishing all hops (and all parallel sequences if any), deduplicate the set of QA pairs used by top chains and pass only these compact, grounded items to the reader LLM.
Why this step exists: Sending only short Q&A facts slashes token usage and reduces distraction.
Example: The final evidence might be just two lines: “Who was Lothair II’s mother? → Ermengarde of Tours” and “When did Ermengarde of Tours die? → 20 March 851.”

Step 6: Answer Generation with Abstention

What happens: The reader is instructed to answer strictly from the provided Q&A evidence or output “N/A” if it’s insufficient.
Why this step exists: It calibrates the system—no evidence, no guess—reducing hallucinations.
Example: If the death date was never found among retrieved QAs, the reader outputs “N/A”.

The Secret Sauce

Write-time structure (GSW) turns messy passages into reusable, linkable atoms.
Dual indexing merges the precision of entity search with the coverage of semantic Q&A search.
Chain-following with a small beam balances efficiency and robustness.
Evidence-only answering keeps token budgets low and faithfulness high.

What breaks without each piece?

Without GSW: You re-read chunks repeatedly and drown in extra text.
Without dual indexing: You either miss entities (QA-only) or miss semantics (entity-only), hurting recall/precision.
Without decomposition: Multi-hop bridges are often missed, leading to wrong or unsupported answers.
Without beam scoring: Early mistakes propagate; the system can’t recover.
Without abstention: The reader may hallucinate when evidence is missing.

04Experiments & Results

The Test: What they measured and why

Benchmarks: Six QA datasets—NQ and PopQA (single-hop), plus MuSiQue, 2WikiMultihopQA, HotpotQA, and LV-Eval (multi-hop/long-context)—to test both simple lookups and multi-step reasoning.
Metrics: • EM/F1: Are answers exactly or mostly correct? • Token usage: How many tokens does the answerer read? (Cost/speed proxy.) • Reliability (Platinum splits): How well does the system answer when it can (Ans) and refuse with “N/A” when it can’t (Unans)?

The Competition: Who Panini was up against

Chunk retrieval: BM25, BM25+reranker, dense embedding retrievers (with/without rerankers).
Structure-augmented RAG: RAPTOR, GraphRAG, HippoRAG/HippoRAG2.
Agentic systems: IRCoT and Search-R1 (multi-step, LLM-in-the-loop search strategies).

The Scoreboard: Numbers with meaning

Overall accuracy: Panini gets the best average F1 (~56%), about 5–7 points higher than strong baselines like HippoRAG2 (~53%).
Multi-hop strength: The largest gains show up on MuSiQue, 2Wiki, and HotpotQA—where careful chaining matters most.
Token savings: Panini cuts answer-time tokens by about 2× compared to chunk retrieval and 5–30× compared to structure-augmented and agentic baselines because it passes compact Q&A facts instead of long passages.
Reliability (Platinum): Panini breaks the usual trade-off: it keeps the highest answerable accuracy (around 80%) while maintaining strong refusal accuracy (around low- to mid-70s%). In plain terms, it both answers more of what’s actually answerable and says “N/A” when the library doesn’t have the book.

Surprising/Notable Findings

One-shot planning suffices: Panini decomposes once and then uses non-parametric retrieval; it still beats agentic loops that call the LLM many times.
Small beam works: Even beam width B=1 is competitive; B≈3–5 gives modest but steady gains, so you don’t need wide, expensive searches.
Robust to growth: When the corpus grows with distractors (but the true evidence stays the same), Panini’s structured memory degrades less than BM25/dense retrievers.
Reusable memory: Swapping Search-R1’s chunk retrieval for Panini’s GSW-based retrieval (no retraining) improves Search-R1’s average F1, showing GSW is a drop-in retrieval layer.
Open-source pipeline: Replacing proprietary models with open-source ones lowers absolute scores but Panini’s edge widens on multi-hop benchmarks; smaller models’ noisier GSWs still work because beam search recovers from extraction errors.

What this means in human terms

Panini doesn’t just read faster; it reads smarter—only the sentences that matter, stitched together in the right order.
It also knows when to stop and say “I don’t have enough info,” which is crucial for safety and trust.

05Discussion & Limitations

Limitations (be specific)

No latent-link caching yet: If the same cross-document bridge is repeatedly discovered at read time, Panini doesn’t currently save that link to make the next query even faster.
Write-time cost and quality: Building high-quality GSWs can be more expensive with proprietary models and less reliable with small open-source models (occasionally missing verb-phrases or inverses).
Light reconciliation: Entities are reconciled confidently within documents but not exhaustively across documents; richer, policy-driven reconciliation could help but may add cost and complexity.
Beyond fact-centric QA: Narrative reasoning and multimodal inputs (video/audio) will require extending the representation of time, space, and cross-modal events.

Required Resources

Storage: External memory for Q&A pairs and two indices (entity BM25 + dense QA index).
Compute: One-time write-time extraction; modest read-time retrieval + reranking; a single LLM call for decomposition and a compact evidence prompt for answering.

When NOT to Use

Very small, static corpora: Simpler RAG may be enough.
Tasks needing broad, stylistic synthesis (e.g., creative summaries) where long passages are helpful and factual precision is secondary.
Situations with zero tolerance for any write-time cost (you cannot pre-process documents).

Open Questions

What should be consolidated across documents, when, and by which policy (frequency, centrality, utility)?
How can the system learn which external facts to later internalize parametrically (guiding future fine-tuning)?
Can abstention be further calibrated with uncertainty estimates over chains?
How should we generalize GSW to narratives and multimodal streams (events with temporal/spatial extents, cross-modal anchors)?

06Conclusion & Future Work

Three-Sentence Summary Panini stores new experiences as small, linked Q&A facts (GSW) and answers questions by following reasoning chains (RICR) instead of re-reading long chunks. This design boosts accuracy—especially on multi-hop questions—while using far fewer tokens, and it reduces hallucinations by answering only from retrieved evidence or saying “N/A.” The structured memory is reusable and even improves other agents when plugged in.

Main Achievement The paper shows that investing computation at write time to build an entity- and event-aware Q&A network yields faster, more accurate, and more reliable read-time reasoning—with fewer tokens—across diverse QA tasks.

Future Directions

Cache frequently used cross-document links; add selective, policy-driven reconciliation.
Lower write-time cost with better open-source constructors and two-pass repair.
Extend the framework to narratives and multimodal streams (e.g., video scenes, timelines).
Use the external memory to guide what to internalize parametrically in future training.

Why Remember This Panini reframes continual learning: don’t keep stuffing longer contexts or keep retraining; instead, structure experiences once and reason cheaply forever after. That shift—from reading everything to reading only the right things—pays off in accuracy, efficiency, and trustworthiness.

Practical Applications

•Enterprise knowledge assistants that keep up with new policies by adding structured Q&A facts without retraining.
•Research copilots that chain facts across papers (author → affiliation → dataset year) to answer complex questions.
•Customer support bots that quickly link known product issues to fixes with minimal token usage.
•Compliance and legal tools that abstain when documents lack sufficient evidence, reducing risky guesses.
•Healthcare triage assistants that connect symptoms to guidelines while citing exact evidence Q&As.
•News trackers that add fresh events as Q&A links and answer timeline queries efficiently.
•Education tutors that build structured memories of course materials and answer multi-step homework questions.
•Developer assistants that index APIs as Q&A facts (method → parameter → default) to answer configuration questions.
•Personal knowledge bases that reconcile events across notes (meeting → attendee → action item) for fast recall.
•Search agents that plug in GSW retrieval to improve multi-hop reasoning without retraining.

Version: 1