Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Nitay Calderon; Eyal Ben-David; Zorik Gekhman; Eran Ofek; Gal Yona

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Intermediate

Nitay Calderon, Eyal Ben-David, Zorik Gekhman et al.2/15/2026

arXiv

Key Summary

•Not all wrong answers from large language models (LLMs) mean they never learned the fact—many times the model knows it but can’t pull it out on demand.
•This paper separates two very different problems: empty shelves (the model never stored the fact) vs. lost keys (the fact is stored but hard to find).
•They introduce “knowledge profiling,” which labels each fact by whether it’s encoded and how easily it can be recalled (directly, only with thinking, or not at all).
•A new benchmark, WikiProfile, builds 10 tests per fact from real Wikipedia text, so we can check encoding and recall in realistic settings without needing model internals.
•Frontier models encode about 95–98% of the tested facts, but still fail to recall 25–33% of them without thinking—so recall, not encoding, is the main bottleneck.
•Models especially struggle with rare (long‑tail) facts and reverse questions (B is A) even when those facts are encoded.
•Multiple‑choice ‘verification’ shows models can often recognize the right answer for reverse questions that they can’t freely recall in generation—pointing to a recall issue, not missing knowledge.
•Letting models ‘think’ (do inference‑time computation) recovers 40–65% of encoded‑but‑not‑recalled facts, most strongly for rare facts and reverse questions.
•The takeaway: future gains may come less from scaling up training and more from improving how models access and use the knowledge they already store.

Why This Research Matters

If most facts are already stored in strong models, we can get big gains by improving how models access that knowledge instead of only making them bigger. This helps assistants answer more reliably, especially for niche questions and tricky reverse lookups people naturally ask. It also cuts costs by deciding when to use slower “thinking” modes only when needed. In safety‑critical domains (like medicine or law), distinguishing missing knowledge from recall failure guides the right fix—better training vs. better prompting and reasoning policies. Finally, fairer performance across popular and rare facts means broader, more inclusive AI help for everyone.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how sometimes you study a lot for a quiz, but when a tricky question is asked in a different way, your mind goes blank—even though you learned it? That doesn’t mean you never learned it; it just means recall is hard under pressure.

🥬 Filling (The Actual Concept):

What it is: This paper asks a simple question with big consequences: when an AI gets a fact wrong, is it because the fact was never learned (empty shelves) or because it can’t find the fact at the moment (lost keys)?
How it works: For years, we mostly judged AI factuality with a single score—accuracy—treating all mistakes the same. The authors argue we should judge at the level of each fact and separate two steps: (1) encoding (did the model store the fact during training?), and (2) recall (can the model pull that fact out when asked in different ways?). They build a new benchmark that tests both.
Why it matters: If shelves are empty, the fix is more or better training data. If keys are lost, the fix is better prompts, better use of model reasoning at inference time, or post‑training techniques—not just more data.

🍞 Bottom Bread (Anchor): Imagine asking, “Which famous band played their first gig at the Boardwalk club?” If the AI fails, is it because it never learned “Oasis” (empty shelves)? Or does it know “Oasis played their first gig at the Boardwalk” but can’t reverse the relation to answer who (lost keys)?

🍞 Top Bread (Hook): Think of a spelling bee. Even great spellers sometimes freeze when a word is asked backward or in a strange sentence.

🥬 Filling:

What it is: Before this work, most evaluations tallied right/wrong answers without explaining why errors happened.
How it works: The field tried many things—making models bigger, adding knowledge graphs, editing facts inside models, and probing model internals—but these didn’t clearly tell us if errors were from missing knowledge or retrieval trouble.
Why it matters: Without knowing the cause, we might waste effort: scaling a model to “store more” won’t fix a recall glitch.

🍞 Bottom Bread (Anchor): If a student forgets your name mid‑conversation, buying them a bigger backpack (more capacity) won’t help. Teaching them a trick to recall names might.

🍞 Top Bread (Hook): Imagine a library. It’s not enough for the book to exist; you have to find it quickly, even if the request is phrased oddly.

🥬 Filling (New Concepts, in Sandwich form):

Factuality Evaluations
- What: Checks of whether an AI’s answers are true or false.
- How: Traditionally, ask a question and mark the answer right/wrong.
- Why: Alone, this hides whether the book isn’t in the library (not encoded) or is just mis‑shelved (recall failure).
- Anchor: Scoring a trivia team by “how many right” won’t tell you if they never learned geography or just blanked under pressure.
Encoding Failures (Empty Shelves)
- What: The model didn’t store the fact during training.
- How: Maybe the data never had it, or the model lacked capacity.
- Why: If true, you need better data or a bigger model.
- Anchor: If your notes never included “Capital of France,” you can’t recall “Paris.”
Recall Failures (Lost Keys)
- What: The model stored the fact but can’t retrieve it in a new phrasing or reversed direction.
- How: The question’s wording or direction differs from how it was learned.
- Why: Fixes involve better prompts, post‑training, or inference‑time thinking.
- Anchor: You know a classmate’s name but can’t say it until someone gives the first letter.
Long‑tail Knowledge
- What: Rare or niche facts.
- How: These appear less during training; models encode many of them but recall is fragile.
- Why: Without targeting recall, assistants will keep missing important niche facts.
- Anchor: A zoo guide can name common animals fast, but might hesitate on a rare lizard’s species.
Reverse Questions
- What: Asking the relation backward (B→A) instead of forward (A→B).
- How: Models often learn “A is related to B” but struggle to answer “Which A goes with B?”
- Why: Highlights recall trouble tied to how facts were read in training.
- Anchor: You can answer “Oasis → Boardwalk club,” but stumble on “Boardwalk club → who?”

🍞 Bottom Bread (Anchor): The paper fills the gap by profiling facts from real Wikipedia text to see whether each one is stored and how accessible it is—with and without letting the model “think.”

02Core Idea

🍞 Top Bread (Hook): Imagine sorting every item in your backpack into five bins: not packed, easy to grab, only grab if you stop and think, guessed from other items, or still can’t find. That’s a lot clearer than just saying “I found it” or “I didn’t.”

🥬 Filling (Aha! Moment):

What it is: The key insight is to shift from grading questions to profiling facts: first check if a fact is encoded in the model, then measure how easily it can be recalled—in direct form, reversed form, and with or without letting the model think.
How it works (recipe):
1. Extract a fact from real Wikipedia text as a subject→object proposition.
2. Test encoding by recreating a pre‑training‑like context and asking the model to complete or answer within that context (no thinking allowed).
3. Test recall by asking direct and reverse questions with different phrasings (with and without thinking).
4. Place the fact into one of five knowledge profiles based on results: Encoding Failure, Direct Recall, Recall with Thinking, Recall Failure, or Inference without Encoding.
Why it matters: Now we can tell if errors need more training data (empty shelves) or better retrieval/usage (lost keys). This changes how we improve models.

🍞 Bottom Bread (Anchor): For the Oasis example, the method can tell whether “Oasis ↔ Boardwalk club” is stored and whether the model can answer both directions, with or without extra thinking steps.

🍞 Top Bread (Hook): Three analogies to see the same idea:

Library and Keys: Books (facts) can be there but hard to find; thinking is like asking a helpful librarian to search deeper.
Detective Work: Solving a case depends on both having the clues (encoded) and noticing which clue matters now (recall).
School Quiz: You might recognize the right answer on a multiple‑choice test (verification) even if you couldn’t recall it freely (generation).

🥬 Filling:

Before vs. After:
- Before: One accuracy score; wrong answers looked the same, so we kept scaling models.
- After: We see that frontier models already encode most facts, but recall is the choke point—especially for rare facts and reversals. Solutions shift toward better retrieval and inference‑time strategies.
Why it works (intuition):
- Encoding test ‘primes’ the model with the same kind of text it saw during training. If it can fill the fact there, the fact is likely stored.
- Knowledge test requires robustness: the model should answer across phrasings and directions. If it fails here, recall (not storage) is the issue.
- Multiple‑choice vs. generation separates recognition from recall: recognizing the right choice is easier than pulling the answer out unprompted.
Building blocks (with Sandwich mini‑blocks):
- Facts as Propositions
  - What: A subject→object link taken from natural text.
  - How: Use the order in the source sentence to define direction.
  - Why: Lets us test both direct and reverse questions cleanly.
  - Anchor: “Oasis (subject) → Boardwalk club (object).”
- Direct vs. Reverse Questions
  - What: Direct asks for the object; reverse asks for the subject.
  - How: Create two phrasings for each to test robustness to wording.
  - Why: Reveals direction‑sensitive recall issues (reversal curse).
  - Anchor: “Where did Oasis play first?” vs. “Who played first at the Boardwalk?”
- Thinking (Inference‑time Computation)
  - What: Let the model write intermediate steps before the final answer.
  - How: Use chain‑of‑thought or reasoning‑optimized modes.
  - Why: Often unlocks stored but hard‑to‑reach facts.
  - Anchor: Like pausing to list clues before naming the suspect.
- Five Knowledge Profiles
  - What: The categories that describe each fact’s status.
  - How: Combine encoding (yes/no) with recall (direct/with thinking/never) to sort.
  - Why: Pinpoints what to fix—data vs. retrieval vs. reasoning policy.
  - Anchor: A shelf label that says “here but needs a step stool” vs. “not in stock.”

🍞 Bottom Bread (Anchor): After profiling, the model’s mistakes stop being a mystery. We can see, for each fact, whether we need more books (training) or a better way to find them (recall/think).

03Methodology

🍞 Top Bread (Hook): Imagine building a science fair project that tests whether classmates remember facts from a textbook and whether they can still answer if you reword the question or flip it around. You’d need a careful plan.

🥬 Filling (High‑level overview):

What it is: A recipe to test, for each fact, (1) if it’s encoded and (2) how easily it’s recalled.
How it works (pipeline): Input (Wikipedia page) → Fact extraction (pick a subject→object fact) → Create 10 tasks (2 for encoding, 4 for knowledge, 4 multiple‑choice versions) → Generate 8 answers per task per model → Grade answers with an LLM grader → Assign each fact to a knowledge profile.
Why it matters: This turns a flat right/wrong score into a detailed map of where knowledge lives and how accessible it is.

🍞 Bottom Bread (Anchor): For the Oasis example, we first test if the model can fill in “Oasis played their first gig at ___” in the original paragraph, then ask both direct and reverse questions in natural ways and multiple‑choice forms.

Detailed steps (each with Sandwich pattern):

Fact Extraction from Natural Text

What: Pull a single, unambiguous fact from a real Wikipedia paragraph.
How: Do named‑entity recognition, pick an object entity that makes a specific, non‑trivial completion (not time‑sensitive), and ensure the left context leads to exactly one correct answer. Balance topics and entity types.
Why: If facts are vague or have multiple answers, we can’t tell if the model failed or the question was unfair.
Anchor: From “Love and Money are a Scottish band… formed by three former members of Friends Again … along with bassist Bobby Paterson,” extract the fact linking subject (the band creators) to object (Bobby Paterson).

Encoding Tests (No Thinking Allowed)

What: Two tasks that mimic training conditions to see if the model stored the fact.
How:
- Proposition Completion: Give the left context and ask the model to complete the sentence that would include the object.
- Contextual Question: Ask a high‑verbatim (close‑to‑text) question appended to the same context.
Why: Completion alone can be ambiguous for chat‑tuned models; the contextual question clarifies the target. Excluding thinking avoids confusing storage with on‑the‑fly inference.
Anchor: “Oasis played their first gig on 14 August 1991 at ____” and “Using the above paragraph: Where did Oasis play their first gig?”

Knowledge Tests (With and Without Thinking)

What: Four closed‑book questions—two direct, two reverse—each with different phrasings; plus four multiple‑choice versions.
How: Ask minimal but unambiguous questions; then create natural rephrasings; then make multiple‑choice formats with realistic distractors of the same entity type.
Why: Tests whether recall is robust to wording and direction; multiple‑choice separates recognition from free recall.
Anchor: Direct: “Where did Oasis play their first gig?” Reverse: “Which band played their first gig at the Boardwalk club?” and their natural rephrasings; MC versions list four plausible bands/venues.

Refinement and Filtering with Web Search

What: Automatic quality control to keep only precise, unique‑answer questions.
How: A prompted LLM checks for specificity and minimalism; then a search‑grounded LLM filters out any question with multiple possible answers or unclear wording.
Why: Prevents garbage‑in, garbage‑out; unclear items would blur the difference between encoding and recall.
Anchor: If “Which film premiered at the 18th Rome Film Festival?” could refer to several films, add minimal context like the director’s name.

Response Generation and Grading

What: For each task, sample 8 answers per model at temperature 1; grade with an LLM autorater into CORRECT/INCORRECT (and rare PARTIALLY/OTHER, which are handled carefully).
How: Compute per‑question accuracy over gradable responses; define encodes if any encoding task for the fact passes a threshold; define knows if all knowledge questions for that fact pass.
Why: Multiple samples reduce randomness; existential vs. universal tests cleanly separate storage from robust recall.
Anchor: A fact “encodes” if at least one encoding test clears 50%; it “knows” if all four knowledge questions clear 50% (with or without thinking, depending on the profile being measured).

Knowledge Profiles (Five Bins)

What: For each fact, decide if it’s (a) Encoding Failure, (b) Direct Recall, (c) Recall with Thinking, (d) Recall Failure, or (e) Inference without Encoding.
How: Combine encoding result (yes/no) with knowledge results (without and with thinking) to sort.
Why: Each bin points to a different fix—more data, better prompts/post‑training, use of thinking, or being cautious about guesses.
Anchor: If the model encodes “Oasis → Boardwalk” but fails direct questions and even with thinking, it’s a Recall Failure; if thinking rescues it, it’s Recall with Thinking.

The Secret Sauce:

Using natural text (not just triples) better matches how models learn facts.
Splitting recognition (MC) from recall (generation) reveals hidden knowledge.
Conditioning on encoding isolates retrieval difficulty from missing data.
Letting models “think” shows how much recall can improve without retraining.

🍞 Bottom Bread (Anchor): In the end, each fact gets a clear label saying whether the book is in the library and how easy it is to grab—straight away, after some thinking, or not at all.

04Experiments & Results

🍞 Top Bread (Hook): Picture a giant quiz bowl: 2,150 facts, 10 tests per fact, 13 different teams (LLMs), and millions of answers checked by neutral judges.

🥬 Filling (The Test):

What it is: The authors measured, for each model, (1) how many facts are encoded and (2) how many of those are recalled directly, recalled only with thinking, or not recalled. They also compared direct vs. reverse questions and rare vs. popular facts.
How it works: 2,150 Wikipedia‑derived $facts × 10$ tasks per fact (2 encoding, 4 knowledge, 4 $MC) × 8$ samples per $task × 13$ $models ≈ 4$ .5 million graded responses. Thresholds decide encodes/knows, with careful handling of edge labels.
Why it matters: This scale gives a trustworthy picture of whether today’s strongest models struggle more with storage or with retrieval.

🍞 Bottom Bread (Anchor): Think of grading not just a score, but a detailed report showing which facts were stored, which were easy, which needed thinking, and which stayed hidden.

The Competition and Scoreboard (with meaning):

Encoding is nearly saturated in frontier models: about 95–98% of facts are encoded. That’s like getting an A for “having the books.”
Yet recall without thinking still misses 25–33% of facts—more like a B‑ for “finding the books on demand.”
Thinking (letting the model write out steps) recovers 40–65% of the facts that were encoded but not initially recalled. This is like using a step‑stool to reach high shelves.

Surprising and Insightful Findings (with Sandwich mini‑blocks):

Long‑tail (Rare) Facts
- What: Rare facts are almost as encoded as popular ones.
- How: The encoding gap is small (just a few points), but the recall gap is big (often 20+ points).
- Why: Rare facts are stored but harder to fetch unless the model can think more.
- Anchor: A museum guide “knows” about a rare fossil but needs a moment to remember its exact name.
Reverse Questions and the Reversal Curse
- What: Models do worse on reverse generation (B→A) than direct, even when encoded.
- How: But in multiple‑choice verification, reverse isn’t harder—and is sometimes easier—showing the association is there.
- Why: This points to recall trouble tied to direction, not missing knowledge.
- Anchor: You can recognize the right face in a lineup even if you couldn’t name it from memory.
Phrasing Sensitivity
- What: Across high‑verbatim vs. natural rephrasings, no significant performance difference was found.
- How: Many hypothesis tests with FDR correction saw no phrasing effect.
- Why: In this controlled setup, direction and rarity matter more than wording style.
- Anchor: Whether you ask, “Where was their first gig?” or “Where did they first perform?”, the real hurdle is direction or rarity.

Big Picture:

Scaling shifts errors from storage to retrieval: as models get bigger, fewer empty shelves but more lost keys relative to remaining errors.
Thinking helps most where it’s hardest: rare facts and reverse questions see the biggest recall gains.

🍞 Bottom Bread (Anchor): For a frontier model, the shelves are almost full, but the filing system is fragile: reverse lookups and niche items need extra effort (thinking) to retrieve consistently.

05Discussion & Limitations

🍞 Top Bread (Hook): If you want to fix a leaky faucet, you first need to know where the leak is. Is it the water supply (encoding) or the handle mechanism (recall)?

🥬 Filling (Honest assessment):

Limitations (what this can’t do):
- Mostly single‑hop facts from English Wikipedia; very time‑sensitive or multi‑hop logic isn’t the focus.
- Behavioral signals can mis‑tag a few cases (e.g., rare instances of inference without encoding or grader edge cases).
- Autoraters (LLM graders) are strong but not perfect; the team checked agreement and filtered unclear items, yet small biases can remain.
- Thinking helps but costs compute and latency; deciding when to trigger it is an open challenge.
- Threshold choices and sampling (8 responses) are principled, but different settings could slightly shift counts.
Required resources:
- Access to LLMs with and without thinking, many forward passes for sampling, and an LLM grader.
- Web search grounding for filtering questions, plus compute to process millions of answers.
When not to use:
- Live, fast‑changing facts (sports scores today) that violate the “not time‑sensitive” rule.
- Non‑English corpora unless the pipeline is adapted.
- Tasks requiring multi‑document synthesis or long reasoning chains (beyond single‑hop factual checks).
Open questions:
- Can we detect, before answering, whether thinking is needed—saving time and cost?
- Can training or post‑training reduce direction‑based recall asymmetry (reversal curse) without overfitting?
- How do internal representations relate to these behavioral profiles—can we predict profiles from activations?
- How should RAG and parametric recall coordinate—when to fetch vs. when to think?
- Can similar profiling diagnose long‑form generation or multi‑step reasoning consistency?

🍞 Bottom Bread (Anchor): Just like a coach uses game tape to plan the next practice, these profiles show whether to train more plays (data) or practice faster recall (thinking/prompting) to win more games (correct facts).

06Conclusion & Future Work

🍞 Top Bread (Hook): You know how the answer is “on the tip of your tongue,” and saying a few related clues suddenly brings it back? That’s the heart of this paper.

🥬 Filling (Takeaway):

3‑Sentence Summary: The authors introduce knowledge profiling, which separates encoding (is the fact stored?) from recall (can it be retrieved across directions and phrasings). On a new, realistic benchmark (WikiProfile), frontier LLMs encode most facts (≈95–98%), but recall remains the main bottleneck—especially for rare facts and reverse questions. Letting models “think” at inference time rescues a large share (40–65%) of otherwise inaccessible, already‑encoded facts.
Main Achievement: Turning one flat accuracy score into a fact‑by‑fact map that tells us what to fix—data vs. retrieval/usage—backed by a scalable benchmark drawn from natural text.
Future Directions: Smarter triggers for when to think; post‑training to strengthen reverse recall; data curricula that teach both directions; integrating profiles with RAG to decide when to fetch vs. recall; extending profiling to long‑form and multi‑hop tasks.
Why Remember This: Because the shelves aren’t empty—the keys are hard to find. Improving how models access what they already know could move the needle more than just making them bigger.

🍞 Bottom Bread (Anchor): Next time a model misses a fact, don’t assume it never learned it; ask whether a nudge to think—or a better way to ask—might unlock what’s already inside.

Practical Applications

•Diagnose model errors: determine whether a wrong answer came from missing knowledge or recall failure and choose the right fix.
•Adaptive prompting: trigger chain‑of‑thought only when a profile predicts recall difficulty (e.g., reverse or long‑tail queries).
•Curriculum design: add reverse‑direction questions during training/alignment to reduce direction asymmetry.
•Evaluation upgrades: report knowledge profiles alongside accuracy to guide product and research decisions.
•Data selection: prioritize collecting or augmenting data for facts that show recall fragility rather than simply scaling all data.
•RAG coordination: use profiles to decide when to retrieve from external sources vs. rely on parametric recall.
•Guardrails and fallback: if reverse recall is weak, switch to multiple‑choice style or verification‑first strategies.
•Personalized tutoring: teach students (and models) to use “thinking steps” when a concept is on the tip of the tongue.
•Search interfaces: for rare facts, present recognition‑style suggestions (choices) to boost success.
•Model selection: pick models not only by accuracy but by recall robustness for your domain’s rare or reverse queries.

Version: 1