Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality
Key Summary
- â˘Not all wrong answers from large language models (LLMs) mean they never learned the factâmany times the model knows it but canât pull it out on demand.
- â˘This paper separates two very different problems: empty shelves (the model never stored the fact) vs. lost keys (the fact is stored but hard to find).
- â˘They introduce âknowledge profiling,â which labels each fact by whether itâs encoded and how easily it can be recalled (directly, only with thinking, or not at all).
- â˘A new benchmark, WikiProfile, builds 10 tests per fact from real Wikipedia text, so we can check encoding and recall in realistic settings without needing model internals.
- â˘Frontier models encode about 95â98% of the tested facts, but still fail to recall 25â33% of them without thinkingâso recall, not encoding, is the main bottleneck.
- â˘Models especially struggle with rare (longâtail) facts and reverse questions (B is A) even when those facts are encoded.
- â˘Multipleâchoice âverificationâ shows models can often recognize the right answer for reverse questions that they canât freely recall in generationâpointing to a recall issue, not missing knowledge.
- â˘Letting models âthinkâ (do inferenceâtime computation) recovers 40â65% of encodedâbutânotârecalled facts, most strongly for rare facts and reverse questions.
- â˘The takeaway: future gains may come less from scaling up training and more from improving how models access and use the knowledge they already store.
Why This Research Matters
If most facts are already stored in strong models, we can get big gains by improving how models access that knowledge instead of only making them bigger. This helps assistants answer more reliably, especially for niche questions and tricky reverse lookups people naturally ask. It also cuts costs by deciding when to use slower âthinkingâ modes only when needed. In safetyâcritical domains (like medicine or law), distinguishing missing knowledge from recall failure guides the right fixâbetter training vs. better prompting and reasoning policies. Finally, fairer performance across popular and rare facts means broader, more inclusive AI help for everyone.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Top Bread (Hook): You know how sometimes you study a lot for a quiz, but when a tricky question is asked in a different way, your mind goes blankâeven though you learned it? That doesnât mean you never learned it; it just means recall is hard under pressure.
𼏠Filling (The Actual Concept):
- What it is: This paper asks a simple question with big consequences: when an AI gets a fact wrong, is it because the fact was never learned (empty shelves) or because it canât find the fact at the moment (lost keys)?
- How it works: For years, we mostly judged AI factuality with a single scoreâaccuracyâtreating all mistakes the same. The authors argue we should judge at the level of each fact and separate two steps: (1) encoding (did the model store the fact during training?), and (2) recall (can the model pull that fact out when asked in different ways?). They build a new benchmark that tests both.
- Why it matters: If shelves are empty, the fix is more or better training data. If keys are lost, the fix is better prompts, better use of model reasoning at inference time, or postâtraining techniquesânot just more data.
đ Bottom Bread (Anchor): Imagine asking, âWhich famous band played their first gig at the Boardwalk club?â If the AI fails, is it because it never learned âOasisâ (empty shelves)? Or does it know âOasis played their first gig at the Boardwalkâ but canât reverse the relation to answer who (lost keys)?
đ Top Bread (Hook): Think of a spelling bee. Even great spellers sometimes freeze when a word is asked backward or in a strange sentence.
𼏠Filling:
- What it is: Before this work, most evaluations tallied right/wrong answers without explaining why errors happened.
- How it works: The field tried many thingsâmaking models bigger, adding knowledge graphs, editing facts inside models, and probing model internalsâbut these didnât clearly tell us if errors were from missing knowledge or retrieval trouble.
- Why it matters: Without knowing the cause, we might waste effort: scaling a model to âstore moreâ wonât fix a recall glitch.
đ Bottom Bread (Anchor): If a student forgets your name midâconversation, buying them a bigger backpack (more capacity) wonât help. Teaching them a trick to recall names might.
đ Top Bread (Hook): Imagine a library. Itâs not enough for the book to exist; you have to find it quickly, even if the request is phrased oddly.
𼏠Filling (New Concepts, in Sandwich form):
- Factuality Evaluations
- What: Checks of whether an AIâs answers are true or false.
- How: Traditionally, ask a question and mark the answer right/wrong.
- Why: Alone, this hides whether the book isnât in the library (not encoded) or is just misâshelved (recall failure).
- Anchor: Scoring a trivia team by âhow many rightâ wonât tell you if they never learned geography or just blanked under pressure.
- Encoding Failures (Empty Shelves)
- What: The model didnât store the fact during training.
- How: Maybe the data never had it, or the model lacked capacity.
- Why: If true, you need better data or a bigger model.
- Anchor: If your notes never included âCapital of France,â you canât recall âParis.â
- Recall Failures (Lost Keys)
- What: The model stored the fact but canât retrieve it in a new phrasing or reversed direction.
- How: The questionâs wording or direction differs from how it was learned.
- Why: Fixes involve better prompts, postâtraining, or inferenceâtime thinking.
- Anchor: You know a classmateâs name but canât say it until someone gives the first letter.
- Longâtail Knowledge
- What: Rare or niche facts.
- How: These appear less during training; models encode many of them but recall is fragile.
- Why: Without targeting recall, assistants will keep missing important niche facts.
- Anchor: A zoo guide can name common animals fast, but might hesitate on a rare lizardâs species.
- Reverse Questions
- What: Asking the relation backward (BâA) instead of forward (AâB).
- How: Models often learn âA is related to Bâ but struggle to answer âWhich A goes with B?â
- Why: Highlights recall trouble tied to how facts were read in training.
- Anchor: You can answer âOasis â Boardwalk club,â but stumble on âBoardwalk club â who?â
đ Bottom Bread (Anchor): The paper fills the gap by profiling facts from real Wikipedia text to see whether each one is stored and how accessible it isâwith and without letting the model âthink.â
02Core Idea
đ Top Bread (Hook): Imagine sorting every item in your backpack into five bins: not packed, easy to grab, only grab if you stop and think, guessed from other items, or still canât find. Thatâs a lot clearer than just saying âI found itâ or âI didnât.â
𼏠Filling (Aha! Moment):
- What it is: The key insight is to shift from grading questions to profiling facts: first check if a fact is encoded in the model, then measure how easily it can be recalledâin direct form, reversed form, and with or without letting the model think.
- How it works (recipe):
- Extract a fact from real Wikipedia text as a subjectâobject proposition.
- Test encoding by recreating a preâtrainingâlike context and asking the model to complete or answer within that context (no thinking allowed).
- Test recall by asking direct and reverse questions with different phrasings (with and without thinking).
- Place the fact into one of five knowledge profiles based on results: Encoding Failure, Direct Recall, Recall with Thinking, Recall Failure, or Inference without Encoding.
- Why it matters: Now we can tell if errors need more training data (empty shelves) or better retrieval/usage (lost keys). This changes how we improve models.
đ Bottom Bread (Anchor): For the Oasis example, the method can tell whether âOasis â Boardwalk clubâ is stored and whether the model can answer both directions, with or without extra thinking steps.
đ Top Bread (Hook): Three analogies to see the same idea:
- Library and Keys: Books (facts) can be there but hard to find; thinking is like asking a helpful librarian to search deeper.
- Detective Work: Solving a case depends on both having the clues (encoded) and noticing which clue matters now (recall).
- School Quiz: You might recognize the right answer on a multipleâchoice test (verification) even if you couldnât recall it freely (generation).
𼏠Filling:
- Before vs. After:
- Before: One accuracy score; wrong answers looked the same, so we kept scaling models.
- After: We see that frontier models already encode most facts, but recall is the choke pointâespecially for rare facts and reversals. Solutions shift toward better retrieval and inferenceâtime strategies.
- Why it works (intuition):
- Encoding test âprimesâ the model with the same kind of text it saw during training. If it can fill the fact there, the fact is likely stored.
- Knowledge test requires robustness: the model should answer across phrasings and directions. If it fails here, recall (not storage) is the issue.
- Multipleâchoice vs. generation separates recognition from recall: recognizing the right choice is easier than pulling the answer out unprompted.
- Building blocks (with Sandwich miniâblocks):
- Facts as Propositions
- What: A subjectâobject link taken from natural text.
- How: Use the order in the source sentence to define direction.
- Why: Lets us test both direct and reverse questions cleanly.
- Anchor: âOasis (subject) â Boardwalk club (object).â
- Direct vs. Reverse Questions
- What: Direct asks for the object; reverse asks for the subject.
- How: Create two phrasings for each to test robustness to wording.
- Why: Reveals directionâsensitive recall issues (reversal curse).
- Anchor: âWhere did Oasis play first?â vs. âWho played first at the Boardwalk?â
- Thinking (Inferenceâtime Computation)
- What: Let the model write intermediate steps before the final answer.
- How: Use chainâofâthought or reasoningâoptimized modes.
- Why: Often unlocks stored but hardâtoâreach facts.
- Anchor: Like pausing to list clues before naming the suspect.
- Five Knowledge Profiles
- What: The categories that describe each factâs status.
- How: Combine encoding (yes/no) with recall (direct/with thinking/never) to sort.
- Why: Pinpoints what to fixâdata vs. retrieval vs. reasoning policy.
- Anchor: A shelf label that says âhere but needs a step stoolâ vs. ânot in stock.â
- Facts as Propositions
đ Bottom Bread (Anchor): After profiling, the modelâs mistakes stop being a mystery. We can see, for each fact, whether we need more books (training) or a better way to find them (recall/think).
03Methodology
đ Top Bread (Hook): Imagine building a science fair project that tests whether classmates remember facts from a textbook and whether they can still answer if you reword the question or flip it around. Youâd need a careful plan.
𼏠Filling (Highâlevel overview):
- What it is: A recipe to test, for each fact, (1) if itâs encoded and (2) how easily itâs recalled.
- How it works (pipeline): Input (Wikipedia page) â Fact extraction (pick a subjectâobject fact) â Create 10 tasks (2 for encoding, 4 for knowledge, 4 multipleâchoice versions) â Generate 8 answers per task per model â Grade answers with an LLM grader â Assign each fact to a knowledge profile.
- Why it matters: This turns a flat right/wrong score into a detailed map of where knowledge lives and how accessible it is.
đ Bottom Bread (Anchor): For the Oasis example, we first test if the model can fill in âOasis played their first gig at ___â in the original paragraph, then ask both direct and reverse questions in natural ways and multipleâchoice forms.
Detailed steps (each with Sandwich pattern):
- Fact Extraction from Natural Text
- What: Pull a single, unambiguous fact from a real Wikipedia paragraph.
- How: Do namedâentity recognition, pick an object entity that makes a specific, nonâtrivial completion (not timeâsensitive), and ensure the left context leads to exactly one correct answer. Balance topics and entity types.
- Why: If facts are vague or have multiple answers, we canât tell if the model failed or the question was unfair.
- Anchor: From âLove and Money are a Scottish band⌠formed by three former members of Friends Again ⌠along with bassist Bobby Paterson,â extract the fact linking subject (the band creators) to object (Bobby Paterson).
- Encoding Tests (No Thinking Allowed)
- What: Two tasks that mimic training conditions to see if the model stored the fact.
- How:
- Proposition Completion: Give the left context and ask the model to complete the sentence that would include the object.
- Contextual Question: Ask a highâverbatim (closeâtoâtext) question appended to the same context.
- Why: Completion alone can be ambiguous for chatâtuned models; the contextual question clarifies the target. Excluding thinking avoids confusing storage with onâtheâfly inference.
- Anchor: âOasis played their first gig on 14 August 1991 at ____â and âUsing the above paragraph: Where did Oasis play their first gig?â
- Knowledge Tests (With and Without Thinking)
- What: Four closedâbook questionsâtwo direct, two reverseâeach with different phrasings; plus four multipleâchoice versions.
- How: Ask minimal but unambiguous questions; then create natural rephrasings; then make multipleâchoice formats with realistic distractors of the same entity type.
- Why: Tests whether recall is robust to wording and direction; multipleâchoice separates recognition from free recall.
- Anchor: Direct: âWhere did Oasis play their first gig?â Reverse: âWhich band played their first gig at the Boardwalk club?â and their natural rephrasings; MC versions list four plausible bands/venues.
- Refinement and Filtering with Web Search
- What: Automatic quality control to keep only precise, uniqueâanswer questions.
- How: A prompted LLM checks for specificity and minimalism; then a searchâgrounded LLM filters out any question with multiple possible answers or unclear wording.
- Why: Prevents garbageâin, garbageâout; unclear items would blur the difference between encoding and recall.
- Anchor: If âWhich film premiered at the 18th Rome Film Festival?â could refer to several films, add minimal context like the directorâs name.
- Response Generation and Grading
- What: For each task, sample 8 answers per model at temperature 1; grade with an LLM autorater into CORRECT/INCORRECT (and rare PARTIALLY/OTHER, which are handled carefully).
- How: Compute perâquestion accuracy over gradable responses; define encodes if any encoding task for the fact passes a threshold; define knows if all knowledge questions for that fact pass.
- Why: Multiple samples reduce randomness; existential vs. universal tests cleanly separate storage from robust recall.
- Anchor: A fact âencodesâ if at least one encoding test clears 50%; it âknowsâ if all four knowledge questions clear 50% (with or without thinking, depending on the profile being measured).
- Knowledge Profiles (Five Bins)
- What: For each fact, decide if itâs (a) Encoding Failure, (b) Direct Recall, (c) Recall with Thinking, (d) Recall Failure, or (e) Inference without Encoding.
- How: Combine encoding result (yes/no) with knowledge results (without and with thinking) to sort.
- Why: Each bin points to a different fixâmore data, better prompts/postâtraining, use of thinking, or being cautious about guesses.
- Anchor: If the model encodes âOasis â Boardwalkâ but fails direct questions and even with thinking, itâs a Recall Failure; if thinking rescues it, itâs Recall with Thinking.
The Secret Sauce:
- Using natural text (not just triples) better matches how models learn facts.
- Splitting recognition (MC) from recall (generation) reveals hidden knowledge.
- Conditioning on encoding isolates retrieval difficulty from missing data.
- Letting models âthinkâ shows how much recall can improve without retraining.
đ Bottom Bread (Anchor): In the end, each fact gets a clear label saying whether the book is in the library and how easy it is to grabâstraight away, after some thinking, or not at all.
04Experiments & Results
đ Top Bread (Hook): Picture a giant quiz bowl: 2,150 facts, 10 tests per fact, 13 different teams (LLMs), and millions of answers checked by neutral judges.
𼏠Filling (The Test):
- What it is: The authors measured, for each model, (1) how many facts are encoded and (2) how many of those are recalled directly, recalled only with thinking, or not recalled. They also compared direct vs. reverse questions and rare vs. popular facts.
- How it works: 2,150 Wikipediaâderived facts Ă 10 tasks per fact (2 encoding, 4 knowledge, 4 MC) Ă 8 samples per task Ă 13 models â 4.5 million graded responses. Thresholds decide encodes/knows, with careful handling of edge labels.
- Why it matters: This scale gives a trustworthy picture of whether todayâs strongest models struggle more with storage or with retrieval.
đ Bottom Bread (Anchor): Think of grading not just a score, but a detailed report showing which facts were stored, which were easy, which needed thinking, and which stayed hidden.
The Competition and Scoreboard (with meaning):
- Encoding is nearly saturated in frontier models: about 95â98% of facts are encoded. Thatâs like getting an A for âhaving the books.â
- Yet recall without thinking still misses 25â33% of factsâmore like a Bâ for âfinding the books on demand.â
- Thinking (letting the model write out steps) recovers 40â65% of the facts that were encoded but not initially recalled. This is like using a stepâstool to reach high shelves.
Surprising and Insightful Findings (with Sandwich miniâblocks):
- Longâtail (Rare) Facts
- What: Rare facts are almost as encoded as popular ones.
- How: The encoding gap is small (just a few points), but the recall gap is big (often 20+ points).
- Why: Rare facts are stored but harder to fetch unless the model can think more.
- Anchor: A museum guide âknowsâ about a rare fossil but needs a moment to remember its exact name.
- Reverse Questions and the Reversal Curse
- What: Models do worse on reverse generation (BâA) than direct, even when encoded.
- How: But in multipleâchoice verification, reverse isnât harderâand is sometimes easierâshowing the association is there.
- Why: This points to recall trouble tied to direction, not missing knowledge.
- Anchor: You can recognize the right face in a lineup even if you couldnât name it from memory.
- Phrasing Sensitivity
- What: Across highâverbatim vs. natural rephrasings, no significant performance difference was found.
- How: Many hypothesis tests with FDR correction saw no phrasing effect.
- Why: In this controlled setup, direction and rarity matter more than wording style.
- Anchor: Whether you ask, âWhere was their first gig?â or âWhere did they first perform?â, the real hurdle is direction or rarity.
Big Picture:
- Scaling shifts errors from storage to retrieval: as models get bigger, fewer empty shelves but more lost keys relative to remaining errors.
- Thinking helps most where itâs hardest: rare facts and reverse questions see the biggest recall gains.
đ Bottom Bread (Anchor): For a frontier model, the shelves are almost full, but the filing system is fragile: reverse lookups and niche items need extra effort (thinking) to retrieve consistently.
05Discussion & Limitations
đ Top Bread (Hook): If you want to fix a leaky faucet, you first need to know where the leak is. Is it the water supply (encoding) or the handle mechanism (recall)?
𼏠Filling (Honest assessment):
- Limitations (what this canât do):
- Mostly singleâhop facts from English Wikipedia; very timeâsensitive or multiâhop logic isnât the focus.
- Behavioral signals can misâtag a few cases (e.g., rare instances of inference without encoding or grader edge cases).
- Autoraters (LLM graders) are strong but not perfect; the team checked agreement and filtered unclear items, yet small biases can remain.
- Thinking helps but costs compute and latency; deciding when to trigger it is an open challenge.
- Threshold choices and sampling (8 responses) are principled, but different settings could slightly shift counts.
- Required resources:
- Access to LLMs with and without thinking, many forward passes for sampling, and an LLM grader.
- Web search grounding for filtering questions, plus compute to process millions of answers.
- When not to use:
- Live, fastâchanging facts (sports scores today) that violate the ânot timeâsensitiveâ rule.
- NonâEnglish corpora unless the pipeline is adapted.
- Tasks requiring multiâdocument synthesis or long reasoning chains (beyond singleâhop factual checks).
- Open questions:
- Can we detect, before answering, whether thinking is neededâsaving time and cost?
- Can training or postâtraining reduce directionâbased recall asymmetry (reversal curse) without overfitting?
- How do internal representations relate to these behavioral profilesâcan we predict profiles from activations?
- How should RAG and parametric recall coordinateâwhen to fetch vs. when to think?
- Can similar profiling diagnose longâform generation or multiâstep reasoning consistency?
đ Bottom Bread (Anchor): Just like a coach uses game tape to plan the next practice, these profiles show whether to train more plays (data) or practice faster recall (thinking/prompting) to win more games (correct facts).
06Conclusion & Future Work
đ Top Bread (Hook): You know how the answer is âon the tip of your tongue,â and saying a few related clues suddenly brings it back? Thatâs the heart of this paper.
𼏠Filling (Takeaway):
- 3âSentence Summary: The authors introduce knowledge profiling, which separates encoding (is the fact stored?) from recall (can it be retrieved across directions and phrasings). On a new, realistic benchmark (WikiProfile), frontier LLMs encode most facts (â95â98%), but recall remains the main bottleneckâespecially for rare facts and reverse questions. Letting models âthinkâ at inference time rescues a large share (40â65%) of otherwise inaccessible, alreadyâencoded facts.
- Main Achievement: Turning one flat accuracy score into a factâbyâfact map that tells us what to fixâdata vs. retrieval/usageâbacked by a scalable benchmark drawn from natural text.
- Future Directions: Smarter triggers for when to think; postâtraining to strengthen reverse recall; data curricula that teach both directions; integrating profiles with RAG to decide when to fetch vs. recall; extending profiling to longâform and multiâhop tasks.
- Why Remember This: Because the shelves arenât emptyâthe keys are hard to find. Improving how models access what they already know could move the needle more than just making them bigger.
đ Bottom Bread (Anchor): Next time a model misses a fact, donât assume it never learned it; ask whether a nudge to thinkâor a better way to askâmight unlock whatâs already inside.
Practical Applications
- â˘Diagnose model errors: determine whether a wrong answer came from missing knowledge or recall failure and choose the right fix.
- â˘Adaptive prompting: trigger chainâofâthought only when a profile predicts recall difficulty (e.g., reverse or longâtail queries).
- â˘Curriculum design: add reverseâdirection questions during training/alignment to reduce direction asymmetry.
- â˘Evaluation upgrades: report knowledge profiles alongside accuracy to guide product and research decisions.
- â˘Data selection: prioritize collecting or augmenting data for facts that show recall fragility rather than simply scaling all data.
- â˘RAG coordination: use profiles to decide when to retrieve from external sources vs. rely on parametric recall.
- â˘Guardrails and fallback: if reverse recall is weak, switch to multipleâchoice style or verificationâfirst strategies.
- â˘Personalized tutoring: teach students (and models) to use âthinking stepsâ when a concept is on the tip of the tongue.
- â˘Search interfaces: for rare facts, present recognitionâstyle suggestions (choices) to boost success.
- â˘Model selection: pick models not only by accuracy but by recall robustness for your domainâs rare or reverse queries.