NanoKnow: How to Know What Your Language Model Knows

Lingwei Gu; Nour Jedidi; Jimmy Lin

NanoKnow: How to Know What Your Language Model Knows

Beginner

Lingwei Gu, Nour Jedidi, Jimmy Lin2/23/2026

arXiv

Key Summary

•NanoKnow is a new benchmark that checks whether a language model’s answers come from what it saw during training or from extra text we give it at question time.
•It splits questions into two groups: supported (the answer exists in the model’s training data) and unsupported (it doesn’t), using a careful three-step search-and-verify pipeline.
•The more often a fact appeared in the training data, the better the model remembered it in closed-book mode (no extra context provided).
•Giving the model the right passage (open-book/RAG) helps reduce its dependence on memorizing, but facts seen during training still make answers more accurate.
•Parametric knowledge (what’s inside the model) and external knowledge (what you feed it) work best together; they are complementary.
•Distractor passages (irrelevant text) hurt accuracy, especially when the correct answer is buried in the middle of other texts (“lost in the middle” effect).
•Smaller models benefit more from being given the right context, while bigger models remember more on their own.
•NanoKnow uses open, fully traceable training data (FineWeb-Edu) and open nanochat models, so researchers can precisely track where knowledge comes from.
•The dataset and tools are publicly released, enabling fair, controlled studies of how pre-training data shapes what LLMs can and can’t do.

Why This Research Matters

NanoKnow makes AI models more trustworthy by showing exactly when they are recalling from memory versus reading from added context. This helps builders design better systems that use retrieval wisely, place key evidence where the model will actually use it, and avoid harmful distractors. Educators, fact-checkers, and app developers can identify gaps in a model’s knowledge and decide when to add authoritative sources. Because the data and models are open, anyone can reproduce results and run fair comparisons. Over time, this transparency supports safer, more reliable AI that explains what it knows and why. It also guides better data curation, focusing training on underrepresented but important facts.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how some kids remember every fact from a book they read, while others do better when they can look things up during a quiz? Computers that talk, like chatbots, are a bit like those kids. They learn from lots of text, and sometimes they remember; other times they need reminders.

🥬 Filling (Concept 1: Large Language Model, LLM)

What it is: An LLM is a computer program that learns to predict and generate words by reading huge amounts of text.
How it works:
1. Read tons of sentences from books and websites.
2. Learn patterns about which words come next.
3. Use those patterns to answer questions and write text.
Why it matters: Without LLMs, we wouldn’t have helpful chatbots or smart writing tools. 🍞 Bottom Bread (Anchor): When you ask a chatbot, “What’s the capital of France?”, it uses what it learned to say “Paris.”

🥬 Filling (Concept 2: Pre-training Data)

What it is: Pre-training data is the giant pile of text the model reads to learn language before you ever chat with it.
How it works:
1. Gather lots of text (like an online library).
2. Train the model to guess missing words.
3. The model stores patterns from this reading inside its parameters (its memory).
Why it matters: If the data doesn’t have a fact, the model probably won’t know it. 🍞 Bottom Bread (Anchor): If the model never sees “The aorta is a main artery,” it may not know that answer later.

🥬 Filling (Concept 3: Parametric Knowledge)

What it is: Parametric knowledge is the stuff the model has absorbed into its internal weights from pre-training.
How it works:
1. See facts many times.
2. Internalize patterns and connections.
3. Recall some of them when asked.
Why it matters: Without this, the model would have no built-in memory and would be helpless without notes. 🍞 Bottom Bread (Anchor): If the model saw “December 1972” often as the last time humans walked on the Moon, it can answer that from memory.

🥬 Filling (Concept 4: Benchmark Dataset)

What it is: A benchmark is a fair test set to see how good a model is at a task.
How it works:
1. Collect many questions.
2. Define correct answers.
3. Measure how often the model is right.
Why it matters: Without a benchmark, we can’t compare models or track progress. 🍞 Bottom Bread (Anchor): Like a school quiz with an answer key that lets teachers grade everyone the same way.

🥬 Filling (Concept 5: Natural Questions, NQ)

What it is: NQ is a set of real questions people typed into Google.
How it works:
1. Use real-world queries.
2. Match them with short factual answers.
3. Check if models answer correctly from general knowledge.
Why it matters: It tests how well models handle everyday questions. 🍞 Bottom Bread (Anchor): “When was the last time anyone was on the moon?” is a typical NQ-style question.

🥬 Filling (Concept 6: SQuAD)

What it is: SQuAD is a set of questions where answers are exact text spans from Wikipedia.
How it works:
1. Provide a short article.
2. Ask a question about it.
3. The answer is a sentence or phrase in that article.
Why it matters: It tests careful reading and pinpointing exact answers. 🍞 Bottom Bread (Anchor): Given a paragraph about the heart, the answer to “What is the main artery?” is “the aorta” found in that text.

Before this paper, researchers had a big problem: most models’ pre-training data was a black box—unknown or inaccessible. That made it hard to tell whether a model’s answer came from memory (parametric knowledge) or from hints added to the prompt (external context). People tried to guess by running closed-book questions (no extra text) and open-book questions (with extra text), but it was impossible to be sure what the model had truly seen during training.

🥬 Filling (Concept 7: Open-Book QA / Retrieval-Augmented Generation, RAG)

What it is: Open-book QA (often done via RAG) gives the model helpful passages at answer time.
How it works:
1. Retrieve likely relevant documents.
2. Paste them into the prompt with the question.
3. Let the model read and answer using this evidence.
Why it matters: Without it, the model might guess or rely on shaky memory. 🍞 Bottom Bread (Anchor): Like letting a student use a textbook to answer a quiz question.

The gap: we needed a way to prove whether an answer existed in the model’s training data and how often it appeared there. Enter FineWeb-Edu, an open, educational web corpus, and nanochat, small LLMs trained only on it. This transparency created a rare chance to trace knowledge—if we had the right tool.

🥬 Filling (Concept 8: FineWeb-Edu Corpus)

What it is: A huge, open collection of educational web pages used to train models like nanochat.
How it works:
1. Gather and clean educational pages.
2. Split them into shards (files) for easy access.
3. Use them to pre-train models.
Why it matters: Because it’s open, we can check exactly what the model saw. 🍞 Bottom Bread (Anchor): If a Wikipedia-like page in FineWeb-Edu says “Capillary action draws kerosene up a wick,” we can point to that exact place.

🥬 Filling (Concept 9: nanochat Models)

What it is: A family of small, open LLMs trained only on FineWeb-Edu.
How it works:
1. Train on FineWeb-Edu.
2. Release the checkpoints and data.
3. Test how well they answer questions.
Why it matters: Open models plus open data let us truly trace knowledge. 🍞 Bottom Bread (Anchor): If nanochat answers a SQuAD question, we can check whether its training set contained that exact answer.

Real stakes: Knowing where a model’s knowledge comes from helps us trust it more, fix gaps faster, and design better systems (like RAG) that don’t get tricked by distractors. It’s like labeling what’s from the student’s brain and what’s from their notes—so teachers and students (and engineers and users) can make smarter choices.

02Core Idea

🍞 Top Bread (Hook): Imagine sorting your school questions into two piles: ones that were definitely covered in last night’s reading, and ones that weren’t. Now you can test whether you answered from memory or from looking things up.

The “Aha!” in one sentence: If we can tell which questions’ answers exist in the model’s training data (and how often), we can cleanly separate what the model knows from memory versus what it learns from added evidence.

🥬 Filling (Concept 10: Supported vs. Unsupported Splits)

What it is: Supported questions have answers present in the training corpus; unsupported ones don’t.
How it works:
1. Search the training corpus for documents that might contain the answer.
2. Check for exact answer strings.
3. Use an LLM to verify the context truly answers the question.
Why it matters: Without this split, we can’t fairly judge memory versus help-from-text. 🍞 Bottom Bread (Anchor): “When was the last time anyone was on the moon?” is supported if the corpus has a passage that says “No one has walked on the Moon since December 1972.”

Three analogies for the main idea:

Library vs. Brain: Your brain (parametric knowledge) remembers some facts from class. The library (external context) has many more. NanoKnow labels which quiz questions were in your textbook (supported) versus not (unsupported), so we can see when you’re using memory or the library.
Recipe vs. Pantry: A chef remembers common recipes (supported) but sometimes needs to check a cookbook (external context). NanoKnow tells us which dishes the chef already memorized.
Map vs. Compass: A hiker memorizes well-traveled trails (supported) but needs a map for new paths (unsupported). NanoKnow marks which answers are well-traveled.

Before vs. After:

Before: We guessed whether models were memorizing or relying on prompts; training data was a black box.
After: With open data and NanoKnow’s supported/unsupported labels, we can directly test memory, the effect of extra evidence, and how often-seen facts boost accuracy.

Why it works (intuition, no equations):

Memory strengthens with repetition: answers seen many times in training become easier to recall (higher closed-book accuracy).
Evidence helps: providing the right passage reduces the need for memory, lifting performance especially on rare facts.
Complementarity: having both memory and evidence is best—prior knowledge guides attention and interpretation, even when text is provided.
Noise hurts: irrelevant text steals attention and pushes the right answer out of focus, especially if it’s placed away from the start (lost-in-the-middle).

🥬 Filling (Concept 11: Answer Frequency)

What it is: How many different training documents (verified) contain the correct answer.
How it works:
1. Count LLM-verified matches across the corpus.
2. Group into buckets (rare, low, medium, high).
3. Compare accuracy across buckets.
Why it matters: Frequency predicts how well the model remembers answers in closed-book mode. 🍞 Bottom Bread (Anchor): If “aorta” appears in lots of FineWeb-Edu docs, models recall it more easily than a super-rare fact.

Building blocks of the idea:

Open corpus (FineWeb-Edu) and open models (nanochat) give full traceability.
A robust pipeline (search, string-match, LLM-verify) finds true answer contexts.
Labeled splits (supported/unsupported) enable clean experiments.
Multiple prompting setups (closed-book, with FineWeb context, with original SQuAD context) show how memory and evidence interact.
Distractor tests reveal how irrelevant info harms reasoning and recall.

Put simply, NanoKnow is the sorting hat for model knowledge: which answers are in the model’s memory, which need outside help, and how repetition and noise change the outcome.

03Methodology

At a high level: Question → [Search training data] → [String-match answer] → [LLM verifies context] → Supported or Unsupported label, plus frequency counts → Use labels to run fair experiments.

🥬 Filling (Concept 12: Search Index)

What it is: A special data structure that lets us quickly find documents related to a query.
How it works:
1. Break all documents into terms.
2. Build an index so term lookups are fast.
3. Given a question, retrieve top candidate passages.
Why it matters: Without an index, finding relevant documents in a huge corpus would be too slow. 🍞 Bottom Bread (Anchor): Like an alphabetical index in a book that helps you jump straight to the right pages.

Step-by-step pipeline:

BM25 Retrieval

What happens: We use BM25, a classic keyword search method, to fetch the top 100 likely documents from FineWeb-Edu for each question.
Why this step exists: It narrows millions of documents down to a manageable set where the answer might live.
Example: For “What is the main artery that takes blood from the heart?”, BM25 pulls passages that mention “artery,” “heart,” and likely “aorta.”

🥬 Filling (Concept 13: BM25)

What it is: A tried-and-true scoring formula that ranks documents by how well their words match the query.
How it works:
1. Count overlapping words between query and document.
2. Reward important words and penalize overly common ones.
3. Return the best-scoring documents first.
Why it matters: Without a strong first-pass retriever, you miss good candidates. 🍞 Bottom Bread (Anchor): Searching “aorta heart artery” surfaces docs that literally discuss those terms.

Answer String Matching

What happens: We lowercase text, clean spaces, and check if the exact answer string appears in any retrieved document.
Why this step exists: It’s a fast way to detect potential answer locations, but it can produce false positives (same word, wrong meaning).
Example: The string “Paris” could match a song lyric instead of the city question—so we need more checks.

🥬 Filling (Concept 14: String Matching)

What it is: A simple check to see if the exact answer text appears.
How it works:
1. Normalize the text.
2. Search for the answer substring.
3. Flag candidates for deeper verification.
Why it matters: It’s a quick filter that finds needles in haystacks—but it can mistake shiny straws for needles. 🍞 Bottom Bread (Anchor): Finding “December 1972” in a paragraph suggests it might answer the moon question—but only if the paragraph is truly about moon landings.

LLM-Based Verification

What happens: We extract a window of about 512 words around the match (256 before, 256 after) and ask another LLM to decide whether this context truly answers the question.
Why this step exists: It filters out coincidental matches and keeps only real evidence.
Example: If the context says, “No one has walked on the Moon since December 1972,” the LLM marks TRUE; if it’s unrelated, it marks COINCIDENTAL.

🥬 Filling (Concept 15: LLM-Judge / Verification)

What it is: Using a language model as a careful checker to confirm that a passage genuinely answers a question.
How it works:
1. Show the question, the found answer string, and the context.
2. Ask the LLM to label TRUE or COINCIDENTAL with a reason.
3. Keep only TRUE matches.
Why it matters: Without verification, many matches would be misleading. 🍞 Bottom Bread (Anchor): The judge says, “TRUE: The sentence directly states the last moon landing date,” or “COINCIDENTAL: The date is about baseball season, not the moon.”

Labeling and frequency:

After verification, questions with at least one TRUE match become supported; otherwise unsupported.
We also count how many distinct verified documents contain the answer to compute answer frequency and place each question into buckets: Rare (1–5), Low (6–20), Medium (21–50), High (51+).

🥬 Filling (Concept 16: Supported/Unsupported Labels)

What it is: A tag per question telling us if its answer exists in training data.
How it works:
1. If a verified context exists in the corpus, label supported.
2. Otherwise label unsupported.
3. Record document IDs and character offsets for traceability.
Why it matters: These labels power fair comparisons of memory vs. external help. 🍞 Bottom Bread (Anchor): “Aorta” with a verified passage is supported; a niche trivia question with no match is unsupported.

Efficient access and traceability:

Each FineWeb-Edu document has a unique ID encoding its shard number and row offset (like shar $d_0$ 0151_20323). This lets us fetch the exact source text quickly, without scanning the whole dataset.
We also store character offsets so we can pull precise answer windows (e.g., 200 words around the first match in some experiments).
This precision enables fast, repeatable RAG experiments at scale.

🥬 Filling (Concept 17: Fast Lookup with IDs)

What it is: A way to jump straight to the exact document and position where the answer appears.
How it works:
1. Parse the shard and row from the ID.
2. Fetch that single row from the right file.
3. Slice the text by character offsets.
Why it matters: Without fast lookup, running thousands of experiments would be painfully slow. 🍞 Bottom Bread (Anchor): Like skipping to page 151, line 20,323 in a huge encyclopedia in milliseconds.

Evaluation setups:

Closed-Book: Prompt with just the question.
With FineWeb Context: Give the oracle answer passage from the training corpus.
With Original Context (SQuAD): Provide the original Wikipedia passage used to create the SQuAD question.

🥬 Filling (Concept 18: Closed-Book vs. Open-Book)

What it is: Closed-book tests memory only; open-book adds helpful passages.
How it works:
1. Closed-book: Model answers from parametric knowledge.
2. Open-book: Model reads the provided context and answers.
3. Compare results to see how memory and evidence interact.
Why it matters: Without comparing both, we can’t tell how much the model relies on memory versus context. 🍞 Bottom Bread (Anchor): Taking a quiz from memory vs. taking it with the right textbook page open.

Metrics:

Exact Match (EM): Does the model’s output contain the exact correct answer string?
LLM-Judge Accuracy: A separate LLM checks if the model’s answer is correct even if wording differs.

🥬 Filling (Concept 19: Exact Match vs. LLM-Judge)

What it is: EM is a strict string check; LLM-Judge is a smarter grader for meaning.
How it works:
1. EM: Look for the exact phrase.
2. LLM-Judge: Read the model’s answer and judge correctness.
3. Report both for robustness.
Why it matters: Without both, we might miss correct paraphrases or count random string matches as correct. 🍞 Bottom Bread (Anchor): “December 1972” vs. “It happened in December of 1972” — EM may miss the second; LLM-Judge won’t.

In summary, the methodology builds a reliable map from questions to training-data evidence, labels what’s truly known vs. unknown, counts how often answers were seen, and then uses those labels to fairly test memory, evidence, and the effects of noise.

04Experiments & Results

The Test: The authors asked four main questions.

Does seeing an answer more often during training boost closed-book accuracy? What about when adding external context?
How does closed-book compare to open-book on supported questions?
In open-book mode, do models still do better on supported than unsupported questions?
How much do distractors (irrelevant passages) hurt accuracy, and does their position matter?

The Competition: The team evaluated several nanochat checkpoints at three sizes (about 561M, 1.9B, 2.2B parameters) and compared:

Closed-book vs. with FineWeb context (oracle training passage) for both NQ and SQuAD.
For SQuAD, also with the original Wikipedia context.
Across answer frequency buckets (Rare, Low, Medium, High) derived from verified training matches.

Results with context:

Answer Frequency Matters (Closed-Book)

Accuracy rose clearly with answer frequency. High-frequency answers performed more than twice as well as rare ones in closed-book mode for the larger models.
Interpretation: Repetition during training strengthens parametric memory; the model recalls common facts much more easily.

Evidence Helps but Memory Still Counts (Open-Book)

Adding the correct FineWeb passage increased accuracy for every model, with the biggest relative gains on the smaller models. For SQuAD, the original Wikipedia passage worked even better than the FineWeb one (it’s more tailored to the question).
Even in open-book mode, accuracy still trended upward with answer frequency, though more gently—evidence reduces but doesn’t erase the memory advantage.
Plain-English read: RAG is like giving good notes to a student; it helps a lot, especially when the student hasn’t memorized much. But students who already know the topic still do best when reading those notes.

Supported Beats Unsupported in Open-Book

When given the original SQuAD context, models were consistently more accurate on supported questions than unsupported ones across all sizes.
Meaning: Parametric knowledge and external knowledge work together. Prior familiarity helps the model read and use the context better.

Distractors Hurt, Placement Matters (Lost in the Middle)

Adding irrelevant passages reduced accuracy compared to having only the answer passage.
More distractors caused larger drops; e.g., in one setup, accuracy fell from about 0.478 to 0.367 (LLM-judged) as distractors increased from 1 to 4.
The answer was best placed near the question (top of prompt) and worst when sandwiched between distractors (middle), showing the “lost in the middle” effect.

Surprising or notable findings:

The smallest model (≈561M) didn’t show the same strong memorization trend by frequency in closed-book, suggesting limited capacity to store many facts.
Even an oracle answer passage doesn’t fully level the playing field: prior exposure still gives an edge.

Scoreboard in simple terms:

Closed-book: Higher frequency = better memory; bigger models = more memorization.
Open-book w/ FineWeb: Big boost for all, especially small models; frequency still helps but less so.
Open-book w/ Original SQuAD context: Best performance, because the context is laser-targeted.
Distractors: Always harmful; keeping the answer near the top and limiting noise yields better outcomes.

🥬 Filling (Concept 20: Distractors and Lost-in-the-Middle)

What it is: Distractors are irrelevant passages; lost-in-the-middle is when the model struggles to use information placed in the middle of long prompts.
How it works:
1. Add irrelevant text before or around the answer.
2. The model’s attention gets split or misled.
3. Middle-placed answers are used least effectively.
Why it matters: Without careful retrieval and ordering, RAG systems can underperform. 🍞 Bottom Bread (Anchor): Imagine a worksheet where the right hint is buried between three unrelated stories; it’s much harder to find and use.

Overall, the experiments turn fuzzy guesses into clear, testable facts: repetition boosts memory, evidence boosts everyone (especially small models), prior knowledge and evidence add up, and noise kills performance if not controlled.

05Discussion & Limitations

Limitations:

Only FineWeb-Edu-trained models: The findings are clean because data is open, but results may differ for models trained on other corpora with different styles and topics.
Coverage bias: SQuAD aligns closely with Wikipedia, and FineWeb-Edu contains lots of Wikipedia-like text, so supported rates are high; other domains may see lower overlaps.
Exact-string reliance upstream: While LLM verification filters coincidences well, initial string matches can miss paraphrased or indirect answers that never use the exact phrase.
Scale and capacity: The smallest model didn’t show clear frequency-linked memorization, so conclusions about memory-strength vs. size depend on model capacity.
Oracle contexts: Using known answer passages is great for analysis, but real-world RAG often retrieves imperfect snippets, so actual gains may be lower.

Required Resources:

Access to the large FineWeb-Edu index (hundreds of GB) and compute capable of fast retrieval and many LLM calls for verification and judging.
Scripts and tooling (released) to reproduce indexing, projection, and evaluation reliably.

When Not to Use:

If the pre-training data is closed or mixed and cannot be aligned with the evaluation tasks, NanoKnow’s supported/unsupported labeling can’t be applied directly.
If your task requires creative or long-form reasoning rather than fact lookup, answer frequency in pre-training may be a weaker predictor of performance.

Open Questions:

Beyond frequency: How do other properties (recency, source quality, wording diversity) shape memorization and generalization?
Topic composition: How does the subject mix in pre-training (science vs. sports vs. law) influence downstream strengths and weaknesses?
Robust RAG: What retrieval and re-ranking tricks best fight distractors and lost-in-the-middle effects? Can we train models to be noise-resistant?
Attribution at scale: Can we combine this projection approach with gradient- or representation-based attribution to triangulate the exact training instances that shaped a particular answer?

In short, NanoKnow gives us a powerful lens, but extending it across corpora, tasks, and retrieval setups will deepen our understanding of how models learn and use knowledge.

06Conclusion & Future Work

Three-sentence summary:

NanoKnow is a benchmark that uses open training data (FineWeb-Edu) to label which questions a model could have learned during pre-training (supported) and which it couldn’t (unsupported), along with how often answers appeared.
With these labels, the paper shows that repetition in training boosts closed-book performance, adding the right context (RAG) lifts everyone—especially smaller models—and prior exposure still helps even with evidence.
Distractors consistently hurt, particularly when the correct passage is buried in the middle, highlighting the need for precise retrieval and careful prompt ordering.

Main achievement:

Turning a long-standing mystery—what a model knows from memory vs. what it learns from supplied text—into a controlled, measurable evaluation using open data and a validated projection pipeline.

Future directions:

Extend the method to other open corpora and tasks to study topic effects, data quality, and time-sensitive knowledge.
Develop retrieval and prompt-structuring strategies that minimize distractor harm and “lost-in-the-middle.”
Combine frequency-aware analysis with training-data attribution tools for even finer tracing of knowledge origins.

Why remember this:

NanoKnow is a practical, transparent “knowledge X-ray” for LLMs. It helps researchers and builders know when models are recalling, when they’re reading, and how to make both work better together in real systems.

Practical Applications

•Audit an LLM’s knowledge: identify which FAQs it can answer from memory vs. which require retrieval.
•Design smarter RAG prompts: place the most relevant evidence near the top and minimize distractors.
•Curate training data: boost coverage of rare but important facts identified as low-frequency.
•Evaluate vendor models: run supported/unsupported tests to compare memory vs. reliance on context.
•Build study tools: teach students (and models) to combine memory with well-placed notes for best results.
•Detect outdated knowledge: unsupported or low-frequency answers flag areas needing updated sources.
•Tune retrieval systems: prefer high-precision passages and re-rank to avoid lost-in-the-middle issues.
•Set guardrails: when closed-book confidence is low, force retrieval from trusted sources.
•Benchmark domain shifts: project new domain questions (e.g., law/medical) onto open corpora to assess gaps.
•Measure improvements: track how changes in data, retrieval, or prompt structure affect supported vs. unsupported accuracy.

Version: 1