LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

Benno Krojer; Shravan Nayak; Oscar Mañas; Vaibhav Adlakha; Desmond Elliott; Siva Reddy; Marius Mosbach

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

Beginner

Benno Krojer, Shravan Nayak, Oscar Mañas et al.1/31/2026

arXiv

Key Summary

•LatentLens is a simple, training-free way to translate what a model "sees" in image patches into clear words and phrases.
•Instead of matching vision features to single vocabulary tokens, LatentLens compares them to word-in-sentence meanings taken from many layers of a language model.
•Across 10 vision–language models and all layers, LatentLens finds that most visual tokens (about 72%) are actually interpretable.
•Older methods like LogitLens and EmbeddingLens miss much of this, reporting only about 23% and 30% interpretability.
•A surprising finding called the Mid-Layer Leap shows that visual tokens at the input often align best with middle language-model layers (e.g., layer 8–16), not with the very first layer.
•LatentLens gives full-word and even sentence-level descriptions, which are easier to understand than subword pieces or next-token guesses.
•The approach works widely, including on a strong off-the-shelf model (Qwen2-VL), not just in a controlled lab setup.
•This supports the idea that vision and language spaces are more aligned than we thought, helping explain why simple connectors can turn LLMs into VLMs.
•LatentLens can guide debugging, reduce hallucinations, and make AI explanations more human-friendly.
•There are limits: it needs a big embedding database, may inherit corpus biases, and results can vary with different architectures or domains.

Why This Research Matters

LatentLens lets us see what a model “thinks” an image patch means, in plain words, at any layer. That clarity makes AI systems easier to debug, safer to deploy, and more trustworthy to users. By showing that most visual tokens are interpretable, it supports building tools that catch hallucinations early and explain decisions transparently. The Mid-Layer Leap insight helps engineers design better connectors and training strategies. Because the method is training-free and generalizes to off-the-shelf models, it can be adopted quickly by researchers and practitioners. Ultimately, this bridges the gap between pixels and language, helping AI align more closely with human understanding.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you look at a picture, you don’t just see pixels—you see a dog, a clock tower, or a red shirt? You also use the rest of the scene to understand what each part means (like “clocks” make more sense when you also see a tower).

🥬 The Concept (Vision-Language Model, VLM): What it is: A VLM is an AI that reads both pictures and words so it can talk about images. How it works: 1) A vision encoder chops the image into pieces (patches) and turns each into a visual token; 2) A small “connector” maps those tokens into the language model’s space; 3) The language model mixes vision and text to answer questions or write captions. Why it matters: Without a VLM, you’d need separate AIs for seeing and talking, and they wouldn’t share meaning well. 🍞 Anchor: Think of the AI as a tour guide who can both look at landmarks and explain them in sentences.

🍞 Hook: Imagine your backpack has hidden pockets that store useful stuff even when you can’t see it.

🥬 The Concept (Latent Representations): What it is: Latent representations are hidden numbers inside the model that capture the meaning of things (like an image patch). How it works: 1) The model turns input into numbers; 2) These numbers evolve through layers; 3) Each layer’s numbers express richer, more contextual meaning. Why it matters: If we don’t understand these hidden numbers, we can’t tell what the model “knows” or “means” at each step. 🍞 Anchor: A patch showing a clock tower becomes a vector that “means” something like “tower with clocks.”

🍞 Hook: Suppose you want to find the closest library to your house—you check which one is nearest.

🥬 The Concept (Nearest Neighbor Retrieval): What it is: A way to find the most similar thing in a big collection. How it works: 1) You have a query vector; 2) You compare it (via a similarity score) to many stored vectors; 3) You pick the top-k most similar; 4) You read off their labels as your best guesses. Why it matters: Without nearest neighbors, you can’t easily translate “mystery vectors” into meaningful words. 🍞 Anchor: Given a visual token, we look up the closest word-in-sentence vectors and use their words as descriptions.

🍞 Hook: Picture using a small adapter to plug a camera into your computer.

🥬 The Concept (Shallow MLP Transformation/Connector): What it is: A tiny neural adapter that maps visual features into the language model’s space. How it works: 1) Take image vectors; 2) Pass through a linear layer or small MLP; 3) Output vectors shaped like the language model’s inputs. Why it matters: Without this adapter, the language model can’t “understand” vision vectors as if they were words. 🍞 Anchor: A 3-layer MLP lets a frozen LLM read image patches and write captions.

🍞 Hook: Imagine trying to identify an object by matching it only to single letters in the alphabet. Tricky!

🥬 The Concept (EmbeddingLens): What it is: A method that matches a hidden vector to single input-word embeddings from the model’s vocabulary. How it works: 1) Compare a hidden vector to all input embeddings; 2) Sort by cosine similarity; 3) Return the top tokens. Why it matters: It’s simple, but it only returns subword pieces, which can be hard to understand and may miss context. 🍞 Anchor: It might say the best match is “clock” or even just “clo,” which is not very helpful.

🍞 Hook: Now imagine peeking at what letter the model would write next if it had to guess right now.

🥬 The Concept (LogitLens): What it is: A method that projects a hidden vector to output scores over the vocabulary (next-token space). How it works: 1) Multiply by the unembedding matrix; 2) Get scores (logits) for each token; 3) Pick top tokens as the interpretation. Why it matters: It’s popular, but it can return punctuation or language-mixing tokens, and it works best in late layers. 🍞 Anchor: On a tower-with-clocks patch, it might return "," or foreign characters instead of a clear phrase.

🍞 Hook: Think of reading a word like “bank.” You only know if it means money-bank or river-bank when you see the sentence around it.

🥬 The Concept (Contextualized Token Representations): What it is: A word’s vector that already includes its sentence context and layer. How it works: 1) Feed many sentences to the LLM; 2) Save each word’s vector at multiple layers; 3) Now each word is “meaningful in context.” Why it matters: Without context, you risk matching the wrong sense of a word or missing richer meaning. 🍞 Anchor: “stories” in “a building with many stories” is the right sense for floors.

🍞 Hook: Imagine most of the good hints about a puzzle live in the middle pages of the book, not the first page.

🥬 The Concept (Mid-Layer Leap): What it is: Visual tokens at the input often match best with middle LLM layers, not the earliest ones. How it works: 1) Project image patches into the LLM; 2) Compare to contextual word vectors from all layers; 3) Find the strongest matches often around layers 8–16. Why it matters: This suggests the connector maps visual tokens to semantic (not just lexical) spots in the LLM. 🍞 Anchor: A patch of a clock tower aligns more with “tower with clocks” meaning from a mid layer than with raw word pieces from layer 0.

The World Before: People could plug images into LLMs with a tiny connector and get good captions or answers, but they didn’t really know what the model’s hidden vision vectors meant at each layer. The Problem: Were these visual tokens meaningful “words” inside the LLM, and how could we see that meaning clearly? Failed Attempts: EmbeddingLens and LogitLens often returned token fragments, punctuation, or late-layer guesses, underestimating interpretability. The Gap: No one compared visual tokens to contextualized, layer-specific text meanings. The Stakes: Clear interpretations mean safer, more trustworthy systems—better debugging, fewer hallucinations, and explanations that people can understand.

02Core Idea

🍞 Hook: You know how it’s easier to identify a mystery word when you see it in a full sentence, not by itself?

🥬 The Concept (LatentLens): What it is: LatentLens explains what a visual token means by comparing it to word-in-sentence vectors saved from many layers of an LLM. How it works: 1) Precompute a huge library of contextualized word vectors from real sentences (and multiple layers); 2) Take a visual token vector from any LLM layer; 3) Find the top-k nearest contextualized vectors (using cosine similarity); 4) Use their words (and nearby text) as the description. Why it matters: Without LatentLens, we miss how interpretable visual tokens already are—especially at early layers and across models. 🍞 Anchor: A patch of a building aligns with “building with many stories,” not just “building” or a stray comma.

The “Aha!” Moment in one sentence: Compare visual tokens to contextualized text vectors (from multiple layers), not to raw input/output embeddings, and their meanings suddenly become clear.

Three analogies:

Dictionary vs. Storybook: Old methods used a dictionary of isolated words; LatentLens uses a storybook where words live in sentences, so meanings are richer.
GPS vs. Local Landmarks: Instead of matching by a global grid (one fixed embedding matrix), LatentLens uses local landmarks (layer-specific, context-aware vectors) to find the closest match.
Single Note vs. Melody: A single token is like one note. LatentLens hears the note within the melody (the sentence), which tells you its real role.

Before vs. After:

Before: Visual tokens looked random or only readable at the very end of the network; lots of subword fragments or punctuation.
After: Most visual tokens are interpretable at most layers, with meaningful, sentence-aware labels.

Why It Works (intuition, no equations):

Visual tokens projected into the LLM space are not bare words; they already carry semantics from the vision encoder and connector.
Contextualized text vectors from middle layers represent similar semantic granularity, so cosine similarity clicks into place.
Since the connector aims to make “word-like” vectors the frozen LLM can use, mid-layer contexts are the natural neighbors.

Building Blocks:

A big, curated corpus (e.g., Visual Genome phrases) covers many visual concepts.
Layer-wise contextual embeddings capture multiple levels of language meaning.
Fast nearest-neighbor search maps visual tokens to text meanings without extra training.
An evaluation judge (LLM-based) checks whether the match fits the highlighted image patch.

🍞 Anchor: When you ask, “What’s in this red box?” LatentLens might say, “gray tower with multiple clocks,” where older methods might say “,” or “clo.”

03Methodology

At a high level: Image patch → Vision encoder → Shallow MLP connector → LLM layers (hidden vectors) → LatentLens nearest-neighbor search in contextual text vectors → Top-k descriptions.

Step-by-step (like a recipe):

Build the Context Library (once):
- What happens: Take a large sentence corpus (e.g., Visual Genome), run each sentence through the LLM, and save each word’s hidden vector at several layers (e.g., layers 1, 2, 4, 8, 16, 24, L-2, L-1).
- Why it exists: Words change meaning with context and depth; storing many layer-wise examples gives you a rich ‘map’ of meanings. Without it, you’d be stuck with isolated subwords or shallow matches.
- Example: For the sentence “a building with many stories,” you store the vector for “stories” at layer 8 (and other layers).
Turn Pictures into “Word-like” Vectors:
- What happens: A vision encoder (e.g., CLIP ViT-L/14, DINOv2, or SigLIP) turns the image into patch vectors. A tiny connector (a linear layer or small MLP) maps each patch vector into the LLM’s embedding space. Then the LLM processes the sequence of visual tokens (optionally mixed with text).
- Why it exists: The LLM expects vectors shaped like words; the connector is the adapter. Without this, the LLM can’t make sense of visual inputs.
- Example: A patch showing a clock face becomes a vector that the LLM can treat like the meaning of a word.
Pick the Right Lens (Layer-aware Matching):
- What happens: Choose any LLM layer where you want to interpret the visual token (early, middle, or late). Take that visual token’s hidden vector and compare it to contextual text vectors from the same layer (and optionally across layers). Compute cosine similarity and retrieve the top-k nearest neighbors.
- Why it exists: Meanings evolve by layer. Matching at the right depth captures the level where semantics line up best. Without layer-aware matching, you might miss the true meaning.
- Example: A layer-0 visual token might best match layer-8 text contexts (“Mid-Layer Leap”).
Read Out Descriptions (Words, Phrases, Sentences):
- What happens: For each nearest neighbor, return the main word (and optionally its sentence snippet). You can also merge subword pieces into full words because you know the sentence.
- Why it exists: Humans read words-in-context more easily than isolated subword pieces or punctuation. Without this, results feel cryptic.
- Example: Bold the matched token in “stone tower with gold clocks” to show exactly what matched: clocks.
Judge Interpretability (Optional but Helpful):
- What happens: A separate LLM judge sees the image region (red box) and the top-5 candidate words, and decides if at least one is a concrete, abstract, or global match.
- Why it exists: Interpretability can be fuzzy; the judge standardizes decisions. Without it, you’d rely on ad-hoc eyeballing.
- Example: For a leafy patch, the judge might label “foliage” as concrete.

Concrete data example:

Input: A city scene with a tower that has two clock faces.
At layer 0: The visual token for the tower patch retrieves neighbors like “tower,” “clocks,” and “large tower with clocks” from mid layers.
At layer 16: The same patch retrieves even higher-similarity matches like “gray tower with multiple clocks.”

Secret Sauce (what makes it clever):

Context beats isolation: Matching to contextualized word vectors (sentence-aware, layer-aware) reveals meaning that older methods can’t see.
Layer symmetry: Visual tokens are often most like mid-layer text—so LatentLens naturally meets them where their meaning lives.
Training-free: You leverage precomputed text vectors and the model’s own geometry; no extra model training is needed.

What breaks without each step:

No context library: You fall back to subwords or next-token guesses; many results look like punctuation or random fragments.
No connector: The LLM can’t “hear” vision tokens as language-like signals.
No layer-aware matching: Early tokens seem uninterpretable even when they’re not.
No judge: Hard to compare methods fairly or quantify success.

Implementation notes from the study:

LLMs used: OLMo-7B, Qwen2-7B, LLaMA3-8B; vision encoders: CLIP, DINOv2, SigLIP (9 combos).
Corpus: ~3M Visual Genome phrases; store up to 20 contextual vectors per token per layer; use float8 to save space.
Evaluation: 100 random visual patches × 9 layers × 3 methods, judged by an LLM; caption quality checked by an independent LLM judge (DCScore).

04Experiments & Results

The Test: The authors asked, “How often can we clearly explain a visual token at each layer?” They measured the percent of interpretable tokens (if at least one of the top-5 words fits the highlighted region) using an LLM judge validated against humans (substantial agreement, κ≈0.68). They also checked where in the LLM (which layer) the best text neighbors come from and validated caption quality to ensure the models behave reasonably overall.

The Competition: LatentLens vs. EmbeddingLens vs. LogitLens. All methods are training-free and use the same underlying models. LatentLens is the only one that matches to contextualized, layer-aware text vectors; EmbeddingLens matches to input embeddings; LogitLens projects to output scores (next-token space).

The Scoreboard (with context):

Interpretable tokens (higher is better): • LatentLens ≈ 72% on average across models and layers. That’s like scoring a strong A when most others get a C. • EmbeddingLens ≈ 30%. That’s a low D, missing lots of meaning. • LogitLens ≈ 23%. That’s near failing for early layers, better later, but still far behind.
Consistency across layers: LatentLens stays high from early to late layers; LogitLens is especially weak early and improves late; EmbeddingLens varies by model family.
Generalization: An off-the-shelf, fully tuned model (Qwen2-VL-7B-Instruct) also shows high LatentLens interpretability (~60–73% across layers), proving it’s not a one-off trick.

Surprising Findings:

Mid-Layer Leap: Even at the input, visual tokens best match mid-layer contextual text (e.g., layer 8 or 16), not the first layer. This hints that the connector maps visual patches to semantic territory rather than raw lexical space.
Minimal Drift: Visual tokens change less across LLM layers than text tokens do, suggesting visual patches arrive already “semantically cooked” by the vision encoder + connector.
DINOv2 (trained without text) still yields interpretable tokens. That supports a deeper alignment between vision and language spaces.
Full-word and sentence-level neighbors improve clarity: LatentLens avoids the subword fragments and punctuation often returned by LogitLens.
Rendered text is handled well: LatentLens often recovers the exact words in the image; LogitLens sometimes guesses plausible next tokens instead of what’s truly visible.

Context on caption quality:

The controlled models scored around 6.0/10 on DCScore (a GPT-4o judge). Qwen2-VL-7B-Instruct, a strong reference model, scored 8.5/10. This confirms the controlled systems are competent enough that their internals are worth interpreting.

Takeaway: With the right comparison set—contextualized, layer-aware text vectors—most visual tokens are understandable. Older lenses underreport interpretability because they look in the wrong place.

05Discussion & Limitations

Limitations (be specific):

Storage overhead: You must precompute and store a large bank of contextual vectors across layers. Even with compression, a bigger corpus and more layers mean more storage.
Corpus bias: If the corpus (e.g., Visual Genome) emphasizes certain objects or cultural contexts, your nearest neighbors may reflect that—skewing interpretations toward nouns or familiar scenes.
Architecture scope: Results cover transformer LLMs with simple connectors; other adapters (Q-Former, Perceiver) or end-to-end finetuning might change patterns.
Judging subjectivity: Even with an LLM judge aligned to humans, “interpretability” has gray areas; nuances can be lost in binary decisions.

Required Resources:

A capable LLM backbone to generate contextual vectors (ideally the same backbone used in the VLM you’re analyzing).
A large phrase corpus; compute and storage to embed it across multiple layers.
A nearest-neighbor index (e.g., cosine similarity search).
Optional: access to an LLM judge (API cost considerations) for large-scale evaluation.

When NOT to Use:

Domains very unlike your corpus (e.g., specialized medical imagery) where stored contexts won’t match; results may look vague or off-target.
When you cannot store or retrieve from a large embedding bank.
If you only care about next-token prediction behavior late in the network; LogitLens may suffice there.

Open Questions:

How universal is the Mid-Layer Leap across more architectures, training regimes, and tasks?
Can dynamic corpus generation (evolving better contexts) systematically boost interpretability further?
How does interpretability relate causally to performance and hallucination rates—does improving one improve the other?
Beyond images, can the same idea decode other non-text inputs (speech, soft prompts, latent thoughts) into language?

Honest Assessment: LatentLens is a practical, training-free tool that reveals a lot more meaning in visual tokens by comparing them to the “right” references—contextual, layer-wise text. It’s not magic: it’s limited by the corpus and requires storage. But it’s a big step toward transparent, trustworthy vision–language systems.

06Conclusion & Future Work

Three-sentence summary: LatentLens explains what visual tokens mean by matching them to contextualized, layer-aware text vectors rather than to bare embeddings or output logits. This reveals that most visual tokens are interpretable across layers and models, and it uncovers a Mid-Layer Leap where early visual tokens align best with mid-layer language representations. The approach is training-free, generalizes to strong off-the-shelf systems, and yields human-friendly descriptions.

Main Achievement: Turning hidden visual vectors into clear words and phrases reliably—across architectures and layers—by using the model’s own contextual language geometry as the reference space.

Future Directions:

Build domain-specific or dynamic corpora to improve coverage and reduce bias.
Extend to other modalities (speech, soft prompts, latent reasoning) and to natively multimodal transformers.
Use LatentLens for causal studies (ablate interpretable vs. non-interpretable tokens) and to reduce hallucinations in practice.

Why Remember This: It changes where we “look” to understand visual tokens—from isolated tokens to contextual, layer-wise meanings—showing that vision and language are more aligned than we thought. That simple shift makes the complex feel clear, paving the way for safer, more transparent AI that can tell us not just the answer, but what it “saw” to get there.

Practical Applications

•Debugging VLMs by inspecting what each image patch means at different layers (catching misinterpretations early).
•Reducing hallucinations by verifying whether key visual tokens align with correct contextual words (e.g., object names).
•Designing better connectors by targeting mid-layer semantics revealed by the Mid-Layer Leap.
•Building explainable AI interfaces that show users sentence-level descriptions for highlighted regions.
•Dataset and domain auditing by spotting where the model’s nearest neighbors reflect biases or gaps.
•Teacher tools for AI literacy that visualize how models turn pixels into meanings step-by-step.
•Safety checks in medical or industrial settings by validating that critical regions match expected terms.
•Prompt and instruction tuning guidance by understanding which layers and contexts anchor desired meanings.
•Model selection and evaluation by comparing interpretability curves across architectures and training regimes.
•Interactive demos for product teams to explore and communicate how their multimodal models actually work.

Version: 1