Imagination Helps Visual Reasoning, But Not Yet in Latent Space
Key Summary
- •The paper asks a simple question: do the model’s invisible “imagination tokens” actually help it reason about images?
- •Using causal tests, the authors show two big breaks: changing the input barely changes the latent tokens, and changing the latent tokens barely changes the answer.
- •Probing tests also show these latent tokens carry little specific visual information and look very similar to each other.
- •Because the latent route seems weak, the authors try a clearer route: make the model “imagine” in plain text by describing the key visual changes it would have looked at.
- •This new method, called CapImagine, teaches models to write mini-captions like “zoom into the top-left corner; the label says 42,” instead of hiding this step in latents.
- •Across tough vision benchmarks, CapImagine beats strong latent-space baselines like Monet, including +4.0% on HR-Bench-8K and +4.9% on MME-RealWorld-Lite.
- •Causal tests on CapImagine show a strong link: change the imagined text and the final answer changes a lot.
- •The takeaway: imagination helps visual reasoning, but today it works better in text than in hidden latent space.
Why This Research Matters
When models help with real documents, signs, or instructions, tiny details matter. If the model’s “inner thoughts” don’t truly reflect the picture or steer the answer, it can fail silently in high-stakes moments. This paper shows a clearer path: make the imagination explicit in text so each step is grounded and checkable. That boosts accuracy on tough, high-resolution tasks and lets humans see and trust the reasoning process. In the near term, this can improve assistive tech, education tools, and workplace automation where visual precision is key. Long term, it guides researchers to build truly causal, interpretable visual reasoning systems.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook): You know how, when you solve a puzzle, you might whisper to yourself, “Look at the corner piece… now check the red stripe,” so your thoughts guide your eyes step by step? Models that see and read try to do something similar.
🥬 Filling (The Actual Concept):
- What it is: A Multimodal Large Language Model (MLLM) is a computer brain that can read words and look at pictures, then answer questions about both.
- How it works: 1) It turns the picture and the words into numbers, 2) mixes them in its “thinking layers,” and 3) writes an answer word by word.
- Why it matters: Without a shared brain that understands both pictures and words, the model can’t connect what it sees (like a traffic sign) with what it’s asked (like “What does the sign say?”).
🍞 Bottom Bread (Anchor): Imagine asking, “How many apples are on the table?” The MLLM looks at the photo and the question together and answers “Three.”
🍞 Top Bread (Hook): Imagine you’re looking at a map and someone asks, “Which road reaches the park?” You scan, zoom in, and compare places before answering.
🥬 Filling (The Actual Concept):
- What it is: Visual reasoning is thinking steps about pictures to reach a correct answer.
- How it works: 1) Notice what matters, 2) check details (like zooming in), 3) connect clues, 4) decide the answer.
- Why it matters: Without careful steps, the model might miss tiny but crucial details, like a small arrow or a number.
🍞 Bottom Bread (Anchor): On a worksheet image, “What is the value in the top-left cell?” requires the model to find the grid, locate top-left, and read the number.
🍞 Top Bread (Hook): Sometimes people use tools—like a magnifying glass or a highlighter—to look closer before deciding.
🥬 Filling (The Actual Concept):
- What it is: Tool-augmented visual reasoning lets models call tools (like zoom-in or crop) to inspect an image before answering.
- How it works: 1) Pick a tool (zoom, draw), 2) apply it to the image, 3) observe the result, 4) continue reasoning.
- Why it matters: Without tools, tiny text or small objects can be missed.
🍞 Bottom Bread (Anchor): The model might zoom into a street sign, read “Oak Ave,” then answer the question about the street’s name.
🍞 Top Bread (Hook): Now imagine thinking quietly in your head without drawing anything—just forming a picture in your mind.
🥬 Filling (The Actual Concept):
- What it is: Latent tokens are the model’s invisible thought-bits—hidden states it passes to itself while “thinking.”
- How it works: 1) The model starts producing special hidden tokens, 2) uses them to continue thinking, 3) then switches back to normal words to give the final answer.
- Why it matters: If these hidden tokens truly carry the model’s inner visual thoughts, they should change with the input and steer the answer.
🍞 Bottom Bread (Anchor): It’s like the model silently telling itself, “Focus left… read the label… compare colors,” but without saying this out loud.
🍞 Top Bread (Hook): People hoped to teach models to “imagine” inside their heads, not just with external tools.
🥬 Filling (The Actual Concept):
- What it is: Latent Visual Reasoning (LVR) is a method where the model reasons inside its hidden space using latent tokens instead of showing text or new images.
- How it works: 1) Train the model so its hidden tokens line up with meaningful visual features, 2) let it generate a sequence of these tokens during reasoning, 3) use them to influence the final answer.
- Why it matters: If it works, LVR could be fast, flexible, and private (no need to render extra images or long text).
🍞 Bottom Bread (Anchor): Instead of writing “zoom top-right,” the model would produce several hidden tokens that supposedly mean “zoom top-right,” then answer based on them.
The World Before: MLLMs became good at general Q&A but struggled when answers depended on tiny, high-resolution details or multi-step spatial logic. Two paths formed: (1) tool-use (zoom/draw) and (2) hidden-space imagination (LVR). Tool-use worked but needed predefined tools and could be slow or rigid. LVR looked elegant: no tools, no extra images, just internal thoughts. The Problem: No one knew if those hidden tokens truly carried the visual thoughts. Were they changing with the input? Did they actually control the answer? Failed Attempts: Many teams supervised these tokens using teacher features or compressed vision embeddings. Results looked okay on benchmarks, but the inner mechanism stayed murky. The Gap: We needed a causal microscope to check whether input → latent → answer was a real working chain. Real Stakes: In everyday use—reading receipts, checking safety signs, counting pills on a label, or aligning parts in a manual—we need models that truly look, think, and explain. If the “thinking” tokens don’t actually think, the model might succeed by shortcuts and fail when details matter.
02Core Idea
🍞 Top Bread (Hook): You know how a detective checks if each clue truly matters by asking, “If this clue changed, would my conclusion change?”
🥬 Filling (The Actual Concept):
- What it is: Causal Mediation Analysis is a way to test whether a middle step (the mediator) actually carries influence from input to output.
- How it works: 1) Treat input as the cause, the middle step as the mediator, and the answer as the effect, 2) poke the input and see if the mediator changes, 3) poke the mediator and see if the answer changes, 4) conclude if the chain really works.
- Why it matters: Without a real causal chain, we might believe a mirage—tokens that look like thoughts but don’t steer the answer.
🍞 Bottom Bread (Anchor): If changing the photo doesn’t change the hidden tokens, and changing the hidden tokens doesn’t change the answer, then those tokens weren’t doing the work.
The “Aha!” Moment in One Sentence: The paper shows that today’s latent imagination tokens barely carry causal influence, while making the imagination explicit in text strongly drives correct visual reasoning.
Multiple Analogies:
- Factory Analogy: If you swap in different raw materials (inputs) but the assembly line’s middle machine (latent tokens) behaves the same, and breaking that machine doesn’t change the product (answer), it’s probably not the real worker.
- Classroom Analogy: If a tutor’s secret notes (latents) never change with questions, and erasing them doesn’t change the student’s final answer, the notes weren’t teaching anything.
- GPS Analogy: If the map route (mediator) never updates when you change destinations, and changing the route doesn’t change where you end up, the GPS route isn’t controlling your trip.
🍞 Top Bread (Hook): Imagine you flip through totally different images—cats, road signs, tables—but your inner “imagination tokens” hardly budge.
🥬 Filling (The Actual Concept):
- What it is: Input–Latent Disconnect means changing the input barely changes the latent tokens.
- How it works: Researchers measured similarity of latent tokens across many different inputs and found they were highly alike, even across tasks, and got more alike as more latent tokens were produced.
- Why it matters: If latents don’t reflect the input, they can’t carry the needed details for the answer.
🍞 Bottom Bread (Anchor): Whether the question showed a menu or a map, the model produced near-identical hidden states.
🍞 Top Bread (Hook): Now imagine you wildly scramble those hidden tokens—and the final answer stays almost the same.
🥬 Filling (The Actual Concept):
- What it is: Latent–Answer Disconnect means changing the latent tokens barely changes the final answer.
- How it works: The team replaced latent tokens with a constant blob, added noise, even set them near zero; the answers hardly changed.
- Why it matters: If the mediator can be scrambled without changing results, it isn’t mediating.
🍞 Bottom Bread (Anchor): On several benchmarks, answers barely moved even when latents were replaced with random noise.
🍞 Top Bread (Hook): Picture asking someone to use only the notes they scribbled earlier to answer a new but related question—and they can’t.
🥬 Filling (The Actual Concept):
- What it is: Probing Analysis checks whether the latent tokens themselves hold useful visual facts.
- How it works: 1) Save the latent tokens from a question-image pair, 2) ask new questions about the same region/object, 3) see if latents alone can answer.
- Why it matters: If latents store real visual evidence, they should help with related questions.
🍞 Bottom Bread (Anchor): Latents performed worse than even text-only guessing, while real images let models score high—showing latents lacked crucial details.
🍞 Top Bread (Hook): So what if, instead of hiding imagination, we say it out loud?
🥬 Filling (The Actual Concept):
- What it is: CapImagine is a text-space imagination method that teaches the model to write short, precise descriptions of the visual changes it would have made (like zooming or highlighting).
- How it works: 1) Rewrite training data so every intermediate image manipulation becomes a clear sentence (“Zoom into the label; it reads 42”), 2) refine the chain for smooth logic, 3) filter low-quality items, 4) train the model to reason using these explicit imagination steps.
- Why it matters: When imagination is written down, it must track the input and it directly guides the answer—making the causal chain strong.
🍞 Bottom Bread (Anchor): With CapImagine, if you alter the imagined text (for example, claiming the label says 24 instead of 42), the answer changes a lot—proof that the imagination step truly matters.
Before vs After:
- Before: We assumed hidden latents were doing smart visual thinking.
- After: The tests show today’s latents are weak mediators; explicit text imagination is a stronger driver of correct answers.
Why It Works (Intuition): Text descriptions force the model to anchor each step to observable details (like positions, numbers, and words in the image). This prevents vague, collapsing hidden states and builds a visible, checkable path from input to answer. In short, explicit words keep the model honest and causal.
Building Blocks:
- Prove two disconnects (input→latent and latent→answer) with interventions and similarity checks.
- Show weak semantics in latents via probing.
- Replace hidden imagination with compact text captions tied to real intermediate visual evidence.
- Train and evaluate across vision-centric benchmarks to confirm consistent gains.
03Methodology
At a high level: Input (image + question) → Causal tests on hidden latents (are they doing work?) → Build a text-imagination dataset → Train CapImagine → Output (answers grounded by explicit imagined text).
🍞 Top Bread (Hook): Think of a science fair project where you poke one thing at a time to see what changes.
🥬 Filling (The Actual Concept):
- What it is: The do-operator (intervention) is a way to “force-set” a variable and see what happens next.
- How it works: 1) Freeze or replace the mediator (latents), 2) observe the change in answers, 3) conclude causal power from the effect size.
- Why it matters: Without clean interventions, we can’t tell if the middle step actually matters.
🍞 Bottom Bread (Anchor): The team set all latent tokens to a fixed tensor or random noise, then checked if the model’s answers changed.
Step A: Measure Input → Latent (X → Z)
- What happens: Feed many different images and questions to latent-reasoning models (e.g., Monet, LVR, Mirage) and record the latent tokens they generate over time. Compute cosine similarity between latents across instances and within each instance as reasoning unfolds.
- Why this exists: If latents carry input-specific thoughts, they should differ across different inputs and differ across steps. High similarity means they’re not reflecting the input or evolving meaningfully.
- Example with data: Across V*, MME, OCRBench-v2, TableVQA, and a planning dataset, the latent tokens stayed very similar—even across tasks—and got more similar as more tokens were produced.
🍞 Top Bread (Hook): You know how you check sameness by seeing how much two arrows point in the same direction?
🥬 Filling (The Actual Concept):
- What it is: Cosine similarity tells how alike two vectors point; 1.0 is “same direction,” 0.0 is “unrelated.”
- How it works: Compare pairs of latent-token vectors; high values mean they’re very similar.
- Why it matters: If everything is very similar, there’s little specific information.
🍞 Bottom Bread (Anchor): Latent tokens across very different images often had high cosine similarity, signaling sameness.
Step B: Intervene on Latents → Answer (Z → Y)
- What happens: Replace all latent tokens with a constant blob, add Gaussian noise, or set them near zero, then measure answer changes on V*, HR-Bench, MME-RealWorld-Lite, and a spatial-planning set.
- Why this exists: If latents are true mediators, big changes to them should cause big changes in answers.
- Example with data: Even extreme changes caused only tiny answer differences (sometimes none). In one strong case (stage-2 Mirage, set near zero), repetition harmed results—but most interventions barely moved scores.
Step C: Probing Latents for Semantics
- What happens: Save latents from a given image-question, then ask new multiple-choice questions about the same region/object (different attributes) using only the latents. Compare to runs using the real image.
- Why this exists: If latents store visual facts, they should help answer related questions.
- Example with data: Latents performed worse than text-only guessing, while real-image baselines scored around 76.7%—showing latents lacked needed details.
Secret Sauce (for Analysis): The careful, two-sided causal check (poke input; poke latents) plus semantic probing makes a strong case that current latent tokens don’t mediate visual reasoning.
Building CapImagine (Text-Space Imagination) Step D: Rewrite Interleaved Data into Text Imagination
- What happens: Start from Monet-SFT-125K (which had intermediate images like zoom-ins, highlights, drawings). Convert each visual manipulation into a short, concrete sentence: “Zoom into top-left receipt corner; total is $8.49,” or “Highlight the third bar; its label reads July.” Then refine the whole chain so the steps flow logically.
- Why this exists: Replace hidden, lossy latent steps with explicit, checkable sentences that must reflect what’s in the image.
- Example with data: For visual search subsets (Visual-CoT, Zebra-CoT), the model generates tight captions focusing on the zoomed region. For manipulation subsets (Refocus, CogCoM), the model verbalizes the new markings or revealed numbers.
🍞 Top Bread (Hook): When writing instructions, sloppy notes cause mistakes.
🥬 Filling (The Actual Concept):
- What it is: Data filtering picks only high-quality training examples with clear, consistent reasoning and answers.
- How it works: 1) Automatically judge if the reasoning aligns with the final answer and isn’t ambiguous, 2) remove flawed items, 3) keep a smaller but cleaner set.
- Why it matters: Bad examples teach the model to be confused.
🍞 Bottom Bread (Anchor): After filtering, about 17k high-quality items remained, and performance improved over using the full noisy set.
Step E: Train and Evaluate
- What happens: Fine-tune Qwen2.5-VL-7B on the rewritten, filtered data to produce CapImagine. Ensure no train-test mismatch (e.g., don’t train with extra images you won’t have at test time). Compare to Monet and others on V*, HR-Bench-4K/8K, MME-RealWorld-Lite, BLINK (jigsaw, multi-view), and TableVQA.
- Why this exists: A fair, controlled setup shows whether text imagination truly beats latent imagination.
- Example with data: CapImagine beats Monet by noticeable margins, including +4.0% on HR-Bench-8K and +4.9% on MME-RealWorld-Lite.
Secret Sauce (for CapImagine):
- Explicit, grounded steps: Each imagined sentence ties to a concrete region or manipulation.
- Logical smoothing: Chains are refined for coherence.
- Quality control: Filtering removes misleading examples.
- Causal punch: Changing these sentences changes answers—a healthy mediator.
Efficiency Note: Despite longer text, CapImagine’s decoding time is similar to latent Monet and about twice as fast as a tool-heavy agent (DeepEyes), offering a practical accuracy–speed balance.
04Experiments & Results
The Test: The authors measured whether latent tokens truly mediate reasoning (by input and latent interventions) and whether a text-based imagination (CapImagine) can outperform latent-based methods on vision-centric tasks. They focused on accuracy across benchmarks that need fine-grained perception (V*, HR-Bench-4K/8K, MME-RealWorld-Lite), compositional and multi-view reasoning (BLINK jigsaw and multi-view), and structured tables/diagrams (TableVQA).
The Competition: Models included strong open-source baselines (InternVL3-8B, Qwen2.5-VL-7B), latent-imagination methods (LVR, Monet), and tool-using agents (PixelReasoner, DeepEyes). A proprietary model (GPT-4o) is also reported for context.
Scoreboard with Context:
- Latent Causal Tests: Input→Latent showed very high similarity among latent tokens across different images and tasks, meaning latents hardly reflected input differences. Latent→Answer showed that even extreme perturbations (constant tensor, noise, near-zero) barely changed answers, suggesting weak causal impact. Probing showed latents couldn’t answer related questions about the same region, underperforming even text-only guesses, while image-based runs scored around 76.7%—latents lacked actual visual facts.
- CapImagine vs Monet: On V*, CapImagine reached about 85.9% vs Monet’s 83.3% (a steady edge). On HR-Bench-8K, CapImagine was +4.0 points, which is like going from a strong B to a solid A on a hard exam. On MME-RealWorld-Lite, CapImagine led by +4.9 points, showing robustness to messy, real images. On BLINK jigsaw and multi-view, CapImagine beat Monet and LVR by over 10 points, a big leap for spatial composition tasks. On TableVQA, CapImagine was +6.1% over Monet, indicating better number/label grounding.
- Against Tool Users: CapImagine outperformed PixelReasoner and approached DeepEyes, suggesting explicit text imagination can rival (or complement) external zoom/draw tools, with simpler infrastructure and comparable or better speed.
Surprising Findings:
- Latent tokens were so similar across instances that they looked like placeholders, not evolving thoughts. Over time, they even collapsed further, becoming more alike with each step.
- Scrambling latents barely changed answers—sometimes performance even ticked up slightly—hinting the model largely ignored these tokens during decision-making.
- In CapImagine, altering the imagined text (corrupting key details) caused dramatic drops—proof that the text imagination truly mediates the result. For example, on HR-Bench-4K, performance plunges when the imagined content is intentionally misled, often below random guess, showing a tight causal link.
Takeaway Numbers (meaningful scale):
- +4.0% on HR-Bench-8K and +4.9% on MME-RealWorld-Lite vs Monet translates to many more correct answers on small-text, fine-detail, and real-world images where mistakes are costly.
- Over 10-point boosts on BLINK jigsaw/multi-view mean clearly better global structure and multi-view reasoning—traditionally tough for models.
Bottom line: The latent path looks weakly causal today, while the text-imagination path is strongly causal and produces higher, more trustworthy scores across diverse, detail-heavy benchmarks.
05Discussion & Limitations
Limitations:
- CapImagine uses longer text chains, which can add decoding time compared to ultra-short hidden latents. While the paper shows similar speed to Monet in practice, text may still be slower in some settings.
- CapImagine is designed as a clean testbed to expose the latent gap, not necessarily the final, optimal imagination approach. There might be tasks where super-fine visual nuances are hard to compress into short sentences.
- Natural language granularity is limited compared to high-dimensional latent vectors; with better latent design, future methods might carry richer details compactly.
Required Resources:
- A capable MLLM backbone (like Qwen2.5-VL-7B).
- Curated interleaved datasets where intermediate images (zoom/highlight/draw) can be verbalized.
- Compute for fine-tuning and quality filtering (the paper used -80G).
When NOT to Use:
- Ultra time-critical or bandwidth-locked environments where even modest extra text generation isn’t acceptable.
- Tasks requiring pixel-perfect or continuous geometry beyond what short text can express (e.g., precise 3D rotations without visuals).
- Data-scarce domains where you cannot reliably rewrite manipulations into faithful text.
Open Questions:
- Can we redesign latent tokens to avoid collapse, tightly bind them to inputs, and increase their causal grip on answers?
- What hybrid designs (short text + improved latents + selective tools) yield the best accuracy–speed–causality trade-off?
- How do we automatically generate faithful, compact imagination text for any visual domain?
- Can we prove stronger causal guarantees on larger, noisier datasets and in real-time settings?
- How do we measure “explainability quality” so that the imagination text not only helps accuracy but also helps humans audit reasoning?
06Conclusion & Future Work
Three-Sentence Summary: The paper checks whether hidden “imagination” tokens really carry visual reasoning and finds two key breaks: inputs barely change them, and changing them barely changes answers. Probing confirms these tokens hold little actionable visual detail. Replacing hidden imagination with explicit, well-written text steps (CapImagine) creates a strong causal link and significantly improves accuracy on tough vision benchmarks.
Main Achievement: A careful causal mediation study that demystifies latent visual reasoning today, plus a simple, effective alternative—text-space imagination—that outperforms leading latent-space methods while being more interpretable and causally faithful.
Future Directions: Build truly causal latent mechanisms that don’t collapse, explore hybrids blending concise text imagination with lightweight tools, and develop automatic, trustworthy imagination-text generation. Improve evaluation to judge not only accuracy but also causal faithfulness and explanation quality. Extend the framework to videos, 3D, and robotics tasks requiring long-horizon spatial planning.
Why Remember This: It shows that “imagination helps”—but the medium matters. Right now, saying the quiet part out loud (in text) leads to clearer, more controllable, and more accurate visual reasoning than hiding it in latents. It’s a roadmap for building models that really look, really think, and let us see their thinking.
Practical Applications
- •Document triage: Read invoices/receipts by explicitly writing the focused region and extracted numbers before computing totals.
- •Safety checks: Describe zoomed labels on machinery or chemicals to confirm warnings or usage steps.
- •Education helpers: Show step-by-step imagined text (e.g., “Look at the third bar; it says July; value is 42”) so students learn how to read charts.
- •Customer support: Diagnose product issues from photos by explicitly describing the inspected parts and observed states.
- •Warehouse/retail: Count or verify SKUs by describing zoomed shelf regions, label text, and mismatches.
- •Accessibility: Provide detailed, guided descriptions of small or cluttered visual elements for low-vision users.
- •Data entry QA: Cross-check table values with explicit text imagination steps to reduce copy errors.
- •Technical manuals: Walk through diagrams step by step in text (identify part, read label, compare spec) for reliable troubleshooting.
- •Medical pre-screening (non-diagnostic): For simple tasks like reading device displays or dosage labels, write explicit inspection steps before output.
- •Robotics perception planning: Use text-imagined inspection steps to decide which visual cues to check before acting.