AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

Changwoo Baek; Jouwon Song; Sohyeon Kim; Kyeongbo Kong

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

Intermediate

Changwoo Baek, Jouwon Song, Sohyeon Kim et al.3/1/2026

arXiv

Key Summary

•Big picture: Vision-language models look at hundreds of image pieces (tokens), which makes them slow and sometimes chatty with mistakes called hallucinations.
•Two popular ways to shrink tokens are attention-based (keep what looks most important) and diversity-based (keep a wide variety), but no one fully mapped their trade-offs.
•This paper measures how spread-out the kept tokens really are using effective rank (erank) and how focused the model’s gaze is using attention entropy.
•Finding 1: Diversity-heavy pruning keeps more variety but causes more hallucinations; attention-heavy pruning keeps focused evidence and hallucinates less.
•Finding 2: Simple images (few key parts) favor attention-based pruning; complex images (many parts) favor diversity-based pruning.
•They turn these insights into a tiny adaptive rule that changes pruning based on image complexity measured by erank (or entropy).
•Their adaptive pruning reliably matches or beats strong baselines across nine benchmarks and reduces hallucinations on CHAIR.
•The same trends hold across several popular models (LLaVA-1.5-7B/13B, LLaVA-NeXT-7B, Qwen2.5-VL-7B), so the idea is model-agnostic.
•Efficiency stays high: big speed/memory savings with minimal overhead to compute erank/entropy.
•Takeaway: Don’t pick one pruning style for everything; let the image decide how much to trust attention vs. diversity.

Why This Research Matters

This work makes multimodal AI faster without just chopping off important information at random. It gives a simple, reliable way to decide whether to trust focused clues or to cover a wider range, based on each image. That reduces harmful hallucinations, which is crucial for assistive tech used by people who are blind or low vision. It also saves compute and battery life, enabling smarter vision assistants on edge devices like phones and robots. Because the rule is training-free and model-agnostic, it can be dropped into many systems today. In short, it’s an easy upgrade path to make vision-language models both quicker and more trustworthy.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re packing for a trip with a tiny backpack. You can’t bring every toy, so you must choose the ones that matter most without missing something important.

🥬 Concept 1 — Large Vision-Language Models (LVLMs)

What it is: LVLMs are AI systems that read pictures and words together to answer questions, describe scenes, or reason.
How it works: (1) An image is chopped into many small pieces (visual tokens). (2) Each token becomes a vector (a list of numbers). (3) A language model reads these tokens like words and talks about the picture.
Why it matters: Pictures create hundreds of tokens. Processing them all is slow and expensive. 🍞 Anchor: Like reading a comic with hundreds of panels at once—awesome, but it takes forever unless you skip repeated panels.

🥬 Concept 2 — Visual Tokens

What it is: Tiny puzzle pieces of the picture that the model reads.
How it works: The vision encoder splits the image into patches and turns each into a numeric vector.
Why it matters: More tokens = more work; many are redundant. 🍞 Anchor: Think of 100 photos of the same cat pose—you don’t need all of them to know it’s a cat.

🥬 Concept 3 — Visual Token Pruning

What it is: A way to throw out unhelpful or repeated tokens so the model runs faster.
How it works: (1) Score or compare tokens, (2) keep K best ones, (3) discard the rest.
Why it matters: Without pruning, computation grows very fast and responses slow down. 🍞 Anchor: Packing only your best toys so your backpack isn’t too heavy.

🥬 Concept 4 — Attention-Based Pruning

What it is: Keep tokens that the model’s attention thinks are most important.
How it works: (1) Compute attention scores, (2) sort tokens by score, (3) keep the top ones.
Why it matters: Without it, you might keep too many background tokens and miss the main subject. 🍞 Anchor: A spotlight in a theater shines on the lead actor, not the curtains.

🥬 Concept 5 — Diversity-Based Pruning

What it is: Keep a wide mix of tokens that look different, so you cover more of the scene.
How it works: (1) Measure similarity among tokens, (2) avoid keeping near-duplicates, (3) choose tokens that are far apart in feature space.
Why it matters: Without diversity, you might keep many near-identical tokens and miss smaller or far-away objects. 🍞 Anchor: A fruit salad tastes better with many fruits, not just a bowl of apples.

🥬 Concept 6 — Hallucination (in captions/answers)

What it is: When the model says an object is there when it isn’t.
How it works: With scattered or uncertain evidence, the model guesses and may add false details.
Why it matters: Hallucination hurts trust—especially for accessibility or safety-critical tasks. 🍞 Anchor: Saying “there’s a dog in the photo” when the image shows only a cat.

🥬 Concept 7 — Attention Score Entropy

What it is: A number that tells how spread-out the model’s attention is across tokens.
How it works: (1) Turn attention scores into probabilities, (2) compute entropy—low = focused, high = spread-out.
Why it matters: If attention is focused, a few tokens carry most of the truth; if spread-out, you need broader coverage. 🍞 Anchor: If your eyes lock onto one spot in a Where’s Waldo page, focus is low; if they scan everywhere, focus is high.

🥬 Concept 8 — Effective Rank (erank)

What it is: A measure of how many meaningful directions the token features really use (effective dimensions).
How it works: (1) Look at the singular values (strengths) of features, (2) compute an entropy-like score, (3) higher erank = more diverse features.
Why it matters: Tells you if the image’s information is concentrated or spread across many kinds of features. 🍞 Anchor: A choir using many voice parts (soprano, alto, tenor, bass) has higher erank than a unison chant.

🥬 Concept 9 — Image Complexity

What it is: How many different visual clues and objects spread across the scene.
How it works: Complex images tend to have higher attention entropy and higher erank; simple images have lower values.
Why it matters: It tells you whether to prefer attention-based or diversity-based pruning. 🍞 Anchor: A clean sheet of paper with one big number is simple; a messy collage of small stickers is complex.

The world before this paper: Many teams sped up LVLMs with token pruning. Some trusted attention to keep the main clues; others trusted diversity to cover more ground; some mixed both with fixed recipes. But no one clearly measured (1) how much true variety each method keeps, (2) how that diversity relates to hallucination, and (3) which images prefer which strategy.

The problem: Without a map of these trade-offs, we might apply the wrong pruning to the wrong image—either losing crucial details or inducing hallucinations.

Failed attempts: Pure attention often over-focuses (great on simple scenes, but can miss small scattered objects). Pure diversity often over-spreads (better coverage, but can sound speculative and hallucinate). Fixed hybrids don’t adapt to each image’s needs.

The gap: A clear, quantitative way to connect diversity, attention focus, hallucination risk, and image complexity—then adapt pruning per image.

Real stakes: Faster, safer multimodal AI helps screen readers describe photos more reliably, robots make steadier decisions, and phones run smarter assistants without draining batteries or making things up.

02Core Idea

🍞 Hook: You know how some puzzles have one big obvious piece to start with, but others need you to gather many small edge pieces first? You don’t solve every puzzle the same way.

Aha in one sentence: Measure how concentrated vs. spread-out an image’s evidence is (using erank or attention entropy), then adapt pruning—lean on attention for simple images and on diversity for complex ones—to keep the right tokens and reduce hallucinations.

🥬 Concept 10 — Hybrid Pruning (but adaptive)

What it is: A prune-chooser that adjusts how much it trusts attention vs. diversity per image.
How it works: (1) Estimate image complexity via erank or attention entropy, (2) set a diversity threshold that scales with complexity, (3) pick high-attention tokens but prune neighbors by similarity using that adaptive threshold, (4) optionally adjust how many tokens to keep.
Why it matters: One-size-fits-all fails because images differ; this makes pruning image-aware. 🍞 Anchor: Choosing sneakers for sprints (attention) or hiking boots for trails (diversity) depending on today’s path.

Three analogies to see the same idea:

Puzzle strategy: If the picture has a huge, bright sun (simple), start with that (attention). If it’s a busy cityscape (complex), keep a mix of windows, roads, and sky pieces (diversity).
School project team: For a short task, keep the most efficient experts (attention). For a complex project, keep a diverse team—designers, writers, testers (diversity).
Backpacking: If you’re going to a theme park (simple), bring your ticket and phone (attention). If you’re going camping (complex), bring varied gear—map, rope, flashlight, first aid (diversity).

Before vs. after:

Before: Pick attention, diversity, or a fixed mix—hope it works for all images.
After: Let the image tell you what it needs; adapt the mix so you keep evidence that is reliable and just diverse enough.

Why it works (intuition, no equations):

Attention scores rank tokens by importance; that’s great when key information clusters in a few places.
erank/entropy tell you how many different feature directions matter; high values mean the scene spreads evidence around.
By scaling a similarity threshold with this complexity, you regulate how much variety to keep while still anchoring selections in high-attention regions.
This balance cuts hallucinations (by staying grounded in high-attention tokens) while covering more objects when necessary (by enforcing diversity at higher complexity).

Building blocks (small pieces that make the idea run):

A focus meter (attention entropy) and a variety meter (erank) to read the image’s information layout.
A ranked list of tokens by attention—so you start from the most promising clues.
A diversity gate (adaptive similarity threshold): if two tokens are too similar, keep one; how strict this gate is depends on image complexity.
An optional token-count knob: keep a few more tokens when images are complex; prune harder when simple.

Bottom line: The core idea is simple—listen to what the image looks like (focused or busy) and adjust pruning so you keep the right kind of evidence.

03Methodology

At a high level: Image → vision encoder makes tokens → measure complexity (erank or entropy) → sort tokens by attention → adaptive similarity-pruning loop → keep K tokens → send to language model → answer.

Step 1: Get tokens and attention

What happens: The vision encoder turns the image into N token vectors. We read the class token’s attention to all visual tokens and average across heads.
Why this step exists: We need a ranking of tokens by how much the model “cares” about them to start from the most promising evidence.
Example: Suppose N = 576 tokens. The top 5 attention scores land on a dog’s face and collar; the bottom scores are blank wall patches.

Step 2: Measure image complexity (erank or attention entropy)

What happens: Compute erank (effective dimensions used by tokens) or attention entropy (how spread-out attention is). Higher means more complex and dispersed evidence.
Why this step exists: Without a complexity signal, we’d use the same pruning policy for every image and either over-focus or over-spread.
Example: A page of large printed digits gives low entropy/low erank (simple). A street market scene with signs, people, fruits—high entropy/high erank (complex).

🥬 Concept 11 — Adaptive Similarity Threshold (τ)

What it is: A gate that says, “If a new token looks too similar to what we already kept, skip it.” The strictness grows with image complexity and token rank.
How it works: (1) Sort tokens by attention, (2) for each picked token at position i, set τ_i proportional to (image erank / average $erank) × i$ , capped by τ_max, (3) prune candidates whose cosine distance to any kept token is less than τ_i, (4) continue until K tokens kept.
Why it matters: A low τ (strict similarity) hugs attention and keeps fine details when information is concentrated. A high τ encourages variety when information is spread out. 🍞 Anchor: If your pantry is small (simple scene), avoid duplicates by a tiny margin; if you’re feeding a crowd (complex scene), insist each new dish is clearly different.

Step 3: Adaptive attention-guided selection loop

What happens:
1. Start with the highest-attention token; keep it.
2. For the next candidates (in attention order), compute cosine distance to kept tokens.
3. If distance < τ_i, drop it (too similar); else, keep it.
4. Repeat until K tokens are kept.
Why this step exists: It fuses attention (reliability) with diversity (coverage) based on the image’s needs.
Example with numbers: For a simple image, τ_i stays small. You might keep nose, eye, and collar tokens of a dog because they aren’t identical; background tokens get pruned. For a complex image, τ_i grows, so you keep tokens from the fruit stand, a sign, a person’s face, and the street—broad coverage.

Step 4 (optional): Adaptive token count

What happens: If erank is high, keep a few more tokens than the budget; if low, keep fewer—staying near a target average to preserve efficiency.
Why this step exists: Some images truly need more pieces to avoid missing essentials, others don’t.
Example: Average budget is 88 tokens. A very simple picture drops to 70–75; a very complex one rises to ~100.

Step 5: Feed pruned tokens to the language model

What happens: The language model now reads a shorter, smarter list of tokens and produces an answer or caption.
Why this step exists: With fewer but better-chosen tokens, we save time and memory while staying accurate and grounded.
Example: The caption names the main objects without inventing a nonexistent “microwave” in the scene.

The secret sauce:

Use erank/entropy as honest thermometers for how the image spreads information. That guides how much to trust attention (precision) vs. diversity (coverage).
Keep the loop simple and training-free—just a few fast matrix ops and distance checks—so it’s robust and easy to drop into many LVLMs.

What breaks without each piece:

No attention sorting: You might anchor on unimportant tokens and miss the main subject.
No complexity measure: You’d apply the same rule everywhere, leading to hallucinations on complex scenes or missed details on simple ones.
No similarity threshold: You’d keep redundant tokens and waste your budget.
No cap on τ: Threshold could get too loose and prune away fine details by over-penalizing similarity.

Concrete toy example (retain K=4):

Attention order: $\begin{pmatrix} A (0.20) \\ B (0.18) \\ C (0.15) \\ D (0.12) \\ E (0.05) \\ ... \end{pmatrix}$
Simple image (low erank): τ small. Keep A, B, C (they differ enough). D is very similar to C—skip. Next different-enough token is E—keep. Final set: {A,B,C,E} (tight cluster plus one extra).
Complex image (high erank): τ larger. Keep A, B; C is too similar to B—skip; D is different—keep; E different—keep. Final set: {A,B,D,E} (broader coverage).

04Experiments & Results

The test: They checked two kinds of things—speed/accuracy across many benchmarks and hallucinations (CHAIR) in captions. They also compared against popular pruning baselines and hybrids.

The competition: Attention-based (FasterVLM, VisionZip, VisPruner, PruMerge+), diversity-based (DivPrune, FPS sampler), and in-LLM pruning (SparseVLM, PyramidDrop). The new method is a simple pre-pruning module attached before the language model.

The scoreboard with context:

Across nine benchmarks (VQAv2, GQA, VizWiz, TextVQA, ScienceQA, MME, POPE, MMBench, MMBench-CN), the adaptive method consistently preserved or improved accuracy versus fixed attention/diversity mixes—especially when keeping only 64–128 tokens.
Example framing: At 64 tokens, some attention-only methods lose over 25% (like going from an A to a C), while the adaptive method loses only about 3% and even edges out strong attention or diversity baselines (like keeping a solid A-).
Hallucination results (CHAIR): Diversity-first methods (like DivPrune) scored higher on hallucination metrics $C_S$ and $C_I$ (more false objects), though they had higher recall (they mentioned more real objects too). Attention-first methods had lower hallucinations but sometimes missed secondary objects. The adaptive method struck a balance: close to full-token reliability while saving big on compute.
Complexity-dependent success: On simple datasets (like OCR/ScienceQA images), low erank/entropy means attention-focused pruning shines. On complex datasets (POPE/MME complex categories), high erank/entropy means diversity-focused pruning wins. The adaptive method learns this pattern and adjusts on the fly.

Surprising (and useful) findings:

More diversity in kept tokens often correlates with more hallucinations in captions. Diversity covers more ground but can tempt the model to speculate.
Replacing a fraction of diversity-chosen tokens with top-attention tokens steadily reduces hallucination (like turning down a “speculation dial”).
A very small, training-free rule—just scaling a similarity threshold by erank/entropy—was enough to match or beat more elaborate systems.

Efficiency notes:

Computing erank is lightweight (a few milliseconds) compared to full inference; entropy is even simpler. Pre-pruning slashes FLOPs and GPU memory because fewer tokens go through all LLM layers.

Generality:

The same trends and gains appeared on LLaVA-1.5-7B/13B, LLaVA-NeXT-7B, and Qwen2.5-VL-7B. That suggests the principle (listen to image complexity, balance attention and diversity) is model-agnostic.

Bottom-line numbers to remember:

Hallucination-balanced sweet spot at 64 tokens: the adaptive method achieves $C_S$ ≈ 52.2, $C_I$ ≈ 15.9 with $recall ≈ 75$ .7—near the full-token baseline’s trustworthiness but much faster.
At 128 tokens, the adaptive rule yields small but consistent accuracy gains over both top attention- and diversity-style baselines—like nudging from a B+ to an A- while carrying a lighter backpack.

05Discussion & Limitations

Limits (be specific):

Very tiny or rare details: If crucial evidence is highly localized inside a cluttered image, diversity pressure could still dilute attention near that spot.
Threshold tuning: Although the rule is simple and capped, different datasets might prefer slightly different caps or scaling constants.
Complexity proxy: erank/entropy are good but imperfect thermometers; extreme lighting or heavy motion blur can shift them.
Caption length trade-off: Diversity methods often write longer captions (higher recall, but more hallucination). The adaptive method may still shorten some outputs versus full tokens.
Compute budget: erank adds a small overhead; on ultra-low-power devices, even milliseconds matter.

Resources required:

A vision encoder that exposes attention and token embeddings.
Ability to compute pairwise cosine distances or a batched similarity check.
A small SVD/eigendecomposition for erank (or simpler entropy-only variant) plus one thresholding loop.

When not to use:

If latency is dominated elsewhere (e.g., huge language headroom or slow I/O) and tokens aren’t the bottleneck, gains may be marginal.
If your model rarely hallucinates and your images are uniformly simple, fixed attention-based pruning might be simpler.

Open questions:

Can we predict fine-grained failure cases (e.g., many tiny, similar objects) and momentarily boost the token count automatically?
Could we learn a better complexity signal that is even faster than erank but more precise than entropy?
How does adaptation interact with training-time distillation—can we teach the model to prefer adaptive pruned inputs?
Can the same adaptive idea guide video token pruning across frames (temporal diversity vs. attention)?
Is there a way to softly reweight (rather than hard prune) tokens using the same complexity signals for smoother control?

06Conclusion & Future Work

Three-sentence summary:

The paper measures how focused vs. spread-out image evidence is (via attention entropy and erank) and shows that attention-based pruning suits simple images while diversity-based pruning suits complex ones.
It also uncovers that keeping more diverse tokens can raise hallucination risk, while anchoring on high-attention tokens lowers it.
Using these facts, the authors design a tiny adaptive pruning rule that balances attention and diversity per image, improving accuracy, reliability, and efficiency across many benchmarks and models.

Main achievement:

Turning empirical patterns (complexity ↔ best pruning style, diversity ↔ hallucination risk) into a training-free, image-aware pruning mechanism that consistently works in practice.

Future directions:

Smarter complexity meters (faster, more robust), soft reweighting rather than hard pruning, adaptive token counts learned end-to-end, and extensions to video or 3D.

Why remember this:

It’s a clear lesson on adaptivity: images differ, so pruning should too. By listening to simple signals (erank/entropy), we can choose the right mix of precision (attention) and coverage (diversity), making LVLMs both faster and more trustworthy.

Practical Applications

•Screen readers that describe user photos with fewer invented objects.
•On-device assistants that answer questions about your camera view while conserving battery.
•Retail shelf-checking robots that stay fast and avoid misreporting missing items.
•Warehouse scanners that identify packages and labels without slowing operations.
•AR translation apps that read signs and menus accurately in crowded scenes.
•Smart home cameras that send reliable alerts (e.g., detecting a package) with less compute.
•Medical triage systems that summarize imaging scenes while reducing false mentions.
•Autonomous drones that navigate complex environments with efficient visual processing.
•Content moderation tools that caption images quickly and avoid speculative claims.
•Education tools that help students query diagrams/photos, adapting to simple vs. busy images.

Version: 1