Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

Yuxuan Yang; Zhonghao Yan; Yi Zhang; Bo Yun; Muxi Diao; Guowei Zhao; Kongming Liang; Wenbin Li; Zhanyu Ma

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

Intermediate

Yuxuan Yang, Zhonghao Yan, Yi Zhang et al.2/23/2026

arXiv

Key Summary

•Hepato-LLaVA is a special AI that reads giant microscope pictures of the liver and answers medical questions about cancer.
•It uses a new trick called Sparse Topo-Pack Attention to keep important tiny details while also seeing the big picture.
•Instead of squishing the whole image down and losing clues, it groups nearby image tiles into small packs and summarizes them smartly.
•The team built HepatoPathoVQA, a 33,000-question dataset that teaches the AI to think from zoomed-out to zoomed-in views like a real pathologist.
•A lightweight connector (Q-Former) helps the image side talk clearly to the language side without overwhelming the system.
•Three training stages (MAE, MoCo, and instruction tuning with LoRA) help the model first learn textures, then structures, then how to answer clinical questions.
•On their test set, Hepato-LLaVA scored 0.83 on average, beating the next best whole-slide model (0.66) by a big margin.
•It reached 0.97 accuracy on some single-choice morphology questions and stayed strong across all zoom levels (WSI, ROI, Patch).
•Ablations show the sparse, topology-aware design and the 32-query connector give better accuracy than dense or very large token setups.
•This can help pathologists catch liver cancer patterns faster and more reliably, especially for tricky early-stage cases.

Why This Research Matters

This work helps pathologists spot and explain liver cancer patterns more reliably by keeping tiny details and the big picture in sync. It can reduce missed early signs that change staging and treatment, improving patient outcomes. Hospitals with fewer specialists can benefit from a model that reasons like a seasoned pathologist, providing consistent, explainable answers. The efficient, sparse design means faster processing of massive slides, saving time in busy clinics. The multi-scale dataset pushes AI to justify answers across zoom levels, making it easier to trust. Over time, similar ideas could support other cancers and bring high-quality diagnostics to more places.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine trying to find a tiny ant on a giant football field. If you zoom out too much, you can’t see the ant. If you zoom in too much, you forget where on the field you are. Pathologists face this problem when they look for cancer in huge microscope images of the liver.

🥬 The Concept (Whole Slide Images, WSIs): WSIs are ultra-large, gigapixel pictures of tissue on a slide. How it works: 1) A slide is scanned at high magnification; 2) The image is split into many small tiles (patches); 3) Computers analyze these tiles to find patterns. Why it matters: Without WSIs, we can’t capture tiny details that reveal early cancer clues.

🍞 Anchor: Think of Google Maps: city view (whole slide), neighborhood view (ROI), and street view (patch). You need all three to navigate.

🍞 Hook: You know how a doctor looks at your X-ray and then zooms into a suspicious spot? The doctor switches between big-picture and close-up to make a decision.

🥬 The Concept (Hepatocellular Carcinoma, HCC): HCC is a common type of liver cancer that looks different from place to place in the same slide. How it works: 1) The tumor can be patchy; 2) Important signs (like tumor edges or tiny vessel invasion) may span multiple tiles; 3) Correct diagnosis requires combining local details with the whole context. Why it matters: Missing small but important clues can mean a wrong stage and the wrong treatment plan.

🍞 Anchor: It’s like judging a forest’s health: you must see both the whole forest and the leaves’ spots.

The world before: AI for pathology often used two shortcuts. Some models shrank the whole giant image into a smaller thumbnail, which made things fast but blurry—tiny features (like cell shapes) got erased. Others collected thousands of patch features into a long list and then tried to smoosh them into a single slide summary. That kept details but created tons of repeated, noisy features and made the AI slow and confused.

The problem: 1) How do we compress a massive slide without throwing away important local clues? 2) How do we let the model work at multiple zoom levels (slide, region, and patch) the way real pathologists do?

Failed attempts: 1) Thumbnail models kept the story of the slide but lost the plot twists in the small details. 2) Patch-stacking models kept all the details but didn’t respect where tiles are on the 2D grid—treating them like a shuffled 1D list—so they mixed unrelated regions and wasted attention. 3) Many systems had only slide-level inputs, so they couldn’t answer fine-grained, patch-level questions.

The gap: We needed a model that understands the tissue’s 2D layout (topology), can summarize local neighborhoods without losing the thread, and can talk fluently about both big and small findings. We also needed teaching data that asks questions at three scales, guiding the model to connect zoomed-out impressions with zoomed-in evidence.

Real stakes: HCC diagnosis affects surgery decisions, follow-up plans, and patient survival. Early, subtle signs can change the stage and the treatment. Better AI assistance means: 1) fewer missed clues, 2) more consistent readings between doctors, and 3) helpful explanations that match clinical reasoning. This especially matters in busy hospitals and places with fewer specialists.

🍞 Hook: You know how group projects work best when each person summarizes their part and one leader keeps everyone aligned?

🥬 The Concept (Multiple Instance Learning, MIL): MIL treats a slide as a bag of many patch instances and learns from the bag’s label. How it works: 1) Each slide contains many patches; 2) The model finds which patches matter; 3) It aggregates them to predict the slide’s label. Why it matters: Without MIL ideas, the model wouldn’t know which small areas drive the diagnosis.

🍞 Anchor: It’s like a teacher grading a class by seeing which students’ work shows the key skills—some students (patches) matter more for the final grade.

02Core Idea

🍞 Hook: You know how a city is made of neighborhoods, and each neighborhood has its own vibe, but the highways connect them all? If you plan your trip by neighborhoods first, you don’t get lost in every single street.

🥬 The Concept (Sparse Topo-Pack Attention): It is a way for the model to group nearby patches into small packs, summarize each pack, and let these summaries talk to each other and to a global overview token. How it works: 1) Split the slide into $k×k$ neighborhoods (packs); 2) Create a summary token for each pack and one global token for the whole slide; 3) Allow dense attention inside each pack (local details), summary-to-patch attention for aggregation, summary-to-summary attention for long-range structure, and a global sink token to broadcast context; 4) Block other, unhelpful connections. Why it matters: Without this, the model wastes effort attending to far-apart areas that don’t belong together, loses the 2D structure, and either drops details or drowns in redundancy.

🍞 Anchor: Picture a school: students discuss in small groups (packs), group leaders share highlights (summary tokens), and the principal (global token) keeps the school-wide plan on track.

The “Aha!” in one sentence: Respect the tissue’s 2D map by packing local tiles, summarizing them, and sparsely connecting only the interactions that match how pathologists think.

Three analogies:

Newspaper layout: Read each article (pack) fully, then skim headlines (summaries) to connect stories across the page, while the front-page banner (global token) sets the theme.
Lego city: Build houses (patches) into blocks (packs), then link blocks with roads (summary-to-summary), guided by a city plan (global token).
Detective work: Inspect each room (pack) closely, let each room’s captain report key clues (summary), and keep the case board (global) as the master view.

Before vs After:

Before: Models either blurred tiny details (thumbnails) or got overloaded by thousands of loosely organized tokens (dense 1D sequences).
After: The model keeps sharp local evidence, compresses it into meaning-rich summaries, and shares only the right information across the slide using a sparse, topology-aware pattern.

Why it works (intuition): Cancer clues are clustered locally (e.g., margins, cell patterns). Nearby tiles usually share meaning; far tiles often don’t. So, focus dense attention locally, let summaries do the long-range talking, and keep one global context anchor. This mirrors a pathologist’s workflow: local inspection → summarize → compare regions → final conclusion.

Building blocks:

Packs: small $k×k$ windows of patches.
Summary tokens: learned mini-reports of each pack’s evidence.
Global token: a macro reference frame that stabilizes interpretation.
Hierarchical sparse mask: a set of rules that allows only useful attention links (intra-pack dense, summary↔patch within a pack, summary↔summary across packs, and global sink), cutting compute to about 1% of fully dense attention without losing accuracy.

🍞 Hook: You know how bilingual friends translate between groups so everyone understands?

🥬 The Concept (Multi-modal Large Language Model, MLLM): An MLLM understands images and text together and can answer questions. How it works: 1) The slide encoder turns images into visual tokens; 2) A connector reshapes them; 3) The language model reads those tokens and produces answers. Why it matters: Without an MLLM, we can’t turn visual clues into clear, step-by-step clinical answers.

🍞 Anchor: It’s like showing pictures to a smart storyteller who explains what’s going on in words a doctor can use.

🍞 Hook: Imagine quizzes that start with big-picture questions and then ask for close-up proof.

🥬 The Concept (Hierarchical VQA): It asks and answers questions at three scales: whole slide (WSI), region (ROI), and patch. How it works: 1) Start with overall morphology; 2) Use that context to guide ROI questions; 3) Zoom further for patch-level, cell detail questions; 4) Tie everything back to diagnosis and staging. Why it matters: Without scale-aware questions, models can sound fluent but miss the actual microscopic evidence.

🍞 Anchor: It’s like a teacher asking, “What’s the book about?” then “What happens in chapter 3?” then “What does this sentence mean?”

03Methodology

High-level recipe: Input (gigapixel WSI) → Patch encoder (tile features) → Slide encoder with Sparse Topo-Pack Attention (local aggregation + global context) → Connector (Q-Former queries) → Language model (answers and explanations).

Step A: Patch encoding and grid layout

What happens: The huge slide is cut into non-overlapping $P×P$ patches; a frozen feature encoder turns each patch into a D-dimensional vector arranged on a 2D grid.
Why it exists: Trying to feed the whole raw slide is impossible; features make the data manageable while keeping important patterns.
Example: A $2048×2048$ region becomes a $64×64$ grid of small patches, each now a vector like [0.2, −0.1, …].

🍞 Hook: You know how neighborhoods make a big city easier to manage?

🥬 The Concept (Hierarchical Sampling for dataset construction): It selects WSIs, then ROIs, then local patches, following a pathologist’s workflow. How it works: 1) Find interesting regions with a graph method; 2) Clean out boring parts; 3) Sample $10× and$ $20× patches$ around ROI centers. Why it matters: Without guided sampling, you waste time on empty glass and miss the action.

🍞 Anchor: It’s like starting with a country map, then a city map, then a street map before deciding where to walk.

Step B: Sparse Topo-Pack Attention (the slide encoder)

What happens: The model builds packs (k=3 means $3×3$ patches per pack), creates a summary token for each pack by re-encoding the local image, adds one global token from a resized whole slide, then applies a hierarchical sparse mask: intra-pack dense, summary↔patch for aggregation, summary↔summary for long-range, and a global sink for macro context.
Why it exists: It preserves 2D tissue structure, reduces redundant attention to far-away patches, and keeps both detail and context.
Example: A tumor edge crossing three neighboring packs is captured because summaries from each pack talk to each other, while distant normal tissue doesn’t steal attention.

🍞 Hook: Think of a master key that opens only the right doors, saving time.

🥬 The Concept (Q-Former Connector): A lightweight module with learnable queries that pull the most important visual info into a fixed number of tokens the language model can digest. How it works: 1) Fix the slide encoder; 2) Train queries to attend to summary tokens; 3) Output a small set of visual-language-ready embeddings. Why it matters: Without the connector, the language model gets overwhelmed by too many, variable-length tokens.

🍞 Anchor: It’s like a team of journalists asking targeted questions and writing a short, clear brief for the editor.

Training pipeline (three stages): Stage 1: MAE pretraining (texture then structure)

What happens: The model learns by reconstructing what’s been masked. Phase 1 masks patches within packs (learn textures). Phase 2 masks entire packs (learn long-range structure).
Why it exists: It teaches the encoder both fine details and how big patterns fit together.
Example: Rebuilding missing tiles of a liver nodule, then later rebuilding whole missing neighborhoods.

🍞 Hook: Like solving jigsaw puzzles first with small missing pieces, then with whole chunks gone.

🥬 The Concept (MAE, Masked Autoencoder): A self-supervised way to learn by predicting missing parts of an image. How it works: 1) Hide some patches/packs; 2) Force the model to predict them; 3) Improve feature quality. Why it matters: Without MAE, features can be shallow and miss key tissue textures.

🍞 Anchor: It’s like practicing to draw a picture when parts are covered by sticky notes.

Stage 2: MoCo pretraining (semantic discrimination)

What happens: Use Momentum Contrast to make features of similar regions close and different regions far in feature space, focusing on summary tokens and operating at feature-level to avoid heavy I/O.
Why it exists: It sharpens the ability to tell look-alike regions apart and stabilizes learning.
Example: Two similar tumor margins become closer in the queue, while a tumor margin and normal bile duct move apart.

🍞 Hook: You know how you learn faces better by comparing them side by side?

🥬 The Concept (MoCo, Momentum Contrast): A method that builds a memory queue of negatives and a momentum-updated encoder to learn discriminative features. How it works: 1) Make positive pairs by noising the same token; 2) Keep many negatives in a queue; 3) Train with contrastive loss. Why it matters: Without MoCo, the model may confuse important but subtly different patterns.

🍞 Anchor: It’s like flashcards where you keep new examples flowing so you don’t mix them up.

Stage 3: Instruction tuning (alignment and diagnosis)

What happens: First, train the connector on image–text captions (HepatoPathoCaption) while keeping the big models frozen—this aligns visuals with words. Then fine-tune the connector and the language model with the multi-scale Q&A (HepatoPathoVQA) using LoRA for efficient updates.
Why it exists: It teaches the system to answer clinically and to chain reasoning from WSI to ROI to patch.
Example: “Low-power view shows a solitary nodule; ROI indicates nodule-in-nodule; patch reveals poor differentiation → diagnosis and pT stage.”

🍞 Hook: Imagine upgrading a bicycle by swapping only a few parts instead of rebuilding the whole thing.

🥬 The Concept (LoRA, Low-Rank Adaptation): A way to fine-tune large models by adding small trainable layers instead of changing all weights. How it works: 1) Insert low-rank adapters; 2) Train just those; 3) Keep everything else frozen for efficiency. Why it matters: Without LoRA, training would be too slow and memory-hungry.

🍞 Anchor: It’s like clipping on lightweight training wheels to adjust your ride without replacing the bike.

Dataset creation and reasoning 🍞 Hook: You know how a tour guide first describes the city skyline, then points out a district, then a single landmark?

🥬 The Concept (Hierarchical VQA, revisited within the dataset): The pipeline uses big-picture descriptions as context for the next zoom level, ensuring consistency from macro to micro. How it works: 1) Generate WSI captions; 2) Feed them as context to ROI prompts; 3) Then to patch prompts; 4) Produce multi-scale Q&As. Why it matters: Without chained context, answers at different scales can contradict each other.

🍞 Anchor: “Overall: solitary 3.0 cm mass; ROI: clear tumor border; Patch: poor differentiation; Conclusion: pT1b without microvascular invasion.”

Tools inside the sampling 🍞 Hook: Think of measuring how alike two songs are by the angle of their vibes, not their loudness.

🥬 The Concept (Cosine Similarity): A way to group patches into regions by how similar their features point, regardless of size. How it works: 1) Compute feature angles; 2) Connect similar patches with a Minimum Spanning Tree; 3) Grow compact ROIs. Why it matters: Without it, ROIs would be messy and miss coherent tissue regions.

🍞 Anchor: Like clustering similar-colored Lego bricks to build a clean section.

🍞 Hook: When groups speak, a spokesperson shares a clear summary.

🥬 The Concept (Summary Token): A compact representation of a local pack that captures its main diagnostic evidence. How it works: 1) Re-encode the pack image; 2) Store the essence; 3) Use it for long-range reasoning. Why it matters: Without summaries, the model either loses details or drowns in thousands of tokens.

🍞 Anchor: It’s like a bullet-point recap of a chapter you just read.

04Experiments & Results

The test: The team built HepatoPathoBench (a held-out split from their dataset) to check if the model can: 1) describe morphology (open-ended), 2) make diagnoses (open-ended), 3) answer single- and multi-choice questions, and 4) stay consistent across scales (WSI, ROI, Patch). They used METEOR (measures text quality), WSI-P (LLM-assisted clinical correctness), and Accuracy (for choice questions, with partial credit 0.5 for incomplete multi-choice).

The competition: They compared three groups of 7B models: 1) general medical LLMs (HuatuoGPT, Lingshu), 2) thumbnail-based pathology MLLMs (Quilt-LLaVA, Patho-R1), and 3) WSI-based pathology MLLMs (SlideChat, WSI-LLaVA). Hepato-LLaVA starts from WSI-LLaVA and adds the new encoder, connector, and training pipeline via LoRA.

The scoreboard (with context):

Average score: Hepato-LLaVA 0.83, which is like scoring an A, while the best WSI-based runner-up, SlideChat, got 0.66 (a solid C+). Thumbnail models lagged at 0.50–0.57 (D to low C range) because shrinking slides erased key details.
Open-ended WSI-P: Morphology 0.79 and Diagnosis 0.75 for Hepato-LLaVA, beating SlideChat (0.70, 0.72). That means the free-text answers matched expert-like reasoning more often.
Close-ended accuracy: Morphology single-choice hit 0.97 (A+), and multi-choice 0.88, clearly ahead of Patho-R1 (0.87, 0.50), even though Patho-R1 uses reinforcement learning tricks.
Multi-scale stability: Hepato-LLaVA scored 0.82 (WSI), 0.83 (ROI), 0.83 (Patch) versus its backbone WSI-LLaVA at 0.65, 0.67, 0.64. This shows the new attention and connector really help across zoom levels.

Surprising findings:

Sparser can be smarter: Using a 32-query Q-Former connector outperformed using 500 queries or all pack tokens. More tokens added noise and hurt accuracy. This supports the idea that diagnostic signals are sparse and should be summarized.
Two-stage connector training helps: Pretraining the connector on captions (with the big models frozen), then doing VQA fine-tuning, improved ROI accuracy by about +2.76% compared to direct fine-tuning.
Consistency across scales: The model didn’t trade off WSI-level understanding for patch-level detail; it kept both strong, mirroring how a pathologist works.

Qualitative case: For a solitary 3.0 cm HCC without microvascular invasion, the model gave the correct pT1b stage and explained why (pT1a $if ≤2$ cm; pT2 if vascular invasion present), showing reasoning that aligns with AJCC 8th edition rules.

05Discussion & Limitations

Limitations:

Domain specificity: The system is tailored to hepatocellular carcinoma. Applying it directly to other cancers may need retraining and possibly different topology rules.
Data generation bias: Although expert-validated, the hierarchical captions and Q&As partly come from an external model (Gemini-3-flash), which could pass along its own biases or gaps.
Resolution constraints: While multi-scale is supported, extreme edge cases (unusual artifacts, rare variants) may still challenge the encoder.
Compute and storage: Training on gigapixel slides requires strong GPUs, careful I/O handling, and a feature-store strategy, even with sparsity.

Required resources:

A slide scanner pipeline or access to public WSI datasets (e.g., TCGA) and institutional data use approvals.
GPUs with enough memory to handle feature grids and sparse attention, plus storage for precomputed features.
Clinical expertise for validation and prompt design.

When not to use:

If you only have small, cropped images (no WSI context), simpler models may suffice.
If your task is pure patch classification without need for long-range structure, the sparse topology benefits may be minor.
In time-critical, on-device settings with very limited memory, even sparse attention might be too heavy without distillation.

Open questions:

Generalization: How well does Sparse Topo-Pack Attention transfer to other organs (breast, lung) and staining variations?
Active sampling: Can the model learn where to zoom next by itself, like a pathologist moving the microscope?
Trust and auditing: How to attach calibrated uncertainty and transparent evidence chains for clinical deployment?
Few-shot adaptation: Can LoRA adapters for new hospitals or scanners be learned quickly while preserving safety and accuracy?

06Conclusion & Future Work

Three-sentence summary: Hepato-LLaVA is a multi-modal AI that reads giant liver slides by grouping nearby tiles into small packs, summarizing them, and sharing only the right information across the slide. A new multi-scale dataset (HepatoPathoVQA) and a three-stage training pipeline teach it to reason from zoomed-out context to zoomed-in details. This approach greatly improves accuracy over existing models while staying efficient.

Main achievement: Introducing Sparse Topo-Pack Attention—a topology-aware, hierarchical, and sparse attention scheme that preserves local diagnostic evidence, maintains global coherence, and slashes redundant computation.

Future directions: Extend the method to other cancers and stains, add active zoom policies that choose the next ROI automatically, and develop stronger uncertainty estimates and safety checks for clinical use. Lighter distilled versions could bring near-real-time assistance to more hospitals.

Why remember this: It shows that respecting the tissue’s 2D map and mirroring a pathologist’s workflow—local inspection, summarization, and global comparison—lets AI keep both detail and context, turning overwhelming gigapixel images into clear, clinically useful answers.

Practical Applications

•Assist pathologists with second-opinion reports that link WSI, ROI, and patch evidence to a final diagnosis.
•Pre-screen large slide batches to flag suspicious regions for priority review.
•Generate structured pathology captions that summarize key morphology for tumor boards.
•Provide training cases for residents with step-by-step, multi-scale reasoning and staged Q&A.
•Support staging decisions (e.g., AJCC pT categories) with clear citations of visual evidence.
•Standardize reporting across institutions by aligning free-text findings to consistent diagnostic language.
•Enable telepathology consults by packaging concise summary tokens and explanations for remote experts.
•Adapt to new scanners or sites using LoRA adapters without retraining the entire model.
•Speed up research studies by mining large WSI cohorts for morphological patterns linked to outcomes.
•Facilitate quality control by highlighting cases where model uncertainty suggests a manual double-check.

Version: 1