From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia; Chaoya Jiang; Shikun Zhang; Wei Ye

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Intermediate

Hongrui Jia, Chaoya Jiang, Shikun Zhang et al.2/26/2026

arXiv

Key Summary

•Large multimodal models (LMMs) can look at pictures and read text, but they still miss tricky cases, like tiny chart labels or multi-step math.
•This paper introduces DPE, a diagnose-and-correct training loop that first finds a model’s blind spots and then makes custom practice exactly for those weaknesses.
•A team of helper agents searches the web for fresh images, edits them when needed, writes questions, and double-checks answers so the practice is realistic and correct.
•DPE controls the mix of practice topics using a diagnostic report, so the model spends more time on what it’s bad at and less on what it already knows.
•Training uses reinforcement learning with verifiable rewards, so the model gets credit only when its answers and reasoning truly match the ground truth.
•With only about 3–4 thousand targeted examples, DPE beats training on a much larger static dataset (about 47 thousand) across many benchmarks.
•Compared to a popular self-evolving method (VisPlay), DPE improves more, avoids ups-and-downs during training, and transfers well to stronger base models.
•DPE especially helps long-tail skills like OCR on real charts, multi-image reasoning, and visual math that used to stall or regress.
•Diversity checks show DPE keeps both text and images varied over time, preventing the model from overfitting to narrow patterns.
•Overall, DPE is a scalable, steady way to keep multimodal models improving under open-ended, real-world tasks.

Why This Research Matters

Real-world tasks mix text and visuals: reading receipts, understanding charts, following maps, and checking math from photos. DPE helps AI handle these reliably by focusing practice on exactly what the AI currently struggles with. Because it can search and edit images, DPE covers unusual layouts and long-tail cases that trip models up in everyday use. Verifiable rewards reduce confident mistakes, which means safer assistants for classrooms, offices, and homes. With smaller, smarter datasets, teams can improve models faster and cheaper. Over time, this approach can power more trustworthy tools for learning, analysis, accessibility, and decision support.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a student can read a story and also look at pictures to answer questions, but still get stuck on tiny details like small labels or tricky diagrams? That’s what happens to today’s AI that reads and sees.

🥬 Filling (The Actual Concept):

What it is: Large Multimodal Models (LMMs) are AIs that can understand both pictures and text together.
How it works: They take in images and words, build an internal picture of what’s going on, and then answer questions or solve problems.
Why it matters: Without good training on many kinds of pictures and questions, they miss rare or hard situations—like faint chart legends or multi-step geometry.

🍞 Bottom Bread (Anchor): Imagine asking an AI, “What is the tallest building shown in this skyline photo?” and “What does this tiny note under the chart say?” LMMs need to do both well to help in the real world.

The World Before: For years, we trained LMMs using static datasets—big collections of images and text that don’t change. This worked for common, easy cases, but hit a wall on rare, long-tail tasks. Even when reinforcement learning (RL) made language models much smarter at reasoning, multimodal models still struggled: not enough clean, labeled, visual reasoning data; and training recipes that didn’t adjust to the model’s actual mistakes.

🍞 Top Bread (Hook): Imagine practicing piano with the same five songs forever. You’d improve a bit, but your weak fingers would stay weak.

🥬 Filling (The Actual Concept):

What it is: Static data means the practice set never changes.
How it works: Models see the same distribution of examples over and over, even after they’ve already mastered many of them.
Why it matters: Time gets wasted on what’s already easy, while tricky, rare skills (the long tail) stay weak.

🍞 Bottom Bread (Anchor): It’s like a soccer team that only practices dribbling straight ahead, so they stumble whenever a defender pressures from the side.

Researchers tried “self-evolving” training: the model generates new questions and answers for itself. That sounds great, but two big issues popped up.

🍞 Top Bread (Hook): You know how guessing what you did wrong on a math test without seeing the teacher’s notes can lead you to study the wrong thing?

🥬 Filling (The Actual Concept):

What it is: Lack of interpretable diagnostics means the system doesn’t clearly say which skill failed and why.
How it works: Past methods rely on rough signals (like uncertainty) instead of pinpointing the exact weakness (e.g., “misread axis units”).
Why it matters: The model keeps making fancy but unfocused practice, drifting away from real gaps.

🍞 Bottom Bread (Anchor): It’s like telling a student “be less confused” instead of “you mix up area and perimeter—study that specifically.”

🍞 Top Bread (Hook): Imagine writing new quiz questions but always on the SAME few pictures. You can only get so creative.

🥬 Filling (The Actual Concept):

What it is: Scarcity of visual diversity means training reuses the same images, limiting coverage of rare scenes.
How it works: Text prompts change, but pictures don’t, so layout variety, fonts, and unusual visuals aren’t covered.
Why it matters: Skills like OCR or chart reading hit a plateau and can even get worse.

🍞 Bottom Bread (Anchor): If every page you practice on uses the same font and layout, you’ll falter on a messy, real-life receipt.

The Gap: We needed a loop that (1) clearly diagnoses exact blind spots, (2) generates fresh, targeted visual practice from large image pools, and (3) reinforces only when answers are verifiably correct. That’s what this paper builds.

🍞 Top Bread (Hook): Think of a great coach who watches your game, names the exact moves you need to practice, fetches the right drills, and scores you fairly.

🥬 Filling (The Actual Concept):

What it is: Diagnostic-driven Progressive Evolution (DPE) is a repeatable loop: diagnose → generate targeted data → reinforce → re-diagnose.
How it works: A diagnostic agent finds weaknesses, a team of helper agents creates new, verified practice focused on those weaknesses, and RL updates the model; then it checks again.
Why it matters: The model spends most of its effort where it needs it most, avoiding wasted practice and instability.

🍞 Bottom Bread (Anchor): If the model often misreads chart legends, DPE creates more varied chart images with clear and tricky legends and asks legend-focused questions. The model improves exactly where it’s weak.

Real Stakes: This matters beyond benchmarks. In daily life we read street signs, receipts, forms, and medical charts; we interpret maps and diagrams; we compare photos or frames in videos. Without precise, targeted practice, multimodal AIs will keep stumbling on the very details people rely on. With DPE’s smart, evolving practice, assistants can help check homework involving diagrams, summarize meeting whiteboard photos, or read complex charts—more reliably and safely.

02Core Idea

🍞 Top Bread (Hook): You know how the fastest way to get better at a sport is to find your bad habits, drill them with the right equipment, and get instant feedback?

🥬 Filling (The Actual Concept):

What it is: The Aha! is to turn training into a diagnose-and-correct spiral where each round targets the exact skills the model currently lacks.
How it works: First, diagnose failures by skill category (like OCR, charts, math). Next, generate new, diverse images and questions that stress those weak skills. Finally, reinforce with verifiable rewards and repeat.
Why it matters: This keeps learning focused, stable, and efficient, even with tiny data budgets.

🍞 Bottom Bread (Anchor): If the model drops steps in math proofs, the next round includes images of math problems that require step-by-step answers and checks that each step is present.

Multiple Analogies:

Tutor Analogy: A tutor doesn’t give random worksheets. They check which topics you missed, then assign carefully chosen problems. That’s DPE.
Doctor Analogy: A doctor runs tests (diagnosis), prescribes targeted therapy (generation), and monitors recovery (re-diagnose). That’s DPE.
Game Coach Analogy: A coach reviews game tape, notes weak plays, sets up drills that recreate those situations, and measures improvement. That’s DPE.

🍞 Top Bread (Hook): Imagine first introducing the team, then the playbook, then the referee, in that order.

🥬 Filling (The Actual Concept):

What it is: DPE’s Building Blocks are (1) Adaptive Diagnosis, (2) Multi-Agent Data Generation with tools, and (3) Reinforcement Learning with verifiable rewards.
How it works: Diagnosis assigns category proportions and specific failure patterns; agents fetch/edit images and craft matching questions; RL updates the model based on checked answers; then the loop repeats.
Why it matters: Each block fixes a past pain point—unclear gaps, stale visuals, and unreliable feedback.

🍞 Bottom Bread (Anchor): If OCR is low accuracy due to tiny fonts, diagnosis boosts the OCR quota; agents find receipts, invoices, and signs with varied fonts; RL rewards only exact text matches.

Before vs After:

Before: Training followed a fixed recipe, hoping that bigger datasets would cover everything. Self-evolving methods guessed where to go, often drifting.
After: Training steers itself. It measures, aims, and fires—again and again—so gains are broad and steady, especially on the long tail.

🍞 Top Bread (Hook): You know how practice should be not too easy and not too hard—just right—to help you learn fastest?

🥬 Filling (The Actual Concept):

What it is: Difficulty-aware selection means keeping moderately hard samples that give strong learning signals.
How it works: If tasks are always solved or always failed, the model learns little; mid-difficulty examples produce bigger, clearer updates.
Why it matters: This boosts improvement per example—crucial when you only have a few thousand training items.

🍞 Bottom Bread (Anchor): Think of choosing math problems with a 50/50 success rate—tough enough to stretch you, but not so hard you give up.

🍞 Top Bread (Hook): Picture a project manager who sets quotas: this many chart tasks, that many map tasks.

🥬 Filling (The Actual Concept):

What it is: Category proportion control means the data mix is explicitly set each round based on diagnostic results.
How it works: Weak categories get more samples; strong ones get fewer, keeping practice balanced and efficient.
Why it matters: Prevents overtraining on mastered skills and neglecting weak ones.

🍞 Bottom Bread (Anchor): If chart understanding dropped last round, the next set increases chart questions with varied legends and units until performance rebounds.

Why It Works (intuition, no equations):

Clear signals: Verifiable rewards mean the model gets credit only for truly correct, grounded answers—reducing hallucinations.
Right difficulty: Mid-hardness examples carry the most teaching power, so each update counts.
Fresh visuals: New images and edits prevent overfitting to familiar layouts and fonts, covering long-tail cases.
Tight loop: Continuous diagnose-generate-reinforce aligns training with current needs, reducing instability and wasted effort.

🍞 Top Bread (Hook): Imagine a school that updates tomorrow’s lesson plan using today’s quiz mistakes.

🥬 Filling (The Actual Concept):

What it is: DPE is a self-correcting, evolving curriculum tailor-made for an LMM’s changing abilities.
How it works: Today’s failures set tomorrow’s quotas, images, and question styles; RL locks in the new skill before the next check.
Why it matters: That’s how you turn blind spots into gains.

🍞 Bottom Bread (Anchor): After seeing the model confuse chart legends with labels, DPE purposely adds more legend-focused tasks from diverse charts, fixes the confusion next round, and moves on to the next weakness.

03Methodology

High-Level Recipe: Input (current model) → Diagnose weaknesses → Generate targeted, verified data → Reinforcement learning update → Output (better model) → Repeat.

🍞 Top Bread (Hook): You know how a good study plan starts with a check-up test, then picks the right exercises, then re-tests?

🥬 Filling (The Actual Concept):

What it is: Iterative Training is repeating a cycle of measure → practice → improve.
How it works: Each loop starts with diagnosis, builds the right practice set, updates the model, and then measures again.
Why it matters: Skills don’t improve all at once—this keeps practice focused over time.

🍞 Bottom Bread (Anchor): Like weekly piano lessons: play a piece (diagnose), get exercises for weak spots (generate), and return next week better (reinforce & repeat).

Step A: Adaptive Diagnosis

🍞 Top Bread (Hook): Imagine sorting a report card by subjects—math, reading, science—and writing a note under each about what went wrong.

🥬 Filling (The Actual Concept):

What it is: The Diagnostic Mechanism tests the model on a small, diverse set, scores both reasoning steps and final answers, and tags failures by capability category.
How it works: It samples about 200 tasks from a diagnostic pool and groups them into 12 visual categories (e.g., charts, OCR-heavy images, geometry, maps, natural scenes). For each category, it computes accuracy and summarizes error patterns (like “ignored axis units” or “dropped a math step”). It then sets category quotas for the next round.
Why it matters: Without precise attribution and quotas, generation would spray effort everywhere and miss weak skills.

🍞 Bottom Bread (Anchor): If text-heavy images (OCR) score low and common errors say “misread small fonts,” the next round reserves more OCR slots with tiny-font images.

Step B: Multi-Agent Questioner System

🍞 Top Bread (Hook): Think of a team where one plans, one finds pictures, one writes questions, and one checks the work.

🥬 Filling (The Actual Concept):

What it is: A Multi-Agent System turns the diagnostic report into a verified training set with the right category mix.
How it works: Four agents collaborate: Planner (sets category and instructions), Image Selector (searches, filters, edits/combines images), Question Generator (writes questions and answers), and Validation (approves only correct, solvable, well-formatted samples). The team respects category quotas so the final set matches diagnostic goals.
Why it matters: Without roles and checks, data gets noisy, off-target, or mislabeled, which confuses training.

🍞 Bottom Bread (Anchor): The Planner asks for a bar chart with a legend and tricky units; the Image Selector finds/edits it; the Question Generator asks “What value does the red bar represent in kilograms?”; Validation confirms the answer is verifiable and formatted correctly.

Agent Details (What/Why/Example):

Planner Agent
- What: Translates category quotas and failure notes into per-sample plans (target category, image needs, question type, and focus).
- Why: Ensures every sample connects directly to a known weakness.
- Example: “Category: charts; Image: must include axis labels and legend; Question: numeric with units; Focus: legend-unit confusion.”
Image Selector Agent (Search–Filter–Edit)
- What: Pulls images from a large external pool, filters for quality and fit, and edits (crop/overlay/compose) to build edge cases.
- Why: Fresh, varied visuals stop overfitting to one layout and cover the long tail.
- Example: Stitch two floor plans and ask to compare room areas; overlay faint labels to test OCR.
Question Generator Agent
- What: Writes the question and reference answer that match the plan and image.
- Why: Keeps questions aligned to weaknesses and ensures answers are checkable.
- Example: “Which route on the map is shorter? Answer in miles.”
Validation Agent
- What: Gates samples on category fit, solvability, answer verifiability, and format.
- Why: Removes noisy or unsolvable items that would destabilize learning.
- Example: Reject a multiple-choice question if options are missing.

Step C: Reinforcement Learning with Verifiable Rewards

🍞 Top Bread (Hook): Think of a scoring judge who only gives points when your answer is truly correct and follows the rules.

🥬 Filling (The Actual Concept):

What it is: Reinforcement Learning (RL) updates the model using rewards based on whether answers are correct and well-grounded.
How it works: The model tries answers; it gets a reward when the final answer (and sometimes reasoning steps) match the reference. A method called GRPO shapes stable updates using grouped, normalized advantages and keeps the model close to a safe reference.
Why it matters: Without reliable rewards, the model could learn to be confidently wrong or to hallucinate details.

🍞 Bottom Bread (Anchor): For a chart question requiring a numeric answer with units, the model is rewarded only if both the number and units match the ground truth.

Difficulty-Aware Filtering

🍞 Top Bread (Hook): You learn most from problems that you sometimes solve and sometimes miss—just hard enough.

🥬 Filling (The Actual Concept):

What it is: Keep moderately difficult samples that carry the strongest learning signals.
How it works: Extremely easy or impossible items teach less; mid-difficulty items provide more informative gradients for RL.
Why it matters: With small data budgets, maximizing learning per sample is crucial.

🍞 Bottom Bread (Anchor): If the model solves almost all “read the big title” tasks, those get down-weighted; instead, it practices “read tiny footnotes” more.

Iterative Loop and Stability

🍞 Top Bread (Hook): Imagine a spiral staircase—each turn climbs higher but also checks footing before the next step.

🥬 Filling (The Actual Concept):

What it is: DPE repeats diagnose → generate → reinforce for several rounds.
How it works: After each update, the system re-diagnoses, adjusts quotas, and refreshes images and questions; validation keeps quality high.
Why it matters: Prevents distribution drift and oscillations that can plague self-evolving pipelines.

🍞 Bottom Bread (Anchor): Over three rounds, OCR accuracy rises as the mix targets OCR more at first, then balances back once it improves.

Secret Sauce (What makes it clever?):

Explicit failure attribution: Know exactly what broke and why.
Category quotas: Turn insights into concrete data budgets.
Tool use for images: Break free from static datasets to cover long-tail visuals.
Verifiable rewards + right difficulty: Reward only true success, on samples that teach most.
Tight, repeating loop: Convert blind spots into steady gains.

04Experiments & Results

The Test: The authors trained two open-source LMMs (Qwen2.5-VL-7B-Instruct and Qwen3-VL-8B-Instruct) using either DPE or a popular self-evolving baseline (VisPlay). They evaluated across 11 tough benchmarks covering STEM reasoning (like MMMU and MMStar), visual math (MathVista, MathVerse, MathVision), OCR (CharXiv, ChartQA), multi-image reasoning (BLINK), and hallucination control (HallusionBench). Accuracy was the score.

🍞 Top Bread (Hook): Think of comparing two training plans for a team across many different games—math quizzes, map reading, chart decoding, and spotting made-up facts.

🥬 Filling (The Actual Concept):

What it is: A head-to-head test of DPE vs. VisPlay and also a check against strong state-of-the-art systems.
How it works: Keep settings fair (same base models, similar rounds), then measure gains per benchmark and overall.
Why it matters: We want to know if focusing on blind spots really beats generating more of the same.

🍞 Bottom Bread (Anchor): If DPE says “charts are weak,” we expect chart benchmarks like CharXiv to improve the most—and they do.

The Competition: VisPlay represents a strong self-evolving method. State-of-the-art models include powerful, much larger or proprietary systems. DPE’s claim is not just to beat VisPlay but to do more with less data compared to static training.

Scoreboard with Context:

Against VisPlay on Qwen2.5-VL-7B:
- CharXiv (OCR-heavy): DPE climbs about +4.11 points over VisPlay by Iteration 3, a notable jump on hard, real-world charts.
- MMMU (broad STEM): DPE steadily rises (around 54.44 → 56.44), while VisPlay oscillates—showing DPE’s stable targeting.
- HallusionBench: DPE edges ahead (about 69.19% vs. 68.35%), meaning fewer confident mistakes.
- Overall: DPE’s averages trend smoothly upward, avoiding the dips that VisPlay shows on tasks like BLINK and MMMU.
Transfer to Qwen3-VL-8B (a stronger base):
- Big jumps on MMMU (+3.67) and MMStar (+10.86), showing that DPE still adds value even when starting strong.
Beating bigger/closed models on hard tasks:
- With an 8B backbone, DPE averages around mid-60s and sets new highs on visual math (e.g., MathVista ~76.2 and MathVision ~53.88), surpassing much larger open models and even proprietary ones on some tasks.
- Hallucination control is also top-tier (around 74+ on HallusionBench), outperforming GPT-4o in these tests.

Data Efficiency Surprises:

Static training with ~47K examples hits plateaus on key tasks.
DPE, with only ~3–4K tightly targeted examples generated over iterations, beats the large static set on benchmarks like MMMU, HallusionBench, MathVista, and RealWorldQA. That’s like getting an A+ after studying a custom 3-chapter guide instead of a whole 30-chapter book that doesn’t match your weaknesses.

Diversity and Quality Checks:

Text and Image Diversity: DPE keeps increasing diversity across iterations (higher mean pairwise distances in embeddings). VisPlay spikes early but then slides, indicating narrowing coverage. More diversity means less overfitting to certain layouts or phrasing.
Question Quality (judged by three independent LLMs): DPE maintains near-ceiling clarity, solvability, and correctness scores across iterations, while VisPlay’s quality notably drops in later rounds. This shows DPE’s validation and planning keep the data clean and helpful.

What Stood Out:

Targeted OCR gains: When diagnosis boosted chart/text-dense quotas, CharXiv improved steadily across iterations.
Stability: DPE avoids the improve-then-drop patterns seen when diagnostics are removed.
Image tools matter: Removing search/editing limits visual variety and caps gains—especially for OCR and math layouts.

Takeaway: DPE doesn’t just inch forward; it lifts multiple hard skills at once, with fewer samples, more stability, and stronger transfer compared to self-evolving baselines that lack explicit diagnosis or visual diversity.

05Discussion & Limitations

Limitations:

Compute and orchestration: Running multiple high-quality agents (planning, retrieval/editing, question writing, validation) and RL updates requires coordination and resources.
Dependence on external image pools: If the pool is biased (e.g., fonts, cultures, diagram styles), generated data may reflect those biases unless carefully curated.
Diagnostic pool coverage: If the small diagnostic sample misses certain rare skills, quotas may undercorrect them.
Reward verification scope: Some complex, open-ended tasks are hard to verify automatically; over-focusing on easily verifiable tasks could skew practice if not balanced.

Required Resources:

Access to capable agent models for planning, generation, and validation; an image search API; basic image editing tools; and an RL finetuning stack.
A seed dataset and a diagnostic pool with category labels to kick-start meaningful analysis.

When Not to Use:

Highly constrained settings where external image retrieval/editing is prohibited or data cannot leave a secure boundary.
Tasks with inherently subjective answers (e.g., aesthetics) where verifiable rewards are unreliable.
Ultra-low compute environments where multi-agent generation and RL are impractical.

Open Questions:

Richer diagnostics: Can we integrate finer-grained failure tags (e.g., reading curved text, glare effects) or automatically discover new categories?
Better verification: How to reliably score step-by-step reasoning across diverse modalities and formats?
Continual deployment: How to safely run DPE online so models keep learning from fresh, real-world streams without drifting or amplifying bias?
Efficiency: Can we compress the agent team or share computations to reduce cost while keeping quality and diversity high?
Generalization: How well does DPE handle video, 3D scenes, or cross-lingual OCR at scale?

Overall, DPE is a strong step toward adaptive, data-efficient multimodal training, but it still relies on careful engineering of diagnostics, tools, and verification to realize its full potential.

06Conclusion & Future Work

Three-Sentence Summary: This paper presents DPE, a diagnose-and-correct training loop for multimodal models that turns blind spots into targeted practice using multi-agent data generation and verifiable-reward RL. By explicitly controlling category quotas and refreshing visuals via search and editing, DPE avoids the stagnation and instability common in self-evolving pipelines. With only a few thousand, well-chosen examples, DPE achieves broad, steady gains across 11 benchmarks, especially on long-tail skills like OCR and visual math.

Main Achievement: Converting opaque self-evolution into an interpretable, quota-driven loop that continuously sources diverse images and crafts weakness-focused questions, enabling stable and data-efficient improvement.

Future Directions: Enrich diagnostic signals (discover new failure types automatically), broaden modalities (video, 3D, multilingual OCR), and strengthen verification of reasoning steps. Streamline the multi-agent pipeline to reduce cost and latency, and explore safe, online continual training with bias checks.

Why Remember This: Training that measures-first, aims, and then practices—again and again—beats bigger-but-blinder approaches. DPE shows that smart targeting, fresh visuals, and reliable feedback can push multimodal reasoning further with far less data, making real-world assistants sharper, steadier, and more trustworthy.

Practical Applications

•Document and receipt reading assistants that handle messy scans, tiny fonts, and unusual layouts.
•Business dashboard helpers that interpret complex charts with correct units, legends, and comparisons.
•Education tools that solve visual math problems step by step from worksheets or whiteboard photos.
•Navigation and map-reading assistants that compare routes and distances from images or screenshots.
•Quality-control systems that verify labels, diagrams, and measurements in engineering or architecture images.
•Healthcare support that reads medical charts or scans with strict format checks to avoid misinterpretation.
•Research summarizers that extract accurate findings from figures and plots in scientific papers.
•Customer support bots that understand product photos, manuals, and troubleshooting diagrams.
•Accessibility tools that read and explain visual content (charts, forms) clearly for low-vision users.
•Content moderation or fact-checking that flags hallucinations by requiring verifiable, grounded answers.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes