UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Zimo Wen; Boxiu Li; Wanbo Zhang; Junxiang Lei; Xiaoyu Chen; Yijia Fan; Qi Zhang; Yujiang Wang; Lili Qiu; Bo Li; Ziwei Liu; Caihua Shan; Yifan Yang; Yifei Shen

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Intermediate

Zimo Wen, Boxiu Li, Wanbo Zhang et al.3/3/2026

arXiv

Key Summary

•This paper builds UniG2U-Bench, a big test to find out when making pictures (generation) actually helps models understand pictures and text together.
•They sort 3,000 problems into 7 categories and 30 subtasks that often need drawing, tracing, or step-by-step visual changes to solve.
•They compare each "unified" model (can understand and draw) to its matching base vision-language model (understands only) so the test is fair.
•Overall, adding drawing skills did not make models better at understanding; unified models usually scored lower than their base models.
•Making a picture before answering (Generate-then-Answer) often hurt performance because wrong drawings can mislead the answer step.
•Generation helped in certain cases like mazes, sliding puzzles, spatial tracking, and visual illusions, where visual steps act like a "picture chain-of-thought."
•They introduce two new scores: RA (how well the drawing follows instructions) and AL (how well the drawing supports the final answer).
•Tasks and models showed patterns: tasks with similar thinking styles behaved alike, and models built on the same base model behaved similarly.
•The paper concludes we need better training data, smarter ways to align drawing with reasoning, and checks to avoid error propagation.
•Use UniG2U-Bench to decide when to draw, when not to draw, and how to design future unified models that help rather than hurt understanding.

Why This Research Matters

In the real world, many problems are easier with the right sketch—but only if that sketch is correct. This work shows when AI should sketch and when it should not, which can make assistants more accurate and faster. It helps build better educational tools (like geometry tutors) that know when drawing will aid learning. It guides map and robot planners to use visuals wisely, improving navigation and safety. It supports chart-reading systems to avoid decorative but misleading graphics. Most importantly, it gives engineers a practical checklist—measure drawing faithfulness (RA), usefulness (AL), and only draw when it truly helps.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re solving a tricky maze on paper. It’s much easier if you can draw your path as you go, right? But what if your pencil keeps drawing the wrong lines—now it confuses you instead of helping.

🥬 The Concept: Before this research, many AI systems could either understand images and text together (answer questions about a picture) or generate images (draw/edit), and newer “unified” models started doing both in one brain. The big question no one cleanly answered was: does drawing actually help the model think better? Or does it just get in the way?

How it used to work:

Vision-Language Models (VLMs) could look at a picture and a question, then answer in text. They didn’t draw.
Image generators could draw from text prompts but didn’t necessarily understand the question deeply.
Unified models tried to do both, hoping shared skills would make them stronger overall.

Why that wasn’t enough:

Existing tests usually checked “Can it answer?” and “Can it draw?” separately. That’s like testing a student’s math and art on different days and never seeing if sketching helps them solve geometry.
Many vision benchmarks could be solved by just writing long captions and letting a text-only model answer—so they didn’t force models to do truly visual thinking.

🍞 Anchor: Think of geometry problems where you must draw an extra helper line to see a triangle similarity. Without that line, the answer stays hidden. This paper asks: can models draw that helper line—and will it actually help them get the right answer?

🍞 Hook: You know how some puzzles (like sliding tiles) are easier if you move pieces step by step? Your brain uses the moves to keep track of the state.

🥬 The Concept: Many real tasks—mazes, physics diagrams, spatial tracking, chart reading—benefit from creating intermediate visuals: small, helpful pictures between the start and the final answer.

How it often fails today:

Benchmarks allowed text-only shortcuts (long captions), so models didn’t need to draw.
When models did draw, no one systematically checked if the drawing improved the final answer.
Differences in model size and training made comparisons unfair.

Why this matters:

If drawing helps, we can build better tutors, safer navigators, and clearer tools. If drawing misleads, we must avoid it or fix how it’s done.

🍞 Anchor: If you’ve ever traced a shortest path on a map with your finger, you know how a little visual action can clear up confusion fast—unless your finger slips, and then you get lost. This benchmark asks which tasks are like the helpful finger-tracing and which ones are just finger-slips.

02Core Idea

🍞 Hook: You know how showing your work in math can help you think—unless your first step is wrong, and then all the steps after go wrong too?

🥬 The Concept: The key insight is that generation (drawing) does not automatically help understanding. It helps only when the task truly needs visual steps and when the intermediate drawings are accurate and aligned with the reasoning.

How it works:

Build a benchmark (UniG2U) with tasks that often need visual transformations (mazes, spatial tracking, geometry, physics, illusions, charts).
For each unified model, compare it against its own matching base VLM to fairly isolate the effect of adding drawing.
Test two modes: Direct (answer without drawing) and Generate-then-Answer (draw first, then answer).
Score the drawings with RA (does the drawing follow the instruction and look right?) and AL (does the drawing actually support the final answer?).

Why it matters:

Without this careful setup, we might believe drawing helps everywhere, when it actually helps only in specific cases and can even hurt.

🍞 Anchor: Like using a ruler in geometry—great when measuring angles, pointless when spelling a word. The paper figures out when the “visual ruler” helps and when it’s just clutter.

Three analogies for the same idea:

Whiteboard Brain: Drawing on a whiteboard helps you think through multi-step puzzles, but a messy or wrong sketch confuses you. That’s Direct vs GtA with good/bad drawings.
Map Tracing: Tracing your route helps you find a path in a maze; tracing on a blurry or wrong map misleads you. That’s RA/AL: the map’s faithfulness (RA) and whether it supports the destination (AL).
Lego Manual: Building step-by-step diagrams helps finish a model; a misprinted step ruins the build. GtA is like following visual steps—bad steps, bad build.

Before vs After:

Before: People assumed that if a model could draw, it might also understand better by default.
After: We learn drawing helps mainly in transformation-heavy tasks (mazes, sliding puzzles, spatial tracking, some illusions). In most other tasks, drawing either does nothing or makes things worse.

Why it works (intuition without equations):

Visual steps reduce memory load when the task has many moving parts. But if the visual step is wrong, it injects false facts into the reasoning chain, causing error propagation.

Building blocks (each with a mini-sandwich):

🍞 Hook: Imagine two types of students—one just answers questions; the other can also draw diagrams. 🥬 The Concept: Vision-Language Models (VLMs) vs Unified Multimodal Models (UMMs).

What it is: VLMs read images + text and answer; UMMs can do that and also generate images.
How it works: VLMs map vision+text to answers; UMMs share parameters for both answering and drawing.
Why it matters: Comparing a UMM to its base VLM shows whether adding drawing helps or hurts. 🍞 Anchor: A student who can sketch might solve geometry faster—or scribble wrong lines and get stuck.

🍞 Hook: You know how building a quick sketch can help you think? 🥬 The Concept: Generation-to-Understanding (G2U).

What it is: Using image generation as a tool to improve understanding/answers.
How it works: Model creates an intermediate picture that captures a visual step, then uses it to answer.
Why it matters: If G2U works, pictures become a thinking aid, not just decoration. 🍞 Anchor: Drawing helper lines in geometry before solving for an angle.

🍞 Hook: Sometimes you answer right away; other times you draw first, then answer. 🥬 The Concept: Direct vs Generate-then-Answer (GtA).

What it is: Direct = answer without drawing; GtA = draw an intermediate picture, then answer.
How it works: In GtA, the new image is fed back into the model as extra context.
Why it matters: This reveals whether explicit pictures act like a visual chain-of-thought. 🍞 Anchor: Solving a maze in your head (Direct) vs tracing the path with a pencil (GtA).

🍞 Hook: A neat sketch that follows the rules is more helpful than a messy one. 🥬 The Concept: RA (Reasoning-to-Visual Alignment) and AL (Answer-to-Visual Alignment).

What it is: RA checks if the drawing follows instructions and looks correct; AL checks if the drawing supports the final answer.
How it works: A rubric scores visuals for instruction-following, quality, and relevance (RA), and checks final answer consistency with the drawing and question (AL).
Why it matters: High RA/AL is linked to helpful drawings; low RA/AL risks error propagation. 🍞 Anchor: A ruler-straight helper line (high RA) that clearly shows equal angles and leads to the right solution (high AL).

🍞 Hook: Different toolkits lead to different habits. 🥬 The Concept: End-to-End vs Decoupled vs Agentic UMMs.

What it is: End-to-End shares one set of parameters for draw+answer; Decoupled stitches separate drawing and answering modules; Agentic calls external tools.
How it works: E2E trains all together; Decoupled passes images between modules; Agentic uses tool APIs.
Why it matters: These designs shape how generation affects understanding and how errors travel. 🍞 Anchor: One multitool vs two separate tools vs a toolbox you borrow from a neighbor.

03Methodology

At a high level: Input (image + question) → Pick inference mode (Direct or GtA) → If GtA: generate intermediate image(s) → Answer → Score both the answer and the intermediate images.

Step-by-step recipe:

Build a task set where drawing could help.

What happens: The authors curate 3,000 problems in 7 categories and 30 subtasks (real-world apps, geometry, physics, puzzles & games, chart/table reasoning, spatial intelligence, perception reasoning). Each is chosen because visual transformation might be useful (e.g., drawing a path, marking angles, simulating motion).
Why this step exists: If tasks don’t need visuals, drawing won’t help; we must test on cases where pictures could reasonably matter.
Example: Maze task—tracing a path; geometry—adding an auxiliary line; physics—drawing force vectors; charts—replotting a simpler view.

Pair each unified model with its matching base VLM.

What happens: For a given unified model (can draw+answer), find the most faithful base VLM it’s built upon (understands only). Match compute budgets and prompts.
Why this step exists: To isolate the effect of adding drawing; otherwise, differences could be due to model size or training, not drawing.
Example: Compare OmniGen2 to its Qwen2.5-VL base; compare Bagel to its Qwen-family base, etc.

Run two inference modes for unified models.

What happens: Direct mode—just answer. GtA mode—first generate an intermediate picture (or a sequence), then answer using both the original and generated images.
Why this step exists: To see if visual chain-of-thought (GtA) helps more than using the model’s internal representations (Direct).
Example: Sliding puzzle—generate the next state frame-by-frame, then output the move list.

Keep budgets and decoding fixed.

What happens: Use standardized token limits and greedy decoding (temperature 0), with uniform image handling.
Why this step exists: Prevent extra compute or randomness from creating fake gains.
Example: All models answer under the same text length limits; image resolutions are handled consistently.

Measure answers and measure drawings.

What happens: Main score is accuracy (exact match or judged with GPT-4o when free-form). Two new drawing scores: RA (does the drawing follow instructions and look right?) and AL (does the drawing actually support the answer?).
Why this step exists: A drawing can look okay but be useless, or look nice but be wrong. RA and AL tell us if the picture is both faithful and helpful.
Example: A geometry drawing with clearly marked equal angles (high RA) that leads to a correct angle computation (high AL).

Analyze correlations.

What happens: Look for patterns across tasks and models by comparing deltas (unified minus base) instead of raw scores.
Why this step exists: Deltas isolate the generation effect, showing which task types benefit and which model families rise or fall together.
Example: Maze and sliding puzzles cluster together; models sharing the same base show similar behaviors.

The Secret Sauce:

Strict base–unified pairing + two-mode testing + RA/AL diagnostics. This trio distinguishes “drawing that helps” from “drawing that hurts.”

Mini-sandwiches for key method components:

🍞 Hook: Two ways to solve: think in your head or sketch first. 🥬 The Concept: Direct vs GtA protocols.

What it is: Two testing modes to see if pictures help.
How it works: Direct uses internal reasoning; GtA externalizes it into an image.
Why it matters: Identifies when visual steps act like a helpful chain-of-thought. 🍞 Anchor: Maze solved by mental path vs pencil-traced path.

🍞 Hook: A tidy rubric helps fair grading. 🥬 The Concept: RA/AL metrics.

What it is: RA judges drawing fidelity; AL judges whether the drawing supports the final answer.
How it works: Weighted rubrics (e.g., instruction adherence, visual quality, task relevance; answer consistency).
Why it matters: Shows if low performance is due to bad pictures, bad reasoning, or both. 🍞 Anchor: Geometry: Are parallel lines marked correctly (RA)? Did that drawing truly lead to the right angle (AL)?

🍞 Hook: Families share habits. 🥬 The Concept: Model taxonomy and correlation analysis (End-to-End, Decoupled, Agentic; plus architecture families).

What it is: Group models by design and shared backbones; compute correlations of gains/losses across tasks.
How it works: Use Spearman correlations on the deltas.
Why it matters: Reveals that base representations, more than generation style, shape behavior. 🍞 Anchor: Students trained by the same teacher solving problems in similar ways, even if they use different colored pens.

Concrete data-flow example (Geometry):

Input: A triangle diagram with a question: “Find angle BE.”
Direct: The model analyzes the diagram and answers 120°.
GtA: The model first generates a diagram adding an auxiliary line, then answers 120°.
RA/AL: If the added line is wrong, RA drops and AL drops, and the final answer can become wrong.

Concrete data-flow example (Maze):

Input: A maze image with a start and a goal.
Direct: The model outputs moves like [left, left, up…].
GtA: The model generates one image per move showing the dot’s new position, then outputs the move list.
RA/AL: If each step image shows a legal, consistent path, RA/AL are high and accuracy improves.

04Experiments & Results

The Test (what they measured and why):

Main goal: Does adding drawing help models understand better? When?
Setup: 3,000 tasks across 7 categories; for each unified model, compare with its own base VLM.
Two modes: Direct vs GtA. Also score drawings (RA/AL). Look for task/model correlation patterns.

The Competition (who went head-to-head):

35 systems: 11 base VLMs, 21 unified models, 3 agentic setups. Unified models spanned end-to-end, decoupled, and different generation styles (AR, diffusion, flow). Agentic systems used strong planners plus external image tools.

The Scoreboard (with context):

Big picture: Most unified models scored worse than their base VLMs on overall understanding. Think of it as dropping from a solid B to a C+ just by adding drawing powers.
Direct vs GtA: GtA often did worse than Direct. If Direct is like solving in your head, GtA is like sketching first—but bad sketches tripped models up.
Where generation helped: Spatial Intelligence (multi-step spatial reasoning), some Puzzles & Games (maze, sliding), and visual illusions. Here, GtA sometimes improved scores—like turning a C into a B—because the pictures acted like a visual chain-of-thought.
RA/AL insights: High RA/AL in perception-heavy tasks didn’t translate to better answers (drawings were nice but redundant). Low RA in strict-logic tasks (geometry/physics) hurt badly (bad drawings misled the solver).

More color on results:

Overall degradation theme: Even without drawing (Direct), unified models often underperformed base VLMs. This suggests an “alignment tax”: training a single brain to both draw and understand introduces objective interference that slightly weakens pure discrimination.
GtA pitfalls: In tasks with strict rules (e.g., exact angles, force diagrams), small drawing errors (low RA) became wrong premises for the final step (low AL), pulling down accuracy noticeably.
Structured gains: In tasks that benefit from state externalization—maze steps, sliding tile positions, tracking moving objects—GtA raised accuracy by offloading memory into pictures.

Surprising findings:

Pretty but pointless: In perception tasks, models could draw tidy intermediate visuals (high RA/AL), yet these pictures didn’t help final answers. The drawings were correct but unnecessary.
Base matters most: Models sharing the same base VLM showed very similar gain/loss patterns. Sharing just a generation architecture (e.g., both using diffusion) did not make behaviors strongly alike. The inherited base representation dominated.
Agentic ceiling: Tool-using agentic systems sometimes formed strong drawings and decisions, but their gains still depended on task structure—tools didn’t magically fix error propagation in strict-logic settings.

Take-home metrics made simple:

Accuracy: Final answer correctness.
Δ (G2U Gain): Unified score minus base VLM score. Mostly negative overall; positive in specific spatial/illusion/multi-step tasks.
RA: How faithfully the drawing followed the instructions and structure.
AL: How well the final answer matched the drawing and the question.

Bottom line: Drawing can be a superpower for the right puzzle, but a booby trap for the wrong one. Knowing when to draw is half the victory.

05Discussion & Limitations

Limitations (be specific):

Coverage bias: Even with 3,000 tasks, real-world diversity is bigger. Some domains (e.g., complex CAD-like diagrams, scientific plotting standards) aren’t fully captured.
RA/AL reliance on LLM judging: Although rubric-based, the RA/AL evaluation uses GPT-4o to grade complex visuals, which can introduce judging variance.
Edit training gaps: Some unified models lack explicit image-edit (i2i) training; forcing them to generate edited intermediates is out-of-distribution and predictably harms results.
Budget constraints: Strict token and resolution caps are fair but might hide benefits that appear with larger budgets or higher-res images.

Required resources:

Models: Access to both base VLMs and their unified counterparts (often large, GPU-heavy).
Evaluation stack: A consistent pipeline (e.g., lmms-eval), curated prompts, and access to a strong judge model for RA/AL.
Data handling: High-quality image pre/post-processing and reproducible decoding settings.

When NOT to use GtA:

Pure recognition tasks (logos, icons, in-scene identification): Drawing is redundant and can distract.
Strict-constraint math/physics unless you have reliable diagram generation: Small visual errors cascade into big answer errors.
ChartQA without structured chart synthesis: Decorative charts without correct axes/scales won’t help.

Open questions:

How to reduce the alignment tax: Can we design training objectives that keep generative power without weakening pure understanding?
Self-checking visuals: Can models validate and correct their own diagrams (e.g., verify parallelism, angles, connectivity) before answering?
Routing policies: Can a meta-controller predict drawing utility (based on task type and confidence) and decide whether to draw or skip?
Better intermediates: Beyond pixels—can vector diagrams, scene graphs, or symbolic sketches provide more reliable scaffolds?
Scaling laws: How do G2U effects change with model/data scale, and at what point do gains outweigh the alignment tax?

06Conclusion & Future Work

Three-sentence summary:

UniG2U-Bench is a large, careful testbed that asks a precise question: when does drawing (generation) help a model understand better?
The answer: not by default—unified models often do worse than their base VLMs, and drawing before answering frequently hurts unless the task truly needs visual state tracking and the generated images are faithful.
Clear wins appear in transformation-heavy tasks (mazes, sliding, spatial tracking, some illusions), and model behaviors cluster by shared base backbones and task structures.

Main achievement:

A principled, fair evaluation framework—base–unified pairing, Direct vs GtA protocols, and RA/AL diagnostics—that isolates the true causal impact of generation on understanding across 7 categories and 30 subtasks.

Future directions:

Reduce alignment tax with better multi-objective training and representation-level alignment.
Add reliability through self-verifying, constraint-aware diagram generation.
Smarter routing to only draw when drawings are likely to help; explore structured intermediates (vectors/graphs) over raw pixels.
Study scaling laws and collect richer, diverse training data that exercise spatial and transformation reasoning.

Why remember this:

It replaces a fuzzy belief (“drawing must help”) with evidence-based rules: draw when the task needs visual state and you can draw faithfully; otherwise, skip. That simple, practical lesson can guide the next generation of multimodal AI.

Practical Applications

•Design a router that decides per task whether to use GtA or stick to Direct mode, based on predicted drawing utility.
•Add self-checks to generated diagrams (e.g., angle sums, parallel lines, legal maze moves) before answering, to reduce error propagation.
•Use RA/AL scoring during training to reinforce generating faithful, helpful intermediates rather than pretty but useless pictures.
•Adopt structured intermediates (vectors, graphs, overlays) for geometry/physics instead of raw pixel drawings to improve reliability.
•Train specialized visual CoT skills for spatial tasks (maze, sliding, tracking) where GtA consistently helps.
•Benchmark unified models with their exact base VLMs using UniG2U-Bench to fairly assess true G2U gains.
•Create classroom assistants that decide when to show a diagram step and when to explain directly, improving learning outcomes.
•In robotics or navigation, externalize only critical path steps (with validation) to aid planning without compounding visual errors.
•For chart QA, prefer table extraction or vector re-expression over freehand chart drawing to maintain numeric fidelity.
•Monitor alignment tax during training and adjust objectives to preserve base understanding performance.

Version: 1