PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

Junxian Li; Kai Liu; Leyang Chen; Weida Wang; Zhixin Wang; Jiaqi Xu; Fan Li; Renjing Pei; Linghe Kong; Yulun Zhang

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

Intermediate

Junxian Li, Kai Liu, Leyang Chen et al.2/6/2026

arXiv

Key Summary

•PlanViz is a new test that checks whether AI image models can plan and draw useful computer-related pictures like routes on maps, work flowcharts, and website screens.
•It focuses on three planning-heavy tasks: route planning, workflow diagramming, and web & UI displaying, for both image generation and image editing.
•The authors built a high-quality, human-annotated dataset with reference images and key points so scoring is fair and precise.
•They introduce PlanScore, which grades results on three things: correctness (did it solve the task), visual quality (does it look right), and efficiency (no extra junk).
•A powerful multimodal model acts as a judge using the key points and reference images, and its scores matched human ratings closely.
•Across many models, making brand-new images is easier than editing existing ones while keeping layout and details intact.
•Closed-source models like GPT-Image-1 and Seedream-4.5 are currently stronger than most open-source models, especially on correctness.
•Models do better on closed-ended questions with clear details than open-ended ones with fuzzy goals.
•Even when pictures look great, they can still be wrong for the task—visual quality and correctness often don’t rise together.
•PlanViz points to a big research need: tie careful planning to pixel-level drawing and editing so AIs can be reliable computer-use assistants.

Why This Research Matters

We spend much of our lives using computers—planning trips, following procedures, and working with apps and websites. If AI is going to be a real helper, it must do more than produce pretty images; it must plan and draw pictures that truly solve tasks. PlanViz tests exactly that: can a model create routes that follow real paths, diagrams that make sense, and screens that stay organized after actions? The benchmark reveals where today’s systems fall short, especially in precise editing that preserves layout and text. This helps researchers focus on the hard problems that matter most for real users. Over time, better performance on PlanViz could mean smarter assistants that help with everyday tasks at home, school, and work.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how planning a family trip needs more than pretty pictures—you need a map, a route, and a schedule that actually works?

🥬 The Concept: Unified Multimodal Models (UMMs)

What it is: UMMs are AI systems that can both understand things (like a question or a picture) and generate things (like images) in one model.
How it works:
1. Read your text prompt and/or look at an image.
2. Reason about what you asked (like a mini plan).
3. Draw or edit an image to match your request.
4. Optionally explain what it did.
Why it matters: Without UMMs, you’d need separate tools to understand and then to draw, making it hard to solve real tasks smoothly. 🍞 Bottom Bread (Anchor): Imagine asking, “Draw a map route for my museum visit.” A good UMM should figure out the route and draw it clearly on the map.

🍞 Top Bread (Hook): Imagine trying to change a slide or website layout—you can’t just make it pretty; it must be organized and correct.

🥬 The Concept: Computer-use tasks

What it is: Tasks tied to things we do on computers—planning routes on maps, making flowcharts, or designing/adjusting web pages and app screens.
How it works:
1. Understand the goal (e.g., visit exhibits 1–7).
2. Plan steps or layout (order, positions, actions).
3. Produce or edit the image to show the plan.
Why it matters: If the AI gets the plan wrong, the picture is useless—even if it looks nice. 🍞 Bottom Bread (Anchor): A to-do app screen must keep its buttons, lists, and labels in the right places or you can’t use it.

🍞 Top Bread (Hook): Think of art class vs. shop class. Art can be free-form; shop requires precise measurements to make something that works.

🥬 The Concept: Image generation vs. image editing

What it is: Generation creates a new image from scratch; editing changes a given image while keeping what should stay the same.
How it works:
1. Generation: Turn instructions into a brand-new picture.
2. Editing: Keep the original layout and content, then change only what the instruction asks.
Why it matters: Editing is tougher—mess up the layout or text, and the result breaks the task. 🍞 Bottom Bread (Anchor): If you add a new step to a flowchart, you must not scramble the other boxes or arrows.

🍞 Top Bread (Hook): You know how a recipe isn’t just ingredients—it’s the steps in the right order.

🥬 The Concept: Planning inside visuals

What it is: Planning means organizing steps or positions so the final image isn’t just pretty but functionally correct.
How it works:
1. Identify the goal (destination, task outcome, or page state).
2. Break into steps or layout rules.
3. Place elements in the right order/locations.
4. Check constraints (paths exist, totals add up, labels readable).
Why it matters: Without planning, you get “nice-looking wrong answers.” 🍞 Bottom Bread (Anchor): A route line that goes through a lake instead of a road looks fine at first glance but is wrong for travel.

The world before this paper: AI image models were great at making natural, artistic scenes and following creative prompts. Benchmarks mostly measured how pretty or realistic images looked, or whether a few objects matched a prompt. But real computer-use visuals—like GUI screens, flowcharts, and maps—demand structure, text clarity, and step-by-step logic. The problem was that we didn’t know if UMMs could plan and draw these functional images, and there wasn’t a good way to test them.

What people tried: Prior benchmarks evaluated natural or science images, sometimes with logic, but rarely focused on planning-heavy computer-use tasks. Popular metrics like FID or LPIPS measure “looks,” not “does this solve the task?” Some works used AI-as-judge, but not with task-specific key points and references.

The missing ingredient: A benchmark with real planning tasks + human-annotated references + a scoring system that checks task correctness, visual clarity, and no extra junk. Also, tests should include both open-ended (no single right answer) and closed-ended (clear target) cases to reflect real life.

Why we should care: Day-to-day computer use needs reliable planning visuals—trip routes, step-by-step diagrams for work or school, and accurate previews of UI after user actions. If AI can’t do these right, it can’t truly assist us in tasks that matter, even if the pictures look beautiful.

02Core Idea

🍞 Top Bread (Hook): Imagine a school test that doesn’t just check handwriting—it checks whether you actually solved the math problem correctly, neatly, and without scribbling extra stuff.

🥬 The Concept: PlanViz (the new benchmark)

What it is: PlanViz is a testbed that measures how well AI can plan and then generate or edit images for real computer-use tasks.
How it works:
1. Three sub-tasks: route planning, workflow diagramming, web & UI displaying.
2. Each has generation and editing versions.
3. Human-made questions, reference images, and key points define success.
4. A judge model scores correctness, visual quality, and efficiency.
Why it matters: Without a planning-aware test, models might score high on pretty images but fail at being useful. 🍞 Bottom Bread (Anchor): “Draw a museum route visiting exhibits 1–7.” PlanViz checks if the line follows real paths, visits the right exhibits, and keeps the map intact.

One-sentence Aha: If you want AIs to be good computer-use helpers, you must test if they can plan, not just paint.

Three analogies:

Driving test: Not just how shiny your car is, but whether you follow the route, obey signs, and arrive safely.
Recipe check: Not just a pretty dish photo, but whether you did the steps in order and the dish is edible.
Lego build: Not just colorful bricks, but whether your bridge actually stands and connects the right points.

Before vs. after PlanViz:

Before: Models could pass by making nice-looking pictures that loosely matched prompts.
After: Models must prove they can plan steps, place elements properly, and edit precisely without breaking what should stay the same.

Why it works (intuition, not equations):

Tasks are defined with human “key points” (what must be right).
Reference images show the target shape or acceptable solution.
A strong judge model compares the output against key points and references.
Scores combine correctness (did you solve it), visual quality (is it clear), and efficiency (no extras).

Building blocks (with mini “sandwiches”):

🍞 Hook: You know how Google Maps gives you turn-by-turn directions? 🥬 Route Planning

What it is: Making a sensible path between places under certain rules.
How it works: Choose destinations, pick order, draw path along real roads/paths, avoid forbidden areas.
Why it matters: A line that cuts across buildings is useless. 🍞 Anchor: Visiting exhibits 1–7 in order on a museum map.

🍞 Hook: Imagine writing a how-to guide as a neat flowchart. 🥬 Workflow Diagramming

What it is: Drawing step-by-step diagrams that explain a process.
How it works: List steps, group them, connect with arrows, label clearly.
Why it matters: Mixed-up arrows cause confusion. 🍞 Anchor: A diagram for “how to apply for a visa.”

🍞 Hook: Picture a store shelf where every item has its proper spot. 🥬 Web & UI Displaying

What it is: Laying out titles, buttons, tables, and sidebars correctly on a screen.
How it works: Keep structure, place components in the right regions, update state after actions.
Why it matters: If buttons move randomly, users get lost. 🍞 Anchor: A blog layout with title at top, content center, sidebar right.

🍞 Hook: Sometimes teachers give a precise problem; sometimes they ask for creative ideas. 🥬 Open-ended vs. Closed-ended questions

What it is: Open-ended has many acceptable solutions; closed-ended has specific targets.
How it works: Open-ended allows creative layouts; closed-ended checks exact items/steps.
Why it matters: Clearer instructions help models plan better. 🍞 Anchor: “Any food-themed route in Tokyo” (open) vs. “Visit markets A, B, C in order” (closed).

03Methodology

At a high level: Input (prompt ± image) → Model generates/edits → Judge model compares to key points and references → Output scores (Correctness, Visual Quality, Efficiency).

🍞 Top Bread (Hook): Think of grading a science fair project—you check if it works, looks clear, and isn’t cluttered.

🥬 The Concept: PlanScore (the grading system)

What it is: A three-part score that checks if the picture solves the task, looks good, and avoids extra junk.
How it works:
1. Correctness (Cor): Count how many key points are satisfied.
2. Visual Quality (Vis): Rate clarity, layout consistency, readable text.
3. Efficiency (Ef): Penalize unnecessary or irrelevant content.
Why it matters: A beautiful but wrong solution is still wrong, and a correct but messy picture is hard to use. 🍞 Bottom Bread (Anchor): If 7 of 10 required steps appear in a flowchart, Cor ≈ 0.7; if text is readable and arrows are neat, Vis is high; if no extra boxes appear, Ef is high.

Recipe-style steps:

Data collection (for editing tasks)

What happens: Humans collect real screenshots of maps, apps, and websites; they ensure clarity and readable text.
Why it exists: Realistic inputs make the test meaningful.
Example: A museum map with labeled exhibits for route editing.

Question design (for generation and editing)

What happens: Humans write prompts spanning route planning, workflows, and UI screens; mark them as open- or closed-ended.
Why it exists: Covers daily tasks and degrees of guidance.
Example: “Draw a flowchart for preparing a job interview.”

Human annotation (references + key points)

What happens: Annotators create reference images and list key points that define success.
Why it exists: Gives the judge a clear target and exact checklist.
Example: Key points might include “visit exhibits 1–7” and “follow existing paths.”

Quality check (blind review)

What happens: Different annotators verify difficulty and correctness of references and key points; disagreements are resolved.
Why it exists: Keeps the ground truth fair and consistent.
Example: If two reviewers disagree, they discuss and fix the item.

Prompt style transformation

What happens: Prompts are rewritten into different styles (same meaning) to test robustness.
Why it exists: Real users speak differently; models should handle variety.
Example: “Plan a tapas route in Barcelona” could be phrased in 10 ways.

🍞 Top Bread (Hook): You know how teachers use rubrics with checkboxes?

🥬 The Concept: Key points

What it is: A checklist of must-haves for each task.
How it works:
1. Annotators define general and specific requirements.
2. The judge checks which are satisfied in the output.
3. Correctness is the fraction satisfied.
Why it matters: Without a checklist, scoring becomes vague or unfair. 🍞 Bottom Bread (Anchor): For a UI table mockup: “has header row,” “≥3 data rows,” “right sidebar present.”

🍞 Top Bread (Hook): Imagine a fair referee who watches both the play and the instant replay.

🥬 The Concept: MLLM-as-judge

What it is: A strong vision-language model that reads the prompt, looks at the output, compares to key points and reference images, and assigns scores.
How it works:
1. Input: prompt, model output, references, key points.
2. Evaluate: mark which key points are met; rate Vis and Ef on 0–5.
3. Normalize scores to 0–1.
Why it matters: This makes evaluation scalable and consistent, and it correlates well with human ratings. 🍞 Bottom Bread (Anchor): If humans give Cor≈0.68 and the judge gives Cor≈0.65, they agree closely.

Details for each sub-task (what breaks without the step):

Route Planning: Must follow actual paths and visit required places. Without this check, models might draw pretty but impossible lines.
Workflow Diagramming: Must include correct steps and clean connections. Without this, flowcharts can be misleading.
Web & UI Displaying: Must keep or form correct layout and update state after actions. Without this, screens become confusing or wrong.

🍞 Top Bread (Hook): Think of packing a school bag—take what you need, not your entire room.

🥬 The Concept: Efficiency (Ef)

What it is: A measure that discourages adding extra, irrelevant items.
How it works: The judge subtracts points when the output includes unnecessary icons, boxes, or text.
Why it matters: Clutter hides the important parts and wastes attention. 🍞 Bottom Bread (Anchor): A blog page mockup shouldn’t include pop-ups or random charts if the prompt never asked for them.

Secret sauce:

Planning-first design of tasks.
Human-anchored key points and references.
Robust, automated judging aligned with human intuition.
A structure that exposes the gap between “looks good” and “is correct.”

04Experiments & Results

🍞 Top Bread (Hook): Imagine a scoreboard where some teams look stylish but don’t score the goals that win the game.

The Test: Models had to perform both generation (from scratch) and editing (change an existing image) across three sub-tasks: route planning, workflow diagramming, and web & UI displaying. Scores measured correctness (solve the task), visual quality (clean and coherent), and efficiency (no extra stuff). A strong judge model compared outputs to key points and reference images; human studies showed high agreement with the judge.

The Competition: 13 unified multimodal models and 9 image-only generators/editors—both open-source and closed-source. Well-known names included GPT-Image-1, Seedream-4.5, OmniGen2, Qwen-Image, Bagel, and more.

The Scoreboard (with context):

Editing is hard for everyone. Even top performers stayed around two-thirds of human-level averages, and most models had correctness far below human references. That’s like getting a C+ where humans get A’s.
Generation shows a big gap: closed-source models (e.g., Seedream-4.5, GPT-Image-1, Gemini3-Pro-Image) reached high correctness and overall scores, while many open-source models hovered below mid-levels. It’s like the varsity team vs. junior varsity.
Sub-task differences: Route planning for generation was harder than workflow and UI; in editing, all three were tough.
Open vs. closed-ended: Clearer, detailed instructions (closed-ended) boosted correctness for most models. Think of it as getting a clearer rubric.
Thinking modes: Enabling “thinking” helped some generation tasks but gave limited or inconsistent gains in editing, hinting that the bottleneck is precise, layout-preserving drawing under constraints.

Surprising findings:

Visual quality can be high even when correctness is near zero—models produce “pretty but wrong” images.
Conversely, aiming for correctness sometimes hurts visual polish—“right but rough.”
Score distributions showed open-source models clustering near low correctness in tough scenarios, while top closed-source models clustered near high correctness in generation but still dipped in editing.

Concrete examples:

A museum route: Many models drew paths that didn’t strictly follow real walkways or missed required exhibits.
A visa flowchart: Open-source models often sketched boxes but with missing or incorrect steps; top models captured more correct steps and clearer arrows.
A blog page mockup: Good models kept the title, content, and sidebar in place; weaker ones misplaced elements or added clutter.

Bottom line: PlanViz reveals that planning-focused editing remains the steep hill, and that being visually impressive doesn’t guarantee task success. Clear constraints help, but models still need better spatial reasoning and step-following to be trustworthy assistants.

05Discussion & Limitations

Limitations:

Coverage: While diverse, the dataset size (hundreds of items) is smaller than massive natural-image corpora and focuses on three sub-tasks.
Judge dependence: The scoring pipeline relies on a strong MLLM; although it correlates well with humans, judge bias or blind spots can persist.
Editing difficulty: Tasks that demand preserving layout and text fidelity remain especially hard; results may underrepresent future improvements in these areas.
Style sensitivity: Open-source models showed variability under different prompt phrasings, indicating robustness gaps.

Required resources:

Human annotators to craft questions, references, and key points.
Access to high-quality judge models and APIs for candidate models.
Computing time to run multiple evaluations and ensure stability.

When NOT to use PlanViz:

If your goal is purely artistic or photorealistic image aesthetics without any task planning.
If the task does not require preserving layout or following specific steps.
If you need pixel-perfect metrics (e.g., compression artifacts) rather than task success.
If the domain involves sensitive, private screens where sharing references is not possible.

Open questions:

How can models tightly couple reasoning steps with pixel-accurate generation and editing?
What architectures or training schemes preserve layout and text while making precise edits?
How to make models robust to diverse prompt styles without losing correctness?
Can we reduce reliance on a single judge by using ensembles or hybrid human+AI judging?
How to expand to more computer-use scenarios (e.g., spreadsheets, dashboards, code editors) while keeping evaluations precise?

06Conclusion & Future Work

Three-sentence summary: PlanViz is a planning-first benchmark that tests whether AI models can generate and edit images for real computer-use tasks, not just make pretty pictures. It introduces human-anchored key points, reference images, and a judge that scores correctness, visual quality, and efficiency. Results show a large gap—especially in editing—highlighting the need to better connect planning with precise, layout-preserving visuals.

Main achievement: A practical, task-adaptive evaluation framework (PlanViz + PlanScore) that exposes the difference between “looks good” and “solves the task,” driving progress toward truly helpful computer-use assistants.

Future directions: Improve spatial reasoning and planning-to-pixels alignment; develop editing methods that keep layouts and text intact; strengthen robustness to prompt styles; broaden computer-use domains; and refine judging with more diverse evaluators.

Why remember this: PlanViz reframes success in image AI from beauty to usefulness. By grading planning, clarity, and restraint together, it sets a higher bar for models to become reliable helpers for maps, workflows, and screens—the kinds of visuals we rely on every day.

Practical Applications

•Evaluate new multimodal models for practical computer-use assistance before deploying them to users.
•Train models with PlanViz-style key points to improve planning accuracy in visual tasks.
•Use PlanScore diagnostics to identify whether failures come from correctness, visual quality, or efficiency.
•Benchmark editing features (e.g., UI mockups, map annotations) in design tools to ensure layout preservation.
•Guide dataset collection by adding key points and references for new domains like spreadsheets or dashboards.
•Stress-test prompt robustness by rewriting queries in multiple styles and measuring score stability.
•Incorporate the MLLM-as-judge pipeline into CI testing for AI products that generate or edit screenshots.
•Prioritize closed-ended templates in production flows when safety and reliability are critical.
•Design human-in-the-loop review steps focusing on low-Cor but high-Vis cases (pretty but wrong).
•Track progress across sub-tasks separately (routes, workflows, UIs) to target the toughest weaknesses.

Version: 1