See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

Jaehyun Park; Minyoung Ahn; Minkyu Kim; Jonghyun Lee; Jae-Gil Lee; Dongmin Park

See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

Intermediate

Jaehyun Park, Minyoung Ahn, Minkyu Kim et al.2/24/2026

arXiv

Key Summary

•Modern image generators can still make strange mistakes like extra fingers or melted faces, and today’s vision-language models (VLMs) often miss them.
•Past fixes depended on many human labels and mostly covered simple noise or blur, which no longer matches the newest models’ failures.
•This paper introduces ArtiAgent, a fully automated, three-agent system that makes clean/artifact image pairs and rich labels at scale.
•The key trick is a new inversion-injection method that nudges where and what a diffusion transformer paints, creating realistic structural flaws on purpose.
•ArtiAgent builds a 100K-image training set with local boxes and text explanations, without needing human annotators.
•A new test set, ArtiBench (1K recent-model images), measures detection, localization, and explanation of artifacts with human ground-truth.
•Open-source VLMs fine-tuned on ArtiAgent data beat or match proprietary models like GPT-5 on multiple artifact tasks.
•The trained VLM can guide generation to avoid flaws and can help inpainting tools fix flawed regions automatically.
•Results scale with more synthetic data: even 1K examples help, and 100K gives strong gains, especially for binary detection.
•This offers a practical path to see and fix visual flaws in high-stakes uses like medicine, robotics, and autonomy.

Why This Research Matters

Tiny visual glitches can lead to big misunderstandings in places where safety matters, like medical imaging or self-driving cars. This work shows how to mass-produce modern, realistic artifact examples without human labelers, so models can learn exactly what to watch for. By teaching VLMs to detect, localize, and explain flaws, we gain trustworthy visual inspectors that are fast and scalable. The same knowledge can steer generators away from mistakes and help fix flawed regions after the fact. Altogether, this improves the reliability and usefulness of AI imagery across everyday creativity and high-stakes decision-making.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a photo can look almost perfect but something tiny feels off—like a person with six fingers or a dog with two noses? Your eyes catch it, but it’s easy to miss if you’re rushing.

🥬 The Concept: Structural visual artifacts are realistic-looking pictures with broken shapes—extra, missing, warped, or fused parts—that break common sense. How it works:

A powerful image generator draws a lovely scene.
Somewhere, structure goes wrong: an extra ear, a missing hand, a melted face, or two objects stuck together.
The picture still looks sharp and colorful, but the “logic” of the scene is broken. Why it matters: If we don’t notice these subtle mistakes, we may trust flawed images, which is risky in medicine, robotics, and self-driving. 🍞 Anchor: A child holding a teddy bear, but the child’s hand has six fingers. The pixels look great, but the anatomy is wrong.

🍞 Hook: Imagine a smart helper that can look at an image and explain it to you in words.

🥬 The Concept: Vision-Language Models (VLMs) are AIs that see pictures and talk about them. How it works:

They look at the image.
They connect what they see to words.
They answer questions or describe what’s happening. Why it matters: They could be our automatic inspectors for image quality—if they can spot tricky artifacts. 🍞 Anchor: You ask, “Is anything wrong with this photo?” A VLM should reply, “Yes, the hand has an extra finger inside this box.”

🍞 Hook: Think of image generators as artists who follow instructions to paint from noise into a detailed picture.

🥬 The Concept: Diffusion models are AIs that turn random noise into images by gradually refining details. How it works:

Start with static-like noise.
Step by step, remove noise and add structure.
End with a photo-like image. Why it matters: Even the best of these artists can produce subtle structural mistakes, especially with people and animals. 🍞 Anchor: “Draw a surfer on a wave,” but the model paints a duplicated hand or a warped face.

🍞 Hook: Before, people tried to fix these flaws with lots of human labeling and simple errors like blur or noise.

🥬 The Concept: Earlier datasets and methods targeted old artifact types and relied on expensive human labels. How it works:

Humans annotate many images with masks and text.
Models learn to detect those artifact styles.
But new generators make different, more realistic structural mistakes. Why it matters: Human labeling is slow and costly, and old labels don’t match today’s failures. 🍞 Anchor: A dataset that teaches “blur” won’t help much when the real problem is “extra fingers.”

🍞 Hook: So what was missing? A way to create many realistic, modern-style flaws with good labels—without humans.

🥬 The Concept: The gap is scalable, up-to-date, annotator-free data that matches how new models fail. How it works:

Start from clean real images.
Automatically inject believable structural mistakes in the right spots.
Auto-label what changed, where, and why. Why it matters: With this, we can train VLMs and generators to notice and fix modern artifacts. 🍞 Anchor: Make a clean bear image and an “extra paw” version, plus a box and a sentence: “There is an extra paw on the bear.”

🍞 Hook: Why should anyone care? Because tiny visual mistakes can lead to big real-world risks.

🥬 The Concept: High-stakes fields like medicine, robotics, and autonomous driving need images that make anatomical and physical sense. How it works:

A diagnostic image must not hallucinate or miss body parts.
A robot’s camera must not “see” fused objects.
A self-driving system must not misread warped road signs. Why it matters: Catching and correcting structural artifacts boosts safety, trust, and usefulness. 🍞 Anchor: In a medical image, an extra or missing structure could mislead a diagnosis if nobody notices.

02Core Idea

🍞 Hook: Imagine teaching a lifeguard to spot rip currents by showing both safe oceans and cleverly faked dangerous ones—with clear arrows and notes.

🥬 The Concept: The paper’s key insight is to automatically create believable structural flaws in images—and auto-label them—by gently re-routing where and what a diffusion transformer paints during reconstruction. How it works:

Look at a clean real image and find parts like “face,” “hand,” or “wheel.”
Decide what kind of flaw to make (extra, missing, warped, or fused) and where.
During diffusion’s “re-draw” (inversion-and-restore), nudge the positions and content of selected patches so the flaw appears naturally.
Keep the rest of the image consistent.
Filter low-quality results and write local and global explanations. Why it matters: Now we can mass-produce realistic artifact examples with boxes and text—no humans needed—so models learn to see and fix what they used to miss. 🍞 Anchor: Take a surfer photo and subtly add a second hand right below the first; label the box and explanation. Repeat at scale.

Three analogies for the same idea:

Map rerouting: A city map app (the diffusion model) normally sends traffic along correct streets; we gently redirect a few cars (patches) onto wrong lanes, creating controlled “traffic jams” (artifacts) while the rest of the city stays normal.
Sticker swapbook: You lift a sticker (a patch) from one page and place it in a slightly wrong spot on another page—now the page looks real but wrong in a meaningful way.
Orchestra seat shuffle: During rehearsal, swap where a few musicians sit (position) and who plays (content). The symphony still sounds cohesive, but a violin shows up in the tuba section—noticeably off if you listen carefully.

Before vs. After:

Before: Models trained on human-labeled, older-style artifacts; VLMs miss subtle, structural mistakes; image pipelines can’t auto-correct.
After: Automated, modern structural artifacts at scale; VLMs learn to detect, localize, and explain; generators can be steered away from flaws or auto-fixed.

🍞 Hook: You know how moving a sticker’s place and swapping what’s printed on it can change a picture a lot, even if the rest stays the same?

🥬 The Concept: Inversion-injection is the trick: while the diffusion transformer “rebuilds” an image, we inject two things into chosen patches—where they pretend to be (position) and what they carry (content)—to craft realistic local flaws. How it works:

Invert: Turn the clean image into the model’s internal notes (cached values).
Choose target patches (what to change) and reference patches (what to borrow).
Position injection: Tell target patches to act like they live at the reference places.
Value injection: Give target patches the reference content.
Denoise forward: The model paints a natural-looking image…with a precise structural flaw. Why it matters: Without position + content control, artifacts look fake or the background breaks; with both, flaws look believable and local. 🍞 Anchor: Borrow a right arm’s content and place it slightly below where it should be, so the surfer appears to have two right arms while the ocean stays perfect.

🍞 Hook: Think of a team with clear jobs: a scout, a builder, and a critic.

🥬 The Concept: ArtiAgent is a three-agent system—Perception (find parts), Synthesis (inject artifacts), and Curation (filter and explain)—that builds clean/artifact pairs with rich labels. How it works:

Perception: Spot entities and subentities, like “person → face, hand.”
Synthesis: Pick an artifact type and apply patch mapping tools (add, remove, distort, fuse).
Inversion-injection: Perform precise position + value swaps while reconstructing the image.
Curation: Filter weak results with LPIPS or a VLM, then produce local boxes and easy-to-read explanations. Why it matters: This pipeline scales without humans yet produces training-ready data. 🍞 Anchor: From a beach photo, Perception finds the surfer’s hand; Synthesis adds a second hand; Curation keeps only convincing results and writes, “There is an extra hand on the surfer.”

03Methodology

At a high level: Real image → Perception agent (find entities/parts) → Synthesis agent (plan patch mappings + inversion-injection) → Curation agent (filter + explain) → Output: clean/artifact pairs with boxes and text.

🍞 Hook: You know how a mechanic first identifies parts, then decides what to tweak, and finally checks the work?

🥬 The Concept: Perception agent (entities and subentities) finds trustworthy targets for artifact injection. How it works:

Vocabulary: Use a VLM to list entities (like person, dog, car) with parts (face, hand, wheel), grouped into two levels: peripheral (fingers, ears) and intermediate (face, arm).
Grounding: Use Grounded-SAM to draw regions for entities and parts in the actual image.
Match: Link each sub-part to its parent (a hand belongs to this person) via overlap. Why it matters: If you don’t know where the parts are, you can’t place realistic, local artifacts. 🍞 Anchor: Detect a “surfer → hand” region so later we can duplicate or remove it correctly.

🍞 Hook: Think of picking puzzle pieces to move from one spot to another.

🥬 The Concept: Target–reference patch mapping toolbox chooses which patches to change (target) and which patches to borrow from (reference), tailored to the artifact type. How it works:

Add (duplication): Copy a sub-part (like a hand) and place it nearby without colliding with other parts.
Remove (omission): Fill the sub-part’s region with nearby background patches.
Distort (distortion): Shuffle or jitter patches within a part to warp its shape but keep appearance consistent.
Fuse (fusion): In overlapping entities, swap border patches so their boundaries blend unnaturally. Why it matters: Carefully planned patch moves create believable flaws instead of random glitches. 🍞 Anchor: For “extra paw,” Add picks nearby empty patches to place a second paw with minimal overlap.

🍞 Hook: Imagine telling a painter not only what to paint, but where their brush should think it is.

🥬 The Concept: Inversion-injection modifies a diffusion transformer’s self-attention inputs so selected patches “think” they stand at new positions and carry new content. How it works:

Inversion: Run the image backward to cache the model’s value embeddings for each layer.
Position injection: For target patches, apply the reference’s positional code so they act like they’re located elsewhere.
Value injection: For target patches, reuse cached content from the references; for background patches, keep their original content.
Denoise: Let the model render a coherent scene where only the chosen structure is wrong. Why it matters: Position-only or value-only tweaks are weaker; combining both makes local, realistic structural artifacts while preserving the rest of the image. 🍞 Anchor: Move the “nose” position a bit and swap in content from a nearby patch, yielding a subtly warped face that still fits the photo.

🍞 Hook: Think of a quality inspector who keeps only the most convincing examples and writes helpful notes.

🥬 The Concept: Curation agent filters poor artifacts and writes local and global explanations. How it works:

For distortion: Use LPIPS (a perceptual difference score) to keep changes that are visible but not absurd.
For add/remove/fuse: Show a VLM three images (context-masked original, original crop, artifact crop) and ask if the change is real and in-bounds.
Explanations: The same VLM explains what’s wrong in each box (local) and summarizes the whole image (global). Why it matters: Without filtering and clear text, training would include confusing or low-quality examples. 🍞 Anchor: Keep only the surfer images where the extra hand is really there and nicely blended; add a box and, “There is an extra hand on the surfer.”

Examples with actual data steps:

Input: A real image of a child with a teddy bear.
Perception: Entities: child, teddy; Subentities: child→face, hand; teddy→ear, paw.
Synthesis: • Choose ‘duplication’ on child’s hand. • Add tool picks nearby empty patches to place a second hand. • Inversion-injection applies position + value swaps during reconstruction.
Curation: • VLM judges: “An extra hand is present in the region.” • Local text: “There is an extra hand on the child.” • Global text: “The child appears to have an extra hand, creating an unnatural look.”

The secret sauce:

Combining position and value injection inside a diffusion transformer during inversion-restoration yields realistic, localized structural artifacts that older editing tricks struggle to produce.
The three-agent design turns this into a repeatable recipe that scales to 100K examples with boxes and explanations.

04Experiments & Results

🍞 Hook: If you build a great practice set, your team should win more games—not just at home, but also away.

🥬 The Concept: The authors train open-source VLMs on ArtiAgent’s 100K examples and test them on several benchmarks, including a new one called ArtiBench. How it works:

Build a 100K clean/artifact dataset with boxes and local/global text via ArtiAgent.
Fine-tune VLMs (e.g., Qwen2.5-VL-7B, InternVL3.5-8B) on detection (is there an artifact?), localization (where?), and explanation (what’s wrong?).
Evaluate on ArtiBench (1K recent-model images, human-labeled) and prior sets (RichHF, LOKI, SynthScars).
Compare against proprietary VLMs (GPT-4o, GPT-5, Gemini-2.5-Pro) and segmentation baselines (PAL, DiffDoctor, LEGION). Why it matters: Beating strong baselines on modern, human-checked data shows the synthetic training really teaches models to see subtle flaws. 🍞 Anchor: On ArtiBench binary detection, ArtiAgent-trained models reach around 0.63 accuracy—like getting an A- while many others score closer to B- or lower.

The competition and tasks:

Detection (true/false): Does the image contain any artifact?
Localization (mIoU & F1): Can you draw accurate boxes or masks where the flaw is?
Explanation (ROUGE, embedding similarity): Can you describe what’s wrong in plain language?

Scoreboard with context:

Detection: ArtiAgent improves open models (e.g., InternVL3.5-8B jumps by ~26.5% accuracy points on ArtiBench) and matches or exceeds GPT-5/Gemini-2.5-Pro in several settings. This is like moving from guessing to reliably noticing the oddities.
Localization: On newer benchmarks (ArtiBench), older segmentation methods that did well on older data struggle; ArtiAgent-trained VLMs generalize better, like a player who adapts to a new rulebook.
Explanation: ROUGE and semantic similarity rise notably, showing clearer, human-like descriptions of what’s wrong (e.g., “an extra finger on the right hand”).

Surprising findings:

Even small synthetic subsets (as little as 1K) already beat some strong baselines for localization and explanation. That’s like getting big gains from a short, focused practice.
Binary detection benefits the most from lots of variety; performance keeps rising up to 100K.

Beyond perception—two downstream wins:

Reward-guided generation: Train a preference scorer (Bradley–Terry with CLIP) that prefers the clean over the artifact image. At sampling time, do best-of-N search guided by this reward. Across six rounds, reward steadily increases—images look more artifact-free over time. Think of a coach picking the best of many takes, with a score that favors fewer flaws.
VLM-guided correction: The trained VLM finds a flawed region and its description; an inpainting model (FLUX inpainting) fixes it; the VLM re-checks; repeat until clean. Qualitatively, this loop removes extra parts and warped seams while keeping the rest natural.

Why the results matter:

They show that synthetic-but-realistic, well-labeled data about modern structural artifacts can lift open VLMs to or above proprietary levels for crucial safety tasks.
They also show practical pathways: fewer flaws at generation time and automatic repairs after the fact. 🍞 Anchor: Given the prompt “a man holding a phone,” the reward-guided sampler moves from a six-finger image early on to a normal hand later, as the reward goes up.

05Discussion & Limitations

🍞 Hook: Even great tools have limits; knowing them helps you use the tools wisely.

🥬 The Concept: Limitations and cautions clarify where ArtiAgent shines and where it may struggle. How it works:

Data source dependence: ArtiAgent starts from real images and a specific diffusion transformer; unusual scenes or edge cases may still be hard.
Coverage gaps: It targets four structural types (duplication, omission, distortion, fusion). Other rare or exotic artifacts may remain uncovered.
Tool sensitivity: The realism of injected artifacts depends on patch mapping choices and inversion-injection hyperparameters; poorly chosen settings can fail filtering.
VLM filters: The curation VLM can occasionally over- or under-accept borderline cases; thresholds and prompts matter. Why it matters: Understanding these limits prevents overconfidence and guides future improvements. 🍞 Anchor: If a scene is extremely crowded or stylized, the Add tool might place parts too close to others and the VLM filter might disagree with a human.

Required resources:

A diffusion transformer with access to internal attention inputs (e.g., FLUX.1-dev) and an inversion-restoration method (e.g., FireFlow) to enable injection.
A capable VLM (e.g., GPT-4o or similar) for perception vocabulary, filtering, and explanations during data synthesis.
Compute to synthesize ~100K examples and fine-tune target VLMs; storage for paired images and annotations.

When not to use:

If your use-case concerns text alignment errors (wrong object entirely) rather than structural artifacts, other datasets are more suitable.
If you cannot access a compatible diffusion transformer or inversion pipeline, artifact injection won’t run as designed.
For ultra-fine medical subtleties requiring expert anatomical nuance, human-in-the-loop validation may still be needed.

Open questions:

Can we expand to more artifact families (e.g., illumination-physical mismatches) while keeping realism high?
How to make filters provably robust across domains (cartoons, thermal, medical) without human oversight?
Can we integrate reward models directly into the sampler to avoid best-of-N brute search?
How to generalize beyond images to video (temporal artifacts) or 3D scenes (geometric consistency)?

06Conclusion & Future Work

Three-sentence summary:

This paper presents ArtiAgent, a three-agent system that automatically injects realistic structural artifacts into images and auto-labels them with boxes and explanations by manipulating where and what a diffusion transformer paints.
Training open-source VLMs on the resulting 100K pairs significantly improves artifact detection, localization, and explanation on a new modern benchmark (ArtiBench) and prior datasets, rivaling or surpassing proprietary systems.
The trained models also guide generation away from flaws and help inpainting loops fix artifacts automatically, improving end-to-end image pipelines.

Main achievement:

A practical, scalable recipe—target–reference patch mapping plus inversion-injection inside a diffusion transformer—turns clean images into realistic, richly annotated artifact examples without human labels.

Future directions:

Broaden artifact categories and domains (video, 3D, medical), tighten filters for harder edge cases, and integrate reward models directly into samplers for faster, cleaner generations.

Why remember this:

It shows that the best way to teach models to see and fix modern flaws is to create those flaws on purpose—with precision, realism, and clear explanations—so the models learn exactly what to watch for and how to respond.

Practical Applications

•Build and release artifact-aware VLMs that can flag and explain subtle flaws in AI-generated images.
•Add an artifact filter to image-generation pipelines to automatically reject high-scoring (flawed) candidates during sampling.
•Use VLM-guided inpainting to repair detected artifact regions in batch image post-processing.
•Train reward models on ArtiAgent pairs to steer best-of-N sampling toward cleaner images in production.
•Create domain-specific artifact datasets (e.g., product photos, robotics scenes) by re-running ArtiAgent on custom image pools.
•Benchmark vendor models with ArtiBench to compare artifact awareness before deployment.
•Generate targeted test cases (e.g., extra digits, fused tools) for safety evaluations in robotics and autonomy.
•Provide explainable QA checks for creative workflows, helping artists quickly spot and fix subtle structural errors.
•Pre-train quality-assurance bots for marketplaces to flag implausible listings or ads with structural visual mistakes.

Version: 1