AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios
Key Summary
- ā¢AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.
- ā¢It includes 209 tasks across 25 sub-domains and 7 big categories (like Commerce, Geography, Technology) built from real photos, screenshots, and diagrams.
- ā¢Tasks force agents to mix tools such as web search, image search, page visiting, and a code interpreter for image processing and math.
- ā¢State-of-the-art models still struggle: the best model (Gemini-3-Pro with tools) scored only 27.3% accuracy, showing lots of room for improvement.
- ā¢Hard tasks can take more than 25 tool calls, and average tool turns are high, highlighting the need for long-horizon planning and checking.
- ā¢Most failures start with visual misidentification (misreading small or subtle details), which then leads to wrong searches and wrong answers.
- ā¢Ablation shows the full tool set works best; mixing visual manipulation with retrieval beats using only one kind of tool.
- ā¢Multi-image inputs often help because extra views reduce ambiguity and improve grounding, even though cross-image reasoning is required.
- ā¢Test-time scaling (generating multiple solutions and picking the best) helps, but still doesnāt solve the benchmark.
- ā¢AgentVista aims to accelerate more reliable multimodal agents for realistic shopping, travel planning, troubleshooting, and more.
Why This Research Matters
AgentVista pushes AI beyond toy problems and into the messy reality of everyday life, where answers depend on tiny visual cues plus reliable web facts and careful math. This matters for shopping with health constraints, planning travel by reading schedules and prices, and fixing things from photos, all of which require many steps. By revealing where current agents failāespecially on fine-grained visual groundingāAgentVista guides researchers toward the most impactful improvements. Clear, checkable answers also make progress easy to measure and compare. In short, this benchmark is a practical map for building AI helpers that people can trust in real, high-stakes tasks.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine youāre helping a friend fix their bike. You look at a photo of the chain, compare it to a diagram online, read a how-to page, and do a bit of math to make sure you buy the right length of chain. Thatās a lot of stepsāand all of them depend on what you see in the picture.
š„¬ The Concept: Multimodal agents are AIs that can use different kinds of information (like images and text) together to solve problems. In real life, solving things from photos usually needs many steps, multiple tools, and careful checking. Before this paper, most tests focused on single-turn questions or just one tool, so they missed how messy, long, and realistic real tasks actually are. Why it matters: Without testing those real, many-step challenges, we donāt know if agents can truly help with everyday jobs like shopping with allergies in mind or planning a bus route by reading a posted schedule.
š Anchor: Think of an AI that helps you plan a weekend trip. It must read a transit map photo, check store hours online, compute fares, and order stops. Older tests didnāt truly check all of that together.
ā New Concept 1 ā š Hook: You know how you can understand a comic by reading speech bubbles (text) and also looking at the drawings (images) at the same time? š„¬ The Concept: Multimodal agents combine text and images to figure things out. How it works: 1) Look at the image for clues, 2) read any text or constraints, 3) connect the dots between whatās seen and whatās asked, 4) decide next steps. Why it matters: If an agent ignores either the picture or the words, it can miss the solution. š Anchor: An agent sees a recipe screenshot and a pantry photo; it must match ingredients in the picture to the recipe list.
ā New Concept 2 ā š Hook: When building a birdhouse, you use a saw, a ruler, and nailsānot just one tool. š„¬ The Concept: Tool use means the agent calls helpers like web search, image search, page visiting, or a code interpreter. How it works: 1) Choose the right tool, 2) feed it a clear query or code, 3) read the result, 4) decide the next action. Why it matters: Without the right tool at the right time, the agent gets stuck or guesses. š Anchor: To price flooring, the agent searches for product specs online, then uses code to multiply plank coverage by room size.
ā New Concept 3 ā š Hook: Cooking a three-course meal takes many steps over time; you canāt do it in one move. š„¬ The Concept: Long-horizon task execution means solving problems that need many steps and checks before the final answer. How it works: 1) Plan steps, 2) gather info, 3) verify, 4) adjust, 5) continue until done. Why it matters: If an agent canāt handle long sequences, it fails as soon as the job gets complex. š Anchor: Diagnosing a gadget from a photo, reading a manual online, checking a part number, and then calculating replacement cost.
ā New Concept 4 ā š Hook: Finding your friend in a crowded playground isnāt easyāthere are tiny clues everywhere. š„¬ The Concept: Realistic visual scenarios are real-world pictures full of detail, clutter, and subtle cues. How it works: 1) Spot small but key details, 2) compare across views, 3) keep constraints in mind. Why it matters: If images are too simple, we wonāt know whether the agent can handle real-life messiness. š Anchor: A pantry shelf with many sauces, tiny allergy icons, and different sizesāall matter to the final choice.
ā New Concept 5 ā š Hook: Solving a jigsaw puzzle means using the picture clues to fit pieces together. š„¬ The Concept: Visual reasoning is figuring things out from images. How it works: 1) Identify objects and text in pictures, 2) measure or compare regions when needed, 3) link what you see to what you must decide. Why it matters: If an agent sees but doesnāt reason, it canāt use the image to solve the task. š Anchor: Matching a floor pattern across two room photos before ordering matching planks.
ā New Concept 6 ā š Hook: Sometimes you need a Swiss Army knifeāmany tools used together beat any single tool. š„¬ The Concept: Hybrid tool use means mixing image tools (crop/measure) with web tools (search/visit) and code. How it works: 1) Inspect the image (crop/zoom), 2) search for missing facts, 3) visit pages to verify, 4) compute or parse with code, 5) repeat. Why it matters: Real tasks need both seeing and knowing; one alone often fails. š Anchor: Crop a LEGO step diagram to read a part number, search a guide, then compute which step went wrong.
ā New Concept 7 ā š Hook: A good school test checks what you truly learned, not just if you memorized one fact. š„¬ The Concept: Benchmark evaluation measures how well agents perform on standard tasks with clear answers. How it works: 1) Present tasks, 2) let the agent act with tools, 3) compare final answer to ground truth, 4) compute accuracy. Why it matters: Without solid tests, we canāt tell if models are actually improving. š Anchor: A math quiz with exact numeric answersāeither you got it right or not.
ā New Concept 8 ā š Hook: When you help your grandma pick groceries, you think about her needs (like allergies) first. š„¬ The Concept: User-centric tasks are written like real requests with real constraints (time, budget, safety). How it works: 1) Capture real intent, 2) keep constraints central, 3) lead to a single, verifiable answer. Why it matters: Agents must be helpful to peopleās real needs, not just toy puzzles. š Anchor: āWhich nut-free chocolate sauce here has the lowest sugar per 100 g, and whatās its price in USD?ā
Before this paper, many tests were either single-turn or focused on just one skill (like browsing or code). They didnāt fully test the realistic, long, cross-tool workflows people truly need. AgentVista fills that gap by pairing messy, detail-rich images with multi-step tool interactions and exact answers.
02Core Idea
š Hook: Picture a scavenger hunt where each clue is hidden in a photo, and to solve it you must look closely, search the web, open pages, and sometimes do a bit of math. Only when you combine all of that do you win.
š„¬ The Concept: The āaha!ā is to build a benchmark that faithfully tests generalist multimodal agents on realistic, long-horizon tasks that demand interleaved, hybrid tool use grounded in visual evidenceāwith clear, verifiable final answers. How it works (big picture):
- Provide real, detail-rich images and multi-image views. 2) Require agents to mix tools: web search, image search, visit, and code interpreter. 3) Make tasks long and constraint-heavy. 4) Score only the final exact answer to keep evaluation simple and fair. Why it matters: Without such a test, agents might look smart on easy, one-step demos but fail in real life.
Three analogies:
- Orchestra analogy: The agent is a conductor. Image tools are strings (fine detail), web search is brass (loud facts), page visit is woodwinds (context), and code is percussion (precise timing and math). Only together does the music (solution) sound right.
- Detective analogy: Photos are the crime scene, web search is the library, visiting pages is interviewing witnesses, and code is the lab test. You need all to crack the case.
- Swiss Army rescue: Each tool handles a different obstacle; you must know when to flip out which blade and how to combine them to finish the rescue safely.
Before vs. After:
- Before: Benchmarks checked single-turn Q&A or one tool skill at a time.
- After: AgentVista checks if agents can survive messy reality: long plans, subtle visuals, and multiple toolsāend to end.
Why it works (intuition):
- Real images force true visual grounding (no shortcutting by memorized text alone).
- Interleaved tools make the agent prove it can see, retrieve, verify, and compute.
- Deterministic answers let us score cleanly (like grading math), avoiding debates.
- Long horizons expose whether the agent can stay on track without drifting.
Building blocks:
- Diverse domains: 25 sub-domains across 7 categories (commerce, geography, entertainment, technology, society, academics, culture).
- Realistic images: 308 images (single and multi-image) with clutter and small details.
- Tool suite: websearch, imagesearch, visit, code interpreter (for image processing and programming/math).
- Quality pipeline: agent-centric filtering, expert finalization, execution filtering, and two-round verification.
- Exact answers: concise, checkable outputs (numbers, names, short phrases).
ā New Concept 9 ā š Hook: Think of AgentVista as a super-tough obstacle course for AI, designed like a real city instead of a neat indoor gym. š„¬ The Concept: AgentVista is a benchmark of 209 ultra-challenging tasks that require mixing visual understanding with long, interleaved tool use to reach a single correct answer. How it works: 1) Start from real images and real needs, 2) write vision-centric queries with constraints, 3) require natural tool mixing (at least two tool types), 4) verify with reproducible steps and fixed answers. Why it matters: It reveals whether agents can actually help with realistic jobs, not just pass classroom quizzes. š Anchor: A home-renovation task where the agent must match flooring across photos, confirm room orientation, fetch product specs, and compute total cost. Only a correct final price passes.
ā New Concept 10 ā š Hook: When your coach tests your soccer skills, they donāt only see if you can kick; they watch your whole play across the field. š„¬ The Concept: Error analysis studies where and why agents fail across the whole play. How it works: 1) Label failure types (like misreading a picture or making math errors), 2) find main bottlenecks, 3) guide improvements. Why it matters: Fixing the right weakness (often visual misidentification here) makes real progress. š Anchor: If an agent misreads a tiny nut-free icon, it will search the wrong product and compute the wrong priceāeven with perfect math later.
In short, AgentVistaās central idea is to fairly and firmly test what real helpers need: seeing the right tiny details, retrieving the right facts, checking themselves with code, and doing it all over many steps until the answer is nailed down.
03Methodology
High-level recipe: Input images and a real question ā [Step A: Agent-centric filtering of candidate tasks] ā [Step B: Expert finalization into realistic, vision-first requests] ā [Step C: Execution filtering to ensure tool necessity and reproducibility] ā [Step D: Two-round verification for stability, visual dependence, and a unique answer] ā Output: The final AgentVista benchmark with clear scoring.
Step A: Agent-centric filtering (from 300k+ images to strong candidates)
- What happens: Models (e.g., CLAUDE-OPUS-4) help filter out low-visual or trivial cases (like pure OCR snapshots), propose an initial tool-compatible query, and humans keep only images with rich, non-trivial visual evidence.
- Why this exists: Without weeding out easy or text-only cases, the benchmark wouldnāt test visual grounding or long-horizon behavior.
- Example: A messy pantry photo is kept because you must read small nutrition labels and compare products under allergy constraints; a crisp signboard with big text is dropped.
Step B: Expert finalization (turning candidates into ultra-challenging but natural tasks)
- What happens: Trained annotators rewrite each query as a realistic, self-contained user request that still depends on visual details. They ensure hybrid tool use is necessary and record the deterministic answer plus key evidence and tool steps.
- Why this exists: Agents should solve genuine, life-like tasks with clear constraints (like times, budgets, or compatibility) and end with a single verifiable answer.
- Example: āWhich nut-free chocolate sauce here has the lowest sugar per 100 g, and what is its price in USD?ā requires checking the photo first (to pick the candidates) and then web pages for sugar content and price.
Step C: Execution filtering (prove it works in the tool environment)
- What happens: Each task is executed end-to-end in the same tool environment to confirm that the answer is reproducible. They keep only tasks that require interleaving at least two tool types. They also remove tasks solvable with no tool access.
- Why this exists: If a task can be solved from the prompt alone or needs only one tool hop, it doesnāt stress the agentās planning and grounding the way real problems do.
- Example: If a shopping question can be answered by a single web lookup ignoring the image, itās removed.
Step D: Two-round verification (final safety checks)
- What happens: Round 1 removes tasks with weak visual dependence or shaky answers. Round 2 independently reproduces the tools-and-evidence trail to confirm the same exact answer.
- Why this exists: To avoid unstable answers or tasks that depend on non-durable facts; the benchmark should be clear, fair, and stable.
- Example: If a price changes over time, annotators may add a date constraint so the correct answer remains checkable.
The tool suite (what agents can use and why)
- websearch: find pages for missing facts. Without it, you canāt pull in external knowledge.
- imagesearch: find reference photos or do reverse lookups. Without it, many visual matching tasks stall.
- visit: open a specific URL and extract page content. Without it, you canāt read the details you just found.
- code interpreter: process images (crop, resize, measure), parse data, and compute. Without it, you canāt verify tiny cues or do reliable math.
Concrete walk-through: The flooring example
- Input: Three images (room views and product references) and a query asking to match style, verify the correct room orientation, retrieve Lifeproof vinyl plank specs, and compute total material cost.
- Likely steps: (1) Code interpreter to zoom/crop floor patterns across photos, (2) websearch for product page, (3) visit to read coverage per box, (4) code interpreter to compute needed boxes for the measured room, (5) finish with the final USD total.
- What breaks without each step: Skip cropping and you might match the wrong pattern; skip visit and you might miss the exact coverage per box; skip code and you might miscalculate cost.
Dataset summary and constraints
- Scale: 209 queries, 308 images; both single-image (72.2%) and multi-image (27.8%).
- Diversity: 25 sub-domains across 7 categories (commerce, geography, entertainment, technology, society, academics, culture).
- Evaluation: Final answers are concise and deterministic (numbers, short names), judged for exact match by a fixed model.
Secret sauce (whatās clever)
- Natural, interleaved hybrid-tool workflows: The tasks look like what real people doāinspect a photo, search, open pages, compute, repeat.
- Strict visual dependence: Queries hide key facts in the image, so agents must truly see, not just guess from text.
- Verifiability and stability: Clear final answers make scoring fair and reproducible, even as models evolve.
- Long-horizon pressure: Budgets up to 30 tool turns force agents to plan, check, and persist, just like real problem solving.
At a high level, AgentVista isnāt just a set of questions. Itās a carefully constructed obstacle course that proves whether an agent can look, think, search, verify, and calculateāover and overāuntil it lands on one correct, checkable answer.
04Experiments & Results
The test: What did they measure and why?
- They measured exact-answer accuracy: did the agentās final answer match the ground truth (like grading a math result)? This keeps scoring simple and fair.
- They also capped tool interactions at 30 turns and used temperature 0.6 to keep conditions comparable.
- A fixed judge (GPT-4.1) checked whether final outputs fit the required answer format.
The competition: Which models were compared?
- Strong commercial and open-source multimodal models, including GPT-4.1, the GPT-5 family (GPT-5/5.1/5.2), Gemini-3-Flash, Gemini-3-Pro, Grok-4, Claude series (Sonnet/Opus), and Qwen3-VL-235B, among others.
The scoreboard (with context):
- Overall difficulty: Even the best model, Gemini-3-Pro with tools, scored 27.27%. Thatās like getting a C when the test is designed to be very hardāeveryone struggles, not just one student.
- Many models were below 15% overall, especially open-source models (around 10ā13%), showing a large performance gap.
- Tool turns were high: GPT-5.2 averaged 13.85 turns, and several models exceeded 10 turns, indicating long-horizon reasoning rather than quick wins.
- Category strengths varied: GPT-5.2 led Technology; GPT-5 and GPT-5.1 were strong in Commerce; Gemini-3-Pro did best overall and led Geography; Claude models did relatively well where careful reading and constraint following mattered.
- Multi-image often helped: Most models did better with multi-image inputs than single-image, sometimes a big jump (Gemini-3-Pro: ~23.7% ā ~36.8%). Extra views reduce ambiguity.
Tool-use patterns (who used what?):
- GPT-5 models leaned heavily on the code interpreter (often cropping, measuring, and calculating). Crop was the most common image operation across models, showing how critical local visual grounding is.
- Gemini and Claude models used web search more, favoring retrieval-driven workflows.
Ablation (what happens when you remove tools?):
- Full tool access worked best for both Gemini-3-Pro and Claude-Sonnet-4.5. Mixing visual manipulation (code) with retrieval (search/visit) was superior to either alone.
- Gemini-3-Pro: All-tools 27.27% > Vision-only 20.10% > No-tool 18.18%; Search-only 26.32% was close to all-tools, suggesting strong perception plus retrieval carried it.
- Claude-Sonnet-4.5: All-tools 17.70% ā Vision-only 17.22% > Search-only 13.40% ā No-tool 13.40%, suggesting it leaned more on image manipulation than on retrieval.
Where do models fail (error analysis)?
- Dominant failure: Visual misidentification (misreading subtle details) across models. This matches the benchmarkās design: small, key cues in cluttered scenes.
- Second: Knowledge hallucination (asserting facts not supported by images or sources). Long-tail facts under constraints are still hard, even with web search.
- Others: Tool execution failures, calculation errors, and instruction misinterpretation also appeared but were less frequent overall.
Test-time scaling (does sampling more help?):
- Generating multiple solutions and picking the best (Best-of-K) improved accuracy for Gemini-3-Flash: from ~21.05% at K=1 up to ~30.62% at K=16. The upper bound Pass@K hit ~51.67% at K=16. Randomly picking one did not help with larger K.
- Translation: Better selection helps, but even many samples donāt "solve" AgentVista; the core skillsāvisual grounding and long-horizon tool useāstill need to improve.
Takeaway: AgentVista exposes the real bottlenecksāseeing tiny yet crucial visual details correctly, retrieving the right facts, and stitching them together over long sequences with verification. Todayās best agents can sometimes do it, but not reliably enoughāyet.
05Discussion & Limitations
Limitations (what this canāt do yet):
- Coverage is broad but finite: 209 tasks wonāt cover every possible real-world scenario. Some domains or rare edge cases may be underrepresented.
- Tool scope is controlled: Only four tool categories (web search, image search, visit, code interpreter). Real deployments may need more specialized tools (APIs, mapping engines, proprietary catalogs).
- Temporal drift: Even with checks and time constraints, some web facts can change, so long-term stability needs monitoring.
- Judge dependency: Using GPT-4.1 as a judge is practical but not perfect; misjudgments can happen in tricky formatting cases.
Required resources (to use it well):
- An agent that can parse images, plan multi-step workflows, and call tools reliably (including code-based image processing).
- Enough compute to handle up to 30 tool turns and image operations.
- Logging of tool calls and intermediate outputs for audit and error analysis.
When not to use (mismatches):
- If you only care about simple, single-hop image Q&A (e.g., āWhat color is the car?ā), AgentVista is overkill.
- If your agent cannot call external tools or run code, many tasks become impossible by design.
- If your domain is purely text-based (no visual grounding needed), a text-only benchmark is more suitable.
Open questions (what we still donāt know):
- Training: What forms of reinforcement learning or curriculum (e.g., repeated grounding) best teach interleaved perception and tool use?
- Robustness: How to make visual grounding work reliably under blur, occlusion, and very subtle cues?
- Memory and planning: What architectures best track constraints and partial evidence over 10ā30+ steps without drifting?
- Tool selection: How should agents decide when to crop/measure vs. when to search/visit vs. when to compute, with minimal wasted calls?
- Evaluation: Beyond exact answers, are there scalable ways to credit partial progress without complicating scoring or hurting reproducibility?
Bottom line: AgentVista is a tough, realistic yardstick that makes current limits obvious and gives teams a clear target for progress.
06Conclusion & Future Work
Three-sentence summary: AgentVista is a rigorous benchmark of 209 realistic, visually grounded tasks that force multimodal agents to mix tools over many steps and deliver exact, verifiable answers. Experiments show todayās best systems still struggleāGemini-3-Pro tops out at 27.3%āmainly due to visual misidentification and long-horizon tool-use challenges. By exposing these real bottlenecks, AgentVista provides a clear path for improving perception, planning, retrieval, and verification together.
Main achievement: Creating a high-fidelity, long-horizon, hybrid-tool benchmark with strict visual dependence and deterministic answersāso we can fairly measure whether agents truly handle real-world, multi-step workflows.
Future directions: Improve visual grounding (especially fine-grained cues), strengthen interleaved reasoning via reinforcement learning, enhance tool selection and self-verification, and expand domains and tools while keeping answers reproducible. Better test-time selection and reward models could also narrow the gap toward the upper bound.
Why remember this: AgentVista shifts the goalposts from toy demos to realityāif an agent can succeed here, itās far more likely to help with everyday tasks like shopping with allergies, planning routes under schedules, or troubleshooting from photos. Itās the kind of benchmark that will shape the next generation of trustworthy, useful multimodal agents.
Practical Applications
- ā¢Evaluate your multimodal agentās readiness for real deployments by running it on AgentVista and analyzing failures.
- ā¢Design training curricula that emphasize fine-grained visual grounding (e.g., crop-and-verify loops) based on common error types.
- ā¢Tune tool-selection policies (when to crop/measure versus when to search/visit) to reduce wasted calls and drift.
- ā¢Adopt verifiable-answer formats in your productās workflows to enable robust, auto-graded QA and regression testing.
- ā¢Use ablation-style experiments (vision-only, search-only) to identify which capabilities limit your modelās performance.
- ā¢Incorporate test-time selection (Best-of-K) and reward models to improve reliability without retraining.
- ā¢Develop RL or feedback pipelines that reward interleaved perceptionāretrievalāverification behavior on long horizons.
- ā¢Benchmark open-source agents against commercial ones to spot capability gaps and prioritize research investments.
- ā¢Build domain-specific extensions (e.g., mapping or EHR tools) while keeping AgentVistaās verifiable-answer discipline.
- ā¢Create error dashboards that track visual misidentification vs. hallucination rates and guide targeted model/data fixes.