OmniGAIA: Towards Native Omni-Modal AI Agents

Xiaoxi Li; Wenxiang Jiao; Jiarui Jin; Shijian Wang; Guanting Dong; Jiajie Jin; Hao Wang; Yinuo Wang; Ji-Rong Wen; Yuan Lu; Zhicheng Dou

OmniGAIA: Towards Native Omni-Modal AI Agents

Intermediate

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin et al.2/26/2026

arXiv

Key Summary

•OmniGAIA is a new test that checks if AI can watch videos, look at images, listen to audio, and use web and code tools in several steps to find a verified answer.
•Most older tests only used two types of information at once (like just pictures and text), which missed how real life mixes many senses together.
•The benchmark builds each question from an omni‑modal event graph, which links who/what/when/where across video, audio, and images, then hides key bits so the AI must reason step by step.
•OmniAtlas is a matching AI agent recipe that thinks while calling tools (web search, browser, code) and uses active perception to ‘look’ or ‘listen’ only where needed.
•To train OmniAtlas, the authors synthesize solution paths using hindsight‑guided tree exploration, then do supervised fine‑tuning and a fine‑grained fixer called OmniDPO.
•On the OmniGAIA leaderboard, the top proprietary model (Gemini‑3‑Pro) scores 62.5 Pass@1 while a strong open model (Qwen‑3‑Omni) gets 13.3.
•Using the OmniAtlas recipe boosts Qwen‑3‑Omni from 13.3 to 20.8 Pass@1, mainly by improving when and how to use tools and verify facts.
•Hard tasks remain tough because they demand long chains of reasoning and careful tool use; simply making models bigger did not fix this.
•Analyses show two common failure modes: not calling tools enough and getting stuck in the wrong search direction (tool‑query drift).
•This work points toward native omni‑modal assistants that can ground what they see and hear, double‑check with the web, compute precisely, and explain their answers.

Why This Research Matters

Real life mixes sights, sounds, and words, and we often double-check facts online and do quick math—OmniGAIA tests if AI can do the same. This helps build assistants that watch long videos, listen carefully, and then verify dates or places before answering. It can boost accessibility tools that describe what’s on screen and confirm information from the web. It can support students and teachers by combining lab videos, spoken notes, and web facts into one reliable explanation. It encourages safer AI by grounding answers in evidence rather than guesses. It also shows developers where models fail—like weak searches or early wrong turns—so training can improve the exact steps that matter. Overall, it moves AI from just recognizing things to solving problems like a careful helper.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how in real life you use many senses together—your eyes, ears, and words—to figure things out, and you also grab tools like your phone’s browser or a calculator when you need them? That’s how humans solve real problems: we look, listen, think, and check. For a long time, most AI systems didn’t work that way. Many could only handle two types of information at once (for example, picture + text), and even then, they mostly recognized what was there instead of planning multi‑step solutions. Before this work, the world of multi‑modal AI focused on short, simple tasks. A model might answer, “What color is the car?” from a photo, or “What did the speaker say?” from a short clip. These tests were useful for perception (seeing and hearing), but not for full problem solving. Real problems are messier. A student might watch a long field‑trip video with background sounds, read a sign on a bridge in the distance, remember a movie reference, search the web for dates, and finally compute an age. That’s not just seeing or hearing—it’s seeing + hearing + thinking + tool use over many steps. The problem researchers faced was clear: there wasn’t a strong way to test whether an AI could do all of that together. Existing benchmarks usually covered two modalities and short clips, asked multiple‑choice questions (easy to guess), and didn’t require the model to use outside tools to verify facts. Without a realistic test, it’s hard to build a truly helpful AI assistant. People tried a few things that didn’t fully work. First, they made bigger models, hoping size alone would teach better reasoning. But results showed that simply adding parameters didn’t close the gap. Second, they made bi‑modal tests (image+text or audio+text) and got great perception scores, but these didn’t measure if the AI could plan, search, verify, and compute across long contexts. Third, some agent systems used tools, but only in text‑only worlds; they rarely mixed tools with vision and audio in one flow. What was missing? A benchmark and an agent that are native to omni‑modality—all senses together—plus real tool use and long‑horizon reasoning. The paper fills this gap with two pieces: (1) OmniGAIA, a benchmark of 360 tasks across nine domains that force an AI to watch/listen/see and then use tools (like web search or a calculator) over multiple turns, and (2) OmniAtlas, a training recipe and agent behavior that naturally interleaves thinking with tool calls and uses active perception to inspect just the right video seconds or image regions. Why should anyone care? Because daily life is omni‑modal. Think of accessibility tools that listen to a lecture and read a diagram, travel assistants that analyze a vlog and check opening hours online, or science helpers that watch an experiment video, hear measurements, and compute results. Without a strong test like OmniGAIA and a capable agent recipe like OmniAtlas, AI assistants can miss details, trust wrong memories, or do the math before checking the facts. This work makes a step toward assistants that ground what they see and hear, verify with sources, and only then compute and answer.

02Core Idea

The “aha!” in one sentence: Build a benchmark (OmniGAIA) and a matching agent recipe (OmniAtlas) that treat vision, audio, and language as first‑class citizens while the agent plans, searches, verifies, and computes in multi‑turn steps. Multiple analogies:

Detective kit: The AI is a detective who watches security footage (video), listens to witnesses (audio), reads signs (images), and then checks the city records (web tools) before doing the final math (calculator) to solve the case.
Field trip journal: The AI goes on a long field trip, writes notes from what it sees and hears, looks up facts in a library, and calculates totals to finish the report.
Cooking show + pantry: The AI watches and listens to a full cooking show, then runs to the pantry (web) to fetch missing ingredients (facts) and uses a scale (code tool) to measure precisely. Before vs. after:

Before: Tests mostly checked recognition in short, bi‑modal settings. Agents often didn’t verify with external tools and didn’t handle long, interwoven clues.
After: The benchmark requires multi‑hop reasoning across minutes‑long videos and audio, and models are judged on whether their open‑form answers are verifiably correct. The agent recipe shows how to interleave thought, tool calls, and active perception to actually solve such problems. Why it works (intuition):
Real problems are chains. An event graph captures those chains across modalities. Hiding key bits (fuzzification) makes guessing hard and forces reasoning.
Thinking while acting (Tool‑Integrated Reasoning) lets the agent fetch missing evidence when its internal knowledge is uncertain.
Active perception focuses attention on the exact frames, seconds, or image crops that matter, so details aren’t lost.
Training on good solution paths (hindsight‑guided exploration + supervised fine‑tuning) builds the right habits; OmniDPO then fixes specific weak steps (like bad searches) without breaking what already works. Building blocks (each explained with the Sandwich pattern):

🍞 You know how you think while using a calculator or looking things up online? 🥬 Tool‑Integrated Reasoning (TIR)

What it is: The AI interleaves its thoughts with tool calls (search, browser, code), step by step.
How it works: (1) Think about what’s missing; (2) call a tool; (3) read what comes back; (4) update the plan; (5) repeat until ready; (6) give the final answer.
Why it matters: Without TIR, the AI guesses from memory, can’t verify facts, and often answers confidently but wrong. 🍞 Example: Asking, “How old was this bridge when filming began?” The AI searches for the build year, finds the filming date, and then uses code to subtract.

🍞 Imagine a giant scrapbook linking who/what/where/when across sights and sounds. 🥬 Omni‑modal Event Graph

What it is: A map of entities and events connected across video, audio, and images.
How it works: (1) Extract visual, audio, and text clues; (2) make nodes for people, objects, places, times; (3) link relations (e.g., ‘this sign appears at 02:30’); (4) expand with web facts; (5) generate questions by hiding key nodes.
Why it matters: Without a graph, tasks become random facts instead of logical trails you can follow. 🍞 Example: The ‘Ruby Street Bridge’ node links to ‘1935 built’ and ‘Joliet Iron Works’ so the AI can trace a full path.

🍞 Think of a life‑skills exam that mixes watching, listening, and proving your answer. 🥬 OmniGAIA (the benchmark)

What it is: A 360‑task test that requires multi‑hop reasoning with tools over video/audio/images.
How it works: (1) Build event graphs from real media; (2) expand with retrieval and tools; (3) fuzz key info; (4) verify with LLM and humans; (5) ask open‑form questions that must be checked.
Why it matters: Without OmniGAIA, we can’t fairly measure if an AI truly reasons and verifies across modalities. 🍞 Example: “Name the bridge in the video and its age at the movie’s filming start,” forcing seeing, searching, and computing.

🍞 Picture a flashlight you point only where details matter. 🥬 Active Omni‑Modal Perception

What it is: The agent can request just the right video seconds, audio slice, or image crop.
How it works: (1) Notice uncertainty; (2) call rea $d_v$ ideo/rea $d_a$ udio/rea $d_i$ mage with exact ranges; (3) inspect details; (4) continue reasoning.
Why it matters: Without it, you downsample everything and miss tiny but crucial clues. 🍞 Example: Zooming on a far‑away bridge sign instead of scanning the whole video again.

🍞 Imagine trying different paths in a maze, then keeping the ones that reach the cheese. 🥬 Hindsight‑Guided Tree Exploration

What it is: A way to explore multiple solution steps and keep the successful ones.
How it works: (1) Branch several next steps; (2) use a verifier with the known answer; (3) prune wrong branches; (4) store the good trajectory for training.
Why it matters: Without it, the agent learns from noisy or failed paths and picks up bad habits. 🍞 Example: Testing three search queries and only saving the one that correctly finds ‘Ruby Street Bridge (1935)’.

🍞 Think of a coach who pauses a replay at the first mistake and shows the fixed move. 🥬 OmniDPO

What it is: A fine‑grained preference learning step that corrects the first error in a failed solution.
How it works: (1) Find the first wrong step; (2) create a corrected version; (3) train a preference for the fixed step over the faulty one.
Why it matters: Without it, small but crucial mistakes (like a bad search query) keep happening. 🍞 Example: Replacing a vague query “Chicago bridge” with “Ruby Street Bridge Joliet year built”.

🍞 Building a birdhouse takes hammering, measuring, and checking alignment over several tries. 🥬 Multi‑turn Tool Use

What it is: Using tools over multiple steps instead of once.
How it works: (1) Plan; (2) search; (3) read; (4) search again; (5) compute; (6) answer.
Why it matters: One tool call rarely solves everything; complex tasks need a chain of checks. 🍞 Example: Search bridge name → search construction year → search filming start → compute age.

03Methodology

At a high level: Input (video/image + audio) → Event mining → Event‑graph building and expansion with tools → Question generation by fuzzification → LLM + human quality checks → Evaluation with an omni‑modal agent (OmniAtlas) that uses active perception + Tool‑Integrated Reasoning. Step‑by‑step (benchmark construction):

Discover signals from media (what happens):

What: Split long videos $into ≤60$ ‑second clips; describe scenes/events/sounds; run speech‑to‑text with timestamps; tag speakers and non‑speech audio; OCR and object/face detection for images.
Why: If we don’t extract fine‑grained signals, we miss small but vital clues (like a tiny sign on a bridge).
Example: From the video, we get “speaker says ‘Ruby Street Bridge’” with a timestamp and a visual note “movable bascule bridge ahead.”

Build the omni‑modal event graph (how things connect):

What: Turn entities (bridge, place, year) and events (speaker mentions movie, camera pans) into a graph with cross‑modal links.
Why: Without structure, it’s hard to craft multi‑hop problems that are solvable and verifiable.
Example: Node ‘Ruby Street Bridge’ links to ‘Joliet Iron Works’ and ‘built: 1935’ and to an audio clip where the bridge is mentioned.

Expand with tools (find missing pieces):

What: Use a strong agent to search related media, do we $b_s$ earch + pag $e_b$ rowser for sources, vision QA for external images, and cod $e_e$ xecutor for computations.
Why: Real problems need outside facts and cross‑checks; otherwise, answers lean on guesses.
Example: Web search confirms “The Blues Brothers filming began July 1979,” linking that date into the graph.

Generate questions via fuzzification (force reasoning):

What: Hide or abstract key nodes/edges (for example, mask the bridge name or its year) so the model must traverse the graph.
Why: If you just ask for a visible fact, the task becomes trivial; fuzzing demands multi‑hop reasoning.
Example: Ask, “What is the bridge name and how many years old was it when filming began?”—this hides both the identity and age, requiring search + compute.

Quality inspection (ensure solvability):

What: First, LLM screening checks clarity, cross‑modal necessity, and uniqueness; then humans verify media grounding and correctness.
Why: Without checks, you risk unclear questions or multiple correct answers.
Example: Reviewers confirm that only ‘Ruby Street Bridge; 44’ fits all evidence.

Step‑by‑step (agent and training recipe):

Tool‑Integrated Reasoning (TIR):

What: The agent alternates between thinking and acting (tool calls) and appends tool outputs back into context.
Why: This lets the agent fetch evidence exactly when needed.
Example: Think “need build year,” call search, read result, then plan the next step.

Active omni‑modal perception:

What: The agent selectively inspects media with rea $d_v$ ideo( $t_s$ tart, $t_e$ nd), rea $d_a$ udio( $t_s$ tart, $t_e$ nd), rea $d_i$ mage(cro $p_b$ ox).
Why: Saves tokens/cost and keeps details; whole‑media downsampling can hide small but important text or sounds.
Example: Zoom a distant sign for the bridge name rather than processing the entire video again.

Hindsight‑guided tree exploration (trajectory synthesis):

What: Generate multiple candidate next steps and keep only branches that reach the correct answer (verified by a checker).
Why: Trains on strong, successful paths instead of noisy ones.
Example: Keep the branch that searches “Ruby Street Bridge built 1935,” drop branches that chase the wrong bridge.

Supervised fine‑tuning with masking:

What: Train on the agent’s own thoughts and tool‑call tokens but mask out tool responses (so the model doesn’t memorize results).
Why: Teaches how to think and act, not to copy web text.
Example: The model learns to write a good query, not the exact phrasing of a page snippet.

OmniDPO (fine‑grained error correction):

What: For each failed try, find the first wrong step and pair it with a corrected version; train preferences to favor the fix.
Why: Small step‑level fixes (like better queries) stop cascades of errors.
Example: Replace “Chicago bridge Blues Brothers” with “Ruby Street Bridge Joliet year built” and learn to prefer the latter.

What breaks without each step:

No event mining → miss crucial details.
No graph → no reliable multi‑hop control.
No tool expansion → unverifiable or shallow tasks.
No fuzzification → trivial lookup instead of reasoning.
No checks → ambiguous or unsolvable questions.
No TIR/active perception → weak evidence gathering; small details lost.
No guided exploration/SFT → noisy habits.
No OmniDPO → recurring first‑step mistakes.

The secret sauce:

Event‑graph + fuzzification makes tasks hard but fair.
TIR + active perception gives the agent a “look/listen where needed” superpower.
Hindsight‑guided exploration + masked SFT build good routines; OmniDPO polishes weak links.
Together, they shift models from guessing to grounded, tool‑verified answers.

04Experiments & Results

The test: OmniGAIA measures whether an AI can combine video, audio, and images with multi‑turn tool use to produce a verifiable open‑form answer. The main score is Pass@1: did the model’s first final answer match (or get judged equivalent to) the ground truth? The competition: The authors compare many omni‑modal models. Proprietary models include Gemini‑3‑Pro/Flash and Gemini‑2.5. Open models include Qwen‑3‑Omni, Qwen‑2.5‑Omni, Baichuan‑Omni, MiniCPM‑O, Ming‑Omni, and LongCat‑Omni. All get the same tools (web search, browser, code). The scoreboard with context:

Gemini‑3‑Pro: 62.5 Pass@1—like scoring an A when the test is tricky and long.
Gemini‑3‑Flash: 51.7—also strong.
Qwen‑3‑Omni (open baseline): 13.3—more like a D; shows how hard these tasks are without better tool use.
OmniAtlas recipe on Qwen‑3‑Omni: 20.8—an absolute +7.5 jump, moving toward a solid C on a very tough exam.
Scaling alone didn’t fix things: a huge open model with many more parameters still underperformed a smaller one, proving that smart tool policies and reasoning matter more than just size. Error analysis (why models miss):
Most misses came from two issues: (1) ineffective tool use (either not calling tools enough or calling them poorly) and (2) reasoning errors (stringing facts together wrong).
On hard tasks, open models nearly saturated tool misuse (about 90–96%) and had high reasoning errors (about 80–90%). When early searches go wrong, later steps crumble.
OmniAtlas reduced tool misuse and reasoning errors significantly, but perception mistakes (misreading visuals or audio) stayed noticeable—the senses themselves still need improvement. Tool calls: more is not always better:
Runs with zero or few tool calls almost always failed; outside evidence is essential.
But very long runs with many calls also failed often—this is “thrashing,” where the AI keeps searching without resolving uncertainty.
OmniAtlas shifted models from under‑calling to a healthier, more active tool pattern, raising success without simply spamming tools. Native vs tool‑based perception:
Strong models do best with native omni‑modal inputs; swapping in perception tools (ask a separate vision/audio helper) didn’t beat native sensing and raised latency.
For weaker models, perception tools help on easy/medium tasks but still struggle on hard, long‑chain reasoning. Training effectiveness:
Supervised fine‑tuning on good trajectories did most of the heavy lifting (big drop in bad tool use, solid Pass@1 jump).
OmniDPO added consistent, across‑the‑board gains by fixing the earliest wrong step in failed solutions. Surprising findings:
Making the model larger without teaching better tool habits didn’t yield big wins.
The biggest stride came from pairing event‑graph‑built tasks with an agent that actively verifies and from training that fixes the first wrong turn.
The benchmark’s open‑form answers plus LLM‑as‑a‑Judge were necessary to catch correct reasoning even when the formatting didn’t match exactly.

05Discussion & Limitations

Limitations:

Perception bottlenecks remain: many errors are still basic seeing/hearing mistakes, especially in long or noisy media.
Hard multi‑hop chains are fragile: an early bad search cascades into failure.
LLM‑as‑a‑Judge introduces a dependency on another model for grading (mitigated by exact‑match first), though it’s a common practice for open‑form answers.
Data and compute demands are high for training (multi‑GPU nodes, many trajectories). Required resources:
For evaluation: tool access (web search, browser, code executor), and the media inputs.
For training OmniAtlas: curated trajectories from guided exploration, supervised fine‑tuning infrastructure, and OmniDPO runs; GPUs with enough memory to train omni‑modal towers. When not to use:
If the task is purely single‑modality and trivial (e.g., read a single number in a clear image), the full agentic pipeline is overkill.
If you cannot permit web access (privacy or policy), tool‑integrated verification may be limited.
Ultra‑low‑latency or on‑device scenarios may dislike the extra tool‑call cost. Open questions:
How to teach models to detect and repair wrong search direction earlier (prevent tool‑query drift)?
Can we couple native perception with smarter compression so we don’t miss tiny details?
What reward modeling or RL signals best improve long‑horizon tool policies without overfitting to the benchmark?
How to generalize from web verification to embodied actions in the physical world while keeping safety and privacy?

06Conclusion & Future Work

Three‑sentence summary: OmniGAIA is a tough, realistic benchmark that forces AI to watch, listen, look, and then use tools over multiple steps to deliver a checked, open‑form answer. OmniAtlas is a matching agent recipe with Tool‑Integrated Reasoning, active perception, hindsight‑guided exploration, supervised fine‑tuning, and OmniDPO to fix early mistakes. Together, they show that better tool use and reasoning—not just bigger models—move open models much closer to reliable omni‑modal assistance. Main achievement: Turning omni‑modal problem solving into a fair, verifiable test and providing a practical training path that measurably boosts open models (13.3 → 20.8 Pass@1 on Qwen‑3‑Omni). Future directions: (1) Omni‑modal agent RL to optimize full policies; (2) richer tool ecosystems (MCP‑style services) for broader tasks; (3) embodied omni‑modal agents that act in the physical world. Why remember this: It reframes multi‑modal AI from “Can you see and hear?” to “Can you see, hear, plan, verify, and compute—like a real assistant?” and offers both the exam (OmniGAIA) and the study guide (OmniAtlas) to get there.

Practical Applications

•Educational helpers that watch a classroom demo video, transcribe key points, verify facts online, and compute results for a lab worksheet.
•Travel guides that analyze vlog footage, read signs, listen to narration, and check museum hours or bridge histories on the web.
•Accessibility tools that describe on-screen action and audio, then confirm names, dates, and places from reliable sources.
•News fact-checkers that align video/audio evidence with web citations before summarizing verified claims.
•Customer support bots that watch a troubleshooting video, hear device beeps, search manuals, and compute settings.
•Sports analysts that review long match footage, align commentary, and compute player stats or timelines.
•Historical archivists that parse old films, read captions, match locations online, and reconstruct accurate event timelines.
•Science note-takers that watch experiment recordings, extract spoken measurements, and calculate derived quantities.
•City information kiosks that read street signs from images, listen to tourist questions, and fetch local facts from the web.
•Media study assistants that cross-check movie locations, filming dates, and landmarks across video, audio, and online references.

Version: 1