OmniGAIA: Towards Native Omni-Modal AI Agents
Key Summary
- ā¢OmniGAIA is a new test that checks if AI can watch videos, look at images, listen to audio, and use web and code tools in several steps to find a verified answer.
- ā¢Most older tests only used two types of information at once (like just pictures and text), which missed how real life mixes many senses together.
- ā¢The benchmark builds each question from an omniāmodal event graph, which links who/what/when/where across video, audio, and images, then hides key bits so the AI must reason step by step.
- ā¢OmniAtlas is a matching AI agent recipe that thinks while calling tools (web search, browser, code) and uses active perception to ālookā or ālistenā only where needed.
- ā¢To train OmniAtlas, the authors synthesize solution paths using hindsightāguided tree exploration, then do supervised fineātuning and a fineāgrained fixer called OmniDPO.
- ā¢On the OmniGAIA leaderboard, the top proprietary model (Geminiā3āPro) scores 62.5 Pass@1 while a strong open model (Qwenā3āOmni) gets 13.3.
- ā¢Using the OmniAtlas recipe boosts Qwenā3āOmni from 13.3 to 20.8 Pass@1, mainly by improving when and how to use tools and verify facts.
- ā¢Hard tasks remain tough because they demand long chains of reasoning and careful tool use; simply making models bigger did not fix this.
- ā¢Analyses show two common failure modes: not calling tools enough and getting stuck in the wrong search direction (toolāquery drift).
- ā¢This work points toward native omniāmodal assistants that can ground what they see and hear, doubleācheck with the web, compute precisely, and explain their answers.
Why This Research Matters
Real life mixes sights, sounds, and words, and we often double-check facts online and do quick mathāOmniGAIA tests if AI can do the same. This helps build assistants that watch long videos, listen carefully, and then verify dates or places before answering. It can boost accessibility tools that describe whatās on screen and confirm information from the web. It can support students and teachers by combining lab videos, spoken notes, and web facts into one reliable explanation. It encourages safer AI by grounding answers in evidence rather than guesses. It also shows developers where models failālike weak searches or early wrong turnsāso training can improve the exact steps that matter. Overall, it moves AI from just recognizing things to solving problems like a careful helper.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how in real life you use many senses togetherāyour eyes, ears, and wordsāto figure things out, and you also grab tools like your phoneās browser or a calculator when you need them? Thatās how humans solve real problems: we look, listen, think, and check. For a long time, most AI systems didnāt work that way. Many could only handle two types of information at once (for example, picture + text), and even then, they mostly recognized what was there instead of planning multiāstep solutions. Before this work, the world of multiāmodal AI focused on short, simple tasks. A model might answer, āWhat color is the car?ā from a photo, or āWhat did the speaker say?ā from a short clip. These tests were useful for perception (seeing and hearing), but not for full problem solving. Real problems are messier. A student might watch a long fieldātrip video with background sounds, read a sign on a bridge in the distance, remember a movie reference, search the web for dates, and finally compute an age. Thatās not just seeing or hearingāitās seeing + hearing + thinking + tool use over many steps. The problem researchers faced was clear: there wasnāt a strong way to test whether an AI could do all of that together. Existing benchmarks usually covered two modalities and short clips, asked multipleāchoice questions (easy to guess), and didnāt require the model to use outside tools to verify facts. Without a realistic test, itās hard to build a truly helpful AI assistant. People tried a few things that didnāt fully work. First, they made bigger models, hoping size alone would teach better reasoning. But results showed that simply adding parameters didnāt close the gap. Second, they made biāmodal tests (image+text or audio+text) and got great perception scores, but these didnāt measure if the AI could plan, search, verify, and compute across long contexts. Third, some agent systems used tools, but only in textāonly worlds; they rarely mixed tools with vision and audio in one flow. What was missing? A benchmark and an agent that are native to omniāmodalityāall senses togetherāplus real tool use and longāhorizon reasoning. The paper fills this gap with two pieces: (1) OmniGAIA, a benchmark of 360 tasks across nine domains that force an AI to watch/listen/see and then use tools (like web search or a calculator) over multiple turns, and (2) OmniAtlas, a training recipe and agent behavior that naturally interleaves thinking with tool calls and uses active perception to inspect just the right video seconds or image regions. Why should anyone care? Because daily life is omniāmodal. Think of accessibility tools that listen to a lecture and read a diagram, travel assistants that analyze a vlog and check opening hours online, or science helpers that watch an experiment video, hear measurements, and compute results. Without a strong test like OmniGAIA and a capable agent recipe like OmniAtlas, AI assistants can miss details, trust wrong memories, or do the math before checking the facts. This work makes a step toward assistants that ground what they see and hear, verify with sources, and only then compute and answer.
02Core Idea
The āaha!ā in one sentence: Build a benchmark (OmniGAIA) and a matching agent recipe (OmniAtlas) that treat vision, audio, and language as firstāclass citizens while the agent plans, searches, verifies, and computes in multiāturn steps. Multiple analogies:
- Detective kit: The AI is a detective who watches security footage (video), listens to witnesses (audio), reads signs (images), and then checks the city records (web tools) before doing the final math (calculator) to solve the case.
- Field trip journal: The AI goes on a long field trip, writes notes from what it sees and hears, looks up facts in a library, and calculates totals to finish the report.
- Cooking show + pantry: The AI watches and listens to a full cooking show, then runs to the pantry (web) to fetch missing ingredients (facts) and uses a scale (code tool) to measure precisely. Before vs. after:
- Before: Tests mostly checked recognition in short, biāmodal settings. Agents often didnāt verify with external tools and didnāt handle long, interwoven clues.
- After: The benchmark requires multiāhop reasoning across minutesālong videos and audio, and models are judged on whether their openāform answers are verifiably correct. The agent recipe shows how to interleave thought, tool calls, and active perception to actually solve such problems. Why it works (intuition):
- Real problems are chains. An event graph captures those chains across modalities. Hiding key bits (fuzzification) makes guessing hard and forces reasoning.
- Thinking while acting (ToolāIntegrated Reasoning) lets the agent fetch missing evidence when its internal knowledge is uncertain.
- Active perception focuses attention on the exact frames, seconds, or image crops that matter, so details arenāt lost.
- Training on good solution paths (hindsightāguided exploration + supervised fineātuning) builds the right habits; OmniDPO then fixes specific weak steps (like bad searches) without breaking what already works. Building blocks (each explained with the Sandwich pattern):
š You know how you think while using a calculator or looking things up online? š„¬ ToolāIntegrated Reasoning (TIR)
- What it is: The AI interleaves its thoughts with tool calls (search, browser, code), step by step.
- How it works: (1) Think about whatās missing; (2) call a tool; (3) read what comes back; (4) update the plan; (5) repeat until ready; (6) give the final answer.
- Why it matters: Without TIR, the AI guesses from memory, canāt verify facts, and often answers confidently but wrong. š Example: Asking, āHow old was this bridge when filming began?ā The AI searches for the build year, finds the filming date, and then uses code to subtract.
š Imagine a giant scrapbook linking who/what/where/when across sights and sounds. š„¬ Omniāmodal Event Graph
- What it is: A map of entities and events connected across video, audio, and images.
- How it works: (1) Extract visual, audio, and text clues; (2) make nodes for people, objects, places, times; (3) link relations (e.g., āthis sign appears at 02:30ā); (4) expand with web facts; (5) generate questions by hiding key nodes.
- Why it matters: Without a graph, tasks become random facts instead of logical trails you can follow. š Example: The āRuby Street Bridgeā node links to ā1935 builtā and āJoliet Iron Worksā so the AI can trace a full path.
š Think of a lifeāskills exam that mixes watching, listening, and proving your answer. š„¬ OmniGAIA (the benchmark)
- What it is: A 360ātask test that requires multiāhop reasoning with tools over video/audio/images.
- How it works: (1) Build event graphs from real media; (2) expand with retrieval and tools; (3) fuzz key info; (4) verify with LLM and humans; (5) ask openāform questions that must be checked.
- Why it matters: Without OmniGAIA, we canāt fairly measure if an AI truly reasons and verifies across modalities. š Example: āName the bridge in the video and its age at the movieās filming start,ā forcing seeing, searching, and computing.
š Picture a flashlight you point only where details matter. š„¬ Active OmniāModal Perception
- What it is: The agent can request just the right video seconds, audio slice, or image crop.
- How it works: (1) Notice uncertainty; (2) call reaideo/reaudio/reamage with exact ranges; (3) inspect details; (4) continue reasoning.
- Why it matters: Without it, you downsample everything and miss tiny but crucial clues. š Example: Zooming on a farāaway bridge sign instead of scanning the whole video again.
š Imagine trying different paths in a maze, then keeping the ones that reach the cheese. š„¬ HindsightāGuided Tree Exploration
- What it is: A way to explore multiple solution steps and keep the successful ones.
- How it works: (1) Branch several next steps; (2) use a verifier with the known answer; (3) prune wrong branches; (4) store the good trajectory for training.
- Why it matters: Without it, the agent learns from noisy or failed paths and picks up bad habits. š Example: Testing three search queries and only saving the one that correctly finds āRuby Street Bridge (1935)ā.
š Think of a coach who pauses a replay at the first mistake and shows the fixed move. š„¬ OmniDPO
- What it is: A fineāgrained preference learning step that corrects the first error in a failed solution.
- How it works: (1) Find the first wrong step; (2) create a corrected version; (3) train a preference for the fixed step over the faulty one.
- Why it matters: Without it, small but crucial mistakes (like a bad search query) keep happening. š Example: Replacing a vague query āChicago bridgeā with āRuby Street Bridge Joliet year builtā.
š Building a birdhouse takes hammering, measuring, and checking alignment over several tries. š„¬ Multiāturn Tool Use
- What it is: Using tools over multiple steps instead of once.
- How it works: (1) Plan; (2) search; (3) read; (4) search again; (5) compute; (6) answer.
- Why it matters: One tool call rarely solves everything; complex tasks need a chain of checks. š Example: Search bridge name ā search construction year ā search filming start ā compute age.
03Methodology
At a high level: Input (video/image + audio) ā Event mining ā Eventāgraph building and expansion with tools ā Question generation by fuzzification ā LLM + human quality checks ā Evaluation with an omniāmodal agent (OmniAtlas) that uses active perception + ToolāIntegrated Reasoning. Stepābyāstep (benchmark construction):
- Discover signals from media (what happens):
- What: Split long videos āsecond clips; describe scenes/events/sounds; run speechātoātext with timestamps; tag speakers and nonāspeech audio; OCR and object/face detection for images.
- Why: If we donāt extract fineāgrained signals, we miss small but vital clues (like a tiny sign on a bridge).
- Example: From the video, we get āspeaker says āRuby Street Bridgeāā with a timestamp and a visual note āmovable bascule bridge ahead.ā
- Build the omniāmodal event graph (how things connect):
- What: Turn entities (bridge, place, year) and events (speaker mentions movie, camera pans) into a graph with crossāmodal links.
- Why: Without structure, itās hard to craft multiāhop problems that are solvable and verifiable.
- Example: Node āRuby Street Bridgeā links to āJoliet Iron Worksā and ābuilt: 1935ā and to an audio clip where the bridge is mentioned.
- Expand with tools (find missing pieces):
- What: Use a strong agent to search related media, do weearch + pagrowser for sources, vision QA for external images, and codxecutor for computations.
- Why: Real problems need outside facts and crossāchecks; otherwise, answers lean on guesses.
- Example: Web search confirms āThe Blues Brothers filming began July 1979,ā linking that date into the graph.
- Generate questions via fuzzification (force reasoning):
- What: Hide or abstract key nodes/edges (for example, mask the bridge name or its year) so the model must traverse the graph.
- Why: If you just ask for a visible fact, the task becomes trivial; fuzzing demands multiāhop reasoning.
- Example: Ask, āWhat is the bridge name and how many years old was it when filming began?āāthis hides both the identity and age, requiring search + compute.
- Quality inspection (ensure solvability):
- What: First, LLM screening checks clarity, crossāmodal necessity, and uniqueness; then humans verify media grounding and correctness.
- Why: Without checks, you risk unclear questions or multiple correct answers.
- Example: Reviewers confirm that only āRuby Street Bridge; 44ā fits all evidence.
Stepābyāstep (agent and training recipe):
- ToolāIntegrated Reasoning (TIR):
- What: The agent alternates between thinking and acting (tool calls) and appends tool outputs back into context.
- Why: This lets the agent fetch evidence exactly when needed.
- Example: Think āneed build year,ā call search, read result, then plan the next step.
- Active omniāmodal perception:
- What: The agent selectively inspects media with reaideo(tart, nd), reaudio(tart, nd), reamage(croox).
- Why: Saves tokens/cost and keeps details; wholeāmedia downsampling can hide small but important text or sounds.
- Example: Zoom a distant sign for the bridge name rather than processing the entire video again.
- Hindsightāguided tree exploration (trajectory synthesis):
- What: Generate multiple candidate next steps and keep only branches that reach the correct answer (verified by a checker).
- Why: Trains on strong, successful paths instead of noisy ones.
- Example: Keep the branch that searches āRuby Street Bridge built 1935,ā drop branches that chase the wrong bridge.
- Supervised fineātuning with masking:
- What: Train on the agentās own thoughts and toolācall tokens but mask out tool responses (so the model doesnāt memorize results).
- Why: Teaches how to think and act, not to copy web text.
- Example: The model learns to write a good query, not the exact phrasing of a page snippet.
- OmniDPO (fineāgrained error correction):
- What: For each failed try, find the first wrong step and pair it with a corrected version; train preferences to favor the fix.
- Why: Small stepālevel fixes (like better queries) stop cascades of errors.
- Example: Replace āChicago bridge Blues Brothersā with āRuby Street Bridge Joliet year builtā and learn to prefer the latter.
What breaks without each step:
- No event mining ā miss crucial details.
- No graph ā no reliable multiāhop control.
- No tool expansion ā unverifiable or shallow tasks.
- No fuzzification ā trivial lookup instead of reasoning.
- No checks ā ambiguous or unsolvable questions.
- No TIR/active perception ā weak evidence gathering; small details lost.
- No guided exploration/SFT ā noisy habits.
- No OmniDPO ā recurring firstāstep mistakes.
The secret sauce:
- Eventāgraph + fuzzification makes tasks hard but fair.
- TIR + active perception gives the agent a ālook/listen where neededā superpower.
- Hindsightāguided exploration + masked SFT build good routines; OmniDPO polishes weak links.
- Together, they shift models from guessing to grounded, toolāverified answers.
04Experiments & Results
The test: OmniGAIA measures whether an AI can combine video, audio, and images with multiāturn tool use to produce a verifiable openāform answer. The main score is Pass@1: did the modelās first final answer match (or get judged equivalent to) the ground truth? The competition: The authors compare many omniāmodal models. Proprietary models include Geminiā3āPro/Flash and Geminiā2.5. Open models include Qwenā3āOmni, Qwenā2.5āOmni, BaichuanāOmni, MiniCPMāO, MingāOmni, and LongCatāOmni. All get the same tools (web search, browser, code). The scoreboard with context:
- Geminiā3āPro: 62.5 Pass@1ālike scoring an A when the test is tricky and long.
- Geminiā3āFlash: 51.7āalso strong.
- Qwenā3āOmni (open baseline): 13.3āmore like a D; shows how hard these tasks are without better tool use.
- OmniAtlas recipe on Qwenā3āOmni: 20.8āan absolute +7.5 jump, moving toward a solid C on a very tough exam.
- Scaling alone didnāt fix things: a huge open model with many more parameters still underperformed a smaller one, proving that smart tool policies and reasoning matter more than just size. Error analysis (why models miss):
- Most misses came from two issues: (1) ineffective tool use (either not calling tools enough or calling them poorly) and (2) reasoning errors (stringing facts together wrong).
- On hard tasks, open models nearly saturated tool misuse (about 90ā96%) and had high reasoning errors (about 80ā90%). When early searches go wrong, later steps crumble.
- OmniAtlas reduced tool misuse and reasoning errors significantly, but perception mistakes (misreading visuals or audio) stayed noticeableāthe senses themselves still need improvement. Tool calls: more is not always better:
- Runs with zero or few tool calls almost always failed; outside evidence is essential.
- But very long runs with many calls also failed oftenāthis is āthrashing,ā where the AI keeps searching without resolving uncertainty.
- OmniAtlas shifted models from underācalling to a healthier, more active tool pattern, raising success without simply spamming tools. Native vs toolābased perception:
- Strong models do best with native omniāmodal inputs; swapping in perception tools (ask a separate vision/audio helper) didnāt beat native sensing and raised latency.
- For weaker models, perception tools help on easy/medium tasks but still struggle on hard, longāchain reasoning. Training effectiveness:
- Supervised fineātuning on good trajectories did most of the heavy lifting (big drop in bad tool use, solid Pass@1 jump).
- OmniDPO added consistent, acrossātheāboard gains by fixing the earliest wrong step in failed solutions. Surprising findings:
- Making the model larger without teaching better tool habits didnāt yield big wins.
- The biggest stride came from pairing eventāgraphābuilt tasks with an agent that actively verifies and from training that fixes the first wrong turn.
- The benchmarkās openāform answers plus LLMāasāaāJudge were necessary to catch correct reasoning even when the formatting didnāt match exactly.
05Discussion & Limitations
Limitations:
- Perception bottlenecks remain: many errors are still basic seeing/hearing mistakes, especially in long or noisy media.
- Hard multiāhop chains are fragile: an early bad search cascades into failure.
- LLMāasāaāJudge introduces a dependency on another model for grading (mitigated by exactāmatch first), though itās a common practice for openāform answers.
- Data and compute demands are high for training (multiāGPU nodes, many trajectories). Required resources:
- For evaluation: tool access (web search, browser, code executor), and the media inputs.
- For training OmniAtlas: curated trajectories from guided exploration, supervised fineātuning infrastructure, and OmniDPO runs; GPUs with enough memory to train omniāmodal towers. When not to use:
- If the task is purely singleāmodality and trivial (e.g., read a single number in a clear image), the full agentic pipeline is overkill.
- If you cannot permit web access (privacy or policy), toolāintegrated verification may be limited.
- Ultraālowālatency or onādevice scenarios may dislike the extra toolācall cost. Open questions:
- How to teach models to detect and repair wrong search direction earlier (prevent toolāquery drift)?
- Can we couple native perception with smarter compression so we donāt miss tiny details?
- What reward modeling or RL signals best improve longāhorizon tool policies without overfitting to the benchmark?
- How to generalize from web verification to embodied actions in the physical world while keeping safety and privacy?
06Conclusion & Future Work
Threeāsentence summary: OmniGAIA is a tough, realistic benchmark that forces AI to watch, listen, look, and then use tools over multiple steps to deliver a checked, openāform answer. OmniAtlas is a matching agent recipe with ToolāIntegrated Reasoning, active perception, hindsightāguided exploration, supervised fineātuning, and OmniDPO to fix early mistakes. Together, they show that better tool use and reasoningānot just bigger modelsāmove open models much closer to reliable omniāmodal assistance. Main achievement: Turning omniāmodal problem solving into a fair, verifiable test and providing a practical training path that measurably boosts open models (13.3 ā 20.8 Pass@1 on Qwenā3āOmni). Future directions: (1) Omniāmodal agent RL to optimize full policies; (2) richer tool ecosystems (MCPāstyle services) for broader tasks; (3) embodied omniāmodal agents that act in the physical world. Why remember this: It reframes multiāmodal AI from āCan you see and hear?ā to āCan you see, hear, plan, verify, and computeālike a real assistant?ā and offers both the exam (OmniGAIA) and the study guide (OmniAtlas) to get there.
Practical Applications
- ā¢Educational helpers that watch a classroom demo video, transcribe key points, verify facts online, and compute results for a lab worksheet.
- ā¢Travel guides that analyze vlog footage, read signs, listen to narration, and check museum hours or bridge histories on the web.
- ā¢Accessibility tools that describe on-screen action and audio, then confirm names, dates, and places from reliable sources.
- ā¢News fact-checkers that align video/audio evidence with web citations before summarizing verified claims.
- ā¢Customer support bots that watch a troubleshooting video, hear device beeps, search manuals, and compute settings.
- ā¢Sports analysts that review long match footage, align commentary, and compute player stats or timelines.
- ā¢Historical archivists that parse old films, read captions, match locations online, and reconstruct accurate event timelines.
- ā¢Science note-takers that watch experiment recordings, extract spoken measurements, and calculate derived quantities.
- ā¢City information kiosks that read street signs from images, listen to tourist questions, and fetch local facts from the web.
- ā¢Media study assistants that cross-check movie locations, filming dates, and landmarks across video, audio, and online references.