Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Romain Froger; Pierre Andrews; Matteo Bettini; Amar Budhiraja; Ricardo Silveira Cabral; Virginie Do; Emilien Garreau; Jean-Baptiste Gaya; Hugo Laurençon; Maxime Lecanu; Kunal Malkan; Dheeraj Mekala; Pierre Ménard; Gerard Moreno-Torres Bertran; Ulyana Piterbarg; Mikhail Plekhanov; Mathieu Rita; Andrey Rusakov; Vladislav Vorotilov; Mengjue Wang; Ian Yu; Amine Benhalloum; Grégoire Mialon; Thomas Scialom

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Intermediate

Romain Froger, Pierre Andrews, Matteo Bettini et al.2/12/2026

arXiv

Key Summary

•Gaia2 is a new test that measures how well AI agents handle real-life messiness like changing events, deadlines, and team coordination.
•It runs on the ARE platform, which simulates a phone-like world where time keeps moving even while the agent is thinking.
•Every state-changing action is checked by a write-action verifier, so success is judged fairly and can be used for reinforcement learning from verifiable rewards.
•Gaia2 covers seven abilities: Execution, Search, Ambiguity, Adaptability, Time, Noise, and Agent2Agent collaboration.
•No model is best at everything: GPT-5 (high) leads overall (about 42% pass@1) but misses time-sensitive tasks; Claude-4 Sonnet is faster but pricier; Kimi-K2 leads open-source models near 20%.
•Results show strong trade-offs between reasoning depth, speed, robustness, and budget, and expose a big sim2real gap.
•The Time split proves that slow thinking can hurt deadlines; instant-time tests raise scores a lot, so inference speed and orchestration matter.
•Agent2Agent helps smaller models more, and mixed teams (strong planner + cheaper executors) can be a smart compute trade.
•Because actions are verifiable at each step, Gaia2 is directly useful for training future agents to be reliable in the wild.

Why This Research Matters

Real life doesn’t pause; deadlines, messages, and app updates happen while you think. Gaia2 tests and trains AI agents to be reliable in that world by checking every important action as time moves on. This makes assistants better at scheduling, messaging, and following multi-step instructions without missing critical windows. It also encourages cost-aware, fast responses—vital for products that must act on time. Teams of agents become practical too, letting a smart planner guide cheaper executors to save money. With Gaia2 and ARE, the community can build, compare, and improve agents that are not just smart on paper but useful in practice.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re helping your parent plan a busy day on their phone. While you type, messages arrive, meetings change, and the timer you set keeps ticking. You have to act fast, double-check details, and sometimes ask for help.

🥬 The Concept: Large Language Models (LLMs) are AI systems that read and write text to help with tasks. How it works:

They take in words (a prompt) like a question or a plan.
They predict the next useful words step by step.
They can call tools (like calendar or email) if given access. Why it matters: Without LLMs, computers can’t flexibly understand instructions across many apps and steps. 🍞 Anchor: When you ask an AI to “email my teacher and move my dentist appointment,” an LLM can plan the words and actions to try to do just that.

🍞 Hook: You know how a school test shows what you learned? We need fair tests for AI, too.

🥬 The Concept: A benchmark is a set of tasks that fairly measures what an AI can do. How it works:

Make tasks that represent real problems.
Have clear rules and scoring.
Compare different AIs using the same setup. Why it matters: Without a good benchmark, we might believe an AI is great when it only learned the test’s tricks. 🍞 Anchor: A driving test checks steering, braking, and parking; an AI benchmark checks reading instructions, using tools, and finishing on time.

🍞 Hook: Picture a video game that only changes when you press a button versus a live soccer game where time and players keep moving even if you stand still.

🥬 The Concept: A synchronous environment only changes when the agent acts, but an asynchronous environment keeps changing over time no matter what the agent does. How it works:

Synchronous: pause until you move.
Asynchronous: time advances; events happen independently (messages arrive, meetings start).
The agent must observe new events and respond in time. Why it matters: Many real tasks are asynchronous—deadlines don’t wait for slow thinking. 🍞 Anchor: While the agent drafts an email, a calendar invite can arrive and change the plan; the AI must notice and adapt.

🍞 Hook: Think of gold stars for homework—but only if the answer is truly correct, not just liked.

🥬 The Concept: Reinforcement Learning from Verifiable Rewards (RLVR) is a way to train AIs using rewards that can be checked automatically for correctness. How it works:

Define a task with a result you can verify (like a correct calendar change).
Give a reward only when the verified result is right.
The AI learns which actions earn reliable rewards. Why it matters: Without verifiable rewards, models can overfit to preferences or style instead of correctness. 🍞 Anchor: If the task is “schedule a 2 PM dentist visit,” the AI gets a reward only if the calendar truly shows that event at 2 PM.

🍞 Hook: You know how the real world is messy—people send noisy group texts and apps sometimes fail? AIs need to survive that noise.

🥬 The Concept: Noise robustness means an AI still works well when there are distractions, errors, or random changes. How it works:

Expect tool hiccups or spammy messages.
Keep the goal in focus.
Recover when something fails. Why it matters: Without robustness, one small glitch can derail the whole task. 🍞 Anchor: If the shopping API briefly fails, the AI retries or tries another step instead of giving up.

🍞 Hook: Teamwork makes the dream work—like a class project where someone researches, another writes, and another edits.

🥬 The Concept: Multi-agent collaboration is when several AIs divide a big job into smaller parts and message each other to finish faster and better. How it works:

A main agent plans and delegates.
Sub-agents execute focused subtasks (like “check contacts”).
They report back; the main agent integrates results. Why it matters: Without collaboration, one model must do everything, which can be slow, costly, or error-prone. 🍞 Anchor: The planner AI asks a “Contacts” agent to find people in Berlin, then a “Calendar” agent to schedule a meeting, and merges the answers.

The world before Gaia2: Most agent tests were like “frozen games”—the world stood still until the agent moved. They mainly checked if the final answer matched, not whether each step was correct or timely. That missed real-world headaches like: messages arriving mid-task, tool failures, unclear instructions, and coordinating with teammates. People tried bigger models, fancy prompting, or chaining tools, but performance looked good mostly on static, single-thread tasks. The gap: we lacked a benchmark where time flows, events pop up on their own, and every write action is verifiably checked along the way.

Real stakes: In daily life, an assistant that misses a deadline or updates the wrong contact creates real problems. In companies, slow or brittle agents waste money and break workflows. Gaia2 exists to close that gap: to train and measure AIs that can be correct, fast, resilient, and cooperative in a world that never pauses.

02Core Idea

🍞 Hook: Imagine a science fair where projects have to run live all day. Judges don’t just peek at the end—they watch how you build, fix mistakes, handle surprises, and finish on time.

🥬 The Concept: The key insight is to benchmark AI agents in a live, event-driven world where time advances, surprises happen, and every state-changing step is verified. How it works:

Build a realistic, asynchronous phone-like world (ARE platform).
Create human-authored scenarios that require abilities like adapting to changes and meeting deadlines.
Verify each write action against an oracle so credit is precise and reliable. Why it matters: Without live timing and verifiable steps, we can’t honestly measure (or train) real-world agent skills. 🍞 Anchor: While the agent drafts a message, a new email arrives changing the plan; Gaia2 checks whether the agent adapts correctly and on time, step by step.

Three analogies for the same idea:

Cooking show: The clock never stops, surprise ingredients arrive, and judges score each cooking step—not just the final dish.
Orchestra: The conductor (agent) must react to a sudden tempo change and cue sections (tools) on beat, while an adjudicator checks each entry was correct.
Sports drill: Plays keep running; substitutes enter unexpectedly; coaches grade every move, not just the scoreboard.

Before vs After:

Before: Agents looked strong on frozen, single-turn tasks with lenient judging.
After: In Gaia2, models must balance thinking time vs deadlines, resist noise, ask to clarify ambiguity, and coordinate with other agents—revealing real trade-offs in accuracy, speed, and cost.

🍞 Hook: Think of a theme park with rides (apps), a clock that keeps ticking, and a map of what should happen when.

🥬 The Concept: ARE (Agents Research Environments) is the platform that powers Gaia2’s live, event-based simulations. How it works:

Apps: phone-like tools (Contacts, Calendar, Email) that keep their own state.
Events: everything that happens (tool calls, messages, scheduled updates) with timestamps.
Notifications: which events are shown to the agent (like phone notifications) to study proactivity.
Scenarios: initial state plus an event DAG (what depends on what) and verification rules. Why it matters: Without a flexible live-world engine, you can’t fairly test timing, adaptivity, or collaboration. 🍞 Anchor: A Calendar event ping arrives while the agent is writing; the notification policy decides if it gets surfaced, letting us see if the agent notices and adjusts.

🍞 Hook: You know how teachers grade steps in math to see if you really understood, not just guessed the answer.

🥬 The Concept: The Write-Action Verifier checks every state-changing action, not just the final outcome. How it works:

Separate read vs write tools; reads are free to explore.
Map each oracle write action to an agent’s action with checks:
- Consistency (right tool, right arguments),
- Causality (parents before children),
- Timing (within allowed windows),
- Completeness (all oracle writes matched).
Use exact checks for rigid fields and an LLM rubric for flexible text. Why it matters: Without action-level checks, agents can wander, get lucky, or game a final-answer judge. 🍞 Anchor: If the oracle says “reply to email ID 123 at t+3 min,” success only counts if the agent really replied to that ID and at the right time.

🍞 Hook: Think of seven gym stations, each training a different muscle so you become an all-around athlete.

🥬 The Concept: Gaia2 measures seven abilities: Execution, Search, Ambiguity, Adaptability, Time, Noise, and Agent2Agent. How it works:

Execution: many correct write steps in order.
Search: gather facts via reads, then report.
Ambiguity: detect impossible/unclear instructions and ask.
Adaptability: adjust plan when the world changes.
Time: do things before deadlines while the clock moves.
Noise: keep going despite errors/distractors.
Agent2Agent: coordinate via messages with app-agents. Why it matters: Without testing each ability, a model might ace easy tasks but fail in messy, real situations. 🍞 Anchor: A task like “update all friends under 25 by +1 year” stresses Execution; “if no one replies in 3 minutes, order a cab” stresses Time.

Why this works: Intuitively, real success requires the right balance: think enough to avoid mistakes, but not so long you miss deadlines; explore reads before risky writes; recover from noise; and call for help when tasks are ambiguous. By rewarding only verifiable steps, the benchmark nudges agents toward dependable habits. By letting time flow and events happen on their own, it exposes what synchronous tests hide. By including collaboration, it opens a path to smarter, cheaper teams.

03Methodology

At a high level: User/task → Mobile world (ARE) with apps and time → Agent runs with a ReAct-style loop → Environment keeps sending notifications/events → Agent reads, then writes → Write-Action Verifier checks each state change → Pass/Fail and logs.

🍞 Hook: Imagine a pretend smartphone that’s actually a lab: real apps, real schedules, and a real clock.

🥬 The Concept: The Mobile environment is a simulated phone with 12+ apps and 101 tools, backed by rich synthetic data (contacts, emails, chats, calendar, files, rides, shopping). How it works:

Each app keeps its own state (e.g., Emails has inbox/sent).
Tools are typed as read (no state change) or write (changes state), with consistent APIs.
Universes are populated with coherent personas and cross-app links so data matches. Why it matters: Without a realistic sandbox, agents can’t practice complex, long-horizon tasks. 🍞 Anchor: An agent can read Chats to find a friend’s city (reads) and then message them (write) with a consistent interface.

🍞 Hook: You know how your phone sends you only certain notifications? Too many pings is distracting.

🥬 The Concept: Notifications choose which events the agent sees (low, medium, high verbosity), supporting studies of proactivity vs reactivity. How it works:

Events are queued with timestamps.
A policy decides which events push alerts to the agent context.
The agent can still proactively read data even if not notified. Why it matters: Without control over what the agent sees, we can’t fairly test its initiative. 🍞 Anchor: On medium verbosity, the agent gets notified about replies to its own messages but not every background update.

🍞 Hook: Think of a recipe card with steps and arrows showing what depends on what.

🥬 The Concept: The Event DAG (Directed Acyclic Graph) captures scenario structure: which actions must come before others, and what’s independent. How it works:

Nodes are events (tool calls, scheduled updates, validations).
Edges are dependencies (parents before children).
Some nodes are timed relative to parents (e.g., +180 seconds). Why it matters: Without a clear DAG, we can’t judge ordering, timing, or independence fairly. 🍞 Anchor: “Send invites” must happen before “collect replies”; either can be done before “finalize room booking,” as long as the timing matches.

🍞 Hook: When you solve a puzzle, you usually look around first before moving the key piece.

🥬 The Concept: Read vs Write tools separate exploration from commitment. How it works:

Reads gather info without changing the world.
Writes change state and are carefully verified.
The agent can do unlimited reads but must be precise with writes. Why it matters: Without this split, agents would be punished for exploring, discouraging safe planning. 🍞 Anchor: The agent can search Contacts and Chats freely; only when it updates a contact’s age does the verifier check accuracy.

🍞 Hook: Picture a simple rhythm: think, act, observe; then repeat.

🥬 The Concept: A ReAct-style scaffold runs the agent in steps, injecting fresh notifications before each model call and checking for stopping after. How it works:

Pre-step: add any new notifications to context.
Model produces a structured tool call (JSON) or a user reply.
Post-step: run the tool, collect outputs, check for termination. Why it matters: Without a clean loop that handles async events, models miss updates or loop forever. 🍞 Anchor: Right before calling Calendar.add_event, a new email notification arrives; the pre-step makes sure the model sees it first.

🍞 Hook: Getting a badge only if you truly completed a quest keeps players honest.

🥬 The Concept: The Write-Action Verifier provides fine-grained, reproducible evaluation usable as a reward function (RLVR). How it works:

Count and type-check tools vs oracle.
Check order with the scenario’s DAG.
Enforce timing windows where required.
Use exact and rubric-based argument checks. Why it matters: Without trustworthy verification, training and evaluation can be gamed. 🍞 Anchor: If the oracle expects “send email to mom with code” and “forward to dad,” the run only passes if both exactly happened, in causally valid order.

Step-by-step example (Time scenario):

Input: “Message each colleague I’m meeting today to ask who orders the cab. If no reply in 3 minutes, order a default cab.”
Step A (Explore): Read Calendar to find who you’re meeting; read Contacts to get chat handles.
Step B (Act): Send messages (write: Chats.create_and_add_message) and start a timer (System.wait or reasoning loop with time awareness).
Step C (Observe): Notifications deliver replies; if the right reply arrives in time, finish; else, after 3 minutes, call Cabs.order.
Step D (Verify): Checks that messages were sent, the timing was respected, and the cab was ordered only when required.

The secret sauce:

Asynchronous time pressure: thinking costs time, so agents must trade depth vs deadlines.
Verifiable action traces: credit is fair and dense, perfect for RL.
Modular, realistic apps: re-usable APIs make results meaningful beyond this one benchmark.
Scenario DAGs + notifications: precise control over what matters and what the agent notices.
Collaboration mode (Agent2Agent): a native way to measure planning vs execution in teams.

04Experiments & Results

🍞 Hook: Think of a scoreboard that doesn’t just say win or lose—it shows how fast you ran, how carefully you played, and how much energy you spent.

🥬 The Concept: pass@1 is the percent of scenarios the model solves on the first try. How it works:

Run each scenario up to its limits.
If all required write actions pass verification, count it as success.
Report the share of successes. Why it matters: Without a clear success rate, we can’t compare models fairly. 🍞 Anchor: If a model solves 42 of 100 tasks correctly end-to-end, its pass@1 is 42%.

🍞 Hook: Imagine you have a spending cap per mission; you want the best score before the money runs out.

🥬 The Concept: Budget scaling curves plot success vs maximum allowed cost, exposing cost–accuracy trade-offs. How it works:

Cap the per-scenario budget; count only runs under that budget.
Vary the cap from tiny to large.
See which models deliver more success per dollar. Why it matters: Without cost-normalized views, expensive models might look unfairly superior. 🍞 Anchor: At the same $0.50 budget, one model might solve far more tasks than another, even if its top-line accuracy is lower without budget limits.

Overall results (high level):

GPT-5 (high) achieved about 42% pass@1 overall, leading the pack but scoring 0% on the Time split in default (real-latency) mode.
Claude-4 Sonnet was competitive overall and much faster, trading higher cost for speed.
Gemini 2.5 Pro did well on Time thanks to low latency.
Kimi-K2 led open-source models around 20% pass@1 and showed strong adaptability.
No single model dominated all abilities; curves plateaued, suggesting that models plus standard scaffolds are missing ingredients for steady gains.

Split difficulty:

Easiest: Execution and Search (consistent with prior benchmarks being close to saturated).
Harder: Ambiguity (ask-clarify) and Adaptability (update plan mid-task).
Hardest: Noise and Time (robustness and temporal responsiveness).

Surprises and insights:

Time split inverse scaling: deeper “thinking” models missed deadlines; when latency was removed (instant mode), their Time scores jumped a lot (e.g., GPT-5 high 0% → ~34%). This proves inference speed and orchestration matter for time-critical tasks.
Exploration helps: higher tool-call counts and more output tokens generally correlated with better performance. But Sonnet and Kimi-K2 were notable outliers—strong performance with fewer tokens, hinting at efficiency advantages.
Collaboration helps small models more: In Agent2Agent tests, adding collaborators boosted lighter models’ pass@k scaling, but gave limited gains to top models. Heterogeneous teams (strong planner + stronger or cheaper executors) often worked best.
Verifier reliability: The ARE Verifier reached ~0.99 precision and ~0.95 recall against human-labeled runs, beating an in-context LLM judge. Also, they hardened it against “judge hacking” that tried to trick soft checks with noisy text chunks.

What the numbers really mean:

A 42% overall pass@1 is like getting an A on some skills and a C or D on others; it’s good, but far from human-like reliability under pressure.
A low Time score at default speed is like missing your bus because you over-explained your plan.
A strong Noise score is like finishing homework despite classmates talking loudly.

Takeaway: To act usefully in the wild, agents need more than reasoning—they need timing sense, robust recovery, and good teamwork. Gaia2 makes those gaps visible and measurable.

05Discussion & Limitations

🍞 Hook: Even great students have weak spots; knowing them helps you practice smarter.

Limitations:

Time orchestration: The baseline single-thread ReAct loop can’t always express parallel actions that some Time tasks demand (though parallel tool-calling improves latency, it didn’t fix Time performance gaps by itself).
Domain scope: The Mobile world is rich but focused on consumer apps; other domains (enterprise, browsing, robotics) will need additional environments.
Synthetic universes: Data is carefully generated for coherence but still synthetic; some deep cross-app consistencies remain future work.
Verifier dependence on LLM rubrics: Soft checks are robust but not perfect; they need thoughtful prompts and anti-hacking defenses.

Required resources:

Models with long context windows (≥128K) and stable, low-latency serving for Time tasks.
Compute budget for multi-run evaluations and, in RLVR training, for large-scale action-level credit.
Engineering to integrate models, tools, and logs with ARE, plus GPU/CPU for verifier runs.

When not to use:

If you only care about single-turn Q&A with no tool use, simpler static benchmarks are cheaper and sufficient.
If latency doesn’t matter for your app (e.g., overnight batch jobs), the Time split may be less relevant.
If you can’t expose or simulate your tools’ state transitions, action-level verification may be hard to adopt.

Open questions:

Adaptive compute: How should agents decide when to think deeply vs act quickly to meet deadlines?
Better collaboration: What protocols help agents communicate intent, state, and partial results with minimal token cost?
Stronger robustness: How to make agents resistant to tool failures and distractors without overfitting to noise patterns?
Richer verifiers: Can we blend scalar, rubric, and preference signals to cover subjective tasks while keeping training stable?
Generalization: How well do gains on Gaia2 transfer to real devices and production systems, closing the sim2real gap?

🍞 Anchor: Treat Gaia2 like a practice gym: it shows where you wobble (timing, noise, teamwork) so you can target training, upgrade orchestration, and verify real improvements.

06Conclusion & Future Work

Three-sentence summary: Gaia2, built on the ARE platform, benchmarks AI agents in live, asynchronous environments where time advances and events happen independently. It verifies every state-changing action, enabling reliable scores and direct training with reinforcement learning from verifiable rewards. Results reveal sharp trade-offs between reasoning, speed, robustness, and cost, with no model best across all abilities—especially under time pressure and noise.

Main achievement: The paper delivers a practical, extensible framework and benchmark that finally tests—and can train—agents on what real life requires: acting correctly, in order, on time, despite surprises, and sometimes as a team, with step-by-step verifiable credit.

Future directions:

Build adaptive-compute agents that switch between fast, shallow and slow, deep reasoning based on time pressure.
Expand environments (desktop, web, enterprise) with richer cross-app consistency.
Advance collaboration protocols and heterogeneous-team strategies.
Strengthen verifiers and anti-hacking rules, blending scalar and rubric rewards.

Why remember this: If yesterday’s tests were like paused video games, Gaia2 is the real match with a running clock, surprise plays, and a ref watching every move. Passing here means an agent is not just smart—it’s ready for the real world.

Practical Applications

•Build AI schedulers that set, move, and confirm appointments on time despite last-minute changes.
•Create robust email and chat assistants that track threads, ask clarifying questions, and ignore spammy distractions.
•Deploy customer-support agents that coordinate with back-end tools and teammates to resolve tickets quickly.
•Train shopping assistants to monitor stock or price changes and buy within strict timing windows.
•Automate office workflows (e.g., follow-ups after no reply for N minutes) with action-level verification.
•Use heterogeneous agent teams (planner + executors) to cut costs while keeping quality high.
•Stress-test production agents with controlled noise (API failures, irrelevant events) before launch.
•Adopt action-level rewards (RLVR) to reliably improve agents’ step-by-step correctness.
•Evaluate orchestration choices (parallel tool-calling, notification policies) to meet real latency targets.

Version: 1