General Agent Evaluation
Key Summary
- ā¢This paper shows how to fairly test "general-purpose" AI agents that should work in many places without special tweaks.
- ā¢The authors introduce a Unified Protocol that turns every task into the same simple recipe: task, context, and actions.
- ā¢They built Exgentic, a framework that plugs any agent into any benchmark using adaptors, then runs and logs everything the same way.
- ā¢They launched the first Open General Agent Leaderboard to compare popular agents across six very different environments.
- ā¢Results: general agents can match specialized, hand-tuned agents on many tasks, but the language model behind the agent matters most.
- ā¢Claude Opus 4.5 based setups scored the highest on average (about 0.66), Gemini 3 was close (0.60), and GPT 5.2 was cheaper but weaker (0.40).
- ā¢Tool shortlisting and schema guards helped agents avoid mistakes and handle big toolboxes efficiently.
- ā¢Cost vs performance is a real tradeoff: the very best scores can cost 30Ć more per task than the most efficient setups.
- ā¢The framework is text-only today and doesnāt yet cover visual/web interactions; expanding support and reducing cost are next steps.
Why This Research Matters
This work makes it finally possible to test āone agent for many jobsā fairly, which is what real users need. Companies can compare options based on both quality and cost, instead of guessing from single-domain demos. Developers save weeks of custom wiring by plugging agents and benchmarks into the same narrow waist once. The public leaderboard gives everyone a clear, shared scoreboard that encourages honest progress. The findings also guide practical choicesālike when to pay for top accuracy and when to prioritize efficiency. Over time, this should speed up the creation of assistants that seamlessly switch between tasks the way people do.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook) You know how kids can learn to play many different games in the same playground? It would be weird if you needed a brand-new playground for tag, another for hide-and-seek, and another for soccer. You want one place where all games can be played and fairly scored.
š„¬ Filling (The Actual Concept)
- What it is: This paper asks, āHow do we fairly test AI agents that should work anywhere, not just in one special place?ā and builds tools to do that.
- How it works (story of the field):
- The world before: AI agents got good at narrow jobs like fixing code, browsing the web, or answering customer questionsābut usually with custom setups and manual tuning for each domain.
- The problem: If every benchmark uses its own language, buttons, and rules, you canāt see which agent is truly general. Agents and tests donāt speak the same āprotocol,ā so people glue them together case-by-case. Thatās slow, unfair, and can break the agentās natural way of working.
- Failed attempts: Some projects tried to bring multiple tasks under one umbrella, but forced all agents to talk a single way (like only through a web browser or only via command-line). That hides what the agent is really good at and still doesnāt let agents use their favorite native tools.
- The gap: We needed a shared, simple way for any agent to talk to any benchmarkāwithout rewriting everyoneās codeāand a fair tournament where many kinds of tasks are compared the same way.
- Real stakes: In real life, an assistant should switch from booking flights to editing files to researching facts, without a human rewriting the rules each time. Companies want to deploy agents that donāt require a custom engineering project for every new job.
- Why it matters: Without a fair, shared test, we canāt tell if agents are truly general or just good at one carefully prepared show.
š Bottom Bread (Anchor) Imagine testing runners by making one sprint on sand, another on grass, and another on iceāwith different timers at each place. Youād have no idea whoās best overall. This paper builds a single fair track and a single timer, so we can finally compare runners fairly.
š Top Bread (Hook) You know how schools use the same style of test to compare students fairly across different classes?
š„¬ The Concept: Benchmarking frameworks
- What it is: A benchmarking framework is a standard way to test and compare agents on tasks.
- How it works:
- Pick tasks (like web browsing, customer service, coding).
- Define how agents talk to the task world.
- Score them the same way every time.
- Why it matters: Without a shared scoring and communication method, comparisons arenāt apples-to-apples.
š Bottom Bread (Anchor) Itās like a school giving the same reading test to all sixth graders, no matter their classroomānow we can compare fairly.
š Top Bread (Hook) Imagine a Swiss Army knife that should work on camping, cooking, and repairsāwithout being rebuilt each time.
š„¬ The Concept: General-purpose agents
- What it is: General-purpose agents are AI helpers meant to handle many different tasks in different environments without special rewiring.
- How it works:
- Read the task (what to do).
- Use tools (how to act) provided by the environment.
- Decide step-by-step, often by planning, searching, or calling tools.
- Why it matters: If an agent only works after lots of custom tuning, itās not truly general and wonāt scale to real-world variety.
š Bottom Bread (Anchor) Like a phone assistant that can book a flight, send a text, update a spreadsheet, and summarize emailsāwithout a separate app for each one.
02Core Idea
š Top Bread (Hook) You know how a universal charger lets you power many devices by snapping on the right tip? You donāt buy a new charger for every gadget.
š„¬ The Concept: The āAha!ā Moment
- What it is: The key idea is a Unified Protocolāa simple, shared way for any agent and any benchmark to talkāplus Exgentic, a framework that uses adaptors to connect them and run fair evaluations, ending with a public leaderboard.
- How it works:
- Unified Protocol turns every task into the same three-part recipe: task, context, actions.
- Adaptors translate from this recipe into the agentās favorite style (tool-calls, MCP, Python functions) and the benchmarkās original interface.
- An orchestrator runs the conversation: observe ā choose action ā get result, until done.
- The same logging and scoring across everything builds a fair scoreboard.
- Why it matters: Without a shared āplug,ā you need custom cables for every agent-benchmark pair. With the Unified Protocol, you plug once, and things just work.
š Bottom Bread (Anchor) Think of it like a multi-sport tournament that supplies standard balls, scoreboards, and rules, so basketball, soccer, and tennis can all be played and compared fairly.
Multiple Analogies
- Airport analogy: The Unified Protocol is like a central terminal where any airline (agent) and any gate (benchmark) use the same signs. Adaptors are the jet bridges that fit any plane.
- Language analogy: Itās like everyone agreeing to speak in simple subject-verb-object sentences. Translators (adaptors) handle accents so all speakers understand each other.
- Power strip analogy: The protocol is a universal power strip; adaptors match prongs; Exgentic is the circuit breaker keeping everything safe, measured, and comparable.
Before vs After
- Before: Agents and benchmarks spoke different dialects; comparisons were messy; customization everywhere.
- After: One narrow waist connects all sides. Add a new agent or a new benchmark once; it works with the rest. Now we can measure true generalization.
Why It Works (intuition behind the math)
- Narrow waist principle: When many producers and consumers meet at a tiny shared core, complexity shrinks from āconnect everyone to everyoneā to āconnect everyone to the core.ā
- Canonical task representation: Standardizing on task, context, actions removes hidden assumptions and forces fairness.
- Controlled interfaces: Adaptors preserve each sideās natural behavior, avoiding distortions from forced single-protocol testing.
Building Blocks (explained simply) š Top Bread (Hook) Imagine a recipe card you always fill out the same way, no matter the dish.
š„¬ The Concept: Canonical Task Representation (task, context, actions)
- What it is: A fixed three-part description every benchmark provides the agent.
- How it works:
- Task: what to do (the goal in words).
- Context: what youāre allowed to know (like policies or rules).
- Actions: buttons you can press (tools with parameters and outputs).
- Why it matters: Without this, agents might guess hidden rules or miss allowed actions, making tests unfair.
š Bottom Bread (Anchor) For airline customer service, task = āhelp the user within policy,ā context = the policy text, actions = book reservation, cancel, message, etc.
š Top Bread (Hook) You know how some friends prefer texting, others voice notes, and others emails? You still send the same message.
š„¬ The Concept: Exgentic framework
- What it is: The system that runs everything: spins up sessions, feeds observations to the agent, executes actions in the benchmark, logs, and scores.
- How it works:
- Start a session with task, context, actions.
- Loop: benchmark sends observation ā agent picks an action ā benchmark runs it ā returns new observation.
- Stop when finished or limits are reached; save results and costs.
- Why it matters: Without a consistent referee, results wouldnāt be comparable or reproducible.
š Bottom Bread (Anchor) Like a game master who keeps the clock, tracks moves, and posts the final score for every match in a league.
š Top Bread (Hook) Think of a scoreboard at a tournament that everyone can see.
š„¬ The Concept: Open General Agent Leaderboard
- What it is: A public ranking of agents across many environments, all tested the same way.
- How it works:
- Run lots of tasks from different domains.
- Use the same metrics: success rate, cost per task, steps.
- Publish results so people can compare and improve.
- Why it matters: It shines a light on true generalization instead of one-off tricks.
š Bottom Bread (Anchor) Itās like a season-long standings table showing which teams perform well across home, away, and different weather.
03Methodology
High-Level Recipe: Input ā [Adapt Benchmarks] ā [Adapt Agents] ā [Run Orchestrator Loop] ā Output (scores, costs, logs)
Step 1: Adapt Benchmarks to the Unified Protocol
- What happens: Each benchmark is converted into the shared ātask, context, actionsā format.
- Why this step exists: Benchmarks come with hidden assumptions (like how to submit a solution). If we donāt make them explicit, agents wonāt know the proper moves.
- Example with data: ⢠SWE-Bench Verified: The reference agent clones repos in a ready bash environment and says ācomplete and submitā to finish. So Exgentic exposes actions: bash(command: str) and submiatch(summary: str). The task text clearly states: all edits must be through bash; submission captures staged diffs. ⢠Ļ-Bench (Airline/Retail/Telecom): Expose a message action for chatting with the simulated user and tool actions like canceeservation(reservatiod) or searcireclight(...). The task says āfollow the given policy,ā and the policy lives in context. ⢠BrowseComp+: Expose search(query), geocument(id), submit(answer, explanation, confidence). The task warns: donāt chat; only finish with submit. ⢠AppWorld: Keep the native interpreter approach and wrap its many APIs (468 tools) as actions; the task and context explain app rules and credentials flow.
Step 2: Adapt Agents to the Unified Protocol
- What happens: Create agent-side adaptors that turn protocol āactionsā into the agentās native style: Python functions (Smolagent), tool calls (ReAct), or MCP tools (Claude Code, OpenAI Solo).
- Why this step exists: Agents expect different message formats; without adaptors, they canāt āpressā the benchmarkās buttons naturally.
- Example with data: ⢠Tool-calling agents (ReAct): The adaptor maps actions to OpenAI-style tools and converts special actions (like message) accordingly. ⢠MCP-based agents (OpenAI Solo, Claude Code): The adaptor exposes benchmark actions as MCP tools and routes observations back. ⢠Code-generating agents (Smolagent): Each action becomes a Python function; runtime errors serve as feedback for schema mistakes.
Step 3: Orchestrator Loop
- What happens: Exgentic spins up a session, sends the first observation, waits for the agentās action, runs it in the benchmark, returns the new observation, and repeats. It stops on finish, no-action, or safety limits.
- Why this step exists: Keeps agents and benchmarks in sync, ensures isolation, parallelism, caching, and standard logs.
- Example with data: ⢠In Ļ-Bench, the environment provides the userās next message; the agent chooses to reply (message) or use a tool (e.g., searclight). The loop cycles until the agent submits the final answer or hits step limits.
Secret Sauce: The Narrow Waist + Non-Intrusive Adaptors
- Instead of editing third-party code, Exgentic uses outside adaptors that translate protocols and synchronize turns. Agents and benchmarks run in separate processes as-is, preserving their original behavior. This keeps integration repeatable and faithful.
Key Components that Help Agents (explained simply) š Top Bread (Hook) Imagine a giant toolbox with 468 toolsāitās hard to pick the right one quickly.
š„¬ The Concept: Tool shortlisting
- What it is: A filter that picks a small, likely-useful subset of tools before the agent decides.
- How it works:
- Read the current goal and observation.
- Score tools for relevance.
- Keep the top few so the model isnāt overwhelmed.
- Why it matters: Some models canāt even load hundreds of tools. Shortlisting makes impossible tasks possible and speeds everything up.
š Bottom Bread (Anchor) In AppWorld with 468 tools, GPT 5.2 fails without shortlisting but becomes usable when we narrow the tool list.
š Top Bread (Hook) You know spell-check that underlines mistakes so you can fix them right away?
š„¬ The Concept: Schema guard
- What it is: A checker that spots when an action call has the wrong shape (like a missing parameter) and lets the agent correct itself.
- How it works:
- Validate the action against its schema before execution.
- If invalid, raise an error back to the agent.
- The agent tries again with a corrected call.
- Why it matters: Prevents wasted steps and weird failures from silly formatting mistakes.
š Bottom Bread (Anchor) If an agent tries canceeservation() without reservatiod, the guard flags it so the agent can add the missing field.
Metrics
- Success Rate: Did the agent accomplish the task under the benchmarkās original rules?
- Cost per Task: How much did model calls cost on average?
- Average Steps: How many back-and-forths did it take?
End-to-End Example (SWE-Bench Verified)
- Input: Task describes bug fix rules; context may be empty; actions include bash and submiatch.
- Steps:
- Agent runs bash to inspect files and tests.
- Agent edits code via bash commands.
- Agent verifies locally.
- Agent calls submiatch once with a summary.
- Output: Benchmark applies the patch to hidden tests and reports success/failure, plus the framework logs steps and cost.
04Experiments & Results
The Test: What and Why
- They evaluated 5 agent architectures across 3 strong language models on 6 benchmarks (coding, browsing for deep research, customer service in 3 domains, and multi-app tasks), with up to 100 turns per task.
- They measured success rate (main score), average cost per task (money), and average steps (effort). This tells us who wins, whoās efficient, and how hard the tasks feel.
The Competition: Who and How
- Agents: ReAct, ReAct Short (with tool shortlisting), Smolagent (code-generating), OpenAI Solo (MCP-based), Claude Code (MCP and coding features).
- Models: Claude Opus 4.5, Gemini 3, GPT 5.2.
- Benchmarks: BrowseComp+ (deep research), SWE-Bench Verified (software bugs), AppWorld (multi-app tasks), and Ļ-Bench (Airline, Retail, Telecom customer service).
The Scoreboard (with context)
- Overall model standings: Claude Opus 4..66 average success, Gemini .60, GPT 5..40. This is like Claude getting a solid A-, Gemini a B+, and GPT 5.2 a C.
- Best per benchmark (examples): ⢠Ļ-Bench-Telecom: OpenAI Solo + Gemini 3 hit 0.89 (very high). ⢠SWE-Bench Verified: OpenAI Solo + Claude Opus 4.5 hit 0.81, matching or exceeding top domain-specific results on the sampled set. ⢠AppWorld: Smolagent + Claude Opus 4.5 scored 0.70 (near published top 0.73).
- No single agent dominates all domains: OpenAI Solo often wins where structured APIs and coding matter; Smolagent shines in multi-application and web-like tasks. This shows different scaffolds fit different task shapes.
Surprising Findings š Top Bread (Hook) Think of picking a backpack: a bigger one carries more, but it may be heavier and costlier.
š„¬ The Concept: Cost-performance tradeoff (Pareto frontier)
- What it is: A curve showing the best possible tradeoffs between performance and costāyou canāt improve one without hurting the other past this line.
- How it works:
- Plot each agent-model by success (up) and cost (right).
- The frontier is the set with no better alternative in both dimensions.
- Choose a point based on your budget and quality needs.
- Why it matters: It guides practical choices; sometimes āgood enough and cheapā beats ābest but pricey.ā
š Bottom Bread (Anchor) GPT 5.2 setups were the most cost-efficient, while Claude Opus 4.5 achieved top raw performance at 3ā cost.
What Drives Performance Most?
- Variance decomposition shows model choice explains about 28.2% of the score differences; agent architecture only 0.6%. Translation: the model matters far more than the scaffold, though the pairing still matters.
Model Stability Across Architectures š Top Bread (Hook) Imagine a bike that rides smoothly on many roadsāyou can focus on the route instead of fixing the bike.
š„¬ The Concept: Model stability
- What it is: How consistently a model performs when you swap different agent architectures.
- How it works:
- Run many agent variants with the same model.
- Measure spread (standard deviation) of scores.
- Lower spread = more stable.
- Why it matters: Stable models let teams improve agents without constant retuning.
š Bottom Bread (Anchor) Claude Opus 4.5 showed the most stability (.66, low .06), so developers can focus on the agent design with fewer surprises.
Other Patterns
- Failures often took more steps (and thus more cost) than successesāespecially in interaction-heavy tasks like AppWorld and BrowseComp+ (e.g., ReAct +110.7% steps on AppWorld failures).
- Cross-benchmark correlations were mostly positive, driven by model strength: models that do well in one place tend to do well elsewhere, but agent rankings can reshuffle within the same model.
- Tool shortlisting rescued GPT 5.2 in tool-rich worlds; schema guards helped top performers catch and fix invalid tool calls.
Bottom Line
- General agents can match or beat specialized baselines on many tasks without per-environment tuning.
- But to win, pick your model wisely, consider tool management (shortlisting), and watch costs.
05Discussion & Limitations
Limitations
- Text-only interactions: The current framework focuses on text-based agent-environment exchanges. Visual interfaces, full web GUIs, or rich multimodal worlds still need adaptors and likely small protocol extensions.
- Coverage and cost: Evaluations are expensive; the study used a subset of agents and models, meaning some strong open-source options werenāt tested here.
- Protocol scope: The Unified Protocol fits many existing patterns, but new or unusual environments may require expanding the action or context design.
Required Resources
- Access to multiple LLMs (API keys), compute for running isolated sessions (Docker/containers), and budget for inference costs (reported total ā $22K for the study).
- Engineering time to write lightweight adaptors for new agents/benchmarks, though base adaptors reduce effort.
When NOT to Use
- Purely visual or highly interactive web tasks where the core interaction isnāt textual yet (until proper adaptors exist).
- Ultra-low-latency or ultra-high-volume deployments where even efficient configurations are still too costly without further pruning or caching.
- Situations demanding custom, domain-specific tricks you explicitly want to use for maximum score on one benchmark (that would break the notion of generality).
Open Questions
- Can we reduce evaluation cost with smart sampling, early stopping, or outcome predictionāwhile keeping results reliable?
- How do we grow from text to multimodal and embodied settings without breaking the narrow waist?
- Which agent components (planning, memory, critique) most improve cross-domain robustness when model quality is held constant?
- How should safety, policy compliance, and alignment be measured consistently across domains in a general-agent leaderboard?
- Can we design training or fine-tuning methods that improve true cross-benchmark generalization, not just single-domain scores?
06Conclusion & Future Work
Three-Sentence Summary
- This paper introduces a Unified Protocol and the Exgentic framework to fairly evaluate general-purpose agents across many different environments, all with the same simple task recipe: task, context, actions.
- Using these, the authors launch the first Open General Agent Leaderboard and show that general agents can match specialized ones without per-environment tuningāthough model choice drives most performance differences.
- They also reveal practical tradeoffs: cost vs accuracy, the power of tool shortlisting and schema guards, and the importance of model stability.
Main Achievement
- Turning general-agent evaluation into a first-class, practical reality via a narrow-waist protocol, non-intrusive adaptors, and a public, reproducible leaderboard.
Future Directions
- Extend beyond text to visual/web/multimodal tasks while preserving the narrow waist.
- Lower evaluation costs through smarter sampling, caching, and early stopping.
- Explore which agent components most reliably boost cross-domain robustness and stability.
- Study safety and alignment metrics that travel well across domains.
Why Remember This
- Itās the universal charger moment for agent evaluation: one plug that fits many worlds. With a fair field and clear scoreboard, the community can finally focus on building agents that truly generalizeāuseful anywhere, not just in a lab built for one trick.
Practical Applications
- ā¢Evaluate your in-house agent across coding, browsing, and customer support with one setup to find its true strengths.
- ā¢Choose the right modelāagent pairing for your budget by comparing cost per task on the Pareto frontier.
- ā¢Add tool shortlisting to handle environments with hundreds of tools and avoid model limits.
- ā¢Enable reliable tool use by adding a schema guard to catch and fix invalid calls automatically.
- ā¢Run A/B tests on planning or memory modules across multiple benchmarks to see real cross-domain impact.
- ā¢Standardize agent evaluation in your org with a single protocol so teams can share results apples-to-apples.
- ā¢Prototype a new benchmark by writing only one adaptor and instantly make it compatible with many agents.
- ā¢Cut evaluation costs by caching runs and limiting steps using the frameworkās orchestration controls.
- ā¢Diagnose failure patterns by inspecting standardized trajectories and step counts across domains.
- ā¢Track progress over time on the Open General Agent Leaderboard to guide roadmap and procurement.