General Agent Evaluation

Elron Bandel; Asaf Yehudai; Lilach Eden; Yehoshua Sagron; Yotam Perlitz; Elad Venezian; Natalia Razinkov; Natan Ergas; Shlomit Shachor Ifergan; Segev Shlomov; Michal Jacovi; Leshem Choshen; Liat Ein-Dor; Yoav Katz; Michal Shmueli-Scheuer

General Agent Evaluation

Intermediate

Elron Bandel, Asaf Yehudai, Lilach Eden et al.2/26/2026

arXiv

Key Summary

•This paper shows how to fairly test "general-purpose" AI agents that should work in many places without special tweaks.
•The authors introduce a Unified Protocol that turns every task into the same simple recipe: task, context, and actions.
•They built Exgentic, a framework that plugs any agent into any benchmark using adaptors, then runs and logs everything the same way.
•They launched the first Open General Agent Leaderboard to compare popular agents across six very different environments.
•Results: general agents can match specialized, hand-tuned agents on many tasks, but the language model behind the agent matters most.
•Claude Opus 4.5 based setups scored the highest on average (about 0.66), Gemini 3 was close (0.60), and GPT 5.2 was cheaper but weaker (0.40).
•Tool shortlisting and schema guards helped agents avoid mistakes and handle big toolboxes efficiently.
•Cost vs performance is a real tradeoff: the very best scores can cost 30× more per task than the most efficient setups.
•The framework is text-only today and doesn’t yet cover visual/web interactions; expanding support and reducing cost are next steps.

Why This Research Matters

This work makes it finally possible to test “one agent for many jobs” fairly, which is what real users need. Companies can compare options based on both quality and cost, instead of guessing from single-domain demos. Developers save weeks of custom wiring by plugging agents and benchmarks into the same narrow waist once. The public leaderboard gives everyone a clear, shared scoreboard that encourages honest progress. The findings also guide practical choices—like when to pay for top accuracy and when to prioritize efficiency. Over time, this should speed up the creation of assistants that seamlessly switch between tasks the way people do.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how kids can learn to play many different games in the same playground? It would be weird if you needed a brand-new playground for tag, another for hide-and-seek, and another for soccer. You want one place where all games can be played and fairly scored.

🥬 Filling (The Actual Concept)

What it is: This paper asks, “How do we fairly test AI agents that should work anywhere, not just in one special place?” and builds tools to do that.
How it works (story of the field):
1. The world before: AI agents got good at narrow jobs like fixing code, browsing the web, or answering customer questions—but usually with custom setups and manual tuning for each domain.
2. The problem: If every benchmark uses its own language, buttons, and rules, you can’t see which agent is truly general. Agents and tests don’t speak the same “protocol,” so people glue them together case-by-case. That’s slow, unfair, and can break the agent’s natural way of working.
3. Failed attempts: Some projects tried to bring multiple tasks under one umbrella, but forced all agents to talk a single way (like only through a web browser or only via command-line). That hides what the agent is really good at and still doesn’t let agents use their favorite native tools.
4. The gap: We needed a shared, simple way for any agent to talk to any benchmark—without rewriting everyone’s code—and a fair tournament where many kinds of tasks are compared the same way.
5. Real stakes: In real life, an assistant should switch from booking flights to editing files to researching facts, without a human rewriting the rules each time. Companies want to deploy agents that don’t require a custom engineering project for every new job.
Why it matters: Without a fair, shared test, we can’t tell if agents are truly general or just good at one carefully prepared show.

🍞 Bottom Bread (Anchor) Imagine testing runners by making one sprint on sand, another on grass, and another on ice—with different timers at each place. You’d have no idea who’s best overall. This paper builds a single fair track and a single timer, so we can finally compare runners fairly.

🍞 Top Bread (Hook) You know how schools use the same style of test to compare students fairly across different classes?

🥬 The Concept: Benchmarking frameworks

What it is: A benchmarking framework is a standard way to test and compare agents on tasks.
How it works:
1. Pick tasks (like web browsing, customer service, coding).
2. Define how agents talk to the task world.
3. Score them the same way every time.
Why it matters: Without a shared scoring and communication method, comparisons aren’t apples-to-apples.

🍞 Bottom Bread (Anchor) It’s like a school giving the same reading test to all sixth graders, no matter their classroom—now we can compare fairly.

🍞 Top Bread (Hook) Imagine a Swiss Army knife that should work on camping, cooking, and repairs—without being rebuilt each time.

🥬 The Concept: General-purpose agents

What it is: General-purpose agents are AI helpers meant to handle many different tasks in different environments without special rewiring.
How it works:
1. Read the task (what to do).
2. Use tools (how to act) provided by the environment.
3. Decide step-by-step, often by planning, searching, or calling tools.
Why it matters: If an agent only works after lots of custom tuning, it’s not truly general and won’t scale to real-world variety.

🍞 Bottom Bread (Anchor) Like a phone assistant that can book a flight, send a text, update a spreadsheet, and summarize emails—without a separate app for each one.

02Core Idea

🍞 Top Bread (Hook) You know how a universal charger lets you power many devices by snapping on the right tip? You don’t buy a new charger for every gadget.

🥬 The Concept: The “Aha!” Moment

What it is: The key idea is a Unified Protocol—a simple, shared way for any agent and any benchmark to talk—plus Exgentic, a framework that uses adaptors to connect them and run fair evaluations, ending with a public leaderboard.
How it works:
1. Unified Protocol turns every task into the same three-part recipe: task, context, actions.
2. Adaptors translate from this recipe into the agent’s favorite style (tool-calls, MCP, Python functions) and the benchmark’s original interface.
3. An orchestrator runs the conversation: observe → choose action → get result, until done.
4. The same logging and scoring across everything builds a fair scoreboard.
Why it matters: Without a shared “plug,” you need custom cables for every agent-benchmark pair. With the Unified Protocol, you plug once, and things just work.

🍞 Bottom Bread (Anchor) Think of it like a multi-sport tournament that supplies standard balls, scoreboards, and rules, so basketball, soccer, and tennis can all be played and compared fairly.

Multiple Analogies

Airport analogy: The Unified Protocol is like a central terminal where any airline (agent) and any gate (benchmark) use the same signs. Adaptors are the jet bridges that fit any plane.
Language analogy: It’s like everyone agreeing to speak in simple subject-verb-object sentences. Translators (adaptors) handle accents so all speakers understand each other.
Power strip analogy: The protocol is a universal power strip; adaptors match prongs; Exgentic is the circuit breaker keeping everything safe, measured, and comparable.

Before vs After

Before: Agents and benchmarks spoke different dialects; comparisons were messy; customization everywhere.
After: One narrow waist connects all sides. Add a new agent or a new benchmark once; it works with the rest. Now we can measure true generalization.

Why It Works (intuition behind the math)

Narrow waist principle: When many producers and consumers meet at a tiny shared core, complexity shrinks from “connect everyone to everyone” to “connect everyone to the core.”
Canonical task representation: Standardizing on task, context, actions removes hidden assumptions and forces fairness.
Controlled interfaces: Adaptors preserve each side’s natural behavior, avoiding distortions from forced single-protocol testing.

Building Blocks (explained simply) 🍞 Top Bread (Hook) Imagine a recipe card you always fill out the same way, no matter the dish.

🥬 The Concept: Canonical Task Representation (task, context, actions)

What it is: A fixed three-part description every benchmark provides the agent.
How it works:
1. Task: what to do (the goal in words).
2. Context: what you’re allowed to know (like policies or rules).
3. Actions: buttons you can press (tools with parameters and outputs).
Why it matters: Without this, agents might guess hidden rules or miss allowed actions, making tests unfair.

🍞 Bottom Bread (Anchor) For airline customer service, task = “help the user within policy,” context = the policy text, actions = book reservation, cancel, message, etc.

🍞 Top Bread (Hook) You know how some friends prefer texting, others voice notes, and others emails? You still send the same message.

🥬 The Concept: Exgentic framework

What it is: The system that runs everything: spins up sessions, feeds observations to the agent, executes actions in the benchmark, logs, and scores.
How it works:
1. Start a session with task, context, actions.
2. Loop: benchmark sends observation → agent picks an action → benchmark runs it → returns new observation.
3. Stop when finished or limits are reached; save results and costs.
Why it matters: Without a consistent referee, results wouldn’t be comparable or reproducible.

🍞 Bottom Bread (Anchor) Like a game master who keeps the clock, tracks moves, and posts the final score for every match in a league.

🍞 Top Bread (Hook) Think of a scoreboard at a tournament that everyone can see.

🥬 The Concept: Open General Agent Leaderboard

What it is: A public ranking of agents across many environments, all tested the same way.
How it works:
1. Run lots of tasks from different domains.
2. Use the same metrics: success rate, cost per task, steps.
3. Publish results so people can compare and improve.
Why it matters: It shines a light on true generalization instead of one-off tricks.

🍞 Bottom Bread (Anchor) It’s like a season-long standings table showing which teams perform well across home, away, and different weather.

03Methodology

High-Level Recipe: Input → [Adapt Benchmarks] → [Adapt Agents] → [Run Orchestrator Loop] → Output (scores, costs, logs)

Step 1: Adapt Benchmarks to the Unified Protocol

What happens: Each benchmark is converted into the shared “task, context, actions” format.
Why this step exists: Benchmarks come with hidden assumptions (like how to submit a solution). If we don’t make them explicit, agents won’t know the proper moves.
Example with data: • SWE-Bench Verified: The reference agent clones repos in a ready bash environment and says “complete and submit” to finish. So Exgentic exposes actions: bash(command: str) and submi $t_p$ atch(summary: str). The task text clearly states: all edits must be through bash; submission captures staged diffs. • τ-Bench (Airline/Retail/Telecom): Expose a message action for chatting with the simulated user and tool actions like cance $l_r$ eservation(reservatio $n_i$ d) or searc $h_d$ irec $t_f$ light(...). The task says “follow the given policy,” and the policy lives in context. • BrowseComp+: Expose search(query), ge $t_d$ ocument(id), submit(answer, explanation, confidence). The task warns: don’t chat; only finish with submit. • AppWorld: Keep the native interpreter approach and wrap its many APIs (468 tools) as actions; the task and context explain app rules and credentials flow.

Step 2: Adapt Agents to the Unified Protocol

What happens: Create agent-side adaptors that turn protocol “actions” into the agent’s native style: Python functions (Smolagent), tool calls (ReAct), or MCP tools (Claude Code, OpenAI Solo).
Why this step exists: Agents expect different message formats; without adaptors, they can’t “press” the benchmark’s buttons naturally.
Example with data: • Tool-calling agents (ReAct): The adaptor maps actions to OpenAI-style tools and converts special actions (like message) accordingly. • MCP-based agents (OpenAI Solo, Claude Code): The adaptor exposes benchmark actions as MCP tools and routes observations back. • Code-generating agents (Smolagent): Each action becomes a Python function; runtime errors serve as feedback for schema mistakes.

Step 3: Orchestrator Loop

What happens: Exgentic spins up a session, sends the first observation, waits for the agent’s action, runs it in the benchmark, returns the new observation, and repeats. It stops on finish, no-action, or safety limits.
Why this step exists: Keeps agents and benchmarks in sync, ensures isolation, parallelism, caching, and standard logs.
Example with data: • In τ-Bench, the environment provides the user’s next message; the agent chooses to reply (message) or use a tool (e.g., searc $h_f$ light). The loop cycles until the agent submits the final answer or hits step limits.

Secret Sauce: The Narrow Waist + Non-Intrusive Adaptors

Instead of editing third-party code, Exgentic uses outside adaptors that translate protocols and synchronize turns. Agents and benchmarks run in separate processes as-is, preserving their original behavior. This keeps integration repeatable and faithful.

Key Components that Help Agents (explained simply) 🍞 Top Bread (Hook) Imagine a giant toolbox with 468 tools—it’s hard to pick the right one quickly.

🥬 The Concept: Tool shortlisting

What it is: A filter that picks a small, likely-useful subset of tools before the agent decides.
How it works:
1. Read the current goal and observation.
2. Score tools for relevance.
3. Keep the top few so the model isn’t overwhelmed.
Why it matters: Some models can’t even load hundreds of tools. Shortlisting makes impossible tasks possible and speeds everything up.

🍞 Bottom Bread (Anchor) In AppWorld with 468 tools, GPT 5.2 fails without shortlisting but becomes usable when we narrow the tool list.

🍞 Top Bread (Hook) You know spell-check that underlines mistakes so you can fix them right away?

🥬 The Concept: Schema guard

What it is: A checker that spots when an action call has the wrong shape (like a missing parameter) and lets the agent correct itself.
How it works:
1. Validate the action against its schema before execution.
2. If invalid, raise an error back to the agent.
3. The agent tries again with a corrected call.
Why it matters: Prevents wasted steps and weird failures from silly formatting mistakes.

🍞 Bottom Bread (Anchor) If an agent tries cance $l_r$ eservation() without reservatio $n_i$ d, the guard flags it so the agent can add the missing field.

Metrics

Success Rate: Did the agent accomplish the task under the benchmark’s original rules?
Cost per Task: How much did model calls cost on average?
Average Steps: How many back-and-forths did it take?

End-to-End Example (SWE-Bench Verified)

Input: Task describes bug fix rules; context may be empty; actions include bash and submi $t_p$ atch.
Steps:
1. Agent runs bash to inspect files and tests.
2. Agent edits code via bash commands.
3. Agent verifies locally.
4. Agent calls submi $t_p$ atch once with a summary.
Output: Benchmark applies the patch to hidden tests and reports success/failure, plus the framework logs steps and cost.

04Experiments & Results

The Test: What and Why

They evaluated 5 agent architectures across 3 strong language models on 6 benchmarks (coding, browsing for deep research, customer service in 3 domains, and multi-app tasks), with up to 100 turns per task.
They measured success rate (main score), average cost per task (money), and average steps (effort). This tells us who wins, who’s efficient, and how hard the tasks feel.

The Competition: Who and How

Agents: ReAct, ReAct Short (with tool shortlisting), Smolagent (code-generating), OpenAI Solo (MCP-based), Claude Code (MCP and coding features).
Models: Claude Opus 4.5, Gemini 3, GPT 5.2.
Benchmarks: BrowseComp+ (deep research), SWE-Bench Verified (software bugs), AppWorld (multi-app tasks), and τ-Bench (Airline, Retail, Telecom customer service).

The Scoreboard (with context)

Overall model standings: Claude Opus 4. $5 ≈ 0$ .66 average success, Gemini $3 ≈ 0$ .60, GPT 5. $2 ≈ 0$ .40. This is like Claude getting a solid A-, Gemini a B+, and GPT 5.2 a C.
Best per benchmark (examples): • τ-Bench-Telecom: OpenAI Solo + Gemini 3 hit 0.89 (very high). • SWE-Bench Verified: OpenAI Solo + Claude Opus 4.5 hit 0.81, matching or exceeding top domain-specific results on the sampled set. • AppWorld: Smolagent + Claude Opus 4.5 scored 0.70 (near published top 0.73).
No single agent dominates all domains: OpenAI Solo often wins where structured APIs and coding matter; Smolagent shines in multi-application and web-like tasks. This shows different scaffolds fit different task shapes.

Surprising Findings 🍞 Top Bread (Hook) Think of picking a backpack: a bigger one carries more, but it may be heavier and costlier.

🥬 The Concept: Cost-performance tradeoff (Pareto frontier)

What it is: A curve showing the best possible tradeoffs between performance and cost—you can’t improve one without hurting the other past this line.
How it works:
1. Plot each agent-model by success (up) and cost (right).
2. The frontier is the set with no better alternative in both dimensions.
3. Choose a point based on your budget and quality needs.
Why it matters: It guides practical choices; sometimes “good enough and cheap” beats “best but pricey.”

🍞 Bottom Bread (Anchor) GPT 5.2 setups were the most cost-efficient, while Claude Opus 4.5 achieved top raw performance at 3– $33× higher$ cost.

What Drives Performance Most?

Variance decomposition shows model choice explains about 28.2% of the score differences; agent architecture only 0.6%. Translation: the model matters far more than the scaffold, though the pairing still matters.

Model Stability Across Architectures 🍞 Top Bread (Hook) Imagine a bike that rides smoothly on many roads—you can focus on the route instead of fixing the bike.

🥬 The Concept: Model stability

What it is: How consistently a model performs when you swap different agent architectures.
How it works:
1. Run many agent variants with the same model.
2. Measure spread (standard deviation) of scores.
3. Lower spread = more stable.
Why it matters: Stable models let teams improve agents without constant retuning.

🍞 Bottom Bread (Anchor) Claude Opus 4.5 showed the most stability ( $mean ≈ 0$ .66, low $STD ≈ 0$ .06), so developers can focus on the agent design with fewer surprises.

Other Patterns

Failures often took more steps (and thus more cost) than successes—especially in interaction-heavy tasks like AppWorld and BrowseComp+ (e.g., ReAct +110.7% steps on AppWorld failures).
Cross-benchmark correlations were mostly positive, driven by model strength: models that do well in one place tend to do well elsewhere, but agent rankings can reshuffle within the same model.
Tool shortlisting rescued GPT 5.2 in tool-rich worlds; schema guards helped top performers catch and fix invalid tool calls.

Bottom Line

General agents can match or beat specialized baselines on many tasks without per-environment tuning.
But to win, pick your model wisely, consider tool management (shortlisting), and watch costs.

05Discussion & Limitations

Limitations

Text-only interactions: The current framework focuses on text-based agent-environment exchanges. Visual interfaces, full web GUIs, or rich multimodal worlds still need adaptors and likely small protocol extensions.
Coverage and cost: Evaluations are expensive; the study used a subset of agents and models, meaning some strong open-source options weren’t tested here.
Protocol scope: The Unified Protocol fits many existing patterns, but new or unusual environments may require expanding the action or context design.

Required Resources

Access to multiple LLMs (API keys), compute for running isolated sessions (Docker/containers), and budget for inference costs (reported total ≈ $22K for the study).
Engineering time to write lightweight adaptors for new agents/benchmarks, though base adaptors reduce effort.

When NOT to Use

Purely visual or highly interactive web tasks where the core interaction isn’t textual yet (until proper adaptors exist).
Ultra-low-latency or ultra-high-volume deployments where even efficient configurations are still too costly without further pruning or caching.
Situations demanding custom, domain-specific tricks you explicitly want to use for maximum score on one benchmark (that would break the notion of generality).

Open Questions

Can we reduce evaluation cost with smart sampling, early stopping, or outcome prediction—while keeping results reliable?
How do we grow from text to multimodal and embodied settings without breaking the narrow waist?
Which agent components (planning, memory, critique) most improve cross-domain robustness when model quality is held constant?
How should safety, policy compliance, and alignment be measured consistently across domains in a general-agent leaderboard?
Can we design training or fine-tuning methods that improve true cross-benchmark generalization, not just single-domain scores?

06Conclusion & Future Work

Three-Sentence Summary

This paper introduces a Unified Protocol and the Exgentic framework to fairly evaluate general-purpose agents across many different environments, all with the same simple task recipe: task, context, actions.
Using these, the authors launch the first Open General Agent Leaderboard and show that general agents can match specialized ones without per-environment tuning—though model choice drives most performance differences.
They also reveal practical tradeoffs: cost vs accuracy, the power of tool shortlisting and schema guards, and the importance of model stability.

Main Achievement

Turning general-agent evaluation into a first-class, practical reality via a narrow-waist protocol, non-intrusive adaptors, and a public, reproducible leaderboard.

Future Directions

Extend beyond text to visual/web/multimodal tasks while preserving the narrow waist.
Lower evaluation costs through smarter sampling, caching, and early stopping.
Explore which agent components most reliably boost cross-domain robustness and stability.
Study safety and alignment metrics that travel well across domains.

Why Remember This

It’s the universal charger moment for agent evaluation: one plug that fits many worlds. With a fair field and clear scoreboard, the community can finally focus on building agents that truly generalize—useful anywhere, not just in a lab built for one trick.

Practical Applications

•Evaluate your in-house agent across coding, browsing, and customer support with one setup to find its true strengths.
•Choose the right model–agent pairing for your budget by comparing cost per task on the Pareto frontier.
•Add tool shortlisting to handle environments with hundreds of tools and avoid model limits.
•Enable reliable tool use by adding a schema guard to catch and fix invalid calls automatically.
•Run A/B tests on planning or memory modules across multiple benchmarks to see real cross-domain impact.
•Standardize agent evaluation in your org with a single protocol so teams can share results apples-to-apples.
•Prototype a new benchmark by writing only one adaptor and instantly make it compatible with many agents.
•Cut evaluation costs by caching runs and limiting steps using the framework’s orchestration controls.
•Diagnose failure patterns by inspecting standardized trajectories and step counts across domains.
•Track progress over time on the Open General Agent Leaderboard to guide roadmap and procurement.

Version: 1