LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth

Weihao Zeng; Yuzhen Huang; Junxian He

LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth

Intermediate

Weihao Zeng, Yuzhen Huang, Junxian He2/8/2026

arXiv

Key Summary

•LOCA-bench is a test that challenges AI agents to work correctly as their to-do list and background information grow very, very long.
•It controls how much extra information the agent must handle while keeping the actual task the same, so we can see how growing context alone affects performance.
•As context gets longer, most models make more mistakes, a problem called context rot, even when their official context windows are huge.
•LOCA-bench uses realistic tools (like email, spreadsheets, and databases) through mock servers so tasks are reproducible and results are easy to verify.
•The benchmark shows that better context engineering strategies, especially programmatic tool calling, can rescue a lot of lost accuracy.
•Frontier models (like GPT-5.2-Medium and Claude-4.5-Opus) handle long contexts better than open-source ones, and the gap widens with length.
•Agents often fail by not exploring enough pages, forgetting instructions, mixing up facts later, or skipping evidence from other sources.
•LOCA-bench is open-source, scalable to potentially infinite context sizes, and decoupled from specific tools or scaffolds so others can extend it.
•It provides clear, binary scoring with rule-based checks, plus efficiency metrics like trajectory length and number of tool calls.
•These findings help developers design smarter agent scaffolds so AI stays reliable on long, real-world jobs.

Why This Research Matters

Real-life AI helpers don’t just answer one question; they work for hours, reading emails, checking databases, and filling spreadsheets as new info appears. LOCA-bench shows how easily accuracy slips when that information grows, and which fixes actually help. This matters for businesses that rely on agents to handle inventory, invoices, research, and schedules without missing steps. The benchmark’s controlled growth and verifiable scoring make results trustworthy and comparable across models. Its open-source toolkit also gives developers practical strategies—like programmatic tool calling—to keep agents reliable. In short, LOCA-bench helps turn impressive demos into dependable day-to-day performance.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how cleaning your room is easy when there are just a few toys out, but gets overwhelming when every drawer and shelf is stuffed? Even if you have big bins, finding the right toy gets harder as the mess grows.

🥬 Filling (The Actual Concept)

What it is: This paper studies how AI “helpers” (language agents) behave when the amount of information they must remember and use keeps growing during a task.
How it works: In real jobs like checking emails, filling spreadsheets, and querying databases, an AI gathers new info over time. That info piles up in its context window (the text it can see at once). LOCA-bench increases that pile in a controlled way, without changing what the task asks for, to see how well the AI stays accurate.
Why it matters: Without a careful test like this, we can’t tell if an AI fails because the task is hard or just because the context got too long and messy.

🍞 Bottom Bread (Anchor) Imagine a homework helper who first reads one page of your notes and does great, but starts forgetting steps when you hand them a giant binder. LOCA-bench checks exactly that.

🍞 Top Bread (Hook) Imagine you’re reading a long comic book. At first, you remember who each character is, but 200 pages later, you start mixing them up.

🥬 Filling (The Actual Concept: Context Rot)

What it is: Context rot is when an AI’s reliability drops as you add more tokens (text) to its input, even if the task itself hasn’t changed.
How it works: As the conversation or tool outputs grow, the model treats some parts less clearly, forgets earlier rules, or mixes up details when reasoning or coding.
Why it matters: If we ignore context rot, we’ll think a big context window solves everything, but the AI may still get confused in long jobs.

🍞 Bottom Bread (Anchor) When asked to build an exam schedule from emails and announcements, the AI starts fine but later forgets to include email info because the context got long and cluttered.

🍞 Top Bread (Hook) Picture a city map. A tiny map is easy to scan, but a super-detailed city atlas can be overwhelming unless you know what to focus on.

🥬 Filling (The Actual Concept: Environment Description Length)

What it is: Environment Description Length is the number of tokens needed to represent all the info an agent would need to read from tools for a task.
How it works: LOCA-bench runs scripted tool calls (like list all emails, fetch all announcements), concatenates the outputs, tokenizes them, and counts tokens.
Why it matters: It’s a clean dial to turn up complexity while keeping the task the same, so we can study how “more background info” alone affects performance.

🍞 Bottom Bread (Anchor) If the task is “collect all final exams,” increasing the number of courses and messages increases the Environment Description Length without changing the goal.

🍞 Top Bread (Hook) Think of a super helper who reads your emails, checks your calendar, and fills your spreadsheet like a smart assistant.

🥬 Filling (The Actual Concept: Long-Context Agents)

What it is: Long-context agents are AIs that act step by step using tools while tracking a lot of information over time.
How it works: They explore (call tools), gather results, remember instructions, reason across sources, and take the next action.
Why it matters: Real tasks grow their context during execution. An agent must stay organized and accurate as it learns more.

🍞 Bottom Bread (Anchor) An agent that paginates through a product catalog, filters items, writes a CSV, and emails subscribers is a long-context agent in action.

🍞 Top Bread (Hook) You know how cooks keep their counters tidy so they don’t spill salt into the sugar?

🥬 Filling (The Actual Concept: Context Management Strategies)

What it is: Ways to keep the AI’s working memory tidy so it doesn’t overflow or lose track.
How it works: Edit or compress history, clear stale tool outputs, store things in memory files, or switch to code-driven tool use.
Why it matters: Without management, the agent gets overwhelmed, explores less, and makes more mistakes.

🍞 Bottom Bread (Anchor) Clearing old tool outputs after you’ve saved the key numbers is like tossing empty boxes so the kitchen stays usable.

🍞 Top Bread (Hook) Imagine redesigning your backpack so heavy books go in smart pockets and quick notes go in a side pouch.

🥬 Filling (The Actual Concept: Context Engineering Strategies)

What it is: Practical techniques to reshape, shrink, or externalize context so models focus on what matters.
How it works: Summarize history, clear thinking blocks, use memory tools, track remaining context, or run tools via programs to avoid dumping long outputs.
Why it matters: Good engineering can rescue accuracy under extreme context growth.

🍞 Bottom Bread (Anchor) Replacing a 20-page tool dump with a short, computed result is classic context engineering.

🍞 Top Bread (Hook) Sticky notes help you remember important details without rereading the whole book.

🥬 Filling (The Actual Concept: Memory Tools)

What it is: Tools to save and retrieve key facts across steps (files to create/read/update/delete memory).
How it works: The agent writes summaries or facts to memory and loads them when needed instead of keeping everything in the active context.
Why it matters: It frees up the context window and reduces forgetting.

🍞 Bottom Bread (Anchor) Save the list of “already-checked courses” in memory so the agent doesn’t re-scan them every time.

🍞 Top Bread (Hook) Sometimes it’s faster to write a quick script than to click a thousand buttons.

🥬 Filling (The Actual Concept: Programmatic Tool Calling)

What it is: The agent writes code that orchestrates tools, consumes intermediate outputs, and returns only the final, compact result.
How it works: The agent submits code to a special tool; that code can call other tools, handle pagination, filter data, and return a neat summary.
Why it matters: It cuts context bloat and makes exploration systematic, improving accuracy and cost.

🍞 Bottom Bread (Anchor) Instead of pasting 10 pages of product listings, the code loops pages, picks only low-stock items, and returns a small, clean list.

🍞 Top Bread (Hook) When packing for a trip, you check the space left in your suitcase so you don’t burst the zipper.

🥬 Filling (The Actual Concept: Context Awareness)

What it is: The agent knows how much context capacity remains and adapts behavior.
How it works: After each tool call, it gets feedback (e.g., tokens left) and chooses whether to summarize, skip, or proceed.
Why it matters: Prevents overfilling and encourages timely, focused actions.

🍞 Bottom Bread (Anchor) If only 10% context remains, the agent summarizes older steps before fetching more emails.

🍞 Top Bread (Hook) Solving a mystery sometimes needs clues from many rooms, not just one.

🥬 Filling (The Actual Concept: Complex Reasoning)

What it is: Thinking across multiple sources and steps, keeping mappings and constraints straight.
How it works: Retrieve, align, check edge cases, follow formats, and combine facts correctly.
Why it matters: Long tasks fail without careful multi-step reasoning.

🍞 Bottom Bread (Anchor) Linking each exam from email and Canvas to the right course ID and then sorting by start time is complex reasoning.

The World Before: Benchmarks mostly tested single-step retrieval (find a needle in a haystack), not the dynamic, growing contexts of real agents. 2) The Problem: As agents explore, their context balloons and accuracy rots. 3) Failed Attempts: Big context windows and static long-text tests didn’t reveal dynamic failures like under-exploration or instruction drift. 4) The Gap: We needed a way to dial up context length while keeping the task the same to isolate the effect of growth. 5) Real Stakes: From coding to scheduling and data analysis, agents must stay reliable over long runs or they’ll miss emails, misfile products, or ship wrong reports.

02Core Idea

🍞 Top Bread (Hook) Imagine a treadmill that speeds up slowly while you try to read homework notes out loud. You need a fair way to see if mistakes happen because the notes got longer, not because the assignment changed.

🥬 Filling (The Actual Concept)

Aha! Moment in one sentence: Keep the task the same but grow the environment’s information in a controlled way, so we can isolate how context length alone affects an agent’s accuracy.
How it works: LOCA-bench builds realistic mock environments (email, Canvas, spreadsheets, BigQuery, WooCommerce), scales how much info they contain, measures Environment Description Length in tokens, and evaluates if agents still complete the exact same task.
Why it matters: This reveals true long-context weaknesses—like under-exploration, instruction forgetting, and hallucination drift—that static benchmarks can hide.

🍞 Bottom Bread (Anchor) As we increase the number of courses and announcements, the agent must still produce the same clean exam schedule. If it starts missing emails at 128K tokens, we know context growth—not task change—caused the drop.

Multiple Analogies for the Same Idea:

Library Analogy: Before vs. After

Before: You gave the student one giant chapter and asked one question. They find a single fact.
After: You let the student walk the whole library, with stacks getting taller and taller, but the question stays the same. Now we learn if they can still find and combine the right books as the library grows.

Kitchen Analogy: Organization under load

Before: Cook one simple dish from a short recipe.
After: Same dish, but the pantry is now massive and cluttered. We learn if the cook can still pick the correct ingredients without getting lost.

Hiking Analogy: Trail markers

Before: One trail with a single lookout point.
After: Many side trails appear, but the final lookout is unchanged. We test if the hiker follows markers and doesn’t get distracted as the trail map gets busier.

Before vs. After:

Before LOCA-bench: Long-context tests often meant finding a snippet in one big text once.
After LOCA-bench: Agents must plan, call tools, paginate, merge evidence, and obey formats while context grows steadily. We see whether they stay careful at every step.

Why It Works (Intuition without equations):

Ceteris paribus for agents: By fixing the task and only scaling the environment’s description length, any accuracy change is linked to handling more context.
Verifiability: Rule-based checks on the final environment state keep scoring objective, not subjective.
Diversity and realism: Multiple environments and tools mimic real work (email + spreadsheet + database), stressing exploration and multi-source reasoning.
Decoupling: Separating tasks, environments, and scaffolds lets us test both models and context strategies fairly.

Building Blocks (Explained with Sandwich): 🍞 Hook: You know how you can tidy your backpack or write a mini checklist to avoid carrying every paper? 🥬 Context Engineering Strategies

What it is: A toolkit to reduce or reshape live context.
How it works: Clear tool outputs, compress history, save facts to memory, run code to shrink outputs, and show remaining budget.
Why it matters: Less clutter → better focus → higher accuracy. 🍞 Anchor: Summarize the last 20 chat turns into a 10-line plan before fetching 50 more emails.

🍞 Hook: Writing a short script can beat copy-pasting pages of data. 🥬 Programmatic Tool Calling

What it is: Code that orchestrates tools and returns only essentials.
How it works: The code paginates, filters, aggregates, and emits a compact summary.
Why it matters: Fewer irrelevant tokens and more reliable control flow. 🍞 Anchor: Loop through product pages, keep only low-stock items, return a tiny table.

🍞 Hook: Sticky notes save brain space. 🥬 Memory Tools

What it is: File-like memory to store and fetch key facts later.
How it works: The agent offloads must-keep info to memory files and reloads them on demand.
Why it matters: Prevents re-reading huge logs and reduces forgetting. 🍞 Anchor: Save a list of already-processed courses so you don’t double-count exams.

🍞 Hook: Don’t overpack your suitcase. 🥬 Context Awareness

What it is: The agent knows how many tokens are left.
How it works: After each tool call, it gets a budget update and adapts.
Why it matters: Stops overflows and triggers smart summaries at the right time. 🍞 Anchor: If nearly full, summarize announcements before pulling all emails.

Put together, these blocks explain why LOCA-bench can both reveal weaknesses and point to fixes that actually help in real, long-running tasks.

03Methodology

High-level Recipe: Input → Scalable Environment → Agent Exploration with Tools → Growing Context (Measured) → Binary Success Check (+ Efficiency Metrics)

Step-by-Step (like a cooking show):

Input and Task Setup

What happens: Start with a realistic task prompt (e.g., “Collect all final exams and fill exa $m_s$ chedule.xlsx”).
Why it exists: It mirrors how an agent begins real jobs—just a goal and tool access.
Example: The prompt names the date, mentions Canvas and Email as sources, and requires a sorted Excel file.

Build Mock Servers that Mirror Real Tools

What happens: Create local, database-backed servers for Google Calendar, Canvas, Email, BigQuery, Google Sheets, Snowflake, and WooCommerce, matching request/response formats of the real services.
Why it exists: Avoid flaky logins, rate limits, and changing APIs; make testing fast, fair, and repeatable.
Example: canvas-lis $t_a$ nnouncements returns data just like the real Canvas tool schema.

Adjustable Environment State via Templates and Generators

What happens: Hand-written templates describe courses, exams, announcements, emails, products, invoices, etc. A generator assembles them into a concrete environment using configuration knobs (e.g., number of courses, percent of info in email vs Canvas, amount of distracting text, presence of edge cases).
Why it exists: Lets us scale complexity while keeping the task semantics unchanged.
Example: Set 100 courses with 60% of exams announced on Canvas and 40% by email, plus a few courses with no exams.

Measure Environment Description Length (EDL)

What happens: Run scripted tool calls to fetch everything an agent would need to read (e.g., all announcements + relevant emails), concatenate outputs, tokenize, and count tokens.
Why it exists: EDL is the “complexity dial” that grows context without changing the goal.
Example: The combined tool outputs measure 128K tokens under a standard tokenizer.

Agents and Scaffolds

What happens: Treat an agent as Model + Scaffold (the controller that manages context, memory, and tools). We can swap models (Claude, GPT, Gemini, DeepSeek, etc.) and scaffolds (ReAct, Claude Agent SDK, etc.).
Why it exists: Real performance depends on both the model and its surrounding strategy.
Example: Use ReAct for baseline runs and add context engineering options like clearing tool results or programmatic tool calling.

Execution and Context Growth

What happens: The agent reads the task, explores via tool calls, accumulates tool outputs in context, and writes files or updates services.
Why it exists: This recreates dynamic, long-running workflows where evidence piles up.
Example: The agent paginates a product catalog (100 items per page), writes discoun $t_p$ roducts.csv, and emails subscribers.

Verifiable, Binary Evaluation

What happens: A rule-based script checks the final environment state or outputs. If they match the ground truth (correct rows, correct schema, correct sorting), score = 1; else 0.
Why it exists: Objective, reproducible scoring captures true task completion.
Example: For the exam task, the checker verifies that all exams (from both Canvas announcements and emails) are recorded with correct course links and sorted by start time.

Efficiency Metrics

What happens: Record trajectory length (total tokens processed), number of tool calls, and tool-output token length.
Why it exists: These show exploration effort and context pressure.
Example: A model that retrieves more tool-output tokens often performs better—up to limits.

Secret Sauce (Why LOCA-bench is clever):

Controlled growth: By scaling EDL only, we isolate context-length effects from task changes.
Realistic diversity: Multiple environments and 280 tools mimic real office/engineering workflows, not just single-document search.
Decoupling: Clean interfaces separate tasks, tools, environments, and scaffolds so researchers can plug in new strategies.
Built-in context engineering: The scaffold can clear stale tool outputs, compact history, use memory tools, track context budget, and switch to programmatic tool calling.

Concrete Data Examples:

Insufficient exploration: The agent fetches only the first 100 products, finds no matches, and stops—while more qualifying products sit on page 2+. This fails the checker because discoun $t_p$ roducts.csv is incomplete.
Instruction following slip: The agent writes record.csv but changes column names ( $A_c$ onversio $n_p$ ct instead of $A_c$ onversion %), failing schema checks.
Hallucination drift: The agent reads a vibration=1.61 value for M006 but later records 2.46 in code, producing a wrong anomal $y_r$ eport.csv.
Programmatic tool calling win: A short script loops pages, filters low-stock items, and returns a tiny list, cutting context and reducing mistakes.

Putting It All Together (Pipeline View): Task Prompt → Generate/Scale Environment (EDL set) → Agent + Scaffold choose actions → Tools return outputs → Context grows (tracked) → Optional context engineering applied → Final action(s) → Rule-based success check + metrics.

04Experiments & Results

The Test: What and Why

We evaluated seven strong models (Claude-4.5-Opus, GPT-5.2-Medium, Gemini-3-Flash, DeepSeek-V3.2-Thinking, MiniMax-M2.1, GLM-4.7, Kimi-K2-Thinking) on 15 seed tasks adapted from Toolathlon across seven Environment Description Lengths (8K, 16K, 32K, 64K, 96K, 128K, 256K tokens).
We measured binary task accuracy (pass/fail), plus efficiency metrics: trajectory length, number of tool calls, and total tool-output tokens retrieved.
Goal: Reveal how accuracy and behavior change as context grows—without changing the underlying task.

The Competition: Who/What Against

Frontier proprietary models vs. strong open-source models, all using their maximum context windows. When inputs exceeded a model’s limit, we truncated to the most recent tokens to simulate real usage under pressure.

The Scoreboard: Results with Context

Short context (8K): Most models are strong—like getting an A on a normal test. Claude-4.5-Opus hits 96% at 8K; several others exceed 70%.
As EDL grows: Accuracy drops fast for everyone, showing context rot in action. By 256K, many models perform like a student rushed and overwhelmed.
Concrete example: Claude-4.5-Opus goes from 96.0% (8K) to 14.7% (256K). GPT-5.2-Medium averages 51.2% overall and remains relatively steadier than most at high EDLs. Gemini-3-Flash averages 38.3%.
Frontier vs open-source gap widens with length: At bigger EDLs, frontier models often score 2– $3× higher$ than open-source peers.

Surprising Findings and Behaviors

Plateau in exploration: As EDL passes ~96K, models’ trajectory lengths and tool-call counts often stop growing proportionally. They explore less relative to available info. This under-exploration correlates with accuracy drops.
Retrieval volume matters: Models that retrieve more tool-output tokens generally do better, up to a point. Frontier models (e.g., GPT-5.2-Medium and Gemini-3-Flash) pull in more tool output than open-source ones.
Context truncation side effects: Models with shorter max windows (e.g., 130K) re-call tools to recover truncated info, inflating tool-call counts and risking inconsistencies.

Context Engineering Helps (Especially Programmatic Tool Calling)

At EDL=128K, adding strategies improved accuracy and sometimes shortened runs: • GPT-5.2-Medium: Base 38.7% → 49.3% with Programmatic Tool Calling (PTC), while trajectory length shrank notably (141K → ~102K effective). • Gemini-3-Flash: Base 21.3% → 30.7% with PTC, also reducing trajectory length (101K → 76K). • DeepSeek-V3.2-Thinking: PTC boosts from 10.7% → 24.0% and cuts trajectory from 191K → 103K.
Memory tools and context awareness often help frontier models more than open-source ones, showing that scaffolds and model training interact strongly.

Scaffold Comparisons (Claude-specific)

With Claude-4.5-Opus at 128K EDL: Our PTC raised accuracy from 34.0% → 40.0%, while Anthropic’s official PTC reached 49.3%. Surprisingly, running through the Claude Agent SDK (with features like subagents) dropped to 26.7%—likely due to misused advanced features that bloated irrelevant context.

Failure Modes Under Long Context

Declining complex reasoning: Missing cross-source merges (e.g., skipping email evidence) leads to incomplete outputs.
Weaker instruction following: Changing required CSV column names breaks strict schemas.
Insufficient exploration: Stopping after first results page misses qualifying items on later pages.
Hallucination-like drift: Correctly retrieved numbers are later miscopied in code, causing wrong final files.

Bottom Line

LOCA-bench exposes how simply growing context can derail agents—even with massive context windows—and shows which strategies most reliably put them back on track.

05Discussion & Limitations

Limitations

Coverage: While 15 diverse seed tasks and 280 tools are substantial, they still can’t span every real-world scenario (e.g., highly interactive web UIs or multi-user workflows).
Model context limits: Some models cap at 130K–260K tokens; truncation strategies can bias behavior (e.g., repeated re-calls). Results partly reflect these practical constraints.
Strategy availability: Not all models integrate equally with memory tools or official programmatic tool calling; comparing their best possible scaffolds isn’t always apples-to-apples.
Mock servers vs. production: Mock tools match real schemas, but real systems include latency, partial outages, and noisy data that may further stress agents.

Required Resources

Compute and API access for evaluated models, plus the LOCA-bench toolkit.
Ability to run mock servers locally (databases + services).
Time to execute long trajectories (cost varies by model and strategy).

When NOT to Use

If you only need single-step retrieval tests (e.g., one-file Q&A), simpler long-context benchmarks may suffice.
If you can’t or won’t use tools, LOCA-bench’s agentic focus might exceed your needs.
If your interest is generation quality without tool use (e.g., creative writing), this setup may be overkill.

Open Questions

Training vs. inference: Which mix of training data, tool schemas, and scaffold hints most improves long-context reliability?
Optimal memory design: What memory abstractions balance faithfulness, compactness, and ease of retrieval under heavy context pressure?
Robustness to noise: How do agents cope when tool outputs contain conflicting or corrupted entries at large scale?
Cost vs. accuracy trade-offs: How to best schedule summaries, memory writes, and PTC to minimize tokens while preserving correctness?
Generalization: Will gains from context engineering on these tasks transfer to unseen domains like legal discovery or biomedical analytics?

06Conclusion & Future Work

Three-Sentence Summary

LOCA-bench fairly tests language agents by keeping tasks the same while growing the environment’s information, isolating the effect of long context.
It reveals sharp accuracy drops as context grows (context rot) and highlights common failure modes like weak exploration, instruction drift, and hallucination under load.
Smart context engineering—especially programmatic tool calling—substantially improves reliability and efficiency in long, realistic workflows.

Main Achievement

A practical, open, extensible benchmark and toolkit that makes extreme, controllable context growth testable and actionable, providing both hard numbers and plug-in strategies that actually help.

Future Directions

Expand tasks and environments (e.g., more enterprise systems, web UIs), deepen memory tooling, and co-design training with scaffolds so models learn to manage context as they act.
Develop unified policies that adaptively choose between summarization, memory writes, and programmatic pipelines based on live context budget and uncertainty.
Add robustness tests with noisy, conflicting, or drifting data to mirror messy real-world backends.

Why Remember This

LOCA-bench changes the question from “Can your model read a long file?” to “Can your agent stay accurate as the world gets busier?”—the core challenge of real deployments. It not only diagnoses long-context problems but also points to practical fixes developers can use today.

Practical Applications

•Evaluate which model and scaffold combination stays reliable on your company’s longest workflows before deploying.
•Use programmatic tool calling to paginate, filter, and aggregate tool results so only compact summaries enter context.
•Add context awareness to show remaining token budget and trigger summaries before the window overflows.
•Adopt memory tools to store key facts (IDs, mappings, checkpoints) and reload them later instead of re-reading long logs.
•Apply context editing policies (clear stale tool outputs, compress old turns) to keep the working memory tidy.
•Stress-test your agent on LOCA-bench-like tasks with growing EDL to identify when exploration plateaus.
•Instrument your scaffold to track trajectory length, tool calls, and tool-output tokens as early warning signals.
•Create rule-based evaluators for your own tasks so you can measure real completion, not just good-looking text.
•Design prompts that explicitly require pagination and schema validation to prevent early stopping and format drift.
•Compare official vs. custom implementations of PTC or memory to see which aligns best with your chosen model.

Version: 1