BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?
Key Summary
- •BeyondSWE is a new benchmark that tests code agents on tougher, more real-life tasks than single-repo bug fixing.
- •It stretches two dimensions at once: how big a change the agent must make (resolution scope) and how much outside knowledge it must bring in (knowledge scope).
- •There are four task types: CrossRepo (use other repos), DomainFix (use science/domain know-how), DepMigrate (upgrade whole codebases for breaking dependency changes), and Doc2Repo (build a full repo from a spec).
- •Even the best current models solve under 45% of cases on average, showing a big gap from simpler benchmarks.
- •The team also built SearchSWE, which lets agents search the web while coding, like real developers do.
- •Search sometimes helps (especially for domain facts and migration guides) but sometimes hurts (noise, wrong versions), so gains are inconsistent.
- •The benchmark includes strict Docker environments, P2P/F2P tests, and anti-cheating rules to keep results fair and reproducible.
- •Doc2Repo shows that models can pass many tests but still rarely produce fully correct, coherent repositories.
- •Analysis suggests that doing more searches is not the same as doing better searches; quality and timing of search matter most.
- •BeyondSWE and SearchSWE give researchers realistic challenges and tools to build the next generation of reliable code agents.
Why This Research Matters
Real software work often spans multiple repos, depends on evolving libraries, and requires specialized domain knowledge. BeyondSWE measures these realities directly, so success here maps better to on-the-job ability than single-repo bug fixes alone. It reveals that current code agents still struggle with end-to-end coherence, version pinning, and the careful use of external information. SearchSWE shows that simply giving an agent the internet is not enough—the agent must know when, where, and how to trust what it finds. By exposing these gaps with clean, reproducible tests, this work provides a roadmap for training and evaluating more dependable coding assistants. As these agents improve on BeyondSWE, they will be more ready to help teams maintain, migrate, and build real systems.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook) You know how fixing a bike is very different from building a whole bike from scratch, and both are different from swapping in a new kind of wheel that doesn’t fit quite the same? Programmers face similar situations with software all the time.
🥬 Filling (The Actual Concept)
- What it is: This paper studies whether today’s code AIs can handle real-world software jobs that go far beyond tiny, single-file bug fixes.
- How it works (step by step):
- Look at the world before: popular tests (benchmarks) mostly asked AIs to do small, local fixes in one code repository.
- Spot the problem: real developers use knowledge from many places, change many files across a repo, adapt to breaking library updates, and sometimes build whole projects from a written plan.
- See failed attempts: newer benchmarks added more repos or harder issues, but still mostly stayed inside a single repo and a small patch.
- Name the missing piece: we need tests that require both bigger changes (resolution scope) and outside knowledge (knowledge scope).
- Propose the solution: create BeyondSWE, a benchmark with four realistic settings that combine these two axes, and build SearchSWE to let AIs search the web while coding.
- Why it matters: If we only test small, local bug fixes, we will think AIs are ready for work when they’re not. Real jobs demand cross-repo reasoning, domain know-how, codebase-wide migrations, and building full systems from specs.
🍞 Bottom Bread (Anchor) Imagine hiring a helper: if you only check whether they can tighten a screw, you won’t know whether they can replace a wheel, find a part from another store, or build the bike from a blueprint. BeyondSWE tests all of that.
🍞 Top Bread (Hook) You know how a school science fair project needs not just doing an experiment, but also reading sources, following rules, and organizing your work? Software tasks are like that too.
🥬 Filling (The Actual Concept)
- What it is: The “world before” had benchmarks like SWE-bench and variants that were super useful but mostly focused on within-one-repo fixes.
- How it works:
- These benchmarks choose a bug report in a single repo, set up tests, and ask the agent to make a fix so tests pass.
- They ensure the agent cannot cheat by editing tests and use pass-to-pass (P2P) and fail-to-pass (F2P) tests to judge correctness.
- They became the go-to yardstick and pushed the field forward.
- Why it matters: As models improved on these, many people assumed agents were ready for much wider, real-world work—but that isn’t guaranteed.
🍞 Bottom Bread (Anchor) It’s like getting great at spelling quizzes and then being asked to write a full story with characters, chapters, and facts. Passing the quiz doesn’t prove you can write the whole book.
🍞 Top Bread (Hook) Imagine cooking dinner: sometimes you follow a family recipe (local knowledge), sometimes you Google a new cooking method (external knowledge), and sometimes you re-plan the whole meal for a friend with allergies (big refactor).
🥬 Filling (The Actual Concept)
- What it is: The paper introduces two key axes to better match real developer work: resolution scope (how big the change is) and knowledge scope (how much you must look beyond the repo).
- How it works:
- Resolution scope ranges from small function patches to whole-repo transformations and full repo creation.
- Knowledge scope ranges from using only what’s inside the repo to using external docs, other repos, or the open web.
- Combining them yields four task families that mirror real life.
- Why it matters: Without these axes, we miss critical skills like cross-repo borrowing, domain reasoning, and coordinated codebase migrations.
🍞 Bottom Bread (Anchor) Think of homework: sometimes you just fix a sentence (small patch), sometimes you rewrite the whole essay (repo-wide change), sometimes you write from a prompt (build from spec), and often you must use the internet or a textbook (external knowledge).
02Core Idea
🍞 Top Bread (Hook) You know how a great soccer tryout tests dribbling, passing, shooting, teamwork, and stamina—not just one skill? A fair coding test should sample all the skills real programmers use.
🥬 Filling (The Actual Concept)
- What it is: The key insight is to evaluate code agents along two axes—resolution scope and knowledge scope—to create four realistic tasks, then study how adding search (SearchSWE) helps or hurts.
- How it works (the “Aha!” moment):
- Expand resolution scope to include not only tiny fixes but also repo-wide migrations and full repo generation.
- Expand knowledge scope so tasks sometimes need info from external repos, scientific domains, or the web.
- Build a clean, reproducible environment with strict tests and anti-cheating safeguards.
- Add SearchSWE so agents can search like humans do, then measure whether that integration actually improves results.
- Why it matters: If agents only do well locally, they will fail in real workplaces where tasks cross boundaries, depend on changing libraries, and require domain expertise.
🍞 Bottom Bread (Anchor) It’s like testing a chef by asking them to fry an egg, cater a party, adapt to gluten-free recipes, and cook from a written menu—then checking if using a cookbook (search) truly helps.
🍞 Top Bread (Hook) Imagine two maps: one shows how big a job is (tiny fix to giant overhaul), the other shows where you can look for help (only what’s in your backpack vs. the whole library).
🥬 Filling (The Actual Concept)
- What it is: Resolution scope and knowledge scope are the two maps BeyondSWE uses to place tasks.
- How it works:
- Resolution scope: Local function patch → repo-wide migration → full repo generation.
- Knowledge scope: Only repo → other repos and docs → domain science → open web.
- Four task types emerge: CrossRepo, DomainFix, DepMigrate, Doc2Repo.
- SearchSWE lets the agent walk from the backpack to the library when needed.
- Why it matters: These axes make gaps visible that single-repo tests can hide.
🍞 Bottom Bread (Anchor) Picture a 2x2 wall chart in a classroom: left-to-right is small-to-big changes, bottom-to-top is local-to-global knowledge. Each BeyondSWE task pins a card to a different square.
🍞 Top Bread (Hook) You know how asking the internet can help with homework—but sometimes it shows the wrong version of a formula or a confusing explanation?
🥬 Filling (The Actual Concept)
- What it is: SearchSWE is a simple, general framework that lets the agent interleave coding with web search and browsing.
- How it works:
- Agent works inside Docker (local context) to read code, run tests, and edit files.
- When stuck, it calls web search and a browser tool (global context) to fetch docs, posts, and examples.
- A blocklist prevents cheating by visiting the target repo or future commits.
- After patching, evaluation happens in a fresh container to avoid side effects.
- Why it matters: It reveals whether today’s models can combine search with coding like human developers do; results show the integration is hard and inconsistent.
🍞 Bottom Bread (Anchor) It’s like building Lego sets: you try pieces locally, peek at the instruction booklet (search) when needed, and finally show your build to a new teacher who checks it with the original rules (fresh evaluation).
03Methodology
🍞 Top Bread (Hook) Imagine a science lab kit: you get a clear problem card, a safe lab station, and tests that say whether your experiment really works.
🥬 Filling (The Actual Concept)
- What it is: BeyondSWE’s methodology is a recipe for building realistic tasks plus a fair way to judge solutions.
- High-level overview: Input (problem statement + Docker env + tests) → Agent explores and edits code (with or without SearchSWE) → Agent proposes a patch → Fresh container replays tests to verify.
Step-by-step like a recipe:
-
Task formulation
- What happens: Each instance includes (a) a problem statement (issue or spec), (b) a Docker image with dependencies, and (c) a test suite split into P2P and F2P tests.
- Why this step exists: So the agent focuses on solving the problem, not on installing compilers or guessing versions.
- Example data: In CrossRepo, an issue links to upstream repos; in Doc2Repo, the agent gets only repocument.md and an empty workspace.
-
Automated environment construction
- What happens: An LLM agent inside Ubuntu builds a working Dockerfile by cloning the repo at the pre-fix commit, running tests, and installing whatever is missing (apt-get, pip, compilers). Then the command history is distilled into a minimal Dockerfile.
- Why it matters: Recreating historical environments is hard; this ensures reproducibility and saves human time.
- Example: If NumPy needs a specific BLAS or a C compiler, the agent installs them until tests run consistently.
-
Strict environmental inspection
- What happens: Build the Dockerfile and run tests five times. Before the patch: P2P must pass; F2P must fail. After applying the gold patch: both must pass. Any flakiness → discard.
- Why it matters: Prevents noisy or non-deterministic cases from skewing results.
- Example: If a test randomly fails once in five runs, the instance is removed.
-
Four task builders
- CrossRepo • What happens: Find PRs that reference external repos (about 3,000 candidates → 200 curated instances across 67 repos). • Why it matters: Real bugs often relate to upstream changes or patterns in sibling projects. • Example: A server ignoring a host argument gets fixed by adapting behavior seen in a linked project.
- DomainFix • What happens: With experts in 11 fields (e.g., astronomy, quantum physics), collect ~800 PRs; keep 72 issues across 12 repos after triple expert review. • Why it matters: Many libraries live in scientific domains where you need real subject knowledge. • Example: Add sparse Cholesky to speed up positive-definite sparse matrices in cvxpy.
- DepMigrate • What happens: Find major library upgrades (e.g., Pydantic v1→v2, NumPy 1.x→2.x), collect ~7,000 PRs, keep 178 issues across 120 repos. • Why it matters: Upstream breaking changes force repo-wide edits across many files. • Example: Replace Pydantic v1 validators with v2 field/model validators across the codebase.
- Doc2Repo • What happens: From recent, active, starred repos, generate masked specs (no code, no structure), keep 50 instances; tests are adapted from originals. • Why it matters: Building a repo from just a specification measures end-to-end design and implementation ability. • Example: Implement a UnifiedDocumentLoader API for converting files to Markdown with optional OCR.
-
Evaluation protocol
- What happens: Extract the agent’s git diff, apply it to a fresh container, restore untouched tests, and run P2P/F2P.
- Why it matters: Prevents cache, config edits, or test tampering from faking success.
- Example: Agent can’t slip fixes into the tests; only real code changes count.
-
SearchSWE framework
- What happens: Add a SearchTool (via web search) and BrowserTool (retrieve and summarize pages) to the normal code-agent tools.
- Why it matters: Lets us measure whether search helps on tasks that obviously need external info.
- Example: For DepMigrate, an agent may search “NumPy 2.0 API changes array interface” and adapt call sites accordingly.
Secret Sauce:
- Carefully pairing strong environment engineering (deterministic Docker + double-checks) with task designs that require both big changes and outside knowledge. Plus, the anti-cheating blocklist and fresh-container replay make the scores trustworthy.
🍞 Bottom Bread (Anchor) Think of it like a fair science contest: everyone gets the same sealed kit, follows the same rules, and turns in a result that’s tested in a brand-new lab setup so no one can hide shortcuts.
04Experiments & Results
🍞 Top Bread (Hook) Imagine a marathon with four different terrains—city streets, forest trails, steep hills, and a track—so no runner can win by being good at just one thing.
🥬 Filling (The Actual Concept)
-
What it is: The team evaluated many top models with and without SearchSWE across the four BeyondSWE tasks and measured how often they truly solved the problem.
-
The test (what they measured and why):
- For CrossRepo, DomainFix, and DepMigrate: Resolved Rate (% of instances where both P2P and F2P pass).
- For Doc2Repo: Pass Rate (average percent of tests passed) and (Almost) Correct Count (all % tests pass).
- Why: These metrics reflect true fixes, protect old behaviors, and capture how close full repo builds get to perfect.
-
The competition (who/what compared):
- OpenHands: a strong baseline agent framework with local tools only.
- SearchSWE: the same plus web search/browse.
- Multiple top and code-focused models (e.g., Gemini 3 Pro, GPT-5.2, DeepSeek-V3.2, Kimi-K2, GLM-4.7, MiniMax-M2.1, Seed-Coder, Qwen variants).
-
The scoreboard (results with context):
- Big picture: Even the best averages stay under about 45% resolved. That’s like getting a high C/low B when simpler tests looked like A-level work.
- DomainFix is consistently the hardest: rarely above ~36% across models—domain science remains a major stumbling block.
- Doc2Repo pass rates hover around ~45–55%, yet the number of fully correct repos is tiny (often 0–2). Models get many parts right but struggle with coherent, end-to-end systems.
- CrossRepo and DepMigrate do better than DomainFix but still below SWE-bench-style performances, showing cross-repo reasoning and large refactors are unsolved.
- SearchSWE helps sometimes but not always. For example, Gemini 3 Pro gains notably on DomainFix (+7.5%) and DepMigrate (+2.3%), likely thanks to good docs/migration guides. But some models drop on certain tasks (e.g., Seed-Coder declines on several when search is enabled).
-
Surprising findings:
- More outcomes: A model that searches rarely but precisely (e.g., ~1 call per instance) can beat another that searches a lot (4–5 calls) but drowns in noise.
- Search can hurt: wrong-version docs, ambiguous forum advice, and off-topic pages can mislead the agent into making incorrect changes.
- Code-specialized models sometimes integrate external info worse than generalists, possibly because they’re tuned to rely heavily on local-repo reasoning.
🍞 Bottom Bread (Anchor) It’s like students taking a multi-subject exam: some study guides help a ton for science but confuse them for history. The best students know exactly when to peek at notes and when to trust their own reasoning—and they still find the hardest sections tough.
05Discussion & Limitations
🍞 Top Bread (Hook) Think about riding a bike with training wheels: they help you balance, but sometimes they slow you down or make turning weird. Tools can help or get in the way.
🥬 Filling (The Actual Concept)
-
Limitations (what this can’t do):
- Models still struggle with deep domain reasoning; even with search, true scientific understanding is hard.
- Doc2Repo shows that stitching parts into a coherent architecture remains a major weakness.
- Search relevance and version matching are fragile; the web often surfaces latest docs, not the pinned legacy versions in the Docker image.
- Benchmark scope is Python-focused; other ecosystems may reveal new patterns.
-
Required resources (what you need to use this):
- GPU/CPU to run LLMs and agent loops.
- Docker to reproduce environments reliably.
- Web search API access (for SearchSWE) and content extraction tools.
- Enough time budget per instance (up to hundreds of tool calls per trial).
-
When NOT to use (situations where this fails):
- If you cannot allow internet access, tasks depending on external knowledge will not be solvable by design.
- If your codebase relies on highly proprietary tools with no public docs, search won’t help much.
- If you require zero nondeterminism but your dependencies are naturally flaky; BeyondSWE filters a lot, but some real-world repos are just noisy.
-
Open questions (what we still don’t know):
- How to train models to reliably detect and correct version mismatches between local env and web docs?
- How to estimate search trustworthiness and filter semantic drift before it contaminates the code plan?
- What curricula or feedback signals best teach “when to search” vs. “reason locally”?
- How to scaffold end-to-end system design so Doc2Repo solutions are both modular and globally coherent?
🍞 Bottom Bread (Anchor) It’s like learning when to ask a teacher for help versus solving a math problem yourself, and how to spot a wrong hint on the board. We know asking can help—but learning to ask the right way is the real skill.
06Conclusion & Future Work
🍞 Top Bread (Hook) Imagine testing a builder not just on hammering a nail, but also on renovating a whole house and constructing a new one from a blueprint—sometimes while checking manuals.
🥬 Filling (The Actual Concept)
- 3-Sentence Summary: BeyondSWE is a benchmark that tests code agents on bigger, more realistic tasks along two axes: how large the fix is and how much outside knowledge is needed. Results show a clear capability gap: even top models stay under about 45% success and struggle to integrate web search consistently. SearchSWE lets agents search like developers, but integrating that information with code reasoning is still hard.
- Main Achievement: Defining and releasing a rigorous, realistic evaluation suite plus a clean search-enabled framework that expose—and help study—the disconnect between searching and coding.
- Future Directions: Train models to detect version mismatches, assess source reliability, and decide “when to search”; design better workflows for end-to-end repository generation; and broaden language/ecosystem coverage.
- Why Remember This: It re-centers evaluation on what real programmers do daily—use outside knowledge, refactor whole codebases, and build from specs—so progress here will mean code agents are truly ready for real work.
🍞 Bottom Bread (Anchor) Just like a music exam that tests scales, duets, improvisation, and full performances, BeyondSWE checks all the key skills a real software “musician” needs—and shows where today’s players still need practice.
Practical Applications
- •Assess a code agent’s readiness for real projects by running it on CrossRepo, DomainFix, DepMigrate, and Doc2Repo tasks.
- •Use SearchSWE to prototype search-integrated workflows and measure when web lookup actually helps your agent.
- •Train agents to detect and reconcile version mismatches by comparing local package versions against web docs before applying changes.
- •Design migration playbooks (e.g., NumPy 1→2, Pydantic v1→v2) and test whether agents can follow them repo-wide.
- •Benchmark domain-augmented models (e.g., with science-specific finetuning) on DomainFix to validate real expertise gains.
- •Evaluate end-to-end system design by tasking agents with Doc2Repo and tracking architecture coherence and test coverage.
- •Implement anti-cheating safeguards (blocklists, fresh-container evaluation) in your own internal agent evaluations.
- •Analyze agent search logs to improve search timing and precision (fewer, higher-quality calls rather than more calls).
- •Use BeyondSWE instances to create curricula that gradually scale from local fixes to full repo builds.
- •Adopt deterministic Docker environments and repeated test runs to eliminate flaky evaluation results.