SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Ibragim Badertdinov; Maksim Nekrashevich; Anton Shevtsov; Alexander Golubev

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Intermediate

Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov et al.2/27/2026

arXiv

Key Summary

•SWE-rebench V2 is a giant, language-agnostic robot pipeline that turns real GitHub pull requests into safe, runnable software tasks for training AI coding agents.
•It builds Docker-based, reproducible environments, runs tests before and after the fix, and keeps only tasks with clear fail-to-pass signals.
•An interactive setup agent figures out how to install and test each repository once, then reuses that recipe for all related tasks.
•Quality is checked by an ensemble of LLM judges calibrated against human-labeled SWE-bench Verified data to filter unclear or risky tasks.
•The release includes 32,000+ fully containerized tasks across 20 programming languages and 3,600+ repositories, plus 120,000+ PR-derived tasks with install/test recipes.
•Repository-specific log parsers are auto-generated so test results can be read consistently across many ecosystems.
•Rich diagnostic metadata flags tricky cases (like brittle tests or hidden naming rules) so researchers can filter or build curricula.
•Ablations show interactive setup beats non-interactive scripts, and pass rates across seven models provide realistic difficulty signals.
•Everything is designed for large-scale reinforcement learning with stable rewards, not just for evaluation.

Why This Research Matters

Real software breaks in many languages, not just one. SWE-rebench V2 delivers a huge, multilingual set of real, runnable tasks so AI coders can practice fairly and safely. By shipping pre-built containers and strong test signals, it reduces guesswork and reward noise during training. The LLM-judge filters and diagnostic labels help teams build smart curricula, starting simple and adding messy real-world cases later. This can accelerate AI agents that fix bugs, improve performance, and assist developers across ecosystems. Ultimately, it makes open-source projects healthier and developer time more productive.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re building a big LEGO city with friends who speak many different languages. You all need the same kind of table, lighting, and instructions so no one’s towers fall over or get mixed up.

🥬 The Concept: Large software engineering (SWE) agents are AIs that read code, propose fixes, and check themselves by running tests. They learn best when they can practice thousands of real tasks, each inside a clean, repeatable “playroom” where the rules (tests) are fair and stable. How it works (before this paper):

People collected tasks from GitHub issues and linked pull requests.
They tried to build the code and run tests to see if a fix really worked.
They kept the tasks with trustworthy test signals. Why it matters: Without many reliable practice tasks, these AI agents can’t improve steadily, especially when we want them to work in many programming languages. 🍞 Anchor: Think of a soccer team that can only practice with one ball on a broken field. They’ll never get good at real games.

🍞 Hook: You know how different board games need different pieces and rules? Code ecosystems are like that—Python uses pip/pytest, Java uses Maven/JUnit, Go uses go test, and so on.

🥬 The Concept (Executable Environments): An executable environment is a self-contained setup where the project installs and tests run predictably every time. How it works:

Gather the right tools and dependencies.
Build or install the project.
Run its full test suite and capture results. Why it matters: Without a stable environment, tests flicker or fail for the wrong reasons, giving bad training signals. 🍞 Anchor: It’s like baking cookies in an oven you know is exactly 350°F, not one that randomly jumps between hot and cold.

🍞 Hook: Imagine putting each science experiment in its own sealed lab box so spills don’t ruin other experiments.

🥬 The Concept (Containerized Environments): Containerization (like Docker) packs the code, tools, and dependencies into a neat box so anyone can run it the same way. How it works:

Start from a language-specific base image (e.g., Python, Java, Go).
Add repo-specific install steps and test commands.
Save the image so it runs the same way later. Why it matters: Without containers, tiny machine differences can break builds or tests. 🍞 Anchor: Like sending a recipe with the exact ingredients and a mini-oven in the box.

🍞 Hook: When you take a math test, there are right-or-wrong answers—no guessing what the teacher meant.

🥬 The Concept (Test-based Validation): This checks correctness by running tests before the fix (should fail) and after the fix (should pass)—aka fail-to-pass (F2P). How it works:

Apply test changes from the PR.
Run tests (they should fail at first).
Apply the code fix.
Run tests again (they should pass now). Why it matters: Without F2P, you can’t be sure a task teaches the AI the intended skill. 🍞 Anchor: It’s like a pre-test that shows the problem exists and a post-test that proves you solved it.

🍞 Hook: You know how a smart checklist app can guide you to pack for any trip—beach, camping, or snow—by following the same steps with different items?

🥬 The Concept (Automated Pipeline): An automated pipeline is a robot assembly line that harvests tasks, builds environments, runs tests, and filters quality without humans doing each step. How it works:

Mine GitHub pull requests and issues.
Synthesize install and test scripts.
Execute tests before/after patches.
Filter unclear tasks and add diagnostic labels. Why it matters: Without automation, you can’t scale to tens of thousands of reliable tasks. 🍞 Anchor: Like a factory that turns raw ingredients into neatly labeled lunchboxes.

🍞 Hook: Imagine a universal phone charger that works in any country with just a small adapter.

🥬 The Concept (Language-Agnostic Systems): A language-agnostic system uses one general workflow that works across many programming languages, plus small reusable templates per language. How it works:

Share the overall pipeline.
Swap in language-specific base images, test runners, and parsers.
Reuse what works; only customize what’s needed. Why it matters: Without this, you’d rebuild a new pipeline for every language, which doesn’t scale. 🍞 Anchor: One travel plan, different plug adapters.

🍞 Hook: Training a puppy with treats works better than just talking to it.

🥬 The Concept (Reinforcement Learning): RL teaches AI agents by letting them try actions (like editing code) and giving rewards (like tests passing) when they succeed. How it works:

The agent proposes a change.
The environment runs tests.
The agent gets feedback (reward) and improves. Why it matters: Without many stable, repeatable tasks, RL can’t learn reliably. 🍞 Anchor: A video game player gets points for winning levels; more fair levels mean better training.

🍞 Hook: Imagine a super-reader who can skim lots of text and judge if instructions are clear.

🥬 The Concept (Large Language Models, LLMs): LLMs are AI text experts that can read code, docs, and logs, and then write helpful text or scripts. How it works:

Read inputs (like README, errors, tests).
Predict useful next steps (commands, fixes, summaries).
Iterate based on feedback. Why it matters: LLMs power setup agents, judges, and parsers in this pipeline. 🍞 Anchor: A very fast librarian who can also draft instructions.

🍞 Hook: Think of three teachers grading the same essay to avoid one person’s bias.

🥬 The Concept (LLM Judges): LLM judges are models that rate whether an issue is clear enough to solve and whether tests align with the problem, often used in ensembles. How it works:

Read the issue, patch, and test patch.
Score clarity and alignment.
Keep tasks only if all judges agree they’re clear enough. Why it matters: Without good screening, you train on confusing or unfair tasks. 🍞 Anchor: If three referees say the field is playable, the match can start.

The world before SWE-rebench V2 had strong evaluation benchmarks (like SWE-bench and multilingual variants), but not enough large, language-diverse, reproducible training tasks. Manual setup didn’t scale, and many pipelines focused on Python or a few ecosystems. The gap was a single, language-agnostic, automated pipeline that could produce reliable, RL-ready tasks at scale with ready-to-run environments and detailed metadata. SWE-rebench V2 fills that gap by harvesting 32,000+ fully containerized tasks across 20 languages (plus 120,000+ PR-derived tasks with recipes), bundling pre-built images, and attaching diagnostic labels so teams can build clean curricula or stress tests. In short, it turns the messy internet of code into a giant, fair practice field for AI software agents.

02Core Idea

The “aha!” moment: Use one end-to-end workflow for many programming languages by separating general steps (mine PRs, build containers, run F2P tests, filter quality) from small, reusable language templates (base images, runners, parsers), then validate and label everything automatically.

Three analogies:

Airport security: Same process for everyone (scan bag, check ID), but a few lanes handle special items; SWE-rebench V2 uses one pipeline with per-language adapters.
Universal remote: One controller runs many TVs with small code maps; the pipeline is the remote, language templates are the maps.
School cafeteria: One serving line feeds all grades; only the trays differ by size. The line is the pipeline; trays are the language bits.

Before vs. After:

Before: Lots of one-off setups, mostly Python-centric, manual checks, and scattered infrastructure. Training sets were small, noisy, or hard to reproduce.
After: A unified, automated funnel for 20+ languages that ships pre-built Docker images, repository-specific test parsers, and issue clarity filtering via an ensemble of LLM judges. Plus, it adds instance-level diagnostics so you can filter brittle or ambiguous tasks.

Why it works (intuition):

Standardize what can be shared (the funnel); modularize what can’t (language adapters). This avoids rebuilding the world for each language.
Run full test suites pre- and post-fix to lock in a stable reward (fail-to-pass), which is gold for RL.
Let an interactive setup agent iterate on real error logs to discover correct install/test steps once per repo, then amortize that cost across many tasks from the same repo.
Calibrate LLM-based quality filters against human-verified data to keep only tasks with clear specs. Add diagnostics to keep the training signal steerable.

Building blocks:

Data mining: From 29.5M PRs to 32,079 executable, issue-linked tasks after filters.
Base images: Pre-built Docker images per language (e.g., different JDK versions) for reproducible execution.
Setup synthesis: mini-SWE-agent with a strong code LLM (e.g., Qwen3-Coder-480B) learns install/test commands from repo clues and error logs.
Test oracle extraction: Always run the full suite, build repository-specific log parsers, and require fail-to-pass tests.
Quality filtering: Three independent LLM judges (gpt-oss-120b, GLM-4.7, DeepSeek-V3.2) gatekeep underspecified issues.
Diagnostics: Flags like TES $T_S$ UIT $E_C$ OUPLING, IMPLICI $T_N$ AMING, and EXTERNA $L_D$ EPENDENCY let you curate or build curricula.
PR-based expansion: 120k+ extra tasks with synthetic problem statements from PR descriptions and patches to broaden training coverage beyond issue-linked PRs.

In essence, SWE-rebench V2 is a large-language buffet that’s safe to eat from: it prepares each dish in its own clean kitchen (container), checks the taste before and after seasoning (F2P), has three chefs agree the recipe is clear (LLM judges), and labels dishes so diners with allergies (research goals) can pick safely.

03Methodology

At a high level: Raw GitHub PRs → (1) Preliminary collection → (2) Setup synthesis → (3) Execution-based validation → (4) Issue-clarity filtering → (5) Metadata enrichment → Outputs: 32k+ containerized tasks + 120k+ PR-derived tasks.

Stage 1: Preliminary Data Collection

What happens: Use GitHub Archive to gather PRs, issues, commit SHAs, and repo attributes; clone repos at scale and extract patches from git locally to avoid API limits. Link PRs to issues when possible and keep only those PRs that add or modify tests, have permissive licenses, and are merged/resolved.
Why this step exists: Without tests, you can’t form a reliable reward signal for RL. Without permissive licenses, redistribution is risky. Without local cloning, rate limits throttle scale.
Example: Start with 29,511,758 PRs across 145,306 repos. Requiring tests cuts to 8,593,722 PRs. Requiring issue links and tests yields 805,598 PRs across 50,797 repos. After repo-level filters tuned per language (stricter on high-resource ecosystems like Python/Java/Go; looser on long-tail languages), about 21,692 repos remain, leading to 41,349 candidates with fail-to-pass, then 32,079 after issue-text filtering.

Stage 2: Setup Synthesis (Interactive Agent)

What happens: Build Docker base images per language (e.g., Python, Go, Java with JDK 11/17/21). For each repo, an interactive setup agent (mini-SWE-agent with Qwen3-Coder-480B-A35B-Instruct) discovers install commands and a per-test-verbose test runner, using real error logs to iterate. It assembles an instal $l_c$ onfig.json with “install” and “tes $t_c$ md”, and ensures XML reports when possible for stable parsing.
Why this step exists: One correct, reusable setup per repo amortizes costs across many tasks from that repo; without it, every task rebuilds the wheel and breaks often.
What breaks without it: Non-interactive guesses miss edge cases, causing flaky installs or wrong test runners. Compiled languages need explicit rebuilds after patching to avoid stale binaries.
Example: A Go repo might yield install: [go mod download], tes $t_c$ md: [go test ./... -run . -v]. A Java repo might use Maven with surefire XML reports for robust parsing.

Stage 3: Execution-based Validation (Fail-to-Pass)

What happens: Use multi-stage Docker builds. Apply test patch and run full suite (expect failing tests). Then apply the solution patch and run again. Keep only instances where at least one test flips from fail to pass. Generate structured logs via repo-specific parsers.
Why this step exists: Full-suite execution ensures coverage and catches regressions; fail-to-pass proves the task has a meaningful learning signal.
What breaks without it: You might keep trivial or misleading tasks (e.g., passing already), giving zero or noisy reward for RL.
Example: In Rust, cargo test runs pre-fix to show a failing assertion, then post-fix to confirm the assertion passes—this pair becomes the training oracle.

Stage 4: Filtering by Issue Clarity (LLM Judges)

What happens: Three LLM judges (gpt-oss-120b, GLM-4.7, DeepSeek-V3.2) read the issue, patch, and test patch using a SWE-bench-Verified-style prompt. An instance is kept only if all judges agree the issue is adequately specified.
Why this step exists: Ambiguous instructions create unfair or confusing tasks, which poison learning signals.
What breaks without it: Agents might “learn” to game tests or flail on unclear goals.
Example: If the issue text lacks acceptance criteria or depends on private external docs, the judges likely reject it.

Stage 5: Metadata Enrichment (Diagnostics + Interfaces)

What happens: A meta-prompt (with gpt-oss-120b) labels tasks with issue codes such as B1 TES $T_S$ UIT $E_C$ OUPLING, B2 IMPLICI $T_N$ AMING, B3 EXTERNA $L_D$ EPENDENCY, etc. It also extracts explicit interfaces (method signatures) that tests call.
Why this step exists: Researchers need to curate subsets (e.g., start with “clean” A tasks, later add B1 for robustness) and build curricula.
What breaks without it: You can’t separate environment pathologies from model weaknesses, making analysis and training design much harder.
Example: A task tagged B2 warns that tests expect specific function names not mentioned in the issue; SFT curricula can avoid these early on.

PR-based Task Expansion

What happens: For repos where issue-linked tasks worked, reuse the synthesized install/test recipes to add PR-only tasks. Generate a problem statement from the PR description and patch with a careful prompt that avoids leaking implementation details.
Why this step exists: Issue links bottleneck scale. PR-based tasks add 120,000+ more instances for pretraining or RL warm-up.
Example: A PR that improves error handling without a linked issue gets a synthetic, high-level “Problem/Expected Behavior” brief and is included with its recipe.

Repository-specific Log Parsers

What happens: From a batch of successful logs, the pipeline generates a parser (with Qwen3-Coder-480B) that maps raw runner output or XML into a standardized per-test status. Retries happen if parsing fails elsewhere.
Why this step exists: Test runners differ widely, especially in C/C++ ecosystems; consistent parsing is required for uniform rewards.
What breaks without it: You can’t compare or aggregate results cleanly, and RL rewards get noisy.

Secret Sauce

Reuse: Infer setup once per repo; reuse for all tasks—huge speedup.
Full-suite + F2P: Strong, simple reward signal across languages.
Ensemble filtering + diagnostics: Keep clarity high and make the dataset steerable.
Pre-built images: Immediate, reproducible execution at scale.

Two more mini “sandwiches” for vital notions:

🍞 Hook: Failing a test first and passing it later proves the fix mattered. 🥬 Concept (Fail-to-Pass Tests): A task is valid when at least one test flips from fail (before) to pass (after). Steps: Run tests after applying only test changes; then run after the fix; check the flip. Why it matters: Ensures the reward teaches the intended behavior. 🍞 Anchor: Like showing a broken light bulb, then the same bulb glowing after you flip the right switch.
🍞 Hook: Sticky notes on each task help you study smarter. 🥬 Concept (Diagnostic Metadata): Labels that tell you what might be tricky (brittle tests, hidden naming, external links). Steps: Analyze issues, patches, and runs; apply codes B1–B7; attach summaries. Why it matters: Lets you build curricula and avoid pitfalls early. 🍞 Anchor: It’s like marking chapters as “intro,” “challenge,” or “bonus problems.”

04Experiments & Results

The Test: The paper measures whether the pipeline really builds correct environments, whether clarity filtering matches human judgments, and how hard the final tasks are for modern models.

Setup Synthesis Ablations (installation success)

What and why: Compare non-interactive scripts versus an interactive mini-SWE-agent using different LLMs and context lengths. Success means the automated setup reproduces the same fail-to-pass test set as the trusted manual setup.
Scoreboard (pass@k): • Non-interactive pipelines: pass@ $1 ≈ 12$ .1%; even with retries they rise only modestly. • Interactive agent with mid/large models: pass@1 jumps to ~17–27% depending on model, and pass@10 reaches ~46–63%. Qwen3-Coder-480B with 32–128k context performs best; longer context helps a bit but not always—32k is usually enough.
Meaning: That’s like going from guessing the right recipe 1 time in 8 to 1 time in 4 (or better), and with 10 tries you get close to $2× more$ repos installed. Interactivity clearly matters.
Surprise: More context isn’t a magic wand; too much can distract the agent or cause loops.

Issue Clarity Filtering (LLM judges)

What and why: Match LLM-based ratings against SWE-bench Verified human annotations. Test prompts (baseline, SPICE, enhanced variants), models (gpt-oss-120b, GLM-4.7, DeepSeek-V3.2, etc.), and ensemble strategies.
Scoreboard: • Prompting: The “Verified+” prompt had the best F1, while “Verified-E” (which adds patch and test patch) had the highest precision—chosen when precision matters most for filtering. • Models: gpt-oss-120b balanced best across metrics; some models traded recall for precision. • Ensembling: Averaging scores across judges improved robustness and F1; strict consensus increased precision but cut recall.
Meaning: If you care most about not letting bad tasks slip through (precision), Verified-E + consensus is strong. For broader coverage, averaging works better.

Diagnostic Study (task difficulty and pathologies)

What and why: Run seven frontier models (DeepSeek-V3.2, Gemini 3 Flash, GLM-4.7, GPT-5.2 medium, gpt-oss-120b, MiniMax-M2.1, Claude Opus-4.5) across 300 tasks from Python, JS, Go, Rust, Scala to observe pass rates and failure modes.
Scoreboard (pass@1 across all 300): Claude Opus-4. $5 ≈ 25$ %, GLM-4. $7 ≈ 21$ %, $Gemini ≈ 18$ %, DeepSeek V3. $2 ≈ 17$ %, GPT-5. $2 ≈ 17$ %, $MiniMax ≈ 19$ %, gpt-oss- $120b ≈ 9$ %. Per-language, Python hit ~36% pass@1 for top model; Scala was toughest (~19%). Pass@3 bumps each by ~6–10 points.
Meaning: That’s like getting a C to B- on first tries even for strong models—realistic difficulty. It also confirms multi-language variety changes what models find easy or hard.
Surprising findings: • Test Suite Coupling: Some correct fixes fail due to regressions elsewhere—useful for training agents to avoid collateral damage, but tricky for early RL. • Implicit Naming: Tests sometimes assume specific symbol names not stated in issues—metadata flags these (B2) so you can filter or provide hints. • External Dependencies: Tasks that reference changing URLs or private docs (B3) can hinder reproducibility but are valuable for tool-augmented agents (e.g., with browsing).

Scale and Coverage

Final issue-linked corpus: 32,079 tasks from 3,600+ repos across 20 languages, with pre-built images for reproducibility.
PR-derived expansion: 120,000+ tasks with install/test recipes and synthetic problem statements from PR descriptions.
Patch stats: Median 3 files / 34 lines changed; 90th percentile 9 files / 181 lines—plenty of challenge.

Bottom line: The pipeline’s interactive setup reliably beats non-interactive methods, the clarity filter is tunable for precision or coverage, and the resulting tasks present a meaningful, diverse challenge to today’s best models—exactly what you want for RL training.

05Discussion & Limitations

Limitations

Single-container focus: The pipeline targets projects that fit in one Docker image. Multi-service systems (databases, queues, microservices) are out of scope for now, limiting coverage of complex, long-horizon tasks.
Training ablations not included: While metadata and diagnostics are provided, the paper doesn’t run end-to-end training ablations to show how filtering choices (e.g., excluding B2 IMPLICI $T_N$ AMING early) shift RL learning curves—this is future work.
Environment pathologies: Even with filters, some tasks contain flaky tests, implicit requirements, or external links. Metadata helps, but perfect cleanliness at this scale is unrealistic.
Compute and storage: Building and storing pre-built images, running full test suites, and executing an interactive agent are resource-intensive; smaller labs may need to subset.

Required Resources

Container runtime (Docker), sufficient disk/network for large image caches, and distributed workers for mining and building.
Access to capable LLMs for setup synthesis, parsing, and judging (or compatible open models and prompts).

When Not to Use

If you need multi-service orchestration (e.g., spinning up DBs, message brokers) with strict SLA-like non-functional requirements today.
If you require 100% human-verified instances with zero ambiguity; this corpus balances scale and realism.

Open Questions

Curriculum design: What’s the best way to sequence A-only tasks into B1/B2/B3 for robust RL without derailing learning?
Reward shaping: How to grant partial credit for regression-avoidance or near-correct fixes to speed up RL convergence?
Tool-augmented agents: How much do browsing or build-tool plugins help on B3 EXTERNA $L_D$ EPENDENCY or tricky build ecosystems?
Long-horizon repairs: How to extend this pipeline to multi-repo or multi-service changes with iterative test/deploy loops?

Overall, SWE-rebench V2 makes a practical trade: maximum multilingual, executable scale with rich metadata so users can tailor the dataset to their needs, knowing that not every edge case is solved by design.

06Conclusion & Future Work

Three-sentence summary: SWE-rebench V2 is a language-agnostic, automated pipeline that turns real GitHub histories into reproducible, fail-to-pass software tasks at scale. It uses interactive setup synthesis, full-suite validation, and an ensemble of LLM judges to keep tasks clear and reliable, then ships pre-built environments plus diagnostic metadata. The release includes 32k+ containerized tasks across 20 languages and 120k+ PR-derived tasks with recipes, purpose-built for large-scale RL training of SWE agents.

Main Achievement: Unifying a single, reusable construction workflow across many languages—paired with pre-built Docker images, repository-specific parsers, calibrated LLM filtering, and instance-level diagnostics—so training-ready tasks are available at unprecedented scale and diversity.

Future Directions: Increase setup retries for higher yield; broaden long-tail language coverage; support multi-service environments; and enrich reward signals to include performance and resource metrics. Systematic training ablations using diagnostic-based curricula will clarify how best to stage difficulty and noise for RL.

Why Remember This: It transforms the messy web of open-source code into a massive, clean practice field where AI coders can safely learn, across many languages, with reliable tests and knobs for difficulty—bringing us closer to trustworthy software agents that can fix real bugs in the wild.

Practical Applications

•Train RL coding agents on reliable, multilingual tasks with stable, test-based rewards.
•Create curricula: start with A-class (clean) tasks, then add B1/B2 for robustness training.
•Benchmark agent improvements across languages using the same execution contract.
•Stress-test agents on brittle or coupled test suites to improve regression-avoidance skills.
•Pretrain repository-aware log parsers or tools that read and normalize test outputs.
•Evaluate setup strategies or LLM choices for environment synthesis in new ecosystems.
•Build language-specific teaching modules using diagnostic labels and interface summaries.
•Study prompt and ensemble strategies for automated issue clarity filtering.
•Prototype tool-augmented agents (e.g., with web browsing) on B3 external-dependency tasks.

Version: 1