SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Jialong Chen; Xander Xu; Hu Wei; Chuan Chen; Bing Zhao

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Intermediate

Jialong Chen, Xander Xu, Hu Wei et al.3/4/2026

arXiv

Key Summary

•SWE-CI is a new benchmark that tests how well AI coding agents can keep a codebase healthy over many changes, not just fix one bug.
•It simulates a real software team’s Continuous Integration (CI) loop with two roles: an Architect (plans) and a Programmer (codes).
•Instead of a single pass/fail, it tracks progress over up to 20 rounds and scores long-term maintainability using EvoScore.
•A clever metric called normalized change measures improvements and regressions on a common scale from -1 to 1.
•Across 100 real-world tasks averaging 233 days and 71 commits, agents must evolve the base code toward a later target version.
•Top models get better at long-term maintenance, but most still struggle to avoid regressions; only a few reach a zero-regression rate above 0.5.
•Claude Opus leads overall, with GLM-5 also strong; providers show different preferences for short-term vs long-term gains.
•SWE-CI reveals what snapshot tests miss: whether early code choices help or hurt later changes.
•It’s shipped with ready-to-run Docker environments and strict rules to mirror real CI and keep results reproducible.

Why This Research Matters

Most real software work is maintenance, not just writing new code once. SWE-CI measures whether AI agents can make many small, safe changes without breaking older features. That’s exactly what teams need when shipping weekly updates, fixing bugs, or adding features on tight schedules. By rewarding long-term stability with EvoScore, SWE-CI encourages agents to write cleaner, more extensible code. The benchmark’s realistic CI loop and strict rules make results trustworthy and comparable. Over time, this can help companies choose better AI partners and guide research toward agents that are reliable teammates, not just clever sprinters.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how building a treehouse isn’t just about hammering the last nail once? You keep improving it—adding a ladder, fixing a squeaky board, making space for friends—over weeks and months. Software is like that too.

🍞 Hook: Imagine a class project where you add something new to your poster every day. If you paste things anywhere just to finish fast, later pieces won’t fit, and you’ll have to rip tape and redo work. 🥬 The Concept (Functional Correctness): It means “does the code do what the tests ask right now?”

What it is: A simple check that today’s code matches today’s rules.
How it works: 1) Run tests; 2) See which ones pass; 3) If all pass, you’re correct today.
Why it matters: Without it, we don’t know if the feature even works. 🍞 Anchor: Pressing a calculator’s square-root button should give the right number—if it does, that function is correct.

🍞 Hook: You know how your room stays nice only if you tidy a little each day? If you stop, the mess piles up. 🥬 The Concept (Long-term Code Maintenance): Keeping code clean and easy to change over months or years.

What it is: The ongoing work to improve, fix, and extend software without breaking it.
How it works: 1) Add features in small steps; 2) Keep things organized; 3) Check old features still work; 4) Repeat.
Why it matters: Without maintenance, new changes get slower and break more stuff. 🍞 Anchor: A well-organized binder makes tomorrow’s homework easier to file; a messy one wastes time.

🍞 Hook: If you rush homework by scribbling, you finish fast today but spend more time fixing it later. 🥬 The Concept (Technical Debt): Shortcuts in code that create extra work in the future.

What it is: Messy or brittle code that “borrows time” now but “charges interest” later.
How it works: 1) Skip cleanup; 2) Add quick hacks; 3) Future changes take longer; 4) Bugs sneak in.
Why it matters: Debt compounds; each new change costs more and risks more. 🍞 Anchor: Leaving dishes in the sink saves 2 minutes now but costs 20 minutes tomorrow.

🍞 Hook: When you change one puzzle piece, you check that the whole picture still looks right. 🥬 The Concept (Regression Testing): Re-checking that old features still work after changes.

What it is: Tests to catch when something that used to pass now fails.
How it works: 1) Make a change; 2) Run all tests; 3) If any once-passing test fails, that’s a regression; 4) Fix it.
Why it matters: Without it, software quality silently drifts downward. 🍞 Anchor: A chef retastes the soup after adding salt to make sure it didn’t become too salty.

🍞 Hook: To compare racers fairly, you put them on the same track and time them the same way. 🥬 The Concept (Benchmarking): A standardized way to compare performance.

What it is: A fair test with shared rules and scoring.
How it works: 1) Pick tasks; 2) Run everyone under the same setup; 3) Score with the same metric; 4) Compare.
Why it matters: Without benchmarks, we argue opinions instead of using evidence. 🍞 Anchor: A spelling bee gives everyone the same word lists and rules.

Before SWE-CI, code benchmarks mostly checked one-shot functional correctness. Great for "Does it work today?" but not for "Can it keep working as things change?" Real software evolves: new features arrive, libraries update, and designs stretch. Research and industry also know that 60%–80% of the total cost of software is maintenance, and quality tends to degrade unless teams actively manage it. But snapshot-style tests hide that story. A brittle fix and a clean design can both pass the same test today. The difference only shows up later—when the next change either slides in easily or jams like a sticky drawer.

SWE-CI fills this gap by shifting the lens from short-term snapshots to long-term evolution. It builds tasks from real GitHub histories: each task starts at an earlier commit (the base) and aims toward a later commit (the oracle), with an average of 233 days and 71 commits of real changes in between. The benchmark runs a mini CI loop many times, asking agents to read failing tests, plan small improvements, code them, and repeat. It observes not only whether agents make progress, but also whether they avoid regressions and keep the door open for future changes. In short, SWE-CI tests how well an agent can be a reliable teammate over time—not just a clever sprinter for one round.

Why should you care? Because your favorite apps and websites don’t succeed from one perfect day. They succeed because teams can add features without breaking old ones, move fast without tripping later, and keep quality steady even as complexity grows. SWE-CI gives us a way to measure if AI helpers can actually do that.

02Core Idea

The “aha!” in one sentence: Don’t just check if an AI can fix code today—watch how its choices help or hurt tomorrow, using a CI-style loop and a future-weighted score.

Analogy 1 (Garden): Planting one flower is easy; keeping the garden thriving through seasons shows true skill. EvoScore is the season-long report card. Analogy 2 (Lego City): Anyone can snap one brick; building a city that stays sturdy as you add trains and bridges is the real challenge. Analogy 3 (School Project Series): A nice first poster is great, but the series grade depends on how well each new piece fits the story without redoing past work.

🍞 Hook: Picture a relay race. The first runner can sprint wildly, but if they hand off the baton poorly, the team stumbles. 🥬 The Concept (Continuous Integration, CI): A practice of making small, frequent changes and checking everything fits together each time.

What it is: A loop of plan → code → test → repeat.
How it works: 1) Make a tiny change; 2) Run tests; 3) Fix issues; 4) Merge; 5) Repeat often.
Why it matters: Many small, safe steps beat one risky giant leap. 🍞 Anchor: Updating a group slideshow one slide at a time and previewing after each change so the whole deck stays smooth.

Before vs After:

Before: Benchmarks gave a single task; if your code passed the tests once, you “won.”
After: SWE-CI gives many rounds; your early choices affect later ease, stability, and progress.
Before: Pass/fail hid how messy the code was.
After: Normalized change and EvoScore reveal steady gains vs fragile jumps and backslides.

Why it works (intuition, no math):

Early design choices either open doors (clean structure) or close them (technical debt).
By replaying many CI cycles, we surface compounding effects—good habits snowball into easier future steps; bad habits multiply into regressions.
Weighting later rounds more (with gamma) rewards agents who invest in maintainability, not just quick wins.
Separating roles (Architect plans, Programmer codes) mirrors real teams, reducing confusion and encouraging focused, test-driven steps.

Building blocks explained with the Sandwich Pattern:

🍞 Hook: If you try to measure height with random rulers, you can’t compare fairly. 🥬 The Concept (Normalized Change): A way to place progress or backsliding on a common -1 to 1 scale.

What it is: A scaled score of how many tests you improved or broke compared to the starting point and the goal.
How it works: 1) Count tests passing now; 2) Compare to base and to target; 3) Scale so closing the full gap is +1; breaking all initially passing tests is -1; 4) Works no matter how big the project is.
Why it matters: Fair comparisons across tasks and models. 🍞 Anchor: Like grading improvement in running from your first time to your goal time, regardless of the track length.

🍞 Hook: Report cards matter more near the end of the term. 🥬 The Concept (EvoScore): A future-weighted average of your normalized change across rounds.

What it is: A single number summarizing your long-term progress, giving more weight to later rounds.
How it works: 1) After each CI round, compute normalized change; 2) Assign bigger weights to later rounds; 3) Average; 4) Higher is better.
Why it matters: Rewards agents who keep code easy to change. 🍞 Anchor: A music teacher who values improvement in later recitals more than early practices.

🍞 Hook: A coach draws the play; the player executes it on the field. 🥬 The Concept (Architect–Programmer Dual-Agent Evaluation): Two coordinated roles to mimic real CI teamwork.

What it is: Architect reads failures and writes small high-level requirements; Programmer implements them.
How it works: 1) Architect summarizes failing tests; 2) Locates root causes; 3) Designs behavior-level requirements; 4) Programmer understands, plans, codes minimal changes; 5) Repeat.
Why it matters: Clear roles reduce chaos, encourage small steps, and reveal maintainability. 🍞 Anchor: A designer specifies what a bridge must hold; an engineer chooses how to build it.

Together, these pieces turn code evaluation from a snapshot into a story. Instead of asking, “Can you pass today’s test?”, SWE-CI asks, “Can you keep passing as the world changes?”

03Methodology

At a high level: Base repository + Oracle repository + Tests + Docker environment → Build CI loop (Architect then Programmer) → Run tests and compute normalized change → Repeat up to 20 rounds → Output EvoScore, zero-regression rate, and detailed logs.

Data curation recipe (how SWE-CI builds tasks from real projects):

Inputs: Thousands of real Python repositories on GitHub.
Step 1: Repository Collection.
- What happens: Pick repos with 500+ stars, 3+ years maintenance, permissive license (e.g., MIT/Apache-2.0), clear config/deps, and unit tests.
- Why it exists: Ensures mature, real-world projects with meaningful histories and runnable tests.
- Example: A popular data library with pyproject.toml and pytest tests gets in; a tiny toy repo does not.
Step 2: Commit Span Extraction.
- What happens: Keep main branch only; split history into longest stretches where dependencies don’t change; take endpoints as base/oracle pairs; require at least 1,000 lines of source changes (excluding tests).
- Why it exists: Stable dependencies avoid environment whiplash; large diffs ensure non-trivial evolution.
- Example: A span before pydantic upgrades from 1.x to 2.x is kept separate from the span after that upgrade.
Step 3: Environment Construction.
- What happens: Auto-generate a Dockerfile from the oracle’s configs; build and run oracle tests to verify; if a missing dependency blocks launch, auto-inject it and rebuild; discard pairs with other failures.
- Why it exists: Reproducibility; fewer flaky runs; more usable tasks via self-repair.
- Example: Missing a small plugin? Inject it automatically; a broken test suite? Discard.
Step 4: Case Filtering.
- What happens: Run the oracle’s test suite against the base code inside the same environment; drop any pairs that can’t launch; drop pairs where the pass-count gap is < 5; rank survivors by time span and commits; pick top 100.
- Why it exists: Guarantees a meaningful challenge and smooth runtime.
- Example: If base vs oracle differ by only 2 passing tests, it’s too small—removed.

The CI-loop protocol (how agents work inside a task):

Inputs: Base code, frozen oracle test report, and the same test suite and Docker image.
Round structure (max 20 rounds):
1. Architect reads failing tests and their summaries; traces into tests and source; writes 1–5 high-level, incremental requirements (XML) focused on behavior, not code.
2. Programmer reads the requirements; inspects relevant files; plans minimal changes; edits only source (not tests); saves.
3. External runner executes tests with pytest; produces a fresh report; computes normalized change against base and oracle; logs regressions.
4. Loop repeats with updated code.
Why these steps: They mirror real teams: planners keep scope tight; coders implement; tests judge; small steps make learning visible and controllable.
Concrete mini-example: If many tests fail due to a missing behavior in a parser, the Architect might write: “Implement pars $e_d$ ate that accepts 'YYYY-MM-DD' and raises ValueError on bad input.” The Programmer then adds that function and minimal wiring.

Scoring details:

Normalized change: A scaled measure from -1 to 1 that fairly credits improvement and penalizes breaking previously passing tests.
EvoScore: A future-weighted average of normalized change across rounds; later rounds count more ( $gamma ≥ 1$ by default), so sustainable progress beats early sprints that stall.
Zero-regression rate: The share of tasks where no previously passing test ever gets broken during the entire run.

Secret sauce (what’s clever here):

Evolution, not snapshots: The benchmark observes compounding effects—exactly where maintainability shows up.
Future-weighted scoring: Rewards long-game design, not just early fireworks.
Role separation: Architect vs Programmer creates cleaner thinking and avoids prompt overload; it’s closer to how real CI works.
Dependency-stable spans: By fixing dependencies across a task’s span, changes focus on code design, not random environment failures.
Self-repairing Docker builds: Automatically patch missing deps to keep valuable tasks instead of discarding them.
Frozen oracle test report: Ensures a stable target and consistent comparisons across models.

System guardrails and tools:

Tests via pytest and pytest-json-report; 3600-second timeout per run.
Agent framework: iFlow CLI; up to 20 iterations.
Constraints: Architect writes only requirement.xml; Programmer edits only /app/code/ (not tests); neither runs tests directly—the harness does.
Both agents can be powered by the same base LLM unless specified.

Putting it all together: SWE-CI turns each task into a mini time machine. Start at an old commit with the final destination defined by a later commit’s behavior and environment. Then, like a real team, iterate small improvements, measure fairly each step, and score how well you sustained quality over the journey.

04Experiments & Results

The test: Can AI agents maintain and evolve real codebases over many rounds without losing quality? To find out, SWE-CI runs 100 tasks from 68 repositories, each covering on average 233 days and 71 commits of history, with substantial source changes (≥500 lines, tests excluded). For each task, agents have up to 20 CI rounds to move the base code toward the oracle behavior. The harness measures three things: normalized change per round (how much you closed the gap or regressed), EvoScore (a future-weighted average of progress), and zero-regression rate (how often you never broke anything that once passed).

The competition: 18 models from 8 providers were evaluated under the same setup (pytest + Docker, fixed environment, same test suites). This isn’t a head-to-head coding sprint like HumanEval—it’s a marathon with checkpoints. The key is how a model’s early choices make later steps smoother or rougher.

The scoreboard with context:

Overall leaders: The Claude Opus series consistently tops EvoScore across the observation period—think A+ while many others hover around B-level. GLM-5 is another strong runner, placing near the front.
Acceleration over time: Within each provider family, newer releases always do better, and models after 2026 show bigger jumps. Imagine each new model climbing a grade level in maintainability skills.
Long-term vs short-term preference: Adjusting gamma (the weight on later rounds) reshuffles rankings. MiniMax, DeepSeek, and GPT tend to benefit more when later rounds matter (long-term-minded). Kimi and GLM look better when early rounds are weighed more (short-term gains). Qwen, Doubao, and Claude remain steadier across settings. This suggests differing training priorities: some invest in future stability, others in quick wins.
Regression control is hard: Most models have a zero-regression rate under 0.25—meaning in most tasks they break at least one test that previously passed. Only two Claude Opus variants exceed 0.5. That’s like saying even strong students often smudge earlier work while fixing new parts.

Surprising findings:

Provider consistency: Models from the same provider behave similarly as gamma changes, hinting at stable internal training pipelines and values (e.g., risk appetite for quick fixes vs careful structuring).
Token scale: The study burned through more than 10 billion tokens—evidence that long-horizon evaluation is compute-hungry and that maintainability isn’t revealed in tiny samples.
Stability under environment control: By fixing dependencies within each task, the benchmark kept runs cleaner, making regressions more about code choices than flaky environments.

What the numbers really mean:

A high EvoScore isn’t just “more tests passed”; it’s “you passed more, later, after many changes,” which indicates design choices that aged well.
A low zero-regression rate flags reliability risks: moving fast but knocking over old dominoes.
Sensitivity to gamma shows the trade-off between sprinting early vs pacing for the whole race. Teams that live on long release trains may favor higher gamma; teams doing quick hotfixes may prefer lower gamma.

In plain terms: SWE-CI turns code evaluation into an endurance test. Some agents start fast but stumble; others pace themselves and finish stronger. Today, the field has promising leaders, but reliable, regression-free long-term maintenance remains a hard bar that most models don’t clear yet.

05Discussion & Limitations

Limitations:

Python-only: SWE-CI focuses on Python projects. Results may differ in languages with other styles (e.g., C++ headers, Java interfaces, Rust borrow checker), where maintainability pressures look different.
Unit-test visibility: The benchmark relies on unit tests and an oracle snapshot; aspects like performance, security, UI/UX, and documentation quality aren’t directly measured.
Dependency stability assumption: Each task holds dependencies constant between base and oracle spans. Real-world maintenance often includes library upgrades that introduce new friction.
Agent protocol bias: The Architect–Programmer split is realistic but not the only workflow; pair-programming agents or tool-rich planners might behave differently.
Compute intensity: Running up to 20 rounds across 100 tasks with Docker and long test timeouts eats resources; smaller labs may find full runs costly.

Required resources:

Containerized CI: Docker images built from oracle configs, with occasional auto-injected dependencies.
Testing stack: pytest, pytest-json-report, and enough time budget (up to 3600 seconds per run) for heavy suites.
Agent framework: An orchestration tool (e.g., iFlow CLI) to manage multi-round loops, logs, and scoring.
Model access: Stable LLM APIs or local models capable of reasoning over multi-file repositories.

When not to use SWE-CI:

Projects without meaningful test suites or with fast-shifting dependencies during the chosen span.
Non-Python codebases or domains where integration tests, performance, or end-to-end behavior matter more than unit-level evolution.
Real-time or GUI-heavy apps where correctness is hard to capture via unit tests alone.

Open questions:

Planning durability: How can agents plan changes that remain flexible across many future rounds without over-engineering?
Regression shields: Can we teach agents to set guardrails (e.g., auto-checklists, invariant tracking) that cut regressions drastically?
Credit assignment: Which specific early choices cause late failures? Can we trace and learn from those causes automatically?
Training signals: Can EvoScore (or its components) be used as a training objective to grow long-horizon coding instincts?
Team topologies: Would more roles (e.g., Reviewer, Tester) or tool-augmented agents (e.g., static analyzers, type checkers) lift stability further?

06Conclusion & Future Work

In three sentences: SWE-CI is a benchmark that evaluates AI coding agents over many CI-style rounds, focusing on long-term maintainability instead of one-shot fixes. It introduces normalized change, EvoScore, and a dual-agent (Architect–Programmer) loop to reveal how early choices affect later progress and regressions. Experiments across 18 models show clear leaders but also a broad struggle to avoid regressions and sustain quality.

Main achievement: Turning code evaluation into an evolution story—with future-weighted scoring and realistic team roles—so we can finally measure what real software work demands: the ability to keep going without crumbling.

Future directions: Extend beyond Python and unit tests; incorporate dependency upgrades; add reviewers and static analysis tools; and explore training agents directly against EvoScore-like objectives. Richer telemetry (e.g., code churn patterns, architectural smell detectors) could pinpoint which habits most improve long-horizon stability.

Why remember this: SWE-CI shifts our mindset from “Did it pass today?” to “Will it keep passing tomorrow?” It’s a practical, reproducible way to see if AI can be a trustworthy teammate for the long haul, not just a clever one-turn fixer.

Practical Applications

•Choose AI coding assistants based on EvoScore and zero-regression rate to match your team’s release cadence.
•Use SWE-CI as a pre-production gate to catch agents that cause regressions during multi-round changes.
•Train or fine-tune models with feedback shaped by normalized change and EvoScore to grow long-horizon discipline.
•Benchmark internal agent workflows (e.g., add a Reviewer role) and compare against the dual-agent baseline.
•Design CI policies that favor maintainability (higher gamma) for long-lived products and evaluate provider fit.
•Run A/B tests of prompts or tools (e.g., static analyzers) to see which setups reduce regressions on SWE-CI tasks.
•Teach TDD and CI best practices to junior developers using SWE-CI’s structured Architect/Programmer pattern.
•Audit code evolution strategies (e.g., refactor-first vs feature-first) by measuring long-term impacts on EvoScore.
•Stress-test dependency management by adding tasks with controlled library upgrades in internal variants.
•Create hiring or upskilling challenges focused on multi-round maintenance rather than single-shot fixes.

Version: 1