LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Yukang Feng; Jianwen Sun; Zelai Yang; Jiaxin Ai; Chuanhao Li; Zizhen Li; Fanrui Zhang; Kang He; Rui Ma; Jifan Lin; Jie Sun; Yang Xiao; Sizhuo Zhou; Wenxiao Wu; Yiming Liu; Pengfei Liu; Yu Qiao; Shenglin Zhang; Kaipeng Zhang

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Beginner

Yukang Feng, Jianwen Sun, Zelai Yang et al.2/15/2026

arXiv

Key Summary

•LongCLI-Bench is a new test that checks how well AI coding agents can handle long, realistic software projects in the command line, not just tiny coding puzzles.
•It uses two kinds of tests at once: Fail→Pass (did you add the new feature?) and Pass→Pass (did you avoid breaking old stuff?), plus step-by-step scores to see exactly where things went wrong.
•Tasks are big and real: about 15,000+ lines of code and 1000+ minutes of expert time on average, across four categories (build from scratch, add a feature, fix a bug, refactor).
•Top agents today still struggle: most pass rates are below 20%, and many runs stall before finishing even 30% of the required steps.
•Self-correction (trying again using test feedback) helps a bit, but not enough to solve most long tasks.
•Human-agent teamwork, like giving a good plan first or stepping in with guidance, boosts success much more than self-correction alone.
•The benchmark avoids “data contamination” by not scraping GitHub directly and by rewriting/curating tasks carefully.
•Results suggest we should build better planning and execution skills in agents and design tools that make human-agent collaboration smooth and effective.
•Step-level scoring turns a simple pass/fail into a useful progress map, showing early bottlenecks and helping diagnose failure points.

Why This Research Matters

Software in the real world is more like a marathon than a sprint, and LongCLI-Bench finally tests AI helpers on that reality. It shows whether agents can add new features without secretly breaking old ones, which is critical for safety and reliability. By scoring each step, teams can see exactly where agents get stuck—often early in setup or planning—and fix those weak spots. The benchmark also proves that a little human guidance can go a long way, pointing toward practical teamwork rather than pure autonomy. Because it avoids contamination and runs in controlled environments, results are trustworthy and reproducible. Over time, this will lead to agents that help teams ship features faster, with fewer bugs, and with more confidence.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how building a treehouse takes many weekends—first planning, then getting wood, then measuring, cutting, and finally assembling? It isn’t one quick step; it’s a long journey where each part depends on the last.

🥬 AI-Assisted Programming (the prerequisite):

What it is: AI-assisted programming is when an AI helps you write and fix code.
How it works: 1) You tell the AI what you want, 2) it suggests code, 3) you run it, 4) you both fix issues together.
Why it matters: Without AI helpers, writing software takes longer and small mistakes can slow you down. 🍞 Anchor: Like asking a smart calculator to help with hard math but for code.

🍞 Hook: Imagine a robot helper who can’t just write recipes but can also cook in your kitchen, using your stove and tools.

🥬 Agentic Programming (acting in the environment):

What it is: Agentic programming is when the AI not only writes code but also plans steps and runs tools in a real computer environment.
How it works: 1) Reads your goal, 2) plans tasks, 3) runs commands in the terminal, 4) checks results, 5) adjusts the plan.
Why it matters: Without this, the AI can’t handle real projects that need setting up, testing, and fixing across many steps. 🍞 Anchor: Like a robot chef that shops, cooks, tastes, and tweaks the recipe, not just lists ingredients.

🍞 Hook: Think of a school test that’s only one question vs. a whole semester project with milestones.

🥬 Benchmark (the measuring tool):

What it is: A benchmark is a fair test used to compare how well different AIs do on the same problems.
How it works: 1) Collect tasks, 2) set clear rules, 3) run AIs on them, 4) score results.
Why it matters: Without solid benchmarks, we can’t tell if new AIs are truly better. 🍞 Anchor: Like a track meet with the same distance, timers, and rules for every runner.

🍞 Hook: If your treehouse plan only checks the first nails, you might miss that the roof leaks later.

🥬 Long-horizon Tasks (many steps over time):

What it is: Long-horizon tasks are projects that require many connected steps where early choices affect later success.
How it works: 1) Plan, 2) prepare environment, 3) code multiple modules, 4) test and fix, 5) avoid breaking old parts.
Why it matters: Short puzzles can’t reveal whether an AI can stay organized and steady over hours or days. 🍞 Anchor: Building a whole game from menus to graphics to saving scores, not just writing one function.

🍞 Hook: If a quiz uses answers from last year’s class, some students might already know them.

🥬 Data Contamination (leaky test answers):

What it is: Contamination happens when an AI has seen the test or its answers during training.
How it works: If training data includes the same tasks or code, the AI may “remember” instead of reasoning.
Why it matters: Without preventing this, we overestimate how smart the AI really is. 🍞 Anchor: Like practicing with the teacher’s exact test key instead of learning the material.

🍞 Hook: Many old coding tests are like single quiz questions; real software is like a semester-long project.

🥬 The Problem:

What it is: Existing coding benchmarks focus on short tasks, risk contamination, and use simple pass/fail scoring.
How it works: AIs look good on tiny puzzles but stumble on real, multi-step projects.
Why it matters: We need to know if agents can truly act like helpful engineers, not just puzzle-solvers. 🍞 Anchor: It’s the difference between solving one riddle and organizing an entire school play.

🍞 Hook: Imagine reading a mystery but only getting told “You failed” without knowing which clue you missed.

🥬 The Gap:

What it is: We lacked a clean, long-horizon benchmark with detailed, step-by-step feedback.
How it works: We need both a) realistic tasks and b) fine-grained scoring that shows where progress stops.
Why it matters: Without this, we can’t teach agents to plan better or fix where they actually stumble. 🍞 Anchor: A report card that shows strengths and weaknesses per subject, not just one overall grade.

🍞 Hook: If software that schedules your bus routes breaks, people miss work.

🥬 Real Stakes:

What it is: Real-world software is complex and long-lived, so agents must plan well and avoid breaking old features.
How it works: Agents must read requirements, set up environments, code changes, and run tests safely.
Why it matters: Better agent skills mean safer apps, faster updates, and less downtime. 🍞 Anchor: Your favorite game gets a new level without crashing your saved files.

02Core Idea

🍞 Hook: Imagine testing a marathon runner, not by a 10-second sprint, but by a full 42 km race with mile markers to see where they slow down.

🥬 The “Aha!” Moment:

What it is: Create a long-horizon, contamination-controlled benchmark (LongCLI-Bench) that tests agents on realistic CLI projects with dual-set tests and step-level scores.
How it works: 1) Curate tough, real tasks, 2) measure both new features and regressions, 3) score each step, 4) analyze where runs break.
Why it matters: Without this, we can’t truly judge or improve agents’ planning and execution skills. 🍞 Anchor: A race with checkpoints so coaches know exactly where to train the runner harder.

🍞 Hook: Think of a playground obstacle course that tests climbing, balancing, and jumping—not just running.

🥬 LongCLI-Bench (the course):

What it is: A benchmark of 20 big, real software tasks run through the command line.
How it works: Tasks span four categories: build from scratch, add features, fix bugs, and refactor; each comes with a clean environment, requirements, tests, and scoring.
Why it matters: It mirrors real engineering work, not tiny puzzles. 🍞 Anchor: Starting with a repo skeleton, wiring modules, squashing bugs, and cleaning code style—end to end.

🍞 Hook: When you build new LEGO parts onto a castle, you must also make sure old towers don’t collapse.

🥬 Dual-Set Testing Protocol (two views):

What it is: Two test sets—Fail→Pass (F2P) checks if new requirements are met; Pass→Pass (P2P) checks that old functionality still works.
How it works: F2P starts failing on the initial repo and should pass after the agent’s edits; P2P should pass both before and after edits.
Why it matters: Without both, you might add cool features but secretly break existing ones. 🍞 Anchor: Adding a drawbridge (F2P) while ensuring the castle walls still stand (P2P).

🍞 Hook: If your GPS only says “wrong route,” that’s not helpful; you want to know at which turn you messed up.

🥬 Step-Level Scoring (progress map):

What it is: A score showing how many sub-steps are done correctly.
How it works: Break tasks into steps, test each, and award points for each success.
Why it matters: Without it, all failures look the same, and you can’t see early vs. late breakdowns. 🍞 Anchor: A to-do list where you can check off each box to see real progress.

🍞 Hook: Buying a new phone from a trusted store beats finding one with unknown history.

🥬 Contamination Control (clean data):

What it is: Avoiding pre-seen answers by using curated CS assignments and hand-crafted real workflows instead of scraping public repos.
How it works: Rewrite names and stories, build independent tests, and isolate environments.
Why it matters: Without clean data, you can’t be sure the AI is thinking, not just copying. 🍞 Anchor: A fresh exam written by the teacher this term, not last year’s test.

🍞 Hook: A chef needs a good kitchen, not just a recipe.

🥬 Isolated Environments (fair kitchens):

What it is: Each task runs in its own Docker-based setup with precise dependencies.
How it works: Build a Dockerfile, freeze versions, and run tests inside it.
Why it matters: Without this, results vary across machines, making scores unfair. 🍞 Anchor: Baking with the same oven temperature every time so results are comparable.

🍞 Hook: Sometimes a friend telling you “start with the frame first” helps you build faster.

🥬 Human-Agent Collaboration (team play):

What it is: People guide agents with plans or checkpoints to improve success.
How it works: Inject a high-level plan or answer agent requests with step guidance (but no direct code); limit to a few nudges.
Why it matters: Without teamwork, agents often loop on early errors; with guidance, pass rates jump. 🍞 Anchor: A coach giving play-by-play strategy boosts the team’s score.

03Methodology

🍞 Hook: Picture assembling a complex LEGO city: you collect sets, write a build plan, set up a clean table, follow instructions, and check each block fits.

🥬 High-Level Pipeline (overview):

What it is: The recipe for building LongCLI-Bench from scratch.
How it works: Input (task sources) → Data collection and filtering → Requirement document → Isolated environment + solution repo → Dual test suites + step scoring → Verification loop → Evaluation harness.
Why it matters: Without a careful build process, tasks could be too easy, inconsistent, or unfair. 🍞 Anchor: A factory line where each station ensures quality before passing items along.

🍞 Hook: Choosing the right ingredients makes or breaks the cake. 🥬 Data Sources and Initial Codebase:
- What it is: Tasks come from 958 CS assignments and 50 real-world, multi-step workflows.
- How it works: Collect, run a strong model (Codex) to gauge difficulty, discard easy/unclear ones, keep challenging long-horizon tasks.
- Why it matters: Without tough, realistic tasks, agents look better than they truly are. 🍞 Anchor: Selecting only the hardest puzzles for a true championship.
🍞 Hook: Clear blueprints stop builders from guessing where doors go. 🥬 Requirement Document:
- What it is: A precise description of what to build and where to start.
- How it works: Spell out functional goals and entry points; rewrite names/stories to avoid lookup; hand-craft for real workflows.
- Why it matters: Without clear entry points, correct code might fail tests just due to mismatch. 🍞 Anchor: A map with “Start here” and labeled checkpoints.
🍞 Hook: Baking in the same oven avoids surprise results. 🥬 Environment and Solution Codebase:
- What it is: Each task has a Docker environment and a separate human-written solution repo.
- How it works: Iteratively solve the task while recording exact dependencies; lock them in a Dockerfile; keep solution separate to avoid bias in tests.
- Why it matters: Without reproducible environments, results vary; without a known-good solution, solvability is uncertain. 🍞 Anchor: A tested recipe and a calibrated oven.
🍞 Hook: A good exam tests understanding, not memorization tricks. 🥬 Test Suite and Scoring Design:
- What it is: Two test sets: Fail→Pass (did you meet new requirements?) and Pass→Pass (did you avoid breaking existing features?), plus step-level scoring.
- How it works: Write tests only from the requirement doc (not the solution); verify environment state (like ports/logs), not just command strings; compute step scores from sub-tests.
- Why it matters: Without P2P, agents may ship regressions; without step scores, we can’t see partial progress. 🍞 Anchor: A driving test that checks both parking a new car and still obeying old traffic rules.
🍞 Hook: Quality control catches wobbly chairs before they reach customers. 🥬 Verification and Quality Control:
- What it is: A strict loop to ensure tasks are valid and fair.
- How it works: A task is valid if F2P fails on the initial repo but passes on the solution, and P2P passes on both; iterate fixes to docs/tests/env until the condition holds; discard after repeated failures; final expert review.
- Why it matters: Without this, hidden flaws sneak in and scores become unreliable. 🍞 Anchor: Inspectors checking every bolt on a roller coaster.
🍞 Hook: A race needs a starter pistol, lap tracker, and clear finish line. 🥬 Evaluation Workflow:
- What it is: The harness that runs agents and scores them.
- How it works: Initialize Docker; give requirements; let the agent plan and act via CLI; run tests at end or timeout; parse outputs for pass rate and step scores; support multiple attempts and self-correction rounds.
- Why it matters: Without a standard harness, results can’t be compared. 🍞 Anchor: Same stopwatch and rules for every runner.
🍞 Hook: Different chores show different skills. 🥬 Task Categories (engineering taxonomy):
- What it is: Four types—From Scratch, Feature Addition, Bug Fix, Refactor.
- How it works: Assess planning/setup (0→1), extending systems (N→N+1), diagnosis and repair (No→Yes), and design cleanup (A→A′).
- Why it matters: Without diversity, an agent might look great in one area but fail elsewhere. 🍞 Anchor: Building a house, adding a room, fixing leaks, and reorganizing the layout.
🍞 Hook: Subjects in school show broad mastery. 🥬 Domain Coverage (system, web, data, ML, apps, devops):
- What it is: Six domains to test varied technical skills.
- How it works: Mix OS/compiler tasks, web backends, data pipelines, ML training/eval, app logic, and DevOps setup.
- Why it matters: Real teams face all of these; agents should too. 🍞 Anchor: Math, science, history, art—balanced report card.
🍞 Hook: A coach’s hints can turn a near-miss into a win. 🥬 Self-Correction and Human-Agent Collaboration Modes:
- What it is: Optional rounds using feedback (self-correction) and human guidance (plan injection, interactive help).
- How it works: Self-correction reruns with test feedback; plan injection seeds a high-level roadmap; interactive guidance lets humans nudge steps when the agent asks—no code given.
- Why it matters: Without these modes, we can’t study how agents learn from feedback or benefit from teamwork. 🍞 Anchor: Reviewing game footage and getting a better playbook before the rematch.

Secret Sauce:

Carefully filtered, contamination-controlled, long-horizon tasks; dual-set tests that catch regressions; and step-level scoring that reveals where the chain breaks. Together, these turn a blurry pass/fail into a clear, actionable performance map.

04Experiments & Results

🍞 Hook: If you only look at the final scoreboard, you miss which quarter your team fell behind. Checkpoints matter.

🥬 The Test (what was measured):

What it is: Pass Rate (both F2P and P2P must fully pass), plus F2P and P2P step-level scores. Also Pass@3 and runtime.
How it works: Run agents on 20 long-horizon tasks in identical Docker environments; average over three runs with a unified system prompt.
Why it matters: Without detailed metrics, we can’t tell whether poor results come from missing new features, breaking old ones, or getting stuck early. 🍞 Anchor: A basketball box score with points per quarter, not just the final tally.

🥬 The Competition (who ran):

What it is: Commercial CLI assistants (Codex with GPT-5.x-Codex series; Claude Code with Claude-Opus/Sonnet) and OpenHands using top open-source models (DeepSeek-V3.1, GLM-4.6, Qwen3-235B-A22B).
How it works: Same harness, same prompt, and three attempts per setting.
Why it matters: Fair comparisons show where strengths and weaknesses truly lie. 🍞 Anchor: All runners on the same track, same weather.

🥬 The Scoreboard (contextualized):

What it is: Overall pass rates are under 20% for all systems; best single-turn pass rate around 16.7% (e.g., Claude-Opus-4.6).
How it works: Although P2P step scores are very high on average (>98%), P2P pass rates drop to about 70–88% whenever edits are complex, revealing regressions.
Why it matters: That’s like getting an A on easy drills but losing points when the real game gets tough. 🍞 Anchor: An 87% step score might feel high, but if it’s not 100% on both sets, the task doesn’t count as “passed.”

🥬 Surprising Findings:

Early Bottlenecks: Most failures happen before 30% F2P completion. Agents often stall early due to planning, environment setup, or misdiagnosis.
Self-Correction Helps (a bit): Rerunning with feedback improves pass rates but has diminishing returns by the third round.
Regression Risk: As agents expand changes late in self-correction, some P2P pass rates dip—more edits can mean new breakage.
Human Help Works Best: Plan injection and interactive guidance significantly raise pass rates (e.g., to about 50–61.7% in tested pairs) while reducing wasted effort. 🍞 Anchor: A coach’s halftime adjustments can swing the game more than just “trying harder” with the same bad plan.

🥬 Step-Level Distribution (where they stumble):

What it is: A large share of runs land in the 0–30% F2P bracket across agents.
How it works: This indicates trouble at the very start (setup, dependency install, entry points).
Why it matters: Fixing the first mile of the marathon may unlock everything after. 🍞 Anchor: If you start a puzzle with the wrong corner pieces, you never build the picture.

Runtime Notes:

Self-correction doesn’t always cost more time; effects vary by agent. Gains are real but smaller in later rounds. Human guidance costs some interaction but pays off in higher pass rates and fewer dead-end loops.

05Discussion & Limitations

🍞 Hook: Even great teams have weak spots; the key is to know them and train smarter.

🥬 Limitations (be specific):

What it is: Manual curation is heavy—about 40 hours per task—so the dataset is relatively small (20 tasks) though deep.
How it works: Human-crafted requirements, environments, and tests ensure quality but slow scaling; step scores don’t judge code quality or efficiency.
Why it matters: We still need broader coverage and more dimensions (style, performance, security) in future versions. 🍞 Anchor: A gourmet menu with few dishes—excellent, but not a buffet yet.

🥬 Required Resources:

What it is: Dockerized environments, testing harness, and agent frameworks.
How it works: You need compute to run terminal agents through long workflows; consistent Docker layers keep results fair.
Why it matters: Without the right kitchen and tools, the recipe can’t be followed. 🍞 Anchor: You can’t bake bread without an oven.

🥬 When NOT to Use:

What it is: Quick checks of tiny functions, teaching beginner syntax, or ultra-fast leaderboard sprints.
How it works: For those, simpler function-level benchmarks are fine.
Why it matters: LongCLI-Bench shines on long, realistic pipelines—not micro-snippets. 🍞 Anchor: Don’t use a marathon to pick a sprinter.

🥬 Open Questions:

What it is: How to scale curation without losing quality? How to measure code safety, performance, and maintainability? How to build agents that reason about environment state robustly? What are best practices for human-agent teamwork?
How it works: Explore semi-automated task generation, richer metrics (latency, memory, regression depth), better planning modules, and collaboration protocols.
Why it matters: These answers will turn agents from clever coders into dependable software teammates. 🍞 Anchor: From talented rookies to seasoned pros—training plans matter.

06Conclusion & Future Work

🍞 Hook: When you want to know if a bridge is safe, you test it under real traffic, not just toy cars.

🥬 3-Sentence Summary:

What it is: LongCLI-Bench is a contamination-aware benchmark of 20 large, realistic, long-horizon CLI tasks with dual-set tests (Fail→Pass and Pass→Pass) and step-level scoring.
How it works: It evaluates whether agents can add new features without breaking old ones and pinpoints where multi-step workflows fail.
Why it matters: Today’s agents pass under 20% of tasks, but human planning and guidance greatly boost success, revealing where to focus future research. 🍞 Anchor: A marathon course with mile markers shows not just who finishes, but where runners need better training.

Main Achievement:

Turning pass/fail into a detailed progress map on long, real engineering tasks—exposing early-stage bottlenecks and regression risks.

Future Directions:

Scale tasks while preserving quality; add metrics for performance, security, and maintainability; improve agent planning/execution; refine human-agent collaboration protocols (e.g., plan injection standards).

Lasting Impact:

By testing what truly matters in real software work—planning, environment grounding, and non-breaking changes—LongCLI-Bench sets a higher bar and a clearer path for building agents we can trust on real teams.

Practical Applications

•Evaluate new AI coding agents on realistic, multi-step projects before using them in production.
•Design training curricula that target early-stage failures (environment setup, dependency management, entry points).
•Adopt dual-set testing in your CI to catch regressions whenever adding features.
•Use step-level scoring in internal tooling to pinpoint where long workflows break and prioritize fixes.
•Pilot human-agent collaboration protocols (plan injection and interactive guidance) to boost real-world success.
•Harden development environments with Docker to ensure reproducible runs for agents and humans.
•Benchmark different prompts and scaffolds (e.g., planning-first) to improve agent reliability on your codebase.
•Create internal requirement documents with explicit entry points to reduce agent confusion and false negatives.
•Track P2P pass rates during refactors to minimize breaking changes in legacy systems.
•Use self-correction loops sparingly and pair them with plan reviews to prevent thrashing and new regressions.

Version: 1