CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion

Yusong Lin; Haiyang Wang; Shuzhe Wu; Lue Fan; Feiyang Pan; Sanyuan Zhao; Dandan Tu

CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion

Intermediate

Yusong Lin, Haiyang Wang, Shuzhe Wu et al.2/11/2026

arXiv

Key Summary

•CLI-Gym is a new way to create lots of realistic computer-fixing tasks for AI by safely breaking and then repairing software environments inside containers.
•Instead of waiting for rare, human-written problems, an agent starts from a healthy setup and deliberately makes careful changes until some tests fail, turning that broken state into a training task.
•This process is called agentic environment inversion: it flips the usual direction of coding by moving from a good state to a bad one on purpose to learn how to get back.
•The team generated 1,655 environment-intensive CLI tasks from 29 popular GitHub projects—about 20× more than prior human-built sets.
•They also recorded 291 clean “how I fixed it” trajectories and used them to fine-tune models named LiberCoder.
•LiberCoder-32B and LiberCoder-235B-A22B reached 38.9% and 46.1% Pass@1 on Terminal-Bench 1.0, beating many bigger open models.
•The key is execution feedback: the agent tries commands, sees errors and test results, and learns from that loop to craft useful training tasks.
•More diverse repositories and higher-quality trajectories mattered more than just having more data.
•This is the first public pipeline that scales up environment-intensive tasks for agents working in command-line interfaces.

Why This Research Matters

Software often breaks because of environment issues, not just code mistakes. CLI-Gym gives AI agents a safe, repeatable way to practice diagnosing and fixing real system problems using containers, commands, and tests. This means faster bug fixes, stronger CI pipelines, and fewer “works on my machine” disasters. Teams can train smaller, cheaper models to compete with larger ones by feeding them targeted, high-quality repair trajectories. For developers and companies, that translates to lower costs, more reliable deployments, and happier users. Ultimately, it moves AI from being a code typist to being a true system troubleshooter.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a bike shop doesn’t just build bikes—they road-test them, tighten bolts, and fix weird rattles that only show up when you actually ride? Real software is like that. It doesn’t just need code; it needs the right tools, libraries, and settings to run smoothly.

🥬 The Concept (Command Line Interfaces): What it is: A command line interface (CLI) is a text window where you type commands to control a computer. How it works:

You type a command like pip install something.
The computer runs it and prints messages.
You read the output to decide what to do next. Why it matters: Without a CLI, agents can’t actually install things, run tests, or change settings—they’d be stuck just talking, not doing. 🍞 Anchor: When an AI runs pytest in a terminal and sees which tests fail, it’s using the CLI the same way a developer would.

🥬 The Concept (Dockerfiles): What it is: A Dockerfile is a recipe for building a clean, repeatable computer environment. How it works:

Start from a base image (like “Ubuntu”).
Add steps: install Python, copy code, set variables.
Build an image and run it in a container—like a fresh kitchen every time. Why it matters: Without Dockerfiles, everyone’s environment is slightly different, so results become messy and unreliable. 🍞 Anchor: If you can rebuild the same Dockerfile on any laptop and the tests pass the same way, the setup is solid.

🥬 The Concept (Unit Tests): What it is: Unit tests are tiny checkups that confirm each piece of software behaves correctly. How it works:

Each test asks a small question (e.g., “Does sorting work?”).
Run tests; they pass or fail.
Failing tests point to what broke. Why it matters: Without unit tests, you don’t know whether your changes helped or hurt. 🍞 Anchor: If a library update accidentally changes behavior, unit tests catch it before users do.

🥬 The Concept (Execution Feedback): What it is: Execution feedback is the live response the system gives after each command—errors, logs, and test results. How it works:

The agent runs a command (like installing a package).
It reads errors or success messages.
It decides the next step based on that feedback. Why it matters: Without feedback, agents would be guessing in the dark and repeat mistakes. 🍞 Anchor: If pip install says “version conflict,” the agent switches versions instead of trying the same thing again.

🍞 Hook: Imagine two kinds of homework. One is writing an essay (mostly about your words). The other is running a science experiment (lots of setup with equipment). Both are work, but they’re different.

🥬 The Concept (Environment-intensive Tasks): What it is: These are tasks where most of the challenge is fixing or configuring the computer environment, not writing a lot of new code. How it works:

Prepare the container (right OS, Python, packages).
Hit a problem (missing library, wrong path, permissions).
Use commands and configs to fix it until tests pass. Why it matters: Without these tasks, agents won’t learn real-world troubleshooting skills beyond code typing. 🍞 Anchor: Diagnosing why import openssl fails on Linux isn’t about code edits—it’s about packages, paths, and system libraries.

🍞 Hook: Think of coding like cooking. Sometimes you write a recipe (code). Other times you fix the oven (environment). Both matter.

🥬 The Concept (Agentic Coding): What it is: Agentic coding is when an AI acts like a developer—reading, running, installing, editing, and testing in loops. How it works:

Observe: read errors and tests.
Plan: choose commands or edits.
Act: execute steps in the terminal or files.
Check: run tests again. Why it matters: Without this loop, the AI can’t improve or verify its fixes. 🍞 Anchor: An AI that edits a config, runs tests, and reverts if things get worse is doing agentic coding.

The World Before: AI was getting good at writing code for known tasks (like SWE-bench) because Git histories and pull requests give tons of examples. But when it came to environment problems—package conflicts, broken system libs, Docker issues—there wasn’t much public data. Terminal-Bench showed that even huge models struggled to fix real command-line problems, often solving under 40%.

The Problem: There was no scalable, public way to build lots of environment-intensive tasks. Code has version control. Environments don’t—they’re messy, personal, and rarely recorded.

Failed Attempts: People tried small, human-written task sets and one-shot LLM-generated Dockerfiles. Those didn’t scale, and single-pass generation without feedback often missed realistic failure modes.

The Gap: We needed a way to create many, diverse, reproducible environment problems with clear verification, all from real projects.

Real Stakes: In real life, software breaks because of environments—a mismatched CUDA, a missing library, a path change. Teaching agents to handle that means faster CI pipelines, more reliable deployments, and fewer “it works on my machine” headaches.

02Core Idea

🍞 Hook: Imagine learning to fix a LEGO model by first gently breaking it in different ways, so you practice putting it back together. You don’t wait for accidents—you create safe practice problems.

🥬 The Concept (Agentic Environment Inversion): What it is: Agentic environment inversion means starting from a healthy setup and deliberately (but safely) making changes until some tests fail, then turning that broken state into a training task. How it works:

Begin with a Dockerized repo where all unit tests pass (gold state).
An agent runs commands with feedback to nudge the environment into failure (poor state).
Record what changed (as a Dockerfile) and which tests failed.
Package the result as a realistic task the next agent must fix. Why it matters: Without inversion, we wait for rare real-world failures or write them by hand; with inversion, we can scale and diversify tasks responsibly. 🍞 Anchor: If the agent changes a library version so tests break, we save that exact change and the failing tests as a new challenge.

The Aha! Moment in One Sentence: If repos don’t store environment histories, have an agent simulate those histories by reversing the usual direction—move from good to broken—so we can learn to get back.

Multiple Analogies:

Reverse cooking: Bake the perfect cake, then vary oven temperature or swap sugar types to see which mistakes cause which flops—now you know how to fix them.
Map-making: If the goal is to navigate home, first explore side roads that get you lost (safely), then record those routes so others can practice getting back.
Sports drills: Don’t wait for a random bad pass in a game; practice specific tricky passes and recoveries so you master them.

Before vs After:

Before: Few environment tasks, hand-written, limited diversity; big models still stumble.
After: Thousands of reproducible tasks built from real repos; agents practice on many failure types and improve fast.

🥬 The Concept (Simulation of Environment Histories): What it is: This is reenacting how an environment could have changed over time by applying command sequences and observing outcomes. How it works:

Start at a known good image.
Apply realistic changes (installs, uninstalls, path tweaks).
Watch tests to see which changes matter.
Save the path taken so it’s repeatable. Why it matters: Without simulation, we can’t study “how we got here” or reproduce bugs reliably. 🍞 Anchor: The recorded Dockerfile acts like a time machine—you can replay exactly how the environment got broken.

Why It Works (Intuition, no equations):

Feedback loop: Actions → errors/tests → refined actions. This makes the agent explore targeted, plausible failures instead of random chaos.
Ground truth: Unit tests provide a clear signal for “this is broken/this is fixed,” turning subjective guesses into objective tasks.
Reproducibility: Dockerfiles encode the exact steps, so others can rebuild the same problem reliably.

Building Blocks:

Gold state (all tests pass) as the starting point.
Inversion prompts derived from unit tests to guide where to poke.
Agent with a rich CLI action space and execution feedback.
Automatic task packaging: Dockerfile + failing tests + a natural-language issue.

🥬 The Concept (Task Derivation): What it is: Turning a broken environment plus its symptoms (failing tests and errors) into a clean, reusable training task. How it works:

Detect which tests fail.
Summarize errors into a short issue statement.
Bundle the reproducible Dockerfile and tests. Why it matters: Without clean task packaging, you can’t train or fairly evaluate agents. 🍞 Anchor: Each CLI-Gym task is a zip of the environment, the problem description, and the tests that prove it’s fixed.

🥬 The Concept (Pass@1): What it is: Pass@1 is the percent of tasks an agent solves on the first try. How it works:

Run the agent once per task.
Check if verification scripts pass.
Count successes and divide by total tasks. Why it matters: It shows how dependable the agent is when it only gets one shot. 🍞 Anchor: Scoring 46.1% Pass@1 is like getting nearly half your pop-quiz questions right on the first attempt in a very tough class.

03Methodology

At a high level: GitHub repo → Build gold environment → Generate inversion prompts → Agent explores and breaks → Tests fail → Package task (Dockerfile + issue + tests) → Use tasks to train agents → Evaluate on Terminal-Bench.

🥬 The Concept (Docker Image/Container): What it is: A Docker image is a frozen blueprint; a container is that blueprint running. How it works:

Build image from a Dockerfile recipe.
Run image to get a container (a fresh, isolated room).
Do all experiments inside the container. Why it matters: Without containers, experiments leak across machines and are hard to repeat. 🍞 Anchor: If a task fails on your laptop and mine the same way in containers, we’re seeing the same bug, not two different computers.

Step-by-step recipe:

Construct the gold instance.

What happens: From a chosen repo (e.g., pandas), start with a base image, install dependencies, run all unit tests until everything passes. Save as the gold state.
Why this exists: We need a clean starting point so we know any new failures came from our controlled changes.
Example: Build task-pandas:latest; confirm 100% tests pass.

Generate inversion prompts from unit tests.

What happens: Sample a subset of tests and ask an LLM to propose disruption directions that could realistically make those tests fail (e.g., dependency misconfig, path issues), while tracking past tasks for diversity.
Why this exists: It steers the agent toward meaningful, varied failures—not random destruction.
Example: For tests touching SQLite, suggest scenarios that might break dynamic linking or version expectations.

Agentic environment inversion (the exploration loop).

What happens: The agent runs commands in the CLI (install/uninstall, edit configs, change permissions, tweak env vars), constantly reading execution feedback and adjusting.
Why this exists: Only a live, feedback-driven loop can discover realistic, reproducible failure modes.
Example: The agent might adjust library versions or environment variables until specific tests start failing.

Check tests and crystallize a broken state.

What happens: After the agent stops, run the selected tests. If at least one fails, mark the environment as a valid poor state. If all pass, discard this attempt.
Why this exists: Unit tests convert vague “looks broken” into objective ground truth.
Example: Two tests fail with ImportError and one times out—this is now a valid instance.

Summarize the degradation as a Dockerfile.

What happens: The agent writes out a minimal Dockerfile snippet encoding the exact commands it executed to reach the poor state.
Why this exists: Others must be able to rebuild the same broken environment deterministically.
Example: The snippet pins a specific package version and sets an environment variable that triggers the failure.

Auto-generate a natural-language issue.

What happens: Using the failing tests and logs, an LLM writes a short, user-style problem statement (optionally with a hint), without giving away the fix.
Why this exists: Tasks should look like real bug reports developers see.
Example: “Several I/O-related unit tests are failing after recent environment changes. Please investigate and repair the environment so the tests pass.”

Package the task instance.

What happens: Bundle (i) the executable environment (base + Dockerfile), (ii) the issue text, and (iii) the fail-to-pass tests.
Why this exists: Standardization makes training, sharing, and evaluation easy.
Example: A single folder contains the Dockerfile snippet, run-tests.sh, and task.yaml.

Collect repair trajectories.

What happens: Use strong models (via OpenHands) to attempt repairs, record successful step-by-step solutions, and filter out trivial or “cheating” fixes.
Why this exists: High-quality repair demonstrations are powerful supervision for training.
Example: Keep trajectories with thoughtful diagnosis and multi-step recovery; drop ones that exploit cached artifacts.

🥬 The Concept (Gold vs Poor States): What it is: Gold means “all tests pass.” Poor means “at least one fails.” How it works:

Start at gold.
Move to poor by controlled changes.
Later, train agents to go back to gold. Why it matters: Clear bookends turn messy debugging into a learnable path. 🍞 Anchor: A task begins poor (red tests) and ends gold (green tests) when solved.

🥬 The Concept (Trajectory): What it is: A trajectory is the full, ordered story of an agent’s actions and observations from start to finish. How it works:

Observe feedback.
Choose an action.
Repeat until done. Why it matters: Without trajectories, models can’t learn the strategies humans use to debug. 🍞 Anchor: A good trajectory might be 30–60 careful steps that install, check, edit, and retest.

Secret Sauce:

Inversion with feedback: The agent doesn’t guess once; it iterates with test-verified signals.
Dockerfile replay: Every broken state is reproducible and portable.
Unit-test guidance: Tests aim the search at semantically meaningful failures.
Diversity memory: Prompting avoids generating the same kind of task over and over.

Concrete data in action:

29 Python repos → 1,655 CLI tasks.
417 raw successes → 291 filtered, high-quality repair trajectories.
Training on those 291 led to large gains on Terminal-Bench.

04Experiments & Results

🥬 The Concept (Terminal-Bench): What it is: Terminal-Bench is a tough set of real CLI challenges where agents must fix environment and system issues under verification. How it works:

Each task sets up a container with a problem.
The agent interacts only via terminal tools.
Success means the official verification scripts pass. Why it matters: It measures real troubleshooting, not just code writing. 🍞 Anchor: If your agent can pass Terminal-Bench, it can likely fix many “works on my machine” errors in the wild.

The Test: Train on CLI-Gym trajectories and measure Pass@1 and Pass@3 on Terminal-Bench 1.0 and 2.0 using the OpenHands agent framework. Also compare to strong baselines, both closed and open weights.

The Competition: Baselines include Qwen3 families, Kimi-K2, GLM-4.6, Minimax, and well-known closed models like Claude and GPT series (using their own agents). Some entries use specialized agents (e.g., Terminus) while we standardize on OpenHands.

Scoreboard with Context:

LiberCoder-32B: 38.9% Pass@1 on Terminal-Bench 1.0. That’s like jumping from a D to a solid B compared to its starting point (10.3%), and it even beats some 480B+ models.
LiberCoder-235B-A22B: 46.1% Pass@1 on v1.0, a +21.1 point boost from its base; 31.0% on v2.0 (+12.9). Think of it as moving into the top tier of open models for v1.0 with a relatively small, targeted training set.
Pass@3 also rises notably, showing that with a few tries the agent can often course-correct.

Surprising Findings:

Quality > Quantity: Using 291 filtered, high-quality repair trajectories outperformed training on more, but noisier, data once the model had basic agentic skills.
Diversity wins: With a fixed number of trajectories, spreading them across more repositories gave steady gains. Different repos trigger different failure modes—great for generalization.
Diminishing returns: Gains flatten past ~200 trajectories, hinting that variety and clarity of supervision matter more than raw count.
Behavior change: Better training reduced “stuck in loops” failures from about 43% down to 3% in scaling studies—agents became more decisive and less repetitive.

Ablations (what moved the needle):

Pretraining on generic SWE trajectories helps initialize agentic habits (navigating repos, editing, testing).
Training solely on CLI-Gym data yields even bigger jumps, since it teaches environment-centric skills.
Combining both is best: general code instincts + specialized environment troubleshooting.

Category-wise effects:

Big improvements in software engineering, system administration, security, debugging, and file operations.
Harder areas like gaming and scientific computing remain trickier and under-addressed by current data.

Takeaway: With only 291 clean, successful repair stories and 1,655 realistic tasks, a 32B model reaches near-40% Pass@1 on v1.0 and a 235B model hits 46.1%—evidence that targeted, reproducible environment practice beats naive parameter scaling for CLI troubleshooting.

05Discussion & Limitations

Limitations:

Repo and test dependence: If a repository lacks good unit tests, it’s harder to aim the inversion at meaningful failures.
Coverage gaps: Some domains (e.g., scientific computing with special hardware, certain OS-specific subtleties) appear less frequently and remain hard.
Agent framework sensitivity: Results vary with the agent harness; OpenHands is general, but specialized agents can score higher.
Long-context strain: Better agents explore more and can hit context limits, causing late-run mistakes.

Required Resources:

Container runtime (Docker/Compose) and compute to build images and run tests.
An LLM agent capable of iterative terminal use and reading logs.
Storage for base images and generated instances (CLI-Gym is relatively storage-efficient).

When NOT to Use:

Pure algorithmic tasks that don’t involve environments (no need for inversion overhead).
Highly specialized, non-containerizable systems (e.g., tightly coupled hardware drivers without emulation).
Situations requiring strong security isolation from any environment modifications beyond controlled containers.

Open Questions:

How to expand beyond Python and Linux into more OSes and mixed-language stacks without exploding complexity?
How to synthesize safe, hardware-accelerated failures (CUDA, ROCm) reproducibly?
Can we auto-tune task difficulty so agents get a perfect curriculum from easy to expert?
How to better handle long horizons without hitting context limits—summarization, memory, or new agent designs?
Can we turn more real CI failure logs into starting points for inversion to mirror production incidents even more closely?

06Conclusion & Future Work

Three-sentence summary: CLI-Gym flips the usual repair story: start from a healthy environment, use an agent to safely induce realistic failures with feedback, and package the results into reproducible CLI tasks. With 1,655 tasks and 291 high-quality repair trajectories, fine-tuned models (LiberCoder) gain big, reliable boosts on Terminal-Bench, surpassing many larger open models. This is the first public, scalable pipeline for environment-intensive tasks, unlocking broader progress in agentic troubleshooting.

Main achievement: Turning environment data scarcity into a solvable problem by simulating environment histories via agentic inversion—and proving it works at scale with strong benchmark results.

Future directions:

Broaden language/OS coverage and include GPU/accelerator environments.
Smarter curricula and difficulty control to avoid plateaus.
Memory- and context-optimized agents for long, complex sessions.
Blend in real CI/CD incidents to mirror production even more closely.

Why remember this: Because most real-world breakages aren’t about writing more code—they’re about fixing the world the code lives in. CLI-Gym finally gives agents a big, realistic playground to practice that and get measurably better.

Practical Applications

•Train internal code agents to fix CI failures by reproducing and repairing environment breakages in containers.
•Augment onboarding for engineers with realistic, replayable troubleshooting drills drawn from your own repos.
•Stress-test deployment pipelines by safely generating failure cases (version conflicts, path issues) and verifying auto-remediation.
•Benchmark different agent frameworks on the same reproducible environment tasks to pick the best one for your stack.
•Build a curriculum for AI assistants that starts with easy environment issues and scales to multi-step, system-level debugging.
•Continuously harvest repair trajectories from successful runs to steadily fine-tune your in-house agent.
•Use inversion to mirror real incidents from logs, turning them into training tasks that improve mean time to recovery.
•Validate infrastructure changes (e.g., Python or OS upgrades) by simulating likely failure modes before rollout.
•Create a shared library of Dockerfile-encoded failure scenarios so teams can practice and standardize recovery playbooks.

Version: 1