On Data Engineering for Scaling LLM Terminal Capabilities

Renjie Pi; Grace Lam; Mohammad Shoeybi; Pooya Jannaty; Bryan Catanzaro; Wei Ping

On Data Engineering for Scaling LLM Terminal Capabilities

Intermediate

Renjie Pi, Grace Lam, Mohammad Shoeybi et al.2/24/2026

arXiv

Key Summary

•This paper shows that you can vastly improve a model’s command-line (terminal) skills by carefully engineering the training data, not just by using a bigger model.
•The authors build Terminal-Task-Gen, a two-part pipeline: adapt existing math/code/SWE prompts into terminal tasks and generate new terminal tasks from seeds and a skill list.
•They release a large dataset (Terminal-Corpus) and fine-tune Qwen3 base models to make Nemotron-Terminal models specialized for terminal use.
•On the tough Terminal-Bench 2.0, Nemotron-Terminal-8B jumps from 2.5% to 13.0%, 14B from 4.0% to 20.2%, and 32B from 3.4% to 27.4%.
•Skill-based synthetic tasks are the biggest driver of gains; combining different data sources is better than any single source alone.
•Keeping imperfect (even failed) trajectories helps models learn to recover from mistakes; strict filtering actually hurts performance.
•Longer context windows didn’t help here; most useful supervision fits in standard context, and longer tails were noisier.
•Curriculum learning (easy-first, hard-later) didn’t outperform simply mixing all data together during training.
•Using a small set of pre-built Docker images makes generating and running many tasks far more scalable than building a new image per task.
•The team open-sourced models and most datasets to help the community build better terminal agents.

Why This Research Matters

Better terminal agents can save huge amounts of time for developers, data scientists, and IT teams by automating setup, data processing, and debugging. With trustworthy tests and safe containers, these agents can perform real, checkable work rather than just suggesting ideas. Smaller, well-trained models that perform like larger ones reduce costs, making powerful tools more accessible to startups, classrooms, and researchers. The approach also teaches recovery from failures, which is key to real-world reliability. Open-sourcing models and datasets speeds up community progress and encourages reproducible research. Over time, this can lead to assistants that maintain projects, enforce best practices, and keep systems healthy around the clock.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how learning to ride a bike on a smooth driveway is different from riding through a busy park? The second one needs many small skills working together—balance, brakes, looking around, and quick decisions.

🥬 Filling (The Actual Concept): What it is: Teaching AI to use the computer terminal (the text window where you type commands) is like the busy park—models must combine lots of small skills in long, careful steps. How it works: 1) The model reads a goal (like “install a package and run tests”). 2) It tries commands one by one. 3) It watches the terminal’s responses. 4) It adjusts, debugs, and keeps going until tests pass. 5) Its steps are recorded as a “trajectory” (a play-by-play). Why it matters: Without practice data that looks like real terminal work, models guess or get stuck. They might know code, but freeze when a dependency fails or a file path is wrong.

🍞 Bottom Bread (Anchor): Imagine a task: “Create solution.py, read input.csv, compute stats, and save results.json.” The model needs to navigate files, install Python packages, run tests, and fix errors—just like a real developer.

🍞 Top Bread (Hook): Imagine taking a cooking quiz with only multiple-choice questions, then suddenly being asked to cook a full dinner in a real kitchen.

🥬 Filling (Terminal-Bench 2.0): What it is: A standard test that checks whether AI agents can actually complete end-to-end tasks in a real terminal. How it works: 1) Each task has an instruction, 2) a Docker container (a safe mini-computer kitchen), 3) tests that auto-check success, and 4) an oracle solution for humans to verify. 5) An agent interacts step-by-step using an approved JSON format to send keystrokes. Why it matters: Without such a test, we’d only know if models “sound right,” not if they can truly finish the job.

🍞 Bottom Bread (Anchor): A task might say: “Compile this C project, fix the build error, and run tests.” Terminal-Bench 2.0 checks if you really fixed it by running a scripted test.

🍞 Top Bread (Hook): Have you ever tried a puzzle where you only get hints, not the whole answer key?

🥬 Filling (The Problem): What it is: People didn’t reveal the secret recipes of data that train great terminal agents, leaving everyone else guessing. How it works: 1) Researchers tried fancy “agent scaffolds” (like extra tools) to boost performance. 2) Others wrapped existing datasets with terminal adapters. 3) Some built multi-agent factories to generate new tasks. Why it matters: These approaches were either hard to scale, mismatched to real terminal workflows, or too costly. The field needed a data strategy that was scalable, targeted, and practical.

🍞 Bottom Bread (Anchor): It’s like wanting more kids to learn biking in parks, but only having smooth-driveway practice sheets or a complicated training machine that few can afford.

🍞 Top Bread (Hook): Think of building legos with two buckets: one full of classic bricks and one full of special shapes for tricky corners.

🥬 Filling (The Gap and Idea): What it is: The paper proposes combining two data buckets—dataset adaptation (classic bricks) and synthetic task generation (special shapes)—into one pipeline called Terminal-Task-Gen. How it works: 1) Adapt high-quality math/code/SWE prompts into terminal tasks to build fundamental skills at scale. 2) Generate new tasks from seeds and from a skill taxonomy to target missing abilities. 3) Run agents in Docker to collect trajectories. 4) Filter/decontaminate. 5) Fine-tune base models. Why it matters: This balanced recipe makes enough data (breadth) while also laser-targeting real terminal skills (depth), which finally moves the needle.

🍞 Bottom Bread (Anchor): With both buckets, the team trains Nemotron-Terminal models that jump from single digits to double digits on Terminal-Bench 2.0—even beating much larger models in places.

🍞 Top Bread (Hook): Imagine practicing soccer by only kicking stationary balls vs. practicing game-like drills.

🥬 Filling (Failed Attempts): What it is: Prior tactics either over-simplified (wrapping non-interactive tasks) or over-complicated (heavy multi-agent systems and per-task Docker builds). How it works: 1) Adapters often assume one-shot answers, not multi-step recovery. 2) Multi-agent pipelines burn time building/repairing Dockerfiles. 3) Fresh environments per task are expensive. Why it matters: These friction points slowed down data generation right when we needed lots and lots of high-quality, realistic practice.

🍞 Bottom Bread (Anchor): If it takes 10 minutes just to set up the practice field for each drill, you won’t run many drills.

🍞 Top Bread (Hook): Think of a library that includes a map of all skills and a practice track that safely records your laps.

🥬 Filling (Real Stakes): What it is: Better terminal agents mean real help for developers, scientists, and IT teams. How it works: 1) Automate setup, data cleaning, model training, and debugging. 2) Recover from errors like missing packages or path mismatches. 3) Run safely in containers. Why it matters: Faster, safer, and more reliable software work saves time and reduces frustration in everyday tasks.

🍞 Bottom Bread (Anchor): A student can ask the agent to “set up a project, read this dataset, and produce a report,” and the agent actually gets it done inside a checked, sandboxed environment.

02Core Idea

🍞 Top Bread (Hook): You know how a great coach doesn’t just say “practice more,” but designs drills that build the exact muscles you need?

🥬 Filling (Terminal-Task-Gen): What it is: A data-making machine that mixes adapted datasets and freshly invented tasks to train stronger terminal agents. How it works: 1) Adapt existing high-quality prompts into terminal form to cover basics at scale. 2) Generate new tasks two ways: from seed problems and from a skill taxonomy that composes 3–5 primitives per task. 3) Run all tasks in stable, pre-built Docker images to avoid setup headaches. 4) Collect trajectories via a standardized agent (Terminus 2). 5) Post-process (decontaminate, filter smartly). 6) Fine-tune models to create Nemotron-Terminal. Why it matters: This lets us scale data efficiently and aim precisely at skills real terminal agents need, improving performance without needing gigantic models.

🍞 Bottom Bread (Anchor): After training on this curated mix, the 32B model reaches 27.4% on Terminal-Bench 2.0—beating a 480B code model on that benchmark.

🍞 Top Bread (Hook): Imagine learning piano with sheets adapted from famous songs plus custom exercises that target your weak fingers.

🥬 Filling (Multiple Analogies): 1) Fitness: Dataset adaptation is general cardio; skill-based synthetic tasks are targeted strength training. 2) School: Old test questions (adapted) teach fundamentals; new teacher-made problems (synthetic) target tricky concepts. 3) Lego: Adapted data are standard bricks; skill-composed tasks are special pieces to finish complex builds. Why it matters: Doing both together is what grows capability the fastest.

🍞 Bottom Bread (Anchor): The model that practiced both reading classic code/math prompts (as terminal tasks) and doing novel multi-skill drills learned to navigate, install, fix, and verify reliably.

🍞 Top Bread (Hook): Think of switching from random practice to a planned routine that closes your weaknesses first.

🥬 Filling (Before vs After): What it is: Before—data was scattered, mismatched, or too costly; After—one pipeline creates abundant, realistic, verifiable terminal data. How it works: 1) Stable environments (pre-built Docker) remove setup failures. 2) Domain prompts ensure task diversity and coherence. 3) Test suites give reliable pass/fail signals. Why it matters: Training focuses on true terminal competence rather than wrestling with infrastructure noise.

🍞 Bottom Bread (Anchor): Instead of spending time building one-off Dockerfiles, the system spends time learning how to recover from a pip install error.

🍞 Top Bread (Hook): Like a math teacher who doesn’t just give answers but sets up clear, checkable problems.

🥬 Filling (Why It Works): What it is: The secret is verifiable, multi-step supervision in environments that stay consistent. How it works: 1) Tests encode ground truth. 2) Agents’ step-by-step trajectories expose error states and recoveries. 3) Composition of primitive skills ensures coverage. 4) Not over-filtering keeps valuable examples of failures and fixes. Why it matters: Models learn not just “what to do,” but “what to do when things go wrong,” which is essential in terminals.

🍞 Bottom Bread (Anchor): A failed make build, followed by the right install, then a rerun—and finally a green test—teaches resilience better than a single, perfect script.

🍞 Top Bread (Hook): Picture a toolbox organized by job: screwdrivers, wrenches, hammers—all labeled so you can build any furniture.

🥬 Filling (Building Blocks): What it is: The idea breaks into parts—dataset adaptation, seed-based generation, skill-based generation with a taxonomy, pre-built Docker images, agent-driven trajectory collection, and selective filtering. How it works: 1) Use adapted prompts to scale breadth. 2) Turn seeds into concrete terminal tasks with files and pytest. 3) Compose primitive skills to invent novel scenarios. 4) Run all in 9 domain-specific Docker images. 5) Collect agent steps with Terminus 2. 6) Keep data diverse (don’t over-filter). 7) Mix all data for SFT to train Nemotron-Terminal. Why it matters: Each piece removes a bottleneck—together they make a smooth, scalable data engine.

🍞 Bottom Bread (Anchor): With these blocks, even an 8B model jumps from 2.5% to 13.0% on TB2.0—proof the toolbox works.

03Methodology

At a high level: Input → [Dataset Adaptation] → [Synthetic Task Generation: Seed-based + Skill-based] → [Stable Docker Environments] → [Trajectory Generation with Terminus 2] → [Post-Processing & Decontamination] → [Supervised Fine-Tuning] → Output (Nemotron-Terminal models)

Step 1: Dataset Adaptation

What happens: High-quality math, code, and SWE prompts are wrapped into terminal-style instructions using a consistent template. For SWE, buggy files are materialized in the environment. No tests are added here (they’re prompts + environments).
Why this exists: It quickly builds a large foundation of terminal practice from trusted sources, making the model comfortable with file paths, command execution, and writing outputs.
Example: “Solve the code problem and write the answer to /app/solution.py.” The agent must create the file, run it, and manage errors.

Step 2: Synthetic Task Generation (Seed-based)

What happens: Take an existing problem (seed) with optional reference solution (ground truth). Use an LLM to craft a terminal task with explicit files, I/O formats, and pytest-based tests. The reference solution guides test design only; it’s never shown to the agent.
Why this exists: Many great problem statements lack terminal structure and verifiable tests. This step turns them into runnable, checkable training data.
Example: A seed about computing eigenvalues becomes: “Read matrix.csv, compute eigenvalues, write to results.json. Tests check numeric tolerances and file format.”

Step 3: Synthetic Task Generation (Skill-based via Skill Taxonomy)

What happens: Use a curated list of primitive skills across domains (e.g., process management, parsing JSON/CSV, graph traversal, authentication). Prompt an LLM to combine 3–5 primitives into novel, realistic scenarios with clear tests.
Why this exists: It surgically targets gaps (like dependency management or data querying) the model struggles with, ensuring multifaceted practice.
Example: “In a security environment, parse a log file, detect failed logins, and verify a lockout rule. Tests programmatically validate behavior.”

🍞 Top Bread (Hook): You know how playing on a sturdy field is easier than a muddy lot? 🥬 Filling (Pre-built Docker Images): What it is: Nine domain-specific images pre-install common packages. How it works: 1) Pick a domain image (e.g., data science with pandas/scikit-learn). 2) Place task files in it. 3) Run tasks without costly per-task Docker building. Why it matters: Eliminates setup failures and speeds up generation at scale. 🍞 Bottom Bread (Anchor): Thousands of tasks can run on the same few images, rather than building thousands of Dockerfiles.

Step 4: Trajectory Generation with Terminus 2

What happens: Agents interact with the terminal via a structured JSON: analysis, plan, and commands (keystrokes + durations). The environment returns outputs; the agent proceeds turn by turn until done.
Why this exists: It standardizes how to collect clear, reproducible, multi-step demonstrations.
Example: The agent runs ls, installs dependencies, edits files, runs pytest, reads errors, and retries.

🍞 Top Bread (Hook): Think of cleaning your recipe book so it doesn’t accidentally include an answer key to the test. 🥬 Filling (Decontamination & Filtering Strategies): What it is: Remove any overlap with the evaluation set (≥14-gram), strip identity leaks, and avoid unwanted language. Test-based filtering is optional. How it works: 1) Decontaminate to avoid cheating. 2) Compare three strategies: keep all, keep only complete, or keep only successful trajectories. Why it matters: Surprisingly, keeping even failed attempts helps the model learn recovery; over-filtering throws away useful lessons and hurts scores. 🍞 Bottom Bread (Anchor): The 8B model scored far better when all trajectories were kept (12.4%) versus success-only (5.06%).

Step 5: Supervised Fine-Tuning (SFT)

What happens: Train Qwen3 base models on the combined adapted and synthetic data using long but standard context (32,768 tokens for SFT; 40,960 for eval), AdamW optimizer, cosine schedule, and gradient clipping.
Why this exists: This distills the collected step-by-step knowledge into the model’s weights.
Example: The model learns patterns like “pip install failed → try apt-get update or adjust version constraints → rerun tests.”

🍞 Top Bread (Hook): Imagine trying to read a super-long chapter versus a normal one—longer isn’t always better. 🥬 Filling (Long Context Training): What it is: Tests whether bigger context windows help. How it works: Train/evaluate at 32k–65k tokens, with/without YaRN2 scaling. Why it matters: Here, longer context didn’t help; most good examples already fit, and very long ones were noisy. 🍞 Bottom Bread (Anchor): Standard Qwen3 settings with ~41k eval context worked best.

Step 6: Data Mixing Strategy

What happens: Compare mixed (all-at-once) vs. two-stage curriculum (adapters then synthetic). Mixed wins.
Why this exists: Simple, robust training often beats complicated schedules.
Example: 8B model: mixed ~13.0% vs. curriculum ~10.4%.

Secret Sauce

Stable environments (pre-built Docker) + compositional skill design + verifiable tests + not over-filtering + mixed training = scalable, targeted learning that teaches both success paths and recovery paths.

04Experiments & Results

🍞 Top Bread (Hook): Picture a science fair where everyone brings their best project, and judges give precise scores, not just gold stars.

🥬 Filling (The Test): What it is: Evaluate on Terminal-Bench 2.0 using the standard Terminus 2 agent to keep comparisons fair. How it works: 1) Models receive tasks with instructions, environments, and tests. 2) They act step-by-step. 3) A pass/fail score is recorded. 4) Results are averaged with uncertainty. Why it matters: We can tell if the model really finishes real terminal jobs, not just chats about them.

🍞 Bottom Bread (Anchor): It’s like checking whether a robot actually assembles the toy and the wheels spin, not just describing how it would.

The Competition

Baselines include Qwen3-8B/14B/32B and larger open/closed-source models (e.g., Qwen3-Coder-480B). The question: can smarter data beat bigger size?

The Scoreboard (with Context)

Nemotron-Terminal-8B: 13.0% vs. Qwen3-8B’s 2.47%—that’s a five-fold jump, like going from barely finishing to passing several tough tasks.
Nemotron-Terminal-14B: 20.2% vs. Qwen3-14B’s 4.04%—like moving from a D to a solid C+/B- depending on the curve.
Nemotron-Terminal-32B: 27.4%, surpassing the much larger Qwen3-Coder-480B at 23.9%—like a smaller car overtaking a truck on a steep hill because it’s tuned correctly.

Category Breakdowns (Why It Matters)

Big turnarounds: Data Querying and Model Training went from 0.0 to 60.0 and 50.0 (32B), showing real operational competence.
Security improved from 2.5 to 27.5; Data Processing from 5.0 to 50.0; SWE from 5.0 to 31.7.
These are exactly the areas where step-by-step terminal skills—installing deps, file ops, interpreting errors—are crucial.

Surprising Findings

Don’t over-filter: Keeping failed and incomplete trajectories helps. For synthetic tasks, no-filter scored 12.4% vs. 6.74% (complete-only) and 5.06% (success-only). Failure teaches recovery.
Long context didn’t help here: Standard Qwen3 context did best. Longer sequences likely included more noise.
Curriculum (two-stage) vs. Mixed: Mixed won (13.0% vs. 10.4%), suggesting that exposure to varied data at once is beneficial for these skills.

Ablations: Which Data Matters Most?

Adapters alone: Math (5.39%), Code (6.29%), SWE (7.02%). Combined adapters: 9.66%—diversity helps.
Synthetic tasks: Skill-based is the star (12.4%); adding seed-based didn’t raise the mean further but reduced variance (more robust performance).

Scaling with Data and Model Size

More data → better results for both 8B and 14B. The 14B model benefits even more from scaling, showing capacity × data both matter.

🍞 Bottom Bread (Anchor): It’s like a soccer team that not only practiced more drills but the right ones. The players learned how to recover from a bad pass, not just how to make perfect passes in ideal conditions—and that’s what wins games.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best game plan has trade-offs—you can’t practice every drill every day.

🥬 Filling (Limitations): What it is: Constraints and caveats of the approach. How it works: 1) Compute and infra: Generating and running thousands of tasks still needs GPUs and container orchestration; it’s efficient but not free. 2) Domain coverage: The 9 Docker images cover a lot, but rare stacks or odd toolchains may be missing. 3) Test quality: Pytest checks are strong, yet tricky scientific edge cases can slip through. 4) Teacher dependence: Using a strong teacher model (DeepSeek-V3.2) biases what gets generated and how trajectories look. 5) Generalization: Great on TB2.0, but new, unseen enterprise toolchains may require new domains or tweaks. Why it matters: Knowing these limits helps you plan deployments and decide when to extend the pipeline.

🍞 Bottom Bread (Anchor): If your company uses an unusual database engine, you may need a new pre-built image and fresh tasks.

Required Resources

You’ll need: access to GPUs for SFT, container runners (Docker or Singularity), storage for datasets and images, and orchestration (Harbor/Daytona-like) for large-scale runs.

When NOT to Use

If your target tasks aren’t verifiable by tests or state checks, this pipeline loses its advantage.
If you require per-task custom OS features beyond the shared images, pre-built images may constrain you.
If you only need one-shot code answers without environment interaction, this is overkill.

Open Questions

RL on top of SFT: How much can verifiable rewards further boost long-horizon planning and recovery?
Better test synthesis: Can we auto-generate even stronger, more realistic test suites with coverage guarantees?
Active data selection: Which failures are most educational to keep? Can we prioritize them adaptively?
Domain expansion: What set of new domain images most economically unlocks the next 10 points on TB2.0?

🍞 Bottom Bread (Anchor): Think of upgrading from practice drills (SFT) to scrimmages with a scoreboard (RL). The question is: how much farther can the team climb?

06Conclusion & Future Work

Three-Sentence Summary

This paper shows that smarter data beats sheer scale for terminal agents: Terminal-Task-Gen combines adapted datasets and synthetic tasks (seed- and skill-based) to train Nemotron-Terminal models.
With stable, pre-built Docker images, verifiable tests, and a “don’t over-filter” philosophy, the models learn both how to succeed and how to recover from failure.
The result is large, reliable gains on Terminal-Bench 2.0—smaller models rival or beat much larger ones—and open releases let the community build further.

Main Achievement

A practical, scalable data engineering framework (Terminal-Task-Gen) that lifts terminal performance dramatically by focusing on task realism, verifiability, and targeted skill composition.

Future Directions

Add reinforcement learning with test-based rewards; expand domains and images; improve test generation; develop active data selection that keeps the most instructive failures; explore multi-agent planning only where it measurably helps.

Why Remember This

It reframes the challenge: terminal mastery comes from the right practice in the right environments with the right feedback. By curating data—not just growing parameters—you can teach models to handle the messy, step-by-step reality of the command line, which is exactly where real software work happens.

Practical Applications

•Automated project bootstrapping: create environments, install dependencies, and run initial tests.
•Reliable data pipelines: parse files, transform datasets, and verify outputs with pytest.
•Hands-free model training: set up experiments, run training scripts, and confirm metrics.
•Debugging assistants: reproduce errors, apply targeted fixes, rerun tests, and document changes.
•SWE issue resolution: localize bugs, generate patches, and validate with test suites.
•Security checks: verify auth rules, parse logs for anomalies, and validate lockout policies in sandboxed envs.
•System administration: manage services, check permissions, and script routine maintenance.
•Education sandboxes: students practice real terminal tasks safely with automatic grading.
•Benchmarking agents: use standardized tasks to compare terminal-capable models fairly.
•Onboarding playbooks: codify company-specific terminal workflows with tests so agents can execute them consistently.

Version: 1