Endless Terminals: Scaling RL Environments for Terminal Agents

Kanishk Gandhi; Shivam Garg; Noah D. Goodman; Dimitris Papailiopoulos

Endless Terminals: Scaling RL Environments for Terminal Agents

Intermediate

Kanishk Gandhi, Shivam Garg, Noah D. Goodman et al.1/23/2026

arXiv PDF

Key Summary

•Endless Terminals is an automatic factory that builds thousands of realistic, checkable computer-terminal tasks so AI agents can practice and improve with reinforcement learning.
•It creates each task in four steps: write a clear goal, set up a safe container, write tests that check the final result, and keep only tasks a strong model can actually solve.
•The team generated 3,255 verified tasks covering file work, logs, data processing, scripting, archiving, databases, and more—without human labeling.
•Using a very simple agent loop and plain PPO (no fancy tools, no retrieval, no multi-agent systems), models improved a lot on a held-out dev set.
•Qwen2.5-7B jumped from 10.7% to 53.3% on the dev set; Llama-3.2-3B rose from 4.0% to 18.2%; Qwen3-8B-openthinker-sft went from 42.6% to 59.0%.
•These gains transferred to a human-made benchmark (TerminalBench 2.0), beating other fine-tuned versions of the same base models, even ones with more complex scaffolds.
•Simple RL can work very well—if you feed it enough diverse, automatically verifiable environments.
•Main limits: tasks look a bit like neat puzzles, not messy real-world requests, and the solvability check depends on a frontier model, which caps difficulty.
•Failure analysis shows many losses come from getting stuck in loops or using up the turn limit; successful runs try more varied commands after mistakes.

Why This Research Matters

Many jobs rely on terminal work—organizing data, parsing logs, making backups, and automating scripts. Endless Terminals turns the hard part of training helpful AI assistants—getting lots of safe, graded practice—into an automated pipeline. This means more reliable copilots for IT operations, data teams, and developers, saving time and reducing errors. It also lowers dependence on expensive human labeling or proprietary teachers by letting models learn directly from interaction and tests. As the pipeline grows and gets more natural, it could make command-line tools accessible to more people, speeding up everyday computing tasks. In short, it builds the playground where terminal skills can be learned at scale.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine learning to ride a bike in a tiny hallway. You can pedal, but you can’t really practice turns, bumps, or speed. You won’t get great at biking unless you have lots of space and different paths to try.

🥬 The Concept (Reinforcement Learning):

What it is: Reinforcement Learning (RL) is a way for AIs to learn by trying actions and getting rewards when they do well.
How it works:
1. The AI looks at a situation.
2. It picks an action.
3. The environment reacts and gives a reward (like a point) or nothing.
4. The AI repeats this, learning which actions lead to more rewards over time.
Why it matters: Without good, varied practice environments, the AI can’t truly learn useful behaviors—it’s like biking in that tiny hallway.

🍞 Anchor: A robot dog learns to fetch better when it can run in a big park with many kinds of sticks and places, not just a living room.

Before this paper, RL helped language models think better at math and code because those areas had tons of small, checkable tasks. But for using a computer terminal—where you type commands, read outputs, fix mistakes, and keep going—there weren’t big, scalable training environments. People had built small benchmarks to test agents, but those were made for grading, not for teaching. Think: a few exam sheets, not a full set of daily practice worksheets.

🍞 Hook: You know how a good video game gives you many levels, different enemies, and clear win conditions so you can improve by playing a lot?

🥬 The Concept (Terminal Agents):

What it is: A terminal agent is an AI that solves tasks by typing commands in a computer shell (like cd, ls, grep) and reading the output.
How it works:
1. Reads the goal and the latest terminal output.
2. Thinks about the next step.
3. Types one command.
4. Sees what happened and repeats until done.
Why it matters: Real computer work needs many steps, error recovery, and careful checking. Without practice on many multi-step tasks, the agent can’t get robust.

🍞 Anchor: Like a student learning to use a calculator, spreadsheets, and file folders step by step, a terminal agent learns to combine commands to reach a goal.

The problem: training needs a river of tasks that are diverse, safe, and automatically checkable. But most existing datasets are small, human-curated, and expensive. Other approaches try to distill from stronger models (which is pricey and limited by the teacher) or repurpose evaluation benchmarks for practice (which risks overfitting and remains too narrow).

Failed attempts included:

Training on fixed, tiny benchmarks meant just for testing—agents memorized patterns but didn’t generalize.
Supervised finetuning on human-made traces—useful but bottlenecked by annotation cost and teacher quality limits.
Using coding/shell datasets not designed for multi-turn interaction—missing the back-and-forth nature of real terminal work.

The gap: We needed a fully automatic pipeline that could endlessly create realistic terminal tasks with:

Clear goals a user might request
Safe, isolated environments (so each task starts clean)
Automatic tests that prove success or failure
A built-in check that the tasks are solvable—not broken or underspecified

🍞 Hook: Imagine a factory that stamps out new math puzzles daily, each with an answer key and a little testing machine that says ‘Correct!’ or ‘Try again.’

🥬 The Concept (PPO):

What it is: Proximal Policy Optimization (PPO) is a stable RL training method that nudges an AI to improve while avoiding too-big jumps that break it.
How it works:
1. Let the agent try tasks and collect rewards.
2. Update its brain just enough (within safe bounds) to do better on what worked.
3. Repeat many times.
Why it matters: Without stability, training can wobble: one day great, next day terrible. PPO keeps learning steady.

🍞 Anchor: Like a coach who says “Great shot—aim a tiny bit more to the left next time,” PPO encourages small, safe improvements that add up.

Real stakes: Terminal skills power everyday work—organizing files, parsing logs, making backups, transforming data, and running scripts. Better terminal agents could:

Help IT teams respond to outages faster
Save data analysts hours on repetitive shell tasks
Assist developers in triage and debugging
Make learning command-line tools friendlier for newcomers

But none of this happens unless we can scale the environments. That’s the heart of why this research exists: give RL the ‘big park’ it needs to truly learn terminal skills.

02Core Idea

🍞 Hook: Picture a never-ending puzzle book that not only writes new puzzles every day but also comes with answer keys and a quick ‘checker’ stamp. You’ll never run out of good practice.

🥬 The Concept (Endless Terminals):

What it is: Endless Terminals is an automatic, four-stage pipeline that keeps generating, setting up, testing, and validating terminal tasks—no humans needed.
How it works:
1. Generate a user-style task description plus hidden ground-truth details.
2. Build a safe container and verify the starting state with tests.
3. Create final tests that check exactly what ‘done’ looks like.
4. Try solving with a strong model; keep tasks that are solvable, toss the rest.
Why it matters: Without endless, checkable tasks, RL can’t scale. With them, even simple RL gets much stronger.

🍞 Anchor: Like a cooking school that endlessly creates new recipes, kitchens, and taste-tests—and only serves dishes that a master chef could actually complete.

The Aha! moment in one sentence: If we can endlessly and automatically produce safe, diverse, verifiable terminal tasks, then even a very simple RL setup can learn a lot.

Three analogies:

Game levels: The pipeline is a level generator that tests itself and only ships fair, beatable levels.
Gym workouts: It’s a personal trainer that designs varied exercises, sets up the equipment safely, and checks your form every session.
Science lab: It builds experiments (tasks), prepares clean lab benches (containers), writes measurement tools (tests), and confirms experiments are doable.

Before vs. After:

Before: Few, fixed tasks; overfitting; weak generalization; expensive human curation.
After: Thousands of varied tasks; automatic checking; scalable training; measured transfer to human-made benchmarks.

Why it works (intuition, not equations):

Diversity teaches robustness: Seeing many file, log, and data tasks trains flexible habits instead of brittle tricks.
Automatic tests give clear rewards, which PPO needs to steadily improve behavior.
Solvability filtering removes broken or impossible tasks, so training time isn’t wasted.
Safe containers ensure each practice run starts clean and fair.

Building blocks, each with a clear role:

Task generation: Writes realistic goals plus hidden ground-truth for tests.
Container setup with initial tests: Guarantees the starting world is exactly as promised.
Final tests: Define success precisely so rewards are reliable.
Solvability filtering: Proves at least one correct path exists.
Minimal agent loop: Keeps the learning signal simple and focused on reasoning and command use.

🍞 Hook: You know how lego kits come with instructions and a final picture so you know if you built it right?

🥬 The Concept (Completion Tests):

What it is: Completion tests are scripts that check the final state—files, contents, configs—to decide if the task is solved.
How it works: They run after the agent says “done,” verifying exact results.
Why it matters: Without solid checking, rewards get noisy, and the agent can’t learn reliably.

🍞 Anchor: Like the picture on the lego box: if your castle looks like the picture, you passed.

With these pieces clicking together, the pipeline produced 3,255 verified tasks. Training simple PPO agents on this set pushed big gains on a held-out dev set and even improved results on TerminalBench 2.0, a human-made benchmark. The key lesson: when the practice field gets big and well-structured, even a plain training recipe can shine.

03Methodology

High-level recipe: Input → Generate Task → Build & Validate Container → Write Final Tests → Filter by Solvability → Output a verified task ready for RL.

🍞 Hook: Think of a teacher making a brand-new quiz, setting up a quiet classroom, preparing an answer key, and giving the quiz only if it’s fair and solvable.

🥬 The Concept (Task Generation):

What it is: The pipeline asks a language model to write a realistic user-style task plus hidden ground-truth used only by the testers.
How it works:
1. Randomly pick a category (e.g., file ops, logs), a complexity level, and a scenario (e.g., DevOps debugging logs).
2. Produce a clear task the agent will see.
3. Produce hidden ground-truth (exact file contents, paths) for tests.
Why it matters: Diversity prevents overfitting; hidden truth enables accurate, automatic grading.

🍞 Anchor: Like a quiz that looks natural to students but also has a private teacher’s answer sheet.

🍞 Hook: Imagine doing experiments in separate, safe mini-labs so spills never mix.

🥬 The Concept (Containerized Environments):

What it is: Each task runs inside its own clean container (e.g., Docker or Apptainer) so the starting world is correct and safe.
How it works:
1. Write a container definition based on the task’s needs.
2. Build it and run initial state tests to ensure all prerequisites exist (files, dirs, repos, processes).
3. If tests fail, feed the error back to the model to fix the container (up to a few rounds).
Why it matters: Without clean starts, tasks become flaky and learning breaks.

🍞 Anchor: Like giving each student a fresh lab kit so they don’t inherit someone else’s mess.

🍞 Hook: You know how a scoreboard tells you if your team actually won?

🥬 The Concept (Completion Tests):

What it is: Scripts that check whether the agent achieved the exact end state.
How it works:
1. Use hidden ground-truth to write checks (e.g., file exists and contents match).
2. Make sure these tests fail before the task starts and only pass when truly solved.
Why it matters: This makes the reward binary, clear, and trustworthy.

🍞 Anchor: Like a science test that only gives full credit when every required measurement is correct.

🍞 Hook: Think of sorting fruits—keep ripe ones, toss the rest.

🥬 The Concept (Solution Filtering):

What it is: The pipeline tries to solve each task with a strong model multiple times and keeps the task only if at least one attempt succeeds.
How it works:
1. Sample 16 solution attempts (interactive command sessions) from a capable model.
2. If any pass, the task is marked solvable; if none pass, discard it.
Why it matters: Removes broken or impossible tasks so training time isn’t wasted.

🍞 Anchor: Like a puzzle magazine only printing crosswords that a test-solver was able to finish.

🍞 Hook: You know how follow-the-directions worksheets often use clear labels so no one gets lost?

🥬 The Concept (XML Structure for the Agent Loop):

What it is: The agent outputs are wrapped in simple tags so the system knows what to execute.
How it works:
1. The model thinks, then writes <command>...</command> for one shell command.
2. It writes <command>done</command> when finished.
3. The shell returns stdout, stderr, and exit code, which get appended to history.
Why it matters: Structure keeps the loop tidy: one command per turn, clear signals, reliable parsing.

🍞 Anchor: Like putting your answer inside a labeled box on a test so the teacher can find it instantly.

🍞 Hook: Imagine a game where you only earn a point if you actually beat the level.

🥬 The Concept (Task Solvability Filtering, pass@16):

What it is: A pass@16 rule keeps tasks where at least one of 16 solution attempts works.
How it works:
1. Run 16 independent interactive tries with a strong model.
2. Keep the task if any attempt passes the final tests.
Why it matters: Ensures every task has a real path to success, so rewards mean something.

🍞 Anchor: Like shooting 16 free throws; if any go in, the shot type is possible—and the drill is worth practicing.

Minimal agent and training details (what happens during RL):

The agent sees the whole conversation history (thoughts, commands, outputs).
It can reason in text before each command.
Episodes end when the agent says done, runs out of turns, or hits a token limit.
Reward is binary: 1 if final tests pass, else 0.
PPO trains the policy using these episode-level signals; no KL penalty; clipping bounds keep updates stable.

The secret sauce:

The clever part isn’t a fancy algorithm—it’s the scale and quality control of the environments. By procedurally creating tasks with built-in verification and solvability checks, the team turned a scarce resource (good training tasks) into an abundant one, unlocking strong learning even with a plain PPO setup.

04Experiments & Results

The test: Measure how often the agent fully solves tasks (passes the final tests). This is reported on a held-out development set (from the same pipeline) and on human-curated benchmarks like TerminalBench 2.0 to check generalization.

The competition: Compare base models, our RL-trained models (+RL Ours), and other fine-tuned or RL-trained variants (including systems with more complex agentic scaffolds).

Scoreboard with context:

On the Endless Terminals dev set:
- Llama-3.2-3B: 4.0% → 18.2% (about 4.5× improvement).
- Qwen2.5-7B: 10.7% → 53.3% (about 5× improvement).
- Qwen3-8B-openthinker-sft: 42.6% → 59.0% (stronger base gets even better). Interpreting: moving from a low pass rate to above half for Qwen2.5-7B is like going from struggling on quizzes to getting a comfortable B+/A-.
Transfer to human-curated TerminalBench 2.0:
- Llama-3.2-3B: 0.0% → 2.2%.
- Qwen2.5-7B: 2.2% → 3.4%.
- Qwen3-8B-openthinker-sft: 1.1% → 6.7%. Interpreting: Absolute numbers are modest (the benchmark is hard), but gains are consistent and beat other fine-tuned versions of the same base models—even those using richer scaffolds. That’s like moving from near-zero to making steady, nontrivial progress on a tough final exam.

Surprising findings:

Simple beats fancy when environments scale: A very minimal agent loop plus vanilla PPO outperformed more complex scaffolds once trained on lots of diverse, verified tasks.
Stronger bases gain more: The model that began with supervised traces (Qwen3-8B-openthinker-sft) reached the best transfer, suggesting SFT and RL complement each other.
Failure patterns matter: Many failures came from loop behaviors (repeating the same commands) and turn exhaustion (running out of steps). Successful runs showed higher command diversity after the first error, meaning they tried new ideas rather than repeating mistakes.

🍞 Hook: Ever see a toy car stuck against a wall, wheels spinning, not going anywhere?

🥬 The Concept (Loop Failures):

What it is: When an agent repeats the same or similar commands after an error instead of exploring alternatives.
How it works: Low command diversity after the first mistake signals it’s trapped in a cycle.
Why it matters: Looping wastes turns and blocks progress.

🍞 Anchor: Like trying the same wrong password over and over instead of resetting it or checking the username.

Extra details:

The dataset spans many categories: biggest shares include file operations and log management; others range from scripting to databases.
Solution lengths vary widely; most tasks need 1,000–4,000 characters of interaction, with a long tail for complex ones.
About half of generated candidates were filtered out by solvability checks, leaving a cleaner, stronger training set.

05Discussion & Limitations

Limitations:

Tasks look like tidy, well-specified puzzles. Real users often ask fuzzy questions or forget details. Automating that ‘messiness’ while keeping tests reliable is hard.
Solvability filter depends on a frontier model. If the checker can’t solve a task, it gets thrown out—even if it’s a good, challenging problem. This sets a moving difficulty ceiling.
Domain gaps remain: performance drops on specialized areas (e.g., cryptanalysis, bioinformatics, certain ML tasks) where background knowledge is thin.

Required resources:

Container infrastructure (e.g., Apptainer or Docker) to build and run many tasks in parallel.
Compute for PPO training and for solvability checks (multiple solution attempts per task).
Storage for datasets, containers, logs, and rollouts.

When not to use:

If you need natural, ambiguous, back-and-forth conversations (e.g., asking clarifying questions) as the main skill. The current pipeline favors precise, checkable goals.
If frontier-model-based filtering is unavailable or too costly.
If you must target highly specialized domains not well covered by the current task generator.

Open questions and next steps:

How to generate ‘fuzzy’ but still verifiable tasks that mimic real user behavior without breaking automatic testing?
Can self-play raise the ceiling by creating tasks just beyond the current skill level instead of relying on a fixed frontier solver?
Would denser rewards (partial credit for passing some tests) speed learning while staying stable?
Could richer scaffolds (retrieval, tools, multi-agents) on top of Endless Terminals deliver bigger gains, or does simplicity often suffice?
Can learned world models of terminal dynamics make training more sample-efficient by letting agents ‘imagine’ outcomes before running commands?

06Conclusion & Future Work

Three-sentence summary: Endless Terminals is an autonomous, four-stage pipeline that mass-produces safe, diverse, verifiable terminal tasks for RL. Training very simple agents with vanilla PPO on these tasks leads to large gains on a dev set and measurable improvements on a human-curated benchmark. The core message is that scaling environments unlocks the power of simple RL.

Main achievement: Showing that a clean, automated pipeline for generating and validating terminal tasks can, by itself, lift performance substantially—often beating more complex agent setups trained on smaller or noisier data.

Future directions:

Add more natural, ambiguous requests while preserving verifiability.
Reduce dependence on frontier model filtering via self-play and adaptive difficulty.
Explore partial-reward schemes and world models for efficiency.
Layer in optional scaffolds (retrieval, tools) to test complementarity with the scaled environment.

Why remember this: In agent training, the environment is as important as the algorithm. Endless Terminals turns a major bottleneck—lack of large, checkable, realistic practice—into a scalable resource, proving that with the right playground, even simple players can learn powerful skills.

Practical Applications

•Train internal terminal copilots to automate routine file, log, and data-processing tasks safely.
•Benchmark and improve DevOps assistants that triage incidents by parsing and summarizing logs.
•Create custom task packs (e.g., database ops, backups) to upskill agents for company-specific workflows.
•Pre-train student-facing shell tutors that give hints and verify solutions with tests.
•Stress-test new agent architectures or prompting strategies on a large, verified task suite.
•Evaluate the impact of retrieval or tool-use scaffolds by layering them on top of the same tasks.
•Generate domain-targeted training sets (e.g., archiving, checksum verification) without manual labeling.
•Rapidly iterate RL recipes (reward shaping, curriculum) with automatic pass/fail feedback.
•Build safety sandboxes where agents can practice risky commands inside isolated containers.

Version: 1