Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Key Summary
- •Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.
- •Each task runs inside a safe, ready-to-use computer box (a Docker container) and is verified by strict tests that check the final result, not the steps.
- •Humans wrote real solution scripts for every task so we know each task is actually solvable and what success looks like.
- •Even the strongest models today solve under 65% of tasks, which means the benchmark is hard enough to tell top AIs apart.
- •The biggest mistakes are execution errors like ignoring instructions, repeating steps, or missing when to stop, showing where agents need the most improvement.
- •Command-level failures often happen because the requested program isn't installed or in PATH, revealing practical setup skills are crucial.
- •Model choice matters more than agent scaffold in many cases, and newer models show steady gains over time on this benchmark.
- •Terminal-Bench includes careful quality control, audits, and anti-cheating checks, but internet access and public data still pose contamination risks.
- •The benchmark is outcome-driven: agents can use any valid approach as long as the final state passes the tests.
- •Results and tasks are open, and the team plans to release harder task sets as models improve so the benchmark stays challenging.
Why This Research Matters
Many real jobs—from running servers to analyzing data—happen in terminals, so testing AI in this environment shows if they can actually help at work. Terminal-Bench focuses on outcomes, which aligns with real business needs: something either works in production or it doesn’t. By revealing the most common mistakes agents make, the benchmark points researchers to fixes that could unlock big productivity gains. Teams can compare models fairly, avoiding hype and choosing what actually gets results for their workflows. As AI gets better, the benchmark will evolve, keeping the bar high and preventing premature trust in unproven agents. This ultimately leads to safer, more reliable AI assistants that can handle valuable, long, and complex tasks.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook): You know how following a recipe is different from actually cooking dinner for your whole family? Reading steps is easy—cooking a tricky dish perfectly, on time, with everything warm, is hard.
🥬 Filling (The Actual Concept)
- What it is: The world before this paper had many AI tests, but most were like short quizzes, not full, realistic jobs.
- How it works (what the world looked like):
- AI models were mostly tested on tiny tasks (like writing one function) or in simulated worlds that don’t match real computers.
- Agents that use terminals became popular for real work (coding, data science, cybersecurity), but tests didn’t match the messiness professionals face daily.
- When tasks got longer and more complex, older benchmarks couldn’t tell strong models apart—they were too easy or too artificial.
- Why it matters: Without realistic tests, we might think an AI is job-ready when it’s only quiz-ready. That’s risky for real projects, deadlines, and money.
🍞 Bottom Bread (Anchor): Imagine a spelling bee champion asked to write a whole mystery novel. Winning the bee doesn’t prove they can plan characters, fix plot holes, and finish a book. We need a different test for that.
🍞 Top Bread (Hook): Imagine talking to a computer like you text a friend—but instead of emojis, you type little magic words that make the computer do stuff.
🥬 Filling (The Actual Concept)
- What it is: Command Line Interfaces (CLIs) are a way to control a computer by typing commands.
- How it works:
- You type a command like ls (list files) or grep (search text).
- The computer runs the command and prints results back as text.
- You chain commands to do powerful things fast (e.g., find + xargs + rm to clean folders).
- Why it matters: So much real work—software builds, data pipelines, servers—runs in terminals. If an AI can’t handle a CLI, it can’t handle many real jobs.
🍞 Bottom Bread (Anchor): A developer might type git status to check code changes, then pytest to run tests. An agent must do the same to solve real tasks.
🍞 Top Bread (Hook): Think of a lunchbox that keeps your sandwich from soaking your apple. Everything stays separate and fresh.
🥬 Filling (The Actual Concept)
- What it is: Docker containers are neat, sealed computer environments that keep apps and tools isolated and reproducible.
- How it works:
- You write a Dockerfile describing what to install (like Python 3.10, gcc).
- You build an image (a snapshot).
- You run a container (a live, clean room) where your commands won’t mess up your whole computer.
- Why it matters: Tests must be fair and repeatable. Containers guarantee everyone—human or AI—solves the same task in the same setup.
🍞 Bottom Bread (Anchor): If a task needs Python 3.10 and specific libraries, the container ensures those exact versions exist every time.
The Problem: Before this paper, we had a gap. AI agents were starting to act like junior engineers using terminals, but our tests weren’t measuring whether they could do real, long, messy, professional tasks (like setting up servers, compiling systems, or reproducing research). Small, single-step coding quizzes don’t capture that. Prior attempts tried: synthetic puzzles (too clean), sandboxed toy tools (not real terminals), or narrow skills like text-to-Bash (can’t measure long, multi-step workflows). The missing piece: a high-skill, diverse, outcome-checked benchmark built from real workflows inside real terminals.
What’s at Stake: In the real world, people use terminals to keep websites running, secure companies, analyze science data, and release software. If we trust an untested agent, we risk bugs, outages, or breaches. If we under-trust capable agents, we miss huge productivity gains. A tough, realistic, and fair benchmark determines when agents are truly ready to help on valuable work.
02Core Idea
🍞 Top Bread (Hook): Imagine a school science fair where each project is a real lab experiment, not a worksheet. You don’t get points for trying—you get points if your experiment really works.
🥬 Filling (The Actual Concept)
- What it is: Terminal-Bench is a hard, realistic test set where AI agents must finish real terminal tasks and are graded only on the final outcome.
- How it works:
- Each task comes with its own container (clean room), instructions (what to achieve), and tests (how we check the end result).
- The agent can use any valid commands and tools to get the job done.
- Tests verify the final computer state—not the agent’s words—to avoid being tricked.
- Why it matters: This measures what counts in real jobs: can the agent get to a correct, working result under real conditions?
🍞 Bottom Bread (Anchor): If the task is “set up an HTTPS server,” the agent only passes if the tests confirm a valid self-signed certificate and a working server—no excuses.
Multiple Analogies (same idea, 3 ways):
- Sports field: Not a quiz about soccer rules—play a real match and see the score at the end.
- Cooking show: Not naming ingredients—cook a dish; judges taste it to decide.
- Puzzle room: Not picking locks in theory—escape by actually solving the room.
Before vs After:
- Before: Benchmarks often measured tiny code pieces or synthetic steps, so many strong models looked the same.
- After: With Terminal-Bench, top models separate clearly—some complete over half the tasks, others much less—revealing true readiness for real work.
Why It Works (intuition without equations):
- Realistic environment: The terminal is where real engineering happens, so skill transfers directly.
- Outcome-based grading: Tests don’t care how you got there, only that it works—like production systems do.
- Diversity of tasks: Different fields (systems, ML, security, data) prevent overfitting one trick.
- Reproducibility via containers: Everyone solves the same problem the same way, making fair comparisons possible.
Building Blocks:
- Isolated task environments (Docker images) with all required files.
- Clear instructions describing acceptable end states.
- Human-written solution scripts proving solvability.
- Comprehensive tests verifying final states only.
- A neutral agent harness (like Terminus 2) to compare models fairly.
- Error analyses to map where and why agents fail.
🍞 Top Bread (Hook): You know how a teacher shows you one good way to solve a tricky math problem so you know it’s possible?
🥬 Filling (The Actual Concept)
- What it is: Human-Written Solutions are step-by-step scripts that actually solve each task.
- How it works:
- An expert writes a real sequence of commands that completes the task.
- Running that script inside the container makes all tests pass.
- Reviewers check that it mirrors a realistic workflow and doesn’t cheat.
- Why it matters: This proves the task is solvable, defines expectations, and catches broken or unfair tests.
🍞 Bottom Bread (Anchor): For “rebuild a C library and run its tests,” the solution script installs deps, compiles the code, and runs unit tests so we know the task is doable.
03Methodology
At a high level: Instruction + Container → Agent interacts via terminal → Outcome tests run → Pass/Fail recorded and analyzed.
Step 1: Build the Task Environment
- What happens: Curators create a Dockerfile and context with all needed code/data pinned to versions.
- Why this step exists: Without a stable, reproducible environment, results could change due to outside updates, breaking fairness.
- Example: A task needs Python 3.10, gcc, and specific crypto libs; the Dockerfile installs these exact versions so everyone sees the same system.
🍞 Top Bread (Hook): Imagine a friendly robot hall monitor checking every classroom for safety and fairness before students arrive.
🥬 Filling (The Actual Concept)
- What it is: Automated Quality Control Tools are scripts and LLM checks that scan tasks for mistakes or sneaky shortcuts.
- How it works:
- Deterministic checks ensure, for example, tests and solutions aren’t baked into the image.
- LLM checks look for underspecified instructions or typos that could trip agents unfairly.
- An adversarial exploit agent tries to cheat to expose vulnerabilities in test design.
- Why it matters: Catching these issues early prevents agents from passing by loopholes and keeps the benchmark rigorous.
🍞 Bottom Bread (Anchor): If a task accidentally copies the tests into the container, a tool flags it so solvers can’t peek and cheat.
Step 2: Write the Instruction and Tests
- What happens: Authors write a clear instruction describing acceptable end states, and tests that check only the final state.
- Why this step exists: Ambiguity confuses agents and reviewers. Tests must match the instruction exactly.
- Example: Instruction says “produce output.json with fields id, score,” and tests verify the file exists and fields are correct—no hidden requirements.
Step 3: Provide a Human-Written Solution
- What happens: An expert creates a realistic solution.sh showing a believable workflow.
- Why this step exists: Proves solvability and sets a standard; if the solution breaks, the task is invalid.
- Example: For a build task, solution.sh runs cmake .. && make && ctest and ensures artifacts are where tests expect them.
Step 4: Run the Agent in the Container
- What happens: The agent (often via the neutral Terminus 2 harness) interacts only through Bash commands to explore, edit, run, and verify.
- Why this step exists: Using a headless terminal isolates the core capability—operating a real CLI without extra crutches.
- Example: The agent clones a repo, inspects error logs with less, installs missing libs with apt-get, and reruns builds.
🍞 Top Bread (Hook): Imagine a school scoreboard that tallies wins and losses for every team, on the same field, with the same rules.
🥬 Filling (The Actual Concept)
- What it is: An Agent Evaluation Framework is the system that runs, times, and scores agent attempts consistently.
- How it works:
- Launch many containers in parallel (at scale) using a harness (e.g., Harbor) and a sandbox provider.
- Feed the instruction; record turns, tokens, and command outputs.
- Run the tests; log pass/fail, cost, and time.
- Why it matters: Controlled, repeatable runs let us compare models fairly and discover performance trends.
🍞 Bottom Bread (Anchor): The framework might run 32–100 containers at once, track that Model A took 12 minutes and 4M tokens, and whether it passed.
Step 5: Verify and Audit Quality
- What happens: Multiple human reviewers audit each task for specificity, solvability, and integrity; repeated runs check consistency.
- Why this step exists: Some flaws only show up under human inspection; audits prevent silent leaks like future git commits.
- Example: Reviewers confirm that tests won’t pass if the agent simply echoes the final answer without real computation.
🍞 Top Bread (Hook): You know how after a game, coaches watch the replay to see exactly where things went wrong?
🥬 Filling (The Actual Concept)
- What it is: Error Analysis is the process of studying failed attempts to find patterns of mistakes and how to fix them.
- How it works:
- Sample failed trials; label failures into categories like Execution, Coherence, Verification.
- Use LLMs and humans to judge and cluster error types (e.g., disobeyed instruction, weak verification).
- Quantify which failures are most common to target improvements.
- Why it matters: Knowing the most frequent and costly mistakes helps build better agents.
🍞 Bottom Bread (Anchor): If many failures say “command not found,” teams know to improve how agents install tools or check PATH first.
Secret Sauce (what makes it clever):
- Outcome-only testing resists prompt fluff and focuses on working results.
- Real terminal environments mirror professional workflows.
- Human-verified solvability keeps tasks fair and meaningful.
- Systematic audits and adversarial checks reduce shortcuts.
- Large-scale, repeatable runs let us see real trends, not one-off luck.
04Experiments & Results
The Test: Researchers measured resolution rate (percent of tasks an agent fully completes), time, tokens, cost, and failure types. They wanted to know: Which models actually finish real jobs in a terminal, and how do they fail when they don’t?
The Competition: Six agent scaffolds and sixteen leading models (closed and open-weight) were compared across 89 tough tasks. Each model-agent combo was run at least five times, totaling 32,155 trials.
Scoreboard with Context:
- Best result: GPT-5.2 with Codex CLI solved about 63% of tasks—like getting an A when many tests are meant to be hard finals.
- Strong runners-up: Claude Opus 4.5 (Terminus 2) and Gemini 3 Pro (Terminus 2) around 58–57%.
- Open-weight leaders: Kimi K2 Thinking (Terminus 2) at roughly 36%—solid but behind top proprietary models.
- Smaller or older models: Around 15% on average—more like passing some quizzes but struggling with projects.
- Trend: Newer releases clearly improve over time; in about eight months, state-of-the-art performance nearly doubled on this benchmark.
Cost and Effort:
- Running the full benchmark per model can cost from about 100+ depending on pricing and run lengths.
- Most agents finish attempts in under 20 minutes, but some runs stretch to ~2 hours and nearly 100 million tokens—showing these are truly long-horizon tasks.
- More tokens or turns don’t automatically mean more success; strategy matters more than chattiness.
Empirical vs Human Difficulty:
- Human authors labeled tasks as medium or hard for people. Then researchers defined empirical difficulty by how many frontier models solved them.
- Correlation is positive: Most human-hard tasks are hard for models (over 90%).
- Biggest gap: Some tasks humans mark medium are actually hard for models—often those needing adversarial creativity, like bypassing XSS filters.
Failure Analyses (Trajectory-level and Command-level):
- Execution errors dominate top models: disobeying specs, repeating steps, or not recognizing when to stop.
- Coherence and verification errors also occur: claiming success when logs disagree, forgetting context, or weak checks.
- Command-level failures: The biggest single issue is “command not found” or missing PATH (~24.1%), followed by runtime app failures and missing files—very practical, nuts-and-bolts problems.
- Distinct error signatures mean different fixes: Models heavy on execution mistakes need stricter instruction-following and planning; models with balanced lapses need better self-checking.
Surprising Findings:
- Model choice often matters more than the agent scaffold for final performance.
- More turns/tokens don’t guarantee better results—smart actions beat longer chats.
- Some tasks remain unsolved by any model, marking a clear frontier for future work (e.g., complex system config, kernel driver compilation, database migration).
05Discussion & Limitations
Limitations:
- Public benchmark: Because tasks and solutions are open, future models could be trained on them (contamination), weakening fairness. Canary strings help detect but can’t prevent this.
- Internet access: Agents can install packages or search docs; in theory they could find oracle solutions. Not observed so far, but vigilance is needed.
- Non-determinism: Even with pinned dependencies and images, network or hardware variability can affect runs.
- Residual task flaws: Despite multi-hour human audits per task, a few might still be underspecified or exploitable.
Required Resources:
- Container runtime (Docker) capable of building and running images reliably.
- Harness (e.g., Harbor) and a sandbox provider to launch many jobs in parallel.
- Model API access and budget (costs vary widely by model and number of trials).
- Storage for logs, traces, and artifacts to enable audits and error analysis.
When NOT to Use:
- Purely GUI-based workflows where mouse/visual interaction is essential (this focuses on terminal tasks).
- Tasks tightly coupled to private, unstable, or rate-limited external APIs that break reproducibility.
- Tiny, single-function coding quizzes (simpler, cheaper benchmarks fit better).
Open Questions:
- How to maintain a private, contamination-resistant test set while keeping community contributions vibrant?
- Can we standardize stronger, automated measures of test quality (beyond human audits), like coverage metrics for task specs?
- What agent designs best reduce execution errors—planning modules, stricter interpreters, or built-in verifiers?
- How should benchmarks evolve as models approach 100% on today’s tasks—cadence of new task sets, difficulty scaling, and meta-benchmarking?
- Can command-level error signals (e.g., frequent “command not found”) be turned into automatic tool-installer skills without overfitting to the benchmark?
06Conclusion & Future Work
Three-Sentence Summary: Terminal-Bench 2.0 is a rigorous benchmark of 89 real, containerized terminal tasks that grades agents by final outcomes, not words. Today’s best models solve under 65%, revealing meaningful headroom and clearly separating capabilities. Careful audits, human-written solutions, and failure analyses spotlight where agents stumble and how to improve.
Main Achievement: Turning real terminal workflows into fair, reproducible, outcome-verified tasks that can reliably differentiate frontier agents on long, professional problems.
Future Directions:
- Release fresh task sets as models improve so the benchmark stays challenging and relevant.
- Explore private or rotating hidden tasks to reduce contamination while preserving openness.
- Build stronger, partly automated measures of task specification quality and flakiness control.
- Integrate smarter self-verification and tool-install strategies to target dominant failure modes.
Why Remember This: If we want trustworthy AI coworkers, we must test them doing actual work. Terminal-Bench moves beyond quizzes to realistic jobs in real environments, showing what agents can truly finish—and where they still need to grow. It’s a compass for building reliable, job-ready AI.
Practical Applications
- •Choose the right AI model for your engineering team by comparing resolution rates and costs on Terminal-Bench.
- •Harden your agent scaffold by targeting dominant execution failures (e.g., enforce instruction adherence and termination checks).
- •Add automatic tool-install and PATH-check routines to reduce 'command not found' errors.
- •Use outcome-only tests in your internal pipelines to judge agents by working results, not chat quality.
- •Adopt Docker-based reproducible task setups for your own agent evaluations and hiring screens.
- •Run failure analyses on your agent logs to prioritize improvements (execution vs coherence vs verification).
- •Prototype self-verification steps (rerun core tests, schema checks) before agents declare tasks complete.
- •Benchmark new model releases regularly to monitor capability growth and decide upgrade timing.
- •Design private, rotating task sets for sensitive workflows to reduce training contamination risks.
- •Educate teams with curated tasks mirroring your stack (builds, migrations, deployments) to train agent-guided SOPs.