AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Keyu Li; Junhao Shi; Yang Xiao; Mohan Jiang; Jie Sun; Yunze Wu; Shijie Xia; Xiaojie Cai; Tianze Xu; Weiye Si; Wenjie Li; Dequan Wang; Pengfei Liu

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Intermediate

Keyu Li, Junhao Shi, Yang Xiao et al.1/16/2026

arXiv PDF

Key Summary

•AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.
•It includes 32 real-world scenarios with 138 tasks that often take hours, around 90 tool uses, and up to 1 million tokens of context.
•A simulated user gives step-by-step feedback so humans don’t have to babysit the AI during long projects.
•A Docker sandbox safely runs and clicks through the AI’s apps and code, then collects screenshots and logs for grading.
•Automatic graders use clear rubrics; some parts are judged by rules and others by vision/text AIs acting as referees.
•Closed-source models scored higher on average (48.4%) than open-source ones (32.1%), but everyone still struggled on very long tasks.
•Different models showed different strengths: some were better at fixing mistakes after feedback, others used fewer tokens or preferred certain tools.
•Models often did best inside their own “home” toolkits (agentic scaffolds), showing that pairing the right framework with the right model matters.
•AgencyBench is open-source and aims to help build agents that are more efficient, better at self-correcting, and useful in real life.
•The benchmark turns long, messy, real tasks into a safe, repeatable, and fully automated test so the community can compare agents fairly.

Why This Research Matters

Real jobs aren’t short riddles; they’re long projects with many steps, and AgencyBench finally measures that. By automating feedback and grading inside a safe sandbox, it removes the need for constant human supervision, so testing can scale. It exposes what really separates agents: not just who answers well once, but who can plan, adapt, and finish reliably. It also shows the true costs in time and tokens, helping teams pick the most efficient options for their budgets. Finally, it highlights how much the right tools (scaffolds) matter, guiding builders to pair models with environments that unlock their best performance.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a school spelling test checks only one skill, but building a school play checks many skills over weeks—like writing, acting, costumes, and lights? Early AI tests were like spelling tests: short and single-skill. Real life is more like a school play: long, messy, and full of moving parts.

🍞 Hook: Imagine you’re judging a science fair where each project takes weeks and needs experiments, graphs, and a working demo. A quick yes/no quiz wouldn’t capture who did the best job. 🥬 The Concept (Benchmark): A benchmark is a standardized test for AIs to compare their abilities.

What it is: A carefully designed set of tasks and scoring rules to measure how well different AI agents perform.
How it works: 1) Collect realistic tasks, 2) Define clear instructions and expected outputs, 3) Score with consistent rubrics.
Why it matters: Without a good benchmark, we can’t tell which AI agent is better at real jobs. 🍞 Anchor: A spelling bee vs. a full science project showcase; the second is a better benchmark for real scientific ability.

Before AgencyBench, many tests focused on narrow skills like browsing a webpage, calling one tool, or fixing a single code bug. Those are useful, but they miss what happens in the wild: long projects with dozens of steps, changing goals, and lots of trial and error.

🍞 Hook: You know how planning a school festival takes many meetings, lists, and deliveries over time? 🥬 The Concept (Long-Horizon Tasks): Long-horizon tasks are projects that take many steps and a long time to finish.

What it is: Tasks that require extended memory, planning, and many actions over hours or days.
How it works: 1) Start with a big goal, 2) Break it into smaller tasks, 3) Keep track of progress and changes, 4) Adjust based on feedback.
Why it matters: Without handling long horizons, an AI forgets earlier steps, repeats mistakes, or loses track of the plan. 🍞 Anchor: Building a small game over five lessons—each lesson adds features and depends on the last.

Researchers also ran into a huge bottleneck: real tasks usually needed a human to guide the AI every few steps. That doesn’t scale when you want to test hundreds of tasks automatically.

🍞 Hook: Imagine a teacher having to watch every single group project live, pausing every few minutes to give tips. They would never finish grading the class! 🥬 The Concept (Scenario): A scenario is a themed, multi-step mini-world where tasks build on each other.

What it is: A realistic storyline (like building a Gomoku game) split into 1–5 tasks that grow in difficulty.
How it works: 1) Start with a simple version, 2) Add features, 3) Fix bugs, 4) Add polish and stress tests.
Why it matters: Without scenarios, we only test short sprints, not real multi-stage projects. 🍞 Anchor: Level 1–5 of a game development project, where each level adds new rules and tools.

Even when tasks were realistic, scoring them was hard. Who checks if the layout is centered, the replay button works, or the database writes are correct? Doing this by hand is slow.

🍞 Hook: Think of trying to check every line in a student’s code by hand versus running an automatic test that says “all 10 checks passed.” 🥬 The Concept (Tool Call): A tool call is when an AI uses a helper tool like running a shell command, editing a file, or doing web search.

What it is: A single action the agent performs using its toolbox.
How it works: 1) The AI decides an action, 2) Calls the tool with inputs, 3) Reads outputs, 4) Plans next steps.
Why it matters: Real jobs need many precise tool calls; without them, the AI can’t actually build or run things. 🍞 Anchor: The AI opens a file, edits a function, runs tests, and checks logs—each is a tool call.

AgencyBench steps in to fix these gaps. It offers 32 real-world scenarios spanning game development, front-end, back-end, code generation, research, and MCP tool use—138 tasks in total. On average, a scenario needs about 90 tool calls, up to 1 million tokens of context, and hours of execution. To make it scalable, a simulated user gives iterative feedback, and a Docker sandbox safely runs the agent’s work to produce screenshots, videos, and logs for grading. With clear rubrics, everything can be scored automatically. This turns long, human-heavy testing into a repeatable, fair, and hands-free process that anyone can run.

02Core Idea

The “aha!” in one sentence: Make a realistic, long, automated playground where AI agents can do big, multi-step jobs while a simulated user and a safe sandbox check and score everything without humans standing by.

We’ll explain the idea three different ways:

Theme park analogy: The agent enters a theme park (the scenario) with many rides (tasks). A robot guide (the user simulation agent) gives tips when the rider gets stuck. Safety rails (Docker sandbox) keep everyone safe. At the exit, scanners (rubrics + judges) check ride photos and stamps to grade the trip.
Cooking show analogy: Chefs (agents) cook multi-course meals. A quiet assistant (user simulator) tastes and suggests fixes. A test kitchen (Docker) isolates the cooking. Judges (rubrics + LLM-as-judge) score taste, timing, and presentation.
School project analogy: Students (agents) build a working app over weeks. A TA (user simulator) gives corrections. A lab room (Docker) locks down the environment. A grading sheet (rubric) plus expert reviewers (judges) assign scores.

Now, the building blocks—in the order that makes everything click.

🍞 Hook: Imagine practicing piano with a friendly teacher who tells you what went wrong right after you play a note. 🥬 The Concept (User Simulation Agent): A user simulation agent is a program that pretends to be a human user and gives feedback.

What it is: An automated partner that checks which requirements you missed and explains what to fix.
How it works: 1) Compare your work to the rubric, 2) List the missed parts, 3) Give concrete suggestions, 4) Repeat for another try.
Why it matters: Without this, a person has to sit there and guide every step, which doesn’t scale. 🍞 Anchor: After your app fails 4 out of 10 checks, the simulator returns exactly those 4 problems with tips to fix them.

🍞 Hook: You know how a sandbox lets you build castles without making a mess in the living room? 🥬 The Concept (Docker Sandbox): A Docker sandbox is a safe, isolated computer environment to run and test software.

What it is: A contained box where we can run code, render UIs, click buttons, and record results without risking the main system.
How it works: 1) Spin up a clean container, 2) Run the agent’s app or scripts, 3) Capture screenshots/videos/logs, 4) Shut it down.
Why it matters: Without isolation, tests could break your computer or leak settings, and results wouldn’t be reproducible. 🍞 Anchor: The Gomoku game runs in Docker; the system records clicks and takes screenshots to verify animations and layout.

🍞 Hook: Picture a robot referee who watches the whole game, checks the rulebook, and writes the score. 🥬 The Concept (Automated Evaluation Framework): An automated evaluation framework is a system that grades agents’ work using scripts and judges.

What it is: A pipeline that takes deliverables and artifacts, applies rubrics, and outputs scores and comments.
How it works: 1) Gather outputs (files, screenshots, logs), 2) Run rule checks, 3) Ask LLM judges for visual/subjective parts, 4) Combine scores.
Why it matters: Without automation, grading long tasks would be slow and inconsistent. 🍞 Anchor: For a front-end page, the framework checks pixel sizes with rules and asks a vision model if the layout looks correct.

🍞 Hook: Think of training wheels that help you ride straight but can be removed later. 🥬 The Concept (Agentic Scaffolds): Agentic scaffolds are support tools and structures that help agents plan, remember, and use tools effectively.

What it is: A toolbox (file editing, shell, web search, memory) plus routines for planning and feedback.
How it works: 1) Provide tools, 2) Guide multi-step reasoning, 3) Store context/memory, 4) Use feedback to improve.
Why it matters: Without scaffolds, even strong models can wander, forget steps, or misuse tools. 🍞 Anchor: The scaffold lets the agent open files, run tests, search docs, and save notes to a memory bank.

Why it works: Real work needs many steps, clear goals, safe execution, and steady feedback. AgencyBench gives agents a realistic stage (scenarios), a teacher (user simulator), a safe lab (Docker), and a fair grader (rubrics + judges). This combination reveals if an agent can start, adapt, and finish big jobs—not just answer one-off questions.

Before vs. after: Before, we compared short sprints; after, we compare marathons. Before, humans had to step in every few turns; after, a simulator handles feedback. Before, grading was manual; after, grading is automated and reproducible. The result: a clearer, fairer picture of which agents are truly ready for real-world work.

03Methodology

At a high level: Input (scenario with tasks and rubrics) → Agent works inside a scaffolded workspace with tools → User simulation agent gives feedback if needed → Deliverables sync to a Docker sandbox for safe execution and recording → Automated evaluators score with rubrics → Output (scores, comments, artifacts).

Let’s teach the pipeline like a recipe and introduce key ideas as we go.

🍞 Hook: Imagine giving a student a project sheet with clear goals and a grading checklist. 🥬 The Concept (Rubric): A rubric is a checklist used to grade work fairly and consistently.

What it is: A list of must-have features with pass/fail or scored criteria.
How it works: 1) Define criteria (e.g., button exists, layout width, function returns correct value), 2) Run checks, 3) Map passes to a score.
Why it matters: Without rubrics, two graders might give very different scores for the same work. 🍞 Anchor: “Board must be 640±4px wide” and “winner banner must appear”—these are rubric items for the Gomoku game.

Step 1: Prepare the workspace and tools

What happens: Each task starts in a clean, isolated workspace so past runs don’t leak into new ones. The agent receives the query (requirements), expected deliverables, and the rubric. It also gets a toolbox: read/write files, run shell commands, search the web, and manage long-term memory when available.
Why this step exists: Without isolation, results aren’t reproducible; without tools, the agent can’t actually build and test things.
Example: The agent creates index.html, styles.css, and app.js, then runs a shell command to start a local server.

🍞 Hook: Think of a tidy desk where everything for your project is in one place and doesn’t mix with someone else’s. 🥬 The Concept (Workspace): A workspace is a clean folder and environment dedicated to a single run.

What it is: A controlled place to store code, logs, and intermediate files for one task.
How it works: 1) Start fresh, 2) Keep only what’s needed, 3) Save artifacts, 4) Reset for the next task.
Why it matters: Without it, old files or settings could quietly break the new task. 🍞 Anchor: Task 2’s files live in their own workspace so Task 1’s mistakes don’t interfere.

🍞 Hook: And when it’s time to grade, you move the project to the teacher’s desk with special testing tools. 🥬 The Concept (Eval-space): Eval-space is the area where scoring happens using the collected artifacts.

What it is: A separate environment that receives screenshots, videos, and logs from the sandbox for grading.
How it works: 1) Sync artifacts from Docker, 2) Run evaluation scripts, 3) Produce scores and comments.
Why it matters: Keeping evaluation separate ensures fair, consistent grading and easy auditing. 🍞 Anchor: The video showing a 300ms-per-move replay is played and scored in eval-space.

Step 2: Do the work (reason → act → check)

What happens: The agent plans a step, calls a tool (edit a file, run tests), reads outputs (test results, errors), then decides the next step. This loop continues until deliverables look ready.
Why: Big tasks require many small, correct steps. Skipping the loop leads to fragile work that fails later.
Example: After adding a replay button, the agent runs the app, notices a TypeError in checkWinner(), then edits the code to fix indexing.

🍞 Hook: Picture someone writing a diary of every move in a chess match so you can replay it later. 🥬 The Concept (Rollout): A rollout is the full history of queries, actions, tool results, and feedback across tasks.

What it is: The timeline of how the agent worked through the scenario.
How it works: 1) Record each action and result, 2) Insert user feedback when below threshold, 3) Continue until done.
Why it matters: Without rollouts, you can’t study decisions, learn from errors, or reproduce results. 🍞 Anchor: The Gomoku scenario has five task rollouts chained into one complete story.

Step 3: Get feedback only when needed

What happens: If the current score is below the pass threshold, the user simulation agent lists the exact failed rubrics and why, plus specific “do this next” advice.
Why: This turns trial-and-error into targeted fixes and saves human time.
Example: “Your last-move ring is 20px, but the rubric requires 26±3px. Increase the ring size and re-run.”

Step 4: Run in a safe place and capture evidence

What happens: Deliverables move into a Docker sandbox. Automated scripts open the app, click buttons, type input, and record screenshots and videos.
Why: We need reliable evidence (what the user would see) and safe execution isolated from the host machine.
Example: The sandbox clicks H8 and H9, records the pulsing ring, and saves movesturns.webm.

Step 5: Grade with rules and referees

What happens: Evaluation scripts do two kinds of grading:
- Rule-based checks: Perfect for objective facts (file exists, JSON matches, function returns 'Black').
- LLM-as-judge: For visuals and complex behaviors, text and vision models grade against the rubric.
Why: Some things are black-and-white; others need expert judgment.
Example: The width (640±4px) is rule-checked; the polish of the layout is judged by a vision model.

🍞 Hook: Like asking an art teacher to grade the poster and a math teacher to grade the graph. 🥬 The Concept (LLM-as-Judge): LLM-as-judge uses large models to grade subjective or visual parts.

What it is: Text and vision AIs that read code or see screenshots and score them by the rubric.
How it works: 1) Provide artifacts and rubric, 2) Ask for a score and justification, 3) Combine with rule checks.
Why it matters: Without this, visual behavior and UX would be hard to evaluate automatically. 🍞 Anchor: A vision judge confirms the replay animation timing is visible and smooth.

Step 6: Report and repeat

What happens: The system outputs a 0–10 score, comments, and artifacts. If allowed, the agent tries again using feedback.
Why: Final reports let us compare agents, and retries measure self-correction.
Example: Pass@1 and Pass@2 show how many tasks agents pass with 0–1 and 0–2 feedback rounds.

🍞 Hook: Some students can fix their mistakes quickly; others need many tries. 🥬 The Concept (Agentic Scaffolds): Agentic scaffolds (revisited) are the frameworks and tools that make retries effective.

What it is: The environment and patterns for planning, acting, remembering, and using feedback.
How it works: 1) Tool APIs, 2) Memory banks, 3) Planning templates, 4) Feedback–repair loops.
Why it matters: Without good scaffolds, retries waste time and tokens. 🍞 Anchor: A model runs best inside its native SDK because the tools and prompts match how it was trained to think.

The secret sauce: co-optimizing the model with the scaffold and the evaluation loop. AgencyBench doesn’t just measure if an agent can do a step; it measures if the agent can keep going, fix itself with feedback, handle tools smartly, and finish the whole job reliably.

04Experiments & Results

The test: AgencyBench measures how well agents complete realistic, long scenarios across six abilities: game dev, front-end, back-end, code generation, research, and MCP tool use. Each scenario averages about 90 tool calls, up to 1M tokens, and hours of runtime, so it stresses memory, planning, and error recovery.

🍞 Hook: Think of grades on a report card—simple, clear, and comparable. 🥬 The Concept (Average Score): Average score is the percent of rubric items an agent satisfies, mapped to 0–10.

What it is: A single number showing overall task success.
How it works: 1) Count passed checks, 2) Divide by total, 3) Convert to a score.
Why it matters: Without a clear score, it’s hard to compare agents fairly. 🍞 Anchor: Passing 6 of 10 checks equals 60%—like a 6 out of 10.

🍞 Hook: How many students finish correctly on the first try versus needing a redo? 🥬 The Concept (Pass@k): Pass@k is the percent of tasks that reach the passing threshold within k feedback rounds.

What it is: A measure of how quickly agents succeed with limited chances.
How it works: 1) Track passes with up to k retries, 2) Divide by total tasks.
Why it matters: Without it, we can’t see who benefits from feedback or who gets it right away. 🍞 Anchor: Pass@1 shows first-try (plus zero feedback) success; Pass@2 allows one round of simulator feedback.

🍞 Hook: Some kids fix a worksheet in one extra try; others need five. Which one is more efficient? 🥬 The Concept (Attempt Efficiency): Attempt efficiency is success per attempt—a measure of how much score you get for each try.

What it is: A way to compare agents regardless of how many attempts they used.
How it works: 1) Take average score, 2) Divide by average attempts, 3) Higher is better.
Why it matters: Without it, an agent might look good just by trying many times. 🍞 Anchor: If two agents score similarly but one needs fewer rounds, it’s more attempt-efficient.

🍞 Hook: If two students get the same grade but one wrote a whole novel to get there, who used words more wisely? 🥬 The Concept (Token Efficiency): Token efficiency is success per token—a measure of how much score you get per token used.

What it is: A way to judge cost-effectiveness in long contexts.
How it works: 1) Take average score, 2) Divide by tokens used, 3) Higher means better use of context.
Why it matters: Without it, agents could waste massive context for tiny gains. 🍞 Anchor: An agent that scores well using half the tokens is more token-efficient.

Now, the scoreboard with context:

Overall, closed-source models averaged 48.4% vs. 32.1% for open-source models—like a solid C+ compared to a D+. Even top models struggled with very long, multi-step tasks, showing the challenge is real.
GPT-5.2 led proprietary models at about 56.5%, while GLM-4.6 led open-source at about 38.6%. Qwen-3-235B-A22B-Thinking was lowest at about 27.0%.
Pass@1 and Pass@2 showed how feedback helps: GPT-5.2’s relative improvement to Pass@2 was about 88.9%, and the Claude models also gained over 80%. Kimi-K2 and Qwen-3 improved dramatically after feedback (up to 300% and ~200% relative), while DeepSeek-V3.2 barely improved, showing weak self-correction.
Resource usage varied a lot. GPT-5.2 used about 3.4M tokens and 89 turns to reach top scores (“brute force” but strong). Grok-4.1-Fast was very thrifty (~1.2M tokens, ~0.3 hours) and earned the best token efficiency. Claude-4.5-Sonnet used many tokens (~4.1M) but didn’t get matching gains, resulting in low token efficiency.
Specializations: Gemini-3-Pro dominated game and front-end, GPT-5.2 did best at back-end and code, and Claude-4.5-Sonnet topped research. Among open-source models, GLM-4.6 was the most balanced.
Tool-use behaviors showed “personalities”: GPT-5.2 and Claude-Opus leaned on shell commands, Gemini-3-Pro used memory tools more than others, Qwen-3 emphasized file edits, and Grok-4.1-Fast and GLM-4.6 relied heavily on web search.
Framework effects (agentic scaffolds) were big: Claude-Opus jumped by ~20.5% inside the Claude SDK compared to a neutral scaffold, and GPT-5.2 slightly preferred OpenAI’s SDK. Some open-source models also had sweet spots with certain scaffolds, while others lost ground when moved.

Surprises:

Feedback didn’t help everyone equally; some agents were great at adjusting, others were stubborn.
Faster or thriftier agents sometimes delivered the best “bang for buck” even if their raw scores weren’t top.
“Home-field advantage” was real: pairing the right scaffold with the right model changed outcomes notably.

05Discussion & Limitations

Limitations:

Model coverage: The AI landscape moves fast. AgencyBench tested a representative set, not everything. Results are a snapshot, not the last word.
Domain scope: Tasks live in digital, software-style worlds (games, web apps, code, research). Physical robots and real-world sensors are out of scope for now.
Cost and time: Long-horizon runs can eat tokens and hours. That’s the point—but it also means running many models at scale is expensive.
Judge dependence: Though carefully validated, LLM judges and user simulators still reflect the strengths and weaknesses of the chosen models.

Required resources:

Compute to run long scenarios, including millions of tokens and many tool calls.
Container infrastructure (Docker) for safe, reproducible runs.
Access to the judging models (text and vision) if using LLM-as-judge for visuals and subjective criteria.

When not to use it:

If you only need short, single-step tests—lighter benchmarks are faster and cheaper.
If you’re evaluating embodied robotics or tasks needing physical interaction—AgencyBench doesn’t cover that.
If your environment forbids sandboxing or containerization—safe execution is a core requirement here.

Open questions:

How can we reduce tokens and time while keeping reliability high? Can planning or memory tools replace sheer context size?
What designs make agents better at self-correction after feedback without human hints?
Which scaffold patterns generalize across models so performance depends less on “home-field” SDKs?
Can we expand beyond software-style tasks into mixed reality or agent–human teams while still keeping grading automated?

🍞 Hook: Think of how sports teams play better on their home field with familiar lights and locker rooms. 🥬 The Concept (Ecosystem Synergy): Ecosystem synergy means a model often performs best inside the toolkits and prompts it was designed or tuned for.

What it is: A performance boost when using a model’s native agent framework.
How it works: 1) Aligned tool APIs, 2) Matching prompt styles, 3) Optimized memory/planning routines, 4) Reduced friction.
Why it matters: Without acknowledging this, we might misjudge a model’s true capability or miss easy wins. 🍞 Anchor: Claude-Opus scored much higher in the Claude-Agent SDK than in a generic scaffold.

06Conclusion & Future Work

Three-sentence summary: AgencyBench is a big, realistic, and fully automated test bed for autonomous agents, built around long, multi-step scenarios that mirror real work. It combines a user simulation agent, a Docker sandbox, and rubric-based (plus LLM) grading to evaluate results at scale without humans-in-the-loop. Experiments show strong models still struggle on very long tasks, resource use matters, and model–scaffold pairing can change outcomes a lot.

Main achievement: Turning long-horizon, real-world agent evaluation into a safe, repeatable, and automated pipeline that the community can run and trust.

Future directions:

Make agents more efficient with planning and memory so they do more with fewer tokens and attempts.
Design scaffolds that generalize across models, reducing “home-field” gaps.
Extend coverage to new domains (e.g., partially embodied tasks) while keeping automated grading.
Improve feedback-driven self-correction so agents learn faster from misses.

Why remember this: AgencyBench raises the bar from short sprints to real marathons, giving us a clear way to see which agents can start, adapt, and finish big jobs. It doesn’t just produce a leaderboard; it’s a diagnostic tool that shows what to fix—efficiency, self-correction, and scaffold fit—so agents can become truly useful in everyday, economically valuable work.

Practical Applications

•Compare different agent stacks to choose the most efficient model–scaffold pair for your engineering team.
•Stress-test your in-house agent on long, multi-step workflows (e.g., build–test–deploy) before shipping to production.
•Tune your agent’s feedback loop by analyzing Pass@1 vs. Pass@2 to improve self-correction strategies.
•Profile token and attempt efficiency to cut cloud costs without losing accuracy.
•Validate UI and game behaviors automatically by using Docker-based visual checks and LLM judges.
•Benchmark research assistants on deep-dive tasks (multi-source search, synthesis) with clear rubrics.
•Identify your agent’s tool-use weaknesses (e.g., over-reliance on web search) and adjust the toolbox or prompts.
•Run A/B tests on different scaffolds (OpenAI/Claude/custom) to find the best ecosystem fit for your model.
•Create internal leaderboards for long-horizon tasks to guide model upgrades and budget planning.
•Use rollouts to debug agent reasoning paths and design better planning or memory modules.

Version: 1