Towards a Science of Scaling Agent Systems

Yubin Kim; Ken Gu; Chanwoo Park; Chunjong Park; Samuel Schmidgall; A. Ali Heydari; Yao Yan; Zhihan Zhang; Yuchen Zhuang; Mark Malhotra; Paul Pu Liang; Hae Won Park; Yuzhe Yang; Xuhai Xu; Yilun Du; Shwetak Patel; Tim Althoff; Daniel McDuff; Xin Liu

Towards a Science of Scaling Agent Systems

Beginner

Yubin Kim, Ken Gu, Chanwoo Park et al.12/9/2025

arXiv PDF

Key Summary

•Multi-agent AI teams are not automatically better; their success depends on matching the team’s coordination style to the job’s structure.
•The paper defines what a true “agentic” task is and compares five architectures across four real-world-style benchmarks under fair, controlled settings.
•A simple rule of thumb emerges: if a single agent already scores above about 45%, adding teammates often hurts more than it helps.
•There’s a tool-coordination trade-off: the more tools a task needs, the more extra teamwork chatter slows things down.
•Errors can snowball in some team topologies: independent agents amplify mistakes 17.2×, while centralized teams hold this to 4.4×.
•A predictive model using measurable signals (efficiency, overhead, error amplification, redundancy, message density) explains over half of performance differences (R²≈0.524).
•The best architecture differs by domain: centralized shines in parallelizable finance (+80.8%), decentralized is best for dynamic web navigation (+9.2%), and all multi-agent variants hurt sequential planning (-39% to -70%).
•The framework correctly picks the best strategy 87% of the time on new tasks and generalizes to newer models.
•Cost matters: multi-agent setups consume far more tokens per success, so using them only when they truly help saves money and time.

Why This Research Matters

AI agents are moving from toy demos to real jobs, where wasted tokens and wrong answers have real costs. This work gives builders a practical way to choose when to use a single agent or a team and which team style to use. By predicting success from simple, measurable signals, it prevents deploying flashy but inefficient setups. It also helps you avoid error cascades by choosing coordination that actually reduces, rather than spreads, mistakes. Ultimately, these rules make AI systems cheaper, faster, and more reliable in day-to-day workflows like research, support, and analytics.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you and your friends are doing a school project. Sometimes splitting up jobs (research, writing, drawing) makes you finish faster. Other times, everyone talking at once just makes a mess.

🥬 The Situation (The World Before): Agents—AI systems that think, plan, and act in steps—were being used everywhere: searching the web, helping with finances, coding, and planning long tasks. Lots of people believed that “more agents working together” would always beat a single smart agent. But results were mixed. Some demos looked great, others fell apart. Why? Because teams add coordination: messages, turn-taking, role assignments. That can help—or it can eat up the entire budget of attention and tokens.

🍞 Example: Three classmates can write a report faster than one, but if they keep passing notes about who does what, they might run out of time before writing anything.

🍞 Hook: You know how some tests check knowledge with one question, but real science projects take many steps? Those are different kinds of challenges.

🥬 New Concept: Agentic Evaluation

What it is: A way to test AI on real, interactive, multi-step tasks where the AI must gather info, adapt, and try again.
How it works:
1. The AI interacts with an environment (webpages, tools, files) over several turns.
2. It gets partial information, so it must ask, look up, and verify.
3. It changes its plan based on feedback and new facts.
Why it matters: If you use single-shot quizzes to judge agents, you’ll miss whether they can actually work through real problems.

🍞 Anchor: A browsing agent can’t answer “Which CEO just resigned and why?” with one glance. It must search, open pages, compare sources, and update its answer.

🍞 Hook: When you invite more friends to help, sometimes it’s awesome, sometimes it’s chaos.

🥬 Problem: No Principles

What it is: We lacked a scientific way to predict when adding agents helps or hurts.
How it works: People tried many team shapes (central leader, group chat, no talking) but also changed prompts, tools, or token budgets—so results were confounded.
Why it matters: Without clear rules, teams got built on guesswork, wasting time, money, and patience.

🍞 Anchor: If you let one group use calculators and another group not, you can’t tell if “group work” or “calculators” caused better scores.

🍞 Hook: Think about recipes. You need to know how doubling or tripling affects cook time and ingredients.

🥬 New Concept: Scaling Laws (for agents)

What it is: Rules that describe how performance changes when we add more agents, tools, or capability.
How it works:
1. Measure performance as you vary number of agents and coordination style.
2. Track costs: messages, turns, and tokens.
3. Find patterns that predict gains or losses.
Why it matters: Without these rules, we keep repeating the same mistakes as teams grow.

🍞 Anchor: If adding more cooks just fills the kitchen and slows the meal, you’ve learned a scaling law: crowding hurts.

🍞 Hook: If every friend needs a different gadget for their part, coordinating gets tricky.

🥬 Gap to Fill

What it is: A clean, fair test bed that holds prompts, tools, and token budgets constant while only changing the team structure and model level.
How it works: Compare single-agent and four team styles across four benchmarks (finance, web browsing, planning, and workplace tasks) using three model families.
Why it matters: Now we can attribute differences to coordination itself, not side effects.

🍞 Anchor: It’s like testing runners on the same track, with the same shoes, and the same distance—only changing if they run alone or pass a baton.

🍞 Hook: Why should you care? Because messy teamwork wastes time, money, and trust.

🥬 Real Stakes

What it is: Picking the right AI setup for real jobs (research, coding, customer ops) saves cost and increases reliability.
How it works: If we can predict which coordination style fits a task, we can avoid expensive, slow, or wrong answers.
Why it matters: In finance research or support workflows, delay and errors can be costly.

🍞 Anchor: If an AI team burns 5× more tokens but answers worse, your app becomes slow, pricey, and unreliable.

02Core Idea

🍞 Hook: Imagine you’re building a soccer team. You don’t win just by adding more players—you win by putting them in the right positions for the game plan.

🥬 The Aha!

What it is: The key insight is that performance comes from matching the team’s coordination style to the task’s structure, not from simply adding more agents.
How it works:
1. Measure simple signals during teamwork: efficiency, overhead, error growth, redundancy, message density.
2. Combine them with task properties (tool count, decomposability, single-agent baseline) and model capability.
3. Use these to predict which architecture will do best.
Why it matters: This turns guesswork into a science that picks the right team shape 87% of the time.

🍞 Anchor: If a task splits naturally into parts (like finance: revenue, costs, market), a central coordinator helps. If it’s a strict sequence (like step-by-step crafting plans), teams slow you down.

Three Analogies:

Orchestra vs. Soloist: A conductor (centralized) helps when many sections play different parts in parallel; a solo violin (single agent) shines on a piece that must flow in strict order.
Kitchen Brigade: A head chef coordinates parallel stations for a banquet (centralized); but making a delicate soufflé (sequential) is best done by one focused cook.
Rescue Team: Many searchers fan out for clues (decentralized) in a big park, but a narrow cave crawl (sequential) is safer with one careful expert.

Before vs. After:

Before: “More agents is better” and results varied wildly across papers.
After: We have quantitative rules: tool-coordination trade-off, capability saturation around a 45% single-agent baseline, and topology-dependent error amplification. We can predict winners.

Why It Works (intuition):

Extra agents fragment the context and consume tokens with messaging. That helps exploration when tasks split up nicely, but hurts when every step depends tightly on the last. Error checks (like a central reviewer) stop mistakes from snowballing, but at a coordination cost. The best setup balances these forces for the specific job.

Building Blocks (with Sandwich explanations):

🍞 Hook: You know how too much talking during a group project can eat the time you need to actually do the work?

🥬 New Concept: Tool-Coordination Trade-off

What it is: The more tools a task needs, the more expensive (in tokens and time) team coordination becomes.
How it works:
1. Each agent needs tokens to think and to call tools.
2. Teams also need tokens to message and synchronize.
3. With many tools, these costs pile up and squeeze out real reasoning.
Why it matters: On tool-heavy tasks, teams often lose to a single well-equipped agent.

🍞 Anchor: If a recipe uses 16 gadgets, passing updates between eight cooks burns time—you’re still preheating when dinner was due.

🍞 Hook: Imagine adding more friends to help with homework, but the score barely goes up.

🥬 New Concept: Capability Saturation

What it is: When a single agent already performs well (about 45% or more), adding teammates often gives diminishing or negative returns.
How it works:
1. The better the soloist, the less there’s left to fix.
2. Coordination adds overhead even when nothing needs fixing.
3. Net effect: the team can do worse than the solo agent.
Why it matters: Don’t pay extra for a team when one agent is already strong enough.

🍞 Anchor: If one student already aces the test, adding helpers who need meetings just wastes time.

🍞 Hook: Remember the telephone game where a small mistake becomes a big misunderstanding?

🥬 New Concept: Error Amplification

What it is: Team topologies can grow small mistakes into big failures.
How it works:
1. Independent agents don’t check each other, so errors multiply (17.2×).
2. Centralized teams route through a checker, containing errors (to 4.4×).
3. Debate (decentralized) helps some, but also adds costs.
Why it matters: Picking a topology that absorbs errors beats one that spreads them.

🍞 Anchor: A teacher reviewing all group answers (centralized) catches more mistakes than groups handing in papers separately (independent).

🍞 Hook: You wouldn’t use a bulldozer to frost a cake.

🥬 New Concept: Architecture-Task Alignment

What it is: The best team design depends on the job—parallel tasks like finance prefer centralized; open exploration like web browsing prefers decentralized; strict sequences prefer single agents.
How it works:
1. Identify task structure (decomposable vs. sequential).
2. Estimate tool load and single-agent baseline.
3. Choose the coordination style that matches.
Why it matters: Right-size the team and save tokens, time, and headaches.

🍞 Anchor: Finance (+80.8% with centralized) splits into parts, but PlanCraft (strict sequences) gets slower and worse with teams (-39% to -70%).

03Methodology

At a high level: Task + Tools + Model → Choose architecture → Run with matched budgets → Log coordination signals → Predict performance and pick the best.

Step 1: Define true agentic tasks and hold everything else constant

What happens: The authors use four benchmarks that require multi-step interaction: Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench. They lock prompts, tools, and total token budgets the same across all setups.
Why this exists: If tools or budgets differ, you can’t tell whether coordination or unfair advantages caused differences.
Example: Every setup gets the same web search, file, or browser tools; only the team topology (single vs. centralized vs. decentralized vs. independent vs. hybrid) changes.

Step 2: Compare five architectures fairly

What happens: They test Single-Agent (SAS) and four Multi-Agent Systems (MAS): Independent, Centralized, Decentralized, Hybrid.
Why this exists: These cover key styles—no communication, hub-and-spoke, peer debate, and a mix—so we can attribute effects to coordination, not random variation.
Example: In Centralized, an orchestrator assigns sub-tasks and checks work; in Decentralized, agents discuss and reach consensus; in Independent, agents don’t talk and outputs are just collected.

Sandwich intros for core metrics and pieces:

🍞 Hook: Like a classroom, you can track how often kids talk, how long projects take, and how many answers are right.

🥬 New Concept: Coordination Overhead

What it is: Extra cost (tokens/turns) from agents messaging and synchronizing.
How it works:
1. Count total turns and messages vs. single agent.
2. Compute relative increase (e.g., +285%).
3. Tie it to success per token.
Why it matters: Overhead can drown out any benefit from teamwork.

🍞 Anchor: Hybrid had the most overhead (about 6.2× more turns than single agent).

🍞 Hook: Imagine your group gets 10 minutes. If you spend 8 just planning, only 2 are left to solve the problem.

🥬 New Concept: Efficiency (Success per cost)

What it is: How many successes you get for your turns/tokens.
How it works:
1. Measure success per 1,000 tokens and per turn.
2. Normalize to compare apples-to-apples.
3. Higher is better value.
Why it matters: It decides if a setup is worth the money and time.

🍞 Anchor: Single agent scored about 67.7 successes per 1,000 tokens; centralized dropped to ~21.5.

🍞 Hook: When teammates keep repeating the same point, time gets wasted.

🥬 New Concept: Redundancy

What it is: How much different agents repeat the same work.
How it works:
1. Compare outputs for similarity.
2. Track overlap vs. diversity.
3. Some redundancy helps catch errors; too much is waste.
Why it matters: The sweet spot balances shared grounding with fresh ideas.

🍞 Anchor: Centralized had a median redundancy around 0.41, close to the measured sweet spot.

🍞 Hook: If nobody checks, small errors grow like snowballs.

🥬 New Concept: Error Amplification (as a metric)

What it is: How many times more likely errors become under a given topology.
How it works:
1. Measure factual error rates in team vs. solo.
2. Compute a factor (e.g., 17.2× for independent).
3. Compare across architectures.
Why it matters: Topologies that absorb errors are safer.

🍞 Anchor: Centralized cut amplification down to about 4.4× by validating through the orchestrator.

🍞 Hook: Think of a school’s “reading level”—a quick sense of capability.

🥬 New Concept: Intelligence Index

What it is: A combined score of model skill across reasoning, coding, and knowledge.
How it works:
1. Use external benchmarks to assign a level.
2. Compare performance trends as level rises.
3. Watch for linear or nonlinear gains.
Why it matters: Capability interacts with coordination—stronger models don’t automatically make teams better if coordination mismatches the task.

🍞 Anchor: Across families, higher Intelligence Index helped linearly, but benefits still depended on architecture choice.

Step 3: Run controlled experiments (N=180)

What happens: For each benchmark, model family, and architecture, they run many trials with matched budgets.
Why this exists: To build enough data to see reliable patterns and not be fooled by luck.
Example: Finance-Agent under Centralized rose about +80.8% vs. SAS; PlanCraft fell sharply for all MAS.

Step 4: Fit a predictive model from measured signals

What happens: A mixed-effects model uses intelligence, tool count, agent count, single-agent baseline, and coordination signals (efficiency, overhead, error amplification, redundancy, message density).
Why this exists: To predict performance and pick the best architecture on new tasks.
Example: The model explains over half of the variance (R²≈0.524) and correctly chooses the best team design 87% of the time.

Secret Sauce

Match architecture to task decomposability and tool load.
Use efficiency and overhead as “budget guards.”
Watch the 45% single-agent threshold to avoid over-teaming.

04Experiments & Results

The Test (What and Why)

Measure success on four agentic benchmarks that require multi-step interaction: Finance-Agent (financial analysis), BrowseComp-Plus (deep web research), PlanCraft (sequential planning), and Workbench (realistic tool workflows).
Why: These cover structured parallel tasks, dynamic open-world tasks, and strict sequences—great for seeing when teams help or hurt.

The Competition (Who vs. Who)

Baseline: Single-Agent System (SAS).
Challengers: Four Multi-Agent Systems (MAS): Independent, Centralized, Decentralized, Hybrid.
Models: Three LLM families across multiple capability levels (all run under matched token budgets and identical tools).

The Scoreboard (with context)

Finance-Agent: Centralized improves by +80.8% (like jumping from a C to a solid A). Decentralized and Hybrid also shine (+74.5% and +73.1%). Why? Finance splits naturally into parallel sub-analyses that a coordinator can stitch together.
BrowseComp-Plus: Decentralized gives +9.2% (a small but real bump), Centralized ~+0.2% (basically flat), and Independent drops (-35%) because uncoordinated exploration duplicates mistakes.
PlanCraft: All MAS hurt performance severely (-39% to -70%). Why? It’s a strict sequence—decomposition adds waste; every extra message steals tokens from the next needed step.
Workbench: Small effects (about -11% to +6%). It’s mixed: some tasks decompose, others are short and tool-heavy where overhead dominates.

Surprising Findings

Capability Saturation: When the solo baseline is already strong (~45%+), teams usually backfire. Diminishing improvements can’t outrun coordination costs.
Error Dynamics: Independent teams amplified errors 17.2×; centralized cut this to 4.4× via validation bottlenecks.
Tool-Coordination Trade-off: More tools predict worse multi-agent efficiency; overhead × tool count compounds into big losses for complex toolchains.
Prediction Works: Using measured coordination signals, the model explains over half the performance differences and picks the best architecture 87% of the time on new setups.
Generalization: Validation on a newer model family configuration (released after the study) kept errors low (MAE≈0.071) and confirmed most principles.

05Discussion & Limitations

Limitations

Architecture Coverage: The study focuses on five canonical designs and team sizes up to nine; very large swarms might behave differently (and likely hit communication walls).
Model Diversity: Mixed capability levels were tested within each family, but not wildly different base architectures or specialized fine-tunes; true epistemic diversity remains underexplored.
Prompt Sensitivity: Prompts were controlled for fairness, not optimized per model. With tuning, some architectures might shift their sweet spots.
Benchmark Breadth: Four strong domains were used, but not embodied robots, multimodal long horizons, or multi-user social tasks.
Economics: Teams often cost 3–6× more tokens per success; practical deployments must mind efficiency.

Required Resources

Access to multiple capable LLMs (with tool use), a standardized tool layer (browser, code exec, file I/O), and orchestration to log turns, messages, and costs.
Budget to run matched-token experiments to avoid confounds.

When NOT to Use Multi-Agent Teams

Sequential tasks with tight step dependencies (e.g., PlanCraft-like pipelines).
Tool-heavy workflows where overhead × tools overwhelms signal (many APIs, deep chains).
When the single agent already passes the 45% threshold on your metric.

Open Questions

Can smarter protocols (sparse messaging, early-exit, role-aware routing) beat the overhead wall?
How much does true epistemic diversity (different model types) improve robustness vs. add noise?
Can we automatically detect decomposability and pick the right architecture on the fly?
What changes in multimodal, embodied, or weeks-long tasks with memory beyond context windows?

06Conclusion & Future Work

3-Sentence Summary

Multi-agent AI teams only help when their coordination style matches the job’s structure; otherwise, overhead and error cascades erase gains.
Measurable signals—efficiency, overhead, error amplification, redundancy, and message density—predict performance across tasks and models (R²≈0.524).
The framework chooses the right architecture 87% of the time and reveals key laws: tool-coordination trade-off, a ~45% single-agent saturation point, and topology-dependent error growth.

Main Achievement

Turning multi-agent design from a guessing game into a predictive science that links task properties to the best coordination topology.

Future Directions

Design lighter protocols (sparse comms, early exits, distilled coordinators) to cross the overhead wall.
Add real epistemic diversity (different model types and specializations) and new domains (embodied, multimodal, long-horizon) to stress-test the laws.
Build automatic “meta-orchestrators” that estimate decomposability and pick (or even morph) the architecture on demand.

Why Remember This

Because “more agents” isn’t a free lunch. The right team for the right task saves tokens, time, and trust—and now we have simple rules and a working predictor to make that choice confidently.

Practical Applications

•Use the 45% rule: if your single agent already scores above ~45% on your task metric, prefer single-agent over multi-agent.
•Estimate task decomposability: if the work splits into parallel subproblems, try centralized; if it’s open exploration, try decentralized; if strictly sequential, stick to single-agent.
•Count tools before teaming: for tool-heavy tasks (e.g., 12–16 tools), avoid high-overhead topologies like hybrid; consider SAS or lean decentralized.
•Track efficiency (success per 1,000 tokens): if it drops below half your SAS baseline, your team is likely over-coordinating.
•Add a validation bottleneck: for error-prone tasks, centralized coordination can contain mistakes; avoid independent topologies.
•Cap agent count: beyond 3–4 agents, overhead often dominates under fixed budgets; scale depth of reasoning instead of headcount.
•Log message density: if you exceed ~0.39 messages/turn without gains, cut communication rounds or enforce early exits.
•Choose sub-agent strength over strong orchestrators: invest capability where the real work happens; keep the coordinator lean.
•Pilot and predict: collect coordination signals on a small slice, feed them to the model (efficiency, overhead, redundancy, error amp), and pick the architecture before full deployment.

Version: 1