AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

Jianhao Ruan; Zhihao Xu; Yiran Peng; Fashen Ren; Zhaoyang Yu; Xinbing Liang; Jinyu Xiang; Bang Liu; Chenglin Wu; Yuyu Luo; Jiayi Zhang

AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

Beginner

Jianhao Ruan, Zhihao Xu, Yiran Peng et al.2/3/2026

arXiv PDF

Key Summary

•AOrchestra is like a smart conductor that builds the right mini-helpers (sub-agents) on demand to solve big, multi-step tasks.
•It represents every agent as a simple four-part recipe: Instruction, Context, Tools, and Model, which makes helpers easy to create and swap.
•The orchestrator never does the task itself; it plans, creates the right helper at the right time, and decides when to stop.
•Carefully choosing just the useful context for each helper beats giving no context or dumping everything, which avoids confusion.
•AOrchestra learns in two ways: fine-tuning for better planning and in-context learning for cheaper-yet-strong model routing.
•Across GAIA, Terminal-Bench 2.0, and SWE-Bench-Verified, AOrchestra outperforms popular systems like ReAct, OpenHands, and mini-SWE.
•With Gemini-3-Flash, it reaches 80.00 pass@1 on GAIA, 52.86 on Terminal-Bench, and 82.00 on SWE-Bench-Verified.
•It’s framework-agnostic and plug-and-play: you can swap in different sub-agent backends (like ReAct-style or mini-SWE-style) and still get strong results.
•The system finds better cost–performance trade-offs by choosing smaller models for simple steps and bigger ones only when needed.
•This approach reduces human hand-crafting of roles and workflows while increasing adaptability in wild, open-ended tasks.

Why This Research Matters

Many real tasks are long and messy, like configuring servers, fixing real bugs, or researching multi-step questions. AOrchestra lets an AI plan like a project manager and then spin up the right specialist with the right tools and just enough context. This reduces human hand-crafting of roles and avoids drowning helpers in irrelevant information. Picking small models for easy steps and strong ones only when needed keeps costs in check. Because sub-agents are plug-and-play, teams can reuse existing tools and frameworks instead of rebuilding from scratch. In short, it makes complex automation more accurate, cheaper, and easier to adapt to new problems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a school play needs a director who calls in the right actors at the right time, gives them their lines, and makes sure each scene has the props it needs? Without a director, everyone talks over each other or forgets what to do next.

🥬 Filling (The Actual Concept):

What it is: AOrchestra is a way to run AI helpers like a well-directed play, where a main 'orchestrator' calls in just the right sub-agent for each scene.
How it works: Step by step, the orchestrator breaks a big job into smaller jobs, gives each sub-agent exactly the right instructions and materials, and then decides what to do next based on the results.
Why it matters: Without this direction, helpers either share too much and get confused or miss key facts, so long tasks fall apart.

🍞 Bottom Bread (Anchor): Imagine researching a school project: you might ask a 'web search helper' to find sources, a 'math helper' to check numbers, and a 'writer helper' to summarize. AOrchestra is the teacher who tells each helper exactly what to do, with just the notes they need.

🍞 Top Bread (Hook): Picture before this work: AI agents were like teams that always followed the same script, no matter what. If a new kind of problem appeared, the script didn’t fit.

🥬 Filling:

What it is: The world before had two common patterns. One treated sub-agents as isolated threads to keep their notes clean. The other hard-coded fixed roles (like 'Coder' or 'Searcher') you set up by hand.
How it works: Isolated threads reduced context mess but didn’t add new skills. Fixed roles added skills but were rigid and required lots of manual setup.
Why it matters: Real problems change a lot. Fixed scripts and rigid teams can’t cover all the surprising sub-tasks.

🍞 Bottom Bread: If you always bring the same three kids to every group project (writer, artist, presenter), you’ll do fine for posters but struggle with coding a website or running an experiment.

🍞 Top Bread (Hook): Imagine carrying your whole backpack to every class instead of only the books you need. You’d be slow and distracted.

🥬 Filling:

What it is: Long tasks suffer from 'context rot'—too much, stale, or irrelevant info makes models worse over time.
How it works: As conversations grow, flooding every sub-agent with everything leads to noise. Hiding too much starves them.
Why it matters: Without careful context control, helpers make more mistakes as tasks get longer.

🍞 Bottom Bread: If you hand your friend the entire library to find one fact, they’ll waste time; if you give them nothing, they can’t help. The sweet spot is a small, relevant packet.

🍞 Top Bread (Hook): Think of LEGOs: you can build anything if you have the right pieces and a clear plan.

🥬 Filling:

What it is: The missing piece was a simple, universal way to describe any agent so it can be created on demand.
How it works: AOrchestra defines every agent as a four-part recipe: Instruction (goal), Context (the evidence), Tools (what actions it can take), and Model (the brain). This makes new helpers easy to assemble.
Why it matters: With this recipe, the orchestrator can spawn exactly the helper needed for the next subtask—no overkill, no missing parts.

🍞 Bottom Bread: Building a 'map-reading helper' on the fly: Instruction = find shortest route; Context = today’s traffic; Tools = map API; Model = a small reasoning LLM. Done.

🍞 Top Bread (Hook): Why should anyone care? Because your apps, websites, and code editors are turning into smart teammates.

🥬 Filling:

What it is: A practical way to automate complex, multi-step digital work reliably and affordably.
How it works: The orchestrator picks cheaper models for easy steps and stronger ones for tricky parts, manages context precisely, and plugs in the right tools.
Why it matters: This saves time and money, reduces human hand-tuning, and handles surprise tasks better.

🍞 Bottom Bread: Planning a trip: cheap model searches flights, a code tool parses CSV prices, a stronger model decides trade-offs, and the orchestrator stitches it all into a perfect itinerary.

02Core Idea

🍞 Top Bread (Hook): Imagine a coach who builds the perfect lineup right before each play—sometimes you need speed, sometimes strength, sometimes strategy.

🥬 Filling (The Actual Concept):

What it is: The 'aha!' is to treat sub-agents as on-demand specialists created from a simple four-part recipe: Instruction, Context, Tools, Model.
How it works: For each subtask, the orchestrator fills in that recipe, spawns a tailored helper, gets the result, and repeats until done. It only has two actions: Delegate(this recipe) or Finish(final answer).
Why it matters: This avoids one-size-fits-all teams or messy info piles. Each helper is just capable enough, with just enough info, for just this step.

🍞 Bottom Bread (Anchor): Solving a bug: the orchestrator creates a 'log-reader helper' (Instruction: find error root cause; Context: failing logs; Tools: file view; Model: small), then a 'patcher helper' (Instruction: fix and test; Tools: edit+pytest; Model: larger), then finishes.

🍞 Top Bread (Hook): Think of a Swiss Army knife vs. a toolbox. One tool tries to do everything; the other lets you pick exactly what you need.

🥬 Filling: Four-Tuple Abstraction

What it is: A four-tuple is a neat list of four items that fully describe an agent: Instruction (what to do), Context (what to look at), Tools (what it can use), Model (which brain to think with).
How it works:
1. Set Instruction: clear goal and success check,
2. Curate Context: only the most relevant notes,
3. Pick Tools: minimal actions needed,
4. Choose Model: balance smarts and cost.
Why it matters: Without this structure, helpers are either too weak (missing tools/info) or too expensive (overpowered brain for a tiny job).

🍞 Bottom Bread: A data-cleaning helper: Instruction = fix date formats; Context = sample rows; Tools = code runner; Model = small coder model.

🍞 Top Bread (Hook): If everyone in a meeting tries to type on the same laptop, chaos. One person should coordinate.

🥬 Filling: Orchestrator

What it is: A planner that never executes environment actions; it only delegates or finishes.
How it works: It decomposes goals into subtasks, instantiates the right helper via Delegate(four-tuple), reads the result, and loops. When satisfied, it calls Finish.
Why it matters: Separation of planning and doing keeps the system clean, controllable, and learnable.

🍞 Bottom Bread: The orchestrator says: 'Now create a web-search helper with just the question keywords and a search tool.' After results come back, it says: 'Now create a calculator helper with only the needed numbers and a code tool.'

🍞 Top Bread (Hook): Ever bring your entire closet on a weekend trip? Overpacking makes everything harder.

🥬 Filling: Curated Context

What it is: The orchestrator passes only task-relevant context to each helper.
How it works: It selects and compresses just the clues that matter from history.
Why it matters: Too little context causes guesswork; too much causes confusion. Curated context hits the sweet spot.

🍞 Bottom Bread: Instead of sending a helper the whole chat log, the orchestrator sends only the three facts needed to verify a museum artifact.

🍞 Top Bread (Hook): Think of choosing bikes: a simple city bike for errands and a mountain bike for tough trails.

🥬 Filling: Cost–Performance Trade-off (Model Routing)

What it is: Dynamically choose which model to use at each step to balance accuracy and cost.
How it works: Use small, cheap models for routine tasks; reserve big models for hard reasoning or final checks.
Why it matters: This keeps bills low while keeping answers strong.

🍞 Bottom Bread: The orchestrator uses a small model to scrape a page and a strong model to interpret a tricky legal paragraph.

🍞 Top Bread (Hook): Practice makes better coaches.

🥬 Filling: Learnable Orchestration

What it is: The orchestrator can be improved by supervised fine-tuning (learning from examples) and in-context learning (improving its own prompt).
How it works: SFT teaches better decomposition and tuple building; ICL iteratively edits the orchestrator’s strategy to reach Pareto-efficient cost–performance.
Why it matters: A better planner multiplies the power of every helper.

🍞 Bottom Bread: After collecting rollouts, the system updates the orchestrator’s prompt to pick cheaper models more often while keeping success rates high.

03Methodology

🍞 Top Bread (Hook): Imagine a recipe card machine. For each cooking step, it prints a tiny card: the goal, the ingredients you actually need, the tools to use, and who should cook it.

🥬 Filling (Overview):

What it is: High-level flow is Input → Orchestrator (plans) → Delegate(four-tuple) → Sub-Agent executes → Orchestrator decides next step or Finish.
How it works: The orchestrator only has two actions: Delegate(Φ) to spawn a specialized helper defined by Φ = (Instruction, Context, Tools, Model), or Finish(answer) to stop.
Why it matters: This keeps decisions simple, controllable, and modular.

🍞 Bottom Bread (Anchor): For a travel plan, the orchestrator delegates: Φ1 = (Instruction: find cheapest flights this weekend, Context: city pairs and dates, Tools: web search+scraper, Model: small). Then Φ2 = (Instruction: compare layovers, Context: top 3 flight options, Tools: code runner, Model: medium). Finally, Finish(best choice).

— Step-by-step ‘recipe’ —

Task intake and state update

What happens: The user’s goal becomes the initial state. As results come back (observations), the state is updated.
Why it exists: Without a shared state, the planner can’t reflect or adapt.
Example: After reading 'install nginx and expose port 80', the state logs what worked (apt install succeeded) and what failed.

Subtask decomposition

What happens: The orchestrator splits the big goal into the next actionable subtask.
Why it exists: One big leap is risky; small steps are safer and clearer.
Example: First, 'update package index'; later, 'start nginx and verify homepage'.

Four-tuple construction Φ = (I, C, T, M)

Instruction (I)
- What happens: Write a precise, self-contained subtask goal and success check.
- Why it exists: Vague goals waste steps.
- Example: 'Extract the second sentence from the paper’s abstract and return only the number of datasets mentioned.'
Context (C)
- What happens: Select only relevant notes, prior results, or artifacts.
- Why it exists: Avoids confusion from irrelevant chatter.
- Example: 'Keep the link to the museum page and last attempt’s artifact ID; drop unrelated searches.'
Tools (T)
- What happens: Grant the minimal tools needed (search, scrape, execute code, shell, file edit, etc.).
- Why it exists: Too many tools increases error surface; too few blocks progress.
- Example: Give 'viewfile' and 'pytest' for code fixes, but not web tools.
Model (M)
- What happens: Pick a model that matches difficulty and budget.
- Why it exists: Save money on easy steps; spend wisely on hard ones.
- Example: Use a compact model for formatting, a stronger one for tough reasoning.

Delegate(Φ) and sub-agent execution

What happens: The system spawns the helper with that exact recipe and runs it up to a step limit.
Why it exists: Isolation guarantees a clean working memory and permission set.
Example data: In GAIA, a sub-agent might call GoogleSearchAction, ExtractUrlContentAction, and ExecuteCodeAction, then return 'status=done' with a summary.

Read observation and update state

What happens: The sub-agent returns a structured observation (result summary, artifacts, any errors). The state integrates it.
Why it exists: The orchestrator needs trustworthy breadcrumbs to plan next moves.
Example: 'Completed: package installed; Issues: service not started; Message: next run systemctl start nginx.'

Decide next action: more delegation or Finish

What happens: The orchestrator evaluates if the goal is met. If yes, Finish(answer). If not, repeat decomposition and spawn a new helper.
Why it exists: Prevents overthinking and needless steps.
Example: After tests pass on SWE-Bench, call Finish immediately to submit.

— The secret sauce — A) Curated context routing

Why clever: The orchestrator’s selective context passing consistently outperforms 'no context' and 'full dump' by avoiding both starvation and overload.
Example: In ablations, curated context scored highest compared to the other two settings.

B) Capability composition per subtask

Why clever: Giving each helper exactly the tools and model it needs reduces errors and cost.
Example: A read-only search helper can’t accidentally modify files.

C) Learnable orchestration policy

Why clever: The orchestrator is trainable via SFT (better decomposition and tuple crafting) and improvable via in-context learning (cheaper routing with similar performance). This nudges the system toward Pareto-efficient frontiers.
Example: On GAIA, ICL improved accuracy while cutting average cost.

— Concrete examples with data —

GAIA web task: Φsearch = (I: find official museum entry; C: question keywords; T: search+extract; M: small). Result returns URL+snippet. Next Φread = (I: extract number asked; C: URL text; T: code runner; M: medium). Finish with short answer.
Terminal-Bench: Φinstall = (I: install nginx; C: task requirements; T: shell execute; M: mid). After 'done', orchestrator calls submit to run tests in the same container.
SWE-Bench: Φlocalize = (I: find failing function; C: failing test logs; T: execute+viewfile; M: mid). Then Φpatch = (I: edit and pass tests; C: target file lines; T: editfile+pytest; M: strong). Finish if tests pass.

04Experiments & Results

🍞 Top Bread (Hook): Think of a science fair where teams solve different kinds of challenges: research questions, lab setups, and code puzzles.

🥬 Filling (The Test):

What it is: Three tough benchmarks measured how well the system handles real tool use and long, multi-step tasks—GAIA (general digital tasks), Terminal-Bench 2.0 (Linux terminal workflows), and SWE-Bench-Verified (real GitHub bug fixes).
How it works: Metrics were pass@1 and pass@3 (success within 1 or 3 tries).
Why it matters: These settings mirror real-world agent work: searching the web, running commands, and editing code.

🍞 Bottom Bread (Anchor): It’s like asking, 'Can your team answer the quiz first try? If not, what about within three tries?'

— The competition —

Baselines: ReAct (single-agent), OpenHands (general agent framework), mini-SWE-agent (coding-focused), and Claude Code (production CLI with pre-defined sub-agents).

— The scoreboard (with Gemini-3-Flash) —

GAIA: AOrchestra 80.00 pass@1 vs best baseline OpenHands 66.06. That’s like jumping from a solid B to a clear A.
Terminal-Bench 2.0: AOrchestra 52.86 pass@1 vs best baseline 34.29. That’s a big bump on hard, hands-on setups.
SWE-Bench-Verified: AOrchestra 82.00 pass@1 vs mini-SWE 56.00. That’s a strong step up on real repositories.

Across backbones, AOrchestra consistently outperforms or matches the best baselines, and the paper also reports a 16.28% relative improvement over the strongest baseline with Gemini-3-Flash.

— Surprising and insightful findings —

Context control wins: In a GAIA ablation (50 samples), curated context beat both 'no-context' and 'full-context' settings. The lesson: not too little, not too much—just right.
Learnable planner: Replacing the main model with a smaller Qwen3-8B still beat some baselines, and SFT on that small orchestrator boosted accuracy notably (56.97% to 68.48%), showing orchestration is a skill you can train.
Cheaper plus better: In-context learning improved accuracy while reducing average cost (e.g., under mixed models on GAIA: from 72.12% at $0.70 to 75.15% at$ 0.57), moving along a Pareto frontier.
Plug-and-play sub-agents: Swapping the sub-agent backend (ReAct-style or mini-SWE-style) kept AOrchestra strong and above their standalone versions, confirming framework-agnostic design.

— Why these results matter —

On GAIA, better pass@1 shows the orchestrator’s decomposition and context routing help answer correctly sooner.
On Terminal-Bench, improved reliability under strict containers shows robust planning and tool scoping.
On SWE-Bench, strong passes show the method’s ability to localize bugs, patch code, and pass tests in real projects.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best coaches have limits when players are tired, tools are missing, or the rulebook changes mid-game.

🥬 Filling (Limitations and when not to use):

What it can’t do (yet):
- If tools or sandboxes are missing or flaky (e.g., web fetch fails), sub-agents can’t act, no matter how good the plan is.
- If problems demand deep domain expertise and there’s no suitable model or tool available, decomposition alone won’t save the day.
- Very long horizons still risk context drift if summaries miss key details.
- Training data quality matters: poor orchestration trajectories limit SFT gains.
Required resources:
- Access to multiple models (small-to-strong) and a tool stack (search, extraction, shells, code runners, etc.).
- Orchestration prompts and optional SFT compute for improvement cycles.
- Sandbox backends (Docker/E2B) and API keys (e.g., search, content extraction).
When not to use:
- One-shot, simple Q&A where a single model call suffices—overhead would be unnecessary.
- Tasks with zero tool access or strict offline rules—capability composition can’t shine.
- Ultra-hard specialized domains with no curated tools or datasets.
Open questions:
- How best to summarize long trajectories without losing critical edge cases?
- Can orchestration learn causal patterns (which subtask failures predict later costs) to prune bad paths earlier?
- How to certify safety when dynamically composing tools and models?
- Can we learn universal 'tuple construction policies' that transfer across domains with minimal tuning?

🍞 Bottom Bread (Anchor): If you only need to add two numbers, you don’t hire a whole team. But when building a treehouse with plumbing and wiring, a good orchestrator and specialized helpers pay off.

06Conclusion & Future Work

🍞 Top Bread (Hook): Imagine building exactly the helper you need, exactly when you need it, with just the right instructions, notes, tools, and brain.

🥬 Filling (Takeaway):

3-sentence summary: AOrchestra treats every sub-agent as a simple four-part recipe—Instruction, Context, Tools, Model—so the orchestrator can create tailor-made helpers on demand. The orchestrator only delegates or finishes, keeping planning clean and execution focused. This design boosts accuracy, trims costs, and stays flexible across tools and agent backends.
Main achievement: Showing that on-the-fly, tuple-based specialization plus curated context and learnable planning reliably beats fixed roles and context dumps on three tough benchmarks.
Future directions: Smarter context summarization, stronger cost-aware routing, richer safety checks for tool composition, and broader cross-domain transfer of learned orchestration.
Why remember this: The four-tuple is a simple mental model for building agents like LEGO—clear parts, easy swaps, and just-in-time assembly—making complex automation practical and adaptable.

🍞 Bottom Bread (Anchor): Next time you see an AI system fix code, configure servers, or research tricky facts, picture a conductor filling out a tiny recipe card—goal, notes, tools, brain—handing it to a specialist, and repeating until the job is perfectly done.

Practical Applications

•Automate software bug triage and patching: localize failures, edit files, and run tests with on-demand code helpers.
•Run reliable server setup scripts: install, configure, and verify services in fresh containers with step-scoped tools.
•Research assistant workflows: search, extract, and compute answers with curated context to avoid misinformation.
•Data wrangling pipelines: spawn small code agents to clean, validate, and summarize datasets step by step.
•Customer support playbooks: route simpler questions to cheaper models and escalate tricky cases to stronger ones.
•Cost-aware analytics: compose cheap parsing helpers and reserve expensive reasoning only for ambiguous summaries.
•Educational tutors: dynamically create a math-checker, a steps-explainer, or a unit-converter per exercise.
•Compliance reviews: isolate a document analyzer with read-only tools and escalate edge cases to a stronger model.
•Product prototyping: plug different agent backends into the same orchestrator to A/B test workflows quickly.
•Enterprise RPA upgrades: replace rigid scripts with adaptive, tuple-defined micro-agents that fit each subtask.

Version: 1