Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

Jinyang Wu; Guocheng Zhai; Ruihan Jin; Jiahao Yuan; Yuhao Shen; Shuai Zhang; Zhengqi Wen; Jianhua Tao

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

Beginner

Jinyang Wu, Guocheng Zhai, Ruihan Jin et al.1/7/2026

arXiv

Key Summary

•ATLAS is a system that picks the best mix of AI models and helper tools for each question, instead of using just one model or a fixed tool plan.
•It has two paths: a quick, training-free path that uses clusters of similar questions, and a smarter, learn-to-adapt path trained with reinforcement learning.
•The quick path is great when the question looks like ones we have seen before; it uses past results and cost to choose a model–tool pair fast.
•The RL path is best for unfamiliar questions; it learns to think in steps and decide when to call which model and tool.
•Across 15 benchmarks, ATLAS beats strong routers and even rivals or outperforms GPT-4o on many tasks, with big gains in both known and new domains.
•ATLAS also boosts visual reasoning by orchestrating multimodal tools like chart readers, counters, and OCR.
•The method generalizes when the pool of models and tools changes, improving performance without retraining.
•It balances accuracy with cost by scoring model–tool choices using both performance history and price.
•Carefully designed rewards (format, correctness, and efficiency) teach the RL router to follow rules, get answers right, and spend wisely.
•This shows a shift from relying on one huge model to smartly coordinating many smaller models and tools.

Why This Research Matters

ATLAS shows that smart coordination beats brute force: the right mix of smaller models and tools can outperform a single giant model while saving costs. This means more affordable, capable AI assistants for classrooms, coding, science help, and customer support. Because it adapts when new tools or models are added, organizations can upgrade capabilities without retraining everything. Its ability to switch between quick, known solutions and careful, exploratory reasoning makes it reliable in the real world where tasks are unpredictable. Strong multimodal results mean it can also understand charts, images, and tables, not just text. In short, ATLAS is a playbook for building practical, flexible AI systems that think, choose, and verify like a good teammate.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how big school projects need different skills—like writing, math, and drawing—and you do better when you use the right tool for each part? A ruler for lines, a calculator for numbers, and your brain for planning.

🥬 Filling (The Actual Concept)

What it is: Before ATLAS, AI systems often tried to answer everything with one big model or a fixed recipe for calling tools, even when problems needed different strengths.
How it works (story of the field):
1. Large Language Models (LLMs) became great at writing and reasoning, but sometimes they guess or make math mistakes.
2. Tools like calculators, code runners, and web search help fix those gaps.
3. People built routers that choose a model for each question, and planners that call tools by a fixed script. But models were treated alone, and tool plans were rigid.
4. Real questions vary a lot (math vs. coding vs. science vs. pictures), so the best solution often needs both the right model and the right tool—together.
Why it matters: Without a smart way to match each question to the best model–tool pair, AI wastes money, makes wrong calls, or both.

🍞 Bottom Bread (Anchor) Imagine asking, “What’s 13. $7 × 24$ .3?” A big model might try to reason in words, but a calculator tool makes it precise. Asking, “Where was Justin Bieber born?” needs web search. The right pairing wins.

🍞 Top Bread (Hook) You know how you sort homework into folders—math here, reading there—so you can find what you need faster?

🥬 Filling (The Actual Concept: LLMs and Tool-Augmented Inference)

What it is: LLMs are text-smart AIs; tool-augmented inference means they also use helpers (calculator, code runner, web search, OCR).
How it works:
1. The AI reads your question.
2. It decides if a tool would help (e.g., exact math, real-time facts).
3. It calls the tool, gets results, then writes the final answer.
Why it matters: Without tools, LLMs can be slow, imprecise, or out-of-date.

🍞 Bottom Bread (Anchor) A coding task? Use a code runner to test the function before answering. That turns guesses into guaranteed passes.

🍞 Top Bread (Hook) Think about picking the right teammate for each sport: sprinter for races, goalie for soccer, and so on.

🥬 Filling (The Actual Concept: The Problem)

What it is: The challenge is picking the best model–tool pair for each question from many choices.
How it works:
1. Many LLMs have different strengths (math, code, facts, images).
2. Many tools do different things (compute, search, parse images, verify steps).
3. The combinations explode into a high-dimensional puzzle.
Why it matters: Random or fixed choices leave performance on the table and can raise costs.

🍞 Bottom Bread (Anchor) If you always pick “one-size-fits-all,” you might use a coder model for history trivia or a web search for pure algebra—both wasteful and error-prone.

🍞 Top Bread (Hook) Imagine trying two strategies for your homework: quick filing by topic when it’s familiar, and careful step-by-step planning when it’s tricky.

🥬 Filling (The Actual Concept: The Gap and Why ATLAS)

What it is: Past systems either chose a model OR used tools by a fixed script; few optimized model–tool pairs dynamically.
How it works:
1. Routers ignored tools.
2. Tool-users ignored model differences.
3. RL methods optimized parts in isolation.
Why it matters: The missing piece is orchestrating models AND tools together, flexibly.

🍞 Bottom Bread (Anchor) A geometry problem with a diagram needs a vision model plus a geometry tool, not just any LLM.

🍞 Top Bread (Hook) You know how real life throws surprises—new topics or weird problems? You need a plan that works for old and new.

🥬 Filling (The Actual Concept: Real Stakes)

What it is: AI assistants in the wild face mixed tasks: math, code, science, charts, and more.
How it works: They must decide fast when a question is familiar and switch gears when it’s new.
Why it matters: This impacts classroom help, coding copilots, customer support, research assistance, and more.

🍞 Bottom Bread (Anchor) ATLAS helps a single assistant act like a team captain: for a chart question, call the chart tool; for counting objects in a photo, call the counting tool; for a proof, call a verifier.

02Core Idea

🍞 Top Bread (Hook) Imagine a smart librarian who first checks which shelf your question belongs to, and if it’s a new kind of question, they explore step by step, asking experts along the way.

🥬 Filling (The Actual Concept: The Aha!)

What it is: ATLAS’s key insight is to orchestrate the best model–tool pair per question using two paths—fast clustering for known stuff and RL-driven multi-step routing for the unknown.
How it works:
1. If a question matches known clusters, use cached knowledge of what worked best (and at what cost) to pick quickly.
2. If not, let an RL policy think in steps, decide when to call which model and tool, and learn from rewards.
Why it matters: You get both speed and adaptability, instead of choosing just one.

🍞 Bottom Bread (Anchor) A routine arithmetic task gets Calculator+Light LLM fast; a novel math puzzle triggers deeper planning with a math model and a verifier.

🍞 Top Bread (Hook) You know how you sort socks by color (quick) but try on new shoes by walking around (careful)?

🥬 Filling (Multiple Analogies)

Analogy 1 (Kitchen): Quick path is the recipe card drawer (use what worked before); RL path is the chef tasting and adjusting for a new dish.
Analogy 2 (Sports): Quick path picks last game’s winning lineup; RL path changes strategy mid-game based on how opponents play.
Analogy 3 (Maps): Quick path follows the usual route; RL path explores detours when roads are closed.

🍞 Bottom Bread (Anchor) If a coding question looks like ones seen before, pick Coder-7B + Python execution; if it’s an odd new spec, the RL router may code, test, and revise.

🍞 Top Bread (Hook) Think of a before-and-after makeover: from a single tool belt to a customizable toolbox.

🥬 Filling (Before vs. After)

Before: One model or a fixed tool script, ignoring model–tool synergy and costs.
After: Dynamic pairing of models and tools, guided by past success or live exploration, balancing accuracy with price.
Why it matters: Smaller open models plus smart tools can rival or beat giant models on many tasks.

🍞 Bottom Bread (Anchor) ATLAS(cluster) rivals GPT-4.1 and beats GPT-4o on many mixed tasks using only 7B–8B models and the right tools.

🍞 Top Bread (Hook) You know how teachers grade on both neatness and correct answers—and also encourage using time wisely?

🥬 Filling (Why It Works: Intuition)

What it is: ATLAS optimizes three things at once: structure, correctness, and efficiency.
How it works:
1. Clusters create neighborhoods of similar questions so past winners are reused.
2. A utility score balances accuracy and cost for each model–tool pair.
3. RL learns decision rules: think first, then call tools when needed, and prefer efficient experts.
Why it matters: Without this, either you overpay for small gains or underperform on hard problems.

🍞 Bottom Bread (Anchor) For trivia, a quick search on a lighter model; for proofs, call a verifier. The router learns these habits.

🍞 Top Bread (Hook) Imagine building blocks you can snap together to solve any puzzle.

🥬 Filling (Building Blocks)

What it is: ATLAS is made of modular parts.
How it works:
1. Semantic embedding and clustering (training-free).
2. Historical stats: accuracy and token cost per model–tool in each cluster.
3. Utility scoring with a performance–cost trade-off.
4. RL policy with two actions: think or route to a model–tool pair.
5. Reward trio: format (follow rules), outcome (correctness), selection (efficiency).
Why it matters: Each block is simple; together they create a flexible, powerful reasoner.

🍞 Bottom Bread (Anchor) A web question goes to Llama-3.1 + Web Search; a geometry diagram goes to a vision LLM + Geo tool; a tough math proof adds a PRM verifier.

03Methodology

🍞 Top Bread (Hook) Imagine a smart help desk: first it checks if your question matches a known category and sends it to the usual expert; if not, it investigates in steps and brings in the right specialists.

🥬 Filling (High-Level Recipe)

What it is: ATLAS routes a question through two possible paths.
How it works: Input → Quick Path (Cluster-Based) OR Adaptive Path (RL Multi-Step) → Output.
Why it matters: This is both fast on familiar tasks and robust on unfamiliar ones.

🍞 Bottom Bread (Anchor) A typical calculator problem is handled instantly via the quick path; a novel science logic problem triggers multi-step RL routing with search and verification.

— Training-Free Cluster-Based Routing —

🍞 Top Bread (Hook) You know how a library groups similar books on the same shelf so you can find them quickly?

🥬 Filling (Semantic Embeddings and Clustering)

What it is: Represent each question as a point in a meaning-space and group similar ones into clusters.
How it works:
1. Encode the question with a pretrained text encoder to get an embedding.
2. Use k-means to form K clusters with centroids.
3. Each cluster collects stats from history.
Why it matters: Similar questions tend to prefer similar model–tool pairs.

🍞 Bottom Bread (Anchor) Most coding problems land in a cluster where “Coder-7B + Python” historically wins.

🍞 Top Bread (Hook) Imagine choosing between two backpacks: one has better tools but is heavier (costlier); the other is lighter but simpler.

🥬 Filling (Utility Score: Performance–Cost Trade-off)

What it is: A score that balances how accurate a pair was and how much it costs.
How it works:
1. For each model–tool in a cluster, compute empirical accuracy from past tasks.
2. Estimate average token cost (input and output) times price.
3. Combine them with a tunable weight alpha into one utility score.
Why it matters: Without balancing, you might always pick the priciest option or the cheapest but weak one.

🍞 Bottom Bread (Anchor) For a simple fact query, a cheaper model + search wins the score; for tricky math, a math model + verifier is worth the spend.

🍞 Top Bread (Hook) Think of asking, “Which shelf is closest to my book?” and then grabbing the best-known tool from that shelf.

🥬 Filling (Proximal Lookup and Execution)

What it is: At inference, find the nearest cluster and pick the top-scoring model–tool pair.
How it works:
1. Embed the new question; pick the nearest centroid.
2. Retrieve the pair with the highest utility; execute it.
3. Use a fallback when stats are missing.
Why it matters: This makes routing very fast for familiar domains.

🍞 Bottom Bread (Anchor) A question about world capitals jumps to the “facts” cluster and selects Llama-3.1 + Web Search.

— RL-Driven Multi-Step Routing —

🍞 Top Bread (Hook) Imagine solving a puzzle by thinking a bit, trying a tool, checking the result, and repeating until you’re sure.

🥬 Filling (Policy with Think and Route Actions)

What it is: An RL agent that alternates between internal thinking and choosing a model–tool to call.
How it works:
1. State: the original question plus the growing context of steps and tool outputs.
2. Actions: think (decompose, plan) or route (pick model–tool, call it).
3. Stop when confident; produce the final answer.
Why it matters: Some problems need multiple steps and cross-checks; a single shot won’t cut it.

🍞 Bottom Bread (Anchor) For a geometry problem with a diagram, the agent thinks, calls OCR if needed, calls a geometry tool, checks, then answers.

🍞 Top Bread (Hook) You know how teachers reward neat work, correct answers, and smart use of time?

🥬 Filling (Reward Design: Format, Outcome, Selection)

What it is: Three rewards teach the agent to follow rules, be right, and be efficient.
How it works:
1. Format Reward: penalizes broken syntax (e.g., missing tags or malformed tool calls) to keep interactions stable.
2. Outcome Reward: boosts correctness of the final answer.
3. Model-Selection Reward: gently nudges choices toward efficient experts.
Why it matters: Without format, the agent fumbles tool calls; without outcome, it doesn’t aim for right answers; without selection, costs balloon.

🍞 Bottom Bread (Anchor) On MBPP (coding), selecting Coder-7B + Python is rewarded; on trivia, calling search with a lighter model is nudged.

🍞 Top Bread (Hook) Think of jogging within safe limits so you improve steadily without injury.

🥬 Filling (Stable Learning with PPO and a Reference Policy)

What it is: PPO training with KL regularization to a reference policy keeps learning stable.
How it works:
1. Collect trajectories by interacting with tools.
2. Score them with the composite reward.
3. Update the policy while staying close to the reference to avoid wild swings.
Why it matters: Prevents the router from collapsing into bad habits or overfitting.

🍞 Bottom Bread (Anchor) Training curves show faster, higher convergence with the selection reward and clean format rules.

— The Secret Sauce —

🍞 Top Bread (Hook) Imagine a Swiss Army knife that knows when to flip out the right blade—and learns new tricks as you add tools.

🥬 Filling (What Makes ATLAS Clever)

What it is: Joint optimization of models and tools, not one or the other.
How it works:
1. Uses empirical clusters for speed on knowns.
2. Uses RL planning for adaptability on unknowns.
3. Balances accuracy and cost.
4. Still works when you add new models/tools without retraining.
Why it matters: This orchestration beats single-model thinking and fixed tool scripts.

🍞 Bottom Bread (Anchor) When a math-specialized model and a result-checker are added later, ATLAS uses them right away and scores higher.

04Experiments & Results

🍞 Top Bread (Hook) Picture a school decathlon—math, coding, science, logic, and even reading charts and pictures. The winner must switch skills smoothly.

🥬 Filling (The Test)

What it is: ATLAS is tested on 15 benchmarks across math (AIME, AMC), code (HumanEval, MBPP), arithmetic (Calculator), commonsense (NQ, WebQ), logic (LogiQA2), science (GPQA), and multimodal (ChartQA, Geometry3K, TallyQA, CountBench, TableVQA).
How it works:
1. Measure accuracy per dataset.
2. Compare to routers like RouterDC, MLPRouter, BertRouter, and even closed models like GPT-4o and GPT-4.1.
3. Evaluate in-distribution (trained on all tasks) and out-of-distribution (trained on only a few, tested on the rest).
Why it matters: Real assistants face mixed tasks, not just one.

🍞 Bottom Bread (Anchor) Think of it like testing a student on homework they practiced (ID) and surprise quizzes (OOD).

🍞 Top Bread (Hook) You know that getting 90% when others get 70% is not just a number—it’s a big gap.

🥬 Filling (The Competition and Scoreboard with Context)

In-Distribution (familiar tasks): ATLAS(cluster) averages 63.5%, beating RouterDC by +10.1%. On AMC (math), it hits 82.5%; on HumanEval (code), 91.5%. It outperforms GPT-4o overall and approaches GPT-4.1 using smaller models plus tools.
Out-of-Distribution (new tasks): ATLAS(RL) averages 59.4%, which is +13.1% over RouterDC (46.3%) and +10.2% over ATLAS(cluster). On tough AIME24/25, cluster routing drops hard (down to 13.3%/3.3%), but RL holds strong (43.3%/33.3%).
Multimodal: ATLAS reaches 68.9% on average, surpassing the best single-tool baselines by +4.3%. It wins on chart questions, object counting, geometry, and table VQA via dynamic tool chaining.
Pool Extension: Adding math-specialized models and an outcome checker without retraining lifts ATLAS(RL) from 59.4% to 61.7% (+2.3%), with the biggest gains on math.
Why it matters: These are like moving from a B- class average to an A- while also paying attention to cost.

🍞 Bottom Bread (Anchor) On a chart question, ATLAS may call OCR, then a chart parser, then a math step; on a web fact, it just searches once—spending wisely.

🍞 Top Bread (Hook) Surprises make stories fun—what did we not expect?

🥬 Filling (Surprising Findings)

What it is:
1. Small models plus the right tools can beat or rival large closed models on many tasks.
2. The RL policy learns when to think more (hard tasks) and spend less (easy tasks), adjusting API call counts per dataset.
3. With self-consistency sampling, performance scales up (e.g., AIME24 from 43.3% to 70.0% at SC@16), showing strong test-time ensembling.
Why it matters: Orchestration can matter more than sheer model size.

🍞 Bottom Bread (Anchor) It’s like a clever coach making a small team play smarter—and win against bigger teams.

05Discussion & Limitations

🍞 Top Bread (Hook) No tool is perfect—like a bike is great on roads but not on sand.

🥬 Filling (Limitations, Resources, When Not to Use, Open Questions)

Limitations (what it can’t do yet):
1. Focused on text and images; audio/video orchestration remains future work.
2. Assumes reliable APIs; outages or latency can hurt performance.
3. RL rewards depend on available signals; label-scarce settings need stronger self-verification.
4. Cluster path can misroute truly novel questions.
Required Resources:
1. Access to a pool of LLMs and tools (calculator, code runner, web search, PRM/OuRM, OCR, chart/counter/geo tools).
2. Moderate GPU for RL training (policy is small, e.g., 3B) and API budgets for profiling and inference.
When NOT to Use:
1. Single-domain, stable tasks where one specialized model already excels cheaply.
2. Extremely latency-sensitive apps that cannot afford multi-step routing.
3. Offline or air-gapped settings where tool APIs are unavailable.
Open Questions:
1. How to extend to audio/video and robotics tools?
2. Can we train with purely self-generated rewards (e.g., verifiers, majority voting) to reduce labels?
3. How to guarantee fairness and safety as tools/models change over time?
4. How to auto-tune the performance–cost trade-off for different users or budgets?

🍞 Bottom Bread (Anchor) If you only ever answer arithmetic in a fixed worksheet, a single calculator model is enough. But for a mixed exam, ATLAS shines.

06Conclusion & Future Work

🍞 Top Bread (Hook) Think of ATLAS as a conductor leading an orchestra of models and tools so each instrument plays at the right time.

🥬 Filling (Takeaway)

3-Sentence Summary: ATLAS introduces a dual-path system that picks the best model–tool pair per question using training-free clustering for familiar tasks and RL-driven multi-step routing for new ones. It outperforms strong routers and even rivals or beats larger closed models across 15 text and visual benchmarks. The method balances accuracy and cost and adapts as new models and tools appear.
Main Achievement: Showing that smart orchestration of heterogeneous models and tools can surpass single-model scaling, delivering both stronger performance and better efficiency.
Future Directions: Expand to audio/video and robotics tools, strengthen self-verification and label-free training signals, and personalize routing to user budgets and preferences. Explore safety-aware and fairness-aware rewards.
Why Remember This: ATLAS marks a shift from building one giant model to coordinating many specialized parts—like moving from a soloist to a full, well-timed orchestra.

🍞 Bottom Bread (Anchor) In practice, that means a single assistant that can code, calculate, search, read charts, count objects, and verify proofs—choosing the right combo each time.

Practical Applications

•Build a helpdesk assistant that routes billing questions to a retrieval model + web search and math invoices to a calculator tool.
•Create a coding copilot that generates code, executes tests in a sandbox, and retries with fixes using RL-guided steps.
•Deploy a study tutor that solves algebra with a calculator, verifies proofs with a PRM, and explains steps clearly.
•Analyze business reports by reading tables and charts with multimodal tools and summarizing key metrics.
•Support research by routing literature queries to search tools and complex derivations to code/math tools for verification.
•Automate data validation by combining OCR for scanned documents with calculators and rule checkers.
•Enhance customer support by selecting light models for common FAQs and heavier reasoning + search for rare issues.
•Assist medical triage systems by routing domain terms to specialized models and verifying with retrieval tools (under supervision).
•Improve classroom AI by switching between text explanation, code execution for simulations, and chart interpretation.
•Prototype orchestration platforms that can integrate new tools on the fly and immediately benefit without retraining.

Version: 1