TCAndon-Router: Adaptive Reasoning Router for Multi-Agent Collaboration

Jiuzhou Zhao; Chunrong Chen; Chenqi Qiao; Lebin Zheng; Minqi Han; Yanchi Liu Yongzhou Xu Xiaochuan Xu Min Zhang

TCAndon-Router: Adaptive Reasoning Router for Multi-Agent Collaboration

Intermediate

Jiuzhou Zhao, Chunrong Chen, Chenqi Qiao et al.1/8/2026

arXiv PDF

Key Summary

•Multi-agent systems are like teams of expert helpers; the tricky part is choosing which helpers to ask for each question.
•Most routers forced a single choice, even when multiple experts were relevant, causing mistakes and confusion.
•TCAndon-Router (TCAR) first writes down its reasoning in plain language, then selects a smart subset of agents, not just one.
•TCAR lets companies add new agents by simply adding their descriptions—no retraining needed.
•After the chosen agents answer in parallel, a special Refining Agent combines their ideas into one clear, high‑quality reply.
•A two-step training recipe (Supervised Fine-Tuning plus Reinforcement Learning) makes TCAR’s choices accurate and its explanations stable.
•Across public datasets and Tencent Cloud data, TCAR matched or beat bigger models, especially when queries were ambiguous or cross‑domain.
•Reasoning visibly helped: with reasoning chains, routing got more robust and interpretable, and conflicts dropped.
•The system stayed efficient: on average it picked only ~1.37 agents, so costs and delays stayed low.
•Limitations include reliance on good agent descriptions and some challenges with rare, highly specialized cases.

Why This Research Matters

In real companies, questions are messy and often touch multiple specialties at once, so picking just one expert often fails. TCAR explains its choices in human language, which makes it easier to debug, trust, and improve over time. It can select a small team of agents instead of one, then a Refining Agent merges their ideas into a single strong answer. Businesses can add new experts instantly by appending descriptions, so the system naturally scales with new products and services. This approach saves support time, improves accuracy on tricky cases, and keeps costs low by usually selecting just a few agents. The result is faster resolutions, happier users, and more reliable AI helpdesks.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how in a big hospital, there’s a heart doctor, a bone doctor, and a lung doctor, and the front desk must decide who you should see? If they pick the wrong doctor, you waste time and don’t get better.

🥬 The Concept: Multi‑Agent Systems (MAS)

What it is: A Multi-Agent System is a team of specialized AI helpers (agents) that each know how to solve certain kinds of problems.
How it works:
1. Break a big problem into smaller parts.
2. Send each part to the expert agent best suited to handle it.
3. Combine the agents’ answers into a final solution.
Why it matters: Without MAS, one general helper must do everything, often doing some parts poorly, like asking a family doctor to do heart surgery. 🍞 Anchor: A cloud company gets a complaint: “My website is slow.” One agent is great at networks, another at CDNs, and another at servers. The team can diagnose faster together than one agent alone.

🍞 Hook: Imagine a school’s main office deciding which teacher should help with a student’s question about math, music, or sports.

🥬 The Concept: Routing

What it is: Routing is the decision of which agent(s) should handle each incoming question.
How it works:
1. Read the question.
2. Compare it with what each agent can do.
3. Pick the most suitable agent(s).
Why it matters: Bad routing sends questions to the wrong helpers, lowering accuracy and wasting time. 🍞 Anchor: A “payment failed” question should go to the billing agent, not the marketing agent.

🍞 Hook: Think of choosing sneakers versus hiking boots depending on the trip.

🥬 The Concept: Performance‑Based Routing

What it is: A strategy that picks models mainly by speed and cost for the question’s difficulty.
How it works:
1. Estimate how hard the question is.
2. Use a small, cheap model if it’s easy; a big, powerful one if it’s hard.
Why it matters: Without it, you might always use the biggest model, paying too much and waiting too long. 🍞 Anchor: Simple weather questions use a small model; long legal summaries use a large model.

🍞 Hook: If you’re fixing a leaky sink, you want a plumber, not a gardener.

🥬 The Concept: Task‑Based Routing

What it is: A strategy that picks domain experts (agents) based on what the question is about.
How it works:
1. Understand the topic (e.g., networking vs. database).
2. Match it to the right specialist agent.
Why it matters: Without task‑based routing, even great systems answer with the wrong expertise. 🍞 Anchor: A “database timeout” goes to the database agent, not the UI agent.

🍞 Hook: Imagine a librarian forced to place a book on exactly one shelf, even if it belongs to two categories.

🥬 The Concept: Single‑Label Routing

What it is: A rule that forces the router to pick only one agent for each query.
How it works:
1. Turn the query into one label.
2. Send it to only that agent.
Why it matters: Many real problems involve multiple domains; forcing one choice causes errors and brittleness. 🍞 Anchor: “Website is slow” could involve network, CDN, and application. One label misses parts of the problem.

🍞 Hook: Think of two teachers who both can help with “science fair” questions—who should lead?

🥬 The Concept: Agent Conflict

What it is: When multiple agents overlap and more than one could reasonably handle the same query.
How it works:
1. The router notices overlapping skills.
2. If forced to choose one, it risks the wrong pick.
Why it matters: Ignoring overlap leads to mistakes, confusion, and lower trust in the system. 🍞 Anchor: A “payment latency” could be finance (billing delays) or networking (API latency). Both might help.

The World Before: MAS existed and worked well on neat, clean tasks. In companies, however, questions are messy. “My website lags” might be network, CDN, and app bottlenecks all at once. Many routers used single-label routing. That caused two headaches: (1) overlapping agent skills led to conflicts, and (2) adding new agents needed retraining, so systems couldn’t grow quickly.

Failed Attempts:

Performance-based routing saved money but didn’t send questions to domain experts.
Static task routers chose exactly one agent, even for multi-intent queries.
Some LLM-based routers predicted a single best agent without explaining why, so it was hard to debug or improve.

The Gap: We needed a router that (a) explains its choices in plain language, (b) can pick multiple agents when needed, and (c) accepts new agents without retraining.

Real Stakes: In support centers, the wrong route wastes hours. In healthcare triage, the wrong expert can delay care. In cloud operations, misrouting slows incident response and harms SLAs. This paper proposes TCAR to make routing smarter, clearer, and easier to grow.

02Core Idea

🍞 Hook: Imagine a smart front desk that first writes down its thinking, then calls all the right experts, and finally has a head teacher tidy up the combined answer.

🥬 The Concept: Natural‑Language Reasoning Chain

What it is: A step‑by‑step explanation, in plain words, showing how the router linked the question to the right agents.
How it works:
1. The router reads the question and all agent descriptions.
2. It lists possible causes and which agents cover them.
3. It writes a short, structured “why this choice” explanation.
Why it matters: Without this, routing is a black box; mistakes are hard to find and fix. 🍞 Anchor: “Webpage slow → could be network, CDN, or app. Network checks latency; CDN checks edge nodes; app checks database queries.”

🍞 Hook: Think of a coach who doesn’t pick just one player, but the exact subset who play best together.

🥬 The Concept: Adaptive Reasoning Router (TCAR)

What it is: A router that first reasons in language and then selects a subset of relevant agents (not just one).
How it works:
1. Build a prompt with the query plus agent descriptions.
2. Generate a reasoning chain.
3. Output up to a few agent IDs that fit the reasoning.
Why it matters: Without selecting a smart subset, conflicts get crushed into a single risky guess. 🍞 Anchor: TCAR may choose {Network, CDN} together for “slow at certain regions,” rather than forcing one.

🍞 Hook: Picture several chefs cooking parts of one meal, then a head chef plates it perfectly.

🥬 The Concept: Collaborative Execution Pipeline

What it is: A process where the selected agents each answer in parallel, then a coordinator fuses their answers into one.
How it works:
1. TCAR picks the agents.
2. Each agent writes its best answer.
3. A downstream module merges them.
Why it matters: Without collaboration, you miss complementary insights (like network plus app clues). 🍞 Anchor: Network agent flags packet loss; app agent spots slow SQL; together the root cause becomes clear.

🍞 Hook: Think of a newspaper editor who combines reporters’ drafts into one clear story.

🥬 The Concept: Refining Agent

What it is: A special agent that compares multiple agent answers, resolves conflicts, and writes the final response.
How it works:
1. Read all candidate answers.
2. Keep the accurate, non‑overlapping parts.
3. Explain or reconcile any disagreements.
Why it matters: Without a Refiner, users get multiple partial, possibly conflicting answers. 🍞 Anchor: If CDN says “edge issue” and app says “database issue,” the Refiner checks both and recommends the correct order to test.

🍞 Hook: Adding a new teammate should be as simple as introducing them at morning meeting.

🥬 The Concept: Dynamic Agent Onboarding

What it is: The ability to add a new agent by providing a description—no retraining the router.
How it works:
1. Write a clear natural‑language description of the new agent’s skills.
2. Append it to the agent list.
3. TCAR immediately considers it during routing.
Why it matters: Without this, growing businesses must constantly retrain routers, slowing expansion. 🍞 Anchor: “Cache Agent” added today starts getting caching questions right away.

🍞 Hook: Think of a referee who fairly considers all sides before deciding.

🥬 The Concept: Multi‑Agent Conflict Resolution

What it is: Handling overlapping expertise by selecting multiple agents and later merging their outputs.
How it works:
1. Keep conflicts visible by outputting a set of agents.
2. Let each speak.
3. Resolve differences via the Refiner.
Why it matters: Without this, the router hides uncertainty and often guesses wrong. 🍞 Anchor: For “latency spikes,” TCAR routes to Network and CDN; the Refiner aligns their findings.

Multiple Analogies (same idea, three ways):

School: Guidance counselor (router) writes notes (reasoning), sends a student to math and science clubs (subset), and the homeroom teacher (Refiner) summarizes a study plan.
Hospital: Triage nurse (router) lists symptoms (reasoning), sends to cardiology and pulmonology (subset), chief physician (Refiner) finalizes the diagnosis.
Factory: Dispatcher (router) logs issue (reasoning), calls electrical and mechanical teams (subset), shift manager (Refiner) delivers one repair plan.

Before vs After:

Before: One-label picks, hidden logic, brittle under ambiguity, hard to grow.
After: Reason‑then‑select subset, explanations you can read, multiple experts when needed, plug‑in new agents.

Why It Works (intuition):

Explaining first forces careful matching between the query and agent skills.
Selecting a subset preserves useful overlap instead of erasing it.
Aggregating answers converts conflict into complementary evidence.
Training with rewards that balance correctness, coverage, and brevity keeps the set precise, complete, and small.

Building Blocks:

Reasoning generator (<reason> tag) that explains the mapping from query to agents.
Subset selector that outputs up to a few agent IDs.
Parallel agent answering.
Refining Agent to integrate and resolve.
Training: Supervised Fine‑Tuning (to learn the pattern) plus Reinforcement Learning (to polish accuracy, coverage, and consistency).

03Methodology

At a high level: Input (User query + Agent descriptions) → Reason‑then‑Select (TCAR) → Parallel Answers (Chosen agents) → Aggregate (Refining Agent) → Output (One final answer)

🍞 Hook: Imagine making a sandwich: you lay out ingredients (descriptions), think about what fits (reason), pick slices (agents), toast them together (parallel answers), and plate it nicely (refine).

🥬 The Concept: Supervised Fine‑Tuning (SFT)

What it is: Teaching the model by showing many examples of good reasoning and correct agent sets.
How it works:
1. Prepare data with queries, agent descriptions, a model instruction, a reasoning chain, and chosen agents.
2. Train the model to copy the structure: <Reason> … </Reason> + <ID> … </ID>.
3. Ensure it learns to align query meaning to agent skills.
Why it matters: Without SFT, the model may not format answers correctly or connect queries to capabilities. 🍞 Anchor: Show examples where “billing error” maps to the Billing agent, with a short reason.

🍞 Hook: Training a puppy with treats helps it learn the right tricks.

🥬 The Concept: Reinforcement Learning (RL)

What it is: Improving choices by giving rewards for good agent sets and discouraging bad ones.
How it works:
1. Let the model propose agent sets.
2. Reward precision-like behavior (few wrong agents) and recall-like behavior (cover all correct agents).
3. Add a small penalty for picking too many agents (keep it concise).
Why it matters: Without RL, the model may overfit templates or be too cautious, missing needed agents. 🍞 Anchor: If the true set is {Network, CDN} and the model outputs {Network, App, CDN}, it gets dinged for the extra “App.”

Step-by-step recipe:

Build the Router Prompt

What happens: Concatenate the routing instruction, the user query, and the natural-language descriptions of all available agents.
Why it exists: The router must compare the question with what each agent claims they can do.
Example: “Query: ‘Checkout is slow’; Agents: Network (latency, packet loss), App (API timeouts, DB), CDN (edge caching).”

Generate a Natural‑Language Reasoning Chain (<reason> …)</reason>

What happens: TCAR lists plausible causes, the relevant technical stack, and role boundaries.
Why it exists: The text explanation forces careful matching and makes debugging easy.
Example: “Slow checkout could be network latency or DB lock contention; Network can measure RTT; App can inspect DB queries.”

Select a Small Subset of Agents (<ID>AgentID</ID>)

What happens: TCAR outputs one to a few agent IDs (up to a cap, e.g., 3) aligned with the reasoning.
Why it exists: Many real questions need multiple experts; the cap keeps costs manageable.
Example: Output Network + App for “slow checkout under heavy load.”

Parallel Agent Responses

What happens: Each chosen agent answers independently using its domain tools or knowledge.
Why it exists: Parallelism reduces latency and gathers complementary evidence.
Example: Network agent returns traceroute insights; App agent returns slow query logs.

Aggregation by the Refining Agent

What happens: The Refiner reads all candidate answers, merges overlapping parts, explains conflicts, and writes one final response.
Why it exists: Users need one clear solution, not multiple partial drafts.
Example: “Network shows no packet loss; App reveals slow SQL—optimize DB index first.”

Training Details (the “secret sauce”)

SFT formatting choice: Use a unified <reason> tag (instead of model-specific tags) so various instruction models can learn the same habit.
RL via DAPO-style optimization: Filter out low-entropy samples that were already easy after SFT; concentrate training on harder, ambiguous cases where routing matters most.
Reward shaping: Balance three forces— • Precision-like reward: fewer irrelevant agents. • Recall-like reward: cover all truly needed agents. • Length penalty: don’t over-list agents.
Why it matters: This balance prevents both under-selection (missing experts) and over-selection (wasteful, confusing responses).
Example: For ground truth {CDN, Network}, the best reward comes from exactly those two; {CDN} misses coverage; {CDN, Network, App} trips the length penalty.

Dynamic Agent Onboarding

What happens: Add a new agent by appending its natural-language description—no router retraining.
Why it exists: Enterprises evolve; routing must keep up without weeks of model updates.
Example: Add a “Caching” agent today; tomorrow TCAR can route cache-related tickets to it.

The Secret Sauce:

Reason‑then‑select turns hidden guesses into readable logic.
Subset selection preserves uncertainty and overlap without exploding costs.
A Refiner converts multi-perspective drafts into one strong, trustworthy answer.
SFT+RL training sculpts behavior to be accurate, complete, and concise.

04Experiments & Results

🍞 Hook: You know how a science fair judge doesn’t just look at the score, but also compares projects side by side?

🥬 The Concept: F1 Score

What it is: A number that balances being precise (not picking wrong agents) and being complete (not missing needed agents).
How it works:
1. Measure precision (How many picks are correct?).
2. Measure recall (How many correct ones did you include?).
3. Combine them into one score (F1) so you can compare models fairly.
Why it matters: Without F1, a router might look good by picking very few agents (high precision) but miss important ones (low recall). 🍞 Anchor: If the true set is {Network, CDN} and you pick only {Network}, you look precise but you missed CDN, so F1 drops.

The Tests (What they measured and why):

Datasets: CLINC150 (lots of classes), HWU64 (cross-domain ambiguity), MINDS14 (multi‑lingual), SGD (multi‑turn dialogue), and QCloud (real enterprise cloud operations with frequent conflicts).
Metrics: Accuracy on single-agent datasets; F1 on multi-agent cases; End-to-End Task Success after the Refining Agent.
Goal: Check if reasoning + subset selection + refinement beats single-label routers and large general LLMs, especially when queries are ambiguous or cross-domain.

The Competition:

Strong proprietary and open-source models: GPT‑5.1, Claude‑4.5, DeepSeek‑v3.1, ArchRouter, and Qwen3 family (the TCAR base).

The Scoreboard (with context):

TCAR (only 4B parameters) achieved state-of-the-art or near-SOTA across datasets, especially on MINDS14 (multilingual), SGD (multi‑turn), and QCloud (real enterprise ambiguity).
On CLINC150 (very many classes—long prompts), TCAR was strong but slightly behind the very largest general LLMs; the ultra‑long agent list stretched small-model sequence handling.
On QCloud, where overlaps and conflicts are common, TCAR’s F1 surpassed even top general LLMs—like getting an A+ when most models were getting A− or B+—showing robustness in messy, real-world settings.

Surprising/Notable Findings:

Reasoning Helps: Adding explicit reasoning chains consistently improved performance versus no‑reasoning ablations, suggesting better generalization and interpretability.
RL Matters: After SFT, applying RL (DAPO-style) improved recall while keeping precision high—fixing the SFT tendency to be too conservative (picking only one agent).
Refining Agent Shines on Troubleshooting: In human preference tests on QCloud, aggregating multiple agents’ answers beat “pick one at random” especially for troubleshooting (win rate much higher), while for simple consultation a single agent often sufficed.
Efficient in Practice: Although TCAR can select multiple agents, it averaged about 1.37 agents per query—so costs stayed low and no combinatorial explosion happened.

What It Means:

The combo of reason‑then‑select + collaboration + refinement turns overlapping domains from a problem into an advantage.
The training recipe (SFT + RL with smart rewards) tunes the router for enterprise realities: ambiguity, overlap, and growth.
Even a compact 4B model, if trained and structured well, can rival or beat much larger models on the routing task that enterprises actually need.

05Discussion & Limitations

Limitations (be specific):

Depends on Agent Descriptions: If an agent’s description is vague or incomplete, the reasoning chain may align to the wrong skills, causing misrouting.
Long‑Tail, Niche Knowledge: Rare configurations or specialized jargon still trip the model when training data is sparse.
Reasoning–Prediction Mismatch: Sometimes the written reasoning looks sensible, but the final agent list doesn’t fully match it; improving this alignment is an open problem.
Ultra‑Long Contexts: Very large agent catalogs (e.g., 150+) can stretch small models’ attention and memory limits.

Required Resources:

A capable instruction-following base model (here, a 4B model worked well for release).
Training data with queries, agent descriptions, reasoning, and gold agent sets.
RL compute for DAPO-style optimization and enough rollouts to learn from ambiguous cases.
A downstream strong LLM for the Refining Agent if you need top-tier aggregation quality.

When NOT to Use:

Single‑domain, simple workloads where one expert always suffices—TCAR’s multi-agent machinery adds little.
Situations with poor or missing agent descriptions—results will be unstable until descriptions are improved.
Extremely tight latency budgets where even small reasoning chains or a second aggregation pass are unacceptable.

Open Questions:

How to guarantee tighter consistency between reasoning and final selections?
Can structured reasoning constraints (checklists, schemas) further boost reliability?
How to compress very long agent catalogs (e.g., summarization, retrieval) without losing routing accuracy?
Can on‑the‑fly description repair (auto‑edit unclear agent blurbs) improve robustness?
What are the best human-in-the-loop strategies to refine routing in production (active learning, feedback loops)?

06Conclusion & Future Work

3‑Sentence Summary: TCAndon‑Router (TCAR) is a reasoning‑centric router that writes down why it picks agents and then selects a small subset rather than forcing a single choice. The chosen agents answer in parallel and a Refining Agent merges their ideas into one strong reply, turning conflicts into complementary evidence. Trained with SFT and RL, TCAR matches or beats larger models on public and real enterprise data, especially when queries are ambiguous or cross‑domain.

Main Achievement: Turning routing from a black‑box single label into an interpretable, multi‑agent, reason‑then‑select process—with dynamic onboarding and downstream refinement.

Future Directions:

Add structured constraints to reasoning chains to better align explanations and selections.
Improve efficiency for ultra‑long agent catalogs via retrieval or agent-summarization.
Explore lighter-weight Refining Agents and tighter cost controls.
Auto‑improve weak agent descriptions with LLM editing tools.

Why Remember This: TCAR shows that explanations plus subsets beat guesses plus single labels. By embracing overlap and then organizing it, multi‑agent systems become more accurate, scalable, and trustworthy for the messy problems real users actually have.

Practical Applications

•Enterprise IT support desks that route incidents to Networking, Security, or Database agents—with a Refiner producing a single fix plan.
•Customer service triage that selects Billing, Shipping, and Returns agents for complex orders and merges their guidance into one reply.
•Cloud operations centers diagnosing outages by combining Network, CDN, and Application agents’ findings.
•Healthcare intake bots that consult multiple specialty agents (symptom checker, medication safety) before giving a triage suggestion.
•E-commerce platforms that route catalog issues to Content, Pricing, and Inventory agents and deliver one consolidated correction.
•Developer platforms where Build, Test, and Deploy agents coordinate to debug failing pipelines.
•Cybersecurity incident response that loops in Threat Intel, EDR, and Network Forensics agents and outputs a unified playbook.
•Education helpdesks that blend Financial Aid, Registration, and Housing agents into one coherent student answer.
•Smart city support where Traffic, Utilities, and Public Safety agents collaborate on city service tickets.
•Research assistants that query Literature, Data, and Methods agents and synthesize one research note.

Version: 1