Token-Level LLM Collaboration via FusionRoute

Nuoya Xiong; Yuhang Zhou; Hanqing Zeng; Zhaorun Chen; Furong Huang; Shuchao Bi; Lizhu Zhang; Zhuokai Zhao

Token-Level LLM Collaboration via FusionRoute

Intermediate

Nuoya Xiong, Yuhang Zhou, Hanqing Zeng et al.1/8/2026

arXiv PDF

Key Summary

•Big all-in-one language models are powerful but too expensive to run everywhere, while small specialists are cheaper but narrow.
•FusionRoute lets several small specialist models work together at every single word (token) while staying fast.
•A tiny 'router' picks the best expert for the next token and also adds a small corrective nudge to the expert’s scores.
•This nudge is called complementary logits; adding them fixes expert mistakes right when they happen.
•Theory shows that picking experts alone can’t always reach the best answers, but adding the router’s nudge can.
•FusionRoute is trained in two stages: first to route and predict tokens (SFT), then to improve preferences with CDPO.
•Across math, coding, and instruction-following, FusionRoute beats sequence-level voting, token-level controlled decoding, model merging, and even direct fine-tuning on average.
•It stays competitive with the best domain experts on their home turf while being more robust on mixed tasks.
•The gains are larger for bigger models, showing complementary routing matters more as capacity grows.
•It is simple, efficient, and works with off-the-shelf experts without joint training or architecture changes.

Why This Research Matters

FusionRoute lets organizations deploy smaller, cheaper specialist models that, together, act like a strong general assistant—cutting costs without sacrificing quality. It helps everyday users get better math, code, and instruction answers from a single assistant without manually picking which model to use. The method is efficient because it avoids generating multiple full answers or retraining all experts together. Its token-level corrections catch tiny mistakes early, improving reliability in real time. As models scale, the benefits grow, giving better output quality from the same expert team. This can make AI assistance more accessible to schools, startups, and nonprofits that can’t afford giant monolith models.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how you might ask your math friend for algebra help, your coding friend for Python tips, and your language-arts friend to polish your essay? No single friend is perfect at everything, but your team is amazing when each one helps at the right time.

🥬 Filling (The Actual Concept):

What it is: The paper is about a way to make different small language models (each good at a certain thing) work together smoothly so we get big-model quality without big-model cost.
How it works (story-before-steps): Before this work, people tried three main styles to combine experts, and each had problems.
1. Mixture-of-Experts inside one giant model: effective but costly and inflexible; you must train everything together and usually keep model structures similar.
2. Multi-agent debates or sequence-level collaboration: each expert writes a full response, and then we pick or merge the best. That’s slow (many full answers), can inflate context, and can even hurt quality.
3. Model merging: blend the parameters of multiple experts into one model. It’s training-free but touchy—one expert’s strengths can damage another’s (parameter interference).
Why it matters: In real apps, you want a single, affordable helper that’s good at math, coding, and following instructions without you guessing which model to pick.

🍞 Bottom Bread (Anchor) Imagine a homework helper that, on a math step, listens to the math expert; on code lines, listens to the coding expert; and for explanations, listens to the writing expert—switching smoothly word by word.

—

🍞 Top Bread (Hook) You know how big, fancy calculators can do everything but cost a lot, while small calculators are cheap but limited?

🥬 Filling (The Actual Concept):

The World Before: Giant general-purpose LLMs handled many tasks but were too expensive to run widely. Small specialists did great within their domains but stumbled outside.
The Problem: We want general-purpose skill from a team of smaller specialists without paying the cost of an ultra-giant model—and we need it to work automatically per word, not just per whole answer.
Failed Attempts:
- MoE: needs joint training and architecture match; costly and rigid.
- Multi-agent debates: many full answers, long contexts, possible quality drop as you add agents.
- Model merging: sensitive hyperparameters; capabilities can clash.
- Prior token-level collaboration: chooses among expert outputs per token, but if experts are weak or selection is off, things break.
The Gap: A per-token method that is both robust (doesn’t crumble when experts misfire) and efficient (no massive extra compute) was missing.
Real Stakes: Everyday assistants should be strong at reasoning, coding, and instructions without users picking a specialist or paying for a supermodel; businesses need cost-effective, reliable quality.

🍞 Bottom Bread (Anchor) Think of a smartphone keyboard that predicts your next word; now imagine it instantly switching between a math brain, a coding brain, and a writing brain while also adding tiny corrections so you rarely mistype—fast and accurate, all the time.

—

🍞 Top Bread (Hook) You know how even your best friend sometimes mishears you, and you jump in with a gentle, quick correction so the story stays on track?

🥬 Filling (The Actual Concept):

What’s missing: Prior token-level methods only picked which expert to trust, but didn’t allow a tiny helper to correct the expert’s choice on the fly.
Why it matters: If no expert alone has the exact right move for a given word, just picking the “least wrong” expert still isn’t enough. You need a little nudge.
This Paper’s Answer: Add a small, trainable router that both (1) picks the expert and (2) contributes a small corrective score to guide the final word choice. This combo is called FusionRoute.

🍞 Bottom Bread (Anchor) Like a theater director who both chooses the right actor for each line and whispers a quick cue so the line lands perfectly.

02Core Idea

🍞 Top Bread (Hook) Imagine a relay race where each runner (expert) sprints the stretch they’re best at, and a coach (router) not only signals who runs next but also shouts a tiny tip that corrects their stride mid-race.

🥬 Filling (The Actual Concept):

The “Aha!” Moment in one sentence: Don’t just pick the right expert per token—add a tiny complementary push from a router so the final next-word choice is better than any expert alone.

Multiple Analogies:

Orchestra: The conductor (router) selects which section (expert) leads the next bar and adds a small cue to tempo and volume (complementary logits) so the piece stays harmonious.
GPS + Driver: The GPS (router) picks which driver (expert) should take the wheel for the next block and gives a tiny voice nudge—“turn a little earlier”—to avoid small mistakes.
Cooking Team: The head chef (router) chooses the right sous-chef (expert) for each step and tosses a pinch of seasoning (complementary logits) to balance the flavor.

Before vs After:

Before: Token-level methods asked, “Which expert’s choice should we follow right now?” If experts disagreed or were weak, results could wobble.
After (FusionRoute): The router still picks one expert but also adds its own small correction, letting the system fix local mistakes without heavy compute or joint retraining.

Why It Works (intuition, no equations):

Pure routing assumes “some expert is near-perfect at every step,” which often isn’t true. When no expert alone fits perfectly, you need a small adjustable helper.
By adding a router’s tiny, learned scores to the expert’s scores, the combined choice explores a bigger space of possible next words, making it easier to match the right move.
In decision-making language, choosing actions (words) step-by-step benefits from both a good specialist and a coach’s hint; the hints reduce the chance of getting stuck with a not-quite-right action.

Building Blocks (in dependency order, each with the Sandwich pattern):

🍞 Hook: You know how a teacher first shows you examples before asking your preferences? 🥬 The Concept: Supervised Fine-Tuning (SFT)
- What it is: SFT teaches a model to imitate good examples so it learns basic skills.
- How it works:
  1. Collect example question–answer pairs.
  2. Train the model to predict the next word in those answers.
  3. Repeat until the model writes similar quality responses.
- Why it matters: Without SFT, the model may not have steady, sensible writing to build on. 🍞 Anchor: Like practicing solved math problems before trying your own.
🍞 Hook: Imagine a superhero team—each hero has a specialty. 🥬 The Concept: Multi-LLM collaboration
- What it is: Let different language models team up so each one does the part they’re best at.
- How it works:
  1. Keep specialists for math, coding, and instructions.
  2. Decide, step by step, who should act.
  3. Combine their strengths into one fluent response.
- Why it matters: One model rarely excels at everything; teaming up boosts overall skill. 🍞 Anchor: Use your math friend for equations and your writing friend for explanations in the same answer.
🍞 Hook: Think of adding scores from quizzes to decide your overall grade. 🥬 The Concept: Logit addition
- What it is: A way to add score-vectors from two models before picking the next word.
- How it works:
  1. Get the expert’s scores for each possible next word.
  2. Get the router’s small corrective scores.
  3. Add the two score lists word-by-word.
  4. Pick the top-scoring word.
- Why it matters: Without adding, you can’t gently correct the expert at the exact token. 🍞 Anchor: If the expert leans toward “print(” in code, the router can nudge toward “print("Hello")”.
🍞 Hook: Like a hospital calling the right specialist for each patient. 🥬 The Concept: Mixture-of-Experts (MoE)
- What it is: A big model that routes inputs to different learned experts inside one network.
- How it works:
  1. Train experts and a router together.
  2. For each input, the router picks a few experts.
  3. Combine their outputs.
- Why it matters: It’s powerful but requires joint training, similar architectures, and lots of compute. 🍞 Anchor: FusionRoute gets MoE-like benefits without retraining or merging all experts.
🍞 Hook: When you tell a story, each new word depends on what you’ve already said. 🥬 The Concept: Token-level Markov Decision Process (MDP)
- What it is: A way to view writing one word at a time as a sequence of decisions based on the current context.
- How it works:
  1. State = what’s been written so far.
  2. Action = which word to add next.
  3. Reward = how good the partial text is (e.g., correctness, clarity).
  4. Goal = choose actions that maximize overall goodness.
- Why it matters: It explains why per-word expert choice plus small corrections can lead to better whole answers. 🍞 Anchor: Like picking the next chess move based on the board right now.
🍞 Hook: Picture a stage manager who picks the right actor and also gives an earpiece cue. 🥬 The Concept: FusionRoute
- What it is: A framework with a router that selects an expert at each token and adds a corrective score (complementary logit).
- How it works:
  1. Read the prompt and what’s been written.
  2. Score which expert should speak next.
  3. Compute a small corrective score vector.
  4. Add expert scores + correction and pick the top word.
- Why it matters: Without the correction, pure selection can’t reach the best policy in many cases. 🍞 Anchor: The router chooses “math expert” for a fraction step and nudges from 3.14 to π when needed.
🍞 Hook: Like a coach whispering a tiny fix—“step left a bit.” 🥬 The Concept: Complementary logit generation
- What it is: The router produces a small score tweak to improve the expert’s next-word choice.
- How it works:
  1. Look at context.
  2. Predict a gentle score adjustment across the vocabulary.
  3. Add it to the expert’s scores.
  4. Choose the improved word.
- Why it matters: This expands what the system can express beyond any single expert. 🍞 Anchor: In code, the expert suggests a variable; the router’s nudge steers to the correct variable name.
🍞 Hook: After practice, you ask a friend which of two answers they prefer so you can adjust your style. 🥬 The Concept: Preference optimization
- What it is: Training that shifts the model toward responses people prefer.
- How it works:
  1. Show pairs of answers to a rater (human or model).
  2. Learn to make preferred answers more likely.
  3. Keep base skills but align tone, helpfulness, and correctness.
- Why it matters: Without preferences, the model may be fluent but not what users want. 🍞 Anchor: Choosing the clearer, kinder explanation over a confusing one.

🍞 Bottom Bread (Anchor) In a multi-step math-and-code problem, FusionRoute picks the math expert for formulas, the coding expert for function syntax, and the instruction expert for the explanation—while the router adds tiny nudges so each token is as accurate and well-phrased as possible.

03Methodology

At a high level: Prompt → Router reads context → Router picks expert and creates a tiny corrective score → Add expert scores + correction → Pick next token (greedy) → Repeat.

Step-by-step with Sandwich explanations where concepts first appear:

Input and Context Reading

What happens: The router LLM reads the prompt and the already-generated words.
Why this step exists: It must understand the current situation to know which expert fits and how to correct.
Example: After “Let’s compute 24 ÷ 6 =”, the router senses math-mode.

Token-level Expert Selection (from Multi-LLM collaboration) 🍞 Hook: Like picking the right teammate to take the next move. 🥬 The Concept: The router outputs a small set of weights—one per expert—and picks the highest.

How it works:
1. Use a lightweight linear head on the router’s last hidden state to score experts.
2. Select the expert with the highest score for this token.
3. Keep it flexible so choices can change word by word.
Why it matters: Without per-token routing, you’d overuse one model or switch too slowly. 🍞 Anchor: During a proof step, route to the math expert just for “= 4”.

Complementary Logit Generation + Logit Addition 🍞 Hook: Imagine giving a soft hint to improve a friend’s answer. 🥬 The Concept: The router also produces complementary logits, then we add them to the chosen expert’s logits.

How it works:
1. Compute expert’s score over the vocabulary.
2. Compute router’s small corrective score over the same vocabulary.
3. Add the two score vectors.
4. Choose the top-scoring token (greedy decoding).
Why it matters: If the expert is slightly off, the router’s nudge can fix it immediately. 🍞 Anchor: Expert leans to “cos” but context requires “sin”; the router’s nudge flips the choice.

Greedy Decoding

What happens: Pick the top-scoring next token each time; it’s simple and fast.
Why this step exists: Keeps inference efficient and stable across many tasks.
Example: Among candidates, the combined score makes “return answer” the top token in code.

Training Phase 1: Supervised Fine-Tuning (SFT) for Routing and Base Skills 🍞 Hook: Practice with worked examples before playing in a tournament. 🥬 The Concept: Train the router to (a) predict next tokens and (b) learn to route at places where experts truly differ.

How it works:
1. Language modeling loss: improve the router’s general next-token prediction on mixed-domain data.
2. Routing loss: only supervise tokens where experts disagree, so gradients focus on meaningful choices.
3. The router learns which expert tends to fit which kind of token in context.
Why it matters: If you train routing on easy tokens (like punctuation), it learns nothing useful. 🍞 Anchor: On a line where math and code experts propose different next words, the router learns to pick the one matching the ground truth.

Training Phase 2: Complemented Direct Preference Optimization (CDPO) 🍞 Hook: After learning basics, you ask referees which answers they prefer to polish your style. 🥬 The Concept: Adjust only the router’s base model with preference learning while treating expert outputs as fixed, so the router learns when to nudge more.

How it works:
1. Use pairs of responses (preferred vs. less preferred).
2. Compare how the combined policy (expert + router) scores those pairs.
3. Update only the router’s base model, not the tiny routing head.
4. If experts are already strong, few changes; if weak on a prompt, the router learns stronger nudges.
Why it matters: This targets the exact spots where experts struggle, improving robustness. 🍞 Anchor: If the coding expert forgets an edge case, CDPO helps the router learn to add a corrective token there.

Mixed Training for Stability

What happens: Mix SFT samples and preference pairs in one training loop; update routing head only on SFT tokens; update the base model on preference pairs.
Why this step exists: Keeps routing sharp while teaching the base model how and when to correct.
Example: In one batch, the router practices choosing the math expert on disagreements; in another, it learns to prefer clearer phrasing.

Secret Sauce

Complementary logits are the clever twist. Pure expert selection assumes a perfect expert exists at every token; that’s unrealistic. The router’s nudge expands what the system can express, letting it reach near-optimal choices more often.

Putting It Together on Real Data (concrete walkthrough):

Prompt: “Write a Python function to sum primes up to n and explain your reasoning.”
1. Router reads: detects code + explanation ahead.
2. For “def ” and function name, router picks coding expert.
3. Router adds a tiny nudge to prefer a common pattern (sieve or primality test loop).
4. For the explanation sentence, router picks instruction expert and nudges for clarity.
5. On an off-by-one risk, router nudges token choice to fix the boundary check.
6. Greedy decoding chooses the final tokens with these combined scores. Result: A clean function and a readable explanation, better than either expert alone.

Why things break without each step:

No routing: one expert tries to do everything; quality drops on out-of-domain parts.
No complementary logits: when experts are all slightly wrong at a token, you can’t fix it.
No preference learning: output may be correct but less aligned to what readers prefer.
No disagreement-focused routing loss: the router learns from easy tokens and gets confused later.

04Experiments & Results

The Test: What they measured and why

Cross-domain accuracy: Can one FusionRoute model handle math (GSM8K, MATH500), code (MBPP, HumanEval), and instruction following (IfEval) without per-task checkpoint switching?
Overall response quality: Pairwise win rate on a general held-out set judged by GPT-4o, reflecting clarity, correctness, and style.

The Competition: Baselines

Sequence Selection: Each expert writes a full answer; pick the best via a reward model. Simple, but slow and can degrade with long contexts.
Collab (controlled decoding): Token-level selection guided by an external reward signal each step; strong idea but computationally heavier and can be unstable if the reward or experts misfire.
Model Merging (DARE, TaskArithmetic): Training-free parameter blending; can lose specializations.
Fine-tuned Model: Directly fine-tune the base model with the same data; tests whether collaboration actually helps.

The Scoreboard: Results with context

Llama-3 family (8B scale):
- FusionRoute average accuracy ≈ 0.566 across five benchmarks, higher than direct fine-tuning (≈ 0.536), sequence selection (≈ 0.466), Collab (≈ 0.502), and merging baselines (≈ 0.368–0.424).
- On coding HumanEval, FusionRoute ≈ 0.63 vs. fine-tuned ≈ 0.58; that’s like moving from a B to a solid B+/A-.
- On IfEval (instruction following), FusionRoute ≈ 0.69 vs. fine-tuned ≈ 0.72 (close), but FusionRoute wins on the overall average thanks to stronger coding and math balance.
Gemma-2 family (2B scale):
- FusionRoute average ≈ 0.426, beating sequence selection (≈ 0.408), Collab (≈ 0.360), merging methods (≈ 0.224–0.268), and slightly besting direct fine-tuning (≈ 0.394).
- Even at small scale, FusionRoute’s balance across domains edges out single-model fine-tuning.

General-Quality Win Rate (GPT-4o judge):

FusionRoute wins far more often than the fine-tuned model at both 8B and 2B scales. Counting ties as half-wins, its win rate jumps notably at 8B, showing bigger benefits when models are larger.
Interpretation: As models scale, just picking experts gets brittle; FusionRoute’s corrective nudge better uses that extra capacity to polish answers.

Surprising Findings:

Complementary logits matter a lot. Routing-only (no correction) beats some baselines but still trails full FusionRoute, especially on coding and instruction tasks where local mistakes are common.
Stability: Training the router on expert disagreements (SFT) makes token-level routing more reliable than controlled-decoding methods that depend on external reward signals at every step.
Specialization preserved: FusionRoute stays competitive with domain experts on their home datasets while providing stronger averaged performance across all tasks—like having your cake and eating it too.

Concrete Examples of Improvements:

Math: When a step could choose between similar tokens (e.g., fraction vs. decimal), the router’s nudge helps pick the format matching the solution style.
Code: When syntax has multiple plausible continuations, the corrective logits encourage the idiomatic or bug-free option (e.g., correct variable names, boundary checks).
Instructions: The router favors clearer, on-task phrasing, improving judged quality without sacrificing correctness.

Efficiency Context:

No need to sample full responses from all experts and then pick; FusionRoute operates per token with a single expert plus a tiny router correction—faster and more cost-effective for deployment.

05Discussion & Limitations

Limitations (honest view):

Coverage gaps: If all experts are weak for a specific pattern, the router’s small correction can help but won’t replace missing deep knowledge entirely.
Preference data quality: CDPO benefits depend on the quality and diversity of preference pairs; biased or narrow preferences can misguide polishing.
Router overfitting risk: Without mixed training (decoupling routing head updates from preference updates), routing can destabilize.
Reward mismatch: External evaluation (e.g., GPT-4o judging) aligns with human preferences but is not perfect; some improvements may be style-sensitive.

Required Resources:

Several off-the-shelf experts (e.g., math, code, instruction) of similar tokenizers and vocabularies to enable logit addition.
Compute for post-training the router (SFT + CDPO), though far less than retraining full MoE or giant single models.
Preference data (or a reliable proxy rater) for CDPO.

When NOT to Use:

Single, well-defined domain with one standout expert: routing overhead may not pay off versus fine-tuning one model.
Extremely low-latency microcontrollers: even a tiny router head adds overhead compared to a single model.
Highly divergent architectures/tokenizers among experts: if logits can’t be aligned, FusionRoute’s addition step won’t apply directly.

Open Questions:

Dynamic expert pools: How to add/remove experts over time without retraining the router from scratch?
Safety and pluralistic goals: Can multiple safety/helpfulness experts be fused with complementary logits at decode-time to balance trade-offs reliably?
Stronger theoretical guarantees: Beyond current assumptions, can we bound performance when experts partially overlap or tokenizers differ?
Adaptive correction strength: Can the router learn when to back off fully (trust the expert) versus when to override more aggressively, using uncertainty estimates?

06Conclusion & Future Work

Three-sentence summary:

FusionRoute lets multiple specialized language models collaborate at the token level, with a tiny router that picks the best expert for the next word and adds a small corrective score.
Theory shows pure expert selection can’t generally reach the optimal decoding policy, while adding complementary logits expands what the system can achieve; practice confirms better accuracy and judged quality across math, coding, and instructions.
It’s efficient, doesn’t require retraining all experts together, and stays competitive with domain specialists while giving strong general-purpose performance.

Main Achievement:

A simple, robust, and principled token-level collaboration method that marries per-token expert routing with a trainable corrective signal, outperforming sequence-level voting, controlled decoding, model merging, and direct fine-tuning on average.

Future Directions:

Plug-and-play expert management (hot-swapping experts), richer preference signals (multidimensional alignment), and uncertainty-aware correction strength.
Extending to multimodal experts (text, code, math, images) and safety/value trade-offs during decoding.

Why Remember This:

FusionRoute’s key idea—select the expert and add a tiny corrective nudge—turns brittle expert-picking into a robust, general-purpose policy. It’s an elegant, practical bridge between specialist teams and all-in-one capability.

Practical Applications

•Unified helpdesk bots that handle troubleshooting (logic), code snippets, and polite customer responses in one chat.
•Educational tutors that switch between math solving, code examples, and clear explanations sentence by sentence.
•Developer copilots that write code and instantly explain design choices in natural language.
•Data analysis assistants that compute formulas, write scripts, and summarize results in clear reports.
•Document assistants that draft policy text but switch to calculations or pseudo-code inline when needed.
•Customer service agents that follow instructions precisely while generating small code patches for common issues.
•Research aides that mix formal reasoning, small code experiments, and readable conclusions within one response.
•On-device assistants that combine smaller experts efficiently instead of running a massive all-in-one model.
•Workflow automation bots that choose specialized skills at each step (APIs, validations, explanations).
•Safety-aligned decoding that blends a helpful expert with a safety expert and a corrective router to reduce risky outputs.

Version: 1