RelayLLM: Efficient Reasoning via Collaborative Decoding
Key Summary
- •RelayLLM lets a small model do the talking and only asks a big model for help on a few, truly hard tokens.
- •The small model learns a special <call>n</call> command to borrow exactly n helper tokens from the big model, then continues on its own.
- •A two-stage training plan (supervised warm-up + GRPO reinforcement learning) teaches when to ask for help and for how long.
- •A difficulty-aware reward encourages independence on easy problems, help-seeking on medium ones, and exploration on very hard ones.
- •Across six math benchmarks, accuracy rises from 42.5% to 49.52% while calling the big model for only 1.07% of tokens.
- •This saves about 98.2% of the token cost compared to a performance-matched random router and bridges roughly 60% of the gap to the large model.
- •RelayLLM beats prior token-level routing (like CITER) without needing an extra controller network, reducing latency and overhead.
- •Teacher-free tests show the student keeps some of the learned reasoning skills even when the big model is unavailable.
- •Dynamic, just-enough help outperforms fixed-length help strategies at much lower cost.
- •Results generalize beyond math to harder general-knowledge tasks (e.g., MMLU-Pro), showing learned help-seeking transfers.
Why This Research Matters
RelayLLM cuts costs and speeds up answers by using the big model only for the few words that truly need expert help. That means better math tutors, coding assistants, and customer support that run faster and cheaper. Phones and edge devices can handle more reasoning locally, saving data, battery, and time. Companies get lower cloud bills and smaller carbon footprints while improving accuracy. Even if the expert model goes offline, the student keeps some learned reasoning skills. This approach makes smart help-seeking a standard tool for AI, not a last resort.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how in class, sometimes you can do most of a worksheet yourself but need the teacher only for one tricky step? It would be silly to hand the whole paper to the teacher when you only need help on a single line. That’s how computers using language models used to act: they either tried everything alone or sent the entire problem to a big model, even if just a tiny part was hard.
🍞 Top Bread (Hook): Imagine you’re baking cookies with a friend. You can mix the dough, scoop it, and set the timer. You only need your friend to help you lift the heavy tray into the oven for a moment. 🥬 The Concept (Reinforcement Learning): What it is: A way for an AI to learn by trying, getting rewards or penalties, and improving like practicing a sport. How it works: 1) The AI tries a move. 2) A rule gives a score (reward). 3) The AI repeats good moves more often. 4) Over time, the AI learns a policy—what to do in each situation. Why it matters: Without rewards, the AI can’t tell which choices were helpful and won’t get better. 🍞 Anchor: Like a video game character that learns to avoid lava because every time it falls in, it loses points.
The world before: Big models (LLMs) are very smart but expensive and slow to run; small models (SLMs) are fast and cheap but get stuck on tough reasoning. Teams tried to combine them. Most systems used “routing” or “cascading”: if a question looked hard, they sent the whole thing to the big model. That helped accuracy but wasted lots of compute because the small model could have handled many steps.
🍞 Top Bread (Hook): Think of a school office that sends every tough-looking question straight to the principal, even if a teacher could answer most of it and only one sentence needs the principal. 🥬 The Concept (Cascading Methods): What it is: A step-by-step handoff where, once flagged as hard, a query goes entirely to a larger model. How it works: 1) Check difficulty. 2) If “hard,” pass the whole task to the big model. 3) Big model solves from that point on. Why it matters: Without care, you overpay—using a principal for tasks a teacher could finish. 🍞 Anchor: Like sending the whole pizza to a master chef to add one basil leaf.
🍞 Top Bread (Hook): Picture a traffic officer deciding which cars go to a highway and which to a side road. 🥬 The Concept (Routing Mechanisms): What it is: A system that decides which model should process a request. How it works: 1) Look at the input. 2) Predict difficulty or needed skill. 3) Send the entire job to a small or big model. Why it matters: If routing is too coarse, you send the whole job to the big model and waste resources. 🍞 Anchor: Like using a bulldozer to plant a tulip because the path was marked “construction.”
The problem: Coarse, all-or-nothing handoffs burn time and money. Prior token-level approaches needed extra controller networks, which add latency and cost. The gap: We needed a way for the small model itself to ask the big one for help only at exact, critical moments—and only for a few tokens—without an external controller.
The real stakes: Faster answers on phones, lower cloud bills, greener computing, smoother math tutoring, more responsive customer chatbots, and better on-device apps where you often have weak connectivity or strict latency limits.
02Core Idea
Aha! Let the small model drive the conversation and call the big model only for a few, critical tokens using a special command, then take back control.
Three analogies:
- Walkie-talkie coach: The player (small model) plays the game and pings the coach (big model) for a quick tip mid-play, then keeps playing. No need to hand over the game. 2) Flashlight in a tunnel: Most of the path is bright (easy). When it gets dark (hard), turn on the flashlight briefly. 3) Spell-check for ideas: You write the paragraph; for a tricky sentence, you ask a smarter friend to supply a phrase, then you continue.
🍞 Top Bread (Hook): You know how friends take turns telling a story, and one friend jumps in only when the plot gets confusing? 🥬 The Concept (Token-Level Collaborative Decoding): What it is: The small model and large model take turns at the token level, not the whole question. How it works: 1) Small model writes tokens. 2) If stuck, it emits a special call. 3) Big model writes n helper tokens. 4) Small model resumes and continues. Why it matters: You pay for help only when and where it’s needed, saving lots of compute. 🍞 Anchor: Like asking a friend to fill in one tricky sentence in your essay rather than rewriting the entire paper.
🍞 Top Bread (Hook): Imagine raising your hand in class and saying, “May I get exactly two hints, please?” 🥬 The Concept (Command Token Generation): What it is: The small model produces a command like <call>n</call> to request exactly n tokens from the big model. How it works: 1) Small model detects a hard spot. 2) It outputs <call>n</call>. 3) Generation pauses; the big model writes n tokens. 4) Control returns to the small model. Why it matters: Precise calls prevent over-helping and wasted cost. 🍞 Anchor: Like asking the librarian, “Can I see the next two pages of the answer key?”—no more, no less.
Before vs. after: Before, once a problem seemed hard, systems shipped everything to the big model. After, the small model remains the boss, borrowing only a few tokens when needed, then finishing the job itself. This slashes cost while recovering much of the big model’s smarts.
Why it works: In many reasoning problems, most steps are easy, and only a few steps are “make-or-break.” If you add expert guidance exactly at those critical points, you fix errors early and stay on track, all while spending very few extra tokens.
Building blocks (with sandwiches):
- 🍞 Hook: Like practicing a dance routine slowly before performing. 🥬 The Concept (Supervised Warm-Up Phase): What it is: A short, guided practice to teach the small model how to write valid <call>n</call> commands anywhere in a sequence. How it works: 1) Sample text from the small model. 2) Insert call commands at random token positions with varied lengths. 3) Fine-tune so the model learns the syntax and timing. Why it matters: Without warm-up, the model might never learn to issue correct calls. 🍞 Anchor: Like rehearsing “raise hand now” cues before a school play.
- 🍞 Hook: Imagine comparing your quiz score to your study group’s average. 🥬 The Concept (GRPO: Group Relative Policy Optimization): What it is: A reinforcement learning method that rewards outputs that beat a group’s average, with gentle nudging to stay close to a reference model. How it works: 1) Generate several candidate answers. 2) Score each with a verifier-based reward. 3) Normalize by the group mean. 4) Update the policy toward better-than-average samples. Why it matters: It stabilizes learning and focuses on relative improvements. 🍞 Anchor: Like earning a gold star when you outperform your table group.
- 🍞 Hook: Think of three bins: easy, medium, and “whoa, super hard.” 🥬 The Concept (Difficulty-Aware Reward): What it is: A reward that changes based on how hard the problem seems (judged by the group’s results). How it works: 1) If someone solved it without calling, reward independence extra. 2) If only call-users got it right, penalize stubborn no-calls. 3) If nobody solved it, give a small bonus for trying help (exploration). Why it matters: It teaches the small model when to be brave, when to ask, and when to explore. 🍞 Anchor: Like giving different stickers for solving solo, wisely asking a teacher, or bravely trying a new approach.
Together, these pieces let the small model become a smart captain: it sails most of the journey solo and radios the lighthouse only when the waters get foggy—for just long enough to pass the rocks.
03Methodology
At a high level: Input question → Small model writes tokens → If stuck, it emits <call>n</call> → Big model writes n helper tokens → Small model resumes and finishes → Output answer.
Step-by-step recipe with why each step exists and what breaks without it:
- Small model (SLM) starts generating. What happens: The SLM reads the prompt and writes tokens, one by one. Why it exists: Most steps are easy and cheap for the SLM. What breaks without it: If you default to the big model, you pay too much for simple steps. Example: Solving “3 cats catch 3 rats in 3 minutes; how long for 100 cats and 100 rats?” The SLM can recall proportional reasoning and write the setup.
- Detect a tough spot and issue a precise help request. What happens: The SLM outputs <call>n</call> to ask for exactly n helper tokens. Why it exists: Precision prevents over-helping and cost bloat. What breaks without it: A vague, open help request risks the big model taking over unnecessarily. Example: The SLM writes, “From the given info… <call>8</call>” to get 8 guiding tokens.
- Pause SLM; hand clean context to the big model (LLM). What happens: The system strips the call markers and forwards the history so the LLM sees a normal input. Why it exists: Keeps the LLM’s input distribution familiar, improving quality. What breaks without it: The LLM might be confused by control tokens and respond oddly. Example: The LLM continues the reasoning for 8 tokens, filling in the tricky step.
- Insert LLM tokens back and resume SLM control. What happens: The LLM’s tokens are appended, and the SLM continues writing with the full history (including its original <call>n</call> markers in its own view). Why it exists: The SLM needs to remember its delegation choices and learn from the LLM’s hints. What breaks without it: The SLM could forget what it asked and repeat or contradict the hint. Example: After the 8 helper tokens, the SLM wraps up: “Therefore, it takes 3 minutes.”
- Stop on final answer formatting. What happens: The SLM (or the pipeline) parses a final answer (e.g., inside \boxed{}). Why it exists: Ensures we can verify correctness deterministically on math tasks (RL with verifiable reward). What breaks without it: The reward becomes noisy or ambiguous, making learning unstable. Example: “Please reason step by step, and put your final answer within \boxed{}.”
Training like a coach:
- Supervised warm-up (teach the gesture): The team first practices how to raise the hand (emit valid <call>n</call>) at any token position, with varied n (1 to thousands, clipped to available room). Self-sampled data from the SLM prevents a distribution shift. Why needed: Without it, the SLM may never produce valid calls during RL.
- Reinforcement learning with GRPO (teach the timing): We generate groups of rollouts, score them with a verifier, and update toward above-average samples while staying close to a reference model via KL regularization. Why needed: It aligns behavior with better-than-peers outcomes rather than noisy single samples.
- Data filtering (teach on solvable ground): Before RL, filter out examples where the teacher LLM fails at least half the time. Why needed: Calling a teacher that can’t help only wastes tokens and confuses learning.
- Difficulty-aware reward (teach judgment): Three scenarios guide behavior: Student-Solvable (boost solo success; calls get only simple reward), Teacher-Dependent (penalize no-call stubbornness; reward helpful calls), and Teacher-Unsolvable (tiny reward for trying calls to encourage exploration).
🍞 Top Bread (Hook): Think of practicing drills—first learn the motion, then learn when to use it in a real game. 🥬 The Concept (Reinforcement Learning with Verifiable Reward—RLVR under GRPO): What it is: An RL setup where a rule-based checker can say “right or wrong,” turning answers into reliable rewards, and GRPO compares you to your group. How it works: 1) Generate multiple answers. 2) Verify correctness and subtract call cost. 3) Normalize by group mean. 4) Update the policy to favor strong, efficient samples. Why it matters: Reliable rewards and group comparison stabilize and speed up learning. 🍞 Anchor: Like a spelling bee where a judge confirms correct words, and you learn by edging past the table average.
Secret sauce:
- The SLM is both solver and controller—no extra controller model at every token. - Token-precise delegation (<call>n</call>) means “just enough” help. - Difficulty-aware rewards shape wise help-seeking habits. - Dynamic length beats fixed length: don’t buy a 500-token hint when 20 will do.
Concrete mini-walkthrough: On a GSM8K-style math word problem, the SLM writes the first few reasoning steps. It notices an uncertain transformation and calls <call>12</call>. The LLM fills in a crisp algebraic step in 12 tokens. The SLM continues, computes the final number, and outputs \boxed{…}. Verified correct and with low call ratio, this sample earns a strong reward and trains the SLM to use future calls sparingly and smartly.
04Experiments & Results
The test: Six math-heavy reasoning benchmarks (Minerva, MATH-500, GSM8K, Olympiad-Bench, AIME-2024, AIME-2025) plus extra general-reasoning tests (BBEH, MMLU-Pro, SuperGPQA). We measure accuracy and call ratio (what fraction of tokens came from the big model). A judge (GPT-4o-mini) checks correctness when needed.
The competition: Baselines include the small models before RL (Base), a standard GRPO-tuned small model (GRPO), a token-level router with an external controller (CITER), and conceptual query-level routers (Random/Perfect). RelayLLM competes while keeping the big model’s token usage tiny.
The scoreboard with context:
- Average accuracy climbs from 42.50% (Base, Qwen3-1.7B) to 49.52% with RelayLLM (Difficulty-Aware), which is like raising your test grade from a solid B- to a high B+/A- region across tough subjects. - Call ratio averages just 1.07%—about 1 out of every 100 tokens—so costs are chopped by roughly 98.2% compared to a performance-matched random router. - RelayLLM recovers about 60% of the gap between the small and large model (teacher at 54.12%), showing that a few critical tokens can unlock much of the big model’s value.
Highlights by datasets:
- Minerva (tough math): With Qwen3-0.6B, RelayLLM rises from 15.81% to 23.53% while calling only 0.77% tokens—nearly 49% relative improvement. - GSM8K and MATH-500: Big gains and high pass@1 while keeping calls under ~1%. - Olympiad and AIME: Accuracy improves meaningfully; avg@32 on AIME reflects robustness across samples.
Surprising findings:
- Teacher-free gains: When we block calls at test time, the RelayLLM student still beats GRPO on easier sets, meaning it absorbed some reasoning patterns during collaboration. - Dynamic help length wins: Models retrained to always request fixed lengths (e.g., 100 or 500 tokens) can match accuracy only with much higher cost; RelayLLM achieves similar or better results with far fewer tokens. - Right teacher, right fit: Swapping in a different teacher LLM at inference can reduce accuracy compared to the train-time teacher, hinting that teacher-student alignment matters; nevertheless, even weaker teachers often help compared to no teacher at all. - Reward design matters: Removing the independence bonus spikes call ratio (over-reliance on help). Removing exploration reward hurts accuracy on very hard queries (fear of asking for help). - Data filtering helps: Training on examples the teacher can sometimes solve keeps calls useful and trims wasted compute.
Bottom line: RelayLLM beats GRPO and CITER on accuracy-vs-cost, delivering a rare combo—higher scores and dramatically lower token bills.
05Discussion & Limitations
Limitations:
- Teacher dependence and alignment: The approach assumes the teacher can often produce helpful tokens. Mismatch between the train-time and test-time teacher can reduce gains. - Over-reliance risk: If rewards are tuned poorly, the student may call too often, raising cost. - Verifier need: RL benefits from reliable correctness checks (easy for math), but tasks without clear verification could be harder to train. - Tokenizer/format coupling: Smooth collaboration is easiest within the same model family and tokenizer; cross-family setups may need extra care. - Latency spikes: Each call introduces a server round trip; on high-latency networks, too many tiny calls could slow responses.
Required resources:
- A capable small model and a stronger large model (ideally same family for smooth tokenization). - An inference stack that supports pausing and resuming generation (e.g., API + stop sequences). - RL training compute and a verifier or rule-based checker for rewards. - Optional: vLLM or similar server to host the teacher efficiently.
When not to use:
- Trivially easy or very short queries where the SLM is already great (extra machinery adds little). - Tasks without clear correctness signals (reward design becomes guessy). - Ultra-low-latency edge cases with unstable connectivity where mid-stream calls are impractical. - Scenarios where the teacher adds little value (teacher fails often or is barely stronger than the student).
Open questions:
- Can we learn uncertainty estimates so the SLM predicts when a call will most likely help (calibrated confidence)? - How to best handle multiple teachers (math expert, code expert) and pick the right one per call? - Can we extend beyond math-style verifiers to reliable rewards in open-ended tasks? - How to make cross-family teacher swaps robust, reducing distribution shift? - Can we compress the learned help-seeking policy into a single, stronger SLM over time (distillation)?
06Conclusion & Future Work
Three-sentence summary: RelayLLM turns the small model into a smart captain that writes most of the answer itself and borrows just a handful of expert tokens at the exact hard moments. A two-stage training plan—supervised warm-up plus GRPO with a difficulty-aware reward—teaches when to ask for help and how long. This lifts accuracy substantially while keeping the big model’s token share near 1%, slashing cost by about 98% versus naive routing.
Main achievement: Showing that token-level, student-led collaboration with precise <call>n</call> commands can recover much of a large model’s reasoning power at a tiny fraction of the token cost, without adding an extra controller network.
Future directions: Better confidence calibration for when to call, support for multiple specialist teachers, robust rewards beyond math, distilling the help-seeking skill back into the SLM, and making cross-teacher swaps less sensitive. Why remember this: RelayLLM reframes help from “send the whole problem away” to “ask for a few perfect words at the perfect time,” unlocking big-brain results with pocket-change compute.
Practical Applications
- •On-device math tutoring that calls the cloud expert only for rare, tough steps to save data and battery.
- •Customer support chatbots that solve most tickets locally and ask the big model for just a few clarifying tokens.
- •Code assistants that request short expert snippets for tricky refactors while keeping most edits local.
- •Educational apps that teach problem-solving by modeling when to try solo and when to ask for help.
- •Edge IoT devices that perform reasoning with minimal connectivity, borrowing tiny expert bursts when available.
- •Healthcare triage bots that consult a larger model briefly for ambiguous cases, reducing latency and cost.
- •Financial analysis tools that use expert tokens for complex risk steps while running standard checks on-device.
- •Game NPCs that think fast on their own and request short expert strategies for boss-level decisions.
- •Accessibility tools (e.g., math explanation readers) that stay responsive by minimizing expert calls.
- •Document assistants that draft summaries locally and request just-enough expert tokens for nuanced legal or technical passages.