Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge

Yao Tang; Li Dong; Yaru Hao; Qingxiu Dong; Furu Wei; Jiatao Gu

Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge

Intermediate

Yao Tang, Li Dong, Yaru Hao et al.1/13/2026

arXiv PDF

Key Summary

•The paper introduces Multiplex Thinking, a new way for AI to think by sampling several likely next words at once and blending them into a single super-token.
•It keeps the good parts of normal token-by-token sampling (stochastic exploration) while packing more ideas into fewer tokens (continuous representation).
•When the model is sure, the multiplex token behaves like a normal token; when unsure, it compactly carries several possibilities without making the sequence longer.
•Because each multiplex token is built from independently sampled tokens, its probability is easy to compute, so it works naturally with on-policy reinforcement learning.
•Across six tough math benchmarks, Multiplex Thinking beats strong Chain-of-Thought and RL baselines in Pass@1 and keeps scaling better up to Pass@1024.
•It also writes shorter solutions on average, showing better token efficiency than standard methods.
•Using width K ≥ 2 gives a big jump in accuracy, with smaller gains beyond K = 3, showing a sweet spot for efficiency.
•A training-free version already helps at inference, and both weighted and simple averaging work similarly well, showing robustness.
•Entropy analysis shows Multiplex Thinking maintains exploration longer during RL, helping it avoid getting stuck on one wrong path.
•Overall, this approach offers a practical path to smarter, faster, and cheaper reasoning in large language models.

Why This Research Matters

Multiplex Thinking helps AI explore several good ideas at once without writing long, slow explanations, making it both smarter and faster. This means better homework help, code assistance, and planning with less waiting and lower compute bills. Because it keeps true randomness while staying compact, it learns more effectively with RL—crucial for real-world systems that must improve over time. Its self-adaptive behavior uses detailed exploration only when needed, so it saves resources on easy steps and spends effort on the hard ones. Shorter responses are especially helpful for on-device models and bandwidth-limited settings. Finally, its clean probability model and robust results make it easier to integrate into current training pipelines and to combine with existing search strategies.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you're solving a big puzzle with lots of pieces. If you try only one way to place the next piece, you might get stuck. But if you keep a few good placements in mind at once, you can move faster and make fewer mistakes.

🥬 The Concept (Chain-of-Thought, CoT):

What it is: Chain-of-Thought is when an AI writes out its steps, one token at a time, to solve hard problems.
How it works: (1) Read the question; (2) Produce a first reasoning token; (3) Use that to produce the next; (4) Repeat until an end-of-thinking token; (5) Give the final answer.
Why it matters: Without CoT, models often jump to answers and fail on multi-step logic. CoT slows down to explain, which boosts accuracy.

🍞 Anchor: When asked a math word problem, CoT makes the model show steps like “First find distance; then compute time,” which often leads to the right answer.

🍞 Hook: You know how reading every single word in a long book out loud takes forever? That’s like AI doing long chains of tiny steps.

🥬 The Concept (The Problem):

What it is: Traditional CoT can be slow and expensive because it produces long, low-bandwidth sequences of discrete tokens.
How it works: Each attempt is one full path (depth-first), so trying alternatives means generating many long traces, which costs compute and time.
Why it matters: If exploring different ideas takes too long, the AI can’t efficiently search for better solutions or learn from trial and error during RL.

🍞 Anchor: It’s like trying one maze path all the way to a dead end, then walking back and trying another—over and over—very time-consuming.

🍞 Hook: Imagine mixing a smoothie: you can taste bananas, strawberries, and mango at once without drinking three separate cups.

🥬 The Concept (Continuous “Soft” Tokens):

What it is: Some methods blend many token possibilities into one continuous vector (a soft token) so the model can represent multiple ideas at once.
How it works: (1) Look at the next-token probabilities; (2) Make a weighted average of token embeddings; (3) Feed that single vector forward.
Why it matters: It shortens sequences and packs more info per step. But if it’s deterministic, every run looks the same—bad for exploration and RL.

🍞 Anchor: A soft token is like a smoothie flavor mix. It’s compact, but if you always blend with the same recipe, you never try new tastes.

🍞 Hook: Think of practicing free throws. If the outcome is always the same in practice, you never learn to adjust your aim from misses.

🥬 The Concept (Reinforcement Learning, RL):

What it is: RL teaches models through rewards for good outcomes and fewer rewards (or none) for bad ones.
How it works: (1) Try; (2) Get a score; (3) Nudge the model to make good tries more likely next time.
Why it matters: RL needs on-policy stochastic trials—real, varied attempts—to learn effectively. Deterministic soft tokens don’t provide that variety.

🍞 Anchor: Like training a dog: it tries a trick, gets a treat when right; with practice, it learns what to do.

🍞 Hook: Opening a mystery box can be very surprising if you truly don’t know what’s inside.

🥬 The Concept (Entropy):

What it is: Entropy measures uncertainty—how spread out the options are.
How it works: High entropy means many plausible choices; low entropy means one or two strong favorites.
Why it matters: In hard reasoning steps, higher entropy signals decision points where exploration pays off.

🍞 Anchor: If a menu has one obvious favorite, choosing is easy (low entropy). If every dish looks great, you need to explore more (high entropy).

The World Before: LLMs used CoT to reason better but at the cost of long sequences. Exploring alternate reasoning paths meant producing many full traces—like depth-first search—making training and evaluation expensive. Continuous tokens helped compress thoughts but were usually deterministic, so runs looked the same and didn’t align well with RL’s need for stochastic exploration.

The Gap: We needed a way to keep the model’s ability to explore (stochastic sampling), but to represent multiple options compactly (continuous vectors), and to do so with a well-defined probability that RL can optimize directly.

Real Stakes: Faster, smarter reasoning helps homework helpers, coding assistants, and on-device models that have small token budgets. It saves energy and money, reduces latency, and lets models search more effectively on tough problems like math Olympiad questions or complex planning tasks.

02Core Idea

🍞 Hook: Imagine you’re deciding what to do after school. Instead of picking just one plan now, you keep three good ideas in your pocket and carry them forward together.

🥬 The Concept (Multiplex Thinking):

What it is: A reasoning method where, at each step, the model samples K likely tokens and merges them into one continuous “multiplex token.”
How it works: (1) Compute next-token probabilities; (2) Independently sample K tokens; (3) Turn them into embeddings; (4) Aggregate into a single vector; (5) Use that as the next thinking token; (6) When the end-of-thinking token appears, switch to regular decoding for the answer.
Why it matters: It keeps stochastic exploration (sampling) yet compresses multiple options into one step, making reasoning shorter and friendlier to RL.

🍞 Anchor: It’s like carrying a mini-backpack with several small tools instead of a giant toolbox for each try—you’re prepared without weighing yourself down.

The “Aha!” in one sentence: Sample several plausible next tokens and merge their embeddings into one continuous token so the model explores more ideas without making the sequence longer—and with probabilities that RL can directly optimize.

Three Analogies:

Voting Council: Several advisors (K samples) each suggest a word; their suggestions are combined into one council decision (multiplex token) that still reflects who voted for what.
Trail Map Overlay: Instead of walking one path to the end, you sketch a transparent map overlay of a few promising routes and carry that single map to the next decision point.
Smoothie with Sprinkles: You blend a base flavor (continuous mix) but keep track of which sprinkles you added (stochastic samples), so you can still learn which ingredients made it tasty.

Before vs After:

Before: Discrete CoT did one path per token and needed many long traces to explore. Deterministic soft tokens compressed steps but removed randomness, hurting RL’s learning.
After: Multiplex Thinking preserves randomness by sampling K tokens and merges them into a compact vector, enabling breadth-wise exploration with shorter sequences and clean probabilities for RL.

Why It Works (intuition, no equations):

Entropy Boost: Sampling K times expands the exploration space. High-uncertainty steps gather richer information in a single token.
Self-Adaptive: If the model is confident, the K samples tend to agree, and the multiplex token behaves like a normal token. If unsure, the token carries multiple possibilities forward, delaying premature commitment.
Probability That Plays Nice: Because each of the K samples is drawn independently from the same distribution, the total probability of a multiplex step is just the product of the sampled token probabilities. This makes the entire trajectory’s probability easy to compute, so RL can learn on-policy directly.
Embedding Prior: By averaging actual vocabulary embeddings (optionally reweighted by the model’s own probabilities), the multiplex token stays aligned with what the model already understands about words.

Building Blocks:

K Independent Samples: Draw K token IDs at each step from the model’s distribution.
Aggregation: Convert those IDs to embeddings and combine them—either simple averaging or probability-weighted averaging over the sampled set.
Multiplex Token: A single continuous vector that represents a small “bundle” of plausible next steps.
Factorized Probability: The likelihood of the multiplex trajectory is just the sum of the log-probabilities of all the sampled discrete tokens.
Switch to Answer: When the special end-of-thinking token is sampled, the model moves to standard discrete decoding for the final answer.
RL-Friendly: With on-policy sampling and tractable probabilities, we can directly maximize rewards over multiplex trajectories.

🍞 Anchor: Suppose the model must choose the next word from {‘add’, ‘simplify’, ‘because’, ‘[eot]’}. With K=3, it might sample {‘add’, ‘simplify’, ‘simplify’}. The multiplex token then encodes both ideas—with ‘simplify’ slightly stronger—and carries them as one vector into the next step, helping the model weigh both options while keeping the sequence short.

03Methodology

High-Level Recipe: Input → Compute next-token distribution → Sample K tokens → Aggregate embeddings into one multiplex token → Repeat until end-of-thinking → Switch to normal decoding for the final answer → (Optional) Train with RL using the multiplex rollout probability.

Step-by-Step (what, why, example):

Start the thinking phase

What happens: Given a question and a special begin-of-thinking token, the model computes next-token probabilities for the first reasoning step.
Why it exists: We need a probability over the whole vocabulary to decide which candidate tokens to sample.
Example: Vocabulary V = {add, subtract, therefore, [eot]}; probabilities = {0.40, 0.25, 0.30, 0.05}.

Independently sample K tokens

What happens: Draw K tokens from the distribution. Each draw is independent and reflects the model’s beliefs.
Why it exists: Stochastic sampling preserves exploration. Multiple samples widen coverage of plausible next moves.
Example (K=3): samples = [add, therefore, add].

Build a sparse sample vector s_i

What happens: For the current step i, turn each sampled token into a one-hot vector and average them. This creates a small histogram over just the sampled set.
Why it exists: s_i compactly records which tokens appeared and how often, while staying tied to real vocabulary items.
Example: s_i(add)=2/3, s_i(therefore)=1/3, others 0.

Aggregate into a continuous multiplex token c_i

What happens: Map s_i into embedding space using the vocabulary embedding matrix, then optionally reweight by the model’s probabilities restricted to the sampled tokens.
Why it exists: Embedding aggregation encodes multiple choices into a single vector that the transformer can process at the next step.
Example:
- Uniform averaging: c_i = average(embedding(add), embedding(add), embedding(therefore)).
- Weighted averaging: among {add, therefore}, weight by normalized probs (e.g., add 0.57, therefore 0.43) to reflect model confidence.

Feed c_i forward to get the next distribution

What happens: Treat c_i like the next “thinking token” the transformer sees, then compute the next-token distribution again.
Why it exists: This allows the model to carry its multi-option thought bundle forward without lengthening the sequence.
Example: After c_i, the model might shift probabilities toward steps that follow naturally from both ‘add’ and ‘therefore’.

Self-adaptive behavior

What happens: If the distribution is sharp (low entropy), K samples often agree, making c_i nearly identical to a discrete token. If it’s uncertain (high entropy), samples spread out, encoding diverse paths.
Why it exists: This automatically uses more “parallel thinking” when needed and simplifies when obvious.
Example: On an easy arithmetic step, samples = [add, add, add]. On a tricky algebra decision, samples = [factor, substitute, expand].

Stopping and switching to the final answer

What happens: When the model samples the special end-of-thinking token [eot], it transitions from thinking to normal discrete decoding for the final answer.
Why it exists: This cleanly ends the compact reasoning phase and lets the model present the solution in plain text.
Example: After 12 multiplex steps, one of the K samples is [eot]; the model then outputs “Therefore, x=4.”

Probability of a multiplex trajectory (for RL)

What happens: Because each of the K samples is drawn independently, the log-probability of the whole thinking trace is the sum of the log-probabilities of all sampled tokens across all steps.
Why it exists: A clean probability lets us apply on-policy RL to multiplex rollouts, reinforcing better reasoning paths.
Example: If at step i the samples were [add, therefore, add], you add log p(add) + log p(therefore) + log p(add) to the trajectory log-prob.

Reinforcement Learning objective

What happens: Use a verifiable reward (e.g., math answer correctness) to weight the joint log-prob of the multiplex trace and the final answer, then update the model.
Why it exists: RL nudges the model to produce multiplex traces that more often lead to correct answers.
Example: Correct answer → reward 1; wrong → 0. The gradient pushes up probabilities on good multiplex steps.

Secret sauce: branch-and-merge per token

What’s clever: Instead of exploring many long, separate chains (depth-first), each step branches (sample K) and then merges (aggregate embeddings) into one compact vector—like mini breadth-first exploration at every token.
Why it matters: You keep the exploration benefits of multiple candidates without paying the full sequence-length cost.

Concrete Mini-Walkthrough:

Prompt: “Solve: If 2x + 3 = 11, find x.”
Step 1 distribution favors ‘subtract’, ‘2’, ‘3’. With K=3: [subtract, subtract, 3]. c_1 encodes that ‘subtract’ is dominant but ‘3’ is relevant.
Step 2 now leans toward ‘3’ (as in “subtract 3 from both sides”). Samples: [3, 3, 3]. c_2 collapses to a discrete-like token.
Step 3 favors ‘both sides’. Samples: [both-sides, both-sides, simplify]. c_3 keeps both options.
A few steps later, [eot] appears among samples, so the model switches to normal decoding and outputs “x = 4.”

What breaks without each step:

No independent K sampling → no stochastic exploration, RL loses signal.
No embedding aggregation → sequence length balloons, losing efficiency.
No [eot] switch → thinking doesn’t end cleanly; answers get delayed or messy.
No factorized probability → cannot do principled on-policy RL over multiplex traces.

Secret Sauce Summary:

Stochastic plus continuous: You get exploration and compression together.
Self-adaptive: Collapses to discrete when confident; superposes ideas when uncertain.
RL-ready: Exact rollout probabilities let you learn from rewards cleanly.

04Experiments & Results

The Test: The authors measure Pass@1 (single try) and Pass@k (best of k tries up to 1024) on six challenging math benchmarks: AIME 2024, AIME 2025, AMC 2023, MATH-500, Minerva Math, and OlympiadBench. They also track sequence length and entropy during training to understand exploration and efficiency.

The Competition:

Discrete CoT: Standard chain-of-thought decoding.
Discrete RL: Same backbones trained with GRPO but using only discrete tokens.
Stochastic Soft Thinking: A training-free continuous baseline that adds Gumbel noise to make soft tokens stochastic.

Scoreboard with Context (7B backbone examples):

AIME 2024: Multiplex 20.6% vs Discrete RL 17.2% vs Stochastic Soft Thinking 20.3% (Multiplex wins narrowly, like edging first in a close race).
AMC 2023: Multiplex 50.7% vs Discrete RL 44.7% (a solid jump, like moving from a B to a strong B+/A-).
MATH-500: Multiplex 78.0% vs Discrete RL 74.1% vs Stochastic Soft Thinking 76.5% (Multiplex leads, nearing the ceiling on an easier set).
OlympiadBench: Multiplex 41.7% vs Discrete RL 38.0% vs Stochastic Soft Thinking 40.6% (Multiplex on top).
Across both 1.5B and 7B models and six datasets, Multiplex Thinking achieves the best Pass@1 in 11 of 12 settings.

Test-Time Scaling (Pass@k up to 1024):

On hard tasks like AIME 2025 (7B), Discrete RL plateaus around ~40% as k grows, while Multiplex Thinking keeps climbing to ~55% at k=1024. That’s like everyone else getting stuck, but Multiplex continues discovering new correct paths with more samples.
On easier sets (e.g., MATH-500), all methods saturate quickly; gains are smaller because there’s less headroom.

Sampling Efficiency and Sequence Length:

Multiplex Thinking tends to need fewer samples to reach a given accuracy target and writes shorter solutions. In one test, an inference-only version with a 4k token limit matched or beat Discrete CoT at 5k tokens—about 20–25% fewer tokens for similar or better results.
Training dynamics show that as K increases, response length drops while accuracy rises, consistent with each multiplex token carrying more information.

K Width Ablation:

Going from K=1 (Discrete RL) to K≥2 provides a big accuracy jump across datasets.
Gains from K=2 to K=3 or K=6 are smaller, suggesting a sweet spot around K=3 for efficiency.

Entropy and Exploration:

Entropy reduction during RL is smaller for larger K, meaning Multiplex maintains exploration longer and avoids collapsing too early to a single path. This matches the higher Pass@k upper bounds.

Surprising/Robust Findings:

Training-free Multiplex (inference only) already helps over Discrete CoT and is competitive with Stochastic Soft Thinking; adding RL boosts it further.
Two aggregation strategies—simple averaging vs probability-weighted over the sampled set—perform similarly, implying the core benefit comes from branching-and-merging itself, not the exact weights.

Takeaway: Multiplex Thinking both improves top-line accuracy and reduces token usage. It explores better at high k, learns better with RL, and stays robust across design choices.

05Discussion & Limitations

Limitations:

Independence Assumption: The K samples are drawn independently at each step, which may miss subtle dependencies among alternatives.
Blending Blur: Aggregating multiple candidates into one vector can blur distinct options; the model must learn to disentangle them later.
Compute Overhead: Sampling K tokens per step adds some overhead (though it avoids extra forward passes for separate paths). Very large K yields diminishing returns.
Tuning Sensitivity: Choosing K, temperature, and top-p affects exploration; poor settings can under- or over-explore.
Scope: The method was tested on math reasoning; behavior on other domains (code generation, long-form writing) needs broader study.

Required Resources:

LLM backbones (e.g., 1.5B–7B), GPUs, and verifiable-reward datasets (~40k examples used here).
A runtime that supports multiplex sampling and embedding aggregation per step.
RL infrastructure (e.g., GRPO) to optimize on-policy multiplex trajectories.

When NOT to Use:

Ultra-low-latency, single-shot tasks where even small sampling overhead is unacceptable.
Tasks demanding fully deterministic outputs (e.g., exact phrasing) where exploration isn’t helpful.
Simple problems with low entropy where CoT already solves near 100%—multiplex adds little.
Pipelines where downstream components cannot handle compacted reasoning representations.

Open Questions:

Adaptive K: Can the model learn to choose K per step based on uncertainty to save compute?
Smarter Aggregation: Nonlinear or attention-based merges over sampled embeddings—do they help?
Better Credit Assignment: How to attribute reward to specific sampled candidates inside a multiplex token?
Hybrid Search: How to blend multiplex tokens with tree search (Tree-of-Thought) for even stronger exploration?
Multimodal Extension: Can images, code tokens, and text be multiplexed together effectively?
Stopping Policies: Beyond [eot], can learned stopping improve stability and reduce reward hacking even further?

06Conclusion & Future Work

Three-Sentence Summary: Multiplex Thinking samples several plausible tokens at each step and merges them into a single continuous token, preserving exploration while shortening sequences. Because its probabilities factorize cleanly over the sampled tokens, it fits naturally with on-policy reinforcement learning. Across challenging math benchmarks, it outperforms discrete CoT and RL baselines from Pass@1 up to Pass@1024 while writing shorter, denser reasoning traces.

Main Achievement: Bridging discrete stochasticity and continuous compactness in a way that keeps vocabulary alignment, enables tractable RL over entire thinking rollouts, and boosts both accuracy and token efficiency.

Future Directions: Add adaptive K per step, experiment with smarter aggregation functions, combine with parallel search strategies like self-consistency or Tree-of-Thought, extend to multimodal and code reasoning, and co-design inference kernels for multiplex operations. Studying richer reward signals and better credit assignment inside multiplex tokens could further enhance learning.

Why Remember This: It shows that we don’t have to choose between exploring many ideas (stochastic) and keeping thoughts short (continuous). With token-wise branch-and-merge, models can think broadly and efficiently—unlocking smarter, faster, and cheaper reasoning at scale.

Practical Applications

•Math tutoring systems that solve problems with fewer tokens while maintaining high accuracy.
•Coding assistants that consider multiple code completions per step but keep responses short.
•On-device AI (phones, wearables) where token budgets and energy are limited.
•Automated theorem provers or solvers that need strong exploration without exploding sequence length.
•Customer support bots that reason about multi-step policies efficiently and respond faster.
•Educational tools that show concise reasoning steps while exploring alternative explanations under the hood.
•Planning assistants (travel, schedules) that weigh multiple next actions compactly before deciding.
•Scientific assistants that explore different hypothesis updates per step without long write-ups.
•Verification-based training pipelines (RLVR) that benefit from on-policy stochastic rollouts with tractable probabilities.
•Search frameworks (Self-Consistency, Best-of-N) enhanced by multiplex tokens to improve sample efficiency.

Version: 1