ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

Zihao Huang; Jundong Zhou; Xingwei Qu; Qiyang Min; Ge Zhang

ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

Beginner

Zihao Huang, Jundong Zhou, Xingwei Qu et al.1/29/2026

arXiv PDF

Key Summary

•ConceptMoE teaches a language model to group easy, similar tokens into bigger ideas called concepts, so it spends more brainpower on the hard parts.
•A small 'chunk' module looks at neighboring tokens and decides where to place boundaries, creating variable-length concepts that adapt to the meaning of the text (and images for multimodal).
•The model then runs its heavy computation on the shorter list of concepts, and later shares each concept back to all its tokens using a 'dechunk' step and joint decoding.
•To keep comparisons fair, saved compute is reallocated so ConceptMoE and the baseline MoE use the same total FLOPs and parameters (excluding attention maps), revealing real architecture gains.
•Across tests, ConceptMoE beats standard MoE: +0.9 points on language pretraining, +2.3 on long context, and +0.6 on multimodal; in continual training with layer looping it gains +5.5 points.
•Even when compute is matched, attention map work and KV cache shrink by about R×, speeding prefill up to 175% and decoding up to 117% at R = 2 on long inputs.
•Ablations show adaptive chunking outperforms fixed merging, cosine-similarity routing generalizes better than a simple linear router, and joint decoding improves downstream results.
•Compression that’s too aggressive (like R = 4) hurts reasoning, while moderate compression (R ≈ 1.5–2) balances speed and accuracy.
•The design is minimally invasive: add a chunk/dechunk module and a few extra QKV projections in the last decoder layers, making it practical for new training and for converting existing MoEs.

Why This Research Matters

ConceptMoE makes language models act more like people: they focus on big ideas first, then use those ideas to handle details. This means better answers when reading long documents, writing code, or reasoning through multi-step problems. By compressing easy stretches and focusing compute on hard parts, it speeds up the slowest stages (attention and KV cache) without needing more hardware. Because it works inside standard MoE systems and keeps total compute and parameters fairly matched, improvements are due to smarter processing, not bigger models. It also converts existing MoEs with minimal changes, so organizations can get faster, better models without starting over. In short, it makes large models both sharper and snappier in real-world use.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re reading a book aloud to a friend. Easy parts like “the, and, of” fly by quickly, but when you reach a twisty mystery clue, you slow down and think harder. You naturally spend more effort where it counts.

🥬 The Concept: Large language models (LLMs) don’t do that. They usually spend the same amount of compute on every token, even though some tokens are obvious and others are tricky.

What it is: The paper asks, “Can a model spend less effort on easy stretches and more on tough spots, automatically?”
How it works (idea-level):
1. Notice where tokens are similar and predictable.
2. Merge those into bigger ideas, called concepts.
3. Use heavy compute on the shorter list of concepts.
4. Share the rich concept info back to all the original tokens for decoding.
Why it matters: Without this, models waste power on routine words and may not have enough focus left for dense, reasoning-heavy parts.

🍞 Anchor: Think of summarizing a paragraph into a few bullet points for yourself, then using those bullets to quickly remember each sentence. That’s concept-first, then token-level recall.

🍞 Hook: You know how we break words into pieces (like “play” + “ing”) so computers can process language more easily?

🥬 The Concept (Tokenization): Tokenization turns text into small units (tokens) the model understands.

What it is: A way to split text into pieces (tokens) that a model can handle.
How it works:
1. A tokenizer maps text to token IDs.
2. The model embeds these IDs into vectors.
3. Layers transform these vectors to make predictions.
Why it matters: Token size affects how many steps the model must process. More tokens mean more compute.

🍞 Anchor: “Unbelievable” might be split into “un”, “believe”, “able” so the model reuses what it knows.

🍞 Hook: Imagine a school where different teachers are experts—math, music, art—and a smart principal sends each question to the right teachers.

🥬 The Concept (MoE Architecture): Mixture of Experts (MoE) uses many specialist sub-networks and routes each token to a few of them.

What it is: A model made of many experts where a router picks which experts to use per token.
How it works:
1. A router scores experts for a token.
2. The top few experts process that token.
3. Their outputs are combined.
Why it matters: You get big-model power without running every expert for every token.

🍞 Anchor: A tricky geometry problem goes to the math teacher; a poem goes to the literature teacher.

🍞 Hook: When you read, you group words into ideas—like “New York City” is one idea, not three separate words.

🥬 The Concept (ConceptMoE): ConceptMoE upgrades MoE to think in concepts, not just tokens.

What it is: A system that adaptively merges nearby, similar tokens into concept units, runs heavy compute on them, then shares the results back.
How it works:
1. A small chunk module decides boundaries between concepts by measuring similarity of neighboring tokens.
2. The concept model (the heavy part) processes the shorter sequence of concepts.
3. A dechunk module maps concept info back to every token, and the decoder jointly uses token+concept signals.
Why it matters: It allocates compute where meaning is dense and saves effort where text is routine.

🍞 Anchor: It’s like reading a sentence by first forming a few key ideas, deeply understanding them, then explaining each word using those ideas.

🍞 Hook: Sometimes you can skim “the cat sat on the mat,” but you slow down for a tough riddle.

🥬 The Concept (Adaptive Token-to-Concept Compression): Merge easy, similar tokens together so the model can focus on the hard stuff.

What it is: A learnable way to compress sequences by turning several tokens into one concept when they carry similar meaning.
How it works:
1. Measure how much a token changes meaning compared to the previous one.
2. If meaning barely changes, keep merging; if it shifts, start a new concept.
3. Aim for a target compression ratio R (e.g., 2 means half as many concepts as tokens).
Why it matters: Shorter concept sequences mean less heavy compute, which can be reallocated to improve quality or saved for speed.

🍞 Anchor: Turning 1,000 tokens into 500 concepts (R = 2) is like packing clothes into fewer vacuum bags so your suitcase is lighter.

🍞 Hook: Think of using a highlighter to mark where new ideas begin in a paragraph.

🥬 The Concept (Chunk Module): A small learnable tool that decides where to split tokens into concepts.

What it is: A boundary detector that compares neighboring tokens via cosine similarity to decide if a new concept should start.
How it works:
1. Project each token into query and key vectors.
2. Compute cosine similarity with the previous token.
3. Low similarity → start a new concept; high similarity → keep merging.
4. Use an auxiliary loss so, on average, you hit the target compression ratio R.
Why it matters: Without smart boundaries, you’d either merge too much (lose detail) or too little (lose efficiency).

🍞 Anchor: Reading “New York City mayor” as one big idea versus splitting after every word.

🍞 Hook: If you summarize a chapter into a few big points, you still need to explain each sentence later.

🥬 The Concept (Dechunk + Joint Decoding): After processing concepts, share each concept back to all its tokens and decode with both token and concept info.

What it is: Dechunk maps concept features to each token; joint decoding mixes token and concept signals inside attention.
How it works:
1. Smooth neighboring concepts with EMA so boundaries stabilize.
2. Map each token to its concept and add concept features to token features.
3. In the last decoder layers, queries/keys/values get extra terms from concepts (joint decoding).
Why it matters: Without dechunk+joint decoding, concept knowledge wouldn’t fully help each token prediction.

🍞 Anchor: A teacher gives the whole class a clear summary (concept), then each student (token) uses it to answer their specific question.

🍞 Hook: When you do chores, you spend more time scrubbing a sticky pan and less time wiping a clean counter.

🥬 The Concept (Implicit Compute Allocation): The model quietly spends more compute where meaning is dense and less where it’s repetitive.

What it is: Compute budgeting that happens automatically through concept merging, not manual per-token knobs.
How it works:
1. Easy stretches merge into fewer concepts.
2. Hard stretches break into more, smaller concepts.
3. Heavy compute goes to the concept sequence, so effort naturally follows difficulty.
Why it matters: You get better results per unit of compute.

🍞 Anchor: Skimming the easy pages lets you spend extra time on the tricky chapter.

🍞 Hook: If you always pack exactly 10 shirts, you might be wrong on vacation—maybe you need 7 or 13.

🥬 The Concept (Auxiliary Loss and Boundary Noise): Training helpers that keep compression on target and make boundaries robust.

What it is: Extra loss terms to match the target R, plus small randomness to avoid fragile boundary choices.
How it works:
1. Add a balancing loss so the fraction of boundaries hits the planned R.
2. Add mild random flips near uncertain boundaries during training.
3. This prevents over-compression on new data.
Why it matters: Without these, compression can drift, hurting accuracy.

🍞 Anchor: Practice packing with a little randomness so you’re ready for surprises on the trip.

Finally, why this research? Earlier attempts used giant vocabularies (costly and give modest compression) or fixed, rule-based merging (not adaptive). ConceptMoE gives adaptive, learnable merging with fair compute-matched tests, showing real quality and speed gains you can feel in long documents, coding help, and multimodal tasks.

02Core Idea

🍞 Hook: You know how you first make a few key bullet points before writing a report? You think hard about the bullets, then the sentences become easy.

🥬 The Aha!: Do the heavy thinking on a shorter list of learned concepts instead of every token, then share that rich knowledge back to each token when predicting.

Multiple Analogies:

Packing Analogy: Compress many clothes (tokens) into a few vacuum bags (concepts). You carry fewer bags (less heavy compute), but everything inside is still there when you unpack (dechunk + joint decoding).
School Analogy: The principal (chunk module) groups similar questions into one topic (concept), sends the topic to the best teachers (experts in MoE), and then shares their guidance with every student’s question (joint decoding).
Map Analogy: Instead of walking every street (token) in a city, you learn the main routes (concepts) first. Once you know the routes well, finding each house (token) is fast and accurate.

Before vs After:

Before: Every token got equal compute. Long inputs and redundant parts wasted time and memory, especially in attention and KV cache.
After: The model adaptively merges easy stretches into concepts, runs heavy compute on fewer steps, and still returns detailed help to each token during decoding.
Net effect: Better accuracy (especially for long context), less attention overhead, faster prefill/decoding, and fair comparisons by reallocating saved compute.

Why It Works (intuition, no equations):

Many neighboring tokens carry similar meaning (think “New York City marathon”). Measuring similarity lets the model guess where meaning changes.
If meaning stays similar, merge—because processing once as a concept captures the gist. If meaning shifts, start a new concept to keep detail.
The big compute (attention + MoE experts) is used on the short concept list, which focuses the model on genuinely new information.
When decoding, giving the token both its own features and its concept’s features ensures no information is lost—quite the opposite: tokens get an extra, richer context.

Building Blocks (each introduced with a mini-sandwich):

🍞 Hook: Like drawing lines between paragraphs. 🥬 Chunk Module: Learns where to start a new concept by checking how much the meaning changes between neighbors (cosine similarity). Why it matters: Good lines keep ideas clear. 🍞 Anchor: Splitting “the-the-the” type stretches together, starting fresh at a topic shift.
🍞 Hook: Summaries help everyone in class. 🥬 Dechunk + EMA: Smooth concept boundaries with EMA to stabilize merges, then hand concept info back to every token. Why it matters: Each token benefits from rich, shared understanding. 🍞 Anchor: The teacher’s summary helps all students answer their own questions.
🍞 Hook: Mix group memory with personal notes. 🥬 Joint Decoding: Blend concept features with token features inside attention (extra QKV terms). Why it matters: Tokens leverage group-level understanding to predict better. 🍞 Anchor: Using both your outline (concept) and your sentence notes (token) when writing.
🍞 Hook: Use more time where needed. 🥬 Implicit Compute Allocation: Easy spans merge (fewer steps), tough spans split (more steps); compute follows difficulty automatically. Why it matters: Improves efficiency without micro-managing tokens. 🍞 Anchor: Skim easy pages; study hard pages.
🍞 Hook: Pack the right amount. 🥬 Auxiliary Loss + Boundary Noise: Keep compression near a target R and make decisions robust to shifts. Why it matters: Prevents over-compressing on new data. 🍞 Anchor: Practice packing under small surprises so your plan works on any trip.
🍞 Hook: Fair race rules matter. 🥬 Fair Compute Reallocation: Since MoE can change how many experts are active without changing total parameters, the paper matches total FLOPs and parameters to compare fairly. Why it matters: Shows the gains come from the idea—not extra capacity. 🍞 Anchor: Two runners with the same weight and shoes; one just runs the route smarter.

Put together, ConceptMoE compresses where safe, focuses compute where needed, and still decodes every token with help from its concept—delivering both higher quality and lower latency, especially for long sequences.

03Methodology

At a high level: Tokens → Encoder → Chunk (to concepts) → Concept Model (heavy compute) → Dechunk (share concepts back) → Decoder with Joint Decoding → Output logits.

Step-by-step recipe with sandwiches for new pieces:

Input and Encoder

What happens: The text (and possibly image patches) become token embeddings. A light encoder refines them a bit.
Why this step exists: Prepares stable features before deciding where to chunk.
Example: A 1024-token sequence becomes a 1024×d matrix of vectors.

🍞 Hook: Drawing lines where ideas change. 🥬 Chunk Module (boundary detection by similarity)

What happens:
- Project each token into query/key vectors.
- Compute cosine similarity with the previous token.
- Low similarity → start a new concept boundary; high similarity → keep merging.
- The first token is always a boundary to ensure every segment has a concept.
Why this step exists: It adapts chunk sizes to meaning, creating fewer but richer units where safe.
Example with data: If tokens 1–5 are very similar, the chunk module merges them into 1 concept; token 6 starts a new concept if similarity drops.

🍞 Hook: Vacuum-bag your clothes. 🥬 Compression Ratio R

What happens: The model aims for a target average R = N/M (tokens per concept), like R = 2 meaning half as many concepts.
Why this step exists: Controls how short the concept sequence becomes to budget compute.
Example: 1024 tokens → about 512 concepts at R ≈ 2.

🍞 Hook: Keep the packing balanced. 🥬 Auxiliary Loss for target R

What happens:
- The model adds a balancing loss so the share of boundaries matches R on average.
- Statistics are computed across the device’s batch so harder samples can get lower R and easier samples higher R.
Why this step exists: Without it, the model might over-merge or under-merge.
Example: If evaluation drifts toward higher compression, training pushes probabilities back toward the planned mean (≈ 1/R).

🍞 Hook: Practice under a little randomness to handle surprises. 🥬 Boundary Noise

What happens:
- Probabilities near 0.5 are sharpened and sampled (Bernoulli), so uncertain boundaries flip sometimes.
- This reduces overfitting to exact boundary thresholds and aligns train/eval behavior.
Why this step exists: Prevents unexpected over-compression on new distributions.
Example: With τ = 6, around 4% of tokens flip, improving robustness with minimal training loss impact.

🍞 Hook: Study the outline, not every sentence. 🥬 Concept Model (heavy compute on concepts)

What happens:
- The shorter concept sequence goes through the heavy MoE Transformer stack (attention + experts).
- Compute reallocation keeps total FLOPs and parameters matched to the baseline.
Why this step exists: Running heavy layers on fewer steps saves attention work and KV cache, enabling speedups.
Example: With R = 2, attention map FLOPs and KV cache in the concept block drop by about 2×, even when total compute is repurposed elsewhere.

🍞 Hook: Smooth your outline for clarity. 🥬 EMA Smoothing of Concepts

What happens:
- An exponential moving average blends neighboring concept vectors based on boundary confidence.
- If the model learns two adjacent concepts are better as one, EMA nudges probabilities so they merge.
Why this step exists: Stabilizes boundaries and speeds convergence.
Example: “Simple and easy-to- / understand picture” may merge into one concept as training discovers they belong together.

🍞 Hook: Share the summary with everyone. 🥬 Dechunk back to tokens

What happens:
- Each token is assigned the concept that covers its position and gets that concept vector added to its token vector.
Why this step exists: Ensures every token benefits from concept-level understanding.
Example: Tokens 1–5 all get the same enriched concept vector mixed into their features.

🍞 Hook: Use both your outline and your notes. 🥬 Joint Decoding in the last decoder layers

What happens:
- Self-attention queries/keys/values are augmented with extra projections of the concept vector for each token.
- This costs little (few extra projections in the last layers) but boosts downstream performance.
Why this step exists: Forces the model to actually use concept information when predicting tokens.
Example: The token “mayor” benefits from the concept covering “New York City mayor,” improving accuracy.

🍞 Hook: Make the race fair. 🥬 Compute Reallocation Strategies (MoE makes this possible)

What happens:
- Strategy 1: Increase number of activated experts (C_moe) to spend saved compute.
- Strategy 2: Loop intermediate layers (increase L_C effectively) and modestly increase experts.
- Strategy 3: Slightly scale attention/hidden sizes in the concept model and adjust experts, adding two small projectors (h→c and c→z).
Why this step exists: To compare fairly with the baseline under the same total parameters and average per-token FLOPs (excluding attention map savings) and isolate architectural benefits.
Example: At R = 2, you can double layers or raise expert count to match compute, yet attention map FLOPs and KV cache in the concept block still shrink, driving speed.

Vision-Language Extension

What happens: Apply chunking to both text and image tokens (from a ViT/NaViT). The model often compresses images more and text less.
Why this step exists: Visual tokens are often more redundant spatially; adaptive compression exploits that.
Example: Long multimodal inputs show strong speedups and better long-document understanding.

The Secret Sauce:

Adaptive, similarity-based chunking focuses compute where it matters.
Dechunk + joint decoding ensure no information loss—concept knowledge actively helps each token.
MoE’s flexible activation lets the paper reallocate compute to keep comparisons fair while enjoying inherent attention/KV reductions from shorter concept sequences.

04Experiments & Results

The Test: The authors measure whether concept-level processing improves both accuracy and efficiency when you hold total parameters and average per-token FLOPs equal to the baseline (excluding attention map compute). They test on language pretraining, long-context understanding, multimodal tasks, and practical continual training (CT), plus real latency on big models.

The Competition: Standard MoE models of comparable size and compute. Prior dynamic compression works are discussed, but fair matching was not always possible there. MoE makes it possible to adjust activated experts/layers to match compute here.

Scoreboard with Context:

Language Pretraining (12B/24B total params, matched compute):
- ConceptMoE edges out MoE: about +0.9 points on downstream averages. That’s like going from a solid B to a B+ when everyone studied the same hours.
Long Context (60B VLM):
- ConceptMoE +2.3 points overall, with higher gains in tasks like Needle-in-a-Haystack and long-document reasoning. That’s like finding hidden facts faster because the pages became shorter and clearer.
Multimodal (60B VLM):
- ConceptMoE +0.6 points overall, notably better in reasoning and understanding. Slight dips occur for fine-grained localization tasks where strict spatial order matters more.
Continual Training Conversion (90B):
- ConceptMoE-top15 roughly matches MoE (lossless conversion), while ConceptMoE-top11-loop8 gains +5.5 points on Open Benchmarks without extra FLOPs. Training from scratch with the same idea adds another +0.9 (total +6.4).
Inference Speed (300B baseline):
- Prefill speedups up to 175% and decoding up to 117% at R = 2 on long sequences, even for quality-oriented setups that double layers but still win on long inputs. With R = 1.5 (CT-friendly), prefill gains up to 43.6% and decoding up to 53.3%.

Surprising/Interesting Findings:

Train vs Eval Mismatch: A linear boundary router slightly improves training loss but hurts downstream scores versus cosine similarity routing—suggesting overfitting. Similarly, removing joint decoding lowers training loss but degrades real tasks. Moral: Better generalization beats tiny train-loss wins.
Chunking Strategy Matters: Adaptive chunking beats fixed-size merging. Fixed chunks can cut across meanings, hurting learning and downstream results. Adaptive boundaries preserve semantic coherence.
Right Amount of Compression: R ≈ 1.5–2 is a sweet spot. Aggressive R = 4 compresses too much, especially damaging reasoning and math—like summarizing so hard you lose the point.
Multimodal Adaptation: Under a global R, the model compresses images more and text less, matching human intuition that vision tokens can be more redundant. This self-balancing behavior emerges from the training objective.

Numbers made meaningful:

+0.9 points on language: Think of this as consistently answering one more question right out of every hundred across many tests.
+2.3 on long context: When reading long documents, this is a noticeable bump—like moving up a letter grade on tough, lengthy assignments.
+5.5 in CT with layer looping: That’s a big jump without extra FLOPs, showing the architectural change—not just more compute—does the work.
Up to 175% prefill speedup: For giant inputs, it’s like turning a 10-second wait into ~3.6 seconds—feelably snappier.

Ablations (what worked and why):

Auxiliary loss weight λ: Too large hurts training; λ = 0.03 is a stable choice.
Boundary Noise: Bernoulli noise (τ = 4–6) improves robustness; τ = 6 used by default balances stability and performance.
Chunking strategies: Dynamic > Fixed > None, confirming adaptiveness matters.
Router: Cosine similarity generalizes better than a linear router that simply predicts boundaries.
Joint Decoding: Keeping it improves real-world tasks even if train loss is slightly worse without it.

Bottom line: With careful, fair matching of compute and parameters, ConceptMoE reliably improves quality and speeds up attention-heavy stages by operating on fewer, smarter steps (concepts) while still decoding every token with concept help.

05Discussion & Limitations

Limitations:

Over-Compression Risk: Pushing R too high (e.g., 4×) harms reasoning and math. Some domains need fine granularity.
Spatial Tasks Sensitivity: Slight drops on localization/chart-text tasks suggest sequential treatment of image tokens can blur spatial structure.
Router Choices: A linear router can look good in training but generalize worse than cosine similarity; careful design is needed.
Distribution Shifts: Train/eval differences can drift compression. Boundary noise helps but does not fully eliminate the issue.

Required Resources:

MoE Infrastructure: You need a Mixture-of-Experts stack and the ability to adjust activated experts/layers for fair matching.
Training Scale: Benefits are shown at tens to hundreds of billions of tokens; expect meaningful compute and data budgets.
Engineering for CT: Minimal, but still requires adding a chunk/dechunk module and extra QKV projectors in the last decoder layers.

When NOT to Use:

Ultra fine-grained vision tasks (precise localization, chart OCR) where preserving exact spatial order is paramount.
Tiny models or tiny datasets where the overhead of adding modules may outweigh gains.
Highly noisy or domain-shifted data without tuning R or boundary noise—compression may drift and hurt accuracy.

Open Questions:

Adaptive R per modality/segment: Can we learn different target R values for text vs image or even per document section automatically?
Richer similarity signals: Beyond local cosine similarity, can global or multi-hop signals mark better boundaries?
Spatially aware multimodal chunking: How to preserve 2D relationships while still compressing visual tokens?
Theoretical bounds: What are principled limits on compression before information-theoretic loss harms reasoning?
Beyond MoE: How well does concept-level processing translate to dense models when fairness (params/FLOPs) is strictly controlled?

Overall, ConceptMoE is a strong, practical step toward adaptive compute: great for long inputs, balanced tasks, and CT conversions—while reminding us to avoid over-compressing and to treat vision structure with care.

06Conclusion & Future Work

Three-Sentence Summary:

ConceptMoE teaches models to merge similar neighboring tokens into concepts, run heavy compute on the shorter concept sequence, and then share concept knowledge back to each token during decoding.
Thanks to MoE, saved compute is reallocated so total parameters and average FLOPs match the baseline, revealing real architectural gains—better accuracy on language, long context, and multimodal tasks.
Even under compute-matching, attention map work and KV cache shrink about R×, yielding big prefill/decoding speedups; minimal code changes make CT conversion practical.

Main Achievement: Demonstrating that adaptive, similarity-based token-to-concept compression—combined with dechunking and joint decoding—delivers both higher quality and notable efficiency, under rigorous, fair comparisons.

Future Directions:

Learn variable R across modalities/segments and incorporate richer boundary cues.
Develop spatially aware visual chunking to help fine-grained localization.
Extend concept-level processing to other architectures and domains (speech, time series) with fairness controls.
Provide theory on optimal compression levels for different datasets.

Why Remember This: It reframes processing from tokens to concepts—just like humans do—so models spend effort where meaning changes. That simple shift boosts accuracy, slashes attention overhead, and makes long-context inference much faster, all with minimal, CT-friendly changes. In short: think in concepts first, decode tokens better.

Practical Applications

•Speed up long-document chatbots that need to read and summarize thousands of tokens.
•Improve code assistants by focusing compute on tricky logic while skimming boilerplate.
•Enhance retrieval-augmented generation where many tokens are repetitive citations.
•Accelerate multimodal QA systems that combine long text and many image patches.
•Reduce serving costs by shrinking attention and KV cache usage, especially at large batch sizes.
•Upgrade existing MoE models via continual training to get immediate latency gains.
•Stabilize performance under data shifts by using boundary noise and calibrated compression ratios.
•Boost long-context tasks like legal/medical reviews where detail plus scale both matter.
•Make streaming assistants more responsive during prefill and decoding phases.
•Provide a template for adaptive compute in other domains (speech, time series) with similar redundancy.

Version: 1