MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Kewei Zhang; Ye Huang; Yufan Deng; Jincheng Yu; Junsong Chen; Huan Ling; Enze Xie; Daquan Zhou

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Intermediate

Kewei Zhang, Ye Huang, Yufan Deng et al.1/12/2026

arXiv PDF

Key Summary

•Transformers are powerful but slow because regular self-attention compares every token with every other token, which grows too fast for long sequences.
•Linear attention is fast but loses accuracy because it squashes all tokens into one shared summary, so different queries see almost the same thing.
•This paper finds the core failure mode and calls it global context collapse: diversity in what different tokens focus on disappears.
•MHLA (Multi-Head Linear Attention) splits tokens into blocks, builds local summaries, and lets each query mix those summaries differently.
•This restores most of the flexibility of softmax attention while keeping linear-time speed and low memory.
•MHLA improves ImageNet accuracy by up to 3.6%, boosts image generation quality by up to 12.6%, and lifts video generation scores by 41% over vanilla linear attention.
•It also beats strong linear and recurrent baselines on several NLP benchmarks and long-context tests.
•MHLA needs no extra heavy modules like convolutions or extra attention blocks, and maps cleanly to standard fast matrix multiplies (GEMMs).
•The trick works because it raises the attention matrix rank and lowers entropy, meaning more diversity and sharper focus.
•MHLA keeps streaming/stateful execution and maintains linear complexity, so it scales to huge images and very long videos.

Why This Research Matters

Long documents, big images, and videos demand attention methods that are both smart and fast. MHLA keeps the linear-time speed of efficient attention while bringing back the sharp, query-specific focus that makes models accurate. That means better photo realism, more consistent and vibrant videos, and language models that can reason across many pages. Because MHLA uses standard fast matrix multiplies, it fits easily into today’s hardware and software stacks. This can lower costs, shorten training time, and enable on-device models for privacy and responsiveness. In short, MHLA helps AI handle more data with less waiting—and higher quality.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you study for a test, you don’t reread the entire textbook word-for-word—you scan for the important parts? Attention in AI tries to do the same thing: focus on what matters.

🥬 Filling (The Actual Concept):

What it is: Attention Mechanism is a way for AI to weigh which pieces of information are most important for each decision.
How it works:
1. Look at all the pieces (tokens) in the input.
2. Score how related each piece is to what we need right now.
3. Combine the most relevant pieces more strongly and the less relevant ones weakly.
Why it matters: Without attention, the model treats everything equally and wastes time and memory, missing the crucial bits.

🍞 Bottom Bread (Anchor): When a model answers “What’s the capital of France?”, attention boosts “capital” and “France” and downplays filler words, so it says “Paris.”

🍞 Top Bread (Hook): Imagine a big class discussion where every student listens to every other student before speaking—that’s very thorough, but it takes a long time.

🥬 Filling (The Actual Concept):

What it is: Self-Attention is attention that compares every token to every other token to build context.
How it works:
1. Turn each token into queries (what I’m looking for), keys (how I can be found), and values (my content).
2. Compare every query with every key to get importance scores.
3. Use these scores to blend values into a context for each token.
Why it matters: Without self-attention, models can’t flexibly relate distant parts (like the subject of a sentence and its verb far away). But it’s expensive because it checks all pairs.

🍞 Bottom Bread (Anchor): In a long sentence, self-attention helps the model connect “The cake that Mom baked yesterday” with “was delicious,” even if they’re far apart.

🍞 Top Bread (Hook): Now think of a shortcut: instead of every student listening to everyone, they read a class summary page.

🥬 Filling (The Actual Concept):

What it is: Linear Attention is a faster form of attention that builds one summary of all tokens so each token doesn’t compare with every other token directly.
How it works:
1. Convert queries and keys into special features (a “kernel map”).
2. Compress all keys and values into a single global summary.
3. Let each query use that same summary to get context.
Why it matters: It runs in linear time, so it’s much faster on long sequences. But because everyone shares the same summary, different queries can’t personalize what they see.

🍞 Bottom Bread (Anchor): If 1,000 news articles are squeezed into one paragraph and every reporter uses that same paragraph, they’ll all write similar stories, missing unique angles.

🍞 Top Bread (Hook): Imagine writing the entire plot of a long movie on the back of a business card—too much gets lost.

🥬 Filling (The Actual Concept):

What it is: A Key–Value (KV) Summary is the compressed bundle of all keys and values that linear attention uses instead of looking at each token one-by-one.
How it works:
1. Add up transformed keys and their paired values into a fixed-size matrix.
2. Reuse this same matrix for all queries.
Why it matters: It saves time and memory, but if the sequence is long, a fixed-size summary can’t hold all the needed details.

🍞 Bottom Bread (Anchor): Like a book summary: it’s quick to read, but if you need a specific quote, it may not be there.

🍞 Top Bread (Hook): Picture a smoothie with every fruit you can find; at the end, every sip tastes the same—no strawberry highlight, no mango zing.

🥬 Filling (The Actual Concept):

What it is: Global Context Collapse is when a shared global summary makes the model’s focus blur so much that different queries see nearly the same context.
How it works:
1. All tokens are crammed into one fixed-sized summary.
2. As sequences get longer, many details compete for the same tiny space.
3. The model’s attention becomes uniform and loses selectivity.
Why it matters: Without diversity, the model can’t zoom in on the most relevant tokens for each query, hurting accuracy—especially for long inputs like big images or videos.

🍞 Bottom Bread (Anchor): When summarizing a 2-hour movie into one sentence, different viewers (queries) can’t pick different scenes—they only get the same bland line.

The world before this paper: Softmax self-attention was accurate but too slow for huge images, long documents, and videos because it checks every pair of tokens. Linear attention was fast and memory-friendly but lost accuracy due to using a single shared KV summary. People tried bolting on extra parts, like depthwise convolutions or gating modules, to restore accuracy. These helped a bit but brought back more compute cost and still struggled as sequences grew even longer.

The problem: How can we keep the speed and memory savings of linear attention but restore the “different queries see different things” superpower of softmax attention?

Failed attempts: Extra convolutions, hybrid layers, and gating added complexity and cost. Even then, the root issue—everyone sharing the same global summary—remained, so performance still sagged with longer sequences.

The gap: We needed a way to give each query a more personalized view of the context without breaking the linear-time budget.

The real stakes: This matters for your daily tech because better long-context attention powers sharper photo generation, crisper long videos, smarter chat over long documents, and faster models on your devices. If attention can be both fast and expressive, we get higher quality and lower wait times without huge hardware.

02Core Idea

🍞 Top Bread (Hook): Imagine a big library split into rooms by topic. Instead of one tiny summary for the whole library, you make a short summary for each room. Then, for each question, you mix the most relevant room summaries. Faster than reading every book, smarter than one summary for all.

🥬 Filling (The Actual Concept):

What it is: Multi-Head Linear Attention (MHLA) splits tokens into multiple blocks (like rooms), builds a local summary for each block, and lets each query mix those summaries differently. This restores diversity while keeping linear-time speed.
How it works (recipe):
1. Partition tokens into M non-overlapping blocks (token-level heads).
2. For each block, compute a local key–value summary (its mini report).
3. For each query’s block, learn mixing weights over all local summaries to form a customized global view.
4. Within chosen blocks, fine-tune contributions using the query–key match, then produce the output.
Why it matters: Without this, linear attention forces everyone to share one summary, causing global context collapse. With MHLA, different queries re-combine blocks differently, bringing back focus and detail—yet still in linear time.

🍞 Bottom Bread (Anchor): A question about “the goalie’s save in the second half” mostly mixes the sports room and the match-timeline room, not the cooking room. Another question mixes different rooms. That’s MHLA.

The “Aha!” moment in one sentence: If linear attention loses diversity by using one global summary, restore diversity by building many local summaries and letting each query learn how to mix them.

Three analogies:

Detective squad: Each detective (block) writes a brief report. For a new clue (query), the chief mixes the most relevant reports to solve the case.
Buffet sampler: Instead of blending all foods into one smoothie, you keep dishes separate and let each eater (query) pick a personal plate mix.
Map tiles: Rather than one fuzzy world map, keep detailed tiles per region and let each traveler (query) stitch together the tiles they need.

Before vs. After:

Before (vanilla linear attention): One shared summary; quick but bland. Different queries look similar.
After (MHLA): Many local summaries mixed per query block; still quick, now sharp and diverse like softmax attention—but without the quadratic cost.

Why it works (intuition):

Sandwich: Rank and Entropy. 🍞 Hook: Think of a choir. A larger choir can sing richer harmonies (higher rank). If everyone whispers equally (high entropy), the song lacks highlights. 🥬 Concept: Rank measures how many independent “directions” attention can express; entropy measures how spread-out or focused attention is.
- How: MHLA adds blockwise components that different queries can combine, increasing independent directions (rank). Mixing and intra-block reweighting sharpen focus, lowering entropy.
- Why: Higher rank captures more diverse patterns; lower entropy means crisper selection. 🍞 Anchor: In images, MHLA can focus on edges in one block and textures in another; the mix depends on the query, yielding detailed, clean generations.

Building blocks of MHLA:

Token partitioning: Split the sequence into blocks that are spatial (2D), spatiotemporal (3D), or 1D chunks.
Local summaries: Compute compact KV summaries per block.
Multi-Head Mixing: Learn an M-by-M coefficient matrix that says how much each query block should use each local summary.
Intra-block refinement: Within chosen blocks, use the query–key match to weight tokens, preserving token-level detail.
Efficient compute: All steps map to standard fast matrix multiplies (GEMMs), preserving linear complexity and streaming.

🍞 Bottom Bread (Anchor): In video generation, MHLA lets a frame’s queries pull more from blocks covering moving objects and less from static background blocks, keeping motion crisp without slowing the model down.

03Methodology

At a high level: Input tokens → Split into blocks → Build local KV summaries → Learn query-specific mixtures over summaries → Intra-block reweighting → Output contexts (linear-time).

Step 1: Partition tokens into non-overlapping blocks (token-level heads)

What happens: Break the sequence of N tokens into M blocks. In images, use spatial tiles; in video, use space–time tiles; in text, contiguous chunks.
Why this step exists: Blocks let us keep multiple local summaries (instead of one global), creating pieces we can later mix per query to restore diversity.
Example: A 32×32 image patch grid forms 16 blocks (4×4 grid). Each block groups nearby pixels/patches that often share local structure.

🍞 Top Bread (Hook): Like shelving a library by rooms so each room gets its own mini catalog. 🥬 Filling (The Actual Concept):

What it is: Token Partitioning is organizing tokens into groups that will each get a local summary.
How it works:
1. Choose M (number of blocks).
2. Assign each token to one block based on position.
3. Proceed per block in parallel.
Why it matters: Without grouping, you’re stuck with one summary for all tokens; grouping creates mixable building blocks. 🍞 Bottom Bread (Anchor): Group puzzle pieces by color/region before assembling; it’s faster to find what you need.

Step 2: Compute local key–value (KV) summaries per block

What happens: For each block, gather its keys and values into a compact summary (its “mini report”) and compute a normalizer if needed.
Why this step exists: Local summaries retain block-specific patterns (edges, textures, words in a paragraph) so different queries can later emphasize different blocks.
Example: In an image, a block covering a cat’s ear summarizes fine edge information; another block covering background summarizes flat color regions.

🍞 Top Bread (Hook): Think of each room in a museum writing a short guide to its exhibits. 🥬 Filling (The Actual Concept):

What it is: Local KV Summary is a block’s compressed description of its tokens.
How it works:
1. Collect keys and values from tokens in the block.
2. Combine them into a fixed-size matrix summary.
3. Save a blockwise normalizer (optional in very long settings for stability).
Why it matters: Without local summaries, we’d fall back to a single global summary, losing variety. 🍞 Bottom Bread (Anchor): A neighborhood guidebook tells you what’s nearby, different from the downtown guidebook.

Step 3: Multi-Head Mixing to form query-specific mixtures

What happens: For every query block i, learn a vector of nonnegative mixing weights over all M blocks, and use it to create a custom mixture of local summaries.
Why this step exists: It’s the core that makes MHLA query-conditioned—different queries do not see the same summary anymore.
Example: Text queries about a “chapter’s main idea” might weight the block containing topic sentences more than side-details blocks.

🍞 Top Bread (Hook): Like a DJ mixing several tracks to make a custom playlist for each listener. 🥬 Filling (The Actual Concept):

What it is: Multi-Head Mixing is a learnable way for each query block to combine local summaries into its own global view.
How it works:
1. Build an M×M coefficient matrix; each row is a block’s mixing weights over all summaries.
2. Initialize to favor nearby blocks (locality bias), then learn it end-to-end.
3. Clip and normalize weights to keep training stable.
Why it matters: Without mixing, queries all share the same summary; with it, queries recompose context differently. 🍞 Bottom Bread (Anchor): For a sports clip, the mixing weights lift blocks with the ball and players; for a landscape clip, they lift sky/ground blocks differently.

Step 4: Intra-block reweighting and output computation

What happens: Within the selected blocks, we still compare the query and the tokens’ keys to fine-tune which tokens matter most. Then we produce each token’s output context.
Why this step exists: Mixing picks which blocks to emphasize; intra-block reweighting picks which tokens inside those blocks matter. Both are needed for sharp focus.
Example: In a paragraph block, bolded keywords get more weight than filler words.

Step 5: Efficient and streaming-friendly compute (GEMMs)

What happens: All major operations (build summaries, multiply by mixing matrix, apply to queries) boil down to fast general matrix multiplications (GEMMs) that GPUs/TPUs handle extremely well. MHLA keeps linear complexity in N, with a small extra term in M that’s typically much smaller than N.
Why this step exists: To ensure speed on modern hardware without special kernels or heavy add-ons.
Example: On high-res images and long videos, MHLA’s throughput matches linear attention but with much better quality.

🍞 Top Bread (Hook): Imagine using a super-fast calculator that’s already built into your computer to do all the heavy lifting. 🥬 Filling (The Actual Concept):

What it is: GEMM Operations are standard, highly optimized matrix multiplications.
How it works:
1. Represent summaries and mixes as matrices.
2. Multiply them using GPU-tuned GEMMs.
3. Reuse results per block to avoid extra work.
Why it matters: Without GEMMs, we’d need custom, slower computations or add heavy modules that cost time. 🍞 Bottom Bread (Anchor): Like using a race car on a racetrack designed for speed instead of building a new road.

Causality and long sequences (optional detail): For language modeling or videos, we can process blocks in order and maintain per-block summaries for streaming. MHLA mixes only past blocks for the current block (causal mask), keeping the same linear-time behavior as chunked linear attention, but with more expressive, query-specific mixtures.

Putting it together: Input → (Partition) → (Local KV per block) → (Mix per query block) → (Intra-block refine) → Output. If any step is removed, something breaks: no partition (no diversity), no local KV (no locality), no mixing (no query-conditioned view), no refinement (too coarse focus). The secret sauce is the two-stage selectivity—block-level mixing plus token-level weighting—delivered with pure GEMMs so it stays fast.

04Experiments & Results

The test: The authors checked whether MHLA restores lost expressivity without losing speed. They measured accuracy for classification (ImageNet-1K), quality for image generation (FID, IS, sFID), long-sequence performance for video generation (VBench scores and latency), and language modeling quality (perplexity, commonsense reasoning, MMLU, LongBench).

The competition: MHLA was compared against self-attention (accurate but slow), vanilla linear attention (fast but less expressive), and recent enhanced linear attentions (e.g., Focused/Inline/MALA/RALA/GLA) as well as strong recurrent/state-space baselines (Mamba, Mamba2, GDN). This is like racing a new car against both the fastest traditional car and the most fuel-efficient ones.

The scoreboard with context:

Image classification (ImageNet-1K): • On DeiT-T and DeiT-S backbones, MHLA beat vanilla linear attention by a wide margin and even outperformed self-attention in final accuracy while keeping FLOPs the same as linear attention. Think of going from a B- to a solid A without extra study time. • On VLT models, MHLA set state-of-the-art results among efficient attention backbones, edging past carefully engineered baselines.
Image generation (DiT/DiG): • Across scales (Small to XL), MHLA consistently lowered FID versus linear attention and often matched or surpassed self-attention quality—yet maintained near-linear-attention throughput. That’s like painting a sharper picture in half the time. • On DiT-S at 512px, MHLA achieved better FID while roughly doubling throughput compared to self-attention. • On Sana T2I finetuning, MHLA improved FID and CLIP scores over the base and beat the PixArt series on some metrics, converging faster in training.
Video generation (Wan2.1-1.3B): • With extreme sequence length (~31,500 tokens), vanilla linear attention suffered big quality drops (global context collapse). MHLA preserved quality near the original FlashAttention model while delivering up to 2.1× speedup. A hybrid variant (2/3 layers MHLA) gave a 1.6× speedup with even better overall scores.
NLP (0.3–0.34B scale on 10B tokens): • MHLA matched or exceeded strong linear/recurrent baselines on commonsense reasoning and achieved top MMLU among the tested efficient models. • On LongBench, MHLA excelled especially in Multi-Doc QA, Summarization, and Code, leading the average score across peers.

Surprising findings:

No heavy add-ons needed: While some prior fixes use depthwise convs or gating to recover performance, MHLA reached or surpassed self-attention levels at large scales without them. In fact, adding extra modules sometimes hurt at XL scale.
Higher rank and lower entropy: MHLA significantly increased the effective rank of the attention scores and reduced entropy (i.e., sharper focus) compared to linear attention—quantitatively confirming the restored expressivity.
Sweet spot for head number M: A moderate number of token-level heads (e.g., 16) often gave the best trade-off, keeping overhead negligible while delivering strong gains.

What the numbers mean in plain words:

3.6% accuracy boost on ImageNet: That’s like moving from a solid A- to an A when others stayed the same or slipped.
12.6% FID improvement on image generation: Images look cleaner and more realistic, especially at higher resolutions.
41% improvement in video generation over vanilla linear attention: Motion and details stay consistent across long sequences, without slowing down.
Strong NLP reasoning and long-context scores: The model keeps track of long documents better and answers more reliably.

Taken together, the experiments show MHLA keeps the speed of linear attention while bringing back much of the “smart focusing” power of full self-attention.

05Discussion & Limitations

Limitations and caveats:

Choosing M (number of token-level heads): Too few blocks may underfit (not enough diversity); too many add overhead and can blur local summaries. Practical recipes (e.g., M much smaller than N) work well, but tuning may be needed per domain.
Block boundaries: Hard block splits can misalign with semantic boundaries (e.g., an object crossing tiles). Although mixing softens this, finer or adaptive partitioning could help further.
Extremely global interactions: Some tasks may need rich, fine-grained global patterns that no block mixture fully captures; in such cases, adding a small fraction of full attention layers or hierarchical mixing might help.
Coefficient matrix learning: While initialized with a locality bias and stabilized by clipping/normalization, the mixing matrix adds a learned component that must be trained well; poor training or very small datasets could limit gains.
Kernel feature choices: Different kernel maps in linear attention can affect stability and accuracy; MHLA inherits these sensitivities and must pick robust ones (sometimes omitting normalization in ultra-long settings helps).

Required resources:

Standard GPU/TPU hardware that accelerates GEMMs (which most accelerators do extremely well).
Typical memory for linear attention models; MHLA adds minimal overhead for the mixing matrix and per-block summaries.
Training time comparable to linear attention; often faster convergence was observed in practice during finetuning.

When not to use MHLA:

Very short sequences or tiny models where full self-attention is already cheap and highly accurate; the extra design may not pay off.
Ultra-scarce data regimes where learning a mixing matrix is unstable; a simpler attention might suffice.
Scenarios where exact softmax attention is mandatory (e.g., certain interpretability pipelines) and compute is not a concern.

Open questions:

Adaptive partitioning: Can blocks be learned dynamically to follow objects, topics, or syntax for even better mixing?
Multiscale mixing: How best to combine local, mid-range, and global blocks across layers or pyramids?
Theory frontiers: Precise bounds linking M, block size, and achievable rank/entropy improvements across tasks.
Hybrid stacks: What’s the optimal blend of MHLA and occasional full-attention or convolutional layers for difficult, global-heavy tasks?
Long-horizon stability: Best practices for coefficient regularization and kernel choices when context windows run into hundreds of thousands of tokens.

06Conclusion & Future Work

Three-sentence summary: Linear attention is fast but loses expressivity because it forces all queries to share one global summary, causing global context collapse. MHLA fixes this by splitting tokens into blocks, making local summaries, and learning query-specific mixtures—restoring diversity and sharp focus while keeping linear-time speed. It consistently improves results in vision, video, and language, often matching or beating self-attention quality with linear attention efficiency.

Main achievement: MHLA shows that token-level multi-head partitioning plus learned block mixing can recover much of softmax attention’s expressivity—raising rank and lowering entropy—without adding heavy modules or quadratic cost.

Future directions: Learn the blocks adaptively, expand to multiscale mixing across layers, and explore hybrids that sprinkle in rare full-attention layers for globally delicate tasks. Probe ultra-long streaming scenarios in video and text, refine stability tricks, and strengthen theoretical guarantees about rank growth.

Why remember this: MHLA reframes the trade-off between speed and smarts in attention. It proves you can stay linear-time yet think in a richly selective, query-dependent way—unlocking higher-quality generation, better long-context understanding, and faster models that run at modern scales.

Practical Applications

•Speed up high-resolution image generation in diffusion transformers while improving FID quality.
•Accelerate long video generation and editing with better motion consistency at lower latency.
•Boost accuracy in vision backbones (e.g., DeiT/VLT) without increasing FLOPs or adding heavy modules.
•Improve long-document reasoning in small/medium language models for summarization and multi-document QA.
•Enable faster, higher-quality on-device AI by keeping computations to standard GEMMs.
•Enhance real-time perception in robotics by focusing attention on relevant spatial blocks efficiently.
•Reduce cloud inference costs for generative services by maintaining linear complexity with better outputs.
•Upgrade existing linear-attention models (e.g., DiT, GLA-based systems) by swapping in MHLA with minimal code changes.
•Support streaming/causal workloads (chatbots, live video) via chunkwise MHLA with stable training.
•Build hybrid stacks that combine MHLA with occasional full-attention layers for tasks needing global precision.

Version: 1