MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MiniCPM Team; Wenhao An; Yingfa Chen; Yewei Fang; Jiayi Li; Xin Li; Yaohui Li; Yishan Li; Yuxuan Li; Biyuan Lin; Chuan Liu; Hezi Liu; Siyuan Liu; Hongya Lyu; Yinxu Pan; Shixin Ren; Xingyu Shen; Zhou Su; Haojun Sun; Yangang Sun; Zhen Leng Thai; Xin Tian; Rui Wang; Xiaorong Wang; Yudong Wang; Bo Wu; Xiaoyue Xu; Dong Xu; Shuaikang Xue; Jiawei Yang; Bowen Zhang; Jinqian Zhang; Letian Zhang; Shengnan Zhang; Xinyu Zhang; Xinyuan Zhang; Zhu Zhang; Hengyu Zhao; Jiacheng Zhao; Jie Zhou; Zihan Zhou; Shuo Wang; Chaojun Xiao; Xu Han; Zhiyuan Liu; Maosong Sun

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

Intermediate

MiniCPM Team, Wenhao An, Yingfa Chen et al.2/12/2026

arXiv

Key Summary

•MiniCPM-SALA is a 9B-parameter language model that mixes two kinds of attention—sparse and linear—to read very long texts quickly and accurately.
•It uses a 1:3 ratio (25% sparse, 75% linear) so it keeps important details while staying fast and memory-friendly.
•A smart layer selection algorithm decides where to place the sparse layers for best results.
•HyPE (Hybrid Positional Encoding) applies RoPE to linear layers but not to sparse layers, helping the model remember order without hurting long-distance memory.
•Instead of training from scratch, the team converts a pretrained Transformer using a continual training recipe (HALO + staged training), cutting training cost by about 75%.
•On long-context tests like RULER, MRCR, and NoLiMa, MiniCPM-SALA outperforms similar-size models and holds up even at 128K–1M tokens.
•On a single NVIDIA A6000D GPU, it runs up to 3.5× faster than a full-attention 8B model at 256K tokens and can handle 1M-token inputs where others run out of memory.
•It even extrapolates beyond its training length to 2M tokens with strong scores, without special tricks.
•General skills (coding, math, knowledge) stay competitive with full-attention models despite the speed and memory savings.
•This approach makes million-token tasks practical on consumer GPUs, unlocking new long-document and long-horizon applications.

Why This Research Matters

Many real problems involve huge amounts of text: full code repositories, legal contracts, technical manuals, scientific papers, and multi-day chat histories. MiniCPM-SALA shows that we can process these million-token workloads on a single GPU—sometimes even on a consumer card—without sacrificing core skills. This unlocks practical applications like on-device document analysis, repository-scale coding assistance, and long-horizon planning for agents. It also reduces infrastructure cost: instead of giant clusters and full-attention memory walls, you get fast, stable inference on modest hardware. Finally, its conversion pipeline means existing Transformers can be upgraded into hybrids, making the path to long context accessible to many teams.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine trying to read an entire encyclopedia in one sitting. If you tried to pay equal attention to every single word, your brain would get overwhelmed fast. You need a plan: skim most parts quickly and zoom in on the parts that matter.

🥬 Filling (The Actual Concept):

What it is: MiniCPM-SALA is a new way for AI to read very long texts by mixing two reading styles—sparse attention (zooming in) and linear attention (skimming)—so it stays fast, accurate, and memory-friendly.
How it works (step by step):
1. Recognize the bottleneck: regular Transformers compare every word to every other word, which gets very slow and memory-hungry as text grows (quadratic cost and huge KV-Cache).
2. Use sparse attention to carefully focus on the most important connections without checking everything.
3. Use linear attention to scan through the whole text with a running summary, which keeps cost growing only linearly.
4. Combine them in one model, placing sparse layers where they help most and linear layers everywhere else, so you get both precision and speed.
5. Train the hybrid starting from a strong pretrained model, not from scratch, to save time and compute.
Why it matters: Without this mix, pure full attention runs out of memory on million-token inputs, pure sparse still stores huge caches, and pure linear may lose important details. The hybrid keeps the strengths and softens the weaknesses.

🍞 Bottom Bread (Anchor): Think of a student preparing for a mega-exam from a 1,000-page binder. They skim most pages (linear attention) but stop to study key diagrams and formulas closely (sparse attention). That’s how MiniCPM-SALA handles ultra-long inputs.

— New Concept 1 — Sparse Attention 🍞 Top Bread (Hook): You know how you highlight only the key sentences in a textbook instead of rereading the whole thing every time? 🥬 Filling:

What it is: Sparse attention lets the model focus computation on selected, important parts (like local windows or special anchor tokens) instead of everywhere at once.
How it works:
1. Break the big attention map into a small pattern (e.g., a sliding window and a few global tokens).
2. Compute attention only in those spots.
3. Keep the rest ignored to save compute.
Why it matters: It cuts computation a lot, but still needs to store the full KV-Cache of past tokens, so memory can still be heavy at million-token scale. 🍞 Bottom Bread (Anchor): When reading a science chapter, you reread the summary boxes and figures (selected spots), not every paragraph.

— New Concept 2 — Linear Attention 🍞 Top Bread (Hook): Imagine reading a super-long scroll where you keep a running summary as you go, instead of comparing every sentence to every other sentence. 🥬 Filling:

What it is: Linear attention processes text with a running state so the cost grows linearly with length.
How it works:
1. Turn the attention formula into a form that updates a compact state as each new token arrives.
2. Use that state to compute outputs without storing all past keys/values.
3. Move through the sequence once, keeping memory small.
Why it matters: It’s very fast and memory-light, but compressing history can lose some fine-grained details. 🍞 Bottom Bread (Anchor): It’s like taking quick notes while listening to a long podcast, so you don’t need to replay the entire episode to remember the big ideas.

— New Concept 3 — Hybrid Attention Mechanism 🍞 Top Bread (Hook): Think of wearing reading glasses you can flip: normal view for fast scanning, magnifier for tricky lines. 🥬 Filling:

What it is: A single model that mixes sparse attention (detail zoom) and linear attention (fast scan) across layers.
How it works:
1. Decide a ratio: 25% sparse, 75% linear.
2. Place sparse layers where long-range precision matters most; fill the rest with linear layers for speed.
3. Train them to cooperate so details are preserved and throughput stays high.
Why it matters: It balances accuracy with memory/compute efficiency so million-token inputs become practical. 🍞 Bottom Bread (Anchor): In a huge rulebook, you skim most sections but switch to careful reading for the penalty rules and exceptions.

— New Concept 4 — Layer Selection Algorithm 🍞 Top Bread (Hook): Like a coach choosing which players go on the field at which times to win the game. 🥬 Filling:

What it is: A method to choose which specific Transformer layers should be sparse vs. linear to maximize performance.
How it works:
1. Start from a pretrained model.
2. Keep the first and last layers stable; use an algorithm (from HALO) to pick which middle layers stay precise (sparse) vs. turn linear.
3. Train so the chosen spots become specialists.
Why it matters: Randomly mixing layers is suboptimal; smart placement lifts accuracy without losing speed. 🍞 Bottom Bread (Anchor): A librarian goes straight to the reference shelves for key facts and uses general shelves for background reading, saving time while getting reliable info.

— New Concept 5 — Hybrid Positional Encoding (HyPE) 🍞 Top Bread (Hook): You know how GPS plus a paper map together make it easier to navigate both big highways and tiny local streets? 🥬 Filling:

What it is: A position system that applies RoPE to linear attention layers but removes RoPE in sparse layers to protect long-distance memory.
How it works:
1. Linear layers get RoPE to keep track of relative order during fast scanning.
2. Sparse layers skip RoPE so distant tokens don’t get their signals weakened by position effects.
3. Together, the model remembers “where things are” without losing far-away links.
Why it matters: Bad positional choices can blur long-range memory; HyPE avoids that and boosts length extrapolation. 🍞 Bottom Bread (Anchor): For a cross-country trip, you use GPS for turn-by-turn (order) but a big atlas for the grand picture (far distances stay clear).

02Core Idea

🍞 Top Bread (Hook): Picture a marathon reader who can flip between speed-reading and deep study, finishing giant books quickly while still catching tiny clues.

🥬 Filling (The Actual Concept):

Aha! Moment in one sentence: Mix 25% high-fidelity sparse attention (InfLLM-V2) with 75% globally efficient linear attention (Lightning Attention), place them smartly, and use HyPE so the model stays both fast and accurate on million-token inputs.

Multiple Analogies (3 ways):

Camera analogy: Use a wide-angle lens (linear) for the whole scene and a zoom lens (sparse) for the important details.
City traffic: Highways (linear) move lots of cars fast; local streets with traffic lights (sparse) handle precise turns where accuracy matters.
Study method: Skim most pages (linear), but carefully work through key examples and theorems (sparse).

Before vs. After:

Before: Full attention was accurate but hit memory walls; sparse attention saved compute but still stored massive KV-Cache; linear attention was light but sometimes lost details.
After: The hybrid keeps strong recall of important links (sparse layers) while cruising through long text (linear layers), so 1M-token contexts run on a single GPU without big quality drops.

Why It Works (intuition, not equations):

Sparse layers act like durable memory anchors that preserve exact token-to-token interactions where needed.
Linear layers build a smooth running summary so cost grows only linearly, dramatically cutting memory.
HyPE gives each part the right sense of position: RoPE for scanning layers, no RoPE for distant-recall layers.
QK-Normalization calms spikes in long sequences, and output gates regulate information flow so attention doesn’t get stuck on unhelpful tokens.

Building Blocks (bite-size): — Lightning Attention (linear) 🍞 Hook: Like conveyor-belt reading—fast and steady. 🥬 Filling:

What: A linear attention variant with good length generalization.
How: Keeps a compact running state instead of full KV-Cache.
Why: Makes million-token throughput possible. 🍞 Anchor: Summarize as you go, not after you read everything.

— InfLLM-V2 (sparse) 🍞 Hook: A spotlight you can turn on for key spots. 🥬 Filling:

What: Dense-sparse switchable attention with no extra parameters.
How: Focus on selected patterns (windows/anchors) to keep precise links.
Why: Preserves fine details across long ranges. 🍞 Anchor: Revisit highlighted paragraphs, not the whole book.

— HyPE (positional) 🍞 Hook: GPS for the fast road, atlas for the long trip. 🥬 Filling:

What: RoPE in linear layers, no RoPE in sparse layers.
How: Keeps local order while protecting far-distance memory.
Why: Enables robust length extrapolation. 🍞 Anchor: Know the next turn and the overall route.

— Layer Selection (placement) 🍞 Hook: Put experts where they matter. 🥬 Filling:

What: Algorithmically decide which layers become sparse vs. linear.
How: Keep edge layers stable, convert middles wisely.
Why: Better accuracy than uniform mixing. 🍞 Anchor: Assign the right jobs to the right teammates.

— Conversion + Continual Training 🍞 Hook: Remodel a solid house instead of rebuilding from the ground up. 🥬 Filling:

What: Start from a strong Transformer (MiniCPM-4.0) and convert.
How: HALO conversion, then staged training on short-to-long sequences and SFT.
Why: Saves ~75% training cost vs. from-scratch. 🍞 Anchor: Upgrade an engine rather than crafting a new one from raw metal.

03Methodology

High-level Flow: Input tokens → Embedding → Stack of attention blocks (25% sparse, 75% linear, each followed by FFN) with HyPE, QK-Norm, and output gates → Output tokens.

Step-by-step (what, why, example):

Start from a pretrained backbone.

What happens: Load an intermediate MiniCPM-4.0 checkpoint trained on 7T tokens.
Why: Strong starting point means less training cost and better stability.
Example: Instead of teaching from ABCs, you tutor a student who already knows grammar.

HALO Conversion (Architecture Conversion).

What happens: Convert selected softmax-attention layers into Lightning Attention (linear); keep first and last layers unconverted; mark some layers to become sparse later.
Why: Minimizes disruption to learned knowledge while preparing the hybrid layout.
Example: You reorganize a library’s layout without throwing away the books.

Continual Stable-Training (4K context, sparse off).

What happens: Train all parameters so converted linear layers coordinate with embeddings, FFNs, and the remaining attention layers. Use constant LR after warmup, big batch.
Why: Smooths the handoff so the model stays stable and accurate at normal lengths.
Example: A newly merged team learns to work together on small projects first.

Short-Decay Training (4K context, sparse off).

What happens: Train on ~1T tokens with an exponentially decaying LR, emphasizing high-quality and reasoning-rich data (PDF corpora, curated selections, synthetic reasoning).
Why: Reinforces general knowledge and reasoning before stretching context.
Example: Master core skills before attempting a marathon.

Long-Decay Training (extend to 32K → 160K → 520K, sparse on).

What happens: Enable sparse attention; gradually increase sequence length and adjust LR downward; up-sample long-context data.
Why: Teaches the hybrid to coordinate sparse (precision) and linear (throughput) at scale.
Example: Train for longer and longer hikes, carrying a bigger backpack each time.

Supervised Fine-Tuning (64K → 140K, sparse on).

What happens: Use high-quality reasoning tasks (code, math, knowledge, tool use, long-context Q&A). LR warmup then decay.
Why: Polishes real-world skills, including cross-document retrieval and long reasoning chains.
Example: Practice with realistic mock exams after basic training.

Secret Sauce (clever bits):

1:3 layer mix with algorithmic placement: Most layers are linear for speed; the chosen sparse layers act as precise memory anchors where they matter.
HyPE: RoPE in linear layers preserves local order; no RoPE in sparse layers protects far-distance recall and boosts length extrapolation.
QK-Normalization: Keeps attention activations stable on very long sequences.
Output gates: Prevent attention sink and over-focusing; they throttle outputs to keep learning balanced.
Lightning + InfLLM-V2: Linear layers shrink memory needs; sparse layers keep critical token-to-token links without extra parameters.

Concrete Data Points (from the pipeline):

Conversion stage: ~1.3B tokens at 512 length; only converted linear layers train.
Stable + Short-Decay: 314.6B + ~1T tokens at 4K.
Long-Decay: 32K (102.2B), 160K (62.9B), 520K (50.6B).
SFT: 64K (204.5B), 140K (213.3B) tokens.
Total Transformer-to-hybrid cost: ~2T tokens (~25% of training-from-scratch).

Example Walkthrough (256K tokens doc):

Prefill: Linear layers stream through the document with compact states; sparse layers selectively form precise links for critical spans.
Decode: As the model generates, it needs far less KV memory than full attention because 75% of layers use linear states.
Result: Lower TTFT and faster end-to-end latency compared to a full-attention 8B baseline, with strong accuracy on retrieval and reasoning.

04Experiments & Results

The Test (what and why):

General skills: Knowledge (CMMLU, MMLU-Pro), coding (HumanEval, MBPP, LCB), math (AIME24/25), and other reasoning sets (BBH, IFEval) to ensure the hybrid didn’t harm everyday abilities.
Long context: RULER, MRCR, NoLiMa to check retrieval, reasoning, and stability as context grows from 64K to 128K and beyond.
Speed and memory: TTFT and end-to-end latency across 64K–1M tokens on A6000D (96GB) and RTX 5090 (32GB), both non-quantized and GPTQ INT4.

The Competition (similar-size baselines): Qwen3-8B, Nemotron-Nano-v2-9B, MiniCPM-4.1-8B, Ministral-3-R-8B, Falcon-H1R-7B.

The Scoreboard (with context):

General capability: Average score ~76.5, competitive among open 7B–9B models. Coding is strong (HumanEval ~95.1, MBPP ~89.1). Math is solid (AIME24 ~83.8; AIME25 ~78.3). Translation: The hybrid stays sharp on regular tasks.
Long context: RULER at 128K ~89.4; NoLiMa at 128K ~23.9—better than peers, showing fewer quality drop-offs with length. Overall long-context average ~39.0 beats comparable baselines.
Ultra-long extrapolation: Despite training to 520K, the model scores ~87.1 at 512K, ~86.3 at 1,000K, and ~81.6 at 2,048K—no special tricks like YaRN needed.
Speed and memory: On A6000D at 256K, TTFT drops from ~180.8s (full-attention baseline) to ~51.6s—about 3. $5× faster$ . Qwen3-8B hits OOM at 512K–1,024K, but MiniCPM-SALA completes inference. On RTX 5090 (32GB), the baseline OOMs even earlier (128K non-quantized), while MiniCPM-SALA still reaches 1M tokens.

Surprising Findings:

A 9B hybrid outperforms even an 80B model at 1M tokens on RULER (86.3 vs. 80.3), showing that smarter attention plus training can beat sheer size for long contexts.
HyPE’s “RoPE for linear, no RoPE for sparse” design appears key to stable length extrapolation—distant memory stays crisp instead of fading.
Converting a strong Transformer via continual training preserved general skills while unlocking ultra-long windows, confirming this is a practical path for other models too.

05Discussion & Limitations

Limitations:

Task variability: The best sparse/linear placement may differ across domains; the fixed 1:3 ratio is strong on average but might be suboptimal for some niche tasks.
Compression trade-offs: Linear layers still compress history, so rare edge cases needing exact, dense pairwise matching across the entire sequence might prefer more sparse or full attention.
Tuning complexity: The multi-stage schedule (HALO, stable, decay phases, SFT) requires careful engineering and data curation.
Benchmark gaps: While general skills hold up, some benchmarks (e.g., MMLU-Pro) don’t top the board; maximal short-context peak scores may still come from full-attention or specialized models.

Required Resources:

Training: Multi-stage compute over ~2T tokens; careful LR/batch schedules; curated long-context and reasoning data.
Inference: For 1M-token contexts, 32–96GB GPUs are recommended; INT4 quantization can help on consumer cards.

When NOT to Use:

Very short inputs (e.g., chat snippets under a few thousand tokens) where full attention is already fast enough and simpler.
Tasks needing uniformly precise, dense interactions across all tokens (e.g., certain algorithmic or exact-matching problems) may benefit from more sparse or full-attention layers.
Extremely low-latency streaming on tiny devices with strict power constraints may prefer lightweight SSMs or distilled small models.

Open Questions:

Dynamic allocation: Can the model choose sparse vs. linear layers on-the-fly per input or per head?
Memory policy: Can we further shrink or tier the sparse KV-Cache with learned eviction or compression?
Architecture combos: How best to mix hybrids with MoE or SSMs for even larger gains?
Data strategy: What’s the optimal blend of long-context pretraining data for stability, reasoning, and retrieval without overfitting to synthetic patterns?

06Conclusion & Future Work

Three-Sentence Summary: MiniCPM-SALA mixes 25% sparse attention with 75% linear attention, placed by a layer selection algorithm and guided by HyPE, to handle million-token contexts efficiently. It converts a pretrained Transformer via continual training, preserving general skills while cutting training cost by ~75% versus starting from scratch. The result is a 9B model that runs up to 3. $5× faster$ than a full-attention 8B baseline at 256K and cleanly scales to 1M+ tokens on single GPUs.

Main Achievement: Showing that a thoughtfully engineered sparse-linear hybrid can match or exceed full-attention baselines on long-context tasks, while delivering major speed and memory wins—and doing so with a cost-effective conversion pipeline.

Future Directions: Make the hybrid adaptive (dynamic layer selection), explore finer-grained gating and normalization, combine with MoE/SSMs, and refine data curricula for even better extrapolation. Better KV policies and positional designs could push stable context windows further while staying accurate.

Why Remember This: It’s a blueprint for practical million-token AI on everyday GPUs—proof that smart attention design and efficient training can beat brute-force scaling, opening the door to real-world long-doc understanding, codebase reasoning, and long-horizon agents.

Practical Applications

•Analyze entire software repositories to answer cross-file coding questions and suggest refactors.
•Search, summarize, and cross-reference very long legal or policy documents on a single GPU.
•Support agents that remember multi-day conversations and evolving task states without forgetting early details.
•Process giant technical manuals to power smart troubleshooting assistants for field technicians.
•Review thousands of research papers to map trends, compare methods, and extract key findings.
•Provide on-device long-document Q&A and summarization for privacy-sensitive industries.
•Plan complex projects end-to-end with timelines, dependencies, and evolving updates in one giant context.
•Handle call center knowledge bases and logs to offer accurate, context-aware support in real time.
•Index and retrieve from massive archival records (news, scientific literature, government documents).
•Create long-form writing (books, reports) that remains consistent across chapters and references.

Version: 1