MiMo-V2-Flash Technical Report
Key Summary
- •MiMo-V2-Flash is a giant but efficient language model that uses a team-of-experts design to think well while staying fast.
- •It mixes two kinds of attention—small local “windows” most of the time and full “global” views sometimes—so it handles very long texts without slowing down too much.
- •A special helper called Multi-Token Prediction (MTP) lets the model guess several next words at once, then quickly checks them, speeding up answers by up to about 2.6× in tests.
- •The model learns from many expert teachers at once through a method called MOPD, which gives feedback on every token, helping it master coding, math, tools, and more without forgetting.
- •It was trained on 27 trillion tokens, can read 32k tokens natively, and was later stretched to 256k while keeping strong long-context skills.
- •On tough real-world coding tests like SWE-Bench Verified, it solves about 73 out of 100 issues, leading among open models of similar or larger size.
- •Thanks to its “sliding-window attention” and a clever bias called an attention sink, it keeps quality high even with small windows (128 tokens) and a 5:1 local-to-global pattern.
- •Its expert system (MoE) uses 309B total parameters but only 15B “wake up” per token, saving compute while keeping brains where needed.
- •During RL training and agent tasks, MTP reduces rollout bottlenecks by parallelizing token guesses, which keeps GPUs busy and learning efficient.
- •MiMo-V2-Flash is open-sourced (including the 3-layer MTP head), inviting the community to build faster, smarter agents.
Why This Research Matters
Real-world tasks like fixing software bugs, studying long research papers, or planning multi-step actions need models that are both smart and fast. MiMo-V2-Flash shows how to mix local and global attention to handle long contexts without huge slowdowns. Its multi-teacher learning keeps skills balanced, so it can code, reason, and use tools without forgetting older abilities. Multi-token prediction cuts waiting time, making chat and agent responses noticeably snappier. Open-sourcing the model and its MTP heads lets researchers and developers build practical, high-speed agents for everyday work. By improving both brains and speed, it makes advanced AI more useful and affordable. This helps teams deploy capable assistants that handle complex workflows, not just short chats.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how when you read a really long book, you skim most pages but stop and carefully read the important parts? Before models like MiMo-V2-Flash, AI tried to carefully read everything, everywhere, all at once. That worked for short notes, but for huge documents, code bases, or long conversations, it got slow and pricey, and sometimes the model lost track of the big picture.
The World Before: Large language models (LLMs) were great chatters, but three frictions kept showing up:
- Long-context slowdown: Full attention (looking at every word against every other word) becomes very expensive as text grows. It’s like comparing each sticker in a giant sticker book with every other sticker just to decide which one to look at.
- Reasoning vs. speed trade-off: Models that reasoned better often ran slower, especially in reinforcement learning (RL) rollouts where the model has to act step-by-step in real tools or code environments.
- Post-training “see-saw”: When researchers trained a model to be great at one thing (say, math), it sometimes got worse at another (like creative writing). Tuning one skill could un-tune another.
The Problem: We need a model that can (1) think deeply, (2) use tools like a capable agent, and (3) handle very long inputs—all while staying fast and affordable. Long contexts are essential for real coding tasks (entire repositories), research papers, law documents, or multi-turn tool use. But paying the full price of global attention everywhere is like turning on every light in a stadium to find a pencil.
Failed Attempts:
- All-global attention: Accurate but too slow and memory-hungry for very long contexts.
- Only-local attention: Fast, but misses far-away clues, so it can forget important info earlier in the text.
- Simple distillation from one teacher: Learns a specialty well but tends to lose balance across tasks; the “see-saw” persists.
- RL with big batches only: Keeps GPUs busy but can be less stable and wastes compute when tasks vary in length.
The Gap: We were missing a design that smartly mixes local focus with occasional global checks, plus a post-training method to blend many expert skills without losing any, and an inference trick to generate tokens faster than one-at-a-time.
Real Stakes: In everyday life, this matters because:
- Coding agents need to open files, run tests, and fix real bugs that may require reading thousands of lines across many files.
- Students and professionals need help with math and science problems that involve multi-step reasoning and references to earlier parts.
- Browsing and tool-using agents must remember what happened many steps ago while planning next actions.
- Slow models cost more to run and can’t respond quickly; fast-but-shallow models make more mistakes.
Enter MiMo-V2-Flash: It’s built to be fast and strong at the same time. It uses a hybrid sliding-window attention so most layers look locally (cheap) and some look globally (smart). It adds a helpful bias (attention sink) so the small windows don’t lose quality. It speeds up decoding using Multi-Token Prediction (MTP) so it can guess several next words at once. And it learns from multiple specialized teachers (MOPD), getting token-by-token advice from experts in math, coding, safety, search, and more. Altogether, it turns the old trade-offs into a win-win where the model stays sharp across tasks but answers quicker, especially on long contexts and in RL agents.
02Core Idea
Aha! Moment in one sentence: Mix mostly-local attention with occasional global attention, teach the model from multiple expert teachers at the token level, and use multi-token prediction to draft and verify several words at once—so you get long-context strength, stable multi-skill learning, and speed.
Multiple Analogies:
- City navigation: Most of the time you follow local street signs (sliding windows), but at key junctions you check a city map (global attention). You also learn from many tour guides (multi-teacher distillation) and drive in a carpool lane that lets you jump multiple blocks at a time (MTP).
- Cooking a feast: You focus on each dish near you (local attention), occasionally step back to check the entire menu (global attention), learn recipes from several master chefs (MOPD), and prep multiple ingredients in parallel (MTP) to serve faster.
- Sports team: Most plays are handled by nearby teammates (local), but the coach sometimes calls a play that considers the full field (global). Several specialist coaches train the team (MOPD), and players practice sequences of moves at once (MTP) to speed up execution.
Before vs After:
- Before: All-global made long inputs slow; single-teacher training caused see-saw skill trade-offs; decoding one token at a time kept GPUs idle and RL rollouts slow.
- After: Hybrid attention balances speed and memory; many teachers give dense feedback so skills add up instead of trade off; MTP drafts multiple tokens, getting 2.6Ă— speedups with high acceptance where text is predictable.
Why It Works (intuition, no equations):
- Local windows act like built-in regularization: they keep each layer focused on nearby info, avoiding getting distracted by far-away noise.
- A few global layers restore long-distance links, so the model still catches distant clues.
- The learnable attention sink is like a safe drain for unhelpful tokens, preventing cluttered focus.
- MTP raises arithmetic intensity: instead of doing a tiny step many times, it proposes several steps and verifies them in parallel.
- Multi-teacher, on-policy distillation gives token-by-token hints right where the student is already looking, so learning is stable and efficient.
Building Blocks (Sandwich explanations):
🍞 Top Bread (Hook): Imagine you have a simple calculator that passes numbers straight through pipes. 🥬 Filling (FFN – Feed-Forward Network):
- What: An FFN is a stack of simple layers that transform inputs forward, no loops.
- How: It takes a vector in, stretches/squeezes it with learned weights, applies an activation, then projects it back.
- Why: Without FFNs, the model can’t reshape information between attention steps, making it too weak. 🍞 Bottom Bread (Anchor): In MiMo-V2-Flash, FFNs appear inside both dense layers and experts to refine meanings after attention.
🍞 Hook: Training a puppy with treats. 🥬 Filling (RL – Reinforcement Learning):
- What: RL teaches by rewards for good actions and less reward for bad ones.
- How: The model tries steps, gets scores, and updates to do better next time.
- Why: Without RL, the model struggles with multi-step tasks like debugging code or searching the web. 🍞 Anchor: MiMo-V2-Flash learns coding and tool-use by running commands, seeing test results, and improving.
🍞 Hook: Reading a long comic with a sliding magnifying glass. 🥬 Filling (SWA – Sliding Window Attention):
- What: The model mainly looks at a small nearby span of tokens (a window) instead of the whole text.
- How: Each layer focuses on, say, 128 surrounding tokens.
- Why: Without SWA, attention cost explodes on long inputs. 🍞 Anchor: MiMo-V2-Flash uses 128-token windows in most layers to keep memory and compute low.
🍞 Hook: Guessing the next three words of a sentence, not just one. 🥬 Filling (MTP – Multi-Token Prediction):
- What: The model predicts several future tokens in one go and then checks them.
- How: Light MTP heads propose drafts; the main model verifies them in parallel.
- Why: Without MTP, decoding is one-at-a-time and slow; RL rollouts bottleneck. 🍞 Anchor: With 3 MTP layers, MiMo-V2-Flash sees up to ~3.6 tokens accepted and about 2.6× speedup.
🍞 Hook: A hospital with many specialists but only a few needed per patient. 🥬 Filling (MoE – Mixture-of-Experts):
- What: Many expert FFNs exist; a router picks a small subset per token.
- How: Tokens are routed to 8 of 256 experts, saving compute.
- Why: Without MoE, you’d run every expert every time, which is wasteful. 🍞 Anchor: MiMo-V2-Flash has 309B total parameters but only 15B are active per token.
🍞 Hook: Zooming in most of the time, zooming out sometimes. 🥬 Filling (Hybrid Attention Architecture):
- What: Interleave many SWA layers with fewer global-attention layers.
- How: A 5:1 pattern (five SWA blocks, then one global block) repeats.
- Why: Without hybrid, you either lose long-range info (all-local) or waste compute (all-global). 🍞 Anchor: This design cuts KV cache and attention compute by nearly 6× for long contexts.
🍞 Hook: A librarian who can throw unhelpful notes into a bin. 🥬 Filling (Learnable Attention Sink Bias):
- What: A learned bias that lets attention safely “ignore” unhelpful tokens.
- How: It competes in the attention softmax so distractions get low weight.
- Why: Without it, small windows can lose quality; with it, hybrid models match or beat full-global baselines. 🍞 Anchor: Using this sink, MiMo-V2-Flash’s 128-window hybrid outperforms larger-window variants.
🍞 Hook: Taking classes from many top tutors at once. 🥬 Filling (MOPD – Multi-Teacher On-Policy Distillation):
- What: The student learns from multiple specialized teachers with token-level guidance while sampling from its own policy.
- How: During rollouts, teachers provide dense KL-based rewards; outcome rewards can be added, too.
- Why: Without MOPD, improving one skill can harm another; MOPD preserves peak skills across domains. 🍞 Anchor: After MOPD, the model matches or exceeds the best teacher on many math and code benchmarks.
03Methodology
At a high level: Data → Pre-training (hybrid attention + MoE + MTP head) → Context extension (32k → 256k) → Supervised Fine-Tuning → Specialized RL Teachers → MOPD (student learns from many teachers with token-level rewards) → Fast Inference with MTP.
Step-by-step, like a recipe:
- Build the backbone with efficient attention and experts
- What happens: Construct a Transformer with Mixture-of-Experts FFNs and a hybrid attention schedule: five Sliding Window Attention (SWA) blocks followed by one Global Attention (GA) block, repeated across 8 hybrid groups (total 48 layers: 39 SWA + 9 GA). Use Grouped-Query Attention to save memory, RoPE for positions, and FP8 mixed precision for speed and stability.
- Why this exists: Long inputs make full attention too expensive; the 5:1 ratio makes most layers cheap while a few layers reconnect long-range info. MoE keeps total capacity high but compute low by activating only ~15B of 309B parameters per token.
- Example data: For a 64k-token repository, most layers attend locally to nearby files/lines, while global layers connect related functions across folders.
- Add a lightweight Multi-Token Prediction (MTP) head
- What happens: During pre-training, one light MTP head learns to predict multiple next tokens. In post-training, that head is replicated into 3 heads for multi-step drafts. The MTP head uses SWA and a small dense FFN (no MoE) so drafting is cheap.
- Why this exists: Drafting multiple tokens increases arithmetic intensity and reduces decoding time. The lightweight design avoids becoming a new bottleneck.
- Example: When generating code, MTP proposes several likely next tokens for a loop; the main model verifies and accepts most of them, jumping ahead.
- Pre-train on 27T tokens with long-range data bias
- What happens: Train at 32k context on diverse high-quality data (web, books, papers, math, code) with an emphasis on long-range dependencies (e.g., repo-level code, PRs, issues). Use FP8 mixed precision and MoE routing. Later extend to 256k with schedule adjustments.
- Why this exists: Pre-training teaches the “language of everything,” especially long-form patterns needed for retrieval and reasoning across large contexts.
- Example: The model learns that a function definition early in a file explains an error thrown much later.
- Supervised Fine-Tuning (SFT)
- What happens: Teach the base model to follow instructions in many domains (reasoning, coding, agents, thinking/no-thinking modes). Carefully tune MoE stability using metrics like number of zero-grad parameters and hyperparameters like expert-bias update rate.
- Why this exists: SFT activates helpful formats, styles, and task structures so the model communicates clearly and obeys prompts.
- Example: The model learns to show its work in math when asked for chain-of-thought, or to provide final answers in code tasks with the right format.
- Train specialized teacher models with RL (agentic and non-agentic)
- What happens: Create domain teachers for code, search, tool-use, math, general reasoning, and safety. For code, auto-setup environments (Kubernetes, Docker), run tests, and reward passing fixes. For web dev, verify by rendered videos. For search, build fact-graph queries with clear answers. Scale environments to 100k+ tasks.
- Why this exists: Specialized RL lets each teacher become best-in-class in its area using the right rewards and environments.
- Example: A code teacher repeatedly edits files, runs unit tests, and learns which edit patterns resolve failures fastest.
- Multi-Teacher On-Policy Distillation (MOPD)
- What happens: The student samples from its own policy and, for each token, gets dense, token-level KL rewards from the domain teacher for that prompt type; outcome-based rewards (like GRPO) can be added. Importance sampling filters out off-distribution tokens.
- Why this exists: On-policy, token-level guidance merges skills without the see-saw. Dense feedback is more informative than sparse final scores only.
- Example: On a math step, the math teacher nudges the student’s next-token probabilities toward the teacher’s distribution, keeping the reasoning path on track.
- Inference-time acceleration with MTP
- What happens: At runtime, the 3 MTP heads propose drafts; the main model verifies them in parallel. Average acceptance length depends on uncertainty (entropy). More predictable text (like boilerplate code) yields longer accepted runs (up to ~3.6), boosting speed to ~2.6Ă—.
- Why this exists: Real deployments care about latency and cost. MTP makes the same GPUs deliver more tokens per second without extra hardware.
- Example: While generating a webpage template, many tokens are easy and accepted in chunks, so responses stream faster.
The Secret Sauce:
- Hybrid SWA + global attention with a learnable attention sink: It makes a small window work surprisingly well, drastically cutting KV-cache and compute while preserving long-range understanding.
- MOPD’s token-level, multi-teacher signals: It’s like getting graded on every word by the right tutor for the job, so the model learns precisely and keeps multiple skills.
- Lightweight MTP used as a native draft: Instead of a separate draft model, the built-in MTP is tuned to be fast and accurate, unlocking large speedups in both inference and RL rollouts.
04Experiments & Results
The Test: The team measured three big things.
- Capability: How well does the model do on tough benchmarks in math, science, coding, reasoning, and writing?
- Long-context strength: Can it retrieve and reason across 32k–256k tokens without falling apart?
- Speed: Does MTP really make decoding and RL rollouts faster in practice?
The Competition: MiMo-V2-Flash is compared with strong open models like DeepSeek-V3.2 and Kimi-K2 (including their “Thinking” models), and with closed models like Gemini 3.0 Pro, Claude Sonnet 4.5, and GPT-5-High on some tests.
The Scoreboard (with context):
- SWE-Bench Verified ~73.4%: That’s like fixing 73 out of 100 real GitHub issues—top-tier among open models and close to frontier systems.
- SWE-Bench Multilingual ~71.7%: Leading results among open models, showing strong cross-language coding ability.
- AIME 2025 ~94.1%: Competitive high-end math performance, like acing a tough contest.
- GPQA-Diamond ~84.3%: Strong graduate-level science Q&A, showing deep, technical understanding.
- LiveCodeBench-v6 ~85.1%: Excellent code-generation and repair skills in a contamination-safe setting.
- MMLU-Pro ~84.9%: High performance on a more robust general knowledge/reasoning suite.
- LongBench V2 and MRCR: The hybrid SWA architecture stays robust, surpassing larger full-attention baselines, with near-100% retrieval from 32k to 256k in tests.
Speed and Efficiency:
- MTP Acceptance Length up to ~3.6: On low-uncertainty text (like web-dev scaffolding), the model often accepts 3–4 drafted tokens at once.
- Decoding Speedup up to ~2.6Ă—: With 3 MTP layers, speedup grows with accept length; this holds across reasonable batch sizes.
- RL Rollouts: MTP parallelism keeps GPUs busy even when tasks have long tails, reducing idleness and cutting time-to-train.
Surprising Findings:
- Smaller sliding windows (128) with the attention sink outperformed larger windows (512) in long-context SFT, and even beat all-global baselines on several tasks. This suggests a “divide-and-conquer” dynamic where local layers learn local patterns well and global layers handle distant links cleanly.
- Large-scale code-agent RL generalized to other domains (math, search, reasoning), hinting that agentic training teaches transferable problem-solving skills.
- On some creative writing settings, trading off for precision in reasoning could slightly reduce style points versus certain baselines, showing the classic balance between logic and flair.
Bottom line: MiMo-V2-Flash reaches or exceeds the level of larger open models on many reasoning and agent tasks, while being markedly faster in practice thanks to MTP—especially on long contexts where the hybrid attention shines.
05Discussion & Limitations
Limitations (be specific):
- Knowledge capacity: With 15B active parameters (309B total MoE), it may carry less raw encyclopedic memory than the very largest dense models, which can matter for niche facts.
- Creative writing trade-offs: Optimizing hard reasoning and tool-use sometimes slightly reduces creative flair compared to models tuned primarily for style.
- Teacher quality dependency: MOPD’s success depends on having strong, well-aligned domain teachers and reliable outcome reward models; weak teachers can transfer weaknesses.
- Architecture scope: The hybrid SWA + sink setup works very well here, but the space of attention designs is vast; more exploration could find even better ratios or schedules.
- Extreme long noise: While 256k contexts are supported, very noisy or adversarially long inputs can still distract any model; careful prompting and context management help.
Required Resources:
- Training: FP8 mixed precision across large clusters; MoE-friendly frameworks; data pipelines for 27T tokens; RL infrastructure with containers, Kubernetes, and scalable verifiers.
- Post-training: Teacher training compute for multiple domains; orchestration for MOPD rollouts; tool managers and schedulers for high utilization.
- Inference: Engines that support MTP speculative decoding efficiently; KV-cache aware scheduling for long contexts.
When NOT to Use:
- Tiny edge devices with strict memory limits and no acceleration; lighter models may be better.
- Purely stylistic or open-ended creative writing as the main goal; a style-tuned model might outperform.
- Tasks requiring offline, fully-audited factual databases where retrieval-augmented systems or specialized knowledge graphs are essential.
Open Questions:
- What is the optimal SWA:GA ratio across sizes and domains? Does 5:1 generalize, or could adaptive schedules work better?
- Can the attention sink mechanism be further understood theoretically to guide design choices?
- How far can MTP scale (more heads/steps) before accuracy dips outweigh speed gains?
- What are the best practices for teacher-student co-evolution cycles over many generations?
- How to automatically detect and prevent reward hacking across diverse RL environments?
06Conclusion & Future Work
Three-Sentence Summary: MiMo-V2-Flash combines a hybrid sliding-window-plus-global attention design, multi-teacher on-policy distillation, and lightweight multi-token prediction to deliver fast, strong reasoning and agent abilities on very long contexts. It matches or beats larger open models on many math, coding, and tool-use benchmarks while achieving up to ~2.6Ă— decoding speedups with 3 MTP layers. Open-sourced weights (including MTP heads) make it a practical, community-ready foundation for advanced agents.
Main Achievement: Showing that a small-window hybrid attention (with a learnable attention sink), paired with MOPD and MTP, can turn the old speed-versus-strength trade-off into a combined win: long-context power, multi-domain mastery, and real deployment speed.
Future Directions:
- Scale teacher-student co-evolution in MOPD to keep pushing multi-skill mastery.
- Explore adaptive hybrid schedules and smarter sinks for even better long-context efficiency.
- Expand RL environments and verifiers for broader tool-use, safety, and planning.
- Tune MTP depth and acceptance strategies for task-aware speed/accuracy control.
Why Remember This: MiMo-V2-Flash reframes how we think about long-context, multi-skill LLMs—it shows that careful attention design, token-level multi-teacher learning, and built-in speculative decoding can make models not just smarter, but truly faster and more useful in the wild.
Practical Applications
- •Automated bug fixing across large repositories where the model reads, edits, and tests code end-to-end.
- •Long-document analysis and summarization for legal, medical, or technical reports up to hundreds of thousands of tokens.
- •Agentic web browsing that plans searches, opens pages, and extracts answers with tool-use verification.
- •Data engineering assistants that navigate multi-file ETL pipelines and suggest safe code changes.
- •Education helpers that solve multi-step math problems, show work, and reference earlier parts of lessons.
- •Terminal and DevOps agents that execute commands, interpret logs, and repair configurations.
- •Productivity copilots that draft documents faster via MTP and keep style consistent over long contexts.
- •Research aides that connect insights across many papers, retaining citations and long-range references.
- •Web development agents that generate and verify front-end code via video-based render checks.
- •Interactive tutoring systems that blend reasoning and code execution to teach programming concepts.