Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
Key Summary
- •Nemotron 3 Nano is a new open-source language model that mixes two brain styles (Mamba and Transformer) and adds a team of special experts (MoE) so it thinks better while running much faster.
- •It was taught on 25 trillion tokens and then carefully coached with examples (SFT), practice-with-scores in many sandboxes (RLVR), and human-preference style rewards (RLHF).
- •Only about 3.2B of its 31.6B parameters wake up for each step, so it’s efficient without losing accuracy.
- •On tough benchmarks for math, code, tools, and chat, it matches or beats similarly sized open models while delivering up to 3.3× higher throughput in an 8K-in/16K-out test.
- •It can read super long inputs (up to 1,000,000 tokens) and still answer well, after a special long-context training phase.
- •A smart chat template gives “reasoning on/off” and “reasoning budget” control, so it can think briefly or deeply when you ask it to.
- •Reinforcement learning across many environments at once improves lots of skills together, instead of helping one skill and hurting others.
- •A careful FP8 post-training quantization keeps almost all the accuracy (about 99% median recovery) but speeds up inference and lowers memory use.
- •The team is releasing model weights, recipes, code, and most datasets so others can reproduce and build on the work.
Why This Research Matters
Faster, smarter language models make helpful tools more affordable and accessible, from classroom tutors to coding assistants. By activating only a few experts at a time, Nemotron 3 Nano delivers strong reasoning without wasting compute, lowering costs and latency. Its ability to handle million-token contexts unlocks applications like analyzing long legal, scientific, or enterprise documents. Multi-environment RL ensures broad, reliable skills, so gains in one area don’t break another. Careful FP8 quantization means you can deploy at scale with near-full accuracy and better throughput. Because it’s open with released weights, recipes, and data, schools, startups, and researchers can reproduce, audit, and extend the work.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: You know how a Swiss Army knife has lots of tools in one handle, but you only flip out the ones you need? It’s handy and light because you don’t use every tool at the same time.
🥬 The Concept (in steps, following the dependency graph):
- Deep Learning
- What it is: A way for computers to learn patterns from lots of data using layered “brain-like” calculators.
- How it works: (a) Show many examples, (b) let the model guess, (c) tell it how wrong it was, (d) nudge its knobs (weights) to improve, (e) repeat a lot.
- Why it matters: Without deep learning, models can’t learn complex skills like reading, coding, or reasoning.
- Example: Teaching a model to spot cats by showing millions of photos and correcting it when it’s wrong.
- Neural Networks
- What it is: The basic deep-learning machine made of layers of tiny math units (“neurons”).
- How it works: Numbers go in, layers mix and match them, a prediction comes out; training adjusts the connections.
- Why it matters: Without networks, deep learning has no engine to learn from data.
- Example: A calculator stack that takes words and predicts the next word.
- Transformer Architecture
- What it is: A popular neural network design for language that understands relationships between words all at once.
- How it works: It uses attention to look at every word and decide which others are important, then transforms the info through stacked layers.
- Why it matters: Without Transformers, models would struggle to keep track of complex sentences.
- Example: Reading “Paris is the capital of France,” the model links “Paris” and “France.”
- Attention Mechanism
- What it is: A scoring system that tells the model which parts of the input to focus on.
- How it works: (a) compare each word with others, (b) score relevance, (c) mix information using higher scores more.
- Why it matters: Without attention, the model treats all words equally and gets confused.
- Example: In “What’s the capital of France?”, it pays most attention to “capital” and “France.”
- Reinforcement Learning (RL)
- What it is: Learning by trying things and getting rewards or penalties.
- How it works: (a) the model acts, (b) a judge gives a score, (c) the model updates to get higher scores next time.
- Why it matters: Without RL, it’s hard to teach multi-step skills with clear success checks.
- Example: Solving a math problem and getting a point only if the final answer is correct.
- Human Feedback
- What it is: People rating which answers are better.
- How it works: Humans compare two replies and pick the preferred one; a reward model learns to predict those preferences.
- Why it matters: Without it, models can be technically correct but unhelpful or impolite.
- Example: Choosing the friendlier, clearer reply to a question.
- Machine Learning (umbrella)
- What it is: Teaching computers to learn from data instead of hard-coded rules.
- How it works: Data in → model guesses → compare to target → update → repeat.
- Why it matters: Without ML, models can’t adapt or improve from experience.
- Example: Learning spelling patterns from reading many stories.
- Neural Network Training
- What it is: The process of adjusting weights to reduce mistakes on training data.
- How it works: Use optimizers (like AdamW), learning-rate schedules, and big batches of data.
- Why it matters: Without careful training, models can be unstable or underperform.
- Example: Starting slow, going steady, then cooling down the learning rate (warmup–stable–decay).
- Model Optimization
- What it is: Tricks to make models faster, smaller, or more accurate.
- How it works: Choose architectures, data, schedules, and compression that fit the goal.
- Why it matters: Without it, inference is slow and expensive.
- Example: Using fewer active parameters per step to save compute.
The World Before: Transformers were strong but expensive at run time because every layer always worked in full. Many models got better by just getting bigger, which made them slow and costly. Also, models often learned great at one skill but training that skill could hurt others. And long context (reading huge documents) was hard and slow.
The Problem: Can we build a model that (a) reasons well like top models, (b) runs much faster, and (c) handles super long inputs—without sacrificing quality?
Failed Attempts:
- Just shrink the model: fast but loses accuracy.
- Make it only a Transformer: accurate but slow on long sequences.
- Train RL on one task: that task improves but others get worse.
- Use full-precision everywhere: accurate but less throughput.
The Gap: We needed a model that only “wakes up” the parts it needs (efficiency), mixes architectures suited for different sequence skills, is trained with multi-environment RL so gains are broad, and is compressed smartly so speed stays high with accuracy intact.
The Stakes (Why you should care): Faster, smarter models mean better homework help, coding assistants that run on fewer GPUs, chatbots that remember long histories, and agents that can use tools step-by-step to actually get things done—all while being open so schools, startups, and researchers can build on it.
🍞 Anchor: Imagine a team project where only the right experts join each meeting, you combine two planning styles (one quick for long timelines and one detailed for key steps), practice across many challenge rooms, and then pack your notes efficiently. That’s Nemotron 3 Nano for language tasks.
02Core Idea
🍞 Hook: Imagine a school with 128 club experts. For each problem, a counselor picks just 6 best-fit experts to help, instead of calling the whole school to the meeting. Fast and focused!
🥬 The Concept:
- Aha! Moment (one sentence): Sparsely activating a few Mixture-of-Experts inside a hybrid Mamba–Transformer, then polishing skills with multi-environment RL and careful quantization, delivers top-tier reasoning at much higher speed.
Multiple Analogies (three ways):
- Sports Team: Only the right players take the field for each play (MoE), while the playbook has two styles—long-range scanning (Mamba) and precise targeting (Attention). Practice drills across many fields (RL in many environments) and carry a light kit (FP8 quantization).
- Kitchen: 128 chefs specialize in cuisines; a head waiter routes dishes to 6 chefs who fit the order best. One cooking method is great for slow stews over long time (Mamba), another for quick sears with sharp focus (Attention). The kitchen rehearses many menus (RL) and uses compact containers (FP8) without losing taste.
- Library: A smart indexer (Mamba) flips through long books quickly, while a reference librarian (Attention) zooms into key pages. Only 6 of 128 subject experts answer your tricky question. The library trims heavy binders to lighter paper (quantization) but keeps the info.
Before vs After:
- Before: To get better results, people turned up model size or always used all layers, which slowed everything down. RL on one domain could break others. Long-context reading was fragile and expensive.
- After: Nemotron 3 Nano activates just ~3.2B of 31.6B parameters per step (6/128 experts), keeps or beats accuracy, runs up to 3.3Ă— faster in an 8K-in/16K-out test, and reads up to 1M tokens. Multi-environment RL steadily lifts many skills together. FP8 keeps speed high with ~99% median accuracy preserved.
Why it works (intuition):
- MoE specialization means experts learn different sub-skills; picking a small set per token gives you the right skill without paying for them all.
- Hybrid Mamba–Attention: Mamba (state-space modeling) shines at long, efficient sequence processing; attention layers laser-focus on the most relevant pieces.
- Multi-environment RL trains many muscles at once with verifiable rewards and a curriculum that starts easier and shifts harder.
- Selective FP8 quantization speeds things up while leaving the most delicate parts (a few attention and adjacent Mamba layers) in higher precision to protect accuracy.
Building Blocks:
- Mixture-of-Experts with a learned router (granular MoE, 6 of 128 experts, with shared experts) for sparse scaling.
- Hybrid pattern of Mamba-2 blocks and GQA attention blocks; RMSNorm, squared ReLU in experts, no positional embeddings.
- Two-phase pretraining on 25T tokens, then long-context continuous pretraining.
- SFT with a chat template enabling reasoning on/off and token budget control; tool-tag formatting; multi-turn and multi-step traces.
- RLVR across coding, math, instruction following, long-context, tool use, structured JSON, with GRPO and curriculum sampling.
- RLHF using a strong Generative Reward Model plus Group Relative Length Control to reduce unnecessary verbosity.
- Post-training FP8 quantization with selective BF16 for the most sensitive layers and FP8 KV cache.
🍞 Anchor: Think of an emergency room. A triage nurse (router) calls 6 right specialists from 128. One system tracks the patient’s full history efficiently (Mamba), another zooms in on the latest vitals (Attention). The hospital trains on many realistic drills (RL), and keeps equipment light but accurate (FP8). Patients (your tasks) get fast, high-quality care.
03Methodology
🍞 Hook: Picture a relay race: the baton (your text) passes through stages—big practice (pretraining), special drills (long-context), coaching with answers (SFT), scoring in many gyms (RLVR), judging by preferences (RLHF), and a lighter uniform (quantization).
🥬 The Concept (High level): Input → Pretraining (25T tokens) → Long-Context CPT → SFT (chat and tools; reasoning control) → RLVR (multi-env GRPO with curriculum) → RLHF (GenRM + length control) → Quantization (FP8 selective) → Output
Step-by-step:
- Pretraining on 25T tokens with Hybrid MoE Mamba–Transformer
- What happens: The model learns general language, code, math, and world knowledge. Architecture: 52 layers; MoE replaces standard FFNs with 128 experts, 6 active per token; hybrid stack interleaves Mamba-2 and GQA attention blocks. Router uses sigmoid gating and load balancing.
- Why this step: Without broad pretraining, later specialization won’t stick. MoE gives accuracy with fewer active parameters, saving compute.
- Example: From Nemotron-CC-v2.1 and Nemotron-CC-Code-v1, it reads diverse web pages and code, learning how functions and proofs look.
- Data Mixture and Two-Phase Curriculum
- What happens: Phase 1 emphasizes diversity (web crawl, code, math, multilingual). Phase 2 turns up high-quality sources (e.g., Wikipedia, STEM SFT-style) near the end of training.
- Why: Without a quality-focused finish, knowledge stays noisy; without diversity first, coverage is thin.
- Example: Early it sees many websites; later it studies well-edited explanations and structured STEM problems.
- Long-Context Extension (LC-Phase)
- What happens: Continuous pretraining with very long sequences (mix of 512k and 4k lengths) plus retrieval-like tasks. Total ~121B extra tokens. This teaches stable reading up to 1M tokens.
- Why: Without LC-Phase, performance at extreme lengths drops or short-context skills suffer.
- Example: Answering questions that require scanning across hundreds of thousands of tokens.
- SFT with a Smart Chat Template and Reasoning Control
- What happens: The model is fine-tuned on curated multi-step, multi-turn traces for math, code, tools, terminal, proofs, science, chat, safety, and instruction following. Tool calls use simple XML-like tags. The chat template allows reasoning on/off and token budget control (sometimes removing or truncating internal thinking during training so the model learns to vary its depth).
- Why: Without SFT, the model is less aligned to act as a helpful assistant or agent. Without reasoning controls, it can be too terse or too verbose.
- Example: “Solve this coding bug in steps and run tests.” You can ask it to think briefly or deeply.
- Multi-Environment RL from Verifiable Rewards (RLVR)
- What happens: The model practices across many gyms simultaneously—competitive math and coding (unit tests), instruction following (checkers), JSON schema adherence, long-context Q&A (LLM judge), and agentic tool use in sandboxed databases. Uses GRPO (a PPO-style variant) with masked importance sampling, curriculum that shifts from easy to hard by pass-rate, and freezes router weights for stability.
- Why: Training one skill at a time can boost that skill but harm others. Multi-env RL builds balanced strength and avoids regressions.
- Example: For coding, it’s correct only if unit tests pass; for JSON, only exact schema matches earn reward.
- RLHF with a Generative Reward Model (GenRM) + Group Relative Length Control
- What happens: A strong model (Qwen3-235B…) is trained via RL to be a judge that reasons and scores pairs of answers (helpfulness + ranking). During RLHF, Nemotron 3 Nano samples 16 responses per prompt. A circular comparison scheme reduces judge calls from O(N^2) to O(N). Group Relative Length Control gives zero-sum rewards that favor shorter thinking/answering only when quality is high, cutting verbosity ~30% without losing accuracy.
- Why: Without GenRM, preference judgments can be brittle; without length control, the model can overthink simple prompts.
- Example: For “Explain photosynthesis simply,” short, clear answers score better than long, rambling ones.
- Post-Training Quantization to FP8 (Selective)
- What happens: Using PTQ, most weights/activations and KV cache go to FP8, but keep 6 attention layers and the preceding 6 Mamba layers in BF16 (sensitive spots). Conv1D in Mamba stays BF16. Calibrate with 1K SFT samples.
- Why: Full FP8 everywhere can hurt accuracy; selective keeps speed while preserving results (~99% median recovery).
- Example: Like packing your backpack light but keeping fragile items in sturdy cases.
Secret Sauce:
- Granular MoE with learned router (6/128) plus shared experts—high accuracy with low active compute.
- Hybrid Mamba–Attention stacking—efficient long sequences plus sharp focus.
- Unified, curriculum-based RL across many environments—steady, uniform improvement.
- GenRM + length control—concise, high-quality answers.
- Selective FP8 quantization—speed without big accuracy drops.
🍞 Anchor: Think of preparing for a triathlon. You cross-train swimming, biking, and running (multi-env RL), get a coach who knows what judges like (GenRM), carry a lighter bottle without spilling (FP8 selective), and only bring the teammates you need to the race (MoE).
04Experiments & Results
🍞 Hook: Imagine a science fair where your project is judged on speed, accuracy, teamwork, and the ability to handle a giant binder of notes without getting lost.
🥬 The Concept:
- The Tests (What and Why):
- Accuracy on reasoning (AIME25, GPQA, math/code sets), coding (LiveCodeBench), agentic tool use (TerminalBench, SWE-Bench, TauBench, BFCL), chat and instruction following (Arena-Hard-V2, IFBench, Scale AI Multi-Challenge), long-context (RULER-100 at 256k/512k/1M, AA-LCR), and multilingual (MMLU-ProX, WMT24++). Throughput measured on H200/H100 with 8K-in/16K-out scenario to reflect heavy generation loads.
- Why: This covers step-by-step thinking, real tool execution, following tricky instructions, reading huge contexts, and speaking multiple languages—skills people actually need.
- The Competition (Who):
- Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B are strong open models around this size; some have high accuracy but slower throughput or shorter context.
- The Scoreboard (With Context):
- Throughput: Up to 3.3× faster than Qwen3-30B-A3B-Thinking-2507 and 2.2× faster than GPT-OSS-20B in an 8K input / 16K output setting on a single H200 (using best of vLLM or TRT-LLM per model, FP8 for Nemotron/Qwen, MXFP4 weights + BF16 activations for GPT-OSS-20B). That’s like finishing three laps while others do one.
- Long Context: Supports up to 1M tokens and outperforms peers on RULER-100 at 256k/512k/1M; AA-LCR stable with reasoning on.
- Reasoning/Agentic Highlights: AIME25 with tools ~99.17%; LiveCodeBench v6 68.25; strong on IFBench prompt (71.51) and Arena-Hard-V2 average (67.65), outperforming peers; competitive on GPQA and coding.
- Base Model: Nemotron 3 Nano 30B-A3B Base generally beats Qwen3-30B-A3B-Base on many academic/code/math/commonsense/long-context tasks, reflecting strong pretraining.
- Quantization: FP8 model shows ~99% median accuracy recovery vs BF16 while improving throughput (KV cache FP8 enabling larger batch sizes).
- Surprising Findings:
- Multi-environment RLVR lifted many benchmarks together; training one environment at a time tended to harm others, but all-at-once with curriculum kept improvements smooth.
- RLVR surpassed or matched a heavily fine-tuned SFT baseline—even when SFT was pushed to convergence.
- Mixing 512k and 4k sequences in long-context CPT retained or improved short-context results compared to only 512k.
- Group Relative Length Control reduced verbosity by ~30% without hurting accuracy—shorter, clearer answers when appropriate.
🍞 Anchor: If a robot student practices math, coding, and lab work in many classrooms at once, carries a lighter backpack, and learns to give concise but correct answers, it will finish quizzes faster and still score top grades. That’s what Nemotron 3 Nano showed on the scoreboard.
05Discussion & Limitations
🍞 Hook: Think of a race car—fast and clever, but it still needs the right fuel, track, and maintenance to shine.
🥬 The Concept:
- Limitations (be specific):
- Hardware Fit: While efficient at inference, training this system (25T tokens; LC-Phase; large RL) needs serious GPU clusters and orchestration.
- Edge Devices: Even with FP8, a 30B model is still large for tiny devices; quantized or distilled variants may be needed.
- Judge Dependence: RLHF quality depends on the Generative Reward Model and data; biases in the judge can shape model style.
- Extreme Edge Cases: Some very long, cross-document tasks or unusual tool schemas may still trip the model.
- Multilingual Breadth: Good performance in several high-resource languages, but not all languages are equally covered.
- Required Resources:
- For training: Multi-node GPU clusters (H100/H200), NeMo RL/Gym stack, dataset pipelines, and verification servers (unit tests, schema checkers, LLM judges).
- For inference: Optimized runtimes (TRT-LLM or vLLM), FP8 support, and enough memory to hold selective BF16 layers and FP8 caches.
- When NOT to Use:
- Ultra-low-power environments (microcontrollers, basic phones) where even quantized 30B is too big.
- Tasks requiring legally auditable, fully transparent internal reasoning traces if you’ve turned reasoning off or constrained budgets.
- Scenarios where reward models or judges are not trusted (e.g., needing provably neutral or certified evaluators).
- Open Questions:
- How far can sparse MoE scale with more or fewer experts per token before diminishing returns?
- What are best practices for balancing Mamba vs Attention blocks as context grows to multi-million tokens?
- Can we further reduce reliance on LLM judges with cheaper/verifiable signals, especially for chat helpfulness?
- How to extend multilingual depth to many low-resource languages without hurting efficiency?
- Can selective quantization be automated per-layer for each deployment target to maximize speed/accuracy trade-offs?
🍞 Anchor: A great bicycle still needs the right rider, tuned gears, and safe roads. Nemotron 3 Nano rides fast, but picking the route, checking the brakes (judges), and choosing the right tires (quantization) makes all the difference.
06Conclusion & Future Work
🍞 Hook: Imagine a brainy, speedy study buddy that calls just the right experts when needed, reads giant books, and learns good habits from many classrooms at once.
🥬 The Concept:
- 3-Sentence Summary: Nemotron 3 Nano combines a sparse Mixture-of-Experts with a hybrid Mamba–Transformer to deliver strong reasoning at much higher speed. It is trained on 25T tokens, refined with SFT, multi-environment RLVR, and RLHF using a Generative Reward Model plus length control, and then compressed with selective FP8 quantization for efficient deployment. The result is an open model that matches or beats peers on many benchmarks, reads up to 1M tokens, and runs up to 3.3× faster in demanding generation settings.
- Main Achievement: Proving that sparse MoE inside a hybrid Mamba–Attention stack, trained with unified, curriculum-based RL across many environments and finished with careful quantization, can push the accuracy–throughput frontier forward.
- Future Directions: Explore dynamic expert counts per token, smarter router learning during RL, expanded multilingual training (especially low-resource), more verifiable reward signals for chat helpfulness, and automated per-layer precision selection for each hardware target.
- Why Remember This: It shows we don’t have to choose between smart and fast—you can get both by waking only the right experts, mixing architectures wisely, practicing across many verifiable arenas, and packing the model efficiently.
🍞 Anchor: Like a championship relay team that picks the best runners for each stretch, trains on many tracks, and travels light, Nemotron 3 Nano reaches the finish line faster without dropping the baton of accuracy.
Practical Applications
- •Build a coding copilot that runs faster on fewer GPUs while passing more unit tests on real repositories.
- •Deploy a customer-support agent that follows complex, multi-step instructions and uses tools (APIs, databases) reliably.
- •Summarize and cross-reference million-token legal or scientific documents without chunking them into tiny parts.
- •Create classroom tutors that can switch reasoning on/off and adjust thinking depth to fit short answers or deep explanations.
- •Automate data extraction into strict JSON schemas for enterprise workflows, with verifiable adherence checks.
- •Run large-scale document QA systems for research teams, with long-context retrieval and stable reasoning.
- •Power terminal-based automation agents to reproduce issues and fix bugs using containerized environments.
- •Translate and explain technical content across multiple high-resource languages with improved clarity.
- •Train domain-specific variants (e.g., finance, medicine, engineering) using the released recipes and selective FP8 for deployment.
- •Integrate concise-answer preferences via RLHF length control to keep chat responses informative but short.