NVIDIA Nemotron 3: Efficient and Open Intelligence
Key Summary
- •Nemotron 3 is a new family of open AI models (Nano, Super, Ultra) built to think better while running faster and cheaper.
- •They mix three ideas—experts (MoE), a speedy sequence engine (Mamba-2), and a little attention—so they handle long thoughts without slowing down.
- •LatentMoE shrinks what gets sent to experts so the model can use more and better experts without extra cost, boosting accuracy per byte.
- •Multi-Token Prediction (MTP) trains the model to guess several steps ahead and doubles as built-in draft tokens for fast generation.
- •Training in NVFP4 (a tiny-number format) makes pretraining much faster on new GPUs while keeping accuracy close to high-precision training.
- •Nemotron 3 handles super-long inputs—up to 1 million tokens—without the usual position-embedding headaches.
- •A big multi-environment reinforcement learning phase teaches reasoning, tools, coding, and long-context skills all at once.
- •At inference, you can set a 'reasoning budget' to trade a bit of accuracy for a lot of speed, like picking quick or careful mode.
- •Nemotron-3-Nano-30B shows about 3.3× higher throughput than a similar Transformer MoE model while maintaining strong benchmark accuracy.
- •NVIDIA plans to openly release weights, software, recipes, and most training data, making high-quality agentic AI widely accessible.
Why This Research Matters
Many real tasks are long and messy: full project codebases, months of chat logs, or multi-document research. Nemotron 3 makes it practical to handle these without grinding to a halt, so helpful AI agents can finally work across truly large contexts. Its ‘think more when needed, think less when not’ budget control lets products balance speed and accuracy per request. Faster, lower-cost training with NVFP4 and efficient inference means more teams can build and deploy capable agents without massive budgets. Open releases of weights, software, and recipes accelerate community research and transparent evaluation. Together, this raises the bar for reliable, scalable AI assistance in everyday tools—IT helpdesks, coding copilots, research assistants, and beyond.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how you can read a short story quickly but need to slow down for a giant book so you don’t lose track of details? Early AI models were great at short stories but stumbled on giant books and complicated tasks. They often needed lots of time and memory to think carefully, especially when keeping track of long histories or multi-step plans. This made it hard to build practical helpers, like agents that read long IT logs, fix codebases, or follow multi-step checklists without getting lost.
🍞 Hook: Imagine a mega group project where every classmate adds notes. Keeping track of who said what for months is tough, and rereading everything every time wastes time. 🥬 The Concept (Mixture-of-Experts Architecture): It’s a model design where only the most relevant specialist networks (‘experts’) wake up for each token. How it works: 1) A gate scores which experts fit the current token, 2) The top-K experts process it, 3) Their answers are combined. Why it matters: Without MoE, the whole model works on every token, making it slow and expensive. 🍞 Anchor: Like calling only the math whiz for a math question and the grammar guru for an essay—faster and better.
Before Nemotron 3, many models used Transformers with self-attention everywhere. Attention is powerful but gets slower as the text grows because it must look back at more and more words and keep a big KV cache. That’s like rereading larger chunks every step—awesome for accuracy but tough for speed and memory.
🍞 Hook: Think of a conga line where each dancer must keep eye contact with everyone behind them—it gets tricky as the line grows. 🥬 The Concept (Mamba–Transformer Hybrid): This mixes fast sequence layers (Mamba-2) with a few attention layers and MoE. How it works: 1) Most layers use Mamba-2 to carry a compact “state” forward (constant memory), 2) Some layers do attention for all-to-all information when needed, 3) MoE layers add specialists for higher accuracy per compute. Why it matters: It keeps speed high on long generations while preserving the benefits of occasional attention. 🍞 Anchor: Like taking smooth highways (Mamba) most of the trip and only using busy downtown streets (attention) when necessary.
People also tried bigger dense models, but they hit memory and latency walls. Others tried long-context tricks like RoPE scaling, but those sometimes break when you go far beyond the lengths seen during training.
🍞 Hook: You know how sending huge photos over a slow internet takes forever? If you compress smartly, you can send more pictures faster. 🥬 The Concept (LatentMoE): It routes tokens through experts in a smaller ‘latent’ size, then maps back up. How it works: 1) Shrink token dimension from d to ℓ for routing and expert compute, 2) Use the saved bandwidth to increase number of experts and top-K, 3) Project results back to full size. Why it matters: You get more (and better) expert help without paying extra time or memory. 🍞 Anchor: Like folding a big poster into a small envelope to mail it cheaply, then unfolding it perfectly on arrival.
Yet, even with better layers, models often hesitate during generation, predicting one token at a time and wasting time on back-and-forth checks.
🍞 Hook: If you’re pretty sure how a sentence will end, you can say a few words at once instead of pausing after every word. 🥬 The Concept (Multi-Token Prediction, MTP): The model predicts several future tokens at once. How it works: 1) During training, it learns to guess 2–n steps ahead, 2) At inference, these guesses act as draft tokens, 3) The main model quickly verifies them (speculative decoding). Why it matters: This gives richer learning signals and speeds up generation. 🍞 Anchor: Like drafting three chess moves you’re confident about, then quickly checking they’re safe before playing.
Developers also need control: sometimes you want the model to think deeply; other times you want it to answer fast.
🍞 Hook: When you take a timed quiz, you budget how long to spend on each question. 🥬 The Concept (Granular Reasoning Budget Control): You set a limit for ‘thinking’ tokens; the model stops thinking when the budget is up and answers. How it works: 1) User gives a token budget, 2) Model generates a partial think-trace, 3) A special token signals ‘stop thinking,’ then it outputs the final answer. Why it matters: You can dial speed vs. accuracy per query. 🍞 Anchor: Like telling yourself, “2 minutes per problem,” so you finish the whole test.
There’s also the question of training speed and cost. Big models are expensive to train at high precision.
🍞 Hook: Switching a video from 4K to 720p loads faster while still looking fine on a phone. 🥬 The Concept (NVFP4 Training): A low-precision number format for weights, activations, and gradients to speed up training on special GPUs. How it works: 1) Quantize tensors to FP4 with clever block scaling and rounding, 2) Keep a few sensitive parts in higher precision, 3) Use fast FP4 matrix ops. Why it matters: You train faster and cheaper while keeping accuracy close to BF16. 🍞 Anchor: Like reading a slightly smaller-print book that still has all the same words—faster page turns, same story.
Finally, long context is crucial—many tasks need hundreds of thousands or even a million tokens.
🍞 Hook: Taking notes in a long lecture helps you remember the start when you get to the end. 🥬 The Concept (Long Context Handling without RoPE issues): Nemotron 3 relies on Mamba’s implicit position sense and avoids RoPE in attention, reducing out-of-range problems. How it works: 1) Pretrain and fine-tune with very long sequences (up to 512k), 2) Add long-context RL environments, 3) Evaluate on 1M-token tasks. Why it matters: The model can use giant inputs without breaking position logic. 🍞 Anchor: Like using a timeline instead of fragile sticky notes that fall off when the paper gets too long.
To tie skills together—math, coding, tools, chat—the model needs feedback from realistic situations.
🍞 Hook: A dog learns faster when it practices many tricks with treats and guidance, not just one trick over and over. 🥬 The Concept (Multi-environment Reinforcement Learning): Train across many tasks at once and reward good behavior. How it works: 1) Asynchronous rollouts generate experiences, 2) GRPO stabilizes learning with masked importance sampling, 3) MTP speeds up sampling. Why it matters: The model learns broad, balanced skills without forgetting others. 🍞 Anchor: Like a sports camp that rotates soccer, swimming, and track so you become an all-around athlete.
This paper fills the gap by showing how to blend these ideas—hybrid layers, LatentMoE, MTP, NVFP4, long-context training, and multi-environment RL—into one open family that is both accurate and efficient for real agents.
02Core Idea
The ‘aha!’ is that you can keep high accuracy while dramatically boosting speed and memory efficiency by routing most work through Mamba-2 and MoE (with LatentMoE), using only a few attention layers, and supercharging training and inference with MTP and NVFP4.
Analogy 1 (Highways + Specialists + Cruise Control):
- Highways (Mamba-2) move you fast most of the way.
- Downtown roads (attention) handle tricky intersections when needed.
- Specialists (MoE) hop in for targeted skills.
- Cruise control (MTP) predicts several steps ahead to keep motion smooth.
- Fuel-efficient engine (NVFP4) keeps costs down.
Analogy 2 (School Team Project):
- A few meetings with everyone (attention) align the big picture.
- Most work happens in small focused groups (Mamba-2 + MoE) to stay efficient.
- A coordinator guesses the next tasks (MTP) so people can move in parallel.
- Using lighter files and notes (NVFP4) lets you share faster without losing meaning.
Analogy 3 (Kitchen Brigade):
- The head chef rarely steps in (few attention layers), but when they do, it matters.
- Most cooking is done by stations that pass along the dish state (Mamba-2).
- Specific chefs handle special dishes (MoE/LatentMoE).
- Prep cooks line up the next ingredients ahead of time (MTP), speeding dinner service.
Before vs After:
- Before: Dense Transformers paid a big memory/time tax on long outputs; MoE was limited by bandwidth/communication; training at high precision was costly; long contexts broke position handling; post-training often siloed skills.
- After: Hybrid Mamba–Transformer MoE cuts cache and bandwidth costs; LatentMoE gets more experts without more traffic; NVFP4 speeds training while keeping accuracy; long-context works up to 1M tokens; multi-environment RL builds broad, agentic skills; MTP both improves quality and speeds generation.
Why it works (intuition):
- Most tokens don’t need all-to-all attention every step; a compact state (Mamba-2) carries useful history cheaply.
- Expertise should be sparse: sending only the right tokens to the right experts avoids wasted compute.
- Communication is the bottleneck in MoE; shrink routed dimensions (LatentMoE) to spend the savings on more/better experts.
- Predicting multiple steps ahead creates stronger learning signals and aligns perfectly with speculative decoding.
- Precision lower than BF16 can be stable when paired with smart scaling and exceptions for sensitive layers.
- Training on long sequences and multiple tasks builds robust generalization to huge contexts and varied reasoning.
Building blocks (with sandwich intros kept concise):
- 🍞 Hook: Picking the right helper saves time. 🥬 MoE: Route tokens to top-K experts and combine outputs; without it, you pay dense costs. 🍞 Anchor: Call the right tutor for each question.
- 🍞 Hook: Freeways beat backroads for long trips. 🥬 Mamba-2 layers: Keep a constant-size state so compute/memory don’t grow with sequence; without them, long generations bog down. 🍞 Anchor: A backpack that stays light no matter how long the hike is.
- 🍞 Hook: Fold big posters into envelopes. 🥬 LatentMoE: Do expert work in a smaller space, then expand; without it, MoE hits bandwidth limits. 🍞 Anchor: Send more letters without higher postage.
- 🍞 Hook: Plan a few chess moves. 🥬 MTP: Predict multiple tokens to learn better and draft outputs; without it, you lose speed and planning gains. 🍞 Anchor: Say a sentence smoothly instead of spelling it letter by letter.
- 🍞 Hook: Lower-res videos load fast. 🥬 NVFP4: Train in FP4 with careful scaling and exceptions; without it, costs soar. 🍞 Anchor: Same movie, less data.
- 🍞 Hook: Take notes in long classes. 🥬 Long context without RoPE issues: Train for ultra-long inputs and rely on Mamba’s implicit positions; without it, models forget or misplace info. 🍞 Anchor: A reliable timeline that never runs out of space.
- 🍞 Hook: Sports camp beats single-sport drills. 🥬 Multi-environment RL: Learn many skills together, stabilized by GRPO; without it, models overfit to one skill. 🍞 Anchor: All-around athletes perform better in games.
03Methodology
At a high level: Input → Tokenization → Hybrid Backbone (Mamba-2 + sparse MoE + a few attention layers) → MTP heads (for multi-token predictions) → Losses (next-token + MTP) → Pretraining in NVFP4 → Long-context CPT/SFT → Multi-environment RL post-training → Inference with budget control + speculative decoding.
Step A: Tokenization and Input Handling
- What happens: Text/code is tokenized into subword tokens; very long sequences (up to 1M tokens) are accepted for certain tasks.
- Why it exists: Standardize inputs and enable long-context modeling.
- Example: A repository-level code file with 1.1M tokens is fed in; the model maintains understanding across files, imports, and comments.
Step B: Hybrid Backbone Forward Pass
- What happens: Layers mostly alternate between Mamba-2 and MoE; a few attention layers are interleaved. Mamba-2 carries a constant-size state; MoE routes tokens to top-K experts; attention provides occasional global mixing.
- Why it exists: Minimize growing KV cache costs while preserving high-fidelity routing when needed.
- Example: For an 8k input / 16k output reasoning session, the model uses fast Mamba-2 recurrent-style updates for most layers, invoking attention layers sparingly to reconcile facts across distant parts of a long context.
Step C: LatentMoE Routing
- What happens: Before entering experts, tokens are projected from dimension d to a smaller latent dimension ℓ; experts operate in ℓ; outputs are projected back to d. The gate (in d) picks top-K experts; N and K are increased thanks to reduced bandwidth.
- Why it exists: Reduce per-token communication and memory while increasing expert diversity and nonlinear budget.
- Example: Shrink d=4096 to ℓ=1024 (≈4× smaller), expand total experts from 128 to 512, and top-K from 6 to 22, improving performance across MMLU, code, and math at similar cost.
Step D: Multi-Token Prediction (MTP) Heads
- What happens: Auxiliary heads predict multiple future tokens from the same hidden state(s), producing draft tokens for speculative decoding.
- Why it exists: Provide richer training signals (planning) and speed up inference without a separate draft model.
- Example: On a coding prompt, the model proposes two next tokens with ≈97% acceptance for the first two, letting the verifier skip many single-step checks.
Step E: Losses and Training Signals
- What happens: Standard next-token loss plus auxiliary MTP losses optimize the backbone and heads. Curriculum includes general knowledge, code, math, reading comprehension, commonsense, and synthetic long-context tasks.
- Why it exists: Balance breadth (many domains) and depth (planning, long-range retrieval).
- Example: Compared to a baseline 8B-active MoE, adding MTP improves average performance by ~2.4% across MMLU, ARC, RACE, GSM8K, and MBPP.
Step F: NVFP4 Pretraining Recipe
- What happens: We quantize weights, activations, and gradients to FP4 with micro-block scaling (E4M3 scales, FP32 global scales), stochastic rounding on grads, and RHTs on inputs to wgrad. Sensitive layers (QKV, attention projections) remain BF16; Mamba output projections use MXFP8.
- Why it exists: Achieve high throughput (≈3× FP8 peak on GB300) with small loss gap to BF16 (<1% on Nano; <0.6% on larger A8B MoE).
- Example: Two runs (BF16 vs NVFP4) track similar loss/accuracy curves on MMLU, GSM8K, ARC, and code benchmarks.
Step G: Long-Context Continued Pretraining (CPT) and SFT
- What happens: CPT at up to 512k tokens, SFT at up to 256k, plus a long-context RL environment up to 32k. No RoPE in attention (Mamba provides implicit positions), avoiding out-of-distribution RoPE issues.
- Why it exists: Build robust long-context capabilities up to 1M during inference.
- Example: On RULER at 1M tokens, Nemotron-3-Nano (MoE hybrid) shows graceful degradation vs a dense hybrid baseline with an abrupt drop.
Step H: Multi-environment RL Post-training
- What happens: Asynchronous rollout generation across math, coding (SWE-Bench, LiveCodeBench), instruction following (IFBench), agentic tool use, long context, chat, and search. Training uses GRPO with masked importance sampling. MTP accelerates rollouts.
- Why it exists: Jointly improve diverse agentic capabilities without stage-wise forgetting or reward hacking.
- Example: Over a single RL run, scores on AIME25, GPQA, -Bench average, and others steadily increase together.
Step I: Inference: Speculative Decoding + Reasoning Budget Control
- What happens: MTP provides draft tokens; the main model verifies them quickly. Users can set a ‘thinking’ token budget; on hitting the cap, the model emits a </think> and proceeds to the final answer.
- Why it exists: Combine speed and controllability—choose fast answers for easy prompts, deeper chains for hard ones.
- Example: On a math puzzle, increasing budget from 200 to 1000 thinking tokens moves along an accuracy vs. tokens curve, letting users pick the best trade-off.
Secret Sauce:
- The hybrid design keeps most compute in Mamba-2 and MoE, slashing cache/communication costs.
- LatentMoE redirects bandwidth savings into more experts and higher K for better accuracy per byte.
- MTP tightly couples quality and speed, acting as built-in drafts for speculative decoding.
- NVFP4 scales pretraining while preserving accuracy through careful precision exceptions.
- Long-context training + multi-environment RL makes the model both broad and robust.
04Experiments & Results
The Test: The team measured both how accurate the models are on tough tasks (math, coding, instruction following, tool use, chat) and how fast they run (tokens per second per GPU), especially on long contexts and long generations. They also tracked training stability/accuracy gaps between NVFP4 and BF16 and evaluated the benefits of LatentMoE and MTP.
The Competition: Nemotron-3-Nano-30B-A3B was compared to similar-size Transformer MoEs like Qwen3-30B-A3B-Thinking and an open GPT-OSS-20B baseline. For LatentMoE vs standard MoE, both models had ~8B active and ~73B total parameters, trained 1T tokens with matched hyperparameters.
The Scoreboard (with context):
- Throughput: Nemotron-3-Nano-30B-A3B achieves ~3.3× higher output tokens/s/GPU than Qwen3-30B-A3B on typical reasoning workloads (e.g., 8k in/16k out), with even bigger gains for longer sequences. That’s like finishing three long essays in the time others finish one.
- Accuracy on Reasoning/Agentic Benchmarks: Across Arena-Hard-v2 (chat), AIME25 (math), IFBench (instruction following), -Bench (tool use), SWE-Bench/LCB (coding), and RULER@1M (long context), the hybrid design reaches state-of-the-art or competitive accuracy while maintaining speed. One reported top score reaches 99.2% (as shown in the figure) on a key benchmark.
- Long Context: On RULER at 1M tokens, Nemotron-3-Nano (MoE hybrid) shows significantly better robustness than a dense hybrid predecessor (Nemotron 2 Nano), indicating graceful degradation rather than a cliff.
- LatentMoE vs Standard MoE: With similar active/total params and training budget, LatentMoE outperforms standard MoE across MMLU-Pro, MMLU, Code, Math, and Commonsense (e.g., MMLU 70.10% → 72.11%; Code 51.95% → 55.14%). That’s like moving from a solid B to a strong B+/A- average without extra runtime cost.
- MTP Ablation: Adding MTP to an 8B-active MoE base improves accuracy by ~2.4% on average across diverse tasks (MMLU, ARC, RACE, GSM8K, MBPP), confirming both better learning signals and synergy with speculative decoding.
- NVFP4 Stability: The relative loss difference vs BF16 drops below 1% for Nano and below 0.6% for a larger A8B MoE. Downstream curves (MMLU, GSM8K, ARC, HumanEval, etc.) closely track BF16, showing that faster FP4 training maintains accuracy.
Surprising/Notable Findings:
- Avoiding RoPE in attention and leaning on Mamba’s implicit positional handling simplifies long-context extension and helps the model keep improving NLL as sequences grow to 1M tokens in code corpora.
- Reasoning-budget control yields smooth accuracy vs. token curves, enabling precise speed/quality trade-offs per query.
- MTP’s high draft-token acceptance (≈97% for the first two tokens in ablations) demonstrates that a separate draft model isn’t necessary for big wins in latency.
Takeaway: The hybrid Mamba–Transformer MoE with LatentMoE, plus MTP and NVFP4 training, sets a new efficiency-accuracy frontier—particularly for long, agentic reasoning—while matching or exceeding the accuracy of similarly sized Transformer MoEs.
05Discussion & Limitations
Limitations:
- Super and Ultra are announced but not fully released at the time of writing; some claims (e.g., broader SOTA coverage) will be verified upon release and external replication.
- While long-context tests (e.g., RULER@1M) and code NLL trends are positive, real-world 1M-token tasks vary widely; edge cases like heavily interleaved multimodal docs or extremely sparse cues need further study.
- LatentMoE and the hybrid stack rely on efficient all-to-all and Mamba kernels; throughput gains depend on high-quality system support and interconnects.
- NVFP4 stability requires careful recipes and higher precision for sensitive layers; portability to older hardware may be limited.
- Multi-environment RL reduces reward hacking but does not eliminate it; careful environment design and safety evaluations remain necessary.
Required Resources:
- Modern GPUs with NVFP4-capable stacks (e.g., Blackwell generation) for best training efficiency.
- High-bandwidth interconnects for MoE all-to-all.
- Large-scale data (up to >10T tokens) and robust RL infrastructure (asynchronous rollouts, GRPO, NeMo-RL/Gym).
When NOT to Use:
- If you must deploy on legacy hardware lacking efficient Mamba/MoE kernels or high-speed interconnects, a dense small Transformer may be simpler.
- Ultra-low-latency, micro-batch=1 scenarios on tiny edge devices might favor smaller dense models without MoE communication.
- If your tasks never exceed short contexts and don’t need deep reasoning, a standard compact Transformer could suffice.
Open Questions:
- How does LatentMoE scale beyond current N and K—where is the point of diminishing returns?
- What are the best curricula for ultra-long multimodal contexts (text + code + tables + images) without RoPE?
- Can MTP be extended to structured drafts (AST nodes for code, equation trees for math) for even higher acceptance and planning gains?
- How do budget-control policies generalize across domains—can models learn to self-budget tokens adaptively?
- What safety and alignment frameworks pair best with multi-environment RL at scale?
06Conclusion & Future Work
In three sentences: Nemotron 3 mixes Mamba-2, sparse experts (MoE/LatentMoE), and a pinch of attention to deliver fast, accurate long-context reasoning, supercharged by MTP for planning and NVFP4 for affordable training. It’s trained and tuned across many environments so it can code, calculate, follow instructions, use tools, and handle million-token inputs, all while letting users pick a ‘thinking budget’ at inference. NVIDIA plans open releases of weights, software, recipes, and much of the data, making high-quality agentic AI widely accessible.
Main achievement: Establishing a practical, open blueprint for state-of-the-art agentic models that sustain accuracy while dramatically improving throughput—especially for long, multi-step tasks—via a hybrid Mamba–Transformer MoE with LatentMoE and MTP.
Future directions: Scale Super and Ultra with full releases; expand long-context training to richer multimodal settings; explore structured MTP drafts; automate adaptive reasoning budgets; refine NVFP4 recipes across more architectures; and deepen safety with multi-environment RL.
Why remember this: Nemotron 3 shows that you don’t have to choose between speed and smarts. With the right mix of layers, smarter expert routing, future-token planning, efficient numerics, and broad RL, you can build open, practical AI agents that think long and run fast.
Practical Applications
- •Enterprise IT ticket triage and resolution across months of logs and configurations.
- •Codebase-wide refactoring, bug localization, and pull-request review for large repositories.
- •Long-form research assistants that synthesize insights from books, papers, and reports.
- •Customer support agents that remember extended histories and coordinate multi-step tool use.
- •Data engineering copilots that track and fix schema changes across many files over time.
- •Mathematics and science tutors that show step-by-step reasoning with adjustable depth.
- •Compliance and audit assistants that scan huge document sets for policy violations.
- •Legal or policy draft analysis over very long briefs and case histories.
- •Multi-agent orchestration for complex workflows (search, plan, code, verify) at high throughput.
- •On-device or edge deployments of the Nano model for cost-efficient, high-volume workloads.