Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

Wei Du; Shubham Toshniwal; Branislav Kisacanin; Sadegh Mahdavi; Ivan Moshkov; George Armstrong; Stephen Ge; Edgar Minasyan; Feng Chen; Igor Gitman

Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

Intermediate

Wei Du, Shubham Toshniwal, Branislav Kisacanin et al.12/17/2025

arXiv PDF

Key Summary

•Nemotron-Math is a giant math dataset with 7.5 million step-by-step solutions created in three thinking styles and with or without Python help.
•It mixes competition-style AoPS problems with community questions from StackExchange-Math to teach models many ways people actually solve math.
•A new training recipe called sequential bucketed training makes long-context fine-tuning 2–3× faster while keeping accuracy within 1–3% of the slower method.
•Models trained on Nemotron-Math beat those trained on OpenMathReasoning in controlled, apples-to-apples tests on AIME and HMMT.
•Adding StackExchange-Math improves robustness on open-ended math (HLE-Math) without hurting competition scores.
•Under high-reasoning mode with Python tools, both Qwen3-8B and Qwen3-30B-A3B hit 100% majority-vote accuracy on AIME 2024 and 2025.
•The dataset keeps only correct, verified reasoning traces and filters out problems that are too easy to ensure strong learning signals.
•Training from short to long sequences by buckets avoids wasting compute on long-context settings when most data are short.
•Care is needed to keep mode balance at long lengths so medium/low modes don’t accidentally turn into always-long, high-depth chains.
•This work shows that diverse supervision plus efficient long-context training can make small and larger models converge to similar strong math skills.

Why This Research Matters

Better math reasoning means safer, more helpful AI tutors that can show their work and check their calculations. Multi-mode supervision teaches models to be concise when problems are simple and thorough when they are tricky, much like a good teacher. Tool integration makes the model more dependable by catching arithmetic and algebraic slips that humans often make. The efficient training schedule cuts costs and energy use, making strong long-context models more accessible to schools, nonprofits, and small labs. Mixing contest and community problems improves robustness to real-world phrasing and messy inputs. Together, these advances push AI from just giving answers to actually reasoning transparently and accurately.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how learning math is easier when you see many different solution styles—some quick, some detailed, and sometimes you even grab a calculator to double-check? Models need that, too.

🥬 The Concept (Fundamentals of Machine Learning): Machine learning is when a computer learns patterns from many examples so it can solve new problems on its own. How it works:

Collect examples and answers.
Show them to the model and let it try.
Nudge it closer to the right answers each time. Why it matters: Without lots of good examples, the model copies bad habits or forgets important steps. 🍞 Anchor: Like practicing 100 fraction problems teaches you the common tricks you can reuse on a new fraction puzzle.

🍞 Hook: Imagine reading a really long mystery novel—you need to remember clues from the beginning to solve the ending.

🥬 The Concept (Long-Context Training): Long-context training helps a model read and use very long chains of steps or long problem histories. How it works:

Give the model long inputs (like full step-by-step math solutions).
Train it to pay attention across thousands of tokens.
Teach it to keep details straight over many steps. Why it matters: Without long-context ability, the model forgets earlier steps and makes mistakes later. 🍞 Anchor: Solving a geometry proof needs earlier definitions and lemmas; long-context lets the model keep them in mind.

The World Before: Many math datasets taught models with single-style solutions—mostly formal, competition-style problems—so models learned a narrow voice and sometimes brittle habits. Tool use (like Python) was often missing, and long explanations were limited, so models struggled with multi-step calculations, verification, and remembering long chains.

The Problem: We needed a dataset that teaches many different ways to reason (short, medium, deep), includes tool use (like Python for checking or computing), covers both formal contest problems and real-life style questions, and supports very long reasoning chains. We also needed a way to fine-tune models on long sequences without wasting tons of compute.

Failed Attempts: People tried making only harder contest problems (great for difficulty but narrow in style), generating solutions in a single reasoning mode (uniform and less robust), or training everything at max context (very slow and inefficient, because most samples are short).

The Gap: Missing were (1) multi-mode reasoning supervision, (2) tool-integrated reasoning traces, (3) diverse community questions, and (4) an efficient training recipe for ultra-long contexts.

Real Stakes: Better math reasoning helps AI tutors explain steps clearly, helps engineers and scientists avoid calculation slips, and saves energy and money by training efficiently. It can also make small models more capable, bringing high-quality math help to more students and schools.

02Core Idea

🍞 Hook: Imagine learning from three teachers at once—one who sketches quick shortcuts, one who explains just enough, and one who shows every careful detail while checking with a calculator.

🥬 The Concept (Nemotron-Math): Nemotron-Math is a huge math dataset with 7.5M step-by-step solutions that come in three reasoning modes and with or without Python tool help. How it works:

Gather problems from AoPS (contests) and StackExchange-Math (community).
Use a strong teacher model to generate multiple solutions: high/medium/low depth, with/without Python.
Keep only solutions that reach correct answers and remove trivial problems. Why it matters: Without diverse, verified solutions, models learn a single brittle style and make unforced errors. 🍞 Anchor: It’s like a math library where every problem has several worked solutions—from a quick sketch to a full check-with-Python write-up.

Aha! Moment (one sentence): Teach models with multi-mode, tool-integrated, long-form solutions and train them from short to long sequences in stages to get high accuracy fast.

Three Analogies:

Orchestra: Different instruments (modes) make richer music; a conductor (training schedule) brings them together smoothly.
Hiking: Start on easy trails (short sequences), then medium, then summit (128K tokens) so you don’t burn out.
Toolbox: Pencils for notes, calculators (Python) for checks—the right tool at the right time makes you more reliable.

Before vs After:

Before: Single-style, competition-heavy data; slow long-context training; weaker robustness.
After: Multi-style, tool-aware data; 2–3× faster long-context fine-tuning; stronger accuracy (e.g., +13.1% on AIME25 no-tool) and even 100% majority-vote on AIME with tools.

Why It Works (intuition):

Diversity trains flexibility: seeing high/medium/low modes teaches when to be brief or thorough.
Tools reduce arithmetic slips: Python verifies steps and numbers.
Filtering keeps signals strong: only correct traces remain.
Length-staged training uses compute wisely: most data are short, so don’t force long settings until needed.

Building Blocks:

Sources: AoPS (structured) + StackExchange-Math (diverse language and topics).
Multi-Mode + TIR: High/medium/low with/without Python.
Quality Control: Majority votes, LLM-as-judge, remove too-easy problems.
Efficient Training: Sequential bucketed schedule (16K→32K→64K→128K).
Evaluation: AIME, HMMT, HLE-Math; pass@1 and maj@k to reflect single-shot and ensemble-like reliability.

🍞 Anchor: Like studying algebra from a contest workbook plus a student forum and practicing with and without a calculator—then starting with short quizzes before tackling full-length exams.

03Methodology

High-level recipe: Problem → multi-mode solution generation (with/without Python) → verify and filter → group by length buckets → fine-tune from short to long → evaluate.

🍞 Hook: Think of cooking a big meal: prep ingredients, follow a recipe step by step, taste and adjust, then serve.

🥬 The Concept (Data Sourcing: AoPS + StackExchange-Math): Combine structured contest problems (AoPS) with real-world style community questions (StackExchange-Math). How it works:

Collect problems from AoPS and StackExchange-Math (plus MathOverflow).
Remove proof-only tasks and benchmark overlaps.
Keep challenging questions by filtering out those the teacher finds trivial. Why it matters: Without both structured and community problems, the model won’t generalize to different phrasings and contexts. 🍞 Anchor: It’s like studying from both a polished textbook and a lively Q&A forum.

🍞 Hook: Imagine three coaches: one says “be concise,” one says “explain enough,” one says “explain everything and check it.”

🥬 The Concept (High/Medium/Low Reasoning Modes): Three depths of thinking for the same problem. How it works:

Low mode: short heuristic chains.
Medium mode: moderate detail.
High mode: long, careful steps with self-checks. Why it matters: Without mode variety, the model overfits to one pace of thinking. 🍞 Anchor: Picking between a quick sketch solution and a full derivation depending on the problem.

🍞 Hook: Sometimes you grab a calculator or run a small script to be sure.

🥬 The Concept (Python Tool-Integrated Reasoning, TIR): Let solutions call Python to compute, verify, and explore. How it works:

Insert Python calls during reasoning.
Execute code to get exact numbers or symbolic steps.
Use outputs to guide or confirm the next steps. Why it matters: Without tools, long arithmetic or algebra can go wrong even if the logic is fine. 🍞 Anchor: Checking a big combinatorics count with a quick Python loop.

Step-by-step pipeline:

Generate solutions: Use a strong teacher (gpt-oss-120b) to produce 6 variants per problem setting (high/medium/low × with/without Python), 8 attempts each via different random seeds.

Why this step: Diversity plus redundancy improves chances of correct and varied traces.
Example: For a geometry ratio problem, low mode might outline key ratios, medium adds intermediate steps, high mode derives every equality and uses Python to compute exact angles.

Verify answers and clean: Keep only solutions that match expected answers (by automatic checking or LLM-as-judge) and replace noisy forum answers with majority-vote when needed.

Why this step: Ensures the dataset teaches correct habits.
Example: If the forum says 42 but 16 high-mode traces agree on 40 and show consistent math, adopt 40.

Filter out trivial problems: If low-mode passes ≥80% of the time, drop the problem.

Why this step: Trivial items waste tokens and don’t teach new skills.
Example: Simple linear equation where nearly all attempts succeed is removed.

Bucket by length and train sequentially: Group traces by token count and train in stages 16K → 32K → 64K → 128K with matched parallelism for each length.

Why this step: Most data are short; using long-context settings on short data is wasteful.
Example: On 16K, you use a fast configuration (about 18s/step) instead of the heavy 128K setup (~25s/step).

Maintain mode balance at long lengths: Sample some medium/low mode at the 128K stage so those modes don’t collapse into always-long behavior.

Why this step: Preserves the intended differences among modes.
Example: Without balance, medium begins to produce overly long chains even when a concise path suffices.

Fine-tune student models: Train Qwen3-8B and Qwen3-30B-A3B under identical recipes across modes and with/without tools.

Why this step: Tests scaling and generality of the supervision and recipe.
Example: Both models approach similar final accuracy on AIME and HLE-Math.

Evaluate: Use competition benchmarks (AIME24/25, HMMT 24–25) and open-domain math (HLE-Math). Compute pass@1 and maj@k (majority vote over multiple attempts).

Why this step: Single-shot checks reliability; majority vote shows stable reasoning behavior.
Example: Hitting 100% maj@16 on AIME means every problem is solved correctly by the ensemble of attempts.

Secret Sauce:

Multi-mode + Tools + Long-form traces = Diverse, correct supervision.
Sequential bucketed training = 2–3× speedup with tiny accuracy trade-offs.
Careful filtering and answer repair = Strong signals with less noise.

🍞 Anchor: Like training for a marathon: you run short distances fast (short buckets, efficient settings), sprinkle in hill training (tools and high mode), then do a few long runs (128K) to be fully ready for race day.

04Experiments & Results

🍞 Hook: Think of a science fair where every project is judged by multiple criteria—speed, accuracy, and reliability—and compared to last year’s winners.

🥬 The Concept (Evaluation and Metrics): We test trained models on trusted math exams and measure single-try accuracy and majority-vote accuracy across several attempts. How it works:

Benchmarks: AIME24, AIME25, HMMT 24–25 (competition style), and HLE-Math (diverse, open-domain).
Metrics: pass@1 (one try) and maj@k (majority vote over k tries; k=16 for contests, 4 for HLE-Math).
Tool settings: with or without Python TIR, under high/medium/low modes. Why it matters: Without fair tests and clear metrics, we don’t know if improvements are real or robust. 🍞 Anchor: Like grading a math quiz (single shot) and also checking the most common answer across a study group (majority vote).

The Competition (Baselines):

OpenMathReasoning (updated pipeline) for data comparison.
Pretrained Qwen3-8B and Qwen3-30B-A3B as model baselines.

Scoreboard Highlights (with context):

Nemotron-Math vs OpenMathReasoning (no tool, high mode, matched AoPS problems and size): Nemotron-Math wins across AIME24/25 and HMMT; for example, on AIME25 pass@1 jumps from 59.38% to 77.08% (like moving from a C to a strong B+/A−), and maj@16 from 71.67% to 90%.
Adding StackExchange-Math: Improves HLE-Math robustness consistently (e.g., more diverse language helps) while keeping or slightly improving competition scores.
Sequential bucketed vs full-length training: Accuracy typically within 1–3% of full-length training but with 2–3× faster end-to-end cost—like finishing the same homework with a study hack that saves hours but loses almost no points.
Peak results: High mode with Python TIR achieves 100% maj@16 on AIME24/25 for both Qwen3-8B and Qwen3-30B-A3B, showing that strong supervision + tools can reach perfect ensemble reliability on these contests.
Notable jump: On AIME25 (no tool), high-mode fine-tuning boosts Qwen3-30B-A3B by 13.1% absolute pass@1 over the baseline (71.67% → about mid-80s), a big practical gain.

Surprising Findings:

Scaling parity: Qwen3-8B and Qwen3-30B-A3B show very similar learning curves and final accuracy under the same recipe, suggesting the supervision is strong enough that size matters less than expected.
HLE quirk: 8B slightly outperforms 30B on HLE-Math without tools in some runs.
Mode drift risk: If the 128K stage uses only high-mode data, medium/low start producing long chains, blurring mode differences; balancing fixes this.

05Discussion & Limitations

🍞 Hook: Even a great study plan has trade-offs—like saving time but needing reminders not to skip certain topics.

🥬 The Concept (Limitations and Trade-offs): This approach is powerful but not magic; it depends on careful data curation and training choices. How it works:

Mode imbalance risk: Long buckets contain fewer medium/low samples; if ignored, these modes drift toward long, high-depth behavior.
Data skew: Most traces are short; truly ultra-long examples are rarer.
Teacher reliance: Generation depends on gpt-oss-120b quality; teacher biases can transfer.
Judging noise: LLM-as-judge for HLE-Math can introduce subtle grading bias.
Domain scope: Proof-heavy tasks and non-math domains aren’t the focus here. Why it matters: Knowing where it can stumble helps you apply it wisely and improve it next time. 🍞 Anchor: Like remembering to solve a few word problems at the end of practice so you don’t forget how to parse tricky language.

Required Resources:

Multi-GPU training with tensor/context parallelism, NeMo-Skills/NeMo-RL stack, long-context memory.
Dataset access and storage for 7.5M traces.

When NOT to Use:

Pure proof-theory tasks needing formal verification.
Multimodal math (images/diagrams) not covered here.
Extremely short-context-only training where long-context overhead gives no benefit.
Domains far from math without adaptation.

Open Questions:

How to best mix high/medium/low and with/without tools for different goals?
Can the bucketed strategy transfer to code, science QA, or legal reasoning?
How to expand verifiable, ultra-long traces without overfitting to the teacher’s style?
What’s the best way to combine this SFT recipe with RL or self-play for further gains?

06Conclusion & Future Work

Three-sentence summary: Nemotron-Math provides 7.5M diverse, verified, and often tool-augmented math reasoning traces across three modes, blending contest problems with community questions. A sequential bucketed training schedule makes 128K-context fine-tuning 2–3× faster with only 1–3% accuracy trade-off, and models trained this way outperform prior datasets in controlled tests. With high mode and Python tools, both small and larger models reach perfect majority-vote accuracy on AIME, showing the strength and generality of the supervision.

Main achievement: Uniting multi-mode, tool-integrated long-form supervision with an efficient long-context fine-tuning recipe that delivers state-of-the-art accuracy at a fraction of the training cost.

Future directions: Extend to proofs and multimodal math, refine mode/tool mixing policies, automate long-trace verification further, and explore synergy with reinforcement learning for even stronger reasoning.

Why remember this: It shows that the right kind of examples (diverse, verified, tool-aware) plus a smart training schedule can teach powerful math skills efficiently—even making smaller models compete with bigger ones—pointing the way to more accessible, reliable AI tutors and problem solvers.

Practical Applications

•Build step-by-step AI math tutors that adapt their explanation depth to the student’s needs.
•Develop homework helpers that verify calculations with Python to reduce arithmetic errors.
•Create curriculum-aligned practice systems mixing contest-style and real-world questions.
•Train efficient long-context models for grading and giving feedback on multi-step solutions.
•Design robust math assistants for engineers and scientists that handle long derivations.
•Construct evaluation pipelines using pass@1 and maj@k to track reliability under multiple tries.
•Use the bucketed schedule to fine-tune other long-context tasks (e.g., code proofs, legal reasoning).
•Bootstrap high-quality datasets by majority-voting multiple teacher solutions to clean noisy answers.
•Balance reasoning modes during training to preserve concise vs detailed behaviors for different tasks.

Version: 1