Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Boxin Wang; Chankyu Lee; Nayeon Lee; Sheng-Chieh Lin; Wenliang Dai; Yang Chen; Yangyi Chen; Zhuolin Yang; Zihan Liu; Mohammad Shoeybi; Bryan Catanzaro; Wei Ping

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Intermediate

Boxin Wang, Chankyu Lee, Nayeon Lee et al.12/15/2025

arXiv PDF

Key Summary

•The paper introduces Nemotron-Cascade, a step-by-step (cascaded) reinforcement learning recipe that trains an AI across domains like alignment, instructions, math, coding, and software engineering—one at a time.
•Instead of mixing everything together, the model learns each subject in sequence, which simplifies training and avoids breaking skills learned earlier.
•A key surprise: starting with RLHF (learning from human preferences) makes the model write clearer, shorter thoughts, which then boosts math and code results later.
•The unified 8B model can switch between 'thinking' and 'instant' modes using simple /think and /no_think flags, closing the gap with a dedicated 'thinking' model.
•The 14B thinking model beats its much bigger SFT teacher (DeepSeek-R1-0528) on LiveCodeBench v5/v6/Pro and reaches silver-medal performance on IOI 2025.
•Careful tricks like on-policy GRPO, dynamic filtering, and staged response-length extension keep RL stable and efficient.
•For software engineering, they design an execution-free reward to scale training without running heavy test containers.
•Across many benchmarks (MMLU, AIME, LiveCodeBench, SWE-bench Verified), Cascade RL delivers state-of-the-art or best-in-class numbers for 8B/14B models.
•Later domain RL stages rarely harm earlier skills—and often improve them—showing resistance to catastrophic forgetting.
•They publish data and recipes, making the approach transparent and reproducible.

Why This Research Matters

A single, reliable AI that can both think deeply and answer instantly is easier to deploy, safer to control, and more useful in real workflows. By training domains in sequence with the right rewards, companies can get top results without juggling multiple models or risking that new skills erase old ones. Clearer, shorter chains of thought make the AI more efficient and cheaper to run, while still solving tough math and coding problems. Scalable SWE training means AI can help fix real software faster without massive infrastructure overhead. The open recipes and data help the community build on this work, accelerating progress in transparent, reproducible ways.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how in school you don’t learn science, math, and art all mixed together in one super-lesson? You usually learn them in blocks: first math class, then science, then art. That way your brain can focus and get good at each subject without getting confused. Training AI to reason well in many areas—like answering questions politely, following precise instructions, solving math puzzles, writing code, and fixing real software bugs—has had the same problem: when everything gets mixed together at once, things get messy. What the world looked like before: Big AI models improved with reinforcement learning (RL), which means they try things, get a score, then adjust to do better next time. But each domain gives rewards differently and at different speeds. Math can be checked quickly with rules (fast). Code needs to be run and tested (slow). Human preference alignment uses a learned judge (moderate). Blending all these at once made the training setup complicated, slowed everything down, and made it hard to pick the right settings (like how long answers can be or how often to update). Another twist: model families were splitting into two types. Some were 'thinking' models that write long chains of thought before answering, and others were 'instruct' models that give quick answers. Keeping two separate models is harder to manage and share. People also worried about 'catastrophic forgetting,' where learning a new skill knocks out an old one. The specific problem: How can we train one general-purpose reasoning model that (1) gets top results across many domains, (2) can do both 'think a lot' and 'answer fast,' (3) stays stable, fast, and simple to train, and (4) doesn’t forget earlier skills when it learns new ones? What failed attempts looked like: Many tried mixing prompts from all domains together for one big RL run. That raised engineering overhead (very different reward speeds), made curricula (like length limits) awkward, and sometimes caused skill interference. Others tried to build one unified model but found the thinking mode got worse than a dedicated thinking model, so they split models again. The gap: We needed a training plan that (a) respects domain differences, (b) lets us tune settings per domain, (c) sequences learning so gains pile up, and (d) offers a clean way to switch between thinking and non-thinking behaviors without training two separate models. Real stakes in daily life: • Better code assistants that can fix real bugs in big projects. • Tutors that solve math step by step without getting lost. • Helpers that follow detailed instructions exactly when it matters (e.g., legal forms, medical notes, safety policies). • Faster, cheaper training pipelines that still reach state-of-the-art results. • One reliable model you can control turn-by-turn (/think vs /no_think), rather than juggling multiple models.

02Core Idea

🍞 Hook: Imagine learning to cook by mastering breakfast first, then lunch, then dinner. You keep what you learned, and each new meal adds skills without ruining the old ones. 🥬 The Concept: The key insight is to do reinforcement learning in a sequence of domains—first align with people (RLHF), then learn to follow strict instructions (IF-RL), then math, then code, then software engineering—so each stage can be tuned perfectly and earlier wins are preserved. Why it works: • RL samples from the model’s own behavior, so useful habits keep showing up and don’t disappear. • Rewards across domains often agree (be correct, be clear, avoid nonsense), so learning in one area tends to help others. • Doing domains in order lets us choose the right token budgets, temperatures, and verifiers for each stage. The change before vs. after: Before, one giant blended RL run was hard to stabilize and tune. After, Cascaded RL simplifies engineering, increases stability, and lifts scores across the board, with minimal forgetting. Building blocks (explained with mini 'sandwiches' below): 1) RLHF, 2) Cascade RL itself, 3) Reward Modeling, 4) Thinking vs Instruct Mode, 5) IF-RL, 6) Math RL, 7) Code RL, 8) SWE RL, 9) Why catastrophic forgetting is resisted, 10) GRPO on-policy training, 11) Dynamic filtering, 12) Response-length extension, 13) Test-time scaling on IOI, 14) Execution-free SWE verifier. ————————————— 🍞 Hook: You know how a teacher gives thumbs up or down on your essay? 🥬 RLHF (Reinforcement Learning from Human Feedback): • What: The model learns to write answers people prefer. • How: A reward model scores each response; higher-scored styles and contents get reinforced. • Why it matters: Without it, answers can be rambling or unhelpful. 🍞 Example: Picking the clearer, kinder explanation when helping with homework. ————————————— 🍞 Hook: Think of finishing kindergarten before first grade, then second—step by step. 🥬 Cascade RL: • What: Train domains in sequence (RLHF → IF-RL → Math → Code → SWE). • How: Tune length limits, reward checks, and temperatures per stage. • Why it matters: Mixing everything at once slows training and causes conflicts. 🍞 Example: After learning polite chatting, the model learns strict instructions, then hard math, then precise coding, and finally big software fixes. ————————————— 🍞 Hook: A referee decides points by the rules. 🥬 Reward Modeling (RM): • What: A model that scores answers by human preference. • How: It compares pairs of answers and learns which is better. • Why it matters: Bad scoring steers learning the wrong way. 🍞 Example: Rewarding concise, correct advice over long, off-topic rambles. ————————————— 🍞 Hook: Sometimes you brainstorm out loud; sometimes you give a quick answer. 🥬 Thinking Mode vs Instruct Mode: • What: Two ways to answer—long chains of thought (/think) or fast replies (/no_think). • How: A simple flag in the user message switches modes per turn. • Why it matters: You get control for speed or depth when you need it. 🍞 Example: /think for tricky puzzles; /no_think for a quick date conversion. ————————————— 🍞 Hook: Imagine assembling furniture exactly as the manual says. 🥬 IF-RL (Instruction-Following RL): • What: Train the model to respect strict constraints (like word limits or formats). • How: Use a rule-checker plus human-preference signals to reward both correctness and quality. • Why it matters: A perfect format is useless if the content is awful—and vice versa. 🍞 Example: “Write under 150 words, include 3 bullets, and no emojis.” ————————————— 🍞 Hook: Solving math puzzles gets easier when you can think longer without drifting. 🥬 Math RL: • What: Improve step-by-step math using fast, rule-based checks for correct answers. • How: Gradually allow longer thoughts (24K → 32K → 40K) while filtering too-easy or impossible items. • Why it matters: Without careful length control, the model either truncates or wastes tokens. 🍞 Example: Counting ways to tile a floor and boxing the final number. ————————————— 🍞 Hook: Programming contests are like timed puzzles with test cases as referees. 🥬 Code RL: • What: Improve competitive coding by rewarding only code that passes all tests. • How: Use parallel/asynchronous verifiers; sample with higher temperature to explore. • Why it matters: Without strict tests, the model might pass easy cases but fail edge cases. 🍞 Example: Sorting a giant list within time limits and passing hidden tests. ————————————— 🍞 Hook: Fixing a real app’s bug is like finding the right wire and repairing it. 🥬 SWE RL: • What: Improve code repair patches for large repos without running heavy containers. • How: Reward with patch similarity (lexical and semantic) to ground truth; train on long, mixed-file prompts. • Why it matters: Execution is slow; a smart, scalable reward speeds learning. 🍞 Example: Changing only the faulty function to stop a crash. ————————————— 🍞 Hook: When you learn a new song, you don’t forget how to tie your shoes. 🥬 Resistance to Catastrophic Forgetting: • What: Earlier skills stick around as you learn new ones. • How: In RL, useful behaviors keep getting sampled; rewards across domains align (be accurate, be clear). • Why it matters: Without this, each new stage could erase past abilities. 🍞 Example: After code RL, math scores stay solid or even improve. ————————————— 🍞 Hook: Practice with today’s playbook—not yesterday’s. 🥬 GRPO On-Policy Training: • What: Collect samples from the current model and update once—no stale data. • How: Use group-normalized rewards and token-level learning. • Why it matters: It stabilizes learning and reduces weird drifts. 🍞 Example: Grading today’s essays and immediately teaching from them. ————————————— 🍞 Hook: Keep only workouts that actually make you stronger. 🥬 Dynamic Filtering: • What: Reuse problems that give learning signal; drop ones that are too easy or unsolvable. • How: After each epoch, resample a small fraction of easy/hard items; keep ~90% useful ones. • Why it matters: Without it, the model wastes time. 🍞 Example: Stop practicing 1+1 after you’ve mastered it. ————————————— 🍞 Hook: Start with short runs; then try longer marathons. 🥬 Response-Length Extension: • What: Train in stages to compress, stabilize, and then extend thinking length. • How: 24K → 32K → 40K token budgets tuned per stage. • Why it matters: Too short truncates; too long gets messy. 🍞 Example: Writing a clear proof that grows only when needed. ————————————— 🍞 Hook: Learn more during the test by listening to feedback. 🥬 Test-Time Scaling (IOI): • What: Generate many candidates, pick the tail-best, submit, read judge feedback, and iterate up to 50 rounds. • How: Keep short history; share insights across subtasks. • Why it matters: Feedback turns tests into learning opportunities. 🍞 Example: Improving a constructive algorithm after each partial score. ————————————— 🍞 Hook: If you don’t run every machine, compare designs smartly. 🥬 Execution-Free SWE Verifier: • What: Reward patch similarity (text and meaning) when running tests is too heavy. • How: Combine lexical diff and an LLM’s semantic YES-probability. • Why it matters: Lets you train at scale without spinning up containers. 🍞 Example: Accepting a different-but-correct refactor that fixes the bug.

03Methodology

High-level recipe: Input (prompts by domain) → SFT (foundations) → RLHF (polish clarity) → IF-RL (obey constraints) → Math RL (deep reasoning) → Code RL (pass tests) → SWE RL (fix real bugs) → Output (a unified or dedicated thinking model). Step-by-step with why it exists, what breaks without it, and examples: 1) Supervised Fine-Tuning (SFT) • What happens: The base Qwen3-8B/14B models learn from curated examples across general chat, math, code, science, tools, and SWE. For the unified model, user prompts include /think or /no_think so the model learns both modes. • Why it exists: SFT gives the model basic skills and a consistent style before RL fine-tunes behavior. • What breaks without it: RL from a weak base struggles; rewards become noisy; learning slows. • Example data: “Explain gravity” (both thinking and non-thinking), math/code problems (thinking), tool-calling conversations, SWE localization/repair/test generation. • Secret sauce: Parallel generation to keep style consistent; careful filtering (e.g., 9-gram decontamination). 2) RLHF (alignment first) • What happens: The model’s answers are scored by a large reward model (72B) trained on preference data. We penalize language mixing and unfinished chains. • Why it exists: It reduces verbosity and repetition, making thoughts clearer. That later helps math/code under fixed token budgets. • What breaks without it: Later RL stages waste tokens on rambling, and training can be unstable. • Example: Asking for travel tips yields shorter, structured, helpful answers. • Secret sauce: On-policy GRPO; no KL term; balanced mode training for unified models. 3) IF-RL (instruction following) • What happens: Verifiable constraints (like word limits or bullet counts) are combined with the preference reward to avoid reward hacking. For unified models, IF-RL uses non-thinking mode to preserve quality. • Why it exists: To be precisely obedient when rules matter. • What breaks without it: The model might write lovely-but-invalid answers (wrong format or ignored constraints). • Example: “Output exactly three numbered steps under 120 words.” • Secret sauce: Combined reward (rule-based + human-preference), dynamic filtering for stable gradients. 4) Math RL (verifiable reasoning with staged length) • What happens: Train on tough math tasks; verify boxed final answers with a fast rule-checker; apply a language-mixing penalty; manage length in stages (24K → 32K → 40K). • Why it exists: Hard problems need longer, organized chains of thought; staged growth avoids truncation or waste. • What breaks without it: Either answers get cut off or thoughts balloon with fluff. • Example: Multi-step combinatorics problem finished with a boxed integer. • Secret sauce: Epoch-level dynamic filtering to keep ~90% of problems useful for learning signal. 5) Code RL (unit-test mastery) • What happens: Reward only if all tests pass; use parallel, asynchronous verification to cut latency; sample with higher temperature (e.g., 1.0) for exploration. • Why it exists: Exactness matters in code; tests are truth. • What breaks without it: Overfits to easy cases; misses tricky edges; training crawls from slow verifications. • Example: Implementing a function that handles big inputs and corner cases. • Secret sauce: Careful data filtering for reliable tests; temperature tuning to balance exploration and stability. 6) SWE RL (repair real software) • What happens: Optimize diffs (patches) to match the ground-truth fix using a scalable, execution-free reward combining lexical and semantic similarity; train on long mixed-file prompts; extend contexts in stages (16K → 24K). • Why it exists: Full container execution for every patch is too slow to scale; similarity-based signals let us train bigger and faster. • What breaks without it: Training becomes bottlenecked by infrastructure, and models can’t learn from many examples. • Example: Editing only the faulty method to pass regression without breaking other features. • Secret sauce: Two-stage long-context curriculum; randomized file order; minimum prompt length to ensure meaningful context. 7) Mode control (/think and /no_think) • What happens: The user appends /think or /no_think per turn. No template hacks; explicit flags only. • Why it exists: Predictable switching at any time in a chat. • What breaks without it: Confusing cues; inconsistent behavior. • Example: “Convert 5pm PT to CET /no_think” then “Explain daylight saving conflicts /think”. 8) GRPO (Group Relative Policy Optimization) on-policy • What happens: For each prompt, sample several answers from the current policy; normalize rewards within the group; update once. • Why it exists: Stability and simplicity; no importance weights or KL penalty. • What breaks without it: Drift from stale data; fragile updates. • Example: Sampling 8 versions of a code solution and reinforcing the best one(s). 9) Dynamic filtering (keep what teaches, drop what doesn’t) • What happens: After each epoch, drop always-solved and never-solved items; resample a small fraction so the model doesn’t forget or miss late breakthroughs. • Why it exists: Efficiency and steady progress even late in training. • What breaks without it: Wasting compute on no-signal items; noisy gradients. • Example: Stop showing arithmetic facts that are already perfect. 10) Response-length extension (three phases) • What happens: Compression (24K) to avoid overlong rambles; stabilize; then extension (32K, 40K) to unlock depth for hard items. • Why it exists: Token budgets are finite; we need thoughtfulness without bloat. • What breaks without it: Truncation or inefficiency, both hurting accuracy. • Example: AIME hard items get solvable only after length extension. 11) Test-time scaling (IOI strategy) • What happens: In contests, generate many candidates, pick tail-best, use judge feedback, and share insights across subtasks over 50 rounds. • Why it exists: Feedback converts attempts into learning; scores climb round by round. • What breaks without it: You leave performance on the table, especially for constructive problems. • Example: Iteratively improving a constructive algorithm for IOI P2 Triples. 12) Execution-free SWE verifier (scale-up reward) • What happens: Reward = 1 if identical patch; -1 if unparseable; 0 if unchanged; otherwise a semantic-YES probability plus lexical similarity. • Why it exists: Massively more scalable than running every test suite during training. • What breaks without it: Training dataset size and throughput remain small; progress slows. • Example: Rewarding a refactor that fixes the bug even if text differs. Putting it all together: Inputs include domain-specific prompts and verifiers (or reward models). The pipeline tunes hyperparameters per domain (length limits, temperatures, sampling counts). The unified 8B model learns both modes; the 14B model targets peak thinking-mode performance. The result is a set of models that (1) are stable to train, (2) rarely forget, (3) reach or beat state-of-the-art scores across many benchmarks, and (4) allow precise control over thinking depth.

04Experiments & Results

The test: The authors evaluate on a wide, modern suite—MMLU/MMLU-Pro (general knowledge), GPQA-Diamond (hard science Q&A), ArenaHard/IFEval/IFBench (alignment and strict instruction following), AIME 2024/2025 (math), LiveCodeBench v5/v6/Pro (competitive programming), SWE-bench Verified (software engineering), and BFCL-V3 (tool use). They measure pass@1 (often averaged across multiple generations, like avg@8 or avg@64) with long thinking budgets (up to 64K tokens) and tuned sampling (e.g., temperature 0.6 for reasoning benchmarks; 1.0 for code RL training). The competition: Strong baselines include DeepSeek-R1-0528 (671B; also used as an SFT teacher), Qwen3 variants, Gemini-2.5, OpenAI o3/o4-mini, and specialized SWE systems like DeepSWE-32B. The scoreboard, with context: • LiveCodeBench (code). The 14B thinking model achieves 77.5 (v5) and 74.6 (v6) avg@8—beating DeepSeek-R1-0528 (74.8/73.3) despite being far smaller. The unified 8B reaches 74.3 (v5) and 71.1 (v6), matching or surpassing much larger baselines. That’s like a high school chess champ beating an older, taller opponent using better strategy, not brute force. • IOI 2025 (contest setting). With a feedback-driven test-time scaling pipeline (up to 50 rounds, 20 candidates each), the 14B thinking model reaches 343.37 total points—solid silver medal territory. On P2 Triples (constructive subtask), it surpasses internal strong baselines, showing the value of iterative refinement. • AIME 2024/2025 (math). After Math RL’s staged length extension, the 8B unified model hits 90.2 (AIME24) and 80.1 (AIME25), while the 14B thinking model reaches 89.7/83.3—near top-tier small-model performance. That’s like consistently getting As on tough math quizzes. • SWE-bench Verified (software repair). The 14B thinking model gets 43.1 pass@1—above specialized 32B open models like DeepSWE-32B (42.2). The unified 8B achieves 37.2, very close to the 8B thinking model’s 38.5, showing the unified approach holds up in tough, realistic settings. • Alignment and instruction following. RLHF massively boosts ArenaHard; IF-RL then rockets IFEval and IFBench while holding human-preference alignment mostly steady via mixed rewards. Surprising findings: • RLHF improves math and code even without math/code prompts—because it trims verbosity and repetition, making thoughts more token-efficient. • Later RL stages rarely hurt earlier domains; they sometimes even add gains (e.g., math RL nudging code/SWE up). That’s strong evidence for resistance to catastrophic forgetting. • Higher temperature helps Code RL performance (more exploration) but can destabilize training entropy; the authors show how to balance that. • Execution-free SWE reward scales training to bigger datasets without Docker execution overhead, yet still lifts verification scores in challenging benchmarks. Bottom line: Across nearly all benchmarks, Nemotron-Cascade models deliver best-in-class numbers among their size class, and the 14B thinking model even overtakes its huge SFT teacher on coding. The unified 8B narrows or closes the gap with a dedicated thinking 8B, while outperforming on strict instruction adherence.

05Discussion & Limitations

Limitations: • IF-RL and RLHF tug in slightly different directions: strict verifiers want exact formats, while human preference rewards style and helpfulness. Without careful reward design, IFEval or ArenaHard can dip. • Reward models are imperfect proxies. A top RewardBench score doesn’t guarantee the best downstream RLHF. This adds variance in outcomes. • Long reasoning budgets (e.g., 64K) bring costs: more compute per sample and careful length curricula are required to avoid truncation or token waste. • Code RL can be temperature-sensitive; high exploration helps, but can cause entropy spikes and instability. • SWE RL’s execution-free reward scales well but may miss nuances that only full execution could catch. Required resources: • Strong backbone LLMs (8B/14B) and large RMs (e.g., 72B) for robust RLHF. • Distributed training with parallel/asynchronous verifiers for code; high-throughput data pipelines; long-context hardware support. • Carefully curated, deduplicated, and decontaminated datasets across domains. When not to use: • If you only need a single narrow skill (e.g., just math), a specialized RL recipe might be simpler. • If you lack verification signals (tests, rules, or rewards) and can’t build a strong RM, cascaded RL’s benefits shrink. • If you must run on tiny hardware with very short context, the long-context strategies won’t fully apply. Open questions: • Can we learn a single, robust generative reward model that handles both human preference and strict constraints, reducing trade-offs? • How do we stabilize high-temperature exploration (great for code) without entropy explosions? • Can execution-free SWE rewards be enhanced to better approximate full test execution outcomes? • What’s the best automatic curriculum to pick stages, lengths, and filtering thresholds per model size? • How can we push unified models even further so they always match or exceed dedicated thinking models in every reasoning task?

06Conclusion & Future Work

Three-sentence summary: Nemotron-Cascade trains a reasoning model one domain at a time—alignment, instruction following, math, code, and software engineering—so each stage is tuned right and earlier skills stick. This cascaded approach delivers state-of-the-art results for 8B/14B models across many benchmarks, with the 14B thinking model beating its huge SFT teacher on code and earning a silver-like score in IOI 2025. A unified 8B model cleanly switches between deep thinking and instant answers using simple flags, closing the gap with a dedicated thinking model. Main achievement: Proving that domain-wise, sequential RL scales across many tasks while resisting catastrophic forgetting, simplifying engineering, and enabling a single model to handle both thinking and non-thinking modes. Future directions: Build better unified reward models that blend human preference and strict constraint checking; stabilize high-temperature exploration; further improve long-context efficiency; and strengthen execution-free SWE rewards as proxies for full test runs. Why remember this: It shows a clear, reproducible path to train one model that reasons broadly and reliably—like a student who aces math, writes code, follows directions, and still answers quickly when you need speed—without the usual trade-offs.

Practical Applications

•Build one assistant that can switch between fast answers (/no_think) and deep reasoning (/think) per user turn.
•Fine-tune internal models by domain sequence (alignment → instructions → math → code → SWE) to reduce training complexity.
•Use RLHF first to trim verbosity and improve token efficiency before math/code RL.
•Adopt staged response-length curricula for math to prevent truncation and overlong rambles.
•Train code assistants with strict pass-all-tests rewards and higher sampling temperatures for better exploration.
•Scale SWE repair training with execution-free patch-similarity rewards to avoid heavy test execution.
•Apply dynamic filtering to keep only learning-rich problems in RL batches and speed convergence.
•Use combined rewards (rule-based + preference) for instruction-following to prevent format-only reward hacking.
•Deploy test-time scaling loops (generate–select–submit) in contests or code review to turn feedback into quick gains.
•Implement /think and /no_think flags in chat templates for predictable, per-turn mode control.

Version: 1