On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
Key Summary
- ā¢The paper asks when reinforcement learning (RL) really makes language models better at reasoning beyond what they learned in pre-training.
- ā¢Using a fully controlled, synthetic math-like world, the authors separate the effects of pre-training, mid-training, and RL.
- ā¢RL gives true new capability only when the task is just a little harder than the modelās comfort zone (its āedge of competenceā) and there is leftover headroom from pre-training.
- ā¢For transferring skills to new wordings or topics (contextual generalization), even tiny pre-training exposure (about 1%) to the new context is enough for RL to spread the skill widely.
- ā¢Adding a mid-training stage (a bridge between pre-training and RL) boosts performance under the same compute budget, especially on hard, out-of-distribution problems compared to RL alone.
- ā¢Designing RL data to match the modelās edge of competence yields large gains (up to +42% pass@128) on deeper problems.
- ā¢Process-aware rewards that check each reasoning step reduce reward hacking and improve faithful reasoning, adding ~4ā5% pass@1 on the hardest settings.
- ā¢A task-aware compute split works best: more mid-training + light RL for reliability on near-range tasks, and heavier RL (with some mid-training) for far-range, very hard tasks.
- ā¢The framework clarifies why past studies disagreed about RL: differences came from how much pre-training coverage there was and how well RL data matched the edge of competence.
Why This Research Matters
This study gives a clear, practical recipe for making language models reason better instead of just guessing better. By seeding minimal exposure in pre-training, adding a mid-training bridge, aiming RL at the edge of competence, and rewarding steps (not just answers), we get deeper and more reliable reasoning. That means homework helpers that explain their steps, coding copilots that avoid brittle shortcuts, and decision tools that generalize to new situations without breaking. It also saves compute by avoiding wasted RL on tasks that are too easy or too hard. Most importantly, it reduces reward hacking so improvements are real, not illusions.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine coaching a student for a big math contest. First, they read a lot of textbooks (pre-training). Later, you give them specially chosen practice sheets (mid-training). Finally, you let them practice with timed drills and instant feedback (RL). If their scores go up, was it the textbooks, the sheets, or the drills? Or the mix?
š„¬ The Concept (The World Before): Before this paper, researchers knew that reinforcement learning could make language models seem better at reasoningāespecially when they show their steps. But nobody could say for sure whether RL actually taught the model new abilities or just polished things it already knew. Thatās because modern training pipelines are messy: huge, opaque pre-training corpora; loosely defined mid-training; and RL that pushes on top of unknown foundations.
How it worked before (step by step):
- Pre-training on the internet: Models absorb all sorts of facts and patterns, but we donāt know which reasoning skills they truly learned.
- Optional mid-training: Some teams add a middle step with instruction-like data, but its role isnāt carefully measured.
- RL post-training: Models get rewards for good answers, but the rewards can be gamed (reward hacking) and often mix with earlier knowledge in unknown ways.
Why that mattered: Conflicting claims popped up. Some said RL mostly sharpens (improves pass@1 but not pass@128), meaning no real new capability. Others showed big gains on synthetic tasks and argued RL truly extends ability.
š Anchor: Think of students who get better after practice. Are they learning new math or just learning test tricks? Without a clean lab-like setup, itās hard to tell.
š Hook: You know how a good science experiment isolates one variable at a time? Thatās the key to figuring out cause and effect.
š„¬ The Concept (The Problem): The field lacked a controlled way to separate the roles of pre-training, mid-training, and RL on reasoning. Without control, we cannot say whether improvements come from new abilities or better sampling of old ones.
How they fix it:
- Build synthetic reasoning tasks with atomic steps and clean dependency graphs (so structure is known).
- Render problems into different surface contexts (zoo, school, festival) without changing the underlying logic.
- Parse solutions step-by-step and grade not just the final answer but the whole reasoning process.
- Carefully split data into non-overlapping sets for pre-, mid-, and post-training.
Why it matters: With clear structure, you can tell whether RL truly extends reasoning depth and transfers skills to new contextsāor just memorizes shortcuts.
š Anchor: Itās like giving two kids identical Lego kits but different instruction pages. If one builds taller towers (deeper reasoning) or can build in a new theme (context transfer), you can measure which training stage helped.
š Hook: Imagine a hiking path with three stages: a flat warm-up (pre-training), a hill bridge (mid-training), and a steep summit push (RL). How do these pieces add up to reach higher peaks?
š„¬ The Concept (Failed Attempts and the Gap): Past attempts used real-world corpora and loosely defined rewards. But hidden overlaps and unverified steps muddled conclusions. The missing piece was a lab-grade setup where every training distribution, task difficulty, and evaluation rule is known and controllable.
How the paper fills the gap:
- It creates a fully controllable world of math-like problems defined by directed acyclic graphs (DAGs), ensuring exact reasoning structure is known.
- It measures two key abilities: extrapolative generalization (tackling deeper compositions) and contextual generalization (transferring across surface stories).
- It uses process-verified evaluation to only count solutions correct when all steps and the final answer match.
Why it matters: This lets us pinpoint when RL truly adds new capability, when mid-training is essential, and how tiny pre-training exposure unlocks big transfer later.
š Anchor: Picture testing shoes on a treadmill that controls slope and speed exactly. Now you can say whether the laces, the sole, or the training plan made the runner faster.
š Hook: Why should you care? Because the way we train models affects your homework help, coding assistants, and creative tools.
š„¬ The Concept (Real Stakes): If we think RL creates new skills when itās merely polishing old ones, we could waste huge compute budgets or trust unreliable reasoning. With clean evaluation, we can design cheaper, safer, and smarter training pipelines.
How it plays out:
- Better planning: Match RL data to the modelās edge of competence to get real gains.
- Better coverage: Seed pre-training with just 1% exposure to long-tail contexts to unlock cross-context transfer.
- Better reliability: Use mid-training and process-aware rewards to reduce reward hacking and improve faithful reasoning.
Why it matters: Everyday tools become more accurate and fair, and we avoid being fooled by models that guess the right answer for the wrong reasons.
š Anchor: Itās the difference between a tutor who teaches understanding step-by-step versus one who only drills answers. The first builds skill; the second risks shortcuts.
02Core Idea
š Hook: You know how video games get fun when the level is just hard enough to stretch you, not so easy itās boring, and not so hard itās impossible?
š„¬ The Concept (The Aha! Moment): The key insight is that true reasoning gains happen when pre-training leaves headroom, RL is aimed exactly at the modelās edge of competence, mid-training bridges the distributions, and rewards check both the final answer and the steps.
How it works:
- Pre-training installs basic reasoning āatoms.ā
- Mid-training strengthens those atoms right where the model starts to struggle, aligning them with what RL will target.
- RL then explores just-beyond-known territory (the edge of competence), composing atoms into deeper chains.
- Process-aware rewards ensure the model earns points for correct steps, not just lucky answers.
Why it matters: Without headroom, RL only polishes; without calibration, RL flails; without process checks, RL can be gamed.
š Anchor: Itās like building muscle: learn form (pre-train), practice with guided sets (mid-train), then attempt slightly heavier weights (RL) with a coach who also watches your form (process reward).
Multiple Analogies:
- School Curriculum: Learn times tables (pre-train), do structured worksheets (mid-train), then timed quizzes just above your comfort zone (RL) graded for both steps and answer (process reward).
- Cooking: Master chopping and mixing (pre-train), follow recipe families (mid-train), then improvise one step harder dishes (RL) while a chef checks each step (process reward).
- Sports: Drill fundamentals (pre-train), scrimmage with targeted plays (mid-train), then play tougher opponents (RL) with a referee checking not only the score but also fouls (process reward).
Before vs After:
- Before: RL results looked inconsistentāsometimes big gains, sometimes not.
- After: Gains depend on two switches: (a) headroom from pre-training and (b) RL difficulty calibrated to the edge of competence; plus a boost from mid-training and process-verified rewards.
Why It Works (Intuition):
- Pre-training stores reusable primitives.
- Mid-training positions those primitives right at the frontier, making them accessible for composition.
- RL thrives when rewards are neither too sparse (too hard) nor too redundant (too easy).
- Process rewards densify feedback and prevent shortcut solutions that donāt generalize.
Building Blocks (explained with Sandwich):
-
Pre-training š Hook: Imagine learning the alphabet before writing essays. š„¬ What it is: Pre-training teaches fundamental reasoning atoms and patterns. How it works: Feed lots of varied, structured problems; learn operations and step formats; stop before covering the hardest cases to leave headroom. Why it matters: Without atoms or headroom, RL canāt extend abilityāonly polish. š Anchor: A kid who knows addition can learn multi-step word problems next.
-
Mid-training š Hook: Think of a bridge connecting two hills. š„¬ What it is: A focused stage that narrows data toward what RL will emphasize. How it works: Continue next-token prediction on a tighter distribution where the model has partial competence. Why it matters: It stabilizes optimization and primes the model to benefit from RL. š Anchor: Practice sheets that match the upcoming test.
-
Reinforcement Learning (RL) š Hook: Like practicing free throws and getting points for each swish. š„¬ What it is: Learning by trial-and-reward, steering the model toward policies that score higher. How it works: Sample solutions; grade them; update the policy to prefer higher-reward traces. Why it matters: RL grows breadth and depth when aimed at the edge of competence. š Anchor: A player improves fastest when shots are challenging but makeable.
-
Extrapolative Generalization š Hook: Imagine stacking more Lego blocks than youāve ever stacked before. š„¬ What it is: Solving problems with deeper chains of operations than seen during training. How it works: Compose learned atoms into longer sequences. Why it matters: Shows true depth growthānot just memorization. š Anchor: Going from 5-step puzzles to 12-step ones.
-
Contextual Generalization š Hook: Same melody, different instruments. š„¬ What it is: Transferring the same reasoning graph to new surface stories. How it works: Keep structure; swap context words (zoo ā school). Why it matters: Real tasks vary in wording; you need stable transfer. š Anchor: Solving a budget problem whether itās framed as animals or teachers.
-
Process-Level Rewards š Hook: Gold stars for showing your work. š„¬ What it is: Rewards that credit correct steps, not just final answers. How it works: Parse the solution into steps; compare to ground-truth graph; reward step accuracy. Why it matters: Cuts reward hacking and improves faithful reasoning. š Anchor: A math teacher who grades both steps and the result.
03Methodology
At a high level: Input (synthetic, graph-defined problems) ā Pre-training (learn atoms) ā Mid-training (bridge and focus) ā RL (edge-of-competence exploration with process-aware rewards) ā Output (solutions scored by process-verified pass@k).
Step-by-step recipe:
-
Create a controllable world of problems š Hook: Think of blueprint-based puzzles where every piece and connection is known. š„¬ What happens: Each problem is built from a dependency graph (a DAG) where nodes are quantities and edges are arithmetic dependencies. We also attach a ācontext templateā (like zoo/animals or school/teachers) that changes the story but not the structure. Why this step exists: Control. It prevents trainingātest leaks and lets us separate structure (math) from wording (context). Example: A graph that says Total = Lions + Elephants can be rendered as either a zoo problem or a school supplies problem. š Anchor: Same skeleton, different costumes.
-
Pre-training (learn the atoms, keep headroom) š Hook: Study the basics across many examples, but stop short of mastering the hardest. š„¬ What happens: A 100M-parameter model is trained on 10B tokens with operation counts op=2ā10 across multiple contexts to build arithmetic and reasoning format skills, leaving deeper ranges (op>10) underexposed. Why this step exists: To install primitives and deliberately leave room for RL to expand. Example: The model sees many 2ā10 step problems in zoo/school/festival styles and learns to show steps. š Anchor: A student whoās solid on medium problems but hasnāt yet tackled the longest chains.
-
Mid-training (bridge to RL) š Hook: Targeted practice right where things start feeling tough. š„¬ What happens: Keep the same next-token objective but narrow the data to the modelās emerging frontier (e.g., op=11ā14) with structures similar to what RL will emphasize. Why this step exists: It aligns representations, improves stability, and makes RL sample-efficient under fixed compute. Example: If op=11ā14 is the frontier, mid-training focuses there with structured solutions. š Anchor: Practice sheets that mirror the next challenge.
-
RL post-training with GRPO (push at the edge) š Hook: Try, get scored, try again. š„¬ What happens: Use GRPO-style RL to sample multiple solution attempts, score them, and update the model toward higher-reward behaviors. Critically, choose RL data at the edge of competence: the tasks are hard-but-possible. Why this step exists: Pushing too easy brings polishing only; too hard brings sparse rewards and little learning. Edge-calibration maximizes learning. Example: Train on op=11ā14 and then evaluate gains on op=15ā20. š Anchor: Practicing shots that you miss sometimes but can learn to make.
-
Process-verified evaluation (grade every step) š Hook: No partial credit for lucky guesses. š„¬ What happens: Parse the modelās solution into a predicted graph; compare each stepās dependencies and values to the gold graph. A sample counts as correct only if all steps and the final answer match. Why this step exists: Prevents reward or evaluation hacking and measures faithful reasoning. Example: If the model claims Total = Lions ā Elephants when it should be +, the sample is counted wrong even if the final number happens to match. š Anchor: A teacher who checks the algebra line by line.
-
Contextual generalization tests (same graph, new words) š Hook: Can you sing the song in a new key? š„¬ What happens: Keep the abstract structure but swap the story template (e.g., from zoo to school). Vary how much the model saw this new template in pre-training (0%, 0.1%, 1%, 10%). Why this step exists: To test whether tiny exposure (ā„1%) is enough for RL to spread skills across contexts. Example: A 12-step budget graph appears in a āteachersāschoolā story instead of āanimalsāzoo.ā š Anchor: Doing the same math in a different word problem.
-
Compute-aware mixing (mid-training vs RL under a fixed budget) š Hook: You have one pizza of computeāhow do you slice it? š„¬ What happens: Convert RLās rollout cost to an equivalent token budget; then test splits: Full Mid, Full RL, Light-RL, Medium-RL, Heavy-RL, all using the same edge-range data (e.g., op=11ā14). Why this step exists: To discover how to allocate compute for best near-range vs far-range gains. Example: Light-RL wins pass@1 on edge tasks; Heavy-RL wins on the hardest OOD problems. š Anchor: Training plans that differ for consistency vs maximum reach.
The secret sauce:
- Edge-of-competence targeting: RL learns best where tasks are just beyond current skill.
- Process-aware rewards: Dense step feedback reduces reward hacking and yields truer reasoning.
- Clean splits and graphs: By disentangling structure from context and preventing contamination, we can attribute gains correctly.
Extra sandwich for key evaluation ideaāpass@k š Hook: Imagine you get k chances to answer a question. š„¬ What it is: pass@k counts a task as solved if any of k sampled solutions is fully correct (steps + answer). How it works: Sample k times; check each with process verification; success if at least one is perfect. Why it matters: pass@1 shows single-try sharpness; pass@128 shows true capability range. š Anchor: Like taking multiple shots at a hard basketball hoopāeventually you see if you can really make it.
04Experiments & Results
The tests and why:
- Extrapolative generalization (depth): Can the model solve problems with more steps (op=11ā14, op=15ā20) than it saw in pre-training (op=2ā10)? This shows real depth gains.
- Contextual generalization (breadth): Can it transfer the same graph to new surface stories (e.g., teachersāschool) after small pre-training exposure?
- Reward design: Do process-aware rewards reduce reward hacking and improve faithful reasoning?
- Compute mixing: Under the same total budget, how should we split mid-training vs RL?
Competition/baselines:
- Base model after pre-training on op=2ā10.
- RL applied to different difficulty buckets: op=7ā10 (ID), 9ā12 (mixed), 11ā14 (edge), 17ā20 (hard).
- Mid-training + RL vs RL-only under equalized compute.
Scoreboard with context:
- When does RL produce true new capability?
- Finding: RL on ID ranges boosts pass@1 but not pass@128āpolishing, not extending.
- Key win: RL trained at the edge (op=11ā14) lifts pass@128 on OOD tasks (including op=15ā20), delivering genuine capability gains.
- Scale: Well-calibrated settings report up to +42% pass@128 on deeper tasks. Analogy: Like moving from consistent 2-point shots to making 3sāyou can now reach further, not just hit the same shots more often.
- How much pre-training exposure is enough for context transfer?
- Finding: With 0%ā0.1% exposure to the long-tailed context during pre-training, RL fails to transfer. With ā„1% exposure, RL reliably spreads skill across contextsāeven to hard op=20 problemsāsometimes adding up to +60% pass@128.
- Intuition: RL canāt build from a void; it needs a seed. Analogy: Learn just a few words of a new language, and a good tutor (RL) helps you speak paragraphs.
- Mid-trainingās role under fixed compute
- Finding: Adding a mid-training bridge improves OOD performance under the same budget and beats RL-only by about +10.8% on OOD-hard in reported comparisons.
- Pattern: Light-RL (more mid, less RL) gets the best pass@1 on OOD-edge. Heavy-RL (less mid, more RL) wins on the hardest OOD tasks in pass@1 and pass@128. Analogy: More drills for consistency; more scrimmage for peak reach.
- Process-aware rewards reduce reward hacking
- Finding: Mixing outcome and process signals (e.g., 0.2 outcome + 0.8 process) lifts pass@1 by ~4ā5% on op=15ā20 and improves reasoning fidelity. A strict gate (reward outcome only if process is perfect) also helps.
- Outcome: Fewer structural errors (missing nodes, wrong dependencies) and more faithful chains. Analogy: Students stop guessing and start showing correct algebra.
Surprising findings:
- Tiny pre-training exposure (as low as ~1%) to a long-tail context unlocks strong cross-context RL transferāmuch less than many expect.
- RL on tasks that are too hard (far OOD) stalls because rewards become too sparse; on tasks that are too easy, it just sharpens pass@1.
- Mid-trainingāoften under-discussedāsubstantially conditions the model for RLās benefits.
Extra sandwich for āedge of competenceā š Hook: Think of a running pace you can barely keepātiring but doable. š„¬ What it is: The difficulty zone where the model fails at pass@1 but succeeds at pass@k. How it works: Filter RL data so examples lie just beyond the modelās easy zone. Why it matters: Maximizes learning signal; avoids both boredom and hopelessness. š Anchor: Lifting a weight you canāt do cold, but can after a few tries.
Numbers with meaning:
- +42% pass@128 when RL is well-calibrated at the edge.
- +60% pass@128 for contextual transfer once pre-training has ā„1% exposure to the long-tail context.
- +10.8% OOD-hard advantage for mid-training + RL over RL-only under fixed compute.
- +4ā5% pass@1 on hardest ranges by adding process rewards.
Take-home: The who, when, and how of RL gains are now clearer: seed in pre-training; bridge in mid-training; aim RL at the edge; and reward the process.
05Discussion & Limitations
Limitations (be specific):
- Synthetic world: The tasks are controllable arithmetic-style problems with DAG-defined steps. Real-world language has messier dependencies and ambiguities.
- Model scale: Results use 100M-parameter models; scaling behavior might shift with larger models.
- Operation set: Focused on arithmetic (+, ā, Ć, Ć·). Other reasoning types (logic, code, commonsense) may differ.
- Parser dependence: Process verification relies on parsing structured steps; free-form reasoning styles might be harder to evaluate robustly.
- Objective scope: RL used GRPO-like setups; other RL variants or credit-assignment strategies could change dynamics.
Required resources:
- Data: ~10B pre-training tokens from the synthetic generator, plus curated splits for mid- and post-training.
- Compute: Enough to train a 100M model and perform RL rollouts (e.g., rollout multiplicity ~6, context length ~2048), and to run process parsing and grading.
- Engineering: Clean deduplication, distributional splits, and robust solution parsing.
When NOT to use:
- If your target tasks are already saturated by pre-training (high pass@128), RL aimed in-domain likely wonāt extend capabilityāonly sharpen pass@1.
- If thereās zero pre-training exposure to a brand-new context and no way to seed it, RL alone probably wonāt transfer.
- If you canāt parse or verify process steps, pure outcome rewards may invite reward hacking.
- If compute is extremely scarce, a mid-training bridge may be more efficient than RL-only.
Open questions:
- Transfer to natural domains: How do these patterns carry over to math word problems in the wild, coding, or multimodal reasoning?
- Scaling laws: How do headroom, edge calibration, and mid-training impact change with larger (billion-scale) models?
- Automated edge finding: Can we continuously detect and track the modelās edge of competence for self-paced curricula?
- Reward design: Whatās the best mix of process vs outcome signals across domains? Can learned process reward models replace exact graph checks?
- Data curricula: What are optimal mid-training distributions for different downstream goals (robustness vs reach)?
06Conclusion & Future Work
Three-sentence summary: This paper builds a clean, controlled lab for studying how pre-training, mid-training, and RL interact to shape reasoning in language models. It finds that true capability gains arise when pre-training leaves headroom, RL targets the edge of competence, mid-training bridges distributions, and process-aware rewards prevent shortcutting. With these pieces in place, models generalize deeper (extrapolation) and across new stories (context) more reliably.
Main achievement: The authors reconcile conflicting RL results by showing they depend on two key dialsāpre-training coverage and RL difficulty calibrationāwhile highlighting mid-training and process-verified rewards as powerful, practical levers.
Future directions: Test the recipe on larger models and real-world domains; automate edge-of-competence tracking for curricula; explore richer process reward models; and expand beyond arithmetic graphs to logic, programming, and multimodal tasks.
Why remember this: It turns fuzzy training folklore into a clear playbookāseed minimal exposure, build a mid-bridge, aim RL at the edge, and reward the steps. Following this playbook can save compute, boost reliability, and grow genuine reasoning ability rather than polished guesswork.
Practical Applications
- ā¢Filter RL training data to the modelās edge of competence (low pass@1 but non-zero pass@k) and refresh this filter as the model improves.
- ā¢Seed long-tail contexts with at least ~1% exposure in pre-training to unlock robust cross-context transfer during RL.
- ā¢Add a mid-training stage that mirrors the upcoming RL distribution to stabilize optimization and improve RL efficiency under fixed compute.
- ā¢Use process-verified evaluation (parse steps to a graph) and count a solution correct only if all steps plus the final answer match.
- ā¢Mix rewards: combine outcome reward with dense process-level signals (e.g., 0.2 outcome + 0.8 process) to reduce reward hacking.
- ā¢Adopt task-aware compute splitting: more mid-training + light RL for reliability near distribution; heavier RL (with some mid-training) for far-OOD gains.
- ā¢Continuously monitor pass@1 vs pass@k to distinguish sharpening from true capability growth.
- ā¢Design contextual templates that vary wording while keeping structure fixed to regularly test breadth generalization.
- ā¢Track structural error types (missing nodes, dependency mismatches) to diagnose when process rewards are needed most.
- ā¢Avoid RL-only on in-domain saturated tasks; redirect compute toward mid-training or edge-calibrated RL where it can extend ability.