šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
🧩ProblemsšŸŽÆPrompts🧠Review
Search
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models | How I Study AI

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Beginner
Charlie Zhang, Graham Neubig, Xiang Yue12/8/2025
arXivPDF

Key Summary

  • •The paper asks when reinforcement learning (RL) really makes language models better at reasoning beyond what they learned in pre-training.
  • •Using a fully controlled, synthetic math-like world, the authors separate the effects of pre-training, mid-training, and RL.
  • •RL gives true new capability only when the task is just a little harder than the model’s comfort zone (its ā€œedge of competenceā€) and there is leftover headroom from pre-training.
  • •For transferring skills to new wordings or topics (contextual generalization), even tiny pre-training exposure (about 1%) to the new context is enough for RL to spread the skill widely.
  • •Adding a mid-training stage (a bridge between pre-training and RL) boosts performance under the same compute budget, especially on hard, out-of-distribution problems compared to RL alone.
  • •Designing RL data to match the model’s edge of competence yields large gains (up to +42% pass@128) on deeper problems.
  • •Process-aware rewards that check each reasoning step reduce reward hacking and improve faithful reasoning, adding ~4–5% pass@1 on the hardest settings.
  • •A task-aware compute split works best: more mid-training + light RL for reliability on near-range tasks, and heavier RL (with some mid-training) for far-range, very hard tasks.
  • •The framework clarifies why past studies disagreed about RL: differences came from how much pre-training coverage there was and how well RL data matched the edge of competence.

Why This Research Matters

This study gives a clear, practical recipe for making language models reason better instead of just guessing better. By seeding minimal exposure in pre-training, adding a mid-training bridge, aiming RL at the edge of competence, and rewarding steps (not just answers), we get deeper and more reliable reasoning. That means homework helpers that explain their steps, coding copilots that avoid brittle shortcuts, and decision tools that generalize to new situations without breaking. It also saves compute by avoiding wasted RL on tasks that are too easy or too hard. Most importantly, it reduces reward hacking so improvements are real, not illusions.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

šŸž Hook: Imagine coaching a student for a big math contest. First, they read a lot of textbooks (pre-training). Later, you give them specially chosen practice sheets (mid-training). Finally, you let them practice with timed drills and instant feedback (RL). If their scores go up, was it the textbooks, the sheets, or the drills? Or the mix?

🄬 The Concept (The World Before): Before this paper, researchers knew that reinforcement learning could make language models seem better at reasoning—especially when they show their steps. But nobody could say for sure whether RL actually taught the model new abilities or just polished things it already knew. That’s because modern training pipelines are messy: huge, opaque pre-training corpora; loosely defined mid-training; and RL that pushes on top of unknown foundations.

How it worked before (step by step):

  1. Pre-training on the internet: Models absorb all sorts of facts and patterns, but we don’t know which reasoning skills they truly learned.
  2. Optional mid-training: Some teams add a middle step with instruction-like data, but its role isn’t carefully measured.
  3. RL post-training: Models get rewards for good answers, but the rewards can be gamed (reward hacking) and often mix with earlier knowledge in unknown ways.

Why that mattered: Conflicting claims popped up. Some said RL mostly sharpens (improves pass@1 but not pass@128), meaning no real new capability. Others showed big gains on synthetic tasks and argued RL truly extends ability.

šŸž Anchor: Think of students who get better after practice. Are they learning new math or just learning test tricks? Without a clean lab-like setup, it’s hard to tell.

šŸž Hook: You know how a good science experiment isolates one variable at a time? That’s the key to figuring out cause and effect.

🄬 The Concept (The Problem): The field lacked a controlled way to separate the roles of pre-training, mid-training, and RL on reasoning. Without control, we cannot say whether improvements come from new abilities or better sampling of old ones.

How they fix it:

  1. Build synthetic reasoning tasks with atomic steps and clean dependency graphs (so structure is known).
  2. Render problems into different surface contexts (zoo, school, festival) without changing the underlying logic.
  3. Parse solutions step-by-step and grade not just the final answer but the whole reasoning process.
  4. Carefully split data into non-overlapping sets for pre-, mid-, and post-training.

Why it matters: With clear structure, you can tell whether RL truly extends reasoning depth and transfers skills to new contexts—or just memorizes shortcuts.

šŸž Anchor: It’s like giving two kids identical Lego kits but different instruction pages. If one builds taller towers (deeper reasoning) or can build in a new theme (context transfer), you can measure which training stage helped.

šŸž Hook: Imagine a hiking path with three stages: a flat warm-up (pre-training), a hill bridge (mid-training), and a steep summit push (RL). How do these pieces add up to reach higher peaks?

🄬 The Concept (Failed Attempts and the Gap): Past attempts used real-world corpora and loosely defined rewards. But hidden overlaps and unverified steps muddled conclusions. The missing piece was a lab-grade setup where every training distribution, task difficulty, and evaluation rule is known and controllable.

How the paper fills the gap:

  1. It creates a fully controllable world of math-like problems defined by directed acyclic graphs (DAGs), ensuring exact reasoning structure is known.
  2. It measures two key abilities: extrapolative generalization (tackling deeper compositions) and contextual generalization (transferring across surface stories).
  3. It uses process-verified evaluation to only count solutions correct when all steps and the final answer match.

Why it matters: This lets us pinpoint when RL truly adds new capability, when mid-training is essential, and how tiny pre-training exposure unlocks big transfer later.

šŸž Anchor: Picture testing shoes on a treadmill that controls slope and speed exactly. Now you can say whether the laces, the sole, or the training plan made the runner faster.

šŸž Hook: Why should you care? Because the way we train models affects your homework help, coding assistants, and creative tools.

🄬 The Concept (Real Stakes): If we think RL creates new skills when it’s merely polishing old ones, we could waste huge compute budgets or trust unreliable reasoning. With clean evaluation, we can design cheaper, safer, and smarter training pipelines.

How it plays out:

  1. Better planning: Match RL data to the model’s edge of competence to get real gains.
  2. Better coverage: Seed pre-training with just 1% exposure to long-tail contexts to unlock cross-context transfer.
  3. Better reliability: Use mid-training and process-aware rewards to reduce reward hacking and improve faithful reasoning.

Why it matters: Everyday tools become more accurate and fair, and we avoid being fooled by models that guess the right answer for the wrong reasons.

šŸž Anchor: It’s the difference between a tutor who teaches understanding step-by-step versus one who only drills answers. The first builds skill; the second risks shortcuts.

02Core Idea

šŸž Hook: You know how video games get fun when the level is just hard enough to stretch you, not so easy it’s boring, and not so hard it’s impossible?

🄬 The Concept (The Aha! Moment): The key insight is that true reasoning gains happen when pre-training leaves headroom, RL is aimed exactly at the model’s edge of competence, mid-training bridges the distributions, and rewards check both the final answer and the steps.

How it works:

  1. Pre-training installs basic reasoning ā€œatoms.ā€
  2. Mid-training strengthens those atoms right where the model starts to struggle, aligning them with what RL will target.
  3. RL then explores just-beyond-known territory (the edge of competence), composing atoms into deeper chains.
  4. Process-aware rewards ensure the model earns points for correct steps, not just lucky answers.

Why it matters: Without headroom, RL only polishes; without calibration, RL flails; without process checks, RL can be gamed.

šŸž Anchor: It’s like building muscle: learn form (pre-train), practice with guided sets (mid-train), then attempt slightly heavier weights (RL) with a coach who also watches your form (process reward).

Multiple Analogies:

  1. School Curriculum: Learn times tables (pre-train), do structured worksheets (mid-train), then timed quizzes just above your comfort zone (RL) graded for both steps and answer (process reward).
  2. Cooking: Master chopping and mixing (pre-train), follow recipe families (mid-train), then improvise one step harder dishes (RL) while a chef checks each step (process reward).
  3. Sports: Drill fundamentals (pre-train), scrimmage with targeted plays (mid-train), then play tougher opponents (RL) with a referee checking not only the score but also fouls (process reward).

Before vs After:

  • Before: RL results looked inconsistent—sometimes big gains, sometimes not.
  • After: Gains depend on two switches: (a) headroom from pre-training and (b) RL difficulty calibrated to the edge of competence; plus a boost from mid-training and process-verified rewards.

Why It Works (Intuition):

  • Pre-training stores reusable primitives.
  • Mid-training positions those primitives right at the frontier, making them accessible for composition.
  • RL thrives when rewards are neither too sparse (too hard) nor too redundant (too easy).
  • Process rewards densify feedback and prevent shortcut solutions that don’t generalize.

Building Blocks (explained with Sandwich):

  1. Pre-training šŸž Hook: Imagine learning the alphabet before writing essays. 🄬 What it is: Pre-training teaches fundamental reasoning atoms and patterns. How it works: Feed lots of varied, structured problems; learn operations and step formats; stop before covering the hardest cases to leave headroom. Why it matters: Without atoms or headroom, RL can’t extend ability—only polish. šŸž Anchor: A kid who knows addition can learn multi-step word problems next.

  2. Mid-training šŸž Hook: Think of a bridge connecting two hills. 🄬 What it is: A focused stage that narrows data toward what RL will emphasize. How it works: Continue next-token prediction on a tighter distribution where the model has partial competence. Why it matters: It stabilizes optimization and primes the model to benefit from RL. šŸž Anchor: Practice sheets that match the upcoming test.

  3. Reinforcement Learning (RL) šŸž Hook: Like practicing free throws and getting points for each swish. 🄬 What it is: Learning by trial-and-reward, steering the model toward policies that score higher. How it works: Sample solutions; grade them; update the policy to prefer higher-reward traces. Why it matters: RL grows breadth and depth when aimed at the edge of competence. šŸž Anchor: A player improves fastest when shots are challenging but makeable.

  4. Extrapolative Generalization šŸž Hook: Imagine stacking more Lego blocks than you’ve ever stacked before. 🄬 What it is: Solving problems with deeper chains of operations than seen during training. How it works: Compose learned atoms into longer sequences. Why it matters: Shows true depth growth—not just memorization. šŸž Anchor: Going from 5-step puzzles to 12-step ones.

  5. Contextual Generalization šŸž Hook: Same melody, different instruments. 🄬 What it is: Transferring the same reasoning graph to new surface stories. How it works: Keep structure; swap context words (zoo → school). Why it matters: Real tasks vary in wording; you need stable transfer. šŸž Anchor: Solving a budget problem whether it’s framed as animals or teachers.

  6. Process-Level Rewards šŸž Hook: Gold stars for showing your work. 🄬 What it is: Rewards that credit correct steps, not just final answers. How it works: Parse the solution into steps; compare to ground-truth graph; reward step accuracy. Why it matters: Cuts reward hacking and improves faithful reasoning. šŸž Anchor: A math teacher who grades both steps and the result.

03Methodology

At a high level: Input (synthetic, graph-defined problems) → Pre-training (learn atoms) → Mid-training (bridge and focus) → RL (edge-of-competence exploration with process-aware rewards) → Output (solutions scored by process-verified pass@k).

Step-by-step recipe:

  1. Create a controllable world of problems šŸž Hook: Think of blueprint-based puzzles where every piece and connection is known. 🄬 What happens: Each problem is built from a dependency graph (a DAG) where nodes are quantities and edges are arithmetic dependencies. We also attach a ā€œcontext templateā€ (like zoo/animals or school/teachers) that changes the story but not the structure. Why this step exists: Control. It prevents training–test leaks and lets us separate structure (math) from wording (context). Example: A graph that says Total = Lions + Elephants can be rendered as either a zoo problem or a school supplies problem. šŸž Anchor: Same skeleton, different costumes.

  2. Pre-training (learn the atoms, keep headroom) šŸž Hook: Study the basics across many examples, but stop short of mastering the hardest. 🄬 What happens: A 100M-parameter model is trained on 10B tokens with operation counts op=2–10 across multiple contexts to build arithmetic and reasoning format skills, leaving deeper ranges (op>10) underexposed. Why this step exists: To install primitives and deliberately leave room for RL to expand. Example: The model sees many 2–10 step problems in zoo/school/festival styles and learns to show steps. šŸž Anchor: A student who’s solid on medium problems but hasn’t yet tackled the longest chains.

  3. Mid-training (bridge to RL) šŸž Hook: Targeted practice right where things start feeling tough. 🄬 What happens: Keep the same next-token objective but narrow the data to the model’s emerging frontier (e.g., op=11–14) with structures similar to what RL will emphasize. Why this step exists: It aligns representations, improves stability, and makes RL sample-efficient under fixed compute. Example: If op=11–14 is the frontier, mid-training focuses there with structured solutions. šŸž Anchor: Practice sheets that mirror the next challenge.

  4. RL post-training with GRPO (push at the edge) šŸž Hook: Try, get scored, try again. 🄬 What happens: Use GRPO-style RL to sample multiple solution attempts, score them, and update the model toward higher-reward behaviors. Critically, choose RL data at the edge of competence: the tasks are hard-but-possible. Why this step exists: Pushing too easy brings polishing only; too hard brings sparse rewards and little learning. Edge-calibration maximizes learning. Example: Train on op=11–14 and then evaluate gains on op=15–20. šŸž Anchor: Practicing shots that you miss sometimes but can learn to make.

  5. Process-verified evaluation (grade every step) šŸž Hook: No partial credit for lucky guesses. 🄬 What happens: Parse the model’s solution into a predicted graph; compare each step’s dependencies and values to the gold graph. A sample counts as correct only if all steps and the final answer match. Why this step exists: Prevents reward or evaluation hacking and measures faithful reasoning. Example: If the model claims Total = Lions āˆ’ Elephants when it should be +, the sample is counted wrong even if the final number happens to match. šŸž Anchor: A teacher who checks the algebra line by line.

  6. Contextual generalization tests (same graph, new words) šŸž Hook: Can you sing the song in a new key? 🄬 What happens: Keep the abstract structure but swap the story template (e.g., from zoo to school). Vary how much the model saw this new template in pre-training (0%, 0.1%, 1%, 10%). Why this step exists: To test whether tiny exposure (≄1%) is enough for RL to spread skills across contexts. Example: A 12-step budget graph appears in a ā€œteachers–schoolā€ story instead of ā€œanimals–zoo.ā€ šŸž Anchor: Doing the same math in a different word problem.

  7. Compute-aware mixing (mid-training vs RL under a fixed budget) šŸž Hook: You have one pizza of compute—how do you slice it? 🄬 What happens: Convert RL’s rollout cost to an equivalent token budget; then test splits: Full Mid, Full RL, Light-RL, Medium-RL, Heavy-RL, all using the same edge-range data (e.g., op=11–14). Why this step exists: To discover how to allocate compute for best near-range vs far-range gains. Example: Light-RL wins pass@1 on edge tasks; Heavy-RL wins on the hardest OOD problems. šŸž Anchor: Training plans that differ for consistency vs maximum reach.

The secret sauce:

  • Edge-of-competence targeting: RL learns best where tasks are just beyond current skill.
  • Process-aware rewards: Dense step feedback reduces reward hacking and yields truer reasoning.
  • Clean splits and graphs: By disentangling structure from context and preventing contamination, we can attribute gains correctly.

Extra sandwich for key evaluation idea—pass@k šŸž Hook: Imagine you get k chances to answer a question. 🄬 What it is: pass@k counts a task as solved if any of k sampled solutions is fully correct (steps + answer). How it works: Sample k times; check each with process verification; success if at least one is perfect. Why it matters: pass@1 shows single-try sharpness; pass@128 shows true capability range. šŸž Anchor: Like taking multiple shots at a hard basketball hoop—eventually you see if you can really make it.

04Experiments & Results

The tests and why:

  • Extrapolative generalization (depth): Can the model solve problems with more steps (op=11–14, op=15–20) than it saw in pre-training (op=2–10)? This shows real depth gains.
  • Contextual generalization (breadth): Can it transfer the same graph to new surface stories (e.g., teachers–school) after small pre-training exposure?
  • Reward design: Do process-aware rewards reduce reward hacking and improve faithful reasoning?
  • Compute mixing: Under the same total budget, how should we split mid-training vs RL?

Competition/baselines:

  • Base model after pre-training on op=2–10.
  • RL applied to different difficulty buckets: op=7–10 (ID), 9–12 (mixed), 11–14 (edge), 17–20 (hard).
  • Mid-training + RL vs RL-only under equalized compute.

Scoreboard with context:

  1. When does RL produce true new capability?
  • Finding: RL on ID ranges boosts pass@1 but not pass@128—polishing, not extending.
  • Key win: RL trained at the edge (op=11–14) lifts pass@128 on OOD tasks (including op=15–20), delivering genuine capability gains.
  • Scale: Well-calibrated settings report up to +42% pass@128 on deeper tasks. Analogy: Like moving from consistent 2-point shots to making 3s—you can now reach further, not just hit the same shots more often.
  1. How much pre-training exposure is enough for context transfer?
  • Finding: With 0%–0.1% exposure to the long-tailed context during pre-training, RL fails to transfer. With ≄1% exposure, RL reliably spreads skill across contexts—even to hard op=20 problems—sometimes adding up to +60% pass@128.
  • Intuition: RL can’t build from a void; it needs a seed. Analogy: Learn just a few words of a new language, and a good tutor (RL) helps you speak paragraphs.
  1. Mid-training’s role under fixed compute
  • Finding: Adding a mid-training bridge improves OOD performance under the same budget and beats RL-only by about +10.8% on OOD-hard in reported comparisons.
  • Pattern: Light-RL (more mid, less RL) gets the best pass@1 on OOD-edge. Heavy-RL (less mid, more RL) wins on the hardest OOD tasks in pass@1 and pass@128. Analogy: More drills for consistency; more scrimmage for peak reach.
  1. Process-aware rewards reduce reward hacking
  • Finding: Mixing outcome and process signals (e.g., 0.2 outcome + 0.8 process) lifts pass@1 by ~4–5% on op=15–20 and improves reasoning fidelity. A strict gate (reward outcome only if process is perfect) also helps.
  • Outcome: Fewer structural errors (missing nodes, wrong dependencies) and more faithful chains. Analogy: Students stop guessing and start showing correct algebra.

Surprising findings:

  • Tiny pre-training exposure (as low as ~1%) to a long-tail context unlocks strong cross-context RL transfer—much less than many expect.
  • RL on tasks that are too hard (far OOD) stalls because rewards become too sparse; on tasks that are too easy, it just sharpens pass@1.
  • Mid-training—often under-discussed—substantially conditions the model for RL’s benefits.

Extra sandwich for ā€œedge of competenceā€ šŸž Hook: Think of a running pace you can barely keep—tiring but doable. 🄬 What it is: The difficulty zone where the model fails at pass@1 but succeeds at pass@k. How it works: Filter RL data so examples lie just beyond the model’s easy zone. Why it matters: Maximizes learning signal; avoids both boredom and hopelessness. šŸž Anchor: Lifting a weight you can’t do cold, but can after a few tries.

Numbers with meaning:

  • +42% pass@128 when RL is well-calibrated at the edge.
  • +60% pass@128 for contextual transfer once pre-training has ≄1% exposure to the long-tail context.
  • +10.8% OOD-hard advantage for mid-training + RL over RL-only under fixed compute.
  • +4–5% pass@1 on hardest ranges by adding process rewards.

Take-home: The who, when, and how of RL gains are now clearer: seed in pre-training; bridge in mid-training; aim RL at the edge; and reward the process.

05Discussion & Limitations

Limitations (be specific):

  • Synthetic world: The tasks are controllable arithmetic-style problems with DAG-defined steps. Real-world language has messier dependencies and ambiguities.
  • Model scale: Results use 100M-parameter models; scaling behavior might shift with larger models.
  • Operation set: Focused on arithmetic (+, āˆ’, Ɨ, Ć·). Other reasoning types (logic, code, commonsense) may differ.
  • Parser dependence: Process verification relies on parsing structured steps; free-form reasoning styles might be harder to evaluate robustly.
  • Objective scope: RL used GRPO-like setups; other RL variants or credit-assignment strategies could change dynamics.

Required resources:

  • Data: ~10B pre-training tokens from the synthetic generator, plus curated splits for mid- and post-training.
  • Compute: Enough to train a 100M model and perform RL rollouts (e.g., rollout multiplicity ~6, context length ~2048), and to run process parsing and grading.
  • Engineering: Clean deduplication, distributional splits, and robust solution parsing.

When NOT to use:

  • If your target tasks are already saturated by pre-training (high pass@128), RL aimed in-domain likely won’t extend capability—only sharpen pass@1.
  • If there’s zero pre-training exposure to a brand-new context and no way to seed it, RL alone probably won’t transfer.
  • If you can’t parse or verify process steps, pure outcome rewards may invite reward hacking.
  • If compute is extremely scarce, a mid-training bridge may be more efficient than RL-only.

Open questions:

  • Transfer to natural domains: How do these patterns carry over to math word problems in the wild, coding, or multimodal reasoning?
  • Scaling laws: How do headroom, edge calibration, and mid-training impact change with larger (billion-scale) models?
  • Automated edge finding: Can we continuously detect and track the model’s edge of competence for self-paced curricula?
  • Reward design: What’s the best mix of process vs outcome signals across domains? Can learned process reward models replace exact graph checks?
  • Data curricula: What are optimal mid-training distributions for different downstream goals (robustness vs reach)?

06Conclusion & Future Work

Three-sentence summary: This paper builds a clean, controlled lab for studying how pre-training, mid-training, and RL interact to shape reasoning in language models. It finds that true capability gains arise when pre-training leaves headroom, RL targets the edge of competence, mid-training bridges distributions, and process-aware rewards prevent shortcutting. With these pieces in place, models generalize deeper (extrapolation) and across new stories (context) more reliably.

Main achievement: The authors reconcile conflicting RL results by showing they depend on two key dials—pre-training coverage and RL difficulty calibration—while highlighting mid-training and process-verified rewards as powerful, practical levers.

Future directions: Test the recipe on larger models and real-world domains; automate edge-of-competence tracking for curricula; explore richer process reward models; and expand beyond arithmetic graphs to logic, programming, and multimodal tasks.

Why remember this: It turns fuzzy training folklore into a clear playbook—seed minimal exposure, build a mid-bridge, aim RL at the edge, and reward the steps. Following this playbook can save compute, boost reliability, and grow genuine reasoning ability rather than polished guesswork.

Practical Applications

  • •Filter RL training data to the model’s edge of competence (low pass@1 but non-zero pass@k) and refresh this filter as the model improves.
  • •Seed long-tail contexts with at least ~1% exposure in pre-training to unlock robust cross-context transfer during RL.
  • •Add a mid-training stage that mirrors the upcoming RL distribution to stabilize optimization and improve RL efficiency under fixed compute.
  • •Use process-verified evaluation (parse steps to a graph) and count a solution correct only if all steps plus the final answer match.
  • •Mix rewards: combine outcome reward with dense process-level signals (e.g., 0.2 outcome + 0.8 process) to reduce reward hacking.
  • •Adopt task-aware compute splitting: more mid-training + light RL for reliability near distribution; heavier RL (with some mid-training) for far-OOD gains.
  • •Continuously monitor pass@1 vs pass@k to distinguish sharpening from true capability growth.
  • •Design contextual templates that vary wording while keeping structure fixed to regularly test breadth generalization.
  • •Track structural error types (missing nodes, dependency mismatches) to diagnose when process rewards are needed most.
  • •Avoid RL-only on in-domain saturated tasks; redirect compute toward mid-training or edge-calibrated RL where it can extend ability.
#edge of competence#process-verified evaluation#process-level rewards#mid-training#reinforcement learning for LLMs#extrapolative generalization#contextual generalization#dependency graphs (DAG)#pass@128#reward hacking#controlled synthetic reasoning#GRPO#GSM-Infinite#compute budget allocation#reasoning primitives
Version: 1