🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
đŸ›€ïžPaths📚Topics💡Concepts🎮Shorts
🎯Practice
đŸ§©Problems🎯Prompts🧠Review
Search
Rethinking Expert Trajectory Utilization in LLM Post-training | How I Study AI

Rethinking Expert Trajectory Utilization in LLM Post-training

Intermediate
Bowen Ding, Yuhan Chen, Jiayang Lv et al.12/12/2025
arXivPDF

Key Summary

  • ‱The paper asks how to best use expert step-by-step solutions (expert trajectories) when teaching big AI models to reason after pretraining.
  • ‱It introduces the Plasticity-Ceiling Framework: think of final skill as “how good you are now (from SFT)” plus “how much room you still have to grow with RL.”
  • ‱The authors test many training styles and find a clear winner: do Supervised Fine-Tuning (SFT) first, then do Reinforcement Learning (RL)—sequentially, not mixed together.
  • ‱They show the best time to switch from SFT to RL is when SFT’s validation loss has flattened (Stable) or is just barely rising (Mild Overfitting), but not later (Severe Overfitting).
  • ‱They refute the idea that “less is more” for data in this setting: bigger SFT datasets raise the ultimate performance ceiling; harder examples act like a multiplier on top.
  • ‱Minimum SFT validation loss is a strong, cheap-to-measure predictor of how high your final performance ceiling can be after RL.
  • ‱On math benchmarks, the sequential SFT-then-RL approach beats Pure RL and Synchronized SFT-RL (mixed) methods by a notable margin.
  • ‱These findings give simple, actionable rules for when to switch to RL and how to choose and scale SFT data to get the most from expert trajectories.

Why This Research Matters

Reasoning-heavy applications—math tutors, coding copilots, science helpers—need models that not only talk but also solve multi-step problems reliably. This paper gives a simple, tested recipe to get there: build a broad SFT base, then switch to RL at the right moment. It shows that bigger SFT datasets raise your ultimate ceiling, and that slightly harder examples multiply gains, so teams can plan data collection wisely. With minimum validation loss as a predictor, labs can choose the best checkpoints and datasets before spending expensive RL compute. The result is better performance with fewer training runs, less instability, and clearer roadmaps for future improvements.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you first learn to ride a bike with training wheels, and only after you’re steady do you try riding without them and maybe even learn some tricks? If you skip the training wheels or take them off too soon, you wobble and fall.

đŸ„Ź The Concept (Large Language Models, LLMs): LLMs are computer programs that learn to read, write, and reason from lots of text. How it works: (1) Pretrain on tons of text to learn general language patterns; (2) Post-train to sharpen skills for specific tasks; (3) Use careful practice to reach higher reasoning ability. Why it matters: Without this second stage of training, models can talk, but they struggle to solve tricky problems step by step.

🍞 Anchor: Think of an LLM as a student who read a whole library (pretraining) but still needs classes and homework (post-training) to ace math contests.

🍞 Hook: Imagine learning math from a solution key. You copy how experts solve problems before trying your own strategies.

đŸ„Ź The Concept (Supervised Fine-Tuning, SFT): SFT is when a model studies examples with correct, step-by-step solutions (expert trajectories). How it works: (1) Show the prompt and the expert’s full solution; (2) The model imitates each step; (3) Repeat across many examples to build strong habits. Why it matters: Without SFT, the model lacks solid, reliable ways to start solving hard problems.

🍞 Anchor: It’s like practicing long division by following your teacher’s worked examples until the steps feel natural.

🍞 Hook: Think about training a puppy: do the trick, get a treat; do it wrong, no treat.

đŸ„Ź The Concept (Reinforcement Learning, RL): RL teaches a model by rewarding good answers and not rewarding bad ones. How it works: (1) The model tries different steps; (2) A rule checks correctness and gives a reward; (3) The model changes to favor moves that earn rewards. Why it matters: Without RL, the model can imitate but struggles to explore new or smarter paths.

🍞 Anchor: It’s like trying different puzzle moves and keeping the ones that get you closer to the solution.

🍞 Hook: Picture trying to read the textbook and play a graded quiz game at the same time; it’s easy to get confused.

đŸ„Ź The Concept (Synchronized SFT-RL): This approach mixes imitation (SFT) and reward-based practice (RL) in one loop. How it works: (1) Train on expert steps; (2) Also generate your own steps; (3) Blend both signals to update the model together. Why it matters: If not handled carefully, this can become unstable—especially with big, hard datasets—causing the model to wobble instead of improve.

🍞 Anchor: It’s like flipping quickly between a teacher’s notes and a live test; you might not learn either one well.

🍞 Hook: Think of two phases: first learn the safe way with guidance, then practice freely to master tricks.

đŸ„Ź The Concept (Sequential SFT-then-RL): Train in two clean stages—first SFT to build a foundation, then RL to push further. How it works: (1) Do SFT until the learning stabilizes; (2) Switch to RL to explore and improve; (3) Stop when gains level off. Why it matters: Mixing signals too early or too late either hurts stability or wastes potential.

🍞 Anchor: Like learning to play piano: first master scales with a teacher, then experiment with jazz improvisation.

The World Before: Researchers knew we needed both SFT and RL to turn chatty models into strong reasoners. But no one had a clear, reliable recipe for when to switch from SFT to RL, how much SFT data to use, or whether to train both at once. Some teams tried synchronized training and reported quick wins—but mostly on small datasets. Others used simple two-stage pipelines that worked in practice, but with rules-of-thumb rather than firm guidance.

The Problem: We lacked a theory that explains how expert trajectories (SFT data) should be used to maximize the final, after-RL performance, and we lacked solid rules for timing the SFT→RL switch and for choosing data scale and difficulty.

Failed Attempts: (1) Pure SFT: Imitates well but stalls; (2) Pure RL: Explores fast but hits a low ceiling without a strong base; (3) Synchronized SFT-RL: Can look efficient early, but tends to be unstable or top out early on large, harder datasets.

The Gap: A unifying way to measure “how high can we go” (the ceiling) and “how much room is left to grow” (plasticity) was missing, along with practical, measurable signals to guide switching time and data choices.

Real Stakes: This matters for math tutors, coding copilots, scientific assistants, and any app where careful, step-by-step reasoning is crucial. Better rules save compute, reduce instability, and produce models that solve harder problems more reliably.

02Core Idea

🍞 Hook: Imagine building a tower. The first floors must be strong (foundation), and then you decide how many more floors you can safely add (headroom). If the base is weak, you can’t go very high.

đŸ„Ź The Concept (Aha!): Final performance = solid foundation from SFT + remaining headroom that RL can still turn into gains. How it works: (1) Use SFT to lock in dependable skills; (2) Measure how much “growth space” is left (plasticity); (3) Spend RL compute to convert that space into real performance; (4) Time the switch so you get a strong base without crushing the room to grow. Why it matters: If you switch too early, the base is shaky; too late, you lose the chance to grow higher.

🍞 Anchor: It’s like baking: first make a sturdy cake layer, then add frosting and decorations. Bad base? Your cake collapses. Too much fiddling with the base? You run out of time to decorate.

Three Analogies:

  • Sports: Warm-up drills (SFT) teach basic moves; scrimmages with scoring (RL) push strategy. Start scrimmaging after you’re steady, or the team won’t improve much.
  • Music: Practice scales with a teacher (SFT), then improvise and get audience feedback (RL). Start improvising once scales are solid.
  • Gardening: Build healthy roots (SFT) before pushing for blooms with special feed (RL). If roots are weak or overwatered, the plant won’t flourish.

🍞 Hook: You know how some jars are already almost full, and there’s just a little space left at the top?

đŸ„Ź The Concept (Plasticity): Plasticity is how much improvement room is left after SFT for RL to use. How it works: (1) Start from SFT performance; (2) Estimate how far you could still go; (3) RL climbs that remaining ladder. Why it matters: If SFT leaves tiny room, RL can’t add much no matter how hard it tries.

🍞 Anchor: If your spelling is already near-perfect, extra practice won’t raise your grade much—that’s low plasticity.

🍞 Hook: Picture a height limit sign at an amusement park ride.

đŸ„Ź The Concept (Performance Ceiling): The ceiling is the highest performance you can reach with lots of compute. How it works: (1) Track performance as you spend compute; (2) Fit a curve that levels off; (3) The top of that curve is the ceiling. Why it matters: Chasing gains past the ceiling wastes time and energy.

🍞 Anchor: It’s like knowing your test score can’t go above 100—once you’re close, more cramming won’t help much.

🍞 Hook: Imagine a dashboard light that tells you, “Okay, you’re ready to switch!”

đŸ„Ź The Concept (Minimum SFT Validation Loss): This is the lowest error your model reaches on a held-out SFT set. How it works: (1) Train with SFT; (2) Watch the validation loss curve; (3) When it bottoms out (Stable) or just barely rises (Mild Overfitting), you’re at the best time to switch to RL. Why it matters: Switching too soon or too late hurts the final ceiling.

🍞 Anchor: Like shooting hoops: once your practice shots are consistently accurate and no longer improving, it’s time to play a real game to grow further.

Before vs After:

  • Before: People guessed when to switch, debated mixing SFT with RL, and argued whether tiny, high-quality datasets were enough.
  • After: We have a map: (1) Do SFT to Stable or Mild Overfitting, then switch; (2) Use big SFT datasets to raise the ceiling; (3) Prefer harder examples to multiply gains; (4) Use minimum validation loss as a reliable predictor.

Why It Works (Intuition):

  • SFT builds dependable, reusable skills that RL can trust.
  • If SFT is too short, RL wastes time fixing basics instead of improving strategy.
  • If SFT goes too long into overfitting, the model becomes rigid, and RL can’t reshape it easily (plasticity shrinks).
  • Big, diverse SFT data teaches broader skills and leaves more useful headroom for RL to climb.

Building Blocks:

  • Foundation (P_sft): The performance after SFT. Stronger P_sft → better starting point.
  • Headroom (PL_rl): A measure of how much RL can still add. Bigger PL_rl → more potential.
  • Ceiling (A_post): The final reachable level after both stages.
  • Switch Signal: Minimum validation loss tells you when SFT is “done enough.”
  • Data Knobs: Scale drives the main gains; difficulty is a multiplier, especially when scale is limited.

03Methodology

High-level overview: Input (prompts + expert step-by-step solutions) → SFT (learn by imitation) → Monitor validation loss (find Stable/Mild Overfitting) → Switch to RL (learn by reward) → Output (higher final reasoning performance).

Step 1: Prepare data and models

  • What happens: Collect expert trajectories (prompt plus worked solution) for SFT. Prepare RL prompts for reward-based practice. Use a strong base model (e.g., Qwen2.5-7B) and a cross-check smaller model (Llama3.2-3B).
  • Why this step exists: Without clean expert data, SFT can’t build reliable habits; without appropriate RL prompts, rewards won’t teach the right skills.
  • Example: For math, gather hundreds of thousands of high-quality, step-by-step solutions distilled from a powerful teacher model and filter benchmark overlaps.

🍞 Hook: Imagine learning from many homework sheets.

đŸ„Ź The Concept (Data Scale): Data scale is how much SFT data you have. How it works: (1) More examples teach more patterns; (2) The model’s base grows broader; (3) RL has more to build on. Why it matters: Too little data makes the base narrow and lowers the final ceiling.

🍞 Anchor: Training with 889K examples usually beats training with 1K, even if the 1K are great.

🍞 Hook: Think of starting with easy puzzles and moving to harder ones.

đŸ„Ź The Concept (Trajectory Difficulty): It’s how hard the training examples are. How it works: (1) Harder items push the model’s limits; (2) It learns deeper strategies; (3) RL gains multiply on top of that. Why it matters: If everything is too easy, the model stalls early.

🍞 Anchor: A mix with more challenging math problems often leads to better final scores.

Step 2: Supervised Fine-Tuning (SFT)

  • What happens: Train the model to imitate expert steps token by token across many examples.
  • Why this step exists: It builds a solid, general foundation—like muscle memory for solving.
  • Example with data: On a big set (e.g., ~889K math problems), run SFT for multiple epochs, saving checkpoints regularly.

🍞 Hook: Like watching your practice test score each week.

đŸ„Ź The Concept (Validation Loss and Sub-phases): Validation loss measures how well you imitate on held-out data. How it works: (1) Track the curve over time; (2) Stable means lowest loss (within ~2% tolerance); (3) Mild Overfitting means a small rise (under ~10%); (4) Severe Overfitting means a big rise (10%+). Why it matters: This tells you when to stop SFT and start RL.

🍞 Anchor: When your practice score stops improving and just wiggles, it’s time to change tactics.

Step 3: Choose the switch time (SFT → RL)

  • What happens: Pick the SFT checkpoint at Stable or Mild Overfitting.
  • Why this step exists: Switching too early leaves basics weak; too late makes the model rigid, shrinking RL’s effect.
  • Example: On the largest dataset, best results came from switching when validation loss had flattened; on smaller/easier datasets, Mild Overfitting also worked well.

Step 4: Reinforcement Learning (RL)

  • What happens: The model generates answers; a rule checks correctness (reward = 1 for correct, 0 for incorrect); the model updates to make rewarded answers more likely. Algorithms like DAPO (with a robust variant DAPOdc) stabilize training for math.
  • Why this step exists: RL turns remaining headroom (plasticity) into real gains.
  • Example: Given math prompts, sample multiple solutions per prompt, grade them automatically (Math-Verify), and update the model to favor correct chains of thought.

Step 5: Avoid synchronized pitfalls

  • What happens: The paper also tests mixed (synchronized) SFT-RL methods (e.g., UPT, LUFFY, SRFT). They often show fast early gains but plateau early or show instability under larger, harder datasets.
  • Why this step exists: Establish a fair baseline: sequential SFT-then-RL is more stable and reaches a higher ceiling.
  • Example: On a general model, synchronized methods quickly flatten below the sequential approach; on a math-specialized starting model, they look briefly efficient, but still don’t reach the sequential ceiling.

Step 6: Measure and fit the ceiling

  • What happens: Track performance as compute increases, and fit a curve that shows the maximum reachable level (ceiling) and how much headroom was converted.
  • Why this step exists: It tells you if you’re close to done or still have meaningful gains ahead.
  • Example: Pure RL topped out in the low 70s (like a solid B), while SFT-then-RL reached the high 70s (like an A- to A), on math evaluations.

The Secret Sauce:

  • A simple, measurable switch rule: use minimum SFT validation loss (Stable or Mild Overfitting) as your “ready” signal.
  • Scale first, difficulty next: Big SFT data raises the ceiling; harder examples multiply gains, especially when data is limited.
  • Keep phases clean: Separate SFT and RL for stability and higher final performance.

04Experiments & Results

The Test: The team evaluated math reasoning across cleaned benchmarks (GSM8K, OlympiadBench, Minerva, MATH, AIME24/25). They measured pass@1 accuracy and, for small sets like AIME, averaged across multiple samples for stability. They also tracked compute used (FLOPs) and fitted curves to estimate the ultimate performance ceiling after training.

The Competition: Four contenders lined up—(1) Pure SFT (imitation only), (2) Pure RL (GRPO, DAPO), (3) Synchronized SFT-RL (UPT, LUFFY, SRFT), and (4) Sequential SFT-then-RL (SFT → RL). Everyone trained on matched math-style data; synchronized methods got the same on-policy/off-policy mix; sequential methods used large SFT then RL.

Scoreboard with Context:

  • Pure RL: Quick early jumps but topped out around 74.3 points (think: strong B). Efficient at first, limited height later.
  • Best Synchronized (LUFFY): About 72.7 ceiling (medium B). Early efficiency, but instability and lower final ceiling on general models.
  • Pure SFT: Slow and steady to around 76.9 (B+ to A-). Stronger than Pure RL or synchronized methods but still not the top.
  • Sequential SFT-then-RL: Best performer at roughly 78.1 on Qwen2.5-7B (firm A-/A). First build a great base with SFT, then climb higher with RL.

Surprising Findings:

  • “Less is More” didn’t hold here. A small, high-quality SFT set (like ~1K examples) reached decent scores quickly but hit a low ceiling and left little room for RL to help. Big SFT sets (hundreds of thousands) reached higher bases and left more plasticity for RL, creating a much taller final tower.
  • Harder data paid off. At the same scale (≈102K), using harder trajectories raised both SFT performance and RL plasticity—like doing tougher drills that later make game day easier.
  • Minimum validation loss was a powerful compass. Datasets/checkpoints with lower SFT validation loss reliably led to higher final ceilings after RL, so you can predict which setup will win before spending on RL.

Cross-Model Check (Llama3.2-3B): With much smaller compute budgets (capped RL), the same story repeated: the sequential SFT-then-RL pipeline won big, switching at the Stable SFT phase worked best, scale beat small fancy sets, and harder examples helped but couldn’t replace scale.

Bottom line: The winning recipe was clear—SFT until validation loss stabilizes or just begins a mild rise, then RL. Scale your SFT data first; add difficulty to multiply gains; and use minimum validation loss to choose checkpoints and datasets before you spend on RL.

05Discussion & Limitations

Limitations:

  • Domain scope: The strongest tests are in mathematical reasoning; results likely generalize but should be validated in other domains (coding, science QA, multi-turn planning).
  • Reward design: Using simple correctness rewards works for math, but richer tasks may need more nuanced or partial-credit rewards.
  • Overfitting boundaries: While Stable/Mild Overfitting is a good switch window, exact thresholds can vary with model, domain, and noise in validation data.
  • Synchronization tuning: Synchronized SFT-RL might improve with better stabilizers, but current results show fragility at scale.

Required Resources:

  • Data: Large, high-quality SFT trajectories are crucial; filtering and deduplicating matter.
  • Compute: SFT over large corpora plus RL sampling requires notable FLOPs; careful monitoring avoids waste.
  • Tooling: Reliable validation sets and automatic verifiers (for rewards) make the pipeline feasible.

When NOT to Use:

  • Extremely tiny compute budgets: If you cannot afford enough SFT to reach Stable, RL won’t add much and might destabilize.
  • Very sparse or noisy reward tasks without a good verifier: RL learning signal may be too weak or misleading.
  • Non-reasoning tasks where imitation already reaches the ceiling: RL may be unnecessary.

Open Questions:

  • Can synchronized methods be made as stable and high-ceiling as sequential ones with new regularizers or curricula?
  • What’s the best way to design partial-credit rewards for multi-step reasoning beyond binary correctness?
  • How do these rules extend to multi-turn agents and tool use where feedback is delayed or noisy?
  • Can we automate the SFT→RL switch with learned detectors beyond validation loss (e.g., plasticity probes)?

06Conclusion & Future Work

Three-sentence summary: This paper introduces the Plasticity-Ceiling Framework, which splits final performance into the SFT foundation you already achieved and the RL headroom you can still claim. Across extensive tests, the simple, stable winner is sequential SFT-then-RL—switching at SFT’s Stable (or Mild Overfitting) point, using big SFT datasets, and preferring harder examples to multiply gains. Minimum SFT validation loss emerges as a cheap, predictive indicator of the ultimate performance you can reach.

Main achievement: Turning expert trajectory usage from guesswork into a predictable, data-driven recipe: scale SFT first, watch validation loss to time the switch, and let RL harvest the remaining plasticity.

Future directions: Stabilize synchronized methods at scale, design richer reward functions that capture partial correctness, extend the framework to multi-turn tools and agents, and develop automatic switch controllers trained to detect plasticity peaks.

Why remember this: Because it gives you a clear, reliable way to build better reasoning models—first lay a broad foundation with lots of good examples, then switch to rewards right when the practice plateaus, and watch your ceiling rise.

Practical Applications

  • ‱Train a new reasoning model by first scaling SFT on a large dataset, then switching to RL when SFT validation loss stabilizes.
  • ‱Use minimum SFT validation loss to pick the best checkpoint for RL instead of guessing based on SFT accuracy alone.
  • ‱Prioritize collecting more SFT trajectories before curating extreme difficulty; add harder examples as a multiplier once scale is adequate.
  • ‱Avoid synchronized SFT-RL for large, hard datasets unless you have strong stabilizers; prefer the sequential pipeline.
  • ‱On small/easy datasets, allow a short Mild Overfitting phase before switching to RL; on large/high-quality data, switch at Stable.
  • ‱Automate a “switch-to-RL” trigger that monitors validation loss minima and tolerance bands (Stable/Mild).
  • ‱Budget compute by estimating the expected ceiling: if plasticity is small, stop early; if large, invest in more RL steps.
  • ‱For math or coding, use binary (or graded) verifiers to supply reliable RL rewards with minimal manual labeling.
  • ‱When adapting smaller models, emphasize reaching SFT saturation even more, since early RL may not help and can regress.
  • ‱Track both foundation (P_sft) and headroom (PL_rl) dashboards to guide training decisions and detect overfitting risk.
#Supervised Fine-Tuning#Reinforcement Learning#Expert Trajectories#Plasticity-Ceiling Framework#Validation Loss#Performance Ceiling#Plasticity#Data Scale#Trajectory Difficulty#Sequential SFT-then-RL#Synchronized SFT-RL#Math Reasoning#GRPO#DAPO#Scaling Laws
Version: 1