Rethinking Expert Trajectory Utilization in LLM Post-training
Key Summary
- âąThe paper asks how to best use expert step-by-step solutions (expert trajectories) when teaching big AI models to reason after pretraining.
- âąIt introduces the Plasticity-Ceiling Framework: think of final skill as âhow good you are now (from SFT)â plus âhow much room you still have to grow with RL.â
- âąThe authors test many training styles and find a clear winner: do Supervised Fine-Tuning (SFT) first, then do Reinforcement Learning (RL)âsequentially, not mixed together.
- âąThey show the best time to switch from SFT to RL is when SFTâs validation loss has flattened (Stable) or is just barely rising (Mild Overfitting), but not later (Severe Overfitting).
- âąThey refute the idea that âless is moreâ for data in this setting: bigger SFT datasets raise the ultimate performance ceiling; harder examples act like a multiplier on top.
- âąMinimum SFT validation loss is a strong, cheap-to-measure predictor of how high your final performance ceiling can be after RL.
- âąOn math benchmarks, the sequential SFT-then-RL approach beats Pure RL and Synchronized SFT-RL (mixed) methods by a notable margin.
- âąThese findings give simple, actionable rules for when to switch to RL and how to choose and scale SFT data to get the most from expert trajectories.
Why This Research Matters
Reasoning-heavy applicationsâmath tutors, coding copilots, science helpersâneed models that not only talk but also solve multi-step problems reliably. This paper gives a simple, tested recipe to get there: build a broad SFT base, then switch to RL at the right moment. It shows that bigger SFT datasets raise your ultimate ceiling, and that slightly harder examples multiply gains, so teams can plan data collection wisely. With minimum validation loss as a predictor, labs can choose the best checkpoints and datasets before spending expensive RL compute. The result is better performance with fewer training runs, less instability, and clearer roadmaps for future improvements.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: You know how you first learn to ride a bike with training wheels, and only after youâre steady do you try riding without them and maybe even learn some tricks? If you skip the training wheels or take them off too soon, you wobble and fall.
đ„Ź The Concept (Large Language Models, LLMs): LLMs are computer programs that learn to read, write, and reason from lots of text. How it works: (1) Pretrain on tons of text to learn general language patterns; (2) Post-train to sharpen skills for specific tasks; (3) Use careful practice to reach higher reasoning ability. Why it matters: Without this second stage of training, models can talk, but they struggle to solve tricky problems step by step.
đ Anchor: Think of an LLM as a student who read a whole library (pretraining) but still needs classes and homework (post-training) to ace math contests.
đ Hook: Imagine learning math from a solution key. You copy how experts solve problems before trying your own strategies.
đ„Ź The Concept (Supervised Fine-Tuning, SFT): SFT is when a model studies examples with correct, step-by-step solutions (expert trajectories). How it works: (1) Show the prompt and the expertâs full solution; (2) The model imitates each step; (3) Repeat across many examples to build strong habits. Why it matters: Without SFT, the model lacks solid, reliable ways to start solving hard problems.
đ Anchor: Itâs like practicing long division by following your teacherâs worked examples until the steps feel natural.
đ Hook: Think about training a puppy: do the trick, get a treat; do it wrong, no treat.
đ„Ź The Concept (Reinforcement Learning, RL): RL teaches a model by rewarding good answers and not rewarding bad ones. How it works: (1) The model tries different steps; (2) A rule checks correctness and gives a reward; (3) The model changes to favor moves that earn rewards. Why it matters: Without RL, the model can imitate but struggles to explore new or smarter paths.
đ Anchor: Itâs like trying different puzzle moves and keeping the ones that get you closer to the solution.
đ Hook: Picture trying to read the textbook and play a graded quiz game at the same time; itâs easy to get confused.
đ„Ź The Concept (Synchronized SFT-RL): This approach mixes imitation (SFT) and reward-based practice (RL) in one loop. How it works: (1) Train on expert steps; (2) Also generate your own steps; (3) Blend both signals to update the model together. Why it matters: If not handled carefully, this can become unstableâespecially with big, hard datasetsâcausing the model to wobble instead of improve.
đ Anchor: Itâs like flipping quickly between a teacherâs notes and a live test; you might not learn either one well.
đ Hook: Think of two phases: first learn the safe way with guidance, then practice freely to master tricks.
đ„Ź The Concept (Sequential SFT-then-RL): Train in two clean stagesâfirst SFT to build a foundation, then RL to push further. How it works: (1) Do SFT until the learning stabilizes; (2) Switch to RL to explore and improve; (3) Stop when gains level off. Why it matters: Mixing signals too early or too late either hurts stability or wastes potential.
đ Anchor: Like learning to play piano: first master scales with a teacher, then experiment with jazz improvisation.
The World Before: Researchers knew we needed both SFT and RL to turn chatty models into strong reasoners. But no one had a clear, reliable recipe for when to switch from SFT to RL, how much SFT data to use, or whether to train both at once. Some teams tried synchronized training and reported quick winsâbut mostly on small datasets. Others used simple two-stage pipelines that worked in practice, but with rules-of-thumb rather than firm guidance.
The Problem: We lacked a theory that explains how expert trajectories (SFT data) should be used to maximize the final, after-RL performance, and we lacked solid rules for timing the SFTâRL switch and for choosing data scale and difficulty.
Failed Attempts: (1) Pure SFT: Imitates well but stalls; (2) Pure RL: Explores fast but hits a low ceiling without a strong base; (3) Synchronized SFT-RL: Can look efficient early, but tends to be unstable or top out early on large, harder datasets.
The Gap: A unifying way to measure âhow high can we goâ (the ceiling) and âhow much room is left to growâ (plasticity) was missing, along with practical, measurable signals to guide switching time and data choices.
Real Stakes: This matters for math tutors, coding copilots, scientific assistants, and any app where careful, step-by-step reasoning is crucial. Better rules save compute, reduce instability, and produce models that solve harder problems more reliably.
02Core Idea
đ Hook: Imagine building a tower. The first floors must be strong (foundation), and then you decide how many more floors you can safely add (headroom). If the base is weak, you canât go very high.
đ„Ź The Concept (Aha!): Final performance = solid foundation from SFT + remaining headroom that RL can still turn into gains. How it works: (1) Use SFT to lock in dependable skills; (2) Measure how much âgrowth spaceâ is left (plasticity); (3) Spend RL compute to convert that space into real performance; (4) Time the switch so you get a strong base without crushing the room to grow. Why it matters: If you switch too early, the base is shaky; too late, you lose the chance to grow higher.
đ Anchor: Itâs like baking: first make a sturdy cake layer, then add frosting and decorations. Bad base? Your cake collapses. Too much fiddling with the base? You run out of time to decorate.
Three Analogies:
- Sports: Warm-up drills (SFT) teach basic moves; scrimmages with scoring (RL) push strategy. Start scrimmaging after youâre steady, or the team wonât improve much.
- Music: Practice scales with a teacher (SFT), then improvise and get audience feedback (RL). Start improvising once scales are solid.
- Gardening: Build healthy roots (SFT) before pushing for blooms with special feed (RL). If roots are weak or overwatered, the plant wonât flourish.
đ Hook: You know how some jars are already almost full, and thereâs just a little space left at the top?
đ„Ź The Concept (Plasticity): Plasticity is how much improvement room is left after SFT for RL to use. How it works: (1) Start from SFT performance; (2) Estimate how far you could still go; (3) RL climbs that remaining ladder. Why it matters: If SFT leaves tiny room, RL canât add much no matter how hard it tries.
đ Anchor: If your spelling is already near-perfect, extra practice wonât raise your grade muchâthatâs low plasticity.
đ Hook: Picture a height limit sign at an amusement park ride.
đ„Ź The Concept (Performance Ceiling): The ceiling is the highest performance you can reach with lots of compute. How it works: (1) Track performance as you spend compute; (2) Fit a curve that levels off; (3) The top of that curve is the ceiling. Why it matters: Chasing gains past the ceiling wastes time and energy.
đ Anchor: Itâs like knowing your test score canât go above 100âonce youâre close, more cramming wonât help much.
đ Hook: Imagine a dashboard light that tells you, âOkay, youâre ready to switch!â
đ„Ź The Concept (Minimum SFT Validation Loss): This is the lowest error your model reaches on a held-out SFT set. How it works: (1) Train with SFT; (2) Watch the validation loss curve; (3) When it bottoms out (Stable) or just barely rises (Mild Overfitting), youâre at the best time to switch to RL. Why it matters: Switching too soon or too late hurts the final ceiling.
đ Anchor: Like shooting hoops: once your practice shots are consistently accurate and no longer improving, itâs time to play a real game to grow further.
Before vs After:
- Before: People guessed when to switch, debated mixing SFT with RL, and argued whether tiny, high-quality datasets were enough.
- After: We have a map: (1) Do SFT to Stable or Mild Overfitting, then switch; (2) Use big SFT datasets to raise the ceiling; (3) Prefer harder examples to multiply gains; (4) Use minimum validation loss as a reliable predictor.
Why It Works (Intuition):
- SFT builds dependable, reusable skills that RL can trust.
- If SFT is too short, RL wastes time fixing basics instead of improving strategy.
- If SFT goes too long into overfitting, the model becomes rigid, and RL canât reshape it easily (plasticity shrinks).
- Big, diverse SFT data teaches broader skills and leaves more useful headroom for RL to climb.
Building Blocks:
- Foundation (P_sft): The performance after SFT. Stronger P_sft â better starting point.
- Headroom (PL_rl): A measure of how much RL can still add. Bigger PL_rl â more potential.
- Ceiling (A_post): The final reachable level after both stages.
- Switch Signal: Minimum validation loss tells you when SFT is âdone enough.â
- Data Knobs: Scale drives the main gains; difficulty is a multiplier, especially when scale is limited.
03Methodology
High-level overview: Input (prompts + expert step-by-step solutions) â SFT (learn by imitation) â Monitor validation loss (find Stable/Mild Overfitting) â Switch to RL (learn by reward) â Output (higher final reasoning performance).
Step 1: Prepare data and models
- What happens: Collect expert trajectories (prompt plus worked solution) for SFT. Prepare RL prompts for reward-based practice. Use a strong base model (e.g., Qwen2.5-7B) and a cross-check smaller model (Llama3.2-3B).
- Why this step exists: Without clean expert data, SFT canât build reliable habits; without appropriate RL prompts, rewards wonât teach the right skills.
- Example: For math, gather hundreds of thousands of high-quality, step-by-step solutions distilled from a powerful teacher model and filter benchmark overlaps.
đ Hook: Imagine learning from many homework sheets.
đ„Ź The Concept (Data Scale): Data scale is how much SFT data you have. How it works: (1) More examples teach more patterns; (2) The modelâs base grows broader; (3) RL has more to build on. Why it matters: Too little data makes the base narrow and lowers the final ceiling.
đ Anchor: Training with 889K examples usually beats training with 1K, even if the 1K are great.
đ Hook: Think of starting with easy puzzles and moving to harder ones.
đ„Ź The Concept (Trajectory Difficulty): Itâs how hard the training examples are. How it works: (1) Harder items push the modelâs limits; (2) It learns deeper strategies; (3) RL gains multiply on top of that. Why it matters: If everything is too easy, the model stalls early.
đ Anchor: A mix with more challenging math problems often leads to better final scores.
Step 2: Supervised Fine-Tuning (SFT)
- What happens: Train the model to imitate expert steps token by token across many examples.
- Why this step exists: It builds a solid, general foundationâlike muscle memory for solving.
- Example with data: On a big set (e.g., ~889K math problems), run SFT for multiple epochs, saving checkpoints regularly.
đ Hook: Like watching your practice test score each week.
đ„Ź The Concept (Validation Loss and Sub-phases): Validation loss measures how well you imitate on held-out data. How it works: (1) Track the curve over time; (2) Stable means lowest loss (within ~2% tolerance); (3) Mild Overfitting means a small rise (under ~10%); (4) Severe Overfitting means a big rise (10%+). Why it matters: This tells you when to stop SFT and start RL.
đ Anchor: When your practice score stops improving and just wiggles, itâs time to change tactics.
Step 3: Choose the switch time (SFT â RL)
- What happens: Pick the SFT checkpoint at Stable or Mild Overfitting.
- Why this step exists: Switching too early leaves basics weak; too late makes the model rigid, shrinking RLâs effect.
- Example: On the largest dataset, best results came from switching when validation loss had flattened; on smaller/easier datasets, Mild Overfitting also worked well.
Step 4: Reinforcement Learning (RL)
- What happens: The model generates answers; a rule checks correctness (reward = 1 for correct, 0 for incorrect); the model updates to make rewarded answers more likely. Algorithms like DAPO (with a robust variant DAPOdc) stabilize training for math.
- Why this step exists: RL turns remaining headroom (plasticity) into real gains.
- Example: Given math prompts, sample multiple solutions per prompt, grade them automatically (Math-Verify), and update the model to favor correct chains of thought.
Step 5: Avoid synchronized pitfalls
- What happens: The paper also tests mixed (synchronized) SFT-RL methods (e.g., UPT, LUFFY, SRFT). They often show fast early gains but plateau early or show instability under larger, harder datasets.
- Why this step exists: Establish a fair baseline: sequential SFT-then-RL is more stable and reaches a higher ceiling.
- Example: On a general model, synchronized methods quickly flatten below the sequential approach; on a math-specialized starting model, they look briefly efficient, but still donât reach the sequential ceiling.
Step 6: Measure and fit the ceiling
- What happens: Track performance as compute increases, and fit a curve that shows the maximum reachable level (ceiling) and how much headroom was converted.
- Why this step exists: It tells you if youâre close to done or still have meaningful gains ahead.
- Example: Pure RL topped out in the low 70s (like a solid B), while SFT-then-RL reached the high 70s (like an A- to A), on math evaluations.
The Secret Sauce:
- A simple, measurable switch rule: use minimum SFT validation loss (Stable or Mild Overfitting) as your âreadyâ signal.
- Scale first, difficulty next: Big SFT data raises the ceiling; harder examples multiply gains, especially when data is limited.
- Keep phases clean: Separate SFT and RL for stability and higher final performance.
04Experiments & Results
The Test: The team evaluated math reasoning across cleaned benchmarks (GSM8K, OlympiadBench, Minerva, MATH, AIME24/25). They measured pass@1 accuracy and, for small sets like AIME, averaged across multiple samples for stability. They also tracked compute used (FLOPs) and fitted curves to estimate the ultimate performance ceiling after training.
The Competition: Four contenders lined upâ(1) Pure SFT (imitation only), (2) Pure RL (GRPO, DAPO), (3) Synchronized SFT-RL (UPT, LUFFY, SRFT), and (4) Sequential SFT-then-RL (SFT â RL). Everyone trained on matched math-style data; synchronized methods got the same on-policy/off-policy mix; sequential methods used large SFT then RL.
Scoreboard with Context:
- Pure RL: Quick early jumps but topped out around 74.3 points (think: strong B). Efficient at first, limited height later.
- Best Synchronized (LUFFY): About 72.7 ceiling (medium B). Early efficiency, but instability and lower final ceiling on general models.
- Pure SFT: Slow and steady to around 76.9 (B+ to A-). Stronger than Pure RL or synchronized methods but still not the top.
- Sequential SFT-then-RL: Best performer at roughly 78.1 on Qwen2.5-7B (firm A-/A). First build a great base with SFT, then climb higher with RL.
Surprising Findings:
- âLess is Moreâ didnât hold here. A small, high-quality SFT set (like ~1K examples) reached decent scores quickly but hit a low ceiling and left little room for RL to help. Big SFT sets (hundreds of thousands) reached higher bases and left more plasticity for RL, creating a much taller final tower.
- Harder data paid off. At the same scale (â102K), using harder trajectories raised both SFT performance and RL plasticityâlike doing tougher drills that later make game day easier.
- Minimum validation loss was a powerful compass. Datasets/checkpoints with lower SFT validation loss reliably led to higher final ceilings after RL, so you can predict which setup will win before spending on RL.
Cross-Model Check (Llama3.2-3B): With much smaller compute budgets (capped RL), the same story repeated: the sequential SFT-then-RL pipeline won big, switching at the Stable SFT phase worked best, scale beat small fancy sets, and harder examples helped but couldnât replace scale.
Bottom line: The winning recipe was clearâSFT until validation loss stabilizes or just begins a mild rise, then RL. Scale your SFT data first; add difficulty to multiply gains; and use minimum validation loss to choose checkpoints and datasets before you spend on RL.
05Discussion & Limitations
Limitations:
- Domain scope: The strongest tests are in mathematical reasoning; results likely generalize but should be validated in other domains (coding, science QA, multi-turn planning).
- Reward design: Using simple correctness rewards works for math, but richer tasks may need more nuanced or partial-credit rewards.
- Overfitting boundaries: While Stable/Mild Overfitting is a good switch window, exact thresholds can vary with model, domain, and noise in validation data.
- Synchronization tuning: Synchronized SFT-RL might improve with better stabilizers, but current results show fragility at scale.
Required Resources:
- Data: Large, high-quality SFT trajectories are crucial; filtering and deduplicating matter.
- Compute: SFT over large corpora plus RL sampling requires notable FLOPs; careful monitoring avoids waste.
- Tooling: Reliable validation sets and automatic verifiers (for rewards) make the pipeline feasible.
When NOT to Use:
- Extremely tiny compute budgets: If you cannot afford enough SFT to reach Stable, RL wonât add much and might destabilize.
- Very sparse or noisy reward tasks without a good verifier: RL learning signal may be too weak or misleading.
- Non-reasoning tasks where imitation already reaches the ceiling: RL may be unnecessary.
Open Questions:
- Can synchronized methods be made as stable and high-ceiling as sequential ones with new regularizers or curricula?
- Whatâs the best way to design partial-credit rewards for multi-step reasoning beyond binary correctness?
- How do these rules extend to multi-turn agents and tool use where feedback is delayed or noisy?
- Can we automate the SFTâRL switch with learned detectors beyond validation loss (e.g., plasticity probes)?
06Conclusion & Future Work
Three-sentence summary: This paper introduces the Plasticity-Ceiling Framework, which splits final performance into the SFT foundation you already achieved and the RL headroom you can still claim. Across extensive tests, the simple, stable winner is sequential SFT-then-RLâswitching at SFTâs Stable (or Mild Overfitting) point, using big SFT datasets, and preferring harder examples to multiply gains. Minimum SFT validation loss emerges as a cheap, predictive indicator of the ultimate performance you can reach.
Main achievement: Turning expert trajectory usage from guesswork into a predictable, data-driven recipe: scale SFT first, watch validation loss to time the switch, and let RL harvest the remaining plasticity.
Future directions: Stabilize synchronized methods at scale, design richer reward functions that capture partial correctness, extend the framework to multi-turn tools and agents, and develop automatic switch controllers trained to detect plasticity peaks.
Why remember this: Because it gives you a clear, reliable way to build better reasoning modelsâfirst lay a broad foundation with lots of good examples, then switch to rewards right when the practice plateaus, and watch your ceiling rise.
Practical Applications
- âąTrain a new reasoning model by first scaling SFT on a large dataset, then switching to RL when SFT validation loss stabilizes.
- âąUse minimum SFT validation loss to pick the best checkpoint for RL instead of guessing based on SFT accuracy alone.
- âąPrioritize collecting more SFT trajectories before curating extreme difficulty; add harder examples as a multiplier once scale is adequate.
- âąAvoid synchronized SFT-RL for large, hard datasets unless you have strong stabilizers; prefer the sequential pipeline.
- âąOn small/easy datasets, allow a short Mild Overfitting phase before switching to RL; on large/high-quality data, switch at Stable.
- âąAutomate a âswitch-to-RLâ trigger that monitors validation loss minima and tolerance bands (Stable/Mild).
- âąBudget compute by estimating the expected ceiling: if plasticity is small, stop early; if large, invest in more RL steps.
- âąFor math or coding, use binary (or graded) verifiers to supply reliable RL rewards with minimal manual labeling.
- âąWhen adapting smaller models, emphasize reaching SFT saturation even more, since early RL may not help and can regress.
- âąTrack both foundation (P_sft) and headroom (PL_rl) dashboards to guide training decisions and detect overfitting risk.