🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning | How I Study AI

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Intermediate
Dylan Zhang, Yufeng Xu, Haojin Wang et al.2/1/2026
arXivPDF

Key Summary

  • •The paper shows that a model that looks great after supervised fine-tuning (SFT) can actually do worse after the same reinforcement learning (RL) than a model that looked weaker at SFT time.
  • •This happens because SFT learns from one distribution of examples (the behavior policy), while RL learns from the model’s own rollouts (the target policy).
  • •The authors propose PEAR, a simple way to reweight the SFT loss so it better matches what the model will see during RL.
  • •PEAR uses importance sampling with likelihood ratios to upweight helpful tokens and downweight misleading ones.
  • •There are three variants: sequence-level, token-level (suffix-based), and block-level for stability.
  • •PEAR is a plug-in to standard SFT or KL-distillation and adds little overhead after you compute probabilities on the offline data.
  • •Across logic games and math benchmarks on Qwen and DeepSeek-distilled models, PEAR consistently improves post-RL results.
  • •On AIME-2025, PEAR raises pass@8 by up to 14.6% after RL, and on logic games it can improve absolute accuracy by around 40% versus standard SFT initializations.
  • •PEAR-initialized models align better with RL’s learning direction and drift less during RL, suggesting a smoother offline-to-online transition.

Why This Research Matters

PEAR helps turn offline practice into real, on-policy performance, which means better results with the same RL compute budget. For math tutors, coding assistants, and reasoning agents, this leads to more reliable multi-step reasoning and fewer dead-end attempts. Teams can pick SFT objectives that actually prepare for RL rather than just inflate offline scores, saving time and money. It also provides a clearer principle for dataset design: emphasize trajectories the model will revisit during RL. As reasoning LLMs are deployed in education, programming, and science, this translates to more accurate solutions and improved user trust.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how you might study with a practice book, but the real test ends up being a bit different? Sometimes the way you practiced doesn’t prepare you for how the test is actually given. That’s been happening to AI models that learn to reason step by step. They usually go through two stages: first, supervised fine-tuning (SFT), where they copy from labeled examples; then reinforcement learning (RL), where they try, get feedback, and improve using their own attempts. For a while, people tried to make the SFT stage as strong as possible, hoping that a higher score after SFT would mean an even higher score after RL.

🍞 Hook: Imagine practicing basketball by only shooting from a taped spot on the floor. You might become amazing at that spot—but during the real game, you’ll be moving and shooting from many angles. If your practice doesn’t match the game, your game-day performance can suffer.

🥬 The Concept (Supervised Fine-Tuning, SFT): SFT is when the model learns by copying good answers from a prepared dataset. How it works: (1) Show the model a question and a correct solution; (2) Train it to predict each next word in that solution; (3) Repeat across many examples; (4) Measure accuracy on similar data. Why it matters: Without SFT, the model starts RL from a worse place; but if SFT is misaligned with what RL will see, the model can get stuck later.

🍞 Anchor: A student studies worked-out math solutions and learns how good ones look. That’s SFT.

🍞 Hook: Now think of RL like practicing the basketball game itself—dribbling, moving, deciding when to pass or shoot, and getting points for good choices.

🥬 The Concept (Reinforcement Learning, RL): RL is when the model improves by generating its own answers and getting reward feedback. How it works: (1) The model tries to solve a problem; (2) It gets a reward based on correctness; (3) It updates itself to make higher-reward decisions more likely; (4) It repeats this on-policy—sampling from its own current behavior. Why it matters: Without RL, models often plateau on complex reasoning; RL pushes them beyond copying into actively improving.

🍞 Anchor: The student now tackles fresh problems without answers, checks with a grader, and learns from the score.

Before this paper, many teams pushed SFT scores higher and higher, assuming this would give better starting points for RL. But the authors found something surprising: after running the exact same RL on different SFT checkpoints, the model that was best at SFT sometimes ended up worse after RL than a model that looked weaker at SFT time. In other words, offline rankings flipped.

🍞 Hook: You know how practicing only multiple-choice might not prepare you for a fill-in-the-blank test? The type of practice and the type of test don’t match.

🥬 The Concept (Distribution Mismatch): Distribution mismatch is when the data used for SFT (from a behavior policy that generated those examples) looks different from what the model produces during RL (from the target policy it is learning). How it works: (1) During SFT, the model is trained under prefixes and continuations from the behavior policy; (2) During RL, the model generates its own prefixes and continuations; (3) Small early differences snowball across long reasoning chains; (4) Some SFT-trained paths are rarely visited by the model during RL. Why it matters: If you train heavily on paths RL won’t revisit, you waste learning on places the model won’t go.

🍞 Anchor: Practicing word problems that assume a certain hint style won’t help much if your real exam never provides those hints.

People tried several SFT tweaks: adding KL penalties to stay close to a base model, reweighting tokens by confidence, or masking easy/hard tokens. Some improved SFT scores. But the paper shows those offline gains don’t reliably translate to better after-RL performance. The missing piece was a way to make SFT care about how RL will actually explore and learn.

🍞 Hook: Imagine if your practice sessions counted extra for shots you’re likely to take in a real game and counted less for weird shots you’d almost never use.

🥬 The Concept (Importance Sampling + Log-Likelihood Ratios): Importance sampling is a way to give more weight to examples that match what you’ll see later. Log-likelihood ratios compare how likely a token (or sequence) is under the target policy vs. the behavior policy. How it works: (1) Compute, at each token, how much more/less likely the target policy would be to generate that token than the behavior policy; (2) Multiply these ratios across the future continuation to see if that path is plausible for the target policy; (3) Use this as a weight for the SFT loss. Why it matters: Without these weights, SFT treats all logged paths equally, even ones the model won’t follow during RL.

🍞 Anchor: When practicing, you give extra credit to plays you’ll actually run in a match and discount drills that won’t show up.

That’s the gap the paper fills. Their method, PEAR, reweights the SFT loss so that the model learns more from trajectories it’s likely to encounter during RL, thus preparing it for smoother and stronger on-policy improvement. The stakes are real: compute is expensive, and getting more out of the RL phase with the same budget can mean better reasoning models that are more reliable at math and logic tasks we care about in school, coding, and science.

02Core Idea

🍞 Hook: Picture two maps for a treasure hunt. One map shows where someone else wandered (SFT data). The other map shows where you’re actually going to walk (RL rollouts). If you only practice the first map, you may not get better at following the second.

🥬 The Core Idea in One Sentence: Make SFT care about where RL will go by reweighting the SFT loss with importance weights that prefer logged continuations the model’s target policy would actually revisit.

Multiple Analogies:

  1. Sports practice: Weight drills that match real game plays higher; weight unusual drills lower.
  2. Cooking class: Practice recipes with steps and ingredients you’ll actually use in your kitchen setup, not just any fancy demo you watched once.
  3. Study guide: Spend more time on problem types you’ll face on test day; skim those that won’t appear.

🍞 Anchor: If your soccer team always counterattacks, you train counterattacks more than rare set plays.

🍞 Hook: You know how planning ahead helps you avoid dead-ends? We want SFT to plan ahead for RL’s on-policy world.

🥬 Why It Works (Intuition):

  • RL improves the model based on its own generated paths (target policy). If SFT overfits to paths the model won’t walk during RL, RL wastes time unlearning or correcting.
  • Importance sampling with likelihood ratios asks: “From this point onward, how likely is the rest of the logged solution under the target policy compared to the behavior policy?” If the answer is “not very,” downweight that token’s loss. If it’s “very,” upweight it. This focuses learning on prefixes and continuations that will matter later in RL.
  • Over long reasoning chains, tiny mismatches grow. Suffix-based weighting looks forward, not just at the current token, so it matches the long-horizon reality of RL.

🍞 Anchor: In a maze, if your usual path won’t take the left tunnel, practicing the left tunnel doesn’t help much—practice the corridors you’ll actually take.

Before vs. After:

  • Before: SFT optimized uniform token losses; higher offline accuracy often felt like a universal win, but didn’t guarantee a better RL outcome.
  • After: SFT is tuned to be RL-friendly: it rewards trajectories that the target policy will see again, making RL improvement faster and larger with the same budget.

🍞 Anchor: It’s like reorganizing your study time so the topics that show up on the real test get more hours.

Building Blocks (explained in order, each with the sandwich pattern):

  1. 🍞 Hook: Think of SFT as learning from a solutions manual. 🥬 The Concept (Supervised Fine-Tuning, SFT): Train the model to predict each next token of good answers. How: read prompt + correct continuation; minimize next-token loss; repeat. Why it matters: SFT gives a capable starting point. 🍞 Anchor: Copying a teacher’s worked example line-by-line.

  2. 🍞 Hook: RL is like playing the actual game. 🥬 The Concept (Reinforcement Learning, RL): Improve by trying and receiving reward. How: generate answers; score them; update probabilities toward higher-reward choices; repeat on-policy. Why it matters: RL turns static knowledge into active reasoning. 🍞 Anchor: Solving a fresh math problem and learning from the grade.

  3. 🍞 Hook: Practicing on a piano but performing on a guitar won’t match. 🥬 The Concept (Distribution Mismatch): SFT data (behavior policy) differs from RL rollouts (target policy). How: learn from logged paths vs. self-generated paths; small prefix shifts compound. Why it matters: Training on paths RL won’t use wastes compute and can slow or hurt RL. 🍞 Anchor: Studying multiple-choice but facing free-response.

  4. 🍞 Hook: Highlight the parts of your notes most likely to be on the exam. 🥬 The Concept (Importance Sampling): Weight examples by how representative they are of future use. How: compute ratios of target vs. behavior likelihood; upweight matches, downweight mismatches. Why it matters: It reduces mismatch and focuses learning where it pays off later. 🍞 Anchor: Spend extra study time on frequently tested topics.

  5. 🍞 Hook: Compare two weather reports to decide which is more believable. 🥬 The Concept (Log-Likelihood Ratios): Compare how likely a token (or sequence) is under target vs. behavior policy. How: subtract log-probabilities; sum across continuations; exponentiate if needed; clip for stability. Why it matters: This becomes the weight signal. 🍞 Anchor: If forecast A says 90% rain and B says 10%, the ratio tells you whom to trust.

  6. 🍞 Hook: Not all mistakes deserve equal practice time. 🥬 The Concept (Loss Reweighting): Multiply the normal SFT loss by an importance weight so some tokens count more. How: compute weights; stabilize; stop gradients through weights; apply to per-token loss. Why it matters: It refocuses SFT toward RL-relevant portions of data. 🍞 Anchor: Grading homework where trickier, more relevant questions count extra.

  7. 🍞 Hook: Look ahead in the book to see if your current page leads somewhere useful. 🥬 The Concept (Suffix-based Weighting): Weight a token by how plausible the rest of the sequence is under the target policy. How: multiply ratios from t+1 to T; optionally discount by gamma; clip. Why it matters: It handles long-horizon effects central to reasoning. 🍞 Anchor: If the rest of the solution is unlikely, don’t spend much practice on this branch.

  8. 🍞 Hook: Sometimes you study in sentences, not words. 🥬 The Concept (Block-level Weighting): Give the same weight to a chunk of tokens to reduce variance. How: partition into blocks; compute within-block and future-suffix ratios; assign stable weights per block. Why it matters: It balances precision and stability. 🍞 Anchor: Grade a paragraph as a whole when word-by-word grading is too noisy.

  9. 🍞 Hook: A whole story’s tone can guide how you treat each sentence. 🥬 The Concept (Sequence-level Weighting): Weight every token by the full sequence ratio. How: product of ratios over the entire trajectory; clip; apply. Why it matters: It’s simple and often strong in practice. 🍞 Anchor: If a whole solution fits your style, practice it more overall.

03Methodology

At a high level: Prompt–response dataset → compute target-vs-behavior likelihood ratios → turn them into weights (sequence, block, or token suffix) → multiply these weights into the standard SFT (or KL-distillation) loss → train the model → use this checkpoint to start RL.

Step-by-step, with the sandwich pattern whenever a new mechanism appears:

  1. 🍞 Hook: Imagine you have a binder of old, verified solutions (the offline buffer). 🥬 What happens: Gather prompt–response pairs generated by a known behavior policy (e.g., sampling from a larger model and verifying answers). Why this step exists: You need clean, labeled trajectories to imitate. Example: 100k correct logic puzzle solutions or math solutions. 🍞 Anchor: Your teacher’s binder of correct past exams.

  2. 🍞 Hook: To compare two storytellers, you check how likely each one is to tell the same next sentence. 🥬 What happens (compute per-token log-likelihoods): For each token in each trajectory, compute log p_theta(y_t | x, y_<t) under the target model and log p_beta(y_t | x, y_<t) under the behavior model. Why this step exists: These will form the log-ratios that power importance sampling. Example: If the target gives 0.2 and the behavior gives 0.5, the log-ratio is log(0.2) − log(0.5). 🍞 Anchor: Comparing how two narrators would likely continue the same story.

  3. 🍞 Hook: When multiplying many small numbers, calculators can underflow; you use logs to stay safe. 🥬 The Concept (Numerical Stabilization): Clip per-token log-ratios within safe bounds and compute sums in log-space before exponentiating and clipping final weights. Why this step exists: Long sequences can explode/vanish; clipping keeps training stable. Example: Clip log-ratios to [−0.08, 0.3] and final log-weights to [−10, 5]. 🍞 Anchor: Keeping your numbers from going off the page.

  4. 🍞 Hook: Sometimes you judge a whole essay; sometimes you judge a paragraph; sometimes a sentence. 🥬 The Concept (Three weighting variants):

    • Sequence-level: weight every token by the product over the whole trajectory (simple, strong).
    • Block-level: split into chunks; give each chunk a weight based on future continuation plausibility (stability vs. detail).
    • Token-level suffix: for each token, weight by how plausible the rest of the sequence is from there (precise long-horizon focus). Why this step exists: Different tasks need different granularity-stability tradeoffs. Example: B=1 is token-level; B>1 is block-level. 🍞 Anchor: Grading an essay, a paragraph, or a sentence depending on how noisy things are.
  5. 🍞 Hook: If a practice solution is unlikely under your usual style, don’t let it sway you much. 🥬 The Concept (Suffix-based token weights): For token t, compute weight G_t = gamma^(T−t) × product of ratios from t+1 to T. Why this step exists: It prefers tokens that lead into continuations the target policy would actually follow during RL. Example: If the remainder is implausible, G_t is small, so that token’s loss contributes little. 🍞 Anchor: If the rest of the path is a dead-end for you, don’t spend much time there.

  6. 🍞 Hook: Use the same recipe base, just change the seasoning amounts. 🥬 The Concept (Reweighting the SFT/KD loss): Keep the loss form (e.g., NLL or forward-KL) but multiply each token’s loss by the clipped, stop-gradient weight. Why this step exists: It keeps your objective familiar while steering learning toward RL-relevant data. Example: L = sum_t stopgrad(G_t) × loss_t. 🍞 Anchor: Same cake batter, different topping amounts.

  7. 🍞 Hook: Learning from mistakes also helps, as long as you don’t overreact to noisy parts. 🥬 The Concept (Optional negative examples): If you have verified failures, add a repulsive term that pushes the model away from those full sequences, weighted at sequence-level for stability. Why this step exists: It avoids reinforcing known-bad patterns while respecting on-policy relevance. Example: Mix 50k positives with 50k negatives; reduce the likelihood of the negative trajectories. 🍞 Anchor: Practice not doing common mistakes you’ve cataloged.

  8. 🍞 Hook: Test-day success needs both good prep and good play. 🥬 Online RL (unchanged): After SFT with PEAR, run the same RL recipe (e.g., GRPO with fixed LR, batch size, and KL coefficient). Why this step exists: You want a fair comparison to see if PEAR initializations learn better under identical RL compute. Example: Same GRPO hyperparameters across all initializations. 🍞 Anchor: Same game, same rules, just different warm-ups.

The Secret Sauce:

  • PEAR’s key trick is “weigh the future, not just the current token.” One-step ratios can be myopic; suffix-based ratios align with long-horizon reasoning, just like RL’s return-to-go logic. This makes SFT gradients point more in the same direction as RL updates, reducing offline-to-online friction.

Concrete data example:

  • Suppose token A is common in the logged data but under the target policy, A is rarely followed by the logged B→C path. Sequence-level or suffix-based weights will downweight tokens leading to this A→B→C branch. As a result, the model doesn’t overlearn a path it won’t take during RL, freeing capacity to learn branches it will explore.

04Experiments & Results

The Test: The authors chose verifiable reasoning tasks where answers can be checked automatically. This reduces noise and lets them compare many SFT objectives fairly. They trained multiple model sizes and families, always running the same RL recipe afterward to isolate the effect of the SFT stage.

🍞 Hook: Imagine two runners using the same race plan; the only difference is how they warmed up.

🥬 What they measured: Pass@1 and Pass@8 on logic puzzles and math benchmarks after RL, plus some offline-to-online alignment metrics like gradient angles and parameter drift. Why: The real goal is strong after-RL performance; alignment metrics explain why some warm-ups (SFT objectives) help RL more. Example: On AIME-2025, they focus on pass@8 because math often benefits from multiple tries.

🍞 Anchor: Same race strategy, different warm-ups leading to different race times.

The Competition: They compared PEAR to several SFT variants—plain SFT (NLL), SFT+KL with different KL strengths, token-adaptive reweighting (TALR), and hard/soft reweighting based on token probabilities (Top/Bottom-P, Top/Bottom-LogP), as well as a one-step importance baseline that only looks at the current token.

Key Scoreboard Highlights (with context):

  • Logic games (SynLogic/Enigmata): PEAR-initialized checkpoints improved pass@1 substantially after RL, with some cases showing around 40% absolute accuracy gains over SFT initializations. That’s like turning a mid-pack race into a near-podium finish.
  • Math (AIME-2024/2025, AMC-2023, MATH-500, MINERVA): On AIME-2025, PEAR boosted pass@8 by up to 14.6% after RL compared to SFT. That’s like moving from a B to an A with the same number of study hours.
  • Across Qwen2.5/3 and a DeepSeek-distilled model, and across sizes (0.6B to 8B), PEAR consistently delivered better after-RL performance under the same RL budget.

Surprising Findings:

  • Offline ≠ Online: Some objectives that topped offline SFT metrics underperformed after RL. Rankings flipped. Choosing by offline pass@1 alone can be misleading.
  • One-step importance weighting underperformed suffix-based PEAR. Looking only at the current token is too myopic for long-horizon reasoning; weighing the future continuation works better.
  • PEAR plays nicely with KL-based knowledge distillation: reusing already-computed probabilities to build weights brings extra gains with minimal overhead.

Alignment Analyses:

  • Gradient direction: The average principal angle between PEAR’s offline gradients and RL gradients was smaller than for other SFT variants. Translation: PEAR pushes in a direction RL also wants to go.
  • Parameter drift (NSS): PEAR-initialized models needed less change during RL to get big gains. That suggests PEAR fixed some mismatches in offline training, so RL didn’t waste updates correcting them.

Transfer Test:

  • Even when the online RL distribution (e.g., Enigmata subset) differed from the offline domain (e.g., SynLogic or synthetic math solutions), PEAR-initialized models still ended up stronger after RL than SFT ones under identical RL compute. This shows PEAR isn’t overfitting to the offline data’s quirks but is genuinely preparing for on-policy learning.

Takeaway: If the final goal is better post-RL reasoning, don’t pick your SFT recipe by offline scores alone. Pick the one that prepares the model to learn on-policy—PEAR does that by upweighting logged continuations the target policy will actually revisit.

05Discussion & Limitations

Limitations:

  • You must compute target and behavior probabilities on the offline data to get the ratios; while this is a one-time cost, it is extra work compared to vanilla SFT.
  • Importance ratios over long sequences can be high-variance; although PEAR uses discounting, clipping, and block-level variants to stabilize, extremely long horizons may still pose challenges.
  • Gains are shown on verifiable reasoning games and competitive math; other domains (e.g., open-ended dialogue with noisy rewards) may need careful tuning.
  • If you only care about offline SFT scores and never do RL, PEAR is not designed to maximize your offline metric.

Required Resources:

  • Access to both the behavior policy (to log its probabilities) and the target model (to log its probabilities) on the offline dataset.
  • Some engineering to cache token log-probs and implement weighting/clipping in the training loop.
  • Usual RL compute (e.g., GRPO) for the online phase—PEAR aims to make this compute more effective, not remove it.

When NOT to Use:

  • If you do not plan to run online RL at all; PEAR optimizes for RL readiness, not offline leaderboard scores.
  • If the behavior and target policies are virtually identical and the distribution mismatch is negligible, importance reweighting may add little benefit.
  • In ultra-short tasks where long-horizon effects are minimal, one-step schemes might suffice, reducing the need for suffix-based weights.

Open Questions:

  • How best to set discount gamma and clipping for extremely long chains or other modalities (e.g., code with tools)?
  • Can we learn the weighting schedule end-to-end, adapting to model size and domain automatically?
  • How does PEAR interact with more advanced RL algorithms, credit assignment schemes, or verifier-assisted training?
  • Can sequence-level negative weighting be extended to token-wise negative credit without instability over long horizons?
  • What are the best diagnostics (beyond gradient angles and NSS) to predict RL readiness after SFT?

06Conclusion & Future Work

Three-Sentence Summary: The paper shows that stronger SFT scores do not guarantee better after-RL performance because SFT learns from a different distribution than RL uses. To fix this, PEAR reweights SFT losses using importance sampling so that logged continuations likely under the target policy count more. As a result, with the same RL budget, models initialized with PEAR learn faster and reach higher reasoning accuracy.

Main Achievement: PEAR provides a simple, plug-in, OPE-inspired loss reweighting that consistently improves post-RL performance across logic and math benchmarks, delivering up to a 14.6% pass@8 gain on AIME-2025 and sizable gains on verifiable logic games.

Future Directions: Automate hyperparameter choices (discount, clipping, block size); integrate with more RL variants; extend to tool-use/code settings; and build better readiness metrics to predict post-RL outcomes from offline checkpoints.

Why Remember This: It reframes what “good SFT” means—don’t optimize SFT in isolation; prepare for RL by weighting the parts of data your on-policy learner will actually see. PEAR turns offline practice into real game strength.

Practical Applications

  • •Train math reasoning models that achieve higher accuracy after RL with the same compute by using PEAR-weighted SFT.
  • •Improve code-generation agents by reweighting offline traces toward continuations the target model will follow during RL fine-tuning.
  • •Enhance data curation by prioritizing sequences with high target-vs-behavior likelihood ratios, increasing RL readiness.
  • •Stabilize long-form reasoning SFT with block-level PEAR weights to reduce variance while keeping future-aware credit.
  • •Integrate PEAR with KL-distillation pipelines for minimal-overhead gains by reusing computed probabilities.
  • •Mix in negative verified trajectories and apply sequence-level repulsion to avoid reinforcing known-bad solutions.
  • •Use gradient-alignment checks (e.g., principal angle) to validate that your offline updates point in RL’s direction.
  • •Port PEAR initializations across related domains (e.g., from synthetic logic to new puzzle sets) to boost transfer during RL.
  • •Tune discount gamma and clipping bounds to manage long-horizon variance in tool-use or code execution settings.
  • •Adopt PEAR as a general plug-in objective in SFT libraries to make RL post-training more compute-efficient.
#Supervised Fine-Tuning#Reinforcement Learning#Distribution Mismatch#Importance Sampling#Off-Policy Evaluation#Log-Likelihood Ratio#Loss Reweighting#Suffix-based Weighting#Block-level Weighting#Sequence-level Weighting#Policy Initialization#Reasoning LLMs#GRPO#Math Benchmarks#Verifiable Reasoning
Version: 1