ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection
Key Summary
- •Traditional supervised fine-tuning (SFT) makes a model copy one answer too exactly, which can cause overfitting to the exact wording instead of the real idea.
- •The paper’s key insight is that tokens with high model probability usually carry the core meaning, while low-probability tokens are mostly stylistic or replaceable.
- •ProFit trains only on high-probability (high-value) tokens and masks low-probability ones, so the model learns the important logic without getting distracted.
- •This simple, one-threshold masking breaks the cost-vs-diversity trade-off: it avoids expensive multi-answer datasets but still captures semantic flexibility.
- •Across many benchmarks (GPQA-Diamond, GSM8K, MATH-500, AIME’24, IFEval) and models (Qwen, Llama, OLMo), ProFit beats standard SFT—often by large margins.
- •Theory shows low-probability tokens produce large gradients that can overpower the learning signal; masking them keeps training on track.
- •Ablations confirm that focusing on high-probability tokens improves faster, scales better with LoRA rank, and makes a stronger start for later reinforcement learning.
- •ProFit is easy to add to current SFT pipelines: compute token probabilities online, gate tokens with a threshold, and backprop only through kept tokens.
- •Limitations: the “low-probability = non-core” rule fits logic-heavy tasks best and may not hold for creative writing; the fixed threshold could be made adaptive.
- •Bottom line: ProFit helps models learn the real ideas, not just the exact phrasing, using the model’s own sense of confidence to guide training.
Why This Research Matters
When we fine-tune AI, we want it to learn the real ideas, not just copy a single way of saying them. ProFit gives models a simple, built-in compass—token probability—to find the most meaningful parts of an answer and learn from those first. This makes the AI better at reasoning, following instructions, and handling new phrasings it never saw before. It also saves time and money by avoiding the need to collect many extra answers for each question. Because training becomes cleaner and more stable, ProFit models are stronger starting points for later improvements like reinforcement learning. In everyday tools, that means clearer explanations, fewer silly mistakes from wording changes, and more trustworthy help.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook): You know how a teacher might grade you only if you write the answer in exactly their favorite words? That would be unfair, because there are many good ways to say the same idea.
🥬 Filling (The Actual Concept)
- What it is: Supervised fine-tuning (SFT) teaches a language model by showing it questions and one correct answer, then asking it to copy that answer token by token.
- How it works: 1) Show the model a question and a single reference answer. 2) The model predicts each next token and is pushed to match the exact reference tokens. 3) This repeats for lots of examples so the model learns to imitate. 4) Tokens not in the reference are punished, even if they mean the same thing.
- Why it matters: Without care, SFT makes the model overfit the exact surface phrasing, forgetting that language has many good ways to say the same thought.
🍞 Bottom Bread (Anchor): If the teacher’s key is “The answer is 42,” SFT may punish “It’s 42” or “I think it’s 42,” even though they mean the same thing.
🍞 Top Bread (Hook): Imagine describing your favorite movie. You could say “It’s exciting and smart” or “It’s thrilling and clever.” Different words, same idea.
🥬 The Concept: The one-to-many nature of language means there are many valid phrasings for the same meaning.
- How it works: 1) Choose any idea. 2) List different wordings. 3) Notice they keep the core meaning but change style. 4) A training system must avoid punishing these valid variations.
- Why it matters: If training punishes synonyms or harmless phrasing tweaks, the model learns to mimic a script instead of understanding.
🍞 Anchor: “The capital of France is Paris” vs. “France’s capital: Paris.” Same truth, different surface.
🍞 Top Bread (Hook): Think of glitter mixed with gold sand. If you grab everything, you’ll take lots of glitter; you want the gold.
🥬 The Concept: Overfitting is when a model memorizes exact wording instead of learning the idea, so it fails when words change.
- How it works: 1) The model is pushed to copy exact tokens. 2) It mistakes style for substance. 3) On new phrasing, it stumbles. 4) It looks smart on the training set but weak on new inputs.
- Why it matters: Overfitting hurts generalization—what really counts in the real world.
🍞 Anchor: A student who memorizes the answer sheet can’t solve a similar question with different words on test day.
🍞 Top Bread (Hook): When you sing along to a song, you usually know the next word really well at the chorus, but you’re less sure in the tricky verse.
🥬 The Concept: Token probability is the model’s confidence that a specific next token is the right one in context.
- How it works: 1) The model looks at the sentence so far. 2) It scores each possible next token from low to high. 3) Higher scores mean “this fits best.” 4) The chosen token has the highest probability.
- Why it matters: These probabilities hint at which parts are core to meaning (often high) and which are flexible (often low).
🍞 Anchor: In “The answer is 42,” the model is very confident about “42” after “The answer is,” but less fixed on filler like “actually.”
🍞 Top Bread (Hook): Like building a LEGO castle: some bricks are load-bearing (core), others are decorations (trivial).
🥬 The Concept: Core tokens carry key logic or facts; trivial tokens are replaceable style or fluff.
- How it works: 1) Compare many valid answers. 2) See which tokens must stay to keep the meaning (core). 3) See which can change without breaking the idea (trivial). 4) Notice core tokens tend to have higher model probability.
- Why it matters: Mixing signals makes training noisy; focusing on core tokens teaches the real structure.
🍞 Anchor: In a math solution, numbers and steps are core; words like “clearly” or “actually” are trivial.
🍞 Top Bread (Hook): Suppose you ask five friends to write the same answer differently. You learn the main idea by noticing what stays the same.
🥬 The Concept: Multi-reference SFT uses several correct answers to be more flexible, but it’s expensive and can confuse training.
- How it works: 1) Collect multiple answers per question. 2) Train to accept all of them. 3) Handle conflicts across styles. 4) Costs and gradient conflicts can hurt convergence.
- Why it matters: It slightly helps diversity but is costly and not always stable.
🍞 Anchor: Hiring many tutors gives variety, but if they disagree on style, a student can get mixed signals.
🍞 Top Bread (Hook): If two kids tug a boat in opposite directions, it barely moves.
🥬 The Concept: Gradient overshadowing means big gradients from unhelpful tokens can pull learning away from the important tokens.
- How it works: 1) Low-probability tokens create large error signals. 2) These dominate the update. 3) The model chases style noise instead of core logic. 4) Progress slows or derails.
- Why it matters: Stopping these noisy pulls keeps training focused and stable.
🍞 Anchor: Shouting trivia can drown out a quiet, correct explanation—so you hush the shouters to hear the important part.
02Core Idea
🍞 Top Bread (Hook): Imagine highlighting only the key sentences in your textbook and studying those first—you learn faster because you focus on what matters most.
🥬 The Concept: The “aha!” is to use the model’s own token probabilities to keep high-value (core) tokens and mask low-value (trivial) ones during SFT.
- How it works: 1) As the model reads the reference answer, it assigns a probability to each next token. 2) If a token’s probability is above a threshold, we train on it; if below, we mask it. 3) Only the kept tokens send gradients to update the model. 4) This strips away style noise and centers learning on the logical skeleton.
- Why it matters: It breaks the need for costly multi-answer datasets while avoiding single-answer overfitting, improving generalization.
🍞 Bottom Bread (Anchor): For “The answer is actually 42,” ProFit learns strongly from “answer” and “42,” but mostly ignores “actually,” so it won’t get stuck on that filler word.
Multiple Analogies
- Spotlight analogy: On stage, ProFit turns the spotlight on the lead actors (core tokens) and dims the background extras (trivial tokens).
- Metal detector analogy: On a beach, ProFit is a detector beeping loudly for gold nuggets (high-prob tokens) and staying quiet for bottle caps (low-prob tokens).
- Recipe analogy: ProFit keeps the flour, eggs, and baking time (core steps) and skips the sprinkles (style) so every cake rises right.
Before vs After
- Before: SFT forced exact copying of one answer, punishing valid rewordings and often overfitting.
- After: ProFit preserves meaning, not wording, by training mainly on high-probability, semantically important tokens.
Why It Works (intuition)
- High-probability tokens tend to be the glue holding the reasoning together. Low-probability tokens are often flourish or optional phrasing. Theory shows low-prob tokens create big gradients that can yank learning off course; masking them keeps updates aligned with real logic. Empirically, probability distributions of core vs trivial tokens differ a lot, so a simple threshold is a strong separator.
Building Blocks (introduced in Sandwich style)
🍞 Hook: You know how you can trust a thermometer, but you still set a cutoff like “fever is above 100.4°F”? 🥬 Concept: Probability threshold τ is the cutoff for keeping tokens.
- What: A simple number between 0 and 1. Tokens with probability higher than τ are trained; the rest are masked.
- How: 1) Compute token probabilities online. 2) Compare each to τ. 3) Keep if above, mask if below. 4) Only kept tokens contribute to loss.
- Why: A clear rule turns noisy text into clean learning signals. 🍞 Anchor: If τ = 0.5, a token with 0.7 gets kept; with 0.3 gets masked.
🍞 Hook: Like taping a switch in place so it doesn’t wiggle while you work. 🥬 Concept: Stop-gradient keeps the mask decision from being changed by backprop.
- What: A trick so the mask is treated as fixed during the gradient step.
- How: 1) Compute probabilities. 2) Detach them (no gradients). 3) Make the binary mask. 4) Use mask to gate the loss.
- Why: Prevents math headaches and keeps training stable. 🍞 Anchor: You decide “train only on these words,” and that choice doesn’t get nudged by the optimizer.
🍞 Hook: Adding a smart filter to a familiar machine. 🥬 Concept: ProFit is a plug-in to standard SFT.
- What: The same SFT loop with a mask in front of the loss.
- How: 1) Forward pass → probabilities. 2) Make mask via τ. 3) Multiply per-token loss by the mask. 4) Backprop only through kept tokens.
- Why: Minimal code change, big focus boost. 🍞 Anchor: It’s like putting a sieve in the pipeline that only lets valuable grains through.
03Methodology
High-level Overview: Input → Model predicts token probabilities → Apply probability threshold mask (with stop-gradient) → Compute loss only on kept tokens → Update model.
Step-by-Step (with Sandwich where new ideas appear)
- Prepare data and model
- What happens: Use standard instruction–response pairs and a base LLM (optionally with LoRA for parameter-efficient tuning).
- Why this step exists: You need a starting model and examples to learn from.
- Example: Q: “What is 6×7?” A: “The answer is 42.”
🍞 Hook: Like attaching a small steering wheel on top of a big one to change direction cheaply. 🥬 Concept: LoRA (optional) fine-tunes with low-rank adapters.
- What: A way to update far fewer parameters by inserting small rank-limited matrices.
- How: 1) Freeze original weights. 2) Add two small matrices A and B. 3) Learn A and B to nudge behavior. 4) Scale and combine with the frozen base.
- Why: Saves memory and stabilizes training. 🍞 Anchor: You can fine-tune an 8B model on 8 GPUs without changing all its weights.
- Forward pass and get probabilities
- What happens: For each position t in the reference answer, the model outputs a probability over the vocabulary; we read p_t for the ground-truth token.
- Why this step exists: These probabilities are our “importance detector.”
- Example: For “…answer is 42”, p(“42”) is very high; p(“actually”) may be lower.
- Build the probability mask using a threshold τ
🍞 Hook: Like a gate that opens only for VIP guests. 🥬 Concept: Probability-guided masking.
- What: Make a binary mask M_t that is 1 if p_t > τ and 0 otherwise.
- How: 1) Detach p_t (stop-gradient). 2) Compare to τ. 3) Set M_t accordingly. 4) Keep only tokens with M_t = 1 for loss.
- Why: This filters out low-value, stylistic noise and prevents their gradients from dominating. 🍞 Anchor: If τ=0.5 and a token has p=0.72, it contributes to learning; with p=0.18, it’s ignored.
- Compute masked loss
- What happens: Multiply each token’s loss by M_t and sum. Only kept tokens backpropagate.
- Why this step exists: It concentrates gradient updates on core logic.
- Example with data: In “The answer is actually 42,” only losses from “answer” and “42” may count if τ is moderate, while “actually” is masked.
- Backpropagate and update
- What happens: Standard optimizer step updates either full parameters or LoRA adapters, guided by the masked loss.
- Why this step exists: This is where learning happens, now focused and less noisy.
- Example: After many batches, the model gets better at reasoning steps and factual tokens without clinging to filler words.
- Secret sauce: Stop big, bad gradients from low-prob tokens
🍞 Hook: If a small kid and a grown-up both push a cart, the grown-up sets the direction; huge but wrong pushes can wreck the path. 🥬 Concept: Gradient overshadowing control.
- What: Low-prob tokens create large gradients that can hijack learning; masking prevents this.
- How: 1) Identify low-prob tokens (big error). 2) Zero out their loss via M_t=0. 3) High-prob core tokens steer the update. 4) Training stays stable.
- Why: Keeps the learning direction aligned with meaning. 🍞 Anchor: On math problems, numbers and key steps guide learning; fluff can’t yank the model off-track anymore.
- Practical settings
- Threshold τ: Tune via small grid (e.g., 0.1–0.9). Results show p>τ works best; training on p<τ underperforms.
- Batch/sequence: Use your usual SFT setup (the paper uses up to 8k–32k tokens for long reasoning).
- Inference: Same as usual; ProFit only changes training.
What breaks without each step
- No probabilities: You lose the built-in importance detector.
- No mask: You re-introduce noise from trivial tokens.
- No stop-gradient: The mask becomes unstable and hard to optimize.
- No threshold tuning: You may keep too much noise or throw out too much signal.
Concrete mini-walkthrough
- Input: “Explain why 6×7=42.” Reference: “Compute 6×7 = 42.”
- Probabilities: High on “compute,” “6,” “×,” “7,” “42”; lower on fillers.
- Mask: Keep high-prob tokens; drop low ones.
- Update: Loss only from kept tokens. After training, the model solves similar problems even when phrasing changes.
Why this method is clever
- It uses the model’s own confidence to cheaply find important tokens, needs no extra answers, and removes harmful gradients from superficial text—simple rule, big payoff.
04Experiments & Results
The Test: What they measured and why
- Benchmarks: GPQA-Diamond (hard science Q&A), GSM8K and MATH-500 (math reasoning), AIME’24 (very hard math), IFEval (instruction following).
- Why these: Together they check if the model learned true reasoning and instruction skills, not just surface phrasing.
The Competition: Baselines
- Vanilla (un-tuned base models): Starting point.
- Standard SFT: Regular token-by-token imitation of one answer.
- Entropy-based and DFT: Probability/uncertainty-aware methods that reweight or select tokens but don’t do hard masking like ProFit.
The Scoreboard (with context)
- Qwen3-4B-Base average accuracy: Standard SFT ≈ 41.39%; ProFit ≈ 52.33% (about +11 points). That’s like jumping from a solid B to an A.
- Qwen3-14B-Base: Standard SFT actually drops below the Vanilla model (a warning sign of overfitting to style), but ProFit rises to ≈ 58.72%, about +5.6 points over Vanilla—like turning a slump into a clear win.
- Qwen3-0.6B and Llama-3.1-8B: ProFit consistently edges out SFT and other methods, showing the idea scales from tiny to larger models.
- OLMo-2-7B: Even here, ProFit is the top among fine-tuning strategies, proving universality across architectures.
Surprising Findings
- Multi-answer training (3 references) did not consistently help and sometimes hurt on complex tasks, suggesting gradient conflicts from style differences.
- Training only on low-probability tokens (p<τ) performs worse than baseline; these tokens don’t build solid reasoning chains.
- With LoRA rank growth, core-token training improves monotonically, but global SFT and trivial-token training show a U-shape, revealing optimization interference at medium ranks.
- ProFit converges faster per epoch—often beating the baseline’s best score in the very first epoch—showing that removing noise speeds learning.
- For reinforcement learning (GRPO) on math, ProFit is a superior warm start: higher Pass@4 and Avg@4, more stable KL divergence, healthy entropy, and longer, deeper chain-of-thought responses.
Make the numbers meaningful
- Think of scores as class grades across different subjects: math quizzes (GSM8K, MATH-500), super-tough olympiads (AIME’24), science comprehension (GPQA-Diamond), and following directions (IFEval). ProFit doesn’t just cram one teacher’s answer key—it learns the concepts, so it aces multiple subjects better than methods that memorize wording.
Takeaway
- Focusing on high-probability tokens is a simple rule with big effects: better accuracy, faster learning, sturdier scaling, and a stronger foundation for later RL.
05Discussion & Limitations
Limitations (be specific)
- The core assumption—low-probability tokens are usually non-core—fits logic-heavy tasks best (math, structured reasoning). In creative writing, rare words might actually be the sparkle, so masking them could dull style.
- A single global threshold τ is simple and stable but not optimal. Some problems are harder and may need a different τ; an adaptive rule could work even better.
- If the base model’s probability estimates are poor (e.g., very small models or unusual domains), the “importance detector” is less reliable.
- Extremely long outputs with many near-threshold tokens may need careful tuning to avoid over-masking.
Required resources
- Same compute class as SFT, plus negligible overhead to read probabilities and apply masks. Works with common training stacks (e.g., LoRA + standard SFT frameworks).
When NOT to use
- Highly creative generation, poetry, or style transfer tasks where low-probability choices can be the point.
- Datasets where the single reference is known to be noisy or idiosyncratic in its core tokens—masking might keep the wrong signals.
- Very small datasets where masking removes too much learning signal; consider a lower τ or warm up with standard SFT.
Open questions
- Adaptive thresholds: Can we tune τ per example, per step, or by uncertainty bands for a better signal-to-noise trade-off?
- Hybrid signals: Can we combine probabilities with syntax, attribution, or causal tracing to tag core logic even more precisely?
- Multi-reference lite: If we add just a tiny sprinkle of extra answers, can ProFit use them to calibrate τ automatically?
- Beyond SFT: How does probability-guided masking pair with preference learning and RL to further lift reasoning?
06Conclusion & Future Work
3-Sentence Summary
- ProFit fine-tunes language models by training mainly on high-probability (core) tokens and masking low-probability (trivial) ones.
- This reduces overfitting to surface phrasing, keeps gradients focused on real logic, and improves accuracy across reasoning and instruction tasks.
- It is simple to add, inexpensive compared to multi-answer datasets, and even makes a stronger starting point for reinforcement learning.
Main Achievement
- Turning the model’s own token probabilities into a reliable, low-cost spotlight that finds and trains the true semantic skeleton of answers.
Future Directions
- Adaptive thresholds that adjust by example difficulty, better core-token detectors that mix probability with structure signals, and extensions to preference learning and RL.
Why Remember This
- ProFit is a small change with outsized impact: it teaches models to learn ideas, not just imitate wording—making them more robust, faster learners, and better reasoners in the real world.
Practical Applications
- •Fine-tune a customer-support bot that understands requests even when phrased differently, by focusing training on core intent tokens.
- •Train math and science tutors that learn real solution steps instead of fixating on filler words, improving correctness and clarity.
- •Build instruction-following assistants that obey formatting and constraints more reliably by emphasizing high-probability instruction tokens.
- •Reduce compute costs by keeping single-answer datasets while still avoiding overfitting to their exact wording.
- •Use ProFit as a warm start before reinforcement learning to reach higher scores faster with more stable training.
- •Combine with LoRA to fine-tune large models on modest hardware while filtering out stylistic noise.
- •Improve enterprise document assistants to extract and act on key facts, not superficial phrasing, across varied writing styles.
- •Enhance data-cleaning pipelines by identifying low-value tokens in references that can be masked during training.
- •Stabilize domain adaptation (e.g., legal, medical) where precise terms are core and generic phrasing is not.
- •Accelerate curriculum learning by starting with a higher τ (very core tokens) and gradually lowering it as the model matures.