Diversity or Precision? A Deep Dive into Next Token Prediction
Key Summary
- •The paper shows that teaching a language model with a special “reward-shaped” next-token objective can make later reinforcement learning (RL) work much better.
- •They reinterpret the usual cross-entropy loss as a one-step policy gradient, turning next-token prediction into a tiny RL problem.
- •A single knob, beta (β), makes the model more precise (low entropy) or more diverse (high entropy) by scaling rewards for the correct token.
- •They also treat wrong tokens differently: keep several top alternatives alive (local diversity) but strongly push down unlikely tail tokens (rank-aware negative shaping).
- •Surprisingly, starting with a precision-oriented prior (globally lower entropy) creates a better exploration space for RL than starting extra-diverse.
- •Across dense and MoE models from 1B to 20B+, the approach scales well and improves reasoning benchmarks in math, logic, and coding.
- •During RL, overly high global entropy can collapse quickly and shorten reasoning; precision priors and local diversity avoid this and boost final scores.
- •The method keeps pass@k strong by mixing high precision with targeted local exploration among top candidates.
- •It unifies supervised pre-training and RL under one view, suggesting smarter pre-training can unlock better downstream reasoning with less RL headache.
Why This Research Matters
This work shows that better “first lessons” for language models make later “advanced classes” in reasoning far more successful. By dialing precision and diversity at pre-training, we can avoid fragile RL runs and grow longer, clearer chains of thought for math and code. That means more reliable homework help, safer coding assistants, and tools that reason instead of just guess. It can also cut compute costs by reducing the trial-and-error during RL, since the model starts closer to a good policy. Schools, developers, and researchers gain a recipe to build models that are both smart and steady. Over time, this approach could generalize beyond text to multimodal systems that plan and reason across images, tools, and data.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how when you learn a new board game, the way you practice at first makes a huge difference in how well you play later? If you practice with clear rules and good feedback, you’ll win more when the real game starts. Training big language models (LLMs) is like that: how we pre-train them shapes how well they reason later when we fine-tune them with reinforcement learning (RL).
🍞 Top Bread (Hook): Imagine you’re guessing the next word in a sentence like a fill‑in‑the‑blank game. 🥬 The Concept: Token Prediction is when an AI tries to guess the next token (a piece of a word) given the tokens before it.
- How it works:
- Read the previous tokens (the context).
- Look at a list of all possible next tokens.
- Assign a probability to each token.
- Pick one and learn from whether it was right.
- Repeat for the next position.
- Why it matters: Without next-token prediction, the model can’t learn language patterns or build up knowledge. 🍞 Bottom Bread (Anchor): Seeing “The capital of France is …,” the model should place high probability on “Paris.”
Before this paper, LLMs were mostly trained with cross-entropy loss, a standard supervised signal that pushes up the correct next token and, by softmax normalization, pushes down the others. More recently, RL has supercharged reasoning by rewarding full, correct solutions (like math answers). But there was a mystery: why do some models thrive in RL and others stall, even with the same algorithm?
🍞 Top Bread (Hook): You know how a coach gives you points when you make a good move so you’ll try that move more often? 🥬 The Concept: Reinforcement Learning (RL) lets an AI try things, get rewards, and adjust to get more rewards next time.
- How it works:
- The model acts (e.g., writes a solution step).
- It gets a reward (e.g., the answer is correct!).
- It updates itself to make good actions more likely.
- Why it matters: RL turns models from copycats into problem solvers who search for good reasoning paths. 🍞 Bottom Bread (Anchor): In math problems, the model receives a reward only if the final result passes tests.
The big insight here is that pre-training already sets the model’s “exploration space” — the shapes of probability over tokens that RL will explore. If pre-training leaves probabilities too flat or too spiky, RL behaves differently later. Many believed that more diversity (higher entropy) is always better for exploration. But that hasn’t always matched practice.
🍞 Top Bread (Hook): Think about how you share attention among your friends in a group chat: too evenly, and you miss the most important message; too narrowly, and you ignore useful suggestions. 🥬 The Concept: Token-Output Distribution is the spread of probabilities the model assigns to all possible next tokens.
- How it works:
- Compute a score for each token.
- Turn scores into probabilities that sum to 1 (softmax).
- Higher probability = more likely to be picked.
- Why it matters: This distribution controls what the model tries during RL — its exploration menu. 🍞 Bottom Bread (Anchor): If “Paris” has 80% and “London” 10%, the model will mostly pick “Paris,” but still sometimes try “London.”
People tried tweaks like label smoothing or focal loss to change this distribution. But these didn’t explicitly connect pre-training to the later RL exploration space. The missing link: treat next-token prediction as a tiny RL step and shape its rewards directly.
🍞 Top Bread (Hook): You know how you can adjust the “spiciness” of a dish so it’s tasty but not overwhelming? 🥬 The Concept: Entropy Control manages how spread-out (random) the model’s token probabilities are.
- How it works:
- Measure how even the probabilities are.
- Use knobs to make them flatter (more diverse) or peakier (more precise).
- Keep global randomness healthy while preserving good alternatives locally.
- Why it matters: Too much randomness confuses training; too little prevents discovering better answers. 🍞 Bottom Bread (Anchor): A story-writing AI with slightly higher entropy explores plots; a math model with lower entropy sticks to solid steps.
The gap this paper fills is a unified, RL-style view of cross-entropy that enables reward shaping right in pre-training. That way, we don’t leave the diversity-versus-precision balance to chance — we dial it in from day one, so RL inherits a well-prepared exploration space.
🍞 Top Bread (Hook): Choosing between a big box of crayons (diverse) and a sharp pencil (precise) depends on the job. 🥬 The Concept: Diversity–Precision Trade-off is the balance between exploring many options and focusing on the right one.
- How it works:
- If you boost precision, you increase the probability of the best answer.
- If you boost diversity, you keep more alternatives alive.
- Good training balances them for the task.
- Why it matters: The wrong balance makes RL wander or get stuck. 🍞 Bottom Bread (Anchor): For coding tests, precise syntax often beats wild ideas; for brainstorming, you want more variety.
Real stakes: Better pre-training means less RL headache, stronger reasoning, and more reliable AI helpers for homework, coding, science projects, or everyday questions — without needing to oversample or over-tune later.
02Core Idea
Aha! Moment in one sentence: If you treat next-token prediction as a one-step RL problem, you can shape its rewards to craft a token distribution that makes later RL exploration more effective — and a precision-first prior usually wins.
🍞 Top Bread (Hook): Imagine training wheels on a bike set just right — not too loose (wobbly) and not too tight (can’t turn). 🥬 The Concept: Cross-Entropy as Policy Gradient reframes the usual loss as a one-step policy-gradient update.
- How it works:
- Consider each token choice as an action.
- Give reward when the action equals the ground-truth token.
- The gradient looks like policy gradient with that reward.
- Why it matters: This unlocks principled reward shaping during pre-training, not just after. 🍞 Bottom Bread (Anchor): Picking “Paris” gets a reward; the learning step mirrors an RL update that increases the chance of “Paris” next time.
Three analogies for the same idea:
- Flashlight analogy: Narrow beam (precision) shows details to move forward confidently; a slightly broader beam (local diversity) lets you see nearby paths without wasting light everywhere.
- DJ mixer analogy: One fader controls global precision (β knob); another boosts or calms nearby alternatives (rank-aware negatives). Good mixing = clearer music (reasoning).
- GPS analogy: Set a strong default route (precision) but keep a short list of detours (local diversity) in case of traffic.
Before vs. After:
- Before: Pre-training aimed at accuracy, and people assumed higher entropy helps RL explore. But RL sometimes collapsed entropy early, cut reasoning short, and plateaued.
- After: With reward-shaped pre-training, models start RL with a precision-oriented global prior and healthy local choices. RL stays stable longer, explores smarter, and reaches higher scores.
🍞 Top Bread (Hook): You know how teachers can grade not just “right” vs. “wrong,” but also how close a try was? 🥬 The Concept: Reward-Shaping Strategy adds smart feedback to both correct and incorrect tokens.
- How it works:
- Scale positive rewards with a β factor: β<0 strengthens confidence in the correct token (lower global entropy); β>0 relaxes it (higher global entropy).
- Split negatives into top-k vs. tail: give small nonzero signals to high-ranking wrong tokens (keep them alive), and suppress low-probability tail tokens more.
- Combine to form the single-step reward for each token.
- Why it matters: The model learns a distribution that is precise overall but still has plausible backups. 🍞 Bottom Bread (Anchor): “Paris” gets a big boost; “Lyon” and “Marseille” get a tiny cushion; random rare tokens get pushed down.
Why it works (intuition, no equations):
- RL benefits from a strong starting point: a peaked distribution over good next steps (precision) gives RL a clear path to climb.
- But reasoning also needs escape hatches: preserving a few nearby alternatives (local diversity) prevents dead ends and supports recovery.
- High global entropy sounds explorative, but in practice it can collapse fast under RL pressure, shrinking reasoning length. Precision-first priors resist this collapse.
🍞 Top Bread (Hook): Think of your backpack: a few great tools you always keep, plus space for one or two extras. 🥬 The Concept: Rank-Aware Negative Shaping keeps the best few wrong tokens available but prunes the noisy tail.
- How it works:
- Identify top-k candidates by probability.
- Apply a small, stabilizing signal to these top-k negatives.
- Strongly down-weight tail tokens so they don’t crowd the head.
- Why it matters: You don’t waste capacity on junk, yet you still have nearby alternatives when needed. 🍞 Bottom Bread (Anchor): Keep “London” as a backup for “Paris,” but don’t waste time on “Purple.”
Building blocks, briefly:
- One-step RL view of cross-entropy.
- β for global entropy control (precision knob).
- Top-k rank-aware negative shaping for local entropy.
- On-policy perspective during token selection.
- A distribution that is easy for RL to refine into long, reliable chains of thought.
03Methodology
At a high level: Text input → Pre-training with reward-shaped next-token objective → Mid-training (longer context, more reasoning data) → RL on verifiable tasks (math/coding) → Better end-to-end reasoning.
Step by step with the Sandwich pattern when new ideas appear:
- Frame next-token prediction as a tiny RL problem. 🍞 Top Bread (Hook): Flipping one coin can still be a game with a score. 🥬 The Concept: Stochastic Decision Process means each token pick is an action made with uncertainty.
- What happens:
- The model sees context (state).
- It samples a token (action) from its policy (probabilities).
- It gets a reward depending only on this state–action.
- Why this step exists: It lets us apply policy-gradient logic cleanly to each token emission. 🍞 Bottom Bread (Anchor): At “France is …,” sampling “Paris” gets reward; sampling others gets shaped signals.
- Reinterpret cross-entropy as policy gradient. 🍞 Top Bread (Hook): Grading each answer immediately is like giving a mini-reward every time. 🥬 The Concept: Policy Gradient Optimization updates the policy in the direction of actions with higher rewards.
- What happens:
- The intrinsic reward equals “1 if correct, 0 if not,” scaled by the model’s own probability terms.
- The gradient looks just like a one-step policy update.
- Thus, cross-entropy is a special RL case.
- Why this step exists: Once we see CE as PG, we can shape rewards instead of being stuck with a single rule. 🍞 Bottom Bread (Anchor): Correct tokens act like “wins” in a one-move game; the model shifts to win more often.
- Add the positive reward scaling factor β (global entropy control). 🍞 Top Bread (Hook): Turning up a flashlight makes the center brighter. 🥬 The Concept: Entropy Control via β adjusts how concentrated probability becomes on the ground-truth token.
- What happens:
- If β<0, correct tokens get extra reward → higher peak on the right answer (low global entropy, precision-first).
- If β>0, reward is gentler → flatter distribution (high global entropy, diversity-first).
- β=0 recovers standard cross-entropy behavior.
- Why this step exists: It gives a single knob to tune global precision vs. diversity before RL. 🍞 Bottom Bread (Anchor): With β<0, “Paris” probability shoots up faster than normal.
- Shape negative tokens by rank (local entropy control). 🍞 Top Bread (Hook): Keep a short backup list, not every wrong guess. 🥬 The Concept: Rank-Aware Negative Shaping treats high-ranking vs. tail negatives differently.
- What happens:
- Compute top-k tokens by current probability.
- Slightly support these top-k negatives (so they don’t vanish).
- Suppress out-of-top-k tail tokens (so noise doesn’t grow).
- Why this step exists: It preserves nearby alternatives essential for robust reasoning while cutting long tails. 🍞 Bottom Bread (Anchor): Keep “London” and “Lyon”; push “Zzqw” down.
- Train across stages.
- Pre-training (500B tokens): Apply the new objective across dense and MoE models. Observe PPL and entropy converge well. β<0 lowers global entropy; rank-aware shaping fine-tunes local entropy.
- Mid-training (100B tokens): Extend context length, add more reasoning content, keep hyperparameters; track knowledge vs. reasoning scores.
- RL stage (math/coding with verifiable rewards): Use on-policy RL (e.g., GRPO-like) to train for long chain-of-thought; sample many solutions and evaluate Avg@128, Cons@128, Pass@k.
- What breaks without each step:
- Without the one-step RL framing: You can’t cleanly define token-level rewards; shaping becomes ad hoc.
- Without β: No global knob to steer precision; RL may face unstable entropy behavior.
- Without rank-aware negatives: Either top alternatives die (hurting recovery) or the tail balloons (wasting probability).
- Without staged training: Long reasoning may not emerge; context scaling lags.
- Concrete mini-example (toy vocabulary):
- Context: “The capital of France is …”
- Candidates: [Paris, Lyon, London, Purple]
- Start: Model gives [0.5, 0.2, 0.2, 0.1].
- With β<0: Increase reward on “Paris” → [0.6, 0.18, 0.17, 0.05].
- Rank-aware shaping (k=3): Slight cushion for Lyon/London; tail token Purple is pushed down.
- After a few updates: [0.75, 0.14, 0.10, 0.01]. Precise globally, still a couple of backups locally.
- The secret sauce:
- Start precise globally so RL has a clear backbone and avoids early entropy collapse (which shortens reasoning chains).
- Maintain local diversity among a few top negatives so RL can branch where it matters (critical “fork” tokens) without drowning in noise.
- This precise-backbone-plus-local-wiggle-room distribution makes RL both stable and capable of discovering better paths.
04Experiments & Results
The Test: The authors evaluated three phases — pre-training, mid-training, and RL — across dense and MoE models at multiple sizes (1B to 20B+). They tracked language knowledge, commonsense, logic, math, and coding using 19+ benchmarks (e.g., MMLU, ARC, GSM8K, MATH-500, OlympiadBench, HumanEval+, MBPP+). For generation-heavy tasks, they reported pass@k by sampling many answers.
🍞 Top Bread (Hook): Grading isn’t just about a single score; it’s how fast you improve and how steady you stay. 🥬 The Concept: Pass@k estimates the chance at least one try is correct if you make k independent attempts.
- How it works:
- Sample m answers; count how many are correct.
- Use a formula to estimate success with k attempts.
- Bigger pass@k = better upper bound on capability.
- Why it matters: Reasoners need both precision (get it right) and diversity (find one correct in several tries). 🍞 Bottom Bread (Anchor): For coding, pass@64 is like turning in 64 different drafts; you pass if any one compiles and solves the problem.
Competition and baselines: Standard cross-entropy training (β=0, no rank-aware negatives) served as the baseline. The new approach varied β (e.g., β<0 vs β>0) and negative shaping (top-k cushions vs. tail suppression) to observe effects on PPL, entropy, performance, and RL trajectory.
Scoreboard with context (high-level patterns):
- Pre-training: Changing rewards altered entropy without hurting final PPL. Precision-first (β<0) or tail suppression tended to scale better as model size grew.
- Mid-training: On 4B, 10B, 20B models, β<0 consistently matched or beat baseline on knowledge and reasoning averages; β>0 rarely won.
- RL: Precision-first (β<0) yielded stronger Avg@128, Cons@128, and Pass@64 curves across 4B and 10B models; rank-aware local diversity further boosted stability and peak performance.
- Pass@k analysis: Precision-first did not collapse diversity in practice. Instead, it maintained enough variation to cover solution space while boosting correctness, improving pass@k in math/coding tasks.
Make the numbers meaningful (examples from trends reported):
- Think of a class where most students (baselines) get B’s on math tests. The precision-first prior pushes the model toward A-range performance, especially after RL. Gains of a few points on tough math sets are like moving from an 88 to a 92 — a big deal when tests are hard.
- On coding (HumanEval+, MBPP+), the mix of precision-first with local diversity was like catching more bugs before submission: pass@k curves rose more reliably than baselines.
- For logic (ARC-e/c, BBH), precision-first models climbed faster during RL and sustained their lead, suggesting better exploration at key decision “forks.”
Surprising findings:
- High global entropy (β>0) didn’t guarantee robust exploration. Under RL, its entropy often collapsed quickly, and response length — a proxy for long chain-of-thought — shrank early. That hurt reasoning.
- Precision-first global entropy (β<0) avoided early collapse. When paired with local diversity (rank-aware negatives), it delivered the steadiest RL curves and highest peaks.
- Bigger models benefitted more from the precision-first prior, indicating this shaping plays nicely with scaling laws.
Why this is convincing:
- The same story appears across architectures (dense and MoE), sizes, and tasks.
- Metrics sensitive to reasoning depth (e.g., pass@k, long outputs) improved in tandem with stability.
- Entropy and response-length traces lined up with performance: stable entropy and growing lengths correlated with better reasoning outcomes.
05Discussion & Limitations
Limitations:
- Hyperparameter sensitivity: β, top-k size, and negative weights need tuning. The best settings may differ by model size and domain.
- Task mismatch: A precision-first prior is great for verifiable reasoning (math, code), but open-ended creativity might prefer more global diversity.
- Architecture dependence: Results are shown on Qwen-like dense/MoE backbones; other architectures should be tested for generality.
- Training cost: Running full pre-train → mid-train → RL pipelines is compute-intensive; small labs may struggle to replicate at scale.
- Metric scope: While many benchmarks are covered, broader domains (e.g., multimodal, dialogue safety) weren’t the focus.
Required resources:
- Large corpora (hundreds of billions of tokens), long-context training capability, and robust RL infrastructure (on-policy sampling, verifiers for math/code).
- Monitoring tools for entropy, response length, and pass@k to guide β and top-k choices.
When not to use:
- Purely creative writing where surprise and novelty dominate correctness.
- Low-data or tiny-model settings where careful shaping may overfit or be hard to tune.
- Tasks without verifiable rewards; the benefits show strongest when RL has clear correctness signals.
Open questions:
- Can β and top-k be learned automatically per-context, driven by uncertainty estimates?
- How does this interact with latent-reasoning or looped architectures that do internal thinking before emitting tokens?
- Can we design curricula that change β over training (anneal from diverse to precise) for even better scaling?
- How does reward shaping affect calibration and confidence estimation?
- Will similar ideas help multimodal models where token choices span text and vision?
06Conclusion & Future Work
Three-sentence summary: This paper reframes next-token prediction as a one-step RL problem and shapes its rewards to balance diversity and precision during pre-training. A precision-first global prior (β<0) plus rank-aware local diversity builds a token distribution that RL can refine more easily, avoiding early entropy collapse and improving reasoning. Across sizes and architectures, this leads to stronger, stabler performance on math, logic, and coding.
Main achievement: A simple, principled pre-training objective — with a single global knob (β) and a rank-aware negative scheme — that reliably creates a better exploration space for downstream RL and lifts end-to-end reasoning.
Future directions:
- Auto-tune β and k based on per-token uncertainty; schedule β over training.
- Extend to latent/looped reasoning models and multimodal settings.
- Study effects on calibration, safety, and interpretability.
Why remember this: It shows that smarter pre-training can pre-arrange the “search field” for RL, making models not just know more, but think better. Instead of hoping RL fixes everything later, we prepare the ground early — precise where it counts, and still flexible where it helps.
Practical Applications
- •Pre-train math and coding assistants with β<0 to improve stability and final pass@k after RL.
- •Use rank-aware negative shaping to preserve top plausible edits in code autocompletion while suppressing noisy completions.
- •Monitor entropy and response length during RL; if entropy collapses early, strengthen the precision prior and increase local diversity (smaller tail, healthier top-k).
- •Adopt a curriculum: start with stronger precision (β<0) during pre-training, then relax slightly or adjust top-k during mid-training for robustness.
- •Auto-tune β per domain: lower β (more precision) for math/code, slightly higher β (more diversity) for brainstorming tasks.
- •Combine with verifiers: pair precision-first pre-training with unit tests or math checkers to magnify RL gains.
- •Deploy pass@k sampling for production coding agents; keep k modest if precision is high to save compute.
- •Apply to MoE models: use the same reward shaping without auxiliary losses to stabilize expert routing outcomes.
- •Use entropy/length dashboards during training to catch and correct early collapse or over-diffusion.
- •In small-scale fine-tunes, approximate the effect by increasing weight on correct tokens and pruning low-probability tails.