Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers
Key Summary
- ā¢Decoding (how a language model picks the next word) isnāt a bag of tricks; itās a clean optimisation problem over probabilities.
- ā¢Many famous methodsāGreedy, Softmax/temperature sampling, Top-K, and Top-Pāfall out as exact solutions when you pick different regularisers and constraints.
- ā¢A key idea is to pick a whole distribution first (not a single token), then sample or take the topāthis ādistribution-firstā view unifies deterministic and random decoding.
- ā¢Active vs. inactive tokens appear naturally from optimality conditions: some tokens get positive probability; others are pushed to zero by the geometry of the simplex.
- ā¢Softmax is the solution of āscore minus entropy penaltyā; Top-K and Top-P are Softmax on smaller, rule-based subsets; Sparsemax gives exact zeros without manual truncation.
- ā¢Mirror ascent is a safe, fast āclimbā that respects probability geometry, letting us solve harder decoders when no closed form exists.
- ā¢Best-of-K (BoK) is a new coverage-focused regulariser that boosts the chance good answers appear at least once across K samples, anchored by a KL trust region.
- ā¢BoK improves high-temperature performance substantially (e.g., +18.6% on MATH500 at Ļ=0.9 for Qwen2.5-Math-7B) with only a small runtime cost.
- ā¢This framework turns āfolkloreā into design: decide the behaviour you want (diversity, sparsity, stability), write it as a regulariser/constraint, and solve.
- ā¢Bottom line: decoding is not a hack; it is optimisation, which makes building new decoders easier, clearer, and more reliable.
Why This Research Matters
This work gives us a dependable way to shape how AI speaks, reasons, and explores, replacing ad-hoc tricks with a single, clear optimisation layer. It can make multi-try workflows (like self-consistency and reranking) much more effective by spending samples wisely on distinct, promising alternatives. Developers gain a design language: pick the behaviour (diverse, safe, sparse, close to model) and encode it as a regulariser or constraint, instead of fiddling blindly with knobs. Users benefit from more reliable answers at higher creativity settings, reducing the frustration of repeated, similar wrong attempts. Because the method adds only a small computational overhead, itās practical for real systems and can improve accuracy without retraining. Over time, this shifts the field from folklore to first-principles engineering, speeding up innovation and making AI behaviour more transparent.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how when you pick snacks, sometimes you always grab the same cookie, and other times you try something new? Old language-model decoding felt like thatāflip a few knobs like āTop-Kā or ātemperatureā and hope the snack (next word) tastes good. People saw these choices as tricks: Greedy here, Top-P thereāuseful, but not clearly connected. This paper says: stop guessing. Thereās a simple principle underneath it allāoptimisation.
š Hook: Imagine choosing a class leader every day. Some days you pick the most helpful student (Greedy). Other days you put the top helpersā names in a hat and draw (Top-K/Top-P). It looks like a bunch of different rules.
š„¬ The Concept (Greedy Decoding):
- What it is: Greedy decoding always picks the single highest-scoring next word.
- How it works: 1) Score all words; 2) Find the biggest score; 3) Pick that word.
- Why it matters: Without Greedy, you might overcomplicate easy choices; with only Greedy, you get stuck being too predictable. š Anchor: If the model is sure āParisā is best after ācapital of France isā¦ā, Greedy simply picks āParis.ā
Before this paper, the situation was: lots of techniquesāGreedy, Softmax sampling, Top-K, Top-P, beam searchāfelt separate. People tuned them by feel, not from a shared rulebook. The problem: this makes it hard to know which method to use, why it works, or how to create better ones.
Researchers tried things like:
- Turning the temperature knob (higher = more adventurous, lower = safer), but that alone can swing from boring to chaotic.
- Truncating with Top-K or Top-P, which helps but still feels like manual pruning, not a principled choice.
- Heavier search (like beam search), which can be slow and still not capture what we want (e.g., diverse but correct options).
What was missing? A single, simple lens to see all these methods at once. The gap was a unifying principle that explains why each method behaves as it doesāand guides the design of new ones on purpose, not by folklore.
š Hook: You know how baking recipes tell you the goal (a soft cookie) and which trade-offs to make (more butter for softness, more flour for structure)?
š„¬ The Concept (Regulariser):
- What it is: A regulariser is a preference or penalty that nudges the probability distribution to have certain shapes (smooth, sparse, stable, close to a reference, etc.).
- How it works: 1) Start with model scores; 2) Add a āpreferenceā term; 3) Balance score vs. preference; 4) Optimise to get a distribution.
- Why it matters: Without regularisers, the decoder may be too sharp (always the same) or too messy (too random). Regularisers steer behaviour. š Anchor: If you want variety in classmatesā turns, you add a rule ādonāt pick the same kid too often.ā Thatās a regulariser.
This paper reframes decoding as optimisation over a probability distribution (not a single token) each step. Choose a distribution that balances āpick high-scoring tokensā with āfollow desired structureā (like smoothness, sparsity, or staying near the model). This viewpoint instantly ties together methods:
- Greedy = no regulariser.
- Softmax/temperature = entropy regulariser.
- Top-K/Top-P = Softmax on a restricted set.
- Sparsemax = a quadratic penalty that allows true zeros.
š Hook: Imagine a pie chart where each slice is a tokenās chance. The whole pie must be 100%.
š„¬ The Concept (Probability Simplex):
- What it is: The set of all valid probability distributionsāevery slice is nonnegative and all slices add up to 1.
- How it works: 1) Assign each token a slice size; 2) No slice goes below zero; 3) All slices sum to one full pie.
- Why it matters: Without the simplex rules, your āprobabilitiesā could be impossible (like negative chances or sums above 100%). š Anchor: If ācatā has 40%, ādogā 30%, and ābirdā 30%, thatās a legal pie on the simplex.
Why should anyone care? Because decoding is the doorway between the model and everything we do with it: homework help, coding assistance, science explanations, and more. Poor decoding wastes compute, repeats itself, and misses good answers hiding just off-center. In multi-try setups (like sampling a few answers and picking the best), naive methods overspend on the same guess and underspend on useful alternatives. This paperās view makes decoding safer, smarter, and easier to improveālike swapping a toolbox of random gadgets for one adjustable power tool that fits the job.
02Core Idea
Aha! In one sentence: Treat decoding as an optimisation problem over a probability distribution that trades off model score against regularisers and constraints on the probability simplex.
Three analogies to lock it in:
- Chefās tasting spoon: You donāt plate the first bite; you first mix flavours (build a distribution), balancing tasty (high score) with house style (regulariser), then serve (sample/mode).
- Sports draft: You donāt draft just the single top player; you shape a whole team (distribution) under rules (salary cap = constraints) and philosophy (regulariser), then pick your lineup.
- Classroom turns: You want helpful answers (scores), but also fairness and variety (regulariser). You keep rules like āeveryoneās share must be nonnegative and sum to 100%ā (simplex), then pick who speaks.
Before vs. After:
- Before: Many separate tricksāGreedy, Softmax, Top-K, Top-Pāfelt unrelated.
- After: One master objective, different regularisers/constraints pick out different decoders exactly. Temperature isnāt a hackāitās the strength of an entropy penalty. Top-K/Top-P are just constraints that trim the pie before softmaxing.
Why it works (intuition, no equations):
- The simplex makes you share 100% probability across tokens. The optimality conditions then split tokens into two groups: active (get some pie) and inactive (get zero).
- Some regularisers (entropy) hate zeros, so they keep everyone slightly positiveāthis is Softmax.
- Other regularisers (quadratics like Sparsemax) are okay with zeros, so low-score tokens vanish automatically (no manual Top-K needed).
- If you want the draws across K samples to ācoverā good alternatives at least once, you add a coverage utility. It gives diminishing returns to over-allocating one token and rewards spreading enough mass to see other strong options at least once. Add a KL āstay close to the modelā leash to avoid drifting too farāthatās Best-of-K (BoK).
Building blocks (introduced with Sandwich style when needed):
š Hook: Think of a stage show: some performers are on stage (active), some are backstage (inactive). š„¬ The Concept (Active and Inactive Tokens):
- What it is: Tokens are active if they get positive probability; inactive if they get zero.
- How it works: 1) Optimise under simplex rules; 2) Active tokens satisfy a balance condition; 3) Inactive ones are pushed to zero by the geometry/penalty.
- Why it matters: Without this split, we canāt explain sparsity (some tokens truly removed) or smoothness (everyone gets a tiny slice). š Anchor: In Softmax, almost all tokens are āon stage,ā even if whispering; in Sparsemax, only a few sing while the rest are silent.
š Hook: You know how āsurpriseā makes a story lively? š„¬ The Concept (Entropy & Softmax Sampling):
- What it is: Softmax is the distribution that balances score with an entropy bonus (which likes some spread, i.e., surprise).
- How it works: 1) Start with scores; 2) Add a term that rewards variety; 3) The best balance is Softmax with temperature controlling spread.
- Why it matters: Without this, you might always pick the same token; with too much, youāre too random. š Anchor: Asking āWhatās the capital of France?āāSoftmax gives āParisā a big slice, but tiny slices to others, just in case.
š Hook: Picking only from your top five snacks is faster than scanning every snack. š„¬ The Concept (Top-K Sampling):
- What it is: Limit choices to the K best tokens, then apply Softmax only there.
- How it works: 1) Find top K by score; 2) Zero out others; 3) Renormalise and sample.
- Why it matters: Without trimming, tiny but bad tokens still get sampled; with too much trimming, you lose useful variety. š Anchor: Choosing a dessert only from your top 5 favourites.
š Hook: Instead of āfive items,ā imagine āenough items to reach a target fullness.ā š„¬ The Concept (Top-P/Nucleus Sampling):
- What it is: Pick the smallest set of tokens whose model probabilities sum to at least P, then Softmax within that set.
- How it works: 1) Sort tokens by model probability; 2) Accumulate until you hit P; 3) Renormalise and sample.
- Why it matters: It adapts to confidenceāfewer tokens when sure, more when unsure. š Anchor: Fill your backpack until it reaches a weight limit; some days thatās 3 books, other days 7.
š Hook: Sometimes you want only a few strong choices, not a crowd. š„¬ The Concept (Sparsemax):
- What it is: A softmax-like method that naturally puts exact zeros on weak tokens.
- How it works: 1) Use a quadratic regulariser; 2) Optimise; 3) Low-score tokens drop to zero automatically.
- Why it matters: It avoids the softmax ālong tailā that can pick weird tokens. š Anchor: Sharing candy with only a few kids so they each get enough to enjoy.
š Hook: Compare your current recipe to a trusted one. š„¬ The Concept (KL Divergence):
- What it is: A measure of how different one distribution is from a reference.
- How it works: 1) Take your proposed probabilities; 2) Compare to the modelās; 3) Penalise being too far.
- Why it matters: Without KL anchoring, a new decoder might drift away from what the model believes. š Anchor: If the chefās new sauce is wildly different from the house sauce, KL says ātoo farādial it back.ā
All together, the master recipe is: choose a distribution that maximises āexpected score minus a regulariser,ā optionally inside a constraint set (like Top-K/P). That single idea turns a shelf of tricks into plug-and-play design.
03Methodology
At a high level: Model scores ā Pick a probability distribution by solving an optimisation ā Sample (or take the most likely) ā Next token.
Step-by-step, like a recipe:
- Collect ingredients (scores): For each token, the model gives a score (higher means more suitable next word).
- Why this step exists: You need a notion of āgoodness.ā Without scores, you canāt rank tokens.
- Example: Scores: cat=3, dog=2, bird=0.
- Decide to choose a distribution first, not a single token.
- Why: It unifies deterministic (Greedy) and stochastic (sampling) methods and lets you encode preferences as regularisers.
- Example: Instead of jumping to ācat,ā you first decide cat 0.6, dog 0.35, bird 0.05.
- Optimise āscore minus regulariser,ā on the simplex, possibly with constraints.
- Why: The simplex keeps probabilities valid; regularisers shape behaviour; constraints cut out unwanted tokens.
- Example:
- Greedy: no regulariser, so all mass on the max (cat).
- Softmax: entropy regulariser; everyone gets a slice, bigger for higher scores.
- Top-K: force zero outside top K, then Softmax on the rest.
- Top-P: choose the smallest mass-covering set, then Softmax.
- Sparsemax: quadratic penalty; weak tokens become exact zero.
- Sample from the distribution (or take the mode) to produce the next token.
- Why: This turns the optimised distribution into an actual word.
- Example: With probabilities [0.6, 0.35, 0.05], you usually get cat, sometimes dog, rarely bird.
Detailing each classic decoder under the master view:
- Greedy = Ī»=0 (no regulariser). Active tokens must tie for the top score; solution puts all mass on the max. Without this, ādeterministicā decoding wouldnāt be explained by the same principle.
- Softmax = negative entropy regulariser. It dislikes zeros, so all tokens stay slightly positive. Without entropy, youād be too peaky (or undefined when ties happen).
- Top-K/Top-P = same Softmax but on a trimmed support set. Without constraints, tiny long-tail tokens still distract; with constraints, you focus on the right region.
- Sparsemax = quadratic regulariser that allows zeros. Without it, youād need manual truncation to get true sparsity.
š Hook: Walking on Earth using a flat map can mislead your steps. š„¬ The Concept (Mirror Ascent):
- What it is: An optimisation method that takes āgeometry-awareā steps on the probability simplex using a divergence (like KL), not Euclidean distance.
- How it works: 1) Start at a safe distribution (often the modelās own); 2) Compute gradients of your objective; 3) Do a multiplicative update that stays on the simplex; 4) Repeat a few times.
- Why it matters: Plain projected gradients can fight the simplex at the edges; mirror ascent keeps probabilities valid and stable. š Anchor: Itās like climbing a hill with shoes designed for that terraināyou slip less and get to the top faster.
Secret Sauce:
- Geometry-aware updates (mirror ascent) + a flexible regulariser. You can write a new behaviour (e.g., coverage across K samples), then solve it stably on the simplex, even without a closed-form formula.
Concrete mini-example (3 tokens): scores [3,2,0].
- Greedy: [1,0,0].
- Softmax(temperature=1): roughly [0.67, 0.25, 0.08].
- Top-2 Softmax: zero ābird,ā renormalise [0.73,0.27,0].
- Sparsemax (illustrative): may produce [0.75, 0.25, 0].
Now the new design: Best-of-K (BoK)
š Hook: If you try opening a combination lock K times, itās wise to spread guesses so at least one hits. š„¬ The Concept (Best-of-K / BoK):
- What it is: A coverage-focused regulariser that boosts the chance good options appear at least once across K samples, while staying near the model with a KL anchor.
- How it works: 1) Define a āhit at least onceā utility for each token across K tries; 2) Weight important tokens more; 3) Add a KL leash to stay close to the model; 4) Optimise with mirror ascent.
- Why it matters: Without BoK, many samples can repeat the same guess; with BoK, you spend your K-budget to cover strong alternatives. š Anchor: Trying five different problem-solving steps and keeping the bestāBoK raises the chance one of them is the winner.
Diminishing returns makes BoK smart: once a token is likely to appear at least once, BoK prefers to invest in other good tokens that arenāt covered yet. The KL term prevents drifting too far from the modelās beliefs. Practically, you run just a few mirror-ascent steps per token (like 2ā5), which converge quickly and add only small overhead.
Implementation sketch for BoK per token step:
- Inputs: scores s, reference model distribution p, nonnegative weights w (importance), hyperparameters (K, λ for KL strength, β for coverage strength), stepsize, and number of mirror steps.
- Loop J times: compute gradient of [score ā Ī» KL + β coverage], apply multiplicative mirror update, renormalise (a softmax-like normalisation), use a log-sum-exp trick to stay numerically safe.
- Output: the BoK distribution q*, then sample or take the mode.
04Experiments & Results
The Test: The authors checked if BoK (the new coverage-anchored decoder) actually helps when you draw several samples and want at least one good answer. They measured task accuracy on three very different benchmarks: MATH500 (math problem solving), GPQA-diamond (hard, āGoogle-proofā questions), and HumanEval (coding problems). They varied temperature from low (almost deterministic) to high (very exploratory) to see where BoK shines.
The Competition: BoK was compared to two strong baselines used everywhere:
- Base: standard autoregressive sampling from the model distribution with temperature.
- Top-K: restrict to the top 50 tokens, then sample (K=50). Same prompts, same lengths, same seedsāfair fight.
The Scoreboard (with context):
- On Qwen2.5-Math-7B at very high temperature (Ļ=0.9), Base accuracy on MATH500 is 53.0%. Thatās like barely passing. Top-K gets 56.2% (a small bump). BoK jumps to 71.6%āthatās a huge lift (+18.6% over Base, +15.4% over Top-K), like going from a C to a solid A when questions are tricky and randomness is high.
- GPQA and HumanEval show a similar story at higher temperatures: BoK often outperforms both baselines, e.g., +6.06% and +14.64% boosts in those high-entropy regimes. The gains are biggest exactly where vanilla sampling is diverse-but-unreliable.
- On Qwen2.5-7B (general model), BoK also reduces the high-temperature accuracy drop and frequently matches or beats baselines. At lower temperatures (near-deterministic), improvements are smaller; sometimes Base or Top-K are similar. Thatās expectedāwhen the model is very confident, exploration matters less.
Surprising (and reassuring) findings:
- Robustness: Multiple (β, Ī») settings work wellāthereās a wide stable region, not a brittle āone lucky setting.ā This matters in practice: you donāt want to babysit tuning knobs.
- Speed-cost trade-off: Mirror ascent converges fast. Just 2ā5 steps per token already give most of the benefit. On MATH500, BoK adds roughly 1 second over ~16 seconds totalāsmall cost for big gains. With 2 steps, accuracy jumps from 64.4% to 69.6% (nearly free); with 5 steps, up to 73.0%.
Meaning behind the numbers:
- High temperature means more creativity but also more bad detours. BoK keeps the creativity while stacking the deck so at least one of your K tries lands on something good. Thatās why you see big lifts at Ļ=0.7ā0.9.
- In low-temperature zones, the model already concentrates probability on strong answers, so thereās less room for improvementāBoK still holds its own without much downside.
Bottom line: BoK reliably increases ābest-of-severalā success by making coverage a first-class goal, delivering higher accuracy especially when exploration is high, with minimal runtime overhead.
05Discussion & Limitations
Limitations:
- Per-step, not full-sequence: The paper focuses on choosing each next token well; it doesnāt yet couple choices across time to enforce sequence-level goals (like total length, global style, or end-to-end coverage across steps).
- Weight choice: BoK needs importance weights (w). If these weights are poor proxies for āgood alternatives,ā the coverage utility can push the distribution in unhelpful directions.
- Trust-region tuning: The KL anchor (Ī») balances staying near the model vs. exploring. Extreme settings can under- or over-explore.
- Multi-sample only: If you only ever take one sample, BoKās special advantage (coverage across K) doesnāt apply.
- Compute: Mirror ascent adds a small but real overhead. On very tight latency budgets, even +1s can matter.
Required resources:
- Access to the modelās token scores (logits or probabilities) at each step.
- A few extra vector ops per token step for mirror updates (2ā5 iterations is often enough).
- Modest memory for per-step distributions and gradients.
When NOT to use:
- Strictly deterministic production flows where every millisecond and exact repeatability are paramountāGreedy or very low-temperature may be better.
- Safety-critical settings where exploration must be minimal and thoroughly vetted.
- Single-shot generation where you never use multiple samples or rerankingāBoKās coverage benefit wonāt shine.
Open questions:
- Sequence-level optimisation: Can we lift the per-step framework to jointly plan over full sentences or solutions?
- Learned or adaptive weights: Can w be learned online from a verifier or task success signal to steer coverage more intelligently?
- Compute-aware utilities: Can we bake reranking/verifier costs right into the objective for smarter budget use?
- Beyond KL: What happens with other divergences or constraints (e.g., group sparsity, tool-use constraints, retrieval-aware supports)?
06Conclusion & Future Work
Three-sentence summary: Decoding isnāt a pile of tricksāitās one optimisation problem over the probability simplex that balances model score with regularisers and constraints. Classic methods like Greedy, Softmax, Top-K, Top-P, and Sparsemax all pop out as exact solutions when you choose the right regulariser/constraint, and harder cases are solved cleanly with mirror ascent. Building on this, Best-of-K (BoK) turns multi-sample coverage into a first-class objective, delivering large gains at high temperatures with little extra compute.
Main achievement: A unifying, geometry-aware framework that both explains old decoders and makes inventing new, better ones (like BoK) straightforward.
Future directions: Extend from per-step to full-sequence objectives, learn coverage weights from verifiers, design compute-aware utilities that encode rerankers or tool calls, and explore richer constraint sets (structured sparsity, retrieval-aware supports).
Why remember this: It replaces heuristics with first principles. Once you see decoding as āscore minus regulariser on the simplex,ā you get a master key: swap in the behaviour you want, and the optimiser gives you the decoderāno folklore required.
Practical Applications
- ā¢Boost self-consistency pipelines: Use BoK to sample K diverse, high-quality candidates so the best one is more likely to be present.
- ā¢Improve verifier/reranker setups: Feed BoK-generated candidate sets to a judge or verifier to raise final answer accuracy.
- ā¢Stabilise high-temperature creative writing: Keep creativity while increasing the chance of at least one coherent, on-topic sample.
- ā¢Enhance math/code generation: At higher temperatures, increase the odds that at least one sampled solution compiles or solves the problem.
- ā¢Constrain decoding safely: Add constraints (like Top-P supports or validity masks) directly in the optimisation for safer outputs.
- ā¢Reduce tail-risk tokens without manual truncation: Use Sparsemax-style regularisers to naturally zero out junk tokens.
- ā¢Compute-aware sampling: Spend a fixed K-sample budget on complementary candidates rather than duplicates of the same idea.
- ā¢Plug-and-play improvement without retraining: Apply BoK as a decoding-time layer on existing models.
- ā¢Adaptive behaviour by task: Swap regularisers (e.g., entropy for variety, KL for conservatism) depending on the applicationās needs.
- ā¢Prototype new decoders quickly: Express desired behaviours as regularisers/constraints and solve with mirror ascent.