Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Xiaotong Ji; Rasul Tutunov; Matthieu Zimmer; Haitham Bou-Ammar

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Intermediate

Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer et al.2/20/2026

arXiv

Key Summary

•Decoding (how a language model picks the next word) isn’t a bag of tricks; it’s a clean optimisation problem over probabilities.
•Many famous methods—Greedy, Softmax/temperature sampling, Top-K, and Top-P—fall out as exact solutions when you pick different regularisers and constraints.
•A key idea is to pick a whole distribution first (not a single token), then sample or take the top—this ‘distribution-first’ view unifies deterministic and random decoding.
•Active vs. inactive tokens appear naturally from optimality conditions: some tokens get positive probability; others are pushed to zero by the geometry of the simplex.
•Softmax is the solution of ‘score minus entropy penalty’; Top-K and Top-P are Softmax on smaller, rule-based subsets; Sparsemax gives exact zeros without manual truncation.
•Mirror ascent is a safe, fast ‘climb’ that respects probability geometry, letting us solve harder decoders when no closed form exists.
•Best-of-K (BoK) is a new coverage-focused regulariser that boosts the chance good answers appear at least once across K samples, anchored by a KL trust region.
•BoK improves high-temperature performance substantially (e.g., +18.6% on MATH500 at τ=0.9 for Qwen2.5-Math-7B) with only a small runtime cost.
•This framework turns ‘folklore’ into design: decide the behaviour you want (diversity, sparsity, stability), write it as a regulariser/constraint, and solve.
•Bottom line: decoding is not a hack; it is optimisation, which makes building new decoders easier, clearer, and more reliable.

Why This Research Matters

This work gives us a dependable way to shape how AI speaks, reasons, and explores, replacing ad-hoc tricks with a single, clear optimisation layer. It can make multi-try workflows (like self-consistency and reranking) much more effective by spending samples wisely on distinct, promising alternatives. Developers gain a design language: pick the behaviour (diverse, safe, sparse, close to model) and encode it as a regulariser or constraint, instead of fiddling blindly with knobs. Users benefit from more reliable answers at higher creativity settings, reducing the frustration of repeated, similar wrong attempts. Because the method adds only a small computational overhead, it’s practical for real systems and can improve accuracy without retraining. Over time, this shifts the field from folklore to first-principles engineering, speeding up innovation and making AI behaviour more transparent.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you pick snacks, sometimes you always grab the same cookie, and other times you try something new? Old language-model decoding felt like that—flip a few knobs like “Top-K” or “temperature” and hope the snack (next word) tastes good. People saw these choices as tricks: Greedy here, Top-P there—useful, but not clearly connected. This paper says: stop guessing. There’s a simple principle underneath it all—optimisation.

🍞 Hook: Imagine choosing a class leader every day. Some days you pick the most helpful student (Greedy). Other days you put the top helpers’ names in a hat and draw (Top-K/Top-P). It looks like a bunch of different rules.

🥬 The Concept (Greedy Decoding):

What it is: Greedy decoding always picks the single highest-scoring next word.
How it works: 1) Score all words; 2) Find the biggest score; 3) Pick that word.
Why it matters: Without Greedy, you might overcomplicate easy choices; with only Greedy, you get stuck being too predictable. 🍞 Anchor: If the model is sure “Paris” is best after “capital of France is…”, Greedy simply picks “Paris.”

Before this paper, the situation was: lots of techniques—Greedy, Softmax sampling, Top-K, Top-P, beam search—felt separate. People tuned them by feel, not from a shared rulebook. The problem: this makes it hard to know which method to use, why it works, or how to create better ones.

Researchers tried things like:

Turning the temperature knob (higher = more adventurous, lower = safer), but that alone can swing from boring to chaotic.
Truncating with Top-K or Top-P, which helps but still feels like manual pruning, not a principled choice.
Heavier search (like beam search), which can be slow and still not capture what we want (e.g., diverse but correct options).

What was missing? A single, simple lens to see all these methods at once. The gap was a unifying principle that explains why each method behaves as it does—and guides the design of new ones on purpose, not by folklore.

🍞 Hook: You know how baking recipes tell you the goal (a soft cookie) and which trade-offs to make (more butter for softness, more flour for structure)?

🥬 The Concept (Regulariser):

What it is: A regulariser is a preference or penalty that nudges the probability distribution to have certain shapes (smooth, sparse, stable, close to a reference, etc.).
How it works: 1) Start with model scores; 2) Add a “preference” term; 3) Balance score vs. preference; 4) Optimise to get a distribution.
Why it matters: Without regularisers, the decoder may be too sharp (always the same) or too messy (too random). Regularisers steer behaviour. 🍞 Anchor: If you want variety in classmates’ turns, you add a rule “don’t pick the same kid too often.” That’s a regulariser.

This paper reframes decoding as optimisation over a probability distribution (not a single token) each step. Choose a distribution that balances “pick high-scoring tokens” with “follow desired structure” (like smoothness, sparsity, or staying near the model). This viewpoint instantly ties together methods:

Greedy = no regulariser.
Softmax/temperature = entropy regulariser.
Top-K/Top-P = Softmax on a restricted set.
Sparsemax = a quadratic penalty that allows true zeros.

🍞 Hook: Imagine a pie chart where each slice is a token’s chance. The whole pie must be 100%.

🥬 The Concept (Probability Simplex):

What it is: The set of all valid probability distributions—every slice is nonnegative and all slices add up to 1.
How it works: 1) Assign each token a slice size; 2) No slice goes below zero; 3) All slices sum to one full pie.
Why it matters: Without the simplex rules, your “probabilities” could be impossible (like negative chances or sums above 100%). 🍞 Anchor: If “cat” has 40%, “dog” 30%, and “bird” 30%, that’s a legal pie on the simplex.

Why should anyone care? Because decoding is the doorway between the model and everything we do with it: homework help, coding assistance, science explanations, and more. Poor decoding wastes compute, repeats itself, and misses good answers hiding just off-center. In multi-try setups (like sampling a few answers and picking the best), naive methods overspend on the same guess and underspend on useful alternatives. This paper’s view makes decoding safer, smarter, and easier to improve—like swapping a toolbox of random gadgets for one adjustable power tool that fits the job.

02Core Idea

Aha! In one sentence: Treat decoding as an optimisation problem over a probability distribution that trades off model score against regularisers and constraints on the probability simplex.

Three analogies to lock it in:

Chef’s tasting spoon: You don’t plate the first bite; you first mix flavours (build a distribution), balancing tasty (high score) with house style (regulariser), then serve (sample/mode).
Sports draft: You don’t draft just the single top player; you shape a whole team (distribution) under rules (salary cap = constraints) and philosophy (regulariser), then pick your lineup.
Classroom turns: You want helpful answers (scores), but also fairness and variety (regulariser). You keep rules like “everyone’s share must be nonnegative and sum to 100%” (simplex), then pick who speaks.

Before vs. After:

Before: Many separate tricks—Greedy, Softmax, Top-K, Top-P—felt unrelated.
After: One master objective, different regularisers/constraints pick out different decoders exactly. Temperature isn’t a hack—it’s the strength of an entropy penalty. Top-K/Top-P are just constraints that trim the pie before softmaxing.

Why it works (intuition, no equations):

The simplex makes you share 100% probability across tokens. The optimality conditions then split tokens into two groups: active (get some pie) and inactive (get zero).
Some regularisers (entropy) hate zeros, so they keep everyone slightly positive—this is Softmax.
Other regularisers (quadratics like Sparsemax) are okay with zeros, so low-score tokens vanish automatically (no manual Top-K needed).
If you want the draws across K samples to “cover” good alternatives at least once, you add a coverage utility. It gives diminishing returns to over-allocating one token and rewards spreading enough mass to see other strong options at least once. Add a KL “stay close to the model” leash to avoid drifting too far—that’s Best-of-K (BoK).

Building blocks (introduced with Sandwich style when needed):

🍞 Hook: Think of a stage show: some performers are on stage (active), some are backstage (inactive). 🥬 The Concept (Active and Inactive Tokens):

What it is: Tokens are active if they get positive probability; inactive if they get zero.
How it works: 1) Optimise under simplex rules; 2) Active tokens satisfy a balance condition; 3) Inactive ones are pushed to zero by the geometry/penalty.
Why it matters: Without this split, we can’t explain sparsity (some tokens truly removed) or smoothness (everyone gets a tiny slice). 🍞 Anchor: In Softmax, almost all tokens are “on stage,” even if whispering; in Sparsemax, only a few sing while the rest are silent.

🍞 Hook: You know how ‘surprise’ makes a story lively? 🥬 The Concept (Entropy & Softmax Sampling):

What it is: Softmax is the distribution that balances score with an entropy bonus (which likes some spread, i.e., surprise).
How it works: 1) Start with scores; 2) Add a term that rewards variety; 3) The best balance is Softmax with temperature controlling spread.
Why it matters: Without this, you might always pick the same token; with too much, you’re too random. 🍞 Anchor: Asking “What’s the capital of France?”—Softmax gives “Paris” a big slice, but tiny slices to others, just in case.

🍞 Hook: Picking only from your top five snacks is faster than scanning every snack. 🥬 The Concept (Top-K Sampling):

What it is: Limit choices to the K best tokens, then apply Softmax only there.
How it works: 1) Find top K by score; 2) Zero out others; 3) Renormalise and sample.
Why it matters: Without trimming, tiny but bad tokens still get sampled; with too much trimming, you lose useful variety. 🍞 Anchor: Choosing a dessert only from your top 5 favourites.

🍞 Hook: Instead of “five items,” imagine “enough items to reach a target fullness.” 🥬 The Concept (Top-P/Nucleus Sampling):

What it is: Pick the smallest set of tokens whose model probabilities sum to at least P, then Softmax within that set.
How it works: 1) Sort tokens by model probability; 2) Accumulate until you hit P; 3) Renormalise and sample.
Why it matters: It adapts to confidence—fewer tokens when sure, more when unsure. 🍞 Anchor: Fill your backpack until it reaches a weight limit; some days that’s 3 books, other days 7.

🍞 Hook: Sometimes you want only a few strong choices, not a crowd. 🥬 The Concept (Sparsemax):

What it is: A softmax-like method that naturally puts exact zeros on weak tokens.
How it works: 1) Use a quadratic regulariser; 2) Optimise; 3) Low-score tokens drop to zero automatically.
Why it matters: It avoids the softmax “long tail” that can pick weird tokens. 🍞 Anchor: Sharing candy with only a few kids so they each get enough to enjoy.

🍞 Hook: Compare your current recipe to a trusted one. 🥬 The Concept (KL Divergence):

What it is: A measure of how different one distribution is from a reference.
How it works: 1) Take your proposed probabilities; 2) Compare to the model’s; 3) Penalise being too far.
Why it matters: Without KL anchoring, a new decoder might drift away from what the model believes. 🍞 Anchor: If the chef’s new sauce is wildly different from the house sauce, KL says “too far—dial it back.”

All together, the master recipe is: choose a distribution that maximises “expected score minus a regulariser,” optionally inside a constraint set (like Top-K/P). That single idea turns a shelf of tricks into plug-and-play design.

03Methodology

At a high level: Model scores → Pick a probability distribution by solving an optimisation → Sample (or take the most likely) → Next token.

Step-by-step, like a recipe:

Collect ingredients (scores): For each token, the model gives a score (higher means more suitable next word).

Why this step exists: You need a notion of “goodness.” Without scores, you can’t rank tokens.
Example: Scores: cat=3, dog=2, bird=0.

Decide to choose a distribution first, not a single token.

Why: It unifies deterministic (Greedy) and stochastic (sampling) methods and lets you encode preferences as regularisers.
Example: Instead of jumping to “cat,” you first decide cat 0.6, dog 0.35, bird 0.05.

Optimise “score minus regulariser,” on the simplex, possibly with constraints.

Why: The simplex keeps probabilities valid; regularisers shape behaviour; constraints cut out unwanted tokens.
Example:
- Greedy: no regulariser, so all mass on the max (cat).
- Softmax: entropy regulariser; everyone gets a slice, bigger for higher scores.
- Top-K: force zero outside top K, then Softmax on the rest.
- Top-P: choose the smallest mass-covering set, then Softmax.
- Sparsemax: quadratic penalty; weak tokens become exact zero.

Sample from the distribution (or take the mode) to produce the next token.

Why: This turns the optimised distribution into an actual word.
Example: With probabilities [0.6, 0.35, 0.05], you usually get cat, sometimes dog, rarely bird.

Detailing each classic decoder under the master view:

Greedy = λ=0 (no regulariser). Active tokens must tie for the top score; solution puts all mass on the max. Without this, “deterministic” decoding wouldn’t be explained by the same principle.
Softmax = negative entropy regulariser. It dislikes zeros, so all tokens stay slightly positive. Without entropy, you’d be too peaky (or undefined when ties happen).
Top-K/Top-P = same Softmax but on a trimmed support set. Without constraints, tiny long-tail tokens still distract; with constraints, you focus on the right region.
Sparsemax = quadratic regulariser that allows zeros. Without it, you’d need manual truncation to get true sparsity.

🍞 Hook: Walking on Earth using a flat map can mislead your steps. 🥬 The Concept (Mirror Ascent):

What it is: An optimisation method that takes “geometry-aware” steps on the probability simplex using a divergence (like KL), not Euclidean distance.
How it works: 1) Start at a safe distribution (often the model’s own); 2) Compute gradients of your objective; 3) Do a multiplicative update that stays on the simplex; 4) Repeat a few times.
Why it matters: Plain projected gradients can fight the simplex at the edges; mirror ascent keeps probabilities valid and stable. 🍞 Anchor: It’s like climbing a hill with shoes designed for that terrain—you slip less and get to the top faster.

Secret Sauce:

Geometry-aware updates (mirror ascent) + a flexible regulariser. You can write a new behaviour (e.g., coverage across K samples), then solve it stably on the simplex, even without a closed-form formula.

Concrete mini-example (3 tokens): scores [3,2,0].

Greedy: [1,0,0].
Softmax(temperature=1): roughly [0.67, 0.25, 0.08].
Top-2 Softmax: zero “bird,” renormalise [0.73,0.27,0].
Sparsemax (illustrative): may produce [0.75, 0.25, 0].

Now the new design: Best-of-K (BoK)

🍞 Hook: If you try opening a combination lock K times, it’s wise to spread guesses so at least one hits. 🥬 The Concept (Best-of-K / BoK):

What it is: A coverage-focused regulariser that boosts the chance good options appear at least once across K samples, while staying near the model with a KL anchor.
How it works: 1) Define a “hit at least once” utility for each token across K tries; 2) Weight important tokens more; 3) Add a KL leash to stay close to the model; 4) Optimise with mirror ascent.
Why it matters: Without BoK, many samples can repeat the same guess; with BoK, you spend your K-budget to cover strong alternatives. 🍞 Anchor: Trying five different problem-solving steps and keeping the best—BoK raises the chance one of them is the winner.

Diminishing returns makes BoK smart: once a token is likely to appear at least once, BoK prefers to invest in other good tokens that aren’t covered yet. The KL term prevents drifting too far from the model’s beliefs. Practically, you run just a few mirror-ascent steps per token (like 2–5), which converge quickly and add only small overhead.

Implementation sketch for BoK per token step:

Inputs: scores s, reference model distribution p, nonnegative weights w (importance), hyperparameters (K, λ for KL strength, β for coverage strength), stepsize, and number of mirror steps.
Loop J times: compute gradient of [score − λ KL + β coverage], apply multiplicative mirror update, renormalise (a softmax-like normalisation), use a log-sum-exp trick to stay numerically safe.
Output: the BoK distribution q*, then sample or take the mode.

04Experiments & Results

The Test: The authors checked if BoK (the new coverage-anchored decoder) actually helps when you draw several samples and want at least one good answer. They measured task accuracy on three very different benchmarks: MATH500 (math problem solving), GPQA-diamond (hard, “Google-proof” questions), and HumanEval (coding problems). They varied temperature from low (almost deterministic) to high (very exploratory) to see where BoK shines.

The Competition: BoK was compared to two strong baselines used everywhere:

Base: standard autoregressive sampling from the model distribution with temperature.
Top-K: restrict to the top 50 tokens, then sample (K=50). Same prompts, same lengths, same seeds—fair fight.

The Scoreboard (with context):

On Qwen2.5-Math-7B at very high temperature (τ=0.9), Base accuracy on MATH500 is 53.0%. That’s like barely passing. Top-K gets 56.2% (a small bump). BoK jumps to 71.6%—that’s a huge lift (+18.6% over Base, +15.4% over Top-K), like going from a C to a solid A when questions are tricky and randomness is high.
GPQA and HumanEval show a similar story at higher temperatures: BoK often outperforms both baselines, e.g., +6.06% and +14.64% boosts in those high-entropy regimes. The gains are biggest exactly where vanilla sampling is diverse-but-unreliable.
On Qwen2.5-7B (general model), BoK also reduces the high-temperature accuracy drop and frequently matches or beats baselines. At lower temperatures (near-deterministic), improvements are smaller; sometimes Base or Top-K are similar. That’s expected—when the model is very confident, exploration matters less.

Surprising (and reassuring) findings:

Robustness: Multiple (β, λ) settings work well—there’s a wide stable region, not a brittle “one lucky setting.” This matters in practice: you don’t want to babysit tuning knobs.
Speed-cost trade-off: Mirror ascent converges fast. Just 2–5 steps per token already give most of the benefit. On MATH500, BoK adds roughly 1 second over ~16 seconds total—small cost for big gains. With 2 steps, accuracy jumps from 64.4% to 69.6% (nearly free); with 5 steps, up to 73.0%.

Meaning behind the numbers:

High temperature means more creativity but also more bad detours. BoK keeps the creativity while stacking the deck so at least one of your K tries lands on something good. That’s why you see big lifts at τ=0.7–0.9.
In low-temperature zones, the model already concentrates probability on strong answers, so there’s less room for improvement—BoK still holds its own without much downside.

Bottom line: BoK reliably increases “best-of-several” success by making coverage a first-class goal, delivering higher accuracy especially when exploration is high, with minimal runtime overhead.

05Discussion & Limitations

Limitations:

Per-step, not full-sequence: The paper focuses on choosing each next token well; it doesn’t yet couple choices across time to enforce sequence-level goals (like total length, global style, or end-to-end coverage across steps).
Weight choice: BoK needs importance weights (w). If these weights are poor proxies for “good alternatives,” the coverage utility can push the distribution in unhelpful directions.
Trust-region tuning: The KL anchor (λ) balances staying near the model vs. exploring. Extreme settings can under- or over-explore.
Multi-sample only: If you only ever take one sample, BoK’s special advantage (coverage across K) doesn’t apply.
Compute: Mirror ascent adds a small but real overhead. On very tight latency budgets, even +1s can matter.

Required resources:

Access to the model’s token scores (logits or probabilities) at each step.
A few extra vector ops per token step for mirror updates (2–5 iterations is often enough).
Modest memory for per-step distributions and gradients.

When NOT to use:

Strictly deterministic production flows where every millisecond and exact repeatability are paramount—Greedy or very low-temperature may be better.
Safety-critical settings where exploration must be minimal and thoroughly vetted.
Single-shot generation where you never use multiple samples or reranking—BoK’s coverage benefit won’t shine.

Open questions:

Sequence-level optimisation: Can we lift the per-step framework to jointly plan over full sentences or solutions?
Learned or adaptive weights: Can w be learned online from a verifier or task success signal to steer coverage more intelligently?
Compute-aware utilities: Can we bake reranking/verifier costs right into the objective for smarter budget use?
Beyond KL: What happens with other divergences or constraints (e.g., group sparsity, tool-use constraints, retrieval-aware supports)?

06Conclusion & Future Work

Three-sentence summary: Decoding isn’t a pile of tricks—it’s one optimisation problem over the probability simplex that balances model score with regularisers and constraints. Classic methods like Greedy, Softmax, Top-K, Top-P, and Sparsemax all pop out as exact solutions when you choose the right regulariser/constraint, and harder cases are solved cleanly with mirror ascent. Building on this, Best-of-K (BoK) turns multi-sample coverage into a first-class objective, delivering large gains at high temperatures with little extra compute.

Main achievement: A unifying, geometry-aware framework that both explains old decoders and makes inventing new, better ones (like BoK) straightforward.

Future directions: Extend from per-step to full-sequence objectives, learn coverage weights from verifiers, design compute-aware utilities that encode rerankers or tool calls, and explore richer constraint sets (structured sparsity, retrieval-aware supports).

Why remember this: It replaces heuristics with first principles. Once you see decoding as “score minus regulariser on the simplex,” you get a master key: swap in the behaviour you want, and the optimiser gives you the decoder—no folklore required.

Practical Applications

•Boost self-consistency pipelines: Use BoK to sample K diverse, high-quality candidates so the best one is more likely to be present.
•Improve verifier/reranker setups: Feed BoK-generated candidate sets to a judge or verifier to raise final answer accuracy.
•Stabilise high-temperature creative writing: Keep creativity while increasing the chance of at least one coherent, on-topic sample.
•Enhance math/code generation: At higher temperatures, increase the odds that at least one sampled solution compiles or solves the problem.
•Constrain decoding safely: Add constraints (like Top-P supports or validity masks) directly in the optimisation for safer outputs.
•Reduce tail-risk tokens without manual truncation: Use Sparsemax-style regularisers to naturally zero out junk tokens.
•Compute-aware sampling: Spend a fixed K-sample budget on complementary candidates rather than duplicates of the same idea.
•Plug-and-play improvement without retraining: Apply BoK as a decoding-time layer on existing models.
•Adaptive behaviour by task: Swap regularisers (e.g., entropy for variety, KL for conservatism) depending on the application’s needs.
•Prototype new decoders quickly: Express desired behaviours as regularisers/constraints and solve with mirror ascent.

Version: 1