Top-p Sampling (Nucleus Sampling)
A method for controlling AI output randomness by only considering the smallest set of tokens whose combined probability exceeds a threshold p.
Top-p sampling (also called nucleus sampling) is a method for controlling the randomness of AI text generation. Instead of considering every possible next token, the model only considers the smallest set of tokens whose cumulative probability exceeds a threshold p β and samples from that set.
How top-p works
When generating each token, the model calculates probabilities for every word in its vocabulary. Top-p filtering:
- Sorts tokens by probability from highest to lowest.
- Adds probabilities cumulatively until the running total exceeds p.
- Discards all remaining tokens.
- Samples randomly from the kept tokens (renormalised).
For example, with top-p = 0.9:
- If the top 3 tokens have probabilities 0.5, 0.3, and 0.15 (cumulative: 0.95), only these three are considered.
- The remaining thousands of tokens are excluded.
- The model samples from these three based on their relative probabilities.
Why top-p is useful
Top-p adapts dynamically to the model's confidence:
- When the model is highly confident (one token has 95% probability), top-p restricts to just that token β behaving like low temperature.
- When the model is uncertain (many tokens share probability), top-p allows more variety β preserving creative options.
This adaptive behaviour is why top-p often produces more natural text than temperature alone.
Top-p vs temperature
- Temperature scales the entire probability distribution. Low temperature sharpens all peaks; high temperature flattens everything.
- Top-p removes the tail of unlikely tokens while preserving the relative probabilities of likely ones.
In practice:
- Use temperature when you want uniform control over randomness.
- Use top-p when you want the model to be creative where it is uncertain but decisive where it is confident.
- Most practitioners adjust one and leave the other at default. Adjusting both simultaneously can produce unpredictable results.
Common settings
- Top-p = 1.0: No filtering. All tokens are considered. This is the default for most providers.
- Top-p = 0.9: A common "quality" setting. Removes only the very unlikely tokens.
- Top-p = 0.5: More focused. Only the most probable tokens are considered.
- Top-p = 0.1: Very restrictive. Nearly deterministic output.
Other sampling methods
- Top-k sampling: Considers only the k most likely tokens regardless of their probabilities.
- Min-p sampling: A newer method that filters tokens below a minimum probability relative to the top token.
- Greedy decoding: Always picks the most likely token (equivalent to temperature 0 or top-p approaching 0).
Why This Matters
Top-p sampling gives you fine-grained control over AI output diversity. Understanding it, alongside temperature, lets you tune AI output for different use cases β from deterministic data extraction to creative brainstorming β using just two parameters.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: Advanced Inference Parameters