Core AI

Top-p Sampling (Nucleus Sampling)

Last reviewed: April 2026

A method for controlling AI output randomness by only considering the smallest set of tokens whose combined probability exceeds a threshold p.

Top-p sampling (also called nucleus sampling) is a method for controlling the randomness of AI text generation. Instead of considering every possible next token, the model only considers the smallest set of tokens whose cumulative probability exceeds a threshold p — and samples from that set.

How top-p works

When generating each token, the model calculates probabilities for every word in its vocabulary. Top-p filtering:

Sorts tokens by probability from highest to lowest.
Adds probabilities cumulatively until the running total exceeds p.
Discards all remaining tokens.
Samples randomly from the kept tokens (renormalised).

For example, with top-p = 0.9:

If the top 3 tokens have probabilities 0.5, 0.3, and 0.15 (cumulative: 0.95), only these three are considered.
The remaining thousands of tokens are excluded.
The model samples from these three based on their relative probabilities.

Why top-p is useful

Top-p adapts dynamically to the model's confidence:

When the model is highly confident (one token has 95% probability), top-p restricts to just that token — behaving like low temperature.
When the model is uncertain (many tokens share probability), top-p allows more variety — preserving creative options.

This adaptive behaviour is why top-p often produces more natural text than temperature alone.

Top-p vs temperature

Temperature scales the entire probability distribution. Low temperature sharpens all peaks; high temperature flattens everything.
Top-p removes the tail of unlikely tokens while preserving the relative probabilities of likely ones.

In practice:

Use temperature when you want uniform control over randomness.
Use top-p when you want the model to be creative where it is uncertain but decisive where it is confident.
Most practitioners adjust one and leave the other at default. Adjusting both simultaneously can produce unpredictable results.

Common settings

Top-p = 1.0: No filtering. All tokens are considered. This is the default for most providers.
Top-p = 0.9: A common "quality" setting. Removes only the very unlikely tokens.
Top-p = 0.5: More focused. Only the most probable tokens are considered.
Top-p = 0.1: Very restrictive. Nearly deterministic output.

Other sampling methods

Top-k sampling: Considers only the k most likely tokens regardless of their probabilities.
Min-p sampling: A newer method that filters tokens below a minimum probability relative to the top token.
Greedy decoding: Always picks the most likely token (equivalent to temperature 0 or top-p approaching 0).

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Top-p sampling gives you fine-grained control over AI output diversity. Understanding it, alongside temperature, lets you tune AI output for different use cases — from deterministic data extraction to creative brainstorming — using just two parameters.

Related Terms

Temperature (Parameter)

A setting that controls how random or creative an AI model's responses are — low temperature gives focused, predictable output while high temperature gives varied, creative output.

Temperature

A setting that controls how creative or conservative AI output is. Low temperature = predictable and focused. High temperature = varied and creative.

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: Advanced Inference Parameters

← Back to Glossary