Core AI

Test-Time Compute

Last reviewed: April 2026

Additional computation spent during inference — such as generating and evaluating multiple responses — to improve the quality of an AI model's output.

Test-time compute refers to additional computational resources spent during inference (when the model is generating a response) to improve output quality. Instead of generating a single answer and returning it, the model spends more time thinking, exploring alternatives, and verifying its reasoning.

The core idea

Traditionally, AI model quality was determined entirely during training. Once trained, the model's capabilities were fixed — it spent the same compute on every query regardless of difficulty. Test-time compute breaks this assumption by allowing the model to spend more effort on harder problems.

How test-time compute works

Several techniques fall under the test-time compute umbrella:

Best-of-N sampling: Generate N different responses and select the best one using a verifier or scoring function. More responses (more compute) means a better chance of finding an excellent answer.
Chain-of-thought reasoning: The model generates intermediate reasoning steps before producing a final answer. More reasoning tokens mean more compute and often better results.
Tree search: The model explores multiple reasoning paths (like a chess engine exploring different move sequences) and selects the most promising one.
Self-verification: The model generates an answer, then checks its own work, potentially revising multiple times.
Majority voting: Generate multiple answers and take the most common one. Simple but effective for factual questions.

Why test-time compute matters now

OpenAI's o1 and o3 models demonstrated that significant performance gains are possible by spending more compute at inference time. These models "think" for longer on difficult problems, producing reasoning traces that can be dozens of times longer than the final answer. The result is substantially better performance on maths, coding, and reasoning benchmarks.

The scaling implications

Test-time compute creates a new scaling axis. Previously, the only way to improve AI was to train a bigger model (which is expensive and takes months). Test-time compute lets you improve performance on-demand by spending more at inference time. You can even adapt the compute budget to the difficulty: easy questions get quick answers, hard questions get extended reasoning.

Trade-offs

Cost: More compute per query means higher per-request costs.
Latency: Thinking longer means waiting longer for a response.
Diminishing returns: Beyond a certain point, additional compute yields minimal improvement.
Not universally beneficial: Some tasks (simple classification, extraction) do not benefit from additional reasoning.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Test-time compute represents a fundamental shift in how AI performance is scaled. Understanding it helps you evaluate new "reasoning" models, anticipate the cost and latency trade-offs of advanced AI features, and decide when paying for additional compute per query is justified by better results.

Related Terms

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Chain-of-Thought Prompting

A technique where you ask the AI to explain its reasoning step by step before giving a final answer. This dramatically improves accuracy on complex tasks.

Scaling Law

The empirical observation that AI model performance improves predictably as you increase model size, training data, and compute — following mathematical power laws.

Latency

The time delay between sending a request to an AI model and receiving the first part of the response. Lower latency means faster, more responsive AI interactions.

Throughput

The volume of data an AI system can process in a given time period — typically measured in tokens per second or requests per minute. Higher throughput means more work done faster.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: How Models Are Trained and Aligned

← Back to Glossary