Core AI

Inference

Last reviewed: April 2026

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Inference is what happens when a trained AI model processes your input and produces output. Every time you type a question into ChatGPT and get an answer, that is inference. Every time Claude summarises a document, writes an email, or analyses data — inference.

Training and inference are the two fundamental phases of AI:

Training is the learning phase. It happens once (or periodically) and requires enormous computing resources. It produces the model.
Inference is the using phase. It happens every time someone sends a prompt. It requires less computing power than training, but at scale, inference costs add up quickly.

How inference works

When you send a prompt to an LLM, here is what happens in milliseconds:

Tokenisation: Your text is broken into tokens — the atomic units the model works with. The sentence "How do I improve my marketing?" becomes approximately 8 tokens.
Processing: The tokens pass through the model's neural network layers. Each layer transforms the representation, applying the patterns learned during training.
Generation: The model predicts the most probable next token, appends it, then predicts the next, and the next. This continues until the model produces a complete response or hits a length limit.

This process is called autoregressive generation — the model generates one token at a time, each informed by everything that came before it.

Why inference speed matters

Inference speed determines how quickly you get responses. Faster inference means:

More responsive AI assistants
Lower costs for high-volume applications
Better user experience for real-time applications

AI companies invest heavily in inference optimisation. Techniques include model quantisation (reducing the precision of model weights to speed up calculation), batching (processing multiple prompts simultaneously), and purpose-built hardware (inference-optimised chips).

Inference costs

For businesses using AI via APIs, inference is typically billed per token — both input tokens (your prompt) and output tokens (the response). A longer, more detailed prompt costs more to process. A longer response costs more to generate.

Understanding this cost structure helps you optimise your AI spending:

Write concise, focused prompts rather than including unnecessary context
Use cheaper, faster models for simple tasks and reserve expensive models for complex ones
Cache responses for repeated queries when possible

Inference vs training costs

Training a frontier model can cost over $100 million. Inference for a single prompt costs fractions of a cent. But because inference happens billions of times across millions of users, the aggregate inference cost can exceed training cost. This is why AI pricing and efficiency are major industry concerns.

Want to go deeper?

This topic is covered in our Foundations level. Access all 100+ lessons free.

Why This Matters

Every AI interaction your team has is an inference event with an associated cost and latency. Understanding inference helps you make informed decisions about which AI model to use for which task, how to structure prompts for cost efficiency, and how to evaluate whether real-time AI features are feasible for your products. It also demystifies AI pricing — when a vendor charges per token, you now understand exactly what you are paying for.

Related Terms

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Context Window

The maximum amount of text an AI can process at once. Think of it as the AI's working memory — everything it can see and consider when generating a response.

API (Application Programming Interface)

A way for software to communicate with other software. APIs are how developers connect AI capabilities to websites, apps, and business tools.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Related Comparisons

Ollama vs LM Studio

A detailed comparison of Ollama and LM Studio for running large language models locally. Covers ease of use, model support, performance, and developer experience.

Local AI vs Cloud AI

Running AI models locally vs using cloud APIs — privacy, cost, performance, capabilities, and maintenance compared. A practical guide to choosing your AI infrastructure.

Learn More

Continue learning in Foundations

This topic is covered in our lesson: How Large Language Models Actually Work

← Back to Glossary