Skip to main content
Early access — new tools and guides added regularly
Core AI

Inference

Last reviewed: April 2026

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Inference is what happens when a trained AI model processes your input and produces output. Every time you type a question into ChatGPT and get an answer, that is inference. Every time Claude summarises a document, writes an email, or analyses data — inference.

Training and inference are the two fundamental phases of AI:

  • Training is the learning phase. It happens once (or periodically) and requires enormous computing resources. It produces the model.
  • Inference is the using phase. It happens every time someone sends a prompt. It requires less computing power than training, but at scale, inference costs add up quickly.

How inference works

When you send a prompt to an LLM, here is what happens in milliseconds:

  1. Tokenisation: Your text is broken into tokens — the atomic units the model works with. The sentence "How do I improve my marketing?" becomes approximately 8 tokens.
  2. Processing: The tokens pass through the model's neural network layers. Each layer transforms the representation, applying the patterns learned during training.
  3. Generation: The model predicts the most probable next token, appends it, then predicts the next, and the next. This continues until the model produces a complete response or hits a length limit.

This process is called autoregressive generation — the model generates one token at a time, each informed by everything that came before it.

Why inference speed matters

Inference speed determines how quickly you get responses. Faster inference means:

  • More responsive AI assistants
  • Lower costs for high-volume applications
  • Better user experience for real-time applications

AI companies invest heavily in inference optimisation. Techniques include model quantisation (reducing the precision of model weights to speed up calculation), batching (processing multiple prompts simultaneously), and purpose-built hardware (inference-optimised chips).

Inference costs

For businesses using AI via APIs, inference is typically billed per token — both input tokens (your prompt) and output tokens (the response). A longer, more detailed prompt costs more to process. A longer response costs more to generate.

Understanding this cost structure helps you optimise your AI spending:

  • Write concise, focused prompts rather than including unnecessary context
  • Use cheaper, faster models for simple tasks and reserve expensive models for complex ones
  • Cache responses for repeated queries when possible

Inference vs training costs

Training a frontier model can cost over $100 million. Inference for a single prompt costs fractions of a cent. But because inference happens billions of times across millions of users, the aggregate inference cost can exceed training cost. This is why AI pricing and efficiency are major industry concerns.

Want to go deeper?
This topic is covered in our Foundations level. Unlock all 52 lessons free.

Why This Matters

Every AI interaction your team has is an inference event with an associated cost and latency. Understanding inference helps you make informed decisions about which AI model to use for which task, how to structure prompts for cost efficiency, and how to evaluate whether real-time AI features are feasible for your products. It also demystifies AI pricing — when a vendor charges per token, you now understand exactly what you are paying for.

Related Terms

Learn More

Continue learning in Foundations

This topic is covered in our lesson: How Large Language Models Actually Work