Core AI

Decoder-Only Model

Last reviewed: April 2026

A transformer architecture that generates text by predicting one token at a time from left to right, used by GPT, Claude, and most modern large language models.

A decoder-only model is a type of transformer architecture that generates text autoregressively — predicting one token at a time, where each prediction is conditioned on all the tokens that came before it. This is the architecture behind GPT, Claude, Llama, and most of today's large language models.

The original transformer

The original transformer architecture (2017) had two parts: an encoder that processed the input text and a decoder that generated the output text. This encoder-decoder design was built for translation — the encoder understood the source language, and the decoder produced the target language.

Why decoder-only?

Researchers discovered that the decoder alone, when scaled up with massive data and parameters, could handle an enormous range of tasks. By framing everything as text generation — question-answering becomes generating an answer, translation becomes generating text in another language, summarization becomes generating a shorter version — a single decoder-only model could be a generalist.

How decoder-only models work

The model processes a sequence of tokens from left to right. At each position, it uses attention mechanisms to consider all previous tokens (but not future ones — this is called "causal masking"). It then predicts the probability distribution over possible next tokens. During generation, it selects a token, appends it to the sequence, and repeats.

This left-to-right constraint is what makes the model generative. It cannot "peek ahead" at future tokens, which means each prediction is a genuine generation step.

Decoder-only vs other architectures

Encoder-only (like BERT): Processes the entire input bidirectionally. Excellent for understanding and classification but not for generation.
Encoder-decoder (like T5, BART): Separate modules for understanding input and generating output. Good for translation and summarization.
Decoder-only (like GPT, Claude): Unified architecture for both understanding and generation. Scales extremely well and handles diverse tasks.

Why this architecture dominates

Decoder-only models dominate for several reasons: they are simpler to train at scale, the next-token prediction objective works with any text data, and they can be applied to virtually any task through appropriate prompting. The simplicity of the training objective — just predict the next word — belies the remarkable capabilities that emerge at scale.

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

Understanding decoder-only architecture helps you grasp how modern AI models like ChatGPT and Claude actually work. It explains why these models are so versatile, why they generate text from left to right, and why context window size is such an important specification.

Related Terms

Transformer

The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

Context Window

The maximum amount of text an AI can process at once. Think of it as the AI's working memory — everything it can see and consider when generating a response.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Understanding Model Architectures

← Back to Glossary