Decoder-Only Model
A transformer architecture that generates text by predicting one token at a time from left to right, used by GPT, Claude, and most modern large language models.
A decoder-only model is a type of transformer architecture that generates text autoregressively β predicting one token at a time, where each prediction is conditioned on all the tokens that came before it. This is the architecture behind GPT, Claude, Llama, and most of today's large language models.
The original transformer
The original transformer architecture (2017) had two parts: an encoder that processed the input text and a decoder that generated the output text. This encoder-decoder design was built for translation β the encoder understood the source language, and the decoder produced the target language.
Why decoder-only?
Researchers discovered that the decoder alone, when scaled up with massive data and parameters, could handle an enormous range of tasks. By framing everything as text generation β question-answering becomes generating an answer, translation becomes generating text in another language, summarization becomes generating a shorter version β a single decoder-only model could be a generalist.
How decoder-only models work
The model processes a sequence of tokens from left to right. At each position, it uses attention mechanisms to consider all previous tokens (but not future ones β this is called "causal masking"). It then predicts the probability distribution over possible next tokens. During generation, it selects a token, appends it to the sequence, and repeats.
This left-to-right constraint is what makes the model generative. It cannot "peek ahead" at future tokens, which means each prediction is a genuine generation step.
Decoder-only vs other architectures
- Encoder-only (like BERT): Processes the entire input bidirectionally. Excellent for understanding and classification but not for generation.
- Encoder-decoder (like T5, BART): Separate modules for understanding input and generating output. Good for translation and summarization.
- Decoder-only (like GPT, Claude): Unified architecture for both understanding and generation. Scales extremely well and handles diverse tasks.
Why this architecture dominates
Decoder-only models dominate for several reasons: they are simpler to train at scale, the next-token prediction objective works with any text data, and they can be applied to virtually any task through appropriate prompting. The simplicity of the training objective β just predict the next word β belies the remarkable capabilities that emerge at scale.
Why This Matters
Understanding decoder-only architecture helps you grasp how modern AI models like ChatGPT and Claude actually work. It explains why these models are so versatile, why they generate text from left to right, and why context window size is such an important specification.
Related Terms
Continue learning in Practitioner
This topic is covered in our lesson: Understanding Model Architectures