Core AI

Positional Encoding

Last reviewed: April 2026

A technique that gives transformer models information about word order, since the attention mechanism alone does not know which words come first.

Positional encoding is a technique used in transformer models to inject information about word order into the model's processing. Without it, a transformer would treat "the dog bit the man" and "the man bit the dog" as identical, because the attention mechanism processes all words simultaneously without any inherent notion of sequence.

Why transformers need positional encoding

Unlike older architectures (RNNs, LSTMs) that process words one at a time and naturally know word order, transformers process all words in parallel. This parallelism is what makes transformers fast, but it means they have no built-in concept of first, second, third — they see a bag of words, not a sentence.

Positional encoding solves this by adding position information to each word's representation before the transformer processes it.

How it works

Each position in a sequence gets a unique positional vector — a set of numbers that encodes its position. This vector is added to the word's embedding (its meaning representation). The result is a combined representation that carries both what the word means and where it appears.

The original transformer paper used sinusoidal functions — mathematical wave patterns — to generate positional vectors. Each position gets a unique combination of sine and cosine values at different frequencies. This approach has the elegant property that relative positions can be computed from the encodings themselves.

Modern approaches

Learned positional embeddings: Instead of fixed mathematical formulas, the model learns the best positional representations during training. Used in GPT-style models.
Rotary Position Embedding (RoPE): Encodes position using rotation matrices, enabling better generalisation to longer sequences than seen during training. Used in Llama and many modern models.
ALiBi (Attention with Linear Biases): Adds position-dependent biases directly to attention scores rather than to embeddings.

Why this matters for context length

Positional encoding directly affects how long a context window a model can handle. Models trained with fixed positional encodings struggle with inputs longer than their training length. Modern techniques like RoPE and ALiBi are specifically designed to extrapolate to longer sequences, which is why context windows have grown from 2,000 tokens to over 1 million.

A practical analogy

Imagine a team meeting where everyone writes their ideas on sticky notes and throws them on a table. Without numbering the notes (positional encoding), you cannot tell the order of the discussion. With numbering, you can reconstruct the conversation flow.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Positional encoding explains a key architectural detail that affects context window limits and how well models handle long documents. Understanding it helps you appreciate why some models handle lengthy inputs better than others, which matters when choosing models for document-heavy applications.

Related Terms

Transformer

The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.

Self-Attention

The mechanism inside transformers that allows each word to consider every other word in the input when determining its meaning and importance.

Context Window

The maximum amount of text an AI can process at once. Think of it as the AI's working memory — everything it can see and consider when generating a response.

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

Embedding

A numerical representation of text (or images, audio, etc.) that captures its meaning. Embeddings let AI measure how similar two pieces of content are.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: Understanding Model Architectures

← Back to Glossary