Positional Encoding
A technique that gives transformer models information about word order, since the attention mechanism alone does not know which words come first.
Positional encoding is a technique used in transformer models to inject information about word order into the model's processing. Without it, a transformer would treat "the dog bit the man" and "the man bit the dog" as identical, because the attention mechanism processes all words simultaneously without any inherent notion of sequence.
Why transformers need positional encoding
Unlike older architectures (RNNs, LSTMs) that process words one at a time and naturally know word order, transformers process all words in parallel. This parallelism is what makes transformers fast, but it means they have no built-in concept of first, second, third β they see a bag of words, not a sentence.
Positional encoding solves this by adding position information to each word's representation before the transformer processes it.
How it works
Each position in a sequence gets a unique positional vector β a set of numbers that encodes its position. This vector is added to the word's embedding (its meaning representation). The result is a combined representation that carries both what the word means and where it appears.
The original transformer paper used sinusoidal functions β mathematical wave patterns β to generate positional vectors. Each position gets a unique combination of sine and cosine values at different frequencies. This approach has the elegant property that relative positions can be computed from the encodings themselves.
Modern approaches
- Learned positional embeddings: Instead of fixed mathematical formulas, the model learns the best positional representations during training. Used in GPT-style models.
- Rotary Position Embedding (RoPE): Encodes position using rotation matrices, enabling better generalisation to longer sequences than seen during training. Used in Llama and many modern models.
- ALiBi (Attention with Linear Biases): Adds position-dependent biases directly to attention scores rather than to embeddings.
Why this matters for context length
Positional encoding directly affects how long a context window a model can handle. Models trained with fixed positional encodings struggle with inputs longer than their training length. Modern techniques like RoPE and ALiBi are specifically designed to extrapolate to longer sequences, which is why context windows have grown from 2,000 tokens to over 1 million.
A practical analogy
Imagine a team meeting where everyone writes their ideas on sticky notes and throws them on a table. Without numbering the notes (positional encoding), you cannot tell the order of the discussion. With numbering, you can reconstruct the conversation flow.
Why This Matters
Positional encoding explains a key architectural detail that affects context window limits and how well models handle long documents. Understanding it helps you appreciate why some models handle lengthy inputs better than others, which matters when choosing models for document-heavy applications.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: Understanding Model Architectures