Vision Transformer (ViT)
An architecture that applies the transformer model — originally designed for text — to image recognition, by splitting images into patches and processing them like words in a sentence.
A Vision Transformer (ViT) is a neural network architecture that applies the transformer — the same architecture behind ChatGPT and Claude — to image understanding tasks. Instead of processing text tokens, it processes image patches, achieving state-of-the-art results on many computer vision benchmarks.
How ViT works
The key insight of ViT is treating an image like a sentence:
- Patch splitting: The image is divided into fixed-size patches (typically 16x16 pixels). A 256x256 image becomes 256 patches.
- Embedding: Each patch is flattened and projected into an embedding vector — the same way text tokens are embedded in language models.
- Position encoding: Each patch embedding receives a positional encoding so the model knows where in the image each patch came from.
- Transformer processing: The sequence of patch embeddings is fed through standard transformer encoder layers with self-attention, allowing every patch to attend to every other patch.
- Classification: A special classification token aggregates the information and produces the final output.
Why this was surprising
Before ViT, computer vision was dominated by convolutional neural networks (CNNs), which had been the standard for over a decade. CNNs process images using local filters that slide across the image, building up from small features (edges, textures) to large ones (objects, scenes).
The surprise was that transformers — with no built-in understanding of spatial locality — could match or exceed CNNs when given enough training data. This demonstrated the remarkable versatility of the attention mechanism.
Advantages over CNNs
- Global attention: Every patch can attend to every other patch from the first layer, allowing ViT to capture long-range relationships in images more easily than CNNs.
- Scalability: ViTs scale more efficiently with data and compute, improving consistently as both increase.
- Unification: Using the same architecture for text and images simplifies multimodal systems that need to process both.
Limitations and trade-offs
ViTs require significantly more training data than CNNs to achieve comparable performance. With small datasets, CNNs still tend to win because their built-in assumptions about spatial structure act as useful prior knowledge. ViTs also require more computation for inference, though optimised variants are closing this gap.
Impact on multimodal AI
ViT's success was a crucial step towards today's multimodal AI models. Because images and text are now processed using the same transformer architecture, building systems that understand both — like GPT-4 with vision or Claude's image understanding — became architecturally straightforward.
Why This Matters
Vision Transformers are the reason modern AI assistants can understand images alongside text. When you upload a screenshot to Claude and ask it to explain what it sees, the underlying architecture draws on ViT's innovation. Understanding this helps you evaluate multimodal AI tools and their capabilities.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: AI Infrastructure and Deployment
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training →