Core AI

Vision Transformer (ViT)

Last reviewed: April 2026

An architecture that applies the transformer model — originally designed for text — to image recognition, by splitting images into patches and processing them like words in a sentence.

A Vision Transformer (ViT) is a neural network architecture that applies the transformer — the same architecture behind ChatGPT and Claude — to image understanding tasks. Instead of processing text tokens, it processes image patches, achieving state-of-the-art results on many computer vision benchmarks.

How ViT works

The key insight of ViT is treating an image like a sentence:

Patch splitting: The image is divided into fixed-size patches (typically 16x16 pixels). A 256x256 image becomes 256 patches.
Embedding: Each patch is flattened and projected into an embedding vector — the same way text tokens are embedded in language models.
Position encoding: Each patch embedding receives a positional encoding so the model knows where in the image each patch came from.
Transformer processing: The sequence of patch embeddings is fed through standard transformer encoder layers with self-attention, allowing every patch to attend to every other patch.
Classification: A special classification token aggregates the information and produces the final output.

Why this was surprising

Before ViT, computer vision was dominated by convolutional neural networks (CNNs), which had been the standard for over a decade. CNNs process images using local filters that slide across the image, building up from small features (edges, textures) to large ones (objects, scenes).

The surprise was that transformers — with no built-in understanding of spatial locality — could match or exceed CNNs when given enough training data. This demonstrated the remarkable versatility of the attention mechanism.

Advantages over CNNs

Global attention: Every patch can attend to every other patch from the first layer, allowing ViT to capture long-range relationships in images more easily than CNNs.
Scalability: ViTs scale more efficiently with data and compute, improving consistently as both increase.
Unification: Using the same architecture for text and images simplifies multimodal systems that need to process both.

Limitations and trade-offs

ViTs require significantly more training data than CNNs to achieve comparable performance. With small datasets, CNNs still tend to win because their built-in assumptions about spatial structure act as useful prior knowledge. ViTs also require more computation for inference, though optimised variants are closing this gap.

Impact on multimodal AI

ViT's success was a crucial step towards today's multimodal AI models. Because images and text are now processed using the same transformer architecture, building systems that understand both — like GPT-4 with vision or Claude's image understanding — became architecturally straightforward.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Vision Transformers are the reason modern AI assistants can understand images alongside text. When you upload a screenshot to Claude and ask it to explain what it sees, the underlying architecture draws on ViT's innovation. Understanding this helps you evaluate multimodal AI tools and their capabilities.

Related Terms

Transformer

The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.

Computer Vision

The field of AI that enables machines to interpret and understand visual information from images and videos, including object recognition, scene understanding, and visual analysis.

Self-Attention

The mechanism inside transformers that allows each word to consider every other word in the input when determining its meaning and importance.

Multimodal AI

AI systems that can process and generate multiple types of content — text, images, audio, video — rather than just text alone.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: AI Infrastructure and Deployment

← Back to Glossary