Core AI

Computer Vision

Last reviewed: April 2026

The field of AI that enables machines to interpret and understand visual information from images and videos, including object recognition, scene understanding, and visual analysis.

Computer vision is the branch of artificial intelligence that enables machines to interpret and understand visual information — images, videos, and real-time camera feeds. When your phone unlocks with face recognition, when a self-driving car identifies pedestrians, or when a factory inspection system spots defective products, that is computer vision.

What computer vision can do

Modern computer vision systems can perform a remarkable range of tasks:

Image classification: Identifying what is in an image. "This photo contains a cat." "This X-ray shows signs of pneumonia."
Object detection: Finding and locating specific objects within an image. "There are three cars and two pedestrians in this street scene."
Semantic segmentation: Understanding every pixel in an image. "These pixels are road, these are pavement, these are sky."
Facial recognition: Identifying or verifying individuals from facial features
Optical character recognition (OCR): Reading text from images — receipts, documents, signs, handwriting
Pose estimation: Understanding human body positions and movements
Image generation: Creating new images from text descriptions (DALL-E, Midjourney)

How computer vision works

Modern computer vision relies on deep learning, particularly convolutional neural networks (CNNs) and, increasingly, vision transformers (ViTs):

Training: The model is shown millions of labelled images. It learns to recognise visual patterns — edges, shapes, textures, objects, and scenes.
Feature extraction: Each layer of the neural network detects increasingly complex features. Early layers detect edges and colours. Middle layers detect shapes and textures. Deep layers recognise objects and scenes.
Inference: Given a new image, the trained model identifies what it sees based on the patterns it learned.

Multi-modal AI

A major development in computer vision is the rise of multi-modal AI — models that understand both text and images. Claude, GPT-5.4, and Gemini can all process images alongside text, enabling:

Describing what is in a photo
Answering questions about charts and diagrams
Reading and interpreting screenshots
Analysing visual data alongside textual context

This convergence of language and vision capabilities is making AI more versatile and practical for business applications.

Business applications

Computer vision is already deployed across industries:

Retail: Visual search ("find products that look like this"), inventory monitoring, cashierless checkout
Manufacturing: Quality inspection, defect detection, equipment monitoring
Healthcare: Medical imaging analysis, pathology screening, patient monitoring
Agriculture: Crop health monitoring, yield estimation, pest detection
Security: Surveillance analysis, access control, threat detection
Insurance: Damage assessment from photos, claims processing
Real estate: Property valuation from images, virtual tours
Logistics: Package sorting, warehouse navigation, delivery verification

Practical considerations

When evaluating computer vision solutions:

Accuracy requirements: How critical are errors? Medical imaging demands near-perfect accuracy. Product photo tagging is more forgiving.
Speed requirements: Real-time processing (self-driving cars, security) versus batch processing (document scanning, photo organisation)
Data privacy: Facial recognition and surveillance raise significant privacy and ethical concerns
Edge vs cloud: Some applications need on-device processing (phones, cameras) while others can send images to cloud servers

Want to go deeper?

This topic is covered in our Foundations level. Access all 100+ lessons free.

Why This Matters

Computer vision is expanding AI beyond text into the visual world, opening up applications in manufacturing, healthcare, retail, and security. Understanding computer vision helps you identify opportunities where visual data — product images, documents, facility monitoring, medical scans — could be analysed automatically. As multi-modal AI models become standard, the ability to process images alongside text will become a core capability in every AI-powered workflow.

Related Terms

Deep Learning

A subset of machine learning that uses neural networks with many layers to learn complex patterns. The 'deep' refers to the number of layers, not the depth of understanding.

Neural Network

A computing system loosely inspired by the human brain, made of layers of interconnected nodes that learn to recognise patterns in data.

Classification

An AI task that assigns input to predefined categories. Spam detection, sentiment analysis, and image recognition are all classification tasks.

Multi-Modal AI

AI that can process and generate multiple types of content — text, images, audio, and video — within a single model. Claude, GPT-5.4, and Gemini are all multi-modal.

Machine Learning (ML)

A type of AI where systems learn patterns from data instead of following explicitly programmed rules. The system improves its performance through experience.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Learn More

Continue learning in Foundations

This topic is covered in our lesson: AI vs Machine Learning vs Deep Learning

← Back to Glossary