Skip to main content
Early access — new tools and guides added regularly
Core AI

Computer Vision

Last reviewed: April 2026

The field of AI that enables machines to interpret and understand visual information from images and videos, including object recognition, scene understanding, and visual analysis.

Computer vision is the branch of artificial intelligence that enables machines to interpret and understand visual information — images, videos, and real-time camera feeds. When your phone unlocks with face recognition, when a self-driving car identifies pedestrians, or when a factory inspection system spots defective products, that is computer vision.

What computer vision can do

Modern computer vision systems can perform a remarkable range of tasks:

  • Image classification: Identifying what is in an image. "This photo contains a cat." "This X-ray shows signs of pneumonia."
  • Object detection: Finding and locating specific objects within an image. "There are three cars and two pedestrians in this street scene."
  • Semantic segmentation: Understanding every pixel in an image. "These pixels are road, these are pavement, these are sky."
  • Facial recognition: Identifying or verifying individuals from facial features
  • Optical character recognition (OCR): Reading text from images — receipts, documents, signs, handwriting
  • Pose estimation: Understanding human body positions and movements
  • Image generation: Creating new images from text descriptions (DALL-E, Midjourney)

How computer vision works

Modern computer vision relies on deep learning, particularly convolutional neural networks (CNNs) and, increasingly, vision transformers (ViTs):

  1. Training: The model is shown millions of labelled images. It learns to recognise visual patterns — edges, shapes, textures, objects, and scenes.
  2. Feature extraction: Each layer of the neural network detects increasingly complex features. Early layers detect edges and colours. Middle layers detect shapes and textures. Deep layers recognise objects and scenes.
  3. Inference: Given a new image, the trained model identifies what it sees based on the patterns it learned.

Multi-modal AI

A major development in computer vision is the rise of multi-modal AI — models that understand both text and images. Claude, GPT-4o, and Gemini can all process images alongside text, enabling:

  • Describing what is in a photo
  • Answering questions about charts and diagrams
  • Reading and interpreting screenshots
  • Analysing visual data alongside textual context

This convergence of language and vision capabilities is making AI more versatile and practical for business applications.

Business applications

Computer vision is already deployed across industries:

  • Retail: Visual search ("find products that look like this"), inventory monitoring, cashierless checkout
  • Manufacturing: Quality inspection, defect detection, equipment monitoring
  • Healthcare: Medical imaging analysis, pathology screening, patient monitoring
  • Agriculture: Crop health monitoring, yield estimation, pest detection
  • Security: Surveillance analysis, access control, threat detection
  • Insurance: Damage assessment from photos, claims processing
  • Real estate: Property valuation from images, virtual tours
  • Logistics: Package sorting, warehouse navigation, delivery verification

Practical considerations

When evaluating computer vision solutions:

  • Accuracy requirements: How critical are errors? Medical imaging demands near-perfect accuracy. Product photo tagging is more forgiving.
  • Speed requirements: Real-time processing (self-driving cars, security) versus batch processing (document scanning, photo organisation)
  • Data privacy: Facial recognition and surveillance raise significant privacy and ethical concerns
  • Edge vs cloud: Some applications need on-device processing (phones, cameras) while others can send images to cloud servers
Want to go deeper?
This topic is covered in our Foundations level. Unlock all 52 lessons free.

Why This Matters

Computer vision is expanding AI beyond text into the visual world, opening up applications in manufacturing, healthcare, retail, and security. Understanding computer vision helps you identify opportunities where visual data — product images, documents, facility monitoring, medical scans — could be analysed automatically. As multi-modal AI models become standard, the ability to process images alongside text will become a core capability in every AI-powered workflow.

Related Terms

Learn More

Continue learning in Foundations

This topic is covered in our lesson: AI vs Machine Learning vs Deep Learning