Core AI

Deep Reinforcement Learning

Last reviewed: April 2026

A training approach that combines deep neural networks with reinforcement learning, enabling AI to learn complex strategies through trial and error in rich environments.

Deep reinforcement learning (deep RL) combines deep neural networks with reinforcement learning, enabling AI agents to learn complex strategies through trial and error in environments that are too rich for traditional reinforcement learning to handle.

How it differs from standard reinforcement learning

Standard reinforcement learning works well when the number of possible states is manageable — like a simple board game. But real-world environments have millions or billions of possible states (think of every possible frame in a video game). Deep RL uses neural networks to approximate the value of states or the best action to take, handling this complexity.

Landmark achievements

Atari games (2013) — DeepMind's DQN learned to play dozens of Atari games from raw pixels, matching or exceeding human performance
AlphaGo (2016) — defeated the world champion at Go, a game with more possible positions than atoms in the universe
AlphaStar (2019) — reached grandmaster level in StarCraft II, a complex real-time strategy game
Robotics — learning dexterous manipulation, walking, and navigation from simulated experience

Key concepts

Policy — the strategy the agent follows (maps observations to actions)
Reward function — defines what "good" behaviour looks like (the agent tries to maximise cumulative reward)
Exploration vs. exploitation — balancing trying new things against sticking with what works
Sim-to-real transfer — training in simulation (cheap, fast, safe) and deploying in the real world

Challenges

Sample inefficiency — deep RL often needs millions of trials to learn, making it expensive
Reward hacking — agents find unexpected ways to maximise reward without achieving the intended goal
Instability — training can be brittle, with performance suddenly collapsing
Safety — trial-and-error learning is dangerous in high-stakes environments (healthcare, finance, physical systems)

RLHF connection

Reinforcement learning from human feedback (RLHF) uses deep RL principles to align language models with human preferences. Rather than maximising a game score, the model learns to generate responses that humans rate as helpful, honest, and harmless.

Want to go deeper?

This topic is covered in our Expert level. Access all 100+ lessons free.

Why This Matters

Deep RL is behind some of AI's most impressive achievements and is the technique used to align models like ChatGPT and Claude with human values. Understanding it helps you recognise both the potential (AI that learns complex strategies) and the limitations (expensive, sometimes unstable, and prone to unexpected behaviour).

Related Terms

Reinforcement Learning

A machine learning approach where an AI learns by trial and error, receiving rewards for good outcomes and penalties for bad ones. Used to train game-playing AI and to fine-tune LLMs.

Deep Learning

A subset of machine learning that uses neural networks with many layers to learn complex patterns. The 'deep' refers to the number of layers, not the depth of understanding.

Neural Network

A computing system loosely inspired by the human brain, made of layers of interconnected nodes that learn to recognise patterns in data.

AI Agent

An AI system that can take actions autonomously — browsing the web, running code, calling APIs, and completing multi-step tasks with minimal human intervention.

Learn More

Continue learning in Expert

This topic is covered in our lesson: Multi-Agent Systems and Orchestration

← Back to Glossary