Deep Reinforcement Learning
A training approach that combines deep neural networks with reinforcement learning, enabling AI to learn complex strategies through trial and error in rich environments.
Deep reinforcement learning (deep RL) combines deep neural networks with reinforcement learning, enabling AI agents to learn complex strategies through trial and error in environments that are too rich for traditional reinforcement learning to handle.
How it differs from standard reinforcement learning
Standard reinforcement learning works well when the number of possible states is manageable β like a simple board game. But real-world environments have millions or billions of possible states (think of every possible frame in a video game). Deep RL uses neural networks to approximate the value of states or the best action to take, handling this complexity.
Landmark achievements
- Atari games (2013) β DeepMind's DQN learned to play dozens of Atari games from raw pixels, matching or exceeding human performance
- AlphaGo (2016) β defeated the world champion at Go, a game with more possible positions than atoms in the universe
- AlphaStar (2019) β reached grandmaster level in StarCraft II, a complex real-time strategy game
- Robotics β learning dexterous manipulation, walking, and navigation from simulated experience
Key concepts
- Policy β the strategy the agent follows (maps observations to actions)
- Reward function β defines what "good" behaviour looks like (the agent tries to maximise cumulative reward)
- Exploration vs. exploitation β balancing trying new things against sticking with what works
- Sim-to-real transfer β training in simulation (cheap, fast, safe) and deploying in the real world
Challenges
- Sample inefficiency β deep RL often needs millions of trials to learn, making it expensive
- Reward hacking β agents find unexpected ways to maximise reward without achieving the intended goal
- Instability β training can be brittle, with performance suddenly collapsing
- Safety β trial-and-error learning is dangerous in high-stakes environments (healthcare, finance, physical systems)
RLHF connection
Reinforcement learning from human feedback (RLHF) uses deep RL principles to align language models with human preferences. Rather than maximising a game score, the model learns to generate responses that humans rate as helpful, honest, and harmless.
Why This Matters
Deep RL is behind some of AI's most impressive achievements and is the technique used to align models like ChatGPT and Claude with human values. Understanding it helps you recognise both the potential (AI that learns complex strategies) and the limitations (expensive, sometimes unstable, and prone to unexpected behaviour).
Related Terms
Continue learning in Expert
This topic is covered in our lesson: Multi-Agent Systems and Orchestration