Reinforcement Learning
A machine learning approach where an AI learns by trial and error, receiving rewards for good outcomes and penalties for bad ones. Used to train game-playing AI and to fine-tune LLMs.
Reinforcement learning (RL) is a type of machine learning where an AI agent learns by interacting with an environment, taking actions, and receiving feedback in the form of rewards or penalties. Over thousands or millions of attempts, the agent learns which actions lead to the best outcomes.
The analogy
Think of teaching a dog to fetch. You do not give the dog a manual. Instead, you reward good behaviour (bringing the ball back) and ignore or gently discourage bad behaviour (running away with it). Over many repetitions, the dog learns the pattern that maximises rewards.
Reinforcement learning works similarly — except the "dog" is a software agent, the "ball" is a task or environment, and the "treat" is a numerical reward signal.
How reinforcement learning works
The RL process involves:
- Agent: The AI system that takes actions
- Environment: The world the agent operates in (a game board, a simulation, a task)
- State: The current situation the agent observes
- Action: What the agent chooses to do
- Reward: The feedback signal — positive for good outcomes, negative for bad ones
- Policy: The strategy the agent develops for choosing actions
The agent observes its state, takes an action, receives a reward, observes the new state, and repeats. Over time, it develops a policy that maximises cumulative reward.
Famous reinforcement learning achievements
RL has produced some of AI's most impressive demonstrations:
- AlphaGo (2016): DeepMind's system defeated the world champion in Go — a game so complex that brute-force search is impossible. It learned by playing millions of games against itself.
- AlphaStar (2019): Beat professional StarCraft II players by learning complex real-time strategy.
- Robotics: RL teaches robots to walk, grasp objects, and navigate environments through physical trial and error (or simulated versions).
RLHF: Reinforcement Learning from Human Feedback
For business professionals, the most relevant application of RL is RLHF — Reinforcement Learning from Human Feedback. This is the technique used to make LLMs like ChatGPT and Claude helpful, harmless, and honest.
The process:
- A pre-trained language model generates multiple responses to a prompt
- Human evaluators rank the responses from best to worst
- A reward model is trained on these human preferences
- The language model is fine-tuned using RL to produce responses the reward model scores highly
RLHF is why modern AI assistants are conversational and helpful rather than just predicting random text — the RL training shaped their behaviour to align with human preferences.
Reinforcement learning vs supervised learning
- Supervised learning: Learns from labelled examples (right answers provided). Best for well-defined tasks with clear correct answers.
- Reinforcement learning: Learns from experience and rewards (discovers right answers through exploration). Best for sequential decision-making and tasks where the right approach is not known in advance.
Business applications
Beyond RLHF, reinforcement learning has practical business applications:
- Recommendation engines: Learning which content to show users based on engagement signals
- Dynamic pricing: Adjusting prices in real time based on demand patterns
- Supply chain optimisation: Learning optimal inventory and routing decisions
- Ad placement: Determining which ads to show to maximise click-through and conversion
- Resource allocation: Optimising scheduling, staffing, and capacity decisions
Why This Matters
Reinforcement learning is the technique that transformed raw language models into the helpful AI assistants you use today. Understanding RLHF explains why ChatGPT and Claude behave the way they do — and why different AI products feel different despite using similar underlying technology. For business applications, RL powers the recommendation and optimisation systems that drive revenue for digital platforms.
Related Terms
Continue learning in Foundations
This topic is covered in our lesson: AI vs Machine Learning vs Deep Learning