Core AI

Reinforcement Learning

Last reviewed: April 2026

A machine learning approach where an AI learns by trial and error, receiving rewards for good outcomes and penalties for bad ones. Used to train game-playing AI and to fine-tune LLMs.

Reinforcement learning (RL) is a type of machine learning where an AI agent learns by interacting with an environment, taking actions, and receiving feedback in the form of rewards or penalties. Over thousands or millions of attempts, the agent learns which actions lead to the best outcomes.

The analogy

Think of teaching a dog to fetch. You do not give the dog a manual. Instead, you reward good behaviour (bringing the ball back) and ignore or gently discourage bad behaviour (running away with it). Over many repetitions, the dog learns the pattern that maximises rewards.

Reinforcement learning works similarly — except the "dog" is a software agent, the "ball" is a task or environment, and the "treat" is a numerical reward signal.

How reinforcement learning works

The RL process involves:

Agent: The AI system that takes actions
Environment: The world the agent operates in (a game board, a simulation, a task)
State: The current situation the agent observes
Action: What the agent chooses to do
Reward: The feedback signal — positive for good outcomes, negative for bad ones
Policy: The strategy the agent develops for choosing actions

The agent observes its state, takes an action, receives a reward, observes the new state, and repeats. Over time, it develops a policy that maximises cumulative reward.

Famous reinforcement learning achievements

RL has produced some of AI's most impressive demonstrations:

AlphaGo (2016): DeepMind's system defeated the world champion in Go — a game so complex that brute-force search is impossible. It learned by playing millions of games against itself.
AlphaStar (2019): Beat professional StarCraft II players by learning complex real-time strategy.
Robotics: RL teaches robots to walk, grasp objects, and navigate environments through physical trial and error (or simulated versions).

RLHF: Reinforcement Learning from Human Feedback

For business professionals, the most relevant application of RL is RLHF — Reinforcement Learning from Human Feedback. This is the technique used to make LLMs like ChatGPT and Claude helpful, harmless, and honest.

The process:

A pre-trained language model generates multiple responses to a prompt
Human evaluators rank the responses from best to worst
A reward model is trained on these human preferences
The language model is fine-tuned using RL to produce responses the reward model scores highly

RLHF is why modern AI assistants are conversational and helpful rather than just predicting random text — the RL training shaped their behaviour to align with human preferences.

Reinforcement learning vs supervised learning

Supervised learning: Learns from labelled examples (right answers provided). Best for well-defined tasks with clear correct answers.
Reinforcement learning: Learns from experience and rewards (discovers right answers through exploration). Best for sequential decision-making and tasks where the right approach is not known in advance.

Business applications

Beyond RLHF, reinforcement learning has practical business applications:

Recommendation engines: Learning which content to show users based on engagement signals
Dynamic pricing: Adjusting prices in real time based on demand patterns
Supply chain optimisation: Learning optimal inventory and routing decisions
Ad placement: Determining which ads to show to maximise click-through and conversion
Resource allocation: Optimising scheduling, staffing, and capacity decisions

Want to go deeper?

This topic is covered in our Foundations level. Access all 100+ lessons free.

Why This Matters

Reinforcement learning is the technique that transformed raw language models into the helpful AI assistants you use today. Understanding RLHF explains why ChatGPT and Claude behave the way they do — and why different AI products feel different despite using similar underlying technology. For business applications, RL powers the recommendation and optimisation systems that drive revenue for digital platforms.

Related Terms

Machine Learning (ML)

A type of AI where systems learn patterns from data instead of following explicitly programmed rules. The system improves its performance through experience.

Supervised Learning

A machine learning approach where the model learns from labelled examples — input data paired with correct answers. The most common type of machine learning in business applications.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Fine-Tuning

Training an existing AI model on your specific data to improve its performance on your specific tasks. Like giving the AI specialised on-the-job training.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

AI Agent

An AI system that can take actions autonomously — browsing the web, running code, calling APIs, and completing multi-step tasks with minimal human intervention.

Learn More

Continue learning in Foundations

This topic is covered in our lesson: AI vs Machine Learning vs Deep Learning

← Back to Glossary