Core AI

Policy Gradient

Last reviewed: April 2026

A reinforcement learning technique where the AI directly learns the best action to take in each situation by adjusting its decision-making policy based on rewards.

Policy gradient is a family of reinforcement learning algorithms where the AI directly learns a policy — a mapping from situations to actions — by adjusting the policy parameters in the direction that increases expected rewards.

What is a policy?

In reinforcement learning, a policy is the AI's decision-making strategy. Given a situation (state), the policy determines what action to take. A chess-playing AI's policy maps board positions to moves. A text-generating AI's policy maps the conversation so far to the next token to produce.

How policy gradient works

The core idea is straightforward:

The agent follows its current policy and takes actions in an environment.
It receives rewards (positive or negative) based on the outcomes of those actions.
Actions that led to good rewards are made more likely in the future.
Actions that led to poor rewards are made less likely.

Mathematically, the policy parameters are adjusted using gradient ascent — moving in the direction that increases expected total reward. This is where the name "policy gradient" comes from.

Why policy gradients matter for modern AI

Policy gradient methods are central to how modern LLMs are aligned with human preferences. RLHF (Reinforcement Learning from Human Feedback) — the technique that made ChatGPT conversational — uses policy gradient algorithms (specifically PPO, Proximal Policy Optimisation) to fine-tune language models based on human preference data.

The process:

The LLM generates multiple responses to a prompt.
Human raters rank the responses by quality.
A reward model learns to predict human preferences.
Policy gradient optimisation adjusts the LLM to produce responses the reward model scores highly.

Advantages of policy gradient methods

They can handle continuous or large action spaces where other RL methods struggle.
They directly optimise for the desired behaviour.
They work with stochastic policies, which can be beneficial for exploration.

Limitations

High variance: Policy gradient estimates can be noisy, leading to unstable training.
Sample inefficiency: They often require many interactions with the environment to learn effectively.
Local optima: They can converge to mediocre policies if not carefully tuned.

Variants and improvements

REINFORCE, PPO, A2C, TRPO, and SAC are all policy gradient variants, each addressing different challenges around stability, efficiency, and performance.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Policy gradient methods are the hidden engine behind AI alignment — the process that transforms a raw language model into a helpful, harmless assistant. Understanding them gives you insight into how AI companies shape model behaviour, which directly affects the quality and safety of every AI tool you use.

Related Terms

Reinforcement Learning

A machine learning approach where an AI learns by trial and error, receiving rewards for good outcomes and penalties for bad ones. Used to train game-playing AI and to fine-tune LLMs.

Fine-Tuning

Training an existing AI model on your specific data to improve its performance on your specific tasks. Like giving the AI specialised on-the-job training.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

AI Alignment

The challenge of ensuring AI systems do what humans actually want — not just what they were literally instructed to do. A core concern in AI safety research.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: How Models Are Trained and Aligned

← Back to Glossary