Skip to main content
Early access β€” new tools and guides added regularly
Core AI

Policy Gradient

Last reviewed: April 2026

A reinforcement learning technique where the AI directly learns the best action to take in each situation by adjusting its decision-making policy based on rewards.

Policy gradient is a family of reinforcement learning algorithms where the AI directly learns a policy β€” a mapping from situations to actions β€” by adjusting the policy parameters in the direction that increases expected rewards.

What is a policy?

In reinforcement learning, a policy is the AI's decision-making strategy. Given a situation (state), the policy determines what action to take. A chess-playing AI's policy maps board positions to moves. A text-generating AI's policy maps the conversation so far to the next token to produce.

How policy gradient works

The core idea is straightforward:

  1. The agent follows its current policy and takes actions in an environment.
  2. It receives rewards (positive or negative) based on the outcomes of those actions.
  3. Actions that led to good rewards are made more likely in the future.
  4. Actions that led to poor rewards are made less likely.

Mathematically, the policy parameters are adjusted using gradient ascent β€” moving in the direction that increases expected total reward. This is where the name "policy gradient" comes from.

Why policy gradients matter for modern AI

Policy gradient methods are central to how modern LLMs are aligned with human preferences. RLHF (Reinforcement Learning from Human Feedback) β€” the technique that made ChatGPT conversational β€” uses policy gradient algorithms (specifically PPO, Proximal Policy Optimisation) to fine-tune language models based on human preference data.

The process:

  1. The LLM generates multiple responses to a prompt.
  2. Human raters rank the responses by quality.
  3. A reward model learns to predict human preferences.
  4. Policy gradient optimisation adjusts the LLM to produce responses the reward model scores highly.

Advantages of policy gradient methods

  • They can handle continuous or large action spaces where other RL methods struggle.
  • They directly optimise for the desired behaviour.
  • They work with stochastic policies, which can be beneficial for exploration.

Limitations

  • High variance: Policy gradient estimates can be noisy, leading to unstable training.
  • Sample inefficiency: They often require many interactions with the environment to learn effectively.
  • Local optima: They can converge to mediocre policies if not carefully tuned.

Variants and improvements

REINFORCE, PPO, A2C, TRPO, and SAC are all policy gradient variants, each addressing different challenges around stability, efficiency, and performance.

Want to go deeper?
This topic is covered in our Advanced level. Access all 60+ lessons free.

Why This Matters

Policy gradient methods are the hidden engine behind AI alignment β€” the process that transforms a raw language model into a helpful, harmless assistant. Understanding them gives you insight into how AI companies shape model behaviour, which directly affects the quality and safety of every AI tool you use.

Related Terms

Learn More

Continue learning in Advanced

This topic is covered in our lesson: How Models Are Trained and Aligned