Core AI

Gradient Descent

Last reviewed: April 2026

The optimisation algorithm that trains neural networks by iteratively adjusting model parameters in the direction that reduces prediction errors.

Gradient descent is the optimisation algorithm at the heart of machine learning training. It is the method by which models learn — iteratively adjusting their parameters to minimise errors.

The intuition

Imagine you are blindfolded on a hilly landscape and need to find the lowest valley. You cannot see, but you can feel the slope under your feet. The strategy is simple: take a step in the direction that goes most steeply downhill. Repeat until the ground levels out. That is gradient descent.

How it works mathematically

The model makes predictions on training data
A loss function measures how wrong the predictions are
The gradient (mathematical slope) of the loss with respect to each parameter is calculated — this tells you which direction each parameter should move to reduce the error
Each parameter is updated by a small step in the direction that decreases the loss
Repeat for many iterations

The learning rate

The learning rate controls how big each step is:

Too large — the model overshoots the minimum, bouncing around or diverging
Too small — training takes forever, and the model may get stuck in a suboptimal spot
Just right — the model steadily converges to a good solution

Finding the right learning rate is one of the most important hyperparameter decisions in training.

Variants of gradient descent

Batch gradient descent — computes the gradient using the entire dataset. Accurate but slow.
Stochastic gradient descent (SGD) — computes the gradient using a single random example. Fast but noisy.
Mini-batch gradient descent — the practical middle ground. Uses a batch of examples (typically 32 to 512).
Adam, AdaGrad, RMSprop — advanced optimisers that adapt the learning rate for each parameter individually. Adam is the most popular default choice.

Challenges

Local minima — the algorithm might find a valley that is not the deepest. In practice, for neural networks with millions of parameters, most local minima are "good enough."
Saddle points — flat regions where the gradient is near zero, causing training to stall
Vanishing/exploding gradients — in deep networks, gradients can become very small or very large, destabilising training

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Gradient descent is why training large AI models requires enormous computing resources — it involves billions of these small parameter adjustments, repeated across massive datasets. Understanding this helps you appreciate why AI training is expensive and why breakthroughs in optimisation algorithms have practical business implications.

Related Terms

Backpropagation

The training algorithm that teaches neural networks by calculating how much each weight contributed to errors and adjusting them to reduce mistakes.

Neural Network

A computing system loosely inspired by the human brain, made of layers of interconnected nodes that learn to recognise patterns in data.

Deep Learning

A subset of machine learning that uses neural networks with many layers to learn complex patterns. The 'deep' refers to the number of layers, not the depth of understanding.

Machine Learning (ML)

A type of AI where systems learn patterns from data instead of following explicitly programmed rules. The system improves its performance through experience.

Model Weights

The numerical values inside a neural network that determine how it processes information. Weights are what the model learns during training — they encode its knowledge and capabilities.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: How LLMs Actually Work

← Back to Glossary