Skip to main content
Early access β€” new tools and guides added regularly
Core AI

Gradient Descent

Last reviewed: April 2026

The optimisation algorithm that trains neural networks by iteratively adjusting model parameters in the direction that reduces prediction errors.

Gradient descent is the optimisation algorithm at the heart of machine learning training. It is the method by which models learn β€” iteratively adjusting their parameters to minimise errors.

The intuition

Imagine you are blindfolded on a hilly landscape and need to find the lowest valley. You cannot see, but you can feel the slope under your feet. The strategy is simple: take a step in the direction that goes most steeply downhill. Repeat until the ground levels out. That is gradient descent.

How it works mathematically

  1. The model makes predictions on training data
  2. A loss function measures how wrong the predictions are
  3. The gradient (mathematical slope) of the loss with respect to each parameter is calculated β€” this tells you which direction each parameter should move to reduce the error
  4. Each parameter is updated by a small step in the direction that decreases the loss
  5. Repeat for many iterations

The learning rate

The learning rate controls how big each step is:

  • Too large β€” the model overshoots the minimum, bouncing around or diverging
  • Too small β€” training takes forever, and the model may get stuck in a suboptimal spot
  • Just right β€” the model steadily converges to a good solution

Finding the right learning rate is one of the most important hyperparameter decisions in training.

Variants of gradient descent

  • Batch gradient descent β€” computes the gradient using the entire dataset. Accurate but slow.
  • Stochastic gradient descent (SGD) β€” computes the gradient using a single random example. Fast but noisy.
  • Mini-batch gradient descent β€” the practical middle ground. Uses a batch of examples (typically 32 to 512).
  • Adam, AdaGrad, RMSprop β€” advanced optimisers that adapt the learning rate for each parameter individually. Adam is the most popular default choice.

Challenges

  • Local minima β€” the algorithm might find a valley that is not the deepest. In practice, for neural networks with millions of parameters, most local minima are "good enough."
  • Saddle points β€” flat regions where the gradient is near zero, causing training to stall
  • Vanishing/exploding gradients β€” in deep networks, gradients can become very small or very large, destabilising training
Want to go deeper?
This topic is covered in our Advanced level. Access all 60+ lessons free.

Why This Matters

Gradient descent is why training large AI models requires enormous computing resources β€” it involves billions of these small parameter adjustments, repeated across massive datasets. Understanding this helps you appreciate why AI training is expensive and why breakthroughs in optimisation algorithms have practical business implications.

Related Terms

Learn More

Continue learning in Advanced

This topic is covered in our lesson: How LLMs Actually Work