Core AI

Weight Decay

Last reviewed: April 2026

A regularisation technique that prevents AI model weights from growing too large during training, encouraging simpler models that generalise better to new data.

Weight decay is a regularisation technique used during neural network training that penalises large weight values, encouraging the model to use smaller, more distributed weights. This prevents the model from becoming overly complex and relying too heavily on any single feature, improving its ability to generalise to new data.

How weight decay works

During training, the model adjusts its weights to minimise a loss function — the mathematical measure of how wrong its predictions are. Weight decay adds a penalty term to this loss function that grows as weights get larger:

Total loss = Prediction loss + (weight decay coefficient * sum of squared weights)

This means the model faces a trade-off: it wants to minimise prediction errors, but it also wants to keep weights small. The weight decay coefficient controls the balance — larger values enforce stronger regularisation.

Why large weights are problematic

Large weights indicate that the model is placing extreme importance on specific features or feature combinations. This typically means the model has memorised patterns in the training data rather than learning general rules. When it encounters new data where those specific patterns do not hold, it fails.

Think of it as a doctor who diagnoses every patient based entirely on one symptom. They might be right for the training cases, but in the real world, diagnosis requires weighing many symptoms together. Weight decay forces the model to distribute its "attention" across multiple features.

Weight decay versus L2 regularisation

Weight decay and L2 regularisation produce identical results with standard gradient descent. However, with modern adaptive optimisers like Adam (used to train most neural networks today), they differ:

L2 regularisation: Adds the penalty to the loss function before computing gradients. The effective penalty depends on the optimiser's learning rate.
Decoupled weight decay (AdamW): Applies weight decay directly to the weights, independent of the gradient computation. This produces more consistent regularisation.

The AdamW optimiser — Adam with decoupled weight decay — has become the standard for training large language models because of this more principled regularisation.

Choosing the coefficient

The weight decay coefficient is a hyperparameter that needs tuning:

Too small: Insufficient regularisation. The model may still overfit.
Too large: Over-regularisation. The model is too constrained and cannot learn complex patterns.
Typical values: 0.01 to 0.1 for most applications. Modern large language model training commonly uses values around 0.1.

Weight decay in practice

Weight decay is one of the most commonly used regularisation techniques, applied almost universally in neural network training. It is simple to implement, computationally cheap, and effective across a wide range of architectures and tasks. Along with dropout, early stopping, and data augmentation, it forms the standard regularisation toolkit.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Weight decay is a fundamental technique for building AI models that perform reliably in production, not just in testing. Understanding regularisation helps you evaluate whether a model has been properly trained and is likely to maintain its performance when deployed on real-world data.

Related Terms

Regularization

A set of techniques that prevent AI models from memorising training data too closely, helping them perform better on new, unseen data.

Overfitting

When an AI model performs excellently on training data but poorly on new data because it has memorised specific examples rather than learning general patterns.

Loss Function

A mathematical function that measures how far a model's predictions are from the correct answers, providing the error signal that drives learning during training.

Hyperparameter

A configuration setting chosen before training begins — like learning rate, batch size, or number of layers — that controls how a model learns rather than what it learns.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: AI Infrastructure and Deployment

← Back to Glossary