Weight Decay
A regularisation technique that prevents AI model weights from growing too large during training, encouraging simpler models that generalise better to new data.
Weight decay is a regularisation technique used during neural network training that penalises large weight values, encouraging the model to use smaller, more distributed weights. This prevents the model from becoming overly complex and relying too heavily on any single feature, improving its ability to generalise to new data.
How weight decay works
During training, the model adjusts its weights to minimise a loss function β the mathematical measure of how wrong its predictions are. Weight decay adds a penalty term to this loss function that grows as weights get larger:
Total loss = Prediction loss + (weight decay coefficient * sum of squared weights)
This means the model faces a trade-off: it wants to minimise prediction errors, but it also wants to keep weights small. The weight decay coefficient controls the balance β larger values enforce stronger regularisation.
Why large weights are problematic
Large weights indicate that the model is placing extreme importance on specific features or feature combinations. This typically means the model has memorised patterns in the training data rather than learning general rules. When it encounters new data where those specific patterns do not hold, it fails.
Think of it as a doctor who diagnoses every patient based entirely on one symptom. They might be right for the training cases, but in the real world, diagnosis requires weighing many symptoms together. Weight decay forces the model to distribute its "attention" across multiple features.
Weight decay versus L2 regularisation
Weight decay and L2 regularisation produce identical results with standard gradient descent. However, with modern adaptive optimisers like Adam (used to train most neural networks today), they differ:
- L2 regularisation: Adds the penalty to the loss function before computing gradients. The effective penalty depends on the optimiser's learning rate.
- Decoupled weight decay (AdamW): Applies weight decay directly to the weights, independent of the gradient computation. This produces more consistent regularisation.
The AdamW optimiser β Adam with decoupled weight decay β has become the standard for training large language models because of this more principled regularisation.
Choosing the coefficient
The weight decay coefficient is a hyperparameter that needs tuning:
- Too small: Insufficient regularisation. The model may still overfit.
- Too large: Over-regularisation. The model is too constrained and cannot learn complex patterns.
- Typical values: 0.01 to 0.1 for most applications. Modern large language model training commonly uses values around 0.1.
Weight decay in practice
Weight decay is one of the most commonly used regularisation techniques, applied almost universally in neural network training. It is simple to implement, computationally cheap, and effective across a wide range of architectures and tasks. Along with dropout, early stopping, and data augmentation, it forms the standard regularisation toolkit.
Why This Matters
Weight decay is a fundamental technique for building AI models that perform reliably in production, not just in testing. Understanding regularisation helps you evaluate whether a model has been properly trained and is likely to maintain its performance when deployed on real-world data.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: AI Infrastructure and Deployment
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β