Scaling Law
The empirical observation that AI model performance improves predictably as you increase model size, training data, and compute β following mathematical power laws.
Scaling laws are empirical observations that AI model performance improves in a predictable, mathematical relationship as you increase three factors: model size (number of parameters), amount of training data, and computational resources used for training.
The key discovery
In 2020, researchers at OpenAI published a paper showing that the performance of language models follows power-law relationships with scale. This means that if you plot model performance against model size (or data or compute) on a log-log graph, you get a straight line. Performance improves smoothly and predictably as you scale up.
This was transformative because it meant AI labs could predict how good a model would be before spending the resources to train it. It gave them a roadmap: if you want a model that is X% better, you need Y% more parameters, Z% more data, and W% more compute.
The three scaling axes
- Parameters: More parameters means more capacity to learn patterns. But parameters alone are not enough β a huge model trained on insufficient data will underperform.
- Training data: More data provides more patterns to learn from. But data alone is not enough β a small model cannot absorb the knowledge in a massive dataset.
- Compute: More training compute (GPU hours) lets the model see more data and adjust its parameters more times. This is often the binding constraint.
Chinchilla scaling
A 2022 paper from DeepMind (the Chinchilla paper) refined scaling laws by showing that many models were over-parameterised and under-trained. For a given compute budget, it is better to train a smaller model on more data than a larger model on less data. This shifted the field toward training smaller, better-fed models.
Implications for the AI industry
- Massive investment: Scaling laws justify the billions being spent on GPU clusters β there is mathematical evidence that more compute produces better models.
- Diminishing returns: Performance improvements per dollar of compute are real but logarithmic. Each doubling of capability requires roughly 10x more resources.
- Cost of frontier models: Training costs for leading models have grown from millions to hundreds of millions to potentially billions of dollars.
Beyond loss scaling
Recent research shows that scaling laws apply not just to training loss but to downstream task performance, though the relationship is less clean. Some capabilities (like reasoning) appear to emerge suddenly at certain scales rather than improving smoothly.
Why This Matters
Scaling laws explain the AI arms race and the massive investments AI companies are making. Understanding them helps you appreciate why models keep getting better, why AI compute costs are so high, and why smaller, more efficient models (which bend the scaling curve) represent genuinely important breakthroughs.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: Understanding Model Architectures