Pre-training
The initial, large-scale training phase where a foundation model learns general knowledge from vast amounts of data before being specialised for specific tasks.
Pre-training is the first and most resource-intensive phase of building a modern AI model. During pre-training, the model learns general knowledge and capabilities from an enormous, diverse dataset β typically trillions of tokens of text, code, and other data.
How pre-training works for language models
The pre-training objective for most large language models is deceptively simple: predict the next word. Given a sequence of words, predict what comes next. The model is trained on this task across trillions of text examples from books, websites, code repositories, scientific papers, and more.
Through this simple task, the model learns:
- Grammar and syntax
- Facts and knowledge
- Reasoning patterns
- Multiple languages
- Code structure
- Writing styles and formats
The scale of pre-training
Pre-training a frontier language model requires:
- Data β trillions of tokens from diverse, curated sources
- Compute β thousands of GPUs running for weeks or months
- Cost β tens to hundreds of millions of dollars
- Energy β significant electricity consumption
This is why only a handful of organisations can afford to pre-train frontier models.
Pre-training vs. fine-tuning
- Pre-training builds the foundation β broad knowledge and general capabilities
- Fine-tuning specialises the foundation β adapting the model for specific tasks, domains, or behaviours
Think of pre-training as a university education (broad knowledge) and fine-tuning as professional training (specific skills).
The pre-training pipeline
- Data collection β gathering and curating enormous datasets
- Data cleaning β removing duplicates, filtering low-quality content, handling sensitive data
- Tokenisation β converting text into the numerical tokens the model processes
- Training β the actual compute-intensive process of learning from the data
- Evaluation β testing the pre-trained model on benchmarks to assess capabilities
Pre-training data matters
The quality and composition of pre-training data profoundly shapes the model's capabilities and biases. Models trained on more code are better at programming. Models trained on more scientific literature are better at technical reasoning. Data curation is one of the most impactful β and least visible β decisions in model development.
Why This Matters
Pre-training is the foundation on which all modern AI capabilities are built. Understanding it helps you appreciate why AI models have the strengths and limitations they do, why training frontier models is so expensive, and why the quality and composition of training data is such a contentious and important topic in the industry.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: How LLMs Actually Work