Core AI

Pre-training

Last reviewed: April 2026

The initial, large-scale training phase where a foundation model learns general knowledge from vast amounts of data before being specialised for specific tasks.

Pre-training is the first and most resource-intensive phase of building a modern AI model. During pre-training, the model learns general knowledge and capabilities from an enormous, diverse dataset — typically trillions of tokens of text, code, and other data.

How pre-training works for language models

The pre-training objective for most large language models is deceptively simple: predict the next word. Given a sequence of words, predict what comes next. The model is trained on this task across trillions of text examples from books, websites, code repositories, scientific papers, and more.

Through this simple task, the model learns:

Grammar and syntax
Facts and knowledge
Reasoning patterns
Multiple languages
Code structure
Writing styles and formats

The scale of pre-training

Pre-training a frontier language model requires:

Data — trillions of tokens from diverse, curated sources
Compute — thousands of GPUs running for weeks or months
Cost — tens to hundreds of millions of dollars
Energy — significant electricity consumption

This is why only a handful of organisations can afford to pre-train frontier models.

Pre-training vs. fine-tuning

Pre-training builds the foundation — broad knowledge and general capabilities
Fine-tuning specialises the foundation — adapting the model for specific tasks, domains, or behaviours

Think of pre-training as a university education (broad knowledge) and fine-tuning as professional training (specific skills).

The pre-training pipeline

Data collection — gathering and curating enormous datasets
Data cleaning — removing duplicates, filtering low-quality content, handling sensitive data
Tokenisation — converting text into the numerical tokens the model processes
Training — the actual compute-intensive process of learning from the data
Evaluation — testing the pre-trained model on benchmarks to assess capabilities

Pre-training data matters

The quality and composition of pre-training data profoundly shapes the model's capabilities and biases. Models trained on more code are better at programming. Models trained on more scientific literature are better at technical reasoning. Data curation is one of the most impactful — and least visible — decisions in model development.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Pre-training is the foundation on which all modern AI capabilities are built. Understanding it helps you appreciate why AI models have the strengths and limitations they do, why training frontier models is so expensive, and why the quality and composition of training data is such a contentious and important topic in the industry.

Related Terms

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Fine-Tuning

Training an existing AI model on your specific data to improve its performance on your specific tasks. Like giving the AI specialised on-the-job training.

Foundation Model

A large AI model trained on broad data at scale that can be adapted to a wide range of downstream tasks — GPT, Claude, and Llama are all foundation models.

Transformer

The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: How LLMs Actually Work

← Back to Glossary