Skip to main content
Early access β€” new tools and guides added regularly
Core AI

Pre-training

Last reviewed: April 2026

The initial, large-scale training phase where a foundation model learns general knowledge from vast amounts of data before being specialised for specific tasks.

Pre-training is the first and most resource-intensive phase of building a modern AI model. During pre-training, the model learns general knowledge and capabilities from an enormous, diverse dataset β€” typically trillions of tokens of text, code, and other data.

How pre-training works for language models

The pre-training objective for most large language models is deceptively simple: predict the next word. Given a sequence of words, predict what comes next. The model is trained on this task across trillions of text examples from books, websites, code repositories, scientific papers, and more.

Through this simple task, the model learns:

  • Grammar and syntax
  • Facts and knowledge
  • Reasoning patterns
  • Multiple languages
  • Code structure
  • Writing styles and formats

The scale of pre-training

Pre-training a frontier language model requires:

  • Data β€” trillions of tokens from diverse, curated sources
  • Compute β€” thousands of GPUs running for weeks or months
  • Cost β€” tens to hundreds of millions of dollars
  • Energy β€” significant electricity consumption

This is why only a handful of organisations can afford to pre-train frontier models.

Pre-training vs. fine-tuning

  • Pre-training builds the foundation β€” broad knowledge and general capabilities
  • Fine-tuning specialises the foundation β€” adapting the model for specific tasks, domains, or behaviours

Think of pre-training as a university education (broad knowledge) and fine-tuning as professional training (specific skills).

The pre-training pipeline

  1. Data collection β€” gathering and curating enormous datasets
  2. Data cleaning β€” removing duplicates, filtering low-quality content, handling sensitive data
  3. Tokenisation β€” converting text into the numerical tokens the model processes
  4. Training β€” the actual compute-intensive process of learning from the data
  5. Evaluation β€” testing the pre-trained model on benchmarks to assess capabilities

Pre-training data matters

The quality and composition of pre-training data profoundly shapes the model's capabilities and biases. Models trained on more code are better at programming. Models trained on more scientific literature are better at technical reasoning. Data curation is one of the most impactful β€” and least visible β€” decisions in model development.

Want to go deeper?
This topic is covered in our Advanced level. Access all 60+ lessons free.

Why This Matters

Pre-training is the foundation on which all modern AI capabilities are built. Understanding it helps you appreciate why AI models have the strengths and limitations they do, why training frontier models is so expensive, and why the quality and composition of training data is such a contentious and important topic in the industry.

Related Terms

Learn More

Continue learning in Advanced

This topic is covered in our lesson: How LLMs Actually Work