Core AI

Corpus

Last reviewed: April 2026

A large, structured collection of text used to train or evaluate AI language models, ranging from curated datasets to vast web crawls containing billions of documents.

A corpus (plural: corpora) is a large collection of text assembled for the purpose of training, evaluating, or studying language models. Think of it as the reading material an AI model learns from — the bigger and more diverse the corpus, the broader the model's understanding of language.

What makes a good corpus

The quality of an AI model is directly tied to the quality of its training corpus. A good corpus has several characteristics:

Scale: Modern large language models train on corpora containing trillions of words — effectively a substantial fraction of the publicly available internet, plus books, academic papers, and code repositories.
Diversity: A corpus that only contains scientific papers would produce a model that writes like a researcher. A diverse corpus spanning conversation, journalism, fiction, technical writing, and everyday language produces a model that can adapt to many contexts.
Quality: Not all text is equally useful. Training on poorly written, factually incorrect, or toxic text degrades model quality. Corpus curation — filtering and cleaning the data — is a critical step.
Representativeness: A corpus that underrepresents certain languages, cultures, or perspectives will produce a model with corresponding blind spots and biases.

Famous corpora in AI

Common Crawl: A massive web crawl containing petabytes of raw text from billions of web pages. It forms the backbone of many training datasets but requires extensive cleaning.
The Pile: An 825-gigabyte curated dataset combining 22 diverse sources including books, Wikipedia, GitHub code, and scientific papers.
C4 (Colossal Clean Crawled Corpus): A cleaned version of Common Crawl used to train Google's T5 model.
RedPajama: An open-source dataset designed to replicate the training data composition of Meta's LLaMA model.

Corpus and bias

Because AI models learn from their training corpus, any biases present in that text become biases in the model. If the corpus contains more text from certain demographics, regions, or viewpoints, the model will reflect those skews. This is why responsible AI development includes careful corpus analysis and debiasing techniques.

Domain-specific corpora

For enterprise applications, organisations often create domain-specific corpora — collections of text from their own industry, documentation, or internal knowledge bases. Fine-tuning a model on a domain-specific corpus produces far better results for specialised tasks than using a general-purpose model alone.

Want to go deeper?

This topic is covered in our Essentials level. Access all 100+ lessons free.

Why This Matters

The training corpus is the single biggest determinant of what an AI model knows and how it behaves. Understanding this helps you evaluate why certain models perform better for your industry, why biases appear in AI outputs, and why fine-tuning on your own data can dramatically improve results for domain-specific tasks.

Related Terms

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Pre-training

The initial, large-scale training phase where a foundation model learns general knowledge from vast amounts of data before being specialised for specific tasks.

Fine-Tuning

Training an existing AI model on your specific data to improve its performance on your specific tasks. Like giving the AI specialised on-the-job training.

Bias in AI

Systematic errors in AI systems that produce unfair outcomes, typically arising from biased training data, flawed assumptions, or unrepresentative datasets.

Learn More

Continue learning in Essentials

This topic is covered in our lesson: How Large Language Models Actually Work

← Back to Glossary