Corpus
A large, structured collection of text used to train or evaluate AI language models, ranging from curated datasets to vast web crawls containing billions of documents.
A corpus (plural: corpora) is a large collection of text assembled for the purpose of training, evaluating, or studying language models. Think of it as the reading material an AI model learns from β the bigger and more diverse the corpus, the broader the model's understanding of language.
What makes a good corpus
The quality of an AI model is directly tied to the quality of its training corpus. A good corpus has several characteristics:
- Scale: Modern large language models train on corpora containing trillions of words β effectively a substantial fraction of the publicly available internet, plus books, academic papers, and code repositories.
- Diversity: A corpus that only contains scientific papers would produce a model that writes like a researcher. A diverse corpus spanning conversation, journalism, fiction, technical writing, and everyday language produces a model that can adapt to many contexts.
- Quality: Not all text is equally useful. Training on poorly written, factually incorrect, or toxic text degrades model quality. Corpus curation β filtering and cleaning the data β is a critical step.
- Representativeness: A corpus that underrepresents certain languages, cultures, or perspectives will produce a model with corresponding blind spots and biases.
Famous corpora in AI
- Common Crawl: A massive web crawl containing petabytes of raw text from billions of web pages. It forms the backbone of many training datasets but requires extensive cleaning.
- The Pile: An 825-gigabyte curated dataset combining 22 diverse sources including books, Wikipedia, GitHub code, and scientific papers.
- C4 (Colossal Clean Crawled Corpus): A cleaned version of Common Crawl used to train Google's T5 model.
- RedPajama: An open-source dataset designed to replicate the training data composition of Meta's LLaMA model.
Corpus and bias
Because AI models learn from their training corpus, any biases present in that text become biases in the model. If the corpus contains more text from certain demographics, regions, or viewpoints, the model will reflect those skews. This is why responsible AI development includes careful corpus analysis and debiasing techniques.
Domain-specific corpora
For enterprise applications, organisations often create domain-specific corpora β collections of text from their own industry, documentation, or internal knowledge bases. Fine-tuning a model on a domain-specific corpus produces far better results for specialised tasks than using a general-purpose model alone.
Why This Matters
The training corpus is the single biggest determinant of what an AI model knows and how it behaves. Understanding this helps you evaluate why certain models perform better for your industry, why biases appear in AI outputs, and why fine-tuning on your own data can dramatically improve results for domain-specific tasks.
Related Terms
Continue learning in Essentials
This topic is covered in our lesson: How Large Language Models Actually Work
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β