Practical

GGUF (GPT-Generated Unified Format)

Last reviewed: April 2026

A file format for storing quantised AI models designed for efficient local execution, widely used by tools like llama.cpp to run large language models on consumer hardware.

GGUF (GPT-Generated Unified Format) is a file format designed for storing and distributing quantised AI models that can run efficiently on consumer hardware. It is the successor to the GGML format and has become the standard for running large language models locally using tools like llama.cpp, Ollama, and LM Studio.

Why GGUF exists

Running a large language model typically requires expensive server-grade GPUs with large amounts of VRAM. GGUF addresses this by packaging models in a format optimised for quantised inference — running models with reduced numerical precision that dramatically lowers hardware requirements while maintaining most of the model's quality.

What is in a GGUF file

A GGUF file contains everything needed to run a model:

Model weights: The learned parameters, stored in quantised format (4-bit, 5-bit, 8-bit, etc.)
Architecture information: The model's structure — layer count, hidden size, attention head count, etc.
Tokenizer data: The vocabulary and rules for converting text to tokens and back.
Metadata: Model name, author, licensing, quantisation method, and other descriptive information.

This self-contained design means a single file is all you need to run a model — no separate configuration files, no dependency management.

Quantisation variants

GGUF files come in various quantisation levels, typically indicated in the filename:

Q8_0: 8-bit quantisation. Highest quality, largest files. Minimal quality loss.
Q6_K: 6-bit. Good balance of quality and size.
Q5_K_M: 5-bit with medium optimisation. Popular middle ground.
Q4_K_M: 4-bit with medium optimisation. The sweet spot for many users.
Q3_K_S: 3-bit. Smallest files, noticeable quality degradation.
Q2_K: 2-bit. Very small but significant quality loss.

The local AI ecosystem

GGUF has become central to the local AI movement — the growing community of users running AI models on their own hardware rather than using cloud APIs. Tools like Ollama and LM Studio make it straightforward to download a GGUF model and start chatting with it in minutes, with no cloud dependency, no API costs, and complete data privacy.

Choosing the right quantisation

The best quantisation level depends on your hardware and quality requirements. For most users with 16GB of RAM, a Q4_K_M version of a 7-billion parameter model runs well. Users with more memory can step up to Q5 or Q6 for better quality. The key is experimenting — download two versions, compare their outputs on your specific tasks, and choose the best trade-off.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

GGUF is the format that makes local AI practical. Understanding it helps you evaluate whether running models on your own hardware — rather than paying for cloud APIs — is viable for your use case, and how to choose the right quality-performance trade-off.

Related Terms

Quantization

A technique that reduces the precision of an AI model's numerical weights to make it smaller, faster, and cheaper to run.

Open-Source AI

AI models and tools whose code and weights are publicly available, allowing anyone to use, modify, and deploy them freely.

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

GPU (Graphics Processing Unit)

A specialised processor originally designed for rendering graphics but now essential for training and running AI models. GPUs can perform thousands of calculations simultaneously.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: Building Your Own AI Solutions

← Back to Glossary