GGUF (GPT-Generated Unified Format)
A file format for storing quantised AI models designed for efficient local execution, widely used by tools like llama.cpp to run large language models on consumer hardware.
GGUF (GPT-Generated Unified Format) is a file format designed for storing and distributing quantised AI models that can run efficiently on consumer hardware. It is the successor to the GGML format and has become the standard for running large language models locally using tools like llama.cpp, Ollama, and LM Studio.
Why GGUF exists
Running a large language model typically requires expensive server-grade GPUs with large amounts of VRAM. GGUF addresses this by packaging models in a format optimised for quantised inference β running models with reduced numerical precision that dramatically lowers hardware requirements while maintaining most of the model's quality.
What is in a GGUF file
A GGUF file contains everything needed to run a model:
- Model weights: The learned parameters, stored in quantised format (4-bit, 5-bit, 8-bit, etc.)
- Architecture information: The model's structure β layer count, hidden size, attention head count, etc.
- Tokenizer data: The vocabulary and rules for converting text to tokens and back.
- Metadata: Model name, author, licensing, quantisation method, and other descriptive information.
This self-contained design means a single file is all you need to run a model β no separate configuration files, no dependency management.
Quantisation variants
GGUF files come in various quantisation levels, typically indicated in the filename:
- Q8_0: 8-bit quantisation. Highest quality, largest files. Minimal quality loss.
- Q6_K: 6-bit. Good balance of quality and size.
- Q5_K_M: 5-bit with medium optimisation. Popular middle ground.
- Q4_K_M: 4-bit with medium optimisation. The sweet spot for many users.
- Q3_K_S: 3-bit. Smallest files, noticeable quality degradation.
- Q2_K: 2-bit. Very small but significant quality loss.
The local AI ecosystem
GGUF has become central to the local AI movement β the growing community of users running AI models on their own hardware rather than using cloud APIs. Tools like Ollama and LM Studio make it straightforward to download a GGUF model and start chatting with it in minutes, with no cloud dependency, no API costs, and complete data privacy.
Choosing the right quantisation
The best quantisation level depends on your hardware and quality requirements. For most users with 16GB of RAM, a Q4_K_M version of a 7-billion parameter model runs well. Users with more memory can step up to Q5 or Q6 for better quality. The key is experimenting β download two versions, compare their outputs on your specific tasks, and choose the best trade-off.
Why This Matters
GGUF is the format that makes local AI practical. Understanding it helps you evaluate whether running models on your own hardware β rather than paying for cloud APIs β is viable for your use case, and how to choose the right quality-performance trade-off.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: Building Your Own AI Solutions
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β