Model Serving
The infrastructure and process of making a trained AI model available to receive requests and return predictions in real time.
Model serving is the process of deploying a trained AI model so that applications can send it requests and receive predictions. It is the bridge between a model that works in a research notebook and one that powers a live product.
From training to serving
Training a model and serving a model are fundamentally different tasks. Training is a batch process β you feed data in, wait hours or days, and get a trained model out. Serving is a real-time process β you need to handle thousands of simultaneous requests with low latency and high reliability.
Key components of model serving
- Model loading: Getting the model into GPU or CPU memory so it is ready to process requests.
- Request handling: Receiving incoming requests, preprocessing inputs, running inference, and returning results.
- Batching: Grouping multiple requests together to process them more efficiently on GPU hardware.
- Scaling: Adding or removing computing resources based on demand.
- Monitoring: Tracking latency, throughput, error rates, and model performance.
- Versioning: Running multiple model versions simultaneously for A/B testing or gradual rollouts.
Serving approaches
- Managed API services: Use a provider like OpenAI, Anthropic, or Google. They handle all serving infrastructure. Simplest but least control.
- Serverless inference: Platforms like AWS Lambda or Hugging Face Inference Endpoints spin up compute on demand. Pay only for what you use.
- Dedicated infrastructure: Run models on your own or rented GPU servers. Most control but most operational burden.
- Edge deployment: Run smaller models directly on user devices (phones, browsers). Lowest latency but limited model size.
Common serving frameworks
- vLLM: High-performance serving specifically for LLMs.
- TensorRT-LLM: NVIDIA's optimised LLM serving framework.
- Triton Inference Server: NVIDIA's general-purpose model serving solution.
- TorchServe: PyTorch's native serving framework.
Optimisation techniques
- Quantisation: Reducing model precision (e.g., from 32-bit to 8-bit) to reduce memory and increase speed.
- KV-cache optimisation: Efficiently managing the key-value cache for transformer models.
- Continuous batching: Dynamically batching requests as they arrive rather than waiting for fixed batch sizes.
Why This Matters
Model serving determines the real-world performance, cost, and reliability of your AI features. A well-served model feels instant and scales seamlessly. A poorly served one is slow, expensive, and prone to outages. Understanding serving options helps you choose between managed services and self-hosting based on your actual needs.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: Deploying AI in Production