Practical

Model Serving

Last reviewed: April 2026

The infrastructure and process of making a trained AI model available to receive requests and return predictions in real time.

Model serving is the process of deploying a trained AI model so that applications can send it requests and receive predictions. It is the bridge between a model that works in a research notebook and one that powers a live product.

From training to serving

Training a model and serving a model are fundamentally different tasks. Training is a batch process — you feed data in, wait hours or days, and get a trained model out. Serving is a real-time process — you need to handle thousands of simultaneous requests with low latency and high reliability.

Key components of model serving

Model loading: Getting the model into GPU or CPU memory so it is ready to process requests.
Request handling: Receiving incoming requests, preprocessing inputs, running inference, and returning results.
Batching: Grouping multiple requests together to process them more efficiently on GPU hardware.
Scaling: Adding or removing computing resources based on demand.
Monitoring: Tracking latency, throughput, error rates, and model performance.
Versioning: Running multiple model versions simultaneously for A/B testing or gradual rollouts.

Serving approaches

Managed API services: Use a provider like OpenAI, Anthropic, or Google. They handle all serving infrastructure. Simplest but least control.
Serverless inference: Platforms like AWS Lambda or Hugging Face Inference Endpoints spin up compute on demand. Pay only for what you use.
Dedicated infrastructure: Run models on your own or rented GPU servers. Most control but most operational burden.
Edge deployment: Run smaller models directly on user devices (phones, browsers). Lowest latency but limited model size.

Common serving frameworks

vLLM: High-performance serving specifically for LLMs.
TensorRT-LLM: NVIDIA's optimised LLM serving framework.
Triton Inference Server: NVIDIA's general-purpose model serving solution.
TorchServe: PyTorch's native serving framework.

Optimisation techniques

Quantisation: Reducing model precision (e.g., from 32-bit to 8-bit) to reduce memory and increase speed.
KV-cache optimisation: Efficiently managing the key-value cache for transformer models.
Continuous batching: Dynamically batching requests as they arrive rather than waiting for fixed batch sizes.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Model serving determines the real-world performance, cost, and reliability of your AI features. A well-served model feels instant and scales seamlessly. A poorly served one is slow, expensive, and prone to outages. Understanding serving options helps you choose between managed services and self-hosting based on your actual needs.

Related Terms

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Latency

The time delay between sending a request to an AI model and receiving the first part of the response. Lower latency means faster, more responsive AI interactions.

Throughput

The volume of data an AI system can process in a given time period — typically measured in tokens per second or requests per minute. Higher throughput means more work done faster.

Serverless AI

Cloud deployment where AI models run on-demand without you managing servers — you pay only for actual usage, and infrastructure scales automatically.

GPU (Graphics Processing Unit)

A specialised processor originally designed for rendering graphics but now essential for training and running AI models. GPUs can perform thousands of calculations simultaneously.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: Deploying AI in Production

← Back to Glossary