Skip to main content
Early access β€” new tools and guides added regularly
Practical

Model Serving

Last reviewed: April 2026

The infrastructure and process of making a trained AI model available to receive requests and return predictions in real time.

Model serving is the process of deploying a trained AI model so that applications can send it requests and receive predictions. It is the bridge between a model that works in a research notebook and one that powers a live product.

From training to serving

Training a model and serving a model are fundamentally different tasks. Training is a batch process β€” you feed data in, wait hours or days, and get a trained model out. Serving is a real-time process β€” you need to handle thousands of simultaneous requests with low latency and high reliability.

Key components of model serving

  • Model loading: Getting the model into GPU or CPU memory so it is ready to process requests.
  • Request handling: Receiving incoming requests, preprocessing inputs, running inference, and returning results.
  • Batching: Grouping multiple requests together to process them more efficiently on GPU hardware.
  • Scaling: Adding or removing computing resources based on demand.
  • Monitoring: Tracking latency, throughput, error rates, and model performance.
  • Versioning: Running multiple model versions simultaneously for A/B testing or gradual rollouts.

Serving approaches

  • Managed API services: Use a provider like OpenAI, Anthropic, or Google. They handle all serving infrastructure. Simplest but least control.
  • Serverless inference: Platforms like AWS Lambda or Hugging Face Inference Endpoints spin up compute on demand. Pay only for what you use.
  • Dedicated infrastructure: Run models on your own or rented GPU servers. Most control but most operational burden.
  • Edge deployment: Run smaller models directly on user devices (phones, browsers). Lowest latency but limited model size.

Common serving frameworks

  • vLLM: High-performance serving specifically for LLMs.
  • TensorRT-LLM: NVIDIA's optimised LLM serving framework.
  • Triton Inference Server: NVIDIA's general-purpose model serving solution.
  • TorchServe: PyTorch's native serving framework.

Optimisation techniques

  • Quantisation: Reducing model precision (e.g., from 32-bit to 8-bit) to reduce memory and increase speed.
  • KV-cache optimisation: Efficiently managing the key-value cache for transformer models.
  • Continuous batching: Dynamically batching requests as they arrive rather than waiting for fixed batch sizes.
Want to go deeper?
This topic is covered in our Advanced level. Access all 60+ lessons free.

Why This Matters

Model serving determines the real-world performance, cost, and reliability of your AI features. A well-served model feels instant and scales seamlessly. A poorly served one is slow, expensive, and prone to outages. Understanding serving options helps you choose between managed services and self-hosting based on your actual needs.

Related Terms

Learn More

Continue learning in Advanced

This topic is covered in our lesson: Deploying AI in Production