Model Serving
Model serving is the infrastructure that makes trained models available for real-time or batch predictions. The goal is to serve predictions with low latency, high throughput, and high availability.
Serving Patterns
Online (real-time) inference: respond to requests within milliseconds. Used for recommendation, fraud detection, search, chatbots.
Batch inference: process large datasets offline (overnight batch scoring). Lower latency requirements; simpler infrastructure.
Streaming inference: process a continuous stream of events; latency in seconds. Used for monitoring, real-time features, alert systems.
REST API Serving
Wrap the model in an HTTP server; accept requests as JSON; return predictions as JSON.
FastAPI + Uvicorn: lightweight Python async HTTP server. Simple to build; good for low-to-medium QPS.
Flask: older synchronous Python server. Less efficient for concurrent requests.
Deployment: containerize with Docker; deploy to Kubernetes or a cloud run service.
Bottleneck: Python GIL limits single-process concurrency. Use multiple workers (gunicorn) or async frameworks.
Dedicated Model Serving Frameworks
TorchServe: official PyTorch serving library. Model archive format; REST API; batching; metrics.
TensorFlow Serving: gRPC + REST; model versioning; multiple models on one server.
Triton Inference Server (NVIDIA): multi-framework (PyTorch, TensorFlow, ONNX, TensorRT). Dynamic batching; concurrent model execution; streaming.
Ray Serve: distributed serving on Ray. Python-native; supports arbitrary computation (preprocessing, postprocessing, ensembles). Easy auto-scaling.
vLLM: optimized serving for LLMs. Continuous batching, PagedAttention for KV cache management. High-throughput LLM inference.
Containerization and Orchestration
Docker: package the model, dependencies, and server code into a container image. Reproducible; portable.
Kubernetes: orchestrate containers at scale. Deployments, Services, Horizontal Pod Autoscaler (HPA) scales replicas based on CPU/QPS. Supports rolling updates (zero-downtime deployments).
Helm charts: Kubernetes configuration templates. Standard for packaging ML serving deployments.
Scaling and Load Balancing
Horizontal scaling: add more serving replicas behind a load balancer. Scales throughput linearly.
Vertical scaling: use larger GPUs or more GPUs per pod. Scales per-request latency.
Auto-scaling: scale replicas based on QPS or GPU utilization. HPA in Kubernetes; KEDA for event-driven scaling.
Load balancer: distribute requests across replicas. Round-robin, least-connections, or consistent hashing (for sticky routing).
Batching
Group multiple requests into a single batch for efficient GPU execution.
Static batching: wait until a batch is full. Adds latency; underutilizes GPU if traffic is low.
Dynamic batching (Triton): accumulate requests for a configurable time window; batch whatever is queued. Balances latency and throughput.
Continuous batching (vLLM): for LLMs, new requests are inserted into the ongoing generation batch as soon as a slot is free. Eliminates the starvation of waiting for all sequences in a batch to finish.
Optimal batch size: limited by GPU memory (KV cache for LLMs; activations for CNNs). Larger batches increase GPU utilization up to the memory limit.
Model Versioning and A/B Testing
Canary deployment: route a small fraction of traffic (e.g., 5%) to a new model version; monitor metrics; gradually increase.
Shadow deployment: new model processes all traffic but its predictions are not returned; compare offline with production.
Blue/green deployment: maintain two identical environments; switch traffic instantly on validation.
Feature flags: control which model version a user receives. Useful for gradual rollouts and experimentation.
Caching
Response caching: cache prediction results for identical inputs. Effective for low-cardinality inputs (e.g., autocomplete for common queries).
KV cache (LLMs): cache key/value tensors for the prompt prefix. Prefix caching (vLLM, TGI) avoids recomputing the same system prompt for every request.
Embedding cache: precompute and cache embeddings for static items (product catalog, knowledge base passages). Avoid redundant GPU computation.
Latency Components
Total latency = Network (client → server)
+ Queue wait
+ Preprocessing
+ GPU compute (inference)
+ Postprocessing
+ Network (server → client)
Profiling each component reveals the bottleneck. GPU compute dominates for large models; network dominates for edge deployments.
P99 latency: the 99th percentile latency. Often $3$–$5\times$ higher than median due to outlier requests (cold starts, large inputs). SLOs are typically defined on P99.