Model Monitoring
Model monitoring tracks the health of deployed models over time. Production models degrade as the world changes; monitoring detects degradation early so corrective action can be taken.
Why Models Degrade
Data drift: the distribution of input features $P(X)$ changes over time. The model was trained on past data; present data looks different.
Concept drift: the relationship between inputs and outputs $P(Y \mid X)$ changes. For example, spam patterns change as spammers adapt; consumer behavior shifts seasonally.
Label distribution shift: the frequency of target classes changes ($P(Y)$ changes). A fraud model trained on 1% fraud rate may encounter 3% fraud during a fraud campaign.
Upstream data issues: a feature pipeline breaks; missing fields, schema changes, or incorrect aggregations silently corrupt inputs.
Monitoring Categories
Infrastructure Metrics
Hardware and service health. Monitored by platform teams.
- GPU/CPU utilization, memory usage.
- Request latency (P50, P95, P99).
- Error rate (HTTP 5xx, exception counts).
- Throughput (requests per second).
Data Drift
Statistical comparison of current input distributions to a reference (training data or recent healthy window).
Population Stability Index (PSI):
\[\text{PSI} = \sum_{i} (A_i - E_i) \ln\!\left(\frac{A_i}{E_i}\right)\]$A_i$: actual fraction in bin $i$. $E_i$: expected fraction. PSI < 0.1: no drift; 0.1-0.2: moderate; > 0.2: significant drift.
KL Divergence / Wasserstein distance: compare current and reference distributions. More sensitive than PSI.
Kolmogorov-Smirnov test: non-parametric test for distribution shift in continuous features.
Chi-squared test: categorical feature distribution shift.
Prediction Drift
Monitor the distribution of model outputs/predictions.
- Distribution of predicted probabilities.
- Distribution of predicted classes.
- Mean prediction score over time.
If predictions shift significantly without a corresponding ground-truth shift, it may indicate input drift or model issues.
Model Performance Metrics
Compare predictions to ground truth when labels become available (often with a delay).
- Classification: accuracy, F1, AUC.
- Regression: MAE, RMSE.
- Ranking: NDCG, MRR.
Label delay: labels may not be immediately available (e.g., churn: customer label is known months later). Design for delayed feedback.
Monitoring Workflow
Production requests
→ Log (inputs, outputs, metadata)
→ Feature drift detection (statistical tests)
→ Prediction drift detection
→ Performance monitoring (when labels arrive)
→ Alert if threshold exceeded
→ Trigger retraining or investigation
Alerting
Threshold-based: alert when a metric exceeds a fixed threshold. Simple but prone to false positives from seasonal patterns.
Anomaly detection: model the expected metric distribution; alert on statistical outliers. More adaptive.
Integration: PagerDuty, OpsGenie, Slack webhooks, Datadog, Grafana.
Tools
| Tool | Focus |
|---|---|
| Evidently AI | Open-source drift + quality reports |
| Arize | LLM and classical model monitoring |
| WhyLabs | Data logging, drift, explainability |
| Fiddler | Bias, explainability, performance |
| Prometheus + Grafana | Infrastructure + custom metrics |
| MLflow | Experiment tracking (limited monitoring) |
Shadow Mode Testing
Run a new model in shadow mode alongside the production model: both process every request, but only the production model’s output is returned. Compare offline metrics.
Safer than A/B testing for high-stakes decisions (medical, financial) since the new model’s output cannot cause harm.
Retraining Triggers
Scheduled retraining: retrain on a fixed schedule (weekly, monthly). Simple; may miss rapid drift.
Performance-based: retrain when monitored metrics fall below a threshold.
Drift-based: retrain when data drift exceeds a threshold, even before labels arrive.
Continuous training: stream new data into a replay buffer; retrain incrementally (online learning or periodic micro-fine-tunes).