Model Evaluation
How do we measure model performance?
Model evaluation measures how well a learned model generalizes to unseen data. It guides model selection, debugging, and deployment decisions. The choice of metric must align with the real-world cost structure of errors.
Train / Validation / Test Split
- Training set: used to fit model parameters.
- Validation set: used for hyperparameter tuning and model selection.
- Test set: used once, at the end, for unbiased estimation of generalization performance.
Contaminating the test set (using it to guide any decisions) leads to optimistic evaluation. Typical splits: 60/20/20 or 70/15/15.
Classification Metrics
Confusion Matrix
For binary classification with predicted label $\hat{y}$ and true label $y$:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP | FN |
| Actual Negative | FP | TN |
Derived metrics:
\[\text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN}\] \[\text{Precision} = \frac{TP}{TP + FP}\] \[\text{Recall (Sensitivity)} = \frac{TP}{TP + FN}\] \[\text{Specificity} = \frac{TN}{TN + FP}\] \[\text{F1} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}\] \[F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}\]$\beta > 1$ weights recall higher; $\beta < 1$ weights precision higher.
ROC Curve and AUC
Plot True Positive Rate (Recall) vs. False Positive Rate ($\frac{FP}{FP + TN}$) across all classification thresholds.
AUC-ROC: area under the ROC curve. Ranges $[0, 1]$; 0.5 is random; 1.0 is perfect. Threshold-independent measure of discriminative ability.
Interpretation: AUC = probability that the model ranks a random positive example higher than a random negative example.
AUC-PR (Precision-Recall curve): preferred for class-imbalanced problems. AUC-ROC is overly optimistic when negatives greatly outnumber positives.
Multi-class Metrics
Extend binary metrics by averaging:
| Averaging | How |
|---|---|
| Macro | Compute metric per class, take unweighted mean |
| Weighted | Compute per class, weight by class support |
| Micro | Aggregate TP, FP, FN across all classes, then compute |
Matthews Correlation Coefficient (MCC)
\[\text{MCC} = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\]Reliable single metric even with class imbalance. Ranges $[-1, 1]$.
Regression Metrics
| Metric | Formula | Notes |
|---|---|---|
| MSE | $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$ | Penalizes large errors heavily |
| RMSE | $\sqrt{\text{MSE}}$ | Same units as target |
| MAE | $\frac{1}{n}\sum|y_i - \hat{y}_i|$ | Robust to outliers |
| MAPE | $\frac{100}{n}\sum\frac{|y_i - \hat{y}_i|}{y_i}$ | Scale-free; undefined if $y_i = 0$ |
| $R^2$ | $1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$ | Fraction of variance explained; $\leq 1$ |
| Adjusted $R^2$ | $1 - (1 - R^2)\frac{n-1}{n-p-1}$ | Penalizes extra features |
Ranking Metrics
Used in information retrieval and recommendation.
Precision@k: fraction of top-$k$ recommendations that are relevant.
Mean Average Precision (MAP):
\[\text{MAP} = \frac{1}{Q} \sum_{q=1}^Q \text{AP}(q), \quad \text{AP}(q) = \frac{1}{R_q}\sum_{k=1}^{n} P(k) \cdot \mathbf{1}[\text{item}_k \text{ is relevant}]\]NDCG@k (Normalized Discounted Cumulative Gain):
\[\text{DCG@k} = \sum_{i=1}^k \frac{2^{r_i} - 1}{\log_2(i+1)}, \quad \text{NDCG@k} = \frac{\text{DCG@k}}{\text{IDCG@k}}\]where $r_i$ is the relevance score of item at position $i$.
Calibration
A well-calibrated model’s predicted probability $\hat{p}$ matches the empirical frequency of the positive class.
Reliability diagram (calibration curve): plot mean predicted probability vs. fraction of positives in each probability bin. Perfect calibration lies on the diagonal.
Expected Calibration Error (ECE):
\[\text{ECE} = \sum_{m=1}^M \frac{|B_m|}{n} |\text{acc}(B_m) - \text{conf}(B_m)|\]Brier score: $\frac{1}{n}\sum(\hat{p}_i - y_i)^2$. Combines calibration and discrimination.
Calibration methods: Platt scaling (logistic regression on outputs), isotonic regression, temperature scaling.
Choosing the Right Metric
| Scenario | Recommended Metric |
|---|---|
| Balanced classes | Accuracy, F1 |
| Imbalanced classes | AUC-PR, F1 (weighted), MCC |
| False positives costly (spam filter) | Precision |
| False negatives costly (cancer screening) | Recall |
| Probabilistic outputs | AUC-ROC, Brier score, log-loss |
| Regression with outliers | MAE, Huber loss |
| Ordinal/rank predictions | NDCG, Spearman correlation |
Diagnostic Tools
Learning curves: plot train and validation loss as a function of training set size. Reveals underfitting (both losses high) vs. overfitting (train loss low, validation loss high).
Residual analysis: for regression, plot residuals $e_i = y_i - \hat{y}_i$ vs. $\hat{y}_i$ or vs. input features. Non-random patterns indicate model misspecification.
Confusion matrix heatmap: reveals systematic class confusions.
Error analysis: manually inspect misclassified examples to identify patterns.
See Cross Validation for unbiased metric estimation and Bias Variance Tradeoff for interpreting train vs. test error.