Model Evaluation

How do we measure model performance?

Model evaluation measures how well a learned model generalizes to unseen data. It guides model selection, debugging, and deployment decisions. The choice of metric must align with the real-world cost structure of errors.

Train / Validation / Test Split

Training set: used to fit model parameters.
Validation set: used for hyperparameter tuning and model selection.
Test set: used once, at the end, for unbiased estimation of generalization performance.

Contaminating the test set (using it to guide any decisions) leads to optimistic evaluation. Typical splits: 60/20/20 or 70/15/15.

Classification Metrics

Confusion Matrix

For binary classification with predicted label $\hat{y}$ and true label $y$:

	Predicted Positive	Predicted Negative
Actual Positive	TP	FN
Actual Negative	FP	TN

Derived metrics:

\[\text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN}\] \[\text{Precision} = \frac{TP}{TP + FP}\] \[\text{Recall (Sensitivity)} = \frac{TP}{TP + FN}\] \[\text{Specificity} = \frac{TN}{TN + FP}\] \[\text{F1} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}\] \[F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}\]

$\beta > 1$ weights recall higher; $\beta < 1$ weights precision higher.

ROC Curve and AUC

Plot True Positive Rate (Recall) vs. False Positive Rate ($\frac{FP}{FP + TN}$) across all classification thresholds.

AUC-ROC: area under the ROC curve. Ranges $[0, 1]$; 0.5 is random; 1.0 is perfect. Threshold-independent measure of discriminative ability.

Interpretation: AUC = probability that the model ranks a random positive example higher than a random negative example.

AUC-PR (Precision-Recall curve): preferred for class-imbalanced problems. AUC-ROC is overly optimistic when negatives greatly outnumber positives.

Multi-class Metrics

Extend binary metrics by averaging:

Averaging	How
Macro	Compute metric per class, take unweighted mean
Weighted	Compute per class, weight by class support
Micro	Aggregate TP, FP, FN across all classes, then compute

Matthews Correlation Coefficient (MCC)

\[\text{MCC} = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\]

Reliable single metric even with class imbalance. Ranges $[-1, 1]$.

Regression Metrics

Metric	Formula	Notes
MSE	$\frac{1}{n}\sum(y_i - \hat{y}_i)^2$	Penalizes large errors heavily
RMSE	$\sqrt{\text{MSE}}$	Same units as target
MAE	$\frac{1}{n}\sum\|y_i - \hat{y}_i\|$	Robust to outliers
MAPE	$\frac{100}{n}\sum\frac{\|y_i - \hat{y}_i\|}{y_i}$	Scale-free; undefined if $y_i = 0$
$R^2$	$1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$	Fraction of variance explained; $\leq 1$
Adjusted $R^2$	$1 - (1 - R^2)\frac{n-1}{n-p-1}$	Penalizes extra features

Ranking Metrics

Used in information retrieval and recommendation.

Precision@k: fraction of top-$k$ recommendations that are relevant.

Mean Average Precision (MAP):

\[\text{MAP} = \frac{1}{Q} \sum_{q=1}^Q \text{AP}(q), \quad \text{AP}(q) = \frac{1}{R_q}\sum_{k=1}^{n} P(k) \cdot \mathbf{1}[\text{item}_k \text{ is relevant}]\]

NDCG@k (Normalized Discounted Cumulative Gain):

\[\text{DCG@k} = \sum_{i=1}^k \frac{2^{r_i} - 1}{\log_2(i+1)}, \quad \text{NDCG@k} = \frac{\text{DCG@k}}{\text{IDCG@k}}\]

where $r_i$ is the relevance score of item at position $i$.

Calibration

A well-calibrated model’s predicted probability $\hat{p}$ matches the empirical frequency of the positive class.

Reliability diagram (calibration curve): plot mean predicted probability vs. fraction of positives in each probability bin. Perfect calibration lies on the diagonal.

Expected Calibration Error (ECE):

\[\text{ECE} = \sum_{m=1}^M \frac{|B_m|}{n} |\text{acc}(B_m) - \text{conf}(B_m)|\]

Brier score: $\frac{1}{n}\sum(\hat{p}_i - y_i)^2$. Combines calibration and discrimination.

Calibration methods: Platt scaling (logistic regression on outputs), isotonic regression, temperature scaling.

Choosing the Right Metric

Scenario	Recommended Metric
Balanced classes	Accuracy, F1
Imbalanced classes	AUC-PR, F1 (weighted), MCC
False positives costly (spam filter)	Precision
False negatives costly (cancer screening)	Recall
Probabilistic outputs	AUC-ROC, Brier score, log-loss
Regression with outliers	MAE, Huber loss
Ordinal/rank predictions	NDCG, Spearman correlation

Diagnostic Tools

Learning curves: plot train and validation loss as a function of training set size. Reveals underfitting (both losses high) vs. overfitting (train loss low, validation loss high).

Residual analysis: for regression, plot residuals $e_i = y_i - \hat{y}_i$ vs. $\hat{y}_i$ or vs. input features. Non-random patterns indicate model misspecification.

Confusion matrix heatmap: reveals systematic class confusions.

Error analysis: manually inspect misclassified examples to identify patterns.

See Cross Validation for unbiased metric estimation and Bias Variance Tradeoff for interpreting train vs. test error.