Model Evaluation (Deep Learning)

How do we evaluate Deep Learning models?

Evaluating deep learning models requires the same statistical rigor as classical ML but adds concerns specific to neural networks: training curves, generalization gaps, calibration, and the gap between benchmark performance and real-world behavior.

For standard evaluation metrics (accuracy, AUC, F1, RMSE, etc.) see Model Evaluation.

Training and Validation Curves

Plot loss and metrics as a function of training steps or epochs.

Healthy training:

Training loss decreases smoothly.
Validation loss follows training loss closely.
Both plateau at a low value.

Overfitting:

Training loss continues to decrease.
Validation loss flattens or increases.
Gap between train and val grows over time.

Underfitting:

Both training and validation losses are high and plateau early.
Model lacks capacity or is undertrained.

Unstable training:

Loss oscillates or spikes.
Causes: learning rate too high, exploding gradients, bad batches.

Generalization Gap

$$ \text{Gap} = \mathcal{J}_\text{val}(\theta^*) - \mathcal{J}_\text{train}(\theta^*) $$

A large gap indicates overfitting; a small gap on a high loss indicates underfitting.

Modern observation: overparameterized networks (many more parameters than training examples) can achieve near-zero training loss and still generalize well. Classical bias-variance intuition (more parameters $\Rightarrow$ more overfit) does not always hold. See double descent in Bias Variance Tradeoff.

Checkpointing and Early Stopping

Save model weights at the epoch with the best validation metric (checkpoint). Use these weights for final evaluation.

Early stopping: stop training if validation loss does not improve for $p$ patience epochs. Restoring the best checkpoint is equivalent to early stopping.

Per-Class and Disaggregated Evaluation

Aggregate metrics (overall accuracy) can hide poor performance on subgroups:

Compute precision, recall, and F1 per class. Identify which classes are hardest.
Slice evaluation: compute metrics separately for each data slice (age group, geography, device type).
Slice-based testing: a model with 95% accuracy might have 60% accuracy on a minority class.

Confusion Matrix Analysis

Visualize as a heatmap (rows = true classes, columns = predicted classes). Systematic off-diagonal patterns reveal class confusions. For a 10-class problem, a $10 \times 10$ matrix shows which classes are confused with each other.

Calibration

Well-calibrated models’ predicted probabilities match empirical frequencies. Critical for risk-sensitive applications.

Temperature scaling: post-hoc calibration by learning a scalar $T > 0$:

$$ \hat{p} = \text{softmax}(\mathbf{z} / T) $$

$T > 1$: softens (more uncertain); $T < 1$: sharpens. Optimal $T$ found on a held-out validation set.

Expected Calibration Error (ECE): see Model Evaluation.

Deep neural networks are typically overconfident (output high-probability predictions even when wrong). Temperature scaling, label smoothing, and mixup improve calibration.

Robustness Evaluation

Measure performance under distribution shift:

Evaluation Type	What It Tests
In-distribution (i.i.d.) test set	Standard generalization
Out-of-distribution (OOD)	Shift in $P(X)$ (new domains, corruptions)
Adversarial examples	Worst-case small perturbations
Corrupted inputs	Blur, noise, JPEG artifacts (ImageNet-C)
Subgroup evaluation	Minority classes, rare conditions

Confidence under distribution shift: a model should output low confidence (high entropy) for OOD inputs, not high-confidence wrong predictions.

Benchmark Evaluation

Standard benchmarks provide reproducible comparison points:

Domain	Benchmark	Metric
Image classification	ImageNet-1k	Top-1, Top-5 accuracy
Object detection	COCO	mAP@50, mAP@50:95
Language modeling	Penn Treebank, The Pile	Perplexity
NLU	GLUE, SuperGLUE	Average score
Question answering	SQuAD 2.0	EM, F1
MT	WMT	BLEU, ChrF

Caveat: overfitting to benchmarks (evaluation set leakage, benchmark saturation) can give a misleadingly optimistic picture. Always evaluate on held-out data from the target distribution.

Loss vs. Metric

The training loss (cross-entropy, MSE) and the evaluation metric (accuracy, AUROC) often differ. Do not confuse them:

A model can have low cross-entropy but mediocre accuracy (miscalibrated confidences).
Monitor both training loss and downstream metric throughout training.

Hyperparameter Sensitivity Analysis

After finding the best configuration, measure how sensitive results are to hyperparameter changes:

Sweep $\pm 1$ order of magnitude on learning rate, weight decay.
Report mean $\pm$ std over multiple random seeds.
A model whose performance varies $>2\%$ across seeds or minor hyperparameter changes should be treated with caution.

Practical Checklist

Train/val/test split is clean; no leakage.
Evaluation metric matches the real-world objective.
Report results on the test set used only once.
Report mean and standard deviation over at least 3 random seeds.
Analyze failure modes and per-class performance.
Check calibration if model outputs are used as probabilities.
Evaluate robustness to expected distribution shifts.