Bias Variance Tradeoff

What is the Bias-Variance Tradeoff?

The bias-variance tradeoff decomposes the expected generalization error of a learning algorithm into three terms: bias (systematic error), variance (sensitivity to training data), and irreducible noise. Minimizing total error requires balancing bias and variance.

Decomposition

For a regression problem with squared error loss, the expected test error at a point $x$ is:

$$ \mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2 $$

where:

$y = f(x) + \epsilon$, $\mathbb{E}[\epsilon] = 0$, $\text{Var}[\epsilon] = \sigma^2$
$\hat{f}(x)$ is the model’s prediction, averaged over different training sets

Bias:

$$ \text{Bias}[\hat{f}(x)] = \mathbb{E}[\hat{f}(x)] - f(x) $$

How far the average prediction is from the true function.

Variance:

$$ \text{Var}[\hat{f}(x)] = \mathbb{E}\left[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\right] $$

How much the prediction changes across different training sets.

Irreducible noise $\sigma^2$: inherent randomness in the data. Cannot be reduced by any model.

Interpretation

Term	Meaning	Caused by
High bias	Model systematically misses the true pattern	Underfitting, model too simple
High variance	Model is highly sensitive to training data	Overfitting, model too complex
Irreducible noise	Noise in labels or features	Measurement error, label ambiguity

Underfitting: high bias, low variance. Model does not capture the signal.

Overfitting: low bias, high variance. Model fits noise specific to training set.

Sweet spot: model complexity where total error is minimized.

The Tradeoff

As model complexity increases:

Bias decreases (more expressive model can fit true function).
Variance increases (more parameters, more sensitivity to training set).

Total expected error is a U-shaped curve as a function of complexity.

$$ \text{Total Error} = \text{Bias}^2 + \text{Variance} + \sigma^2 $$

Identifying from Train/Test Error

Situation	Train Error	Test Error	Diagnosis
High bias	High	High	Underfitting
High variance	Low	High	Overfitting
Optimal	Low	Low (close to train)	Good generalization
Irreducible noise dominates	Cannot go below $\sigma^2$	Cannot go below $\sigma^2$	Improve data quality

Reducing Bias

Use a more expressive model (higher degree polynomial, deeper network).
Add features or feature interactions.
Reduce regularization strength.
Train longer (for iterative methods).
Engineer better features.

Reducing Variance

Collect more training data: variance scales as $O(1/n)$ for many models.
Increase regularization (L1, L2, dropout, early stopping).
Reduce model complexity (fewer parameters, shallower depth, fewer features).
Use ensemble methods: Bagging averages predictions across models trained on bootstrap samples, reducing variance without increasing bias.
Data augmentation.

Formal Example: Polynomial Regression

True function: $f(x) = \sin(2\pi x)$, noise $\sigma^2 = 0.1$.

Model	Bias	Variance
Degree 0 (constant)	High	Very low
Degree 1 (linear)	High	Low
Degree 3 (cubic)	Low	Medium
Degree 9 (complex)	Very low	Very high

Degree 3 achieves the best balance for limited data.

Bagging as Variance Reduction

If $\hat{f}_1, \ldots, \hat{f}_B$ are independent models each with variance $\sigma^2$ and pairwise correlation $\rho$:

$$ \text{Var}\!\left[\frac{1}{B}\sum_{b=1}^B \hat{f}_b\right] = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2 $$

As $B \to \infty$, variance approaches $\rho \sigma^2$. Decorrelating models (via random subspace, bootstrap) drives $\rho$ down and further reduces variance.

Double Descent

Modern observation that transcends classical bias-variance tradeoff: as model complexity continues past the interpolation threshold (where training error reaches zero), test error can decrease again.

Three regimes:

Classical: increasing complexity reduces bias, increases variance (U-curve).
Interpolation threshold: model is just large enough to fit training data; test error peaks.
Over-parameterized: extremely large models can memorize training data while still generalizing well (implicit regularization from gradient descent, benign overfitting).

Relevant for deep neural networks and large kernel machines. Challenges the classical assumption that overfitting always hurts.

Bias-Variance in Classification

For 0-1 loss, the decomposition is more complex and non-additive. An analogous (but different) decomposition exists:

Bias: error of the main prediction (most frequent label across datasets).
Variance: probability that prediction differs from the main prediction.
Net error is not simply bias$^2$ + variance.

The qualitative insight still holds: simpler models underfit, complex models overfit.

See Cross Validation for empirically estimating these quantities and Ensemble Learning for structured variance reduction.