Regularization

Regularization refers to any technique that reduces overfitting (high variance) by adding constraints, noise, or implicit bias to the learning process. It improves generalization by preventing the model from memorizing training data idiosyncrasies.

Formal framing: regularization modifies the objective:

\[\mathcal{J}(\theta) = \underbrace{\frac{1}{n}\sum_{i=1}^n \mathcal{L}(f_\theta(x_i), y_i)}_{\text{empirical risk}} + \lambda \underbrace{\Omega(\theta)}_{\text{regularizer}}\]

L2 Regularization (Weight Decay)

Penalizes the squared $\ell^2$ norm of weights:

\[\Omega(\theta) = \frac{1}{2}\|\theta\|_2^2 = \frac{1}{2}\sum_j \theta_j^2\]

Modified gradient:

\[\nabla_\theta \mathcal{J} = \nabla_\theta \mathcal{L} + \lambda \theta\]

Update rule:

\[\theta \leftarrow \theta(1 - \eta\lambda) - \eta \nabla_\theta \mathcal{L}\]

The factor $(1 - \eta\lambda)$ shrinks (decays) the weights at each step, hence “weight decay.”

Effect: shrinks all weights toward zero proportionally to their magnitude. Discourages large weights that could indicate overfitting. Does not produce sparsity.

Bayesian interpretation: equivalent to MAP estimation with a Gaussian prior $p(\theta) = \mathcal{N}(0, 1/\lambda)$.

Note: in Adam, weight decay and L2 regularization are not equivalent. Use decoupled weight decay (AdamW) for Adam: apply the decay directly to weights rather than adding $\lambda\theta$ to the gradient.

L1 Regularization (Lasso)

Penalizes the $\ell^1$ norm:

\[\Omega(\theta) = \|\theta\|_1 = \sum_j \lvert\theta_j\rvert\]

Modified gradient: $\nabla_\theta \mathcal{J} = \nabla_\theta \mathcal{L} + \lambda \cdot \text{sign}(\theta)$

Effect: drives some weights to exactly zero (sparse solutions). Performs implicit feature selection.

Bayesian interpretation: MAP estimation with a Laplace prior $p(\theta) \propto e^{-\lambda\lvert\theta\rvert}$.

Less common in deep learning than L2 due to non-differentiability at zero and lack of scale invariance.

Elastic Net

Combines L1 and L2:

\[\Omega(\theta) = \alpha \|\theta\|_1 + \frac{1-\alpha}{2}\|\theta\|_2^2\]

Inherits sparsity from L1 and the grouping effect from L2. Common in linear models for tabular data.

Data Augmentation

Artificially expands the training set by applying label-preserving transformations:

Images:

Random crop and resize
Horizontal/vertical flip
Color jitter (brightness, contrast, saturation, hue)
Gaussian blur, noise injection
Cutout (random rectangular masking)
Mixup: $\tilde{x} = \lambda x_i + (1-\lambda)x_j$, $\tilde{y} = \lambda y_i + (1-\lambda)y_j$
CutMix: paste patch from one image into another

Text:

Synonym substitution
Back-translation
Random token masking or deletion
EDA (Easy Data Augmentation)

Audio:

Time stretching, pitch shifting, SpecAugment (masking frequency/time blocks)

Data augmentation is one of the most effective regularization techniques; the augmented examples are effectively unlimited.

Early Stopping

Stop training when validation loss stops decreasing and begins to increase.

Patience $p$: stop if validation loss does not improve for $p$ consecutive epochs. Restore best checkpoint.

Effect: implicit regularization; limits the effective number of gradient steps, preventing the model from fitting noise.

Equivalent to L2 regularization under certain conditions (Tikhonov): the optimal early stopping time corresponds to $\lambda \approx 1/(\eta T)$ in the L2-regularized problem.

Noise Injection

Adding noise to inputs, hidden activations, or gradients acts as regularization:

Input noise: $\tilde{x} = x + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma^2)$. Equivalent to Tikhonov regularization for linear models.
Weight noise: $\tilde{W} = W + \epsilon$. Makes model robust to weight perturbations; related to flatness/generalization.
Label smoothing: soft targets prevent overconfident predictions. See Loss Functions.

Max-Norm Constraint

Constrains the norm of each neuron’s weight vector to be at most $c$:

\[\|w_j\|_2 \leq c\]

Applied post-gradient-step by projecting onto the ball. More robust than L2 for very large learning rates; commonly used alongside dropout.

Summary Comparison

Technique	Mechanism	Produces Sparsity	Main Use
L2 (weight decay)	Shrinks weights	No	Default in deep learning
L1	Absolute value penalty	Yes	Feature selection, linear models
Elastic Net	L1 + L2	Yes (partial)	Tabular regression
Data augmentation	Effective data increase	No	Vision, NLP, audio
Early stopping	Limits optimization steps	No	Always recommended
Dropout	Random unit removal	Implicit	Dense, RNN layers
Batch normalization	Noise from batch stats	No	Deep networks
Label smoothing	Soft targets	No	Classification
Max-norm	Constrained optimization	No	With dropout

See Dropout and Batch Normalization for dedicated coverage of those techniques.