Regularization
Regularization refers to any technique that reduces overfitting (high variance) by adding constraints, noise, or implicit bias to the learning process. It improves generalization by preventing the model from memorizing training data idiosyncrasies.
Formal framing: regularization modifies the objective:
\[\mathcal{J}(\theta) = \underbrace{\frac{1}{n}\sum_{i=1}^n \mathcal{L}(f_\theta(x_i), y_i)}_{\text{empirical risk}} + \lambda \underbrace{\Omega(\theta)}_{\text{regularizer}}\]L2 Regularization (Weight Decay)
Penalizes the squared $\ell^2$ norm of weights:
\[\Omega(\theta) = \frac{1}{2}\|\theta\|_2^2 = \frac{1}{2}\sum_j \theta_j^2\]Modified gradient:
\[\nabla_\theta \mathcal{J} = \nabla_\theta \mathcal{L} + \lambda \theta\]Update rule:
\[\theta \leftarrow \theta(1 - \eta\lambda) - \eta \nabla_\theta \mathcal{L}\]The factor $(1 - \eta\lambda)$ shrinks (decays) the weights at each step, hence “weight decay.”
Effect: shrinks all weights toward zero proportionally to their magnitude. Discourages large weights that could indicate overfitting. Does not produce sparsity.
Bayesian interpretation: equivalent to MAP estimation with a Gaussian prior $p(\theta) = \mathcal{N}(0, 1/\lambda)$.
Note: in Adam, weight decay and L2 regularization are not equivalent. Use decoupled weight decay (AdamW) for Adam: apply the decay directly to weights rather than adding $\lambda\theta$ to the gradient.
L1 Regularization (Lasso)
Penalizes the $\ell^1$ norm:
\[\Omega(\theta) = \|\theta\|_1 = \sum_j \lvert\theta_j\rvert\]Modified gradient: $\nabla_\theta \mathcal{J} = \nabla_\theta \mathcal{L} + \lambda \cdot \text{sign}(\theta)$
Effect: drives some weights to exactly zero (sparse solutions). Performs implicit feature selection.
Bayesian interpretation: MAP estimation with a Laplace prior $p(\theta) \propto e^{-\lambda\lvert\theta\rvert}$.
Less common in deep learning than L2 due to non-differentiability at zero and lack of scale invariance.
Elastic Net
Combines L1 and L2:
\[\Omega(\theta) = \alpha \|\theta\|_1 + \frac{1-\alpha}{2}\|\theta\|_2^2\]Inherits sparsity from L1 and the grouping effect from L2. Common in linear models for tabular data.
Data Augmentation
Artificially expands the training set by applying label-preserving transformations:
Images:
- Random crop and resize
- Horizontal/vertical flip
- Color jitter (brightness, contrast, saturation, hue)
- Gaussian blur, noise injection
- Cutout (random rectangular masking)
- Mixup: $\tilde{x} = \lambda x_i + (1-\lambda)x_j$, $\tilde{y} = \lambda y_i + (1-\lambda)y_j$
- CutMix: paste patch from one image into another
Text:
- Synonym substitution
- Back-translation
- Random token masking or deletion
- EDA (Easy Data Augmentation)
Audio:
- Time stretching, pitch shifting, SpecAugment (masking frequency/time blocks)
Data augmentation is one of the most effective regularization techniques; the augmented examples are effectively unlimited.
Early Stopping
Stop training when validation loss stops decreasing and begins to increase.
Patience $p$: stop if validation loss does not improve for $p$ consecutive epochs. Restore best checkpoint.
Effect: implicit regularization; limits the effective number of gradient steps, preventing the model from fitting noise.
Equivalent to L2 regularization under certain conditions (Tikhonov): the optimal early stopping time corresponds to $\lambda \approx 1/(\eta T)$ in the L2-regularized problem.
Noise Injection
Adding noise to inputs, hidden activations, or gradients acts as regularization:
- Input noise: $\tilde{x} = x + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma^2)$. Equivalent to Tikhonov regularization for linear models.
- Weight noise: $\tilde{W} = W + \epsilon$. Makes model robust to weight perturbations; related to flatness/generalization.
- Label smoothing: soft targets prevent overconfident predictions. See Loss Functions.
Max-Norm Constraint
Constrains the norm of each neuron’s weight vector to be at most $c$:
\[\|w_j\|_2 \leq c\]Applied post-gradient-step by projecting onto the ball. More robust than L2 for very large learning rates; commonly used alongside dropout.
Summary Comparison
| Technique | Mechanism | Produces Sparsity | Main Use |
|---|---|---|---|
| L2 (weight decay) | Shrinks weights | No | Default in deep learning |
| L1 | Absolute value penalty | Yes | Feature selection, linear models |
| Elastic Net | L1 + L2 | Yes (partial) | Tabular regression |
| Data augmentation | Effective data increase | No | Vision, NLP, audio |
| Early stopping | Limits optimization steps | No | Always recommended |
| Dropout | Random unit removal | Implicit | Dense, RNN layers |
| Batch normalization | Noise from batch stats | No | Deep networks |
| Label smoothing | Soft targets | No | Classification |
| Max-norm | Constrained optimization | No | With dropout |
See Dropout and Batch Normalization for dedicated coverage of those techniques.