Optimization
The Problem
Find parameters $\theta$ that minimize (or maximize) an objective function:
\[\theta^* = \arg\min_\theta \mathcal{L}(\theta)\]In ML, $\mathcal{L}$ is the loss function computed over training data.
Conditions for Optimality
First-order necessary condition: $\nabla \mathcal{L}(\theta^*) = 0$ (gradient is zero at a critical point)
Second-order conditions:
- Local minimum: $\nabla \mathcal{L} = 0$ and $H \succ 0$ (positive definite Hessian)
- Local maximum: $\nabla \mathcal{L} = 0$ and $H \prec 0$
- Saddle point: $\nabla \mathcal{L} = 0$ and $H$ indefinite
Convex functions have only global minima, so any local minimum is global.
Gradient Descent
Iteratively move in the negative gradient direction:
\[\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)\]where $\eta$ is the learning rate (step size).
Variants
| Variant | Gradient computed over | Notes |
|---|---|---|
| Batch GD | Entire dataset | Accurate but slow per step |
| Stochastic GD (SGD) | Single sample | Noisy but fast, good regularization effect |
| Mini-batch GD | Small batch (e.g. 32–512) | Standard in deep learning |
Learning Rate
- Too large → diverges or oscillates
- Too small → slow convergence
- Tuned via grid search, learning rate schedules, or adaptive methods
Momentum
Accumulates velocity in the gradient direction to smooth updates and escape local optima:
\[v_{t+1} = \beta v_t + \nabla \mathcal{L}(\theta_t)\] \[\theta_{t+1} = \theta_t - \eta v_{t+1}\]Typical $\beta = 0.9$.
Adaptive Optimizers
AdaGrad
Scales learning rate by accumulated squared gradients, giving large updates for rare features:
\[G_t = G_{t-1} + g_t^2, \quad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t\]Downside: learning rate monotonically decreases to zero.
RMSProp
Uses exponential moving average of squared gradients (fixes AdaGrad’s diminishing rate):
\[G_t = \rho G_{t-1} + (1-\rho) g_t^2\]Adam (Adaptive Moment Estimation)
Combines momentum (first moment) + RMSProp (second moment):
\[m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t\] \[v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2\] \[\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}\] \[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\]Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$.
Most widely used optimizer in deep learning.
AdamW
Adam with decoupled weight decay. Applies L2 regularization directly to weights, not through the gradient. Fixes Adam’s poor weight decay behavior.
Newton’s Method
Uses second-order information (Hessian) for faster convergence:
\[\theta_{t+1} = \theta_t - H^{-1} \nabla \mathcal{L}(\theta_t)\]- Quadratic convergence near optimum
- Computing and inverting $H$ is $O(n^3)$, infeasible for large models
Quasi-Newton methods (L-BFGS) approximate the Hessian from gradient history. Used in smaller ML problems.
Convexity
A function $f$ is convex if:
\[f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y) \quad \forall \lambda \in [0,1]\]- Convex: linear regression, logistic regression, SVMs, lasso
- Non-convex: neural networks (multiple local minima, saddle points)
In practice: deep networks are non-convex but gradient descent works well due to the loss landscape properties. Most local minima have similar loss values.
Learning Rate Schedules
| Schedule | Description |
|---|---|
| Step decay | Multiply $\eta$ by factor every $k$ epochs |
| Exponential decay | $\eta_t = \eta_0 e^{-kt}$ |
| Cosine annealing | $\eta$ follows cosine curve to near-zero |
| Warmup | Ramp up $\eta$ for first few steps, then decay |
| Cyclical | Oscillate between min and max $\eta$ |
Regularization (as Constrained Optimization)
Adds a penalty term to prevent overfitting:
\[\mathcal{L}_{reg}(\theta) = \mathcal{L}(\theta) + \lambda R(\theta)\]| Regularizer | Penalty | Effect |
|---|---|---|
| L2 (Ridge) | $\lVert\theta\rVert_2^2$ | Small weights, smooth solution |
| L1 (Lasso) | $\lVert\theta\rVert_1$ | Sparse weights, feature selection |
| Elastic Net | $\alpha\lVert\theta\rVert_1 + (1-\alpha)\lVert\theta\rVert_2^2$ | Combines both |
Constrained Optimization
Minimize $f(\mathbf{x})$ subject to $g_i(\mathbf{x}) = 0$ and $h_j(\mathbf{x}) \leq 0$.
Lagrangian: $\mathcal{L}(\mathbf{x}, \lambda, \mu) = f(\mathbf{x}) + \sum_i \lambda_i g_i(\mathbf{x}) + \sum_j \mu_j h_j(\mathbf{x})$
KKT conditions generalize this to inequality constraints.
Used in SVMs, RLHF, and constrained policy optimization.