Gradient Descent
Gradient descent minimizes a differentiable loss $\mathcal{J}(\theta)$ by iteratively moving parameters in the direction of the negative gradient. The gradient $\nabla_\theta \mathcal{J}$ points in the direction of steepest ascent; moving opposite to it decreases the loss.
\[\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{J}(\theta)\]Formal Setup
Objective: $\min_\theta \mathcal{J}(\theta) = \frac{1}{n}\sum_{i=1}^n \mathcal{L}(f_\theta(x_i), y_i)$
Gradient: $\nabla_\theta \mathcal{J}(\theta) \in \mathbb{R}^p$ where $p = \lvert\theta\rvert$
Learning rate (step size): $\eta > 0$
Update rule:
\[\theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta \mathcal{J}(\theta^{(t)})\]Batch (Full) Gradient Descent
Computes the exact gradient using all $n$ training examples per update:
\[\nabla_\theta \mathcal{J} = \frac{1}{n}\sum_{i=1}^n \nabla_\theta \mathcal{L}(f_\theta(x_i), y_i)\]Advantages:
- Exact gradient; deterministic updates.
- Guaranteed to converge to a local minimum for convex objectives (with appropriate $\eta$).
Disadvantages:
- One update requires processing all $n$ samples: $O(n)$ per step.
- Impractical for large datasets.
- Converges slowly; does not escape shallow local minima.
Convergence Analysis
For a $\beta$-smooth convex function (Lipschitz-continuous gradient, $|\nabla f(x) - \nabla f(y)| \leq \beta |x - y|$):
With $\eta = 1/\beta$:
\[\mathcal{J}(\theta^{(T)}) - \mathcal{J}(\theta^*) \leq \frac{\beta \|\theta^{(0)} - \theta^*\|^2}{2T}\]Convergence rate: $O(1/T)$. For $\mu$-strongly convex functions: linear convergence $O(\exp(-\mu T / \beta))$.
Learning Rate
The learning rate $\eta$ is the most critical hyperparameter.
| $\eta$ value | Behavior |
|---|---|
| Too large | Diverges or oscillates; may overshoot minimum |
| Too small | Converges very slowly |
| Well-chosen | Smooth, efficient convergence |
| Adaptive | Adjusts per-parameter (AdaGrad, Adam) |
Choosing $\eta$:
- Start at $10^{-3}$ for Adam, $10^{-1}$ for SGD with momentum.
- Use learning rate range test (LR finder): increase $\eta$ exponentially over one epoch; pick value just before loss diverges.
- Lipschitz constant of loss provides a theoretical upper bound: $\eta < 1/L$.
Gradient Descent on Non-Convex Objectives
Neural network losses are non-convex. Key phenomena:
Local minima: points where $\nabla \mathcal{J} = 0$ and Hessian is positive definite. In high dimensions, most critical points are saddle points, not local minima.
Saddle points: $\nabla \mathcal{J} = 0$ but Hessian is indefinite. Gradient descent slows near saddle points; perturbation or momentum helps escape.
Plateaus: regions where $|\nabla \mathcal{J}| \approx 0$ but $\mathcal{J}$ is not at a minimum.
Sharp vs. flat minima: flat minima (low Hessian eigenvalues) generalize better. Sharp minima have high curvature and are more sensitive to parameter perturbations.
Loss Landscape Geometry
For an $L$-layer neural network with width $n \to \infty$: in the infinite-width limit, the loss landscape becomes convex (Neural Tangent Kernel regime). In practice, overparameterization still helps gradient descent find good solutions.
Implicit regularization: gradient descent with small $\eta$ and large batch finds solutions with specific inductive biases (e.g., minimum norm solutions for overparameterized linear models).
Gradient Clipping
Prevents exploding gradients by scaling the gradient when its norm exceeds a threshold $\tau$:
\[g \leftarrow \begin{cases} g & \|g\| \leq \tau \\ \tau \cdot g / \|g\| & \|g\| > \tau \end{cases}\]Critical for RNNs and Transformers on long sequences. Typical $\tau \in [0.5, 5.0]$.
Comparison of Gradient Descent Variants
| Variant | Gradient Estimate | Batch Size | Notes |
|---|---|---|---|
| Batch GD | Exact | $n$ | Deterministic; slow |
| Stochastic GD (SGD) | Single sample | 1 | Noisy; fast updates |
| Mini-batch GD | Approximate | $B$ | Best in practice |
See Stochastic Gradient Descent and Optimization Algorithms for variants with momentum, adaptive rates, etc.