Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) approximates the true gradient using a small random subset (mini-batch) of the training data at each step. The noise in the gradient estimate acts as implicit regularization and helps escape sharp minima, making SGD the workhorse optimizer for deep learning.

\[\theta^{(t+1)} = \theta^{(t)} - \eta_t \hat{g}_t\]

where $\hat{g}t = \frac{1}{B}\sum{i \in \mathcal{B}t} \nabla\theta \mathcal{L}(f_\theta(x_i), y_i)$ is the mini-batch gradient.

Pure SGD (Batch Size = 1)

\[\hat{g}_t = \nabla_\theta \mathcal{L}(f_\theta(x_{i_t}), y_{i_t})\]

Unbiased estimate: $\mathbb{E}[\hat{g}t] = \nabla\theta \mathcal{J}(\theta)$

High variance: each step is a noisy estimate; updates zigzag around the true gradient direction.

Advantage: one pass over data gives $n$ updates (vs. 1 for batch GD).

Mini-Batch SGD

Uses batch size $B \in [1, n]$. Standard in deep learning with $B \in {32, 64, 128, 256}$.

Variance of mini-batch gradient:

\[\text{Var}[\hat{g}_t] = \frac{\sigma^2}{B}\]

where $\sigma^2$ is the per-sample variance. Larger batches reduce gradient variance but each step costs more compute.

Gradient noise: $\hat{g}t = \nabla\theta \mathcal{J}(\theta) + \epsilon_t$ where $\epsilon_t$ is zero-mean noise with variance $\propto 1/B$.

Epoch

One epoch = one complete pass over the training data. Number of steps per epoch = $\lceil n/B \rceil$.

Training typically runs for tens to hundreds of epochs. Total gradient steps $\approx \text{epochs} \times n/B$.

SGD with Momentum

Accumulates a velocity vector $\mathbf{v}$ to smooth updates and accelerate in consistent directions:

\[\mathbf{v}_{t+1} = \mu \mathbf{v}_t - \eta \hat{g}_t\] \[\theta^{(t+1)} = \theta^{(t)} + \mathbf{v}_{t+1}\]

$\mu \in [0.9, 0.99]$: momentum coefficient (exponential decay of past gradients).
Steady-state velocity: $\mathbf{v} = \frac{\eta}{1 - \mu} \hat{g}$. With $\mu = 0.9$: effective learning rate is $10\times$ larger in consistent directions.

Physical intuition: ball rolling down a hill accumulates speed; momentum helps traverse flat regions and reduces oscillation in ravines.

Nesterov Accelerated Gradient (NAG)

Evaluates gradient at a “look-ahead” position:

\[\mathbf{v}_{t+1} = \mu \mathbf{v}_t - \eta \nabla_\theta \mathcal{J}(\theta + \mu \mathbf{v}_t)\] \[\theta^{(t+1)} = \theta^{(t)} + \mathbf{v}_{t+1}\]

Corrects the momentum overshoot; converges faster than standard momentum on convex problems. Convergence rate $O(1/T^2)$ vs. $O(1/T)$ for GD.

Noise and Generalization

SGD noise has a beneficial regularization effect:

Flat minima hypothesis: the gradient noise biases SGD toward wide, flat minima where $\lambda_{\max}(H)$ (largest Hessian eigenvalue) is small. Flat minima generalize better under distribution shift.

Noise scale: $N \propto \frac{\eta \sigma^2}{B}$. Increasing $\eta$ or decreasing $B$ increases noise.

Linear scaling rule (Goyal et al.): when multiplying $B$ by $k$, multiply $\eta$ by $k$ and use a linear warmup phase to maintain the effective noise scale.

Learning Rate Schedule

SGD benefits significantly from decaying the learning rate over training.

Schedule	Formula	Notes
Step decay	$\eta_t = \eta_0 \cdot \gamma^{\lfloor t / s \rfloor}$	Reduce by $\gamma$ every $s$ steps
Cosine annealing	$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})(1 + \cos(\pi t / T))$	Smooth; very common
Linear decay	$\eta_t = \eta_0 (1 - t/T)$	Simple; used in Transformers
Warmup	Linearly increase $\eta$ for first $w$ steps	Stabilizes early training
Cyclical (CLR)	Oscillates between $\eta_{\min}$ and $\eta_{\max}$	Can improve final accuracy
1-cycle	Warmup to $\eta_{\max}$, decay to $\eta_{\min}$, anneal to 0	Used in fastai; strong results

Convergence of SGD

For a $\beta$-smooth non-convex function, SGD with $\eta_t = \eta_0 / \sqrt{T}$:

\[\frac{1}{T}\sum_{t=1}^T \mathbb{E}[\|\nabla \mathcal{J}(\theta^{(t)})\|^2] \leq O\!\left(\frac{\sigma}{\sqrt{T}}\right)\]

Converges to a stationary point (not necessarily a local minimum) at rate $O(1/\sqrt{T})$.

Batch Size Effects

Batch size	Gradient variance	Steps per epoch	Memory	Generalization
1 (pure SGD)	Very high	$n$	Minimal	Often good
32-256 (mini-batch)	Moderate	$n/B$	Moderate	Good
$n$ (full batch)	Zero	1	High	Often worse
Very large ($>8192$)	Very low	Few	Very high	Degraded (needs careful LR tuning)

Large-batch training degrades generalization unless compensated by increased $\eta$ (linear scaling rule) and warmup.

SGD vs. Adam

Property	SGD + Momentum	Adam
Convergence speed	Slower (needs tuning)	Faster
Generalization	Often better (CV, language)	Comparable or slightly worse
Hyperparameter sensitivity	High (LR, momentum, schedule)	Lower (but $\beta_1, \beta_2, \epsilon$)
Sparse gradients	Poor	Good
Common use	ResNets, CNNs (image), LLM fine-tune	Transformers, NLP pretraining

See Optimization Algorithms for Adam, AdaGrad, RMSProp.