Backpropagation
How does Backpropagation work?
Backpropagation (backprop) is an efficient algorithm for computing the gradient of the loss $\mathcal{L}$ with respect to all network parameters $\theta$ by applying the chain rule in reverse order through the computational graph. It runs in $O(p)$ time (proportional to the number of parameters), the same cost as a single forward pass.
The Chain Rule
For a composition $f(g(x))$:
\[\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}\]For a multi-variable composition $\ell = \mathcal{L}(z)$, $z = Wh + b$:
\[\frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial z} \cdot \frac{\partial z}{\partial W}\]Backpropagation applies this chain rule systematically across all layers.
Forward Pass
Compute and cache intermediate activations ${\mathbf{z}^{(l)}, \mathbf{h}^{(l)}}$ for each layer $l = 1, \ldots, L$:
\[\mathbf{z}^{(l)} = W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\] \[\mathbf{h}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})\]Output: $\hat{\mathbf{y}} = \mathbf{h}^{(L)}$, then compute $\mathcal{L}(\hat{\mathbf{y}}, \mathbf{y})$.
These cached values are reused during the backward pass.
Backward Pass
Define the error signal (upstream gradient) at layer $l$:
\[\boldsymbol{\delta}^{(l)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}}\]Output layer ($l = L$):
\[\boldsymbol{\delta}^{(L)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}} \odot \sigma'^{(L)}(\mathbf{z}^{(L)})\]For softmax + cross-entropy: $\boldsymbol{\delta}^{(L)} = \hat{\mathbf{y}} - \mathbf{y}$ (clean closed form).
Propagation (layer $l$ from $l+1$):
\[\boldsymbol{\delta}^{(l)} = \left(W^{(l+1)T} \boldsymbol{\delta}^{(l+1)}\right) \odot \sigma'^{(l)}(\mathbf{z}^{(l)})\]Parameter gradients:
\[\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{h}^{(l-1)})^T\] \[\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}\]Algorithm Summary
Forward pass:
for l = 1 to L:
z[l] = W[l] @ h[l-1] + b[l]
h[l] = σ(z[l])
compute L(h[L], y)
Backward pass:
δ[L] = ∂L/∂z[L]
for l = L-1 downto 1:
δ[l] = (W[l+1].T @ δ[l+1]) * σ'(z[l])
for l = 1 to L:
∂L/∂W[l] = δ[l] @ h[l-1].T
∂L/∂b[l] = δ[l]
Computational Cost
- Forward pass: $O(\sum_l n_l \cdot n_{l-1})$ multiplications.
- Backward pass: same asymptotic cost as forward pass ($\approx 2\times$ wall-clock time).
- Memory: $O(\sum_l n_l)$ to store activations. Memory grows with depth and batch size.
Automatic Differentiation (Autograd)
Modern frameworks implement backprop via automatic differentiation, not symbolic or numerical differentiation.
Forward mode AD: accumulates derivatives alongside the primal computation. Efficient when input dimension is small; $O(p)$ passes for $p$ parameters.
Reverse mode AD (backprop): one backward pass computes all $p$ partial derivatives simultaneously. $O(1)$ passes regardless of $p$. Ideal for neural networks where $p \gg$ number of outputs.
Tape-based (eager): PyTorch records operations dynamically; backward traverses the recorded tape.
Symbolic/compiled: TensorFlow (v1), JAX’s jit: builds static computation graph; enables optimization passes (operation fusion, XLA compilation).
Vanishing and Exploding Gradients
The gradient at layer $l$ involves a product of Jacobians:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(L)}} \prod_{k=l}^{L-1} \frac{\partial \mathbf{h}^{(k+1)}}{\partial \mathbf{h}^{(k)}}\]Each Jacobian factor is $W^{(k)T} \cdot \text{diag}(\sigma’(\mathbf{z}^{(k)}))$.
Vanishing: if $|W^{(k)T}| \cdot |\sigma’(\mathbf{z}^{(k)})| < 1$ on average, the product shrinks exponentially with depth. Early layers learn extremely slowly.
Exploding: if the product grows exponentially, gradients blow up. Weights diverge.
Solutions:
| Problem | Solution |
|---|---|
| Vanishing | ReLU activation, residual connections, careful init |
| Exploding | Gradient clipping, careful init, weight norm |
| Both | Batch normalization, layer normalization |
Jacobians and the General Case
For vector-to-vector functions $\mathbf{f}: \mathbb{R}^m \to \mathbb{R}^n$, the Jacobian is:
\[J_f = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \in \mathbb{R}^{n \times m}\]Backprop computes vector-Jacobian products (VJPs) $\mathbf{v}^T J_f$, where $\mathbf{v}$ is the upstream gradient. This is cheaper than materializing the full Jacobian.
Numerical Gradient Checking
Finite difference approximation to verify backprop implementation:
\[\frac{\partial \mathcal{L}}{\partial \theta_i} \approx \frac{\mathcal{L}(\theta + \epsilon e_i) - \mathcal{L}(\theta - \epsilon e_i)}{2\epsilon}\]Use $\epsilon = 10^{-5}$. Compare against analytical gradient; relative error should be $< 10^{-5}$. Only used for debugging; too expensive for training.