Matrix Calculus
Why Matrix Calculus?
Neural network parameters live in matrices and tensors. Computing gradients w.r.t. entire weight matrices (rather than scalar-by-scalar) requires matrix calculus, the language of vectorized backpropagation.
Layout Conventions
Two competing conventions for arranging derivatives:
| Convention | Gradient of scalar $f$ w.r.t. vector $\mathbf{x}$ | Notes |
|---|---|---|
| Numerator layout | Row vector $\frac{\partial f}{\partial \mathbf{x}} \in \mathbb{R}^{1 \times n}$ | Used in some ML texts |
| Denominator layout | Column vector $\frac{\partial f}{\partial \mathbf{x}} \in \mathbb{R}^{n \times 1}$ | Used in ML frameworks |
These notes use denominator layout (column gradient vector), consistent with most ML code.
Scalar by Vector
For $f: \mathbb{R}^n \to \mathbb{R}$:
\[\frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \in \mathbb{R}^n\]Common identities:
| $f(\mathbf{x})$ | $\frac{\partial f}{\partial \mathbf{x}}$ |
|---|---|
| $\mathbf{a}^T \mathbf{x}$ | $\mathbf{a}$ |
| $\mathbf{x}^T \mathbf{a}$ | $\mathbf{a}$ |
| $\mathbf{x}^T \mathbf{x}$ | $2\mathbf{x}$ |
| $\mathbf{x}^T A \mathbf{x}$ | $(A + A^T)\mathbf{x}$ |
| $\mathbf{x}^T A \mathbf{x}$ (symmetric $A$) | $2A\mathbf{x}$ |
Vector by Vector (Jacobian)
For $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian:
\[J = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \in \mathbb{R}^{m \times n}, \quad J_{ij} = \frac{\partial f_i}{\partial x_j}\]Common identities:
| $\mathbf{f}(\mathbf{x})$ | $\frac{\partial \mathbf{f}}{\partial \mathbf{x}}$ |
|---|---|
| $A\mathbf{x}$ | $A$ |
| $\mathbf{x}$ | $I$ |
| $\sigma(\mathbf{x})$ (elementwise) | $\text{diag}(\sigma’(\mathbf{x}))$ |
Scalar by Matrix
For $f: \mathbb{R}^{m \times n} \to \mathbb{R}$, the gradient has the same shape as the matrix:
\[\frac{\partial f}{\partial A} \in \mathbb{R}^{m \times n}, \quad \left(\frac{\partial f}{\partial A}\right)_{ij} = \frac{\partial f}{\partial A_{ij}}\]Common identities:
| $f(A)$ | $\frac{\partial f}{\partial A}$ |
|---|---|
| $\text{tr}(A)$ | $I$ |
| $\text{tr}(AB)$ | $B^T$ |
| $\text{tr}(A^T B)$ | $B$ |
| $\text{tr}(ABA^T)$ | $(B + B^T)A^T$ |
| $\log \det(A)$ | $A^{-T} = (A^{-1})^T$ |
| $\mathbf{a}^T A \mathbf{b}$ | $\mathbf{a}\mathbf{b}^T$ |
The Chain Rule in Matrix Form
For $f(\mathbf{y})$ where $\mathbf{y} = A\mathbf{x} + \mathbf{b}$:
\[\frac{\partial f}{\partial \mathbf{x}} = A^T \frac{\partial f}{\partial \mathbf{y}}\]For $f(\mathbf{y})$ where $\mathbf{y} = g(\mathbf{x})$:
\[\frac{\partial f}{\partial \mathbf{x}} = J_g^T \frac{\partial f}{\partial \mathbf{y}}\]This is exactly what happens in a neural network backward pass.
Gradients Through a Linear Layer
Layer: $\mathbf{y} = W\mathbf{x} + \mathbf{b}$, loss $\mathcal{L}$
Given upstream gradient $\frac{\partial \mathcal{L}}{\partial \mathbf{y}}$:
\[\frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \mathbf{x}^T\] \[\frac{\partial \mathcal{L}}{\partial \mathbf{b}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}}\] \[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = W^T \frac{\partial \mathcal{L}}{\partial \mathbf{y}}\]Gradients Through Elementwise Operations
For $\mathbf{y} = \sigma(\mathbf{x})$ (elementwise activation):
\[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \odot \sigma'(\mathbf{x})\](elementwise product, since the Jacobian is diagonal)
Softmax Jacobian
For softmax $\mathbf{p} = \text{softmax}(\mathbf{z})$:
\[\frac{\partial p_i}{\partial z_j} = p_i(\delta_{ij} - p_j)\]In matrix form: $J = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^T$
Combined with cross-entropy loss: $\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \mathbf{p} - \mathbf{y}$ (clean formula).
Trace and Cyclic Property
\[\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)\]Cyclic invariance is frequently used to rearrange matrix derivative expressions.
Vectorization
The vec operator stacks columns of a matrix into a vector:
\[\text{vec}(A) \in \mathbb{R}^{mn}\]Kronecker product $A \otimes B$: used to express matrix derivatives as standard vector Jacobians:
\[\text{vec}(AXB) = (B^T \otimes A)\text{vec}(X)\]