Normalizing Flows
Normalizing flows are generative models that learn an exact, invertible mapping between a simple base distribution and the data distribution. They provide exact likelihood evaluation and efficient sampling.
The Core Idea
Let $z \sim p_Z(z)$ be a simple distribution (e.g., $\mathcal{N}(0, I)$). Define an invertible, differentiable function $f_\theta: \mathcal{Z} \to \mathcal{X}$. Then $x = f_\theta(z)$ is a sample from the model.
Change of variables formula:
\[p_X(x) = p_Z(f_\theta^{-1}(x)) \left|\det J_{f_\theta^{-1}}(x)\right|\]or equivalently, writing $z = g_\theta(x) = f_\theta^{-1}(x)$:
\[\log p_X(x) = \log p_Z(g_\theta(x)) + \log \left|\det J_{g_\theta}(x)\right|\]The log absolute Jacobian determinant accounts for the volume change of the transformation.
Composing Flows
A normalizing flow is typically a composition of $K$ simple invertible transformations:
\[x = f_K \circ f_{K-1} \circ \cdots \circ f_1(z)\] \[\log p_X(x) = \log p_Z(z) + \sum_{k=1}^K \log \left|\det J_{f_k^{-1}}(x_{k-1})\right|\]Each $f_k$ must be:
- Invertible.
- Have a tractable Jacobian determinant.
Coupling Layers
The dominant building block. Split $x$ into two halves $[x_1, x_2]$:
\(y_1 = x_1\) \(y_2 = x_2 \odot \exp(s(x_1)) + t(x_1)\)
where $s$ (scale) and $t$ (translation) are arbitrary neural networks of $x_1$.
Inverse:
\(x_1 = y_1\) \(x_2 = (y_2 - t(y_1)) \odot \exp(-s(y_1))\)
Jacobian determinant: $\det J = \prod_i \exp(s_i(x_1))$, which is trivially computed as $\sum_i s_i(x_1)$.
Networks $s$ and $t$ can be arbitrarily complex (no invertibility constraint on them), enabling expressive transformations with cheap Jacobian computation.
Autoregressive Flows
Generalize coupling layers. Each dimension is transformed conditioned on all previous ones:
\[y_i = f(x_i; c_i), \quad c_i = g(x_1, \ldots, x_{i-1})\]Masked Autoregressive Flow (MAF): uses a MADE-style masked MLP to compute all conditionals in one forward pass. Fast density estimation; slow sampling (sequential).
Inverse Autoregressive Flow (IAF): reverses the direction. Fast sampling; slow density estimation. Useful as a posterior in VAEs (IAF-VAE).
| Flow | Forward (density) | Inverse (sampling) |
|---|---|---|
| MAF | Fast (parallel) | Slow (sequential) |
| IAF | Slow (sequential) | Fast (parallel) |
| Coupling | Fast | Fast |
Real NVP (Non-Volume Preserving)
Stacks multiple coupling layers, alternating which half of dimensions is kept fixed. Uses checkerboard and channel-wise masking for images.
\[\log p(x) = \log p_Z(z) + \sum_{k=1}^K \sum_i s_k^{(i)}(x_1)\]Provides exact likelihood for images; competitive with PixelCNN on density estimation.
Glow
Extension of Real NVP with:
- Actnorm: per-channel affine transform initialized via data-dependent statistics (replaces batch norm).
- Invertible 1×1 convolution: a learnable permutation of channels (generalizes fixed checkerboard masking). Jacobian determinant: $\log \lvert\det W\rvert$ computed via LU decomposition.
- Affine coupling layers.
Architecture: $L$ levels, each with $K$ steps of (Actnorm → 1×1 Conv → Affine Coupling).
First model to demonstrate high-quality $256\times256$ face synthesis with exact likelihoods.
Continuous Normalizing Flows (CNF)
Instead of discrete transformation steps, define a continuous-time flow via an ODE:
\[\frac{dx}{dt} = f_\theta(x, t), \quad x(0) = z, \quad x(1) = \tilde{x}\]Instantaneous change of variables:
\[\frac{d \log p(x(t))}{dt} = -\text{tr}\!\left(\frac{\partial f_\theta}{\partial x(t)}\right)\]The trace of the Jacobian (much cheaper than the determinant) is the only quantity needed. Stochastic trace estimators (Hutchinson) make this scalable.
FFJORD: uses Hutchinson’s estimator for unbiased, cheap log-likelihood. Solve the ODE with a black-box ODE solver.
Flow Matching
A more efficient training paradigm for CNFs (Lipman et al. 2022):
Instead of maximizing likelihood via the ODE, directly regress the vector field that transforms $p_0$ to $p_1$:
\[\mathcal{L}_\text{FM} = \mathbb{E}_{t, x(t)}\!\left[\|v_\theta(x(t), t) - u_t(x(t))\|^2\right]\]where $u_t$ is a target vector field (e.g., a straight-line path between noise and data). Simpler and more scalable than FFJORD. Used in Stable Diffusion 3 (rectified flows).
Spline Flows
Replace affine coupling transforms with monotone splines (piecewise-polynomial bijections). More expressive per layer; still have tractable inverses and Jacobians.
Neural Spline Flows: rational-quadratic splines with learnable knots. State of the art for tabular density estimation.
Comparison with Other Generative Models
| Property | Normalizing Flow | VAE | GAN | Diffusion |
|---|---|---|---|---|
| Exact likelihood | Yes | No (ELBO) | No | No (ELBO) |
| Sampling speed | Fast | Fast | Fast | Slow |
| Latent space | Yes (exact inversion) | Approximate | None | Implicit |
| Expressiveness | Limited by architecture | High | High | Highest |
| Training stability | Stable (MLE) | Stable | Unstable | Stable |
| Memory | High (invertibility) | Low | Low | Moderate |
Applications
- Density estimation: model tabular data distributions; anomaly detection.
- Variational inference: IAF as flexible posterior in VAEs.
- Physics simulations: normalizing flows as surrogate models for complex posteriors.
- Image synthesis: Glow (high-res faces), Real NVP.
- Audio: WaveGlow (WaveNet + Glow for TTS).