Variational Autoencoders
Variational Autoencoders (VAEs) are latent variable generative models trained by maximizing a tractable lower bound on the log-likelihood. They simultaneously learn a generative model $p_\theta(x \mid z)$ and an inference model (encoder) $q_\phi(z \mid x)$.
The Generative Model
\[p_\theta(x) = \int p_\theta(x \mid z) p(z) \, dz\]- Prior: $p(z) = \mathcal{N}(0, I)$, a standard Gaussian in $\mathbb{R}^d$.
- Decoder: $p_\theta(x \mid z)$, a neural network mapping latent $z$ to a distribution over $x$.
- Marginal likelihood: the integral is intractable for nonlinear decoders.
The ELBO
Since $\log p_\theta(x)$ is intractable, introduce an approximate posterior $q_\phi(z \mid x)$ (the encoder) and derive the Evidence Lower Bound (ELBO):
\[\log p_\theta(x) \geq \mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] - D_\text{KL}(q_\phi(z \mid x) \| p(z))\]Derivation:
\[\log p_\theta(x) = \mathcal{L}(\theta, \phi; x) + D_\text{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x)) \geq \mathcal{L}\]since $D_\text{KL} \geq 0$. Equality holds when $q_\phi(z \mid x) = p_\theta(z \mid x)$ exactly.
Reconstruction term: $\mathbb{E}{q\phi}[\log p_\theta(x \mid z)]$. Encourages the decoder to reconstruct $x$ from $z$.
KL term: $D_\text{KL}(q_\phi(z \mid x) | p(z))$. Regularizes the posterior toward the prior. For Gaussian encoder and prior, has a closed form:
\[D_\text{KL}(\mathcal{N}(\mu, \sigma^2 I) \| \mathcal{N}(0, I)) = \frac{1}{2}\sum_{j=1}^d (\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1)\]Gaussian VAE
Encoder: $q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))$
The encoder network outputs $\mu_\phi(x)$ and $\log \sigma^2_\phi(x)$ (log-variance for numerical stability).
Decoder: depends on data type:
- Continuous data: $p_\theta(x \mid z) = \mathcal{N}(\mu_\theta(z), \sigma^2 I)$, reconstruction loss = MSE.
- Binary data: $p_\theta(x \mid z) = \text{Bernoulli}(\mu_\theta(z))$, reconstruction loss = binary cross-entropy.
The Reparameterization Trick
The ELBO involves $\mathbb{E}{q\phi(z \mid x)}[\cdot]$. Naive Monte Carlo sampling $z \sim q_\phi(z \mid x)$ is not differentiable with respect to $\phi$.
Reparameterization: sample $\epsilon \sim \mathcal{N}(0, I)$, then set:
\[z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon\]Now $z$ is a deterministic function of $\phi$ and $\epsilon$. Gradients $\partial z / \partial \phi$ exist and backpropagation flows through the sampling operation.
Training Objective
\[\mathcal{L}(\theta, \phi; x) = -\mathcal{L}_\text{recon} + \mathcal{L}_\text{KL}\]Maximize ELBO = maximize reconstruction likelihood + minimize KL divergence.
Mini-batch ELBO with one MC sample per data point (low variance in practice):
\[\mathcal{L} \approx \log p_\theta(x \mid z) - D_\text{KL}(q_\phi(z \mid x) \| p(z)), \quad z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon\]KL Annealing and $\beta$-VAE
KL collapse: early in training, the KL term can dominate and push $q_\phi(z \mid x) \to p(z)$, making the encoder ignore the input. The decoder then learns to ignore $z$; latent space is unused.
KL annealing: start with $\beta = 0$ (pure reconstruction), gradually increase $\beta$ to 1.
$\beta$-VAE (Higgins et al. 2017): intentionally upweight the KL term by $\beta > 1$:
\[\mathcal{L}_\beta = \mathbb{E}_{q_\phi}[\log p_\theta(x \mid z)] - \beta \cdot D_\text{KL}(q_\phi(z \mid x) \| p(z))\]Encourages disentangled latent representations: individual $z_j$ correspond to independent factors of variation. $\beta = 1$ recovers the standard VAE.
Latent Space Properties
A well-trained VAE encodes semantically similar inputs to nearby latent codes. The continuous, smooth latent space enables:
- Interpolation: linearly interpolate between two encodings $z_1, z_2$ and decode intermediate points.
- Arithmetic: $z_\text{man with glasses} - z_\text{man} + z_\text{woman} \approx z_\text{woman with glasses}$.
- Conditional generation: condition decoder on class label or other attribute.
Hierarchical VAEs
A single Gaussian posterior is a weak approximation. Hierarchical VAEs stack multiple latent layers:
\[p_\theta(x) = \int p_\theta(x \mid z_1) p_\theta(z_1 \mid z_2) \cdots p_\theta(z_{L-1} \mid z_L) p(z_L) \, dz_{1:L}\]NVAE, VDVAE: deep hierarchical VAEs with residual connections in the latent hierarchy. Achieve competitive image generation quality.
VQ-VAE (Vector Quantized VAE)
Replaces the continuous Gaussian latent with a discrete codebook:
- Encoder maps $x$ to continuous vectors $z_e(x) \in \mathbb{R}^d$.
- Each vector is replaced by the nearest codebook entry: $z_q = \arg\min_{e_k} |z_e - e_k|$.
- Decoder reconstructs $x$ from $z_q$.
Straight-through estimator: since argmin is non-differentiable, pass gradients directly from decoder input to encoder output. Codebook updated via exponential moving average.
Loss: $\mathcal{L} = |x - \hat{x}|^2 + |sg(z_e) - e|^2 + \beta |z_e - sg(e)|^2$
where $sg$ is stop-gradient. Used in DALL-E (v1), AudioCodec, and as backbone for image generation with Transformers.
Comparison with Other Generative Models
| Property | VAE | GAN | Diffusion |
|---|---|---|---|
| Exact likelihood | No (ELBO) | No | No (ELBO) |
| Sampling speed | Fast | Fast | Slow |
| Sample quality | Moderate (blurry) | High | Highest |
| Latent space | Structured, smooth | Unstructured | Implicit |
| Training stability | Stable | Unstable (mode collapse) | Stable |
| Encoder (inference) | Yes | No | No (or DDIM inversion) |