Latent Variable Models

Latent variable models (LVMs) explain observed data $x$ by positing unobserved (latent) variables $z$ that generate or cause $x$. The joint distribution factorizes as:

$$ p_\theta(x, z) = p_\theta(x \mid z) \cdot p(z) $$

The marginal likelihood $p_\theta(x) = \int p_\theta(x \mid z) p(z) \, dz$ is the quantity of interest for generative modeling, but this integral is typically intractable.

Why Latent Variables?

Compact representation: $z$ captures the essential structure of $x$ in a lower-dimensional space.
Disentanglement: different dimensions of $z$ may correspond to interpretable factors of variation.
Compositionality: complex data distributions can be modeled as mixtures or hierarchies of simpler distributions.
Missing data / semi-supervised learning: latent variables naturally handle unobserved quantities.

Mixture Models

The simplest latent variable model. $z$ is a discrete categorical variable:

$$ p(x) = \sum_{k=1}^K p(z=k) \cdot p(x \mid z=k) = \sum_{k=1}^K \pi_k \cdot p_k(x) $$

Gaussian Mixture Model (GMM):

$$ p(x) = \sum_{k=1}^K \pi_k \, \mathcal{N}(x; \mu_k, \Sigma_k) $$

Parameters ${\pi_k, \mu_k, \Sigma_k}$ are estimated via the EM algorithm (see below).

The EM Algorithm

Expectation-Maximization is a general method for MLE in latent variable models. It alternates between:

E-step: compute the expected complete-data log-likelihood under the posterior $p(z \mid x, \theta_\text{old})$:

$$ Q(\theta \mid \theta_\text{old}) = \mathbb{E}_{z \sim p(z \mid x, \theta_\text{old})}[\log p_\theta(x, z)] $$

M-step: maximize $Q$ w.r.t. $\theta$:

$$ \theta_\text{new} = \arg\max_\theta Q(\theta \mid \theta_\text{old}) $$

Convergence: EM guarantees non-decreasing log-likelihood at each step. Converges to a local maximum.

For GMMs:

E-step: compute responsibilities $r_{ik} = p(z_i = k \mid x_i, \theta) \propto \pi_k \mathcal{N}(x_i; \mu_k, \Sigma_k)$.
M-step: update $\pi_k$, $\mu_k$, $\Sigma_k$ using weighted sample statistics.

Factor Analysis

Continuous latent variables; linear Gaussian model:

$$ z \sim \mathcal{N}(0, I), \quad x = Wz + \mu + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \Psi) $$

where $W \in \mathbb{R}^{d \times k}$ ($k \ll d$) is the factor loading matrix and $\Psi$ is diagonal.

Marginal: $p(x) = \mathcal{N}(x; \mu, WW^T + \Psi)$.

PCA as special case: $\Psi = \sigma^2 I$ (isotropic noise), $\sigma^2 \to 0$ gives PCA directions.

Probabilistic PCA

Explicit probabilistic model for PCA. Marginal and posterior have closed forms:

$$ p(z \mid x) = \mathcal{N}(z; M^{-1}W^T(x-\mu), \sigma^2 M^{-1}) $$

where $M = W^T W + \sigma^2 I$. The MAP estimate of $z$ corresponds to the PCA projection.

Provides principled handling of missing data and model selection via marginal likelihood.

Variational Inference

When the posterior $p(z \mid x, \theta)$ is intractable (nonlinear decoder, deep networks), use a variational approximation $q_\phi(z \mid x) \approx p_\theta(z \mid x)$.

Variational lower bound (ELBO):

$$ \log p_\theta(x) \geq \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] - D_\text{KL}(q_\phi(z \mid x) \| p(z)) $$

This is the objective optimized by VAEs. See Variational Autoencoders.

Hierarchical Latent Variable Models

Stack multiple layers of latent variables:

$$ p_\theta(x) = \int p_\theta(x \mid z_1) p_\theta(z_1 \mid z_2) \cdots p(z_L) \, dz_{1:L} $$

Motivation: a single Gaussian posterior cannot capture complex, multi-modal posteriors. Hierarchy allows richer representations at different levels of abstraction.

Inference network: a hierarchical encoder $q_\phi(z_{1:L} \mid x)$ factorized as:

$$ q_\phi(z_{1:L} \mid x) = q_\phi(z_1 \mid x) \prod_{l=2}^L q_\phi(z_l \mid z_{l-1}, x) $$

NVAE / VDVAE: state-of-the-art hierarchical VAEs with 30+ latent groups, achieving high-quality image generation.

Discrete Latent Variables

Categorical or discrete latent variables are non-differentiable. Workarounds:

REINFORCE / score function estimator:

$$ \nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{q_\phi}[f(z) \nabla_\phi \log q_\phi(z)] $$

High variance; requires many samples or control variates.

Straight-through estimator: in the forward pass, use discrete $z$; in the backward pass, treat it as if continuous. Used in VQ-VAE.

Gumbel-Softmax (concrete distribution):

$$ z_k = \frac{\exp((\log \pi_k + g_k) / \tau)}{\sum_j \exp((\log \pi_j + g_j) / \tau)}, \quad g_k \sim \text{Gumbel}(0,1) $$

Temperature $\tau \to 0$: approaches discrete one-hot; $\tau > 0$: soft, differentiable. Anneal $\tau$ during training.

Topic Models

LVMs for discrete text data.

Latent Dirichlet Allocation (LDA): each document is a mixture of topics; each topic is a distribution over words.

$$ p(\text{words}|\text{doc}) = \sum_{k=1}^K \theta_k \cdot \phi_k(\text{word}) $$

where $\theta \sim \text{Dirichlet}(\alpha)$ (document-topic proportions) and $\phi_k \sim \text{Dirichlet}(\beta)$ (topic-word distributions). Inference via variational EM or collapsed Gibbs sampling.

Disentangled Representations

A latent space is disentangled if individual dimensions of $z$ correspond to independent, interpretable factors of variation.

$\beta$-VAE: increases KL weight to $\beta > 1$, pushing the encoder toward more independent latent dimensions.

TC-VAE: explicitly penalizes the total correlation $D_\text{KL}(q(z) | \prod_j q(z_j))$, which measures dependence between latent dimensions.

Metrics: Mutual Information Gap (MIG), DCI disentanglement score, SAP score.

Disentanglement is useful for controllable generation and fairness (sensitive attributes confined to specific $z_j$).