Generative Modeling Overview

What is Generative Modeling?

A generative model learns the underlying data distribution $p(x)$ (or joint $p(x, y)$) so it can produce new samples that look like they came from that distribution. This contrasts with discriminative models, which only learn $p(y \mid x)$.

Formal objective: given samples ${x_i}{i=1}^n \sim p\text{data}(x)$, learn a model $p_\theta(x)$ such that $p_\theta \approx p_\text{data}$.

Why Generative Models Matter

Data synthesis: generate training data for downstream tasks, augment rare classes.
Representation learning: learn compact latent spaces that capture semantic structure.
Density estimation: assign likelihoods to inputs; detect anomalies, do compression.
Conditional generation: generate $x$ given conditioning signal $c$ (text-to-image, speech synthesis, code generation).
Understanding: modeling $p(x)$ requires understanding the data, not just discriminating classes.

The Core Challenge

$p_\text{data}(x)$ is unknown and lives in a very high-dimensional space (e.g., a $256 \times 256$ RGB image is a point in $\mathbb{R}^{196608}$). The model must generalize from finite samples to a continuous distribution.

Curse of dimensionality: naive approaches (histograms, KDE) become hopelessly sparse in high dimensions. All modern generative models use some form of structured inductive bias to make this tractable.

Taxonomy of Generative Models

Family	Tractable likelihood	Sampling speed	Key mechanism
Autoregressive	Yes (exact)	Slow (sequential)	Chain rule factorization
VAE	Approximate (ELBO)	Fast	Amortized variational inference
GAN	No	Fast	Adversarial training
Normalizing Flow	Yes (exact)	Fast	Invertible transformations
Diffusion	Approximate (ELBO)	Slow (iterative)	Denoising score matching
Energy-based	Unnormalized	Slow (MCMC)	Learned energy function
Score-based	No explicit	Slow (SDE/ODE)	Score function estimation

Evaluating Generative Models

Evaluation is harder than for discriminative models since there is no single ground-truth label.

Sample Quality

FID (Fréchet Inception Distance): compares statistics of real and generated images in a feature space (InceptionV3 activations):

$$ \text{FID} = \|\mu_r - \mu_g\|^2 + \text{Tr}\!\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right) $$

Lower is better. Most widely used metric for images.

IS (Inception Score): measures sharpness (low $H(y \mid x)$) and diversity (high $H(y)$):

$$ \text{IS} = \exp(\mathbb{E}_x[D_\text{KL}(p(y \mid x) \| p(y))]) $$

Higher is better, but does not compare to real data.

Precision and Recall: precision measures sample quality (fraction of generated samples that are realistic); recall measures coverage (fraction of real data modes that are captured).

Likelihood

For models that compute exact or approximate log-likelihood, report bits per dimension (BPD):

$$ \text{BPD} = -\frac{\log_2 p_\theta(x)}{d} $$

where $d$ is the dimensionality. Lower is better.

Caveat: high likelihood does not guarantee good visual quality, and vice versa. Pixel-level likelihoods weight all dimensions equally; perceptual quality depends on high-level structure.

Latent Variable Models

Many generative models introduce a latent variable $z$:

$$ p_\theta(x) = \int p_\theta(x|z) p(z) \, dz $$

The prior $p(z)$ is typically $\mathcal{N}(0, I)$. Sampling: draw $z \sim p(z)$, then $x \sim p_\theta(x

z)$.

The integral is generally intractable, motivating approximate inference (VAEs) or implicit models (GANs).

Conditional Generation

All generative model families can be conditioned on a signal $c$:

$$ p_\theta(x|c) \quad \text{or} \quad p_\theta(x|z, c) $$

Conditioning mechanisms: concatenate $c$ to input, use cross-attention (Transformers), use classifier-free guidance (diffusion models), or condition via adaptive normalization (AdaIN, FiLM).

Classifier-free guidance (Ho & Salimans 2021): interpolate between conditional and unconditional score:

$$ \tilde{\nabla}_x \log p(x|c) = (1 + w)\nabla_x \log p_\theta(x|c) - w \nabla_x \log p_\theta(x) $$

Guidance scale $w > 0$ trades diversity for fidelity. Used in DALL-E 2, Stable Diffusion, Imagen.

Common Architectures Used as Backbones

Architecture	Typical Use
U-Net	Diffusion model denoiser (image)
Transformer (decoder)	Autoregressive models (language, image)
Transformer (encoder-decoder)	Conditional generation (text-to-image)
ResNet / CNN	GAN generator and discriminator
MLP	Normalizing flows (coupling layers)