Generative Adversarial Networks
Generative Adversarial Networks (GANs) frame generative modeling as a two-player minimax game between a generator $G$ and a discriminator $D$. The generator tries to fool the discriminator; the discriminator tries to distinguish real from generated samples.
The Minimax Objective
\[\min_G \max_D \; V(G, D) = \mathbb{E}_{x \sim p_\text{data}}[\log D(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))]\]- $G_\theta: z \to x$; maps noise $z \sim \mathcal{N}(0,I)$ to a sample in data space.
- $D_\phi: x \to [0,1]$; estimates the probability that $x$ is real.
Optimal discriminator (for fixed $G$):
\[D^*(x) = \frac{p_\text{data}(x)}{p_\text{data}(x) + p_G(x)}\]Optimal generator: at the global optimum, $p_G = p_\text{data}$ and $D^*(x) = 1/2$ everywhere.
Minimax connection: at the optimal discriminator, the minimax objective equals $2 \cdot \text{JSD}(p_\text{data} | p_G) - \log 4$, where JSD is the Jensen-Shannon divergence.
Training in Practice
Alternate between:
- Discriminator step: sample real batch ${x_i}$ and fake batch ${G(z_i)}$; maximize $V$ w.r.t. $\phi$.
- Generator step: sample ${z_i}$; minimize $V$ w.r.t. $\theta$.
Non-saturating generator loss: instead of minimizing $\log(1 - D(G(z)))$ (saturates early), maximize $\log D(G(z))$. Same fixed point but stronger gradients early in training.
Training Instabilities
Mode Collapse
Generator produces a limited subset of the data distribution, ignoring most modes. Discriminator cannot distinguish this since it only sees the current generator output.
Mitigations: minibatch discrimination, unrolled GANs, diverse training objectives.
Vanishing Gradients
If the discriminator is too good early on, $D(G(z)) \approx 0$ and $\nabla_\theta \log(1-D(G(z))) \approx 0$. Generator receives no useful signal.
Mitigations: non-saturating loss, careful learning rate balancing, Wasserstein loss.
Training Oscillation
Generator and discriminator do not converge to a fixed point but cycle around it.
Wasserstein GAN (WGAN)
Replaces JSD with Earth Mover’s (Wasserstein-1) distance, which is smoother and provides meaningful gradients even when the distributions have disjoint support:
\[W(p, q) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p}[f(x)] - \mathbb{E}_{x \sim q}[f(x)]\]The critic $f$ (unbounded, replaces $D$) must satisfy the 1-Lipschitz constraint.
WGAN objective:
\[\min_G \max_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_\text{data}}[f(x)] - \mathbb{E}_{z}[f(G(z))]\]Gradient penalty (WGAN-GP): enforce Lipschitz by penalizing $\lvert\nabla_{\hat{x}} f(\hat{x})\rvert \neq 1$, where $\hat{x}$ is sampled along straight lines between real and fake:
\[\mathcal{L}_\text{GP} = \lambda \, \mathbb{E}_{\hat{x}}[(\|\nabla_{\hat{x}} f(\hat{x})\|_2 - 1)^2]\]More stable training than weight clipping.
Progressive GAN (ProGAN)
Trains both generator and discriminator starting at low resolution ($4\times4$) and progressively adds layers to increase resolution. New layers are faded in smoothly using a blending factor $\alpha$.
Produces very high-quality faces (CelebA-HQ); enables 1024×1024 image generation. Foundation for StyleGAN.
StyleGAN / StyleGAN2
Introduces a mapping network $f: z \to w$ (8-layer MLP) that maps the latent code $z$ to an intermediate latent space $\mathcal{W}$. Style vectors from $\mathcal{W}$ are injected via Adaptive Instance Normalization (AdaIN) at each generator layer:
\[\text{AdaIN}(x_i, y) = y_{s,i} \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}\]where $y = (y_s, y_b)$ come from a learned affine transform of $w$.
StyleGAN2 improvements: removes normalization artifacts via demodulation, path length regularization for smooth $\mathcal{W}$ space.
Stochastic variation: per-pixel Gaussian noise injected at each layer adds fine-grained stochastic detail (hair, pores) independent of the style code.
The $\mathcal{W}$ space is more disentangled than $\mathcal{Z}$ because the mapping network can learn to linearize the data manifold.
Conditional GANs (cGAN)
Condition both generator and discriminator on a label $y$:
\[\min_G \max_D \; \mathbb{E}[\log D(x, y)] + \mathbb{E}[\log(1 - D(G(z, y), y))]\]Projection discriminator: project label embedding and take inner product with discriminator features. Stronger conditioning than concatenation.
BigGAN: large-scale cGAN with class-conditional BatchNorm in generator and truncation trick ($z$ sampled from truncated normal for quality/diversity tradeoff).
Image-to-Image Translation
Pix2Pix
Paired image-to-image translation (edge maps to photos, segmentation to scenes). Combines adversarial loss with L1 reconstruction loss:
\[\mathcal{L} = \mathcal{L}_\text{cGAN}(G, D) + \lambda \mathbb{E}[\|y - G(x)\|_1]\]CycleGAN
Unpaired image translation. Two generators $G: X \to Y$ and $F: Y \to X$. Cycle consistency:
\[\mathcal{L}_\text{cyc} = \mathbb{E}[\|F(G(x)) - x\|_1] + \mathbb{E}[\|G(F(y)) - y\|_1]\]Learns mappings without paired data (horses to zebras, summer to winter).
GAN Evaluation
FID (Fréchet Inception Distance): standard metric. Lower is better. See Generative Modeling Overview.
Precision and Recall: precision = sample fidelity; recall = mode coverage.
Truncation trick: sample $z$ from a truncated normal. Higher truncation: better quality, less diversity.
Comparison of Key GAN Variants
| Variant | Key Contribution | Metric |
|---|---|---|
| Vanilla GAN | Original formulation | JSD |
| WGAN | Wasserstein distance | W-distance |
| WGAN-GP | Gradient penalty for Lipschitz | W-distance |
| ProGAN | Progressive growing | FID (faces) |
| StyleGAN2 | Style-based generator, $\mathcal{W}$ space | FID (SOTA faces) |
| BigGAN | Large-scale class-conditional | FID (ImageNet) |
| CycleGAN | Unpaired image translation | FID, user study |