Entropy and KL Divergence
Entropy
Shannon entropy measures the uncertainty or information content of a random variable.
Discrete Entropy
For a discrete random variable $X$ with PMF $P(X)$:
\[H(X) = -\sum_x P(x) \log P(x)\]Properties:
- $H(X) \geq 0$ (non-negative)
- $H(X) = 0$ iff $X$ is deterministic
- Maximized when $X$ is uniform: $H_{\text{max}} = \log \lvert\mathcal{X}\rvert$
- Base of logarithm: $\log_2$ gives bits, $\ln$ gives nats
Differential Entropy
For a continuous random variable $X$ with PDF $f(x)$:
\[h(X) = -\int f(x) \log f(x) dx\]Differences from discrete entropy:
- Can be negative (e.g., narrow Gaussian)
- Not invariant to coordinate transformations
- Requires Jacobian correction under change of variables
Entropy of Common Distributions
| Distribution | Entropy |
|---|---|
| Bernoulli($p$) | $-p \log p - (1-p) \log(1-p)$ |
| Uniform($a, b$) | $\log(b - a)$ |
| Normal($\mu, \sigma^2$) | $\frac{1}{2} \log(2\pi e \sigma^2)$ |
| Exponential($\lambda$) | $1 - \log \lambda$ |
Joint and Conditional Entropy
Joint Entropy
Uncertainty in multiple variables together:
\[H(X, Y) = -\sum_{x,y} P(x, y) \log P(x, y)\]Conditional Entropy
Uncertainty in $Y$ given knowledge of $X$:
\[H(Y|X) = -\sum_{x,y} P(x, y) \log P(y|x)\]Chain rule:
\[H(X, Y) = H(X) + H(Y|X)\]Property: $H(Y \mid X) \leq H(Y)$ (conditioning reduces entropy)
Equality holds iff $X$ and $Y$ are independent.
Mutual Information
Mutual information (MI) measures the amount of information shared between two variables:
\[I(X; Y) = \sum_{x,y} P(x, y) \log \frac{P(x, y)}{P(x) P(y)}\]Equivalent expressions:
\[I(X; Y) = H(Y) - H(Y|X)\] \[I(X; Y) = H(X) - H(X|Y)\] \[I(X; Y) = H(X) + H(Y) - H(X, Y)\]Properties:
- $I(X; Y) \geq 0$ (non-negative)
- $I(X; Y) = 0$ iff $X$ and $Y$ are independent
- Symmetric: $I(X; Y) = I(Y; X)$
- $I(X; X) = H(X)$ (self-information equals entropy)
Conditional Mutual Information
\[I(X; Y | Z) = H(X|Z) - H(X|Y, Z)\]Information shared between $X$ and $Y$ given knowledge of $Z$.
KL Divergence
Kullback-Leibler divergence measures how much one distribution diverges from another.
Definition
For discrete distributions $P$ and $Q$:
\[D_{\text{KL}}(P \Vert Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}\]For continuous distributions:
\[D_{\text{KL}}(P \Vert Q) = \int p(x) \log \frac{p(x)}{q(x)} dx\]Properties
- $D_{\text{KL}}(P \Vert Q) \geq 0$ (Gibbs’ inequality)
- $D_{\text{KL}}(P \Vert Q) = 0$ iff $P = Q$ (almost everywhere)
- Not symmetric: $D_{\text{KL}}(P \Vert Q) \neq D_{\text{KL}}(Q \Vert P)$ in general
- Not a metric: doesn’t satisfy triangle inequality
Relationship to Entropy and Cross-Entropy
Cross-entropy:
\[H(P, Q) = -\sum_x P(x) \log Q(x)\]Relationship:
\[D_{\text{KL}}(P \Vert Q) = H(P, Q) - H(P)\]Minimizing cross-entropy is equivalent to minimizing KL divergence (since $H(P)$ is constant).
KL Divergence for Common Distributions
Bernoulli:
\[D_{\text{KL}}(\text{Bern}(p) \Vert \text{Bern}(q)) = p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}\]Normal:
\[D_{\text{KL}}(\mathcal{N}_0 \Vert \mathcal{N}_1) = \frac{1}{2} \left[\frac{(\mu_1 - \mu_0)^2}{\sigma_1^2} + \frac{\sigma_0^2}{\sigma_1^2} - 1 + \log \frac{\sigma_1}{\sigma_0}\right]\]Multivariate Normal:
\[D_{\text{KL}}(\mathcal{N}_0 \Vert \mathcal{N}_1) = \frac{1}{2} \left[\text{tr}(\Sigma_1^{-1} \Sigma_0) + (\mu_1 - \mu_0)^T \Sigma_1^{-1} (\mu_1 - \mu_0) - k + \log \frac{|\Sigma_1|}{|\Sigma_0|}\right]\]Forward vs Reverse KL
Forward KL ($D_{\text{KL}}(P \Vert Q)$)
- Zero-avoiding: $Q$ must cover all regions where $P$ has mass
- Tends to overestimate the support of $P$
- Used in: maximum likelihood, variational inference (variational distribution approximates true posterior)
Reverse KL ($D_{\text{KL}}(Q \Vert P)$)
- Zero-forcing: $Q$ concentrates on high-probability regions of $P$
- Tends to underestimate the support of $P$
- Used in: EM algorithm, reinforcement learning (policy optimization)
Jensen-Shannon Divergence
Symmetrized, smoothed version of KL divergence:
\[\text{JSD}(P \Vert Q) = \frac{1}{2} D_{\text{KL}}(P \Vert M) + \frac{1}{2} D_{\text{KL}}(Q \Vert M)\]where $M = \frac{1}{2}(P + Q)$ (mixture distribution).
Properties:
- Always finite (unlike KL)
- Symmetric: $\text{JSD}(P \Vert Q) = \text{JSD}(Q \Vert P)$
- Bounded: $0 \leq \text{JSD} \leq \log 2$ (for $\log_2$)
- $\sqrt{\text{JSD}}$ is a proper metric
Cross-Entropy in Machine Learning
Classification Loss
For true label $y$ (one-hot) and predicted probabilities $\hat{y}$:
\[\mathcal{L} = -\sum_{i=1}^C y_i \log \hat{y}_i\]Binary classification:
\[\mathcal{L} = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]\]Multi-class classification:
\[\mathcal{L} = -\sum_{c=1}^C y_c \log \hat{y}_c\]Minimizing cross-entropy loss = maximizing log-likelihood = minimizing KL divergence between true and predicted distributions.
Applications in Information Theory
Source Coding Theorem
The minimum expected code length to encode samples from $P$ is $H(P)$ bits.
Using a code optimized for $Q$ when the true distribution is $P$ gives expected length $H(P, Q)$.
Rate-Distortion Theory
Minimum rate (bits) needed to represent a source within distortion $D$:
\[R(D) = \min_{P(\hat{X}|X) : E[d(X, \hat{X})] \leq D} I(X; \hat{X})\]Applications in Machine Learning
Variational Inference
Approximate intractable posterior $P(\theta \mid D)$ with variational distribution $q(\theta)$:
\[q^*(\theta) = \arg\min_q D_{\text{KL}}(q(\theta) \Vert P(\theta | D))\]ELBO (Evidence Lower Bound):
\[\log P(D) \geq E_q[\log P(D | \theta)] - D_{\text{KL}}(q(\theta) \Vert P(\theta))\]Variational Autoencoders (VAEs)
KL divergence regularizes the latent space:
\[\mathcal{L} = E_{q(z|x)}[\log P(x|z)] - D_{\text{KL}}(q(z|x) \Vert P(z))\]Expectation-Maximization (EM)
E-step: Compute expected complete-data log-likelihood. M-step: Maximize w.r.t. parameters.
Equivalent to minimizing reverse KL divergence.
Information Bottleneck
Compress input $X$ while preserving information about target $Y$:
\[\min_{P(Z|X)} I(X; Z) - \beta I(Z; Y)\]Trade-off between compression and prediction.
Contrastive Learning
InfoNCE loss estimates mutual information:
\[\mathcal{L} = -\log \frac{\exp(\text{sim}(x, x^+)/\tau)}{\sum_{x^-} \exp(\text{sim}(x, x^-)/\tau)}\]Lower bound on mutual information between representations.