Entropy and KL Divergence

Entropy

Shannon entropy measures the uncertainty or information content of a random variable.

Discrete Entropy

For a discrete random variable $X$ with PMF $P(X)$:

\[H(X) = -\sum_x P(x) \log P(x)\]

Properties:

$H(X) \geq 0$ (non-negative)
$H(X) = 0$ iff $X$ is deterministic
Maximized when $X$ is uniform: $H_{\text{max}} = \log \lvert\mathcal{X}\rvert$
Base of logarithm: $\log_2$ gives bits, $\ln$ gives nats

Differential Entropy

For a continuous random variable $X$ with PDF $f(x)$:

\[h(X) = -\int f(x) \log f(x) dx\]

Differences from discrete entropy:

Can be negative (e.g., narrow Gaussian)
Not invariant to coordinate transformations
Requires Jacobian correction under change of variables

Entropy of Common Distributions

Distribution	Entropy
Bernoulli($p$)	$-p \log p - (1-p) \log(1-p)$
Uniform($a, b$)	$\log(b - a)$
Normal($\mu, \sigma^2$)	$\frac{1}{2} \log(2\pi e \sigma^2)$
Exponential($\lambda$)	$1 - \log \lambda$

Joint and Conditional Entropy

Joint Entropy

Uncertainty in multiple variables together:

\[H(X, Y) = -\sum_{x,y} P(x, y) \log P(x, y)\]

Conditional Entropy

Uncertainty in $Y$ given knowledge of $X$:

\[H(Y|X) = -\sum_{x,y} P(x, y) \log P(y|x)\]

Chain rule:

\[H(X, Y) = H(X) + H(Y|X)\]

Property: $H(Y \mid X) \leq H(Y)$ (conditioning reduces entropy)

Equality holds iff $X$ and $Y$ are independent.

Mutual Information

Mutual information (MI) measures the amount of information shared between two variables:

\[I(X; Y) = \sum_{x,y} P(x, y) \log \frac{P(x, y)}{P(x) P(y)}\]

Equivalent expressions:

\[I(X; Y) = H(Y) - H(Y|X)\] \[I(X; Y) = H(X) - H(X|Y)\] \[I(X; Y) = H(X) + H(Y) - H(X, Y)\]

Properties:

$I(X; Y) \geq 0$ (non-negative)
$I(X; Y) = 0$ iff $X$ and $Y$ are independent
Symmetric: $I(X; Y) = I(Y; X)$
$I(X; X) = H(X)$ (self-information equals entropy)

Conditional Mutual Information

\[I(X; Y | Z) = H(X|Z) - H(X|Y, Z)\]

Information shared between $X$ and $Y$ given knowledge of $Z$.

KL Divergence

Kullback-Leibler divergence measures how much one distribution diverges from another.

Definition

For discrete distributions $P$ and $Q$:

\[D_{\text{KL}}(P \Vert Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}\]

For continuous distributions:

\[D_{\text{KL}}(P \Vert Q) = \int p(x) \log \frac{p(x)}{q(x)} dx\]

Properties

$D_{\text{KL}}(P \Vert Q) \geq 0$ (Gibbs’ inequality)
$D_{\text{KL}}(P \Vert Q) = 0$ iff $P = Q$ (almost everywhere)
Not symmetric: $D_{\text{KL}}(P \Vert Q) \neq D_{\text{KL}}(Q \Vert P)$ in general
Not a metric: doesn’t satisfy triangle inequality

Relationship to Entropy and Cross-Entropy

Cross-entropy:

\[H(P, Q) = -\sum_x P(x) \log Q(x)\]

Relationship:

\[D_{\text{KL}}(P \Vert Q) = H(P, Q) - H(P)\]

Minimizing cross-entropy is equivalent to minimizing KL divergence (since $H(P)$ is constant).

KL Divergence for Common Distributions

Bernoulli:

\[D_{\text{KL}}(\text{Bern}(p) \Vert \text{Bern}(q)) = p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}\]

Normal:

\[D_{\text{KL}}(\mathcal{N}_0 \Vert \mathcal{N}_1) = \frac{1}{2} \left[\frac{(\mu_1 - \mu_0)^2}{\sigma_1^2} + \frac{\sigma_0^2}{\sigma_1^2} - 1 + \log \frac{\sigma_1}{\sigma_0}\right]\]

Multivariate Normal:

\[D_{\text{KL}}(\mathcal{N}_0 \Vert \mathcal{N}_1) = \frac{1}{2} \left[\text{tr}(\Sigma_1^{-1} \Sigma_0) + (\mu_1 - \mu_0)^T \Sigma_1^{-1} (\mu_1 - \mu_0) - k + \log \frac{|\Sigma_1|}{|\Sigma_0|}\right]\]

Forward vs Reverse KL

Forward KL ($D_{\text{KL}}(P \Vert Q)$)

Zero-avoiding: $Q$ must cover all regions where $P$ has mass
Tends to overestimate the support of $P$
Used in: maximum likelihood, variational inference (variational distribution approximates true posterior)

Reverse KL ($D_{\text{KL}}(Q \Vert P)$)

Zero-forcing: $Q$ concentrates on high-probability regions of $P$
Tends to underestimate the support of $P$
Used in: EM algorithm, reinforcement learning (policy optimization)

Jensen-Shannon Divergence

Symmetrized, smoothed version of KL divergence:

\[\text{JSD}(P \Vert Q) = \frac{1}{2} D_{\text{KL}}(P \Vert M) + \frac{1}{2} D_{\text{KL}}(Q \Vert M)\]

where $M = \frac{1}{2}(P + Q)$ (mixture distribution).

Properties:

Always finite (unlike KL)
Symmetric: $\text{JSD}(P \Vert Q) = \text{JSD}(Q \Vert P)$
Bounded: $0 \leq \text{JSD} \leq \log 2$ (for $\log_2$)
$\sqrt{\text{JSD}}$ is a proper metric

Cross-Entropy in Machine Learning

Classification Loss

For true label $y$ (one-hot) and predicted probabilities $\hat{y}$:

\[\mathcal{L} = -\sum_{i=1}^C y_i \log \hat{y}_i\]

Binary classification:

\[\mathcal{L} = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]\]

Multi-class classification:

\[\mathcal{L} = -\sum_{c=1}^C y_c \log \hat{y}_c\]

Minimizing cross-entropy loss = maximizing log-likelihood = minimizing KL divergence between true and predicted distributions.

Applications in Information Theory

Source Coding Theorem

The minimum expected code length to encode samples from $P$ is $H(P)$ bits.

Using a code optimized for $Q$ when the true distribution is $P$ gives expected length $H(P, Q)$.

Rate-Distortion Theory

Minimum rate (bits) needed to represent a source within distortion $D$:

\[R(D) = \min_{P(\hat{X}|X) : E[d(X, \hat{X})] \leq D} I(X; \hat{X})\]

Applications in Machine Learning

Variational Inference

Approximate intractable posterior $P(\theta \mid D)$ with variational distribution $q(\theta)$:

\[q^*(\theta) = \arg\min_q D_{\text{KL}}(q(\theta) \Vert P(\theta | D))\]

ELBO (Evidence Lower Bound):

\[\log P(D) \geq E_q[\log P(D | \theta)] - D_{\text{KL}}(q(\theta) \Vert P(\theta))\]

Variational Autoencoders (VAEs)

KL divergence regularizes the latent space:

\[\mathcal{L} = E_{q(z|x)}[\log P(x|z)] - D_{\text{KL}}(q(z|x) \Vert P(z))\]

Expectation-Maximization (EM)

E-step: Compute expected complete-data log-likelihood. M-step: Maximize w.r.t. parameters.

Equivalent to minimizing reverse KL divergence.

Information Bottleneck

Compress input $X$ while preserving information about target $Y$:

\[\min_{P(Z|X)} I(X; Z) - \beta I(Z; Y)\]

Trade-off between compression and prediction.

Contrastive Learning

InfoNCE loss estimates mutual information:

\[\mathcal{L} = -\log \frac{\exp(\text{sim}(x, x^+)/\tau)}{\sum_{x^-} \exp(\text{sim}(x, x^-)/\tau)}\]

Lower bound on mutual information between representations.