Information Theory

Information theory quantifies uncertainty, information content, and communication efficiency. In ML, it provides the theoretical foundation for loss functions, compression, and learning bounds.

Entropy

Shannon entropy measures the expected surprise (uncertainty) of a random variable $X$:

$$ H(X) = -\sum_x p(x) \log p(x) $$

Uses $\log_2$ → bits; $\ln$ → nats
$H(X) \geq 0$, equals 0 when $X$ is deterministic
Maximized by the uniform distribution
For binary $X$: $H = -p\log p - (1-p)\log(1-p)$ (binary entropy function)

Differential entropy for continuous distributions:

$$ h(X) = -\int p(x) \log p(x)\, dx $$

Can be negative (unlike discrete entropy).

Joint and Conditional Entropy

$$ H(X, Y) = -\sum_{x,y} p(x,y) \log p(x,y) $$

$$ H(Y|X) = -\sum_{x,y} p(x,y) \log p(y|x) $$

Chain rule: $H(X, Y) = H(X) + H(Y \mid X)$

Conditioning reduces entropy: $H(Y \mid X) \leq H(Y)$

Cross-Entropy

Measures the expected number of bits needed to encode samples from $p$ using a code optimized for $q$:

$$ H(p, q) = -\sum_x p(x) \log q(x) $$

Cross-entropy loss in classification:

$$ \mathcal{L} = -\sum_i y_i \log \hat{y}_i $$

where $y$ is the true label (one-hot) and $\hat{y}$ is the predicted probability.

Minimizing cross-entropy = maximizing log-likelihood = MLE

KL Divergence

Measures how much distribution $q$ diverges from the true distribution $p$:

$$ D_{KL}(p \Vert q) = \sum_x p(x) \log \frac{p(x)}{q(x)} $$

Properties:

$D_{KL}(p \Vert q) \geq 0$ (Gibbs’ inequality)
$D_{KL}(p \Vert q) = 0$ iff $p = q$
Not symmetric: $D_{KL}(p \Vert q) \neq D_{KL}(q \Vert p)$ in general
Relationship: $H(p, q) = H(p) + D_{KL}(p \Vert q)$

Forward KL $D_{KL}(p \Vert q)$: zero-avoiding, $q$ must cover all of $p$. Reverse KL $D_{KL}(q \Vert p)$: zero-forcing, $q$ concentrates where $p$ is high. Used in variational inference (ELBO).

Mutual Information

Measures the amount of information shared between $X$ and $Y$:

$$ I(X; Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} $$

$$ I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y) $$

$I(X; Y) = 0$ iff $X$ and $Y$ are independent
$I(X; Y) = D_{KL}(p(x,y) \Vert p(x)p(y))$

Used in: feature selection, representation learning (InfoNCE), IB (information bottleneck).

Jensen’s Inequality

For a convex function $f$:

$$ f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)] $$

For a concave function (like $\log$):

$$ \log(\mathbb{E}[X]) \geq \mathbb{E}[\log X] $$

Jensen’s inequality is used to derive the ELBO in VAEs and EM algorithm.

Data Processing Inequality

If $X \to Y \to Z$ form a Markov chain:

$$ I(X; Z) \leq I(X; Y) $$

Processing data cannot increase the information about the original source. Foundation of the information bottleneck theory of deep learning.

Minimum Description Length (MDL)

The best model is the one that gives the shortest combined description of:

The model itself
The data given the model

Connects information theory to Bayesian inference and model selection.

Maximum Entropy Principle

Among all distributions satisfying known constraints, choose the one with maximum entropy (the least committal distribution).

No constraints → uniform distribution
Known mean and variance → Gaussian distribution (maximum entropy for fixed mean/variance)

Used in: feature-based models (MaxEnt classifiers), energy-based models.

Entropy in Decision Trees

Information gain for feature $A$ with respect to target $Y$:

$$ \text{IG}(Y, A) = H(Y) - H(Y|A) $$

Greedy split criterion: choose the feature that maximizes information gain. Used in ID3, C4.5 algorithms.

Gini impurity is an alternative to entropy for CART:

$$ G = 1 - \sum_k p_k^2 $$