Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) finds the parameter values that make the observed data most probable.

Given data $D = {x_1, \ldots, x_n}$ and a parametric model $P(x \mid \theta)$:

$$ \hat{\theta}_{\text{MLE}} = \arg\max_\theta P(D | \theta) = \arg\max_\theta \prod_{i=1}^n P(x_i | \theta) $$

The Likelihood Function

Likelihood: $L(\theta) = P(D \mid \theta)$ viewed as a function of $\theta$.

Important distinction:

$P(D \mid \theta)$ as a function of $D$: probability of data
$L(\theta)$ as a function of $\theta$: likelihood of parameters

Note: Likelihood is NOT a probability distribution over $\theta$ (doesn’t sum/integrate to 1).

Log-Likelihood

Since $\log$ is monotonically increasing, maximizing likelihood is equivalent to maximizing log-likelihood:

$$ \ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log P(x_i | \theta) $$

Advantages:

Converts products to sums (easier to compute)
More numerically stable (avoids underflow)
Derivatives are simpler

Finding the MLE

Analytical Solution

Set derivative to zero:

$$ \frac{\partial \ell(\theta)}{\partial \theta} = 0 $$

Solve for $\theta$ (if closed-form exists).

Numerical Optimization

When no closed-form exists:

Gradient ascent on $\ell(\theta)$
Newton-Raphson / Fisher scoring
EM algorithm (for latent variable models)

Examples

Bernoulli / Binomial

Data: $x_1, \ldots, x_n \in {0, 1}$, $P(x \mid \theta) = \theta^x (1-\theta)^{1-x}$

Log-likelihood:

$$ \ell(\theta) = \sum_{i=1}^n [x_i \log \theta + (1-x_i) \log(1-\theta)] $$

Derivative:

$$ \frac{\partial \ell}{\partial \theta} = \frac{\sum x_i}{\theta} - \frac{n - \sum x_i}{1-\theta} = 0 $$

MLE:

$$ \hat{\theta}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n x_i $$

(sample mean = proportion of successes)

Normal Distribution

Data: $x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2)$

Log-likelihood:

$$ \ell(\mu, \sigma^2) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 $$

MLE for mean:

$$ \hat{\mu}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x} $$

MLE for variance:

$$ \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 $$

Note: Biased estimator (divides by $n$, not $n-1$).

Poisson Distribution

Data: $x_1, \ldots, x_n \sim \text{Poisson}(\lambda)$

Log-likelihood:

$$ \ell(\lambda) = \sum_{i=1}^n [x_i \log \lambda - \lambda - \log(x_i!)] $$

MLE:

$$ \hat{\lambda}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x} $$

Multinomial / Categorical

Data: counts $n_1, \ldots, n_k$ for $k$ categories, $\sum n_i = n$

Log-likelihood:

$$ \ell(p_1, \ldots, p_k) = \sum_{i=1}^k n_i \log p_i $$

MLE (with constraint $\sum p_i = 1$):

$$ \hat{p}_i = \frac{n_i}{n} $$

(relative frequencies)

Properties of MLE

Consistency

As $n \to \infty$, $\hat{\theta}_{\text{MLE}} \to \theta_0$ (true parameter value).

Asymptotic Normality

As $n \to \infty$:

$$ \sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1}) $$

where $I(\theta)$ is the Fisher information.

Asymptotic Efficiency

MLE achieves the Cramer-Rao lower bound asymptotically (minimum variance among unbiased estimators).

Invariance

If $\hat{\theta}$ is the MLE of $\theta$, then $g(\hat{\theta})$ is the MLE of $g(\theta)$ for any function $g$.

Fisher Information

Measures how much information the data provides about $\theta$:

$$ I(\theta) = E\left[\left(\frac{\partial}{\partial \theta} \log P(X | \theta)\right)^2\right] = -E\left[\frac{\partial^2}{\partial \theta^2} \log P(X | \theta)\right] $$

Cramer-Rao bound: Any unbiased estimator $\hat{\theta}$ satisfies:

$$ \text{Var}(\hat{\theta}) \geq \frac{1}{n I(\theta)} $$

MLE vs MAP

Aspect	MLE	MAP
Formula	$\arg\max_\theta P(D \mid \theta)$	$\arg\max_\theta P(D \mid \theta) P(\theta)$
Prior	None (implicit uniform)	Explicit
Result	Point estimate	Point estimate
Regularization	None	Prior acts as regularizer
Bayesian?	Frequentist	Hybrid (point estimate from Bayesian framework)

Connection: MAP with Gaussian prior = MLE with L2 regularization.

MLE in Machine Learning

Linear Regression

Assuming $y_i = \mathbf{w}^T \mathbf{x}_i + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma^2)$:

$$ \hat{\mathbf{w}}_{\text{MLE}} = \arg\min_{\mathbf{w}} \sum_{i=1}^n (y_i - \mathbf{w}^T \mathbf{x}_i)^2 $$

MLE under Gaussian noise = least squares.

Logistic Regression

For binary classification with $P(y=1 \mid \mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^T \mathbf{x})$:

$$ \hat{\mathbf{w}}_{\text{MLE}} = \arg\max_{\mathbf{w}} \sum_{i=1}^n [y_i \log \sigma(\mathbf{w}^T \mathbf{x}_i) + (1-y_i) \log(1 - \sigma(\mathbf{w}^T \mathbf{x}_i))] $$

MLE = minimizing cross-entropy loss.

Neural Networks

Training neural networks with cross-entropy or MSE loss is equivalent to MLE:

Cross-entropy: MLE for categorical output
MSE: MLE for Gaussian output

Problems with MLE

Overfitting

MLE maximizes fit to training data, no penalty for model complexity.

Solution: Use MAP or add regularization.

No Uncertainty Quantification

MLE gives a single point estimate, no measure of confidence.

Solution: Full Bayesian inference or bootstrap.

Non-identifiability

Multiple parameter values may give the same likelihood.

Example: Label switching in mixture models.

Boundary Solutions

MLE can give degenerate solutions (e.g., zero variance, perfect separation).

Solution: Add prior / regularization.