Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) finds the parameter values that make the observed data most probable.
Given data $D = {x_1, \ldots, x_n}$ and a parametric model $P(x \mid \theta)$:
\[\hat{\theta}_{\text{MLE}} = \arg\max_\theta P(D | \theta) = \arg\max_\theta \prod_{i=1}^n P(x_i | \theta)\]The Likelihood Function
Likelihood: $L(\theta) = P(D \mid \theta)$ viewed as a function of $\theta$.
Important distinction:
- $P(D \mid \theta)$ as a function of $D$: probability of data
- $L(\theta)$ as a function of $\theta$: likelihood of parameters
Note: Likelihood is NOT a probability distribution over $\theta$ (doesn’t sum/integrate to 1).
Log-Likelihood
Since $\log$ is monotonically increasing, maximizing likelihood is equivalent to maximizing log-likelihood:
\[\ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log P(x_i | \theta)\]Advantages:
- Converts products to sums (easier to compute)
- More numerically stable (avoids underflow)
- Derivatives are simpler
Finding the MLE
Analytical Solution
Set derivative to zero:
\[\frac{\partial \ell(\theta)}{\partial \theta} = 0\]Solve for $\theta$ (if closed-form exists).
Numerical Optimization
When no closed-form exists:
- Gradient ascent on $\ell(\theta)$
- Newton-Raphson / Fisher scoring
- EM algorithm (for latent variable models)
Examples
Bernoulli / Binomial
Data: $x_1, \ldots, x_n \in {0, 1}$, $P(x \mid \theta) = \theta^x (1-\theta)^{1-x}$
Log-likelihood:
\[\ell(\theta) = \sum_{i=1}^n [x_i \log \theta + (1-x_i) \log(1-\theta)]\]Derivative:
\[\frac{\partial \ell}{\partial \theta} = \frac{\sum x_i}{\theta} - \frac{n - \sum x_i}{1-\theta} = 0\]MLE:
\[\hat{\theta}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n x_i\](sample mean = proportion of successes)
Normal Distribution
Data: $x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2)$
Log-likelihood:
\[\ell(\mu, \sigma^2) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\]MLE for mean:
\[\hat{\mu}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x}\]MLE for variance:
\[\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2\]Note: Biased estimator (divides by $n$, not $n-1$).
Poisson Distribution
Data: $x_1, \ldots, x_n \sim \text{Poisson}(\lambda)$
Log-likelihood:
\[\ell(\lambda) = \sum_{i=1}^n [x_i \log \lambda - \lambda - \log(x_i!)]\]MLE:
\[\hat{\lambda}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x}\]Multinomial / Categorical
Data: counts $n_1, \ldots, n_k$ for $k$ categories, $\sum n_i = n$
Log-likelihood:
\[\ell(p_1, \ldots, p_k) = \sum_{i=1}^k n_i \log p_i\]MLE (with constraint $\sum p_i = 1$):
\[\hat{p}_i = \frac{n_i}{n}\](relative frequencies)
Properties of MLE
Consistency
As $n \to \infty$, $\hat{\theta}_{\text{MLE}} \to \theta_0$ (true parameter value).
Asymptotic Normality
As $n \to \infty$:
\[\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1})\]where $I(\theta)$ is the Fisher information.
Asymptotic Efficiency
MLE achieves the Cramer-Rao lower bound asymptotically (minimum variance among unbiased estimators).
Invariance
If $\hat{\theta}$ is the MLE of $\theta$, then $g(\hat{\theta})$ is the MLE of $g(\theta)$ for any function $g$.
Fisher Information
Measures how much information the data provides about $\theta$:
\[I(\theta) = E\left[\left(\frac{\partial}{\partial \theta} \log P(X | \theta)\right)^2\right] = -E\left[\frac{\partial^2}{\partial \theta^2} \log P(X | \theta)\right]\]Cramer-Rao bound: Any unbiased estimator $\hat{\theta}$ satisfies:
\[\text{Var}(\hat{\theta}) \geq \frac{1}{n I(\theta)}\]MLE vs MAP
| Aspect | MLE | MAP |
|---|---|---|
| Formula | $\arg\max_\theta P(D \mid \theta)$ | $\arg\max_\theta P(D \mid \theta) P(\theta)$ |
| Prior | None (implicit uniform) | Explicit |
| Result | Point estimate | Point estimate |
| Regularization | None | Prior acts as regularizer |
| Bayesian? | Frequentist | Hybrid (point estimate from Bayesian framework) |
Connection: MAP with Gaussian prior = MLE with L2 regularization.
MLE in Machine Learning
Linear Regression
Assuming $y_i = \mathbf{w}^T \mathbf{x}_i + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma^2)$:
\[\hat{\mathbf{w}}_{\text{MLE}} = \arg\min_{\mathbf{w}} \sum_{i=1}^n (y_i - \mathbf{w}^T \mathbf{x}_i)^2\]MLE under Gaussian noise = least squares.
Logistic Regression
For binary classification with $P(y=1 \mid \mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^T \mathbf{x})$:
\[\hat{\mathbf{w}}_{\text{MLE}} = \arg\max_{\mathbf{w}} \sum_{i=1}^n [y_i \log \sigma(\mathbf{w}^T \mathbf{x}_i) + (1-y_i) \log(1 - \sigma(\mathbf{w}^T \mathbf{x}_i))]\]MLE = minimizing cross-entropy loss.
Neural Networks
Training neural networks with cross-entropy or MSE loss is equivalent to MLE:
- Cross-entropy: MLE for categorical output
- MSE: MLE for Gaussian output
Problems with MLE
Overfitting
MLE maximizes fit to training data, no penalty for model complexity.
Solution: Use MAP or add regularization.
No Uncertainty Quantification
MLE gives a single point estimate, no measure of confidence.
Solution: Full Bayesian inference or bootstrap.
Non-identifiability
Multiple parameter values may give the same likelihood.
Example: Label switching in mixture models.
Boundary Solutions
MLE can give degenerate solutions (e.g., zero variance, perfect separation).
Solution: Add prior / regularization.