Bayesian Inference
Bayesian inference treats parameters as random variables and updates beliefs using Bayes’ theorem:
\[P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)}\]- $\theta$: unknown parameter(s)
- $D$: observed data
- $P(\theta)$: prior (belief before seeing data)
- $P(D \mid \theta)$: likelihood (probability of data given parameters)
- $P(\theta \mid D)$: posterior (updated belief after seeing data)
- $P(D)$: marginal likelihood / evidence (normalizing constant)
The Evidence (Marginal Likelihood)
\[P(D) = \int P(D | \theta) P(\theta) d\theta\]For discrete parameters, replace integral with sum.
Role:
- Normalizes the posterior to sum/integrate to 1
- Used for model comparison (higher evidence = better model)
- Often intractable for continuous parameters (requires approximation)
Conjugate Priors
A prior is conjugate to a likelihood if the posterior is in the same family as the prior.
| Likelihood | Conjugate Prior | Posterior |
|---|---|---|
| Bernoulli | Beta($\alpha, \beta$) | Beta($\alpha + \text{successes}, \beta + \text{failures}$) |
| Binomial | Beta($\alpha, \beta$) | Beta($\alpha + k, \beta + n-k$) |
| Poisson | Gamma($\alpha, \beta$) | Gamma($\alpha + \sum x_i, \beta + n$) |
| Normal (known $\sigma^2$) | Normal($\mu_0, \sigma_0^2$) | Normal (updated mean/variance) |
| Normal (known $\mu$) | Inverse-Gamma | Inverse-Gamma |
| Multinomial | Dirichlet($\alpha_1, \ldots, \alpha_k$) | Dirichlet($\alpha_1 + n_1, \ldots, \alpha_k + n_k$) |
Example: Beta-Bernoulli Model
Prior: $\theta \sim \text{Beta}(\alpha, \beta)$
\[P(\theta) \propto \theta^{\alpha-1} (1-\theta)^{\beta-1}\]Likelihood: $D = {x_1, \ldots, x_n}$, $x_i \sim \text{Bernoulli}(\theta)$
\[P(D | \theta) = \theta^k (1-\theta)^{n-k}\]where $k = \sum_i x_i$ (number of successes).
Posterior:
\[P(\theta | D) \propto P(D | \theta) P(\theta) \propto \theta^{k+\alpha-1} (1-\theta)^{n-k+\beta-1}\] \[\theta | D \sim \text{Beta}(\alpha + k, \beta + n - k)\]Interpretation: Prior acts like “pseudo-counts” added to observed data.
Point Estimation in Bayesian Framework
Maximum a Posteriori (MAP):
\[\hat{\theta}_{\text{MAP}} = \arg\max_\theta P(\theta | D) = \arg\max_\theta P(D | \theta) P(\theta)\]Like MLE but with prior regularization.
Posterior Mean:
\[\hat{\theta}_{\text{mean}} = E[\theta | D] = \int \theta P(\theta | D) d\theta\]Minimizes squared error loss.
Posterior Median:
Minimizes absolute error loss.
Bayesian Prediction
Predicting new data $x_{\text{new}}$ integrates over all possible parameter values:
\[P(x_{\text{new}} | D) = \int P(x_{\text{new}} | \theta) P(\theta | D) d\theta\]This is the posterior predictive distribution.
Key difference from frequentist: Accounts for parameter uncertainty, not just point estimates.
Credible Intervals
A $100(1-\alpha)\%$ credible interval is a range $[a, b]$ such that:
\[P(a \leq \theta \leq b | D) = 1 - \alpha\]Interpretation: “There is a $1-\alpha$ probability that $\theta$ lies in this interval” (unlike frequentist confidence intervals).
Highest Density Interval (HDI): The narrowest interval containing $1-\alpha$ of the posterior mass.
Priors: Choosing and Interpreting
Informative priors: Encode domain knowledge (e.g., previous studies, physical constraints).
Weakly informative priors: Regularize without strongly influencing (e.g., Normal(0, 10) for regression coefficients).
Non-informative / reference priors: Minimal influence (e.g., Uniform, Jeffreys prior).
Improper priors: Don’t integrate to 1 but yield proper posteriors (e.g., $P(\theta) \propto 1$ for real-valued parameters).
Hierarchical Bayesian Models
Parameters themselves have priors with hyperparameters:
\[\theta \sim P(\theta | \phi), \quad \phi \sim P(\phi)\]Example: Modeling student test scores across schools:
- Each school has its mean $\mu_j$
- School means come from a population distribution: $\mu_j \sim \text{Normal}(\mu_0, \tau^2)$
- Hyperpriors on $\mu_0$ and $\tau$
Benefit: Partial pooling shares statistical strength across groups.
Computational Methods
Analytical Solutions
Only available for conjugate models or simple cases.
Grid Approximation
Discretize parameter space, compute posterior at each point.
- Simple but scales poorly with dimension
- Good for teaching and 1-2 parameter problems
Laplace Approximation
Approximate posterior with a Normal centered at the MAP:
\[P(\theta | D) \approx \mathcal{N}(\hat{\theta}_{\text{MAP}}, H^{-1})\]where $H$ is the Hessian of $\log P(\theta \mid D)$ at the mode.
Markov Chain Monte Carlo (MCMC)
Generate samples from the posterior using a Markov chain.
Metropolis-Hastings:
- Propose $\theta’ \sim q(\theta’ \mid \theta_t)$
- Accept with probability $\alpha = \min\left(1, \frac{P(\theta’ \mid D) q(\theta_t \mid \theta’)}{P(\theta_t \mid D) q(\theta’ \mid \theta_t)}\right)$
Gibbs Sampling:
- Sample each parameter from its conditional distribution given others
- Special case of Metropolis-Hastings with acceptance = 1
Hamiltonian Monte Carlo (HMC):
- Uses gradient information for efficient exploration
- Implemented in Stan, PyMC, NumPyro
Variational Inference
Approximate posterior with a simpler distribution $q(\theta)$ by minimizing KL divergence:
\[q^*(\theta) = \arg\min_q D_{\text{KL}}(q(\theta) \Vert P(\theta | D))\]- Faster than MCMC but introduces approximation bias
- Scales to large datasets
Bayesian vs Frequentist
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Parameters | Fixed but unknown | Random variables |
| Data | Fixed | Random |
| Inference | Point estimates, confidence intervals | Full posterior, credible intervals |
| Prior information | Not incorporated | Explicitly modeled |
| Computation | Optimization | Integration / sampling |
| Interpretation | Long-run frequency | Degree of belief |