Sampling Methods
Sampling methods generate random draws from a probability distribution. In ML, they are used for:
- Approximating intractable integrals (Monte Carlo)
- Generating synthetic data
- Bayesian inference (posterior sampling)
- Training generative models
Simple Random Sampling
With replacement: Each draw is independent, same distribution.
Without replacement: Each item can be selected only once.
Inverse Transform Sampling
If $F$ is the CDF of the target distribution and $U \sim \text{Uniform}(0, 1)$:
\[X = F^{-1}(U)\]has distribution $F$.
Algorithm:
- Sample $u \sim \text{Uniform}(0, 1)$
- Return $x = F^{-1}(u)$
Limitation: Requires invertible CDF (not available for many distributions).
Rejection Sampling
Sample from a target distribution $P(x)$ using a proposal distribution $Q(x)$ that is easier to sample from.
Algorithm:
- Sample $x \sim Q(x)$
- Sample $u \sim \text{Uniform}(0, 1)$
- Accept $x$ if $u < \frac{P(x)}{M Q(x)}$, where $M$ satisfies $P(x) \leq M Q(x)$ for all $x$
- Otherwise, reject and repeat
Acceptance rate: $1/M$ (higher $M$ = more rejections = less efficient)
Advantage: Exact samples from target distribution.
Disadvantage: Inefficient in high dimensions (most samples rejected).
Importance Sampling
Estimate expectations under $P(x)$ using samples from a different distribution $Q(x)$.
For $E_{x \sim P}[f(x)]$:
\[E_{x \sim P}[f(x)] = \int f(x) \frac{P(x)}{Q(x)} Q(x) dx \approx \frac{1}{N} \sum_{i=1}^n w_i f(x_i)\]where $x_i \sim Q(x)$ and $w_i = \frac{P(x_i)}{Q(x_i)}$ (importance weights).
Self-normalized importance sampling:
\[E_{x \sim P}[f(x)] \approx \sum_{i=1}^n \tilde{w}_i f(x_i), \quad \tilde{w}_i = \frac{w_i}{\sum_j w_j}\]Used when $P(x)$ is known only up to a normalizing constant.
Problem: High variance if $Q$ is very different from $P$.
Markov Chain Monte Carlo (MCMC)
Generate samples using a Markov chain whose stationary distribution is the target $P(x)$.
Metropolis-Hastings Algorithm
Algorithm:
- Initialize $\theta_0$
- For $t = 0, 1, 2, \ldots$:
- Sample proposal $\theta’ \sim q(\theta’ \mid \theta_t)$
- Compute acceptance ratio: \(\alpha = \min\left(1, \frac{P(\theta' | D) q(\theta_t | \theta')}{P(\theta_t | D) q(\theta' | \theta_t)}\right)\)
- Accept $\theta_{t+1} = \theta’$ with probability $\alpha$
- Otherwise $\theta_{t+1} = \theta_t$
Key insight: Only need ratios of posteriors, so normalizing constant $P(D)$ cancels out.
Special case: Symmetric proposal ($q(\theta’ \mid \theta) = q(\theta \mid \theta’)$):
\[\alpha = \min\left(1, \frac{P(\theta' | D)}{P(\theta_t | D)}\right)\]Gibbs Sampling
Special case of Metropolis-Hastings where proposals are from conditional distributions.
Algorithm:
- Initialize $(\theta_1^{(0)}, \ldots, \theta_k^{(0)})$
- For $t = 0, 1, 2, \ldots$:
- Sample $\theta_1^{(t+1)} \sim P(\theta_1 \mid \theta_2^{(t)}, \ldots, \theta_k^{(t)}, D)$
- Sample $\theta_2^{(t+1)} \sim P(\theta_2 \mid \theta_1^{(t+1)}, \theta_3^{(t)}, \ldots, D)$
- Continue for all variables
Acceptance rate: Always 1 (no rejections).
Requirement: Need to sample from all conditional distributions.
Hamiltonian Monte Carlo (HMC)
Uses gradient information to propose distant states with high acceptance probability.
Idea: Introduce auxiliary momentum variable $r$ and simulate Hamiltonian dynamics:
\[H(\theta, r) = -\log P(\theta | D) + \frac{1}{2} r^T M^{-1} r\]Algorithm:
- Sample momentum $r \sim \mathcal{N}(0, M)$
- Simulate Hamiltonian dynamics using leapfrog integration
- Accept/reject based on Hamiltonian change
Advantages:
- Explores parameter space more efficiently than random-walk MCMC
- Fewer correlations between samples
Implementation: No-U-Turn Sampler (NUTS) in Stan, PyMC, NumPyro.
Block Sampling and Slice Sampling
Block sampling: Update groups of correlated variables together (improves mixing).
Slice sampling:
- Introduce auxiliary variable $u \sim \text{Uniform}(0, P(\theta))$
- Sample $\theta$ uniformly from the “slice” ${\theta : P(\theta) > u}$
- No tuning parameters needed
Convergence Diagnostics
How to know if MCMC has converged to the stationary distribution?
Trace Plots
Plot parameter values vs iteration. Look for:
- Stationary (no trend)
- Good mixing (rapid fluctuations, not stuck)
Gelman-Rubin Statistic (R-hat)
Run multiple chains from different starting points. Compare within-chain and between-chain variance:
\[\hat{R} = \sqrt{\frac{\text{between-chain variance}}{\text{within-chain variance}}}\]$\hat{R} < 1.1$ indicates convergence.
Effective Sample Size (ESS)
Number of independent samples equivalent to the correlated MCMC samples:
\[\text{ESS} = \frac{N}{1 + 2 \sum_k \rho_k}\]where $\rho_k$ is autocorrelation at lag $k$.
Sequential Monte Carlo (Particle Filters)
For state-space models and time series.
Algorithm:
- Initialize particles ${x_0^{(i)}}$ with weights ${w_0^{(i)}}$
- For each time step:
- Propagate particles through transition model
- Update weights based on observation likelihood
- Resample particles (keep high-weight, discard low-weight)
Used in: tracking, robotics, online inference.
Monte Carlo Integration
Approximate integrals using random samples:
\[\int f(x) P(x) dx \approx \frac{1}{N} \sum_{i=1}^N f(x_i), \quad x_i \sim P(x)\]Error: $O(1/\sqrt{N})$ (independent of dimension!)
Applications:
- Computing expectations
- Marginal likelihood estimation
- Predictive distributions
Variance Reduction Techniques
Antithetic Variates
Use negatively correlated samples to reduce variance.
Control Variates
Subtract a correlated variable with known expectation:
\[E[f(X)] = E[f(X) - c(g(X) - E[g(X)))]\]Stratified Sampling
Divide domain into strata, sample from each proportionally.