Statistical Modeling
Statistical modeling uses probability distributions to describe data-generating processes and make inferences or predictions.
General Framework
A statistical model consists of:
- Response variable (dependent): $Y$ (what we want to predict/explain)
- Predictor variables (independent): $X_1, X_2, \ldots, X_p$ (features, covariates)
- Parameters: $\theta$ (unknown quantities to estimate)
- Error term: $\epsilon$ (unexplained variability)
General form:
\[Y = f(X; \theta) + \epsilon\]Linear Regression
Simple Linear Regression
One predictor variable:
\[Y = \beta_0 + \beta_1 X + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)\]Parameters:
- $\beta_0$: intercept (expected $Y$ when $X = 0$)
- $\beta_1$: slope (change in $Y$ per unit change in $X$)
OLS estimates:
\[\hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}\] \[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]Multiple Linear Regression
Multiple predictors:
\[Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon\]Matrix form:
\[\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon}\]OLS solution:
\[\hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y}\]Assumptions:
- Linearity: relationship is linear
- Independence: observations are independent
- Homoscedasticity: constant error variance
- Normality: errors are normally distributed
- No multicollinearity: predictors not perfectly correlated
Generalized Linear Models (GLMs)
Extension of linear regression for non-normal responses.
Components:
- Random component: $Y$ follows exponential family distribution
- Systematic component: Linear predictor $\eta = X\boldsymbol{\beta}$
- Link function: $g(\mu) = \eta$ relates mean to linear predictor
Logistic Regression
Binary outcome $Y \in {0, 1}$:
\[\log\left(\frac{P(Y=1)}{P(Y=0)}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\] \[P(Y=1) = \sigma(\mathbf{x}^T \boldsymbol{\beta}) = \frac{1}{1 + e^{-\mathbf{x}^T \boldsymbol{\beta}}}\]Likelihood:
\[L(\boldsymbol{\beta}) = \prod_{i=1}^n P(Y_i=1)^{y_i} (1 - P(Y_i=1))^{1-y_i}\]MLE: Maximize log-likelihood (no closed form, use iterative methods).
Poisson Regression
Count outcome $Y \in {0, 1, 2, \ldots}$:
\[\log(E[Y]) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\] \[E[Y] = \exp(\mathbf{x}^T \boldsymbol{\beta})\]Used for: rate data, count data, contingency tables.
Multinomial Logistic Regression
Multi-class outcome $Y \in {1, \ldots, K}$:
\[P(Y = k) = \frac{e^{\boldsymbol{\beta}_k^T \mathbf{x}}}{\sum_{j=1}^K e^{\boldsymbol{\beta}_j^T \mathbf{x}}}\]Reference class (usually $K$) has $\boldsymbol{\beta}_K = \mathbf{0}$.
Model Selection
Goodness of Fit Measures
R-squared (coefficient of determination):
\[R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}\]Proportion of variance explained by the model.
Adjusted R-squared:
\[R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}\]Penalizes for number of predictors.
AIC (Akaike Information Criterion):
\[\text{AIC} = 2k - 2\log(\hat{L})\]where $k$ = number of parameters, $\hat{L}$ = maximized likelihood.
BIC (Bayesian Information Criterion):
\[\text{BIC} = k \log(n) - 2\log(\hat{L})\]Stronger penalty for model complexity than AIC.
Model Selection Strategies
Forward selection: Start with no predictors, add most significant one at a time.
Backward elimination: Start with all predictors, remove least significant one at a time.
Stepwise selection: Combination of forward and backward.
Best subset: Try all possible combinations (computationally expensive).
Regularization
Ridge Regression (L2)
\[\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \left[\sum_{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 + \lambda \sum_{j=1}^p \beta_j^2\right]\]Solution:
\[\hat{\boldsymbol{\beta}}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T \mathbf{y}\]Effect: Shrinks coefficients toward zero, handles multicollinearity.
Lasso (L1)
\[\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \left[\sum_{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 + \lambda \sum_{j=1}^p |\beta_j|\right]\]Effect: Produces sparse solutions (some coefficients exactly zero).
Used for: feature selection, high-dimensional settings.
Elastic Net
Combines L1 and L2:
\[\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \left[\sum_{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 + \lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p \beta_j^2\right]\]Diagnostics
Residual Analysis
Residuals: $e_i = y_i - \hat{y}_i$
Check:
- Residuals vs fitted: should show no pattern (linearity)
- Q-Q plot: residuals should follow normal distribution
- Scale-location: constant variance (homoscedasticity)
- Leverage vs residuals: identify influential points
Multicollinearity
Variance Inflation Factor (VIF):
\[\text{VIF}_j = \frac{1}{1 - R_j^2}\]where $R_j^2$ is from regressing $X_j$ on other predictors.
$\text{VIF} > 5$ or $\text{VIF} > 10$ indicates problematic multicollinearity.
Cross-Validation
k-fold CV:
- Split data into $k$ folds
- Train on $k-1$ folds, test on held-out fold
- Repeat for all folds, average performance
Used for: model selection, hyperparameter tuning, estimating generalization error.
Missing Data
Types of Missingness
MCAR (Missing Completely At Random): Missingness independent of data.
MAR (Missing At Random): Missingness depends on observed data.
MNAR (Missing Not At Random): Missingness depends on unobserved data.
Handling Missing Data
Listwise deletion: Remove rows with any missing values (biased if not MCAR).
Mean/median imputation: Replace with column mean/median (underestimates variance).
Multiple imputation: Create multiple imputed datasets, combine results.
Model-based: Use models that handle missing data (e.g., XGBoost).