Ensemble Learning

Why do Ensembles work?

Ensemble learning combines predictions from multiple models (base learners) to produce a final prediction that is more accurate and robust than any individual model. The key insight: models make different errors; averaging reduces variance without increasing bias proportionally.

Ensemble prediction (regression):

\[\hat{y} = \frac{1}{M} \sum_{m=1}^M \hat{f}_m(x)\]

For classification: majority vote or averaging predicted probabilities.

Why Ensembles Work

From the Bias Variance Tradeoff: if $M$ models are uncorrelated and each has variance $\sigma^2$, the averaged prediction has variance:

\[\text{Var}[\bar{f}] = \rho\sigma^2 + \frac{1-\rho}{M}\sigma^2\]

As $M \to \infty$: variance $\to \rho\sigma^2$. Decorrelating base learners (reducing $\rho$) magnifies variance reduction.

Bagging (Bootstrap Aggregating)

Bagging trains $M$ models on $M$ different bootstrap samples of the training data. Predictions are averaged (regression) or majority-voted (classification).

Bootstrap sample: sample $n$ points from the training set with replacement. Each sample contains $\approx 63.2\%$ unique points; the remaining $\approx 36.8\%$ are out-of-bag (OOB).

OOB error: each sample is OOB for $\approx 1/3$ of trees. Average OOB predictions give a free validation error estimate without a separate val set.

Variance reduction: averaging over $M$ models reduces variance by a factor of $M$ (for independent models) or $\rho + (1-\rho)/M$ for correlated ones.

Bias: unchanged from individual models.

Random Forests

Bagging + random feature subspace at each split.

Key modifications:

Each tree is grown on a bootstrap sample.
At each split, only $m$ randomly chosen features are considered ($m \approx \sqrt{d}$ for classification, $m \approx d/3$ for regression).
Trees are grown to maximum depth (low bias, high variance; variance reduced by averaging).

Feature importance:

Impurity-based (Gini importance): mean decrease in impurity across all splits using that feature. Fast but biased toward high-cardinality features.
Permutation importance: measure decrease in OOB accuracy when a feature’s values are randomly permuted. Model-agnostic and unbiased.

Boosting

Boosting trains base learners sequentially, each correcting the errors of the previous ensemble. Reduces bias; can also reduce variance.

AdaBoost

Maintains a distribution $D_t$ over training samples. Misclassified samples get higher weight in the next round.

For weak learners $h_t$ with weighted error $\epsilon_t$:

\[\alpha_t = \frac{1}{2} \ln\frac{1 - \epsilon_t}{\epsilon_t}\]

Update weights:

\[D_{t+1}(i) \propto D_t(i) \exp(-\alpha_t y_i h_t(x_i))\]

Final prediction:

\[F(x) = \text{sign}\!\left(\sum_{t=1}^T \alpha_t h_t(x)\right)\]

Training error bound: $\text{err}{\text{train}} \leq \exp!\left(-2\sum{t=1}^T \gamma_t^2\right)$ where $\gamma_t = 0.5 - \epsilon_t$ is the edge over random.

Gradient Boosting

Generalizes boosting to arbitrary differentiable loss functions. Each new model fits the pseudo-residuals (negative gradient of the loss with respect to current predictions).

At step $m$:

Compute pseudo-residuals: $r_i^{(m)} = -\frac{\partial \mathcal{L}(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)}$
Fit a base learner $h_m$ to ${(x_i, r_i^{(m)})}$.
Update: $F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)$, where $\nu$ is the learning rate (shrinkage).

For squared loss: $r_i^{(m)} = y_i - F_{m-1}(x_i)$. Reduces to fitting residuals.

XGBoost

Extreme Gradient Boosting. Adds second-order Taylor expansion to the loss, L1/L2 regularization on leaf weights, and column subsampling. Objective at step $m$:

\[\mathcal{L}^{(m)} \approx \sum_{i=1}^n [g_i f_m(x_i) + \frac{1}{2} h_i f_m^2(x_i)] + \Omega(f_m)\]

where $g_i = \partial_{F} \mathcal{L}(y_i, F)$, $h_i = \partial^2_{F} \mathcal{L}(y_i, F)$, and $\Omega(f) = \gamma T + \frac{1}{2}\lambda |w|^2$.

Optimal leaf weight: $w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$

LightGBM

Gradient boosting with:

Gradient-based One-Side Sampling (GOSS): keeps samples with large gradients; randomly samples those with small gradients. Reduces data without losing information from hard examples.
Exclusive Feature Bundling (EFB): bundles mutually exclusive sparse features to reduce dimensionality.
Leaf-wise growth: grows the leaf with the largest loss reduction, rather than level-wise. More accurate but can overfit on small datasets.

CatBoost

Handles categorical features natively via ordered target statistics; avoids target leakage using permutation-based encoding. Ordered boosting reduces overfitting. Strong default performance without extensive tuning.

Stacking (Stacked Generalization)

Trains a meta-learner on the out-of-fold predictions of base learners.

Procedure:

Split training data into $k$ folds.
Train each base model $M_j$ on $k-1$ folds; generate predictions on held-out fold. Collect OOF predictions as new features.
Train meta-learner on the OOF predictions as inputs, with original labels.
For inference: predict with each base model; feed predictions to meta-learner.

Meta-learner: typically a simple model (logistic regression, linear regression, light GBM) to avoid overfitting.

Voting and Averaging

Hard voting: majority class label across classifiers.

Soft voting: average predicted probabilities; class with highest average wins. Generally superior to hard voting when classifiers are calibrated.

Weighted averaging: $\hat{y} = \sum_m w_m \hat{y}_m$, weights learned on validation or by optimization.

Comparison

Method	Bias	Variance	Sequential	Comments
Bagging	Same	Lower	No	Easy to parallelize
Random Forest	Slightly higher	Much lower	No	Best single-model alternative
AdaBoost	Lower	Varies	Yes	Sensitive to noise/outliers
Gradient Boosting	Lower	Varies	Yes	Highly accurate; needs tuning
XGBoost / LightGBM	Lower	Varies	Yes	State of the art on tabular data
Stacking	Lower	Lower	No	Powerful; complex pipeline

Practical Considerations

Random Forest and gradient boosting (XGBoost, LightGBM) are the dominant choices for tabular data.
Gradient boosting benefits more from hyperparameter tuning than Random Forest.
Stacking and blending often provide 0.5-2% improvement over single best model in competitions.
Diverse base learners (different algorithms, feature sets, data subsets) improve ensemble quality more than duplicating the same model.