Hyperparameter Tuning
Hyperparameters are configuration choices that control the learning process but are not learned from data: learning rate, regularization strength, tree depth, number of layers. Tuning finds the configuration that maximizes validation performance.
Distinguish from model parameters (learned during training, e.g., weights $\mathbf{w}$).
Search Strategies
Grid Search
Exhaustively evaluates all combinations of a discrete hyperparameter grid.
Algorithm: for each $(h_1, h_2, \ldots, h_k) \in \mathcal{H}_1 \times \mathcal{H}_2 \times \cdots \times \mathcal{H}_k$, train and evaluate via CV.
Complexity: $O(\lvert\mathcal{H}_1\rvert \times \cdots \times \lvert\mathcal{H}_k\rvert \times k\text{-fold cost})$.
Advantage: systematic, reproducible.
Disadvantage: exponential in number of hyperparameters; wastes resources on unimportant ones.
Random Search
Samples hyperparameter configurations at random from specified distributions.
Key insight (Bergstra & Bengio 2012): if only a few hyperparameters matter, random search explores each of them more densely than grid search with the same budget.
For $n$ trials, each hyperparameter is sampled with $n$ distinct values vs. grid search which repeats the same values across all rows.
Advantage: efficient when some hyperparameters are unimportant; easy to parallelize.
Disadvantage: no learning from past evaluations.
Bayesian Optimization
Builds a surrogate model of the objective function $f: \mathcal{H} \to \mathbb{R}$ and uses it to select the next most promising configuration to evaluate.
Algorithm:
- Evaluate $f$ at a few initial random points.
- Fit surrogate model (e.g., Gaussian Process) to observed ${(h_i, f(h_i))}$.
- Maximize acquisition function to select next $h$.
- Evaluate $f(h)$, add to observations. Repeat.
Acquisition functions:
| Function | Idea |
|---|---|
| Expected Improvement (EI) | $\mathbb{E}[\max(f(h) - f^*, 0)]$ |
| Upper Confidence Bound (UCB) | $\mu(h) + \kappa \sigma(h)$; balances exploration and exploitation |
| Probability of Improvement (PI) | $P(f(h) > f^* + \xi)$ |
Surrogate models:
- Gaussian Process (GP): provides uncertainty estimates; exact but scales $O(n^3)$ with evaluations.
- Tree Parzen Estimator (TPE): models $P(h \mid f(h) < \text{threshold})$ and its complement; scales better.
- Random Forest: used in SMAC; handles categorical hyperparameters well.
Bayesian optimization is most effective when function evaluations are expensive.
Successive Halving and Hyperband
Successive Halving: allocate a small budget (epochs) to many configurations; keep the top fraction; increase budget; repeat.
| Round | Configurations | Budget per config |
|---|---|---|
| 1 | $n$ | $r$ |
| 2 | $n/\eta$ | $r\eta$ |
| $\ldots$ | $\ldots$ | $\ldots$ |
| $\log_\eta n$ | 1 | $r\eta^{\log_\eta n}$ |
Hyperband: runs successive halving with multiple brackets at different initial $n$, trading off exploration vs. exploitation across brackets. Robust to the $n$ vs. $r$ choice.
BOHB (Bayesian Optimization + HyperBand): replaces random sampling in Hyperband with TPE-based Bayesian optimization. State of the art for many benchmarks.
Key Hyperparameters by Algorithm
| Algorithm | Key Hyperparameters |
|---|---|
| Gradient Boosted Trees | n_estimators, max_depth, learning_rate, subsample, colsample_bytree, min_child_weight |
| Random Forest | n_estimators, max_depth, max_features, min_samples_leaf |
| Neural Networks | learning_rate, batch_size, n_layers, hidden_size, dropout, weight_decay |
| SVM | C, kernel, gamma (RBF) |
| k-NN | k, distance metric |
| Lasso / Ridge | alpha ($\lambda$) |
Search Space Design
- Learning rate: log-uniform in $[10^{-5}, 10^{-1}]$. Always search on log scale.
- Regularization strength: log-uniform in $[10^{-5}, 10]$.
- Integer hyperparameters: uniform discrete (tree depth, n_estimators).
- Categorical: one-hot in Bayesian methods or direct in TPE.
- Conditional hyperparameters: e.g., gamma is only relevant if kernel = RBF. Use conditional search spaces in frameworks like Optuna.
Practical Considerations
- Always use validation data (or CV) for tuning. Never use test set.
- Use early stopping to terminate unpromising runs cheaply.
- Parallelize across hardware when possible.
- Start coarse (wide ranges), then refine around good regions.
- Budget for final model: refit on full training data with best hyperparameters.
- Log all experiments (hyperparameters + metrics) for reproducibility and analysis.
Software
| Library | Search Strategy | Notable Features |
|---|---|---|
scikit-learn GridSearchCV / RandomizedSearchCV |
Grid, Random | Integrated with sklearn pipeline |
| Optuna | TPE, CMA-ES | Pruning, distributed, conditional search |
| Ray Tune | Grid, Random, BOHB, PBT | Distributed, scheduler integration |
| Hyperopt | TPE | Functional API |
| Ax / BoTorch | Bayesian (GP) | Research-grade, multi-objective |
| SMAC3 | Random Forest surrogate | Strong for algorithm configuration |
Population Based Training (PBT)
PBT trains a population of models simultaneously. Underperforming models copy hyperparameters and weights from better-performing ones and perturb the hyperparameters:
- Exploit: copy weights and hyperparameters from a top-performing member.
- Explore: randomly perturb the copied hyperparameters.
Discovers schedules (e.g., learning rate schedules) not possible with fixed hyperparameters. Used at DeepMind for AlphaGo, Waymo, and language model training.