Supervised Learning

Supervised learning trains a model $f_\theta$ on labeled pairs ${(x_i, y_i)}_{i=1}^n$ to predict $y$ for unseen $x$. The label $y$ provides direct supervision about the correct output.

\[\hat{\theta} = \arg\min_\theta \frac{1}{n} \sum_{i=1}^n \mathcal{L}(f_\theta(x_i), y_i)\]

Task Types

Task Output $\mathcal{Y}$ Examples
Binary Classification ${0, 1}$ Spam detection, fraud detection
Multi-class Classification ${1, \ldots, K}$ Image classification, NER
Multi-label Classification ${0,1}^K$ Tagging, multi-disease diagnosis
Regression $\mathbb{R}$ House price, temperature forecasting
Ordinal Regression ordered categories Rating prediction
Structured Prediction sequences, trees, graphs Machine translation, parsing

Loss Functions

Classification Losses

Binary cross-entropy (log loss):

\[\mathcal{L} = -\frac{1}{n} \sum_{i=1}^n [y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i)]\]

Categorical cross-entropy:

\[\mathcal{L} = -\frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K y_{ik} \log \hat{p}_{ik}\]

Hinge loss (SVM):

\[\mathcal{L} = \frac{1}{n} \sum_{i=1}^n \max(0, 1 - y_i f(x_i))\]

Regression Losses

Loss Formula Notes
MSE $\frac{1}{n}\sum (y_i - \hat{y}_i)^2$ Sensitive to outliers
MAE $\frac{1}{n}\sum |y_i - \hat{y}_i|$ Robust to outliers
Huber MSE for $|e| \leq \delta$, MAE otherwise Best of both
Log-cosh $\frac{1}{n}\sum \log\cosh(y_i - \hat{y}_i)$ Smooth approximation to MAE

Core Algorithms

Linear Models

Linear Regression: $\hat{y} = \mathbf{w}^T x + b$. Closed-form solution: $\mathbf{w} = (X^T X)^{-1} X^T y$.

Logistic Regression: $\hat{p} = \sigma(\mathbf{w}^T x + b)$ where $\sigma(z) = \frac{1}{1 + e^{-z}}$. Optimized via gradient descent on cross-entropy.

Ridge Regression (L2): adds $\lambda |\mathbf{w}|_2^2$ to MSE. Shrinks all weights uniformly.

Lasso (L1): adds $\lambda |\mathbf{w}|_1$. Produces sparse solutions; some weights become exactly zero.

Tree-based Models

Decision Tree: recursively partitions feature space by choosing splits that maximize information gain or minimize Gini impurity.

Gini impurity: $G = 1 - \sum_{k=1}^K p_k^2$

Entropy / information gain: $H = -\sum_{k=1}^K p_k \log_2 p_k$

Trees overfit easily; mitigated by pruning, min-samples, max-depth constraints.

Support Vector Machines

Finds the maximum-margin hyperplane separating classes:

\[\min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T x_i + b) \geq 1\]

Soft-margin SVM allows slack variables $\xi_i \geq 0$:

\[\min_{\mathbf{w},b,\xi} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_i \xi_i\]

Kernel trick: replaces $x_i^T x_j$ with $K(x_i, x_j)$ to handle non-linear boundaries implicitly.

Kernel Formula
Linear $x^T x’$
Polynomial $(x^T x’ + c)^d$
RBF (Gaussian) $\exp(-\gamma |x - x’|^2)$
Sigmoid $\tanh(\kappa x^T x’ + c)$

k-Nearest Neighbors

Predict $y$ for $x$ using the majority vote (classification) or mean (regression) of the $k$ closest training points:

\[\hat{y} = \frac{1}{k} \sum_{i \in \mathcal{N}_k(x)} y_i\]

Non-parametric. No training phase. Inference is $O(nd)$ without indexing structures.

Naive Bayes

Applies Bayes’ theorem with the conditional independence assumption:

\[P(y | x) \propto P(y) \prod_{j=1}^d P(x_j | y)\]

Fast and effective for text classification despite the independence assumption often being violated.

Inductive Bias

Every supervised learning algorithm encodes assumptions about the true function:

Algorithm Inductive Bias
Linear regression Relationship is linear
Decision trees Axis-aligned splits suffice
SVM (RBF kernel) Nearby points have similar labels
Naive Bayes Features are conditionally independent
Neural networks Hierarchical feature compositions

Practical Considerations

  • Class imbalance: use oversampling (SMOTE), undersampling, or class-weighted loss.
  • Feature scaling: required for SVMs, k-NN, and linear models; trees are invariant.
  • Missing data: impute (mean, median, model-based) or use models that handle missingness natively (e.g., XGBoost).
  • Label noise: robust losses (Huber, MAE), label smoothing, or noise-robust algorithms.

See Model Evaluation for performance metrics and Cross Validation for model selection.