Supervised Learning

Supervised learning trains a model $f_\theta$ on labeled pairs ${(x_i, y_i)}_{i=1}^n$ to predict $y$ for unseen $x$. The label $y$ provides direct supervision about the correct output.

\[\hat{\theta} = \arg\min_\theta \frac{1}{n} \sum_{i=1}^n \mathcal{L}(f_\theta(x_i), y_i)\]

Task Types

Task	Output $\mathcal{Y}$	Examples
Binary Classification	${0, 1}$	Spam detection, fraud detection
Multi-class Classification	${1, \ldots, K}$	Image classification, NER
Multi-label Classification	${0,1}^K$	Tagging, multi-disease diagnosis
Regression	$\mathbb{R}$	House price, temperature forecasting
Ordinal Regression	ordered categories	Rating prediction
Structured Prediction	sequences, trees, graphs	Machine translation, parsing

Loss Functions

Classification Losses

Binary cross-entropy (log loss):

\[\mathcal{L} = -\frac{1}{n} \sum_{i=1}^n [y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i)]\]

Categorical cross-entropy:

\[\mathcal{L} = -\frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K y_{ik} \log \hat{p}_{ik}\]

Hinge loss (SVM):

\[\mathcal{L} = \frac{1}{n} \sum_{i=1}^n \max(0, 1 - y_i f(x_i))\]

Regression Losses

Loss	Formula	Notes
MSE	$\frac{1}{n}\sum (y_i - \hat{y}_i)^2$	Sensitive to outliers
MAE	$\frac{1}{n}\sum \|y_i - \hat{y}_i\|$	Robust to outliers
Huber	MSE for $\|e\| \leq \delta$, MAE otherwise	Best of both
Log-cosh	$\frac{1}{n}\sum \log\cosh(y_i - \hat{y}_i)$	Smooth approximation to MAE

Core Algorithms

Linear Models

Linear Regression: $\hat{y} = \mathbf{w}^T x + b$. Closed-form solution: $\mathbf{w} = (X^T X)^{-1} X^T y$.

Logistic Regression: $\hat{p} = \sigma(\mathbf{w}^T x + b)$ where $\sigma(z) = \frac{1}{1 + e^{-z}}$. Optimized via gradient descent on cross-entropy.

Ridge Regression (L2): adds $\lambda |\mathbf{w}|_2^2$ to MSE. Shrinks all weights uniformly.

Lasso (L1): adds $\lambda |\mathbf{w}|_1$. Produces sparse solutions; some weights become exactly zero.

Tree-based Models

Decision Tree: recursively partitions feature space by choosing splits that maximize information gain or minimize Gini impurity.

Gini impurity: $G = 1 - \sum_{k=1}^K p_k^2$

Entropy / information gain: $H = -\sum_{k=1}^K p_k \log_2 p_k$

Trees overfit easily; mitigated by pruning, min-samples, max-depth constraints.

Support Vector Machines

Finds the maximum-margin hyperplane separating classes:

\[\min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T x_i + b) \geq 1\]

Soft-margin SVM allows slack variables $\xi_i \geq 0$:

\[\min_{\mathbf{w},b,\xi} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_i \xi_i\]

Kernel trick: replaces $x_i^T x_j$ with $K(x_i, x_j)$ to handle non-linear boundaries implicitly.

Kernel	Formula
Linear	$x^T x’$
Polynomial	$(x^T x’ + c)^d$
RBF (Gaussian)	$\exp(-\gamma \|x - x’\|^2)$
Sigmoid	$\tanh(\kappa x^T x’ + c)$

k-Nearest Neighbors

Predict $y$ for $x$ using the majority vote (classification) or mean (regression) of the $k$ closest training points:

\[\hat{y} = \frac{1}{k} \sum_{i \in \mathcal{N}_k(x)} y_i\]

Non-parametric. No training phase. Inference is $O(nd)$ without indexing structures.

Naive Bayes

Applies Bayes’ theorem with the conditional independence assumption:

\[P(y | x) \propto P(y) \prod_{j=1}^d P(x_j | y)\]

Fast and effective for text classification despite the independence assumption often being violated.

Inductive Bias

Every supervised learning algorithm encodes assumptions about the true function:

Algorithm	Inductive Bias
Linear regression	Relationship is linear
Decision trees	Axis-aligned splits suffice
SVM (RBF kernel)	Nearby points have similar labels
Naive Bayes	Features are conditionally independent
Neural networks	Hierarchical feature compositions

Practical Considerations

Class imbalance: use oversampling (SMOTE), undersampling, or class-weighted loss.
Feature scaling: required for SVMs, k-NN, and linear models; trees are invariant.
Missing data: impute (mean, median, model-based) or use models that handle missingness natively (e.g., XGBoost).
Label noise: robust losses (Huber, MAE), label smoothing, or noise-robust algorithms.

See Model Evaluation for performance metrics and Cross Validation for model selection.