Machine Learning Overview
What is Machine Learning?
Machine learning is the study of algorithms that improve their performance on a task through experience. Instead of explicit programming, a model learns a mapping from inputs to outputs (or latent structure) directly from data.
Formally, following Mitchell’s definition: a program learns from experience $E$ with respect to task $T$ and performance measure $P$, if its performance at $T$, as measured by $P$, improves with experience $E$.
The Learning Problem
Given a dataset $\mathcal{D} = {(x_i, y_i)}_{i=1}^n$ drawn i.i.d. from an unknown distribution $P(X, Y)$, find a function $f: \mathcal{X} \to \mathcal{Y}$ that generalizes well to unseen data.
Components:
- Input space $\mathcal{X}$: feature vectors $x \in \mathbb{R}^d$
- Output space $\mathcal{Y}$: labels, values, or structure
- Hypothesis class $\mathcal{H}$: set of candidate functions the algorithm considers
- Loss function $\mathcal{L}(f(x), y)$: measures prediction error
- Risk $R(f) = \mathbb{E}_{(x,y) \sim P}[\mathcal{L}(f(x), y)]$: expected loss over the true distribution
- Empirical risk $\hat{R}(f) = \frac{1}{n} \sum_{i=1}^n \mathcal{L}(f(x_i), y_i)$: loss on training data
Types of Machine Learning
| Type | Supervision | Goal | Examples |
|---|---|---|---|
| Supervised | Labeled $(x, y)$ pairs | Learn $f: x \to y$ | Classification, Regression |
| Unsupervised | Unlabeled $x$ only | Discover structure | Clustering, Dimensionality Reduction |
| Semi-supervised | Few labeled, many unlabeled | Leverage both | Label propagation, pseudo-labeling |
| Self-supervised | No labels; supervision from data itself | Learn representations | Contrastive learning, masked modeling |
| Reinforcement | Reward signal from environment | Learn policy | Game playing, robotics |
Core Algorithms by Task
Classification: Logistic Regression, Decision Trees, Random Forests, SVM, k-NN, Naive Bayes, Neural Networks
Regression: Linear Regression, Ridge, Lasso, Decision Trees, Gradient Boosted Trees, Neural Networks
Clustering: k-Means, DBSCAN, Hierarchical Clustering, Gaussian Mixture Models
Dimensionality Reduction: PCA, t-SNE, UMAP, Autoencoders
The ML Workflow
- Problem formulation: Define task, metric, and data requirements.
- Data collection and cleaning: Handle missing values, outliers, and label noise.
- Exploratory Data Analysis (EDA): Understand distributions, correlations, and class imbalance.
- Feature engineering: Transform raw inputs into informative representations.
- Model selection: Choose hypothesis class appropriate to data size and complexity.
- Training: Minimize empirical risk via optimization (gradient descent, etc.).
- Evaluation: Measure generalization on held-out data.
- Hyperparameter tuning: Search for optimal configuration via validation.
- Deployment: Serve model in production; monitor for distribution shift.
Generalization and the Goal of Learning
The fundamental goal is low generalization error, not just low training error:
\[\text{Generalization gap} = R(f) - \hat{R}(f)\]Underfitting: model too simple to capture true pattern. High training and test error.
Overfitting: model fits noise in training data. Low training error, high test error.
The Bias Variance Tradeoff formalizes this tension.
No Free Lunch Theorem
No single algorithm is universally best. For any algorithm $A$ that outperforms random guessing on one distribution, there exists another distribution where $A$ performs at most as well as random guessing. Algorithm choice must be guided by inductive bias appropriate for the problem domain.
Key Assumptions
- i.i.d. assumption: training and test data are drawn from the same distribution. Violations lead to distribution shift.
- Smoothness: nearby inputs tend to have similar outputs (enables generalization).
- Finite VC dimension (for PAC learning): ensures learnability guarantees.
PAC Learning
Probably Approximately Correct (PAC) framework: an algorithm PAC-learns a concept class $\mathcal{C}$ if, given enough samples $n$, it produces a hypothesis $h$ satisfying:
\[P[R(h) \leq \epsilon] \geq 1 - \delta\]for any $\epsilon, \delta > 0$. The sample complexity is:
\[n \geq \frac{1}{\epsilon}\left(\ln |\mathcal{H}| + \ln \frac{1}{\delta}\right)\]For infinite hypothesis classes, VC dimension replaces $\ln \lvert\mathcal{H}\rvert$.