Machine Learning Overview

What is Machine Learning?

Machine learning is the study of algorithms that improve their performance on a task through experience. Instead of explicit programming, a model learns a mapping from inputs to outputs (or latent structure) directly from data.

Formally, following Mitchell’s definition: a program learns from experience $E$ with respect to task $T$ and performance measure $P$, if its performance at $T$, as measured by $P$, improves with experience $E$.

The Learning Problem

Given a dataset $\mathcal{D} = {(x_i, y_i)}_{i=1}^n$ drawn i.i.d. from an unknown distribution $P(X, Y)$, find a function $f: \mathcal{X} \to \mathcal{Y}$ that generalizes well to unseen data.

Components:

Input space $\mathcal{X}$: feature vectors $x \in \mathbb{R}^d$
Output space $\mathcal{Y}$: labels, values, or structure
Hypothesis class $\mathcal{H}$: set of candidate functions the algorithm considers
Loss function $\mathcal{L}(f(x), y)$: measures prediction error
Risk $R(f) = \mathbb{E}_{(x,y) \sim P}[\mathcal{L}(f(x), y)]$: expected loss over the true distribution
Empirical risk $\hat{R}(f) = \frac{1}{n} \sum_{i=1}^n \mathcal{L}(f(x_i), y_i)$: loss on training data

Types of Machine Learning

Type	Supervision	Goal	Examples
Supervised	Labeled $(x, y)$ pairs	Learn $f: x \to y$	Classification, Regression
Unsupervised	Unlabeled $x$ only	Discover structure	Clustering, Dimensionality Reduction
Semi-supervised	Few labeled, many unlabeled	Leverage both	Label propagation, pseudo-labeling
Self-supervised	No labels; supervision from data itself	Learn representations	Contrastive learning, masked modeling
Reinforcement	Reward signal from environment	Learn policy	Game playing, robotics

Core Algorithms by Task

Classification: Logistic Regression, Decision Trees, Random Forests, SVM, k-NN, Naive Bayes, Neural Networks

Regression: Linear Regression, Ridge, Lasso, Decision Trees, Gradient Boosted Trees, Neural Networks

Clustering: k-Means, DBSCAN, Hierarchical Clustering, Gaussian Mixture Models

Dimensionality Reduction: PCA, t-SNE, UMAP, Autoencoders

The ML Workflow

Problem formulation: Define task, metric, and data requirements.
Data collection and cleaning: Handle missing values, outliers, and label noise.
Exploratory Data Analysis (EDA): Understand distributions, correlations, and class imbalance.
Feature engineering: Transform raw inputs into informative representations.
Model selection: Choose hypothesis class appropriate to data size and complexity.
Training: Minimize empirical risk via optimization (gradient descent, etc.).
Evaluation: Measure generalization on held-out data.
Hyperparameter tuning: Search for optimal configuration via validation.
Deployment: Serve model in production; monitor for distribution shift.

Generalization and the Goal of Learning

The fundamental goal is low generalization error, not just low training error:

$$ \text{Generalization gap} = R(f) - \hat{R}(f) $$

Underfitting: model too simple to capture true pattern. High training and test error.

Overfitting: model fits noise in training data. Low training error, high test error.

The Bias Variance Tradeoff formalizes this tension.

No Free Lunch Theorem

No single algorithm is universally best. For any algorithm $A$ that outperforms random guessing on one distribution, there exists another distribution where $A$ performs at most as well as random guessing. Algorithm choice must be guided by inductive bias appropriate for the problem domain.

Key Assumptions

i.i.d. assumption: training and test data are drawn from the same distribution. Violations lead to distribution shift.
Smoothness: nearby inputs tend to have similar outputs (enables generalization).
Finite VC dimension (for PAC learning): ensures learnability guarantees.

PAC Learning

Probably Approximately Correct (PAC) framework: an algorithm PAC-learns a concept class $\mathcal{C}$ if, given enough samples $n$, it produces a hypothesis $h$ satisfying:

$$ P[R(h) \leq \epsilon] \geq 1 - \delta $$

for any $\epsilon, \delta > 0$. The sample complexity is:

$$ n \geq \frac{1}{\epsilon}\left(\ln |\mathcal{H}| + \ln \frac{1}{\delta}\right) $$

For infinite hypothesis classes, VC dimension replaces $\ln \lvert\mathcal{H}\rvert$.