Loss Functions
The loss function $\mathcal{L}(\hat{y}, y)$ measures the discrepancy between the model’s prediction $\hat{y}$ and the true target $y$. It defines what the network is optimizing and must be chosen to match the task, output distribution, and desired error geometry.
The training objective is:
\[\mathcal{J}(\theta) = \frac{1}{n} \sum_{i=1}^n \mathcal{L}(f_\theta(x_i), y_i) + \lambda \Omega(\theta)\]where $\Omega(\theta)$ is an optional regularizer.
Classification Losses
Binary Cross-Entropy (Log Loss)
For $y \in {0, 1}$ and predicted probability $\hat{p} = \sigma(z) \in (0,1)$:
\[\mathcal{L} = -[y \log \hat{p} + (1-y) \log(1 - \hat{p})]\]Derivation: MLE under a Bernoulli likelihood $P(y \mid \hat{p}) = \hat{p}^y (1-\hat{p})^{1-y}$.
Gradient: $\frac{\partial \mathcal{L}}{\partial z} = \hat{p} - y$. Clean and bounded; no vanishing gradient at output.
Categorical Cross-Entropy
For $y \in {1, \ldots, K}$ (one-hot encoded as $\mathbf{y}$) and softmax probabilities $\hat{\mathbf{p}}$:
\[\mathcal{L} = -\sum_{k=1}^K y_k \log \hat{p}_k\]In practice, combined with the softmax as log-softmax + NLL:
\[\mathcal{L} = -z_{y} + \log \sum_k e^{z_k}\]where $z_y$ is the logit for the true class. Numerically stable when computed this way.
Label Smoothing
Replaces hard one-hot targets with soft targets:
\[\tilde{y}_k = (1 - \epsilon) y_k + \frac{\epsilon}{K}\]Prevents overconfident predictions; improves calibration. Typically $\epsilon = 0.1$.
Focal Loss
Addresses class imbalance by down-weighting easy (well-classified) examples:
\[\mathcal{L}_\text{focal} = -(1 - \hat{p}_t)^\gamma \log \hat{p}_t\]where $\hat{p}_t = \hat{p}$ if $y=1$, else $1 - \hat{p}$, and $\gamma \geq 0$ is the focusing parameter. Used in RetinaNet for object detection.
At $\gamma = 0$: reduces to standard cross-entropy.
Hinge Loss (SVM)
\[\mathcal{L} = \max(0, 1 - y \cdot f(x)), \quad y \in \{-1, +1\}\]Multi-class (Crammer-Singer): $\mathcal{L} = \max(0, 1 - z_{y_i} + \max_{j \neq y_i} z_j)$
Does not produce calibrated probabilities. Used in SVMs; occasionally in neural nets when margin-based objective is desired.
Regression Losses
Mean Squared Error (MSE / L2 Loss)
\[\mathcal{L} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2\]Derivation: MLE under a Gaussian likelihood $P(y \mid \hat{y}) = \mathcal{N}(y; \hat{y}, \sigma^2)$.
Gradient: $\frac{\partial \mathcal{L}}{\partial \hat{y}_i} = 2(\hat{y}_i - y_i)$. Amplifies large errors.
Mean Absolute Error (MAE / L1 Loss)
\[\mathcal{L} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|\]Gradient: $\text{sign}(\hat{y}_i - y_i)$. Robust to outliers; non-differentiable at 0.
Huber Loss (Smooth L1)
\[\mathcal{L}_\delta = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & |y - \hat{y}| \leq \delta \\ \delta\left(|y - \hat{y}| - \frac{\delta}{2}\right) & |y - \hat{y}| > \delta \end{cases}\]Differentiable everywhere. MSE behavior near zero; MAE behavior for large errors. $\delta$ controls the transition point.
Log-Cosh Loss
\[\mathcal{L} = \frac{1}{n}\sum_{i=1}^n \log\!\cosh(y_i - \hat{y}_i)\]Smooth approximation to Huber/MAE. Has well-defined second derivatives, useful for second-order optimizers.
Probabilistic / Distribution Losses
KL Divergence
\[\mathcal{L} = D_{\text{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}\]Minimized when model distribution $Q$ matches true distribution $P$. Cross-entropy loss $= H(P, Q) = H(P) + D_{\text{KL}}(P | Q)$; minimizing cross-entropy is equivalent to minimizing KL when $H(P)$ is fixed.
ELBO (Evidence Lower Bound)
Used in Variational Autoencoders:
\[\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))\]Reconstruction term: forces decoder to reconstruct inputs well.
KL term: regularizes latent space toward prior $p(z) = \mathcal{N}(0, I)$.
Contrastive Loss
Used in self-supervised and metric learning:
\[\mathcal{L} = y \cdot d^2 + (1-y) \cdot \max(0, m - d)^2\]where $d = |z_1 - z_2|$ is the distance between embeddings and $m$ is the margin.
Ranking / Metric Learning Losses
Triplet Loss
\[\mathcal{L} = \max(0, \|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + \alpha)\]Anchor $a$, positive $p$, negative $n$ with margin $\alpha$. Learns embeddings where positive pairs are closer than negative pairs.
NT-Xent (Normalized Temperature-scaled Cross-Entropy)
Used in contrastive SSL (SimCLR). See Self Supervised Learning.
Choosing a Loss Function
| Task | Recommended Loss |
|---|---|
| Binary classification | Binary cross-entropy |
| Multi-class classification | Categorical cross-entropy |
| Multi-label classification | Binary cross-entropy per output |
| Regression (low noise) | MSE |
| Regression (outlier-prone) | Huber or MAE |
| Class imbalance | Focal loss |
| Probabilistic outputs | NLL, ELBO |
| Metric learning | Triplet, NT-Xent |
| Overconfidence problem | Cross-entropy + label smoothing |