Activation Functions
Activation functions introduce nonlinearity into neural networks. Without them, any depth of linear layers collapses to a single linear transformation. The choice of activation affects gradient flow, training speed, and representational capacity.
Requirements for a Good Activation
- Nonlinear: otherwise stacking layers is useless.
- Differentiable (almost everywhere): required for gradient-based training.
- Non-saturating (ideally): saturating activations kill gradients in deep networks.
- Computationally cheap: called billions of times per training step.
Sigmoid
\[\sigma(z) = \frac{1}{1 + e^{-z}} \in (0, 1)\]Derivative: $\sigma’(z) = \sigma(z)(1 - \sigma(z))$
Maximum derivative: $0.25$ at $z = 0$.
Problems:
- Saturates at both tails: $\sigma’(z) \approx 0$ for $\lvert z \rvert \gg 0$. Gradients vanish in deep networks.
- Output not zero-centered: all outputs positive; leads to zig-zagging gradient updates.
- Expensive: requires exponential.
Use: output layer for binary classification; gating mechanisms (LSTM, GRU).
Tanh
\[\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1 \in (-1, 1)\]Derivative: $\tanh’(z) = 1 - \tanh^2(z)$
Maximum derivative: $1.0$ at $z = 0$.
Advantages over sigmoid: zero-centered output. Stronger gradient at origin.
Problems: still saturates at tails; vanishing gradients in deep networks.
Use: hidden layers in RNNs and shallow networks. Largely replaced by ReLU in feedforward layers.
ReLU (Rectified Linear Unit)
\[\text{ReLU}(z) = \max(0, z)\]Derivative:
\[\text{ReLU}'(z) = \begin{cases} 1 & z > 0 \\ 0 & z < 0 \end{cases}\]Advantages:
- Non-saturating for $z > 0$: gradients pass through unchanged.
- Computationally trivial (comparison + threshold).
- Induces sparsity: about 50% of neurons are inactive on average.
- Empirically accelerates convergence compared to sigmoid/tanh.
Problems:
- Dying ReLU: if a neuron’s pre-activation is always negative (e.g., after a large negative gradient update), it outputs 0 permanently and receives zero gradient. Learning stops for that neuron.
- Not zero-centered.
- Not differentiable at $z = 0$ (subgradient ${0}$ used in practice).
Use: default activation for hidden layers in MLPs and CNNs.
Leaky ReLU
\[\text{LeakyReLU}(z) = \begin{cases} z & z > 0 \\ \alpha z & z \leq 0 \end{cases}\]Typically $\alpha = 0.01$. Fixes dying ReLU by allowing a small gradient for $z < 0$.
PReLU (Parametric ReLU)
Same as Leaky ReLU but $\alpha$ is a learned parameter. Adds minimal overhead; can improve accuracy on large datasets.
ELU (Exponential Linear Unit)
\[\text{ELU}(z) = \begin{cases} z & z > 0 \\ \alpha(e^z - 1) & z \leq 0 \end{cases}\]- Smooth, differentiable everywhere.
- Negative outputs bring mean activations closer to zero (reduces bias shift).
- Slower to compute due to exponential.
SELU (Scaled ELU)
\[\text{SELU}(z) = \lambda \cdot \text{ELU}(z)\]With specific $\lambda \approx 1.0507$ and $\alpha \approx 1.6733$. Self-normalizing: activations converge to zero mean and unit variance under mild conditions, making batch normalization unnecessary. Requires LeCun Normal initialization and fully connected architectures.
GELU (Gaussian Error Linear Unit)
\[\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left[1 + \text{erf}\!\left(\frac{z}{\sqrt{2}}\right)\right]\]where $\Phi$ is the standard normal CDF. Approximated as:
\[\text{GELU}(z) \approx 0.5z\left(1 + \tanh\!\left(\sqrt{2/\pi}(z + 0.044715z^3)\right)\right)\]- Smooth, non-monotone.
- Stochastic interpretation: $\text{GELU}(z) = z \cdot P(Z \leq z)$ for $Z \sim \mathcal{N}(0,1)$.
- Default activation in BERT, GPT, and most modern Transformers.
SiLU / Swish
\[\text{SiLU}(z) = z \cdot \sigma(z)\]Smooth, non-monotone, unbounded above. Performs well in EfficientNet and many vision models. Close to GELU.
Softmax
\[\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}\]- Outputs sum to 1; interpreted as probabilities.
- Numerically stable version: subtract $\max(\mathbf{z})$ before exponentiating.
- Not an activation for hidden layers; used only in the output layer for multi-class classification.
Temperature scaling: $\text{softmax}(\mathbf{z} / T)$. Low $T \to 0$ sharpens to argmax; high $T \to \infty$ flattens to uniform. Used in knowledge distillation and language model decoding.
Comparison
| Activation | Range | Saturates | Zero-centered | Dying neurons | Typical Use |
|---|---|---|---|---|---|
| Sigmoid | $(0,1)$ | Yes (both) | No | No | Output (binary), gates |
| Tanh | $(-1,1)$ | Yes (both) | Yes | No | RNNs, shallow nets |
| ReLU | $[0, \infty)$ | No (pos) | No | Yes | MLPs, CNNs |
| Leaky ReLU | $\mathbb{R}$ | No | No | No | MLPs, CNNs |
| ELU | $(-\alpha, \infty)$ | No | Near-zero mean | No | Deep MLPs |
| GELU | $\mathbb{R}$ | No | No | No | Transformers |
| Swish/SiLU | $\mathbb{R}$ | No | No | No | Vision models |
| Softmax | $(0,1)^K$ | Yes | No | No | Output (multi-class) |