Activation Functions

Activation functions introduce nonlinearity into neural networks. Without them, any depth of linear layers collapses to a single linear transformation. The choice of activation affects gradient flow, training speed, and representational capacity.

Requirements for a Good Activation

Nonlinear: otherwise stacking layers is useless.
Differentiable (almost everywhere): required for gradient-based training.
Non-saturating (ideally): saturating activations kill gradients in deep networks.
Computationally cheap: called billions of times per training step.

Sigmoid

\[\sigma(z) = \frac{1}{1 + e^{-z}} \in (0, 1)\]

Derivative: $\sigma’(z) = \sigma(z)(1 - \sigma(z))$

Maximum derivative: $0.25$ at $z = 0$.

Problems:

Saturates at both tails: $\sigma’(z) \approx 0$ for $\lvert z \rvert \gg 0$. Gradients vanish in deep networks.
Output not zero-centered: all outputs positive; leads to zig-zagging gradient updates.
Expensive: requires exponential.

Use: output layer for binary classification; gating mechanisms (LSTM, GRU).

Tanh

\[\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1 \in (-1, 1)\]

Derivative: $\tanh’(z) = 1 - \tanh^2(z)$

Maximum derivative: $1.0$ at $z = 0$.

Advantages over sigmoid: zero-centered output. Stronger gradient at origin.

Problems: still saturates at tails; vanishing gradients in deep networks.

Use: hidden layers in RNNs and shallow networks. Largely replaced by ReLU in feedforward layers.

ReLU (Rectified Linear Unit)

\[\text{ReLU}(z) = \max(0, z)\]

Derivative:

\[\text{ReLU}'(z) = \begin{cases} 1 & z > 0 \\ 0 & z < 0 \end{cases}\]

Advantages:

Non-saturating for $z > 0$: gradients pass through unchanged.
Computationally trivial (comparison + threshold).
Induces sparsity: about 50% of neurons are inactive on average.
Empirically accelerates convergence compared to sigmoid/tanh.

Problems:

Dying ReLU: if a neuron’s pre-activation is always negative (e.g., after a large negative gradient update), it outputs 0 permanently and receives zero gradient. Learning stops for that neuron.
Not zero-centered.
Not differentiable at $z = 0$ (subgradient ${0}$ used in practice).

Use: default activation for hidden layers in MLPs and CNNs.

Leaky ReLU

\[\text{LeakyReLU}(z) = \begin{cases} z & z > 0 \\ \alpha z & z \leq 0 \end{cases}\]

Typically $\alpha = 0.01$. Fixes dying ReLU by allowing a small gradient for $z < 0$.

PReLU (Parametric ReLU)

Same as Leaky ReLU but $\alpha$ is a learned parameter. Adds minimal overhead; can improve accuracy on large datasets.

ELU (Exponential Linear Unit)

\[\text{ELU}(z) = \begin{cases} z & z > 0 \\ \alpha(e^z - 1) & z \leq 0 \end{cases}\]

Smooth, differentiable everywhere.
Negative outputs bring mean activations closer to zero (reduces bias shift).
Slower to compute due to exponential.

SELU (Scaled ELU)

\[\text{SELU}(z) = \lambda \cdot \text{ELU}(z)\]

With specific $\lambda \approx 1.0507$ and $\alpha \approx 1.6733$. Self-normalizing: activations converge to zero mean and unit variance under mild conditions, making batch normalization unnecessary. Requires LeCun Normal initialization and fully connected architectures.

GELU (Gaussian Error Linear Unit)

\[\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left[1 + \text{erf}\!\left(\frac{z}{\sqrt{2}}\right)\right]\]

where $\Phi$ is the standard normal CDF. Approximated as:

\[\text{GELU}(z) \approx 0.5z\left(1 + \tanh\!\left(\sqrt{2/\pi}(z + 0.044715z^3)\right)\right)\]

Smooth, non-monotone.
Stochastic interpretation: $\text{GELU}(z) = z \cdot P(Z \leq z)$ for $Z \sim \mathcal{N}(0,1)$.
Default activation in BERT, GPT, and most modern Transformers.

SiLU / Swish

\[\text{SiLU}(z) = z \cdot \sigma(z)\]

Smooth, non-monotone, unbounded above. Performs well in EfficientNet and many vision models. Close to GELU.

Softmax

\[\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}\]

Outputs sum to 1; interpreted as probabilities.
Numerically stable version: subtract $\max(\mathbf{z})$ before exponentiating.
Not an activation for hidden layers; used only in the output layer for multi-class classification.

Temperature scaling: $\text{softmax}(\mathbf{z} / T)$. Low $T \to 0$ sharpens to argmax; high $T \to \infty$ flattens to uniform. Used in knowledge distillation and language model decoding.

Comparison

Activation	Range	Saturates	Zero-centered	Dying neurons	Typical Use
Sigmoid	$(0,1)$	Yes (both)	No	No	Output (binary), gates
Tanh	$(-1,1)$	Yes (both)	Yes	No	RNNs, shallow nets
ReLU	$[0, \infty)$	No (pos)	No	Yes	MLPs, CNNs
Leaky ReLU	$\mathbb{R}$	No	No	No	MLPs, CNNs
ELU	$(-\alpha, \infty)$	No	Near-zero mean	No	Deep MLPs
GELU	$\mathbb{R}$	No	No	No	Transformers
Swish/SiLU	$\mathbb{R}$	No	No	No	Vision models
Softmax	$(0,1)^K$	Yes	No	No	Output (multi-class)