Shallow Neural Networks

A shallow neural network has exactly one hidden layer between input and output. Despite their simplicity, shallow networks are the theoretical basis for the universal approximation theorem and serve as the building block for understanding deeper architectures.

\[\hat{y} = W^{(2)} \sigma(W^{(1)} \mathbf{x} + \mathbf{b}^{(1)}) + \mathbf{b}^{(2)}\]

Architecture

Input layer     Hidden layer      Output layer
x_1  ─┐
x_2  ─┤──► [h_1, h_2, ..., h_n] ──► [y_1, ..., y_k]
...   ─┘
x_d
  • Input: $\mathbf{x} \in \mathbb{R}^d$
  • Hidden: $\mathbf{h} = \sigma(W^{(1)}\mathbf{x} + \mathbf{b}^{(1)}) \in \mathbb{R}^n$
  • Output: $\hat{\mathbf{y}} = W^{(2)}\mathbf{h} + \mathbf{b}^{(2)} \in \mathbb{R}^k$

Total parameters: $d \cdot n + n + n \cdot k + k = n(d + k + 1) + k$

Forward Pass

For a single input $\mathbf{x} \in \mathbb{R}^d$:

Layer 1 (hidden):

\[z_j^{(1)} = \sum_{i=1}^d w_{ji}^{(1)} x_i + b_j^{(1)}, \quad j = 1, \ldots, n\] \[h_j = \sigma(z_j^{(1)})\]

Layer 2 (output):

\[\hat{y}_k = \sum_{j=1}^n w_{kj}^{(2)} h_j + b_k^{(2)}\]

Learning in a Shallow Network

The network minimizes empirical risk:

\[\hat{\theta} = \arg\min_{W^{(1)}, \mathbf{b}^{(1)}, W^{(2)}, \mathbf{b}^{(2)}} \frac{1}{n} \sum_{i=1}^n \mathcal{L}(\hat{y}_i, y_i)\]

The output weights $W^{(2)}$ enter the loss linearly (fixing $\mathbf{h}$, the problem is convex in $W^{(2)}$). The hidden weights $W^{(1)}$ make the overall problem non-convex.

Representational Capacity

For a width-$n$ ReLU network:

  • Creates at most $n$ “kinks” (breakpoints) in 1-D: can represent any piecewise linear function with $\leq n$ pieces.
  • In $d$ dimensions: partitions input space into at most $\sum_{j=0}^d \binom{n}{j}$ regions, each with a linear function.

Width needed for $\epsilon$-approximation: Typically exponential in $d$ for multivariate functions. This curse of dimensionality motivates depth.

Universal Approximation (Formal Statement)

Cybenko (1989): For any continuous $f: [0,1]^d \to \mathbb{R}$ and $\epsilon > 0$, there exist $n$, $W^{(1)}$, $W^{(2)}$, $\mathbf{b}$ such that:

\[\sup_{x \in [0,1]^d} |f(x) - \hat{f}(x)| < \epsilon\]

for any non-constant, bounded, monotone continuous activation $\sigma$.

Hornik (1991): Extends to any non-polynomial $\sigma$.

The theorem guarantees existence but not efficient constructibility or learnability via gradient descent.

When Shallow Networks Suffice

  • Low-dimensional inputs ($d \leq$ a few hundred).
  • Smooth, simple target functions.
  • Small datasets where deep networks overfit.
  • Interpretability is required.

Examples: logistic regression (no hidden layer), one-hidden-layer MLP for tabular data, radial basis function networks.

Comparison to Deep Networks

Property Shallow Deep
Parameters for same function Exponentially more Polynomial
Optimization landscape Fewer saddle points More complex
Feature learning One level of features Hierarchical
Generalization (typical) Worse on structured data Better
Interpretability Easier Harder

Shallow networks are best understood as the foundation: understanding their forward pass, loss landscape, and gradient flow directly extends to deeper architectures. See Deep Neural Networks.