Sequence Models

Sequence models process variable-length ordered inputs, maintaining state across time steps. They were the dominant NLP architecture before Transformers and remain relevant for streaming, low-latency, and resource-constrained settings.

Recurrent Neural Networks (RNN)

At each time step $t$, the hidden state $h_t$ summarizes all past inputs:

\[\begin{aligned} h_t &= \tanh(W_h h_{t-1} + W_x x_t + b) \\ y_t &= W_o h_t + b_o \end{aligned}\]

Parameters $W_h$, $W_x$ are shared across all time steps (weight tying).

Training: backpropagation through time (BPTT). Unroll the computation graph over $T$ steps and backpropagate. Memory scales with $T$.

Vanishing gradients: gradients of the loss w.r.t. early hidden states involve $\prod_{t’=t}^T W_h$. If the spectral radius $\rho(W_h) < 1$, gradients shrink exponentially. Limits effective context to $\sim$10-20 steps.

Exploding gradients: $\rho(W_h) > 1$ causes gradients to diverge. Fixed with gradient clipping: $g \leftarrow g \cdot \min(1, \theta / |g|)$.

Long Short-Term Memory (LSTM)

Hochreiter & Schmidhuber (1997). Introduces a cell state $c_t$ as a linear memory bus, controlled by learned gates.

\[\begin{aligned} f_t &= \sigma(W_f [h_{t-1}, x_t] + b_f) \quad \text{(forget gate)} \\ i_t &= \sigma(W_i [h_{t-1}, x_t] + b_i) \quad \text{(input gate)} \\ \tilde{c}_t &= \tanh(W_c [h_{t-1}, x_t] + b_c) \quad \text{(candidate)} \\ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \quad \text{(cell update)} \\ o_t &= \sigma(W_o [h_{t-1}, x_t] + b_o) \quad \text{(output gate)} \\ h_t &= o_t \odot \tanh(c_t) \end{aligned}\]

Forget gate: how much of the previous cell state to retain.

Input gate: how much of the new candidate to write.

Output gate: what part of the cell state to expose as hidden state.

The additive cell update $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$ allows gradients to flow without repeated multiplication by the same matrix. This is the key to learning long-range dependencies.

Gated Recurrent Unit (GRU)

Cho et al. (2014). Simplification of LSTM with two gates; no separate cell state.

\[\begin{aligned} z_t &= \sigma(W_z [h_{t-1}, x_t]) \quad \text{(update gate)} \\ r_t &= \sigma(W_r [h_{t-1}, x_t]) \quad \text{(reset gate)} \\ \tilde{h}_t &= \tanh(W [r_t \odot h_{t-1}, x_t]) \quad \text{(candidate)} \\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \end{aligned}\]

Fewer parameters than LSTM; comparable performance on most tasks. More amenable to efficient hardware implementation.

Model	Parameters	Gates	Cell state	Typical use
Vanilla RNN	Fewest	None	No	Simple tasks
GRU	Medium	2	No	General
LSTM	Most	3	Yes	Long sequences

Bidirectional RNNs

Standard RNNs process sequences left to right. For tasks requiring full context (classification, NER, parsing), process the sequence in both directions and concatenate hidden states:

\[\begin{aligned} \overrightarrow{h}_t &= \text{RNN}(x_t, \overrightarrow{h}_{t-1}) \\ \overleftarrow{h}_t &= \text{RNN}(x_t, \overleftarrow{h}_{t+1}) \\ h_t &= [\overrightarrow{h}_t; \overleftarrow{h}_t] \end{aligned}\]

Cannot be used for autoregressive generation (left-to-right only).

Encoder-Decoder (Seq2Seq)

Sutskever et al. (2014). Maps a source sequence to a target sequence of different length.

Encoder: reads the source and produces a context vector (the final hidden state).

\[c = h_T^{\text{enc}}\]

Decoder: generates the target token by token, conditioned on $c$:

\[\begin{aligned} h_t^{\text{dec}} &= \text{LSTM}(y_{t-1}, h_{t-1}^{\text{dec}}, c) \\ p(y_t \mid y_{<t}, x) &= \text{softmax}(W h_t^{\text{dec}}) \end{aligned}\]

Bottleneck problem: the entire source must compress into a single vector $c$. For long sequences, this is insufficient. Solved by attention (see Attention Mechanisms).

Applications: machine translation, summarization, dialogue, code generation.

Convolutional Sequence Models

1D convolutions over sequences. Each filter captures local patterns of fixed width $k$.

Temporal Convolutional Network (TCN):

Causal convolutions (no future leakage).
Dilated convolutions with exponentially growing dilation $d = 1, 2, 4, \ldots, 2^{L-1}$ to achieve $O(\log T)$ depth for $O(T)$ receptive field.
Residual connections for training stability.

Advantages over RNN:

Fully parallelizable over time during training.
Stable gradients (no BPTT).
Explicit receptive field size.

Disadvantages: fixed receptive field; must set network depth to cover the required context.

State Space Models (SSM)

Recent alternative to RNNs that models sequences via a latent state transition:

$h'(t) = A h(t) + B x(t)$ $y(t) = C h(t) + D x(t)$

Discretized for sequences. The structured state space sequence model (S4) parameterizes $A$ as a special structured matrix to enable efficient $O(N \log N)$ convolution computation.

Mamba (2023): selective SSM where $B$, $C$, and the step size $\Delta$ depend on the input, allowing the model to selectively remember or forget. Linear time complexity at inference; competitive with Transformers on many tasks.

Comparison with Transformers

Property	RNN / LSTM	CNN / TCN	Transformer	SSM / Mamba
Training parallelism	None (sequential)	Full	Full	Full
Inference per token	$O(1)$	$O(k)$	$O(n)$	$O(1)$
Long-range dependencies	Difficult	Fixed receptive field	Excellent	Good
Memory (train)	$O(T)$	$O(T)$	$O(T^2)$	$O(T)$
Memory (inference)	$O(d)$	$O(d)$	$O(T \cdot d)$ (KV cache)	$O(d)$

RNNs are preferred for streaming inference (constant memory, constant per-step compute), while Transformers dominate quality on offline tasks.