Language Modeling
A language model assigns a probability to a sequence of tokens. It is the foundation of modern NLP: pretraining a language model on large corpora produces representations that transfer to virtually every downstream task.
The Language Modeling Objective
Given a sequence $w_1, w_2, \ldots, w_T$, the joint probability is:
\[p(w_1, \ldots, w_T) = \prod_{t=1}^T p(w_t \mid w_1, \ldots, w_{t-1})\]by the chain rule. A language model learns to estimate each conditional $p(w_t \mid w_{<t})$.
Training: maximize log-likelihood (minimize cross-entropy) over a large corpus:
\[\mathcal{L} = -\frac{1}{T}\sum_{t=1}^T \log p_\theta(w_t \mid w_{<t})\]Perplexity
The standard evaluation metric. Lower is better.
\[\text{PPL} = \exp\!\left(-\frac{1}{T}\sum_{t=1}^T \log p(w_t \mid w_{<t})\right) = \exp(\mathcal{L})\]Interpretation: the effective vocabulary size the model is “choosing from” at each step. PPL = 10 means the model is as uncertain as choosing uniformly among 10 options.
Baseline: uniform over vocabulary: $\text{PPL} = \lvert V \rvert$.
N-gram Language Models
Approximate the conditional with a fixed context window of $n-1$ words:
\[p(w_t \mid w_{<t}) \approx p(w_t \mid w_{t-n+1}, \ldots, w_{t-1})\]Bigram: $p(w_t \mid w_{t-1}) = \frac{c(w_{t-1}, w_t)}{c(w_{t-1})}$
Sparsity problem: most $n$-grams never appear in training data. Requires smoothing.
Smoothing methods:
| Method | Idea |
|---|---|
| Add-$k$ (Laplace) | Add $k$ to all counts |
| Interpolation | Combine trigram, bigram, unigram with learned weights |
| Kneser-Ney | Discounting + absolute continuation probability |
| Stupid Backoff | Use lower-order model when count is zero |
Kneser-Ney smoothing is the standard for n-gram models. N-gram models are practical for small-scale applications (keyboard autocomplete) but saturate in quality quickly.
Neural Language Models
Replace the n-gram table with a neural network that can capture longer dependencies and share statistical strength across similar contexts.
Fixed-Window Neural LM (Bengio et al. 2003)
Concatenate embeddings of $n-1$ context words; pass through MLP to predict next word.
\[p(w_t \mid w_{t-n+1:t-1}) = \text{softmax}(W_2 \tanh(W_1 [e_{t-n+1}; \ldots; e_{t-1}] + b_1) + b_2)\]Limitation: fixed context window; no weight sharing across positions.
RNN Language Models
Process sequences recurrently:
\(h_t = f(W_h h_{t-1} + W_x e_t + b)\) \(p(w_{t+1} \mid w_{\leq t}) = \text{softmax}(W_o h_t + b_o)\)
Theoretically unbounded context; in practice, vanishing gradients limit effective context. LSTMs extend this. See Sequence Models.
Transformer Language Models
Decoder-only Transformer with causal self-attention. Full parallel training; excellent at long-range dependencies. The dominant architecture. See Transformers.
Bidirectional vs. Causal LMs
Causal (autoregressive) LM: $p(w_t \mid w_{<t})$. Conditions only on past context. Can generate text. GPT family.
Masked LM (MLM): $p(w_t \mid w_{\neq t})$. Conditions on both left and right context. Cannot directly generate; excellent representations. BERT.
Prefix LM: causal on the generated tokens, bidirectional on the prefix. T5 in text-to-text framework.
Pretraining Objectives
| Objective | Model | Description |
|---|---|---|
| Causal LM (CLM) | GPT, LLaMA | Predict next token; train on all positions |
| Masked LM (MLM) | BERT | Predict 15% randomly masked tokens |
| Replaced Token Detection | ELECTRA | Distinguish real vs. generator-replaced tokens |
| Denoising | T5, BART | Reconstruct corrupted spans of text |
| Permutation LM | XLNet | Predict tokens in random order with full context |
ELECTRA efficiency: the discriminator (detect replaced tokens) sees all tokens, not just 15%. Same FLOPs, more signal. Trains $4\times$ more efficiently than BERT.
Language Model Scaling
Scaling laws (Kaplan et al. 2020): test loss scales as a power law in model parameters $N$, dataset tokens $D$, and compute $C$:
\[L(N) \propto N^{-\alpha_N}, \quad L(D) \propto D^{-\alpha_D}\]Chinchilla scaling (Hoffmann et al. 2022): optimal compute allocation requires $D \approx 20N$ tokens. Many large models were undertrained; a smaller model trained on more data can outperform a larger model trained on fewer tokens.
Decoding Strategies
At inference, generating text requires choosing tokens from $p(w_t \mid w_{<t})$.
Greedy decoding: always pick $\arg\max$. Fast; repetitive and degenerate.
Beam search: maintain top-$k$ partial sequences at each step. Better quality; still tends toward generic outputs.
Top-$k$ sampling: sample from the top-$k$ most probable tokens. Diverse but can sample low-quality tokens.
Top-$p$ (nucleus) sampling: sample from the smallest set of tokens whose cumulative probability exceeds $p$. Adapts to the sharpness of the distribution. $p = 0.9$–$0.95$ typical.
Temperature: scale logits by $1/T$ before softmax. $T < 1$: sharper, more conservative; $T > 1$: flatter, more random. $T = 0$ is greedy.
Min-$p$ sampling (2024): filter tokens below $p_\text{min} \times p_\text{max}$. Adapts threshold relative to the top token’s probability; strong quality/diversity balance.
| Strategy | Diversity | Quality | Speed |
|---|---|---|---|
| Greedy | None | Low | Fast |
| Beam search | Low | Moderate | Moderate |
| Top-$k$ | High | Moderate | Fast |
| Top-$p$ | High | Good | Fast |
| Temperature + top-$p$ | Tunable | Good | Fast |