Transformers

The Transformer (Vaswani et al. 2017, “Attention Is All You Need”) replaces recurrence entirely with stacked self-attention layers. It is the dominant architecture for NLP, vision, audio, and multimodal models.

Architecture Overview

The original Transformer is an encoder-decoder model for sequence-to-sequence tasks. Modern usage separates the encoder and decoder into independent architectures.

Input tokens
  → Token embeddings + Positional encoding
  → Encoder stack (N layers)
      → Multi-Head Self-Attention
      → Add & Norm
      → Feed-Forward Network
      → Add & Norm
  → Encoder output

Target tokens (shifted right)
  → Token embeddings + Positional encoding
  → Decoder stack (N layers)
      → Masked Multi-Head Self-Attention
      → Add & Norm
      → Cross-Attention (Q from decoder, K/V from encoder)
      → Add & Norm
      → Feed-Forward Network
      → Add & Norm
  → Linear + Softmax → output distribution

Transformer Layer

Each encoder layer consists of two sub-layers.

Multi-Head Self-Attention

Described in Attention Mechanisms. Each token’s representation is updated as a weighted sum of all other token representations.

Feed-Forward Network (FFN)

Applied position-wise (independently to each token):

$$ \text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2 $$

Dimension: $d_\text{model} \to d_\text{ff} \to d_\text{model}$, where $d_\text{ff} = 4 d_\text{model}$ typically.

This is where most of the model’s “knowledge” is stored. Can be interpreted as a key-value memory: the first layer selects patterns, the second layer outputs values.

SwiGLU FFN (used in LLaMA, PaLM):

$$ \text{FFN}_\text{SwiGLU}(x) = (\text{SiLU}(x W_1) \odot x W_3) W_2 $$

Adds a gating mechanism. Empirically outperforms ReLU FFN at the same parameter count.

Layer Normalization

Post-LN (original paper): LayerNorm applied after residual addition:

$$ x \leftarrow \text{LayerNorm}(x + \text{Sublayer}(x)) $$

Pre-LN (most modern models): LayerNorm before the sublayer:

$$ x \leftarrow x + \text{Sublayer}(\text{LayerNorm}(x)) $$

Pre-LN is more stable to train without warmup; preferred in large models.

RMSNorm: simplifies LayerNorm by removing the mean-centering term:

$$ \text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \odot g, \quad \text{RMS}(x) = \sqrt{\frac{1}{d}\sum_i x_i^2} $$

Faster and equally effective. Used in LLaMA, Mistral, T5.

Encoder-Only Models (BERT family)

Bidirectional encoder. Uses all tokens as context for each position.

BERT (Devlin et al. 2018):

Pretraining: Masked LM (predict 15% randomly masked tokens) + Next Sentence Prediction.
Input: [CLS] sentence A [SEP] sentence B [SEP].
Fine-tuning: add a classification head on [CLS] for sequence tasks; token-level head for tagging tasks.

RoBERTa: BERT without NSP; trained longer on more data with larger batches and dynamic masking. Significantly outperforms BERT.

DeBERTa: disentangled attention over content and position; uses virtual adversarial training. State-of-the-art encoder for many benchmarks.

ELECTRA: replaced token detection pretraining. A generator corrupts tokens; the discriminator classifies each token as real or replaced. Much more efficient than BERT.

Decoder-Only Models (GPT family)

Causal (left-to-right) decoder. Used for generation; fine-tuned with RLHF for instruction following.

GPT-2 (2019): 1.5B parameters; trained on WebText. Demonstrated zero-shot text generation.

GPT-3 (2020): 175B parameters; demonstrated strong few-shot in-context learning.

GPT-4 (2023): multimodal; significantly improved reasoning and instruction following.

LLaMA family (Meta): open-weight decoder-only models. LLaMA-3 (2024): 8B/70B/405B; 128k vocabulary; grouped query attention; 8k+ context.

Mistral 7B: sliding window attention; grouped query attention; outperforms LLaMA 7B.

Encoder-Decoder Models (T5 family)

Full encoder-decoder. Reframe all NLP tasks as text-to-text: input and output are both text strings.

T5 (Raffel et al. 2020): “Text-to-Text Transfer Transformer”. Pretraining: span denoising (mask spans, predict them). Fine-tuning: task prompt prepended to input.

BART: denoising autoencoder pretraining. Encoder reads corrupted text; decoder reconstructs original. Strong on summarization and generation.

mT5: multilingual T5 trained on 101 languages.

Positional Encoding

See Attention Mechanisms for RoPE, ALiBi, and sinusoidal encodings.

Key Hyperparameters

Hyperparameter	Typical range	Effect
$N$ (layers)	12-96	Depth; more capacity
$d_\text{model}$	768-12288	Width; representation size
$h$ (heads)	12-96	Attention diversity
$d_\text{ff}$	$4 d_\text{model}$	FFN capacity
Context length	512-128k+	Maximum input length

Efficient Transformer Variants

Grouped Query Attention (GQA): multiple query heads share the same key/value heads. Reduces KV cache memory at inference. Used in LLaMA-3, Mistral.

Multi-Query Attention (MQA): all query heads share a single key/value head. Maximum memory savings; slight quality loss.

Mixture of Experts (MoE): replace the dense FFN with a sparse combination of $E$ expert FFNs. Each token is routed to $k$ experts (typically $k=2$). Active parameters per token is a small fraction of total parameters. Mixtral 8x7B: 46.7B total params, 12.9B active per token.

Scaling Properties

Parameters in a standard Transformer:

$$ N \approx 12 n_\text{layers} d_\text{model}^2 $$

(embeddings and attention projections dominate for large $d_\text{model}$).

Compute for a forward pass: $C_\text{forward} \approx 2N$ FLOPs per token (multiply-add counted as 2 ops).

Training compute: $C_\text{train} \approx 6ND$ FLOPs total for $D$ tokens (forward + backward $\approx 3\times$ forward).