Tokenization
Tokenization converts a raw string into a sequence of tokens that the model processes. The choice of tokenization algorithm determines the vocabulary size, the handling of rare and unseen words, and the sequence length for a given input.
Word Tokenization
Split text on whitespace and punctuation boundaries.
Simple split: "Hello, world!".split() → ["Hello,", "world!"]. Punctuation attached to words.
Rule-based (Penn Treebank): handles contractions (“don’t” → “do”, “n’t”), possessives, punctuation. Standard for English NLP benchmarks.
Problems:
- Vocabulary explosion: millions of word types in large corpora.
- Out-of-vocabulary (OOV) words: unseen words at inference map to
<UNK>. - Morphologically rich languages (German, Finnish, Turkish) have enormous word-form spaces.
Character Tokenization
Each character is a token. Vocabulary size $\approx$ 100-200 (ASCII/Unicode).
Advantages: no OOV; handles any word form; small vocabulary.
Disadvantages: sequences are very long (5-10$\times$ longer than word sequences); harder for models to learn word-level semantics; slow to process.
Subword Tokenization
The modern standard. Splits words into frequently occurring subword units. Balances vocabulary size with sequence length; rare words are segmented into known subwords.
Key insight: frequent words remain as single tokens; rare words decompose into known subword units.
\(\text{"unhappiness"} \to [\text{"un"}, \text{"happiness"}]\) \(\text{"tokenization"} \to [\text{"token"}, "\text{ization}"]\)
Byte-Pair Encoding (BPE)
Start with a character-level vocabulary. Iteratively merge the most frequent adjacent pair of tokens until the target vocabulary size is reached.
Algorithm:
- Initialize vocabulary with individual characters.
- Count all adjacent token pairs in the corpus.
- Merge the most frequent pair into a new token.
- Repeat until vocabulary size $V$ is reached.
Vocabulary size typically 32k-64k for LLMs.
Encoding a new string: apply merges in the learned order (greedy).
GPT-2 / GPT-4 tokenizer (tiktoken): BPE over byte sequences. Every possible byte sequence is representable; no true OOV at the byte level.
WordPiece
Similar to BPE but merges maximize the language model likelihood rather than raw frequency:
\[\text{score}(A, B) = \frac{\text{freq}(AB)}{\text{freq}(A) \times \text{freq}(B)}\]Subword tokens after the first in a word are prefixed with ## (e.g., “playing” → ["play", "##ing"]).
Used in: BERT, DistilBERT, ELECTRA.
SentencePiece
Language-agnostic; trains directly on raw text without pre-tokenization. Treats whitespace as a special character ▁. Works for any language including those without spaces (Japanese, Chinese).
Supports BPE and unigram language model as underlying algorithms.
Used in: T5, XLNet, LLaMA, Mistral, mT5.
Unigram Language Model
Maintains a large vocabulary, then prunes tokens by removing those that increase the total corpus log-likelihood least:
\[\mathcal{L} = \sum_{x \in \mathcal{D}} \log p(x) = \sum_{x \in \mathcal{D}} \log \sum_{\mathbf{s} \in S(x)} \prod_{t} p(s_t)\]Viterbi algorithm finds the most likely segmentation. Allows multiple valid tokenizations (useful for regularization during training).
Special Tokens
Modern tokenizers add special tokens for model conditioning:
| Token | Purpose |
|---|---|
[CLS] |
Classification token (BERT); prepended to input |
[SEP] |
Separator between sentences (BERT) |
[PAD] |
Padding to uniform sequence length |
[UNK] |
Unknown token (rare in BPE models) |
[MASK] |
Masked position for MLM training |
<s>, </s> |
Start/end of sequence (T5, LLaMA) |
<|endoftext|> |
End of document (GPT) |
Vocabulary Size Tradeoffs
| Vocab size | Sequence length | Embedding size | OOV handling |
|---|---|---|---|
| Small (1k) | Very long | Small | Poor |
| Medium (32k) | Moderate | Medium | Good (subword) |
| Large (100k+) | Short | Large (more params) | Excellent |
Typical modern LLMs: 32k-200k vocabulary. LLaMA-3: 128k; GPT-4: ~100k.
Tokenization Artifacts
- Byte fallback: byte-level BPE can represent any input but introduces rare byte tokens that models handle poorly.
- Whitespace sensitivity: “token” and “ token” may be different tokens (leading space matters in BPE).
- Number tokenization: “100”, “1”, “0”, “0” may be three tokens; arithmetic reasoning is harder.
- Non-English languages: English-centric tokenizers (GPT-2) over-segment non-Latin scripts, consuming many tokens per character.