Word Embeddings
Word embeddings map discrete tokens to dense vectors in $\mathbb{R}^d$ such that semantically similar words are close in the embedding space. They replace sparse one-hot representations ($\mathbb{R}^{\lvert V \rvert}$) with compact, information-rich vectors.
Why Dense Embeddings?
One-hot vectors: $\lvert V \rvert$-dimensional, all zeros except one. No notion of similarity; “cat” and “dog” are orthogonal. Dot product of any two distinct words is zero.
Dense embeddings: similar words have high cosine similarity. Enable generalization across semantically related words. Pretrained embeddings transfer knowledge across tasks.
Word2Vec
Mikolov et al. (2013). Trains a shallow neural network on a self-supervised task to produce word vectors. Two architectures:
Skip-gram
Predict surrounding context words from the center word. For center word $w_t$ and context window $c$:
\[\max_\theta \sum_{t=1}^T \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j} | w_t)\] \[p(w_O | w_I) = \frac{\exp(v_{w_O}'^T v_{w_I})}{\sum_{w=1}^V \exp(v_w'^T v_{w_I})}\]Separate input embeddings $v_w$ and output embeddings $v_w’$.
Negative sampling: replaces the full softmax. For each positive pair $(w_I, w_O)$, sample $k$ random negative words $w_i^-$:
\[\mathcal{L} = \log \sigma(v_{w_O}'^T v_{w_I}) + \sum_{i=1}^k \mathbb{E}_{w_i^- \sim P_n}[\log \sigma(-v_{w_i^-}'^T v_{w_I})]\]$P_n(w) \propto \text{freq}(w)^{3/4}$ upweights rare words.
CBOW (Continuous Bag of Words)
Predict center word from averaged context embeddings. Faster to train; slightly lower quality than skip-gram.
Word2Vec Properties
Linear analogies: $v(\text{king}) - v(\text{man}) + v(\text{woman}) \approx v(\text{queen})$
This emerges from the distributional structure of co-occurrence without any explicit supervision.
Subsampling: frequent words (the, is, a) are downsampled during training with probability $P(\text{discard}) = 1 - \sqrt{t/f}$.
GloVe (Global Vectors)
Penington et al. (2014). Learns embeddings from the global word-word co-occurrence matrix $X$ directly.
Objective: for co-occurrence count $X_{ij}$ of words $i$ and $j$:
\[\mathcal{L} = \sum_{i,j=1}^V f(X_{ij})(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2\]Weighting function: $f(x) = (x/x_\text{max})^\alpha$ for $x < x_\text{max}$, else 1. Reduces weight of very frequent pairs ($\alpha = 3/4$, $x_\text{max} = 100$).
GloVe and Word2Vec produce similar quality embeddings. GloVe is easier to train in parallel.
fastText
Bojanowski et al. (2017). Extends Word2Vec by representing each word as a bag of character n-grams.
\[v(w) = \sum_{g \in G(w)} z_g\]For “where”: character 3-grams = {<wh, whe, her, ere, re>} plus the full token <where>.
Advantages:
- Handles OOV words: unseen words are represented via their subword components.
- Better for morphologically rich languages.
- Improves representations for rare words.
Contextualized Embeddings
Word2Vec/GloVe produce a single static vector per word. “Bank” gets one vector regardless of context (riverbank vs. financial bank).
Contextualized embeddings produce a different vector for each occurrence based on surrounding context. These emerge from the hidden states of language models.
ELMo (Embeddings from Language Models)
Bidirectional LSTM language model. The embedding for a token is a learned weighted sum of all LSTM layer representations:
\[\text{ELMo}_k = \gamma \sum_{j=0}^L s_j h_{k,j}\]where $s_j$ are softmax-normalized task-specific scalars and $\gamma$ is a scaling factor. The first layer captures syntax; higher layers capture semantics.
BERT Hidden States
Each token’s hidden state from BERT is a contextualized embedding. Common strategies:
- Use the
[CLS]token representation for sentence-level tasks. - Average last 4 layers for word-level tasks.
- Fine-tune end-to-end for maximum performance.
Sentence Embeddings
Fixed-size vector for an entire sentence or document.
Sentence-BERT (SBERT): fine-tunes BERT with a siamese network on sentence pairs using cosine similarity loss. Produces semantically meaningful sentence embeddings in $O(1)$ (no cross-attention at inference).
SimCSE: contrastive learning with dropout-based augmentation. Two forward passes of the same sentence with different dropout masks form a positive pair.
Embedding Evaluation
Intrinsic:
- Word similarity: compare cosine similarity of word pairs against human judgments (SimLex-999, WordSim-353). Pearson/Spearman correlation.
- Analogy tasks: $a : b :: c : ?$ (Google analogy dataset). Accuracy of $v(a) - v(b) + v(c)$ nearest neighbor.
Extrinsic: downstream task performance (NER, text classification, QA).
Embedding Dimensionality
Typical $d$: 100-300 for Word2Vec/GloVe; 768-4096 for Transformer hidden states.
There is a tradeoff: higher $d$ captures more information but requires more parameters and training data.
Dimension reduction for visualization: t-SNE or UMAP on embeddings reveals semantic clusters.
Bias in Embeddings
Embeddings trained on biased corpora encode social biases. Classic example:
\[v(\text{man}) - v(\text{woman}) \approx v(\text{programmer}) - v(\text{homemaker})\]Debiasing: Hard debiasing (Bolukbasi et al.): project out the gender direction from gender-neutral words. Soft debiasing: regularize during training. Ongoing research area.