Link Prediction

Link prediction estimates the likelihood of an edge existing between two nodes. It is fundamental for recommendation systems, knowledge graph completion, and drug interaction discovery.

Problem Formulation

Given a graph $G = (V, E_\text{train})$ (a subset of all true edges), predict whether an edge $(u, v) \notin E_\text{train}$ exists in the true graph $G^* = (V, E^*)$.

Transductive: predict missing edges in the same graph.

Inductive: predict edges for new nodes or graphs not seen during training.

Negative sampling: since only positive edges are observed, sample non-edges as negatives. Random negatives are easy; hard negatives (structurally similar but absent edges) are more informative.

Heuristic Methods

Before deep learning, hand-designed similarity scores between node pairs.

Common Neighbors:

$$ s(u,v) = |\mathcal{N}(u) \cap \mathcal{N}(v)| $$

Two nodes are likely to connect if they share many neighbors. Works well for social networks.

Jaccard Coefficient:

$$ s(u,v) = \frac{|\mathcal{N}(u) \cap \mathcal{N}(v)|}{|\mathcal{N}(u) \cup \mathcal{N}(v)|} $$

Normalizes by total neighbors to handle high-degree nodes.

Adamic-Adar:

$$ s(u,v) = \sum_{w \in \mathcal{N}(u) \cap \mathcal{N}(v)} \frac{1}{\log |\mathcal{N}(w)|} $$

Weights common neighbors by inverse log degree: low-degree common neighbors are more informative (less likely to be common friends by chance).

Resource Allocation:

$$ s(u,v) = \sum_{w \in \mathcal{N}(u) \cap \mathcal{N}(v)} \frac{1}{|\mathcal{N}(w)|} $$

Katz Index:

$$ s(u,v) = \sum_{l=1}^\infty \beta^l |paths_{uv}^{(l)}| = (I - \beta A)^{-1} A $$

Counts paths of all lengths with exponential decay $\beta$. Captures global structure.

Matrix Factorization

Factorize the adjacency matrix $A \approx ZZ^T$ or $A \approx UV^T$ into low-rank matrices.

SVD decomposition: $A = U \Sigma V^T$; use top-$k$ singular vectors. Equivalent to spectral node embeddings.

Symmetric NMF: find $Z \geq 0$ such that $A \approx ZZ^T$. Produces non-negative embeddings; corresponds to community structure.

Matrix factorization for recommendation: user-item bipartite graph; factorize the interaction matrix.

Node2Vec / DeepWalk (Random Walk Embeddings)

Learn node embeddings that predict context nodes in random walks.

DeepWalk: generate random walks; treat each walk as a “sentence” of nodes; apply Word2Vec skip-gram to learn node embeddings.

Node2Vec: biased random walks. Two parameters:

$p$: return probability (penalizes returning to previous node).
$q$: in-out ratio (BFS-like vs. DFS-like walks).

$p < 1, q > 1$: local BFS exploration (community structure). $p > 1, q < 1$: global DFS exploration (structural equivalence).

Link prediction: given embeddings $z_u, z_v$, predict edge probability via $\sigma(z_u^T z_v)$ or $\sigma(h(z_u \odot z_v))$.

GNN-Based Link Prediction

A GNN computes node embeddings $z_u, z_v$; a scoring function predicts edge probability.

Dot product scoring:

$$ \hat{A}_{uv} = \sigma(z_u^T z_v) $$

MLP scoring: $\hat{A}_{uv} = \sigma(\text{MLP}([z_u; z_v]))$.

Loss: binary cross-entropy over positive and negative edge pairs:

$$ \mathcal{L} = -\sum_{(u,v) \in E_\text{train}} \log \hat{A}_{uv} - \sum_{(u,v) \notin E} \log(1 - \hat{A}_{uv}) $$

SEAL (Subgraph Extraction for Link Prediction)

Zhang & Chen (2018). Extract the $h$-hop enclosing subgraph around each target link $(u,v)$; apply a GNN to the subgraph to score the link.

The subgraph provides structural information beyond node embeddings: is the pair a bridge between communities? Are they in the same clique?

Double Radius Node Labeling (DRNL): label each node in the subgraph with its distance to $u$ and $v$. Encodes structural role w.r.t. the target link.

SEAL significantly outperforms embedding-based methods on citation and social network datasets.

Evaluation

Area Under the ROC Curve (AUC): fraction of (positive, negative) pairs where the positive is scored higher.

Average Precision (AP): area under precision-recall curve.

Hits@K: fraction of positive edges ranked in the top-$K$ among all negatives.

MRR (Mean Reciprocal Rank): $\frac{1}{\lvert E_\text{test} \rvert} \sum_{(u,v) \in E_\text{test}} \frac{1}{\text{rank}(v)}$ where rank is computed among all candidate nodes for a given source $u$.

OGB (Open Graph Benchmark) uses ranking-based metrics with hard negative sampling to avoid trivially easy random negatives.