Graph Attention Networks

Graph Attention Networks (GAT) compute edge weights dynamically via self-attention rather than using fixed normalized weights. This allows the model to assign different importance to different neighbors during aggregation.

Motivation

GCN uses fixed normalized weights $\frac{1}{\sqrt{d_u d_v}}$. These are determined by the graph topology alone; the model cannot learn to focus on more informative neighbors.

Attention: learn importance weights $\alpha_{uv}$ based on the features of the node pair $(u,v)$. More informative neighbors receive higher weight.

GAT Layer

Step 1: Compute attention logits. For each edge $(u,v)$, apply a shared attention mechanism $a: \mathbb{R}^{d’} \times \mathbb{R}^{d’} \to \mathbb{R}$:

\[e_{uv} = a(W h_u, W h_v) = \text{LeakyReLU}(\mathbf{a}^T [W h_u \| W h_v])\]

$W \in \mathbb{R}^{d’ \times d}$: shared linear transformation. $\mathbf{a} \in \mathbb{R}^{2d’}$: attention vector. $|$: concatenation.

Step 2: Normalize with softmax over the neighborhood of $v$:

\[\alpha_{uv} = \frac{\exp(e_{uv})}{\sum_{k \in \mathcal{N}(v) \cup \{v\}} \exp(e_{kv})}\]

Step 3: Weighted aggregation:

\[h_v' = \sigma\!\left(\sum_{u \in \mathcal{N}(v) \cup \{v\}} \alpha_{uv} W h_u\right)\]

Multi-Head Attention

Run $K$ independent attention heads; concatenate (or average) their outputs:

\[h_v' = \|_{k=1}^K \sigma\!\left(\sum_{u \in \mathcal{N}(v) \cup \{v\}} \alpha_{uv}^k W^k h_u\right)\]

For the final layer, average instead of concatenate:

\[h_v' = \sigma\!\left(\frac{1}{K}\sum_{k=1}^K \sum_{u \in \mathcal{N}(v) \cup \{v\}} \alpha_{uv}^k W^k h_u\right)\]

Multiple heads allow the model to attend to different types of neighbor relationships simultaneously.

GATv2

Brody et al. (2022). Standard GAT has a static attention problem: the ranking of neighbors by attention weight is fixed regardless of the query node’s features.

Problem in original GAT: the ranking of $e_{uv}$ depends only on $W h_u$ (independent of $h_v$), making it equivalent to a linear function of $h_u$ for a fixed $h_v$.

GATv2 fix: apply the non-linearity after adding the two projections:

\[e_{uv} = \mathbf{a}^T \text{LeakyReLU}(W [h_u \| h_v])\]

Now the attention is dynamic: the ranking of neighbors changes depending on $h_v$. GATv2 is strictly more expressive than GAT.

Properties and Behavior

Attention weights as interpretability: the learned $\alpha_{uv}$ can be inspected to understand which edges the model relies on. Visualizing high-attention edges reveals the most informative relationships.

Caveats: attention does not always correlate with importance for prediction. Gradient-based attribution is often more faithful than attention-based explanations.

Sparse attention: in large graphs, restrict attention to the existing edges (local neighborhood). GATv2 only computes attention over $\mathcal{N}(v)$, not all $n$ nodes; $O(m)$ complexity.

Global attention (Transformer-style): compute attention over all pairs of nodes. $O(n^2)$ complexity; feasible only for small graphs. Used in Graphormer.

Edge Features in Attention

Extend the attention mechanism to incorporate edge features $e_{uv}$:

\[\text{logit}_{uv} = \mathbf{a}^T \text{LeakyReLU}(W_u h_u + W_v h_v + W_e e_{uv})\]

Edge features are especially important for molecular graphs (bond type, distance) and knowledge graphs (relation type).

Comparison with GCN

Property	GCN	GAT
Edge weights	Fixed (degree-based)	Learned (feature-based)
Neighbor importance	Topology only	Features + topology
Parameters	$O(d \cdot d’)$	$O(d \cdot d’ + 2d’)$ per head
Interpretability	None	Attention weights
Inductive	Yes	Yes
Over-smoothing	More prone	Similar

GAT generally outperforms GCN on benchmark node classification tasks, particularly when there is heterophily or when neighbors vary significantly in informativeness.

Applications

Molecular property prediction: different atoms and bonds contribute differently to a molecule’s property. Attention learns to focus on the functional groups most relevant to the property.

Knowledge graph reasoning: given a source entity and a relation, attention over neighbors helps traverse multi-hop reasoning paths.

Traffic forecasting: road segments influence each other; attention learns which road segments are most correlated with the target segment’s traffic.