Energy Based Models

Energy-based models (EBMs) define a probability distribution over data via an energy function $E_\theta(x) \in \mathbb{R}$ that assigns low energy to likely configurations and high energy to unlikely ones.

The Energy-Based Distribution

\[p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z(\theta)}, \quad Z(\theta) = \int \exp(-E_\theta(x)) \, dx\]

$E_\theta(x)$: scalar energy function, often a neural network.
$Z(\theta)$: partition function (normalizing constant). This integral is generally intractable.

Equivalently: $\log p_\theta(x) = -E_\theta(x) - \log Z(\theta)$.

Why Energy-Based?

Flexibility: no architectural constraints on $E_\theta$; any neural network works.
No generative process required: no need to define a decoder or forward sampling procedure.
Unified framework: many classical models (Boltzmann machines, CRFs, Hopfield networks) are EBMs.
Compositionality: $E(x) = E_1(x) + E_2(x)$ corresponds to $p(x) \propto p_1(x) \cdot p_2(x)$.

Challenge: the intractable $Z(\theta)$ makes MLE and sampling hard.

Maximum Likelihood Training

The MLE gradient involves the difference of model expectations:

\[\nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) + \mathbb{E}_{p_\theta(x')}[\nabla_\theta E_\theta(x')]\]

The first term pushes the energy of real data down. The second (the negative phase) pushes the energy of model samples up.

Computing the negative phase requires sampling from $p_\theta$, which requires MCMC.

MCMC for EBMs

Since $p_\theta$ is only known up to $Z(\theta)$, sampling requires Markov Chain Monte Carlo.

Langevin Dynamics

Iterative refinement of a noisy sample using the gradient of the energy:

\[x^{(k+1)} = x^{(k)} - \frac{\alpha}{2} \nabla_x E_\theta(x^{(k)}) + \sqrt{\alpha} \, \epsilon^{(k)}, \quad \epsilon^{(k)} \sim \mathcal{N}(0, I)\]

As step size $\alpha \to 0$ and steps $K \to \infty$, the chain converges to $p_\theta(x)$.

In practice: run for a finite number of steps starting from noise or a real sample.

Contrastive Divergence (CD-k)

Initialize MCMC chain from real data $x \sim p_\text{data}$, run $k$ steps of Gibbs or Langevin, use the result as a “negative sample”:

\[\nabla_\theta \mathcal{L} \approx -\nabla_\theta E_\theta(x^+) + \nabla_\theta E_\theta(x^-)\]

$k=1$ (CD-1) is often sufficient in practice and is the default for RBMs.

Persistent Contrastive Divergence (PCD): maintain a persistent chain across training steps rather than restarting from data. Better captures the full model distribution.

Replay Buffer

Modern EBMs (Du & Mordatch 2019) maintain a replay buffer of past MCMC samples. For each gradient step:

Initialize some chains from buffer samples, some from noise.
Run $K$ steps of Langevin dynamics.
Update the buffer with new samples.
Compute the contrastive gradient.

Prevents MCMC from getting stuck and stabilizes training.

Boltzmann Machines

An EBM on binary vectors $x \in {0,1}^d$:

\[E(x) = -x^T W x - b^T x\]

Training: maximize $\log p(x) = -E(x) - \log Z$. Gradient requires computing $\mathbb{E}_{p}[x x^T]$, intractable for general BMs.

Restricted Boltzmann Machine (RBM): bipartite graph with visible units $v$ and hidden units $h$; no within-layer connections.

\[E(v, h) = -v^T W h - b^T v - c^T h\] \[p(v) = \sum_h p(v, h), \quad p(h \mid v) = \sigma(Wv + c), \quad p(v \mid h) = \sigma(W^T h + b)\]

Alternating Gibbs sampling between $v$ and $h$ is efficient. Trained with CD-1. Building block of Deep Belief Networks.

Joint Energy Models (JEM)

Combines a classifier $f_\theta(x)[y]$ (logit for class $y$) with an EBM over inputs:

\[p_\theta(x) \propto \sum_y \exp(f_\theta(x)[y])\] \[p_\theta(y \mid x) = \text{softmax}(f_\theta(x))\]

Joint MLE: maximize $\log p_\theta(x) + \log p_\theta(y \mid x)$.

A single network simultaneously classifies and generates. Improves OOD detection and adversarial robustness.

EBMs for Structured Prediction

In NLP and vision, model $p(y \mid x) \propto \exp(-E_\theta(x, y))$ where $y$ is a structured output (parse tree, segmentation map).

CRF (Conditional Random Field): shallow EBM with pairwise potentials. Exact inference via dynamic programming for linear chains.
Structured SVM: margin-based training that avoids partition function via a surrogate loss.

Composing Energy Functions

EBMs compose cleanly via addition:

\[\log p(x \mid c) \propto -E_\text{model}(x) - E_\text{condition}(x, c)\]

This enables test-time composition of multiple constraints without retraining. E.g., combine a generative model with a physics constraint or a classifier.

Comparison with Other Generative Models

Property	EBM	VAE	GAN	Diffusion
Tractable likelihood	No (unnormalized)	No (ELBO)	No	No (ELBO)
Sampling	Slow (MCMC)	Fast	Fast	Slow (iterative)
Training stability	Difficult (MCMC needed)	Stable	Unstable	Stable
Compositionality	Excellent	Poor	Poor	Moderate
Flexibility	High	Moderate	High	High