Score Based Models
Score-based models learn the score function of the data distribution, which is the gradient of the log-density with respect to the data:
\[s_\theta(x) \approx \nabla_x \log p_\text{data}(x)\]The score points in the direction of increasing data density. Knowing it is sufficient for sampling (via Langevin dynamics) without ever computing the intractable partition function $Z$.
Why the Score?
For an EBM $p_\theta(x) = e^{-E_\theta(x)} / Z(\theta)$:
\[\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)\]The partition function $Z(\theta)$ drops out! The score can be learned and used without normalization.
Score Matching
Hyvärinen (2005): minimize the expected squared distance between the model score and the data score:
\[J_\text{SM}(\theta) = \mathbb{E}_{p_\text{data}}\!\left[\frac{1}{2}\|s_\theta(x) - \nabla_x \log p_\text{data}(x)\|^2\right]\]The data score $\nabla_x \log p_\text{data}(x)$ is unknown, but integration by parts yields an equivalent objective that only requires $p_\text{data}$ samples:
\[J_\text{SM}(\theta) = \mathbb{E}_{p_\text{data}}\!\left[\text{tr}(\nabla_x s_\theta(x)) + \frac{1}{2}\|s_\theta(x)\|^2\right] + \text{const}\]The trace of the Jacobian $\text{tr}(\nabla_x s_\theta(x))$ is expensive to compute exactly; approximated via Hutchinson’s estimator.
Denoising Score Matching
Add Gaussian noise $\tilde{x} = x + \sigma \epsilon$ to perturbed samples and learn the score of the noisy distribution:
\[J_\text{DSM}(\theta) = \mathbb{E}_{\tilde{x}, x}\!\left[\|s_\theta(\tilde{x}, \sigma) - \nabla_{\tilde{x}} \log q_\sigma(\tilde{x}|x)\|^2\right]\]The conditional score is available in closed form:
\[\nabla_{\tilde{x}} \log q_\sigma(\tilde{x}|x) = -\frac{\tilde{x} - x}{\sigma^2}\]So the network learns to predict $(\tilde{x} - x)/\sigma^2$, equivalent to predicting the noise:
\[J_\text{DSM} \propto \mathbb{E}\!\left[\|s_\theta(\tilde{x}, \sigma) + \epsilon/\sigma\|^2\right]\]Connection to diffusion: denoising score matching is exactly the training objective of DDPM (noise prediction loss). See Diffusion Models.
Noise Conditional Score Network (NCSN)
Song & Ermon (2019). Train a single network $s_\theta(x, \sigma)$ conditioned on the noise level $\sigma$ to estimate scores at multiple noise scales $\sigma_1 > \sigma_2 > \cdots > \sigma_L$:
\[\mathcal{L} = \sum_{l=1}^L \lambda(\sigma_l) \mathbb{E}_{p_{\sigma_l}(\tilde{x}|x)}\!\left[\|s_\theta(\tilde{x}, \sigma_l) - \nabla_{\tilde{x}} \log p_{\sigma_l}(\tilde{x}|x)\|^2\right]\]Sampling via annealed Langevin dynamics: start at high noise, run Langevin, reduce noise level, repeat.
Stochastic Differential Equations (SDEs)
Song et al. (2021) unifies diffusion models and score-based models under a continuous SDE framework.
Forward SDE (noising process):
\[dx = f(x, t) \, dt + g(t) \, dW\]- $f$: drift coefficient (deterministic).
- $g$: diffusion coefficient (noise scale).
- $W$: standard Wiener process (Brownian motion).
Reverse SDE (denoising process, Anderson 1982):
\[dx = [f(x, t) - g(t)^2 \nabla_x \log p_t(x)] \, dt + g(t) \, d\bar{W}\]This requires the score $\nabla_x \log p_t(x)$ at each time $t$, which is learned by the score network $s_\theta(x, t)$.
DDPM as SDE: corresponds to the Variance Preserving (VP) SDE with specific $f$ and $g$.
SMLD (NCSN) as SDE: corresponds to the Variance Exploding (VE) SDE.
Probability Flow ODE
Every SDE has a corresponding deterministic ODE with the same marginal densities:
\[\frac{dx}{dt} = f(x, t) - \frac{1}{2} g(t)^2 \nabla_x \log p_t(x)\]Benefits of ODE sampling:
- Deterministic; exact inversion via reverse-time ODE (enables image editing).
- Can use fast ODE solvers (fewer function evaluations than SDE).
- Enables exact likelihood computation via the instantaneous change of variables formula.
DDIM is a special case of the probability flow ODE sampler.
Score Distillation Sampling (SDS)
Uses a pretrained diffusion model’s score as a loss to optimize a separate model (e.g., NeRF, mesh, image):
\[\nabla_\theta \mathcal{L}_\text{SDS} = \mathbb{E}_{t, \epsilon}\!\left[w(t)(\hat{\epsilon}_\phi(z_t, t, c) - \epsilon) \frac{\partial z}{\partial \theta}\right]\]The diffusion model provides a gradient signal without differentiating through the entire diffusion process. Used in DreamFusion (text-to-3D), Magic3D, and text-guided image editing.
Tweedie’s Formula
The posterior mean estimate of $x_0$ given noisy $x_t$:
\[\mathbb{E}[x_0 | x_t] = \frac{x_t + (1-\bar{\alpha}_t) s_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}\]Reveals that the score function at noise level $t$ is equivalent to the optimal denoiser. Connects denoising, score matching, and diffusion in a single formula.
Connections Between Model Families
| Framework | Score Perspective |
|---|---|
| DDPM | Predicts noise $\epsilon = -\sigma \cdot s(x_t, t)$ (equivalent to score) |
| NCSN | Directly predicts $\nabla_x \log p_\sigma(x)$ |
| SDE framework | Unifies both via VP/VE SDE |
| Flow matching | Learns the vector field $v_t$ instead of the score; equivalent in limit |
| EBM | Score = negative energy gradient $-\nabla_x E_\theta(x)$ |
The score function is the central quantity connecting diffusion models, EBMs, and normalizing flows.