Module 18a - Denoising Score Matching for Energy Based Models

This module is based on the work: How to Train Your Energy-Based Models by Yang Song and Diederik P. Kingma (2021).

Table of Contents

Theory for Energy-Based Models (EBM)

The density given by an EBM is:

pθ(x)=exp(Eθ(x))Zθ,\begin{array}{rcl} p_{\theta}(x) = \frac{\exp(-E_\theta(x))}{Z_\theta}, \end{array}

where Eθ:RdRE_\theta:\mathbb{R}^d \to \mathbb{R} and Zθ=exp(Eθ(x))dxZ_\theta=\int \exp(-E_\theta(x)) dx.

Given samples x1,,xNx_1,\dots, x_N in Rd\mathbb{R}^d, we want to find the parameter θ\theta maximizing the log-likelihood maxθi=1Nlogpθ(xi)\max_\theta \sum_{i=1}^N \log p_{\theta}(x_i). Since ZθZ_\theta is a function of θ\theta, evaluation and differentiation of logpθ(x)\log p_{\theta}(x) w.r.t. θ\theta involves a typically intractable integral.

Maximum Likelihood Training with MCMC

We can estimate the gradient of the log-likelihood with MCMC approaches:

θlogpθ(x)=θEθ(x)θlogZθ.\begin{array}{rcl} \nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x)-\nabla_\theta \log Z_\theta. \end{array}

The first term is simple to compute (with automatic differentiation).

Maths: computing θlogZθ\nabla_\theta \log Z_\theta We have:

θlogZθ=Epθ(x)[θEθ(x)](=pθ(x)[θEθ(x)]dx).\begin{array}{rcl} \nabla_\theta \log Z_\theta = \mathbb{E}_{p_{\theta}(x)}\left[-\nabla_\theta E_\theta(x)\right] \left(= \int p_{\theta}(x) \left[-\nabla_\theta E_\theta(x)\right] dx \right). \end{array}

Proof:

θlogZθ=θZθZθ=1Zθθexp(Eθ(x))dx=1ZθθEθ(x)exp(Eθ(x))dx=Epθ(x)[θEθ(x)]\begin{array}{rcl} \nabla_\theta \log Z_\theta &=& \frac{\nabla_\theta Z_\theta}{Z_\theta}\\ &=& \frac{1}{Z_\theta} \int \nabla_\theta \exp (-E_\theta(x))dx\\ &=& \frac{-1}{Z_\theta} \int \nabla_\theta E_\theta(x) \exp (-E_\theta(x))dx\\ &=& \mathbb{E}_{p_{\theta}(x)}\left[-\nabla_\theta E_\theta(x)\right] \end{array}

Thus, we can obtain an unbiased one-sample Monte Carlo estimate of the log-likelihood gradient by

θlogZθθEθ(x~),\begin{array}{rcl} \nabla_\theta \log Z_\theta \approx -\nabla_\theta E_\theta(\tilde{x}), \end{array}

with x~pθ(x)\tilde{x}\sim p_\theta(x), i.e. a random sample from the distribution given by the EBM. Therefore, we need to draw random samples from the model. As explained during the course, this can be done using Langevin MCMC. First note that the gradient of the log-probability w.r.t. xx (which is the score) is easy to calculate:

xlogpθ(x)=xEθ(x) since xlogZθ=0.\begin{array}{rcl} \nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x) \text{ since } \nabla_x \log Z_\theta = 0. \end{array}

Hence, in this case, Langevin MCMC is given by:

xt=xt1ϵxEθ(xt1)+2ϵzt,\begin{array}{rcl} x_t = x_{t-1} - \epsilon \nabla_x E_\theta(x_{t-1}) +\sqrt{2\epsilon}z_t, \end{array}

where ztN(0,I)z_t\sim \mathcal{N}(0,I). When ϵ0\epsilon\to 0 and tt\to \infty, xtx_t will be distributed as pθ(x)p_\theta(x) (under some regularity conditions).

In this homework, we will consider an alternative learning procedure.

Score Matching

The score (which was used in Langevin MCMC above) is defined as

sθ(x)=xlogpθ(x)=xEθ(x)=(Eθ(x)x1,,Eθ(x)xd). s_\theta(x) = \nabla_x\log p_\theta(x) = -\nabla_x E_\theta(x) = -\left( \frac{\partial E_\theta(x)}{\partial x_1},\dots, \frac{\partial E_\theta(x)}{\partial x_d}\right).

If p(x)p(x) denote the (unknown) data distribution, the basic score matching objective minimizes:

Ep(x)xlogp(x)sθ(x)2. \mathbb{E}_{p(x)} \|\nabla_x \log p(x) - s_\theta(x)\|^2.

The problem with this objective is that we cannot compute xlogp(x)\nabla_x \log p(x) as p(x)p(x) is unknown. We can only compute (approximate) averages with respect to p(x)p(x) with empirical averages. Fortunately, we can solve this issue as we have:

Ep(x)xlogp(x)sθ(x)2=c+Ep(x)[i=1d(Eθ(x)xi)2+22Eθ(x)xi2], \mathbb{E}_{p(x)} \|\nabla_x \log p(x) - s_\theta(x)\|^2 = c + \mathbb{E}_{p(x)}\left[ \sum_{i=1}^d\left ( \frac{\partial E_\theta(x)}{\partial x_i}\right)^2+2\frac{\partial^2 E_\theta(x)}{\partial x^2_i}\right],

where cc is a constant (not depending on θ\theta).

Proof:

Ep(x)xlogp(x)sθ(x)2=Ep(x)xlogp(x)2+Ep(x)sθ(x)22Ep(x)xlogp(x),sθ(x)=c+Ep(x)[i=1d(Eθ(x)xi)2]2p(x)xp(x)p(x),sθ(x)dx=c+Ep(x)[i=1d(Eθ(x)xi)2]+2p(x)xsθ(x)dx,\begin{array}{rcl} \mathbb{E}_{p(x)} \|\nabla_x \log p(x) - s_\theta(x)\|^2 &=&\mathbb{E}_{p(x)} \|\nabla_x \log p(x) \|^2 +\mathbb{E}_{p(x)} \| s_\theta(x)\|^2 - 2 \mathbb{E}_{p(x)} \langle \nabla_x \log p(x) , s_\theta(x)\rangle\\ &=& c + \mathbb{E}_{p(x)}\left[ \sum_{i=1}^d\left ( \frac{\partial E_\theta(x)}{\partial x_i}\right)^2\right] - 2 \int p(x) \langle \frac{\nabla_x p(x)}{p(x)} , s_\theta(x)\rangle dx\\ &=& c + \mathbb{E}_{p(x)}\left[ \sum_{i=1}^d\left ( \frac{\partial E_\theta(x)}{\partial x_i}\right)^2\right] + 2\int p(x) \nabla_x \cdot s_\theta(x) dx, \end{array}

by integration by parts where for a vector valued function v(x1,x2,x3)v(x_1,x_2,x_3) xv=v1x1+v2x2+v3x3\nabla_x \cdot v = \frac{\partial v_1}{\partial x_1} + \frac{\partial v_2}{\partial x_2}+ \frac{\partial v_3}{\partial x_3}. The statement follows.

Denoising Score Matching

There are several drawbacks about the score matching approach: computing the trace of the Hessian is expensive and scores will not be accurately estimated in low-density regions, see Generative Modeling by Estimating Gradients of the Data Distribution

Denoising score matching is an elegant and scalable solution. Consider the random variable Y=X+σZY = X+\sigma Z, where Xp(x)X\sim p(x) and ZN(0,I)Z\sim\mathcal{N}(0,I). We denote by pσ(y)p^\sigma(y) the distribution of YY so that we have:

ylogpσ(y)=1σE[ZY=y]=1σE[ZX+σZ=y]. \nabla_y\log p^\sigma(y) = -\frac{1}{\sigma}\mathbb{E}\left[ Z |Y=y\right] = -\frac{1}{\sigma}\mathbb{E}\left[ Z |X+\sigma Z=y\right].

Proof:

ylogpσ(y)=ypσ(y)pσ(y)\begin{array}{rcl} \nabla_y\log p^\sigma(y) = \frac{\nabla_y p^\sigma(y)}{p^\sigma(y)} \end{array}

We denote by φ\varphi the density of N(0,σ2I)\mathcal{N}(0,\sigma^2 I). We have pσ(y)=p(x)φ(yx)dxp^\sigma(y) = \int p(x) \varphi(y-x) dx so that using the fact that zφ(z)=zσ2φ(z)\nabla_z \varphi(z) = -\frac{z}{\sigma^2} \varphi(z), we get

ypσ(y)=p(x)yφ(yx)dx=p(x)(yx)σ2φ(yx)dx=1σE[YXσY=y]=1σE[ZY=y]\begin{array}{rcl} \nabla_y p^\sigma(y) &=& \int p(x) \nabla_y \varphi(y-x) dx\\ &=& \int p(x) \frac{-(y-x)}{\sigma^2} \varphi(y-x) dx \\ &=& -\frac{1}{\sigma}\mathbb{E}\left[ \frac{Y-X}{\sigma} |Y=y\right]\\ &=& -\frac{1}{\sigma}\mathbb{E}\left[ Z |Y=y\right] \end{array}

The denoising score matching objective is now

Epσ(y)ylogpσ(y)sθ(y)2, \mathbb{E}_{p^\sigma(y)}\|\nabla_y \log p^\sigma(y) - s_\theta(y)\|^2,

that we will minimize thanks to a gradient descent in the parameter θ\theta.

In practice, we use the following relation:

Epσ(y)ylogpσ(y)sθ(y)2=EZσ+sθ(X+σZ)2C \mathbb{E}_{p^\sigma(y)}\|\nabla_y \log p^\sigma(y) - s_\theta(y)\|^2 = \mathbb{E}\left\| \frac{Z}{\sigma}+s_\theta(X+\sigma Z)\right\|^2-C

where CC does not depend on θ\theta (made explicit below).

Proof: We have

Epσ(y)ylogpσ(y)sθ(y)2=E[E[ZσY]+sθ(Y)2]=E[E[ZσY]2+sθ(Y)2+2E[ZσY],sθ(Y)]=E[E[ZσY]2]+E[E[sθ(Y)2+2Zσ,sθ(Y)Y]]=E[E[ZσY]2]+E[sθ(Y)2+2Zσ,sθ(Y)]=E[E[ZσY]2]+E[sθ(Y)+Zσ2]E[Zσ2]=EZσ+sθ(X+σZ)2E[Zσ2E[ZσY]2].\begin{array}{rcl} \mathbb{E}_{p^\sigma(y)}\|\nabla_y \log p^\sigma(y) - s_\theta(y)\|^2 &=& \mathbb{E} \left[\left\| \mathbb{E} \left[\frac{Z}{\sigma} | Y\right] +s_\theta(Y)\right\|^2\right]\\ &=& \mathbb{E} \left[\left\| \mathbb{E} \left[\frac{Z}{\sigma} | Y\right]\right\|^2 + \left\|s_\theta(Y)\right\|^2 + 2 \left\langle \mathbb{E} \left[\frac{Z}{\sigma} | Y\right], s_\theta(Y)\right\rangle \right]\\ &=& \mathbb{E} \left[\left\| \mathbb{E} \left[\frac{Z}{\sigma} | Y\right]\right\|^2 \right] + \mathbb{E} \left[ \mathbb{E} \left[ \left\|s_\theta(Y)\right\|^2 + 2 \left\langle \frac{Z}{\sigma}, s_\theta(Y)\right\rangle | Y \right]\right]\\ &=& \mathbb{E} \left[\left\| \mathbb{E} \left[\frac{Z}{\sigma} | Y\right]\right\|^2 \right] + \mathbb{E} \left[ \left\|s_\theta(Y)\right\|^2 + 2 \left\langle \frac{Z}{\sigma}, s_\theta(Y)\right\rangle \right]\\ &=& \mathbb{E} \left[\left\| \mathbb{E} \left[\frac{Z}{\sigma} | Y\right]\right\|^2 \right] + \mathbb{E} \left[ \left\|s_\theta(Y) + \frac{Z}{\sigma} \right\|^2 \right] - \mathbb{E} \left[ \left\|\frac{Z}{\sigma}\right\|^2\right]\\ &=& \mathbb{E}\left\| \frac{Z}{\sigma}+s_\theta(X+\sigma Z)\right\|^2 - \mathbb{E} \left[ \left\|\frac{Z}{\sigma}\right\|^2 - \left\| \mathbb{E} \left[\frac{Z}{\sigma} | Y\right]\right\|^2 \right]. \end{array}

Hence, in practice, we will minimize the (random) loss:

(θ;x1,,xN)=1Ni=1Nziσ+sθ(xi+σzi)2, \ell(\theta; x_1,\dots, x_N) = \frac{1}{N} \sum_{i=1}^N \left\| \frac{z_i}{\sigma}+s_\theta(x_i+\sigma z_i)\right\|^2,

where the ziz_i are iid Gaussian. As the dataset is too large, we will run SGD algorithm, i.e. make batches and use automatic differentiation to get the gradient w.r.t. θ\theta over each batch.

Code for Energy Based Models