zt, where zt∼N(0,I). When ϵ→0 and t→∞, xt will be distributed as pθ(x) (under some regularity conditions).
In this homework, we will consider an alternative learning procedure.
The score (which was used in Langevin MCMC above) is defined as
sθ(x)=∇xlogpθ(x)=−∇xEθ(x)=−(∂x1∂Eθ(x),…,∂xd∂Eθ(x)). If p(x) denote the (unknown) data distribution, the basic score matching objective minimizes:
Ep(x)∥∇xlogp(x)−sθ(x)∥2. The problem with this objective is that we cannot compute ∇xlogp(x) as p(x) is unknown. We can only compute (approximate) averages with respect to p(x) with empirical averages. Fortunately, we can solve this issue as we have:
Ep(x)∥∇xlogp(x)−sθ(x)∥2=c+Ep(x)[i=1∑d(∂xi∂Eθ(x))2+2∂xi2∂2Eθ(x)], where c is a constant (not depending on θ).
Proof:
Ep(x)∥∇xlogp(x)−sθ(x)∥2===Ep(x)∥∇xlogp(x)∥2+Ep(x)∥sθ(x)∥2−2Ep(x)⟨∇xlogp(x),sθ(x)⟩c+Ep(x)[∑i=1d(∂xi∂Eθ(x))2]−2∫p(x)⟨p(x)∇xp(x),sθ(x)⟩dxc+Ep(x)[∑i=1d(∂xi∂Eθ(x))2]+2∫p(x)∇x⋅sθ(x)dx, by integration by parts where for a vector valued function v(x1,x2,x3) ∇x⋅v=∂x1∂v1+∂x2∂v2+∂x3∂v3. The statement follows.
There are several drawbacks about the score matching approach: computing the trace of the Hessian is expensive and scores will not be accurately estimated in low-density regions, see Generative Modeling by Estimating Gradients of the Data Distribution
Denoising score matching is an elegant and scalable solution. Consider the random variable Y=X+σZ, where X∼p(x) and Z∼N(0,I). We denote by pσ(y) the distribution of Y so that we have:
∇ylogpσ(y)=−σ1E[Z∣Y=y]=−σ1E[Z∣X+σZ=y]. Proof:
∇ylogpσ(y)=pσ(y)∇ypσ(y) We denote by φ the density of N(0,σ2I). We have pσ(y)=∫p(x)φ(y−x)dx so that using the fact that ∇zφ(z)=−σ2zφ(z), we get
∇ypσ(y)====∫p(x)∇yφ(y−x)dx∫p(x)σ2−(y−x)φ(y−x)dx−σ1E[σY−X∣Y=y]−σ1E[Z∣Y=y] The denoising score matching objective is now
Epσ(y)∥∇ylogpσ(y)−sθ(y)∥2, that we will minimize thanks to a gradient descent in the parameter θ.
In practice, we use the following relation:
Epσ(y)∥∇ylogpσ(y)−sθ(y)∥2=E∥∥∥∥∥σZ+sθ(X+σZ)∥∥∥∥∥2−C where C does not depend on θ (made explicit below).
Proof: We have
Epσ(y)∥∇ylogpσ(y)−sθ(y)∥2======E[∥∥∥E[σZ∣Y]+sθ(Y)∥∥∥2]E[∥∥∥E[σZ∣Y]∥∥∥2+∥sθ(Y)∥2+2⟨E[σZ∣Y],sθ(Y)⟩]E[∥∥∥E[σZ∣Y]∥∥∥2]+E[E[∥sθ(Y)∥2+2⟨σZ,sθ(Y)⟩∣Y]]E[∥∥∥E[σZ∣Y]∥∥∥2]+E[∥sθ(Y)∥2+2⟨σZ,sθ(Y)⟩]E[∥∥∥E[σZ∣Y]∥∥∥2]+E[∥∥∥sθ(Y)+σZ∥∥∥2]−E[∥∥∥σZ∥∥∥2]E∥∥∥σZ+sθ(X+σZ)∥∥∥2−E[∥∥∥σZ∥∥∥2−∥∥∥E[σZ∣Y]∥∥∥2]. Hence, in practice, we will minimize the (random) loss:
ℓ(θ;x1,…,xN)=N1i=1∑N∥∥∥∥σzi+sθ(xi+σzi)∥∥∥∥2, where the zi are iid Gaussian. As the dataset is too large, we will run SGD algorithm, i.e. make batches and use automatic differentiation to get the gradient w.r.t. θ over each batch.