Blog
Notes on Diffusion Models (I) -- DDPM and DDIM

Introduction

A central challenge in machine learning, particularly in generative modeling, is to model complex datasets using highly flexible families of probability distributions that maintain analytical or computational tractability for learning, sampling, inference, and evaluation.

This post summarizes the fundamental concepts of diffusion models, their optimization strategies, and applications, focusing on the mathematical foundations and practical implications.

Denoising Diffusion Probabilistic Models (DDPM)

Diffusion models are a class of generative models first proposed by Sohl-Dickstein et al. (Sohl-Dickstein et al., 2015). Inspired by nonequilibrium thermodynamics, the method systematically and gradually destroys data structure through a forward diffusion process, then learns a reverse process to restore structure and yield a highly flexible and tractable generative model.

Forward Process

A diffusion model formulates the learned data distribution as , where are latent variables with the same dimensionality as the real data .

The forward process is a Markov chain where transitions from to follow multivariate Gaussian distributions. The joint distribution of latent variables () given the real data is:

Key properties:

  1. The coefficients are pre-defined and determine the "velocity" of diffusion.
  2. After sufficient diffusion steps, the final state approaches an isotropic Gaussian distribution when is large enough.
  3. For clarity, we use to represent the learned distribution and to represent the real data distribution.

A notable property of this forward process, as mentioned in Ho et al. (Ho et al., 2020) (Section 2), is that has a closed-form expression derived using the reparameterization trick (opens in a new tab). Let and :

The sum of two uncorrelated multivariate normal distributions and is also a multivariate normal distribution (proof details (opens in a new tab)).

Typically, we can afford larger update steps when the sample becomes noisier, so and therefore .

Reverse Process

Ideally, if we knew , we could gradually remove noise from corrupted samples to recover the original image. However, this conditional distribution is not readily available and its computation requires the entire dataset. Specifically:

Computing requires evaluating integrals in Eq. (\ref{eq:imprac_cond_prob_expr}), which is computationally expensive. Instead, we use the diffusion model to learn and approximate the true conditional distribution. When is sufficiently small, is also Gaussian (details (opens in a new tab)).

The joint distribution of the diffusion model is:

With this distribution, we can sample from an isotropic Gaussian distribution and expect the reverse process to gradually transform it into samples that follow .

Loss Function

The loss function for training the diffusion model is the standard variational bound on negative log likelihood:

Therefore, the expected negative log likelihood is lower bounded by the variational lower bound:

To convert each term in the equation to be analytically computable, the objective can be further rewritten to be a combination of several KL-divergence and entropy terms (See the detailed step-by-step process in Appendix B in Sohl-Dickstein et al. (Sohl-Dickstein et al., 2015)):

In summary, we can split the variational lower bound into components and label them as follows:

In the above decomposition:

  1. can be computed from Eq. (\ref{eq:xt_x0_relation})
  2. are parameterized and learned

Next, we show that can be computed in closed form even though can't.

From the above derivation, we observe that the conditional distribution is also Gaussian and can be written in standard multivariate normal form as . Notice that , the and can be computed as follows,

Recall the relation between and deduced from Eq. \ref{eq:xt_x0_relation}, the Eq. \ref{eq:standard_form_mu_t} can be further rewrited as follows:

Parameterization of reverse diffusion process and

Recall our previous decomposition of variational lower bound loss, we have closed form computation of real data distribution (i.e. and ), we still need a parameterization of . As we discussed previously, when is small enough, we can approximate by the Gaussian distribution . In practice, we can further simplify the loss function Eq. (\ref{eq:lt_before_simple_sigma}) by predefine the variance matrix as and, experimentally, or had similar results. We expect that the training process can let to predict . With this parameterization, each component of loss function is the KL divergence between two multivariate Gaussian distributions and has a relatively simple closed form (opens in a new tab). The loss term become

Comments Loading (Debug)...
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning, 2256–2265. https://arxiv.org/pdf/1503.03585.pdf