Notes on Diffusion Models (I) -- DDPM and DDIM

#generative-model #diffusion-model #math

Introduction

A central challenge in machine learning, particularly in generative modeling, is to model complex datasets using highly flexible families of probability distributions that maintain analytical or computational tractability for learning, sampling, inference, and evaluation.

This post summarizes the fundamental concepts of diffusion models, their optimization strategies, and applications, focusing on the mathematical foundations and practical implications.

Denoising Diffusion Probabilistic Models (DDPM)

Diffusion models are a class of generative models first proposed by Sohl-Dickstein et al. (Sohl-Dickstein et al., 2015). Inspired by nonequilibrium thermodynamics, the method systematically and gradually destroys data structure through a forward diffusion process, then learns a reverse process to restore structure and yield a highly flexible and tractable generative model.

Forward Process

A diffusion model formulates the learned data distribution as $p_{θ} (x_{0}) := \int p_{θ} (x_{0 : T}) d x_{1 : T}$ , where $x_{1}, x_{2}, \dots, x_{T}$ are latent variables with the same dimensionality as the real data $x_{0} \sim q (x_{0})$ .

The forward process is a Markov chain where transitions from $x_{t}$ to $x_{t + 1}$ follow multivariate Gaussian distributions. The joint distribution of latent variables ( $x_{1 : T}$ ) given the real data $x_{0}$ is:

q (x_{1 : T} ∣ x_{0}) := t = 1 \prod T q (x_{t} ∣ x_{t - 1}), q (x_{t} ∣ x_{t - 1}) := N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I) .

Key properties:

The coefficients ${β_{t}}$ are pre-defined and determine the "velocity" of diffusion.
After sufficient diffusion steps, the final state $x_{T}$ approaches an isotropic Gaussian distribution when $T$ is large enough.
For clarity, we use $p_{θ}$ to represent the learned distribution and $q$ to represent the real data distribution.

A notable property of this forward process, as mentioned in Ho et al. (Ho et al., 2020) (Section 2), is that $x_{t}$ has a closed-form expression derived using the reparameterization trick (opens in a new tab). Let $α_{t} = 1 - β_{t}$ and $\overset{α}{ˉ}_{t} = \prod_{i = 1}^{t} α_{i}$ :

x_{t} = = = = = = α_{t} x_{t - 1} + 1 - α_{t} ϵ_{t - 1} α_{t} (α_{t - 1} x_{t - 2} + 1 - α_{t - 1} ϵ_{t - 2}) + 1 - α_{t} ϵ_{t - 1} α_{t} α_{t - 1} x_{t - 2} + (α_{t} (1 - α_{t - 1}) ϵ_{t - 2} + 1 - α_{t} ϵ_{t - 1}) α_{t} α_{t - 1} x_{t - 2} + 1 - α_{t} α_{t - 1} \overset{ϵ}{ˉ}_{t - 2}, \overset{ϵ}{ˉ}_{t - 2} \sim N (0, I) (*) \dots \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} \overset{ϵ}{ˉ}, \overset{ϵ}{ˉ} \sim N (0, I)

$^{*}$ The sum of two uncorrelated multivariate normal distributions $N (μ_{1}, σ_{1}^{2} I)$ and $N (μ_{2}, σ_{2}^{2} I)$ is also a multivariate normal distribution $N (μ_{1} + μ_{2}, (σ_{1}^{2} + σ_{2}^{2}) I)$ (proof details (opens in a new tab)).

Typically, we can afford larger update steps when the sample becomes noisier, so $β_{1} < β_{2} < \dots < β_{T}$ and therefore $\overset{α}{ˉ}_{1} > \overset{α}{ˉ}_{2} > \dots > \overset{α}{ˉ}_{T}$ .

Reverse Process

Ideally, if we knew $q (x_{t - 1} ∣ x_{t})$ , we could gradually remove noise from corrupted samples to recover the original image. However, this conditional distribution is not readily available and its computation requires the entire dataset. Specifically:

q (x_{t - 1} ∣ x_{t}) = \frac{q ( x _{t - 1} , x _{t} )}{q ( x _{t} )} = q (x_{t} ∣ x_{t - 1}) \frac{q ( x _{t - 1} )}{q ( x _{t} )} = q (x_{t} ∣ x_{t - 1}) \frac{\int _{x_{0}} q ( x _{t - 1} ∣ x _{0} ) d x _{0}}{\int _{x_{0}} q ( x _{t} ∣ x _{0} ) d x _{0}} .

Computing $q (x_{t - 1} ∣ x_{t})$ requires evaluating integrals in Eq. ( $\ref{eq:imprac_cond_prob_expr}$ ), which is computationally expensive. Instead, we use the diffusion model $p_{θ} (x_{t - 1} ∣ x_{t})$ to learn and approximate the true conditional distribution. When $β_{t}$ is sufficiently small, $q (x_{t - 1} ∣ x_{t})$ is also Gaussian (details (opens in a new tab)).

The joint distribution of the diffusion model is:

p_{θ} (x_{0 : T}) = p (x_{T}) t = 1 \prod T p_{θ} (x_{t - 1} ∣ x_{t}) p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t)) .

With this distribution, we can sample from an isotropic Gaussian distribution and expect the reverse process to gradually transform it into samples that follow $p_{θ} (x_{0}) \approx q (x_{0})$ .

Loss Function

The loss function for training the diffusion model is the standard variational bound on negative log likelihood:

lo g p_{θ} (x_{0}) = = = \geq lo g \int_{x_{1 : T}} p (x_{0 : T}) d x_{1 : T} lo g \int_{x_{1 : T}} p (x_{0 : T}) \frac{q ( x _{1 : T} ∣ x _{0} )}{q ( x _{1 : T} ∣ x _{0} )} d x_{1 : T} lo g E_{q (x_{1 : T} ∣ x_{0})} [\frac{p ( x _{0 : T} )}{q ( x _{1 : T} ∣ x _{0} )}] E_{q (x_{1 : T} ∣ x_{0})} (lo g [\frac{p ( x _{0 : T} )}{q ( x _{1 : T} ∣ x _{0} )}]) .

Therefore, the expected negative log likelihood is lower bounded by the variational lower bound:

E_{q (x_{0})} [- lo g p_{θ} (x_{0})] \geq = = E_{q (x_{0})} [- E_{q (x_{1 : T} ∣ x_{0})} (lo g [\frac{p ( x _{0 : T} )}{q ( x _{1 : T} ∣ x _{0} )}])] - E_{q (x_{0 : T})} (lo g [\frac{p ( x _{0 : T} )}{q ( x _{1 : T} ∣ x _{0} )}]) : L_{VLB} .

To convert each term in the equation to be analytically computable, the objective can be further rewritten to be a combination of several KL-divergence and entropy terms (See the detailed step-by-step process in Appendix B in Sohl-Dickstein et al. (Sohl-Dickstein et al., 2015)):

L_{VLB} = E_{q (x_{0 : T})} [lo g \frac{q ( x _{1 : T} ∣ x _{0} )}{p _{θ} ( x _{0 : T} )}] = E_{q} [lo g \frac{\prod _{t = 1}^{T} q ( x _{t} ∣ x _{t - 1} )}{p _{θ} ( x _{T} ) \prod _{t = 1}^{T} p _{θ} ( x _{t - 1} ∣ x _{t} )}] = E_{q} [- lo g p_{θ} (x_{T}) + t = 1 \sum T lo g \frac{q ( x _{t} ∣ x _{t - 1} )}{p _{θ} ( x _{t - 1} ∣ x _{t} )}] = E_{q} [- lo g p_{θ} (x_{T}) + t = 2 \sum T lo g \frac{q ( x _{t} ∣ x _{t - 1} )}{p _{θ} ( x _{t - 1} ∣ x _{t} )} + lo g \frac{q ( x _{1} ∣ x _{0} )}{p _{θ} ( x _{0} ∣ x _{1} )}] = E_{q} [- lo g p_{θ} (x_{T}) + t = 2 \sum T lo g (\frac{q ( x _{t - 1} ∣ x _{t} , x _{0} )}{p _{θ} ( x _{t - 1} ∣ x _{t} )} \cdot \frac{q ( x _{t} ∣ x _{0} )}{q ( x _{t - 1} ∣ x _{0} )}) + lo g \frac{q ( x _{1} ∣ x _{0} )}{p _{θ} ( x _{0} ∣ x _{1} )}] = E_{q} [- lo g p_{θ} (x_{T}) + t = 2 \sum T lo g \frac{q ( x _{t - 1} ∣ x _{t} , x _{0} )}{p _{θ} ( x _{t - 1} ∣ x _{t} )} + t = 2 \sum T lo g \frac{q ( x _{t} ∣ x _{0} )}{q ( x _{t - 1} ∣ x _{0} )} + lo g \frac{q ( x _{1} ∣ x _{0} )}{p _{θ} ( x _{0} ∣ x _{1} )}] = E_{q} [- lo g p_{θ} (x_{T}) + t = 2 \sum T lo g \frac{q ( x _{t - 1} ∣ x _{t} , x _{0} )}{p _{θ} ( x _{t - 1} ∣ x _{t} )} + lo g \frac{q ( x _{T} ∣ x _{0} )}{q ( x _{1} ∣ x _{0} )} + lo g \frac{q ( x _{1} ∣ x _{0} )}{p _{θ} ( x _{0} ∣ x _{1} )}] = E_{q} [lo g \frac{q ( x _{T} ∣ x _{0} )}{p _{θ} ( x _{T} )} + t = 2 \sum T lo g \frac{q ( x _{t - 1} ∣ x _{t} , x _{0} )}{p _{θ} ( x _{t - 1} ∣ x _{t} )} - lo g p_{θ} (x_{0} ∣ x_{1})] = E_{q} [L_{T} D_{KL} (q (x_{T} ∣ x_{0}) ∥ p_{θ} (x_{T})) + t = 2 \sum T L_{t - 1} D_{KL} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} ∣ x_{t})) L_{0} - lo g p_{θ} (x_{0} ∣ x_{1})] .

In summary, we can split the variational lower bound into $(T + 1)$ components and label them as follows:

L_{VLB} where L_{T} L_{t} L_{0} = L_{T} + L_{T - 1} + \dots + L_{0} = D_{KL} (q (x_{T} ∣ x_{0}) ∥ p_{θ} (x_{T})), = D_{KL} (q (x_{t} ∣ x_{t + 1}, x_{0}) ∥ p_{θ} (x_{t} ∣ x_{t + 1})) for 1 \leq t \leq T - 1, = - lo g p_{θ} (x_{0} ∣ x_{1}) .

In the above decomposition:

$q (x_{T} ∣ x_{0})$ can be computed from Eq. ( $\ref{eq:xt_x0_relation}$ )
$p_{θ} (x_{t} ∣ x_{t + 1})$ are parameterized and learned

Next, we show that $q (x_{t} ∣ x_{t + 1}, x_{0})$ can be computed in closed form even though $q (x_{t} ∣ x_{t + 1})$ can't.

q (x_{t - 1} ∣ x_{t}, x_{0}) = q (x_{t} ∣ x_{t - 1}, x_{0}) \frac{q ( x _{t - 1} ∣ x _{0} )}{q ( x _{t} ∣ x _{0} )} \propto exp (- \frac{1}{2} (\frac{( x _{t} - α _{t} x _{t - 1} ) ^{2}}{β _{t}} + \frac{( x _{t - 1} - α ˉ _{t - 1} x _{0} ) ^{2}}{1 - α ˉ _{t - 1}} - \frac{( x _{t} - α ˉ _{t} x _{0} ) ^{2}}{1 - α ˉ _{t}})) = exp (- \frac{1}{2} (\frac{x _{t}^{2} - 2 α _{t} x _{t} x _{t - 1} + α _{t} x _{t - 1}^{2}}{β _{t}} + \frac{x _{t - 1}^{2} - 2 α ˉ _{t - 1} x _{0} x _{t - 1} + α ˉ _{t - 1} x _{0}^{2}}{1 - α ˉ _{t - 1}} - \frac{( x _{t} - α ˉ _{t} x _{0} ) ^{2}}{1 - α ˉ _{t}})) = exp (- \frac{1}{2} ((\frac{α _{t}}{β _{t}} + \frac{1}{1 - α ˉ _{t - 1}}) x_{t - 1}^{2} - (\frac{2 α _{t}}{β _{t}} x_{t} + \frac{2 α ˉ _{t - 1}}{1 - α ˉ _{t - 1}} x_{0}) x_{t - 1} + C (x_{t}, x_{0})))

From the above derivation, we observe that the conditional distribution $q (x_{t - 1} ∣ x_{t}, x_{0})$ is also Gaussian and can be written in standard multivariate normal form as $q (x_{t - 1} ∣ x_{t}, x_{0}) = N (x_{t - 1}; \tilde{μ} (x_{t}, x_{0}), \tilde{β}_{t} I)$ . Notice that $a x^{2} + b x + c = \frac{1}{1/ a} (x^{2} + \frac{b}{a} x + \frac{c}{a}) = \frac{1}{1/ a} ((x + \frac{b}{2 a})^{2} + const)$ , the $\tilde{μ}$ and $\tilde{β}_{t}$ can be computed as follows,

\tilde{β}_{t} \tilde{μ} (x_{t}, x_{0}) = 1/ (\frac{α _{t}}{β _{t}} + \frac{1}{1 - α ˉ _{t - 1}}) = \frac{1 - α ˉ _{t - 1}}{1 - α ˉ _{t}} β_{t}, = (\frac{α _{t}}{β _{t}} x_{t} + \frac{α ˉ _{t - 1}}{1 - α ˉ _{t - 1}} x_{0}) / (\frac{α _{t}}{β _{t}} + \frac{1}{1 - α ˉ _{t - 1}}) = \frac{α _{t} ( 1 - α ˉ _{t - 1} )}{1 - α ˉ _{t}} x_{t} + \frac{α ˉ _{t - 1} β _{t}}{1 - α ˉ _{t}} x_{0} .

Recall the relation between $x_{t}$ and $x_{0}$ deduced from Eq. $\ref{eq:xt_x0_relation}$ , the Eq. $\ref{eq:standard_form_mu_t}$ can be further rewrited as follows:

\tilde{μ}_{t} = \frac{α _{t} ( 1 - α ˉ _{t - 1} )}{1 - α ˉ _{t}} x_{t} + \frac{α ˉ _{t - 1} β _{t}}{1 - α ˉ _{t}} \frac{1}{α ˉ _{t}} (x_{t} - 1 - \overset{α}{ˉ}_{t} ϵ_{t}) = \frac{1}{α _{t}} (x_{t} - \frac{1 - α _{t}}{1 - α ˉ _{t}} ϵ_{t})

Parameterization of reverse diffusion process and $L_{t}$

Recall our previous decomposition of variational lower bound loss, we have closed form computation of real data distribution (i.e. $q (x_{T} ∣ x_{0})$ and $q (x_{t} ∣ x_{t + 1}, x_{0})$ ), we still need a parameterization of $p_{θ} (x_{t} ∣ x_{t + 1})$ . As we discussed previously, when $β_{t}$ is small enough, we can approximate $q (x_{t} ∣ x_{t + 1})$ by the Gaussian distribution $p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))$ . In practice, we can further simplify the loss function Eq. ( $\ref{eq:lt_before_simple_sigma}$ ) by predefine the variance matrix as $Σ_{θ} (x_{t}, t) = σ_{t}^{2} I$ and, experimentally, $σ_{t}^{2} = \tilde{β}_{t} = \frac{1 - α ˉ _{t - 1}}{1 - α _{t} ˉ} β_{t}$ or $σ_{t}^{2} = β_{t}$ had similar results. We expect that the training process can let $μ_{θ}$ to predict $\tilde{μ}_{t}$ . With this parameterization, each component of loss function $L_{t} = D_{KL} (q (x_{t} ∣ x_{t + 1}, x_{0}) ∥ p_{θ} (x_{t} ∣ x_{t + 1}))$ is the KL divergence between two multivariate Gaussian distributions and has a relatively simple closed form (opens in a new tab). The loss term $L_{t}$ become

L_{t} = = D_{KL} (q (x_{t} ∣ x_{t + 1}, x_{0}) ∥ p_{θ} (x_{t} ∣ x_{t + 1})) \frac{1}{2} [lo g \frac{∣ Σ _{θ} ( x _{t} , t ) ∣}{∣ β ~ _{t} I ∣} + (\tilde{μ}_{t} - μ_{θ})^{T} Σ_{θ} (x_{t}, t)^{- 1} (\tilde{μ}_{t} - μ_{θ}) + tr {Σ_{θ} (x_{t}, t)^{- 1} \tilde{β}_{t} I}] .

Comments Loading (Debug)...

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning, 2256–2265. https://arxiv.org/pdf/1503.03585.pdf

Notes on Diffusion Model (II) -- Conditional Generation