Denoising Diffusion Probabilistic Models

Basics of Probability

Conditional Probability

\begin{aligned} P(A, B, C) & = P(C|A, B)P(A, B) = P(C|A, B)P(B|A)P(A) \\ P(B, C|A) & = \frac{P(A, B, C)}{P(A)} = P(C|A, B)P(B|A) \\ \end{aligned}

Markov Chain

The probability of the current state is only related to the previous moment, for example, considering Markov relationship​ $A \rightarrow B \rightarrow C$, then

\begin{aligned} P(A, B, C) & = P(C|B)P(B|A)P(A) \\ P(B, C|A) & = P(C|B)P(B|A) \\ \end{aligned}

Reparameterization

For a neural network, If we want to sample from the Gaussian distribution $\mathcal{N}(\mu, \sigma^2)$, the result obtained is not differentiable. Thus, we can first sample $\epsilon$ from the standard distribution, and then get $\sigma * \epsilon + \mu$​. Then the randomness is transferred to $\mu$ and $\sigma$, and ​and​ are used as part of the affine transformation network.

Forward Diffusion Process

Fig.1 Diffusion Probabilistic Model

Given a data point sampled from a real data distribution $\mathbf{x}_0 \sim q(\mathbf{x})$, let us define a forward diffusion process in which we add small amount of Gaussian noise to the sample in $T$ steps, producing a sequence of noisy samples $\mathbf{x}_1, \dots, \mathbf{x}_T$. The step sizes are controlled by a variance schedule ${\beta_t \in (0, 1)}_{t=1}^T$

\begin{equation} q(\textbf{x}_t \vert \textbf{x}_{t-1}) := \mathcal{N}(\textbf{x}_t; \sqrt{1 - \beta _t} \textbf{x}_{t-1}, \beta _t\textbf{I}) \quad q(\textbf{x}_{1:T} \vert \textbf{x}_0) := \prod^T_{t=1} q(\textbf{x}_t \vert \textbf{x}_{t-1}) \end{equation}

The data sample $\textbf{x}_0$ gradually loses its distinguishable features as the step $t$ becomes larger. Eventually when $T \to \infty$, $\textbf{x}_T$ is equivalent to an isotropic Gaussian distribution.

Let $\alpha_t = 1 - \beta _t$ and $\bar{\alpha}_t = \prod _{i=1}^t \alpha_i$, then

\begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t}\boldsymbol{\epsilon}_{t-1} & \text{ ;where } \boldsymbol{\epsilon}_{t-1}, \boldsymbol{\epsilon}_{t-2}, \dots \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\boldsymbol{\epsilon}}_{t-2} & \text{ ;where } \bar{\boldsymbol{\epsilon}}_{t-2} \text{ merges two Gaussians (*).} \\ &= \dots \\ &= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon} \\ q(\mathbf{x}_t \vert \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I}) \end{aligned}

(*) Recall that when we merge two Gaussians with different variance, $\mathcal{N}(\mathbf{0}, \sigma _1^2\mathbf{I})$ and $\mathcal{N}(\mathbf{0}, \sigma _2^2\mathbf{I})$, the new distribution is $\mathcal{N}(\mathbf{0}, (\sigma _1^2 + \sigma _2^2)\mathbf{I})$. Here the merged standard deviation is $\sqrt{(1 - \alpha _t) + \alpha _t (1-\alpha _{t-1})} = \sqrt{1 - \alpha _t\alpha _{t-1}}$.

Usually, we can afford a larger update step when the sample gets noisier, so $\beta _1 < \beta _2 < \dots < \beta _T$ and therefore $\bar{\alpha}_1 > \dots > \bar{\alpha}_T$.

Reverse Diffusion Process

Fig.2 Diffusion Process

If we can reverse the above process and sample from $q(\mathbf{x}_{t-1} \vert \mathbf{x}_t)$ , we will be able to recreate the true sample from a Gaussian noise input, $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ .

Note that if $\beta _t$ is small enough, $q(\mathbf{x}_t | \mathbf{x}_{t-1})$ will also be Gaussian.

Unfortunately, we cannot easily estimate $q(\mathbf{x}_t | \mathbf{x}_{t-1})$ because it needs to use the entire dataset and therefore we need to learn a model $p_{\theta}$ to approximate these conditional probabilities in order to run the reverse diffusion process.

\begin{equation} p_\theta(\mathbf{x}_{0:T}) := p(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) \quad p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) := \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)) \end{equation}

It is noteworthy that the reverse conditional probability is tractable when conditioned on $\mathbf{x}_0$

\begin{equation} q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; {\tilde{\boldsymbol{\mu}}}(\mathbf{x}_t, \mathbf{x}_0), {\tilde{\beta}_t} \mathbf{I}) \end{equation}

As we know, $\mathcal{N}(\mu, \sigma^2)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$. Using Bayes’ rule, we have

\begin{aligned} q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) &= \frac{ q(\mathbf{x}_t, \mathbf{x}_{t-1}, \mathbf{x}_0) }{ q(\mathbf{x}_{t}, \mathbf{x}_0) } = q(\mathbf{x}_t \vert \mathbf{x}_{t-1}, \mathbf{x}_0) \frac{q(\mathbf{x}_{t-1}, \mathbf{x}_0) }{ q(\mathbf{x}_{t}, \mathbf{x}_0) } \\ &= q(\mathbf{x}_t \vert \mathbf{x}_{t-1}, \mathbf{x}_0) \frac{ q(\mathbf{x}_{t-1} \vert \mathbf{x}_0) }{ q(\mathbf{x}_t \vert \mathbf{x}_0) } = q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) \frac{ q(\mathbf{x}_{t-1} \vert \mathbf{x}_0) }{ q(\mathbf{x}_t \vert \mathbf{x}_0) } \\ &\propto \exp \Big(-\frac{1}{2} \big(\frac{(\mathbf{x}_t - \sqrt{\alpha_t} \mathbf{x}_{t-1})^2}{\beta_t} + \frac{(\mathbf{x}_{t-1} - \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0)^2}{1-\bar{\alpha}_{t-1}} - \frac{(\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} \big) \Big) \\ &= \exp \Big(-\frac{1}{2} \big(\frac{\mathbf{x}_t^2 - 2\sqrt{\alpha_t} \mathbf{x}_t {\mathbf{x}_{t-1}} \color{black}{+ \alpha_t} {\mathbf{x}_{t-1}^2} }{\beta_t} + \frac{ {\mathbf{x}_{t-1}^2} \color{black}{- 2 \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0} {\mathbf{x}_{t-1}} \color{black}{+ \bar{\alpha}_{t-1} \mathbf{x}_0^2} }{1-\bar{\alpha}_{t-1}} - \frac{(\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} \big) \Big) \\ &= \exp\Big( -\frac{1}{2} \big( {(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}})} \mathbf{x}_{t-1}^2 - {(\frac{2\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{2\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0)} \mathbf{x}_{t-1} \color{black}{ + C(\mathbf{x}_t, \mathbf{x}_0) \big) \Big)} \end{aligned}

where $C(\mathbf{x}_t, \mathbf{x}_0)$ is some function not involving $\mathbf{x}_{t-1}$ and details are omitted. Thus, following the standard Gaussian density function, the mean and variance can be parameterized as follows

\begin{aligned} \tilde{\beta}_t &= 1/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) = 1/(\frac{\alpha_t - \bar{\alpha}_t + \beta_t}{\beta_t(1 - \bar{\alpha}_{t-1})}) = {\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t} \\ \tilde{\boldsymbol{\mu}}_t (\mathbf{x}_t, \mathbf{x}_0) &= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0)/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) \\ &= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0) {\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t} \\ &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0\\ \end{aligned}

As discussed above, we can represent $\mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t)$ and plug it into the above equation and obtain

\begin{aligned} \tilde{\boldsymbol{\mu}}_t &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t) \\ &= {\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \Big)} \end{aligned}

where $\epsilon_t$ is sampled from distribution $p_{\theta}$. Thus,

\begin{aligned} \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) &= {\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \Big)} \\ \mathbf{x}_{t-1} &= \mathcal{N}(\mathbf{x}_{t-1}; \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \Big), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)) \end{aligned}

Pytorch Implementation

Fig.3 The training and sampling algorithms in DDPM

Notice

  • nan and inf may appear if the precision is not enough
  • $t$ participates in training through embedding
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
from torch import nn
import torch
from torch.nn import functional as F

class DDPM(nn.Module):
def __init__(self, model, T, beta_1, beta_T):
super(DDPM, self).__init__()
self.model = model
self.T = T
self.beta_1 = beta_1
self.beta_T = beta_T
self._init_constant()

def _init_constant(self):

betas = torch.linspace(self.beta_1, self.beta_T, self.T, dtype=torch.float64)
alphas = 1 - betas
cumprod_alphas = torch.cumprod(alphas, dim=0)
prev_cumprod_alphas = F.pad(cumprod_alphas[:-1], (1, 0), value = 1.)

self.forward_coef1 = torch.sqrt(cumprod_alphas)
self.forward_coef2 = torch.sqrt(1 - cumprod_alphas)
self.reverse_std = torch.sqrt(betas * (1. - prev_cumprod_alphas) / (1. - cumprod_alphas))
self.reverse_mean_coef1 = 1 / self.forward_coef1
self.reverse_mean_coef2 = self.reverse_mean_coef1 * betas / self.forward_coef2

def forward(self, x0):

b = x0.shape[0]
t = torch.randint(0, self.T, size=(b, ))
noise = torch.rand_like(x0)
coef1 = F.embedding(t, self.forward_coef1.unsqueeze(-1)).unsqueeze(-1).unsqueeze(-1)
coef2 = F.embedding(t, self.forward_coef2.unsqueeze(-1)).unsqueeze(-1).unsqueeze(-1)

xt = coef1 * x0 + coef2 * noise
denoise = self.model(xt, t)

return noise, denoise

def sample(self, xT):

xt = xT
b = xT.shape[0]

for step in reversed(range(self.T)):
t = torch.ones(size=(b, ), dtype=torch.long) * step
mean_coef1 = F.embedding(t, self.reverse_mean_coef1.unsqueeze(-1)).unsqueeze(-1).unsqueeze(-1)
mean_coef2 = F.embedding(t, self.reverse_mean_coef2.unsqueeze(-1)).unsqueeze(-1).unsqueeze(-1)
std = F.embedding(t, self.reverse_std.unsqueeze(-1)).unsqueeze(-1).unsqueeze(-1)
denoise = self.model(xt, t)
mean = mean_coef1 * xt - mean_coef2 * denoise
if step > 0:
noise = torch.randn_like(xt)
else:
noise = 0
xt = mean + std * noise
x0 = torch.clip(xt, min=-1, max=1)

return x0

Reference


Denoising Diffusion Probabilistic Models
https://blog.iks-ran.com/2023/08/07/ddpm/
Author
iks-ran
Posted on
August 7, 2023
Licensed under