\WarningFilter

latexYou have requested package

Towards understanding Diffusion Models (on Graphs)

Solveig Klepper
(November 2023)
Abstract

Diffusion models have emerged from various theoretical and methodological perspectives, each offering unique insights into their underlying principles. In this work, we provide an overview of the most prominent approaches, drawing attention to their striking analogies – namely, how seemingly diverse methodologies converge to a similar mathematical formulation of the core problem. While our ultimate goal is to understand these models in the context of graphs, we begin by conducting experiments in a simpler setting to build foundational insights. Through an empirical investigation of different diffusion and sampling techniques, we explore three critical questions: (1) What role does noise play in these models? (2) How significantly does the choice of the sampling method affect outcomes? (3) What function is the neural network approximating, and is high complexity necessary for optimal performance? Our findings aim to enhance the understanding of diffusion models and in the long run their application in graph machine learning.

1 Continouos Diffusion Models

Refer to caption
Figure 1: General idea of denoising diffusion models. The forward process is modelled by a Markov process. The reverse process is unknown and needs to be approximated; this is usually done with a neural network.

In physics, diffusion captures the overall movement of particles, such as atoms, from areas of higher concentration to those of lower concentration. Consider the analogy of dropping a small amount of paint into a glass of water. Initially, the paint is concentrated in one location, but over time, it diffuses throughout the water until it reaches a state of equilibrium. The intriguing question arises: Can we reverse this diffusion process? Unfortunately, such a reversal proves impossible in most cases.

Despite the impossibility of reversing diffusion, a field of study known as diffusion models exists. These models aim to capture the dynamics of this diffusion phenomenon and are based on the idea of approximately undoing this process. Empirically, they achieve surprisingly good results when sampling new data points with similar properties.

From a practical point of view, diffusion models are generative models that aim to create new samples from an unknown and often complex underlying distribution. Usually, the only information about the target distribution is training data points originating from it. However, directly approximating this training distribution is challenging, so diffusion models systematically decompose the process into incremental steps. Due to the incremental diffusion, the model learns to predict a distribution not only for clean training data but also for a set of distributions generated by gradually adding noise to the training data. This way, the model can learn and improve itself over these steps. This results in high-quality samples. In this context of a chaotic system, each datapoint xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT progressively loses its distinguishable features as the time step t𝑡titalic_t increases. As the number of diffusion steps approaches infinity (T𝑇T\to\inftyitalic_T → ∞), the terminal state xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT converges to an isotropic Gaussian distribution, showing the system attained a state of equilibrium.

1.1 Diffusion Models

In the past few years, various generative models using the concept of diffusion have been introduced. Different methodologies end up with more or less the same mathematical formulation of the underlying problem.

1.1.1 Langevin Dynamics

Inspired by the principles of a molecule diffusing in a liquid, the Langevin formula mathematically captures the diffusion process. The key parameters are the particle mass m𝑚mitalic_m, the damping coefficient λ𝜆\lambdaitalic_λ, velocity v𝑣vitalic_v, and a noise term η𝜂\etaitalic_η representing collisions with surrounding molecules.

mdvdt=λv+η(t)𝑚𝑑𝑣𝑑𝑡𝜆𝑣𝜂𝑡m\frac{dv}{dt}=-\lambda v+\eta(t)italic_m divide start_ARG italic_d italic_v end_ARG start_ARG italic_d italic_t end_ARG = - italic_λ italic_v + italic_η ( italic_t ) (1)

In the context of diffusion models, we describe the forward process similarly.

d𝐱(t)dt=𝐱(t)+g(t)𝐰(t)𝑑𝐱𝑡𝑑𝑡𝐱𝑡𝑔𝑡𝐰𝑡\frac{d\mathbf{x}(t)}{dt}=\mathbf{x}(t)+g(t)\mathbf{w}(t)divide start_ARG italic_d bold_x ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG = bold_x ( italic_t ) + italic_g ( italic_t ) bold_w ( italic_t ) (2)

The function x(t)𝑥𝑡x(t)italic_x ( italic_t ) represents the externally introduced change in the data point and is usually maintained as the identity. The data point undergoes dispersion that is scaled by g(t)𝑔𝑡g(t)italic_g ( italic_t ) and described by the noise term w(t)𝑤𝑡w(t)italic_w ( italic_t ). This forward process is commonly represented as a Markov Chain, with noise added at each time step based on a variance schedule (β1,,βTsubscript𝛽1subscript𝛽𝑇\beta_{1},...,\beta_{T}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT).

q(x1:T|x0)=i=0Tq(xt|xt1) with q(xt|xt1)=𝒩(xt1;1βtxt1,βt𝐈).𝑞conditionalsubscript𝑥:1𝑇subscript𝑥0superscriptsubscriptproduct𝑖0𝑇𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1 with 𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1𝒩subscript𝑥𝑡11subscript𝛽𝑡subscript𝑥𝑡1subscript𝛽𝑡𝐈q(x_{1:T}|x_{0})=\prod\limits_{i=0}^{T}q(x_{t}|x_{t-1})\text{ ~{}with~{} }q(x_% {t}|x_{t-1})=\mathcal{N}(x_{t-1};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}\mathbf{I}).italic_q ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) with italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) . (3)

Given the noisy state, we want the model to return the most probable, clean input image. So, for the backward process, we train a model to optimize the (variational lower bound of) the log-likelihood:

𝔼[logpθ(x0)]𝔼[log(p(xT))t=1Tlogpθ(xt1|xt)q(xt|xt1))]\mathbb{E}[-\log p_{\theta}(x_{0})]\leq\mathbb{E}\left[-\log\left(p(x_{T}))-% \sum\limits_{t=1}^{T}\log\frac{p_{\theta}(x_{t-1}|x_{t})}{q(x_{t}|x_{t-1})}% \right)\right]blackboard_E [ - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ≤ blackboard_E [ - roman_log ( italic_p ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG ) ] (4)

The detailed derivations can be found in Ho et al. (2020) and Sohl-Dickstein et al. (2015).

For the reverse process, the conditional probability pθ(xt1|xt):=𝒩(xt1;μθ(xt,t),Σθ(xt,t))assignsubscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡𝒩subscript𝑥𝑡1subscript𝜇𝜃subscript𝑥𝑡𝑡subscriptΣ𝜃subscript𝑥𝑡𝑡p_{\theta}(x_{t-1}|x_{t}):=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\Sigma_{% \theta}(x_{t},t))italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) is modelled as normal distribution and a neural network is optimized to predict μθsubscript𝜇𝜃\mu_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and ΣθsubscriptΣ𝜃\Sigma_{\theta}roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

Refer to caption
Figure 2: Reparametrization in sampling. The model does not predict the previous data point but the noise in relation to the clean image. The predicted noise and the diffusion process are used to interpolate between the clean image x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the input xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to sample xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT with the desired step size. atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are functions of t𝑡titalic_t that encoder the stepsize and manage the interpolation between the clean and the noisy image.

Despite ΣθsubscriptΣ𝜃\Sigma_{\theta}roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and the variance schedule βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be learned, Ho et al. (2020) opt for fixing all βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to a linear schedule to reduce computational costs. Specifically, they set Σθ(xt|t)=βt𝐈subscriptΣ𝜃conditionalsubscript𝑥𝑡𝑡subscript𝛽𝑡𝐈\Sigma_{\theta}(x_{t}|t)=\beta_{t}\mathbf{I}roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_t ) = italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I which allows to optimize solely for μθsubscript𝜇𝜃\mu_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. They observe that reparametrizing μθ(xt,t)=11β(xtβt1i=1t(1βt)ϵθ(xt,t))subscript𝜇𝜃subscript𝑥𝑡𝑡11𝛽subscript𝑥𝑡subscript𝛽𝑡1superscriptsubscriptproduct𝑖1𝑡1subscript𝛽𝑡subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\mu_{\theta}(x_{t},t)=\frac{1}{\sqrt{1-\beta}}\left(x_{t}-\frac{\beta_{t}}{% \sqrt{1-\prod_{i=1}^{t}(1-\beta_{t})}}\epsilon_{\theta}(x_{t},t)\right)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 - italic_β end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) and optimizing for ϵθ(xt,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) yields even better performance. In addition, they suggest simplifying the loss by discarding some terms, again justifying this choice with better empirical performance. So, they end up optimizing the following objective; training to predict the noise in relation to the clean image (also see Figure 2):

𝔼x0,ϵ[ϵϵθ(i=1t(1βt)x0+1i=1t(1βt)ϵ,t)2]subscript𝔼subscript𝑥0italic-ϵdelimited-[]superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃superscriptsubscriptproduct𝑖1𝑡1subscript𝛽𝑡subscript𝑥01superscriptsubscriptproduct𝑖1𝑡1subscript𝛽𝑡italic-ϵ𝑡2\mathbb{E}_{x_{0},\epsilon}\left[||\epsilon-\epsilon_{\theta}(\sqrt{\prod_{i=1% }^{t}(1-\beta_{t})}x_{0}+\sqrt{1-\prod_{i=1}^{t}(1-\beta_{t})}\epsilon,t)||^{2% }\right]blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ end_POSTSUBSCRIPT [ | | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_ϵ , italic_t ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (5)

During training, each gradient step involves independently sampling clean data points x0q(x0)similar-tosubscript𝑥0𝑞subscript𝑥0x_{0}\sim q(x_{0})italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), a random timesteps tUniform(1,,T)similar-to𝑡Uniform1𝑇t\sim\text{Uniform}({1,...,T})italic_t ∼ Uniform ( 1 , … , italic_T ), and noise (ϵ𝒩(𝟎,𝐈)similar-toitalic-ϵ𝒩0𝐈\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I )).

Later research suggests potential improvements to the linear schedule (Nichol and Dhariwal, 2021), and our experiments also demonstrate suboptimal performance, which highlights the inefficiency of sampling with this schedule.

1.1.2 Stochastic Differential Equations

Drawing from the same conceptual framework as in Langevin Dynamics, we formalize the diffusion process as a random phenomenon unfolding over time, which can be mathematically formulated as a Stochastic Differential Equation (SDE):

dx=x(t)dt+g(t)d𝐰(t)𝑑xx𝑡𝑑𝑡𝑔𝑡𝑑𝐰𝑡d\textbf{x}=\textbf{x}(t)dt+g(t)d\mathbf{w}(t)italic_d x = x ( italic_t ) italic_d italic_t + italic_g ( italic_t ) italic_d bold_w ( italic_t ) (6)

This equation matches with the structure of the Langevin Dynamics Equation 2, underscoring their similarity.

However, different from the discretized perspective of the Markov Chain, the reversion of a stochastic differential equation is represented by another stochastic differential process, expressed as:

d𝐱=[𝐱(t)g(t)2x(t)logpt(x)]dt+g(t)d𝐰(t)𝑑𝐱delimited-[]𝐱𝑡𝑔superscript𝑡2subscript𝑥𝑡subscript𝑝𝑡𝑥𝑑𝑡𝑔𝑡𝑑𝐰𝑡d\mathbf{x}=\left[\mathbf{x}(t)-g(t)^{2}\nabla_{x(t)}\log p_{t}(x)\right]dt+g(% t)d\mathbf{w}(t)italic_d bold_x = [ bold_x ( italic_t ) - italic_g ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_x ( italic_t ) end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ] italic_d italic_t + italic_g ( italic_t ) italic_d bold_w ( italic_t ) (7)

When the score x(t)logpt(x)subscript𝑥𝑡subscript𝑝𝑡𝑥\nabla_{x(t)}\log p_{t}(x)∇ start_POSTSUBSCRIPT italic_x ( italic_t ) end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) for all marginal distributions across time is known, we can effectively sample from this SDE. This score can be estimated through model training using score matching: A time-dependent model, denoted as sθ(x,t)subscript𝑠𝜃𝑥𝑡s_{\theta}(x,t)italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_t ), is trained to estimate sθ(x(t),t)subscript𝑠𝜃𝑥𝑡𝑡s_{\theta}(x(t),t)italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_t ), minimizing the following objective:

𝔼t{λ(t)𝔼x(0)𝔼x(t)|x(0)[||sθ(x(t),t)x(t)logq0t(x(t)|x(0))||2]}.\mathbb{E}_{t}\left\{\lambda(t)\mathbb{E}_{x(0)}\mathbb{E}_{x(t)|x(0)}\left[||% s_{\theta}(x(t),t)-\nabla_{x(t)}\log q_{0t}(x(t)|x(0))||^{2}\right]\right\}.blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT { italic_λ ( italic_t ) blackboard_E start_POSTSUBSCRIPT italic_x ( 0 ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ( italic_t ) | italic_x ( 0 ) end_POSTSUBSCRIPT [ | | italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_t ) - ∇ start_POSTSUBSCRIPT italic_x ( italic_t ) end_POSTSUBSCRIPT roman_log italic_q start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT ( italic_x ( italic_t ) | italic_x ( 0 ) ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] } . (8)

Estimating the score for the underlying ground truth distribution poses challenges, particularly in low-density regions with limited training samples. While adding noise to estimate scores is a valid approach, determining an optimal noise level for recovering the true distribution across the entire space is complex. Learning score functions over time mitigates this challenge.

At t=T𝑡𝑇t=Titalic_t = italic_T, the data is standard normally distributed, simplifying score estimation. As time regresses t0𝑡0t\to 0italic_t → 0 and the data approaches the true underlying distribution, the accurate approximation of scores might be limited to high-density regions. However, in an iterative denoising process, all points would have already converged towards these high-density regions.

Note that ideally, we want to train the model to approximate x(t)logq0t(x(t))subscript𝑥𝑡subscript𝑞0𝑡𝑥𝑡\nabla_{x(t)}\log q_{0t}(x(t))∇ start_POSTSUBSCRIPT italic_x ( italic_t ) end_POSTSUBSCRIPT roman_log italic_q start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT ( italic_x ( italic_t ) ), however with enough training data, one can show that this is equivalent to x(t)logq0t(x(t)|x(0))subscript𝑥𝑡subscript𝑞0𝑡conditional𝑥𝑡𝑥0\nabla_{x(t)}\log q_{0t}(x(t)|x(0))∇ start_POSTSUBSCRIPT italic_x ( italic_t ) end_POSTSUBSCRIPT roman_log italic_q start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT ( italic_x ( italic_t ) | italic_x ( 0 ) ). Additionally, note that for xt𝒩(μtx0,σt2)similar-tosubscript𝑥𝑡𝒩subscript𝜇𝑡subscript𝑥0superscriptsubscript𝜎𝑡2x_{t}\sim\mathcal{N}(\mu_{t}x_{0},\sigma_{t}^{2})italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be written as μtxt+σtϵsubscript𝜇𝑡subscript𝑥𝑡subscript𝜎𝑡italic-ϵ\mu_{t}x_{t}+\sigma_{t}\epsilonitalic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ and it holds that x(t)logq0t(x(t))=ϵσsubscript𝑥𝑡subscript𝑞0𝑡𝑥𝑡italic-ϵ𝜎\nabla_{x(t)}\log q_{0t}(x(t))=-\frac{\epsilon}{\sigma}∇ start_POSTSUBSCRIPT italic_x ( italic_t ) end_POSTSUBSCRIPT roman_log italic_q start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT ( italic_x ( italic_t ) ) = - divide start_ARG italic_ϵ end_ARG start_ARG italic_σ end_ARG. So, while in the approach motivated by Langevin Dynamics, we train to fit the noise ϵitalic-ϵ\epsilonitalic_ϵ, in this approach motivated by SDEs, we optimize for the negative scaled noise ϵσitalic-ϵ𝜎-\frac{\epsilon}{\sigma}- divide start_ARG italic_ϵ end_ARG start_ARG italic_σ end_ARG, which is yet again another reparametrization of the target.

Technically, this approach’s main difference is its continuous nature and the possibility of optimizing it by solving the SDE. However, in practice, this approach is usually discretized for training (and sampling), and a neural network is used to approximate the score in the same way a network is trained to approximate the noise in the above approach.

1.1.3 Stochastic Localization

Montanari (2023) has recently drawn parallels between stochastic localization and the perspective of stochastic differential equations in diffusion models. Stochastic localization is a stochastic process where at each time step t[0,)𝑡0t\in[0,\infty)italic_t ∈ [ 0 , ∞ ), we are given a random probability measure μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. As time progresses (t)𝑡(t\to\infty)( italic_t → ∞ ), the probability measure μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT localizes, that is, it converges to a point μtδxsubscript𝜇𝑡subscript𝛿subscript𝑥\mu_{t}\to\delta_{x_{*}}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where xsubscript𝑥x_{*}italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is a random variable. The only requirement is that this process must be martingale. This means that at a particular time, the conditional expectation of the next value in the sequence is equal to the present value, regardless of all prior values. As with the previous methods, the general idea is that if we can construct this process, we can sample from δsuperscript𝛿\delta^{*}italic_δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Let 𝐘tsubscript𝐘𝑡\mathbf{Y}_{t}bold_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be such a process, and for simplicity, assume it follows a Gaussian distribution:

𝐘t=tx+𝐖tsubscript𝐘𝑡𝑡subscript𝑥subscript𝐖𝑡\mathbf{Y}_{t}=tx_{*}+\mathbf{W}_{t}bold_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (9)

where Wt0subscript𝑊𝑡0W_{t\geq 0}italic_W start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT is a Wiener process. We observe that, as time t𝑡titalic_t increases, the signal-to-noise ratio also increases. Montanari (2023) show that this process is the unique solution to a stochastic differential equation, coinciding with the one derived in Song et al. (2021). This gives rise to another mathematical framework to analyze the properties of diffusion processes and models.

2 Diffusion Models in Discrete State Space

The diffusion process has been successfully adapted to various spaces, such as discrete state spaces Austin et al. (2021) and function spaces Lim et al. (2023). In graphs, the former adaptation can be deployed Haefeli et al. (2022), Vignac et al. (2023).

While certain adjustments are necessary, the underlying concept remains the same. The approach involves diffusing clean input graphs until they resemble random graphs and then learning to reverse this process. The diffusion and sampling processes must work in the discrete state space. Each datapoint x𝑥xitalic_x is expressed as a one-hot encoding, assuming one of d𝑑ditalic_d states: x{0,1}d𝑥superscript01𝑑x\in\{0,1\}^{d}italic_x ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The noise is characterized by transition matrices Q1,Qtsuperscript𝑄1superscript𝑄𝑡Q^{1},...Q^{t}italic_Q start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, where [Qt]ijsubscriptdelimited-[]superscript𝑄𝑡𝑖𝑗[Q^{t}]_{ij}[ italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the probability of transitioning from state i𝑖iitalic_i to state j𝑗jitalic_j: q(xt|xt1)=xt1Qt𝑞conditionalsuperscript𝑥𝑡superscript𝑥𝑡1superscript𝑥𝑡1superscript𝑄𝑡q(x^{t}|x^{t-1})=x^{t-1}Q^{t}italic_q ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) = italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

Based on this representation, one can derive the marginal and posterior distribution for t𝑡titalic_t steps:

q(xt|x0)=x0𝐐¯t with 𝐐¯t=𝐐1𝐐2𝐐t𝑞conditionalsubscript𝑥𝑡subscript𝑥0subscript𝑥0superscript¯𝐐𝑡 with superscript¯𝐐𝑡superscript𝐐1superscript𝐐2superscript𝐐𝑡q(x_{t}~{}|~{}x_{0})=x_{0}\mathbf{\bar{Q}}^{t}\text{ with }\bar{\mathbf{Q}}^{t% }=\mathbf{Q}^{1}\mathbf{Q}^{2}...\mathbf{Q}^{t}italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over¯ start_ARG bold_Q end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT with over¯ start_ARG bold_Q end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = bold_Q start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT bold_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT … bold_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (10)

and

q(xt1|xt,x0)=xt(𝐐T)x0𝐐¯t1x0𝐐¯txtT𝑞conditionalsubscript𝑥𝑡1subscript𝑥𝑡subscript𝑥0subscript𝑥𝑡superscript𝐐Tsubscript𝑥0superscript¯𝐐𝑡1subscript𝑥0superscript¯𝐐𝑡superscriptsubscript𝑥𝑡Tq(x_{t-1}~{}|~{}x_{t},x_{0})=\frac{x_{t}(\mathbf{Q}^{\texttt{T}})\bigodot x_{0% }\mathbf{\bar{Q}}^{t-1}}{x_{0}\mathbf{\bar{Q}}^{t}x_{t}^{\texttt{T}}}italic_q ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_Q start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ) ⨀ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over¯ start_ARG bold_Q end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over¯ start_ARG bold_Q end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT end_ARG (11)

Now, one can train to directly predict the logits pθ(xt1|xt)subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡p_{\theta}(x_{t-1}~{}|~{}x_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). However, many approaches opt for a sampling procedure, wherein the model predicts the clean input pθ(x0|xt)subscript𝑝𝜃conditionalsubscript𝑥0subscript𝑥𝑡p_{\theta}(x_{0}~{}|~{}x_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), uses renoising q(xt1|xt,x0)𝑞conditionalsubscript𝑥𝑡1subscript𝑥𝑡subscript𝑥0q(x_{t-1}~{}|~{}x_{t},x_{0})italic_q ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and marginalizing over the one-hot encodings:

pθ(xt1|xt)x0q(xt1,xt|x0)pθ(x0|xt)proportional-tosubscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡subscriptsubscript𝑥0𝑞subscript𝑥𝑡1conditionalsubscript𝑥𝑡subscript𝑥0subscript𝑝𝜃conditionalsubscript𝑥0subscript𝑥𝑡p_{\theta}(x_{t-1}~{}|~{}x_{t})\propto\sum\limits_{x_{0}}q(x_{t-1},x_{t}~{}|~{% }x_{0})p_{\theta}(x_{0}~{}|~{}x_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∝ ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (12)

2.1 Sampling and the approximated function

The whole pipeline of denoising diffusion models has three parts that all work together to generate new samples. The diffusion process is the iterative process of adding small (random) perturbations to the data, which is used to generate training data.

The denoising part of a denoising diffusion pipeline is the sampling process. This process is based on parts of the target distribution that are unknown, intractable, or unfeasible to compute. So, one part of the sampling is a function that is approximated by a (graph) neural network. The exact sampling procedure and the approximated function rely on each other. Depending on the sampling strategies, different objectives are optimized, and the chosen neural network approximates different functions. We want to understand the role of the three components and try to disentangle their influence.

Graph diffusion, as presented in Vignac et al. (2023), is based on the algorithms presented in Ho et al. (2020).

Given a graph G={X,E}𝐺𝑋𝐸G=\{X,E\}italic_G = { italic_X , italic_E }, as described in the discrete setting above, the state of each node and edge of the graph is encoded as a one-hot vector. A node x𝑥xitalic_x can take d𝑑ditalic_d states x{0,1}d,X{0,1}n×dformulae-sequence𝑥superscript01𝑑𝑋superscript01𝑛𝑑x\in\{0,1\}^{d},X\in\{0,1\}^{n\times d}italic_x ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_X ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT. Analogously for each edge.

The marginal and posterior distributions are given by Equation 10 and Equation 11.

A graph neural network is trained to solve a classification task on each node and edge, given a noisy graph Gt={Xt,Et}superscript𝐺𝑡superscript𝑋𝑡superscript𝐸𝑡G^{t}=\{X^{t},E^{t}\}italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_E start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }. It optimizes the cross-entropy between the predicted probabilities p^=(p^X,p^E)^𝑝superscript^𝑝𝑋superscript^𝑝𝐸\hat{p}=(\hat{p}^{X},\hat{p}^{E})over^ start_ARG italic_p end_ARG = ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ) for each node and edge and the true graph:

i=1ncross-entropy(xi,p^iX)+λi,j=1ncross-entropy(eij,p^ijE)superscriptsubscript𝑖1𝑛cross-entropysubscript𝑥𝑖superscriptsubscript^𝑝𝑖𝑋𝜆superscriptsubscript𝑖𝑗1𝑛cross-entropysubscript𝑒𝑖𝑗superscriptsubscript^𝑝𝑖𝑗𝐸\sum\limits_{i=1}^{n}\text{cross-entropy}(x_{i},\hat{p}_{i}^{X})+\lambda\sum% \limits_{i,j=1}^{n}\text{cross-entropy}(e_{ij},\hat{p}_{ij}^{E})∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT cross-entropy ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ) + italic_λ ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT cross-entropy ( italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ) (13)

Once trained, one samples from the reverse process

pθ(Gt1|Gt)=i=0npθ(Xi:t1|Gt)i,j=0npθ(Eij:t1|Gt),subscript𝑝𝜃conditionalsuperscript𝐺𝑡1superscript𝐺𝑡superscriptsubscriptproduct𝑖0𝑛subscript𝑝𝜃conditionalsuperscriptsubscript𝑋:𝑖absent𝑡1superscript𝐺𝑡superscriptsubscriptproduct𝑖𝑗0𝑛subscript𝑝𝜃conditionalsuperscriptsubscript𝐸:𝑖𝑗absent𝑡1superscript𝐺𝑡p_{\theta}(G^{t-1}~{}|~{}G^{t})=\prod\limits_{i=0}^{n}p_{\theta}(X_{i:}^{t-1}~% {}|~{}G^{t})\prod\limits_{i,j=0}^{n}p_{\theta}(E_{ij:}^{t-1}~{}|~{}G^{t}),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_G start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i : end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i , italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_i italic_j : end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ,

which can be estimated from the network predictions:

pθ(xt1|Gt)=x𝒳pθ(xt1|x0=x,Gt)p^X(x)subscript𝑝𝜃conditionalsubscript𝑥𝑡1superscript𝐺𝑡subscript𝑥𝒳subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥0𝑥superscript𝐺𝑡superscript^𝑝𝑋𝑥p_{\theta}(x_{t-1}~{}|~{}G^{t})=\sum\limits_{x\in\mathcal{X}}p_{\theta}(x_{t-1% }~{}|~{}x_{0}=x,G^{t})\hat{p}^{X}(x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x , italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ( italic_x ) (14)

where

pθ(xt1|x0=x,Gt)={q(xt1|x=x0,xt)if q(xt|x=xt1)>00otherwisesubscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥0𝑥superscript𝐺𝑡cases𝑞conditionalsubscript𝑥𝑡1𝑥subscript𝑥0subscript𝑥𝑡if 𝑞conditionalsubscript𝑥𝑡𝑥subscript𝑥𝑡100otherwisep_{\theta}(x_{t-1}~{}|~{}x_{0}=x,G^{t})=\begin{cases}q(x_{t-1}~{}|~{}x=x_{0},x% _{t})&\text{if }q(x_{t}~{}|~{}x=x_{t-1})>0\\ 0&\text{otherwise}\end{cases}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x , italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = { start_ROW start_CELL italic_q ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL if italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) > 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (15)

As is done for images, where the diffusion is applied to each pixel independently, the diffusion process is not defined on graphs but independently on edges and nodes. The structural information of the graph is neglected in this step. Instead of a standard neural network that approximates the gradient of points in the data, Digress uses a graph neural network to solve a classification task on each node and edge. As the sampling also uses the information from the diffusion process, the graph structure is only considered in the learned weights of the graph neural network.

This raises questions about to what extent the diffusion, the graph neural network, or the sampling contribute to good-quality samples. In their work, they suggest using the marginal distribution of classes in the training data and show superior sampling quality when using this process. This indicates that the noise process and the information put into it significantly affect the sampling quality.

Other works on images, as Bansal et al. (2024), claim noise is unnecessary, showing high-quality samples for deterministic diffusion processes.

Several questions arise considering the influence of certain parts of the algorithms pipeline and their respective biases.

Q1: What is the role of noise in the diffusion denoising pipeline, and do we need it at all? We investigate the importance of the different parts in simulations and give some insights into their role.

Q2: How much influence does the sampling have on the performance? Some works suggest reparametrization in the sampling. While approaches on graphs train to predict the clean graph, other works such as Ho et al. (2020) note that predicting the clean image is less accurate.

Q3: What does the neural network approximate, and do we need the complexity? When solving the ”simple” classification task for the graph setting, could a simpler model achieve similar results? How much structural information do we introduce by the iterative sampling procedure, including the forward noise process?

How to approach these questions is not ad hoc clear, and the complexity of graphs and graph neural networks introduce an additional degree of complexity. As a starting point, we want to investigate the three components in a much simpler setting. This helps to break it down into a setting that we can visualize and allows us to build intuition in a more graspable setting.

3 Diffusion and denoising in a simple setting.

3.0.1 Setup

Consider a set of points in two dimensions originating from some unknown distribution p𝑝pitalic_p. We want to generate new samples xpsimilar-to𝑥𝑝x\sim pitalic_x ∼ italic_p from this distribution. We cannot sample from it because we cannot access the underlying distribution. However, we can train a denoising diffusion model to sample from an approximated distribution p~~𝑝\tilde{p}over~ start_ARG italic_p end_ARG.

For a simple analysis, we choose a mixture of two Gaussians. Figure 3 shows the density and the score of the chosen ground truth distribution with

μ1=(4,4),μ2=(4,4),σ1=(0.3000.1) and σ2=(0.2000.2).formulae-sequencesubscript𝜇144formulae-sequencesubscript𝜇244subscript𝜎1matrix0.3000.1 and subscript𝜎2matrix0.2000.2\mu_{1}=(-4,-4),\mu_{2}=(4,4),\sigma_{1}=\begin{pmatrix}0.3&0\\ 0&0.1\end{pmatrix}\text{ and }\sigma_{2}=\begin{pmatrix}0.2&0\\ 0&0.2\end{pmatrix}.italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( - 4 , - 4 ) , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 4 , 4 ) , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL 0.3 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0.1 end_CELL end_ROW end_ARG ) and italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL 0.2 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0.2 end_CELL end_ROW end_ARG ) .

Our simulations (ref Section 3.1 suggest the following answers to the questions raised in the section above:

(A1) We do not need the noise. Song and Ermon (2019) observe that the gradient approximation is poor in low-density regions of the data and address the problem by adding noise. If we do not introduce any perturbation, we only sufficiently approximate the data gradient close to high-density regions. Noise mitigates the problem by diffusing the training points and leaving no low-density regions. Clearly, too much noise leaves no signal. Hence, the amount of noise added is crucial. However, by iteratively adding tiny perturbations and learning an iterative backward process, we can approximate the time-dependent ground truth distribution even when starting far away from high-density regions. However, the conclusion that we need noise in the sense of randomness is misleading. As long as we manage to cover the space sufficiently, the diffusion process can also be of a deterministic nature. We show experiments on that in Section 3.1.

(A2) Diffusion schedule and sampling process are crucial for the performance.

Unsurprisingly, the diffusion schedule plays an essential role in the proper approximation of the reverse process. Figure 5 visualizes the influence of different schedules for β,α𝛽𝛼\beta,\alphaitalic_β , italic_α and α¯¯𝛼\bar{\alpha}over¯ start_ARG italic_α end_ARG. The linear schedule leads to faster convergence to a standard normal distribution and thus loses much signal in the first steps. As a result, the later timesteps contain little to no signal and are worthless for training. The cosine schedule results in a smoother transition; thus, later timesteps contain a more valuable signal for the training process. As all timesteps are equally likely to be sampled during training, lower, smoother diffusion is better.

In addition, what exactly is approximated by the neural network significantly influences the performance. Both the distribution and its likelihood follow mathematical rules that are hard to enforce with a neural network. Thus, predicting the likelihood of a data point is challenging. While it is only a reparametrization of the target, locally approximating the score of the likelihood allows the inclusion of additional information about the noise process, seems more accessible, and empirically results in better performance.

(A3) A simple network approximates the data distribution reasonably well.

The network only partially approximates the distribution’s score. However, even though our network’s architecture is simple, and thus, its approximation power is limited, we learn essential features of the ground truth distribution in all three settings.

It is impossible to learn random independent noise. So, the model does not approximate the actual reverse process but the gradient of the distribution in each step. For each point, the model learns a mapping that moves every point closer to a high-density region of the training data.

Aligned with the intuition behind stochastic localization, the network learns the gradient of the distribution for each time step.

3.1 Simulations

3.1.1 Generation process and experimental setup

We generate data from a mixture of two Gaussians. We randomly sample 5,000 points from each of the two distributions, so 10,000 training points overall. The distribution we sample the training data from is visualized in Figure 3.

Refer to caption
Refer to caption
Figure 3: Ground truth data distribution used to sample training points. The left figure shows the density, and the right figure shows the log-likelihood. The arrows indicate the direction of the score xlogp(x)subscript𝑥𝑝𝑥\nabla_{x}\log p(x)∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p ( italic_x ).

For each of the investigated sampling methods (see Figure 4), we train the same neural network architecture: a simple multi-layer perceptron with two relu layers of width 20 and a final linear layer as output.

Refer to caption

a)
Refer to caption
b)
Refer to caption
c)

Figure 4: Visualization of the three investigated sampling methods. Red indicates the part that the model predicts.

The model is trained with a batch size of 64 for 50 epochs using the Adam Optimizer from PyTorch. We note that we did not tune the neural network in any way and used the same architecture and hyperparameters for the three different tasks. Given the simplicity of the chosen problem, a comparison on this basis is still fair and justified.

3.1.2 Different Noise Schedules

Different papers observe that the noise schedule in the training can play a crucial role in the performance of the generative model. While the original work of  Ho et al. (2020) suggests a linear schedule, recent works usually use the cosine schedule introduced in Nichol and Dhariwal (2021). Empirically, the latter proves to yield better performance. In the linear case, a lot of the time, steps fall into the range where the data is indistinguishable from random noise. In those steps, the training data does not hold enough information for learning. The cosine schedule mitigates this effect and distributes the structural information more smoothly along the time steps. Compare Figure 5(b)-5(c) for the schedules and Figures 5(d) and 5(e) for a visualization of the respective noising processes. We used the cosine diffusion schedule in all our experiments.

Refer to caption
(a) βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Refer to caption
(b) αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Refer to caption
(c) α¯t=i=1tαtsubscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑡\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{t}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(d) a linear diffusion process with normal noise
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(e) a cosine diffusion process with normal noise
Figure 5: The noise schedule makes a difference. For linear diffusion, most information is lost in the early time steps, and later steps hold little to no information about either the original distribution or the diffusion process. Controlled by α¯¯𝛼\bar{\alpha}over¯ start_ARG italic_α end_ARG, the information in the cosine diffusion process degrades slower, so later steps still hold valuable transition information for the training. Visualizations of the diffusion process in Figures (d) and (e) show timesteps t = 0, 27, 54, 81, 99 from left to right.

3.1.3 Different Sampling Methods

We consider three different sampling procedures. The target is always, given a datapoint and a timestep xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, to predict the state of the data point at the precious timestep: xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, then use an iterative process to sample x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The most direct way is to train a neural network to directly predict xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. This is usually done in a variational manner by training to predict the mean value of p(xt1|xt)𝑝conditionalsubscript𝑥𝑡1subscript𝑥𝑡p(x_{t-1}|x_{t})italic_p ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). We call this method single step sampling. One can also reparameterize the sampling and train the network to predict the clean input x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and then use the knowledge about the diffusion process to get xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. We call this method whole step sampling. The most sophisticated and commonly used method also uses the information about the diffusion process and reparametrizes the mean of p(xt1|xt)𝑝conditionalsubscript𝑥𝑡1subscript𝑥𝑡p(x_{t-1}|x_{t})italic_p ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) the difference between xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. As this learns to predict the added noise, we call this method noise sampling.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Reparametrized single step denoising process as suggested in Ho et al. (2020). The neural network is trained to approximate the noise in each step.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Deterministic noise. Reparametrized single-step denoising process as suggested in Ho et al. (2020). The neural network is trained to approximate the noise in each step.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Single step denoising process. The neural network is trained to approximate the mean of the distribution in the previous step μt1subscript𝜇𝑡1\mu_{t-1}italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Whole step denoising process. The neural network is trained to approximate the clean image x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT is sampled by adding t1𝑡1t-1italic_t - 1 steps of noise process.
Refer to caption
Refer to caption
Refer to caption
(a) t = 99
Refer to caption
Refer to caption
Refer to caption
(b) t = 54
Refer to caption
Refer to caption
Refer to caption
(c) t = 36
Refer to caption
Refer to caption
Refer to caption
(d) t = 18
Refer to caption
Refer to caption
Refer to caption
(e) t = 0
Figure 10: The function approximated by the neural network for the three different sampling methods. The arrows indicate the approximated function when trained to predict the noise ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (left), when trained to predict the clean image x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (middle), and when trained to predict the previous time step xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT (right).
Refer to caption
Refer to caption
Refer to caption
(a) t = 99
Refer to caption
Refer to caption
Refer to caption
(b) t = 54
Refer to caption
Refer to caption
Refer to caption
(c) t = 36
Refer to caption
Refer to caption
Refer to caption
(d) t = 18
Refer to caption
Refer to caption
Refer to caption
(e) t = 0
Figure 11: Comparison of the learned trajectories for the random noise versus the noiseless diffusion process. The arrows indicate the final trajectories for the standard model trained on the diffusion process using random noise (left), for a model trained on a deterministic diffusion process (middle), and for comparison, the true score of the function (right). The score is proportional to the noise given a fixed point.

3.1.4 Does the Process invert the diffusion?

We visualize the denoising processes for the different sampling methods in Figures 6 to 9. The figures show time steps t = 99, 54, 36, 18, and 0 for 10000 data points. The data points for time step 99 are taken from a grid between -7 and 7. The colors in the figures indicate group membership when clustering with a Gaussian Mixture Model on the final time step.

While the noise sampling approximates the ground truth distribution reasonably well, the other sampling methods fail to fit the training distribution. We observe many samples from the low-density region between the two clusters for the single step sampling. The whole step sampling only samples points from very high-density regions and almost collapses to the means of the two Gaussians.

Most surprising is the significantly worse performance of the single step sampling compared to the noise sampling, as this is a simple reparametrization. However, we explicitly add information about the forward diffusion process when sampling the noise relative to the clean data instead of the data point at the previous timestep. This information is hard to infer for the model trained on this data with only implicit access to this information.

Another phenomenon we observe is that the two first sampling methods keep a positional bias, so points closer to μ1subscript𝜇1\mu_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in the beginning end up close to μ1subscript𝜇1\mu_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in the end. The whole-step sampling methods suffer less from this phenomenon. Also, the model trained with the deterministic diffusion process shows no positional bias. This shows that even though, for the noise sampling, at first glance, one might think the reverse process is approximated, it is not. This is not possible for random noise, so the intuition of reversing the diffusion is misleading. However, this also indicates that the model, to some degree, fits the data gradient. In the following, we do further simulations to investigate what precisely the models learn.

3.1.5 What does the neural network approximate?

Figure 10 provides insights into the neural network’s learning outcomes for the three specific objectives discussed in Section 3.1.3. The network struggles to accurately approximate the score function across various regions in all tasks. This is expected as the model deals with very noisy data in the first time steps and sparse regions in the latter. Especially in the low-density regions in the last time step, we can not expect the model to learn a helpful function as training data in those areas is limited. The necessity of the iterative sampling process becomes evident in these images, showcasing that only when combining and aggregating the information available at individual time steps results in a sufficient approximation of the training distribution.

This observation underscores the significant effect of introducing stepwise perturbations into the training process to ensure adequate coverage of low-density regions and the effective learning of data distribution gradients. Song and Ermon (2019) also observe this behavior and argue that noise is the solution for a good approximation of the score function. In the following experiment, we showcase that this is only one perspective.

3.1.6 What about Noise?

If the coverage of low-density regions is the core problem, then the diffusion process, not noise itself, is the key to a good approximation of the score function. We define a deterministic diffusion process that modifies a data point in each step to converge to a normal distribution. For every data point, we take the x’ths number after the comma. These values are approximately uniformly distributed. To go from the uniform distribution to a normal distribution, we map it through the inverse of the cdf. The resulting diffusion process when using the cosine schedule is shown in Figure 7.

We train the same model with the same hyperparameters as before and observe similar behavior and performance on the three sampling methods. Figure 7 shows the learned trajectories for the reparametrized denoising process.

We conclude that diffusion is necessary to cover the whole space. However, we can do this in an unnoisy way. If we could construct a deterministic diffusion process that is also invertible, we could achieve perfect recovery of training data while still being able to sample new data points.

While Bansal et al. (2024) also observes good performance for deterministic diffusion processes, they consider a very different sampling setting, and thus, their insights do not translate to our setting; they do not aim to approximate the score gradient of data but the datapoint at the previous timestep. So, their target is not the approximate score. This would be difficult in low-density regions and would not work using their “diffusion” processes.

4 What’s next

In considering the future directions for our research, several intriguing questions emerge, separating into two overarching areas.

4.1 Noise vs. No Noise

If we decide we do not need any noise in the diffusion pipeline, what are the benefits and drawbacks of using it? An essential consideration is the computational efficiency of computing the diffusion deterministically. This is computationally more expensive in our current approach and prompts evaluating whether the computational overhead is justified and what advantages deterministic diffusion may bring.

Additionally, exploring the implications of training on deterministic data remains an exciting question. The prospect of achieving perfect recovery of training data through deterministic training raises the question: is this even desired as the goal of these models is to generate new data? What are the limitations or (unwanted) biases introduced to the model compared to random noise?

4.2 Take it back to graphs

Shifting our focus back to the domain of graphs introduces a distinctive set of challenges and considerations. Unlike images, which are essentially high-dimensional vectors, graphs encapsulate diverse and heterogeneous forms of information. Notably, distributions over molecules present challenges regarding description and analysis. The central question at hand involves understanding what the ”score function” captures in graph data and critically assessing how well we are approximating the underlying distribution.

Delving deeper, a key question is understanding what information our model learns about graphs. Is the structural information captured sufficiently even though it is not explicitly included in the diffusion process? Determining what the model learns well and what it might miss is crucial, especially considering how complex graph data can be.

Moreover, the unique way the graphs sampling is introduced in Vignac et al. (2023) calls for further exploration. Figuring out why predicting a clear graph works better than predicting a clear image could help us improve our understanding of the model and the model itself, especially when dealing with different data types.

References

  • Austin et al. (2021) J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. Van Den Berg. Structured denoising diffusion models in discrete state-spaces. 2021.
  • Bansal et al. (2024) A. Bansal, E. Borgnia, H.-M. Chu, J. Li, H. Kazemi, F. Huang, M. Goldblum, J. Geiping, and T. Goldstein. Cold diffusion: Inverting arbitrary image transforms without noise. Neural Information Processing Systems (NeurIPS), 2024.
  • Haefeli et al. (2022) K. K. Haefeli, K. Martinkus, N. Perraudin, and R. Wattenhofer. Diffusion models for graphs benefit from discrete state spaces. NeurIPS Workshop on New Frontiers in Graph Learning, 2022.
  • Ho et al. (2020) J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Neural Information Processing Systems (NeurIPS), 2020.
  • Lim et al. (2023) J. H. Lim, N. B. Kovachki, R. Baptista, C. Beckham, K. Azizzadenesheli, J. Kossaifi, V. Voleti, J. Song, K. Kreis, J. Kautz, et al. Score-based diffusion models in function space. arXiv preprint arXiv:2302.07400, 2023.
  • Montanari (2023) A. Montanari. Sampling, diffusions, and stochastic localization. arXiv preprint arXiv:2305.10690, 2023.
  • Nichol and Dhariwal (2021) A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. International Conference on Machine Learning, 2021.
  • Sohl-Dickstein et al. (2015) J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning (ICML), 2015.
  • Song and Ermon (2019) Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Neural Information Processing Systems (NeurIPS), 2019.
  • Song et al. (2021) Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations (ICLR), 2021.
  • Vignac et al. (2023) C. Vignac, I. Krawczuk, A. Siraudin, B. Wang, V. Cevher, and P. Frossard. Digress: Discrete denoising diffusion for graph generation. International Conference on Learning Representations, 2023.