latexYou have requested package
Towards understanding Diffusion Models (on Graphs)
Abstract
Diffusion models have emerged from various theoretical and methodological perspectives, each offering unique insights into their underlying principles. In this work, we provide an overview of the most prominent approaches, drawing attention to their striking analogies – namely, how seemingly diverse methodologies converge to a similar mathematical formulation of the core problem. While our ultimate goal is to understand these models in the context of graphs, we begin by conducting experiments in a simpler setting to build foundational insights. Through an empirical investigation of different diffusion and sampling techniques, we explore three critical questions: (1) What role does noise play in these models? (2) How significantly does the choice of the sampling method affect outcomes? (3) What function is the neural network approximating, and is high complexity necessary for optimal performance? Our findings aim to enhance the understanding of diffusion models and in the long run their application in graph machine learning.
1 Continouos Diffusion Models
In physics, diffusion captures the overall movement of particles, such as atoms, from areas of higher concentration to those of lower concentration. Consider the analogy of dropping a small amount of paint into a glass of water. Initially, the paint is concentrated in one location, but over time, it diffuses throughout the water until it reaches a state of equilibrium. The intriguing question arises: Can we reverse this diffusion process? Unfortunately, such a reversal proves impossible in most cases.
Despite the impossibility of reversing diffusion, a field of study known as diffusion models exists. These models aim to capture the dynamics of this diffusion phenomenon and are based on the idea of approximately undoing this process. Empirically, they achieve surprisingly good results when sampling new data points with similar properties.
From a practical point of view, diffusion models are generative models that aim to create new samples from an unknown and often complex underlying distribution. Usually, the only information about the target distribution is training data points originating from it. However, directly approximating this training distribution is challenging, so diffusion models systematically decompose the process into incremental steps. Due to the incremental diffusion, the model learns to predict a distribution not only for clean training data but also for a set of distributions generated by gradually adding noise to the training data. This way, the model can learn and improve itself over these steps. This results in high-quality samples. In this context of a chaotic system, each datapoint progressively loses its distinguishable features as the time step increases. As the number of diffusion steps approaches infinity (), the terminal state converges to an isotropic Gaussian distribution, showing the system attained a state of equilibrium.
1.1 Diffusion Models
In the past few years, various generative models using the concept of diffusion have been introduced. Different methodologies end up with more or less the same mathematical formulation of the underlying problem.
1.1.1 Langevin Dynamics
Inspired by the principles of a molecule diffusing in a liquid, the Langevin formula mathematically captures the diffusion process. The key parameters are the particle mass , the damping coefficient , velocity , and a noise term representing collisions with surrounding molecules.
(1) |
In the context of diffusion models, we describe the forward process similarly.
(2) |
The function represents the externally introduced change in the data point and is usually maintained as the identity. The data point undergoes dispersion that is scaled by and described by the noise term . This forward process is commonly represented as a Markov Chain, with noise added at each time step based on a variance schedule ().
(3) |
Given the noisy state, we want the model to return the most probable, clean input image. So, for the backward process, we train a model to optimize the (variational lower bound of) the log-likelihood:
(4) |
For the reverse process, the conditional probability is modelled as normal distribution and a neural network is optimized to predict and .
Despite and the variance schedule can be learned, Ho et al. (2020) opt for fixing all to a linear schedule to reduce computational costs. Specifically, they set which allows to optimize solely for . They observe that reparametrizing and optimizing for yields even better performance. In addition, they suggest simplifying the loss by discarding some terms, again justifying this choice with better empirical performance. So, they end up optimizing the following objective; training to predict the noise in relation to the clean image (also see Figure 2):
(5) |
During training, each gradient step involves independently sampling clean data points , a random timesteps , and noise ().
Later research suggests potential improvements to the linear schedule (Nichol and Dhariwal, 2021), and our experiments also demonstrate suboptimal performance, which highlights the inefficiency of sampling with this schedule.
1.1.2 Stochastic Differential Equations
Drawing from the same conceptual framework as in Langevin Dynamics, we formalize the diffusion process as a random phenomenon unfolding over time, which can be mathematically formulated as a Stochastic Differential Equation (SDE):
(6) |
This equation matches with the structure of the Langevin Dynamics Equation 2, underscoring their similarity.
However, different from the discretized perspective of the Markov Chain, the reversion of a stochastic differential equation is represented by another stochastic differential process, expressed as:
(7) |
When the score for all marginal distributions across time is known, we can effectively sample from this SDE. This score can be estimated through model training using score matching: A time-dependent model, denoted as , is trained to estimate , minimizing the following objective:
(8) |
Estimating the score for the underlying ground truth distribution poses challenges, particularly in low-density regions with limited training samples. While adding noise to estimate scores is a valid approach, determining an optimal noise level for recovering the true distribution across the entire space is complex. Learning score functions over time mitigates this challenge.
At , the data is standard normally distributed, simplifying score estimation. As time regresses and the data approaches the true underlying distribution, the accurate approximation of scores might be limited to high-density regions. However, in an iterative denoising process, all points would have already converged towards these high-density regions.
Note that ideally, we want to train the model to approximate , however with enough training data, one can show that this is equivalent to . Additionally, note that for , can be written as and it holds that . So, while in the approach motivated by Langevin Dynamics, we train to fit the noise , in this approach motivated by SDEs, we optimize for the negative scaled noise , which is yet again another reparametrization of the target.
Technically, this approach’s main difference is its continuous nature and the possibility of optimizing it by solving the SDE. However, in practice, this approach is usually discretized for training (and sampling), and a neural network is used to approximate the score in the same way a network is trained to approximate the noise in the above approach.
1.1.3 Stochastic Localization
Montanari (2023) has recently drawn parallels between stochastic localization and the perspective of stochastic differential equations in diffusion models. Stochastic localization is a stochastic process where at each time step , we are given a random probability measure . As time progresses , the probability measure localizes, that is, it converges to a point , where is a random variable. The only requirement is that this process must be martingale. This means that at a particular time, the conditional expectation of the next value in the sequence is equal to the present value, regardless of all prior values. As with the previous methods, the general idea is that if we can construct this process, we can sample from .
Let be such a process, and for simplicity, assume it follows a Gaussian distribution:
(9) |
where is a Wiener process. We observe that, as time increases, the signal-to-noise ratio also increases. Montanari (2023) show that this process is the unique solution to a stochastic differential equation, coinciding with the one derived in Song et al. (2021). This gives rise to another mathematical framework to analyze the properties of diffusion processes and models.
2 Diffusion Models in Discrete State Space
The diffusion process has been successfully adapted to various spaces, such as discrete state spaces Austin et al. (2021) and function spaces Lim et al. (2023). In graphs, the former adaptation can be deployed Haefeli et al. (2022), Vignac et al. (2023).
While certain adjustments are necessary, the underlying concept remains the same. The approach involves diffusing clean input graphs until they resemble random graphs and then learning to reverse this process. The diffusion and sampling processes must work in the discrete state space. Each datapoint is expressed as a one-hot encoding, assuming one of states: . The noise is characterized by transition matrices , where is the probability of transitioning from state to state : .
Based on this representation, one can derive the marginal and posterior distribution for steps:
(10) |
and
(11) |
Now, one can train to directly predict the logits . However, many approaches opt for a sampling procedure, wherein the model predicts the clean input , uses renoising , and marginalizing over the one-hot encodings:
(12) |
2.1 Sampling and the approximated function
The whole pipeline of denoising diffusion models has three parts that all work together to generate new samples. The diffusion process is the iterative process of adding small (random) perturbations to the data, which is used to generate training data.
The denoising part of a denoising diffusion pipeline is the sampling process. This process is based on parts of the target distribution that are unknown, intractable, or unfeasible to compute. So, one part of the sampling is a function that is approximated by a (graph) neural network. The exact sampling procedure and the approximated function rely on each other. Depending on the sampling strategies, different objectives are optimized, and the chosen neural network approximates different functions. We want to understand the role of the three components and try to disentangle their influence.
Graph diffusion, as presented in Vignac et al. (2023), is based on the algorithms presented in Ho et al. (2020).
Given a graph , as described in the discrete setting above, the state of each node and edge of the graph is encoded as a one-hot vector. A node can take states . Analogously for each edge.
A graph neural network is trained to solve a classification task on each node and edge, given a noisy graph . It optimizes the cross-entropy between the predicted probabilities for each node and edge and the true graph:
(13) |
Once trained, one samples from the reverse process
which can be estimated from the network predictions:
(14) |
where
(15) |
As is done for images, where the diffusion is applied to each pixel independently, the diffusion process is not defined on graphs but independently on edges and nodes. The structural information of the graph is neglected in this step. Instead of a standard neural network that approximates the gradient of points in the data, Digress uses a graph neural network to solve a classification task on each node and edge. As the sampling also uses the information from the diffusion process, the graph structure is only considered in the learned weights of the graph neural network.
This raises questions about to what extent the diffusion, the graph neural network, or the sampling contribute to good-quality samples. In their work, they suggest using the marginal distribution of classes in the training data and show superior sampling quality when using this process. This indicates that the noise process and the information put into it significantly affect the sampling quality.
Other works on images, as Bansal et al. (2024), claim noise is unnecessary, showing high-quality samples for deterministic diffusion processes.
Several questions arise considering the influence of certain parts of the algorithms pipeline and their respective biases.
Q1: What is the role of noise in the diffusion denoising pipeline, and do we need it at all? We investigate the importance of the different parts in simulations and give some insights into their role.
Q2: How much influence does the sampling have on the performance? Some works suggest reparametrization in the sampling. While approaches on graphs train to predict the clean graph, other works such as Ho et al. (2020) note that predicting the clean image is less accurate.
Q3: What does the neural network approximate, and do we need the complexity? When solving the ”simple” classification task for the graph setting, could a simpler model achieve similar results? How much structural information do we introduce by the iterative sampling procedure, including the forward noise process?
How to approach these questions is not ad hoc clear, and the complexity of graphs and graph neural networks introduce an additional degree of complexity. As a starting point, we want to investigate the three components in a much simpler setting. This helps to break it down into a setting that we can visualize and allows us to build intuition in a more graspable setting.
3 Diffusion and denoising in a simple setting.
3.0.1 Setup
Consider a set of points in two dimensions originating from some unknown distribution . We want to generate new samples from this distribution. We cannot sample from it because we cannot access the underlying distribution. However, we can train a denoising diffusion model to sample from an approximated distribution .
For a simple analysis, we choose a mixture of two Gaussians. Figure 3 shows the density and the score of the chosen ground truth distribution with
Our simulations (ref Section 3.1 suggest the following answers to the questions raised in the section above:
(A1) We do not need the noise. Song and Ermon (2019) observe that the gradient approximation is poor in low-density regions of the data and address the problem by adding noise. If we do not introduce any perturbation, we only sufficiently approximate the data gradient close to high-density regions. Noise mitigates the problem by diffusing the training points and leaving no low-density regions. Clearly, too much noise leaves no signal. Hence, the amount of noise added is crucial. However, by iteratively adding tiny perturbations and learning an iterative backward process, we can approximate the time-dependent ground truth distribution even when starting far away from high-density regions. However, the conclusion that we need noise in the sense of randomness is misleading. As long as we manage to cover the space sufficiently, the diffusion process can also be of a deterministic nature. We show experiments on that in Section 3.1.
(A2) Diffusion schedule and sampling process are crucial for the performance.
Unsurprisingly, the diffusion schedule plays an essential role in the proper approximation of the reverse process. Figure 5 visualizes the influence of different schedules for and . The linear schedule leads to faster convergence to a standard normal distribution and thus loses much signal in the first steps. As a result, the later timesteps contain little to no signal and are worthless for training. The cosine schedule results in a smoother transition; thus, later timesteps contain a more valuable signal for the training process. As all timesteps are equally likely to be sampled during training, lower, smoother diffusion is better.
In addition, what exactly is approximated by the neural network significantly influences the performance. Both the distribution and its likelihood follow mathematical rules that are hard to enforce with a neural network. Thus, predicting the likelihood of a data point is challenging. While it is only a reparametrization of the target, locally approximating the score of the likelihood allows the inclusion of additional information about the noise process, seems more accessible, and empirically results in better performance.
(A3) A simple network approximates the data distribution reasonably well.
The network only partially approximates the distribution’s score. However, even though our network’s architecture is simple, and thus, its approximation power is limited, we learn essential features of the ground truth distribution in all three settings.
It is impossible to learn random independent noise. So, the model does not approximate the actual reverse process but the gradient of the distribution in each step. For each point, the model learns a mapping that moves every point closer to a high-density region of the training data.
Aligned with the intuition behind stochastic localization, the network learns the gradient of the distribution for each time step.
3.1 Simulations
3.1.1 Generation process and experimental setup
We generate data from a mixture of two Gaussians. We randomly sample 5,000 points from each of the two distributions, so 10,000 training points overall. The distribution we sample the training data from is visualized in Figure 3.
For each of the investigated sampling methods (see Figure 4), we train the same neural network architecture: a simple multi-layer perceptron with two relu layers of width 20 and a final linear layer as output.
The model is trained with a batch size of 64 for 50 epochs using the Adam Optimizer from PyTorch. We note that we did not tune the neural network in any way and used the same architecture and hyperparameters for the three different tasks. Given the simplicity of the chosen problem, a comparison on this basis is still fair and justified.
3.1.2 Different Noise Schedules
Different papers observe that the noise schedule in the training can play a crucial role in the performance of the generative model. While the original work of Ho et al. (2020) suggests a linear schedule, recent works usually use the cosine schedule introduced in Nichol and Dhariwal (2021). Empirically, the latter proves to yield better performance. In the linear case, a lot of the time, steps fall into the range where the data is indistinguishable from random noise. In those steps, the training data does not hold enough information for learning. The cosine schedule mitigates this effect and distributes the structural information more smoothly along the time steps. Compare Figure 5(b)-5(c) for the schedules and Figures 5(d) and 5(e) for a visualization of the respective noising processes. We used the cosine diffusion schedule in all our experiments.
3.1.3 Different Sampling Methods
We consider three different sampling procedures. The target is always, given a datapoint and a timestep , to predict the state of the data point at the precious timestep: , then use an iterative process to sample . The most direct way is to train a neural network to directly predict . This is usually done in a variational manner by training to predict the mean value of . We call this method single step sampling. One can also reparameterize the sampling and train the network to predict the clean input and then use the knowledge about the diffusion process to get . We call this method whole step sampling. The most sophisticated and commonly used method also uses the information about the diffusion process and reparametrizes the mean of the difference between and . As this learns to predict the added noise, we call this method noise sampling.
3.1.4 Does the Process invert the diffusion?
We visualize the denoising processes for the different sampling methods in Figures 6 to 9. The figures show time steps t = 99, 54, 36, 18, and 0 for 10000 data points. The data points for time step 99 are taken from a grid between -7 and 7. The colors in the figures indicate group membership when clustering with a Gaussian Mixture Model on the final time step.
While the noise sampling approximates the ground truth distribution reasonably well, the other sampling methods fail to fit the training distribution. We observe many samples from the low-density region between the two clusters for the single step sampling. The whole step sampling only samples points from very high-density regions and almost collapses to the means of the two Gaussians.
Most surprising is the significantly worse performance of the single step sampling compared to the noise sampling, as this is a simple reparametrization. However, we explicitly add information about the forward diffusion process when sampling the noise relative to the clean data instead of the data point at the previous timestep. This information is hard to infer for the model trained on this data with only implicit access to this information.
Another phenomenon we observe is that the two first sampling methods keep a positional bias, so points closer to in the beginning end up close to in the end. The whole-step sampling methods suffer less from this phenomenon. Also, the model trained with the deterministic diffusion process shows no positional bias. This shows that even though, for the noise sampling, at first glance, one might think the reverse process is approximated, it is not. This is not possible for random noise, so the intuition of reversing the diffusion is misleading. However, this also indicates that the model, to some degree, fits the data gradient. In the following, we do further simulations to investigate what precisely the models learn.
3.1.5 What does the neural network approximate?
Figure 10 provides insights into the neural network’s learning outcomes for the three specific objectives discussed in Section 3.1.3. The network struggles to accurately approximate the score function across various regions in all tasks. This is expected as the model deals with very noisy data in the first time steps and sparse regions in the latter. Especially in the low-density regions in the last time step, we can not expect the model to learn a helpful function as training data in those areas is limited. The necessity of the iterative sampling process becomes evident in these images, showcasing that only when combining and aggregating the information available at individual time steps results in a sufficient approximation of the training distribution.
This observation underscores the significant effect of introducing stepwise perturbations into the training process to ensure adequate coverage of low-density regions and the effective learning of data distribution gradients. Song and Ermon (2019) also observe this behavior and argue that noise is the solution for a good approximation of the score function. In the following experiment, we showcase that this is only one perspective.
3.1.6 What about Noise?
If the coverage of low-density regions is the core problem, then the diffusion process, not noise itself, is the key to a good approximation of the score function. We define a deterministic diffusion process that modifies a data point in each step to converge to a normal distribution. For every data point, we take the x’ths number after the comma. These values are approximately uniformly distributed. To go from the uniform distribution to a normal distribution, we map it through the inverse of the cdf. The resulting diffusion process when using the cosine schedule is shown in Figure 7.
We train the same model with the same hyperparameters as before and observe similar behavior and performance on the three sampling methods. Figure 7 shows the learned trajectories for the reparametrized denoising process.
We conclude that diffusion is necessary to cover the whole space. However, we can do this in an unnoisy way. If we could construct a deterministic diffusion process that is also invertible, we could achieve perfect recovery of training data while still being able to sample new data points.
While Bansal et al. (2024) also observes good performance for deterministic diffusion processes, they consider a very different sampling setting, and thus, their insights do not translate to our setting; they do not aim to approximate the score gradient of data but the datapoint at the previous timestep. So, their target is not the approximate score. This would be difficult in low-density regions and would not work using their “diffusion” processes.
4 What’s next
In considering the future directions for our research, several intriguing questions emerge, separating into two overarching areas.
4.1 Noise vs. No Noise
If we decide we do not need any noise in the diffusion pipeline, what are the benefits and drawbacks of using it? An essential consideration is the computational efficiency of computing the diffusion deterministically. This is computationally more expensive in our current approach and prompts evaluating whether the computational overhead is justified and what advantages deterministic diffusion may bring.
Additionally, exploring the implications of training on deterministic data remains an exciting question. The prospect of achieving perfect recovery of training data through deterministic training raises the question: is this even desired as the goal of these models is to generate new data? What are the limitations or (unwanted) biases introduced to the model compared to random noise?
4.2 Take it back to graphs
Shifting our focus back to the domain of graphs introduces a distinctive set of challenges and considerations. Unlike images, which are essentially high-dimensional vectors, graphs encapsulate diverse and heterogeneous forms of information. Notably, distributions over molecules present challenges regarding description and analysis. The central question at hand involves understanding what the ”score function” captures in graph data and critically assessing how well we are approximating the underlying distribution.
Delving deeper, a key question is understanding what information our model learns about graphs. Is the structural information captured sufficiently even though it is not explicitly included in the diffusion process? Determining what the model learns well and what it might miss is crucial, especially considering how complex graph data can be.
Moreover, the unique way the graphs sampling is introduced in Vignac et al. (2023) calls for further exploration. Figuring out why predicting a clear graph works better than predicting a clear image could help us improve our understanding of the model and the model itself, especially when dealing with different data types.
References
- Austin et al. (2021) J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. Van Den Berg. Structured denoising diffusion models in discrete state-spaces. 2021.
- Bansal et al. (2024) A. Bansal, E. Borgnia, H.-M. Chu, J. Li, H. Kazemi, F. Huang, M. Goldblum, J. Geiping, and T. Goldstein. Cold diffusion: Inverting arbitrary image transforms without noise. Neural Information Processing Systems (NeurIPS), 2024.
- Haefeli et al. (2022) K. K. Haefeli, K. Martinkus, N. Perraudin, and R. Wattenhofer. Diffusion models for graphs benefit from discrete state spaces. NeurIPS Workshop on New Frontiers in Graph Learning, 2022.
- Ho et al. (2020) J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Neural Information Processing Systems (NeurIPS), 2020.
- Lim et al. (2023) J. H. Lim, N. B. Kovachki, R. Baptista, C. Beckham, K. Azizzadenesheli, J. Kossaifi, V. Voleti, J. Song, K. Kreis, J. Kautz, et al. Score-based diffusion models in function space. arXiv preprint arXiv:2302.07400, 2023.
- Montanari (2023) A. Montanari. Sampling, diffusions, and stochastic localization. arXiv preprint arXiv:2305.10690, 2023.
- Nichol and Dhariwal (2021) A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. International Conference on Machine Learning, 2021.
- Sohl-Dickstein et al. (2015) J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning (ICML), 2015.
- Song and Ermon (2019) Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Neural Information Processing Systems (NeurIPS), 2019.
- Song et al. (2021) Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations (ICLR), 2021.
- Vignac et al. (2023) C. Vignac, I. Krawczuk, A. Siraudin, B. Wang, V. Cevher, and P. Frossard. Digress: Discrete denoising diffusion for graph generation. International Conference on Learning Representations, 2023.