Search | arXiv e-print repository

Differentiable Cost-Parameterized Monge Map Estimators

Authors: Samuel Howard, George Deligiannidis, Patrick Rebeschini, James Thornton

Abstract: Within the field of optimal transport (OT), the choice of ground cost is crucial to ensuring that the optimality of a transport map corresponds to usefulness in real-world applications. It is therefore desirable to use known information to tailor cost functions and hence learn OT maps which are adapted to the problem at hand. By considering a class of neural ground costs whose Monge maps have a kn… ▽ More Within the field of optimal transport (OT), the choice of ground cost is crucial to ensuring that the optimality of a transport map corresponds to usefulness in real-world applications. It is therefore desirable to use known information to tailor cost functions and hence learn OT maps which are adapted to the problem at hand. By considering a class of neural ground costs whose Monge maps have a known form, we construct a differentiable Monge map estimator which can be optimized to be consistent with known information about an OT map. In doing so, we simultaneously learn both an OT map estimator and a corresponding adapted cost function. Through suitable choices of loss function, our method provides a general approach for incorporating prior information about the Monge map itself when learning adapted OT maps and cost functions. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2404.17442 [pdf, ps, other]

Uniform Generalization Bounds on Data-Dependent Hypothesis Sets via PAC-Bayesian Theory on Random Sets

Authors: Benjamin Dupuis, Paul Viallard, George Deligiannidis, Umut Simsekli

Abstract: We propose data-dependent uniform generalization bounds by approaching the problem from a PAC-Bayesian perspective. We first apply the PAC-Bayesian framework on `random sets' in a rigorous way, where the training algorithm is assumed to output a data-dependent hypothesis set after observing the training data. This approach allows us to prove data-dependent bounds, which can be applicable in numero… ▽ More We propose data-dependent uniform generalization bounds by approaching the problem from a PAC-Bayesian perspective. We first apply the PAC-Bayesian framework on `random sets' in a rigorous way, where the training algorithm is assumed to output a data-dependent hypothesis set after observing the training data. This approach allows us to prove data-dependent bounds, which can be applicable in numerous contexts. To highlight the power of our approach, we consider two main applications. First, we propose a PAC-Bayesian formulation of the recently developed fractal-dimension-based generalization bounds. The derived results are shown to be tighter and they unify the existing results around one simple proof technique. Second, we prove uniform bounds over the trajectories of continuous Langevin dynamics and stochastic gradient Langevin dynamics. These results provide novel information about the generalization properties of noisy algorithms. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2402.06320 [pdf, other]

Particle Denoising Diffusion Sampler

Authors: Angus Phillips, Hai-Dang Dau, Michael John Hutchinson, Valentin De Bortoli, George Deligiannidis, Arnaud Doucet

Abstract: Denoising diffusion models have become ubiquitous for generative modeling. The core idea is to transport the data distribution to a Gaussian by using a diffusion. Approximate samples from the data distribution are then obtained by estimating the time-reversal of this diffusion using score matching ideas. We follow here a similar strategy to sample from unnormalized probability densities and comput… ▽ More Denoising diffusion models have become ubiquitous for generative modeling. The core idea is to transport the data distribution to a Gaussian by using a diffusion. Approximate samples from the data distribution are then obtained by estimating the time-reversal of this diffusion using score matching ideas. We follow here a similar strategy to sample from unnormalized probability densities and compute their normalizing constants. However, the time-reversed diffusion is here simulated by using an original iterative particle scheme relying on a novel score matching loss. Contrary to standard denoising diffusion models, the resulting Particle Denoising Diffusion Sampler (PDDS) provides asymptotically consistent estimates under mild assumptions. We demonstrate PDDS on multimodal and high dimensional sampling tasks. △ Less

Submitted 15 June, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

Comments: To be published in ICML 2024. 37 pages, 20 figures, 3 tables, 5 algorithms

arXiv:2308.03686 [pdf, other]

Nearly $d$-Linear Convergence Bounds for Diffusion Models via Stochastic Localization

Authors: Joe Benton, Valentin De Bortoli, Arnaud Doucet, George Deligiannidis

Abstract: Denoising diffusions are a powerful method to generate approximate samples from high-dimensional data distributions. Recent results provide polynomial bounds on their convergence rate, assuming $L^2$-accurate scores. Until now, the tightest bounds were either superlinear in the data dimension or required strong smoothness assumptions. We provide the first convergence bounds which are linear in the… ▽ More Denoising diffusions are a powerful method to generate approximate samples from high-dimensional data distributions. Recent results provide polynomial bounds on their convergence rate, assuming $L^2$-accurate scores. Until now, the tightest bounds were either superlinear in the data dimension or required strong smoothness assumptions. We provide the first convergence bounds which are linear in the data dimension (up to logarithmic factors) assuming only finite second moments of the data distribution. We show that diffusion models require at most $\tilde O(\frac{d \log^2(1/δ)}{\varepsilon^2})$ steps to approximate an arbitrary distribution on $\mathbb{R}^d$ corrupted with Gaussian noise of variance $δ$ to within $\varepsilon^2$ in KL divergence. Our proof extends the Girsanov-based methods of previous works. We introduce a refined treatment of the error from discretizing the reverse SDE inspired by stochastic localization. △ Less

Submitted 5 March, 2024; v1 submitted 7 August, 2023; originally announced August 2023.

arXiv:2306.07254 [pdf, other]

On the Expected Size of Conformal Prediction Sets

Authors: Guneet S. Dhillon, George Deligiannidis, Tom Rainforth

Abstract: While conformal predictors reap the benefits of rigorous statistical guarantees on their error frequency, the size of their corresponding prediction sets is critical to their practical utility. Unfortunately, there is currently a lack of finite-sample analysis and guarantees for their prediction set sizes. To address this shortfall, we theoretically quantify the expected size of the prediction set… ▽ More While conformal predictors reap the benefits of rigorous statistical guarantees on their error frequency, the size of their corresponding prediction sets is critical to their practical utility. Unfortunately, there is currently a lack of finite-sample analysis and guarantees for their prediction set sizes. To address this shortfall, we theoretically quantify the expected size of the prediction sets under the split conformal prediction framework. As this precise formulation cannot usually be calculated directly, we further derive point estimates and high-probability interval bounds that can be empirically computed, providing a practical method for characterizing the expected set size. We corroborate the efficacy of our results with experiments on real-world datasets for both regression and classification problems. △ Less

Submitted 8 March, 2024; v1 submitted 12 June, 2023; originally announced June 2023.

Comments: International Conference on Artificial Intelligence and Statistics (AISTATS), 2024

arXiv:2305.19638 [pdf, other]

A Unified Framework for U-Net Design and Analysis

Authors: Christopher Williams, Fabian Falck, George Deligiannidis, Chris Holmes, Arnaud Doucet, Saifuddin Syed

Abstract: U-Nets are a go-to, state-of-the-art neural architecture across numerous tasks for continuous signals on a square such as images and Partial Differential Equations (PDE), however their design and architecture is understudied. In this paper, we provide a framework for designing and analysing general U-Net architectures. We present theoretical results which characterise the role of the encoder and d… ▽ More U-Nets are a go-to, state-of-the-art neural architecture across numerous tasks for continuous signals on a square such as images and Partial Differential Equations (PDE), however their design and architecture is understudied. In this paper, we provide a framework for designing and analysing general U-Net architectures. We present theoretical results which characterise the role of the encoder and decoder in a U-Net, their high-resolution scaling limits and their conjugacy to ResNets via preconditioning. We propose Multi-ResNets, U-Nets with a simplified, wavelet-based encoder without learnable parameters. Further, we show how to design novel U-Net architectures which encode function constraints, natural bases, or the geometry of the data. In diffusion models, our framework enables us to identify that high-frequency information is dominated by noise exponentially faster, and show how U-Nets with average pooling exploit this. In our experiments, we demonstrate how Multi-ResNets achieve competitive and often superior performance compared to classical U-Nets in image segmentation, PDE surrogate modelling, and generative modelling with diffusion models. Our U-Net framework paves the way to study the theoretical properties of U-Nets and design natural, scalable neural architectures for a multitude of problems beyond the square. △ Less

Submitted 10 January, 2024; v1 submitted 31 May, 2023; originally announced May 2023.

arXiv:2305.16860 [pdf, ps, other]

Error Bounds for Flow Matching Methods

Authors: Joe Benton, George Deligiannidis, Arnaud Doucet

Abstract: Score-based generative models are a popular class of generative modelling techniques relying on stochastic differential equations (SDE). From their inception, it was realized that it was also possible to perform generation using ordinary differential equations (ODE) rather than SDE. This led to the introduction of the probability flow ODE approach and denoising diffusion implicit models. Flow matc… ▽ More Score-based generative models are a popular class of generative modelling techniques relying on stochastic differential equations (SDE). From their inception, it was realized that it was also possible to perform generation using ordinary differential equations (ODE) rather than SDE. This led to the introduction of the probability flow ODE approach and denoising diffusion implicit models. Flow matching methods have recently further extended these ODE-based approaches and approximate a flow between two arbitrary probability distributions. Previous work derived bounds on the approximation error of diffusion models under the stochastic sampling regime, given assumptions on the $L^2$ loss. We present error bounds for the flow matching procedure using fully deterministic sampling, assuming an $L^2$ bound on the approximation error and a certain regularity condition on the data distributions. △ Less

Submitted 11 February, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

arXiv:2302.02766 [pdf, other]

Generalization Bounds with Data-dependent Fractal Dimensions

Authors: Benjamin Dupuis, George Deligiannidis, Umut Şimşekli

Abstract: Providing generalization guarantees for modern neural networks has been a crucial task in statistical learning. Recently, several studies have attempted to analyze the generalization error in such settings by using tools from fractal geometry. While these works have successfully introduced new mathematical tools to apprehend generalization, they heavily rely on a Lipschitz continuity assumption, w… ▽ More Providing generalization guarantees for modern neural networks has been a crucial task in statistical learning. Recently, several studies have attempted to analyze the generalization error in such settings by using tools from fractal geometry. While these works have successfully introduced new mathematical tools to apprehend generalization, they heavily rely on a Lipschitz continuity assumption, which in general does not hold for neural networks and might make the bounds vacuous. In this work, we address this issue and prove fractal geometry-based generalization bounds without requiring any Lipschitz assumption. To achieve this goal, we build up on a classical covering argument in learning theory and introduce a data-dependent fractal dimension. Despite introducing a significant amount of technical complications, this new notion lets us control the generalization error (over either fixed or random hypothesis spaces) along with certain mutual information (MI) terms. To provide a clearer interpretation to the newly introduced MI terms, as a next step, we introduce a notion of "geometric stability" and link our bounds to the prior art. Finally, we make a rigorous connection between the proposed data-dependent dimension and topological data analysis tools, which then enables us to compute the dimension in a numerically efficient way. We support our theory with experiments conducted on various settings. △ Less

Submitted 10 July, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

Journal ref: International Conference on Machine Learning (ICML 2023)

arXiv:2301.08187 [pdf, other]

A Multi-Resolution Framework for U-Nets with Applications to Hierarchical VAEs

Authors: Fabian Falck, Christopher Williams, Dominic Danks, George Deligiannidis, Christopher Yau, Chris Holmes, Arnaud Doucet, Matthew Willetts

Abstract: U-Net architectures are ubiquitous in state-of-the-art deep learning, however their regularisation properties and relationship to wavelets are understudied. In this paper, we formulate a multi-resolution framework which identifies U-Nets as finite-dimensional truncations of models on an infinite-dimensional function space. We provide theoretical results which prove that average pooling corresponds… ▽ More U-Net architectures are ubiquitous in state-of-the-art deep learning, however their regularisation properties and relationship to wavelets are understudied. In this paper, we formulate a multi-resolution framework which identifies U-Nets as finite-dimensional truncations of models on an infinite-dimensional function space. We provide theoretical results which prove that average pooling corresponds to projection within the space of square-integrable functions and show that U-Nets with average pooling implicitly learn a Haar wavelet basis representation of the data. We then leverage our framework to identify state-of-the-art hierarchical VAEs (HVAEs), which have a U-Net architecture, as a type of two-step forward Euler discretisation of multi-resolution diffusion processes which flow from a point mass, introducing sampling instabilities. We also demonstrate that HVAEs learn a representation of time which allows for improved parameter efficiency through weight-sharing. We use this observation to achieve state-of-the-art HVAE performance with half the number of parameters of existing models, exploiting the properties of our continuous-time formulation. △ Less

Submitted 19 January, 2023; originally announced January 2023.

Comments: NeurIPS 2022 (selected as oral)

arXiv:2211.03595 [pdf, other]

From Denoising Diffusions to Denoising Markov Models

Authors: Joe Benton, Yuyang Shi, Valentin De Bortoli, George Deligiannidis, Arnaud Doucet

Abstract: Denoising diffusions are state-of-the-art generative models exhibiting remarkable empirical performance. They work by diffusing the data distribution into a Gaussian distribution and then learning to reverse this noising process to obtain synthetic datapoints. The denoising diffusion relies on approximations of the logarithmic derivatives of the noised data densities using score matching. Such mod… ▽ More Denoising diffusions are state-of-the-art generative models exhibiting remarkable empirical performance. They work by diffusing the data distribution into a Gaussian distribution and then learning to reverse this noising process to obtain synthetic datapoints. The denoising diffusion relies on approximations of the logarithmic derivatives of the noised data densities using score matching. Such models can also be used to perform approximate posterior simulation when one can only sample from the prior and likelihood. We propose a unifying framework generalising this approach to a wide class of spaces and leading to an original extension of score matching. We illustrate the resulting models on various applications. △ Less

Submitted 18 February, 2024; v1 submitted 7 November, 2022; originally announced November 2022.

arXiv:2209.02525 [pdf, other]

Generalisation under gradient descent via deterministic PAC-Bayes

Authors: Eugenio Clerico, Tyler Farghly, George Deligiannidis, Benjamin Guedj, Arnaud Doucet

Abstract: We establish disintegrated PAC-Bayesian generalisation bounds for models trained with gradient descent methods or continuous gradient flows. Contrary to standard practice in the PAC-Bayesian setting, our result applies to optimisation algorithms that are deterministic, without requiring any de-randomisation step. Our bounds are fully computable, depending on the density of the initial distribution… ▽ More We establish disintegrated PAC-Bayesian generalisation bounds for models trained with gradient descent methods or continuous gradient flows. Contrary to standard practice in the PAC-Bayesian setting, our result applies to optimisation algorithms that are deterministic, without requiring any de-randomisation step. Our bounds are fully computable, depending on the density of the initial distribution and the Hessian of the training objective over the trajectory. We show that our framework can be applied to a variety of iterative optimisation algorithms, including stochastic gradient descent (SGD), momentum-based schemes, and damped Hamiltonian dynamics. △ Less

Submitted 4 April, 2023; v1 submitted 6 September, 2022; originally announced September 2022.

arXiv:2207.00109 [pdf, other]

Ranking In Generalized Linear Bandits

Authors: Amitis Shidani, George Deligiannidis, Arnaud Doucet

Abstract: We study the ranking problem in generalized linear bandits. At each time, the learning agent selects an ordered list of items and observes stochastic outcomes. In recommendation systems, displaying an ordered list of the most attractive items is not always optimal as both position and item dependencies result in a complex reward function. A very naive example is the lack of diversity when all the… ▽ More We study the ranking problem in generalized linear bandits. At each time, the learning agent selects an ordered list of items and observes stochastic outcomes. In recommendation systems, displaying an ordered list of the most attractive items is not always optimal as both position and item dependencies result in a complex reward function. A very naive example is the lack of diversity when all the most attractive items are from the same category. We model the position and item dependencies in the ordered list and design UCB and Thompson Sampling type algorithms for this problem. Our work generalizes existing studies in several directions, including position dependencies where position discount is a particular case, and connecting the ranking problem to graph theory. △ Less

Submitted 1 January, 2024; v1 submitted 30 June, 2022; originally announced July 2022.

Journal ref: AAAI 2024 Workshop on Recommendation Ecosystems: Modeling, Optimization and Incentive Design

arXiv:2205.14987 [pdf, other]

A Continuous Time Framework for Discrete Denoising Models

Authors: Andrew Campbell, Joe Benton, Valentin De Bortoli, Tom Rainforth, George Deligiannidis, Arnaud Doucet

Abstract: We provide the first complete continuous time framework for denoising diffusion models of discrete data. This is achieved by formulating the forward noising process and corresponding reverse time generative process as Continuous Time Markov Chains (CTMCs). The model can be efficiently trained using a continuous time version of the ELBO. We simulate the high dimensional CTMC using techniques develo… ▽ More We provide the first complete continuous time framework for denoising diffusion models of discrete data. This is achieved by formulating the forward noising process and corresponding reverse time generative process as Continuous Time Markov Chains (CTMCs). The model can be efficiently trained using a continuous time version of the ELBO. We simulate the high dimensional CTMC using techniques developed in chemical physics and exploit our continuous time framework to derive high performance samplers that we show can outperform discrete time methods for discrete data. The continuous time treatment also enables us to derive a novel theoretical result bounding the error between the generated sample distribution and the true data distribution. △ Less

Submitted 14 October, 2022; v1 submitted 30 May, 2022; originally announced May 2022.

Comments: 44 pages, 15 figures; NeurIPS 2022

arXiv:2203.00977 [pdf, ps, other]

Chained Generalisation Bounds

Authors: Eugenio Clerico, Amitis Shidani, George Deligiannidis, Arnaud Doucet

Abstract: This work discusses how to derive upper bounds for the expected generalisation error of supervised learning algorithms by means of the chaining technique. By developing a general theoretical framework, we establish a duality between generalisation bounds based on the regularity of the loss function, and their chained counterparts, which can be obtained by lifting the regularity assumption from the… ▽ More This work discusses how to derive upper bounds for the expected generalisation error of supervised learning algorithms by means of the chaining technique. By developing a general theoretical framework, we establish a duality between generalisation bounds based on the regularity of the loss function, and their chained counterparts, which can be obtained by lifting the regularity assumption from the loss onto its gradient. This allows us to re-derive the chaining mutual information bound from the literature, and to obtain novel chained information-theoretic generalisation bounds, based on the Wasserstein distance and other probability metrics. We show on some toy examples that the chained generalisation bound can be significantly tighter than its standard counterpart, particularly when the distribution of the hypotheses selected by the algorithm is very concentrated. Keywords: Generalisation bounds; Chaining; Information-theoretic bounds; Mutual information; Wasserstein distance; PAC-Bayes. △ Less

Submitted 30 June, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

Journal ref: Proceedings of the 35th Conference on Learning Theory, PMLR 178:4212-4257, 2022

arXiv:2203.00554 [pdf, other]

Neural Score Matching for High-Dimensional Causal Inference

Authors: Oscar Clivio, Fabian Falck, Brieuc Lehmann, George Deligiannidis, Chris Holmes

Abstract: Traditional methods for matching in causal inference are impractical for high-dimensional datasets. They suffer from the curse of dimensionality: exact matching and coarsened exact matching find exponentially fewer matches as the input dimension grows, and propensity score matching may match highly unrelated units together. To overcome this problem, we develop theoretical results which motivate th… ▽ More Traditional methods for matching in causal inference are impractical for high-dimensional datasets. They suffer from the curse of dimensionality: exact matching and coarsened exact matching find exponentially fewer matches as the input dimension grows, and propensity score matching may match highly unrelated units together. To overcome this problem, we develop theoretical results which motivate the use of neural networks to obtain non-trivial, multivariate balancing scores of a chosen level of coarseness, in contrast to the classical, scalar propensity score. We leverage these balancing scores to perform matching for high-dimensional causal inference and call this procedure neural score matching. We show that our method is competitive against other matching approaches on semi-synthetic high-dimensional datasets, both in terms of treatment effect estimation and reducing imbalance. △ Less

Submitted 1 March, 2022; originally announced March 2022.

Comments: To appear in AISTATS 2022

arXiv:2202.13460 [pdf, other]

Conditional Simulation Using Diffusion Schrödinger Bridges

Authors: Yuyang Shi, Valentin De Bortoli, George Deligiannidis, Arnaud Doucet

Abstract: Denoising diffusion models have recently emerged as a powerful class of generative models. They provide state-of-the-art results, not only for unconditional simulation, but also when used to solve conditional simulation problems arising in a wide range of inverse problems. A limitation of these models is that they are computationally intensive at generation time as they require simulating a diffus… ▽ More Denoising diffusion models have recently emerged as a powerful class of generative models. They provide state-of-the-art results, not only for unconditional simulation, but also when used to solve conditional simulation problems arising in a wide range of inverse problems. A limitation of these models is that they are computationally intensive at generation time as they require simulating a diffusion process over a long time horizon. When performing unconditional simulation, a Schrödinger bridge formulation of generative modeling leads to a theoretically grounded algorithm shortening generation time which is complementary to other proposed acceleration techniques. We extend the Schrödinger bridge framework to conditional simulation. We demonstrate this novel methodology on various applications including image super-resolution, optimal filtering for state-space models and the refinement of pre-trained networks. Our code can be found at https://github.com/vdeborto/cdsb. △ Less

Submitted 26 June, 2022; v1 submitted 27 February, 2022; originally announced February 2022.

Comments: 29 pages, 15 figures. UAI 2022 camera-ready version

arXiv:2112.00565 [pdf, other]

On Mixing Times of Metropolized Algorithm With Optimization Step (MAO) : A New Framework

Authors: EL Mahdi Khribch, George Deligiannidis, Daniel Paulin

Abstract: In this paper, we consider sampling from a class of distributions with thin tails supported on $\mathbb{R}^d$ and make two primary contributions. First, we propose a new Metropolized Algorithm With Optimization Step (MAO), which is well suited for such targets. Our algorithm is capable of sampling from distributions where the Metropolis-adjusted Langevin algorithm (MALA) is not converging or lacki… ▽ More In this paper, we consider sampling from a class of distributions with thin tails supported on $\mathbb{R}^d$ and make two primary contributions. First, we propose a new Metropolized Algorithm With Optimization Step (MAO), which is well suited for such targets. Our algorithm is capable of sampling from distributions where the Metropolis-adjusted Langevin algorithm (MALA) is not converging or lacking in theoretical guarantees. Second, we derive upper bounds on the mixing time of MAO. Our results are supported by simulations on multiple target distributions. △ Less

Submitted 1 December, 2021; originally announced December 2021.

Comments: 24 pages, 27 Figures, 4 Tables

MSC Class: 65C05; 65C40 ACM Class: G.3

arXiv:2110.11886 [pdf, other]

Conditionally Gaussian PAC-Bayes

Authors: Eugenio Clerico, George Deligiannidis, Arnaud Doucet

Abstract: Recent studies have empirically investigated different methods to train stochastic neural networks on a classification task by optimising a PAC-Bayesian bound via stochastic gradient descent. Most of these procedures need to replace the misclassification error with a surrogate loss, leading to a mismatch between the optimisation objective and the actual generalisation bound. The present paper prop… ▽ More Recent studies have empirically investigated different methods to train stochastic neural networks on a classification task by optimising a PAC-Bayesian bound via stochastic gradient descent. Most of these procedures need to replace the misclassification error with a surrogate loss, leading to a mismatch between the optimisation objective and the actual generalisation bound. The present paper proposes a novel training algorithm that optimises the PAC-Bayesian bound, without relying on any surrogate loss. Empirical results show that this approach outperforms currently available PAC-Bayesian training methods. △ Less

Submitted 24 February, 2022; v1 submitted 22 October, 2021; originally announced October 2021.

Journal ref: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:2311-2329, 2022

arXiv:2108.08129 [pdf, ps, other]

Quantitative Uniform Stability of the Iterative Proportional Fitting Procedure

Authors: George Deligiannidis, Valentin De Bortoli, Arnaud Doucet

Abstract: We establish the uniform in time stability, w.r.t. the marginals, of the Iterative Proportional Fitting Procedure, also known as Sinkhorn algorithm, used to solve entropy-regularised Optimal Transport problems. Our result is quantitative and stated in terms of the 1-Wasserstein metric. As a corollary we establish a quantitative stability result for Schrödinger bridges. We establish the uniform in time stability, w.r.t. the marginals, of the Iterative Proportional Fitting Procedure, also known as Sinkhorn algorithm, used to solve entropy-regularised Optimal Transport problems. Our result is quantitative and stated in terms of the 1-Wasserstein metric. As a corollary we establish a quantitative stability result for Schrödinger bridges. △ Less

Submitted 22 October, 2021; v1 submitted 18 August, 2021; originally announced August 2021.

Comments: 14 pages

arXiv:2106.09798 [pdf, other]

Wide stochastic networks: Gaussian limit and PAC-Bayesian training

Authors: Eugenio Clerico, George Deligiannidis, Arnaud Doucet

Abstract: The limit of infinite width allows for substantial simplifications in the analytical study of over-parameterised neural networks. With a suitable random initialisation, an extremely large network exhibits an approximately Gaussian behaviour. In the present work, we establish a similar result for a simple stochastic architecture whose parameters are random variables, holding both before and during… ▽ More The limit of infinite width allows for substantial simplifications in the analytical study of over-parameterised neural networks. With a suitable random initialisation, an extremely large network exhibits an approximately Gaussian behaviour. In the present work, we establish a similar result for a simple stochastic architecture whose parameters are random variables, holding both before and during training. The explicit evaluation of the output distribution allows for a PAC-Bayesian training procedure that directly optimises the generalisation bound. For a large but finite-width network, we show empirically on MNIST that this training approach can outperform standard PAC-Bayesian methods. △ Less

Submitted 13 February, 2023; v1 submitted 17 June, 2021; originally announced June 2021.

Journal ref: The 34th International Conference on Algorithmic Learning Theory (ALT 2023)

arXiv:2106.04881 [pdf, other]

Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms

Authors: Alexander Camuto, George Deligiannidis, Murat A. Erdogdu, Mert Gürbüzbalaban, Umut Şimşekli, Lingjiong Zhu

Abstract: Understanding generalization in deep learning has been one of the major challenges in statistical learning theory over the last decade. While recent work has illustrated that the dataset and the training algorithm must be taken into account in order to obtain meaningful generalization bounds, it is still theoretically not clear which properties of the data and the algorithm determine the generaliz… ▽ More Understanding generalization in deep learning has been one of the major challenges in statistical learning theory over the last decade. While recent work has illustrated that the dataset and the training algorithm must be taken into account in order to obtain meaningful generalization bounds, it is still theoretically not clear which properties of the data and the algorithm determine the generalization performance. In this study, we approach this problem from a dynamical systems theory perspective and represent stochastic optimization algorithms as random iterated function systems (IFS). Well studied in the dynamical systems literature, under mild assumptions, such IFSs can be shown to be ergodic with an invariant measure that is often supported on sets with a fractal structure. As our main contribution, we prove that the generalization error of a stochastic optimization algorithm can be bounded based on the `complexity' of the fractal structure that underlies its invariant measure. Leveraging results from dynamical systems theory, we show that the generalization error can be explicitly linked to the choice of the algorithm (e.g., stochastic gradient descent -- SGD), algorithm hyperparameters (e.g., step-size, batch-size), and the geometry of the problem (e.g., Hessian of the loss). We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden-layered neural networks) and algorithms (e.g., SGD and preconditioned variants), and obtain analytical estimates for our bound.For modern neural networks, we develop an efficient algorithm to compute the developed bound and support our theory with various experiments on neural networks. △ Less

Submitted 9 June, 2021; originally announced June 2021.

Comments: 34 pages including Supplement, 4 Figures

arXiv:2102.07850 [pdf, other]

Differentiable Particle Filtering via Entropy-Regularized Optimal Transport

Authors: Adrien Corenflos, James Thornton, George Deligiannidis, Arnaud Doucet

Abstract: Particle Filtering (PF) methods are an established class of procedures for performing inference in non-linear state-space models. Resampling is a key ingredient of PF, necessary to obtain low variance likelihood and states estimates. However, traditional resampling methods result in PF-based loss functions being non-differentiable with respect to model and PF parameters. In a variational inference… ▽ More Particle Filtering (PF) methods are an established class of procedures for performing inference in non-linear state-space models. Resampling is a key ingredient of PF, necessary to obtain low variance likelihood and states estimates. However, traditional resampling methods result in PF-based loss functions being non-differentiable with respect to model and PF parameters. In a variational inference context, resampling also yields high variance gradient estimates of the PF-based evidence lower bound. By leveraging optimal transport ideas, we introduce a principled differentiable particle filter and provide convergence results. We demonstrate this novel method on a variety of applications. △ Less

Submitted 30 June, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

Comments: 9 pages of content + 11 pages supplementary, accepted for oral at ICML 2021

arXiv:2010.12859 [pdf, other]

Stable ResNet

Authors: Soufiane Hayou, Eugenio Clerico, Bobby He, George Deligiannidis, Arnaud Doucet, Judith Rousseau

Abstract: Deep ResNet architectures have achieved state of the art performance on many tasks. While they solve the problem of gradient vanishing, they might suffer from gradient exploding as the depth becomes large (Yang et al. 2017). Moreover, recent results have shown that ResNet might lose expressivity as the depth goes to infinity (Yang et al. 2017, Hayou et al. 2019). To resolve these issues, we introd… ▽ More Deep ResNet architectures have achieved state of the art performance on many tasks. While they solve the problem of gradient vanishing, they might suffer from gradient exploding as the depth becomes large (Yang et al. 2017). Moreover, recent results have shown that ResNet might lose expressivity as the depth goes to infinity (Yang et al. 2017, Hayou et al. 2019). To resolve these issues, we introduce a new class of ResNet architectures, called Stable ResNet, that have the property of stabilizing the gradient while ensuring expressivity in the infinite depth limit. △ Less

Submitted 18 March, 2021; v1 submitted 24 October, 2020; originally announced October 2020.

Comments: 43 pages, 4 figures

arXiv:2006.09313 [pdf, other]

doi 10.1088/1742-5468/ac3ae7

Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks

Authors: Umut Şimşekli, Ozan Sener, George Deligiannidis, Murat A. Erdogdu

Abstract: Despite its success in a wide range of applications, characterizing the generalization properties of stochastic gradient descent (SGD) in non-convex deep learning problems is still an important challenge. While modeling the trajectories of SGD via stochastic differential equations (SDE) under heavy-tailed gradient noise has recently shed light over several peculiar characteristics of SGD, a rigoro… ▽ More Despite its success in a wide range of applications, characterizing the generalization properties of stochastic gradient descent (SGD) in non-convex deep learning problems is still an important challenge. While modeling the trajectories of SGD via stochastic differential equations (SDE) under heavy-tailed gradient noise has recently shed light over several peculiar characteristics of SGD, a rigorous treatment of the generalization properties of such SDEs in a learning theoretical framework is still missing. Aiming to bridge this gap, in this paper, we prove generalization bounds for SGD under the assumption that its trajectories can be well-approximated by a \emph{Feller process}, which defines a rich class of Markov processes that include several recent SDE representations (both Brownian or heavy-tailed) as its special case. We show that the generalization error can be controlled by the \emph{Hausdorff dimension} of the trajectories, which is intimately linked to the tail behavior of the driving process. Our results imply that heavier-tailed processes should achieve better generalization; hence, the tail-index of the process can be used as a notion of "capacity metric". We support our theory with experiments on deep neural networks illustrating that the proposed capacity metric accurately estimates the generalization error, and it does not necessarily grow with the number of parameters unlike the existing capacity metrics in the literature. △ Less

Submitted 22 May, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

Comments: Published at NeurIPS 2020 (Spotlight) -- an imprecision in Definition 2 and a mistake in the statement and the proof of Theorem 2 are fixed

arXiv:1909.13833 [pdf, other]

Relaxing Bijectivity Constraints with Continuously Indexed Normalising Flows

Authors: Rob Cornish, Anthony L. Caterini, George Deligiannidis, Arnaud Doucet

Abstract: We show that normalising flows become pathological when used to model targets whose supports have complicated topologies. In this scenario, we prove that a flow must become arbitrarily numerically noninvertible in order to approximate the target closely. This result has implications for all flow-based models, and especially Residual Flows (ResFlows), which explicitly control the Lipschitz constant… ▽ More We show that normalising flows become pathological when used to model targets whose supports have complicated topologies. In this scenario, we prove that a flow must become arbitrarily numerically noninvertible in order to approximate the target closely. This result has implications for all flow-based models, and especially Residual Flows (ResFlows), which explicitly control the Lipschitz constant of the bijection used. To address this, we propose Continuously Indexed Flows (CIFs), which replace the single bijection used by normalising flows with a continuously indexed family of bijections, and which can intuitively "clean up" mass that would otherwise be misplaced by a single bijection. We show theoretically that CIFs are not subject to the same topological limitations as normalising flows, and obtain better empirical performance on a variety of models and benchmarks. △ Less

Submitted 23 April, 2021; v1 submitted 30 September, 2019; originally announced September 2019.

Comments: Minor revision

arXiv:1901.09881 [pdf, other]

Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets

Authors: Robert Cornish, Paul Vanetti, Alexandre Bouchard-Côté, George Deligiannidis, Arnaud Doucet

Abstract: Bayesian inference via standard Markov Chain Monte Carlo (MCMC) methods is too computationally intensive to handle large datasets, since the cost per step usually scales like $Θ(n)$ in the number of data points $n$. We propose the Scalable Metropolis-Hastings (SMH) kernel that exploits Gaussian concentration of the posterior to require processing on average only $O(1)$ or even $O(1/\sqrt{n})$ data… ▽ More Bayesian inference via standard Markov Chain Monte Carlo (MCMC) methods is too computationally intensive to handle large datasets, since the cost per step usually scales like $Θ(n)$ in the number of data points $n$. We propose the Scalable Metropolis-Hastings (SMH) kernel that exploits Gaussian concentration of the posterior to require processing on average only $O(1)$ or even $O(1/\sqrt{n})$ data points per step. This scheme is based on a combination of factorized acceptance probabilities, procedures for fast simulation of Bernoulli processes, and control variate ideas. Contrary to many MCMC subsampling schemes such as fixed step-size Stochastic Gradient Langevin Dynamics, our approach is exact insofar as the invariant distribution is the true posterior and not an approximation to it. We characterise the performance of our algorithm theoretically, and give realistic and verifiable conditions under which it is geometrically ergodic. This theory is borne out by empirical results that demonstrate overall performance benefits over standard Metropolis-Hastings and various subsampling algorithms. △ Less

Submitted 10 June, 2019; v1 submitted 28 January, 2019; originally announced January 2019.

Showing 1–26 of 26 results for author: Deligiannidis, G