Search | arXiv e-print repository

Tilting the Odds at the Lottery: the Interplay of Overparameterisation and Curricula in Neural Networks

Authors: Stefano Sarao Mannelli, Yaraslau Ivashinka, Andrew Saxe, Luca Saglietti

Abstract: A wide range of empirical and theoretical works have shown that overparameterisation can amplify the performance of neural networks. According to the lottery ticket hypothesis, overparameterised networks have an increased chance of containing a sub-network that is well-initialised to solve the task at hand. A more parsimonious approach, inspired by animal learning, consists in guiding the learner… ▽ More A wide range of empirical and theoretical works have shown that overparameterisation can amplify the performance of neural networks. According to the lottery ticket hypothesis, overparameterised networks have an increased chance of containing a sub-network that is well-initialised to solve the task at hand. A more parsimonious approach, inspired by animal learning, consists in guiding the learner towards solving the task by curating the order of the examples, i.e. providing a curriculum. However, this learning strategy seems to be hardly beneficial in deep learning applications. In this work, we undertake an analytical study that connects curriculum learning and overparameterisation. In particular, we investigate their interplay in the online learning setting for a 2-layer network in the XOR-like Gaussian Mixture problem. Our results show that a high degree of overparameterisation -- while simplifying the problem -- can limit the benefit from curricula, providing a theoretical account of the ineffectiveness of curricula in deep learning. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: Accepted to ICML 2024

arXiv:2405.18296 [pdf, other]

Bias in Motion: Theoretical Insights into the Dynamics of Bias in SGD Training

Authors: Anchit Jain, Rozhin Nobahari, Aristide Baratin, Stefano Sarao Mannelli

Abstract: Machine learning systems often acquire biases by leveraging undesired features in the data, impacting accuracy variably across different sub-populations. Current understanding of bias formation mostly focuses on the initial and final stages of learning, leaving a gap in knowledge regarding the transient dynamics. To address this gap, this paper explores the evolution of bias in a teacher-student s… ▽ More Machine learning systems often acquire biases by leveraging undesired features in the data, impacting accuracy variably across different sub-populations. Current understanding of bias formation mostly focuses on the initial and final stages of learning, leaving a gap in knowledge regarding the transient dynamics. To address this gap, this paper explores the evolution of bias in a teacher-student setup modeling different data sub-populations with a Gaussian-mixture model. We provide an analytical description of the stochastic gradient descent dynamics of a linear classifier in this setting, which we prove to be exact in high dimension. Notably, our analysis reveals how different properties of sub-populations influence bias at different timescales, showing a shifting preference of the classifier during training. Applying our findings to fairness and robustness, we delineate how and when heterogeneous data and spurious features can generate and amplify bias. We empirically validate our results in more complex scenarios by training deeper networks on synthetic and real datasets, including CIFAR10, MNIST, and CelebA. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2306.10404 [pdf, other]

The RL Perceptron: Generalisation Dynamics of Policy Learning in High Dimensions

Authors: Nishil Patel, Sebastian Lee, Stefano Sarao Mannelli, Sebastian Goldt, Andrew Saxe

Abstract: Reinforcement learning (RL) algorithms have proven transformative in a range of domains. To tackle real-world domains, these systems often use neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, much theory of RL has focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learnin… ▽ More Reinforcement learning (RL) algorithms have proven transformative in a range of domains. To tackle real-world domains, these systems often use neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, much theory of RL has focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional model of RL that can capture a variety of learning protocols, and derive its typical dynamics as a set of closed-form ordinary differential equations (ODEs). We derive optimal schedules for the learning rates and task difficulty - analogous to annealing schemes and curricula during training in RL - and show that the model exhibits rich behaviour, including delayed learning under sparse rewards; a variety of learning regimes depending on reward baselines; and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game "Bossfight" and Arcade Learning Environment game "Pong" also show such a speed-accuracy trade-off in practice. Together, these results take a step towards closing the gap between theory and practice in high-dimensional RL. △ Less

Submitted 2 September, 2023; v1 submitted 17 June, 2023; originally announced June 2023.

Comments: 10 pages, 7 figures, Preprint

arXiv:2303.01429 [pdf, other]

Optimal transfer protocol by incremental layer defrosting

Authors: Federica Gerace, Diego Doimo, Stefano Sarao Mannelli, Luca Saglietti, Alessandro Laio

Abstract: Transfer learning is a powerful tool enabling model training with limited amounts of data. This technique is particularly useful in real-world problems where data availability is often a serious limitation. The simplest transfer learning protocol is based on ``freezing" the feature-extractor layers of a network pre-trained on a data-rich source task, and then adapting only the last layers to a dat… ▽ More Transfer learning is a powerful tool enabling model training with limited amounts of data. This technique is particularly useful in real-world problems where data availability is often a serious limitation. The simplest transfer learning protocol is based on ``freezing" the feature-extractor layers of a network pre-trained on a data-rich source task, and then adapting only the last layers to a data-poor target task. This workflow is based on the assumption that the feature maps of the pre-trained model are qualitatively similar to the ones that would have been learned with enough data on the target task. In this work, we show that this protocol is often sub-optimal, and the largest performance gain may be achieved when smaller portions of the pre-trained network are kept frozen. In particular, we make use of a controlled framework to identify the optimal transfer depth, which turns out to depend non-trivially on the amount of available training data and on the degree of source-target task correlation. We then characterize transfer optimality by analyzing the internal representations of two networks trained from scratch on the source and the target task through multiple established similarity measures. △ Less

Submitted 2 March, 2023; originally announced March 2023.

arXiv:2205.15935 [pdf, other]

Bias-inducing geometries: an exactly solvable data model with fairness implications

Authors: Stefano Sarao Mannelli, Federica Gerace, Negar Rostamzadeh, Luca Saglietti

Abstract: Machine learning (ML) may be oblivious to human bias but it is not immune to its perpetuation. Marginalisation and iniquitous group representation are often traceable in the very data used for training, and may be reflected or even enhanced by the learning models. In the present work, we aim at clarifying the role played by data geometry in the emergence of ML bias. We introduce an exactly solvabl… ▽ More Machine learning (ML) may be oblivious to human bias but it is not immune to its perpetuation. Marginalisation and iniquitous group representation are often traceable in the very data used for training, and may be reflected or even enhanced by the learning models. In the present work, we aim at clarifying the role played by data geometry in the emergence of ML bias. We introduce an exactly solvable high-dimensional model of data imbalance, where parametric control over the many bias-inducing factors allows for an extensive exploration of the bias inheritance mechanism. Through the tools of statistical physics, we analytically characterise the typical properties of learning models trained in this synthetic framework and obtain exact predictions for the observables that are commonly employed for fairness assessment. Despite the simplicity of the data model, we retrace and unpack typical unfairness behaviour observed on real-world datasets. We also obtain a detailed analytical characterisation of a class of bias mitigation strategies. We first consider a basic loss-reweighing scheme, which allows for an implicit minimisation of different unfairness metrics, and quantify the incompatibilities between some existing fairness criteria. Then, we consider a novel mitigation strategy based on a matched inference approach, consisting in the introduction of coupled learning models. Our theoretical analysis of this approach shows that the coupled strategy can strike superior fairness-accuracy trade-offs. △ Less

Submitted 13 November, 2023; v1 submitted 31 May, 2022; originally announced May 2022.

Comments: 9 pages + methods + SI

arXiv:2205.09029 [pdf, other]

Maslow's Hammer for Catastrophic Forgetting: Node Re-Use vs Node Activation

Authors: Sebastian Lee, Stefano Sarao Mannelli, Claudia Clopath, Sebastian Goldt, Andrew Saxe

Abstract: Continual learning - learning new tasks in sequence while maintaining performance on old tasks - remains particularly challenging for artificial neural networks. Surprisingly, the amount of forgetting does not increase with the dissimilarity between the learned tasks, but appears to be worst in an intermediate similarity regime. In this paper we theoretically analyse both a synthetic teacher-stu… ▽ More Continual learning - learning new tasks in sequence while maintaining performance on old tasks - remains particularly challenging for artificial neural networks. Surprisingly, the amount of forgetting does not increase with the dissimilarity between the learned tasks, but appears to be worst in an intermediate similarity regime. In this paper we theoretically analyse both a synthetic teacher-student framework and a real data setup to provide an explanation of this phenomenon that we name Maslow's hammer hypothesis. Our analysis reveals the presence of a trade-off between node activation and node re-use that results in worst forgetting in the intermediate regime. Using this understanding we reinterpret popular algorithmic interventions for catastrophic interference in terms of this trade-off, and identify the regimes in which they are most effective. △ Less

Submitted 18 May, 2022; originally announced May 2022.

Journal ref: Proceedings of the 39th International Conference on Machine Learning, PMLR 162:12455-12477 (2022)

arXiv:2106.08068 [pdf, other]

doi 10.1088/1742-5468/ac9b3c

An Analytical Theory of Curriculum Learning in Teacher-Student Networks

Authors: Luca Saglietti, Stefano Sarao Mannelli, Andrew Saxe

Abstract: In humans and animals, curriculum learning -- presenting data in a curated order - is critical to rapid learning and effective pedagogy. Yet in machine learning, curricula are not widely used and empirically often yield only moderate benefits. This stark difference in the importance of curriculum raises a fundamental theoretical question: when and why does curriculum learning help? In this work,… ▽ More In humans and animals, curriculum learning -- presenting data in a curated order - is critical to rapid learning and effective pedagogy. Yet in machine learning, curricula are not widely used and empirically often yield only moderate benefits. This stark difference in the importance of curriculum raises a fundamental theoretical question: when and why does curriculum learning help? In this work, we analyse a prototypical neural network model of curriculum learning in the high-dimensional limit, employing statistical physics methods. Curricula could in principle change both the learning speed and asymptotic performance of a model. To study the former, we provide an exact description of the online learning setting, confirming the long-standing experimental observation that curricula can modestly speed up learning. To study the latter, we derive performance in a batch learning setting, in which a network trains to convergence in successive phases of learning on dataset slices of varying difficulty. With standard training losses, curriculum does not provide generalisation benefit, in line with empirical observations. However, we show that by connecting different learning phases through simple Gaussian priors, curriculum can yield a large improvement in test performance. Taken together, our reduced analytical descriptions help reconcile apparently conflicting empirical results and trace regimes where curriculum learning yields the largest gains. More broadly, our results suggest that fully exploiting a curriculum may require explicit changes to the loss function at curriculum boundaries. △ Less

Submitted 12 October, 2022; v1 submitted 15 June, 2021; originally announced June 2021.

Comments: Accepted to NeurIPS 2022

arXiv:2106.05418 [pdf, other]

doi 10.1088/2632-2153/ac4f3f

Probing transfer learning with a model of synthetic correlated datasets

Authors: Federica Gerace, Luca Saglietti, Stefano Sarao Mannelli, Andrew Saxe, Lenka Zdeborová

Abstract: Transfer learning can significantly improve the sample efficiency of neural networks, by exploiting the relatedness between a data-scarce target task and a data-abundant source task. Despite years of successful applications, transfer learning practice often relies on ad-hoc solutions, while theoretical understanding of these procedures is still limited. In the present work, we re-think a solvable… ▽ More Transfer learning can significantly improve the sample efficiency of neural networks, by exploiting the relatedness between a data-scarce target task and a data-abundant source task. Despite years of successful applications, transfer learning practice often relies on ad-hoc solutions, while theoretical understanding of these procedures is still limited. In the present work, we re-think a solvable model of synthetic data as a framework for modeling correlation between data-sets. This setup allows for an analytic characterization of the generalization performance obtained when transferring the learned feature map from the source to the target task. Focusing on the problem of training two-layer networks in a binary classification setting, we show that our model can capture a range of salient features of transfer learning with real data. Moreover, by exploiting parametric control over the correlation between the two data-sets, we systematically investigate under which conditions the transfer of features is beneficial for generalization. △ Less

Submitted 2 February, 2022; v1 submitted 9 June, 2021; originally announced June 2021.

Journal ref: Machine Learning: Science and Technology 3.1 (2022): 015030

arXiv:2102.11755 [pdf, other]

Analytical Study of Momentum-Based Acceleration Methods in Paradigmatic High-Dimensional Non-Convex Problems

Authors: Stefano Sarao Mannelli, Pierfrancesco Urbani

Abstract: The optimization step in many machine learning problems rarely relies on vanilla gradient descent but it is common practice to use momentum-based accelerated methods. Despite these algorithms being widely applied to arbitrary loss functions, their behaviour in generically non-convex, high dimensional landscapes is poorly understood. In this work, we use dynamical mean field theory techniques to de… ▽ More The optimization step in many machine learning problems rarely relies on vanilla gradient descent but it is common practice to use momentum-based accelerated methods. Despite these algorithms being widely applied to arbitrary loss functions, their behaviour in generically non-convex, high dimensional landscapes is poorly understood. In this work, we use dynamical mean field theory techniques to describe analytically the average dynamics of these methods in a prototypical non-convex model: the (spiked) matrix-tensor model. We derive a closed set of equations that describe the behaviour of heavy-ball momentum and Nesterov acceleration in the infinite dimensional limit. By numerical integration of these equations, we observe that these methods speed up the dynamics but do not improve the algorithmic threshold with respect to gradient descent in the spiked model. △ Less

Submitted 27 October, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

Comments: To appear in NeurIPS 2021

arXiv:2009.09422 [pdf, other]

doi 10.1073/pnas.2106548118

Epidemic mitigation by statistical inference from contact tracing data

Authors: Antoine Baker, Indaco Biazzo, Alfredo Braunstein, Giovanni Catania, Luca Dall'Asta, Alessandro Ingrosso, Florent Krzakala, Fabio Mazza, Marc Mézard, Anna Paola Muntoni, Maria Refinetti, Stefano Sarao Mannelli, Lenka Zdeborová

Abstract: Contact-tracing is an essential tool in order to mitigate the impact of pandemic such as the COVID-19. In order to achieve efficient and scalable contact-tracing in real time, digital devices can play an important role. While a lot of attention has been paid to analyzing the privacy and ethical risks of the associated mobile applications, so far much less research has been devoted to optimizing th… ▽ More Contact-tracing is an essential tool in order to mitigate the impact of pandemic such as the COVID-19. In order to achieve efficient and scalable contact-tracing in real time, digital devices can play an important role. While a lot of attention has been paid to analyzing the privacy and ethical risks of the associated mobile applications, so far much less research has been devoted to optimizing their performance and assessing their impact on the mitigation of the epidemic. We develop Bayesian inference methods to estimate the risk that an individual is infected. This inference is based on the list of his recent contacts and their own risk levels, as well as personal information such as results of tests or presence of syndromes. We propose to use probabilistic risk estimation in order to optimize testing and quarantining strategies for the control of an epidemic. Our results show that in some range of epidemic spreading (typically when the manual tracing of all contacts of infected people becomes practically impossible, but before the fraction of infected people reaches the scale where a lock-down becomes unavoidable), this inference of individuals at risk could be an efficient way to mitigate the epidemic. Our approaches translate into fully distributed algorithms that only require communication between individuals who have recently been in contact. Such communication may be encrypted and anonymized and thus compatible with privacy preserving standards. We conclude that probabilistic risk estimation is capable to enhance performance of digital contact tracing and should be considered in the currently developed mobile applications. △ Less

Submitted 20 September, 2020; originally announced September 2020.

Comments: 21 pages, 7 figures

ACM Class: G.3; G.4; I.2.11; J.3

Journal ref: PNAS 2021 Vol. 118 No. 32 e2106548118

arXiv:2007.13483 [pdf, other]

Post-Workshop Report on Science meets Engineering in Deep Learning, NeurIPS 2019, Vancouver

Authors: Levent Sagun, Caglar Gulcehre, Adriana Romero, Negar Rostamzadeh, Stefano Sarao Mannelli

Abstract: Science meets Engineering in Deep Learning took place in Vancouver as part of the Workshop section of NeurIPS 2019. As organizers of the workshop, we created the following report in an attempt to isolate emerging topics and recurring themes that have been presented throughout the event. Deep learning can still be a complex mix of art and engineering despite its tremendous success in recent years.… ▽ More Science meets Engineering in Deep Learning took place in Vancouver as part of the Workshop section of NeurIPS 2019. As organizers of the workshop, we created the following report in an attempt to isolate emerging topics and recurring themes that have been presented throughout the event. Deep learning can still be a complex mix of art and engineering despite its tremendous success in recent years. The workshop aimed at gathering people across the board to address seemingly contrasting challenges in the problems they are working on. As part of the call for the workshop, particular attention has been given to the interdependence of architecture, data, and optimization that gives rise to an enormous landscape of design and performance intricacies that are not well-understood. This year, our goal was to emphasize the following directions in our community: (i) identify obstacles in the way to better models and algorithms; (ii) identify the general trends from which we would like to build scientific and potentially theoretical understanding; and (iii) the rigorous design of scientific experiments and experimental protocols whose purpose is to resolve and pinpoint the origin of mysteries while ensuring reproducibility and robustness of conclusions. In the event, these topics emerged and were broadly discussed, matching our expectations and paving the way for new studies in these directions. While we acknowledge that the text is naturally biased as it comes through our lens, here we present an attempt to do a fair job of highlighting the outcome of the workshop. △ Less

Submitted 29 July, 2020; v1 submitted 25 June, 2020; originally announced July 2020.

Comments: Report of NeurIPS 2019 workshop SEDL

arXiv:2006.15459 [pdf, other]

Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions

Authors: Stefano Sarao Mannelli, Eric Vanden-Eijnden, Lenka Zdeborová

Abstract: We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the over-parametrized regime where the layer width $m$ is larger than the input dimension $d$. We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width $m^*\le m$. We describe… ▽ More We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the over-parametrized regime where the layer width $m$ is larger than the input dimension $d$. We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width $m^*\le m$. We describe how the empirical loss landscape is affected by the number $n$ of data samples and the width $m^*$ of the teacher network. In particular we determine how the probability that there be no spurious minima on the empirical loss depends on $n$, $d$, and $m^*$, thereby establishing conditions under which the neural network can in principle recover the teacher. We also show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error, i.e. it enables recovery in practice. Finally we characterize the time-convergence rate of gradient descent in the limit of a large number of samples. These results are confirmed by numerical experiments. △ Less

Submitted 18 August, 2020; v1 submitted 27 June, 2020; originally announced June 2020.

Comments: 10 pages, 4 figures + appendix

Journal ref: Advances in Neural Information Processing Systems, v33, page 13445--13455, 2020

arXiv:2006.13395 [pdf, other]

Winning the competition: enhancing counter-contagion in SIS-like epidemic processes

Authors: Argyris Kalogeratos, Stefano Sarao Mannelli

Abstract: In this paper we consider the epidemic competition between two generic diffusion processes, where each competing side is represented by a different state of a stochastic process. For this setting, we present the Generalized Largest Reduction in Infectious Edges (gLRIE) dynamic resource allocation strategy to advantage the preferred state against the other. Motivated by social epidemics, we apply t… ▽ More In this paper we consider the epidemic competition between two generic diffusion processes, where each competing side is represented by a different state of a stochastic process. For this setting, we present the Generalized Largest Reduction in Infectious Edges (gLRIE) dynamic resource allocation strategy to advantage the preferred state against the other. Motivated by social epidemics, we apply this method to a generic continuous-time SIS-like diffusion model where we allow for: i) arbitrary node transition rate functions that describe the dynamics of propagation depending on the network state, and ii) competition between the healthy (positive) and infected (negative) states, which are both diffusive at the same time, yet mutually exclusive on each node. Finally we use simulations to compare empirically the proposed gLRIE against competitive approaches from literature. △ Less

Submitted 23 June, 2020; originally announced June 2020.

Comments: 4 pages, 3 figures, linked to external 6-page Appendix

ACM Class: G.3; I.6; I.2.8; J.4

arXiv:2006.06997 [pdf, other]

Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval

Authors: Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, Lenka Zdeborová

Abstract: Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimensio… ▽ More Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimension is small the dynamics remains trapped in spurious minima with large basins of attraction. We find analytically that above a critical ratio those critical points become unstable developing a negative direction toward the signal. By numerical experiments we show that in this regime the gradient flow algorithm is not trapped; it drifts away from the spurious critical points along the unstable direction and succeeds in finding the global minimum. Using tools from statistical physics we characterize this phenomenon, which is related to a BBP-type transition in the Hessian of the spurious minima. △ Less

Submitted 12 June, 2020; originally announced June 2020.

Comments: 9 pages, 5 figures + appendix

Journal ref: Advances in Neural Information Processing Systems, v22, page 3265--327, 2020

arXiv:2001.00479 [pdf, other]

doi 10.1088/1742-5468/ab7123

Thresholds of descending algorithms in inference problems

Authors: Stefano Sarao Mannelli, Lenka Zdeborova

Abstract: We review recent works on analyzing the dynamics of gradient-based algorithms in a prototypical statistical inference problem. Using methods and insights from the physics of glassy systems, these works showed how to understand quantitatively and qualitatively the performance of gradient-based algorithms. Here we review the key results and their interpretation in non-technical terms accessible to a… ▽ More We review recent works on analyzing the dynamics of gradient-based algorithms in a prototypical statistical inference problem. Using methods and insights from the physics of glassy systems, these works showed how to understand quantitatively and qualitatively the performance of gradient-based algorithms. Here we review the key results and their interpretation in non-technical terms accessible to a wide audience of physicists in the context of related works. △ Less

Submitted 4 January, 2020; v1 submitted 2 January, 2020; originally announced January 2020.

Comments: 8 pages, 4 figures

Journal ref: J. Stat. Mech. (2020) 034004

arXiv:1907.08226 [pdf, other]

Who is Afraid of Big Bad Minima? Analysis of Gradient-Flow in a Spiked Matrix-Tensor Model

Authors: Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Lenka Zdeborová

Abstract: Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones. Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model.… ▽ More Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones. Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model. Our framework is based on the Kac-Rice analysis of stationary points and a closed-form analysis of gradient-flow originating from statistical physics. We show that there is a well defined region of parameters where the gradient-flow algorithm finds a good global minimum despite the presence of exponentially many spurious local minima. We show that this is achieved by surfing on saddles that have strong negative direction towards the global minima, a phenomenon that is connected to a BBP-type threshold in the Hessian describing the critical points of the landscapes. △ Less

Submitted 20 January, 2020; v1 submitted 18 July, 2019; originally announced July 2019.

Comments: 9 pages, 4 figures + appendix. Appears in Proceedings of the Advances in Neural Information Processing Systems 2019 (NeurIPS 2019)

Journal ref: Advances in Neural Information Processing Systems, pp. 8676-8686. 2019

arXiv:1902.00139 [pdf, other]

Passed & Spurious: Descent Algorithms and Local Minima in Spiked Matrix-Tensor Models

Authors: Stefano Sarao Mannelli, Florent Krzakala, Pierfrancesco Urbani, Lenka Zdeborová

Abstract: In this work we analyse quantitatively the interplay between the loss landscape and performance of descent algorithms in a prototypical inference problem, the spiked matrix-tensor model. We study a loss function that is the negative log-likelihood of the model. We analyse the number of local minima at a fixed distance from the signal/spike with the Kac-Rice formula, and locate trivialization of th… ▽ More In this work we analyse quantitatively the interplay between the loss landscape and performance of descent algorithms in a prototypical inference problem, the spiked matrix-tensor model. We study a loss function that is the negative log-likelihood of the model. We analyse the number of local minima at a fixed distance from the signal/spike with the Kac-Rice formula, and locate trivialization of the landscape at large signal-to-noise ratios. We evaluate in a closed form the performance of a gradient flow algorithm using integro-differential PDEs as developed in physics of disordered systems for the Langevin dynamics. We analyze the performance of an approximate message passing algorithm estimating the maximum likelihood configuration via its state evolution. We conclude by comparing the above results: while we observe a drastic slow down of the gradient flow dynamics even in the region where the landscape is trivial, both the analyzed algorithms are shown to perform well even in the part of the region of parameters where spurious local minima are present. △ Less

Submitted 20 January, 2020; v1 submitted 31 January, 2019; originally announced February 2019.

Comments: 12 pages + appendix, 10 figures. Appears in Proceedings of the International Conference on Machine Learning (ICML 2019)

Journal ref: International Conference on Machine Learning, 4333-4342 (ICML 2019)

arXiv:1812.09066 [pdf, other]

doi 10.1103/PhysRevX.10.011057

Marvels and Pitfalls of the Langevin Algorithm in Noisy High-dimensional Inference

Authors: Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, Lenka Zdeborová

Abstract: Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work we perform an analytic study of the performances of one of them, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked matrix-tensor… ▽ More Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work we perform an analytic study of the performances of one of them, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked matrix-tensor model. The typical behaviour of this algorithm is described by a system of integro-differential equations that we call the Langevin state evolution, whose solution is compared with the one of the state evolution of approximate message passing (AMP). Our results show that, remarkably, the algorithmic threshold of the Langevin algorithm is sub-optimal with respect to the one given by AMP. We conjecture this phenomenon to be due to the residual glassiness present in that region of parameters. Finally we show how a landscape-annealing protocol, that uses the Langevin algorithm but violate the Bayes-optimality condition, can approach the performance of AMP. △ Less

Submitted 13 January, 2020; v1 submitted 21 December, 2018; originally announced December 2018.

Comments: 11 pages and 5 figures + appendix

Journal ref: Phys. Rev. X 10, 011057 (2020)

Showing 1–18 of 18 results for author: Mannelli, S S