Search | arXiv e-print repository

Provable Advantage of Curriculum Learning on Parity Targets with Mixed Inputs

Authors: Emmanuel Abbe, Elisabetta Cornacchia, Aryo Lotfi

Abstract: Experimental results have shown that curriculum learning, i.e., presenting simpler examples before more complex ones, can improve the efficiency of learning. Some recent theoretical results also showed that changing the sampling distribution can help neural networks learn parities, with formal results only for large learning rates and one-step arguments. Here we show a separation result in the num… ▽ More Experimental results have shown that curriculum learning, i.e., presenting simpler examples before more complex ones, can improve the efficiency of learning. Some recent theoretical results also showed that changing the sampling distribution can help neural networks learn parities, with formal results only for large learning rates and one-step arguments. Here we show a separation result in the number of training steps with standard (bounded) learning rates on a common sample distribution: if the data distribution is a mixture of sparse and dense inputs, there exists a regime in which a 2-layer ReLU neural network trained by a curriculum noisy-GD (or SGD) algorithm that uses sparse examples first, can learn parities of sufficiently large degree, while any fully connected neural network of possibly larger width or depth trained by noisy-GD on the unordered samples cannot learn without additional steps. We also provide experimental results supporting the qualitative separation beyond the specific regime of the theoretical results. △ Less

Submitted 29 June, 2023; originally announced June 2023.

Comments: 34 pages, 8 figures

arXiv:2306.16255 [pdf, other]

Theory and applications of the Sum-Of-Squares technique

Authors: Francis Bach, Elisabetta Cornacchia, Luca Pesce, Giovanni Piccioli

Abstract: The Sum-of-Squares (SOS) approximation method is a technique used in optimization problems to derive lower bounds on the optimal value of an objective function. By representing the objective function as a sum of squares in a feature space, the SOS method transforms non-convex global optimization problems into solvable semidefinite programs. This note presents an overview of the SOS method. We star… ▽ More The Sum-of-Squares (SOS) approximation method is a technique used in optimization problems to derive lower bounds on the optimal value of an objective function. By representing the objective function as a sum of squares in a feature space, the SOS method transforms non-convex global optimization problems into solvable semidefinite programs. This note presents an overview of the SOS method. We start with its application in finite-dimensional feature spaces and, subsequently, we extend it to infinite-dimensional feature spaces using reproducing kernels (k-SOS). Additionally, we highlight the utilization of SOS for estimating some relevant quantities in information theory, including the log-partition function. △ Less

Submitted 11 March, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

Comments: These are notes from the lecture of Francis Bach given at the summer school "Statistical Physics & Machine Learning", that took place in Les Houches School of Physics in France from 4th to 29th July 2022. The school was organized by Florent Krzakala and Lenka Zdeborová from EPFL. 19 pages, 4 figures

arXiv:2301.13833 [pdf, other]

A Mathematical Model for Curriculum Learning for Parities

Authors: Elisabetta Cornacchia, Elchanan Mossel

Abstract: Curriculum learning (CL) - training using samples that are generated and presented in a meaningful order - was introduced in the machine learning context around a decade ago. While CL has been extensively used and analysed empirically, there has been very little mathematical justification for its advantages. We introduce a CL model for learning the class of k-parities on d bits of a binary string… ▽ More Curriculum learning (CL) - training using samples that are generated and presented in a meaningful order - was introduced in the machine learning context around a decade ago. While CL has been extensively used and analysed empirically, there has been very little mathematical justification for its advantages. We introduce a CL model for learning the class of k-parities on d bits of a binary string with a neural network trained by stochastic gradient descent (SGD). We show that a wise choice of training examples involving two or more product distributions, allows to reduce significantly the computational cost of learning this class of functions, compared to learning under the uniform distribution. Furthermore, we show that for another class of functions - namely the `Hamming mixtures' - CL strategies involving a bounded number of product distributions are not beneficial. △ Less

Submitted 22 April, 2024; v1 submitted 31 January, 2023; originally announced January 2023.

Journal ref: ICML 2023

arXiv:2205.13647 [pdf, other]

Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures

Authors: Emmanuel Abbe, Samy Bengio, Elisabetta Cornacchia, Jon Kleinberg, Aryo Lotfi, Maithra Raghu, Chiyuan Zhang

Abstract: This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the g… ▽ More This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss. △ Less

Submitted 20 October, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

Comments: To appear in NeurIPS 2022

arXiv:2203.12094 [pdf, other]

doi 10.1088/2632-2153/acb428

Learning curves for the multi-class teacher-student perceptron

Authors: Elisabetta Cornacchia, Francesca Mignacco, Rodrigo Veiga, Cédric Gerbelot, Bruno Loureiro, Lenka Zdeborová

Abstract: One of the most classical results in high-dimensional learning theory provides a closed-form expression for the generalisation error of binary classification with the single-layer teacher-student perceptron on i.i.d. Gaussian inputs. Both Bayes-optimal estimation and empirical risk minimisation (ERM) were extensively analysed for this setting. At the same time, a considerable part of modern machin… ▽ More One of the most classical results in high-dimensional learning theory provides a closed-form expression for the generalisation error of binary classification with the single-layer teacher-student perceptron on i.i.d. Gaussian inputs. Both Bayes-optimal estimation and empirical risk minimisation (ERM) were extensively analysed for this setting. At the same time, a considerable part of modern machine learning practice concerns multi-class classification. Yet, an analogous analysis for the corresponding multi-class teacher-student perceptron was missing. In this manuscript we fill this gap by deriving and evaluating asymptotic expressions for both the Bayes-optimal and ERM generalisation errors in the high-dimensional regime. For Gaussian teacher weights, we investigate the performance of ERM with both cross-entropy and square losses, and explore the role of ridge regularisation in approaching Bayes-optimality. In particular, we observe that regularised cross-entropy minimisation yields close-to-optimal accuracy. Instead, for a binary teacher we show that a first-order phase transition arises in the Bayes-optimal performance. △ Less

Submitted 22 March, 2022; originally announced March 2022.

Comments: 14 pages + appendix

Journal ref: Machine Learning: Science and Technology 4 015019 (2022)

arXiv:2202.12846 [pdf, other]

An initial alignment between neural network and target is needed for gradient descent to learn

Authors: Emmanuel Abbe, Elisabetta Cornacchia, Jan Hązła, Christopher Marquis

Abstract: This paper introduces the notion of ``Initial Alignment'' (INAL) between a neural network at initialization and a target function. It is proved that if a network and a Boolean target function do not have a noticeable INAL, then noisy gradient descent on a fully connected network with normalized i.i.d. initialization will not learn in polynomial time. Thus a certain amount of knowledge about the ta… ▽ More This paper introduces the notion of ``Initial Alignment'' (INAL) between a neural network at initialization and a target function. It is proved that if a network and a Boolean target function do not have a noticeable INAL, then noisy gradient descent on a fully connected network with normalized i.i.d. initialization will not learn in polynomial time. Thus a certain amount of knowledge about the target (measured by the INAL) is needed in the architecture design. This also provides an answer to an open problem posed in [AS20]. The results are based on deriving lower-bounds for descent algorithms on symmetric neural networks without explicit knowledge of the target function beyond its INAL. △ Less

Submitted 16 August, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

Journal ref: Proceedings of the International Conference on Machine Learning, 2022

arXiv:2111.02154 [pdf, ps, other]

Regularization by Misclassification in ReLU Neural Networks

Authors: Elisabetta Cornacchia, Jan Hązła, Ido Nachum, Amir Yehudayoff

Abstract: We study the implicit bias of ReLU neural networks trained by a variant of SGD where at each step, the label is changed with probability $p$ to a random label (label smoothing being a close variant of this procedure). Our experiments demonstrate that label noise propels the network to a sparse solution in the following sense: for a typical input, a small fraction of neurons are active, and the fir… ▽ More We study the implicit bias of ReLU neural networks trained by a variant of SGD where at each step, the label is changed with probability $p$ to a random label (label smoothing being a close variant of this procedure). Our experiments demonstrate that label noise propels the network to a sparse solution in the following sense: for a typical input, a small fraction of neurons are active, and the firing pattern of the hidden layers is sparser. In fact, for some instances, an appropriate amount of label noise does not only sparsify the network but further reduces the test error. We then turn to the theoretical analysis of such sparsification mechanisms, focusing on the extremal case of $p=1$. We show that in this case, the network withers as anticipated from experiments, but surprisingly, in different ways that depend on the learning rate and the presence of bias, with either weights vanishing or neurons ceasing to fire. △ Less

Submitted 3 November, 2021; originally announced November 2021.

arXiv:2101.12601 [pdf, other]

Stochastic block model entropy and broadcasting on trees with survey

Authors: Emmanuel Abbe, Elisabetta Cornacchia, Yuzhou Gu, Yury Polyanskiy

Abstract: The limit of the entropy in the stochastic block model (SBM) has been characterized in the sparse regime for the special case of disassortative communities [COKPZ17] and for the classical case of assortative communities but in the dense regime [DAM16]. The problem has not been closed in the classical sparse and assortative case. This paper establishes the result in this case for any SNR besides fo… ▽ More The limit of the entropy in the stochastic block model (SBM) has been characterized in the sparse regime for the special case of disassortative communities [COKPZ17] and for the classical case of assortative communities but in the dense regime [DAM16]. The problem has not been closed in the classical sparse and assortative case. This paper establishes the result in this case for any SNR besides for the interval (1, 3.513). It further gives an approximation to the limit in this window. The result is obtained by expressing the global SBM entropy as an integral of local tree entropies in a broadcasting on tree model with erasure side-information. The main technical advancement then relies on showing the irrelevance of the boundary in such a model, also studied with variants in [KMS16], [MNS16] and [MX15]. In particular, we establish the uniqueness of the BP fixed point in the survey model for any SNR above 3.513 or below 1. This only leaves a narrow region in the plane between SNR and survey strength where the uniqueness of BP conjectured in these papers remains unproved. △ Less

Submitted 29 January, 2021; originally announced January 2021.

arXiv:2006.05251 [pdf, other]

Polarization in Attraction-Repulsion Models

Authors: Elisabetta Cornacchia, Neta Singer, Emmanuel Abbe

Abstract: This paper introduces a model for opinion dynamics, where at each time step, randomly selected agents see their opinions - modeled as scalars in [0,1] - evolve depending on a local interaction function. In the classical Bounded Confidence Model, agents opinions get attracted when they are close enough. The proposed model extends this by adding a repulsion component, which models the effect of opin… ▽ More This paper introduces a model for opinion dynamics, where at each time step, randomly selected agents see their opinions - modeled as scalars in [0,1] - evolve depending on a local interaction function. In the classical Bounded Confidence Model, agents opinions get attracted when they are close enough. The proposed model extends this by adding a repulsion component, which models the effect of opinions getting further pushed away when dissimilar enough. With this repulsion component added, and under a repulsion-attraction cleavage assumption, it is shown that a new stable configuration emerges beyond the classical consensus configuration, namely the polarization configuration. More specifically, it is shown that total consensus and total polarization are the only two possible limiting configurations. The paper further provides an analysis of the infinite population regime in dimension 1 and higher, with a phase transition phenomenon conjectured and backed heuristically. △ Less

Submitted 9 June, 2020; originally announced June 2020.

Showing 1–9 of 9 results for author: Cornacchia, E