-
How Far Can Transformers Reason? The Locality Barrier and Inductive Scratchpad
Authors:
Emmanuel Abbe,
Samy Bengio,
Aryo Lotfi,
Colin Sandon,
Omid Saremi
Abstract:
Can Transformers predict new syllogisms by composing established ones? More generally, what type of targets can be learned by such models from scratch? Recent works show that Transformers can be Turing-complete in terms of expressivity, but this does not address the learnability objective. This paper puts forward the notion of 'distribution locality' to capture when weak learning is efficiently ac…
▽ More
Can Transformers predict new syllogisms by composing established ones? More generally, what type of targets can be learned by such models from scratch? Recent works show that Transformers can be Turing-complete in terms of expressivity, but this does not address the learnability objective. This paper puts forward the notion of 'distribution locality' to capture when weak learning is efficiently achievable by regular Transformers, where the locality measures the least number of tokens required in addition to the tokens histogram to correlate nontrivially with the target. As shown experimentally and theoretically under additional assumptions, distributions with high locality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Furthermore, we show that (i) an agnostic scratchpad cannot help to break the locality barrier, (ii) an educated scratchpad can help if it breaks the locality at each step, (iii) a notion of 'inductive scratchpad' can both break the locality and improve the out-of-distribution generalization, e.g., generalizing to almost double input size for some arithmetic tasks.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
On the Minimal Degree Bias in Generalization on the Unseen for non-Boolean Functions
Authors:
Denys Pushkin,
Raphaël Berthier,
Emmanuel Abbe
Abstract:
We investigate the out-of-domain generalization of random feature (RF) models and Transformers. We first prove that in the `generalization on the unseen (GOTU)' setting, where training data is fully seen in some part of the domain but testing is made on another part, and for RF models in the small feature regime, the convergence takes place to interpolators of minimal degree as in the Boolean case…
▽ More
We investigate the out-of-domain generalization of random feature (RF) models and Transformers. We first prove that in the `generalization on the unseen (GOTU)' setting, where training data is fully seen in some part of the domain but testing is made on another part, and for RF models in the small feature regime, the convergence takes place to interpolators of minimal degree as in the Boolean case (Abbe et al., 2023). We then consider the sparse target regime and explain how this regime relates to the small feature regime, but with a different regularization term that can alter the picture in the non-Boolean case. We show two different outcomes for the sparse regime with q-ary data tokens: (1) if the data is embedded with roots of unities, then a min-degree interpolator is learned like in the Boolean case for RF models, (2) if the data is not embedded as such, e.g., simply as integers, then RF models and Transformers may not learn minimal degree interpolators. This shows that the Boolean setting and its roots of unities generalization are special cases where the minimal degree interpolator offers a rare characterization of how learning takes place. For more general integer and real-valued settings, a more nuanced picture remains to be fully characterized.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Reed-Muller codes have vanishing bit-error probability below capacity: a simple tighter proof via camellia boosting
Authors:
Emmanuel Abbe,
Colin Sandon
Abstract:
This paper shows that a class of codes such as Reed-Muller (RM) codes have vanishing bit-error probability below capacity on symmetric channels. The proof relies on the notion of `camellia codes': a class of symmetric codes decomposable into `camellias', i.e., set systems that differ from sunflowers by allowing for scattered petal overlaps. The proof then follows from a boosting argument on the ca…
▽ More
This paper shows that a class of codes such as Reed-Muller (RM) codes have vanishing bit-error probability below capacity on symmetric channels. The proof relies on the notion of `camellia codes': a class of symmetric codes decomposable into `camellias', i.e., set systems that differ from sunflowers by allowing for scattered petal overlaps. The proof then follows from a boosting argument on the camellia petals with second moment Fourier analysis. For erasure channels, this gives a self-contained proof of the bit-error result in Kudekar et al.'17, without relying on sharp thresholds for monotone properties Friedgut-Kalai'96. For error channels, this gives a shortened proof of Reeves-Pfister'23 with an exponentially tighter bound, and a proof variant of the bit-error result in Abbe-Sandon'23. The control of the full (block) error probability still requires Abbe-Sandon'23 for RM codes.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
When can transformers reason with abstract symbols?
Authors:
Enric Boix-Adsera,
Omid Saremi,
Emmanuel Abbe,
Samy Bengio,
Etai Littwin,
Joshua Susskind
Abstract:
We investigate the capabilities of transformer models on relational reasoning tasks. In these tasks, models are trained on a set of strings encoding abstract relations, and are then tested out-of-distribution on data that contains symbols that did not appear in the training dataset. We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relation…
▽ More
We investigate the capabilities of transformer models on relational reasoning tasks. In these tasks, models are trained on a set of strings encoding abstract relations, and are then tested out-of-distribution on data that contains symbols that did not appear in the training dataset. We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relations and generalize to the test set when trained by gradient descent on sufficiently large quantities of training data. This is in contrast to classical fully-connected networks, which we prove fail to learn to reason. Our results inspire modifications of the transformer architecture that add only two trainable parameters per head, and that we empirically demonstrate improve data efficiency for learning to reason.
△ Less
Submitted 16 April, 2024; v1 submitted 15 October, 2023;
originally announced October 2023.
-
Boolformer: Symbolic Regression of Logic Functions with Transformers
Authors:
Stéphane d'Ascoli,
Samy Bengio,
Josh Susskind,
Emmanuel Abbé
Abstract:
In this work, we introduce Boolformer, the first Transformer architecture trained to perform end-to-end symbolic regression of Boolean functions. First, we show that it can predict compact formulas for complex functions which were not seen during training, when provided a clean truth table. Then, we demonstrate its ability to find approximate expressions when provided incomplete and noisy observat…
▽ More
In this work, we introduce Boolformer, the first Transformer architecture trained to perform end-to-end symbolic regression of Boolean functions. First, we show that it can predict compact formulas for complex functions which were not seen during training, when provided a clean truth table. Then, we demonstrate its ability to find approximate expressions when provided incomplete and noisy observations. We evaluate the Boolformer on a broad set of real-world binary classification datasets, demonstrating its potential as an interpretable alternative to classic machine learning methods. Finally, we apply it to the widespread task of modelling the dynamics of gene regulatory networks. Using a recent benchmark, we show that Boolformer is competitive with state-of-the art genetic algorithms with a speedup of several orders of magnitude. Our code and models are available publicly.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Provable Advantage of Curriculum Learning on Parity Targets with Mixed Inputs
Authors:
Emmanuel Abbe,
Elisabetta Cornacchia,
Aryo Lotfi
Abstract:
Experimental results have shown that curriculum learning, i.e., presenting simpler examples before more complex ones, can improve the efficiency of learning. Some recent theoretical results also showed that changing the sampling distribution can help neural networks learn parities, with formal results only for large learning rates and one-step arguments. Here we show a separation result in the num…
▽ More
Experimental results have shown that curriculum learning, i.e., presenting simpler examples before more complex ones, can improve the efficiency of learning. Some recent theoretical results also showed that changing the sampling distribution can help neural networks learn parities, with formal results only for large learning rates and one-step arguments. Here we show a separation result in the number of training steps with standard (bounded) learning rates on a common sample distribution: if the data distribution is a mixture of sparse and dense inputs, there exists a regime in which a 2-layer ReLU neural network trained by a curriculum noisy-GD (or SGD) algorithm that uses sparse examples first, can learn parities of sufficiently large degree, while any fully connected neural network of possibly larger width or depth trained by noisy-GD on the unordered samples cannot learn without additional steps. We also provide experimental results supporting the qualitative separation beyond the specific regime of the theoretical results.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
Transformers learn through gradual rank increase
Authors:
Enric Boix-Adsera,
Etai Littwin,
Emmanuel Abbe,
Samy Bengio,
Joshua Susskind
Abstract:
We identify incremental learning dynamics in transformers, where the difference between trained and initial weights progressively increases in rank. We rigorously prove this occurs under the simplifying assumptions of diagonal weight matrices and small initialization. Our experiments support the theory and also show that phenomenon can occur in practice without the simplifying assumptions.
We identify incremental learning dynamics in transformers, where the difference between trained and initial weights progressively increases in rank. We rigorously prove this occurs under the simplifying assumptions of diagonal weight matrices and small initialization. Our experiments support the theory and also show that phenomenon can occur in practice without the simplifying assumptions.
△ Less
Submitted 10 December, 2023; v1 submitted 12 June, 2023;
originally announced June 2023.
-
A proof that Reed-Muller codes achieve Shannon capacity on symmetric channels
Authors:
Emmanuel Abbe,
Colin Sandon
Abstract:
Reed-Muller codes were introduced in 1954, with a simple explicit construction based on polynomial evaluations, and have long been conjectured to achieve Shannon capacity on symmetric channels. Major progress was made towards a proof over the last decades; using combinatorial weight enumerator bounds, a breakthrough on the erasure channel from sharp thresholds, hypercontractivity arguments, and po…
▽ More
Reed-Muller codes were introduced in 1954, with a simple explicit construction based on polynomial evaluations, and have long been conjectured to achieve Shannon capacity on symmetric channels. Major progress was made towards a proof over the last decades; using combinatorial weight enumerator bounds, a breakthrough on the erasure channel from sharp thresholds, hypercontractivity arguments, and polarization theory. Another major progress recently established that the bit error probability vanishes slowly below capacity. However, when channels allow for errors, the results of Bourgain-Kalai do not apply for converting a vanishing bit to a vanishing block error probability, neither do the known weight enumerator bounds. The conjecture that RM codes achieve Shannon capacity on symmetric channels, with high probability of recovering the codewords, has thus remained open.
This paper closes the conjecture's proof. It uses a new recursive boosting framework, which aggregates the decoding of codeword restrictions on `subspace-sunflowers', handling their dependencies via an $L_p$ Boolean Fourier analysis, and using a list-decoding argument with a weight enumerator bound from Sberlo-Shpilka. The proof does not require a vanishing bit error probability for the base case, but only a non-trivial probability, obtained here for general symmetric codes. This gives in particular a shortened and tightened argument for the vanishing bit error probability result of Reeves-Pfister, and with prior works, it implies the strong wire-tap secrecy of RM codes on pure-state classical-quantum channels.
△ Less
Submitted 5 April, 2023;
originally announced April 2023.
-
SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics
Authors:
Emmanuel Abbe,
Enric Boix-Adsera,
Theodor Misiakiewicz
Abstract:
We investigate the time complexity of SGD learning on fully-connected neural networks with isotropic data. We put forward a complexity measure -- the leap -- which measures how "hierarchical" target functions are. For $d$-dimensional uniform Boolean or isotropic Gaussian data, our main conjecture states that the time complexity to learn a function $f$ with low-dimensional support is…
▽ More
We investigate the time complexity of SGD learning on fully-connected neural networks with isotropic data. We put forward a complexity measure -- the leap -- which measures how "hierarchical" target functions are. For $d$-dimensional uniform Boolean or isotropic Gaussian data, our main conjecture states that the time complexity to learn a function $f$ with low-dimensional support is $\tildeΘ(d^{\max(\mathrm{Leap}(f),2)})$. We prove a version of this conjecture for a class of functions on Gaussian isotropic data and 2-layer neural networks, under additional technical assumptions on how SGD is run. We show that the training sequentially learns the function support with a saddle-to-saddle dynamic. Our result departs from [Abbe et al. 2022] by going beyond leap 1 (merged-staircase functions), and by going beyond the mean-field and gradient flow approximations that prohibit the full complexity control obtained here. Finally, we note that this gives an SGD complexity for the full training trajectory that matches that of Correlational Statistical Query (CSQ) lower-bounds.
△ Less
Submitted 31 August, 2023; v1 submitted 21 February, 2023;
originally announced February 2023.
-
Generalization on the Unseen, Logic Reasoning and Degree Curriculum
Authors:
Emmanuel Abbe,
Samy Bengio,
Aryo Lotfi,
Kevin Rizk
Abstract:
This paper considers the learning of logical (Boolean) functions with focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a f…
▽ More
This paper considers the learning of logical (Boolean) functions with focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an 'extrapolating' or 'reasoning' learner. We then study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for a class of network models including instances of Transformers, random features models, and diagonal linear networks, a min-degree-interpolator is learned on the unseen. We also provide evidence that other instances with larger learning rates or mean-field networks reach leaky min-degree solutions. These findings lead to two implications: (1) we provide an explanation to the length generalization problem (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports.
△ Less
Submitted 28 June, 2023; v1 submitted 30 January, 2023;
originally announced January 2023.
-
On the non-universality of deep learning: quantifying the cost of symmetry
Authors:
Emmanuel Abbe,
Enric Boix-Adsera
Abstract:
We prove limitations on what neural networks trained by noisy gradient descent (GD) can efficiently learn. Our results apply whenever GD training is equivariant, which holds for many standard architectures and initializations. As applications, (i) we characterize the functions that fully-connected networks can weak-learn on the binary hypercube and unit sphere, demonstrating that depth-2 is as pow…
▽ More
We prove limitations on what neural networks trained by noisy gradient descent (GD) can efficiently learn. Our results apply whenever GD training is equivariant, which holds for many standard architectures and initializations. As applications, (i) we characterize the functions that fully-connected networks can weak-learn on the binary hypercube and unit sphere, demonstrating that depth-2 is as powerful as any other depth for this task; (ii) we extend the merged-staircase necessity result for learning with latent low-dimensional structure [ABM22] to beyond the mean-field regime. Under cryptographic assumptions, we also show hardness results for learning with fully-connected networks trained by stochastic gradient descent (SGD).
△ Less
Submitted 14 October, 2022; v1 submitted 5 August, 2022;
originally announced August 2022.
-
Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures
Authors:
Emmanuel Abbe,
Samy Bengio,
Elisabetta Cornacchia,
Jon Kleinberg,
Aryo Lotfi,
Maithra Raghu,
Chiyuan Zhang
Abstract:
This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the g…
▽ More
This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.
△ Less
Submitted 20 October, 2022; v1 submitted 26 May, 2022;
originally announced May 2022.
-
An initial alignment between neural network and target is needed for gradient descent to learn
Authors:
Emmanuel Abbe,
Elisabetta Cornacchia,
Jan Hązła,
Christopher Marquis
Abstract:
This paper introduces the notion of ``Initial Alignment'' (INAL) between a neural network at initialization and a target function. It is proved that if a network and a Boolean target function do not have a noticeable INAL, then noisy gradient descent on a fully connected network with normalized i.i.d. initialization will not learn in polynomial time. Thus a certain amount of knowledge about the ta…
▽ More
This paper introduces the notion of ``Initial Alignment'' (INAL) between a neural network at initialization and a target function. It is proved that if a network and a Boolean target function do not have a noticeable INAL, then noisy gradient descent on a fully connected network with normalized i.i.d. initialization will not learn in polynomial time. Thus a certain amount of knowledge about the target (measured by the INAL) is needed in the architecture design. This also provides an answer to an open problem posed in [AS20]. The results are based on deriving lower-bounds for descent algorithms on symmetric neural networks without explicit knowledge of the target function beyond its INAL.
△ Less
Submitted 16 August, 2022; v1 submitted 25 February, 2022;
originally announced February 2022.
-
The merged-staircase property: a necessary and nearly sufficient condition for SGD learning of sparse functions on two-layer neural networks
Authors:
Emmanuel Abbe,
Enric Boix-Adsera,
Theodor Misiakiewicz
Abstract:
It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parameterizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest (non-linear but regular networks) no tight characterization has yet been achieved, despite significant developments.
We take a ste…
▽ More
It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parameterizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest (non-linear but regular networks) no tight characterization has yet been achieved, despite significant developments.
We take a step in this direction by considering depth-2 neural networks trained by SGD in the mean-field regime. We consider functions on binary inputs that depend on a latent low-dimensional subspace (i.e., small number of coordinates). This regime is of interest since it is poorly understood how neural networks routinely tackle high-dimensional datasets and adapt to latent low-dimensional structure without suffering from the curse of dimensionality. Accordingly, we study SGD-learnability with $O(d)$ sample complexity in a large ambient dimension $d$.
Our main results characterize a hierarchical property, the "merged-staircase property", that is both necessary and nearly sufficient for learning in this setting.
We further show that non-linear training is necessary: for this class of functions, linear methods on any feature map (e.g., the NTK) are not capable of learning efficiently. The key tools are a new "dimension-free" dynamics approximation result that applies to functions defined on a latent space of low-dimension, a proof of global convergence based on polynomial identity testing, and an improvement of lower bounds against linear methods for non-almost orthogonal functions.
△ Less
Submitted 17 February, 2022;
originally announced February 2022.
-
The staircase property: How hierarchical structure can guide deep learning
Authors:
Emmanuel Abbe,
Enric Boix-Adsera,
Matthew Brennan,
Guy Bresler,
Dheeraj Nagaraj
Abstract:
This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically. We define the "staircase" property for functions over the Boolean hypercube, which posits that high-order Fourier coefficients are reachable from lower-order Fourier coefficients along increasing chains. We prove that functions satisfying this property can be learned in poly…
▽ More
This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically. We define the "staircase" property for functions over the Boolean hypercube, which posits that high-order Fourier coefficients are reachable from lower-order Fourier coefficients along increasing chains. We prove that functions satisfying this property can be learned in polynomial time using layerwise stochastic coordinate descent on regular neural networks -- a class of network architectures and initializations that have homogeneity properties. Our analysis shows that for such staircase functions and neural networks, the gradient-based algorithm learns high-level features by greedily combining lower-level features along the depth of the network. We further back our theoretical results with experiments showing that staircase functions are also learnable by more standard ResNet architectures with stochastic gradient descent. Both the theoretical and experimental results support the fact that staircase properties have a role to play in understanding the capabilities of gradient-based learning on regular networks, in contrast to general polynomial-size networks that can emulate any SQ or PAC algorithms as recently shown.
△ Less
Submitted 23 November, 2021; v1 submitted 24 August, 2021;
originally announced August 2021.
-
On the Power of Differentiable Learning versus PAC and SQ Learning
Authors:
Emmanuel Abbe,
Pritish Kamath,
Eran Malach,
Colin Sandon,
Nathan Srebro
Abstract:
We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt using these paradigms. We show that SGD and GD can always simulate learning with statistical queries (SQ), but their ability to go beyond that depends…
▽ More
We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt using these paradigms. We show that SGD and GD can always simulate learning with statistical queries (SQ), but their ability to go beyond that depends on the precision $ρ$ of the gradient calculations relative to the minibatch size $b$ (for SGD) and sample size $m$ (for GD). With fine enough precision relative to minibatch size, namely when $b ρ$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for $b=1$. Similarly, with fine enough precision relative to the sample size $m$, GD can also simulate any sample-based learning algorithm based on $m$ samples. In particular, with polynomially many bits of precision (i.e. when $ρ$ is exponentially small), SGD and GD can both simulate PAC learning regardless of the mini-batch size. On the other hand, when $b ρ^2$ is large enough, the power of SGD is equivalent to that of SQ learning.
△ Less
Submitted 5 February, 2022; v1 submitted 9 August, 2021;
originally announced August 2021.
-
Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels
Authors:
Eran Malach,
Pritish Kamath,
Emmanuel Abbe,
Nathan Srebro
Abstract:
We study the relative power of learning with gradient descent on differentiable models, such as neural networks, versus using the corresponding tangent kernels. We show that under certain conditions, gradient descent achieves small error only if a related tangent kernel method achieves a non-trivial advantage over random guessing (a.k.a. weak learning), though this advantage might be very small ev…
▽ More
We study the relative power of learning with gradient descent on differentiable models, such as neural networks, versus using the corresponding tangent kernels. We show that under certain conditions, gradient descent achieves small error only if a related tangent kernel method achieves a non-trivial advantage over random guessing (a.k.a. weak learning), though this advantage might be very small even when gradient descent can achieve arbitrarily high accuracy. Complementing this, we show that without these conditions, gradient descent can in fact learn with small error even when no kernel method, in particular using the tangent kernel, can achieve a non-trivial advantage over random guessing.
△ Less
Submitted 1 March, 2021;
originally announced March 2021.
-
Stochastic block model entropy and broadcasting on trees with survey
Authors:
Emmanuel Abbe,
Elisabetta Cornacchia,
Yuzhou Gu,
Yury Polyanskiy
Abstract:
The limit of the entropy in the stochastic block model (SBM) has been characterized in the sparse regime for the special case of disassortative communities [COKPZ17] and for the classical case of assortative communities but in the dense regime [DAM16]. The problem has not been closed in the classical sparse and assortative case. This paper establishes the result in this case for any SNR besides fo…
▽ More
The limit of the entropy in the stochastic block model (SBM) has been characterized in the sparse regime for the special case of disassortative communities [COKPZ17] and for the classical case of assortative communities but in the dense regime [DAM16]. The problem has not been closed in the classical sparse and assortative case. This paper establishes the result in this case for any SNR besides for the interval (1, 3.513). It further gives an approximation to the limit in this window.
The result is obtained by expressing the global SBM entropy as an integral of local tree entropies in a broadcasting on tree model with erasure side-information. The main technical advancement then relies on showing the irrelevance of the boundary in such a model, also studied with variants in [KMS16], [MNS16] and [MX15]. In particular, we establish the uniqueness of the BP fixed point in the survey model for any SNR above 3.513 or below 1. This only leaves a narrow region in the plane between SNR and survey strength where the uniqueness of BP conjectured in these papers remains unproved.
△ Less
Submitted 29 January, 2021;
originally announced January 2021.
-
Maximum Multiscale Entropy and Neural Network Regularization
Authors:
Amir R. Asadi,
Emmanuel Abbe
Abstract:
A well-known result across information theory, machine learning, and statistical physics shows that the maximum entropy distribution under a mean constraint has an exponential form called the Gibbs-Boltzmann distribution. This is used for instance in density estimation or to achieve excess risk bounds derived from single-scale entropy regularizers (Xu-Raginsky '17). This paper investigates a gener…
▽ More
A well-known result across information theory, machine learning, and statistical physics shows that the maximum entropy distribution under a mean constraint has an exponential form called the Gibbs-Boltzmann distribution. This is used for instance in density estimation or to achieve excess risk bounds derived from single-scale entropy regularizers (Xu-Raginsky '17). This paper investigates a generalization of these results to a multiscale setting. We present different ways of generalizing the maximum entropy result by incorporating the notion of scale. For different entropies and arbitrary scale transformations, it is shown that the distribution maximizing a multiscale entropy is characterized by a procedure which has an analogy to the renormalization group procedure in statistical physics. For the case of decimation transformation, it is further shown that this distribution is Gaussian whenever the optimal single-scale distribution is Gaussian. This is then applied to neural networks, and it is shown that in a teacher-student scenario, the multiscale Gibbs posterior can achieve a smaller excess risk than the single-scale Gibbs posterior.
△ Less
Submitted 25 June, 2020;
originally announced June 2020.
-
An $\ell_p$ theory of PCA and spectral clustering
Authors:
Emmanuel Abbe,
Jianqing Fan,
Kaizheng Wang
Abstract:
Principal Component Analysis (PCA) is a powerful tool in statistics and machine learning. While existing study of PCA focuses on the recovery of principal components and their associated eigenvalues, there are few precise characterizations of individual principal component scores that yield low-dimensional embedding of samples. That hinders the analysis of various spectral methods. In this paper,…
▽ More
Principal Component Analysis (PCA) is a powerful tool in statistics and machine learning. While existing study of PCA focuses on the recovery of principal components and their associated eigenvalues, there are few precise characterizations of individual principal component scores that yield low-dimensional embedding of samples. That hinders the analysis of various spectral methods. In this paper, we first develop an $\ell_p$ perturbation theory for a hollowed version of PCA in Hilbert spaces which provably improves upon the vanilla PCA in the presence of heteroscedastic noises. Through a novel $\ell_p$ analysis of eigenvectors, we investigate entrywise behaviors of principal component score vectors and show that they can be approximated by linear functionals of the Gram matrix in $\ell_p$ norm, which includes $\ell_2$ and $\ell_\infty$ as special examples. For sub-Gaussian mixture models, the choice of $p$ giving optimal bounds depends on the signal-to-noise ratio, which further yields optimality guarantees for spectral clustering. For contextual community detection, the $\ell_p$ theory leads to a simple spectral algorithm that achieves the information threshold for exact recovery. These also provide optimal recovery results for Gaussian mixture and stochastic block models as special cases.
△ Less
Submitted 9 April, 2022; v1 submitted 24 June, 2020;
originally announced June 2020.
-
An Alon-Boppana theorem for powered graphs and generalized Ramanujan graphs
Authors:
Emmanuel Abbe,
Peter Ralli
Abstract:
The r-th power of a graph modifies a graph by connecting every vertex pair within distance r. This paper gives a generalization of the Alon-Boppana Theorem for the r-th power of graphs, including irregular graphs. This leads to a generalized notion of Ramanujan graphs, those for which the powered graph has a spectral gap matching the derived Alon-Boppana bound. In particular, we show that certain…
▽ More
The r-th power of a graph modifies a graph by connecting every vertex pair within distance r. This paper gives a generalization of the Alon-Boppana Theorem for the r-th power of graphs, including irregular graphs. This leads to a generalized notion of Ramanujan graphs, those for which the powered graph has a spectral gap matching the derived Alon-Boppana bound. In particular, we show that certain graphs that are not good expanders due to local irregularities, such as Erdos-Renyi random graphs, become almost Ramanujan once powered. A different generalization of Ramanujan graphs can also be obtained from the nonbacktracking operator. We next argue that the powering operator gives a more robust notion than the latter: Sparse Erdos-Renyi random graphs with an adversary modifying a subgraph of log(n)^c$ vertices are still almost Ramanujan in the powered sense, but not in the nonbacktracking sense. As an application, this gives robust community testing for different block models.
△ Less
Submitted 18 June, 2020;
originally announced June 2020.
-
Learning Sparse Graphons and the Generalized Kesten-Stigum Threshold
Authors:
Emmanuel Abbe,
Shuangping Li,
Allan Sly
Abstract:
The problem of learning graphons has attracted considerable attention across several scientific communities, with significant progress over the recent years in sparser regimes. Yet, the current techniques still require diverging degrees in order to succeed with efficient algorithms in the challenging cases where the local structure of the graph is homogeneous. This paper provides an efficient algo…
▽ More
The problem of learning graphons has attracted considerable attention across several scientific communities, with significant progress over the recent years in sparser regimes. Yet, the current techniques still require diverging degrees in order to succeed with efficient algorithms in the challenging cases where the local structure of the graph is homogeneous. This paper provides an efficient algorithm to learn graphons in the constant expected degree regime. The algorithm is shown to succeed in estimating the rank-$k$ projection of a graphon in the $L_2$ metric if the top $k$ eigenvalues of the graphon satisfy a generalized Kesten-Stigum condition.
△ Less
Submitted 13 June, 2020;
originally announced June 2020.
-
Polarization in Attraction-Repulsion Models
Authors:
Elisabetta Cornacchia,
Neta Singer,
Emmanuel Abbe
Abstract:
This paper introduces a model for opinion dynamics, where at each time step, randomly selected agents see their opinions - modeled as scalars in [0,1] - evolve depending on a local interaction function. In the classical Bounded Confidence Model, agents opinions get attracted when they are close enough. The proposed model extends this by adding a repulsion component, which models the effect of opin…
▽ More
This paper introduces a model for opinion dynamics, where at each time step, randomly selected agents see their opinions - modeled as scalars in [0,1] - evolve depending on a local interaction function. In the classical Bounded Confidence Model, agents opinions get attracted when they are close enough. The proposed model extends this by adding a repulsion component, which models the effect of opinions getting further pushed away when dissimilar enough. With this repulsion component added, and under a repulsion-attraction cleavage assumption, it is shown that a new stable configuration emerges beyond the classical consensus configuration, namely the polarization configuration. More specifically, it is shown that total consensus and total polarization are the only two possible limiting configurations. The paper further provides an analysis of the infinite population regime in dimension 1 and higher, with a phase transition phenomenon conjectured and backed heuristically.
△ Less
Submitted 9 June, 2020;
originally announced June 2020.
-
Almost-Reed--Muller Codes Achieve Constant Rates for Random Errors
Authors:
Emmanuel Abbe,
Jan Hązła,
Ido Nachum
Abstract:
This paper considers '$δ$-almost Reed-Muller codes', i.e., linear codes spanned by evaluations of all but a $δ$ fraction of monomials of degree at most $d$. It is shown that for any $δ> 0$ and any $\varepsilon>0$, there exists a family of $δ$-almost Reed-Muller codes of constant rate that correct $1/2-\varepsilon$ fraction of random errors with high probability. For exact Reed-Muller codes, the an…
▽ More
This paper considers '$δ$-almost Reed-Muller codes', i.e., linear codes spanned by evaluations of all but a $δ$ fraction of monomials of degree at most $d$. It is shown that for any $δ> 0$ and any $\varepsilon>0$, there exists a family of $δ$-almost Reed-Muller codes of constant rate that correct $1/2-\varepsilon$ fraction of random errors with high probability. For exact Reed-Muller codes, the analogous result is not known and represents a weaker version of the longstanding conjecture that Reed-Muller codes achieve capacity for random errors (Abbe-Shpilka-Wigderson STOC '15). Our approach is based on the recent polarization result for Reed-Muller codes, combined with a combinatorial approach to establishing inequalities between the Reed-Muller code entropies.
△ Less
Submitted 5 October, 2021; v1 submitted 20 April, 2020;
originally announced April 2020.
-
Reed-Muller Codes: Theory and Algorithms
Authors:
Emmanuel Abbe,
Amir Shpilka,
Min Ye
Abstract:
Reed-Muller (RM) codes are among the oldest, simplest and perhaps most ubiquitous family of codes. They are used in many areas of coding theory in both electrical engineering and computer science. Yet, many of their important properties are still under investigation. This paper covers some of the recent developments regarding the weight enumerator and the capacity-achieving properties of RM codes,…
▽ More
Reed-Muller (RM) codes are among the oldest, simplest and perhaps most ubiquitous family of codes. They are used in many areas of coding theory in both electrical engineering and computer science. Yet, many of their important properties are still under investigation. This paper covers some of the recent developments regarding the weight enumerator and the capacity-achieving properties of RM codes, as well as some of the algorithmic developments. In particular, the paper discusses the recent connections established between RM codes, thresholds of Boolean functions, polarization theory, hypercontractivity, and the techniques of approximating low weight codewords using lower degree polynomials. It then overviews some of the algorithms with performance guarantees, as well as some of the algorithms with state-of-the-art performances in practical regimes. Finally, the paper concludes with a few open problems.
△ Less
Submitted 10 June, 2020; v1 submitted 9 February, 2020;
originally announced February 2020.
-
Poly-time universality and limitations of deep learning
Authors:
Emmanuel Abbe,
Colin Sandon
Abstract:
The goal of this paper is to characterize function distributions that deep learning can or cannot learn in poly-time. A universality result is proved for SGD-based deep learning and a non-universality result is proved for GD-based deep learning; this also gives a separation between SGD-based deep learning and statistical query algorithms:
(1) {\it Deep learning with SGD is efficiently universal.…
▽ More
The goal of this paper is to characterize function distributions that deep learning can or cannot learn in poly-time. A universality result is proved for SGD-based deep learning and a non-universality result is proved for GD-based deep learning; this also gives a separation between SGD-based deep learning and statistical query algorithms:
(1) {\it Deep learning with SGD is efficiently universal.} Any function distribution that can be learned from samples in poly-time can also be learned by a poly-size neural net trained with SGD on a poly-time initialization with poly-steps, poly-rate and possibly poly-noise.
Therefore deep learning provides a universal learning paradigm: it was known that the approximation and estimation errors could be controlled with poly-size neural nets, using ERM that is NP-hard; this new result shows that the optimization error can also be controlled with SGD in poly-time. The picture changes for GD with large enough batches:
(2) {\it Result (1) does not hold for GD:} Neural nets of poly-size trained with GD (full gradients or large enough batches) on any initialization with poly-steps, poly-range and at least poly-noise cannot learn any function distribution that has super-polynomial {\it cross-predictability,} where the cross-predictability gives a measure of ``average'' function correlation -- relations and distinctions to the statistical dimension are discussed. In particular, GD with these constraints can learn efficiently monomials of degree $k$ if and only if $k$ is constant.
Thus (1) and (2) point to an interesting contrast: SGD is universal even with some poly-noise while full GD or SQ algorithms are not (e.g., parities).
△ Less
Submitted 7 January, 2020;
originally announced January 2020.
-
Entropic matroids and their representation
Authors:
Emmanuel Abbe,
Sophie Spirkl
Abstract:
This paper investigates entropic matroids, that is, matroids whose rank function is given as the Shannon entropy of random variables. In particular, we consider $p$-entropic matroids, for which the random variables each have support of cardinality $p$. We draw connections between such entropic matroids and secret-sharing matroids and show that entropic matroids are linear matroids when $p = 2,3$ b…
▽ More
This paper investigates entropic matroids, that is, matroids whose rank function is given as the Shannon entropy of random variables. In particular, we consider $p$-entropic matroids, for which the random variables each have support of cardinality $p$. We draw connections between such entropic matroids and secret-sharing matroids and show that entropic matroids are linear matroids when $p = 2,3$ but not when $p = 9$. Our results leave open the possibility for $p$-entropic matroids to be linear whenever $p$ is prime, with particular cases proved here. Applications of entropic matroids to coding theory and cryptography are also discussed.
△ Less
Submitted 26 September, 2019;
originally announced September 2019.
-
Chaining Meets Chain Rule: Multilevel Entropic Regularization and Training of Neural Nets
Authors:
Amir R. Asadi,
Emmanuel Abbe
Abstract:
We derive generalization and excess risk bounds for neural nets using a family of complexity measures based on a multilevel relative entropy. The bounds are obtained by introducing the notion of generated hierarchical coverings of neural nets and by using the technique of chaining mutual information introduced in Asadi et al. NeurIPS'18. The resulting bounds are algorithm-dependent and exploit the…
▽ More
We derive generalization and excess risk bounds for neural nets using a family of complexity measures based on a multilevel relative entropy. The bounds are obtained by introducing the notion of generated hierarchical coverings of neural nets and by using the technique of chaining mutual information introduced in Asadi et al. NeurIPS'18. The resulting bounds are algorithm-dependent and exploit the multilevel structure of neural nets. This, in turn, leads to an empirical risk minimization problem with a multilevel entropic regularization. The minimization problem is resolved by introducing a multi-scale generalization of the celebrated Gibbs posterior distribution, proving that the derived distribution achieves the unique minimum. This leads to a new training procedure for neural nets with performance guarantees, which exploits the chain rule of relative entropy rather than the chain rule of derivatives (as in backpropagation). To obtain an efficient implementation of the latter, we further develop a multilevel Metropolis algorithm simulating the multi-scale Gibbs distribution, with an experiment for a two-layer neural net on the MNIST data set.
△ Less
Submitted 26 June, 2019;
originally announced June 2019.
-
Subadditivity Beyond Trees and the Chi-Squared Mutual Information
Authors:
Emmanuel Abbe,
Enric Boix-Adserà
Abstract:
In 2000, Evans et al. [Eva+00] proved the subadditivity of the mutual information in the broadcasting on tree model with binary vertex labels and symmetric channels. They raised the question of whether such subadditivity extends to loopy graphs in some appropriate way. We recently proposed such an extension that applies to general graphs and binary vertex labels [AB18], using synchronization model…
▽ More
In 2000, Evans et al. [Eva+00] proved the subadditivity of the mutual information in the broadcasting on tree model with binary vertex labels and symmetric channels. They raised the question of whether such subadditivity extends to loopy graphs in some appropriate way. We recently proposed such an extension that applies to general graphs and binary vertex labels [AB18], using synchronization models and relying on percolation bounds. This extension requires however the edge channels to be symmetric on the product of the adjacent spins. A more general version of such a percolation bound that applies to asymmetric channels is also obtained in [PW18], relying on the SDPI, but the subadditivity property does not follow with such generalizations.
In this note, we provide a new result showing that the subadditivity property still holds for arbitrary (asymmetric) channels acting on the product of spins, when the graphs are restricted to be series-parallel. The proof relies on the use of the Chi-squared mutual information rather than the classical mutual information, and various properties of the former are discussed.
We also present a generalization of the broadcasting on tree model (the synchronization on tree) where the bound from [PW18] relying on the SPDI can be significantly looser than the bound resulting from the Chi-squared subadditivity property presented here.
△ Less
Submitted 6 February, 2019;
originally announced February 2019.
-
Recursive projection-aggregation decoding of Reed-Muller codes
Authors:
Min Ye,
Emmanuel Abbe
Abstract:
We propose a new class of efficient decoding algorithms for Reed-Muller (RM) codes over binary-input memoryless channels. The algorithms are based on projecting the code on its cosets, recursively decoding the projected codes (which are lower-order RM codes), and aggregating the reconstructions (e.g., using majority votes). We further provide extensions of the algorithms using list-decoding.
We…
▽ More
We propose a new class of efficient decoding algorithms for Reed-Muller (RM) codes over binary-input memoryless channels. The algorithms are based on projecting the code on its cosets, recursively decoding the projected codes (which are lower-order RM codes), and aggregating the reconstructions (e.g., using majority votes). We further provide extensions of the algorithms using list-decoding.
We run our algorithm for AWGN channels and Binary Symmetric Channels at the short code length ($\le 1024$) regime for a wide range of code rates. Simulation results show that in both low code rate and high code rate regimes, the new algorithm outperforms the widely used decoder for polar codes (SCL+CRC) with the same parameters. The performance of the new algorithm for RM codes in those regimes is in fact close to that of the maximal likelihood decoder. Finally, the new decoder naturally allows for parallel implementations.
△ Less
Submitted 26 February, 2020; v1 submitted 4 February, 2019;
originally announced February 2019.
-
Reed-Muller codes polarize
Authors:
Emmanuel Abbe,
Min Ye
Abstract:
Reed-Muller (RM) codes and polar codes are generated by the same matrix $G_m= \bigl[\begin{smallmatrix}1 & 0 \\ 1 & 1 \\ \end{smallmatrix}\bigr]^{\otimes m}$ but using different subset of rows. RM codes select simply rows having largest weights. Polar codes select instead rows having the largest conditional mutual information proceeding top to down in $G_m$; while this is a more elaborate and chan…
▽ More
Reed-Muller (RM) codes and polar codes are generated by the same matrix $G_m= \bigl[\begin{smallmatrix}1 & 0 \\ 1 & 1 \\ \end{smallmatrix}\bigr]^{\otimes m}$ but using different subset of rows. RM codes select simply rows having largest weights. Polar codes select instead rows having the largest conditional mutual information proceeding top to down in $G_m$; while this is a more elaborate and channel-dependent rule, the top-to-down ordering has the advantage of making the conditional mutual information polarize, giving directly a capacity-achieving code on any binary memoryless symmetric channel (BMSC). RM codes are yet to be proved to have such property.
In this paper, we reconnect RM codes to polarization theory. It is shown that proceeding in the RM code ordering, i.e., not top-to-down but from the lightest to the heaviest rows in $G_m$, the conditional mutual information again polarizes. We further demonstrate that it does so faster than for polar codes. This implies that $G_m$ contains another code, different than the polar code and called here the twin code, that is provably capacity-achieving on any BMSC. This proves a necessary condition for RM codes to achieve capacity on BMSCs. It further gives a sufficient condition if the rows with the largest conditional mutual information correspond to the heaviest rows, i.e., if the twin code is the RM code. We show here that the two codes bare similarity with each other and give further evidence that they are likely the same.
△ Less
Submitted 31 January, 2019;
originally announced January 2019.
-
Provable limitations of deep learning
Authors:
Emmanuel Abbe,
Colin Sandon
Abstract:
As the success of deep learning reaches more grounds, one would like to also envision the potential limits of deep learning. This paper gives a first set of results proving that certain deep learning algorithms fail at learning certain efficiently learnable functions. The results put forward a notion of cross-predictability that characterizes when such failures take place. Parity functions provide…
▽ More
As the success of deep learning reaches more grounds, one would like to also envision the potential limits of deep learning. This paper gives a first set of results proving that certain deep learning algorithms fail at learning certain efficiently learnable functions. The results put forward a notion of cross-predictability that characterizes when such failures take place. Parity functions provide an extreme example with a cross-predictability that decays exponentially, while a mere super-polynomial decay of the cross-predictability is shown to be sufficient to obtain failures. Examples in community detection and arithmetic learning are also discussed.
Recall that it is known that the class of neural networks (NNs) with polynomial network size can express any function that can be implemented in polynomial time, and that their sample complexity scales polynomially with the network size. The challenge is with the optimization error (the ERM is NP-hard), and the success behind deep learning is to train deep NNs with descent algorithms. The failures shown in this paper apply to training poly-size NNs on function distributions of low cross-predictability with a descent algorithm that is either run with limited memory per sample or that is initialized and run with enough randomness. We further claim that such types of constraints are necessary to obtain failures, in that exact SGD with careful non-random initialization can be shown to learn parities. The cross-predictability in our results plays a similar role the statistical dimension in statistical query (SQ) algorithms, with distinctions explained in the paper. The proof techniques are based on exhibiting algorithmic constraints that imply a statistical indistinguishability between the algorithm's output on the test model v.s.\ a null model, using information measures to bound the total variation distance.
△ Less
Submitted 29 April, 2019; v1 submitted 15 December, 2018;
originally announced December 2018.
-
Graph powering and spectral robustness
Authors:
Emmanuel Abbe,
Enric Boix,
Peter Ralli,
Colin Sandon
Abstract:
Spectral algorithms, such as principal component analysis and spectral clustering, typically require careful data transformations to be effective: upon observing a matrix $A$, one may look at the spectrum of $ψ(A)$ for a properly chosen $ψ$. The issue is that the spectrum of $A$ might be contaminated by non-informational top eigenvalues, e.g., due to scale` variations in the data, and the applicat…
▽ More
Spectral algorithms, such as principal component analysis and spectral clustering, typically require careful data transformations to be effective: upon observing a matrix $A$, one may look at the spectrum of $ψ(A)$ for a properly chosen $ψ$. The issue is that the spectrum of $A$ might be contaminated by non-informational top eigenvalues, e.g., due to scale` variations in the data, and the application of $ψ$ aims to remove these.
Designing a good functional $ψ$ (and establishing what good means) is often challenging and model dependent. This paper proposes a simple and generic construction for sparse graphs, $$ψ(A) = \1((I+A)^r \ge1),$$ where $A$ denotes the adjacency matrix and $r$ is an integer (less than the graph diameter). This produces a graph connecting vertices from the original graph that are within distance $r$, and is referred to as graph powering. It is shown that graph powering regularizes the graph and decontaminates its spectrum in the following sense: (i) If the graph is drawn from the sparse Erdős-Rényi ensemble, which has no spectral gap, it is shown that graph powering produces a `maximal' spectral gap, with the latter justified by establishing an Alon-Boppana result for powered graphs; (ii) If the graph is drawn from the sparse SBM, graph powering is shown to achieve the fundamental limit for weak recovery (the KS threshold) similarly to \cite{massoulie-STOC}, settling an open problem therein. Further, graph powering is shown to be significantly more robust to tangles and cliques than previous spectral algorithms based on self-avoiding or nonbacktracking walk counts \cite{massoulie-STOC,Mossel_SBM2,bordenave,colin3}. This is illustrated on a geometric block model that is dense in cliques.
△ Less
Submitted 13 September, 2018;
originally announced September 2018.
-
Chaining Mutual Information and Tightening Generalization Bounds
Authors:
Amir R. Asadi,
Emmanuel Abbe,
Sergio Verdú
Abstract:
Bounding the generalization error of learning algorithms has a long history, which yet falls short in explaining various generalization successes including those of deep learning. Two important difficulties are (i) exploiting the dependencies between the hypotheses, (ii) exploiting the dependence between the algorithm's input and output. Progress on the first point was made with the chaining metho…
▽ More
Bounding the generalization error of learning algorithms has a long history, which yet falls short in explaining various generalization successes including those of deep learning. Two important difficulties are (i) exploiting the dependencies between the hypotheses, (ii) exploiting the dependence between the algorithm's input and output. Progress on the first point was made with the chaining method, originating from the work of Kolmogorov, and used in the VC-dimension bound. More recently, progress on the second point was made with the mutual information method by Russo and Zou '15. Yet, these two methods are currently disjoint. In this paper, we introduce a technique to combine the chaining and mutual information methods, to obtain a generalization bound that is both algorithm-dependent and that exploits the dependencies between the hypotheses. We provide an example in which our bound significantly outperforms both the chaining and the mutual information bounds. As a corollary, we tighten Dudley's inequality when the learning algorithm chooses its output from a small subset of hypotheses with high probability.
△ Less
Submitted 1 July, 2019; v1 submitted 11 June, 2018;
originally announced June 2018.
-
An Information-Percolation Bound for Spin Synchronization on General Graphs
Authors:
Emmanuel Abbe,
Enric Boix
Abstract:
This paper considers the problem of reconstructing $n$ independent uniform spins $X_1,\dots,X_n$ living on the vertices of an $n$-vertex graph $G$, by observing their interactions on the edges of the graph. This captures instances of models such as (i) broadcasting on trees, (ii) block models, (iii) synchronization on grids, (iv) spiked Wigner models. The paper gives an upper-bound on the mutual i…
▽ More
This paper considers the problem of reconstructing $n$ independent uniform spins $X_1,\dots,X_n$ living on the vertices of an $n$-vertex graph $G$, by observing their interactions on the edges of the graph. This captures instances of models such as (i) broadcasting on trees, (ii) block models, (iii) synchronization on grids, (iv) spiked Wigner models. The paper gives an upper-bound on the mutual information between two vertices in terms of a bond percolation estimate. Namely, the information between two vertices' spins is bounded by the probability that these vertices are connected in a bond percolation model, where edges are opened with a probability that "emulates" the edge-information. Both the information and the open-probability are based on the Chi-squared mutual information. The main results allow us to re-derive known results for information-theoretic non-reconstruction in models (i)-(iv), with more direct or improved bounds in some cases, and to obtain new results, such as for a spiked Wigner model on grids. The main result also implies a new subadditivity property for the Chi-squared mutual information for symmetric channels and general graphs, extending the subadditivity property obtained by Evans-Kenyon-Peres-Schulman [EKPS00] for trees.
△ Less
Submitted 11 June, 2018; v1 submitted 8 June, 2018;
originally announced June 2018.
-
Communication-Computation Efficient Gradient Coding
Authors:
Min Ye,
Emmanuel Abbe
Abstract:
This paper develops coding techniques to reduce the running time of distributed learning tasks. It characterizes the fundamental tradeoff to compute gradients (and more generally vector summations) in terms of three parameters: computation load, straggler tolerance and communication cost. It further gives an explicit coding scheme that achieves the optimal tradeoff based on recursive polynomial co…
▽ More
This paper develops coding techniques to reduce the running time of distributed learning tasks. It characterizes the fundamental tradeoff to compute gradients (and more generally vector summations) in terms of three parameters: computation load, straggler tolerance and communication cost. It further gives an explicit coding scheme that achieves the optimal tradeoff based on recursive polynomial constructions, coding both across data subsets and vector components. As a result, the proposed scheme allows to minimize the running time for gradient computations. Implementations are made on Amazon EC2 clusters using Python with mpi4py package. Results show that the proposed scheme maintains the same generalization error while reducing the running time by $32\%$ compared to uncoded schemes and $23\%$ compared to prior coded schemes focusing only on stragglers (Tandon et al., ICML 2017).
△ Less
Submitted 9 February, 2018;
originally announced February 2018.
-
Estimation in the group action channel
Authors:
Emmanuel Abbe,
João M. Pereira,
Amit Singer
Abstract:
We analyze the problem of estimating a signal from multiple measurements on a $\mbox{group action channel}$ that linearly transforms a signal by a random group action followed by a fixed projection and additive Gaussian noise. This channel is motivated by applications such as multi-reference alignment and cryo-electron microscopy. We focus on the large noise regime prevalent in these applications.…
▽ More
We analyze the problem of estimating a signal from multiple measurements on a $\mbox{group action channel}$ that linearly transforms a signal by a random group action followed by a fixed projection and additive Gaussian noise. This channel is motivated by applications such as multi-reference alignment and cryo-electron microscopy. We focus on the large noise regime prevalent in these applications. We give a lower bound on the mean square error (MSE) of any asymptotically unbiased estimator of the signal's orbit in terms of the signal's moment tensors, which implies that the MSE is bounded away from 0 when $N/σ^{2d}$ is bounded from above, where $N$ is the number of observations, $σ$ is the noise standard deviation, and $d$ is the so-called $\mbox{moment order cutoff}$. In contrast, the maximum likelihood estimator is shown to be consistent if $N /σ^{2d}$ diverges.
△ Less
Submitted 12 January, 2018;
originally announced January 2018.
-
Multireference Alignment is Easier with an Aperiodic Translation Distribution
Authors:
Emmanuel Abbe,
Tamir Bendory,
William Leeb,
João Pereira,
Nir Sharon,
Amit Singer
Abstract:
In the multireference alignment model, a signal is observed by the action of a random circular translation and the addition of Gaussian noise. The goal is to recover the signal's orbit by accessing multiple independent observations. Of particular interest is the sample complexity, i.e., the number of observations/samples needed in terms of the signal-to-noise ratio (the signal energy divided by th…
▽ More
In the multireference alignment model, a signal is observed by the action of a random circular translation and the addition of Gaussian noise. The goal is to recover the signal's orbit by accessing multiple independent observations. Of particular interest is the sample complexity, i.e., the number of observations/samples needed in terms of the signal-to-noise ratio (the signal energy divided by the noise variance) in order to drive the mean-square error (MSE) to zero. Previous work showed that if the translations are drawn from the uniform distribution, then, in the low SNR regime, the sample complexity of the problem scales as $ω(1/\text{SNR}^3)$. In this work, using a generalization of the Chapman--Robbins bound for orbits and expansions of the $χ^2$ divergence at low SNR, we show that in the same regime the sample complexity for any aperiodic translation distribution scales as $ω(1/\text{SNR}^2)$. This rate is achieved by a simple spectral algorithm. We propose two additional algorithms based on non-convex optimization and expectation-maximization. We also draw a connection between the multireference alignment problem and the spiked covariance model.
△ Less
Submitted 3 November, 2018; v1 submitted 8 October, 2017;
originally announced October 2017.
-
Community Detection on Euclidean Random Graphs
Authors:
Emmanuel Abbe,
Francois Baccelli,
Abishek Sankararaman
Abstract:
We study the problem of community detection (CD) on Euclidean random geometric graphs where each vertex has two latent variables: a binary community label and a $\mathbb{R}^d$ valued location label which forms the support of a Poisson point process of intensity $λ$. A random graph is then drawn with edge probabilities dependent on both the community and location labels. In contrast to the stochast…
▽ More
We study the problem of community detection (CD) on Euclidean random geometric graphs where each vertex has two latent variables: a binary community label and a $\mathbb{R}^d$ valued location label which forms the support of a Poisson point process of intensity $λ$. A random graph is then drawn with edge probabilities dependent on both the community and location labels. In contrast to the stochastic block model (SBM) that has no location labels, the resulting random graph contains many more short loops due to the geometric embedding. We consider the recovery of the community labels, partial and exact, using the random graph and the location labels. We establish phase transitions for both sparse and logarithmic degree regimes, and provide bounds on the location of the thresholds, conjectured to be tight in the case of exact recovery. We also show that the threshold of the distinguishability problem, i.e., the testing between our model and the null model without community labels exhibits no phase-transition and in particular, does not match the weak recovery threshold (in contrast to the SBM).
△ Less
Submitted 19 March, 2020; v1 submitted 29 June, 2017;
originally announced June 2017.
-
Group Synchronization on Grids
Authors:
Emmanuel Abbe,
Laurent Massoulie,
Andrea Montanari,
Allan Sly,
Nikhil Srivastava
Abstract:
Group synchronization requires to estimate unknown elements $(θ_v)_{v\in V}$ of a compact group ${\mathfrak G}$ associated to the vertices of a graph $G=(V,E)$, using noisy observations of the group differences associated to the edges. This model is relevant to a variety of applications ranging from structure from motion in computer vision to graph localization and positioning, to certain families…
▽ More
Group synchronization requires to estimate unknown elements $(θ_v)_{v\in V}$ of a compact group ${\mathfrak G}$ associated to the vertices of a graph $G=(V,E)$, using noisy observations of the group differences associated to the edges. This model is relevant to a variety of applications ranging from structure from motion in computer vision to graph localization and positioning, to certain families of community detection problems.
We focus on the case in which the graph $G$ is the $d$-dimensional grid. Since the unknowns ${\boldsymbol θ}_v$ are only determined up to a global action of the group, we consider the following weak recovery question. Can we determine the group difference $θ_u^{-1}θ_v$ between far apart vertices $u, v$ better than by random guessing? We prove that weak recovery is possible (provided the noise is small enough) for $d\ge 3$ and, for certain finite groups, for $d\ge 2$. Viceversa, for some continuous groups, we prove that weak recovery is impossible for $d=2$. Finally, for strong enough noise, weak recovery is always impossible.
△ Less
Submitted 26 June, 2017;
originally announced June 2017.
-
Nonbacktracking Bounds on the Influence in Independent Cascade Models
Authors:
Emmanuel Abbe,
Sanjeev Kulkarni,
Eun Jee Lee
Abstract:
This paper develops upper and lower bounds on the influence measure in a network, more precisely, the expected number of nodes that a seed set can influence in the independent cascade model. In particular, our bounds exploit nonbacktracking walks, Fortuin-Kasteleyn-Ginibre (FKG) type inequalities, and are computed by message passing implementation. Nonbacktracking walks have recently allowed for h…
▽ More
This paper develops upper and lower bounds on the influence measure in a network, more precisely, the expected number of nodes that a seed set can influence in the independent cascade model. In particular, our bounds exploit nonbacktracking walks, Fortuin-Kasteleyn-Ginibre (FKG) type inequalities, and are computed by message passing implementation. Nonbacktracking walks have recently allowed for headways in community detection, and this paper shows that their use can also impact the influence computation. Further, we provide a knob to control the trade-off between the efficiency and the accuracy of the bounds. Finally, the tightness of the bounds is illustrated with simulations on various network models.
△ Less
Submitted 29 June, 2017; v1 submitted 23 May, 2017;
originally announced June 2017.
-
Community Detection and Stochastic Block Models
Authors:
Emmanuel Abbe
Abstract:
The stochastic block model (SBM) is a random graph model with different group of vertices connecting differently. It is widely employed as a canonical model to study clustering and community detection, and provides a fertile ground to study the information-theoretic and computational tradeoffs that arise in combinatorial statistics and more generally data science.
This monograph surveys the rece…
▽ More
The stochastic block model (SBM) is a random graph model with different group of vertices connecting differently. It is widely employed as a canonical model to study clustering and community detection, and provides a fertile ground to study the information-theoretic and computational tradeoffs that arise in combinatorial statistics and more generally data science.
This monograph surveys the recent developments that establish the fundamental limits for community detection in the SBM, both with respect to information-theoretic and computational tradeoffs, and for various recovery requirements such as exact, partial and weak recovery. The main results discussed are the phase transitions for exact recovery at the Chernoff-Hellinger threshold, the phase transition for weak recovery at the Kesten-Stigum threshold, the optimal SNR-mutual information tradeoff for partial recovery, and the gap between information-theoretic and computational thresholds.
The monograph gives a principled derivation of the main algorithms developed in the quest of achieving the limits, in particular two-round algorithms via graph-splitting, semi-definite programming, (linearized) belief propagation, classical/nonbacktracking spectral methods and graph powering. Extensions to other block models, such as geometric block models, and a few open problems are also discussed.
△ Less
Submitted 24 October, 2023; v1 submitted 29 March, 2017;
originally announced March 2017.
-
Sample Complexity of the Boolean Multireference Alignment Problem
Authors:
Emmanuel Abbe,
Joao Pereira,
Amit Singer
Abstract:
The Boolean multireference alignment problem consists in recovering a Boolean signal from multiple shifted and noisy observations. In this paper we obtain an expression for the error exponent of the maximum A posteriori decoder. This expression is used to characterize the number of measurements needed for signal recovery in the low SNR regime, in terms of higher order autocorrelations of the signa…
▽ More
The Boolean multireference alignment problem consists in recovering a Boolean signal from multiple shifted and noisy observations. In this paper we obtain an expression for the error exponent of the maximum A posteriori decoder. This expression is used to characterize the number of measurements needed for signal recovery in the low SNR regime, in terms of higher order autocorrelations of the signal. The characterization is explicit for various signal dimensions, such as prime and even dimensions.
△ Less
Submitted 2 February, 2017; v1 submitted 25 January, 2017;
originally announced January 2017.
-
Detection in the stochastic block model with multiple clusters: proof of the achievability conjectures, acyclic BP, and the information-computation gap
Authors:
Emmanuel Abbe,
Colin Sandon
Abstract:
In a paper that initiated the modern study of the stochastic block model, Decelle et al., backed by Mossel et al., made the following conjecture: Denote by $k$ the number of balanced communities, $a/n$ the probability of connecting inside communities and $b/n$ across, and set $\mathrm{SNR}=(a-b)^2/(k(a+(k-1)b)$; for any $k \geq 2$, it is possible to detect communities efficiently whenever…
▽ More
In a paper that initiated the modern study of the stochastic block model, Decelle et al., backed by Mossel et al., made the following conjecture: Denote by $k$ the number of balanced communities, $a/n$ the probability of connecting inside communities and $b/n$ across, and set $\mathrm{SNR}=(a-b)^2/(k(a+(k-1)b)$; for any $k \geq 2$, it is possible to detect communities efficiently whenever $\mathrm{SNR}>1$ (the KS threshold), whereas for $k\geq 4$, it is possible to detect communities information-theoretically for some $\mathrm{SNR}<1$. Massoulié, Mossel et al.\ and Bordenave et al.\ succeeded in proving that the KS threshold is efficiently achievable for $k=2$, while Mossel et al.\ proved that it cannot be crossed information-theoretically for $k=2$. The above conjecture remained open for $k \geq 3$.
This paper proves this conjecture, further extending the efficient detection to non-symmetrical SBMs with a generalized notion of detection and KS threshold. For the efficient part, a linearized acyclic belief propagation (ABP) algorithm is developed and proved to detect communities for any $k$ down to the KS threshold in time $O(n \log n)$. Achieving this requires showing optimality of ABP in the presence of cycles, a challenge for message passing algorithms. The paper further connects ABP to a power iteration method with a nonbacktracking operator of generalized order, formalizing the interplay between message passing and spectral methods. For the information-theoretic (IT) part, a non-efficient algorithm sampling a typical clustering is shown to break down the KS threshold at $k=4$. The emerging gap is shown to be large in some cases; if $a=0$, the KS threshold reads $b \gtrsim k^2$ whereas the IT bound reads $b \gtrsim k \ln(k)$, making the SBM a good study-case for information-computation gaps.
△ Less
Submitted 14 September, 2016; v1 submitted 30 December, 2015;
originally announced December 2015.
-
Entropies of weighted sums in cyclic groups and an application to polar codes
Authors:
Emmanuel Abbe,
Jiange Li,
Mokshay Madiman
Abstract:
In this note, the following basic question is explored: in a cyclic group, how are the Shannon entropies of the sum and difference of i.i.d. random variables related to each other? For the integer group, we show that they can differ by any real number additively, but not too much multiplicatively; on the other hand, for $\mathbb{Z}/3\mathbb{Z}$, the entropy of the difference is always at least as…
▽ More
In this note, the following basic question is explored: in a cyclic group, how are the Shannon entropies of the sum and difference of i.i.d. random variables related to each other? For the integer group, we show that they can differ by any real number additively, but not too much multiplicatively; on the other hand, for $\mathbb{Z}/3\mathbb{Z}$, the entropy of the difference is always at least as large as that of the sum. These results are closely related to the study of more-sum-than-difference (i.e. MSTD) sets in additive combinatorics. We also investigate polar codes for $q$-ary input channels using non-canonical kernels to construct the generator matrix, and present applications of our results to constructing polar codes with significantly improved error probability compared to the canonical construction.
△ Less
Submitted 21 April, 2016; v1 submitted 30 November, 2015;
originally announced December 2015.
-
Detecting Community Structures in Hi-C Genomic Data
Authors:
Irineo Cabreros,
Emmanuel Abbe,
Aristotelis Tsirigos
Abstract:
Community detection (CD) algorithms are applied to Hi-C data to discover new communities of loci in the 3D conformation of human and mouse DNA. We find that CD has some distinct advantages over pre-existing methods: (1) it is capable of finding a variable number of communities, (2) it can detect communities of DNA loci either adjacent or distant in the 1D sequence, and (3) it allows us to obtain a…
▽ More
Community detection (CD) algorithms are applied to Hi-C data to discover new communities of loci in the 3D conformation of human and mouse DNA. We find that CD has some distinct advantages over pre-existing methods: (1) it is capable of finding a variable number of communities, (2) it can detect communities of DNA loci either adjacent or distant in the 1D sequence, and (3) it allows us to obtain a principled value of k, the number of communities present. Forcing k = 2, our method recovers earlier findings of Lieberman-Aiden, et al. (2009), but letting k be a parameter, our method obtains as optimal value k = 6, discovering new candidate communities. In addition to discovering large communities that partition entire chromosomes, we also show that CD can detect small-scale topologically associating domains (TADs) such as those found in Dixon, et al. (2012). CD thus provides a natural and flexible statistical framework for understanding the folding structure of DNA at multiple scales in Hi-C data.
△ Less
Submitted 17 September, 2015;
originally announced September 2015.
-
Asymptotic Mutual Information for the Two-Groups Stochastic Block Model
Authors:
Yash Deshpande,
Emmanuel Abbe,
Andrea Montanari
Abstract:
We develop an information-theoretic view of the stochastic block model, a popular statistical model for the large-scale structure of complex networks. A graph $G$ from such a model is generated by first assigning vertex labels at random from a finite alphabet, and then connecting vertices with edge probabilities depending on the labels of the endpoints. In the case of the symmetric two-group model…
▽ More
We develop an information-theoretic view of the stochastic block model, a popular statistical model for the large-scale structure of complex networks. A graph $G$ from such a model is generated by first assigning vertex labels at random from a finite alphabet, and then connecting vertices with edge probabilities depending on the labels of the endpoints. In the case of the symmetric two-group model, we establish an explicit `single-letter' characterization of the per-vertex mutual information between the vertex labels and the graph.
The explicit expression of the mutual information is intimately related to estimation-theoretic quantities, and --in particular-- reveals a phase transition at the critical point for community detection. Below the critical point the per-vertex mutual information is asymptotically the same as if edges were independent. Correspondingly, no algorithm can estimate the partition better than random guessing. Conversely, above the threshold, the per-vertex mutual information is strictly smaller than the independent-edges upper bound. In this regime there exists a procedure that estimates the vertex labels better than random guessing.
△ Less
Submitted 30 July, 2015;
originally announced July 2015.
-
Recovering communities in the general stochastic block model without knowing the parameters
Authors:
Emmanuel Abbe,
Colin Sandon
Abstract:
Most recent developments on the stochastic block model (SBM) rely on the knowledge of the model parameters, or at least on the number of communities. This paper introduces efficient algorithms that do not require such knowledge and yet achieve the optimal information-theoretic tradeoffs identified in [AS15] for linear size communities. The results are three-fold: (i) in the constant degree regime,…
▽ More
Most recent developments on the stochastic block model (SBM) rely on the knowledge of the model parameters, or at least on the number of communities. This paper introduces efficient algorithms that do not require such knowledge and yet achieve the optimal information-theoretic tradeoffs identified in [AS15] for linear size communities. The results are three-fold: (i) in the constant degree regime, an algorithm is developed that requires only a lower-bound on the relative sizes of the communities and detects communities with an optimal accuracy scaling for large degrees; (ii) in the regime where degrees are scaled by $ω(1)$ (diverging degrees), this is enhanced into a fully agnostic algorithm that only takes the graph in question and simultaneously learns the model parameters (including the number of communities) and detects communities with accuracy $1-o(1)$, with an overall quasi-linear complexity; (iii) in the logarithmic degree regime, an agnostic algorithm is developed that learns the parameters and achieves the optimal CH-limit for exact recovery, in quasi-linear time. These provide the first algorithms affording efficiency, universality and information-theoretic optimality for strong and weak consistency in the general SBM with linear size communities.
△ Less
Submitted 11 June, 2015;
originally announced June 2015.
-
Concentration of the number of solutions of random planted CSPs and Goldreich's one-way candidates
Authors:
Emmanuel Abbe,
Katherine Edwards
Abstract:
This paper shows that the logarithm of the number of solutions of a random planted $k$-SAT formula concentrates around a deterministic $n$-independent threshold. Specifically, if $F^*_{k}(α,n)$ is a random $k$-SAT formula on $n$ variables, with clause density $α$ and with a uniformly drawn planted solution, there exists a function $φ_k(\cdot)$ such that, besides for some $α$ in a set of Lesbegue m…
▽ More
This paper shows that the logarithm of the number of solutions of a random planted $k$-SAT formula concentrates around a deterministic $n$-independent threshold. Specifically, if $F^*_{k}(α,n)$ is a random $k$-SAT formula on $n$ variables, with clause density $α$ and with a uniformly drawn planted solution, there exists a function $φ_k(\cdot)$ such that, besides for some $α$ in a set of Lesbegue measure zero, we have $ \frac{1}{n}\log Z(F^*_{k}(α,n)) \to φ_k(α)$ in probability, where $Z(F)$ is the number of solutions of the formula $F$. This settles a problem left open in Abbe-Montanari RANDOM 2013, where the concentration is obtained only for the expected logarithm over the clause distribution. The result is also extended to a more general class of random planted CSPs; in particular, it is shown that the number of pre-images for the Goldreich one-way function model concentrates for some choices of the predicates.
△ Less
Submitted 30 April, 2015;
originally announced April 2015.
-
Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms
Authors:
Emmanuel Abbe,
Colin Sandon
Abstract:
New phase transition phenomena have recently been discovered for the stochastic block model, for the special case of two non-overlapping symmetric communities. This gives raise in particular to new algorithmic challenges driven by the thresholds. This paper investigates whether a general phenomenon takes place for multiple communities, without imposing symmetry.
In the general stochastic block m…
▽ More
New phase transition phenomena have recently been discovered for the stochastic block model, for the special case of two non-overlapping symmetric communities. This gives raise in particular to new algorithmic challenges driven by the thresholds. This paper investigates whether a general phenomenon takes place for multiple communities, without imposing symmetry.
In the general stochastic block model $\text{SBM}(n,p,Q)$, $n$ vertices are split into $k$ communities of relative size $\{p_i\}_{i \in [k]}$, and vertices in community $i$ and $j$ connect independently with probability $\{Q_{i,j}\}_{i,j \in [k]}$. This paper investigates the partial and exact recovery of communities in the general SBM (in the constant and logarithmic degree regimes), and uses the generality of the results to tackle overlapping communities.
The contributions of the paper are: (i) an explicit characterization of the recovery threshold in the general SBM in terms of a new divergence function $D_+$, which generalizes the Hellinger and Chernoff divergences, and which provides an operational meaning to a divergence function analog to the KL-divergence in the channel coding theorem, (ii) the development of an algorithm that recovers the communities all the way down to the optimal threshold and runs in quasi-linear time, showing that exact recovery has no information-theoretic to computational gap for multiple communities, in contrast to the conjectures made for detection with more than 4 communities; note that the algorithm is optimal both in terms of achieving the threshold and in having quasi-linear complexity, (iii) the development of an efficient algorithm that detects communities in the constant degree regime with an explicit accuracy bound that can be made arbitrarily close to 1 when a prescribed signal-to-noise ratio (defined in term of the spectrum of $\diag(p)Q$) tends to infinity.
△ Less
Submitted 4 April, 2015; v1 submitted 2 March, 2015;
originally announced March 2015.