Search | arXiv e-print repository

Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds

Abstract: We study the complexity of training neural network models with one hidden nonlinear activation layer and an output weighted sum layer. We analyze Gradient Descent applied to learning a bounded target function on $n$ real-valued inputs. We give an agnostic learning guarantee for GD: starting from a randomly initialized network, it converges in mean squared loss to the minimum error (in $2$-norm) of… ▽ More We study the complexity of training neural network models with one hidden nonlinear activation layer and an output weighted sum layer. We analyze Gradient Descent applied to learning a bounded target function on $n$ real-valued inputs. We give an agnostic learning guarantee for GD: starting from a randomly initialized network, it converges in mean squared loss to the minimum error (in $2$-norm) of the best approximation of the target function using a polynomial of degree at most $k$. Moreover, for any $k$, the size of the network and number of iterations needed are both bounded by $n^{O(k)}\log(1/ε)$. In particular, this applies to training networks of unbiased sigmoids and ReLUs. We also rigorously explain the empirical finding that gradient descent discovers lower frequency Fourier components before higher frequency components. We complement this result with nearly matching lower bounds in the Statistical Query model. GD fits well in the SQ framework since each training step is determined by an expectation over the input distribution. We show that any SQ algorithm that achieves significant improvement over a constant function with queries of tolerance some inverse polynomial in the input dimensionality $n$ must use $n^{Ω(k)}$ queries even when the target functions are restricted to a set of $n^{O(k)}$ degree-$k$ polynomials, and the input distribution is uniform over the unit sphere; for this class the information-theoretic lower bound is only $Θ(k \log n)$. Our approach for both parts is based on spherical harmonics. We view gradient descent as an operator on the space of functions, and study its dynamics. An essential tool is the Funk-Hecke theorem, which explains the eigenfunctions of this operator in the case of the mean squared loss. △ Less

Submitted 27 May, 2019; v1 submitted 7 May, 2018; originally announced May 2018.

Comments: Revised version now includes matching lower bounds

arXiv:1712.07504 [pdf, other]

On Counting Perfect Matchings in General Graphs

Authors: Daniel Štefankovič, Eric Vigoda, John Wilmes

Abstract: Counting perfect matchings has played a central role in the theory of counting problems. The permanent, corresponding to bipartite graphs, was shown to be #P-complete to compute exactly by Valiant (1979), and a fully polynomial randomized approximation scheme (FPRAS) was presented by Jerrum, Sinclair, and Vigoda (2004) using a Markov chain Monte Carlo (MCMC) approach. However, it has remained an o… ▽ More Counting perfect matchings has played a central role in the theory of counting problems. The permanent, corresponding to bipartite graphs, was shown to be #P-complete to compute exactly by Valiant (1979), and a fully polynomial randomized approximation scheme (FPRAS) was presented by Jerrum, Sinclair, and Vigoda (2004) using a Markov chain Monte Carlo (MCMC) approach. However, it has remained an open question whether there exists an FPRAS for counting perfect matchings in general graphs. In fact, it was unresolved whether the same Markov chain defined by JSV is rapidly mixing in general. In this paper, we show that it is not. We prove torpid mixing for any weighting scheme on hole patterns in the JSV chain. As a first step toward overcoming this obstacle, we introduce a new algorithm for counting matchings based on the Gallai-Edmonds decomposition of a graph, and give an FPRAS for counting matchings in graphs that are sufficiently close to bipartite. In particular, we obtain a fixed-parameter tractable algorithm for counting matchings in general graphs, parameterized by the greatest "order" of a factor-critical subgraph. △ Less

Submitted 20 December, 2017; originally announced December 2017.

Comments: To appear in LATIN 2018

MSC Class: 68Q25; 60J10

arXiv:1707.04615 [pdf, other]

On the Complexity of Learning Neural Networks

Authors: Le Song, Santosh Vempala, John Wilmes, Bo Xie

Abstract: The stunning empirical successes of neural networks currently lack rigorous theoretical explanation. What form would such an explanation take, in the face of existing complexity-theoretic lower bounds? A first step might be to show that data generated by neural networks with a single hidden layer, smooth activation functions and benign input distributions can be learned efficiently. We demonstrate… ▽ More The stunning empirical successes of neural networks currently lack rigorous theoretical explanation. What form would such an explanation take, in the face of existing complexity-theoretic lower bounds? A first step might be to show that data generated by neural networks with a single hidden layer, smooth activation functions and benign input distributions can be learned efficiently. We demonstrate here a comprehensive lower bound ruling out this possibility: for a wide class of activation functions (including all currently used), and inputs drawn from any logconcave distribution, there is a family of one-hidden-layer functions whose output is a sum gate, that are hard to learn in a precise sense: any statistical query algorithm (which includes all known variants of stochastic gradient descent with any loss function) needs an exponential number of queries even using tolerance inversely proportional to the input dimensionality. Moreover, this hard family of functions is realizable with a small (sublinear in dimension) number of activation units in the single hidden layer. The lower bound is also robust to small perturbations of the true weights. Systematic experiments illustrate a phase transition in the training error as predicted by the analysis. △ Less

Submitted 14 July, 2017; originally announced July 2017.

Comments: 21 pages, 2 figures

arXiv:1510.02195 [pdf, ps, other]

Structure and automorphisms of primitive coherent configurations

Authors: Xiaorui Sun, John Wilmes

Abstract: Coherent configurations (CCs) are highly regular colorings of the set of ordered pairs of a "vertex set"; each color represents a "constituent digraph." CCs arise in the study of permutation groups, combinatorial structures such as partially balanced designs, and the analysis of algorithms; their history goes back to Schur in the 1930s. A CC is primitive (PCC) if all its constituent digraphs are c… ▽ More Coherent configurations (CCs) are highly regular colorings of the set of ordered pairs of a "vertex set"; each color represents a "constituent digraph." CCs arise in the study of permutation groups, combinatorial structures such as partially balanced designs, and the analysis of algorithms; their history goes back to Schur in the 1930s. A CC is primitive (PCC) if all its constituent digraphs are connected. We address the problem of classifying PCCs with large automorphism groups. This project was started in Babai's 1981 paper in which he showed that only the trivial PCC admits more than $\exp(\tilde{O}(n^{1/2}))$ automorphisms. (Here, $n$ is the number of vertices and the $\tilde{O}$ hides polylogarithmic factors.) In the present paper we classify all PCCs with more than $\exp(\tilde{O}(n^{1/3}))$ automorphisms, making the first progress on Babai's conjectured classification of all PCCs with more than $\exp(n^ε)$ automorphisms. A corollary to Babai's 1981 result solved a then 100-year-old problem on primitive but not doubly transitive permutation groups, giving an $\exp(\tilde{O}(n^{1/2}))$ bound on their order. In a similar vein, our result implies an $\exp(\tilde{O}(n^{1/3}))$ upper bound on the order of such groups, with known exceptions. This improvement of Babai's result was previously known only through the Classification of Finite Simple Groups (Cameron, 1981), while our proof, like Babai's, is elementary and almost purely combinatorial. Our analysis relies on a new combinatorial structure theory we develop for PCCs. In particular, we demonstrate the presence of "asymptotically uniform clique geometries" on PCCs in a certain range of the parameters. △ Less

Submitted 25 August, 2016; v1 submitted 8 October, 2015; originally announced October 2015.

Comments: An extended abstract of this paper appeared in the Proceedings of the 47th ACM Symposium on Theory of Computing (STOC'15) under the title "Faster canonical forms for primitive coherent configurations"

MSC Class: 05E18

arXiv:1503.02746 [pdf, ps, other]

Asymptotic Delsarte cliques in distance-regular graphs

Authors: László Babai, John Wilmes

Abstract: We give a new bound on the parameter $λ$ (number of common neighbors of a pair of adjacent vertices) in a distance-regular graph $G$, improving and generalizing bounds for strongly regular graphs by Spielman (1996) and Pyber (2014). The new bound is one of the ingredients of recent progress on the complexity of testing isomorphism of strongly regular graphs (Babai, Chen, Sun, Teng, Wilmes 2013). T… ▽ More We give a new bound on the parameter $λ$ (number of common neighbors of a pair of adjacent vertices) in a distance-regular graph $G$, improving and generalizing bounds for strongly regular graphs by Spielman (1996) and Pyber (2014). The new bound is one of the ingredients of recent progress on the complexity of testing isomorphism of strongly regular graphs (Babai, Chen, Sun, Teng, Wilmes 2013). The proof is based on a clique geometry found by Metsch (1991) under certain constraints on the parameters. We also give a simplified proof of the following asymptotic consequence of Metsch's result: if $kμ= o(λ^2)$ then each edge of $G$ belongs to a unique maximal clique of size asymptotically equal to $λ$, and all other cliques have size $o(λ)$. Here $k$ denotes the degree and $μ$ the number of common neighbors of a pair of vertices at distance 2. We point out that Metsch's cliques are "asymptotically Delsarte" when $kμ= o(λ^2)$, so families of distance-regular graphs with parameters satisfying $kμ= o(λ^2)$ are "asymptotically Delsarte-geometric." △ Less

Submitted 9 March, 2015; originally announced March 2015.

Comments: 10 pages

MSC Class: 05E30 (primary); 68R05; 68R10; 68Q25; 05C99 (secondary)

Showing 1–5 of 5 results for author: Wilmes, J