Zum Hauptinhalt springen

Showing 1–26 of 26 results for author: Hanin, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.16831  [pdf, ps, other

    cs.AI

    Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design

    Authors: Jared Quincy Davis, Boris Hanin, Lingjiao Chen, Peter Bailis, Ion Stoica, Matei Zaharia

    Abstract: As practitioners seek to surpass the current reliability and quality frontier of monolithic models, Compound AI Systems consisting of many language model inference calls are increasingly employed. In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness, a fundamental concept in… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  2. arXiv:2405.16630  [pdf, other

    stat.ML cs.AI cs.LG math.PR physics.data-an

    Bayesian Inference with Deep Weakly Nonlinear Networks

    Authors: Boris Hanin, Alexander Zlokapa

    Abstract: We show at a physics level of rigor that Bayesian inference with a fully connected neural network and a shaped nonlinearity of the form $φ(t) = t + ψt^3/L$ is (perturbatively) solvable in the regime where the number of training datapoints $P$ , the input dimension $N_0$, the network layer widths $N$, and the network depth $L$ are simultaneously large. Our results hold with weak assumptions on the… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  3. arXiv:2403.02419  [pdf, other

    cs.LG cs.AI cs.CL eess.SY

    Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

    Authors: Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, James Zou

    Abstract: Many recent state-of-the-art results in language tasks were achieved using compound systems that perform multiple Language Model (LM) calls and aggregate their responses. However, there is little understanding of how the number of LM calls - e.g., when asking the LM to answer each question multiple times and taking a majority vote - affects such a compound system's performance. In this paper, we i… ▽ More

    Submitted 4 June, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  4. arXiv:2402.17440  [pdf, other

    cs.LG

    Principled Architecture-aware Scaling of Hyperparameters

    Authors: Wuyang Chen, Junru Wu, Zhangyang Wang, Boris Hanin

    Abstract: Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process. Current works try to automatically optimize or design principles of hyperparameters, such that they can generalize to diverse unseen scenarios. However, most designs or optimization methods are agnostic to the choice of network structures, and thus largely ignore th… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

  5. arXiv:2309.16620  [pdf, other

    stat.ML cond-mat.dis-nn cs.AI cs.LG

    Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

    Authors: Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, Cengiz Pehlevan

    Abstract: The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $μ$P parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across dep… ▽ More

    Submitted 8 December, 2023; v1 submitted 28 September, 2023; originally announced September 2023.

  6. arXiv:2309.01592  [pdf, other

    stat.ML cs.AI cs.LG hep-th math.PR

    Les Houches Lectures on Deep Learning at Large & Infinite Width

    Authors: Yasaman Bahri, Boris Hanin, Antonin Brossollet, Vittorio Erba, Christian Keup, Rosalba Pacelli, James B. Simon

    Abstract: These lectures, presented at the 2022 Les Houches Summer School on Statistical Physics and Machine Learning, focus on the infinite-width limit and large-width regime of deep neural networks. Topics covered include various statistical and dynamical properties of these networks. In particular, the lecturers discuss properties of random deep neural networks; connections between trained deep neural ne… ▽ More

    Submitted 12 February, 2024; v1 submitted 4 September, 2023; originally announced September 2023.

    Comments: These are notes from lectures delivered by Yasaman Bahri and Boris Hanin at the 2022 Les Houches Summer School on Statistics Physics and Machine Learning and a first version of them were transcribed by Antonin Brossollet, Vittorio Erba, Christian Keup, Rosalba Pacelli, James B. Simon

  7. arXiv:2307.06092  [pdf, ps, other

    cs.LG cs.AI math.PR stat.ML

    Quantitative CLTs in Deep Neural Networks

    Authors: Stefano Favaro, Boris Hanin, Domenico Marinucci, Ivan Nourdin, Giovanni Peccati

    Abstract: We study the distribution of a fully connected neural network with random Gaussian weights and biases in which the hidden layer widths are proportional to a large constant $n$. Under mild assumptions on the non-linearity, we obtain quantitative bounds on normal approximations valid at large but finite $n$ and any fixed network depth. Our theorems show both for the finite-dimensional distributions… ▽ More

    Submitted 17 June, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

  8. arXiv:2306.11668  [pdf, other

    stat.ML cs.AI cs.LG hep-ex math.PR

    Principles for Initialization and Architecture Selection in Graph Neural Networks with ReLU Activations

    Authors: Gage DeZoort, Boris Hanin

    Abstract: This article derives and validates three principles for initialization and architecture selection in finite width graph neural networks (GNNs) with ReLU activations. First, we theoretically derive what is essentially the unique generalization to ReLU GNNs of the well-known He-initialization. Our initialization scheme guarantees that the average scale of network outputs and gradients remains order… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

    Comments: Comments appreciated

  9. arXiv:2305.07810  [pdf, ps, other

    cs.LG stat.ML

    Depth Dependence of $μ$P Learning Rates in ReLU MLPs

    Authors: Samy Jelassi, Boris Hanin, Ziwei Ji, Sashank J. Reddi, Srinadh Bhojanapalli, Sanjiv Kumar

    Abstract: In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose is to study the dependence on $n$ and $L$ of the maximal update ($μ$P) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large $n,L$. A… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

  10. arXiv:2212.14457  [pdf, other

    stat.ML cs.LG math.PR

    Bayesian Interpolation with Deep Linear Networks

    Authors: Boris Hanin, Alexander Zlokapa

    Abstract: Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We give here a complete solution in the special case of linear networks with output dimension one trained using zero noise Bayesian inference with Gaussian weight priors and mean squared error as a negative log-likelihood. For any training dataset, network dep… ▽ More

    Submitted 14 May, 2023; v1 submitted 29 December, 2022; originally announced December 2022.

  11. arXiv:2212.07295  [pdf, other

    stat.ML cs.LG

    Maximal Initial Learning Rates in Deep ReLU Networks

    Authors: Gaurav Iyer, Boris Hanin, David Rolnick

    Abstract: Training a neural network requires choosing a suitable learning rate, which involves a trade-off between speed and effectiveness of convergence. While there has been considerable theoretical and empirical analysis of how large the learning rate can be, most prior work focuses only on late-stage training. In this work, we introduce the maximal initial learning rate $η^{\ast}$ - the largest learning… ▽ More

    Submitted 25 May, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

    Comments: International Conference on Machine Learning (ICML) 2023

  12. arXiv:2205.05662  [pdf, other

    cs.LG

    Deep Architecture Connectivity Matters for Its Convergence: A Fine-Grained Analysis

    Authors: Wuyang Chen, Wei Huang, Xinyu Gong, Boris Hanin, Zhangyang Wang

    Abstract: Advanced deep neural networks (DNNs), designed by either human or AutoML algorithms, are growing increasingly complex. Diverse operations are connected by complicated connectivity patterns, e.g., various types of skip connections. Those topological compositions are empirically effective and observed to smooth the loss landscape and facilitate the gradient flow in general. However, it remains elusi… ▽ More

    Submitted 12 October, 2022; v1 submitted 11 May, 2022; originally announced May 2022.

    Comments: Neurips 2022 accepted

  13. arXiv:2204.01058  [pdf, ps, other

    math.PR cs.LG stat.ML

    Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies

    Authors: Boris Hanin

    Abstract: This article considers fully connected neural networks with Gaussian random weights and biases as well as $L$ hidden layers, each of width proportional to a large parameter $n$. For polynomially bounded non-linearities we give sharp estimates in powers of $1/n$ for the joint cumulants of the network output and its derivatives. Moreover, we show that network cumulants form a perturbatively solvable… ▽ More

    Submitted 15 January, 2023; v1 submitted 3 April, 2022; originally announced April 2022.

    Comments: 86p

  14. arXiv:2109.12960  [pdf, other

    stat.ML cs.LG

    Ridgeless Interpolation with Shallow ReLU Networks in $1D$ is Nearest Neighbor Curvature Extrapolation and Provably Generalizes on Lipschitz Functions

    Authors: Boris Hanin

    Abstract: We prove a precise geometric description of all one layer ReLU networks $z(x;θ)$ with a single linear unit and input/output dimensions equal to one that interpolate a given dataset $\mathcal D=\{(x_i,f(x_i))\}$ and, among all such interpolants, minimize the $\ell_2$-norm of the neuron weights. Such networks can intuitively be thought of as those that minimize the mean-squared error over… ▽ More

    Submitted 27 September, 2021; originally announced September 2021.

  15. arXiv:2107.01562  [pdf, ps, other

    math.PR cs.LG math.ST

    Random Neural Networks in the Infinite Width Limit as Gaussian Processes

    Authors: Boris Hanin

    Abstract: This article gives a new proof that fully connected neural networks with random weights and biases converge to Gaussian processes in the regime where the input dimension, output dimension, and depth are kept fixed, while the hidden layer widths tend to infinity. Unlike prior work, convergence is shown assuming only moment conditions for the distribution of weights and for quite general non-lineari… ▽ More

    Submitted 4 July, 2021; originally announced July 2021.

    Comments: 26p

  16. arXiv:2106.10165  [pdf, other

    cs.LG cs.AI hep-th stat.ML

    The Principles of Deep Learning Theory

    Authors: Daniel A. Roberts, Sho Yaida, Boris Hanin

    Abstract: This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are… ▽ More

    Submitted 24 August, 2021; v1 submitted 18 June, 2021; originally announced June 2021.

    Comments: 471 pages, to be published by Cambridge University Press; v2: hyperlinks fixed, index added

    Report number: MIT-CTP/5306

    Journal ref: Cambridge University Press (2022)

  17. arXiv:2102.10492  [pdf, other

    stat.ML cs.LG

    Deep ReLU Networks Preserve Expected Length

    Authors: Boris Hanin, Ryan Jeong, David Rolnick

    Abstract: Assessing the complexity of functions computed by a neural network helps us understand how the network will learn and generalize. One natural measure of complexity is how the network distorts length - if the network takes a unit-length curve as input, what is the length of the resulting curve of outputs? It has been widely believed that this length grows exponentially in network depth. We prove th… ▽ More

    Submitted 22 June, 2021; v1 submitted 20 February, 2021; originally announced February 2021.

    Comments: 18 pages, 4 figures

  18. arXiv:2010.11171  [pdf, other

    cs.LG math.OC stat.ML

    How Data Augmentation affects Optimization for Linear Regression

    Authors: Boris Hanin, Yi Sun

    Abstract: Though data augmentation has rapidly emerged as a key tool for optimization in modern machine learning, a clear picture of how augmentation schedules affect optimization and interact with optimization hyperparameters such as learning rate is nascent. In the spirit of classical convex optimization and recent work on implicit bias, the present work analyzes the effect of augmentation on optimization… ▽ More

    Submitted 26 October, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

    Comments: 31 pages, 3 figures, NeurIPS 2021

  19. arXiv:1909.05989  [pdf, other

    cs.LG math.PR stat.ML

    Finite Depth and Width Corrections to the Neural Tangent Kernel

    Authors: Boris Hanin, Mihai Nica

    Abstract: We prove the precise scaling, at finite depth and width, for the mean and variance of the neural tangent kernel (NTK) in a randomly initialized ReLU network. The standard deviation is exponential in the ratio of network depth to width. Thus, even in the limit of infinite overparameterization, the NTK is not deterministic if depth and width simultaneously tend to infinity. Moreover, we prove that f… ▽ More

    Submitted 12 September, 2019; originally announced September 2019.

    Comments: 27 pages, 2 figures, comments welcome

  20. arXiv:1906.00904  [pdf, other

    stat.ML cs.LG math.ST

    Deep ReLU Networks Have Surprisingly Few Activation Patterns

    Authors: Boris Hanin, David Rolnick

    Abstract: The success of deep networks has been attributed in part to their expressivity: per parameter, deep networks can approximate a richer class of functions than shallow networks. In ReLU networks, the number of activation patterns is one measure of expressivity; and the maximum number of patterns grows exponentially with the depth. However, recent work has showed that the practical expressivity of de… ▽ More

    Submitted 20 October, 2019; v1 submitted 3 June, 2019; originally announced June 2019.

    Comments: 18 page, 7 figures

    Journal ref: NeurIPS 2019

  21. arXiv:1905.02199  [pdf, other

    cs.LG

    Nonlinear Approximation and (Deep) ReLU Networks

    Authors: I. Daubechies, R. DeVore, S. Foucart, B. Hanin, G. Petrova

    Abstract: This article is concerned with the approximation and expressive powers of deep neural networks. This is an active research area currently producing many interesting papers. The results most commonly found in the literature prove that neural networks approximate functions with classical smoothness to the same accuracy as classical linear methods of approximation, e.g. approximation by polynomials o… ▽ More

    Submitted 5 May, 2019; originally announced May 2019.

    MSC Class: 41A25; 41A30; 41A46; 68T99; 82C32; 92B20;

  22. arXiv:1901.09021  [pdf, other

    stat.ML cs.LG math.PR

    Complexity of Linear Regions in Deep Networks

    Authors: Boris Hanin, David Rolnick

    Abstract: It is well-known that the expressivity of a neural network depends on its architecture, with deeper networks expressing more complex functions. In the case of networks that compute piecewise linear functions, such as those with ReLU activation, the number of distinct linear regions is a natural measure of expressivity. It is possible to construct networks with merely a single region, or for which… ▽ More

    Submitted 11 June, 2019; v1 submitted 25 January, 2019; originally announced January 2019.

    Comments: ICML 2019

  23. arXiv:1803.01719  [pdf, other

    stat.ML cs.LG

    How to Start Training: The Effect of Initialization and Architecture

    Authors: Boris Hanin, David Rolnick

    Abstract: We identify and study two common failure modes for early training in deep ReLU nets. For each we give a rigorous proof of when it occurs and how to avoid it, for fully connected and residual architectures. The first failure mode, exploding/vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in and, for ResNets, by correctly wei… ▽ More

    Submitted 13 November, 2018; v1 submitted 5 March, 2018; originally announced March 2018.

    Comments: Final Version, 16p, Accepted NIPS 2018

  24. arXiv:1801.03744  [pdf, other

    stat.ML cs.LG math.PR math.ST

    Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients?

    Authors: Boris Hanin

    Abstract: We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths. When beta is large… ▽ More

    Submitted 26 October, 2018; v1 submitted 11 January, 2018; originally announced January 2018.

    Comments: v3. 18p. 1 fig. Accepted at NIPS 2018

  25. arXiv:1710.11278  [pdf, other

    stat.ML cs.CC cs.LG math.CO math.ST

    Approximating Continuous Functions by ReLU Nets of Minimal Width

    Authors: Boris Hanin, Mark Sellke

    Abstract: This article concerns the expressive power of depth in deep feed-forward neural nets with ReLU activations. Specifically, we answer the following question: for a fixed $d_{in}\geq 1,$ what is the minimal width $w$ so that neural nets with ReLU activations, input dimension $d_{in}$, hidden layer widths at most $w,$ and arbitrary depth can approximate any continuous, real-valued function of… ▽ More

    Submitted 10 March, 2018; v1 submitted 30 October, 2017; originally announced October 2017.

    Comments: v2. 13p. Extended main result to higher dimensional output. Comments welcome

  26. arXiv:1708.02691  [pdf, ps, other

    stat.ML cs.CG cs.LG math.FA math.ST

    Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations

    Authors: Boris Hanin

    Abstract: This article concerns the expressive power of depth in neural nets with ReLU activations and bounded width. We are particularly interested in the following questions: what is the minimal width $w_{\text{min}}(d)$ so that ReLU nets of width $w_{\text{min}}(d)$ (and arbitrary depth) can approximate any continuous function on the unit cube $[0,1]^d$ aribitrarily well? For ReLU nets near this minimal… ▽ More

    Submitted 20 December, 2017; v1 submitted 8 August, 2017; originally announced August 2017.

    Comments: v3. Theorem 3 removed. Comments Welcome. 9p

    Journal ref: Mathematics 2019, 7(10), 992