Search | arXiv e-print repository

On the Complexity of Learning Sparse Functions with Statistical and Gradient Queries

Authors: Nirmit Joshi, Theodor Misiakiewicz, Nathan Srebro

Abstract: The goal of this paper is to investigate the complexity of gradient algorithms when learning sparse functions (juntas). We introduce a type of Statistical Queries ($\mathsf{SQ}$), which we call Differentiable Learning Queries ($\mathsf{DLQ}$), to model gradient queries on a specified loss with respect to an arbitrary model. We provide a tight characterization of the query complexity of… ▽ More The goal of this paper is to investigate the complexity of gradient algorithms when learning sparse functions (juntas). We introduce a type of Statistical Queries ($\mathsf{SQ}$), which we call Differentiable Learning Queries ($\mathsf{DLQ}$), to model gradient queries on a specified loss with respect to an arbitrary model. We provide a tight characterization of the query complexity of $\mathsf{DLQ}$ for learning the support of a sparse function over generic product distributions. This complexity crucially depends on the loss function. For the squared loss, $\mathsf{DLQ}$ matches the complexity of Correlation Statistical Queries $(\mathsf{CSQ})$--potentially much worse than $\mathsf{SQ}$. But for other simple loss functions, including the $\ell_1$ loss, $\mathsf{DLQ}$ always achieves the same complexity as $\mathsf{SQ}$. We also provide evidence that $\mathsf{DLQ}$ can indeed capture learning with (stochastic) gradient descent by showing it correctly describes the complexity of learning with a two-layer neural network in the mean field regime and linear scaling. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 43 pages, 1 table, 1 figure

arXiv:2406.04981 [pdf, other]

The Price of Implicit Bias in Adversarially Robust Generalization

Authors: Nikolaos Tsilivis, Natalie Frank, Nathan Srebro, Julia Kempe

Abstract: We study the implicit bias of optimization in robust empirical risk minimization (robust ERM) and its connection with robust generalization. In classification settings under adversarial perturbations with linear models, we study what type of regularization should ideally be applied for a given perturbation set to improve (robust) generalization. We then show that the implicit bias of optimization… ▽ More We study the implicit bias of optimization in robust empirical risk minimization (robust ERM) and its connection with robust generalization. In classification settings under adversarial perturbations with linear models, we study what type of regularization should ideally be applied for a given perturbation set to improve (robust) generalization. We then show that the implicit bias of optimization in robust ERM can significantly affect the robustness of the model and identify two ways this can happen; either through the optimization algorithm or the architecture. We verify our predictions in simulations with synthetic data and experimentally study the importance of implicit bias in robust ERM with deep neural networks. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2405.11667 [pdf, other]

The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

Authors: Kumar Kshitij Patel, Margalit Glasgow, Ali Zindari, Lingxiao Wang, Sebastian U. Stich, Ziheng Cheng, Nirmit Joshi, Nathan Srebro

Abstract: Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under… ▽ More Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions, showing that these assumptions are insufficient to prove the effectiveness of local update steps. Furthermore, under these same assumptions, we demonstrate the min-max optimality of accelerated mini-batch SGD, which fully resolves our understanding of distributed optimization for several problem classes. Our results emphasize the need for better models of data heterogeneity to understand the effectiveness of local SGD in practice. Towards this end, we consider higher-order smoothness and heterogeneity assumptions, providing new upper bounds that imply the dominance of local SGD over mini-batch SGD when data heterogeneity is low. △ Less

Submitted 19 May, 2024; originally announced May 2024.

arXiv:2402.08808 [pdf, other]

Depth Separation in Norm-Bounded Infinite-Width Neural Networks

Authors: Suzanna Parkinson, Greg Ongie, Rebecca Willett, Ohad Shamir, Nathan Srebro

Abstract: We study depth separation in infinite-width neural networks, where complexity is controlled by the overall squared $\ell_2$-norm of the weights (sum of squares of all weights in the network). Whereas previous depth separation results focused on separation in terms of width, such results do not give insight into whether depth determines if it is possible to learn a network that generalizes well eve… ▽ More We study depth separation in infinite-width neural networks, where complexity is controlled by the overall squared $\ell_2$-norm of the weights (sum of squares of all weights in the network). Whereas previous depth separation results focused on separation in terms of width, such results do not give insight into whether depth determines if it is possible to learn a network that generalizes well even when the network width is unbounded. Here, we study separation in terms of the sample complexity required for learnability. Specifically, we show that there are functions that are learnable with sample complexity polynomial in the input dimension by norm-controlled depth-3 ReLU networks, yet are not learnable with sub-exponential sample complexity by norm-controlled depth-2 ReLU networks (with any value for the norm). We also show that a similar statement in the reverse direction is not possible: any function learnable with polynomial sample complexity by a norm-controlled depth-2 ReLU network with infinite width is also learnable with polynomial sample complexity by a norm-controlled depth-3 ReLU network. △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2402.06323 [pdf, other]

How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers

Authors: Gon Buzaglo, Itamar Harel, Mor Shpigel Nacson, Alon Brutzkus, Nathan Srebro, Daniel Soudry

Abstract: Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly unifor… ▽ More Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifies the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. Contributions. We prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow ``teacher NN'' that agrees with the labels. Specifically, we show that such a `flat' prior over the NN parameterization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent -- enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student's. △ Less

Submitted 9 June, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

arXiv:2312.13978 [pdf, other]

Metalearning with Very Few Samples Per Task

Authors: Maryam Aliakbarpour, Konstantina Bairaktari, Gavin Brown, Adam Smith, Nathan Srebro, Jonathan Ullman

Abstract: Metalearning and multitask learning are two frameworks for solving a group of related learning tasks more efficiently than we could hope to solve each of the individual tasks on their own. In multitask learning, we are given a fixed set of related learning tasks and need to output one accurate model per task, whereas in metalearning we are given tasks that are drawn i.i.d. from a metadistribution… ▽ More Metalearning and multitask learning are two frameworks for solving a group of related learning tasks more efficiently than we could hope to solve each of the individual tasks on their own. In multitask learning, we are given a fixed set of related learning tasks and need to output one accurate model per task, whereas in metalearning we are given tasks that are drawn i.i.d. from a metadistribution and need to output some common information that can be easily specialized to new tasks from the metadistribution. We consider a binary classification setting where tasks are related by a shared representation, that is, every task $P$ can be solved by a classifier of the form $f_{P} \circ h$ where $h \in H$ is a map from features to a representation space that is shared across tasks, and $f_{P} \in F$ is a task-specific classifier from the representation space to labels. The main question we ask is how much data do we need to metalearn a good representation? Here, the amount of data is measured in terms of the number of tasks $t$ that we need to see and the number of samples $n$ per task. We focus on the regime where $n$ is extremely small. Our main result shows that, in a distribution-free setting where the feature vectors are in $\mathbb{R}^d$, the representation is a linear map from $\mathbb{R}^d \to \mathbb{R}^k$, and the task-specific classifiers are halfspaces in $\mathbb{R}^k$, we can metalearn a representation with error $\varepsilon$ using $n = k+2$ samples per task, and $d \cdot (1/\varepsilon)^{O(k)}$ tasks. Learning with so few samples per task is remarkable because metalearning would be impossible with $k+1$ samples per task, and because we cannot even hope to learn an accurate task-specific classifier with $k+2$ samples per task. Our work also yields a characterization of distribution-free multitask learning and reductions between meta and multitask learning. △ Less

Submitted 1 April, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

arXiv:2311.15404 [pdf, other]

Applying statistical learning theory to deep learning

Authors: Cédric Gerbelot, Avetik Karagulyan, Stefani Karp, Kavya Ravichandran, Menachem Stern, Nathan Srebro

Abstract: Although statistical learning theory provides a robust framework to understand supervised learning, many theoretical aspects of deep learning remain unclear, in particular how different architectures may lead to inductive bias when trained using gradient based methods. The goal of these lectures is to provide an overview of some of the main questions that arise when attempting to understand deep l… ▽ More Although statistical learning theory provides a robust framework to understand supervised learning, many theoretical aspects of deep learning remain unclear, in particular how different architectures may lead to inductive bias when trained using gradient based methods. The goal of these lectures is to provide an overview of some of the main questions that arise when attempting to understand deep learning from a learning theory perspective. After a brief reminder on statistical learning theory and stochastic optimization, we discuss implicit bias in the context of benign overfitting. We then move to a general description of the mirror descent algorithm, showing how we may go back and forth between a parameter space and the corresponding function space for a given learning problem, as well as how the geometry of the learning problem may be represented by a metric tensor. Building on this framework, we provide a detailed study of the implicit bias of gradient descent on linear diagonal networks for various regression tasks, showing how the loss function, scale of parameters at initialization and depth of the network may lead to various forms of implicit bias, in particular transitioning between kernel or feature learning. △ Less

Submitted 25 March, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

Comments: 66 pages, 20 figures

arXiv:2310.06113 [pdf, other]

When is Agnostic Reinforcement Learning Statistically Tractable?

Authors: Zeyu Jia, Gene Li, Alexander Rakhlin, Ayush Sekhari, Nathan Srebro

Abstract: We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $Π$, how many rounds of interaction with an unknown MDP (with a potentially large state and action space) are required to learn an $ε$-suboptimal policy with respect to $Π$? Towards that end, we introduce a new complexity measure, called the \emph{spanning capacity}, that depends solely on the set $Π$ and is ind… ▽ More We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $Π$, how many rounds of interaction with an unknown MDP (with a potentially large state and action space) are required to learn an $ε$-suboptimal policy with respect to $Π$? Towards that end, we introduce a new complexity measure, called the \emph{spanning capacity}, that depends solely on the set $Π$ and is independent of the MDP dynamics. With a generative model, we show that for any policy class $Π$, bounded spanning capacity characterizes PAC learnability. However, for online RL, the situation is more subtle. We show there exists a policy class $Π$ with a bounded spanning capacity that requires a superpolynomial number of samples to learn. This reveals a surprising separation for agnostic learnability between generative access and online access models (as well as between deterministic/stochastic MDPs under online access). On the positive side, we identify an additional \emph{sunflower} structure, which in conjunction with bounded spanning capacity enables statistically efficient online RL via a new algorithm called POPLER, which takes inspiration from classical importance sampling methods as well as techniques for reachable-state identification and policy evaluation in reward-free exploration. △ Less

Submitted 9 October, 2023; originally announced October 2023.

Comments: Accepted to NeurIPS 2023

arXiv:2307.15396 [pdf, other]

Noisy Interpolation Learning with Shallow Univariate ReLU Networks

Authors: Nirmit Joshi, Gal Vardi, Nathan Srebro

Abstract: Understanding how overparameterized neural networks generalize despite perfect interpolation of noisy training data is a fundamental question. Mallinar et. al. 2022 noted that neural networks seem to often exhibit ``tempered overfitting'', wherein the population risk does not converge to the Bayes optimal error, but neither does it approach infinity, yielding non-trivial generalization. However, t… ▽ More Understanding how overparameterized neural networks generalize despite perfect interpolation of noisy training data is a fundamental question. Mallinar et. al. 2022 noted that neural networks seem to often exhibit ``tempered overfitting'', wherein the population risk does not converge to the Bayes optimal error, but neither does it approach infinity, yielding non-trivial generalization. However, this has not been studied rigorously. We provide the first rigorous analysis of the overfitting behavior of regression with minimum norm ($\ell_2$ of weights), focusing on univariate two-layer ReLU networks. We show overfitting is tempered (with high probability) when measured with respect to the $L_1$ loss, but also show that the situation is more complex than suggested by Mallinar et. al., and overfitting is catastrophic with respect to the $L_2$ loss, or when taking an expectation over the training set. △ Less

Submitted 21 March, 2024; v1 submitted 28 July, 2023; originally announced July 2023.

Comments: To appear at ICLR 2024. Updated version with minor changes in the presentation

arXiv:2306.13188 [pdf, ps, other]

Uniform Convergence with Square-Root Lipschitz Loss

Authors: Lijia Zhou, Zhen Dai, Frederic Koehler, Nathan Srebro

Abstract: We establish generic uniform convergence guarantees for Gaussian data in terms of the Rademacher complexity of the hypothesis class and the Lipschitz constant of the square root of the scalar loss function. We show how these guarantees substantially generalize previous results based on smoothness (Lipschitz constant of the derivative), and allow us to handle the broader class of square-root-Lipsch… ▽ More We establish generic uniform convergence guarantees for Gaussian data in terms of the Rademacher complexity of the hypothesis class and the Lipschitz constant of the square root of the scalar loss function. We show how these guarantees substantially generalize previous results based on smoothness (Lipschitz constant of the derivative), and allow us to handle the broader class of square-root-Lipschitz losses, which includes also non-smooth loss functions appropriate for studying phase retrieval and ReLU regression, as well as rederive and better understand "optimistic rate" and interpolation learning guarantees. △ Less

Submitted 22 June, 2023; originally announced June 2023.

arXiv:2306.13185 [pdf, ps, other]

An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression

Authors: Lijia Zhou, James B. Simon, Gal Vardi, Nathan Srebro

Abstract: We study the cost of overfitting in noisy kernel ridge regression (KRR), which we define as the ratio between the test error of the interpolating ridgeless model and the test error of the optimally-tuned model. We take an "agnostic" view in the following sense: we consider the cost as a function of sample size for any target function, even if the sample size is not large enough for consistency or… ▽ More We study the cost of overfitting in noisy kernel ridge regression (KRR), which we define as the ratio between the test error of the interpolating ridgeless model and the test error of the optimally-tuned model. We take an "agnostic" view in the following sense: we consider the cost as a function of sample size for any target function, even if the sample size is not large enough for consistency or the target is outside the RKHS. We analyze the cost of overfitting under a Gaussian universality ansatz using recently derived (non-rigorous) risk estimates in terms of the task eigenstructure. Our analysis provides a more refined characterization of benign, tempered and catastrophic overfitting (cf. Mallinar et al. 2022). △ Less

Submitted 22 March, 2024; v1 submitted 22 June, 2023; originally announced June 2023.

Comments: This is the ICLR CR version

arXiv:2306.03534 [pdf, other]

Continual Learning in Linear Classification on Separable Data

Authors: Itay Evron, Edward Moroshko, Gon Buzaglo, Maroun Khriesh, Badea Marjieh, Nathan Srebro, Daniel Soudry

Abstract: We analyze continual learning on a sequence of separable linear classification tasks with binary labels. We show theoretically that learning with weak regularization reduces to solving a sequential max-margin problem, corresponding to a special case of the Projection Onto Convex Sets (POCS) framework. We then develop upper bounds on the forgetting and other quantities of interest under various set… ▽ More We analyze continual learning on a sequence of separable linear classification tasks with binary labels. We show theoretically that learning with weak regularization reduces to solving a sequential max-margin problem, corresponding to a special case of the Projection Onto Convex Sets (POCS) framework. We then develop upper bounds on the forgetting and other quantities of interest under various settings with recurring tasks, including cyclic and random orderings of tasks. We discuss several practical implications to popular training practices like regularization scheduling and weighting. We point out several theoretical differences between our continual classification setting and a recently studied continual regression setting. △ Less

Submitted 6 June, 2023; originally announced June 2023.

arXiv:2305.16508 [pdf, other]

Most Neural Networks Are Almost Learnable

Authors: Amit Daniely, Nathan Srebro, Gal Vardi

Abstract: We present a PTAS for learning random constant-depth networks. We show that for any fixed $ε>0$ and depth $i$, there is a poly-time algorithm that for any distribution on $\sqrt{d} \cdot \mathbb{S}^{d-1}$ learns random Xavier networks of depth $i$, up to an additive error of $ε$. The algorithm runs in time and sample complexity of $(\bar{d})^{\mathrm{poly}(ε^{-1})}$, where $\bar d$ is the size of… ▽ More We present a PTAS for learning random constant-depth networks. We show that for any fixed $ε>0$ and depth $i$, there is a poly-time algorithm that for any distribution on $\sqrt{d} \cdot \mathbb{S}^{d-1}$ learns random Xavier networks of depth $i$, up to an additive error of $ε$. The algorithm runs in time and sample complexity of $(\bar{d})^{\mathrm{poly}(ε^{-1})}$, where $\bar d$ is the size of the network. For some cases of sigmoid and ReLU-like activations the bound can be improved to $(\bar{d})^{\mathrm{polylog}(ε^{-1})}$, resulting in a quasi-poly-time algorithm for learning constant depth random networks. △ Less

Submitted 24 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: Small fixes after review

arXiv:2303.01462 [pdf, ps, other]

Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin Maximization

Authors: Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro

Abstract: Linear classifiers and leaky ReLU networks trained by gradient flow on the logistic loss have an implicit bias towards solutions which satisfy the Karush--Kuhn--Tucker (KKT) conditions for margin maximization. In this work we establish a number of settings where the satisfaction of these KKT conditions implies benign overfitting in linear classifiers and in two-layer leaky ReLU networks: the estim… ▽ More Linear classifiers and leaky ReLU networks trained by gradient flow on the logistic loss have an implicit bias towards solutions which satisfy the Karush--Kuhn--Tucker (KKT) conditions for margin maximization. In this work we establish a number of settings where the satisfaction of these KKT conditions implies benign overfitting in linear classifiers and in two-layer leaky ReLU networks: the estimators interpolate noisy training data and simultaneously generalize well to test data. The settings include variants of the noisy class-conditional Gaussians considered in previous work as well as new distributional settings where benign overfitting has not been previously observed. The key ingredient to our proof is the observation that when the training data is nearly-orthogonal, both linear classifiers and leaky ReLU networks satisfying the KKT conditions for their respective margin maximization problems behave like a nearly uniform average of the training examples. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Comments: 53 pages

arXiv:2303.01456 [pdf, ps, other]

The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks

Authors: Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro

Abstract: In this work, we study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks. We focus on a setting where the data consists of clusters and the correlations between cluster means are small, and show that in two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial e… ▽ More In this work, we study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks. We focus on a setting where the data consists of clusters and the correlations between cluster means are small, and show that in two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial examples. Our results hold even in cases where the network has many more parameters than training examples. Despite the potential for harmful overfitting in such overparameterized settings, we prove that the implicit bias of gradient flow prevents it. However, the implicit bias also leads to non-robust solutions (susceptible to small adversarial $\ell_2$-perturbations), even though robust networks that fit the data exist. △ Less

Submitted 31 October, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

Comments: 42 pages; NeurIPS 2023 camera ready

arXiv:2302.07426 [pdf, ps, other]

Computational Complexity of Learning Neural Networks: Smoothness and Degeneracy

Authors: Amit Daniely, Nathan Srebro, Gal Vardi

Abstract: Understanding when neural networks can be learned efficiently is a fundamental question in learning theory. Existing hardness results suggest that assumptions on both the input distribution and the network's weights are necessary for obtaining efficient algorithms. Moreover, it was previously shown that depth-$2$ networks can be efficiently learned under the assumptions that the input distribution… ▽ More Understanding when neural networks can be learned efficiently is a fundamental question in learning theory. Existing hardness results suggest that assumptions on both the input distribution and the network's weights are necessary for obtaining efficient algorithms. Moreover, it was previously shown that depth-$2$ networks can be efficiently learned under the assumptions that the input distribution is Gaussian, and the weight matrix is non-degenerate. In this work, we study whether such assumptions may suffice for learning deeper networks and prove negative results. We show that learning depth-$3$ ReLU networks under the Gaussian input distribution is hard even in the smoothed-analysis framework, where a random noise is added to the network's parameters. It implies that learning depth-$3$ ReLU networks under the Gaussian distribution is hard even if the weight matrices are non-degenerate. Moreover, we consider depth-$2$ networks, and show hardness of learning in the smoothed-analysis framework, where both the network parameters and the input distribution are smoothed. Our hardness results are under a well-studied assumption on the existence of local pseudorandom generators. △ Less

Submitted 4 October, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

Comments: Changed the title, and made some other minor modifications. arXiv admin note: text overlap with arXiv:2101.08303

arXiv:2302.07263 [pdf, other]

Interpolation Learning With Minimum Description Length

Authors: Naren Sarayu Manoj, Nathan Srebro

Abstract: We prove that the Minimum Description Length learning rule exhibits tempered overfitting. We obtain tempered agnostic finite sample learning guarantees and characterize the asymptotic behavior in the presence of random label noise. We prove that the Minimum Description Length learning rule exhibits tempered overfitting. We obtain tempered agnostic finite sample learning guarantees and characterize the asymptotic behavior in the presence of random label noise. △ Less

Submitted 14 February, 2023; originally announced February 2023.

arXiv:2210.12082 [pdf, other]

A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models

Authors: Lijia Zhou, Frederic Koehler, Pragya Sur, Danica J. Sutherland, Nathan Srebro

Abstract: We prove a new generalization bound that shows for any class of linear predictors in Gaussian space, the Rademacher complexity of the class and the training error under any continuous loss $\ell$ can control the test error under all Moreau envelopes of the loss $\ell$. We use our finite-sample bound to directly recover the "optimistic rate" of Zhou et al. (2021) for linear regression with the squa… ▽ More We prove a new generalization bound that shows for any class of linear predictors in Gaussian space, the Rademacher complexity of the class and the training error under any continuous loss $\ell$ can control the test error under all Moreau envelopes of the loss $\ell$. We use our finite-sample bound to directly recover the "optimistic rate" of Zhou et al. (2021) for linear regression with the square loss, which is known to be tight for minimal $\ell_2$-norm interpolation, but we also handle more general settings where the label is generated by a potentially misspecified multi-index model. The same argument can analyze noisy interpolation of max-margin classifiers through the squared hinge loss, and establishes consistency results in spiked-covariance settings. More generally, when the loss is only assumed to be Lipschitz, our bound effectively improves Talagrand's well-known contraction lemma by a factor of two, and we prove uniform convergence of interpolators (Koehler et al. 2021) for all smooth, non-negative losses. Finally, we show that application of our generalization bound using localized Gaussian width will generally be sharp for empirical risk minimizers, establishing a non-asymptotic Moreau envelope theory for generalization that applies outside of proportional scaling regimes, handles model misspecification, and complements existing asymptotic Moreau envelope theories for M-estimation. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: As published at NeurIPS 2022

arXiv:2210.07082 [pdf, other]

Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data

Authors: Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro, Wei Hu

Abstract: The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient… ▽ More The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that asymptotically, gradient flow produces a neural network with rank at most two. Moreover, this network is an $\ell_2$-max-margin solution (in parameter space), and has a linear decision boundary that corresponds to an approximate-max-margin linear predictor. For gradient descent, provided the random initialization variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training. We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: 54 pages

arXiv:2209.07369 [pdf, ps, other]

Adversarially Robust Learning: A Generic Minimax Optimal Learner and Characterization

Authors: Omar Montasser, Steve Hanneke, Nathan Srebro

Abstract: We present a minimax optimal learner for the problem of learning predictors robust to adversarial examples at test-time. Interestingly, we find that this requires new algorithmic ideas and approaches to adversarially robust learning. In particular, we show, in a strong negative sense, the suboptimality of the robust learner proposed by Montasser, Hanneke, and Srebro (2019) and a broader family of… ▽ More We present a minimax optimal learner for the problem of learning predictors robust to adversarial examples at test-time. Interestingly, we find that this requires new algorithmic ideas and approaches to adversarially robust learning. In particular, we show, in a strong negative sense, the suboptimality of the robust learner proposed by Montasser, Hanneke, and Srebro (2019) and a broader family of learners we identify as local learners. Our results are enabled by adopting a global perspective, specifically, through a key technical contribution: the global one-inclusion graph, which may be of independent interest, that generalizes the classical one-inclusion graph due to Haussler, Littlestone, and Warmuth (1994). Finally, as a byproduct, we identify a dimension characterizing qualitatively and quantitatively what classes of predictors $\mathcal{H}$ are robustly learnable. This resolves an open problem due to Montasser et al. (2019), and closes a (potentially) infinite gap between the established upper and lower bounds on the sample complexity of adversarially robust learning. △ Less

Submitted 15 September, 2022; originally announced September 2022.

Comments: To appear in NeurIPS 2022

arXiv:2205.10671 [pdf, other]

Pessimism for Offline Linear Contextual Bandits using $\ell_p$ Confidence Sets

Authors: Gene Li, Cong Ma, Nathan Srebro

Abstract: We present a family $\{\hatπ\}_{p\ge 1}$ of pessimistic learning rules for offline learning of linear contextual bandits, relying on confidence sets with respect to different $\ell_p$ norms, where $\hatπ_2$ corresponds to Bellman-consistent pessimism (BCP), while $\hatπ_\infty$ is a novel generalization of lower confidence bound (LCB) to the linear setting. We show that the novel $\hatπ_\infty$ le… ▽ More We present a family $\{\hatπ\}_{p\ge 1}$ of pessimistic learning rules for offline learning of linear contextual bandits, relying on confidence sets with respect to different $\ell_p$ norms, where $\hatπ_2$ corresponds to Bellman-consistent pessimism (BCP), while $\hatπ_\infty$ is a novel generalization of lower confidence bound (LCB) to the linear setting. We show that the novel $\hatπ_\infty$ learning rule is, in a sense, adaptively optimal, as it achieves the minimax performance (up to log factors) against all $\ell_q$-constrained problems, and as such it strictly dominates all other predictors in the family, including $\hatπ_2$. △ Less

Submitted 4 October, 2022; v1 submitted 21 May, 2022; originally announced May 2022.

Comments: Accepted to NeurIPS 2022

arXiv:2205.09588 [pdf, other]

How catastrophic can catastrophic forgetting be in linear regression?

Authors: Itay Evron, Edward Moroshko, Rachel Ward, Nati Srebro, Daniel Soudry

Abstract: To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research… ▽ More To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when T tasks in d dimensions are presented cyclically for k iterations, we prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the T^2 factor can be lifted when tasks are presented in a random ordering. △ Less

Submitted 25 May, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

Journal ref: 35th Annual Conference on Learning Theory (2022)

arXiv:2202.13328 [pdf, ps, other]

Thinking Outside the Ball: Optimal Learning with Gradient Descent for Generalized Linear Stochastic Convex Optimization

Authors: Idan Amir, Roi Livni, Nathan Srebro

Abstract: We consider linear prediction with a convex Lipschitz loss, or more generally, stochastic convex optimization problems of generalized linear form, i.e.~where each instantaneous loss is a scalar convex function of a linear function. We show that in this setting, early stopped Gradient Descent (GD), without any explicit regularization or projection, ensures excess error at most $ε$ (compared to the… ▽ More We consider linear prediction with a convex Lipschitz loss, or more generally, stochastic convex optimization problems of generalized linear form, i.e.~where each instantaneous loss is a scalar convex function of a linear function. We show that in this setting, early stopped Gradient Descent (GD), without any explicit regularization or projection, ensures excess error at most $ε$ (compared to the best possible with unit Euclidean norm) with an optimal, up to logarithmic factors, sample complexity of $\tilde{O}(1/ε^2)$ and only $\tilde{O}(1/ε^2)$ iterations. This contrasts with general stochastic convex optimization, where $Ω(1/ε^4)$ iterations are needed Amir et al. [2021b]. The lower iteration complexity is ensured by leveraging uniform convergence rather than stability. But instead of uniform convergence in a norm ball, which we show can guarantee suboptimal learning using $Θ(1/ε^4)$ samples, we rely on uniform convergence in a distribution-dependent ball. △ Less

Submitted 30 October, 2022; v1 submitted 27 February, 2022; originally announced February 2022.

arXiv:2202.06233 [pdf, other]

The Sample Complexity of One-Hidden-Layer Neural Networks

Authors: Gal Vardi, Ohad Shamir, Nathan Srebro

Abstract: We study norm-based uniform convergence bounds for neural networks, aiming at a tight understanding of how these are affected by the architecture and type of norm constraint, for the simple class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm. We begin by proving that in general, controlling the spectral norm of the hidden layer weight matrix is insufficient to ge… ▽ More We study norm-based uniform convergence bounds for neural networks, aiming at a tight understanding of how these are affected by the architecture and type of norm constraint, for the simple class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm. We begin by proving that in general, controlling the spectral norm of the hidden layer weight matrix is insufficient to get uniform convergence guarantees (independent of the network width), while a stronger Frobenius norm control is sufficient, extending and improving on previous work. Motivated by the proof constructions, we identify and analyze two important settings where (perhaps surprisingly) a mere spectral norm control turns out to be sufficient: First, when the network's activation functions are sufficiently smooth (with the result extending to deeper networks); and second, for certain types of convolutional networks. In the latter setting, we study how the sample complexity is additionally affected by parameters such as the amount of overlap between patches and the overall number of patches. △ Less

Submitted 22 September, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

Comments: Bug fixed in proof of Theorem 2 (resulting in different log factors); Other minor edits

arXiv:2112.14195 [pdf, other]

Exponential Family Model-Based Reinforcement Learning via Score Matching

Authors: Gene Li, Junbo Li, Anmol Kabra, Nathan Srebro, Zhaoran Wang, Zhuoran Yang

Abstract: We propose an optimistic model-based algorithm, dubbed SMRL, for finite-horizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with $d$ parameters and the reward is bounded and known. SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression. Un… ▽ More We propose an optimistic model-based algorithm, dubbed SMRL, for finite-horizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with $d$ parameters and the reward is bounded and known. SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression. Under standard regularity assumptions, SMRL achieves $\tilde O(d\sqrt{H^3T})$ online regret, where $H$ is the length of each episode and $T$ is the total number of interactions (ignoring polynomial dependence on structural scale parameters). △ Less

Submitted 8 January, 2023; v1 submitted 28 December, 2021; originally announced December 2021.

Comments: NeurIPS 2022

arXiv:2112.04470 [pdf, other]

Optimistic Rates: A Unifying Theory for Interpolation Learning and Regularization in Linear Regression

Authors: Lijia Zhou, Frederic Koehler, Danica J. Sutherland, Nathan Srebro

Abstract: We study a localized notion of uniform convergence known as an "optimistic rate" (Panchenko 2002; Srebro et al. 2010) for linear regression with Gaussian data. Our refined analysis avoids the hidden constant and logarithmic factor in existing results, which are known to be crucial in high-dimensional settings, especially for understanding interpolation learning. As a special case, our analysis rec… ▽ More We study a localized notion of uniform convergence known as an "optimistic rate" (Panchenko 2002; Srebro et al. 2010) for linear regression with Gaussian data. Our refined analysis avoids the hidden constant and logarithmic factor in existing results, which are known to be crucial in high-dimensional settings, especially for understanding interpolation learning. As a special case, our analysis recovers the guarantee from Koehler et al. (2021), which tightly characterizes the population risk of low-norm interpolators under the benign overfitting conditions. Our optimistic rate bound, though, also analyzes predictors with arbitrary training error. This allows us to recover some classical statistical guarantees for ridge and LASSO regression under random designs, and helps us obtain a precise understanding of the excess risk of near-interpolators in the over-parameterized regime. △ Less

Submitted 8 December, 2021; originally announced December 2021.

arXiv:2110.10602 [pdf, ps, other]

Transductive Robust Learning Guarantees

Authors: Omar Montasser, Steve Hanneke, Nathan Srebro

Abstract: We study the problem of adversarially robust learning in the transductive setting. For classes $\mathcal{H}$ of bounded VC dimension, we propose a simple transductive learner that when presented with a set of labeled training examples and a set of unlabeled test examples (both sets possibly adversarially perturbed), it correctly labels the test examples with a robust error rate that is linear in t… ▽ More We study the problem of adversarially robust learning in the transductive setting. For classes $\mathcal{H}$ of bounded VC dimension, we propose a simple transductive learner that when presented with a set of labeled training examples and a set of unlabeled test examples (both sets possibly adversarially perturbed), it correctly labels the test examples with a robust error rate that is linear in the VC dimension and is adaptive to the complexity of the perturbation set. This result provides an exponential improvement in dependence on VC dimension over the best known upper bound on the robust error in the inductive setting, at the expense of competing with a more restrictive notion of optimal robust error. △ Less

Submitted 20 October, 2021; originally announced October 2021.

arXiv:2110.02954 [pdf, other]

A Stochastic Newton Algorithm for Distributed Convex Optimization

Authors: Brian Bullins, Kumar Kshitij Patel, Ohad Shamir, Nathan Srebro, Blake Woodworth

Abstract: We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations… ▽ More We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations performed between rounds of communication. We show that our method can reduce the number, and frequency, of required communication rounds compared to existing methods without hurting performance, by proving convergence guarantees for quasi-self-concordant objectives (e.g., logistic regression), alongside empirical evidence. △ Less

Submitted 7 October, 2021; originally announced October 2021.

arXiv:2110.02732 [pdf, other]

On Margin Maximization in Linear and ReLU Networks

Authors: Gal Vardi, Ohad Shamir, Nathan Srebro

Abstract: The implicit bias of neural networks has been extensively studied in recent years. Lyu and Li [2019] showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in the parameter space. However, that leaves open the question of whether this point will generally be an actual optimum of the max margin problem.… ▽ More The implicit bias of neural networks has been extensively studied in recent years. Lyu and Li [2019] showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in the parameter space. However, that leaves open the question of whether this point will generally be an actual optimum of the max margin problem. In this paper, we study this question in detail, for several neural network architectures involving linear and ReLU activations. Perhaps surprisingly, we show that in many cases, the KKT point is not even a local optimum of the max margin problem. On the flip side, we identify multiple settings where a local or global optimum can be guaranteed. △ Less

Submitted 2 October, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

Comments: This version includes some minor updates

arXiv:2108.04190 [pdf, ps, other]

On the Power of Differentiable Learning versus PAC and SQ Learning

Authors: Emmanuel Abbe, Pritish Kamath, Eran Malach, Colin Sandon, Nathan Srebro

Abstract: We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt using these paradigms. We show that SGD and GD can always simulate learning with statistical queries (SQ), but their ability to go beyond that depends… ▽ More We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt using these paradigms. We show that SGD and GD can always simulate learning with statistical queries (SQ), but their ability to go beyond that depends on the precision $ρ$ of the gradient calculations relative to the minibatch size $b$ (for SGD) and sample size $m$ (for GD). With fine enough precision relative to minibatch size, namely when $b ρ$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for $b=1$. Similarly, with fine enough precision relative to the sample size $m$, GD can also simulate any sample-based learning algorithm based on $m$ samples. In particular, with polynomially many bits of precision (i.e. when $ρ$ is exponentially small), SGD and GD can both simulate PAC learning regardless of the mini-batch size. On the other hand, when $b ρ^2$ is large enough, the power of SGD is equivalent to that of SQ learning. △ Less

Submitted 5 February, 2022; v1 submitted 9 August, 2021; originally announced August 2021.

arXiv:2107.00595 [pdf, other]

Fast Margin Maximization via Dual Acceleration

Authors: Ziwei Ji, Nathan Srebro, Matus Telgarsky

Abstract: We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$. This contrasts with a rate of $\mathcal{O}(1/\log(t))$ for standard gradient descent, and $\mathcal{O}(1/t)$ for normalized gradi… ▽ More We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$. This contrasts with a rate of $\mathcal{O}(1/\log(t))$ for standard gradient descent, and $\mathcal{O}(1/t)$ for normalized gradient descent. This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual, which manages to result in a simple and intuitive method in the primal. This dual view can also be used to derive a stochastic variant, which performs adaptive non-uniform sampling via the dual variables. △ Less

Submitted 21 August, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

Comments: ICML 2021

arXiv:2106.09276 [pdf, other]

Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting

Authors: Frederic Koehler, Lijia Zhou, Danica J. Sutherland, Nathan Srebro

Abstract: We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators… ▽ More We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators, and confirms a prediction of Zhou et al. (2020) for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum l1-norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings. △ Less

Submitted 3 January, 2022; v1 submitted 17 June, 2021; originally announced June 2021.

Comments: v2: Minor changes only. As published at NeurIPS 2021: https://proceedings.neurips.cc/paper/2021/hash/ac9815bef801f58de83804bce86984ad-Abstract.html

arXiv:2106.02720 [pdf, ps, other]

An Even More Optimal Stochastic Optimization Algorithm: Minibatching and Interpolation Learning

Authors: Blake Woodworth, Nathan Srebro

Abstract: We present and analyze an algorithm for optimizing smooth and convex or strongly convex objectives using minibatch stochastic gradient estimates. The algorithm is optimal with respect to its dependence on both the minibatch size and minimum expected loss simultaneously. This improves over the optimal method of Lan (2012), which is insensitive to the minimum expected loss; over the optimistic accel… ▽ More We present and analyze an algorithm for optimizing smooth and convex or strongly convex objectives using minibatch stochastic gradient estimates. The algorithm is optimal with respect to its dependence on both the minibatch size and minimum expected loss simultaneously. This improves over the optimal method of Lan (2012), which is insensitive to the minimum expected loss; over the optimistic acceleration of Cotter et al. (2011), which has suboptimal dependence on the minibatch size; and over the algorithm of Liu and Belkin (2018), which is limited to least squares problems and is also similarly suboptimal with respect to the minibatch size. Applied to interpolation learning, the improvement over Cotter et al. and Liu and Belkin translates to a linear, rather than square-root, parallelization speedup. △ Less

Submitted 26 October, 2021; v1 submitted 4 June, 2021; originally announced June 2021.

Comments: 24 pages

arXiv:2104.06970 [pdf, other]

Understanding the Eluder Dimension

Authors: Gene Li, Pritish Kamath, Dylan J. Foster, Nathan Srebro

Abstract: We provide new insights on eluder dimension, a complexity measure that has been extensively used to bound the regret of algorithms for online bandits and reinforcement learning with function approximation. First, we study the relationship between the eluder dimension for a function class and a generalized notion of rank, defined for any monotone "activation" $σ: \mathbb{R}\to \mathbb{R}$, which co… ▽ More We provide new insights on eluder dimension, a complexity measure that has been extensively used to bound the regret of algorithms for online bandits and reinforcement learning with function approximation. First, we study the relationship between the eluder dimension for a function class and a generalized notion of rank, defined for any monotone "activation" $σ: \mathbb{R}\to \mathbb{R}$, which corresponds to the minimal dimension required to represent the class as a generalized linear model. It is known that when $σ$ has derivatives bounded away from $0$, $σ$-rank gives rise to an upper bound on eluder dimension for any function class; we show however that eluder dimension can be exponentially smaller than $σ$-rank. We also show that the condition on the derivative is necessary; namely, when $σ$ is the $\mathsf{relu}$ activation, the eluder dimension can be exponentially larger than $σ$-rank. For binary-valued function classes, we obtain a characterization of the eluder dimension in terms of star number and threshold dimension, quantities which are relevant in active learning and online learning respectively. △ Less

Submitted 4 October, 2022; v1 submitted 14 April, 2021; originally announced April 2021.

Comments: NeurIPS 2022

arXiv:2103.01210 [pdf, ps, other]

Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels

Authors: Eran Malach, Pritish Kamath, Emmanuel Abbe, Nathan Srebro

Abstract: We study the relative power of learning with gradient descent on differentiable models, such as neural networks, versus using the corresponding tangent kernels. We show that under certain conditions, gradient descent achieves small error only if a related tangent kernel method achieves a non-trivial advantage over random guessing (a.k.a. weak learning), though this advantage might be very small ev… ▽ More We study the relative power of learning with gradient descent on differentiable models, such as neural networks, versus using the corresponding tangent kernels. We show that under certain conditions, gradient descent achieves small error only if a related tangent kernel method achieves a non-trivial advantage over random guessing (a.k.a. weak learning), though this advantage might be very small even when gradient descent can achieve arbitrarily high accuracy. Complementing this, we show that without these conditions, gradient descent can in fact learn with small error even when no kernel method, in particular using the tangent kernel, can achieve a non-trivial advantage over random guessing. △ Less

Submitted 1 March, 2021; originally announced March 2021.

arXiv:2102.09769 [pdf, other]

On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent

Authors: Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake Woodworth, Nathan Srebro, Amir Globerson, Daniel Soudry

Abstract: Recent work has highlighted the role of initialization scale in determining the structure of the solutions that gradient methods converge to. In particular, it was shown that large initialization leads to the neural tangent kernel regime solution, whereas small initialization leads to so called "rich regimes". However, the initialization structure is richer than the overall scale alone and involve… ▽ More Recent work has highlighted the role of initialization scale in determining the structure of the solutions that gradient methods converge to. In particular, it was shown that large initialization leads to the neural tangent kernel regime solution, whereas small initialization leads to so called "rich regimes". However, the initialization structure is richer than the overall scale alone and involves relative magnitudes of different weights and layers in the network. Here we show that these relative scales, which we refer to as initialization shape, play an important role in determining the learned model. We develop a novel technique for deriving the inductive bias of gradient-flow and use it to obtain closed-form implicit regularizers for multiple cases of interest. △ Less

Submitted 19 February, 2021; originally announced February 2021.

Comments: 33 pages, 2 figures

MSC Class: 68T07 (Primary) ACM Class: I.2.6; G.1.6

arXiv:2102.02145 [pdf, ps, other]

Adversarially Robust Learning with Unknown Perturbation Sets

Authors: Omar Montasser, Steve Hanneke, Nathan Srebro

Abstract: We study the problem of learning predictors that are robust to adversarial examples with respect to an unknown perturbation set, relying instead on interaction with an adversarial attacker or access to attack oracles, examining different models for such interactions. We obtain upper bounds on the sample complexity and upper and lower bounds on the number of required interactions, or number of succ… ▽ More We study the problem of learning predictors that are robust to adversarial examples with respect to an unknown perturbation set, relying instead on interaction with an adversarial attacker or access to attack oracles, examining different models for such interactions. We obtain upper bounds on the sample complexity and upper and lower bounds on the number of required interactions, or number of successful attacks, in different interaction models, in terms of the VC and Littlestone dimensions of the hypothesis class of predictors, and without any assumptions on the perturbation set. △ Less

Submitted 3 February, 2021; originally announced February 2021.

arXiv:2102.01583 [pdf, other]

The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication

Authors: Blake Woodworth, Brian Bullins, Ohad Shamir, Nathan Srebro

Abstract: We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates. We present a novel lower bound wi… ▽ More We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates. We present a novel lower bound with a matching upper bound that establishes an optimal algorithm. △ Less

Submitted 5 August, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

Comments: 48 pages

arXiv:2101.01134 [pdf, other]

Does Invariant Risk Minimization Capture Invariance?

Authors: Pritish Kamath, Akilesh Tangella, Danica J. Sutherland, Nathan Srebro

Abstract: We show that the Invariant Risk Minimization (IRM) formulation of Arjovsky et al. (2019) can fail to capture "natural" invariances, at least when used in its practical "linear" form, and even on very simple problems which directly follow the motivating examples for IRM. This can lead to worse generalization on new environments, even when compared to unconstrained ERM. The issue stems from a signif… ▽ More We show that the Invariant Risk Minimization (IRM) formulation of Arjovsky et al. (2019) can fail to capture "natural" invariances, at least when used in its practical "linear" form, and even on very simple problems which directly follow the motivating examples for IRM. This can lead to worse generalization on new environments, even when compared to unconstrained ERM. The issue stems from a significant gap between the linear variant (as in their concrete method IRMv1) and the full non-linear IRM formulation. Additionally, even when capturing the "right" invariances, we show that it is possible for IRM to learn a sub-optimal predictor, due to the loss function not being invariant across environments. The issues arise even when measuring invariance on the population distributions, but are exacerbated by the fact that IRM is extremely fragile to sampling. △ Less

Submitted 26 February, 2021; v1 submitted 4 January, 2021; originally announced January 2021.

Comments: Code is available in the arXiv ancillary files, linked from this page

arXiv:2010.12039 [pdf, ps, other]

Reducing Adversarially Robust Learning to Non-Robust PAC Learning

Authors: Omar Montasser, Steve Hanneke, Nathan Srebro

Abstract: We study the problem of reducing adversarially robust learning to standard PAC learning, i.e. the complexity of learning adversarially robust predictors using access to only a black-box non-robust learner. We give a reduction that can robustly learn any hypothesis class $\mathcal{C}$ using any non-robust learner $\mathcal{A}$ for $\mathcal{C}$. The number of calls to $\mathcal{A}$ depends logarith… ▽ More We study the problem of reducing adversarially robust learning to standard PAC learning, i.e. the complexity of learning adversarially robust predictors using access to only a black-box non-robust learner. We give a reduction that can robustly learn any hypothesis class $\mathcal{C}$ using any non-robust learner $\mathcal{A}$ for $\mathcal{C}$. The number of calls to $\mathcal{A}$ depends logarithmically on the number of allowed adversarial perturbations per example, and we give a lower bound showing this is unavoidable. △ Less

Submitted 22 October, 2020; originally announced October 2020.

Comments: To appear in NeurIPS 2020

arXiv:2007.06738 [pdf, other]

Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy

Authors: Edward Moroshko, Suriya Gunasekar, Blake Woodworth, Jason D. Lee, Nathan Srebro, Daniel Soudry

Abstract: We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks". This is the simplest model displaying a transition between "kernel" and non-kernel ("rich" or "active") regimes. We show how the transition is controlled by the relationship between the initialization scale and how accuratel… ▽ More We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks". This is the simplest model displaying a transition between "kernel" and non-kernel ("rich" or "active") regimes. We show how the transition is controlled by the relationship between the initialization scale and how accurately we minimize the training loss. Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies (well beyond $10^{-100}$). Moreover, the implicit bias at reasonable initialization scales and training accuracies is more complex and not captured by these limits. △ Less

Submitted 13 July, 2020; originally announced July 2020.

arXiv:2007.05073 [pdf, other]

Predictive Value Generalization Bounds

Authors: Keshav Vemuri, Nathan Srebro

Abstract: In this paper, we study a bi-criterion framework for assessing scoring functions in the context of binary classification. The positive and negative predictive values (ppv and npv, respectively) are conditional probabilities of the true label matching a classifier's predicted label. The usual classification error rate is a linear combination of these probabilities, and therefore, concentration ineq… ▽ More In this paper, we study a bi-criterion framework for assessing scoring functions in the context of binary classification. The positive and negative predictive values (ppv and npv, respectively) are conditional probabilities of the true label matching a classifier's predicted label. The usual classification error rate is a linear combination of these probabilities, and therefore, concentration inequalities for the error rate do not yield confidence intervals for the two separate predictive values. We study generalization properties of scoring functions with respect to predictive values by deriving new distribution-free large deviation and uniform convergence bounds. The latter bound is stated in terms of a measure of function class complexity that we call the order coefficient; we relate this combinatorial quantity to the VC-subgraph dimension. △ Less

Submitted 9 July, 2020; originally announced July 2020.

Comments: 20 pages, 3 figures

arXiv:2006.05942 [pdf, ps, other]

On Uniform Convergence and Low-Norm Interpolation Learning

Authors: Lijia Zhou, Danica J. Sutherland, Nathan Srebro

Abstract: We consider an underdetermined noisy linear regression model where the minimum-norm interpolating predictor is known to be consistent, and ask: can uniform convergence in a norm ball, or at least (following Nagarajan and Kolter) the subset of a norm ball that the algorithm selects on a typical input set, explain this success? We show that uniformly bounding the difference between empirical and pop… ▽ More We consider an underdetermined noisy linear regression model where the minimum-norm interpolating predictor is known to be consistent, and ask: can uniform convergence in a norm ball, or at least (following Nagarajan and Kolter) the subset of a norm ball that the algorithm selects on a typical input set, explain this success? We show that uniformly bounding the difference between empirical and population errors cannot show any learning in the norm ball, and cannot show consistency for any set, even one depending on the exact algorithm and distribution. But we argue we can explain the consistency of the minimal-norm interpolator with a slightly weaker, yet standard, notion: uniform convergence of zero-error predictors in a norm ball. We use this to bound the generalization error of low- (but not minimal-) norm interpolating predictors. △ Less

Submitted 13 January, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

Comments: v3: No content changes to this final version, as published at NeurIPS 2020: https://proceedings.neurips.cc/paper/2020/hash/4cc5400e63624c44fadeda99f57588a6-Abstract.html

arXiv:2006.04735 [pdf, other]

Minibatch vs Local SGD for Heterogeneous Distributed Learning

Authors: Blake Woodworth, Kumar Kshitij Patel, Nathan Srebro

Abstract: We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) domina… ▽ More We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) dominates all existing analysis of Local SGD in this setting, (ii) accelerated Minibatch SGD is optimal when the heterogeneity is high, and (iii) present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime. △ Less

Submitted 1 March, 2022; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: 34 pages

arXiv:2005.07652 [pdf, ps, other]

Efficiently Learning Adversarially Robust Halfspaces with Noise

Authors: Omar Montasser, Surbhi Goel, Ilias Diakonikolas, Nathan Srebro

Abstract: We study the problem of learning adversarially robust halfspaces in the distribution-independent setting. In the realizable setting, we provide necessary and sufficient conditions on the adversarial perturbation sets under which halfspaces are efficiently robustly learnable. In the presence of random label noise, we give a simple computationally efficient algorithm for this problem with respect to… ▽ More We study the problem of learning adversarially robust halfspaces in the distribution-independent setting. In the realizable setting, we provide necessary and sufficient conditions on the adversarial perturbation sets under which halfspaces are efficiently robustly learnable. In the presence of random label noise, we give a simple computationally efficient algorithm for this problem with respect to any $\ell_p$-perturbation. △ Less

Submitted 15 May, 2020; originally announced May 2020.

arXiv:2004.01025 [pdf, ps, other]

Mirrorless Mirror Descent: A Natural Derivation of Mirror Descent

Authors: Suriya Gunasekar, Blake Woodworth, Nathan Srebro

Abstract: We present a primal only derivation of Mirror Descent as a "partial" discretization of gradient flow on a Riemannian manifold where the metric tensor is the Hessian of the Mirror Descent potential. We contrast this discretization to Natural Gradient Descent, which is obtained by a "full" forward Euler discretization. This view helps shed light on the relationship between the methods and allows gen… ▽ More We present a primal only derivation of Mirror Descent as a "partial" discretization of gradient flow on a Riemannian manifold where the metric tensor is the Hessian of the Mirror Descent potential. We contrast this discretization to Natural Gradient Descent, which is obtained by a "full" forward Euler discretization. This view helps shed light on the relationship between the methods and allows generalizing Mirror Descent to general Riemannian geometries, even when the metric tensor is {\em not} a Hessian, and thus there is no "dual." △ Less

Submitted 1 July, 2021; v1 submitted 2 April, 2020; originally announced April 2020.

Comments: 11 pages

arXiv:2003.04180 [pdf, ps, other]

Approximate is Good Enough: Probabilistic Variants of Dimensional and Margin Complexity

Authors: Pritish Kamath, Omar Montasser, Nathan Srebro

Abstract: We present and study approximate notions of dimensional and margin complexity, which correspond to the minimal dimension or norm of an embedding required to approximate, rather then exactly represent, a given hypothesis class. We show that such notions are not only sufficient for learning using linear predictors or a kernel, but unlike the exact variants, are also necessary. Thus they are better s… ▽ More We present and study approximate notions of dimensional and margin complexity, which correspond to the minimal dimension or norm of an embedding required to approximate, rather then exactly represent, a given hypothesis class. We show that such notions are not only sufficient for learning using linear predictors or a kernel, but unlike the exact variants, are also necessary. Thus they are better suited for discussing limitations of linear or kernel methods. △ Less

Submitted 9 March, 2020; originally announced March 2020.

arXiv:2003.03397 [pdf, other]

Dropout: Explicit Forms and Capacity Control

Authors: Raman Arora, Peter Bartlett, Poorya Mianjy, Nathan Srebro

Abstract: We investigate the capacity control provided by dropout in various machine learning problems. First, we study dropout for matrix completion, where it induces a data-dependent regularizer that, in expectation, equals the weighted trace-norm of the product of the factors. In deep learning, we show that the data-dependent regularizer due to dropout directly controls the Rademacher complexity of the u… ▽ More We investigate the capacity control provided by dropout in various machine learning problems. First, we study dropout for matrix completion, where it induces a data-dependent regularizer that, in expectation, equals the weighted trace-norm of the product of the factors. In deep learning, we show that the data-dependent regularizer due to dropout directly controls the Rademacher complexity of the underlying class of deep neural networks. These developments enable us to give concrete generalization error bounds for the dropout algorithm in both matrix completion as well as training deep neural networks. We evaluate our theoretical findings on real-world datasets, including MovieLens, MNIST, and Fashion-MNIST. △ Less

Submitted 6 March, 2020; originally announced March 2020.

arXiv:2002.11651 [pdf, other]

Fair Learning with Private Demographic Data

Authors: Hussein Mozannar, Mesrob I. Ohannessian, Nathan Srebro

Abstract: Sensitive attributes such as race are rarely available to learners in real world settings as their collection is often restricted by laws and regulations. We give a scheme that allows individuals to release their sensitive information privately while still allowing any downstream entity to learn non-discriminatory predictors. We show how to adapt non-discriminatory learners to work with privatized… ▽ More Sensitive attributes such as race are rarely available to learners in real world settings as their collection is often restricted by laws and regulations. We give a scheme that allows individuals to release their sensitive information privately while still allowing any downstream entity to learn non-discriminatory predictors. We show how to adapt non-discriminatory learners to work with privatized protected attributes giving theoretical guarantees on performance. Finally, we highlight how the methodology could apply to learning fair predictors in settings where protected attributes are only available for a subset of the data. △ Less

Submitted 13 July, 2020; v1 submitted 26 February, 2020; originally announced February 2020.

Comments: ICML 2020

arXiv:2002.09277 [pdf, other]

Kernel and Rich Regimes in Overparametrized Models

Authors: Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, Nathan Srebro

Abstract: A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich… ▽ More A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach, we show how the scale of the initialization controls the transition between the "kernel" (aka lazy) and "rich" (aka active) regimes and affects generalization properties in multilayer homogeneous models. We also highlight an interesting role for the width of a model in the case that the predictor is not identically zero at initialization. We provide a complete and detailed analysis for a family of simple depth-$D$ models that already exhibit an interesting and meaningful transition between the kernel and rich regimes, and we also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks. △ Less

Submitted 27 July, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

Comments: This updates and significantly extends a previous article (arXiv:1906.05827), Sections 6 and 7 are the most major additions. 31 pages. arXiv admin note: text overlap with arXiv:1906.05827

Showing 1–50 of 133 results for author: Srebro, N