Search | arXiv e-print repository

RedEx: Beyond Fixed Representation Methods via Convex Optimization

Authors: Amit Daniely, Mariano Schain, Gilad Yehudai

Abstract: Optimizing Neural networks is a difficult task which is still not well understood. On the other hand, fixed representation methods such as kernels and random features have provable optimization guarantees but inferior performance due to their inherent inability to learn the representations. In this paper, we aim at bridging this gap by presenting a novel architecture called RedEx (Reduced Expander… ▽ More Optimizing Neural networks is a difficult task which is still not well understood. On the other hand, fixed representation methods such as kernels and random features have provable optimization guarantees but inferior performance due to their inherent inability to learn the representations. In this paper, we aim at bridging this gap by presenting a novel architecture called RedEx (Reduced Expander Extractor) that is as expressive as neural networks and can also be trained in a layer-wise fashion via a convex program with semi-definite constraints and optimization guarantees. We also show that RedEx provably surpasses fixed representation methods, in the sense that it can efficiently learn a family of target functions which fixed representation methods cannot. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2311.13877 [pdf, other]

Locally Optimal Descent for Dynamic Stepsize Scheduling

Authors: Gilad Yehudai, Alon Cohen, Amit Daniely, Yoel Drori, Tomer Koren, Mariano Schain

Abstract: We introduce a novel dynamic learning-rate scheduling scheme grounded in theory with the goal of simplifying the manual and time-consuming tuning of schedules in practice. Our approach is based on estimating the locally-optimal stepsize, guaranteeing maximal descent in the direction of the stochastic gradient of the current step. We first establish theoretical convergence bounds for our method wit… ▽ More We introduce a novel dynamic learning-rate scheduling scheme grounded in theory with the goal of simplifying the manual and time-consuming tuning of schedules in practice. Our approach is based on estimating the locally-optimal stepsize, guaranteeing maximal descent in the direction of the stochastic gradient of the current step. We first establish theoretical convergence bounds for our method within the context of smooth non-convex stochastic optimization, matching state-of-the-art bounds while only assuming knowledge of the smoothness parameter. We then present a practical implementation of our algorithm and conduct systematic experiments across diverse datasets and optimization algorithms, comparing our scheme with existing state-of-the-art learning-rate schedulers. Our findings indicate that our method needs minimal tuning when compared to existing approaches, removing the need for auxiliary manual schedules and warm-up phases and achieving comparable performance with drastically reduced parameter tuning. △ Less

Submitted 23 November, 2023; originally announced November 2023.

arXiv:2307.00642 [pdf, ps, other]

Multiclass Boosting: Simple and Intuitive Weak Learning Criteria

Authors: Nataly Brukhim, Amit Daniely, Yishay Mansour, Shay Moran

Abstract: We study a generalization of boosting to the multiclass setting. We introduce a weak learning condition for multiclass classification that captures the original notion of weak learnability as being "slightly better than random guessing". We give a simple and efficient boosting algorithm, that does not require realizability assumptions and its sample and oracle complexity bounds are independent of… ▽ More We study a generalization of boosting to the multiclass setting. We introduce a weak learning condition for multiclass classification that captures the original notion of weak learnability as being "slightly better than random guessing". We give a simple and efficient boosting algorithm, that does not require realizability assumptions and its sample and oracle complexity bounds are independent of the number of classes. In addition, we utilize our new boosting technique in several theoretical applications within the context of List PAC Learning. First, we establish an equivalence to weak PAC learning. Furthermore, we present a new result on boosting for list learners, as well as provide a novel proof for the characterization of multiclass PAC learning and List PAC learning. Notably, our technique gives rise to a simplified analysis, and also implies an improved error bound for large list sizes, compared to previous results. △ Less

Submitted 2 July, 2023; originally announced July 2023.

arXiv:2305.16508 [pdf, other]

Most Neural Networks Are Almost Learnable

Authors: Amit Daniely, Nathan Srebro, Gal Vardi

Abstract: We present a PTAS for learning random constant-depth networks. We show that for any fixed $ε>0$ and depth $i$, there is a poly-time algorithm that for any distribution on $\sqrt{d} \cdot \mathbb{S}^{d-1}$ learns random Xavier networks of depth $i$, up to an additive error of $ε$. The algorithm runs in time and sample complexity of $(\bar{d})^{\mathrm{poly}(ε^{-1})}$, where $\bar d$ is the size of… ▽ More We present a PTAS for learning random constant-depth networks. We show that for any fixed $ε>0$ and depth $i$, there is a poly-time algorithm that for any distribution on $\sqrt{d} \cdot \mathbb{S}^{d-1}$ learns random Xavier networks of depth $i$, up to an additive error of $ε$. The algorithm runs in time and sample complexity of $(\bar{d})^{\mathrm{poly}(ε^{-1})}$, where $\bar d$ is the size of the network. For some cases of sigmoid and ReLU-like activations the bound can be improved to $(\bar{d})^{\mathrm{polylog}(ε^{-1})}$, resulting in a quasi-poly-time algorithm for learning constant depth random networks. △ Less

Submitted 24 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: Small fixes after review

arXiv:2302.07426 [pdf, ps, other]

Computational Complexity of Learning Neural Networks: Smoothness and Degeneracy

Authors: Amit Daniely, Nathan Srebro, Gal Vardi

Abstract: Understanding when neural networks can be learned efficiently is a fundamental question in learning theory. Existing hardness results suggest that assumptions on both the input distribution and the network's weights are necessary for obtaining efficient algorithms. Moreover, it was previously shown that depth-$2$ networks can be efficiently learned under the assumptions that the input distribution… ▽ More Understanding when neural networks can be learned efficiently is a fundamental question in learning theory. Existing hardness results suggest that assumptions on both the input distribution and the network's weights are necessary for obtaining efficient algorithms. Moreover, it was previously shown that depth-$2$ networks can be efficiently learned under the assumptions that the input distribution is Gaussian, and the weight matrix is non-degenerate. In this work, we study whether such assumptions may suffice for learning deeper networks and prove negative results. We show that learning depth-$3$ ReLU networks under the Gaussian input distribution is hard even in the smoothed-analysis framework, where a random noise is added to the network's parameters. It implies that learning depth-$3$ ReLU networks under the Gaussian distribution is hard even if the weight matrices are non-degenerate. Moreover, we consider depth-$2$ networks, and show hardness of learning in the smoothed-analysis framework, where both the network parameters and the input distribution are smoothed. Our hardness results are under a well-studied assumption on the existence of local pseudorandom generators. △ Less

Submitted 4 October, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

Comments: Changed the title, and made some other minor modifications. arXiv admin note: text overlap with arXiv:2101.08303

arXiv:2211.09634 [pdf, ps, other]

On the Sample Complexity of Two-Layer Networks: Lipschitz vs. Element-Wise Lipschitz Activation

Authors: Amit Daniely, Elad Granot

Abstract: We investigate the sample complexity of bounded two-layer neural networks using different activation functions. In particular, we consider the class $$ \mathcal{H} = \left\{\textbf{x}\mapsto \langle \textbf{v}, σ\circ W\textbf{b} + \textbf{b} \rangle : \textbf{b}\in\mathbb{R}^d, W \in \mathbb{R}^{\mathcal{T}\times d}, \textbf{v} \in \mathbb{R}^{\mathcal{T}}\right\} $$ where the spectral norm… ▽ More We investigate the sample complexity of bounded two-layer neural networks using different activation functions. In particular, we consider the class $$ \mathcal{H} = \left\{\textbf{x}\mapsto \langle \textbf{v}, σ\circ W\textbf{b} + \textbf{b} \rangle : \textbf{b}\in\mathbb{R}^d, W \in \mathbb{R}^{\mathcal{T}\times d}, \textbf{v} \in \mathbb{R}^{\mathcal{T}}\right\} $$ where the spectral norm of $W$ and $\textbf{v}$ is bounded by $O(1)$, the Frobenius norm of $W$ is bounded from its initialization by $R > 0$, and $σ$ is a Lipschitz activation function. We prove that if $σ$ is element-wise, then the sample complexity of $\mathcal{H}$ has only logarithmic dependency in width and that this complexity is tight, up to logarithmic factors. We further show that the element-wise property of $σ$ is essential for a logarithmic dependency bound in width, in the sense that there exist non-element-wise activation functions whose sample complexity is linear in width, for widths that can be up to exponential in the input dimension. For the upper bound, we use the recent approach for norm-based bounds named Approximate Description Length (ADL) by arXiv:1910.05697. We further develop new techniques and tools for this approach that will hopefully inspire future works. △ Less

Submitted 20 January, 2024; v1 submitted 17 November, 2022; originally announced November 2022.

Comments: 13 pages

Journal ref: 35th International Conference on Algorithmic Learning Theory, 2024

arXiv:2209.12882 [pdf, other]

Approximate Description Length, Covering Numbers, and VC Dimension

Authors: Amit Daniely, Gal Katzhendler

Abstract: Recently, Daniely and Granot [arXiv:1910.05697] introduced a new notion of complexity called Approximate Description Length (ADL). They used it to derive novel generalization bounds for neural networks, that despite substantial work, were out of reach for more classical techniques such as discretization, Covering Numbers and Rademacher Complexity. In this paper we explore how ADL relates to classi… ▽ More Recently, Daniely and Granot [arXiv:1910.05697] introduced a new notion of complexity called Approximate Description Length (ADL). They used it to derive novel generalization bounds for neural networks, that despite substantial work, were out of reach for more classical techniques such as discretization, Covering Numbers and Rademacher Complexity. In this paper we explore how ADL relates to classical notions of function complexity such as Covering Numbers and VC Dimension. We find that for functions whose range is the reals, ADL is essentially equivalent to these classical complexity measures. However, this equivalence breaks for functions with high dimensional range. △ Less

Submitted 26 September, 2022; originally announced September 2022.

arXiv:2202.05246 [pdf, other]

Monotone Learning

Authors: Olivier Bousquet, Amit Daniely, Haim Kaplan, Yishay Mansour, Shay Moran, Uri Stemmer

Abstract: The amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of training-data increases. Perhaps surprisingly, natural attempts to formalize this intuition give rise to interesting and challenging mathematical questions. For example, in their classical book on pattern rec… ▽ More The amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of training-data increases. Perhaps surprisingly, natural attempts to formalize this intuition give rise to interesting and challenging mathematical questions. For example, in their classical book on pattern recognition, Devroye, Gyorfi, and Lugosi (1996) ask whether there exists a {monotone} Bayes-consistent algorithm. This question remained open for over 25 years, until recently Pestov (2021) resolved it for binary classification, using an intricate construction of a monotone Bayes-consistent algorithm. We derive a general result in multiclass classification, showing that every learning algorithm A can be transformed to a monotone one with similar performance. Further, the transformation is efficient and only uses a black-box oracle access to A. This demonstrates that one can provably avoid non-monotonic behaviour without compromising performance, thus answering questions asked by Devroye et al (1996), Viering, Mey, and Loog (2019), Viering and Loog (2021), and by Mhammedi (2021). Our transformation readily implies monotone learners in a variety of contexts: for example it extends Pestov's result to classification tasks with an arbitrary number of labels. This is in contrast with Pestov's work which is tailored to binary classification. In addition, we provide uniform bounds on the error of the monotone algorithm. This makes our transformation applicable in distribution-free settings. For example, in PAC learning it implies that every learnable class admits a monotone PAC learner. This resolves questions by Viering, Mey, and Loog (2019); Viering and Loog (2021); Mhammedi (2021). △ Less

Submitted 6 November, 2023; v1 submitted 10 February, 2022; originally announced February 2022.

Comments: Fixed a calculation error in Lemma 2.5

arXiv:2106.11879 [pdf, other]

Asynchronous Stochastic Optimization Robust to Arbitrary Delays

Authors: Alon Cohen, Amit Daniely, Yoel Drori, Tomer Koren, Mariano Schain

Abstract: We consider stochastic optimization with delayed gradients where, at each time step $t$, the algorithm makes an update using a stale stochastic gradient from step $t - d_t$ for some arbitrary delay $d_t$. This setting abstracts asynchronous distributed optimization where a central server receives gradient updates computed by worker machines. These machines can experience computation and communicat… ▽ More We consider stochastic optimization with delayed gradients where, at each time step $t$, the algorithm makes an update using a stale stochastic gradient from step $t - d_t$ for some arbitrary delay $d_t$. This setting abstracts asynchronous distributed optimization where a central server receives gradient updates computed by worker machines. These machines can experience computation and communication loads that might vary significantly over time. In the general non-convex smooth optimization setting, we give a simple and efficient algorithm that requires $O( σ^2/ε^4 + τ/ε^2 )$ steps for finding an $ε$-stationary point $x$, where $τ$ is the \emph{average} delay $\smash{\frac{1}{T}\sum_{t=1}^T d_t}$ and $σ^2$ is the variance of the stochastic gradients. This improves over previous work, which showed that stochastic gradient decent achieves the same rate but with respect to the \emph{maximal} delay $\max_{t} d_t$, that can be significantly larger than the average delay especially in heterogeneous distributed systems. Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed. △ Less

Submitted 15 November, 2021; v1 submitted 22 June, 2021; originally announced June 2021.

arXiv:2105.09673 [pdf, other]

An Exact Poly-Time Membership-Queries Algorithm for Extraction a three-Layer ReLU Network

Authors: Amit Daniely, Elad Granot

Abstract: We consider the natural problem of learning a ReLU network from queries, which was recently remotivated by model extraction attacks. In this work, we present a polynomial-time algorithm that can learn a depth-two ReLU network from queries under mild general position assumptions. We also present a polynomial-time algorithm that, under mild general position assumptions, can learn a rich class of dep… ▽ More We consider the natural problem of learning a ReLU network from queries, which was recently remotivated by model extraction attacks. In this work, we present a polynomial-time algorithm that can learn a depth-two ReLU network from queries under mild general position assumptions. We also present a polynomial-time algorithm that, under mild general position assumptions, can learn a rich class of depth-three ReLU networks from queries. For instance, it can learn most networks where the number of first layer neurons is smaller than the dimension and the number of second layer neurons. These two results substantially improve state-of-the-art: Until our work, polynomial-time algorithms were only shown to learn from queries depth-two networks under the assumption that either the underlying distribution is Gaussian (Chen et al. (2021)) or that the weights matrix rows are linearly independent (Milli et al. (2019)). For depth three or more, there were no known poly-time results. △ Less

Submitted 14 February, 2023; v1 submitted 20 May, 2021; originally announced May 2021.

arXiv:2101.08303 [pdf, ps, other]

From Local Pseudorandom Generators to Hardness of Learning

Authors: Amit Daniely, Gal Vardi

Abstract: We prove hardness-of-learning results under a well-studied assumption on the existence of local pseudorandom generators. As we show, this assumption allows us to surpass the current state of the art, and prove hardness of various basic problems, with no hardness results to date. Our results include: hardness of learning shallow ReLU neural networks under the Gaussian distribution and other distr… ▽ More We prove hardness-of-learning results under a well-studied assumption on the existence of local pseudorandom generators. As we show, this assumption allows us to surpass the current state of the art, and prove hardness of various basic problems, with no hardness results to date. Our results include: hardness of learning shallow ReLU neural networks under the Gaussian distribution and other distributions; hardness of learning intersections of $ω(1)$ halfspaces, DNF formulas with $ω(1)$ terms, and ReLU networks with $ω(1)$ hidden neurons; hardness of weakly learning deterministic finite automata under the uniform distribution; hardness of weakly learning depth-$3$ Boolean circuits under the uniform distribution, as well as distribution-specific hardness results for learning DNF formulas and intersections of halfspaces. We also establish lower bounds on the complexity of learning intersections of a constant number of halfspaces, and ReLU networks with a constant number of hidden neurons. Moreover, our results imply the hardness of virtually all improper PAC-learning problems (both distribution-free and distribution-specific) that were previously shown hard under other assumptions. △ Less

Submitted 8 June, 2021; v1 submitted 20 January, 2021; originally announced January 2021.

arXiv:2010.14927 [pdf, other]

Most ReLU Networks Suffer from $\ell^2$ Adversarial Perturbations

Authors: Amit Daniely, Hadas Schacham

Abstract: We consider ReLU networks with random weights, in which the dimension decreases at each layer. We show that for most such networks, most examples $x$ admit an adversarial perturbation at an Euclidean distance of $O\left(\frac{\|x\|}{\sqrt{d}}\right)$, where $d$ is the input dimension. Moreover, this perturbation can be found via gradient flow, as well as gradient descent with sufficiently small st… ▽ More We consider ReLU networks with random weights, in which the dimension decreases at each layer. We show that for most such networks, most examples $x$ admit an adversarial perturbation at an Euclidean distance of $O\left(\frac{\|x\|}{\sqrt{d}}\right)$, where $d$ is the input dimension. Moreover, this perturbation can be found via gradient flow, as well as gradient descent with sufficiently small steps. This result can be seen as an explanation to the abundance of adversarial examples, and to the fact that they are found via gradient descent. △ Less

Submitted 28 October, 2020; originally announced October 2020.

arXiv:2006.03177 [pdf, ps, other]

Hardness of Learning Neural Networks with Natural Weights

Authors: Amit Daniely, Gal Vardi

Abstract: Neural networks are nowadays highly successful despite strong hardness results. The existing hardness results focus on the network architecture, and assume that the network's weights are arbitrary. A natural approach to settle the discrepancy is to assume that the network's weights are "well-behaved" and posses some generic properties that may allow efficient learning. This approach is supported b… ▽ More Neural networks are nowadays highly successful despite strong hardness results. The existing hardness results focus on the network architecture, and assume that the network's weights are arbitrary. A natural approach to settle the discrepancy is to assume that the network's weights are "well-behaved" and posses some generic properties that may allow efficient learning. This approach is supported by the intuition that the weights in real-world networks are not arbitrary, but exhibit some "random-like" properties with respect to some "natural" distributions. We prove negative results in this regard, and show that for depth-$2$ networks, and many "natural" weights distributions such as the normal and the uniform distribution, most networks are hard to learn. Namely, there is no efficient learning algorithm that is provably successful for most weights, and every input distribution. It implies that there is no generic property that holds with high probability in such random networks and allows efficient learning. △ Less

Submitted 13 October, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

arXiv:2003.12895 [pdf, other]

Memorizing Gaussians with no over-parameterizaion via gradient decent on neural networks

Authors: Amit Daniely

Abstract: We prove that a single step of gradient decent over depth two network, with $q$ hidden neurons, starting from orthogonal initialization, can memorize $Ω\left(\frac{dq}{\log^4(d)}\right)$ independent and randomly labeled Gaussians in $\mathbb{R}^d$. The result is valid for a large class of activation functions, which includes the absolute value. We prove that a single step of gradient decent over depth two network, with $q$ hidden neurons, starting from orthogonal initialization, can memorize $Ω\left(\frac{dq}{\log^4(d)}\right)$ independent and randomly labeled Gaussians in $\mathbb{R}^d$. The result is valid for a large class of activation functions, which includes the absolute value. △ Less

Submitted 28 March, 2020; originally announced March 2020.

arXiv:2002.07400 [pdf, other]

Learning Parities with Neural Networks

Authors: Amit Daniely, Eran Malach

Abstract: In recent years we see a rapidly growing line of research which shows learnability of various models via common neural network algorithms. Yet, besides a very few outliers, these results show learnability of models that can be learned using linear methods. Namely, such results show that learning neural-networks with gradient-descent is competitive with learning a linear classifier on top of a data… ▽ More In recent years we see a rapidly growing line of research which shows learnability of various models via common neural network algorithms. Yet, besides a very few outliers, these results show learnability of models that can be learned using linear methods. Namely, such results show that learning neural-networks with gradient-descent is competitive with learning a linear classifier on top of a data-independent representation of the examples. This leaves much to be desired, as neural networks are far more successful than linear methods. Furthermore, on the more conceptual level, linear models don't seem to capture the "deepness" of deep networks. In this paper we make a step towards showing leanability of models that are inherently non-linear. We show that under certain distributions, sparse parities are learnable via gradient decent on depth-two network. On the other hand, under the same distributions, these parities cannot be learned efficiently by linear methods. △ Less

Submitted 3 July, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

arXiv:2002.03273 [pdf, ps, other]

On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions

Authors: Yossi Arjevani, Amit Daniely, Stefanie Jegelka, Hongzhou Lin

Abstract: Recent advances in randomized incremental methods for minimizing $L$-smooth $μ$-strongly convex finite sums have culminated in tight complexity of $\tilde{O}((n+\sqrt{n L/μ})\log(1/ε))$ and $O(n+\sqrt{nL/ε})$, where $μ>0$ and $μ=0$, respectively, and $n$ denotes the number of individual functions. Unlike incremental methods, stochastic methods for finite sums do not rely on an explicit knowledge o… ▽ More Recent advances in randomized incremental methods for minimizing $L$-smooth $μ$-strongly convex finite sums have culminated in tight complexity of $\tilde{O}((n+\sqrt{n L/μ})\log(1/ε))$ and $O(n+\sqrt{nL/ε})$, where $μ>0$ and $μ=0$, respectively, and $n$ denotes the number of individual functions. Unlike incremental methods, stochastic methods for finite sums do not rely on an explicit knowledge of which individual function is being addressed at each iteration, and as such, must perform at least $Ω(n^2)$ iterations to obtain $O(1/n^2)$-optimal solutions. In this work, we exploit the finite noise structure of finite sums to derive a matching $O(n^2)$-upper bound under the global oracle model, showing that this lower bound is indeed tight. Following a similar approach, we propose a novel adaptation of SVRG which is both \emph{compatible with stochastic oracles}, and achieves complexity bounds of $\tilde{O}((n^2+n\sqrt{L/μ})\log(1/ε))$ and $O(n\sqrt{L/ε})$, for $μ>0$ and $μ=0$, respectively. Our bounds hold w.h.p. and match in part existing lower bounds of $\tildeΩ(n^2+\sqrt{nL/μ}\log(1/ε))$ and $\tildeΩ(n^2+\sqrt{nL/ε})$, for $μ>0$ and $μ=0$, respectively. △ Less

Submitted 8 February, 2020; originally announced February 2020.

arXiv:1911.09873 [pdf, other]

Neural Networks Learning and Memorization with (almost) no Over-Parameterization

Authors: Amit Daniely

Abstract: Many results in recent years established polynomial time learnability of various models via neural networks algorithms. However, unless the model is linear separable, or the activation is a polynomial, these results require very large networks -- much more than what is needed for the mere existence of a good predictor. In this paper we prove that SGD on depth two neural networks can memorize sam… ▽ More Many results in recent years established polynomial time learnability of various models via neural networks algorithms. However, unless the model is linear separable, or the activation is a polynomial, these results require very large networks -- much more than what is needed for the mere existence of a good predictor. In this paper we prove that SGD on depth two neural networks can memorize samples, learn polynomials with bounded weights, and learn certain kernel spaces, with near optimal network size, sample complexity, and runtime. In particular, we show that SGD on depth two network with $\tilde{O}\left(\frac{m}{d}\right)$ hidden neurons (and hence $\tilde{O}(m)$ parameters) can memorize $m$ random labeled points in $\mathbb{S}^{d-1}$. △ Less

Submitted 22 November, 2019; originally announced November 2019.

arXiv:1910.05697 [pdf, other]

Generalization Bounds for Neural Networks via Approximate Description Length

Authors: Amit Daniely, Elad Granot

Abstract: We investigate the sample complexity of networks with bounds on the magnitude of its weights. In particular, we consider the class \[ H=\left\{W_t\circρ\circ \ldots\circρ\circ W_{1} :W_1,\ldots,W_{t-1}\in M_{d, d}, W_t\in M_{1,d}\right\} \] where the spectral norm of each $W_i$ is bounded by $O(1)$, the Frobenius norm is bounded by $R$, and $ρ$ is the sigmoid function $\frac{e^x}{1+e^x}$ or the sm… ▽ More We investigate the sample complexity of networks with bounds on the magnitude of its weights. In particular, we consider the class \[ H=\left\{W_t\circρ\circ \ldots\circρ\circ W_{1} :W_1,\ldots,W_{t-1}\in M_{d, d}, W_t\in M_{1,d}\right\} \] where the spectral norm of each $W_i$ is bounded by $O(1)$, the Frobenius norm is bounded by $R$, and $ρ$ is the sigmoid function $\frac{e^x}{1+e^x}$ or the smoothened ReLU function $ \ln (1+e^x)$. We show that for any depth $t$, if the inputs are in $[-1,1]^d$, the sample complexity of $H$ is $\tilde O\left(\frac{dR^2}{ε^2}\right)$. This bound is optimal up to log-factors, and substantially improves over the previous state of the art of $\tilde O\left(\frac{d^2R^2}{ε^2}\right)$. We furthermore show that this bound remains valid if instead of considering the magnitude of the $W_i$'s, we consider the magnitude of $W_i - W_i^0$, where $W_i^0$ are some reference matrices, with spectral norm of $O(1)$. By taking the $W_i^0$ to be the matrices at the onset of the training process, we get sample complexity bounds that are sub-linear in the number of parameters, in many typical regimes of parameters. To establish our results we develop a new technique to analyze the sample complexity of families $H$ of predictors. We start by defining a new notion of a randomized approximate description of functions $f:X\to\mathbb{R}^d$. We then show that if there is a way to approximately describe functions in a class $H$ using $d$ bits, then $d/ε^2$ examples suffices to guarantee uniform convergence. Namely, that the empirical loss of all the functions in the class is $ε$-close to the true loss. Finally, we develop a set of tools for calculating the approximate description length of classes of functions that can be presented as a composition of linear function classes and non-linear functions. △ Less

Submitted 13 October, 2019; originally announced October 2019.

Comments: To appear in NeurIPS

arXiv:1909.12051 [pdf, other]

The Implicit Bias of Depth: How Incremental Learning Drives Generalization

Authors: Daniel Gissin, Shai Shalev-Shwartz, Amit Daniely

Abstract: A leading hypothesis for the surprising generalization of neural networks is that the dynamics of gradient descent bias the model towards simple solutions, by searching through the solution space in an incremental order of complexity. We formally define the notion of incremental learning dynamics and derive the conditions on depth and initialization for which this phenomenon arises in deep linear… ▽ More A leading hypothesis for the surprising generalization of neural networks is that the dynamics of gradient descent bias the model towards simple solutions, by searching through the solution space in an incremental order of complexity. We formally define the notion of incremental learning dynamics and derive the conditions on depth and initialization for which this phenomenon arises in deep linear models. Our main theoretical contribution is a dynamical depth separation result, proving that while shallow models can exhibit incremental learning dynamics, they require the initialization to be exponentially small for these dynamics to present themselves. However, once the model becomes deeper, the dependence becomes polynomial and incremental learning can arise in more natural settings. We complement our theoretical findings by experimenting with deep matrix sensing, quadratic neural networks and with binary classification using diagonal and convolutional linear networks, showing all of these models exhibit incremental learning. △ Less

Submitted 28 December, 2019; v1 submitted 26 September, 2019; originally announced September 2019.

Comments: 25 pages, 7 figures, published at the International Conference on Learning Representations (ICLR) 2020

arXiv:1907.05444 [pdf, other]

On the Optimality of Trees Generated by ID3

Authors: Alon Brutzkus, Amit Daniely, Eran Malach

Abstract: Since its inception in the 1980s, ID3 has become one of the most successful and widely used algorithms for learning decision trees. However, its theoretical properties remain poorly understood. In this work, we introduce a novel metric of a decision tree algorithm's performance, called mean iteration statistical consistency (MIC), which measures optimality of trees generated by ID3. As opposed to… ▽ More Since its inception in the 1980s, ID3 has become one of the most successful and widely used algorithms for learning decision trees. However, its theoretical properties remain poorly understood. In this work, we introduce a novel metric of a decision tree algorithm's performance, called mean iteration statistical consistency (MIC), which measures optimality of trees generated by ID3. As opposed to previous metrics, MIC can differentiate between different decision tree algorithms and compare their performance. We provide theoretical and empirical evidence that the TopDown variant of ID3, introduced by Kearns and Mansour (1996), has near-optimal MIC in various settings for learning read-once DNFs under product distributions. In contrast, another widely used variant of ID3 has MIC which is not near-optimal. We show that the MIC analysis predicts well the performance of these algorithms in practice. Our results present a novel view of decision tree algorithms which may lead to better and more practical guarantees for these algorithms. △ Less

Submitted 23 February, 2020; v1 submitted 11 July, 2019; originally announced July 2019.

arXiv:1906.08654 [pdf, ps, other]

ID3 Learns Juntas for Smoothed Product Distributions

Authors: Alon Brutzkus, Amit Daniely, Eran Malach

Abstract: In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a $k$-Junta, a function that depends on $k$ out of… ▽ More In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a $k$-Junta, a function that depends on $k$ out of $n$ variables of the input. We prove that when $k = \log n$, the ID3 algorithm learns in polynomial time $k$-Juntas, in the smoothed analysis model of Kalai & Teng. That is, we show a learnability result when the observed distribution is a "noisy" variant of the original distribution. △ Less

Submitted 20 June, 2019; originally announced June 2019.

arXiv:1904.03602 [pdf, ps, other]

Competitive ratio versus regret minimization: achieving the best of both worlds

Authors: Amit Daniely, Yishay Mansour

Abstract: We consider online algorithms under both the competitive ratio criteria and the regret minimization one. Our main goal is to build a unified methodology that would be able to guarantee both criteria simultaneously. For a general class of online algorithms, namely any Metrical Task System (MTS), we show that one can simultaneously guarantee the best known competitive ratio and a natural regret bo… ▽ More We consider online algorithms under both the competitive ratio criteria and the regret minimization one. Our main goal is to build a unified methodology that would be able to guarantee both criteria simultaneously. For a general class of online algorithms, namely any Metrical Task System (MTS), we show that one can simultaneously guarantee the best known competitive ratio and a natural regret bound. For the paging problem we further show an efficient online algorithm (polynomial in the number of pages) with this guarantee. To this end, we extend an existing regret minimization algorithm (specifically, Kapralov and Panigrahy) to handle movement cost (the cost of switching between states of the online system). We then show how to use the extended regret minimization algorithm to combine multiple online algorithms. Our end result is an online algorithm that can combine a "base" online algorithm, having a guaranteed competitive ratio, with a range of online algorithms that guarantee a small regret over any interval of time. The combined algorithm guarantees both that the competitive ratio matches that of the base algorithm and a low regret over any time interval. As a by product, we obtain an expert algorithm with close to optimal regret bound on every time interval, even in the presence of switching costs. This result is of independent interest. △ Less

Submitted 7 April, 2019; originally announced April 2019.

arXiv:1809.09165 [pdf, ps, other]

Locally Private Learning without Interaction Requires Separation

Authors: Amit Daniely, Vitaly Feldman

Abstract: We consider learning under the constraint of local differential privacy (LDP). For many learning problems known efficient algorithms in this model require many rounds of communication between the server and the clients holding the data points. Yet multi-round protocols are prohibitively slow in practice due to network latency and, as a result, currently deployed large-scale systems are limited to… ▽ More We consider learning under the constraint of local differential privacy (LDP). For many learning problems known efficient algorithms in this model require many rounds of communication between the server and the clients holding the data points. Yet multi-round protocols are prohibitively slow in practice due to network latency and, as a result, currently deployed large-scale systems are limited to a single round. Despite significant research interest, very little is known about which learning problems can be solved by such non-interactive systems. The only lower bound we are aware of is for PAC learning an artificial class of functions with respect to a uniform distribution (Kasiviswanathan et al. 2011). We show that the margin complexity of a class of Boolean functions is a lower bound on the complexity of any non-interactive LDP algorithm for distribution-independent PAC learning of the class. In particular, the classes of linear separators and decision lists require exponential number of samples to learn non-interactively even though they can be learned in polynomial time by an interactive LDP algorithm. This gives the first example of a natural problem that is significantly harder to solve without interaction and also resolves an open problem of Kasiviswanathan et al. (2011). We complement this lower bound with a new efficient learning algorithm whose complexity is polynomial in the margin complexity of the class. Our algorithm is non-interactive on labeled samples but still needs interactive access to unlabeled samples. All of our results also apply to the statistical query model and any model in which the number of bits communicated about each data point is constrained. △ Less

Submitted 28 October, 2019; v1 submitted 24 September, 2018; originally announced September 2018.

arXiv:1805.02363 [pdf, other]

Planning and Learning with Stochastic Action Sets

Authors: Craig Boutilier, Alon Cohen, Amit Daniely, Avinatan Hassidim, Yishay Mansour, Ofer Meshi, Martin Mladenov, Dale Schuurmans

Abstract: In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundation… ▽ More In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundations. We show that optimal policies and value functions in this model have a structure that admits a compact representation. From an RL perspective, we show that Q-learning with sampled action sets is sound. In model-based settings, we consider two important special cases: when individual actions are available with independent probabilities; and a sampling-based model for unknown distributions. We develop poly-time value and policy iteration methods for both cases; and in the first, we offer a poly-time linear programming solution. △ Less

Submitted 12 February, 2021; v1 submitted 7 May, 2018; originally announced May 2018.

arXiv:1803.03155 [pdf, other]

Learning Rules-First Classifiers

Authors: Deborah Cohen, Amit Daniely, Amir Globerson, Gal Elidan

Abstract: Complex classifiers may exhibit "embarassing" failures in cases where humans can easily provide a justified classification. Avoiding such failures is obviously of key importance. In this work, we focus on one such setting, where a label is perfectly predictable if the input contains certain features, or rules, and otherwise it is predictable by a linear classifier. We define a hypothesis class tha… ▽ More Complex classifiers may exhibit "embarassing" failures in cases where humans can easily provide a justified classification. Avoiding such failures is obviously of key importance. In this work, we focus on one such setting, where a label is perfectly predictable if the input contains certain features, or rules, and otherwise it is predictable by a linear classifier. We define a hypothesis class that captures this notion and determine its sample complexity. We also give evidence that efficient algorithms cannot achieve this sample complexity. We then derive a simple and efficient algorithm and show that its sample complexity is close to optimal, among efficient algorithms. Experiments on synthetic and sentiment analysis data demonstrate the efficacy of the method, both in terms of accuracy and interpretability. △ Less

Submitted 13 June, 2019; v1 submitted 8 March, 2018; originally announced March 2018.

arXiv:1703.07872 [pdf, other]

Random Features for Compositional Kernels

Authors: Amit Daniely, Roy Frostig, Vineet Gupta, Yoram Singer

Abstract: We describe and analyze a simple random feature scheme (RFS) from prescribed compositional kernels. The compositional kernels we use are inspired by the structure of convolutional neural networks and kernels. The resulting scheme yields sparse and efficiently computable features. Each random feature can be represented as an algebraic expression over a small number of (random) paths in a compositio… ▽ More We describe and analyze a simple random feature scheme (RFS) from prescribed compositional kernels. The compositional kernels we use are inspired by the structure of convolutional neural networks and kernels. The resulting scheme yields sparse and efficiently computable features. Each random feature can be represented as an algebraic expression over a small number of (random) paths in a composition tree. Thus, compositional random features can be stored compactly. The discrete nature of the generation process enables de-duplication of repeated features, further compacting the representation and increasing the diversity of the embeddings. Our approach complements and can be combined with previous random feature schemes. △ Less

Submitted 22 March, 2017; originally announced March 2017.

arXiv:1702.08503 [pdf, other]

SGD Learns the Conjugate Kernel Class of the Network

Authors: Amit Daniely

Abstract: We show that the standard stochastic gradient decent (SGD) algorithm is guaranteed to learn, in polynomial time, a function that is competitive with the best function in the conjugate kernel space of the network, as defined in Daniely, Frostig and Singer. The result holds for log-depth networks from a rich family of architectures. To the best of our knowledge, it is the first polynomial-time guara… ▽ More We show that the standard stochastic gradient decent (SGD) algorithm is guaranteed to learn, in polynomial time, a function that is competitive with the best function in the conjugate kernel space of the network, as defined in Daniely, Frostig and Singer. The result holds for log-depth networks from a rich family of architectures. To the best of our knowledge, it is the first polynomial-time guarantee for the standard neural network learning algorithm for networks of depth more that two. As corollaries, it follows that for neural networks of any depth between $2$ and $\log(n)$, SGD is guaranteed to learn, in polynomial time, constant degree polynomials with polynomially bounded coefficients. Likewise, it follows that SGD on large enough networks can learn any continuous function (not in polynomial time), complementing classical expressivity results. △ Less

Submitted 19 May, 2017; v1 submitted 27 February, 2017; originally announced February 2017.

arXiv:1702.08489 [pdf, other]

Depth Separation for Neural Networks

Authors: Amit Daniely

Abstract: Let $f:\mathbb{S}^{d-1}\times \mathbb{S}^{d-1}\to\mathbb{S}$ be a function of the form $f(\mathbf{x},\mathbf{x}') = g(\langle\mathbf{x},\mathbf{x}'\rangle)$ for $g:[-1,1]\to \mathbb{R}$. We give a simple proof that shows that poly-size depth two neural networks with (exponentially) bounded weights cannot approximate $f$ whenever $g$ cannot be approximated by a low degree polynomial. Moreover, for… ▽ More Let $f:\mathbb{S}^{d-1}\times \mathbb{S}^{d-1}\to\mathbb{S}$ be a function of the form $f(\mathbf{x},\mathbf{x}') = g(\langle\mathbf{x},\mathbf{x}'\rangle)$ for $g:[-1,1]\to \mathbb{R}$. We give a simple proof that shows that poly-size depth two neural networks with (exponentially) bounded weights cannot approximate $f$ whenever $g$ cannot be approximated by a low degree polynomial. Moreover, for many $g$'s, such as $g(x)=\sin(πd^3x)$, the number of neurons must be $2^{Ω\left(d\log(d)\right)}$. Furthermore, the result holds w.r.t.\ the uniform distribution on $\mathbb{S}^{d-1}\times \mathbb{S}^{d-1}$. As many functions of the above form can be well approximated by poly-size depth three networks with poly-bounded weights, this establishes a separation between depth two and depth three networks w.r.t.\ the uniform distribution on $\mathbb{S}^{d-1}\times \mathbb{S}^{d-1}$. △ Less

Submitted 27 February, 2017; originally announced February 2017.

arXiv:1611.10228 [pdf, ps, other]

Behavior-Based Machine-Learning: A Hybrid Approach for Predicting Human Decision Making

Authors: Gali Noti, Effi Levi, Yoav Kolumbus, Amit Daniely

Abstract: A large body of work in behavioral fields attempts to develop models that describe the way people, as opposed to rational agents, make decisions. A recent Choice Prediction Competition (2015) challenged researchers to suggest a model that captures 14 classic choice biases and can predict human decisions under risk and ambiguity. The competition focused on simple decision problems, in which human s… ▽ More A large body of work in behavioral fields attempts to develop models that describe the way people, as opposed to rational agents, make decisions. A recent Choice Prediction Competition (2015) challenged researchers to suggest a model that captures 14 classic choice biases and can predict human decisions under risk and ambiguity. The competition focused on simple decision problems, in which human subjects were asked to repeatedly choose between two gamble options. In this paper we present our approach for predicting human decision behavior: we suggest to use machine learning algorithms with features that are based on well-established behavioral theories. The basic idea is that these psychological features are essential for the representation of the data and are important for the success of the learning process. We implement a vanilla model in which we train SVM models using behavioral features that rely on the psychological properties underlying the competition baseline model. We show that this basic model captures the 14 choice biases and outperforms all the other learning-based models in the competition. The preliminary results suggest that such hybrid models can significantly improve the prediction of human decision making, and are a promising direction for future research. △ Less

Submitted 30 November, 2016; originally announced November 2016.

ACM Class: J.4; I.2.6; H.1.2

arXiv:1604.05753 [pdf, other]

Sketching and Neural Networks

Authors: Amit Daniely, Nevena Lazic, Yoram Singer, Kunal Talwar

Abstract: High-dimensional sparse data present computational and statistical challenges for supervised learning. We propose compact linear sketches for reducing the dimensionality of the input, followed by a single layer neural network. We show that any sparse polynomial function can be computed, on nearly all sparse binary vectors, by a single layer neural network that takes a compact sketch of the vector… ▽ More High-dimensional sparse data present computational and statistical challenges for supervised learning. We propose compact linear sketches for reducing the dimensionality of the input, followed by a single layer neural network. We show that any sparse polynomial function can be computed, on nearly all sparse binary vectors, by a single layer neural network that takes a compact sketch of the vector as input. Consequently, when a set of sparse binary vectors is approximately separable using a sparse polynomial, there exists a single-layer neural network that takes a short sketch as input and correctly classifies nearly all the points. Previous work has proposed using sketches to reduce dimensionality while preserving the hypothesis class. However, the sketch size has an exponential dependence on the degree in the case of polynomial classifiers. In stark contrast, our approach of using improper learning, using a larger hypothesis class allows the sketch size to have a logarithmic dependence on the degree. Even in the linear case, our approach allows us to improve on the pesky $O({1}/{γ^2})$ dependence of random projections, on the margin $γ$. We empirically show that our approach leads to more compact neural networks than related methods such as feature hashing at equal or better performance. △ Less

Submitted 19 April, 2016; originally announced April 2016.

arXiv:1603.03714 [pdf, other]

Distribution Free Learning with Local Queries

Authors: Galit Bary-Weisberg, Amit Daniely, Shai Shalev-Shwartz

Abstract: The model of learning with \emph{local membership queries} interpolates between the PAC model and the membership queries model by allowing the learner to query the label of any example that is similar to an example in the training set. This model, recently proposed and studied by Awasthi, Feldman and Kanade, aims to facilitate practical use of membership queries. We continue this line of work, p… ▽ More The model of learning with \emph{local membership queries} interpolates between the PAC model and the membership queries model by allowing the learner to query the label of any example that is similar to an example in the training set. This model, recently proposed and studied by Awasthi, Feldman and Kanade, aims to facilitate practical use of membership queries. We continue this line of work, proving both positive and negative results in the {\em distribution free} setting. We restrict to the boolean cube $\{-1, 1\}^n$, and say that a query is $q$-local if it is of a hamming distance $\le q$ from some training example. On the positive side, we show that $1$-local queries already give an additional strength, and allow to learn a certain type of DNF formulas. On the negative side, we show that even $\left(n^{0.99}\right)$-local queries cannot help to learn various classes including Automata, DNFs and more. Likewise, $q$-local queries for any constant $q$ cannot help to learn Juntas, Decision Trees, Sparse Polynomials and more. Moreover, for these classes, an algorithm that uses $\left(\log^{0.99}(n)\right)$-local queries would lead to a breakthrough in the best known running times. △ Less

Submitted 11 March, 2016; originally announced March 2016.

arXiv:1602.05897 [pdf, other]

Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity

Authors: Amit Daniely, Roy Frostig, Yoram Singer

Abstract: We develop a general duality between neural networks and compositional kernels, striving towards a better understanding of deep learning. We show that initial representations generated by common random initializations are sufficiently rich to express all functions in the dual kernel space. Hence, though the training objective is hard to optimize in the worst case, the initial weights form a good s… ▽ More We develop a general duality between neural networks and compositional kernels, striving towards a better understanding of deep learning. We show that initial representations generated by common random initializations are sufficiently rich to express all functions in the dual kernel space. Hence, though the training objective is hard to optimize in the worst case, the initial weights form a good starting point for optimization. Our dual view also reveals a pragmatic and aesthetic perspective of neural networks and underscores their expressive power. △ Less

Submitted 19 May, 2017; v1 submitted 18 February, 2016; originally announced February 2016.

arXiv:1505.05800 [pdf, other]

Complexity Theoretic Limitations on Learning Halfspaces

Authors: Amit Daniely

Abstract: We study the problem of agnostically learning halfspaces which is defined by a fixed but unknown distribution $\mathcal{D}$ on $\mathbb{Q}^n\times \{\pm 1\}$. We define $\mathrm{Err}_{\mathrm{HALF}}(\mathcal{D})$ as the least error of a halfspace classifier for $\mathcal{D}$. A learner who can access $\mathcal{D}$ has to return a hypothesis whose error is small compared to… ▽ More We study the problem of agnostically learning halfspaces which is defined by a fixed but unknown distribution $\mathcal{D}$ on $\mathbb{Q}^n\times \{\pm 1\}$. We define $\mathrm{Err}_{\mathrm{HALF}}(\mathcal{D})$ as the least error of a halfspace classifier for $\mathcal{D}$. A learner who can access $\mathcal{D}$ has to return a hypothesis whose error is small compared to $\mathrm{Err}_{\mathrm{HALF}}(\mathcal{D})$. Using the recently developed method of the author, Linial and Shalev-Shwartz we prove hardness of learning results under a natural assumption on the complexity of refuting random $K$-$\mathrm{XOR}$ formulas. We show that no efficient learning algorithm has non-trivial worst-case performance even under the guarantees that $\mathrm{Err}_{\mathrm{HALF}}(\mathcal{D}) \le η$ for arbitrarily small constant $η>0$, and that $\mathcal{D}$ is supported in $\{\pm 1\}^n\times \{\pm 1\}$. Namely, even under these favorable conditions its error must be $\ge \frac{1}{2}-\frac{1}{n^c}$ for every $c>0$. In particular, no efficient algorithm can achieve a constant approximation ratio. Under a stronger version of the assumption (where $K$ can be poly-logarithmic in $n$), we can take $η= 2^{-\log^{1-ν}(n)}$ for arbitrarily small $ν>0$. Interestingly, this is even stronger than the best known lower bounds (Arora et. al. 1993, Feldamn et. al. 2006, Guruswami and Raghavendra 2006) for the case that the learner is restricted to return a halfspace classifier (i.e. proper learning). △ Less

Submitted 13 March, 2016; v1 submitted 21 May, 2015; originally announced May 2015.

arXiv:1502.07073 [pdf, ps, other]

Strongly Adaptive Online Learning

Authors: Amit Daniely, Alon Gonen, Shai Shalev-Shwartz

Abstract: Strongly adaptive algorithms are algorithms whose performance on every time interval is close to optimal. We present a reduction that can transform standard low-regret algorithms to strongly adaptive. As a consequence, we derive simple, yet efficient, strongly adaptive algorithms for a handful of problems. Strongly adaptive algorithms are algorithms whose performance on every time interval is close to optimal. We present a reduction that can transform standard low-regret algorithms to strongly adaptive. As a consequence, we derive simple, yet efficient, strongly adaptive algorithms for a handful of problems. △ Less

Submitted 19 June, 2015; v1 submitted 25 February, 2015; originally announced February 2015.

arXiv:1412.6265 [pdf, other]

Inapproximability of Truthful Mechanisms via Generalizations of the VC Dimension

Authors: Amit Daniely, Michael Schapira, Gal Shahaf

Abstract: Algorithmic mechanism design (AMD) studies the delicate interplay between computational efficiency, truthfulness, and optimality. We focus on AMD's paradigmatic problem: combinatorial auctions. We present a new generalization of the VC dimension to multivalued collections of functions, which encompasses the classical VC dimension, Natarajan dimension, and Steele dimension. We present a correspondi… ▽ More Algorithmic mechanism design (AMD) studies the delicate interplay between computational efficiency, truthfulness, and optimality. We focus on AMD's paradigmatic problem: combinatorial auctions. We present a new generalization of the VC dimension to multivalued collections of functions, which encompasses the classical VC dimension, Natarajan dimension, and Steele dimension. We present a corresponding generalization of the Sauer-Shelah Lemma and harness this VC machinery to establish inapproximability results for deterministic truthful mechanisms. Our results essentially unify all inapproximability results for deterministic truthful mechanisms for combinatorial auctions to date and establish new separation gaps between truthful and non-truthful algorithms. △ Less

Submitted 4 April, 2015; v1 submitted 19 December, 2014; originally announced December 2014.

arXiv:1410.7050 [pdf, ps, other]

A PTAS for Agnostically Learning Halfspaces

Authors: Amit Daniely

Abstract: We present a PTAS for agnostically learning halfspaces w.r.t. the uniform distribution on the $d$ dimensional sphere. Namely, we show that for every $μ>0$ there is an algorithm that runs in time $\mathrm{poly}(d,\frac{1}ε)$, and is guaranteed to return a classifier with error at most $(1+μ)\mathrm{opt}+ε$, where $\mathrm{opt}$ is the error of the best halfspace classifier. This improves on Awasthi… ▽ More We present a PTAS for agnostically learning halfspaces w.r.t. the uniform distribution on the $d$ dimensional sphere. Namely, we show that for every $μ>0$ there is an algorithm that runs in time $\mathrm{poly}(d,\frac{1}ε)$, and is guaranteed to return a classifier with error at most $(1+μ)\mathrm{opt}+ε$, where $\mathrm{opt}$ is the error of the best halfspace classifier. This improves on Awasthi, Balcan and Long [ABL14] who showed an algorithm with an (unspecified) constant approximation ratio. Our algorithm combines the classical technique of polynomial regression (e.g. [LMN89, KKMS05]), together with the new localization technique of [ABL14]. △ Less

Submitted 25 June, 2015; v1 submitted 26 October, 2014; originally announced October 2014.

arXiv:1407.7937 [pdf, ps, other]

Learning Economic Parameters from Revealed Preferences

Authors: Maria-Florina Balcan, Amit Daniely, Ruta Mehta, Ruth Urner, Vijay V. Vazirani

Abstract: A recent line of work, starting with Beigman and Vohra (2006) and Zadimoghaddam and Roth (2012), has addressed the problem of {\em learning} a utility function from revealed preference data. The goal here is to make use of past data describing the purchases of a utility maximizing agent when faced with certain prices and budget constraints in order to produce a hypothesis function that can accurat… ▽ More A recent line of work, starting with Beigman and Vohra (2006) and Zadimoghaddam and Roth (2012), has addressed the problem of {\em learning} a utility function from revealed preference data. The goal here is to make use of past data describing the purchases of a utility maximizing agent when faced with certain prices and budget constraints in order to produce a hypothesis function that can accurately forecast the {\em future} behavior of the agent. In this work we advance this line of work by providing sample complexity guarantees and efficient algorithms for a number of important classes. By drawing a connection to recent advances in multi-class learning, we provide a computationally efficient algorithm with tight sample complexity guarantees ($Θ(d/ε)$ for the case of $d$ goods) for learning linear utility functions under a linear price model. This solves an open question in Zadimoghaddam and Roth (2012). Our technique yields numerous generalizations including the ability to learn other well-studied classes of utility functions, to deal with a misspecified model, and with non-linear prices. △ Less

Submitted 30 July, 2014; originally announced July 2014.

arXiv:1405.2420 [pdf, ps, other]

Optimal Learners for Multiclass Problems

Authors: Amit Daniely, Shai Shalev-Shwartz

Abstract: The fundamental theorem of statistical learning states that for binary classification problems, any Empirical Risk Minimization (ERM) learning rule has close to optimal sample complexity. In this paper we seek for a generic optimal learner for multiclass prediction. We start by proving a surprising result: a generic optimal multiclass learner must be improper, namely, it must have the ability to o… ▽ More The fundamental theorem of statistical learning states that for binary classification problems, any Empirical Risk Minimization (ERM) learning rule has close to optimal sample complexity. In this paper we seek for a generic optimal learner for multiclass prediction. We start by proving a surprising result: a generic optimal multiclass learner must be improper, namely, it must have the ability to output hypotheses which do not belong to the hypothesis class, even though it knows that all the labels are generated by some hypothesis from the class. In particular, no ERM learner is optimal. This brings back the fundmamental question of "how to learn"? We give a complete answer to this question by giving a new analysis of the one-inclusion multiclass learner of Rubinstein et al (2006) showing that its sample complexity is essentially optimal. Then, we turn to study the popular hypothesis class of generalized linear classifiers. We derive optimal learners that, unlike the one-inclusion algorithm, are computationally efficient. Furthermore, we show that the sample complexity of these learners is better than the sample complexity of the ERM rule, thus settling in negative an open question due to Collins (2005). △ Less

Submitted 10 May, 2014; originally announced May 2014.

arXiv:1404.3378 [pdf, other]

Complexity theoretic limitations on learning DNF's

Authors: Amit Daniely, Shai Shalev-Shwatz

Abstract: Using the recently developed framework of [Daniely et al, 2014], we show that under a natural assumption on the complexity of refuting random K-SAT formulas, learning DNF formulas is hard. Furthermore, the same assumption implies the hardness of learning intersections of $ω(\log(n))$ halfspaces, agnostically learning conjunctions, as well as virtually all (distribution free) learning problems that… ▽ More Using the recently developed framework of [Daniely et al, 2014], we show that under a natural assumption on the complexity of refuting random K-SAT formulas, learning DNF formulas is hard. Furthermore, the same assumption implies the hardness of learning intersections of $ω(\log(n))$ halfspaces, agnostically learning conjunctions, as well as virtually all (distribution free) learning problems that were previously shown hard (under complexity assumptions). △ Less

Submitted 4 November, 2014; v1 submitted 13 April, 2014; originally announced April 2014.

Comments: arXiv admin note: substantial text overlap with arXiv:1311.2272

arXiv:1311.2272 [pdf, other]

From average case complexity to improper learning complexity

Authors: Amit Daniely, Nati Linial, Shai Shalev-Shwartz

Abstract: The basic problem in the PAC model of computational learning theory is to determine which hypothesis classes are efficiently learnable. There is presently a dearth of results showing hardness of learning problems. Moreover, the existing lower bounds fall short of the best known algorithms. The biggest challenge in proving complexity results is to establish hardness of {\em improper learning} (a.… ▽ More The basic problem in the PAC model of computational learning theory is to determine which hypothesis classes are efficiently learnable. There is presently a dearth of results showing hardness of learning problems. Moreover, the existing lower bounds fall short of the best known algorithms. The biggest challenge in proving complexity results is to establish hardness of {\em improper learning} (a.k.a. representation independent learning).The difficulty in proving lower bounds for improper learning is that the standard reductions from $\mathbf{NP}$-hard problems do not seem to apply in this context. There is essentially only one known approach to proving lower bounds on improper learning. It was initiated in (Kearns and Valiant 89) and relies on cryptographic assumptions. We introduce a new technique for proving hardness of improper learning, based on reductions from problems that are hard on average. We put forward a (fairly strong) generalization of Feige's assumption (Feige 02) about the complexity of refuting random constraint satisfaction problems. Combining this assumption with our new technique yields far reaching implications. In particular, 1. Learning $\mathrm{DNF}$'s is hard. 2. Agnostically learning halfspaces with a constant approximation ratio is hard. 3. Learning an intersection of $ω(1)$ halfspaces is hard. △ Less

Submitted 9 March, 2014; v1 submitted 10 November, 2013; originally announced November 2013.

Comments: 34 pages

arXiv:1311.2271 [pdf, ps, other]

More data speeds up training time in learning halfspaces over sparse vectors

Authors: Amit Daniely, Nati Linial, Shai Shalev Shwartz

Abstract: The increased availability of data in recent years has led several authors to ask whether it is possible to use data as a {\em computational} resource. That is, if more data is available, beyond the sample complexity limit, is it possible to use the extra examples to speed up the computation time required to perform the learning task? We give the first positive answer to this question for a {\em… ▽ More The increased availability of data in recent years has led several authors to ask whether it is possible to use data as a {\em computational} resource. That is, if more data is available, beyond the sample complexity limit, is it possible to use the extra examples to speed up the computation time required to perform the learning task? We give the first positive answer to this question for a {\em natural supervised learning problem} --- we consider agnostic PAC learning of halfspaces over $3$-sparse vectors in $\{-1,1,0\}^n$. This class is inefficiently learnable using $O\left(n/ε^2\right)$ examples. Our main contribution is a novel, non-cryptographic, methodology for establishing computational-statistical gaps, which allows us to show that, under a widely believed assumption that refuting random $\mathrm{3CNF}$ formulas is hard, it is impossible to efficiently learn this class using only $O\left(n/ε^2\right)$ examples. We further show that under stronger hardness assumptions, even $O\left(n^{1.499}/ε^2\right)$ examples do not suffice. On the other hand, we show a new algorithm that learns this class efficiently using $\tildeΩ\left(n^2/ε^2\right)$ examples. This formally establishes the tradeoff between sample and computational complexity for a natural supervised learning problem. △ Less

Submitted 10 November, 2013; originally announced November 2013.

Comments: 13 pages

arXiv:1308.2893 [pdf, other]

Multiclass learnability and the ERM principle

Authors: Amit Daniely, Sivan Sabato, Shai Ben-David, Shai Shalev-Shwartz

Abstract: We study the sample complexity of multiclass prediction in several learning settings. For the PAC setting our analysis reveals a surprising phenomenon: In sharp contrast to binary classification, we show that there exist multiclass hypothesis classes for which some Empirical Risk Minimizers (ERM learners) have lower sample complexity than others. Furthermore, there are classes that are learnable b… ▽ More We study the sample complexity of multiclass prediction in several learning settings. For the PAC setting our analysis reveals a surprising phenomenon: In sharp contrast to binary classification, we show that there exist multiclass hypothesis classes for which some Empirical Risk Minimizers (ERM learners) have lower sample complexity than others. Furthermore, there are classes that are learnable by some ERM learners, while other ERM learners will fail to learn them. We propose a principle for designing good ERM learners, and use this principle to prove tight bounds on the sample complexity of learning {\em symmetric} multiclass hypothesis classes---classes that are invariant under permutations of label names. We further provide a characterization of mistake and regret bounds for multiclass learning in the online setting and the bandit setting, using new generalizations of Littlestone's dimension. △ Less

Submitted 24 November, 2014; v1 submitted 13 August, 2013; originally announced August 2013.

Journal ref: Journal of Machine Learning Research, 16(Jul):1275-1304, 2015

arXiv:1302.1043 [pdf, other]

The price of bandit information in multiclass online classification

Authors: Amit Daniely, Tom Helbertal

Abstract: We consider two scenarios of multiclass online learning of a hypothesis class $H\subseteq Y^X$. In the {\em full information} scenario, the learner is exposed to instances together with their labels. In the {\em bandit} scenario, the true label is not exposed, but rather an indication whether the learner's prediction is correct or not. We show that the ratio between the error rates in the two scen… ▽ More We consider two scenarios of multiclass online learning of a hypothesis class $H\subseteq Y^X$. In the {\em full information} scenario, the learner is exposed to instances together with their labels. In the {\em bandit} scenario, the true label is not exposed, but rather an indication whether the learner's prediction is correct or not. We show that the ratio between the error rates in the two scenarios is at most $8\cdot|Y|\cdot \log(|Y|)$ in the realizable case, and $\tilde{O}(\sqrt{|Y|})$ in the agnostic case. The results are tight up to a logarithmic factor and essentially answer an open question from (Daniely et. al. - Multiclass learnability and the erm principle). We apply these results to the class of $γ$-margin multiclass linear classifiers in $\reals^d$. We show that the bandit error rate of this class is $\tildeΘ(\frac{|Y|}{γ^2})$ in the realizable case and $\tildeΘ(\frac{1}γ\sqrt{|Y|T})$ in the agnostic case. This resolves an open question from (Kakade et. al. - Efficient bandit algorithms for online multiclass prediction). △ Less

Submitted 9 July, 2013; v1 submitted 5 February, 2013; originally announced February 2013.

arXiv:1211.0616 [pdf, other]

The complexity of learning halfspaces using generalized linear methods

Authors: Amit Daniely, Nati Linial, Shai Shalev-Shwartz

Abstract: Many popular learning algorithms (E.g. Regression, Fourier-Transform based algorithms, Kernel SVM and Kernel ridge regression) operate by reducing the problem to a convex optimization problem over a vector space of functions. These methods offer the currently best approach to several central problems such as learning half spaces and learning DNF's. In addition they are widely used in numerous appl… ▽ More Many popular learning algorithms (E.g. Regression, Fourier-Transform based algorithms, Kernel SVM and Kernel ridge regression) operate by reducing the problem to a convex optimization problem over a vector space of functions. These methods offer the currently best approach to several central problems such as learning half spaces and learning DNF's. In addition they are widely used in numerous application domains. Despite their importance, there are still very few proof techniques to show limits on the power of these algorithms. We study the performance of this approach in the problem of (agnostically and improperly) learning halfspaces with margin $γ$. Let $\mathcal{D}$ be a distribution over labeled examples. The $γ$-margin error of a hyperplane $h$ is the probability of an example to fall on the wrong side of $h$ or at a distance $\leγ$ from it. The $γ$-margin error of the best $h$ is denoted $\mathrm{Err}_γ(\mathcal{D})$. An $α(γ)$-approximation algorithm receives $γ,ε$ as input and, using i.i.d. samples of $\mathcal{D}$, outputs a classifier with error rate $\le α(γ)\mathrm{Err}_γ(\mathcal{D}) + ε$. Such an algorithm is efficient if it uses $\mathrm{poly}(\frac{1}γ,\frac{1}ε)$ samples and runs in time polynomial in the sample size. The best approximation ratio achievable by an efficient algorithm is $O\left(\frac{1/γ}{\sqrt{\log(1/γ)}}\right)$ and is achieved using an algorithm from the above class. Our main result shows that the approximation ratio of every efficient algorithm from this family must be $\ge Ω\left(\frac{1/γ}{\mathrm{poly}\left(\log\left(1/γ\right)\right)}\right)$, essentially matching the best known upper bound. △ Less

Submitted 10 May, 2014; v1 submitted 3 November, 2012; originally announced November 2012.

arXiv:1205.6432 [pdf, other]

Multiclass Learning Approaches: A Theoretical Comparison with Implications

Authors: Amit Daniely, Sivan Sabato, Shai Shalev Shwartz

Abstract: We theoretically analyze and compare the following five popular multiclass classification methods: One vs. All, All Pairs, Tree-based classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on a reduction to binary classification. We consider the case where the binary classifier comes from a… ▽ More We theoretically analyze and compare the following five popular multiclass classification methods: One vs. All, All Pairs, Tree-based classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on a reduction to binary classification. We consider the case where the binary classifier comes from a class of VC dimension $d$, and in particular from the class of halfspaces over $\reals^d$. We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the \emph{approximation error} of hypothesis classes. This is in sharp contrast to most, if not all, previous uses of VC theory, which only deal with estimation error. △ Less

Submitted 1 June, 2012; v1 submitted 29 May, 2012; originally announced May 2012.

Journal ref: Advances in Neural Information Processing Systems, 2012, pages 494-502

arXiv:1205.4893 [pdf, other]

On the practically interesting instances of MAXCUT

Authors: Yonatan Bilu, Amit Daniely, Nati Linial, Michael Saks

Abstract: The complexity of a computational problem is traditionally quantified based on the hardness of its worst case. This approach has many advantages and has led to a deep and beautiful theory. However, from the practical perspective, this leaves much to be desired. In application areas, practically interesting instances very often occupy just a tiny part of an algorithm's space of instances, and the v… ▽ More The complexity of a computational problem is traditionally quantified based on the hardness of its worst case. This approach has many advantages and has led to a deep and beautiful theory. However, from the practical perspective, this leaves much to be desired. In application areas, practically interesting instances very often occupy just a tiny part of an algorithm's space of instances, and the vast majority of instances are simply irrelevant. Addressing these issues is a major challenge for theoretical computer science which may make theory more relevant to the practice of computer science. Following Bilu and Linial, we apply this perspective to MAXCUT, viewed as a clustering problem. Using a variety of techniques, we investigate practically interesting instances of this problem. Specifically, we show how to solve in polynomial time distinguished, metric, expanding and dense instances of MAXCUT under mild stability assumptions. In particular, $(1+ε)$-stability (which is optimal) suffices for metric and dense MAXCUT. We also show how to solve in polynomial time $Ω(\sqrt{n})$-stable instances of MAXCUT, substantially improving the best previously known result. △ Less

Submitted 22 May, 2012; originally announced May 2012.

arXiv:1205.4891 [pdf, other]

Clustering is difficult only when it does not matter

Authors: Amit Daniely, Nati Linial, Michael Saks

Abstract: Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {\em that can be clustered well}. More generally, despite the ubiquity and the great importance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly… ▽ More Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {\em that can be clustered well}. More generally, despite the ubiquity and the great importance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from the perspective of computational complexity theory the clustering problem seems very hard. Numerous papers introduce various criteria and numerical measures to quantify the quality of a given clustering. The resulting conclusions are pessimistic, since it is computationally difficult to find an optimal clustering of a given data set, if we go by any of these popular criteria. In contrast, the practitioners' perspective is much more optimistic. Our explanation for this disparity of opinions is that complexity theory concentrates on the worst case, whereas in reality we only care for data sets that can be clustered well. We introduce a theoretical framework of clustering in metric spaces that revolves around a notion of "good clustering". We show that if a good clustering exists, then in many cases it can be efficiently found. Our conclusion is that contrary to popular belief, clustering should not be considered a hard task. △ Less

Submitted 22 May, 2012; originally announced May 2012.

arXiv:1001.3661 [pdf, ps, other]

Tight products and Expansion

Authors: Amit Daniely, Nathan Linial

Abstract: In this paper we study a new product of graphs called {\em tight product}. A graph $H$ is said to be a tight product of two (undirected multi) graphs $G_1$ and $G_2$, if $V(H)=V(G_1)\times V(G_2)$ and both projection maps $V(H)\to V(G_1)$ and $V(H)\to V(G_2)$ are covering maps. It is not a priori clear when two given graphs have a tight product (in fact, it is $NP$-hard to decide). We investigate… ▽ More In this paper we study a new product of graphs called {\em tight product}. A graph $H$ is said to be a tight product of two (undirected multi) graphs $G_1$ and $G_2$, if $V(H)=V(G_1)\times V(G_2)$ and both projection maps $V(H)\to V(G_1)$ and $V(H)\to V(G_2)$ are covering maps. It is not a priori clear when two given graphs have a tight product (in fact, it is $NP$-hard to decide). We investigate the conditions under which this is possible. This perspective yields a new characterization of class-1 $(2k+1)$-regular graphs. We also obtain a new model of random $d$-regular graphs whose second eigenvalue is almost surely at most $O(d^{3/4})$. This construction resembles random graph lifts, but requires fewer random bits. △ Less

Submitted 3 November, 2012; v1 submitted 20 January, 2010; originally announced January 2010.

Journal ref: J. Graph Theory, (2012) 69, 426 - 440

Showing 1–48 of 48 results for author: Daniely, A