Search | arXiv e-print repository

Fast and Scalable Structural SVM with Slack Rescaling

Authors: Heejin Choi, Ofer Meshi, Nathan Srebro

Abstract: We present an efficient method for training slack-rescaled structural SVM. Although finding the most violating label in a margin-rescaled formulation is often easy since the target function decomposes with respect to the structure, this is not the case for a slack-rescaled formulation, and finding the most violated label might be very difficult. Our core contribution is an efficient method for fin… ▽ More We present an efficient method for training slack-rescaled structural SVM. Although finding the most violating label in a margin-rescaled formulation is often easy since the target function decomposes with respect to the structure, this is not the case for a slack-rescaled formulation, and finding the most violated label might be very difficult. Our core contribution is an efficient method for finding the most-violating-label in a slack-rescaled formulation, given an oracle that returns the most-violating-label in a (slightly modified) margin-rescaled formulation. We show that our method enables accurate and scalable training for slack-rescaled SVMs, reducing runtime by an order of magnitude compared to previous approaches to slack-rescaled SVMs. △ Less

Submitted 27 October, 2015; v1 submitted 20 October, 2015; originally announced October 2015.

arXiv:1510.02054 [pdf, ps, other]

Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations

Authors: Weiran Wang, Raman Arora, Karen Livescu, Nathan Srebro

Abstract: Deep CCA is a recently proposed deep neural network extension to the traditional canonical correlation analysis (CCA), and has been successful for multi-view representation learning in several domains. However, stochastic optimization of the deep CCA objective is not straightforward, because it does not decouple over training examples. Previous optimizers for deep CCA are either batch-based algori… ▽ More Deep CCA is a recently proposed deep neural network extension to the traditional canonical correlation analysis (CCA), and has been successful for multi-view representation learning in several domains. However, stochastic optimization of the deep CCA objective is not straightforward, because it does not decouple over training examples. Previous optimizers for deep CCA are either batch-based algorithms or stochastic optimization using large minibatches, which can have high memory consumption. In this paper, we tackle the problem of stochastic optimization for deep CCA with small minibatches, based on an iterative solution to the CCA objective, and show that we can achieve as good performance as previous optimizers and thus alleviate the memory requirement. △ Less

Submitted 7 October, 2015; originally announced October 2015.

Comments: in 2015 Annual Allerton Conference on Communication, Control and Computing

arXiv:1510.00633 [pdf, other]

Distributed Multitask Learning

Authors: Jialei Wang, Mladen Kolar, Nathan Srebro

Abstract: We consider the problem of distributed multi-task learning, where each machine learns a separate, but related, task. Specifically, each machine learns a linear predictor in high-dimensional space,where all tasks share the same small support. We present a communication-efficient estimator based on the debiased lasso and show that it is comparable with the optimal centralized method. We consider the problem of distributed multi-task learning, where each machine learns a separate, but related, task. Specifically, each machine learns a linear predictor in high-dimensional space,where all tasks share the same small support. We present a communication-efficient estimator based on the debiased lasso and show that it is comparable with the optimal centralized method. △ Less

Submitted 2 October, 2015; originally announced October 2015.

arXiv:1508.02479 [pdf, other]

Normalized Hierarchical SVM

Authors: Heejin Choi, Yutaka Sasaki, Nathan Srebro

Abstract: We present improved methods of using structured SVMs in a large-scale hierarchical classification problem, that is when labels are leaves, or sets of leaves, in a tree or a DAG. We examine the need to normalize both the regularization and the margin and show how doing so significantly improves performance, including allowing achieving state-of-the-art results where unnormalized structured SVMs do… ▽ More We present improved methods of using structured SVMs in a large-scale hierarchical classification problem, that is when labels are leaves, or sets of leaves, in a tree or a DAG. We examine the need to normalize both the regularization and the margin and show how doing so significantly improves performance, including allowing achieving state-of-the-art results where unnormalized structured SVMs do not perform better than flat models. We also describe a further extension of hierarchical SVMs that highlight the connection between hierarchical SVMs and matrix factorization models. △ Less

Submitted 4 March, 2016; v1 submitted 10 August, 2015; originally announced August 2015.

arXiv:1507.08322 [pdf, ps, other]

Distributed Mini-Batch SDCA

Authors: Martin Takáč, Peter Richtárik, Nathan Srebro

Abstract: We present an improved analysis of mini-batched stochastic dual coordinate ascent for regularized empirical loss minimization (i.e. SVM and SVM-type objectives). Our analysis allows for flexible sampling schemes, including where data is distribute across machines, and combines a dependence on the smoothness of the loss and/or the data spread (measured through the spectral norm). We present an improved analysis of mini-batched stochastic dual coordinate ascent for regularized empirical loss minimization (i.e. SVM and SVM-type objectives). Our analysis allows for flexible sampling schemes, including where data is distribute across machines, and combines a dependence on the smoothness of the loss and/or the data spread (measured through the spectral norm). △ Less

Submitted 29 July, 2015; originally announced July 2015.

arXiv:1506.02617 [pdf, other]

Path-SGD: Path-Normalized Optimization in Deep Neural Networks

Authors: Behnam Neyshabur, Ruslan Salakhutdinov, Nathan Srebro

Abstract: We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD… ▽ More We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. △ Less

Submitted 8 June, 2015; originally announced June 2015.

Comments: 12 pages, 5 figures

arXiv:1503.00036 [pdf, ps, other]

Norm-Based Capacity Control in Neural Networks

Authors: Behnam Neyshabur, Ryota Tomioka, Nathan Srebro

Abstract: We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. △ Less

Submitted 14 April, 2015; v1 submitted 27 February, 2015; originally announced March 2015.

Comments: 29 pages

arXiv:1412.6614 [pdf, ps, other]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Authors: Behnam Neyshabur, Ryota Tomioka, Nathan Srebro

Abstract: We present experiments demonstrating that some other form of capacity control, different from network size, plays a central role in learning multilayer feed-forward networks. We argue, partially through analogy to matrix factorization, that this is an inductive bias that can help shed light on deep learning. We present experiments demonstrating that some other form of capacity control, different from network size, plays a central role in learning multilayer feed-forward networks. We argue, partially through analogy to matrix factorization, that this is an inductive bias that can help shed light on deep learning. △ Less

Submitted 16 April, 2015; v1 submitted 20 December, 2014; originally announced December 2014.

Comments: 9 pages, 2 figures

arXiv:1410.5518 [pdf, ps, other]

On Symmetric and Asymmetric LSHs for Inner Product Search

Authors: Behnam Neyshabur, Nathan Srebro

Abstract: We consider the problem of designing locality sensitive hashes (LSH) for inner product similarity, and of the power of asymmetric hashes in this context. Shrivastava and Li argue that there is no symmetric LSH for the problem and propose an asymmetric LSH based on different mappings for query and database points. However, we show there does exist a simple symmetric LSH that enjoys stronger guarant… ▽ More We consider the problem of designing locality sensitive hashes (LSH) for inner product similarity, and of the power of asymmetric hashes in this context. Shrivastava and Li argue that there is no symmetric LSH for the problem and propose an asymmetric LSH based on different mappings for query and database points. However, we show there does exist a simple symmetric LSH that enjoys stronger guarantees and better empirical performance than the asymmetric LSH they suggest. We also show a variant of the settings where asymmetry is in-fact needed, but there a different asymmetric LSH is required. △ Less

Submitted 8 June, 2015; v1 submitted 20 October, 2014; originally announced October 2014.

Comments: 11 pages, 3 figures, In Proceedings of The 32nd International Conference on Machine Learning (ICML)

arXiv:1405.3167 [pdf, ps, other]

Clustering, Hamming Embedding, Generalized LSH and the Max Norm

Authors: Behnam Neyshabur, Yury Makarychev, Nathan Srebro

Abstract: We study the convex relaxation of clustering and hamming embedding, focusing on the asymmetric case (co-clustering and asymmetric hamming embedding), understanding their relationship to LSH as studied by (Charikar 2002) and to the max-norm ball, and the differences between their symmetric and asymmetric versions. We study the convex relaxation of clustering and hamming embedding, focusing on the asymmetric case (co-clustering and asymmetric hamming embedding), understanding their relationship to LSH as studied by (Charikar 2002) and to the max-norm ball, and the differences between their symmetric and asymmetric versions. △ Less

Submitted 13 May, 2014; originally announced May 2014.

Comments: 17 pages

arXiv:1312.7853 [pdf, other]

Communication Efficient Distributed Optimization using an Approximate Newton-type Method

Authors: Ohad Shamir, Nathan Srebro, Tong Zhang

Abstract: We present a novel Newton-type method for distributed optimization, which is particularly well suited for stochastic optimization and learning problems. For quadratic objectives, the method enjoys a linear rate of convergence which provably \emph{improves} with the data size, requiring an essentially constant number of iterations under reasonable assumptions. We provide theoretical and empirical e… ▽ More We present a novel Newton-type method for distributed optimization, which is particularly well suited for stochastic optimization and learning problems. For quadratic objectives, the method enjoys a linear rate of convergence which provably \emph{improves} with the data size, requiring an essentially constant number of iterations under reasonable assumptions. We provide theoretical and empirical evidence of the advantages of our method compared to other approaches, such as one-shot parameter averaging and ADMM. △ Less

Submitted 13 May, 2014; v1 submitted 30 December, 2013; originally announced December 2013.

arXiv:1311.7662 [pdf, other]

The Power of Asymmetry in Binary Hashing

Authors: Behnam Neyshabur, Payman Yadollahpour, Yury Makarychev, Ruslan Salakhutdinov, Nathan Srebro

Abstract: When approximating binary similarity using the hamming distance between short binary hashes, we show that even if the similarity is symmetric, we can have shorter and more accurate hashes by using two distinct code maps. I.e. by approximating the similarity between $x$ and $x'$ as the hamming distance between $f(x)$ and $g(x')$, for two distinct binary codes $f,g$, rather than as the hamming dista… ▽ More When approximating binary similarity using the hamming distance between short binary hashes, we show that even if the similarity is symmetric, we can have shorter and more accurate hashes by using two distinct code maps. I.e. by approximating the similarity between $x$ and $x'$ as the hamming distance between $f(x)$ and $g(x')$, for two distinct binary codes $f,g$, rather than as the hamming distance between $f(x)$ and $f(x')$. △ Less

Submitted 29 November, 2013; originally announced November 2013.

Comments: Accepted to NIPS 2013, 9 pages, 5 figures

arXiv:1310.5715 [pdf, ps, other]

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

Authors: Deanna Needell, Nathan Srebro, Rachel Ward

Abstract: We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning $(L/μ)^2$ (where $L$ is a bound on the smoothness and $μ$ on the strong convexity) to a linear dependence on $L/μ$. Furthermore, we show how reweighting the sampling distribution (i.e. importance… ▽ More We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning $(L/μ)^2$ (where $L$ is a bound on the smoothness and $μ$ on the strong convexity) to a linear dependence on $L/μ$. Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results. We also discuss importance sampling for SGD more broadly and show how it can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods. In particular, we recast the randomized Kaczmarz algorithm as an instance of SGD, and apply our results to prove its exponential convergence, but to the solution of a weighted least squares problem rather than the original least squares problem. We then present a modified Kaczmarz algorithm with partially biased sampling which does converge to the original least squares solution with the same exponential convergence rate. △ Less

Submitted 16 January, 2015; v1 submitted 21 October, 2013; originally announced October 2013.

Comments: 22 pages, 6 figures

MSC Class: 65B99; 52A99; 60G99; 62L20

arXiv:1307.1674 [pdf, other]

Stochastic Optimization of PCA with Capped MSG

Authors: Raman Arora, Andrew Cotter, Nathan Srebro

Abstract: We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as "Matrix Stochastic Gradient" (MSG), as well as a practical variant, Capped MSG. We study the method both theoretically and empirically. We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as "Matrix Stochastic Gradient" (MSG), as well as a practical variant, Capped MSG. We study the method both theoretically and empirically. △ Less

Submitted 5 July, 2013; originally announced July 2013.

arXiv:1306.2347 [pdf, other]

Auditing: Active Learning with Outcome-Dependent Query Costs

Authors: Sivan Sabato, Anand D. Sarwate, Nathan Srebro

Abstract: We propose a learning setting in which unlabeled data is free, and the cost of a label depends on its value, which is not known in advance. We study binary classification in an extreme case, where the algorithm only pays for negative labels. Our motivation are applications such as fraud detection, in which investigating an honest transaction should be avoided if possible. We term the setting audit… ▽ More We propose a learning setting in which unlabeled data is free, and the cost of a label depends on its value, which is not known in advance. We study binary classification in an extreme case, where the algorithm only pays for negative labels. Our motivation are applications such as fraud detection, in which investigating an honest transaction should be avoided if possible. We term the setting auditing, and consider the auditing complexity of an algorithm: the number of negative labels the algorithm requires in order to learn a hypothesis with low relative error. We design auditing algorithms for simple hypothesis classes (thresholds and rectangles), and show that with these algorithms, the auditing complexity can be significantly lower than the active label complexity. We also discuss a general competitive approach for auditing and possible modifications to the framework. △ Less

Submitted 12 July, 2015; v1 submitted 10 June, 2013; originally announced June 2013.

Comments: Corrections in section 5

Journal ref: Neural Information Processing Systems 26 (NIPS), 512-520, 2013

arXiv:1303.2314 [pdf, ps, other]

Mini-Batch Primal and Dual Methods for SVMs

Authors: Martin Takáč, Avleen Bijral, Peter Richtárik, Nathan Srebro

Abstract: We address the issue of using mini-batches in stochastic optimization of SVMs. We show that the same quantity, the spectral norm of the data, controls the parallelization speedup obtained for both primal stochastic subgradient descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and use it to derive novel variants of mini-batched SDCA. Our guarantees for both methods are expressed in… ▽ More We address the issue of using mini-batches in stochastic optimization of SVMs. We show that the same quantity, the spectral norm of the data, controls the parallelization speedup obtained for both primal stochastic subgradient descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and use it to derive novel variants of mini-batched SDCA. Our guarantees for both methods are expressed in terms of the original nonsmooth primal problem based on the hinge-loss. △ Less

Submitted 10 March, 2013; originally announced March 2013.

arXiv:1301.2311 [pdf]

Maximum Likelihood Bounded Tree-Width Markov Networks

Authors: Nathan Srebro

Abstract: Chow and Liu (1968) studied the problem of learning a maximumlikelihood Markov tree. We generalize their work to more complexMarkov networks by considering the problem of learning a maximumlikelihood Markov network of bounded complexity. We discuss howtree-width is in many ways the appropriate measure of complexity andthus analyze the problem of learning a maximum likelihood Markovnetwork of bound… ▽ More Chow and Liu (1968) studied the problem of learning a maximumlikelihood Markov tree. We generalize their work to more complexMarkov networks by considering the problem of learning a maximumlikelihood Markov network of bounded complexity. We discuss howtree-width is in many ways the appropriate measure of complexity andthus analyze the problem of learning a maximum likelihood Markovnetwork of bounded tree-width.Similar to the work of Chow and Liu, we are able to formalize thelearning problem as a combinatorial optimization problem on graphs. Weshow that learning a maximum likelihood Markov network of boundedtree-width is equivalent to finding a maximum weight hypertree. Thisequivalence gives rise to global, integer-programming based,approximation algorithms with provable performance guarantees, for thelearning problem. This contrasts with heuristic local-searchalgorithms which were previously suggested (e.g. by Malvestuto 1991).The equivalence also allows us to study the computational hardness ofthe learning problem. We show that learning a maximum likelihoodMarkov network of bounded tree-width is NP-hard, and discuss thehardness of approximation. △ Less

Submitted 10 January, 2013; originally announced January 2013.

Comments: Appears in Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI2001)

Report number: UAI-P-2001-PG-504-511

arXiv:1212.3276 [pdf, ps, other]

Learning Sparse Low-Threshold Linear Classifiers

Authors: Sivan Sabato, Shai Shalev-Shwartz, Nathan Srebro, Daniel Hsu, Tong Zhang

Abstract: We consider the problem of learning a non-negative linear classifier with a $1$-norm of at most $k$, and a fixed threshold, under the hinge-loss. This problem generalizes the problem of learning a $k$-monotone disjunction. We prove that we can learn efficiently in this setting, at a rate which is linear in both $k$ and the size of the threshold, and that this is the best possible rate. We provide… ▽ More We consider the problem of learning a non-negative linear classifier with a $1$-norm of at most $k$, and a fixed threshold, under the hinge-loss. This problem generalizes the problem of learning a $k$-monotone disjunction. We prove that we can learn efficiently in this setting, at a rate which is linear in both $k$ and the size of the threshold, and that this is the best possible rate. We provide an efficient online learning algorithm that achieves the optimal rate, and show that in the batch case, empirical risk minimization achieves this rate as well. The rates we show are tighter than the uniform convergence rate, which grows with $k^2$. △ Less

Submitted 18 April, 2016; v1 submitted 13 December, 2012; originally announced December 2012.

Journal ref: Journal of Machine Learning Research, 16(Jul):1275-1304, 2015

arXiv:1210.5196 [pdf, other]

Matrix reconstruction with the local max norm

Authors: Rina Foygel, Nathan Srebro, Ruslan Salakhutdinov

Abstract: We introduce a new family of matrix norms, the "local max" norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trac… ▽ More We introduce a new family of matrix norms, the "local max" norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. We test this interpolation on simulated data and on the large-scale Netflix and MovieLens ratings data, and find improved accuracy relative to the existing matrix norms. We also provide theoretical results showing learning guarantees for some of the new norms. △ Less

Submitted 18 October, 2012; originally announced October 2012.

arXiv:1206.6442 [pdf]

Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss

Authors: Shai Ben-David, David Loker, Nathan Srebro, Karthik Sridharan

Abstract: We carefully study how well minimizing convex surrogate loss functions, corresponds to minimizing the misclassification error rate for the problem of binary classification with linear predictors. In particular, we show that amongst all convex surrogate losses, the hinge loss gives essentially the best possible bound, of all convex loss functions, for the misclassification error rate of the resulti… ▽ More We carefully study how well minimizing convex surrogate loss functions, corresponds to minimizing the misclassification error rate for the problem of binary classification with linear predictors. In particular, we show that amongst all convex surrogate losses, the hinge loss gives essentially the best possible bound, of all convex loss functions, for the misclassification error rate of the resulting linear predictor in terms of the best possible margin error rate. We also provide lower bounds for specific convex surrogates that show how different commonly used losses qualitatively differ from each other. △ Less

Submitted 27 June, 2012; originally announced June 2012.

Comments: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)

arXiv:1206.3240 [pdf]

Complexity of Inference in Graphical Models

Authors: Venkat Chandrasekaran, Nathan Srebro, Prahladh Harsha

Abstract: It is well-known that inference in graphical models is hard in the worst case, but tractable for models with bounded treewidth. We ask whether treewidth is the only structural criterion of the underlying graph that enables tractable inference. In other words, is there some class of structures with unbounded treewidth in which inference is tractable? Subject to a combinatorial hypothesis due to Rob… ▽ More It is well-known that inference in graphical models is hard in the worst case, but tractable for models with bounded treewidth. We ask whether treewidth is the only structural criterion of the underlying graph that enables tractable inference. In other words, is there some class of structures with unbounded treewidth in which inference is tractable? Subject to a combinatorial hypothesis due to Robertson et al. (1994), we show that low treewidth is indeed the only structural restriction that can ensure tractability. Thus, even for the "best case" graph structure, there is no inference algorithm with complexity polynomial in the treewidth. △ Less

Submitted 13 June, 2012; originally announced June 2012.

Comments: Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI2008)

Report number: UAI-P-2008-PG-70-78

arXiv:1206.2372 [pdf, other]

PRISMA: PRoximal Iterative SMoothing Algorithm

Authors: Francesco Orabona, Andreas Argyriou, Nathan Srebro

Abstract: Motivated by learning problems including max-norm regularized matrix completion and clustering, robust PCA and sparse inverse covariance selection, we propose a novel optimization algorithm for minimizing a convex objective which decomposes into three parts: a smooth part, a simple non-smooth Lipschitz part, and a simple non-smooth non-Lipschitz part. We use a time variant smoothing strategy that… ▽ More Motivated by learning problems including max-norm regularized matrix completion and clustering, robust PCA and sparse inverse covariance selection, we propose a novel optimization algorithm for minimizing a convex objective which decomposes into three parts: a smooth part, a simple non-smooth Lipschitz part, and a simple non-smooth non-Lipschitz part. We use a time variant smoothing strategy that allows us to obtain a guarantee that does not depend on knowing in advance the total number of iterations nor a bound on the domain. △ Less

Submitted 18 November, 2012; v1 submitted 11 June, 2012; originally announced June 2012.

arXiv:1204.5043 [pdf, ps, other]

Sparse Prediction with the $k$-Support Norm

Authors: Andreas Argyriou, Rina Foygel, Nathan Srebro

Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an $\ell_2$ penalty. We show that this new {\em $k$-support norm} provides a tighter relaxation than the elastic net and is thus a good replacement for the Lasso or the elastic net in sparse prediction problems. Through the study of the $k$-support norm, we also bound the looseness of the elastic ne… ▽ More We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an $\ell_2$ penalty. We show that this new {\em $k$-support norm} provides a tighter relaxation than the elastic net and is thus a good replacement for the Lasso or the elastic net in sparse prediction problems. Through the study of the $k$-support norm, we also bound the looseness of the elastic net, thus shedding new light on it and providing justification for its use. △ Less

Submitted 12 June, 2012; v1 submitted 23 April, 2012; originally announced April 2012.

arXiv:1204.1276 [pdf, ps, other]

Distribution-Dependent Sample Complexity of Large Margin Learning

Authors: Sivan Sabato, Nathan Srebro, Naftali Tishby

Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the dat… ▽ More We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. △ Less

Submitted 18 September, 2013; v1 submitted 5 April, 2012; originally announced April 2012.

Comments: arXiv admin note: text overlap with arXiv:1011.5053

Journal ref: S. Sabato, N. Srebro and N. Tishby, "Distribution-Dependent Sample Complexity of Large Margin Learning", Journal of Machine Learning Research, 14(Jul):2119-2149, 2013

arXiv:1204.0566 [pdf, ps, other]

The Kernelized Stochastic Batch Perceptron

Authors: Andrew Cotter, Shai Shalev-Shwartz, Nathan Srebro

Abstract: We present a novel approach for training kernel Support Vector Machines, establish learning runtime guarantees for our method that are better then those of any other known kernelized SVM optimization approach, and show that our method works well in practice compared to existing alternatives. We present a novel approach for training kernel Support Vector Machines, establish learning runtime guarantees for our method that are better then those of any other known kernelized SVM optimization approach, and show that our method works well in practice compared to existing alternatives. △ Less

Submitted 21 June, 2012; v1 submitted 2 April, 2012; originally announced April 2012.

arXiv:1202.5598 [pdf, other]

Clustering using Max-norm Constrained Optimization

Authors: Ali Jalali, Nathan Srebro

Abstract: We suggest using the max-norm as a convex surrogate constraint for clustering. We show how this yields a better exact cluster recovery guarantee than previously suggested nuclear-norm relaxation, and study the effectiveness of our method, and other related convex relaxations, compared to other clustering approaches. We suggest using the max-norm as a convex surrogate constraint for clustering. We show how this yields a better exact cluster recovery guarantee than previously suggested nuclear-norm relaxation, and study the effectiveness of our method, and other related convex relaxations, compared to other clustering approaches. △ Less

Submitted 13 April, 2012; v1 submitted 24 February, 2012; originally announced February 2012.

arXiv:1202.3702 [pdf]

Semi-supervised Learning with Density Based Distances

Authors: Avleen S. Bijral, Nathan Ratliff, Nathan Srebro

Abstract: We present a simple, yet effective, approach to Semi-Supervised Learning. Our approach is based on estimating density-based distances (DBD) using a shortest path calculation on a graph. These Graph-DBD estimates can then be used in any distance-based supervised learning method, such as Nearest Neighbor methods and SVMs with RBF kernels. In order to apply the method to very large data sets, we also… ▽ More We present a simple, yet effective, approach to Semi-Supervised Learning. Our approach is based on estimating density-based distances (DBD) using a shortest path calculation on a graph. These Graph-DBD estimates can then be used in any distance-based supervised learning method, such as Nearest Neighbor methods and SVMs with RBF kernels. In order to apply the method to very large data sets, we also present a novel algorithm which integrates nearest neighbor computations into the shortest path search and can find exact shortest paths even in extremely large dense graphs. Significant runtime improvement over the commonly used Laplacian regularization method is then shown on a large scale dataset. △ Less

Submitted 14 February, 2012; originally announced February 2012.

Report number: UAI-P-2011-PG-43-50

arXiv:1109.4603 [pdf, other]

Explicit Approximations of the Gaussian Kernel

Authors: Andrew Cotter, Joseph Keshet, Nathan Srebro

Abstract: We investigate training and using Gaussian kernel SVMs by approximating the kernel with an explicit finite- dimensional polynomial feature representation based on the Taylor expansion of the exponential. Although not as efficient as the recently-proposed random Fourier features [Rahimi and Recht, 2007] in terms of the number of features, we show how this polynomial representation can provide a bet… ▽ More We investigate training and using Gaussian kernel SVMs by approximating the kernel with an explicit finite- dimensional polynomial feature representation based on the Taylor expansion of the exponential. Although not as efficient as the recently-proposed random Fourier features [Rahimi and Recht, 2007] in terms of the number of features, we show how this polynomial representation can provide a better approximation in terms of the computational cost involved. This makes our "Taylor features" especially attractive for use on very large data sets, in conjunction with online or stochastic training. △ Less

Submitted 21 September, 2011; originally announced September 2011.

Comments: 11 pages, 2 tables, 2 figures

arXiv:1108.0373 [pdf, ps, other]

Fast-rate and optimistic-rate error bounds for L1-regularized regression

Authors: Rina Foygel, Nathan Srebro

Abstract: We consider the prediction error of linear regression with L1 regularization when the number of covariates p is large relative to the sample size n. When the model is k-sparse and well-specified, and restricted isometry or similar conditions hold, the excess squared-error in prediction can be bounded on the order of sigma^2*(k*log(p)/n), where sigma^2 is the noise variance. Although these conditio… ▽ More We consider the prediction error of linear regression with L1 regularization when the number of covariates p is large relative to the sample size n. When the model is k-sparse and well-specified, and restricted isometry or similar conditions hold, the excess squared-error in prediction can be bounded on the order of sigma^2*(k*log(p)/n), where sigma^2 is the noise variance. Although these conditions are close to necessary for accurate recovery of the true coefficient vector, it is possible to guarantee good predictive accuracy under much milder conditions, avoiding the restricted isometry condition, but only ensuring an excess error bound of order (k*log(p)/n)+sigma*\surd(k*log(p)/n). Here we show that this is indeed the best bound possible (up to logarithmic factors) without introducing stronger assumptions similar to restricted isometry. △ Less

Submitted 1 August, 2011; originally announced August 2011.

arXiv:1107.4080 [pdf, other]

On the Universality of Online Mirror Descent

Authors: Nathan Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. △ Less

Submitted 20 July, 2011; originally announced July 2011.

arXiv:1106.4574 [pdf, other]

Better Mini-Batch Algorithms via Accelerated Gradient Methods

Authors: Andrew Cotter, Ohad Shamir, Nathan Srebro, Karthik Sridharan

Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficien… ▽ More Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. △ Less

Submitted 22 June, 2011; originally announced June 2011.

arXiv:1106.4251 [pdf, other]

Learning with the Weighted Trace-norm under Arbitrary Sampling Distributions

Authors: Rina Foygel, Ruslan Salakhutdinov, Ohad Shamir, Nathan Srebro

Abstract: We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. We show that the standard weighted trace-norm might fail when the sampling distribution is not a product distribution (i.e. when row and column indexes are not selected independently), present a corrected variant for which we establish strong learning guarantees, and demonstrate that it… ▽ More We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. We show that the standard weighted trace-norm might fail when the sampling distribution is not a product distribution (i.e. when row and column indexes are not selected independently), present a corrected variant for which we establish strong learning guarantees, and demonstrate that it works better in practice. We provide guarantees when weighting by either the true or empirical sampling distribution, and suggest that even if the true distribution is known (or is uniform), weighting by the empirical distribution may be beneficial. △ Less

Submitted 21 June, 2011; originally announced June 2011.

arXiv:1102.3923 [pdf, ps, other]

Concentration-Based Guarantees for Low-Rank Matrix Reconstruction

Authors: Rina Foygel, Nathan Srebro

Abstract: We consider the problem of approximately reconstructing a partially-observed, approximately low-rank matrix. This problem has received much attention lately, mostly using the trace-norm as a surrogate to the rank. Here we study low-rank matrix reconstruction using both the trace-norm, as well as the less-studied max-norm, and present reconstruction guarantees based on existing analysis on the Rade… ▽ More We consider the problem of approximately reconstructing a partially-observed, approximately low-rank matrix. This problem has received much attention lately, mostly using the trace-norm as a surrogate to the rank. Here we study low-rank matrix reconstruction using both the trace-norm, as well as the less-studied max-norm, and present reconstruction guarantees based on existing analysis on the Rademacher complexity of the unit balls of these norms. We show how these are superior in several ways to recently published guarantees based on specialized analysis. △ Less

Submitted 26 May, 2011; v1 submitted 18 February, 2011; originally announced February 2011.

arXiv:1011.5053 [pdf, ps, other]

Tight Sample Complexity of Large-Margin Learning

Authors: Sivan Sabato, Nathan Srebro, Naftali Tishby

Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L_2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution's covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distrib… ▽ More We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L_2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution's covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. △ Less

Submitted 5 April, 2012; v1 submitted 23 November, 2010; originally announced November 2010.

Comments: Appearing in Neural Information Processing Systems (NIPS) 2010; This is the full version, including appendix with proofs; Also with some corrections

Journal ref: Advances in Neural Information Processing Systems 23 (NIPS), 2038-2046, 2010

arXiv:1009.3896 [pdf, ps, other]

Optimistic Rates for Learning with a Smooth Loss

Authors: Nathan Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: We establish an excess risk bound of O(H R_n^2 + R_n \sqrt{H L*}) for empirical risk minimization with an H-smooth loss function and a hypothesis class with Rademacher complexity R_n, where L* is the best risk achievable by the hypothesis class. For typical hypothesis classes where R_n = \sqrt{R/n}, this translates to a learning rate of O(RH/n) in the separable (L*=0) case and O(RH/n + \sqrt{L^* R… ▽ More We establish an excess risk bound of O(H R_n^2 + R_n \sqrt{H L*}) for empirical risk minimization with an H-smooth loss function and a hypothesis class with Rademacher complexity R_n, where L* is the best risk achievable by the hypothesis class. For typical hypothesis classes where R_n = \sqrt{R/n}, this translates to a learning rate of O(RH/n) in the separable (L*=0) case and O(RH/n + \sqrt{L^* RH/n}) more generally. We also provide similar guarantees for online and stochastic convex optimization with a smooth non-negative objective. △ Less

Submitted 26 November, 2012; v1 submitted 20 September, 2010; originally announced September 2010.

arXiv:1002.2780 [pdf, ps, other]

Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm

Authors: Ruslan Salakhutdinov, Nathan Srebro

Abstract: We show that matrix completion with trace-norm regularization can be significantly hurt when entries of the matrix are sampled non-uniformly. We introduce a weighted version of the trace-norm regularizer that works well also with non-uniform sampling. Our experimental results demonstrate that the weighted trace-norm regularization indeed yields significant gains on the (highly non-uniformly samp… ▽ More We show that matrix completion with trace-norm regularization can be significantly hurt when entries of the matrix are sampled non-uniformly. We introduce a weighted version of the trace-norm regularizer that works well also with non-uniform sampling. Our experimental results demonstrate that the weighted trace-norm regularization indeed yields significant gains on the (highly non-uniformly sampled) Netflix dataset. △ Less

Submitted 14 February, 2010; originally announced February 2010.

Comments: 9 pages

Showing 101–136 of 136 results for author: Srebro, N