Search | arXiv e-print repository

Nyström Kernel Stein Discrepancy

Authors: Florian Kalinke, Zoltan Szabo, Bharath K. Sriperumbudur

Abstract: Kernel methods underpin many of the most successful approaches in data science and statistics, and they allow representing probability measures as elements of a reproducing kernel Hilbert space without loss of information. Recently, the kernel Stein discrepancy (KSD), which combines Stein's method with kernel techniques, gained considerable attention. Through the Stein operator, KSD allows the con… ▽ More Kernel methods underpin many of the most successful approaches in data science and statistics, and they allow representing probability measures as elements of a reproducing kernel Hilbert space without loss of information. Recently, the kernel Stein discrepancy (KSD), which combines Stein's method with kernel techniques, gained considerable attention. Through the Stein operator, KSD allows the construction of powerful goodness-of-fit tests where it is sufficient to know the target distribution up to a multiplicative constant. However, the typical U- and V-statistic-based KSD estimators suffer from a quadratic runtime complexity, which hinders their application in large-scale settings. In this work, we propose a Nyström-based KSD acceleration -- with runtime $\mathcal O\!\left(mn+m^3\right)$ for $n$ samples and $m\ll n$ Nyström points -- , show its $\sqrt{n}$-consistency under the null with a classical sub-Gaussian assumption, and demonstrate its applicability for goodness-of-fit testing on a suite of benchmarks. △ Less

Submitted 25 July, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

Comments: Update proof of Lemma B.3, milder Assumption 1, more experiments

MSC Class: 46E22 (Primary) 62G10 (Secondary) ACM Class: G.3; I.2.6

arXiv:2306.17329 [pdf, other]

Kernel $ε$-Greedy for Contextual Bandits

Authors: Sakshi Arya, Bharath K. Sriperumbudur

Abstract: We consider a kernelized version of the $ε$-greedy strategy for contextual bandits. More precisely, in a setting with finitely many arms, we consider that the mean reward functions lie in a reproducing kernel Hilbert space (RKHS). We propose an online weighted kernel ridge regression estimator for the reward functions. Under some conditions on the exploration probability sequence, $\{ε_t\}_t$, and… ▽ More We consider a kernelized version of the $ε$-greedy strategy for contextual bandits. More precisely, in a setting with finitely many arms, we consider that the mean reward functions lie in a reproducing kernel Hilbert space (RKHS). We propose an online weighted kernel ridge regression estimator for the reward functions. Under some conditions on the exploration probability sequence, $\{ε_t\}_t$, and choice of the regularization parameter, $\{λ_t\}_t$, we show that the proposed estimator is consistent. We also show that for any choice of kernel and the corresponding RKHS, we achieve a sub-linear regret rate depending on the intrinsic dimensionality of the RKHS. Furthermore, we achieve the optimal regret rate of $\sqrt{T}$ under a margin condition for finite-dimensional RKHS. △ Less

Submitted 29 June, 2023; originally announced June 2023.

MSC Class: 62L10; 62G05; 68T05

arXiv:2212.09201 [pdf, other]

Spectral Regularized Kernel Two-Sample Tests

Authors: Omar Hagrass, Bharath K. Sriperumbudur, Bing Li

Abstract: Over the last decade, an approach that has gained a lot of popularity to tackle nonparametric testing problems on general (i.e., non-Euclidean) domains is based on the notion of reproducing kernel Hilbert space (RKHS) embedding of probability distributions. The main goal of our work is to understand the optimality of two-sample tests constructed based on this approach. First, we show the popular M… ▽ More Over the last decade, an approach that has gained a lot of popularity to tackle nonparametric testing problems on general (i.e., non-Euclidean) domains is based on the notion of reproducing kernel Hilbert space (RKHS) embedding of probability distributions. The main goal of our work is to understand the optimality of two-sample tests constructed based on this approach. First, we show the popular MMD (maximum mean discrepancy) two-sample test to be not optimal in terms of the separation boundary measured in Hellinger distance. Second, we propose a modification to the MMD test based on spectral regularization by taking into account the covariance information (which is not captured by the MMD test) and prove the proposed test to be minimax optimal with a smaller separation boundary than that achieved by the MMD test. Third, we propose an adaptive version of the above test which involves a data-driven strategy to choose the regularization parameter and show the adaptive test to be almost minimax optimal up to a logarithmic factor. Moreover, our results hold for the permutation variant of the test where the test threshold is chosen elegantly through the permutation of the samples. Through numerical experiments on synthetic and real data, we demonstrate the superior performance of the proposed test in comparison to the MMD test and other popular tests in the literature. △ Less

Submitted 1 May, 2024; v1 submitted 18 December, 2022; originally announced December 2022.

Comments: 75 pages, to be published in the Annals of Statistics

MSC Class: Primary: 62G10; Secondary: 65J20; 65J22; 46E22; 47A52

arXiv:2211.07861 [pdf, other]

Regularized Stein Variational Gradient Flow

Authors: Ye He, Krishnakumar Balasubramanian, Bharath K. Sriperumbudur, Jianfeng Lu

Abstract: The Stein Variational Gradient Descent (SVGD) algorithm is a deterministic particle method for sampling. However, a mean-field analysis reveals that the gradient flow corresponding to the SVGD algorithm (i.e., the Stein Variational Gradient Flow) only provides a constant-order approximation to the Wasserstein Gradient Flow corresponding to the KL-divergence minimization. In this work, we propose t… ▽ More The Stein Variational Gradient Descent (SVGD) algorithm is a deterministic particle method for sampling. However, a mean-field analysis reveals that the gradient flow corresponding to the SVGD algorithm (i.e., the Stein Variational Gradient Flow) only provides a constant-order approximation to the Wasserstein Gradient Flow corresponding to the KL-divergence minimization. In this work, we propose the Regularized Stein Variational Gradient Flow, which interpolates between the Stein Variational Gradient Flow and the Wasserstein Gradient Flow. We establish various theoretical properties of the Regularized Stein Variational Gradient Flow (and its time-discretization) including convergence to equilibrium, existence and uniqueness of weak solutions, and stability of the solutions. We provide preliminary numerical evidence of the improved performance offered by the regularization. △ Less

Submitted 8 May, 2024; v1 submitted 14 November, 2022; originally announced November 2022.

arXiv:2206.01795 [pdf, other]

Robust Topological Inference in the Presence of Outliers

Authors: Siddharth Vishwanath, Bharath K. Sriperumbudur, Kenji Fukumizu, Satoshi Kuriki

Abstract: The distance function to a compact set plays a crucial role in the paradigm of topological data analysis. In particular, the sublevel sets of the distance function are used in the computation of persistent homology -- a backbone of the topological data analysis pipeline. Despite its stability to perturbations in the Hausdorff distance, persistent homology is highly sensitive to outliers. In this w… ▽ More The distance function to a compact set plays a crucial role in the paradigm of topological data analysis. In particular, the sublevel sets of the distance function are used in the computation of persistent homology -- a backbone of the topological data analysis pipeline. Despite its stability to perturbations in the Hausdorff distance, persistent homology is highly sensitive to outliers. In this work, we develop a framework of statistical inference for persistent homology in the presence of outliers. Drawing inspiration from recent developments in robust statistics, we propose a $\textit{median-of-means}$ variant of the distance function ($\textsf{MoM Dist}$), and establish its statistical properties. In particular, we show that, even in the presence of outliers, the sublevel filtrations and weighted filtrations induced by $\textsf{MoM Dist}$ are both consistent estimators of the true underlying population counterpart, and their rates of convergence in the bottleneck metric are controlled by the fraction of outliers in the data. Finally, we demonstrate the advantages of the proposed methodology through simulations and applications. △ Less

Submitted 3 June, 2022; originally announced June 2022.

Comments: 50 pages, 10 figures

MSC Class: 62R40; 55N31; 68T09

arXiv:2111.11328 [pdf, other]

Cycle Consistent Probability Divergences Across Different Spaces

Authors: Zhengxin Zhang, Youssef Mroueh, Ziv Goldfeld, Bharath K. Sriperumbudur

Abstract: Discrepancy measures between probability distributions are at the core of statistical inference and machine learning. In many applications, distributions of interest are supported on different spaces, and yet a meaningful correspondence between data points is desired. Motivated to explicitly encode consistent bidirectional maps into the discrepancy measure, this work proposes a novel unbalanced Mo… ▽ More Discrepancy measures between probability distributions are at the core of statistical inference and machine learning. In many applications, distributions of interest are supported on different spaces, and yet a meaningful correspondence between data points is desired. Motivated to explicitly encode consistent bidirectional maps into the discrepancy measure, this work proposes a novel unbalanced Monge optimal transport formulation for matching, up to isometries, distributions on different spaces. Our formulation arises as a principled relaxation of the Gromov-Haussdroff distance between metric spaces, and employs two cycle-consistent maps that push forward each distribution onto the other. We study structural properties of the proposed discrepancy and, in particular, show that it captures the popular cycle-consistent generative adversarial network (GAN) framework as a special case, thereby providing the theory to explain it. Motivated by computational efficiency, we then kernelize the discrepancy and restrict the mappings to parametric function classes. The resulting kernelized version is coined the generalized maximum mean discrepancy (GMMD). Convergence rates for empirical estimation of GMMD are studied and experiments to support our theory are provided. △ Less

Submitted 22 November, 2021; originally announced November 2021.

Comments: 35 pages

arXiv:1908.05818 [pdf, other]

Gaussian Sketching yields a J-L Lemma in RKHS

Authors: Samory Kpotufe, Bharath K. Sriperumbudur

Abstract: The main contribution of the paper is to show that Gaussian sketching of a kernel-Gram matrix $\boldsymbol K$ yields an operator whose counterpart in an RKHS $\mathcal H$, is a \emph{random projection} operator---in the spirit of Johnson-Lindenstrauss (J-L) lemma. To be precise, given a random matrix $Z$ with i.i.d. Gaussian entries, we show that a sketch $Z\boldsymbol{K}$ corresponds to a particu… ▽ More The main contribution of the paper is to show that Gaussian sketching of a kernel-Gram matrix $\boldsymbol K$ yields an operator whose counterpart in an RKHS $\mathcal H$, is a \emph{random projection} operator---in the spirit of Johnson-Lindenstrauss (J-L) lemma. To be precise, given a random matrix $Z$ with i.i.d. Gaussian entries, we show that a sketch $Z\boldsymbol{K}$ corresponds to a particular random operator in (infinite-dimensional) Hilbert space $\mathcal H$ that maps functions $f \in \mathcal H$ to a low-dimensional space $\mathbb R^d$, while preserving a weighted RKHS inner-product of the form $\langle f, g \rangle_Σ \doteq \langle f, Σ^3 g \rangle_{\mathcal H}$, where $Σ$ is the \emph{covariance} operator induced by the data distribution. In particular, under similar assumptions as in kernel PCA (KPCA), or kernel $k$-means (K-$k$-means), well-separated subsets of feature-space $\{K(\cdot, x): x \in \cal X\}$ remain well-separated after such operation, which suggests similar benefits as in KPCA and/or K-$k$-means, albeit at the much cheaper cost of a random projection. In particular, our convergence rates suggest that, given a large dataset $\{X_i\}_{i=1}^N$ of size $N$, we can build the Gram matrix $\boldsymbol K$ on a much smaller subsample of size $n\ll N$, so that the sketch $Z\boldsymbol K$ is very cheap to obtain and subsequently apply as a projection operator on the original data $\{X_i\}_{i=1}^N$. We verify these insights empirically on synthetic data, and on real-world clustering applications. △ Less

Submitted 11 March, 2020; v1 submitted 15 August, 2019; originally announced August 2019.

Comments: 16 pages

arXiv:1902.01219 [pdf, ps, other]

Local minimax rates for closeness testing of discrete distributions

Authors: Joseph Lam-Weil, Alexandra Carpentier, Bharath K. Sriperumbudur

Abstract: We consider the closeness testing problem for discrete distributions. The goal is to distinguish whether two samples are drawn from the same unspecified distribution, or whether their respective distributions are separated in $L_1$-norm. In this paper, we focus on adapting the rate to the shape of the underlying distributions, i.e. we consider \textit{a local minimax setting}. We provide, to the b… ▽ More We consider the closeness testing problem for discrete distributions. The goal is to distinguish whether two samples are drawn from the same unspecified distribution, or whether their respective distributions are separated in $L_1$-norm. In this paper, we focus on adapting the rate to the shape of the underlying distributions, i.e. we consider \textit{a local minimax setting}. We provide, to the best of our knowledge, the first local minimax rate for the separation distance up to logarithmic factors, together with a test that achieves it. In view of the rate, closeness testing turns out to be substantially harder than the related one-sample testing problem over a wide range of cases. △ Less

Submitted 19 January, 2021; v1 submitted 1 February, 2019; originally announced February 2019.

MSC Class: 62F03; 62G10; 62F35 ACM Class: G.3; I.2.6

arXiv:1810.05207 [pdf, ps, other]

On Kernel Derivative Approximation with Random Fourier Features

Authors: Zoltan Szabo, Bharath K. Sriperumbudur

Abstract: Random Fourier features (RFF) represent one of the most popular and wide-spread techniques in machine learning to scale up kernel algorithms. Despite the numerous successful applications of RFFs, unfortunately, quite little is understood theoretically on their optimality and limitations of their performance. Only recently, precise statistical-computational trade-offs have been established for RFFs… ▽ More Random Fourier features (RFF) represent one of the most popular and wide-spread techniques in machine learning to scale up kernel algorithms. Despite the numerous successful applications of RFFs, unfortunately, quite little is understood theoretically on their optimality and limitations of their performance. Only recently, precise statistical-computational trade-offs have been established for RFFs in the approximation of kernel values, kernel ridge regression, kernel PCA and SVM classification. Our goal is to spark the investigation of optimality of RFF-based approximations in tasks involving not only function values but derivatives, which naturally lead to optimization problems with kernel derivatives. Particularly, in this paper, we focus on the approximation quality of RFFs for kernel derivatives and prove that the existing finite-sample guarantees can be improved exponentially in terms of the domain where they hold, using recent tools from unbounded empirical process theory. Our result implies that the same approximation guarantee is attainable for kernel derivatives using RFF as achieved for kernel values. △ Less

Submitted 9 February, 2019; v1 submitted 11 October, 2018; originally announced October 2018.

Comments: AISTATS-2019

MSC Class: 60E10; 42Bxx; 46E22 ACM Class: G.3; I.2.6

arXiv:1807.02582 [pdf, other]

Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences

Authors: Motonobu Kanagawa, Philipp Hennig, Dino Sejdinovic, Bharath K Sriperumbudur

Abstract: This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other. It is widely known in machine learning that these two formalisms are closely related;… ▽ More This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other. It is widely known in machine learning that these two formalisms are closely related; for instance, the estimator of kernel ridge regression is identical to the posterior mean of Gaussian process regression. However, they have been studied and developed almost independently by two essentially separate communities, and this makes it difficult to seamlessly transfer results between them. Our aim is to overcome this potential difficulty. To this end, we review several old and new results and concepts from either side, and juxtapose algorithmic quantities from each framework to highlight close similarities. We also provide discussions on subtle philosophical and theoretical differences between the two approaches. △ Less

Submitted 6 July, 2018; originally announced July 2018.

Comments: 64 pages

arXiv:1803.11451 [pdf, ps, other]

Minimax Estimation of Quadratic Fourier Functionals

Authors: Shashank Singh, Bharath K. Sriperumbudur, Barnabás Póczos

Abstract: We study estimation of (semi-)inner products between two nonparametric probability distributions, given IID samples from each distribution. These products include relatively well-studied classical $\mathcal{L}^2$ and Sobolev inner products, as well as those induced by translation-invariant reproducing kernels, for which we believe our results are the first. We first propose estimators for these qu… ▽ More We study estimation of (semi-)inner products between two nonparametric probability distributions, given IID samples from each distribution. These products include relatively well-studied classical $\mathcal{L}^2$ and Sobolev inner products, as well as those induced by translation-invariant reproducing kernels, for which we believe our results are the first. We first propose estimators for these quantities, and the induced (semi)norms and (pseudo)metrics. We then prove non-asymptotic upper bounds on their mean squared error, in terms of weights both of the inner product and of the two distributions, in the Fourier basis. Finally, we prove minimax lower bounds that imply rate-optimality of the proposed estimators over Fourier ellipsoids. △ Less

Submitted 1 September, 2018; v1 submitted 30 March, 2018; originally announced March 2018.

arXiv:1708.08157 [pdf, ps, other]

Characteristic and Universal Tensor Product Kernels

Authors: Zoltan Szabo, Bharath K. Sriperumbudur

Abstract: Maximum mean discrepancy (MMD), also called energy distance or N-distance in statistics and Hilbert-Schmidt independence criterion (HSIC), specifically distance covariance in statistics, are among the most popular and successful approaches to quantify the difference and independence of random variables, respectively. Thanks to their kernel-based foundations, MMD and HSIC are applicable on a wide v… ▽ More Maximum mean discrepancy (MMD), also called energy distance or N-distance in statistics and Hilbert-Schmidt independence criterion (HSIC), specifically distance covariance in statistics, are among the most popular and successful approaches to quantify the difference and independence of random variables, respectively. Thanks to their kernel-based foundations, MMD and HSIC are applicable on a wide variety of domains. Despite their tremendous success, quite little is known about when HSIC characterizes independence and when MMD with tensor product kernel can discriminate probability distributions. In this paper, we answer these questions by studying various notions of characteristic property of the tensor product kernel. △ Less

Submitted 2 August, 2018; v1 submitted 27 August, 2017; originally announced August 2017.

Comments: final version appeared in JMLR

MSC Class: 46E22; 94A15; 62G10; 47B32 ACM Class: G.3; H.1.1; I.2.6

Journal ref: Journal of Machine Learning Research 18(233):1-29, 2018

arXiv:1506.02155 [pdf, ps, other]

Optimal Rates for Random Fourier Features

Authors: Bharath K. Sriperumbudur, Zoltan Szabo

Abstract: Kernel methods represent one of the most powerful tools in machine learning to tackle problems expressed in terms of function values and derivatives due to their capability to represent and model complex relations. While these methods show good versatility, they are computationally intensive and have poor scalability to large data as they require operations on Gram matrices. In order to mitigate t… ▽ More Kernel methods represent one of the most powerful tools in machine learning to tackle problems expressed in terms of function values and derivatives due to their capability to represent and model complex relations. While these methods show good versatility, they are computationally intensive and have poor scalability to large data as they require operations on Gram matrices. In order to mitigate this serious computational limitation, recently randomized constructions have been proposed in the literature, which allow the application of fast linear algorithms. Random Fourier features (RFF) are among the most popular and widely applied constructions: they provide an easily computable, low-dimensional feature representation for shift-invariant kernels. Despite the popularity of RFFs, very little is understood theoretically about their approximation quality. In this paper, we provide a detailed finite-sample theoretical analysis about the approximation quality of RFFs by (i) establishing optimal (in terms of the RFF dimension, and growing set size) performance guarantees in uniform norm, and (ii) presenting guarantees in $L^r$ ($1\le r<\infty$) norms. We also propose an RFF approximation to derivatives of a kernel with a theoretical study on its approximation quality. △ Less

Submitted 4 November, 2015; v1 submitted 6 June, 2015; originally announced June 2015.

Comments: To appear at NIPS-2015

MSC Class: 60E10; 62Gxx; 62Exx; 62H12; 42Bxx; 46E22 ACM Class: G.3; I.2.6; F.2

arXiv:1305.2505 [pdf, other]

On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions

Authors: Purushottam Kar, Bharath K Sriperumbudur, Prateek Jain, Harish C Karnick

Abstract: In this paper, we study the generalization properties of online learning based stochastic methods for supervised learning problems where the loss function is dependent on more than one training sample (e.g., metric learning, ranking). We present a generic decoupling technique that enables us to provide Rademacher complexity-based generalization error bounds. Our bounds are in general tighter than… ▽ More In this paper, we study the generalization properties of online learning based stochastic methods for supervised learning problems where the loss function is dependent on more than one training sample (e.g., metric learning, ranking). We present a generic decoupling technique that enables us to provide Rademacher complexity-based generalization error bounds. Our bounds are in general tighter than those obtained by Wang et al (COLT 2012) for the same problem. Using our decoupling technique, we are further able to obtain fast convergence rates for strongly convex pairwise loss functions. We are also able to analyze a class of memory efficient online learning algorithms for pairwise learning problems that use only a bounded subset of past training samples to update the hypothesis at each step. Finally, in order to complement our generalization bounds, we propose a novel memory efficient online learning algorithm for higher order learning problems with bounded regret guarantees. △ Less

Submitted 11 May, 2013; originally announced May 2013.

Comments: To appear in proceedings of the 30th International Conference on Machine Learning (ICML 2013)

Journal ref: Journal of Machine Learning Research, W&CP 28(3) (2013)

arXiv:0901.2698 [pdf, ps, other]

On integral probability metrics, φ-divergences and binary classification

Authors: Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Gert R. G. Lanckriet

Abstract: A class of distance measures on probabilities -- the integral probability metrics (IPMs) -- is addressed: these include the Wasserstein distance, Dudley metric, and Maximum Mean Discrepancy. IPMs have thus far mostly been used in more abstract settings, for instance as theoretical tools in mass transportation problems, and in metrizing the weak topology on the set of all Borel probability measur… ▽ More A class of distance measures on probabilities -- the integral probability metrics (IPMs) -- is addressed: these include the Wasserstein distance, Dudley metric, and Maximum Mean Discrepancy. IPMs have thus far mostly been used in more abstract settings, for instance as theoretical tools in mass transportation problems, and in metrizing the weak topology on the set of all Borel probability measures defined on a metric space. Practical applications of IPMs are less common, with some exceptions in the kernel machines literature. The present work contributes a number of novel properties of IPMs, which should contribute to making IPMs more widely used in practice, for instance in areas where $φ$-divergences are currently popular. First, to understand the relation between IPMs and $φ$-divergences, the necessary and sufficient conditions under which these classes intersect are derived: the total variation distance is shown to be the only non-trivial $φ$-divergence that is also an IPM. This shows that IPMs are essentially different from $φ$-divergences. Second, empirical estimates of several IPMs from finite i.i.d. samples are obtained, and their consistency and convergence rates are analyzed. These estimators are shown to be easily computable, with better rates of convergence than estimators of $φ$-divergences. Third, a novel interpretation is provided for IPMs by relating them to binary classification, where it is shown that the IPM between class-conditional distributions is the negative of the optimal risk associated with a binary classifier. In addition, the smoothness of an appropriate binary classifier is proved to be inversely related to the distance between the class-conditional distributions, measured in terms of an IPM. △ Less

Submitted 12 October, 2009; v1 submitted 18 January, 2009; originally announced January 2009.

Comments: 18 pages

Showing 1–15 of 15 results for author: Sriperumbudur, B K