Search | arXiv e-print repository

Data Complexity Estimates for Operator Learning

Authors: Nikola B. Kovachki, Samuel Lanthaler, Hrushikesh Mhaskar

Abstract: Operator learning has emerged as a new paradigm for the data-driven approximation of nonlinear operators. Despite its empirical success, the theoretical underpinnings governing the conditions for efficient operator learning remain incomplete. The present work develops theory to study the data complexity of operator learning, complementing existing research on the parametric complexity. We investig… ▽ More Operator learning has emerged as a new paradigm for the data-driven approximation of nonlinear operators. Despite its empirical success, the theoretical underpinnings governing the conditions for efficient operator learning remain incomplete. The present work develops theory to study the data complexity of operator learning, complementing existing research on the parametric complexity. We investigate the fundamental question: How many input/output samples are needed in operator learning to achieve a desired accuracy $ε$? This question is addressed from the point of view of $n$-widths, and this work makes two key contributions. The first contribution is to derive lower bounds on $n$-widths for general classes of Lipschitz and Fréchet differentiable operators. These bounds rigorously demonstrate a ``curse of data-complexity'', revealing that learning on such general classes requires a sample size exponential in the inverse of the desired accuracy $ε$. The second contribution of this work is to show that ``parametric efficiency'' implies ``data efficiency''; using the Fourier neural operator (FNO) as a case study, we show rigorously that on a narrower class of operators, efficiently approximated by FNO in terms of the number of tunable parameters, efficient operator learning is attainable in data complexity as well. Specifically, we show that if only an algebraically increasing number of tunable parameters is needed to reach a desired approximation accuracy, then an algebraically bounded number of data samples is also sufficient to achieve the same accuracy. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2402.12687 [pdf, other]

Learning on manifolds without manifold learning

Authors: H. N. Mhaskar, Ryan O'Dowd

Abstract: Function approximation based on data drawn randomly from an unknown distribution is an important problem in machine learning. The manifold hypothesis assumes that the data is sampled from an unknown submanifold of a high dimensional Euclidean space. A great deal of research deals with obtaining information about this manifold, such as the eigendecomposition of the Laplace-Beltrami operator or coor… ▽ More Function approximation based on data drawn randomly from an unknown distribution is an important problem in machine learning. The manifold hypothesis assumes that the data is sampled from an unknown submanifold of a high dimensional Euclidean space. A great deal of research deals with obtaining information about this manifold, such as the eigendecomposition of the Laplace-Beltrami operator or coordinate charts, and using this information for function approximation. This two-step approach implies some extra errors in the approximation stemming from estimating the basic quantities of the data manifold in addition to the errors inherent in function approximation. In this paper, we project the unknown manifold as a submanifold of an ambient hypersphere and study the question of constructing a one-shot approximation using a specially designed sequence of localized spherical polynomial kernels on the hypersphere. Our approach does not require preprocessing of the data to obtain information about the manifold other than its dimension. We give optimal rates of approximation for relatively ``rough'' functions. △ Less

Submitted 18 August, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

arXiv:2308.03230 [pdf, ps, other]

Tractability of approximation by general shallow networks

Authors: Hrushikesh Mhaskar, Tong Mao

Abstract: In this paper, we present a sharper version of the results in the paper Dimension independent bounds for general shallow networks; Neural Networks, \textbf{123} (2020), 142-152. Let $\mathbb{X}$ and $\mathbb{Y}$ be compact metric spaces. We consider approximation of functions of the form $ x\mapsto\int_{\mathbb{Y}} G( x, y)dτ( y)$, $ x\in\mathbb{X}$, by $G$-networks of the form… ▽ More In this paper, we present a sharper version of the results in the paper Dimension independent bounds for general shallow networks; Neural Networks, \textbf{123} (2020), 142-152. Let $\mathbb{X}$ and $\mathbb{Y}$ be compact metric spaces. We consider approximation of functions of the form $ x\mapsto\int_{\mathbb{Y}} G( x, y)dτ( y)$, $ x\in\mathbb{X}$, by $G$-networks of the form $ x\mapsto \sum_{k=1}^n a_kG( x, y_k)$, $ y_1,\cdots, y_n\in\mathbb{Y}$, $a_1,\cdots, a_n\in\mathbb{R}$. Defining the dimensions of $\mathbb{X}$ and $\mathbb{Y}$ in terms of covering numbers, we obtain dimension independent bounds on the degree of approximation in terms of $n$, where also the constants involved are all dependent at most polynomially on the dimensions. Applications include approximation by power rectified linear unit networks, zonal function networks, certain radial basis function networks as well as the important problem of function extension to higher dimensional spaces. △ Less

Submitted 10 December, 2023; v1 submitted 6 August, 2023; originally announced August 2023.

arXiv:2305.03890 [pdf, ps, other]

Approximation by non-symmetric networks for cross-domain learning

Authors: Hrushikesh Mhaskar

Abstract: For the past 30 years or so, machine learning has stimulated a great deal of research in the study of approximation capabilities (expressive power) of a multitude of processes, such as approximation by shallow or deep neural networks, radial basis function networks, and a variety of kernel based methods. Motivated by applications such as invariant learning, transfer learning, and synthetic apertur… ▽ More For the past 30 years or so, machine learning has stimulated a great deal of research in the study of approximation capabilities (expressive power) of a multitude of processes, such as approximation by shallow or deep neural networks, radial basis function networks, and a variety of kernel based methods. Motivated by applications such as invariant learning, transfer learning, and synthetic aperture radar imaging, we initiate in this paper a general approach to study the approximation capabilities of kernel based networks using non-symmetric kernels. While singular value decomposition is a natural instinct to study such kernels, we consider a more general approach to include the use of a family of kernels, such as generalized translation networks (which include neural networks and translation invariant kernels as special cases) and rotated zonal function kernels. Naturally, unlike traditional kernel based approximation, we cannot require the kernels to be positive definite. In particular, we obtain estimates on the accuracy of uniform approximation of functions in a ($L^2$)-Sobolev class by ReLU$^r$ networks when $r$ is not necessarily an integer. Our general results apply to the approximation of functions with small smoothness compared to the dimension of the input space. △ Less

Submitted 5 January, 2024; v1 submitted 5 May, 2023; originally announced May 2023.

arXiv:2303.00984 [pdf, other]

Encoding of data sets and algorithms

Authors: Katarina Doctor, Tong Mao, Hrushikesh Mhaskar

Abstract: In many high-impact applications, it is important to ensure the quality of output of a machine learning algorithm as well as its reliability in comparison with the complexity of the algorithm used. In this paper, we have initiated a mathematically rigorous theory to decide which models (algorithms applied on data sets) are close to each other in terms of certain metrics, such as performance and th… ▽ More In many high-impact applications, it is important to ensure the quality of output of a machine learning algorithm as well as its reliability in comparison with the complexity of the algorithm used. In this paper, we have initiated a mathematically rigorous theory to decide which models (algorithms applied on data sets) are close to each other in terms of certain metrics, such as performance and the complexity level of the algorithm. This involves creating a grid on the hypothetical spaces of data sets and algorithms so as to identify a finite set of probability distributions from which the data sets are sampled and a finite set of algorithms. A given threshold metric acting on this grid will express the nearness (or statistical distance) from each algorithm and data set of interest to any given application. A technically difficult part of this project is to estimate the so-called metric entropy of a compact subset of functions of \textbf{infinitely many variables} that arise in the definition of these spaces. △ Less

Submitted 2 March, 2023; originally announced March 2023.

arXiv:2302.00160 [pdf, ps, other]

Local transfer learning from one data space to another

Authors: H. N. Mhaskar, Ryan O'Dowd

Abstract: A fundamental problem in manifold learning is to approximate a functional relationship in a data chosen randomly from a probability distribution supported on a low dimensional sub-manifold of a high dimensional ambient Euclidean space. The manifold is essentially defined by the data set itself and, typically, designed so that the data is dense on the manifold in some sense. The notion of a data sp… ▽ More A fundamental problem in manifold learning is to approximate a functional relationship in a data chosen randomly from a probability distribution supported on a low dimensional sub-manifold of a high dimensional ambient Euclidean space. The manifold is essentially defined by the data set itself and, typically, designed so that the data is dense on the manifold in some sense. The notion of a data space is an abstraction of a manifold encapsulating the essential properties that allow for function approximation. The problem of transfer learning (meta-learning) is to use the learning of a function on one data set to learn a similar function on a new data set. In terms of function approximation, this means lifting a function on one data space (the base data space) to another (the target data space). This viewpoint enables us to connect some inverse problems in applied mathematics (such as inverse Radon transform) with transfer learning. In this paper we examine the question of such lifting when the data is assumed to be known only on a part of the base data space. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related. △ Less

Submitted 7 July, 2023; v1 submitted 31 January, 2023; originally announced February 2023.

Comments: To appear in Proceedings of ICAIPA 2022, Editors: S. Pereverzyev, R. Radha, S. Sivananthan, Springer Verlag

arXiv:2202.06392 [pdf, other]

Local approximation of operators

Authors: Hrushikesh Mhaskar

Abstract: Many applications, such as system identification, classification of time series, direct and inverse problems in partial differential equations, and uncertainty quantification lead to the question of approximation of a non-linear operator between metric spaces $\mathfrak{X}$ and $\mathfrak{Y}$. We study the problem of determining the degree of approximation of such operators on a compact subset… ▽ More Many applications, such as system identification, classification of time series, direct and inverse problems in partial differential equations, and uncertainty quantification lead to the question of approximation of a non-linear operator between metric spaces $\mathfrak{X}$ and $\mathfrak{Y}$. We study the problem of determining the degree of approximation of such operators on a compact subset $K_\mathfrak{X}\subset \mathfrak{X}$ using a finite amount of information. If $\mathcal{F}: K_\mathfrak{X}\to K_\mathfrak{Y}$, a well established strategy to approximate $\mathcal{F}(F)$ for some $F\in K_\mathfrak{X}$ is to encode $F$ (respectively, $\mathcal{F}(F)$) in terms of a finite number $d$ (repectively $m$) of real numbers. Together with appropriate reconstruction algorithms (decoders), the problem reduces to the approximation of $m$ functions on a compact subset of a high dimensional Euclidean space $\mathbb{R}^d$, equivalently, the unit sphere $\mathbb{S}^d$ embedded in $\mathbb{R}^{d+1}$. The problem is challenging because $d$, $m$, as well as the complexity of the approximation on $\mathbb{S}^d$ are all large, and it is necessary to estimate the accuracy keeping track of the inter-dependence of all the approximations involved. In this paper, we establish constructive methods to do this efficiently; i.e., with the constants involved in the estimates on the approximation on $\mathbb{S}^d$ being $\mathcal{O}(d^{1/6})$. We study different smoothness classes for the operators, and also propose a method for approximation of $\mathcal{F}(F)$ using only information in a small neighborhood of $F$, resulting in an effective reduction in the number of parameters involved. △ Less

Submitted 1 December, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

arXiv:2110.01670 [pdf, other]

A manifold learning approach for gesture recognition from micro-Doppler radar measurements

Authors: Eric Mason, Hrushikesh Mhaskar, Adam Guo

Abstract: A recent paper (Neural Networks, {\bf 132} (2020), 253-268) introduces a straightforward and simple kernel based approximation for manifold learning that does not require the knowledge of anything about the manifold, except for its dimension. In this paper, we examine how the pointwise error in approximation using least squares optimization based on similarly localized kernels depends upon the dat… ▽ More A recent paper (Neural Networks, {\bf 132} (2020), 253-268) introduces a straightforward and simple kernel based approximation for manifold learning that does not require the knowledge of anything about the manifold, except for its dimension. In this paper, we examine how the pointwise error in approximation using least squares optimization based on similarly localized kernels depends upon the data characteristics and deteriorates as one goes away from the training data. The theory is presented with an abstract localized kernel, which can utilize any prior knowledge about the data being located on an unknown sub-manifold of a known manifold. We demonstrate the performance of our approach using a publicly available micro-Doppler data set, and investigate the use of different preprocessing measures, kernels, and manifold dimensions. Specifically, it is shown that the localized kernel introduced in the above mentioned paper when used with PCA components leads to a near-competitive performance to deep neural networks, and offers significant improvements in training speed and memory requirements. To demonstrate the fact that our methods are agnostic to the domain knowledge, we examine the classification problem in a simple video data set. △ Less

Submitted 21 April, 2022; v1 submitted 4 October, 2021; originally announced October 2021.

Comments: To appear in Neural Networks

arXiv:2109.14752 [pdf, other]

Kernel distance measures for time series, random fields and other structured data

Authors: Srinjoy Das, Hrushikesh Mhaskar, Alexander Cloninger

Abstract: This paper introduces kdiff, a novel kernel-based measure for estimating distances between instances of time series, random fields and other forms of structured data. This measure is based on the idea of matching distributions that only overlap over a portion of their region of support. Our proposed measure is inspired by MPdist which has been previously proposed for such datasets and is construct… ▽ More This paper introduces kdiff, a novel kernel-based measure for estimating distances between instances of time series, random fields and other forms of structured data. This measure is based on the idea of matching distributions that only overlap over a portion of their region of support. Our proposed measure is inspired by MPdist which has been previously proposed for such datasets and is constructed using Euclidean metrics, whereas kdiff is constructed using non-linear kernel distances. Also, kdiff accounts for both self and cross similarities across the instances and is defined using a lower quantile of the distance distribution. Comparing the cross similarity to self similarity allows for measures of similarity that are more robust to noise and partial occlusions of the relevant signals. Our proposed measure kdiff is a more general form of the well known kernel-based Maximum Mean Discrepancy (MMD) distance estimated over the embeddings. Some theoretical results are provided for separability conditions using kdiff as a distance measure for clustering and classification problems where the embedding distributions can be modeled as two component mixtures. Applications are demonstrated for clustering of synthetic and real-life time series and image data, and the performance of kdiff is compared to competing distance measures for clustering. △ Less

Submitted 29 September, 2021; originally announced September 2021.

arXiv:2105.05893 [pdf, other]

A function approximation approach to the prediction of blood glucose levels

Authors: H. N. Mhaskar, S. V. Pereverzyev, M. D. van der Walt

Abstract: The problem of real time prediction of blood glucose (BG) levels based on the readings from a continuous glucose monitoring (CGM) device is a problem of great importance in diabetes care, and therefore, has attracted a lot of research in recent years, especially based on machine learning. An accurate prediction with a 30, 60, or 90 minute prediction horizon has the potential of saving millions of… ▽ More The problem of real time prediction of blood glucose (BG) levels based on the readings from a continuous glucose monitoring (CGM) device is a problem of great importance in diabetes care, and therefore, has attracted a lot of research in recent years, especially based on machine learning. An accurate prediction with a 30, 60, or 90 minute prediction horizon has the potential of saving millions of dollars in emergency care costs. In this paper, we treat the problem as one of function approximation, where the value of the BG level at time $t+h$ (where $h$ the prediction horizon) is considered to be an unknown function of $d$ readings prior to the time $t$. This unknown function may be supported in particular on some unknown submanifold of the $d$-dimensional Euclidean space. While manifold learning is classically done in a semi-supervised setting, where the entire data has to be known in advance, we use recent ideas to achieve an accurate function approximation in a supervised setting; i.e., construct a model for the target function. We use the state-of-the-art clinically relevant PRED-EGA grid to evaluate our results, and demonstrate that for a real life dataset, our method performs better than a standard deep network, especially in hypoglycemic and hyperglycemic regimes. One noteworthy aspect of this work is that the training data and test data may come from different distributions. △ Less

Submitted 29 June, 2021; v1 submitted 12 May, 2021; originally announced May 2021.

Comments: arXiv admin note: text overlap with arXiv:1707.05828

arXiv:2010.04227 [pdf, other]

A low discrepancy sequence on graphs

Authors: A. Cloninger, H. N. Mhaskar

Abstract: Many applications such as election forecasting, environmental monitoring, health policy, and graph based machine learning require taking expectation of functions defined on the vertices of a graph. We describe a construction of a sampling scheme analogous to the so called Leja points in complex potential theory that can be proved to give low discrepancy estimates for the approximation of the expec… ▽ More Many applications such as election forecasting, environmental monitoring, health policy, and graph based machine learning require taking expectation of functions defined on the vertices of a graph. We describe a construction of a sampling scheme analogous to the so called Leja points in complex potential theory that can be proved to give low discrepancy estimates for the approximation of the expected value by the impirical expected value based on these points. In contrast to classical potential theory where the kernel is fixed and the equilibrium distribution depends upon the kernel, we fix a probability distribution and construct a kernel (which represents the graph structure) for which the equilibrium distribution is the given probability distribution. Our estimates do not depend upon the size of the graph. △ Less

Submitted 7 June, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

Comments: Accepted for publication in Journal of Fourier Analysis and Applications

arXiv:2008.01245 [pdf, other]

Cautious Active Clustering

Authors: Alexander Cloninger, Hrushikesh Mhaskar

Abstract: We consider the problem of classification of points sampled from an unknown probability measure on a Euclidean space. We study the question of querying the class label at a very small number of judiciously chosen points so as to be able to attach the appropriate class label to every point in the set. Our approach is to consider the unknown probability measure as a convex combination of the conditi… ▽ More We consider the problem of classification of points sampled from an unknown probability measure on a Euclidean space. We study the question of querying the class label at a very small number of judiciously chosen points so as to be able to attach the appropriate class label to every point in the set. Our approach is to consider the unknown probability measure as a convex combination of the conditional probabilities for each class. Our technique involves the use of a highly localized kernel constructed from Hermite polynomials, in order to create a hierarchical estimate of the supports of the constituent probability measures. We do not need to make any assumptions on the nature of any of the probability measures nor know in advance the number of classes involved. We give theoretical guarantees measured by the $F$-score for our classification scheme. Examples include classification in hyper-spectral images and MNIST classification. △ Less

Submitted 7 December, 2020; v1 submitted 3 August, 2020; originally announced August 2020.

arXiv:2003.13226 [pdf, ps, other]

Kernel based analysis of massive data

Authors: Hrushikesh N Mhaskar

Abstract: Dealing with massive data is a challenging task for machine learning. An important aspect of machine learning is function approximation. In the context of massive data, some of the commonly used tools for this purpose are sparsity, divide-and-conquer, and distributed learning. In this paper, we develop a very general theory of approximation by networks, which we have called eignets, to achieve loc… ▽ More Dealing with massive data is a challenging task for machine learning. An important aspect of machine learning is function approximation. In the context of massive data, some of the commonly used tools for this purpose are sparsity, divide-and-conquer, and distributed learning. In this paper, we develop a very general theory of approximation by networks, which we have called eignets, to achieve local, stratified approximation. The very massive nature of the data allows us to use these eignets to solve inverse problems such as finding a good approximation to the probability law that governs the data, and finding the local smoothness of the target function near different points in the domain. In fact, we develop a wavelet-like representation using our eignets. Our theory is applicable to approximation on a general locally compact metric measure space. Special examples include approximation by periodic basis functions on the torus, zonal function networks on a Euclidean sphere (including smooth ReLU networks), Gaussian networks, and approximation on manifolds. We construct pre-fabricated networks so that no data-based training is required for the approximation. △ Less

Submitted 7 July, 2020; v1 submitted 30 March, 2020; originally announced March 2020.

Comments: Accepted for publication in Frontiers in Applied Mathematics and Statistics, section Mathematics of Computation and Data Science. Special issue on Fundamental Mathematical Topics in Data Science

arXiv:2001.12006 [pdf, other]

Theory inspired deep network for instantaneous-frequency extraction and signal components recovery from discrete blind-source data

Authors: Charles K. Chui, Ningning Han, Hrushikesh N. Mhaskar

Abstract: This paper is concerned with the inverse problem of recovering the unknown signal components, along with extraction of their instantaneous frequencies (IFs), governed by the adaptive harmonic model (AHM), from discrete (and possibly non-uniform) samples of the blind-source composite signal. None of the existing decomposition methods and algorithms, including the most popular empirical mode decom… ▽ More This paper is concerned with the inverse problem of recovering the unknown signal components, along with extraction of their instantaneous frequencies (IFs), governed by the adaptive harmonic model (AHM), from discrete (and possibly non-uniform) samples of the blind-source composite signal. None of the existing decomposition methods and algorithms, including the most popular empirical mode decomposition (EMD) computational scheme and its current modifications, is capable of solving this inverse problem. In order to meet the AHM formulation and to extract the IFs of the decomposed components, called intrinsic mode functions (IMFs), each IMF of EMD is extended to an analytic function in the upper half of the complex plane via the Hilbert transform, followed by taking the real part of the polar form of the analytic extension. Unfortunately, this approach most often fails to resolve the inverse problem satisfactorily. More recently, to resolve the inverse problem, the notion of synchrosqueezed wavelet transform (SST) was proposed by Daubechies and Maes, and further developed in many other papers, while a more direct method, called signal separation operation (SSO), was proposed and developed in our previous work published in the journal, Applied and Computational Harmonic Analysis, vol. 30(2):243-261, 2016. In the present paper, we propose a synthesis of SSO using a deep neural network, based directly on a discrete sample set, that may be non-uniformly sampled, of the blind-source signal. Our method is localized, as illustrated by a number of numerical examples, including components with different signal arrival and departure times. It also yields short-term prediction of the signal components, along with their IFs. Our neural networks are inspired by theory, designed so that they do not require any training in the traditional sense. △ Less

Submitted 31 January, 2020; originally announced January 2020.

arXiv:1908.09880 [pdf, ps, other]

Dimension independent bounds for general shallow networks

Authors: Hrushikesh N. Mhaskar

Abstract: This paper proves an abstract theorem addressing in a unified manner two important problems in function approximation: avoiding curse of dimensionality and estimating the degree of approximation for out-of-sample extension in manifold learning. We consider an abstract (shallow) network that includes, for example, neural networks, radial basis function networks, and kernels on data defined manifold… ▽ More This paper proves an abstract theorem addressing in a unified manner two important problems in function approximation: avoiding curse of dimensionality and estimating the degree of approximation for out-of-sample extension in manifold learning. We consider an abstract (shallow) network that includes, for example, neural networks, radial basis function networks, and kernels on data defined manifolds used for function approximation in various settings. A deep network is obtained by a composition of the shallow networks according to a directed acyclic graph, representing the architecture of the deep network. In this paper, we prove dimension independent bounds for approximation by shallow networks in the very general setting of what we have called $G$-networks on a compact metric measure space, where the notion of dimension is defined in terms of the cardinality of maximal distinguishable sets, generalizing the notion of dimension of a cube or a manifold. Our techniques give bounds that improve without saturation with the smoothness of the kernel involved in an integral representation of the target function. In the context of manifold learning, our bounds provide estimates on the degree of approximation for an out-of-sample extension of the target function to the ambient space. One consequence of our theorem is that without the requirement of robust parameter selection, deep networks using a non-smooth activation function such as the ReLU, do not provide any significant advantage over shallow networks in terms of the degree of approximation alone. △ Less

Submitted 4 November, 2019; v1 submitted 26 August, 2019; originally announced August 2019.

arXiv:1908.00156 [pdf, other]

A direct approach for function approximation on data defined manifolds

Authors: Hrushikesh Mhaskar

Abstract: In much of the literature on function approximation by deep networks, the function is assumed to be defined on some known domain, such as a cube or a sphere. In practice, the data might not be dense on these domains, and therefore, the approximation theory results are observed to be too conservative. In manifold learning, one assumes instead that the data is sampled from an unknown manifold; i.e.,… ▽ More In much of the literature on function approximation by deep networks, the function is assumed to be defined on some known domain, such as a cube or a sphere. In practice, the data might not be dense on these domains, and therefore, the approximation theory results are observed to be too conservative. In manifold learning, one assumes instead that the data is sampled from an unknown manifold; i.e., the manifold is defined by the data itself. Function approximation on this unknown manifold is then a two stage procedure: first, one approximates the Laplace-Beltrami operator (and its eigen-decomposition) on this manifold using a graph Laplacian, and next, approximates the target function using the eigen-functions. Alternatively, one estimates first some atlas on the manifold and then uses local approximation techniques based on the local coordinate charts. In this paper, we propose a more direct approach to function approximation on \emph{unknown}, data defined manifolds without computing the eigen-decomposition of some operator or an atlas for the manifold, and without any kind of training in the classical sense. Our constructions are universal; i.e., do not require the knowledge of any prior on the target function other than continuity on the manifold. We estimate the degree of approximation. For smooth functions, the estimates do not suffer from the so-called saturation phenomenon. We demonstrate via a property called good propagation of errors how the results can be lifted for function approximation using deep networks where each channel evaluates a Gaussian network on a possibly unknown manifold. △ Less

Submitted 20 August, 2020; v1 submitted 31 July, 2019; originally announced August 2019.

Comments: Version 1 was submitted on August 1, 2019 under the title Deep Gaussian networks for function approximation on data defined manifolds. This version is accepted for publication in Neural Networks

arXiv:1907.04895 [pdf, ps, other]

Super-resolution meets machine learning: approximation of measures

Authors: H. N. Mhaskar

Abstract: The problem of super-resolution in general terms is to recuperate a finitely supported measure $μ$ given finitely many of its coefficients $\hatμ(k)$ with respect to some orthonormal system. The interesting case concerns situations, where the number of coefficients required is substantially smaller than a power of the reciprocal of the minimal separation among the points in the support of $μ$. In… ▽ More The problem of super-resolution in general terms is to recuperate a finitely supported measure $μ$ given finitely many of its coefficients $\hatμ(k)$ with respect to some orthonormal system. The interesting case concerns situations, where the number of coefficients required is substantially smaller than a power of the reciprocal of the minimal separation among the points in the support of $μ$. In this paper, we consider the more severe problem of recuperating $μ$ approximately without any assumption on $μ$ beyond having a finite total variation. In particular, $μ$ may be supported on a continuum, so that the minimal separation among the points in the support of $μ$ is $0$. A variant of this problem is also of interest in machine learning as well as the inverse problem of de-convolution. We define an appropriate notion of a distance between the target measure and its recuperated version, give an explicit expression for the recuperation operator, and estimate the distance between $μ$ and its approximation. We show that these estimates are the best possible in many different ways. We also explain why for a finitely supported measure the approximation quality of its recuperation is bounded from below if the amount of information is smaller than what is demanded in the super-resolution problem. △ Less

Submitted 10 July, 2019; originally announced July 2019.

Comments: 14 pages, To appear in Journal of Fourier Analysis and Applications

arXiv:1905.12882 [pdf, other]

Function approximation by deep networks

Authors: H. N. Mhaskar, T. Poggio

Abstract: We show that deep networks are better than shallow networks at approximating functions that can be expressed as a composition of functions described by a directed acyclic graph, because the deep networks can be designed to have the same compositional structure, while a shallow network cannot exploit this knowledge. Thus, the blessing of compositionality mitigates the curse of dimensionality. On th… ▽ More We show that deep networks are better than shallow networks at approximating functions that can be expressed as a composition of functions described by a directed acyclic graph, because the deep networks can be designed to have the same compositional structure, while a shallow network cannot exploit this knowledge. Thus, the blessing of compositionality mitigates the curse of dimensionality. On the other hand, a theorem called good propagation of errors allows to `lift' theorems about shallow networks to those about deep networks with an appropriate choice of norms, smoothness, etc. We illustrate this in three contexts where each channel in the deep network calculates a spherical polynomial, a non-smooth ReLU network, or another zonal function network related closely with the ReLU network. △ Less

Submitted 23 November, 2019; v1 submitted 30 May, 2019; originally announced May 2019.

Comments: To appear in Communications in pure and applied mathematics

arXiv:1901.02975 [pdf, other]

A witness function based construction of discriminative models using Hermite polynomials

Authors: H. N. Mhaskar, A. Cloninger, X. Cheng

Abstract: In machine learning, we are given a dataset of the form $\{(\mathbf{x}_j,y_j)\}_{j=1}^M$, drawn as i.i.d. samples from an unknown probability distribution $μ$; the marginal distribution for the $\mathbf{x}_j$'s being $μ^*$. We propose that rather than using a positive kernel such as the Gaussian for estimation of these measures, using a non-positive kernel that preserves a large number of moments… ▽ More In machine learning, we are given a dataset of the form $\{(\mathbf{x}_j,y_j)\}_{j=1}^M$, drawn as i.i.d. samples from an unknown probability distribution $μ$; the marginal distribution for the $\mathbf{x}_j$'s being $μ^*$. We propose that rather than using a positive kernel such as the Gaussian for estimation of these measures, using a non-positive kernel that preserves a large number of moments of these measures yields an optimal approximation. We use multi-variate Hermite polynomials for this purpose, and prove optimal and local approximation results in a supremum norm in a probabilistic sense. Together with a permutation test developed with the same kernel, we prove that the kernel estimator serves as a `witness function' in classification problems. Thus, if the value of this estimator at a point $\mathbf{x}$ exceeds a certain threshold, then the point is reliably in a certain class. This approach can be used to modify pretrained algorithms, such as neural networks or nonlinear dimension reduction techniques, to identify in-class vs out-of-class regions for the purposes of generative models, classification uncertainty, or finding robust centroids. This fact is demonstrated in a number of real world data sets including MNIST, CIFAR10, Science News documents, and LaLonde data sets. △ Less

Submitted 9 January, 2019; originally announced January 2019.

Comments: 20 pages, 3.1 MB

arXiv:1806.02003 [pdf, other]

Deep Algorithms: designs for networks

Authors: Abhejit Rajagopal, Shivkumar Chandrasekaran, Hrushikesh N. Mhaskar

Abstract: A new design methodology for neural networks that is guided by traditional algorithm design is presented. To prove our point, we present two heuristics and demonstrate an algorithmic technique for incorporating additional weights in their signal-flow graphs. We show that with training the performance of these networks can not only exceed the performance of the initial network, but can match the pe… ▽ More A new design methodology for neural networks that is guided by traditional algorithm design is presented. To prove our point, we present two heuristics and demonstrate an algorithmic technique for incorporating additional weights in their signal-flow graphs. We show that with training the performance of these networks can not only exceed the performance of the initial network, but can match the performance of more-traditional neural network architectures. A key feature of our approach is that these networks are initialized with parameters that provide a known performance threshold for the architecture on a given task. △ Less

Submitted 6 June, 2018; originally announced June 2018.

Comments: submitted to Thirty-second Annual Conference on Neural Information Processing Systems (NIPS), May 2018

arXiv:1802.06266 [pdf, other]

An analysis of training and generalization errors in shallow and deep networks

Authors: Hrushikesh Mhaskar, Tomaso Poggio

Abstract: This paper is motivated by an open problem around deep networks, namely, the apparent absence of over-fitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we analyze this phenomenon in the case of regression problems when each unit evaluates a periodic activation function. We argue that the minimal expected value of the square loss is inappro… ▽ More This paper is motivated by an open problem around deep networks, namely, the apparent absence of over-fitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we analyze this phenomenon in the case of regression problems when each unit evaluates a periodic activation function. We argue that the minimal expected value of the square loss is inappropriate to measure the generalization error in approximation of compositional functions in order to take full advantage of the compositional structure. Instead, we measure the generalization error in the sense of maximum loss, and sometimes, as a pointwise error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error. We prove that a solution of a regularization problem is guaranteed to yield a good training error as well as a good generalization error and estimate how much error to expect at which test data. △ Less

Submitted 27 August, 2019; v1 submitted 17 February, 2018; originally announced February 2018.

Comments: 21 pages; Accepted for publication in Neural Networks

arXiv:1801.00173 [pdf, other]

Theory of Deep Learning III: explaining the non-overfitting puzzle

Authors: Tomaso Poggio, Kenji Kawaguchi, Qianli Liao, Brando Miranda, Lorenzo Rosasco, Xavier Boix, Jack Hidary, Hrushikesh Mhaskar

Abstract: A main puzzle of deep networks revolves around the absence of overfitting despite large overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamics associated to gradient descent minimization of nonlinear networks is topologically equivalent, near the asymptotically stable minima of the empirical error, to… ▽ More A main puzzle of deep networks revolves around the absence of overfitting despite large overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamics associated to gradient descent minimization of nonlinear networks is topologically equivalent, near the asymptotically stable minima of the empirical error, to linear gradient system in a quadratic potential with a degenerate (for square loss) or almost degenerate (for logistic or crossentropy loss) Hessian. The proposition depends on the qualitative theory of dynamical systems and is supported by numerical results. Our main propositions extend to deep nonlinear networks two properties of gradient descent for linear networks, that have been recently established (1) to be key to their generalization properties: 1. Gradient descent enforces a form of implicit regularization controlled by the number of iterations, and asymptotically converges to the minimum norm solution for appropriate initial conditions of gradient descent. This implies that there is usually an optimum early stopping that avoids overfitting of the loss. This property, valid for the square loss and many other loss functions, is relevant especially for regression. 2. For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for "low noise" datasets. This property holds for loss functions such as the logistic and cross-entropy loss independently of the initial conditions. The robustness to overparametrization has suggestive implications for the robustness of the architecture of deep convolutional networks with respect to the curse of dimensionality. △ Less

Submitted 16 January, 2018; v1 submitted 30 December, 2017; originally announced January 2018.

arXiv:1709.08174 [pdf, other]

Function approximation with zonal function networks with activation functions analogous to the rectified linear unit functions

Authors: Hrushikesh N. Mhaskar

Abstract: A zonal function (ZF) network on the $q$ dimensional sphere $\mathbb{S}^q$ is a network of the form $\mathbf{x}\mapsto \sum_{k=1}^n a_kφ(\mathbf{x}\cdot\mathbf{x}_k)$ where $φ:[-1,1]\to\mathbf{R}$ is the activation function, $\mathbf{x}_k\in\mathbb{S}^q$ are the centers, and $a_k\in\mathbb{R}$. While the approximation properties of such networks are well studied in the context of positive definite… ▽ More A zonal function (ZF) network on the $q$ dimensional sphere $\mathbb{S}^q$ is a network of the form $\mathbf{x}\mapsto \sum_{k=1}^n a_kφ(\mathbf{x}\cdot\mathbf{x}_k)$ where $φ:[-1,1]\to\mathbf{R}$ is the activation function, $\mathbf{x}_k\in\mathbb{S}^q$ are the centers, and $a_k\in\mathbb{R}$. While the approximation properties of such networks are well studied in the context of positive definite activation functions, recent interest in deep and shallow networks motivate the study of activation functions of the form $φ(t)=|t|$, which are not positive definite. In this paper, we define an appropriate smoothess class and establish approximation properties of such networks for functions in this class. The centers can be chosen independently of the target function, and the coefficients are linear combinations of the training data. The constructions preserve rotational symmetries. △ Less

Submitted 8 July, 2018; v1 submitted 24 September, 2017; originally announced September 2017.

Comments: 18 pages, Title changed from the pervious version

arXiv:1707.09428 [pdf, ps, other]

A unified method for super-resolution recovery and real exponential-sum separation

Authors: Charles K. Chui, Hrushikesh N. Mhaskar

Abstract: In this paper, motivated by diffraction of traveling light waves, a simple mathematical model is proposed, both for the multivariate super-resolution problem and the problem of blind-source separation of real-valued exponential sums. This model facilitates the development of a unified theory and a unified solution of both problems in this paper. Our consideration of the super-resolution problem is… ▽ More In this paper, motivated by diffraction of traveling light waves, a simple mathematical model is proposed, both for the multivariate super-resolution problem and the problem of blind-source separation of real-valued exponential sums. This model facilitates the development of a unified theory and a unified solution of both problems in this paper. Our consideration of the super-resolution problem is aimed at applications to fluorescence microscopy and observational astronomy, and the motivation for our consideration of the second problem is the current need of extracting multivariate exponential features in magnetic resonance spectroscopy (MRS) for the neurologist and radiologist as well as for providing a mathematical tool for isotope separation in Nuclear Chemistry. The unified method introduced in this paper can be easily realized by processing only finitely many data, sampled at locations that are not necessarily prescribed in advance, with computational scheme consisting only of matrix - vector multiplication, peak finding, and clustering. △ Less

Submitted 26 July, 2017; originally announced July 2017.

arXiv:1707.09319 [pdf, ps, other]

A Fourier-invariant method for locating point-masses and computing their attributes

Authors: Charles K. Chui, Hrushikesh N. Mhaskar

Abstract: Motivated by the interest of observing the growth of cancer cells among normal living cells and exploring how galaxies and stars are truly formed, the objective of this paper is to introduce a rigorous and effective method for counting point-masses, determining their spatial locations, and computing their attributes. Based on computation of Hermite moments that are Fourier-invariant, our approach… ▽ More Motivated by the interest of observing the growth of cancer cells among normal living cells and exploring how galaxies and stars are truly formed, the objective of this paper is to introduce a rigorous and effective method for counting point-masses, determining their spatial locations, and computing their attributes. Based on computation of Hermite moments that are Fourier-invariant, our approach facilitates the processing of both spatial and Fourier data in any dimension. △ Less

Submitted 26 July, 2017; originally announced July 2017.

arXiv:1707.05828 [pdf, other]

doi 10.3389/fams.2017.00014

A deep learning approach to diabetic blood glucose prediction

Authors: H. N. Mhaskar, S. V. Pereverzyev, M. D. van der Walt

Abstract: We consider the question of 30-minute prediction of blood glucose levels measured by continuous glucose monitoring devices, using clinical data. While most studies of this nature deal with one patient at a time, we take a certain percentage of patients in the data set as training data, and test on the remainder of the patients; i.e., the machine need not re-calibrate on the new patients in the dat… ▽ More We consider the question of 30-minute prediction of blood glucose levels measured by continuous glucose monitoring devices, using clinical data. While most studies of this nature deal with one patient at a time, we take a certain percentage of patients in the data set as training data, and test on the remainder of the patients; i.e., the machine need not re-calibrate on the new patients in the data set. We demonstrate how deep learning can outperform shallow networks in this example. One novelty is to demonstrate how a parsimonious deep representation can be constructed using domain knowledge. △ Less

Submitted 18 July, 2017; originally announced July 2017.

Journal ref: Front. Appl. Math. Stat., 14 July 2017

arXiv:1611.00740 [pdf, other]

Why and When Can Deep -- but Not Shallow -- Networks Avoid the Curse of Dimensionality: a Review

Authors: Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, Qianli Liao

Abstract: The paper characterizes classes of functions for which deep learning can be exponentially better than shallow learning. Deep convolutional networks are a special case of these conditions, though weight sharing is not the main reason for their exponential advantage. The paper characterizes classes of functions for which deep learning can be exponentially better than shallow learning. Deep convolutional networks are a special case of these conditions, though weight sharing is not the main reason for their exponential advantage. △ Less

Submitted 4 February, 2017; v1 submitted 2 November, 2016; originally announced November 2016.

arXiv:1608.03287 [pdf, other]

Deep vs. shallow networks : An approximation theory perspective

Authors: Hrushikesh Mhaskar, Tomaso Poggio

Abstract: The paper briefy reviews several recent results on hierarchical architectures for learning from examples, that may formally explain the conditions under which Deep Convolutional Neural Networks perform much better in function approximation problems than shallow, one-hidden layer architectures. The paper announces new results for a non-smooth activation function - the ReLU function - used in presen… ▽ More The paper briefy reviews several recent results on hierarchical architectures for learning from examples, that may formally explain the conditions under which Deep Convolutional Neural Networks perform much better in function approximation problems than shallow, one-hidden layer architectures. The paper announces new results for a non-smooth activation function - the ReLU function - used in present-day neural networks, as well as for the Gaussian networks. We propose a new definition of relative dimension to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning. △ Less

Submitted 10 August, 2016; originally announced August 2016.

Comments: 14 pages, 4 figures, to be published in a Journal

Report number: CBMM Memo 54

arXiv:1607.07110 [pdf, ps, other]

Deep nets for local manifold learning

Authors: Charles K. Chui, H. N. Mhaskar

Abstract: The problem of extending a function $f$ defined on a training data $\mathcal{C}$ on an unknown manifold $\mathbb{X}$ to the entire manifold and a tubular neighborhood of this manifold is considered in this paper. For $\mathbb{X}$ embedded in a high dimensional ambient Euclidean space $\mathbb{R}^D$, a deep learning algorithm is developed for finding a local coordinate system for the manifold {\bf… ▽ More The problem of extending a function $f$ defined on a training data $\mathcal{C}$ on an unknown manifold $\mathbb{X}$ to the entire manifold and a tubular neighborhood of this manifold is considered in this paper. For $\mathbb{X}$ embedded in a high dimensional ambient Euclidean space $\mathbb{R}^D$, a deep learning algorithm is developed for finding a local coordinate system for the manifold {\bf without eigen--decomposition}, which reduces the problem to the classical problem of function approximation on a low dimensional cube. Deep nets (or multilayered neural networks) are proposed to accomplish this approximation scheme by using the training data. Our methods do not involve such optimization techniques as back--propagation, while assuring optimal (a priori) error bounds on the output in terms of the number of derivatives of the target function. In addition, these methods are universal, in that they do not require a prior knowledge of the smoothness of the target function, but adjust the accuracy of approximation locally and automatically, depending only upon the local smoothness of the target function. Our ideas are easily extended to solve both the pre--image problem and the out--of--sample extension problem, with a priori bounds on the growth of the function thus extended. △ Less

Submitted 24 July, 2016; originally announced July 2016.

Comments: Submitted on Sept. 17, 2015

arXiv:1603.00988 [pdf, other]

Learning Functions: When Is Deep Better Than Shallow

Authors: Hrushikesh Mhaskar, Qianli Liao, Tomaso Poggio

Abstract: While the universal approximation property holds both for hierarchical and shallow networks, we prove that deep (hierarchical) networks can approximate the class of compositional functions with the same accuracy as shallow networks but with exponentially lower number of training parameters as well as VC-dimension. This theorem settles an old conjecture by Bengio on the role of depth in networks. W… ▽ More While the universal approximation property holds both for hierarchical and shallow networks, we prove that deep (hierarchical) networks can approximate the class of compositional functions with the same accuracy as shallow networks but with exponentially lower number of training parameters as well as VC-dimension. This theorem settles an old conjecture by Bengio on the role of depth in networks. We then define a general class of scalable, shift-invariant algorithms to show a simple and natural set of requirements that justify deep convolutional networks. △ Less

Submitted 29 May, 2016; v1 submitted 3 March, 2016; originally announced March 2016.

arXiv:0909.5000 [pdf, ps, other]

Eignets for function approximation on manifolds

Authors: H. N. Mhaskar

Abstract: Let $\XX$ be a compact, smooth, connected, Riemannian manifold without boundary, $G:\XX\times\XX\to \RR$ be a kernel. Analogous to a radial basis function network, an eignet is an expression of the form $\sum_{j=1}^M a_jG(\circ,y_j)$, where $a_j\in\RR$, $y_j\in\XX$, $1\le j\le M$. We describe a deterministic, universal algorithm for constructing an eignet for approximating functions in… ▽ More Let $\XX$ be a compact, smooth, connected, Riemannian manifold without boundary, $G:\XX\times\XX\to \RR$ be a kernel. Analogous to a radial basis function network, an eignet is an expression of the form $\sum_{j=1}^M a_jG(\circ,y_j)$, where $a_j\in\RR$, $y_j\in\XX$, $1\le j\le M$. We describe a deterministic, universal algorithm for constructing an eignet for approximating functions in $L^p(μ;\XX)$ for a general class of measures $μ$ and kernels $G$. Our algorithm yields linear operators. Using the minimal separation amongst the centers $y_j$ as the cost of approximation, we give modulus of smoothness estimates for the degree of approximation by our eignets, and show by means of a converse theorem that these are the best possible for every \emph{individual function}. We also give estimates on the coefficients $a_j$ in terms of the norm of the eignet. Finally, we demonstrate that if any sequence of eignets satisfies the optimal estimates for the degree of approximation of a smooth function, measured in terms of the minimal separation, then the derivatives of the eignets also approximate the corresponding derivatives of the target function in an optimal manner. △ Less

Submitted 28 September, 2009; originally announced September 2009.

Comments: 28 pages. Articles in press; Applied and Computational Harmonic Analysis, 2009

Showing 1–31 of 31 results for author: Mhaskar, H