Search | arXiv e-print repository

Deep Learning in Random Neural Fields: Numerical Experiments via Neural Tangent Kernel

Authors: Kaito Watanabe, Kotaro Sakamoto, Ryo Karakida, Sho Sonoda, Shun-ichi Amari

Abstract: A biological neural network in the cortex forms a neural field. Neurons in the field have their own receptive fields, and connection weights between two neurons are random but highly correlated when they are in close proximity in receptive fields. In this paper, we investigate such neural fields in a multilayer architecture to investigate the supervised learning of the fields. We empirically compa… ▽ More A biological neural network in the cortex forms a neural field. Neurons in the field have their own receptive fields, and connection weights between two neurons are random but highly correlated when they are in close proximity in receptive fields. In this paper, we investigate such neural fields in a multilayer architecture to investigate the supervised learning of the fields. We empirically compare the performances of our field model with those of randomly connected deep networks. The behavior of a randomly connected network is investigated on the basis of the key idea of the neural tangent kernel regime, a recent development in the machine learning theory of over-parameterized networks; for most randomly connected neural networks, it is shown that global minima always exist in their small neighborhoods. We numerically show that this claim also holds for our neural fields. In more detail, our model has two structures: i) each neuron in a field has a continuously distributed receptive field, and ii) the initial connection weights are random but not independent, having correlations when the positions of neurons are close in each layer. We show that such a multilayer neural field is more robust than conventional models when input patterns are deformed by noise disturbances. Moreover, its generalization ability can be slightly superior to that of conventional models. △ Less

Submitted 6 January, 2023; v1 submitted 10 February, 2022; originally announced February 2022.

arXiv:2006.10732 [pdf, other]

When Does Preconditioning Help or Hurt Generalization?

Authors: Shun-ichi Amari, Jimmy Ba, Roger Grosse, Xuechen Li, Atsushi Nitanda, Taiji Suzuki, Denny Wu, Ji Xu

Abstract: While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question. This work presents a more nuanced view on how the \textit{implicit bias} of first- and second-order methods affects the comparison of generalization properties. We provide an exact asymptotic bias-variance decomposition of the generalizatio… ▽ More While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question. This work presents a more nuanced view on how the \textit{implicit bias} of first- and second-order methods affects the comparison of generalization properties. We provide an exact asymptotic bias-variance decomposition of the generalization error of overparameterized ridgeless regression under a general class of preconditioner $\boldsymbol{P}$, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We determine the optimal $\boldsymbol{P}$ for both the bias and variance, and find that the relative generalization performance of different optimizers depends on the label noise and the "shape" of the signal (true parameters): when the labels are noisy, the model is misspecified, or the signal is misaligned with the features, NGD can achieve lower risk; conversely, GD generalizes better than NGD under clean labels, a well-specified model, or aligned signal. Based on this analysis, we discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between GD and NGD. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioned GD can decrease the population risk faster than GD. Lastly, we empirically compare the generalization error of first- and second-order optimizers in neural network experiments, and observe robust trends matching our theoretical analysis. △ Less

Submitted 8 December, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

Comments: 42 pages

arXiv:2001.06931 [pdf, ps, other]

Any Target Function Exists in a Neighborhood of Any Sufficiently Wide Random Network: A Geometrical Perspective

Authors: Shun-ichi Amari

Abstract: It is known that any target function is realized in a sufficiently small neighborhood of any randomly connected deep network, provided the width (the number of neurons in a layer) is sufficiently large. There are sophisticated theories and discussions concerning this striking fact, but rigorous theories are very complicated. We give an elementary geometrical proof by using a simple model for the p… ▽ More It is known that any target function is realized in a sufficiently small neighborhood of any randomly connected deep network, provided the width (the number of neurons in a layer) is sufficiently large. There are sophisticated theories and discussions concerning this striking fact, but rigorous theories are very complicated. We give an elementary geometrical proof by using a simple model for the purpose of elucidating its structure. We show that high-dimensional geometry plays a magical role: When we project a high-dimensional sphere of radius 1 to a low-dimensional subspace, the uniform distribution over the sphere reduces to a Gaussian distribution of negligibly small covariances. △ Less

Submitted 17 March, 2020; v1 submitted 19 January, 2020; originally announced January 2020.

arXiv:1910.05992 [pdf, other]

Pathological spectra of the Fisher information metric and its variants in deep neural networks

Authors: Ryo Karakida, Shotaro Akaho, Shun-ichi Amari

Abstract: The Fisher information matrix (FIM) plays an essential role in statistics and machine learning as a Riemannian metric tensor or a component of the Hessian matrix of loss functions. Focusing on the FIM and its variants in deep neural networks (DNNs), we reveal their characteristic scale dependence on the network width, depth and sample size when the network has random weights and is sufficiently wi… ▽ More The Fisher information matrix (FIM) plays an essential role in statistics and machine learning as a Riemannian metric tensor or a component of the Hessian matrix of loss functions. Focusing on the FIM and its variants in deep neural networks (DNNs), we reveal their characteristic scale dependence on the network width, depth and sample size when the network has random weights and is sufficiently wide. This study covers two widely-used FIMs for regression with linear output and for classification with softmax output. Both FIMs asymptotically show pathological eigenvalue spectra in the sense that a small number of eigenvalues become large outliers depending the width or sample size while the others are much smaller. It implies that the local shape of the parameter space or loss landscape is very sharp in a few specific directions while almost flat in the other directions. In particular, the softmax output disperses the outliers and makes a tail of the eigenvalue density spread from the bulk. We also show that pathological spectra appear in other variants of FIMs: one is the neural tangent kernel; another is a metric for the input signal and feature space that arises from feedforward signal propagation. Thus, we provide a unified perspective on the FIM and its variants that will lead to more quantitative understanding of learning in large-scale DNNs. △ Less

Submitted 27 September, 2020; v1 submitted 14 October, 2019; originally announced October 2019.

Comments: 23 pages, 7 figures; v2: minor improvements, Section 3.4 added

arXiv:1906.02926 [pdf, other]

The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks

Authors: Ryo Karakida, Shotaro Akaho, Shun-ichi Amari

Abstract: Normalization methods play an important role in enhancing the performance of deep learning while their theoretical understandings have been limited. To theoretically elucidate the effectiveness of normalization, we quantify the geometry of the parameter space determined by the Fisher information matrix (FIM), which also corresponds to the local shape of the loss landscape under certain conditions.… ▽ More Normalization methods play an important role in enhancing the performance of deep learning while their theoretical understandings have been limited. To theoretically elucidate the effectiveness of normalization, we quantify the geometry of the parameter space determined by the Fisher information matrix (FIM), which also corresponds to the local shape of the loss landscape under certain conditions. We analyze deep neural networks with random initialization, which is known to suffer from a pathologically sharp shape of the landscape when the network becomes sufficiently wide. We reveal that batch normalization in the last layer contributes to drastically decreasing such pathological sharpness if the width and sample number satisfy a specific condition. In contrast, it is hard for batch normalization in the middle hidden layers to alleviate pathological sharpness in many settings. We also found that layer normalization cannot alleviate pathological sharpness either. Thus, we can conclude that batch normalization in the last layer significantly contributes to decreasing the sharpness induced by the FIM. △ Less

Submitted 28 October, 2019; v1 submitted 7 June, 2019; originally announced June 2019.

Comments: To appear in NeurIPS 2019

arXiv:1808.07172 [pdf, ps, other]

Fisher Information and Natural Gradient Learning of Random Deep Networks

Authors: Shun-ichi Amari, Ryo Karakida, Masafumi Oizumi

Abstract: A deep neural network is a hierarchical nonlinear model transforming input signals to output signals. Its input-output relation is considered to be stochastic, being described for a given input by a parameterized conditional probability distribution of outputs. The space of parameters consisting of weights and biases is a Riemannian manifold, where the metric is defined by the Fisher information m… ▽ More A deep neural network is a hierarchical nonlinear model transforming input signals to output signals. Its input-output relation is considered to be stochastic, being described for a given input by a parameterized conditional probability distribution of outputs. The space of parameters consisting of weights and biases is a Riemannian manifold, where the metric is defined by the Fisher information matrix. The natural gradient method uses the steepest descent direction in a Riemannian manifold, so it is effective in learning, avoiding plateaus. It requires inversion of the Fisher information matrix, however, which is practically impossible when the matrix has a huge number of dimensions. Many methods for approximating the natural gradient have therefore been introduced. The present paper uses statistical neurodynamical method to reveal the properties of the Fisher information matrix in a net of random connections under the mean field approximation. We prove that the Fisher information matrix is unit-wise block diagonal supplemented by small order terms of off-block-diagonal elements, which provides a justification for the quasi-diagonal natural gradient method by Y. Ollivier. A unitwise block-diagonal Fisher metrix reduces to the tensor product of the Fisher information matrices of single units. We further prove that the Fisher information matrix of a single unit has a simple reduced form, a sum of a diagonal matrix and a rank 2 matrix of weight-bias correlations. We obtain the inverse of Fisher information explicitly. We then have an explicit form of the natural gradient, without relying on the numerical matrix inversion, which drastically speeds up stochastic gradient learning. △ Less

Submitted 21 August, 2018; originally announced August 2018.

Comments: 22 pages, 2 figures

arXiv:1808.07169 [pdf, ps, other]

Statistical Neurodynamics of Deep Networks: Geometry of Signal Spaces

Authors: Shun-ichi Amari, Ryo Karakida, Masafumi Oizumi

Abstract: Statistical neurodynamics studies macroscopic behaviors of randomly connected neural networks. We consider a deep layered feedforward network where input signals are processed layer by layer. The manifold of input signals is embedded in a higher dimensional manifold of the next layer as a curved submanifold, provided the number of neurons is larger than that of inputs. We show geometrical features… ▽ More Statistical neurodynamics studies macroscopic behaviors of randomly connected neural networks. We consider a deep layered feedforward network where input signals are processed layer by layer. The manifold of input signals is embedded in a higher dimensional manifold of the next layer as a curved submanifold, provided the number of neurons is larger than that of inputs. We show geometrical features of the embedded manifold, proving that the manifold enlarges or shrinks locally isotropically so that it is always embedded conformally. We study the curvature of the embedded manifold. The scalar curvature converges to a constant or diverges to infinity slowly. The distance between two signals also changes, converging eventually to a stable fixed value, provided both the number of neurons in a layer and the number of layers tend to infinity. This causes a problem, since when we consider a curve in the input space, it is mapped as a continuous curve of fractal nature, but our theory contradictorily suggests that the curve eventually converges to a discrete set of equally spaced points. In reality, the numbers of neurons and layers are finite and thus, it is expected that the finite size effect causes the discrepancies between our theory and reality. We need to further study the discrepancies to understand their implications on information processing. △ Less

Submitted 21 August, 2018; originally announced August 2018.

Comments: 23 pages, 8 figures

arXiv:1806.01316 [pdf, other]

Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach

Authors: Ryo Karakida, Shotaro Akaho, Shun-ichi Amari

Abstract: The Fisher information matrix (FIM) is a fundamental quantity to represent the characteristics of a stochastic model, including deep neural networks (DNNs). The present study reveals novel statistics of FIM that are universal among a wide class of DNNs. To this end, we use random weights and large width limits, which enables us to utilize mean field theories. We investigate the asymptotic statisti… ▽ More The Fisher information matrix (FIM) is a fundamental quantity to represent the characteristics of a stochastic model, including deep neural networks (DNNs). The present study reveals novel statistics of FIM that are universal among a wide class of DNNs. To this end, we use random weights and large width limits, which enables us to utilize mean field theories. We investigate the asymptotic statistics of the FIM's eigenvalues and reveal that most of them are close to zero while the maximum eigenvalue takes a huge value. Because the landscape of the parameter space is defined by the FIM, it is locally flat in most dimensions, but strongly distorted in others. Moreover, we demonstrate the potential usage of the derived statistics in learning strategies. First, small eigenvalues that induce flatness can be connected to a norm-based capacity measure of generalization ability. Second, the maximum eigenvalue that induces the distortion enables us to quantitatively estimate an appropriately sized learning rate for gradient methods to converge. △ Less

Submitted 8 October, 2019; v1 submitted 4 June, 2018; originally announced June 2018.

Comments: Accepted at AISTATS2019. Main text: 10 pages, 2 figures. Supplementary material: 9 pages, 2 figures, typos corrected

arXiv:1709.10219 [pdf, other]

Information Geometry Connecting Wasserstein Distance and Kullback-Leibler Divergence via the Entropy-Relaxed Transportation Problem

Authors: Shun-ichi Amari, Ryo Karakida, Masafumi Oizumi

Abstract: Two geometrical structures have been extensively studied for a manifold of probability distributions. One is based on the Fisher information metric, which is invariant under reversible transformations of random variables, while the other is based on the Wasserstein distance of optimal transportation, which reflects the structure of the distance between random variables. Here, we propose a new info… ▽ More Two geometrical structures have been extensively studied for a manifold of probability distributions. One is based on the Fisher information metric, which is invariant under reversible transformations of random variables, while the other is based on the Wasserstein distance of optimal transportation, which reflects the structure of the distance between random variables. Here, we propose a new information-geometrical theory that is a unified framework connecting the Wasserstein distance and Kullback-Leibler (KL) divergence. We primarily considered a discrete case consisting of $n$ elements and studied the geometry of the probability simplex $S_{n-1}$, which is the set of all probability distributions over $n$ elements. The Wasserstein distance was introduced in $S_{n-1}$ by the optimal transportation of commodities from distribution ${\mathbf{p}}$ to distribution ${\mathbf{q}}$, where ${\mathbf{p}}$, ${\mathbf{q}} \in S_{n-1}$. We relaxed the optimal transportation by using entropy, which was introduced by Cuturi. The optimal solution was called the entropy-relaxed stochastic transportation plan. The entropy-relaxed optimal cost $C({\mathbf{p}}, {\mathbf{q}})$ was computationally much less demanding than the original Wasserstein distance but does not define a distance because it is not minimized at ${\mathbf{p}}={\mathbf{q}}$. To define a proper divergence while retaining the computational advantage, we first introduced a divergence function in the manifold $S_{n-1} \times S_{n-1}$ of optimal transportation plans. We fully explored the information geometry of the manifold of the optimal transportation plans and subsequently constructed a new one-parameter family of divergences in $S_{n-1}$ that are related to both the Wasserstein distance and the KL-divergence. △ Less

Submitted 28 September, 2017; originally announced September 2017.

arXiv:1709.02050 [pdf, ps, other]

Geometry of Information Integration

Authors: Shun-ichi Amari, Naotsugu Tsuchiya, Masafumi Oizumi

Abstract: Information geometry is used to quantify the amount of information integration within multiple terminals of a causal dynamical system. Integrated information quantifies how much information is lost when a system is split into parts and information transmission between the parts is removed. Multiple measures have been proposed as a measure of integrated information. Here, we analyze four of the pre… ▽ More Information geometry is used to quantify the amount of information integration within multiple terminals of a causal dynamical system. Integrated information quantifies how much information is lost when a system is split into parts and information transmission between the parts is removed. Multiple measures have been proposed as a measure of integrated information. Here, we analyze four of the previously proposed measures and elucidate their relations from a viewpoint of information geometry. Two of them use dually flat manifolds and the other two use curved manifolds to define a split model. We show that there are hierarchical structures among the measures. We provide explicit expressions of these measures. △ Less

Submitted 6 September, 2017; originally announced September 2017.

arXiv:1510.04455 [pdf, ps, other]

doi 10.1073/pnas.1603583113

A unified framework for information integration based on information geometry

Authors: Masafumi Oizumi, Naotsugu Tsuchiya, Shun-ichi Amari

Abstract: We propose a unified theoretical framework for quantifying spatio-temporal interactions in a stochastic dynamical system based on information geometry. In the proposed framework, the degree of interactions is quantified by the divergence between the actual probability distribution of the system and a constrained probability distribution where the interactions of interest are disconnected. This fra… ▽ More We propose a unified theoretical framework for quantifying spatio-temporal interactions in a stochastic dynamical system based on information geometry. In the proposed framework, the degree of interactions is quantified by the divergence between the actual probability distribution of the system and a constrained probability distribution where the interactions of interest are disconnected. This framework provides novel geometric interpretations of various information theoretic measures of interactions, such as mutual information, transfer entropy, and stochastic interaction in terms of how interactions are disconnected. The framework therefore provides an intuitive understanding of the relationships between the various quantities. By extending the concept of transfer entropy, we propose a novel measure of integrated information which measures causal interactions between parts of a system. Integrated information quantifies the extent to which the whole is more than the sum of the parts and can be potentially used as a biological measure of the levels of consciousness. △ Less

Submitted 15 October, 2015; originally announced October 2015.

arXiv:1505.04368 [pdf, ps, other]

doi 10.1371/journal.pcbi.1004654

Measuring integrated information from the decoding perspective

Authors: Masafumi Oizumi, Shun-ichi Amari, Toru Yanagawa, Naotaka Fujii, Naotsugu Tsuchiya

Abstract: Accumulating evidence indicates that the capacity to integrate information in the brain is a prerequisite for consciousness. Integrated Information Theory (IIT) of consciousness provides a mathematical approach to quantifying the information integrated in a system, called integrated information, $Φ$. Integrated information is defined theoretically as the amount of information a system generates as… ▽ More Accumulating evidence indicates that the capacity to integrate information in the brain is a prerequisite for consciousness. Integrated Information Theory (IIT) of consciousness provides a mathematical approach to quantifying the information integrated in a system, called integrated information, $Φ$. Integrated information is defined theoretically as the amount of information a system generates as a whole, above and beyond the sum of the amount of information its parts independently generate. IIT predicts that the amount of integrated information in the brain should reflect levels of consciousness. Empirical evaluation of this theory requires computing integrated information from neural data acquired from experiments, although difficulties with using the original measure $Φ$ precludes such computations. Although some practical measures have been previously proposed, we found that these measures fail to satisfy the theoretical requirements as a measure of integrated information. Measures of integrated information should satisfy the lower and upper bounds as follows: The lower bound of integrated information should be 0 when the system does not generate information (no information) or when the system comprises independent parts (no integration). The upper bound of integrated information is the amount of information generated by the whole system and is realized when the amount of information generated independently by its parts equals to 0. Here we derive the novel practical measure $Φ^*$ by introducing a concept of mismatched decoding developed from information theory. We show that $Φ^*$ is properly bounded from below and above, as required, as a measure of integrated information. We derive the analytical expression $Φ^*$ under the Gaussian assumption, which makes it readily applicable to experimental data. △ Less

Submitted 17 May, 2015; originally announced May 2015.

Journal ref: PLoS Comput Biol 12(1), e1004654, 2016

arXiv:1412.7146 [pdf, other]

doi 10.3390/e17052988

Log-Determinant Divergences Revisited: Alpha--Beta and Gamma Log-Det Divergences

Authors: Andrzej Cichocki, Sergio Cruces, Shun-Ichi Amari

Abstract: In this paper, we review and extend a family of log-det divergences for symmetric positive definite (SPD) matrices and discuss their fundamental properties. We show how to generate from parameterized Alpha-Beta (AB) and Gamma Log-det divergences many well known divergences, for example, the Stein's loss, S-divergence, called also Jensen-Bregman LogDet (JBLD) divergence, the Logdet Zero (Bhattachar… ▽ More In this paper, we review and extend a family of log-det divergences for symmetric positive definite (SPD) matrices and discuss their fundamental properties. We show how to generate from parameterized Alpha-Beta (AB) and Gamma Log-det divergences many well known divergences, for example, the Stein's loss, S-divergence, called also Jensen-Bregman LogDet (JBLD) divergence, the Logdet Zero (Bhattacharryya) divergence, Affine Invariant Riemannian Metric (AIRM) as well as some new divergences. Moreover, we establish links and correspondences among many log-det divergences and display them on alpha-beta plain for various set of parameters. Furthermore, this paper bridges these divergences and shows also their links to divergences of multivariate and multiway Gaussian distributions. Closed form formulas are derived for gamma divergences of two multivariate Gaussian densities including as special cases the Kullback-Leibler, Bhattacharryya, Rényi and Cauchy-Schwartz divergences. Symmetrized versions of the log-det divergences are also discussed and reviewed. A class of divergences is extended to multiway divergences for separable covariance (precision) matrices. △ Less

Submitted 23 December, 2014; v1 submitted 18 December, 2014; originally announced December 2014.

Comments: 35 pages, 4 figures

arXiv:1410.2386 [pdf, other]

doi 10.1109/TNNLS.2015.2423694

Bayesian Robust Tensor Factorization for Incomplete Multiway Data

Authors: Qibin Zhao, Guoxu Zhou, Liqing Zhang, Andrzej Cichocki, Shun-ichi Amari

Abstract: We propose a generative model for robust tensor factorization in the presence of both missing data and outliers. The objective is to explicitly infer the underlying low-CP-rank tensor capturing the global information and a sparse tensor capturing the local information (also considered as outliers), thus providing the robust predictive distribution over missing entries. The low-CP-rank tensor is mo… ▽ More We propose a generative model for robust tensor factorization in the presence of both missing data and outliers. The objective is to explicitly infer the underlying low-CP-rank tensor capturing the global information and a sparse tensor capturing the local information (also considered as outliers), thus providing the robust predictive distribution over missing entries. The low-CP-rank tensor is modeled by multilinear interactions between multiple latent factors on which the column sparsity is enforced by a hierarchical prior, while the sparse tensor is modeled by a hierarchical view of Student-$t$ distribution that associates an individual hyperparameter with each element independently. For model learning, we develop an efficient closed-form variational inference under a fully Bayesian treatment, which can effectively prevent the overfitting problem and scales linearly with data size. In contrast to existing related works, our method can perform model selection automatically and implicitly without need of tuning parameters. More specifically, it can discover the groundtruth of CP rank and automatically adapt the sparsity inducing priors to various types of outliers. In addition, the tradeoff between the low-rank approximation and the sparse representation can be optimized in the sense of maximum model evidence. The extensive experiments and comparisons with many state-of-the-art algorithms on both synthetic and real-world datasets demonstrate the superiorities of our method from several perspectives. △ Less

Submitted 16 April, 2015; v1 submitted 9 October, 2014; originally announced October 2014.

Comments: in IEEE Transactions on Neural Networks and Learning Systems, 2015

arXiv:1311.5125 [pdf, ps, other]

On conformal divergences and their population minimizers

Authors: Richard Nock, Frank Nielsen, Shun-ichi Amari

Abstract: Total Bregman divergences are a recent tweak of ordinary Bregman divergences originally motivated by applications that required invariance by rotations. They have displayed superior results compared to ordinary Bregman divergences on several clustering, computer vision, medical imaging and machine learning tasks. These preliminary results raise two important problems : First, report a complete cha… ▽ More Total Bregman divergences are a recent tweak of ordinary Bregman divergences originally motivated by applications that required invariance by rotations. They have displayed superior results compared to ordinary Bregman divergences on several clustering, computer vision, medical imaging and machine learning tasks. These preliminary results raise two important problems : First, report a complete characterization of the left and right population minimizers for this class of total Bregman divergences. Second, characterize a principled superset of total and ordinary Bregman divergences with good clustering properties, from which one could tailor the choice of a divergence to a particular application. In this paper, we provide and study one such superset with interesting geometric features, that we call conformal divergences, and focus on their left and right population minimizers. Our results are obtained in a recently coined $(u, v)$-geometric structure that is a generalization of the dually flat affine connections in information geometry. We characterize both analytically and geometrically the population minimizers. We prove that conformal divergences (resp. total Bregman divergences) are essentially exhaustive for their left (resp. right) population minimizers. We further report new results and extend previous results on the robustness to outliers of the left and right population minimizers, and discuss the role of the $(u, v)$-geometric structure in clustering. Additional results are also given. △ Less

Submitted 8 June, 2015; v1 submitted 20 November, 2013; originally announced November 2013.

arXiv:1304.6591 [pdf, ps, other]

Lp-Regularized Least Squares (0<p<1) and Critical Path

Authors: Masahiro Yukawa, Shun-ichi Amari

Abstract: The least squares problem is formulated in terms of Lp quasi-norm regularization (0<p<1). Two formulations are considered: (i) an Lp-constrained optimization and (ii) an Lp-penalized (unconstrained) optimization. Due to the nonconvexity of the Lp quasi-norm, the solution paths of the regularized least squares problem are not ensured to be continuous. A critical path, which is a maximal continuous… ▽ More The least squares problem is formulated in terms of Lp quasi-norm regularization (0<p<1). Two formulations are considered: (i) an Lp-constrained optimization and (ii) an Lp-penalized (unconstrained) optimization. Due to the nonconvexity of the Lp quasi-norm, the solution paths of the regularized least squares problem are not ensured to be continuous. A critical path, which is a maximal continuous curve consisting of critical points, is therefore considered separately. The critical paths are piecewise smooth, as can be seen from the viewpoint of the variational method, and generally contain non-optimal points such as saddle points and local maxima as well as global/local minima. Along each critical path, the correspondence between the regularization parameters (which govern the 'strength' of regularization in the two formulations) is non-monotonic and, more specifically, it has multiplicity. Two paths of critical points connecting the origin and an ordinary least squares (OLS) solution are highlighted. One is a main path starting at an OLS solution, and the other is a greedy path starting at the origin. Part of the greedy path can be constructed with a generalized Minkowskian gradient. The breakpoints of the greedy path coincide with the step-by-step solutions generated by using orthogonal matching pursuit (OMP), thereby establishing a direct link between OMP and Lp-regularized least squares. △ Less

Submitted 24 April, 2013; originally announced April 2013.

arXiv:1010.4965 [pdf, ps, other]

Dually flat structure with escort probability and its application to alpha-Voronoi diagrams

Authors: Atsumi Ohara, Hiroshi Matsuzoe, Shun-ichi Amari

Abstract: This paper studies geometrical structure of the manifold of escort probability distributions and shows its new applicability to information science. In order to realize escort probabilities we use a conformal transformation that flattens so-called alpha-geometry of the space of discrete probability distributions, which well characterizes nonadditive statistics on the space. As a result escort prob… ▽ More This paper studies geometrical structure of the manifold of escort probability distributions and shows its new applicability to information science. In order to realize escort probabilities we use a conformal transformation that flattens so-called alpha-geometry of the space of discrete probability distributions, which well characterizes nonadditive statistics on the space. As a result escort probabilities are proved to be flat coordinates of the usual probabilities for the derived dually flat structure. Finally, we demonstrate that escort probabilities with the new structure admits a simple algorithm to compute Voronoi diagrams and centroids with respect to alpha-divergences. △ Less

Submitted 24 October, 2010; originally announced October 2010.

Comments: Several results in this paper can be found in the conference paper [36] without complete proofs

Showing 1–17 of 17 results for author: Amari, S