Search | arXiv e-print repository

Scaling Law of Sim2Real Transfer Learning in Expanding Computational Materials Databases for Real-World Predictions

Authors: Shunya Minami, Yoshihiro Hayashi, Stephen Wu, Kenji Fukumizu, Hiroki Sugisawa, Masashi Ishii, Isao Kuwajima, Kazuya Shiratori, Ryo Yoshida

Abstract: To address the challenge of limited experimental materials data, extensive physical property databases are being developed based on high-throughput computational experiments, such as molecular dynamics simulations. Previous studies have shown that fine-tuning a predictor pretrained on a computational database to a real system can result in models with outstanding generalization capabilities compar… ▽ More To address the challenge of limited experimental materials data, extensive physical property databases are being developed based on high-throughput computational experiments, such as molecular dynamics simulations. Previous studies have shown that fine-tuning a predictor pretrained on a computational database to a real system can result in models with outstanding generalization capabilities compared to learning from scratch. This study demonstrates the scaling law of simulation-to-real (Sim2Real) transfer learning for several machine learning tasks in materials science. Case studies of three prediction tasks for polymers and inorganic materials reveal that the prediction error on real systems decreases according to a power-law as the size of the computational data increases. Observing the scaling behavior offers various insights for database development, such as determining the sample size necessary to achieve a desired performance, identifying equivalent sample sizes for physical and computational experiments, and guiding the design of data production protocols for downstream real-world tasks. △ Less

Submitted 7 August, 2024; originally announced August 2024.

Comments: 22 pages, 6 figures

arXiv:2405.20879 [pdf, other]

Flow matching achieves minimax optimal convergence

Authors: Kenji Fukumizu, Taiji Suzuki, Noboru Isobe, Kazusato Oko, Masanori Koyama

Abstract: Flow matching (FM) has gained significant attention as a simulation-free generative model. Unlike diffusion models, which are based on stochastic differential equations, FM employs a simpler approach by solving an ordinary differential equation with an initial condition from a normal distribution, thus streamlining the sample generation process. This paper discusses the convergence properties of F… ▽ More Flow matching (FM) has gained significant attention as a simulation-free generative model. Unlike diffusion models, which are based on stochastic differential equations, FM employs a simpler approach by solving an ordinary differential equation with an initial condition from a normal distribution, thus streamlining the sample generation process. This paper discusses the convergence properties of FM in terms of the $p$-Wasserstein distance, a measure of distributional discrepancy. We establish that FM can achieve the minmax optimal convergence rate for $1 \leq p \leq 2$, presenting the first theoretical evidence that FM can reach convergence rates comparable to those of diffusion models. Our analysis extends existing frameworks by examining a broader class of mean and variance functions for the vector fields and identifies specific conditions necessary to attain these optimal rates. △ Less

Submitted 31 May, 2024; originally announced May 2024.

arXiv:2403.11520 [pdf, other]

State-Separated SARSA: A Practical Sequential Decision-Making Algorithm with Recovering Rewards

Authors: Yuto Tanimoto, Kenji Fukumizu

Abstract: While many multi-armed bandit algorithms assume that rewards for all arms are constant across rounds, this assumption does not hold in many real-world scenarios. This paper considers the setting of recovering bandits (Pike-Burke & Grunewalder, 2019), where the reward depends on the number of rounds elapsed since the last time an arm was pulled. We propose a new reinforcement learning (RL) algorith… ▽ More While many multi-armed bandit algorithms assume that rewards for all arms are constant across rounds, this assumption does not hold in many real-world scenarios. This paper considers the setting of recovering bandits (Pike-Burke & Grunewalder, 2019), where the reward depends on the number of rounds elapsed since the last time an arm was pulled. We propose a new reinforcement learning (RL) algorithm tailored to this setting, named the State-Separate SARSA (SS-SARSA) algorithm, which treats rounds as states. The SS-SARSA algorithm achieves efficient learning by reducing the number of state combinations required for Q-learning/SARSA, which often suffers from combinatorial issues for large-scale RL problems. Additionally, it makes minimal assumptions about the reward structure and offers lower computational complexity. Furthermore, we prove asymptotic convergence to an optimal policy under mild assumptions. Simulation studies demonstrate the superior performance of our algorithm across various settings. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.10859 [pdf, other]

Neural-Kernel Conditional Mean Embeddings

Authors: Eiki Shimizu, Kenji Fukumizu, Dino Sejdinovic

Abstract: Kernel conditional mean embeddings (CMEs) offer a powerful framework for representing conditional distribution, but they often face scalability and expressiveness challenges. In this work, we propose a new method that effectively combines the strengths of deep learning with CMEs in order to address these challenges. Specifically, our approach leverages the end-to-end neural network (NN) optimizati… ▽ More Kernel conditional mean embeddings (CMEs) offer a powerful framework for representing conditional distribution, but they often face scalability and expressiveness challenges. In this work, we propose a new method that effectively combines the strengths of deep learning with CMEs in order to address these challenges. Specifically, our approach leverages the end-to-end neural network (NN) optimization framework using a kernel-based objective. This design circumvents the computationally expensive Gram matrix inversion required by current CME methods. To further enhance performance, we provide efficient strategies to optimize the remaining kernel hyperparameters. In conditional density estimation tasks, our NN-CME hybrid achieves competitive performance and often surpasses existing deep learning-based methods. Lastly, we showcase its remarkable versatility by seamlessly integrating it into reinforcement learning (RL) contexts. Building on Q-learning, our approach naturally leads to a new variant of distributional RL methods, which demonstrates consistent effectiveness across different environments. △ Less

Submitted 16 March, 2024; originally announced March 2024.

arXiv:2402.18839 [pdf, other]

Extended Flow Matching: a Method of Conditional Generation with Generalized Continuity Equation

Authors: Noboru Isobe, Masanori Koyama, Jinzhe Zhang, Kohei Hayashi, Kenji Fukumizu

Abstract: The task of conditional generation is one of the most important applications of generative models, and numerous methods have been developed to date based on the celebrated flow-based models. However, many flow-based models in use today are not built to allow one to introduce an explicit inductive bias to how the conditional distribution to be generated changes with respect to conditions. This can… ▽ More The task of conditional generation is one of the most important applications of generative models, and numerous methods have been developed to date based on the celebrated flow-based models. However, many flow-based models in use today are not built to allow one to introduce an explicit inductive bias to how the conditional distribution to be generated changes with respect to conditions. This can result in unexpected behavior in the task of style transfer, for example. In this research, we introduce extended flow matching (EFM), a direct extension of flow matching that learns a "matrix field" corresponding to the continuous map from the space of conditions to the space of distributions. We show that we can introduce inductive bias to the conditional generation through the matrix field and demonstrate this fact with MMOT-EFM, a version of EFM that aims to minimize the Dirichlet energy or the sensitivity of the distribution with respect to conditions. We will present our theory along with experimental results that support the competitiveness of EFM in conditional generation. △ Less

Submitted 5 July, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: 27 pages, 10 figures, We have corrected an error in our experiment on COT-FM

MSC Class: 68T07 (Primary); 49Q22 (Secondary)

arXiv:2402.04516 [pdf, other]

Generalized Sobolev Transport for Probability Measures on a Graph

Authors: Tam Le, Truyen Nguyen, Kenji Fukumizu

Abstract: We study the optimal transport (OT) problem for measures supported on a graph metric space. Recently, Le et al. (2022) leverage the graph structure and propose a variant of OT, namely Sobolev transport (ST), which yields a closed-form expression for a fast computation. However, ST is essentially coupled with the $L^p$ geometric structure within its definition which makes it nontrivial to utilize S… ▽ More We study the optimal transport (OT) problem for measures supported on a graph metric space. Recently, Le et al. (2022) leverage the graph structure and propose a variant of OT, namely Sobolev transport (ST), which yields a closed-form expression for a fast computation. However, ST is essentially coupled with the $L^p$ geometric structure within its definition which makes it nontrivial to utilize ST for other prior structures. In contrast, the classic OT has the flexibility to adapt to various geometric structures by modifying the underlying cost function. An important instance is the Orlicz-Wasserstein (OW) which moves beyond the $L^p$ structure by leveraging the \emph{Orlicz geometric structure}. Comparing to the usage of standard $p$-order Wasserstein, OW remarkably helps to advance certain machine learning approaches. Nevertheless, OW brings up a new challenge on its computation due to its two-level optimization formulation. In this work, we leverage a specific class of convex functions for Orlicz structure to propose the generalized Sobolev transport (GST). GST encompasses the ST as its special case, and can be utilized for prior structures beyond the $L^p$ geometry. In connection with the OW, we show that one only needs to simply solve a univariate optimization problem to compute the GST, unlike the complex two-level optimization problem in OW. We empirically illustrate that GST is several-order faster than the OW. Moreover, we provide preliminary evidences on the advantages of GST for document classification and for several tasks in topological data analysis. △ Less

Submitted 29 May, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

Comments: To appear at ICML'2024

arXiv:2310.13653 [pdf, other]

Optimal Transport for Measures with Noisy Tree Metric

Authors: Tam Le, Truyen Nguyen, Kenji Fukumizu

Abstract: We study optimal transport (OT) problem for probability measures supported on a tree metric space. It is known that such OT problem (i.e., tree-Wasserstein (TW)) admits a closed-form expression, but depends fundamentally on the underlying tree structure over supports of input measures. In practice, the given tree structure may be, however, perturbed due to noisy or adversarial measurements. To mit… ▽ More We study optimal transport (OT) problem for probability measures supported on a tree metric space. It is known that such OT problem (i.e., tree-Wasserstein (TW)) admits a closed-form expression, but depends fundamentally on the underlying tree structure over supports of input measures. In practice, the given tree structure may be, however, perturbed due to noisy or adversarial measurements. To mitigate this issue, we follow the max-min robust OT approach which considers the maximal possible distances between two input measures over an uncertainty set of tree metrics. In general, this approach is hard to compute, even for measures supported in one-dimensional space, due to its non-convexity and non-smoothness which hinders its practical applications, especially for large-scale settings. In this work, we propose novel uncertainty sets of tree metrics from the lens of edge deletion/addition which covers a diversity of tree structures in an elegant framework. Consequently, by building upon the proposed uncertainty sets, and leveraging the tree structure over supports, we show that the robust OT also admits a closed-form expression for a fast computation as its counterpart standard OT (i.e., TW). Furthermore, we demonstrate that the robust OT satisfies the metric property and is negative definite. We then exploit its negative definiteness to propose positive definite kernels and test them in several simulations on various real-world datasets on document classification and topological data analysis. △ Less

Submitted 29 February, 2024; v1 submitted 20 October, 2023; originally announced October 2023.

Comments: To appear in AISTATS 2024

arXiv:2307.11972 [pdf, other]

Out-of-Distribution Optimality of Invariant Risk Minimization

Authors: Shoji Toyota, Kenji Fukumizu

Abstract: Deep Neural Networks often inherit spurious correlations embedded in training data and hence may fail to generalize to unseen domains, which have different distributions from the domain to provide training data. M. Arjovsky et al. (2019) introduced the concept out-of-distribution (o.o.d.) risk, which is the maximum risk among all domains, and formulated the issue caused by spurious correlations as… ▽ More Deep Neural Networks often inherit spurious correlations embedded in training data and hence may fail to generalize to unseen domains, which have different distributions from the domain to provide training data. M. Arjovsky et al. (2019) introduced the concept out-of-distribution (o.o.d.) risk, which is the maximum risk among all domains, and formulated the issue caused by spurious correlations as a minimization problem of the o.o.d. risk. Invariant Risk Minimization (IRM) is considered to be a promising approach to minimize the o.o.d. risk: IRM estimates a minimum of the o.o.d. risk by solving a bi-level optimization problem. While IRM has attracted considerable attention with empirical success, it comes with few theoretical guarantees. Especially, a solid theoretical guarantee that the bi-level optimization problem gives the minimum of the o.o.d. risk has not yet been established. Aiming at providing a theoretical justification for IRM, this paper rigorously proves that a solution to the bi-level optimization problem minimizes the o.o.d. risk under certain conditions. The result also provides sufficient conditions on distributions providing training data and on a dimension of feature space for the bi-leveled optimization problem to minimize the o.o.d. risk. △ Less

Submitted 21 July, 2023; originally announced July 2023.

Comments: 23 pages, submitted for a publication

arXiv:2305.18484 [pdf, other]

Neural Fourier Transform: A General Approach to Equivariant Representation Learning

Authors: Masanori Koyama, Kenji Fukumizu, Kohei Hayashi, Takeru Miyato

Abstract: Symmetry learning has proven to be an effective approach for extracting the hidden structure of data, with the concept of equivariance relation playing the central role. However, most of the current studies are built on architectural theory and corresponding assumptions on the form of data. We propose Neural Fourier Transform (NFT), a general framework of learning the latent linear action of the g… ▽ More Symmetry learning has proven to be an effective approach for extracting the hidden structure of data, with the concept of equivariance relation playing the central role. However, most of the current studies are built on architectural theory and corresponding assumptions on the form of data. We propose Neural Fourier Transform (NFT), a general framework of learning the latent linear action of the group without assuming explicit knowledge of how the group acts on data. We present the theoretical foundations of NFT and show that the existence of a linear equivariant feature, which has been assumed ubiquitously in equivariance learning, is equivalent to the existence of a group invariant kernel on the dataspace. We also provide experimental results to demonstrate the application of NFT in typical scenarios with varying levels of knowledge about the acting group. △ Less

Submitted 14 February, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

arXiv:2304.12770 [pdf, other]

Controlling Posterior Collapse by an Inverse Lipschitz Constraint on the Decoder Network

Authors: Yuri Kinoshita, Kenta Oono, Kenji Fukumizu, Yuichi Yoshida, Shin-ichi Maeda

Abstract: Variational autoencoders (VAEs) are one of the deep generative models that have experienced enormous success over the past decades. However, in practice, they suffer from a problem called posterior collapse, which occurs when the encoder coincides, or collapses, with the prior taking no information from the latent structure of the input data into consideration. In this work, we introduce an invers… ▽ More Variational autoencoders (VAEs) are one of the deep generative models that have experienced enormous success over the past decades. However, in practice, they suffer from a problem called posterior collapse, which occurs when the encoder coincides, or collapses, with the prior taking no information from the latent structure of the input data into consideration. In this work, we introduce an inverse Lipschitz neural network into the decoder and, based on this architecture, provide a new method that can control in a simple and clear manner the degree of posterior collapse for a wide range of VAE models equipped with a concrete theoretical guarantee. We also illustrate the effectiveness of our method through several numerical experiments. △ Less

Submitted 2 February, 2024; v1 submitted 25 April, 2023; originally announced April 2023.

Comments: accepted to ICML 2023, some notations adjusted from the submitted version

arXiv:2302.12498 [pdf, other]

Scalable Unbalanced Sobolev Transport for Measures on a Graph

Authors: Tam Le, Truyen Nguyen, Kenji Fukumizu

Abstract: Optimal transport (OT) is a popular and powerful tool for comparing probability measures. However, OT suffers a few drawbacks: (i) input measures required to have the same mass, (ii) a high computational complexity, and (iii) indefiniteness which limits its applications on kernel-dependent algorithmic approaches. To tackle issues (ii)--(iii), Le et al. (2022) recently proposed Sobolev transport fo… ▽ More Optimal transport (OT) is a popular and powerful tool for comparing probability measures. However, OT suffers a few drawbacks: (i) input measures required to have the same mass, (ii) a high computational complexity, and (iii) indefiniteness which limits its applications on kernel-dependent algorithmic approaches. To tackle issues (ii)--(iii), Le et al. (2022) recently proposed Sobolev transport for measures on a graph having the same total mass by leveraging the graph structure over supports. In this work, we consider measures that may have different total mass and are supported on a graph metric space. To alleviate the disadvantages (i)--(iii) of OT, we propose a novel and scalable approach to extend Sobolev transport for this unbalanced setting where measures may have different total mass. We show that the proposed unbalanced Sobolev transport (UST) admits a closed-form formula for fast computation, and it is also negative definite. Additionally, we derive geometric structures for the UST and establish relations between our UST and other transport distances. We further exploit the negative definiteness to design positive definite kernels and evaluate them on various simulations to illustrate their fast computation and comparable performances against other transport baselines for unbalanced measures on a graph. △ Less

Submitted 24 February, 2023; originally announced February 2023.

Comments: to appear in AISTATS 2023. arXiv admin note: text overlap with arXiv:2101.09756

arXiv:2210.09745 [pdf, other]

Transfer learning with affine model transformation

Authors: Shunya Minami, Kenji Fukumizu, Yoshihiro Hayashi, Ryo Yoshida

Abstract: Supervised transfer learning has received considerable attention due to its potential to boost the predictive power of machine learning in scenarios where data are scarce. Generally, a given set of source models and a dataset from a target domain are used to adapt the pre-trained models to a target domain by statistically learning domain shift and domain-specific factors. While such procedurally a… ▽ More Supervised transfer learning has received considerable attention due to its potential to boost the predictive power of machine learning in scenarios where data are scarce. Generally, a given set of source models and a dataset from a target domain are used to adapt the pre-trained models to a target domain by statistically learning domain shift and domain-specific factors. While such procedurally and intuitively plausible methods have achieved great success in a wide range of real-world applications, the lack of a theoretical basis hinders further methodological development. This paper presents a general class of transfer learning regression called affine model transfer, following the principle of expected-square loss minimization. It is shown that the affine model transfer broadly encompasses various existing methods, including the most common procedure based on neural feature extractors. Furthermore, the current paper clarifies theoretical properties of the affine model transfer such as generalization error and excess risk. Through several case studies, we demonstrate the practical benefits of modeling and estimating inter-domain commonality and domain-specific factors separately with the affine-type transfer models. △ Less

Submitted 19 January, 2024; v1 submitted 18 October, 2022; originally announced October 2022.

Comments: 34 pages

Journal ref: NeurIPS 2023

arXiv:2210.07413 [pdf, other]

Invariance-adapted decomposition and Lasso-type contrastive learning

Authors: Masanori Koyama, Takeru Miyato, Kenji Fukumizu

Abstract: Recent years have witnessed the effectiveness of contrastive learning in obtaining the representation of dataset that is useful in interpretation and downstream tasks. However, the mechanism that describes this effectiveness have not been thoroughly analyzed, and many studies have been conducted to investigate the data structures captured by contrastive learning. In particular, the recent study of… ▽ More Recent years have witnessed the effectiveness of contrastive learning in obtaining the representation of dataset that is useful in interpretation and downstream tasks. However, the mechanism that describes this effectiveness have not been thoroughly analyzed, and many studies have been conducted to investigate the data structures captured by contrastive learning. In particular, the recent study of \citet{content_isolate} has shown that contrastive learning is capable of decomposing the data space into the space that is invariant to all augmentations and its complement. In this paper, we introduce the notion of invariance-adapted latent space that decomposes the data space into the intersections of the invariant spaces of each augmentation and their complements. This decomposition generalizes the one introduced in \citet{content_isolate}, and describes a structure that is analogous to the frequencies in the harmonic analysis of a group. We experimentally show that contrastive learning with lasso-type metric can be used to find an invariance-adapted latent space, thereby suggesting a new potential for the contrastive learning. We also investigate when such a latent space can be identified up to mixings within each component. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Journal ref: 2022 ICML workshop of Topology, Algebra and Geometry in Machine Learning (spotlight)

arXiv:2210.05972 [pdf, other]

Unsupervised Learning of Equivariant Structure from Sequences

Authors: Takeru Miyato, Masanori Koyama, Kenji Fukumizu

Abstract: In this study, we present meta-sequential prediction (MSP), an unsupervised framework to learn the symmetry from the time sequence of length at least three. Our method leverages the stationary property (e.g. constant velocity, constant acceleration) of the time sequence to learn the underlying equivariant structure of the dataset by simply training the encoder-decoder model to be able to predict t… ▽ More In this study, we present meta-sequential prediction (MSP), an unsupervised framework to learn the symmetry from the time sequence of length at least three. Our method leverages the stationary property (e.g. constant velocity, constant acceleration) of the time sequence to learn the underlying equivariant structure of the dataset by simply training the encoder-decoder model to be able to predict the future observations. We will demonstrate that, with our framework, the hidden disentangled structure of the dataset naturally emerges as a by-product by applying simultaneous block-diagonalization to the transition operators in the latent space, the procedure which is commonly used in representation theory to decompose the feature-space based on the type of response to group actions. We will showcase our method from both empirical and theoretical perspectives. Our result suggests that finding a simple structured relation and learning a model with extrapolation capability are two sides of the same coin. The code is available at https://github.com/takerum/meta_sequential_prediction. △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: Accepted to NeurIPS 2022

arXiv:2206.01795 [pdf, other]

Robust Topological Inference in the Presence of Outliers

Authors: Siddharth Vishwanath, Bharath K. Sriperumbudur, Kenji Fukumizu, Satoshi Kuriki

Abstract: The distance function to a compact set plays a crucial role in the paradigm of topological data analysis. In particular, the sublevel sets of the distance function are used in the computation of persistent homology -- a backbone of the topological data analysis pipeline. Despite its stability to perturbations in the Hausdorff distance, persistent homology is highly sensitive to outliers. In this w… ▽ More The distance function to a compact set plays a crucial role in the paradigm of topological data analysis. In particular, the sublevel sets of the distance function are used in the computation of persistent homology -- a backbone of the topological data analysis pipeline. Despite its stability to perturbations in the Hausdorff distance, persistent homology is highly sensitive to outliers. In this work, we develop a framework of statistical inference for persistent homology in the presence of outliers. Drawing inspiration from recent developments in robust statistics, we propose a $\textit{median-of-means}$ variant of the distance function ($\textsf{MoM Dist}$), and establish its statistical properties. In particular, we show that, even in the presence of outliers, the sublevel filtrations and weighted filtrations induced by $\textsf{MoM Dist}$ are both consistent estimators of the true underlying population counterpart, and their rates of convergence in the bottleneck metric are controlled by the fraction of outliers in the data. Finally, we demonstrate the advantages of the proposed methodology through simulations and applications. △ Less

Submitted 3 June, 2022; originally announced June 2022.

Comments: 50 pages, 10 figures

MSC Class: 62R40; 55N31; 68T09

arXiv:2203.15549 [pdf, other]

Invariance Learning based on Label Hierarchy

Authors: Shoji Toyota, Kenji Fukumizu

Abstract: Deep Neural Networks inherit spurious correlations embedded in training data and hence may fail to predict desired labels on unseen domains (or environments), which have different distributions from the domain used in training. Invariance Learning (IL) has been developed recently to overcome this shortcoming; using training data in many domains, IL estimates such a predictor that is invariant to a… ▽ More Deep Neural Networks inherit spurious correlations embedded in training data and hence may fail to predict desired labels on unseen domains (or environments), which have different distributions from the domain used in training. Invariance Learning (IL) has been developed recently to overcome this shortcoming; using training data in many domains, IL estimates such a predictor that is invariant to a change of domain. However, the requirement of training data in multiple domains is a strong restriction of IL, since it often needs high annotation cost. We propose a novel IL framework to overcome this problem. Assuming the availability of data from multiple domains for a higher level of classification task, for which the labeling cost is low, we estimate an invariant predictor for the target classification task with training data in a single domain. Additionally, we propose two cross-validation methods for selecting hyperparameters of invariance regularization to solve the issue of hyperparameter selection, which has not been handled properly in existing IL methods. The effectiveness of the proposed framework, including the cross-validation, is demonstrated empirically, and the correctness of the hyperparameter selection is proved under some conditions. △ Less

Submitted 29 March, 2022; originally announced March 2022.

Comments: 30 pages, submitted for a publication

arXiv:2202.10281 [pdf, other]

doi 10.1109/ACCESS.2022.3169594

ALGAN: Anomaly Detection by Generating Pseudo Anomalous Data via Latent Variables

Authors: Hironori Murase, Kenji Fukumizu

Abstract: In many anomaly detection tasks, where anomalous data rarely appear and are difficult to collect, training using only normal data is important. Although it is possible to manually create anomalous data using prior knowledge, they may be subject to user bias. In this paper, we propose an Anomalous Latent variable Generative Adversarial Network (ALGAN) in which the GAN generator produces pseudo-anom… ▽ More In many anomaly detection tasks, where anomalous data rarely appear and are difficult to collect, training using only normal data is important. Although it is possible to manually create anomalous data using prior knowledge, they may be subject to user bias. In this paper, we propose an Anomalous Latent variable Generative Adversarial Network (ALGAN) in which the GAN generator produces pseudo-anomalous data as well as fake-normal data, whereas the discriminator is trained to distinguish between normal and pseudo-anomalous data. This differs from the standard GAN discriminator, which specializes in classifying two similar classes. The training dataset contains only normal data; the latent variables are introduced in anomalous states and are input into the generator to produce diverse pseudo-anomalous data. We compared the performance of ALGAN with other existing methods on the MVTec-AD, Magnetic Tile Defects, and COIL-100 datasets. The experimental results showed that ALGAN exhibited an AUROC comparable to those of state-of-the-art methods while achieving a much faster prediction time. △ Less

Submitted 9 May, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

Comments: 13 pages, 8 figures

Journal ref: IEEE Access, vol. 10, pp. 44259-44270, 2022

arXiv:2110.05225 [pdf, other]

$β$-Intact-VAE: Identifying and Estimating Causal Effects under Limited Overlap

Authors: Pengzhou Wu, Kenji Fukumizu

Abstract: As an important problem in causal inference, we discuss the identification and estimation of treatment effects (TEs) under limited overlap; that is, when subjects with certain features belong to a single treatment group. We use a latent variable to model a prognostic score which is widely used in biostatistics and sufficient for TEs; i.e., we build a generative prognostic model. We prove that the… ▽ More As an important problem in causal inference, we discuss the identification and estimation of treatment effects (TEs) under limited overlap; that is, when subjects with certain features belong to a single treatment group. We use a latent variable to model a prognostic score which is widely used in biostatistics and sufficient for TEs; i.e., we build a generative prognostic model. We prove that the latent variable recovers a prognostic score, and the model identifies individualized treatment effects. The model is then learned as β-Intact-VAE--a new type of variational autoencoder (VAE). We derive the TE error bounds that enable representations balanced for treatment groups conditioned on individualized features. The proposed method is compared with recent methods using (semi-)synthetic datasets. △ Less

Submitted 11 October, 2021; originally announced October 2021.

Comments: Updated version of the NeurIPS 2021 submission (https://openreview.net/forum?id=Z3yd722b5X5). Largely improve readability and the presentation of experimental results. arXiv admin note: text overlap with arXiv:2109.15062, arXiv:2101.06662

arXiv:2109.15062 [pdf, other]

Towards Principled Causal Effect Estimation by Deep Identifiable Models

Authors: Pengzhou Wu, Kenji Fukumizu

Abstract: As an important problem in causal inference, we discuss the estimation of treatment effects (TEs). Representing the confounder as a latent variable, we propose Intact-VAE, a new variant of variational autoencoder (VAE), motivated by the prognostic score that is sufficient for identifying TEs. Our VAE also naturally gives representations balanced for treatment groups, using its prior. Experiments o… ▽ More As an important problem in causal inference, we discuss the estimation of treatment effects (TEs). Representing the confounder as a latent variable, we propose Intact-VAE, a new variant of variational autoencoder (VAE), motivated by the prognostic score that is sufficient for identifying TEs. Our VAE also naturally gives representations balanced for treatment groups, using its prior. Experiments on (semi-)synthetic datasets show state-of-the-art performance under diverse settings, including unobserved confounding. Based on the identifiability of our model, we prove identification of TEs under unconfoundedness, and also discuss (possible) extensions to harder settings. △ Less

Submitted 1 November, 2021; v1 submitted 30 September, 2021; originally announced September 2021.

Comments: Fully updated. Largely improve clarity, add identification under unconfoundedness (Sec. 4.2), and more. arXiv admin note: substantial text overlap with arXiv:2101.06662

arXiv:2108.11018 [pdf, other]

A Scaling Law for Synthetic-to-Real Transfer: How Much Is Your Pre-training Effective?

Authors: Hiroaki Mikami, Kenji Fukumizu, Shogo Murai, Shuji Suzuki, Yuta Kikuchi, Taiji Suzuki, Shin-ichi Maeda, Kohei Hayashi

Abstract: Synthetic-to-real transfer learning is a framework in which a synthetically generated dataset is used to pre-train a model to improve its performance on real vision tasks. The most significant advantage of using synthetic images is that the ground-truth labels are automatically available, enabling unlimited expansion of the data size without human cost. However, synthetic data may have a huge doma… ▽ More Synthetic-to-real transfer learning is a framework in which a synthetically generated dataset is used to pre-train a model to improve its performance on real vision tasks. The most significant advantage of using synthetic images is that the ground-truth labels are automatically available, enabling unlimited expansion of the data size without human cost. However, synthetic data may have a huge domain gap, in which case increasing the data size does not improve the performance. How can we know that? In this study, we derive a simple scaling law that predicts the performance from the amount of pre-training data. By estimating the parameters of the law, we can judge whether we should increase the data or change the setting of image synthesis. Further, we analyze the theory of transfer learning by considering learning dynamics and confirm that the derived generalization bound is consistent with our empirical findings. We empirically validated our scaling law on various experimental settings of benchmark tasks, model sizes, and complexities of synthetic images. △ Less

Submitted 8 October, 2021; v1 submitted 24 August, 2021; originally announced August 2021.

arXiv:2101.06662 [pdf, other]

Intact-VAE: Estimating Treatment Effects under Unobserved Confounding

Authors: Pengzhou Wu, Kenji Fukumizu

Abstract: NOTE: This preprint has a flawed theoretical formulation. Please avoid it and refer to the ICLR22 publication https://openreview.net/forum?id=q7n2RngwOM. Also, arXiv:2109.15062 contains some new ideas on unobserved Confounding. As an important problem of causal inference, we discuss the identification and estimation of treatment effects under unobserved confounding. Representing the confounder a… ▽ More NOTE: This preprint has a flawed theoretical formulation. Please avoid it and refer to the ICLR22 publication https://openreview.net/forum?id=q7n2RngwOM. Also, arXiv:2109.15062 contains some new ideas on unobserved Confounding. As an important problem of causal inference, we discuss the identification and estimation of treatment effects under unobserved confounding. Representing the confounder as a latent variable, we propose Intact-VAE, a new variant of variational autoencoder (VAE), motivated by the prognostic score that is sufficient for identifying treatment effects. We theoretically show that, under certain settings, treatment effects are identified by our model, and further, based on the identifiability of our model (i.e., determinacy of representation), our VAE is a consistent estimator with representation balanced for treatment groups. Experiments on (semi-)synthetic datasets show state-of-the-art performance under diverse settings. △ Less

Submitted 20 April, 2022; v1 submitted 17 January, 2021; originally announced January 2021.

Comments: This preprint has a flawed theoretical formulation. It was intended as a theoretical update of https://openreview.net/forum?id=D3TNqCspFpM

arXiv:2011.02256 [pdf, other]

Advantage of Deep Neural Networks for Estimating Functions with Singularity on Hypersurfaces

Authors: Masaaki Imaizumi, Kenji Fukumizu

Abstract: We develop a minimax rate analysis to describe the reason that deep neural networks (DNNs) perform better than other standard methods. For nonparametric regression problems, it is well known that many standard methods attain the minimax optimal rate of estimation errors for smooth functions, and thus, it is not straightforward to identify the theoretical advantages of DNNs. This study tries to fil… ▽ More We develop a minimax rate analysis to describe the reason that deep neural networks (DNNs) perform better than other standard methods. For nonparametric regression problems, it is well known that many standard methods attain the minimax optimal rate of estimation errors for smooth functions, and thus, it is not straightforward to identify the theoretical advantages of DNNs. This study tries to fill this gap by considering the estimation for a class of non-smooth functions that have singularities on hypersurfaces. Our findings are as follows: (i) We derive the generalization error of a DNN estimator and prove that its convergence rate is almost optimal. (ii) We elucidate a phase diagram of estimation problems, which describes the situations where the DNNs outperform a general class of estimators, including kernel methods, Gaussian process methods, and others. We additionally show that DNNs outperform harmonic analysis based estimators. This advantage of DNNs comes from the fact that a shape of singularity can be successfully handled by their multi-layered structure. △ Less

Submitted 8 February, 2022; v1 submitted 4 November, 2020; originally announced November 2020.

Comments: Complete version of arXiv:1802.04474

arXiv:2007.02809 [pdf, other]

Meta Learning for Causal Direction

Authors: Jean-Francois Ton, Dino Sejdinovic, Kenji Fukumizu

Abstract: The inaccessibility of controlled randomized trials due to inherent constraints in many fields of science has been a fundamental issue in causal inference. In this paper, we focus on distinguishing the cause from effect in the bivariate setting under limited observational data. Based on recent developments in meta learning as well as in causal inference, we introduce a novel generative model that… ▽ More The inaccessibility of controlled randomized trials due to inherent constraints in many fields of science has been a fundamental issue in causal inference. In this paper, we focus on distinguishing the cause from effect in the bivariate setting under limited observational data. Based on recent developments in meta learning as well as in causal inference, we introduce a novel generative model that allows distinguishing cause and effect in the small data setting. Using a learnt task variable that contains distributional information of each dataset, we propose an end-to-end algorithm that makes use of similar training datasets at test time. We demonstrate our method on various synthetic as well as real-world data and show that it is able to maintain high accuracy in detecting directions across varying dataset sizes. △ Less

Submitted 21 February, 2021; v1 submitted 6 July, 2020; originally announced July 2020.

arXiv:2006.13228 [pdf, other]

A General Class of Transfer Learning Regression without Implementation Cost

Authors: Shunya Minami, Song Liu, Stephen Wu, Kenji Fukumizu, Ryo Yoshida

Abstract: We propose a novel framework that unifies and extends existing methods of transfer learning (TL) for regression. To bridge a pretrained source model to the model on a target task, we introduce a density-ratio reweighting function, which is estimated through the Bayesian framework with a specific prior distribution. By changing two intrinsic hyperparameters and the choice of the density-ratio model… ▽ More We propose a novel framework that unifies and extends existing methods of transfer learning (TL) for regression. To bridge a pretrained source model to the model on a target task, we introduce a density-ratio reweighting function, which is estimated through the Bayesian framework with a specific prior distribution. By changing two intrinsic hyperparameters and the choice of the density-ratio model, the proposed method can integrate three popular methods of TL: TL based on cross-domain similarity regularization, a probabilistic TL using the density-ratio estimation, and fine-tuning of pretrained neural networks. Moreover, the proposed method can benefit from its simple implementation without any additional cost; the regression model can be fully trained using off-the-shelf libraries for supervised learning in which the original output variable is simply transformed to a new output variable. We demonstrate its simplicity, generality, and applicability using various real data applications. △ Less

Submitted 16 December, 2020; v1 submitted 23 June, 2020; originally announced June 2020.

Comments: 31 pages, 6 figures

arXiv:2006.10012 [pdf, other]

Robust Persistence Diagrams using Reproducing Kernels

Authors: Siddharth Vishwanath, Kenji Fukumizu, Satoshi Kuriki, Bharath Sriperumbudur

Abstract: Persistent homology has become an important tool for extracting geometric and topological features from data, whose multi-scale features are summarized in a persistence diagram. From a statistical perspective, however, persistence diagrams are very sensitive to perturbations in the input space. In this work, we develop a framework for constructing robust persistence diagrams from superlevel filtra… ▽ More Persistent homology has become an important tool for extracting geometric and topological features from data, whose multi-scale features are summarized in a persistence diagram. From a statistical perspective, however, persistence diagrams are very sensitive to perturbations in the input space. In this work, we develop a framework for constructing robust persistence diagrams from superlevel filtrations of robust density estimators constructed using reproducing kernels. Using an analogue of the influence function on the space of persistence diagrams, we establish the proposed framework to be less sensitive to outliers. The robust persistence diagrams are shown to be consistent estimators in bottleneck distance, with the convergence rate controlled by the smoothness of the kernel. This, in turn, allows us to construct uniform confidence bands in the space of persistence diagrams. Finally, we demonstrate the superiority of the proposed approach on benchmark datasets. △ Less

Submitted 3 June, 2022; v1 submitted 17 June, 2020; originally announced June 2020.

MSC Class: 55N31; 62R40; 62G07; 46E22

arXiv:2004.01822 [pdf, other]

The equivalence between Stein variational gradient descent and black-box variational inference

Authors: Casey Chu, Kentaro Minami, Kenji Fukumizu

Abstract: We formalize an equivalence between two popular methods for Bayesian inference: Stein variational gradient descent (SVGD) and black-box variational inference (BBVI). In particular, we show that BBVI corresponds precisely to SVGD when the kernel is the neural tangent kernel. Furthermore, we interpret SVGD and BBVI as kernel gradient flows; we do this by leveraging the recent perspective that views… ▽ More We formalize an equivalence between two popular methods for Bayesian inference: Stein variational gradient descent (SVGD) and black-box variational inference (BBVI). In particular, we show that BBVI corresponds precisely to SVGD when the kernel is the neural tangent kernel. Furthermore, we interpret SVGD and BBVI as kernel gradient flows; we do this by leveraging the recent perspective that views SVGD as a gradient flow in the space of probability distributions and showing that BBVI naturally motivates a Riemannian structure on that space. We observe that kernel gradient flow also describes dynamics found in the training of generative adversarial networks (GANs). This work thereby unifies several existing techniques in variational inference and generative modeling and identifies the kernel as a fundamental object governing the behavior of these algorithms, motivating deeper analysis of its properties. △ Less

Submitted 3 April, 2020; originally announced April 2020.

Comments: ICLR 2020, Workshop on Integration of Deep Neural Models and Differential Equations

arXiv:2002.04185 [pdf, other]

Smoothness and Stability in GANs

Authors: Casey Chu, Kentaro Minami, Kenji Fukumizu

Abstract: Generative adversarial networks, or GANs, commonly display unstable behavior during training. In this work, we develop a principled theoretical framework for understanding the stability of various types of GANs. In particular, we derive conditions that guarantee eventual stationarity of the generator when it is trained with gradient descent, conditions that must be satisfied by the divergence that… ▽ More Generative adversarial networks, or GANs, commonly display unstable behavior during training. In this work, we develop a principled theoretical framework for understanding the stability of various types of GANs. In particular, we derive conditions that guarantee eventual stationarity of the generator when it is trained with gradient descent, conditions that must be satisfied by the divergence that is minimized by the GAN and the generator's architecture. We find that existing GAN variants satisfy some, but not all, of these conditions. Using tools from convex analysis, optimal transport, and reproducing kernels, we construct a GAN that fulfills these conditions simultaneously. In the process, we explain and clarify the need for various existing GAN stabilization techniques, including Lipschitz constraints, gradient penalties, and smooth activation functions. △ Less

Submitted 10 February, 2020; originally announced February 2020.

Comments: ICLR 2020

arXiv:2001.01894 [pdf]

Causal Mosaic: Cause-Effect Inference via Nonlinear ICA and Ensemble Method

Authors: Pengzhou Wu, Kenji Fukumizu

Abstract: We address the problem of distinguishing cause from effect in bivariate setting. Based on recent developments in nonlinear independent component analysis (ICA), we train nonparametrically general nonlinear causal models that allow non-additive noise. Further, we build an ensemble framework, namely Causal Mosaic, which models a causal pair by a mixture of nonlinear models. We compare this method wi… ▽ More We address the problem of distinguishing cause from effect in bivariate setting. Based on recent developments in nonlinear independent component analysis (ICA), we train nonparametrically general nonlinear causal models that allow non-additive noise. Further, we build an ensemble framework, namely Causal Mosaic, which models a causal pair by a mixture of nonlinear models. We compare this method with other recent methods on artificial and real world benchmark datasets, and our method shows state-of-the-art performance. △ Less

Submitted 7 January, 2020; originally announced January 2020.

Comments: Accepted to AISTATS 2020. Camera-ready version in preparation

Journal ref: An updated version at AISTATS 2020: http://proceedings.mlr.press/v108/wu20b/wu20b.pdf. Main changes: a correction in Theorem 3 and additional explanations in Sec. 4

arXiv:1910.09972 [pdf, other]

doi 10.1007/978-3-030-58520-4_37

Exchangeable deep neural networks for set-to-set matching and learning

Authors: Yuki Saito, Takuma Nakamura, Hirotaka Hachiya, Kenji Fukumizu

Abstract: Matching two different sets of items, called heterogeneous set-to-set matching problem, has recently received attention as a promising problem. The difficulties are to extract features to match a correct pair of different sets and also preserve two types of exchangeability required for set-to-set matching: the pair of sets, as well as the items in each set, should be exchangeable. In this study, w… ▽ More Matching two different sets of items, called heterogeneous set-to-set matching problem, has recently received attention as a promising problem. The difficulties are to extract features to match a correct pair of different sets and also preserve two types of exchangeability required for set-to-set matching: the pair of sets, as well as the items in each set, should be exchangeable. In this study, we propose a novel deep learning architecture to address the abovementioned difficulties and also an efficient training framework for set-to-set matching. We evaluate the methods through experiments based on two industrial applications: fashion set recommendation and group re-identification. In these experiments, we show that the proposed method provides significant improvements and results compared with the state-of-the-art methods, thereby validating our architecture for the heterogeneous set matching problem. △ Less

Submitted 28 January, 2021; v1 submitted 22 October, 2019; originally announced October 2019.

arXiv:1907.00586 [pdf, other]

doi 10.1093/jrsssb/qkad050

A Kernel Stein Test for Comparing Latent Variable Models

Authors: Heishiro Kanagawa, Wittawat Jitkrittum, Lester Mackey, Kenji Fukumizu, Arthur Gretton

Abstract: We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018) t… ▽ More We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018) to the case of latent variable models, a much more general class than the fully observed models treated previously. The new test, with a properly calibrated threshold, has a well-controlled type-I error. In the case of certain models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative Maximum Mean Discrepancy test, which is based on samples from the models and does not exploit the latent structure. △ Less

Submitted 9 May, 2023; v1 submitted 1 July, 2019; originally announced July 2019.

Comments: This is a pre-copyedited, author-produced version of an article accepted for publication in The Journal of the Royal Statistical Society Series: B following peer review

arXiv:1906.04868 [pdf, other]

Semi-flat minima and saddle points by embedding neural networks to overparameterization

Authors: Kenji Fukumizu, Shoichiro Yamaguchi, Yoh-ichi Mototake, Mirai Tanaka

Abstract: We theoretically study the landscape of the training error for neural networks in overparameterized cases. We consider three basic methods for embedding a network into a wider one with more hidden units, and discuss whether a minimum point of the narrower network gives a minimum or saddle point of the wider one. Our results show that the networks with smooth and ReLU activation have different part… ▽ More We theoretically study the landscape of the training error for neural networks in overparameterized cases. We consider three basic methods for embedding a network into a wider one with more hidden units, and discuss whether a minimum point of the narrower network gives a minimum or saddle point of the wider one. Our results show that the networks with smooth and ReLU activation have different partially flat landscapes around the embedded point. We also relate these results to a difference of their generalization abilities in overparameterized realization. △ Less

Submitted 14 June, 2019; v1 submitted 11 June, 2019; originally announced June 2019.

Comments: 38 pages, 4 figures

arXiv:1902.00342 [pdf, other]

Tree-Sliced Variants of Wasserstein Distances

Authors: Tam Le, Makoto Yamada, Kenji Fukumizu, Marco Cuturi

Abstract: Optimal transport (\OT) theory defines a powerful set of tools to compare probability distributions. \OT~suffers however from a few drawbacks, computational and statistical, which have encouraged the proposal of several regularized variants of OT in the recent literature, one of the most notable being the \textit{sliced} formulation, which exploits the closed-form formula between univariate distri… ▽ More Optimal transport (\OT) theory defines a powerful set of tools to compare probability distributions. \OT~suffers however from a few drawbacks, computational and statistical, which have encouraged the proposal of several regularized variants of OT in the recent literature, one of the most notable being the \textit{sliced} formulation, which exploits the closed-form formula between univariate distributions by projecting high-dimensional measures onto random lines. We consider in this work a more general family of ground metrics, namely \textit{tree metrics}, which also yield fast closed-form computations and negative definite, and of which the sliced-Wasserstein distance is a particular case (the tree is a chain). We propose the tree-sliced Wasserstein distance, computed by averaging the Wasserstein distance between these measures using random tree metrics, built adaptively in either low or high-dimensional spaces. Exploiting the negative definiteness of that distance, we also propose a positive definite kernel, and test it against other baselines on a few benchmark tasks. △ Less

Submitted 28 October, 2019; v1 submitted 1 February, 2019; originally announced February 2019.

Comments: Camera-ready for NeurIPS 2019

arXiv:1809.00800 [pdf, other]

doi 10.18653/v1/D18-1203

Pointwise HSIC: A Linear-Time Kernelized Co-occurrence Norm for Sparse Linguistic Expressions

Authors: Sho Yokoi, Sosuke Kobayashi, Kenji Fukumizu, Jun Suzuki, Kentaro Inui

Abstract: In this paper, we propose a new kernel-based co-occurrence measure that can be applied to sparse linguistic expressions (e.g., sentences) with a very short learning time, as an alternative to pointwise mutual information (PMI). As well as deriving PMI from mutual information, we derive this new measure from the Hilbert--Schmidt independence criterion (HSIC); thus, we call the new measure the point… ▽ More In this paper, we propose a new kernel-based co-occurrence measure that can be applied to sparse linguistic expressions (e.g., sentences) with a very short learning time, as an alternative to pointwise mutual information (PMI). As well as deriving PMI from mutual information, we derive this new measure from the Hilbert--Schmidt independence criterion (HSIC); thus, we call the new measure the pointwise HSIC (PHSIC). PHSIC can be interpreted as a smoothed variant of PMI that allows various similarity metrics (e.g., sentence embeddings) to be plugged in as kernels. Moreover, PHSIC can be estimated by simple and fast (linear in the size of the data) matrix calculations regardless of whether we use linear or nonlinear kernels. Empirically, in a dialogue response selection task, PHSIC is learned thousands of times faster than an RNN-based PMI while outperforming PMI in accuracy. In addition, we also demonstrate that PHSIC is beneficial as a criterion of a data selection task for machine translation owing to its ability to give high (low) scores to a consistent (inconsistent) pair with other pairs. △ Less

Submitted 4 September, 2018; originally announced September 2018.

Comments: Accepted by EMNLP 2018

Journal ref: EMNLP 2018

arXiv:1805.08463 [pdf, other]

Variational Learning on Aggregate Outputs with Gaussian Processes

Authors: Ho Chung Leon Law, Dino Sejdinovic, Ewan Cameron, Tim CD Lucas, Seth Flaxman, Katherine Battle, Kenji Fukumizu

Abstract: While a typical supervised learning framework assumes that the inputs and the outputs are measured at the same levels of granularity, many applications, including global mapping of disease, only have access to outputs at a much coarser level than that of the inputs. Aggregation of outputs makes generalization to new inputs much more difficult. We consider an approach to this problem based on varia… ▽ More While a typical supervised learning framework assumes that the inputs and the outputs are measured at the same levels of granularity, many applications, including global mapping of disease, only have access to outputs at a much coarser level than that of the inputs. Aggregation of outputs makes generalization to new inputs much more difficult. We consider an approach to this problem based on variational learning with a model of output aggregation and Gaussian processes, where aggregation leads to intractability of the standard evidence lower bounds. We propose new bounds and tractable approximations, leading to improved prediction accuracy and scalability to large datasets, while explicitly taking uncertainty into account. We develop a framework which extends to several types of likelihoods, including the Poisson model for aggregated count data. We apply our framework to a challenging and important problem, the fine-scale spatial modelling of malaria incidence, with over 1 million observations. △ Less

Submitted 22 May, 2018; originally announced May 2018.

arXiv:1802.05411 [pdf, ps, other]

Selecting the Best in GANs Family: a Post Selection Inference Framework

Authors: Yao-Hung Hubert Tsai, Makoto Yamada, Denny Wu, Ruslan Salakhutdinov, Ichiro Takeuchi, Kenji Fukumizu

Abstract: "Which Generative Adversarial Networks (GANs) generates the most plausible images?" has been a frequently asked question among researchers. To address this problem, we first propose an \emph{incomplete} U-statistics estimate of maximum mean discrepancy $\mathrm{MMD}_{inc}$ to measure the distribution discrepancy between generated and real images. $\mathrm{MMD}_{inc}$ enjoys the advantages of asymp… ▽ More "Which Generative Adversarial Networks (GANs) generates the most plausible images?" has been a frequently asked question among researchers. To address this problem, we first propose an \emph{incomplete} U-statistics estimate of maximum mean discrepancy $\mathrm{MMD}_{inc}$ to measure the distribution discrepancy between generated and real images. $\mathrm{MMD}_{inc}$ enjoys the advantages of asymptotic normality, computation efficiency, and model agnosticity. We then propose a GANs analysis framework to select and test the "best" member in GANs family using the Post Selection Inference (PSI) with $\mathrm{MMD}_{inc}$. In the experiments, we adopt the proposed framework on 7 GANs variants and compare their $\mathrm{MMD}_{inc}$ scores. △ Less

Submitted 23 June, 2018; v1 submitted 15 February, 2018; originally announced February 2018.

arXiv:1705.07673 [pdf, other]

A Linear-Time Kernel Goodness-of-Fit Test

Authors: Wittawat Jitkrittum, Wenkai Xu, Zoltan Szabo, Kenji Fukumizu, Arthur Gretton

Abstract: We propose a novel adaptive test of goodness-of-fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein's method, meaning that it is not necessary to compute the normalising constant of the model. We anal… ▽ More We propose a novel adaptive test of goodness-of-fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein's method, meaning that it is not necessary to compute the normalising constant of the model. We analyse the asymptotic Bahadur efficiency of the new test, and prove that under a mean-shift alternative, our test always has greater relative efficiency than a previous linear-time kernel test, regardless of the choice of parameters for that test. In experiments, the performance of our method exceeds that of the earlier linear-time test, and matches or exceeds the power of a quadratic-time kernel test. In high dimensions and where model structure may be exploited, our goodness of fit test performs far better than a quadratic-time two-sample test based on the Maximum Mean Discrepancy, with samples drawn from the model. △ Less

Submitted 24 October, 2017; v1 submitted 22 May, 2017; originally announced May 2017.

Comments: Accepted to NIPS 2017

MSC Class: 46E22; 62G10 ACM Class: G.3; I.2.6

arXiv:1605.09522 [pdf, ps, other]

doi 10.1561/2200000060

Kernel Mean Embedding of Distributions: A Review and Beyond

Authors: Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Schölkopf

Abstract: A Hilbert space embedding of a distribution---in short, a kernel mean embedding---has recently emerged as a powerful tool for machine learning and inference. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It can be viewed as a generalization of the orig… ▽ More A Hilbert space embedding of a distribution---in short, a kernel mean embedding---has recently emerged as a powerful tool for machine learning and inference. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It can be viewed as a generalization of the original "feature map" common to support vector machines (SVMs) and other kernel methods. While initially closely associated with the latter, it has meanwhile found application in fields ranging from kernel machines and probabilistic modeling to statistical inference, causal discovery, and deep learning. The goal of this survey is to give a comprehensive review of existing work and recent advances in this research area, and to discuss the most challenging issues and open problems that could lead to new research directions. The survey begins with a brief introduction to the RKHS and positive definite kernels which forms the backbone of this survey, followed by a thorough discussion of the Hilbert space embedding of marginal distributions, theoretical guarantees, and a review of its applications. The embedding of distributions enables us to apply RKHS methods to probability measures which prompts a wide range of applications such as kernel two-sample testing, independent testing, and learning on distributional data. Next, we discuss the Hilbert space embedding for conditional distributions, give theoretical insights, and review some applications. The conditional mean embedding enables us to perform sum, product, and Bayes' rules---which are ubiquitous in graphical model, probabilistic inference, and reinforcement learning---in a non-parametric way. We then discuss relationships between this framework and other related areas. Lastly, we give some suggestions on future research directions. △ Less

Submitted 13 December, 2020; v1 submitted 31 May, 2016; originally announced May 2016.

Comments: 147 pages; this is the final version

Journal ref: Foundations and Trends in Machine Learning: Vol. 10: No. 1-2, pp 1-141 (2017)

arXiv:1510.09155 [pdf, ps, other]

A characterization of minimum spanning tree-like metric spaces

Authors: Momoko Hayamizu, Hiroshi Endo, Kenji Fukumizu

Abstract: Recent years have witnessed a surge of biological interest in the minimum spanning tree (MST) problem for its relevance to automatic model construction using the distances between data points. Despite the increasing use of MST algorithms for this purpose, the goodness-of-fit of an MST to the data is often elusive because no quantitative criteria have been developed to measure it. Motivated by this… ▽ More Recent years have witnessed a surge of biological interest in the minimum spanning tree (MST) problem for its relevance to automatic model construction using the distances between data points. Despite the increasing use of MST algorithms for this purpose, the goodness-of-fit of an MST to the data is often elusive because no quantitative criteria have been developed to measure it. Motivated by this, we provide a necessary and sufficient condition to ensure that a metric space on n points can be represented by a fully labeled tree on n vertices, and thereby determine when an MST preserves all pairwise distances between points in a finite metric space. △ Less

Submitted 30 October, 2015; originally announced October 2015.

Comments: 9 pages, 2 figures

MSC Class: Primary 05C12; Secondary 05C05

arXiv:1506.02784 [pdf, other]

Estimating Posterior Ratio for Classification: Transfer Learning from Probabilistic Perspective

Authors: Song Liu, Kenji Fukumizu

Abstract: Transfer learning assumes classifiers of similar tasks share certain parameter structures. Unfortunately, modern classifiers uses sophisticated feature representations with huge parameter spaces which lead to costly transfer. Under the impression that changes from one classifier to another should be ``simple'', an efficient transfer learning criteria that only learns the ``differences'' is propose… ▽ More Transfer learning assumes classifiers of similar tasks share certain parameter structures. Unfortunately, modern classifiers uses sophisticated feature representations with huge parameter spaces which lead to costly transfer. Under the impression that changes from one classifier to another should be ``simple'', an efficient transfer learning criteria that only learns the ``differences'' is proposed in this paper. We train a \emph{posterior ratio} which turns out to minimizes the upper-bound of the target learning risk. The model of posterior ratio does not have to share the same parameter space with the source classifier at all so it can be easily modelled and efficiently trained. The resulting classifier therefore is obtained by simply multiplying the existing probabilistic-classifier with the learned posterior ratio. △ Less

Submitted 19 October, 2015; v1 submitted 9 June, 2015; originally announced June 2015.

Comments: Revision Comments: The proofs were corrected from a few mistakes. The title and the introduction was changed. We have also re-run a few experiments

arXiv:1501.06794 [pdf, other]

Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations

Authors: Bernhard Schölkopf, Krikamol Muandet, Kenji Fukumizu, Jonas Peters

Abstract: We describe a method to perform functional operations on probability distributions of random variables. The method uses reproducing kernel Hilbert space representations of probability distributions, and it is applicable to all operations which can be applied to points drawn from the respective distributions. We refer to our approach as {\em kernel probabilistic programming}. We illustrate it on sy… ▽ More We describe a method to perform functional operations on probability distributions of random variables. The method uses reproducing kernel Hilbert space representations of probability distributions, and it is applicable to all operations which can be applied to points drawn from the respective distributions. We refer to our approach as {\em kernel probabilistic programming}. We illustrate it on synthetic data, and show how it can be used for nonparametric structural equation models, with an application to causal inference. △ Less

Submitted 27 January, 2015; originally announced January 2015.

ACM Class: G.3; I.2.6; D.3.3

Journal ref: Statistics and Computing 25:755-766 (2015)

arXiv:1405.5505 [pdf, ps, other]

Kernel Mean Shrinkage Estimators

Authors: Krikamol Muandet, Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf

Abstract: A mean function in a reproducing kernel Hilbert space (RKHS), or a kernel mean, is central to kernel methods in that it is used by many classical algorithms such as kernel principal component analysis, and it also forms the core inference step of modern kernel methods that rely on embedding probability distributions in RKHSs. Given a finite sample, an empirical average has been used commonly as a… ▽ More A mean function in a reproducing kernel Hilbert space (RKHS), or a kernel mean, is central to kernel methods in that it is used by many classical algorithms such as kernel principal component analysis, and it also forms the core inference step of modern kernel methods that rely on embedding probability distributions in RKHSs. Given a finite sample, an empirical average has been used commonly as a standard estimator of the true kernel mean. Despite a widespread use of this estimator, we show that it can be improved thanks to the well-known Stein phenomenon. We propose a new family of estimators called kernel mean shrinkage estimators (KMSEs), which benefit from both theoretical justifications and good empirical performance. The results demonstrate that the proposed estimators outperform the standard one, especially in a "large d, small n" paradigm. △ Less

Submitted 25 February, 2016; v1 submitted 21 May, 2014; originally announced May 2014.

Comments: 41 pages

arXiv:1306.0842 [pdf, ps, other]

Kernel Mean Estimation and Stein's Effect

Authors: Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Arthur Gretton, Bernhard Schölkopf

Abstract: A mean function in reproducing kernel Hilbert space, or a kernel mean, is an important part of many applications ranging from kernel principal component analysis to Hilbert-space embedding of distributions. Given finite samples, an empirical average is the standard estimate for the true kernel mean. We show that this estimator can be improved via a well-known phenomenon in statistics called Stein'… ▽ More A mean function in reproducing kernel Hilbert space, or a kernel mean, is an important part of many applications ranging from kernel principal component analysis to Hilbert-space embedding of distributions. Given finite samples, an empirical average is the standard estimate for the true kernel mean. We show that this estimator can be improved via a well-known phenomenon in statistics called Stein's phenomenon. After consideration, our theoretical analysis reveals the existence of a wide class of estimators that are better than the standard. Focusing on a subset of this class, we propose efficient shrinkage estimators for the kernel mean. Empirical evaluations on several benchmark applications clearly demonstrate that the proposed estimators outperform the standard kernel mean estimator. △ Less

Submitted 6 June, 2013; v1 submitted 4 June, 2013; originally announced June 2013.

Comments: first draft

arXiv:1210.4887 [pdf]

Hilbert Space Embeddings of POMDPs

Authors: Yu Nishiyama, Abdeslam Boularias, Arthur Gretton, Kenji Fukumizu

Abstract: A nonparametric approach for policy learning for POMDPs is proposed. The approach represents distributions over the states, observations, and actions as embeddings in feature spaces, which are reproducing kernel Hilbert spaces. Distributions over states given the observations are obtained by applying the kernel Bayes' rule to these distribution embeddings. Policies and value functions are defined… ▽ More A nonparametric approach for policy learning for POMDPs is proposed. The approach represents distributions over the states, observations, and actions as embeddings in feature spaces, which are reproducing kernel Hilbert spaces. Distributions over states given the observations are obtained by applying the kernel Bayes' rule to these distribution embeddings. Policies and value functions are defined on the feature space over states, which leads to a feature space expression for the Bellman equation. Value iteration may then be used to estimate the optimal value function and associated policy. Experimental results confirm that the correct policy is learned using the feature space representation. △ Less

Submitted 16 October, 2012; originally announced October 2012.

Comments: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012)

Report number: UAI-P-2012-PG-644-653

arXiv:1207.6076 [pdf, ps, other]

doi 10.1214/13-AOS1140

Equivalence of distance-based and RKHS-based statistics in hypothesis testing

Authors: Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, Kenji Fukumizu

Abstract: We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the cas… ▽ More We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the case where the energy distance is computed with a semimetric of negative type, a positive definite kernel, termed distance kernel, may be defined such that the MMD corresponds exactly to the energy distance. Conversely, for any positive definite kernel, we can interpret the MMD as energy distance with respect to some negative-type semimetric. This equivalence readily extends to distance covariance using kernels on the product space. We determine the class of probability distributions for which the test statistics are consistent against all alternatives. Finally, we investigate the performance of the family of distance kernels in two-sample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of kernels, and that other choices from this family can yield more powerful tests. △ Less

Submitted 12 November, 2013; v1 submitted 25 July, 2012; originally announced July 2012.

Comments: Published in at http://dx.doi.org/10.1214/13-AOS1140 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1140

Journal ref: Annals of Statistics 2013, Vol. 41, No. 5, 2263-2291

arXiv:1205.0411 [pdf, ps, other]

Hypothesis testing using pairwise distances and associated kernels (with Appendix)

Authors: Dino Sejdinovic, Arthur Gretton, Bharath Sriperumbudur, Kenji Fukumizu

Abstract: We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. The equivalence holds when energy distances are compu… ▽ More We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. The equivalence holds when energy distances are computed with semimetrics of negative type, in which case a kernel may be defined such that the RKHS distance between distributions corresponds exactly to the energy distance. We determine the class of probability distributions for which kernels induced by semimetrics are characteristic (that is, for which embeddings of the distributions to an RKHS are injective). Finally, we investigate the performance of this family of kernels in two-sample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of kernels, and that other choices from this family can yield more powerful tests. △ Less

Submitted 21 May, 2012; v1 submitted 2 May, 2012; originally announced May 2012.

Comments: Appearing in Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012

arXiv:1202.6504 [pdf, ps, other]

Learning from Distributions via Support Measure Machines

Authors: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hil… ▽ More This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (Flex-SVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. △ Less

Submitted 12 January, 2013; v1 submitted 29 February, 2012; originally announced February 2012.

Comments: Advances in Neural Information Processing Systems 25

arXiv:1109.0455 [pdf, ps, other]

Gradient-based kernel dimension reduction for supervised learning

Authors: Kenji Fukumizu, Chenlei Leng

Abstract: This paper proposes a novel kernel approach to linear dimension reduction for supervised learning. The purpose of the dimension reduction is to find directions in the input space to explain the output as effectively as possible. The proposed method uses an estimator for the gradient of regression function, based on the covariance operators on reproducing kernel Hilbert spaces. In comparison with o… ▽ More This paper proposes a novel kernel approach to linear dimension reduction for supervised learning. The purpose of the dimension reduction is to find directions in the input space to explain the output as effectively as possible. The proposed method uses an estimator for the gradient of regression function, based on the covariance operators on reproducing kernel Hilbert spaces. In comparison with other existing methods, the proposed one has wide applicability without strong assumptions on the distributions or the type of variables, and uses computationally simple eigendecomposition. Experimental results show that the proposed method successfully finds the effective directions with efficient computation. △ Less

Submitted 2 September, 2011; originally announced September 2011.

Comments: 21 pages

arXiv:1103.0605 [pdf, ps, other]

Loopy Belief Propagation, Bethe Free Energy and Graph Zeta Function

Authors: Yusuke Watanabe, Kenji Fukumizu

Abstract: We propose a new approach to the theoretical analysis of Loopy Belief Propagation (LBP) and the Bethe free energy (BFE) by establishing a formula to connect LBP and BFE with a graph zeta function. The proposed approach is applicable to a wide class of models including multinomial and Gaussian types. The connection derives a number of new theoretical results on LBP and BFE. This paper focuses two o… ▽ More We propose a new approach to the theoretical analysis of Loopy Belief Propagation (LBP) and the Bethe free energy (BFE) by establishing a formula to connect LBP and BFE with a graph zeta function. The proposed approach is applicable to a wide class of models including multinomial and Gaussian types. The connection derives a number of new theoretical results on LBP and BFE. This paper focuses two of such topics. One is the analysis of the region where the Hessian of the Bethe free energy is positive definite, which derives the non-convexity of BFE for graphs with multiple cycles, and a condition of convexity on a restricted set. This analysis also gives a new condition for the uniqueness of the LBP fixed point. The other result is to clarify the relation between the local stability of a fixed point of LBP and local minima of the BFE, which implies, for example, that a locally stable fixed point of the Gaussian LBP is a local minimum of the Gaussian Bethe free energy. △ Less

Submitted 2 March, 2011; originally announced March 2011.

arXiv:1002.3307 [pdf, ps, other]

Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation

Authors: Yusuke Watanabe, Kenji Fukumizu

Abstract: We propose a new approach to the analysis of Loopy Belief Propagation (LBP) by establishing a formula that connects the Hessian of the Bethe free energy with the edge zeta function. The formula has a number of theoretical implications on LBP. It is applied to give a sufficient condition that the Hessian of the Bethe free energy is positive definite, which shows non-convexity for graphs with mult… ▽ More We propose a new approach to the analysis of Loopy Belief Propagation (LBP) by establishing a formula that connects the Hessian of the Bethe free energy with the edge zeta function. The formula has a number of theoretical implications on LBP. It is applied to give a sufficient condition that the Hessian of the Bethe free energy is positive definite, which shows non-convexity for graphs with multiple cycles. The formula clarifies the relation between the local stability of a fixed point of LBP and local minima of the Bethe free energy. We also propose a new approach to the uniqueness of LBP fixed point, and show various conditions of uniqueness. △ Less

Submitted 17 February, 2010; originally announced February 2010.

Comments: 19 pages, Annual Conference on Neural Information Processing Systems (NIPS 2009), together with the supplementary material

Journal ref: Advances in Neural Information Processing Systems 22, pages 2017-2025

arXiv:0908.3850 [pdf, ps, other]

New graph polynomials from the Bethe approximation of the Ising partition function

Authors: Yusuke Watanabe, Kenji Fukumizu

Abstract: We introduce two graph polynomials and discuss their properties. One is a polynomial of two variables whose investigation is motivated by the performance analysis of the Bethe approximation of the Ising partition function. The other is a polynomial of one variable that is obtained by the specialization of the first one. It is shown that these polynomials satisfy deletion-contraction relations and… ▽ More We introduce two graph polynomials and discuss their properties. One is a polynomial of two variables whose investigation is motivated by the performance analysis of the Bethe approximation of the Ising partition function. The other is a polynomial of one variable that is obtained by the specialization of the first one. It is shown that these polynomials satisfy deletion-contraction relations and are new examples of the V-function, which was introduced by Tutte (1947, Proc. Cambridge Philos. Soc. 43, 26-40). For these polynomials, we discuss the interpretations of special values and then obtain the bound on the number of sub-coregraphs, i.e., spanning subgraphs with no vertices of degree one. It is proved that the polynomial of one variable is equal to the monomer-dimer partition function with weights parameterized by that variable. The properties of the coefficients and the possible region of zeros are also discussed for this polynomial. △ Less

Submitted 3 June, 2010; v1 submitted 26 August, 2009; originally announced August 2009.

Comments: To appear in Combinatorics, Probability & Computing. Revised from the first version, 28 pages

Showing 1–50 of 53 results for author: Fukumizu, K