Search | arXiv e-print repository

Investigating Data Usage for Inductive Conformal Predictors

Abstract: Inductive conformal predictors (ICPs) are algorithms that are able to generate prediction sets, instead of point predictions, which are valid at a user-defined confidence level, only assuming exchangeability. These algorithms are useful for reliable machine learning and are increasing in popularity. The ICP development process involves dividing development data into three parts: training, calibrat… ▽ More Inductive conformal predictors (ICPs) are algorithms that are able to generate prediction sets, instead of point predictions, which are valid at a user-defined confidence level, only assuming exchangeability. These algorithms are useful for reliable machine learning and are increasing in popularity. The ICP development process involves dividing development data into three parts: training, calibration and test. With access to limited or expensive development data, it is an open question regarding the most efficient way to divide the data. This study provides several experiments to explore this question and consider the case for allowing overlap of examples between training and calibration sets. Conclusions are drawn that will be of value to academics and practitioners planning to use ICPs. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2403.18994 [pdf, other]

Causal-StoNet: Causal Inference for High-Dimensional Complex Data

Authors: Yaxin Fang, Faming Liang

Abstract: With the advancement of data science, the collection of increasingly complex datasets has become commonplace. In such datasets, the data dimension can be extremely high, and the underlying data generation process can be unknown and highly nonlinear. As a result, the task of making causal inference with high-dimensional complex data has become a fundamental problem in many disciplines, such as medi… ▽ More With the advancement of data science, the collection of increasingly complex datasets has become commonplace. In such datasets, the data dimension can be extremely high, and the underlying data generation process can be unknown and highly nonlinear. As a result, the task of making causal inference with high-dimensional complex data has become a fundamental problem in many disciplines, such as medicine, econometrics, and social science. However, the existing methods for causal inference are frequently developed under the assumption that the data dimension is low or that the underlying data generation process is linear or approximately linear. To address these challenges, this paper proposes a novel causal inference approach for dealing with high-dimensional complex data. The proposed approach is based on deep learning techniques, including sparse deep learning theory and stochastic neural networks, that have been developed in recent literature. By using these techniques, the proposed approach can address both the high dimensionality and unknown data generation process in a coherent way. Furthermore, the proposed approach can also be used when missing values are present in the datasets. Extensive numerical studies indicate that the proposed approach outperforms existing ones. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2402.03933 [pdf]

Development of a Evaluation Tool for Age-Appropriate Software in Aging Environments: A Delphi Study

Authors: Zhenggang Bai, Yougxiang Fang, Hongtu Chen, Xinru Chen, Ning An, Min Zhang, Guoxin Rui, Jing Jin

Abstract: Objective: We aimed to develop a dependable reliable tool for assessing software ageappropriateness. Methods: We conducted a systematic review to get the indicators of technology ageappropriateness from studies from January 2000 to April 2023.This study engaged 25 experts from the fields of anthropology, sociology,and social technology research across, three rounds of Delphi consultations were con… ▽ More Objective: We aimed to develop a dependable reliable tool for assessing software ageappropriateness. Methods: We conducted a systematic review to get the indicators of technology ageappropriateness from studies from January 2000 to April 2023.This study engaged 25 experts from the fields of anthropology, sociology,and social technology research across, three rounds of Delphi consultations were conducted. Experts were asked to screen, assess, add and provide feedback on the preliminary indicators identified in the initial indicator pool. Result: We found 76 criterias for evaluating quality criteria was extracted, grouped into 11 distinct domains. After completing three rounds of Delphi consultations,experts drew upon their personal experiences,theoretical frameworks,and industry insights to arrive at a three-dimensional structure for the evaluation tooluser experience,product quality,and social promotion.These metrics were further distilled into a 16-item scale, and a corresponding questionnaire was formulated.The developed tool exhibited strong internal reliability(Cronbach's Alpha is 0.867)and content validity(S-CVI is 0.93). Conclusion: This tool represents a straightforward,objective,and reliable mechanism for evaluating software's appropriateness across age groups. Moreover,it offers valuable insights and practical guidance for designing and developing of high-quality age-appropriate software,and assisst age groups to select software they like. △ Less

Submitted 4 February, 2024; originally announced February 2024.

arXiv:2307.07442 [pdf]

Sensitivity Analysis for Unmeasured Confounding in Medical Product Development and Evaluation Using Real World Evidence

Authors: Peng Ding, Yixin Fang, Doug Faries, Susan Gruber, Hana Lee, Joo-Yeon Lee, Pallavi Mishra-Kalyani, Mingyang Shan, Mark van der Laan, Shu Yang, Xiang Zhang

Abstract: The American Statistical Association Biopharmaceutical Section (ASA BIOP) working group on real-world evidence (RWE) has been making continuous, extended effort towards a goal of supporting and advancing regulatory science with respect to non-interventional, clinical studies intended to use real-world data for evidence generation for the purpose of medical product development and evaluation (i.e.,… ▽ More The American Statistical Association Biopharmaceutical Section (ASA BIOP) working group on real-world evidence (RWE) has been making continuous, extended effort towards a goal of supporting and advancing regulatory science with respect to non-interventional, clinical studies intended to use real-world data for evidence generation for the purpose of medical product development and evaluation (i.e., RWE studies). In 2023, the working group published a manuscript delineating challenges and opportunities in constructing estimands for RWE studies following a framework in ICH E9(R1) guidance on estimand and sensitivity analysis. As a follow-up task, we describe the other issue in RWE studies, sensitivity analysis. Focusing on the issue of unmeasured confounding, we review availability and applicability of sensitivity analysis methods for different types unmeasured confounding. We discuss consideration on the choice and use of sensitivity analysis for RWE studies. Updated version of this article will present how findings from sensitivity analysis could support regulatory decision-making using a real example. △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: 17 pages, 2 figures

arXiv:2303.14211 [pdf, other]

Tackling the infinite likelihood problem when fitting mixtures of shifted asymmetric Laplace distributions

Authors: Yuan Fang, Brian C. Franczak, Sanjeena Subedi

Abstract: Mixtures of shifted asymmetric Laplace distributions were introduced as a tool for model-based clustering that allowed for the direct parameterization of skewness in addition to location and scale. Following common practices, an expectation-maximization algorithm was developed to fit these mixtures. However, adaptations to account for the `infinite likelihood problem' led to fits that gave good cl… ▽ More Mixtures of shifted asymmetric Laplace distributions were introduced as a tool for model-based clustering that allowed for the direct parameterization of skewness in addition to location and scale. Following common practices, an expectation-maximization algorithm was developed to fit these mixtures. However, adaptations to account for the `infinite likelihood problem' led to fits that gave good classification performance at the expense of parameter recovery. In this paper, we propose a more valuable solution to this problem by developing a novel Bayesian parameter estimation scheme for mixtures of shifted asymmetric Laplace distributions. Through simulation studies, we show that the proposed parameter estimation scheme gives better parameter estimates compared to the expectation-maximization based scheme. In addition, we also show that the classification performance is as good, and in some cases better, than the expectation-maximization based scheme. The performance of both schemes are also assessed using well-known real data sets. △ Less

Submitted 24 March, 2023; originally announced March 2023.

arXiv:2302.08606 [pdf, other]

Intrinsic and extrinsic deep learning on manifolds

Authors: Yihao Fang, Ilsang Ohn, Vijay Gupta, Lizhen Lin

Abstract: We propose extrinsic and intrinsic deep neural network architectures as general frameworks for deep learning on manifolds. Specifically, extrinsic deep neural networks (eDNNs) preserve geometric features on manifolds by utilizing an equivariant embedding from the manifold to its image in the Euclidean space. Moreover, intrinsic deep neural networks (iDNNs) incorporate the underlying intrinsic geom… ▽ More We propose extrinsic and intrinsic deep neural network architectures as general frameworks for deep learning on manifolds. Specifically, extrinsic deep neural networks (eDNNs) preserve geometric features on manifolds by utilizing an equivariant embedding from the manifold to its image in the Euclidean space. Moreover, intrinsic deep neural networks (iDNNs) incorporate the underlying intrinsic geometry of manifolds via exponential and log maps with respect to a Riemannian structure. Consequently, we prove that the empirical risk of the empirical risk minimizers (ERM) of eDNNs and iDNNs converge in optimal rates. Overall, The eDNNs framework is simple and easy to compute, while the iDNNs framework is accurate and fast converging. To demonstrate the utilities of our framework, various simulation studies, and real data analyses are presented with eDNNs and iDNNs. △ Less

Submitted 16 February, 2023; originally announced February 2023.

arXiv:2211.15943 [pdf, other]

Fully Stochastic Trust-Region Sequential Quadratic Programming for Equality-Constrained Optimization Problems

Authors: Yuchen Fang, Sen Na, Michael W. Mahoney, Mladen Kolar

Abstract: We propose a trust-region stochastic sequential quadratic programming algorithm (TR-StoSQP) to solve nonlinear optimization problems with stochastic objectives and deterministic equality constraints. We consider a fully stochastic setting, where at each step a single sample is generated to estimate the objective gradient. The algorithm adaptively selects the trust-region radius and, compared to th… ▽ More We propose a trust-region stochastic sequential quadratic programming algorithm (TR-StoSQP) to solve nonlinear optimization problems with stochastic objectives and deterministic equality constraints. We consider a fully stochastic setting, where at each step a single sample is generated to estimate the objective gradient. The algorithm adaptively selects the trust-region radius and, compared to the existing line-search StoSQP schemes, allows us to utilize indefinite Hessian matrices (i.e., Hessians without modification) in SQP subproblems. As a trust-region method for constrained optimization, our algorithm must address an infeasibility issue -- the linearized equality constraints and trust-region constraints may lead to infeasible SQP subproblems. In this regard, we propose an adaptive relaxation technique to compute the trial step, consisting of a normal step and a tangential step. To control the lengths of these two steps while ensuring a scale-invariant property, we adaptively decompose the trust-region radius into two segments, based on the proportions of the rescaled feasibility and optimality residuals to the rescaled full KKT residual. The normal step has a closed form, while the tangential step is obtained by solving a trust-region subproblem, to which a solution ensuring the Cauchy reduction is sufficient for our study. We establish a global almost sure convergence guarantee for TR-StoSQP, and illustrate its empirical performance on both a subset of problems in the CUTEst test set and constrained logistic regression problems using data from the LIBSVM collection. △ Less

Submitted 28 January, 2024; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: 10 figures, 33 pages

arXiv:2204.11150 [pdf, other]

doi 10.1162/neco_a_01505

Learning and Inference in Sparse Coding Models with Langevin Dynamics

Authors: Michael Y. -S. Fang, Mayur Mudigonda, Ryan Zarcone, Amir Khosrowshahi, Bruno A. Olshausen

Abstract: We describe a stochastic, dynamical system capable of inference and learning in a probabilistic latent variable model. The most challenging problem in such models - sampling the posterior distribution over latent variables - is proposed to be solved by harnessing natural sources of stochasticity inherent in electronic and neural systems. We demonstrate this idea for a sparse coding model by derivi… ▽ More We describe a stochastic, dynamical system capable of inference and learning in a probabilistic latent variable model. The most challenging problem in such models - sampling the posterior distribution over latent variables - is proposed to be solved by harnessing natural sources of stochasticity inherent in electronic and neural systems. We demonstrate this idea for a sparse coding model by deriving a continuous-time equation for inferring its latent variables via Langevin dynamics. The model parameters are learned by simultaneously evolving according to another continuous-time equation, thus bypassing the need for digital accumulators or a global clock. Moreover we show that Langevin dynamics lead to an efficient procedure for sampling from the posterior distribution in the 'L0 sparse' regime, where latent variables are encouraged to be set to zero as opposed to having a small L1 norm. This allows the model to properly incorporate the notion of sparsity rather than having to resort to a relaxed version of sparsity to make optimization tractable. Simulations of the proposed dynamical system on both synthetic and natural image datasets demonstrate that the model is capable of probabilistically correct inference, enabling learning of the dictionary as well as parameters of the prior. △ Less

Submitted 23 April, 2022; originally announced April 2022.

arXiv:2203.11748 [pdf, other]

On p-value combination of independent and frequent signals: asymptotic efficiency and Fisher ensemble

Authors: Yusi Fang, Chung Chang, George Tseng

Abstract: Combining p-values to integrate multiple effects is of long-standing interest in social science and biomedical research. In this paper, we focus on revisiting a classical scenario closely related to meta-analysis, which combines a relatively small (finite and fixed) number of p-values while the sample size for generating each p-value is large (asymptotically goes to infinity). We evaluate a list o… ▽ More Combining p-values to integrate multiple effects is of long-standing interest in social science and biomedical research. In this paper, we focus on revisiting a classical scenario closely related to meta-analysis, which combines a relatively small (finite and fixed) number of p-values while the sample size for generating each p-value is large (asymptotically goes to infinity). We evaluate a list of traditional and recently developed modified Fisher's methods to investigate their asymptotic efficiencies and finite-sample numerical performance. The result concludes Fisher and adaptively weighted Fisher method to have top performance and complementary advantages across different proportions of true signals. Finally, we propose an ensemble method, namely Fisher ensemble, to combine the two top-performing Fisher-related methods using a robust truncated Cauchy ensemble approach. We show that Fisher ensemble achieves asymptotic Bahadur optimality and integrates the strengths of Fisher and adaptively weighted Fisher methods in simulations. We subsequently extend Fisher ensemble to a variant with emphasized power for concordant effect size directions. A transcriptomic meta-analysis application confirms the theoretical and simulation conclusions, generates intriguing biomarker and pathway findings and demonstrates strengths and strategy of using proposed Fisher ensemble methods. △ Less

Submitted 14 April, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

Comments: 25 pages, 1 table, 5figures

arXiv:2112.05818 [pdf, other]

Association study between gene expression and multiple phenotypes in omics applications of complex diseases

Authors: Yujia Li, Yusi Fang, Peng Liu, George C. Tseng

Abstract: Studying phenotype-gene association can uncover mechanism of diseases and develop efficient treatments. In complex disease where multiple phenotypes are available and correlated, analyzing and interpreting associated genes for each phenotype respectively may decrease statistical power and lose intepretation due to not considering the correlation between phenotypes. The typical approaches are many… ▽ More Studying phenotype-gene association can uncover mechanism of diseases and develop efficient treatments. In complex disease where multiple phenotypes are available and correlated, analyzing and interpreting associated genes for each phenotype respectively may decrease statistical power and lose intepretation due to not considering the correlation between phenotypes. The typical approaches are many global testing methods, such as multivariate analysis of variance (MANOVA), which tests the overall association between phenotypes and each gene, without considersing the heterogeneity among phenotypes. In this paper, we extend and evaluate two p-value combination methods, adaptive weighted Fisher's method (AFp) and adaptive Fisher's method (AFz), to tackle this problem, where AFp stands out as our final proposed method, based on extensive simulations and a real application. Our proposed AFp method has three advantages over traditional global testing methods. Firstly, it can consider the heterogeneity of phenotypes and determines which specific phenotypes a gene is associated with, using phenotype specific 0-1 weights. Secondly, AFp takes the p-values from the test of association of each phenotype as input, thus can accommodate different types of phenotypes (continuous, binary and count). Thirdly, we also apply bootstrapping to construct a variability index for the weight estimator of AFp and generate a co-membership matrix to categorize (cluster) genes based on their association-patterns for intuitive biological investigations. Through extensive simulations, AFp shows superior performance over global testing methods in terms of type I error control and statistical power, as well as higher accuracy of 0-1 weights estimation over AFz. A real omics application with transcriptomic and clinical data of complex lung diseases demonstrates insightful biological findings of AFp. △ Less

Submitted 10 December, 2021; originally announced December 2021.

arXiv:2109.12962 [pdf, other]

pyStoNED: A Python Package for Convex Regression and Frontier Estimation

Authors: Sheng Dai, Yu-Hsueh Fang, Chia-Yen Lee, Timo Kuosmanen

Abstract: Shape-constrained nonparametric regression is a growing area in econometrics, statistics, operations research, machine learning and related fields. In the field of productivity and efficiency analysis, recent developments in the multivariate convex regression and related techniques such as convex quantile regression and convex expectile regression have bridged the long-standing gap between the con… ▽ More Shape-constrained nonparametric regression is a growing area in econometrics, statistics, operations research, machine learning and related fields. In the field of productivity and efficiency analysis, recent developments in the multivariate convex regression and related techniques such as convex quantile regression and convex expectile regression have bridged the long-standing gap between the conventional deterministic-nonparametric and stochastic-parametric methods. Unfortunately, the heavy computational burden and the lack of powerful, reliable, and fully open access computational package has slowed down the diffusion of these advanced estimation techniques to the empirical practice. The purpose of the Python package pyStoNED is to address this challenge by providing a freely available and user-friendly tool for the multivariate convex regression, convex quantile regression, convex expectile regression, isotonic regression, stochastic nonparametric envelopment of data, and related methods. This paper presents a tutorial of the pyStoNED package and illustrates its application, focusing on the estimation of frontier cost and production functions. △ Less

Submitted 27 September, 2021; originally announced September 2021.

arXiv:2109.00539 [pdf, other]

Spatially and Robustly Hybrid Mixture Regression Model for Inference of Spatial Dependence

Authors: Wennan Chang, Pengtao Dang, Changlin Wan, Xiaoyu Lu, Yue Fang, Tong Zhao, Yong Zang, Bo Li, Chi Zhang, Sha Cao

Abstract: In this paper, we propose a Spatial Robust Mixture Regression model to investigate the relationship between a response variable and a set of explanatory variables over the spatial domain, assuming that the relationships may exhibit complex spatially dynamic patterns that cannot be captured by constant regression coefficients. Our method integrates the robust finite mixture Gaussian regression mode… ▽ More In this paper, we propose a Spatial Robust Mixture Regression model to investigate the relationship between a response variable and a set of explanatory variables over the spatial domain, assuming that the relationships may exhibit complex spatially dynamic patterns that cannot be captured by constant regression coefficients. Our method integrates the robust finite mixture Gaussian regression model with spatial constraints, to simultaneously handle the spatial nonstationarity, local homogeneity, and outlier contaminations. Compared with existing spatial regression models, our proposed model assumes the existence a few distinct regression models that are estimated based on observations that exhibit similar response-predictor relationships. As such, the proposed model not only accounts for nonstationarity in the spatial trend, but also clusters observations into a few distinct and homogenous groups. This provides an advantage on interpretation with a few stationary sub-processes identified that capture the predominant relationships between response and predictor variables. Moreover, the proposed method incorporates robust procedures to handle contaminations from both regression outliers and spatial outliers. By doing so, we robustly segment the spatial domain into distinct local regions with similar regression coefficients, and sporadic locations that are purely outliers. Rigorous statistical hypothesis testing procedure has been designed to test the significance of such segmentation. Experimental results on many synthetic and real-world datasets demonstrate the robustness, accuracy, and effectiveness of our proposed method, compared with other robust finite mixture regression, spatial regression and spatial segmentation methods. △ Less

Submitted 28 September, 2021; v1 submitted 1 September, 2021; originally announced September 2021.

Comments: Accepted by ICDM IEEE 2021

arXiv:2106.13097 [pdf, other]

Understanding the Spread of COVID-19 Epidemic: A Spatio-Temporal Point Process View

Authors: Shuang Li, Lu Wang, Xinyun Chen, Yixiang Fang, Yan Song

Abstract: Since the first coronavirus case was identified in the U.S. on Jan. 21, more than 1 million people in the U.S. have confirmed cases of COVID-19. This infectious respiratory disease has spread rapidly across more than 3000 counties and 50 states in the U.S. and have exhibited evolutionary clustering and complex triggering patterns. It is essential to understand the complex spacetime intertwined pro… ▽ More Since the first coronavirus case was identified in the U.S. on Jan. 21, more than 1 million people in the U.S. have confirmed cases of COVID-19. This infectious respiratory disease has spread rapidly across more than 3000 counties and 50 states in the U.S. and have exhibited evolutionary clustering and complex triggering patterns. It is essential to understand the complex spacetime intertwined propagation of this disease so that accurate prediction or smart external intervention can be carried out. In this paper, we model the propagation of the COVID-19 as spatio-temporal point processes and propose a generative and intensity-free model to track the spread of the disease. We further adopt a generative adversarial imitation learning framework to learn the model parameters. In comparison with the traditional likelihood-based learning methods, this imitation learning framework does not need to prespecify an intensity function, which alleviates the model-misspecification. Moreover, the adversarial learning procedure bypasses the difficult-to-evaluate integral involved in the likelihood evaluation, which makes the model inference more scalable with the data and variables. We showcase the dynamic learning performance on the COVID-19 confirmed cases in the U.S. and evaluate the social distancing policy based on the learned generative model. △ Less

Submitted 24 June, 2021; originally announced June 2021.

arXiv:2103.12967 [pdf, other]

Heavy-tailed distribution for combining dependent $p$-values with asymptotic robustness

Authors: Yusi Fang, George C. Tseng, Chung Chang

Abstract: The issue of combining individual $p$-values to aggregate multiple small effects is prevalent in many scientific investigations and is a long-standing statistical topic. Many classical methods are designed for combining independent and frequent signals in a traditional meta-analysis sense using the sum of transformed $p$-values with the transformation of light-tailed distributions, in which Fisher… ▽ More The issue of combining individual $p$-values to aggregate multiple small effects is prevalent in many scientific investigations and is a long-standing statistical topic. Many classical methods are designed for combining independent and frequent signals in a traditional meta-analysis sense using the sum of transformed $p$-values with the transformation of light-tailed distributions, in which Fisher's method and Stouffer's method are the most well-known. Since the early 2000, advances in big data promoted methods to aggregate independent, sparse and weak signals, such as the renowned higher criticism and Berk-Jones tests. Recently, Liu and Xie(2020) and Wilson(2019) independently proposed Cauchy and harmonic mean combination tests to robustly combine $p$-values under "arbitrary" dependency structure, where a notable application is to combine $p$-values from a set of often correlated SNPs in genome-wide association studies. The proposed tests are the transformation of heavy-tailed distributions for improved power with the sparse signal. It calls for a natural question to investigate heavy-tailed distribution transformation, to understand the connection among existing methods, and to explore the conditions for a method to possess robustness to dependency. In this paper, we investigate the regularly varying distribution, which is a rich family of heavy-tailed distribution and includes Pareto distribution as a special case. We show that only an equivalent class of Cauchy and harmonic mean tests have sufficient robustness to dependency in a practical sense. We also show an issue caused by large negative penalty in the Cauchy method and propose a simple, yet practical modification. Finally, we present simulations and apply to a neuroticism GWAS application to verify the discovered theoretical insights and provide practical guidance. △ Less

Submitted 7 September, 2021; v1 submitted 23 March, 2021; originally announced March 2021.

Comments: 34 pages, 3 figures

arXiv:2011.06682 [pdf, other]

Clustering microbiome data using mixtures of logistic normal multinomial models

Authors: Yuan Fang, Sanjeena Subedi

Abstract: Discrete data such as counts of microbiome taxa resulting from next-generation sequencing are routinely encountered in bioinformatics. Taxa count data in microbiome studies are typically high-dimensional, over-dispersed, and can only reveal relative abundance therefore being treated as compositional. Analyzing compositional data presents many challenges because they are restricted on a simplex. In… ▽ More Discrete data such as counts of microbiome taxa resulting from next-generation sequencing are routinely encountered in bioinformatics. Taxa count data in microbiome studies are typically high-dimensional, over-dispersed, and can only reveal relative abundance therefore being treated as compositional. Analyzing compositional data presents many challenges because they are restricted on a simplex. In a logistic normal multinomial model, the relative abundance is mapped from a simplex to a latent variable that exists on the real Euclidean space using the additive log-ratio transformation. While a logistic normal multinomial approach brings in flexibility for modeling the data, it comes with a heavy computational cost as the parameter estimation typically relies on Bayesian techniques. In this paper, we develop a novel mixture of logistic normal multinomial models for clustering microbiome data. Additionally, we utilize an efficient framework for parameter estimation using variational Gaussian approximations (VGA). Adopting a variational Gaussian approximation for the posterior of the latent variable reduces the computational overhead substantially. The proposed method is illustrated on simulated and real datasets. △ Less

Submitted 21 June, 2022; v1 submitted 12 November, 2020; originally announced November 2020.

Comments: 53 pages, 6 Figures

MSC Class: 62H30

arXiv:2009.05102 [pdf, other]

doi 10.1109/TIP.2020.3037470

Data-Level Recombination and Lightweight Fusion Scheme for RGB-D Salient Object Detection

Authors: Xuehao Wang, Shuai Li, Chenglizhao Chen, Yuming Fang, Aimin Hao, Hong Qin

Abstract: Existing RGB-D salient object detection methods treat depth information as an independent component to complement its RGB part, and widely follow the bi-stream parallel network architecture. To selectively fuse the CNNs features extracted from both RGB and depth as a final result, the state-of-the-art (SOTA) bi-stream networks usually consist of two independent subbranches; i.e., one subbranch is… ▽ More Existing RGB-D salient object detection methods treat depth information as an independent component to complement its RGB part, and widely follow the bi-stream parallel network architecture. To selectively fuse the CNNs features extracted from both RGB and depth as a final result, the state-of-the-art (SOTA) bi-stream networks usually consist of two independent subbranches; i.e., one subbranch is used for RGB saliency and the other aims for depth saliency. However, its depth saliency is persistently inferior to the RGB saliency because the RGB component is intrinsically more informative than the depth component. The bi-stream architecture easily biases its subsequent fusion procedure to the RGB subbranch, leading to a performance bottleneck. In this paper, we propose a novel data-level recombination strategy to fuse RGB with D (depth) before deep feature extraction, where we cyclically convert the original 4-dimensional RGB-D into \textbf{D}GB, R\textbf{D}B and RG\textbf{D}. Then, a newly lightweight designed triple-stream network is applied over these novel formulated data to achieve an optimal channel-wise complementary fusion status between the RGB and D, achieving a new SOTA performance. △ Less

Submitted 7 August, 2020; originally announced September 2020.

arXiv:2008.09624 [pdf, other]

Optimization of Graph Neural Networks with Natural Gradient Descent

Authors: Mohammad Rasool Izadi, Yihao Fang, Robert Stevenson, Lizhen Lin

Abstract: In this work, we propose to employ information-geometric tools to optimize a graph neural network architecture such as the graph convolutional networks. More specifically, we develop optimization algorithms for the graph-based semi-supervised learning by employing the natural gradient information in the optimization process. This allows us to efficiently exploit the geometry of the underlying stat… ▽ More In this work, we propose to employ information-geometric tools to optimize a graph neural network architecture such as the graph convolutional networks. More specifically, we develop optimization algorithms for the graph-based semi-supervised learning by employing the natural gradient information in the optimization process. This allows us to efficiently exploit the geometry of the underlying statistical model or parameter space for optimization and inference. To the best of our knowledge, this is the first work that has utilized the natural gradient for the optimization of graph neural networks that can be extended to other semi-supervised problems. Efficient computations algorithms are developed and extensive numerical studies are conducted to demonstrate the superior performance of our algorithms over existing algorithms such as ADAM and SGD. △ Less

Submitted 21 August, 2020; originally announced August 2020.

arXiv:2007.11123 [pdf, other]

Outcome-Guided Disease Subtyping for High-Dimensional Omics Data

Authors: Peng Liu, Yusi Fang, Zhao Ren, Lu Tang, George C. Tseng

Abstract: High-throughput microarray and sequencing technology have been used to identify disease subtypes that could not be observed otherwise by using clinical variables alone. The classical unsupervised clustering strategy concerns primarily the identification of subpopulations that have similar patterns in gene features. However, as the features corresponding to irrelevant confounders (e.g. gender or ag… ▽ More High-throughput microarray and sequencing technology have been used to identify disease subtypes that could not be observed otherwise by using clinical variables alone. The classical unsupervised clustering strategy concerns primarily the identification of subpopulations that have similar patterns in gene features. However, as the features corresponding to irrelevant confounders (e.g. gender or age) may dominate the clustering process, the resulting clusters may or may not capture clinically meaningful disease subtypes. This gives rise to a fundamental problem: can we find a subtyping procedure guided by a pre-specified disease outcome? Existing methods, such as supervised clustering, apply a two-stage approach and depend on an arbitrary number of selected features associated with outcome. In this paper, we propose a unified latent generative model to perform outcome-guided disease subtyping constructed from omics data, which improves the resulting subtypes concerning the disease of interest. Feature selection is embedded in a regularization regression. A modified EM algorithm is applied for numerical computation and parameter estimation. The proposed method performs feature selection, latent subtype characterization and outcome prediction simultaneously. To account for possible outliers or violation of mixture Gaussian assumption, we incorporate robust estimation using adaptive Huber or median-truncated loss function. Extensive simulations and an application to complex lung diseases with transcriptomic and clinical data demonstrate the ability of the proposed method to identify clinically relevant disease subtypes and signature genes suitable to explore toward precision medicine. △ Less

Submitted 21 July, 2020; originally announced July 2020.

Comments: 29 pages in total, 4 figures, 2 tables and 1 supplement

arXiv:2007.08053 [pdf, other]

doi 10.24963/ijcai.2020/168

Inductive Link Prediction for Nodes Having Only Attribute Information

Authors: Yu Hao, Xin Cao, Yixiang Fang, Xike Xie, Sibo Wang

Abstract: Predicting the link between two nodes is a fundamental problem for graph data analytics. In attributed graphs, both the structure and attribute information can be utilized for link prediction. Most existing studies focus on transductive link prediction where both nodes are already in the graph. However, many real-world applications require inductive prediction for new nodes having only attribute i… ▽ More Predicting the link between two nodes is a fundamental problem for graph data analytics. In attributed graphs, both the structure and attribute information can be utilized for link prediction. Most existing studies focus on transductive link prediction where both nodes are already in the graph. However, many real-world applications require inductive prediction for new nodes having only attribute information. It is more challenging since the new nodes do not have structure information and cannot be seen during the model training. To solve this problem, we propose a model called DEAL, which consists of three components: two node embedding encoders and one alignment mechanism. The two encoders aim to output the attribute-oriented node embedding and the structure-oriented node embedding, and the alignment mechanism aligns the two types of embeddings to build the connections between the attributes and links. Our model DEAL is versatile in the sense that it works for both inductive and transductive link prediction. Extensive experiments on several benchmark datasets show that our proposed model significantly outperforms existing inductive link prediction methods, and also outperforms the state-of-the-art methods on transductive link prediction. △ Less

Submitted 15 July, 2020; originally announced July 2020.

Comments: IJCAI2020

arXiv:2007.05188 [pdf, other]

Intelligent Credit Limit Management in Consumer Loans Based on Causal Inference

Authors: Hang Miao, Kui Zhao, Zhun Wang, Linbo Jiang, Quanhui Jia, Yanming Fang, Quan Yu

Abstract: Nowadays consumer loan plays an important role in promoting the economic growth, and credit cards are the most popular consumer loan. One of the most essential parts in credit cards is the credit limit management. Traditionally, credit limits are adjusted based on limited heuristic strategies, which are developed by experienced professionals. In this paper, we present a data-driven approach to man… ▽ More Nowadays consumer loan plays an important role in promoting the economic growth, and credit cards are the most popular consumer loan. One of the most essential parts in credit cards is the credit limit management. Traditionally, credit limits are adjusted based on limited heuristic strategies, which are developed by experienced professionals. In this paper, we present a data-driven approach to manage the credit limit intelligently. Firstly, a conditional independence testing is conducted to acquire the data for building models. Based on these testing data, a response model is then built to measure the heterogeneous treatment effect of increasing credit limits (i.e. treatments) for different customers, who are depicted by several control variables (i.e. features). In order to incorporate the diminishing marginal effect, a carefully selected log transformation is introduced to the treatment variable. Moreover, the model's capability can be further enhanced by applying a non-linear transformation on features via GBDT encoding. Finally, a well-designed metric is proposed to properly measure the performances of compared methods. The experimental results demonstrate the effectiveness of the proposed approach. △ Less

Submitted 10 July, 2020; originally announced July 2020.

Comments: 7 pages

arXiv:2005.08189 [pdf, other]

doi 10.1145/3441450

Multi-View Collaborative Network Embedding

Authors: Sezin Kircali Ata, Yuan Fang, Min Wu, Jiaqi Shi, Chee Keong Kwoh, Xiaoli Li

Abstract: Real-world networks often exist with multiple views, where each view describes one type of interaction among a common set of nodes. For example, on a video-sharing network, while two user nodes are linked if they have common favorite videos in one view, they can also be linked in another view if they share common subscribers. Unlike traditional single-view networks, multiple views maintain differe… ▽ More Real-world networks often exist with multiple views, where each view describes one type of interaction among a common set of nodes. For example, on a video-sharing network, while two user nodes are linked if they have common favorite videos in one view, they can also be linked in another view if they share common subscribers. Unlike traditional single-view networks, multiple views maintain different semantics to complement each other. In this paper, we propose MANE, a multi-view network embedding approach to learn low-dimensional representations. Similar to existing studies, MANE hinges on diversity and collaboration - while diversity enables views to maintain their individual semantics, collaboration enables views to work together. However, we also discover a novel form of second-order collaboration that has not been explored previously, and further unify it into our framework to attain superior node representations. Furthermore, as each view often has varying importance w.r.t. different nodes, we propose MANE+, an attention-based extension of MANE to model node-wise view importance. Finally, we conduct comprehensive experiments on three public, real-world multi-view networks, and the results demonstrate that our models consistently outperform state-of-the-art approaches. △ Less

Submitted 17 December, 2020; v1 submitted 17 May, 2020; originally announced May 2020.

Comments: Accepted for publication in the ACM Transactions on Knowledge Discovery from Data, TKDD

Journal ref: ACM Trans. Knowl. Discov. Data 15, 3, Article 39 (April 2021), 18 pages

arXiv:2005.05324 [pdf, other]

Infinite mixtures of multivariate normal-inverse Gaussian distributions for clustering of skewed data

Authors: Yuan Fang, Dimitris Karlis, Sanjeena Subedi

Abstract: Mixtures of multivariate normal inverse Gaussian (MNIG) distributions can be used to cluster data that exhibit features such as skewness and heavy tails. However, for cluster analysis, using a traditional finite mixture model framework, either the number of components needs to be known $a$-$priori$ or needs to be estimated $a$-$posteriori$ using some model selection criterion after deriving result… ▽ More Mixtures of multivariate normal inverse Gaussian (MNIG) distributions can be used to cluster data that exhibit features such as skewness and heavy tails. However, for cluster analysis, using a traditional finite mixture model framework, either the number of components needs to be known $a$-$priori$ or needs to be estimated $a$-$posteriori$ using some model selection criterion after deriving results for a range of possible number of components. However, different model selection criteria can sometimes result in different number of components yielding uncertainty. Here, an infinite mixture model framework, also known as Dirichlet process mixture model, is proposed for the mixtures of MNIG distributions. This Dirichlet process mixture model approach allows the number of components to grow or decay freely from 1 to $\infty$ (in practice from 1 to $N$) and the number of components is inferred along with the parameter estimates in a Bayesian framework thus alleviating the need for model selection criteria. We provide real data applications with benchmark datasets as well as a small simulation experiment to compare with other existing models. The proposed method provides competitive clustering results to other clustering approaches for both simulation and real data and parameter recovery are illustrated using simulation studies. △ Less

Submitted 11 May, 2020; originally announced May 2020.

Comments: 61 pages. arXiv admin note: text overlap with arXiv:2005.02585

MSC Class: 62H30

arXiv:2005.02585 [pdf, other]

A Bayesian approach for clustering skewed data using mixtures of multivariate normal-inverse Gaussian distributions

Authors: Yuan Fang, Dimitris Karlis, Sanjeena Subedi

Abstract: Non-Gaussian mixture models are gaining increasing attention for mixture model-based clustering particularly when dealing with data that exhibit features such as skewness and heavy tails. Here, such a mixture distribution is presented, based on the multivariate normal inverse Gaussian (MNIG) distribution. For parameter estimation of the mixture, a Bayesian approach via Gibbs sampler is used; for t… ▽ More Non-Gaussian mixture models are gaining increasing attention for mixture model-based clustering particularly when dealing with data that exhibit features such as skewness and heavy tails. Here, such a mixture distribution is presented, based on the multivariate normal inverse Gaussian (MNIG) distribution. For parameter estimation of the mixture, a Bayesian approach via Gibbs sampler is used; for this, a novel approach to simulate univariate generalized inverse Gaussian random variables and matrix generalized inverse Gaussian random matrices is provided. The proposed algorithm will be applied to both simulated and real data. Through simulation studies and real data analysis, we show parameter recovery and that our approach provides competitive clustering results compared to other clustering approaches. △ Less

Submitted 5 May, 2020; originally announced May 2020.

Comments: 40 pages, 7 figures

MSC Class: 62H30

arXiv:2005.00718 [pdf, other]

Large-scale Uncertainty Estimation and Its Application in Revenue Forecast of SMEs

Authors: Zebang Zhang, Kui Zhao, Kai Huang, Quanhui Jia, Yanming Fang, Quan Yu

Abstract: The economic and banking importance of the small and medium enterprise (SME) sector is well recognized in contemporary society. Business credit loans are very important for the operation of SMEs, and the revenue is a key indicator of credit limit management. Therefore, it is very beneficial to construct a reliable revenue forecasting model. If the uncertainty of an enterprise's revenue forecasting… ▽ More The economic and banking importance of the small and medium enterprise (SME) sector is well recognized in contemporary society. Business credit loans are very important for the operation of SMEs, and the revenue is a key indicator of credit limit management. Therefore, it is very beneficial to construct a reliable revenue forecasting model. If the uncertainty of an enterprise's revenue forecasting can be estimated, a more proper credit limit can be granted. Natural gradient boosting approach, which estimates the uncertainty of prediction by a multi-parameter boosting algorithm based on the natural gradient. However, its original implementation is not easy to scale into big data scenarios, and computationally expensive compared to state-of-the-art tree-based models (such as XGBoost). In this paper, we propose a Scalable Natural Gradient Boosting Machines that is simple to implement, readily parallelizable, interpretable and yields high-quality predictive uncertainty estimates. According to the characteristics of revenue distribution, we derive an uncertainty quantification function. We demonstrate that our method can distinguish between samples that are accurate and inaccurate on revenue forecasting of SMEs. What's more, interpretability can be naturally obtained from the model, satisfying the financial needs. △ Less

Submitted 2 May, 2020; originally announced May 2020.

arXiv:2004.00201 [pdf, other]

doi 10.1109/BigData.2018.8622169

NetDP: An Industrial-Scale Distributed Network Representation Framework for Default Prediction in Ant Credit Pay

Authors: Jianbin Lin, Zhiqiang Zhang, Jun Zhou, Xiaolong Li, Jingli Fang, Yanming Fang, Quan Yu, Yuan Qi

Abstract: Ant Credit Pay is a consumer credit service in Ant Financial Service Group. Similar to credit card, loan default is one of the major risks of this credit product. Hence, effective algorithm for default prediction is the key to losses reduction and profits increment for the company. However, the challenges facing in our scenario are different from those in conventional credit card service. The firs… ▽ More Ant Credit Pay is a consumer credit service in Ant Financial Service Group. Similar to credit card, loan default is one of the major risks of this credit product. Hence, effective algorithm for default prediction is the key to losses reduction and profits increment for the company. However, the challenges facing in our scenario are different from those in conventional credit card service. The first one is scalability. The huge volume of users and their behaviors in Ant Financial requires the ability to process industrial-scale data and perform model training efficiently. The second challenges is the cold-start problem. Different from the manual review for credit card application in conventional banks, the credit limit of Ant Credit Pay is automatically offered to users based on the knowledge learned from big data. However, default prediction for new users is suffered from lack of enough credit behaviors. It requires that the proposal should leverage other new data source to alleviate the cold-start problem. Considering the above challenges and the special scenario in Ant Financial, we try to incorporate default prediction with network information to alleviate the cold-start problem. In this paper, we propose an industrial-scale distributed network representation framework, termed NetDP, for default prediction in Ant Credit Pay. The proposal explores network information generated by various interaction between users, and blends unsupervised and supervised network representation in a unified framework for default prediction problem. Moreover, we present a parameter-server-based distributed implement of our proposal to handle the scalability challenge. Experimental results demonstrate the effectiveness of our proposal, especially in cold-start problem, as well as the efficiency for industrial-scale dataset. △ Less

Submitted 31 March, 2020; originally announced April 2020.

Comments: 2018 IEEE International Conference on Big Data (Big Data)

arXiv:2003.01171 [pdf, other]

doi 10.1109/ICDM.2019.00070

A Semi-supervised Graph Attentive Network for Financial Fraud Detection

Authors: Daixin Wang, Jianbin Lin, Peng Cui, Quanhui Jia, Zhen Wang, Yanming Fang, Quan Yu, Jun Zhou, Shuang Yang, Yuan Qi

Abstract: With the rapid growth of financial services, fraud detection has been a very important problem to guarantee a healthy environment for both users and providers. Conventional solutions for fraud detection mainly use some rule-based methods or distract some features manually to perform prediction. However, in financial services, users have rich interactions and they themselves always show multifacete… ▽ More With the rapid growth of financial services, fraud detection has been a very important problem to guarantee a healthy environment for both users and providers. Conventional solutions for fraud detection mainly use some rule-based methods or distract some features manually to perform prediction. However, in financial services, users have rich interactions and they themselves always show multifaceted information. These data form a large multiview network, which is not fully exploited by conventional methods. Additionally, among the network, only very few of the users are labelled, which also poses a great challenge for only utilizing labeled data to achieve a satisfied performance on fraud detection. To address the problem, we expand the labeled data through their social relations to get the unlabeled data and propose a semi-supervised attentive graph neural network, namedSemiGNN to utilize the multi-view labeled and unlabeled data for fraud detection. Moreover, we propose a hierarchical attention mechanism to better correlate different neighbors and different views. Simultaneously, the attention mechanism can make the model interpretable and tell what are the important factors for the fraud and why the users are predicted as fraud. Experimentally, we conduct the prediction task on the users of Alipay, one of the largest third-party online and offline cashless payment platform serving more than 4 hundreds of million users in China. By utilizing the social relations and the user attributes, our method can achieve a better accuracy compared with the state-of-the-art methods on two tasks. Moreover, the interpretable results also give interesting intuitions regarding the tasks. △ Less

Submitted 28 February, 2020; originally announced March 2020.

Comments: icdm

arXiv:1812.11262 [pdf]

Autoencoder Based Residual Deep Networks for Robust Regression Prediction and Spatiotemporal Estimation

Authors: Lianfa Li, Ying Fang, Jun Wu, Jinfeng Wang

Abstract: To have a superior generalization, a deep learning neural network often involves a large size of training sample. With increase of hidden layers in order to increase learning ability, neural network has potential degradation in accuracy. Both could seriously limit applicability of deep learning in some domains particularly involving predictions of continuous variables with a small size of samples.… ▽ More To have a superior generalization, a deep learning neural network often involves a large size of training sample. With increase of hidden layers in order to increase learning ability, neural network has potential degradation in accuracy. Both could seriously limit applicability of deep learning in some domains particularly involving predictions of continuous variables with a small size of samples. Inspired by residual convolutional neural network in computer vision and recent findings of crucial shortcuts in the brains in neuroscience, we propose an autoencoder-based residual deep network for robust prediction. In a nested way, we leverage shortcut connections to implement residual mapping with a balanced structure for efficient propagation of error signals. The novel method is demonstrated by multiple datasets, imputation of high spatiotemporal resolution non-randomness missing values of aerosol optical depth, and spatiotemporal estimation of fine particulate matter <2.5 μm, achieving the cutting edge of accuracy and efficiency. Our approach is also a general-purpose regression learner to be applicable in diverse domains. △ Less

Submitted 28 December, 2018; originally announced December 2018.

Comments: including supplementary materials

arXiv:1808.05347 [pdf, other]

Tool Breakage Detection using Deep Learning

Authors: Guang Li, Xin Yang, Duanbing Chen, Anxing Song, Yuke Fang, Junlin Zhou

Abstract: In manufacture, steel and other metals are mainly cut and shaped during the fabrication process by computer numerical control (CNC) machines. To keep high productivity and efficiency of the fabrication process, engineers need to monitor the real-time process of CNC machines, and the lifetime management of machine tools. In a real manufacturing process, breakage of machine tools usually happens wit… ▽ More In manufacture, steel and other metals are mainly cut and shaped during the fabrication process by computer numerical control (CNC) machines. To keep high productivity and efficiency of the fabrication process, engineers need to monitor the real-time process of CNC machines, and the lifetime management of machine tools. In a real manufacturing process, breakage of machine tools usually happens without any indication, this problem seriously affects the fabrication process for many years. Previous studies suggested many different approaches for monitoring and detecting the breakage of machine tools. However, there still exists a big gap between academic experiments and the complex real fabrication processes such as the high demands of real-time detections, the difficulty in data acquisition and transmission. In this work, we use the spindle current approach to detect the breakage of machine tools, which has the high performance of real-time monitoring, low cost, and easy to install. We analyze the features of the current of a milling machine spindle through tools wearing processes, and then we predict the status of tool breakage by a convolutional neural network(CNN). In addition, we use a BP neural network to understand the reliability of the CNN. The results show that our CNN approach can detect tool breakage with an accuracy of 93%, while the best performance of BP is 80%. △ Less

Submitted 16 August, 2018; originally announced August 2018.

Comments: 6 pages,BCD2018

arXiv:1803.04640 [pdf, other]

Bayesian Detection of Abnormal ADS in Mutant Caenorhabditis elegans Embryos

Authors: Wei Liang, Yuxiao Yang, Yusi Fang, Zhongying Zhao, Jie Hu

Abstract: Cell division timing is critical for cell fate specification and morphogenesis during embryogenesis. How division timings are regulated among cells during development is poorly understood. Here we focus on the comparison of asynchrony of division between sister cells (ADS) between wild-type and mutant individuals of Caenorhabditis elegans. Since the replicate number of mutant individuals of each m… ▽ More Cell division timing is critical for cell fate specification and morphogenesis during embryogenesis. How division timings are regulated among cells during development is poorly understood. Here we focus on the comparison of asynchrony of division between sister cells (ADS) between wild-type and mutant individuals of Caenorhabditis elegans. Since the replicate number of mutant individuals of each mutated gene, usually one, is far smaller than that of wild-type, direct comparison of two distributions of ADS between wild-type and mutant type, such as Kolmogorov- Smirnov test, is not feasible. On the other hand, we find that sometimes ADS is correlated with the life span of corresponding mother cell in wild-type. Hence, we apply a semiparametric Bayesian quantile regression method to estimate the 95% confidence interval curve of ADS with respect to life span of mother cell of wild-type individuals. Then, mutant-type ADSs outside the corresponding confidence interval are selected out as abnormal one with a significance level of 0.05. Simulation study demonstrates the accuracy of our method and Gene Enrichment Analysis validates the results of real data sets. △ Less

Submitted 13 March, 2018; originally announced March 2018.

arXiv:1707.00192 [pdf, other]

On Scalable Inference with Stochastic Gradient Descent

Authors: Yixin Fang, Jinfeng Xu, Lei Yang

Abstract: In many applications involving large dataset or online updating, stochastic gradient descent (SGD) provides a scalable way to compute parameter estimates and has gained increasing popularity due to its numerical convenience and memory efficiency. While the asymptotic properties of SGD-based estimators have been established decades ago, statistical inference such as interval estimation remains much… ▽ More In many applications involving large dataset or online updating, stochastic gradient descent (SGD) provides a scalable way to compute parameter estimates and has gained increasing popularity due to its numerical convenience and memory efficiency. While the asymptotic properties of SGD-based estimators have been established decades ago, statistical inference such as interval estimation remains much unexplored. The traditional resampling method such as the bootstrap is not computationally feasible since it requires to repeatedly draw independent samples from the entire dataset. The plug-in method is not applicable when there are no explicit formulas for the covariance matrix of the estimator. In this paper, we propose a scalable inferential procedure for stochastic gradient descent, which, upon the arrival of each observation, updates the SGD estimate as well as a large number of randomly perturbed SGD estimates. The proposed method is easy to implement in practice. We establish its theoretical properties for a general class of models that includes generalized linear models and quantile regression models as special cases. The finite-sample performance and numerical utility is evaluated by simulation studies and two real data applications. △ Less

Submitted 1 July, 2017; originally announced July 2017.

arXiv:1705.03831 [pdf, ps, other]

Quasi-Reliable Estimates of Effective Sample Size

Authors: Youhan Fang, Yudong Cao, Robert D. Skeel

Abstract: The efficiency of a Markov chain Monte Carlo algorithm might be measured by the cost of generating one independent sample, or equivalently, the total cost divided by the effective sample size, defined in terms of the integrated autocorrelation time. To ensure the reliability of such an estimate, it is suggested that there be an adequate sampling of state space--- to the extent that this can be det… ▽ More The efficiency of a Markov chain Monte Carlo algorithm might be measured by the cost of generating one independent sample, or equivalently, the total cost divided by the effective sample size, defined in terms of the integrated autocorrelation time. To ensure the reliability of such an estimate, it is suggested that there be an adequate sampling of state space--- to the extent that this can be determined from the available samples. A possible method for doing this is derived and evaluated. △ Less

Submitted 11 May, 2017; v1 submitted 10 May, 2017; originally announced May 2017.

arXiv:1701.03772 [pdf, other]

Additive Partially Linear Models for Massive Heterogeneous Data

Authors: Binhuan Wang, Yixin Fang, Heng Lian, Hua Liang

Abstract: We consider an additive partially linear framework for modelling massive heterogeneous data. The major goal is to extract multiple common features simultaneously across all sub-populations while exploring heterogeneity of each sub-population. We propose an aggregation type of estimators for the commonality parameters that possess the asymptotic optimal bounds and the asymptotic distributions as if… ▽ More We consider an additive partially linear framework for modelling massive heterogeneous data. The major goal is to extract multiple common features simultaneously across all sub-populations while exploring heterogeneity of each sub-population. We propose an aggregation type of estimators for the commonality parameters that possess the asymptotic optimal bounds and the asymptotic distributions as if there were no heterogeneity. This oracle result holds when the number of sub-populations does not grow too fast and the tuning parameters are selected carefully. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. Furthermore, we develop a heterogeneity test for the linear components and a homogeneity test for the non-linear components accordingly. The performance of the proposed methods is evaluated via simulation studies and an application to the Medicare Provider Utilization and Payment data. △ Less

Submitted 28 December, 2018; v1 submitted 13 January, 2017; originally announced January 2017.

arXiv:1603.07427 [pdf, other]

Penalized Weighted Least Squares for Outlier Detection and Robust Regression

Authors: Xiaoli Gao, Yixin Fang

Abstract: To conduct regression analysis for data contaminated with outliers, many approaches have been proposed for simultaneous outlier detection and robust regression, so is the approach proposed in this manuscript. This new approach is called "penalized weighted least squares" (PWLS). By assigning each observation an individual weight and incorporating a lasso-type penalty on the log-transformation of t… ▽ More To conduct regression analysis for data contaminated with outliers, many approaches have been proposed for simultaneous outlier detection and robust regression, so is the approach proposed in this manuscript. This new approach is called "penalized weighted least squares" (PWLS). By assigning each observation an individual weight and incorporating a lasso-type penalty on the log-transformation of the weight vector, the PWLS is able to perform outlier detection and robust regression simultaneously. A Bayesian point-of-view of the PWLS is provided, and it is showed that the PWLS can be seen as an example of M-estimation. Two methods are developed for selecting the tuning parameter in the PWLS. The performance of the PWLS is demonstrated via simulations and real applications. △ Less

Submitted 23 March, 2016; originally announced March 2016.

Comments: 27 pages; 4 figures

arXiv:1601.04586 [pdf, other]

doi 10.1080/10618600.2017.1377081

Sparse Convex Clustering

Authors: Binhuan Wang, Yilong Zhang, Will Wei Sun, Yixin Fang

Abstract: Convex clustering, a convex relaxation of k-means clustering and hierarchical clustering, has drawn recent attentions since it nicely addresses the instability issue of traditional nonconvex clustering methods. Although its computational and statistical properties have been recently studied, the performance of convex clustering has not yet been investigated in the high-dimensional clustering scena… ▽ More Convex clustering, a convex relaxation of k-means clustering and hierarchical clustering, has drawn recent attentions since it nicely addresses the instability issue of traditional nonconvex clustering methods. Although its computational and statistical properties have been recently studied, the performance of convex clustering has not yet been investigated in the high-dimensional clustering scenario, where the data contains a large number of features and many of them carry no information about the clustering structure. In this paper, we demonstrate that the performance of convex clustering could be distorted when the uninformative features are included in the clustering. To overcome it, we introduce a new clustering method, referred to as Sparse Convex Clustering, to simultaneously cluster observations and conduct feature selection. The key idea is to formulate convex clustering in a form of regularization, with an adaptive group-lasso penalty term on cluster centers. In order to optimally balance the tradeoff between the cluster fitting and sparsity, a tuning criterion based on clustering stability is developed. In theory, we provide an unbiased estimator for the degrees of freedom of the proposed sparse convex clustering method. Finally, the effectiveness of the sparse convex clustering is examined through a variety of numerical experiments and a real data application. △ Less

Submitted 10 February, 2017; v1 submitted 18 January, 2016; originally announced January 2016.

arXiv:1503.03970 [pdf, other]

Flexible combination of multiple diagnostic biomarkers to improve diagnostic accuracy

Authors: Tu Xu, Yixin Fang, Alan Rong, Junhui Wang

Abstract: In medical research, it is common to collect information of multiple continuous biomarkers to improve the accuracy of diagnostic tests. Combining the measurements of these biomarkers into one single score is a popular practice to integrate the collected information, where the accuracy of the resultant diagnostic test is usually improved. To measure the accuracy of a diagnostic test, the Youden ind… ▽ More In medical research, it is common to collect information of multiple continuous biomarkers to improve the accuracy of diagnostic tests. Combining the measurements of these biomarkers into one single score is a popular practice to integrate the collected information, where the accuracy of the resultant diagnostic test is usually improved. To measure the accuracy of a diagnostic test, the Youden index has been widely used in literature. Various parametric and nonparametric methods have been proposed to linearly combine biomarkers so that the corresponding Youden index can be optimized. Yet there seems to be little justification of enforcing such a linear combination. This paper proposes a flexible approach that allows both linear and nonlinear combinations of biomarkers. The proposed approach formulates the problem in a large margin classification framework, where the combination function is embedded in a flexible reproducing kernel Hilbert space. Advantages of the proposed approach are demonstrated in a variety of simulated experiments as well as a real application to a liver disorder study. △ Less

Submitted 7 July, 2015; v1 submitted 13 March, 2015; originally announced March 2015.

arXiv:1402.7107 [pdf, ps, other]

doi 10.1063/1.4874000

Compressible Generalized Hybrid Monte Carlo

Authors: Youhan Fang, Jesus-Maria Sanz-Serna, Robert D. Skeel

Abstract: One of the most demanding calculations is to generate random samples from a specified probability distribution (usually with an unknown normalizing prefactor) in a high-dimensional configuration space. One often has to resort to using a Markov chain Monte Carlo method, which converges only in the limit to the prescribed distribution. Such methods typically inch through configuration space step by… ▽ More One of the most demanding calculations is to generate random samples from a specified probability distribution (usually with an unknown normalizing prefactor) in a high-dimensional configuration space. One often has to resort to using a Markov chain Monte Carlo method, which converges only in the limit to the prescribed distribution. Such methods typically inch through configuration space step by step, with acceptance of a step based on a Metropolis(-Hastings) criterion. An acceptance rate of 100% is possible in principle by embedding configuration space in a higher-dimensional phase space and using ordinary differential equations. In practice, numerical integrators must be used, lowering the acceptance rate. This is the essence of hybrid Monte Carlo methods. Presented is a general framework for constructing such methods under relaxed conditions: the only geometric property needed is (weakened) reversibility; volume preservation is not needed. The possibilities are illustrated by deriving a couple of explicit hybrid Monte Carlo methods, one based on barrier-lowering variable-metric dynamics and another based on isokinetic dynamics. △ Less

Submitted 27 February, 2014; originally announced February 2014.

Comments: 27 pages, 2 figures

arXiv:1402.1835 [pdf, other]

A model-free estimation for the covariate-adjusted Youden index and its associated cut-point

Authors: Tu Xu, Junhui Wang, Yixin Fang

Abstract: In medical research, continuous markers are widely employed in diagnostic tests to distinguish diseased and non-diseased subjects. The accuracy of such diagnostic tests is commonly assessed using the receiver operating characteristic (ROC) curve. To summarize an ROC curve and determine its optimal cut-point, the Youden index is popularly used. In literature, estimation of the Youden index has been… ▽ More In medical research, continuous markers are widely employed in diagnostic tests to distinguish diseased and non-diseased subjects. The accuracy of such diagnostic tests is commonly assessed using the receiver operating characteristic (ROC) curve. To summarize an ROC curve and determine its optimal cut-point, the Youden index is popularly used. In literature, estimation of the Youden index has been widely studied via various statistical modeling strategies on the conditional density. This paper proposes a new model-free estimation method, which directly estimates the covariate-adjusted cut-point without estimating the conditional density. Consequently, covariate-adjusted Youden index can be estimated based on the estimated cutpoint. The proposed method formulates the estimation problem in a large margin classification framework, which allows flexible modeling of the covariate-adjusted Youden index through kernel machines. The advantage of the proposed method is demonstrated in a variety of simulated experiments as well as a real application to Pima Indians diabetes study. △ Less

Submitted 8 February, 2014; originally announced February 2014.

arXiv:1308.3416 [pdf, other]

Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices

Authors: Yixin Fang, Binhuan Wang, Yang Feng

Abstract: Recently many regularized estimators of large covariance matrices have been proposed, and the tuning parameters in these estimators are usually selected via cross-validation. However, there is no guideline on the number of folds for conducting cross-validation and there is no comparison between cross-validation and the methods based on bootstrap. Through extensive simulations, we suggest 10-fold c… ▽ More Recently many regularized estimators of large covariance matrices have been proposed, and the tuning parameters in these estimators are usually selected via cross-validation. However, there is no guideline on the number of folds for conducting cross-validation and there is no comparison between cross-validation and the methods based on bootstrap. Through extensive simulations, we suggest 10-fold cross-validation (nine-tenths for training and one-tenth for validation) be appropriate when the estimation accuracy is measured in the Frobenius norm, while 2-fold cross-validation (half for training and half for validation) or reverse 3-fold cross-validation (one-third for training and two-thirds for validation) be appropriate in the operator norm. We also suggest the "optimal" cross-validation be more appropriate than the methods based on bootstrap for both types of norm. △ Less

Submitted 15 August, 2013; originally announced August 2013.

arXiv:1301.7118 [pdf, ps, other]

A note on selection stability: combining stability and prediction

Authors: Yixin Fang, Junhui Wang, Wei Sun

Abstract: Recently, many regularized procedures have been proposed for variable selection in linear regression, but their performance depends on the tuning parameter selection. Here a criterion for the tuning parameter selection is proposed, which combines the strength of both stability selection and cross-validation and therefore is referred as the prediction and stability selection (PASS). The selection c… ▽ More Recently, many regularized procedures have been proposed for variable selection in linear regression, but their performance depends on the tuning parameter selection. Here a criterion for the tuning parameter selection is proposed, which combines the strength of both stability selection and cross-validation and therefore is referred as the prediction and stability selection (PASS). The selection consistency is established assuming the data generating model is a subset of the full model, and the small sample performance is demonstrated through some simulation studies where the assumption is either held or violated. △ Less

Submitted 29 January, 2013; originally announced January 2013.

arXiv:1208.3380 [pdf, ps, other]

Consistent selection of tuning parameters via variable selection stability

Authors: Wei Sun, Junhui Wang, Yixin Fang

Abstract: Penalized regression models are popularly used in high-dimensional data analysis to conduct variable selection and model fitting simultaneously. Whereas success has been widely reported in literature, their performances largely depend on the tuning parameters that balance the trade-off between model fitting and model sparsity. Existing tuning criteria mainly follow the route of minimizing the esti… ▽ More Penalized regression models are popularly used in high-dimensional data analysis to conduct variable selection and model fitting simultaneously. Whereas success has been widely reported in literature, their performances largely depend on the tuning parameters that balance the trade-off between model fitting and model sparsity. Existing tuning criteria mainly follow the route of minimizing the estimated prediction error or maximizing the posterior model probability, such as cross-validation, AIC and BIC. This article introduces a general tuning parameter selection criterion based on a novel concept of variable selection stability. The key idea is to select the tuning parameters so that the resultant penalized regression model is stable in variable selection. The asymptotic selection consistency is established for both fixed and diverging dimensions. The effectiveness of the proposed criterion is also demonstrated in a variety of simulated examples as well as an application to the prostate cancer data. △ Less

Submitted 13 December, 2013; v1 submitted 16 August, 2012; originally announced August 2012.

Comments: Published in JMLR (http://jmlr.org/papers/v14/)

Journal ref: Journal of Machine Learning Research 2013, Vol. 14, 3419-3440

arXiv:1203.3559 [pdf, other]

A divergence formula for regularization methods with an L2 constraint

Authors: Yixin Fang, Yuanjia Wang, Xin Huang

Abstract: We derive a divergence formula for a group of regularization methods with an L2 constraint. The formula is useful for regularization parameter selection, because it provides an unbiased estimate for the number of degrees of freedom. We begin with deriving the formula for smoothing splines and then extend it to other settings such as penalized splines, ridge regression, and functional linear regres… ▽ More We derive a divergence formula for a group of regularization methods with an L2 constraint. The formula is useful for regularization parameter selection, because it provides an unbiased estimate for the number of degrees of freedom. We begin with deriving the formula for smoothing splines and then extend it to other settings such as penalized splines, ridge regression, and functional linear regression. △ Less

Submitted 15 March, 2012; originally announced March 2012.

Showing 1–41 of 41 results for author: Fang, Y