-
Investigating Data Usage for Inductive Conformal Predictors
Authors:
Yizirui Fang,
Anthony Bellotti
Abstract:
Inductive conformal predictors (ICPs) are algorithms that are able to generate prediction sets, instead of point predictions, which are valid at a user-defined confidence level, only assuming exchangeability. These algorithms are useful for reliable machine learning and are increasing in popularity. The ICP development process involves dividing development data into three parts: training, calibrat…
▽ More
Inductive conformal predictors (ICPs) are algorithms that are able to generate prediction sets, instead of point predictions, which are valid at a user-defined confidence level, only assuming exchangeability. These algorithms are useful for reliable machine learning and are increasing in popularity. The ICP development process involves dividing development data into three parts: training, calibration and test. With access to limited or expensive development data, it is an open question regarding the most efficient way to divide the data. This study provides several experiments to explore this question and consider the case for allowing overlap of examples between training and calibration sets. Conclusions are drawn that will be of value to academics and practitioners planning to use ICPs.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Causal-StoNet: Causal Inference for High-Dimensional Complex Data
Authors:
Yaxin Fang,
Faming Liang
Abstract:
With the advancement of data science, the collection of increasingly complex datasets has become commonplace. In such datasets, the data dimension can be extremely high, and the underlying data generation process can be unknown and highly nonlinear. As a result, the task of making causal inference with high-dimensional complex data has become a fundamental problem in many disciplines, such as medi…
▽ More
With the advancement of data science, the collection of increasingly complex datasets has become commonplace. In such datasets, the data dimension can be extremely high, and the underlying data generation process can be unknown and highly nonlinear. As a result, the task of making causal inference with high-dimensional complex data has become a fundamental problem in many disciplines, such as medicine, econometrics, and social science. However, the existing methods for causal inference are frequently developed under the assumption that the data dimension is low or that the underlying data generation process is linear or approximately linear. To address these challenges, this paper proposes a novel causal inference approach for dealing with high-dimensional complex data. The proposed approach is based on deep learning techniques, including sparse deep learning theory and stochastic neural networks, that have been developed in recent literature. By using these techniques, the proposed approach can address both the high dimensionality and unknown data generation process in a coherent way. Furthermore, the proposed approach can also be used when missing values are present in the datasets. Extensive numerical studies indicate that the proposed approach outperforms existing ones.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Development of a Evaluation Tool for Age-Appropriate Software in Aging Environments: A Delphi Study
Authors:
Zhenggang Bai,
Yougxiang Fang,
Hongtu Chen,
Xinru Chen,
Ning An,
Min Zhang,
Guoxin Rui,
Jing Jin
Abstract:
Objective: We aimed to develop a dependable reliable tool for assessing software ageappropriateness. Methods: We conducted a systematic review to get the indicators of technology ageappropriateness from studies from January 2000 to April 2023.This study engaged 25 experts from the fields of anthropology, sociology,and social technology research across, three rounds of Delphi consultations were con…
▽ More
Objective: We aimed to develop a dependable reliable tool for assessing software ageappropriateness. Methods: We conducted a systematic review to get the indicators of technology ageappropriateness from studies from January 2000 to April 2023.This study engaged 25 experts from the fields of anthropology, sociology,and social technology research across, three rounds of Delphi consultations were conducted. Experts were asked to screen, assess, add and provide feedback on the preliminary indicators identified in the initial indicator pool. Result: We found 76 criterias for evaluating quality criteria was extracted, grouped into 11 distinct domains. After completing three rounds of Delphi consultations,experts drew upon their personal experiences,theoretical frameworks,and industry insights to arrive at a three-dimensional structure for the evaluation tooluser experience,product quality,and social promotion.These metrics were further distilled into a 16-item scale, and a corresponding questionnaire was formulated.The developed tool exhibited strong internal reliability(Cronbach's Alpha is 0.867)and content validity(S-CVI is 0.93). Conclusion: This tool represents a straightforward,objective,and reliable mechanism for evaluating software's appropriateness across age groups. Moreover,it offers valuable insights and practical guidance for designing and developing of high-quality age-appropriate software,and assisst age groups to select software they like.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
Sensitivity Analysis for Unmeasured Confounding in Medical Product Development and Evaluation Using Real World Evidence
Authors:
Peng Ding,
Yixin Fang,
Doug Faries,
Susan Gruber,
Hana Lee,
Joo-Yeon Lee,
Pallavi Mishra-Kalyani,
Mingyang Shan,
Mark van der Laan,
Shu Yang,
Xiang Zhang
Abstract:
The American Statistical Association Biopharmaceutical Section (ASA BIOP) working group on real-world evidence (RWE) has been making continuous, extended effort towards a goal of supporting and advancing regulatory science with respect to non-interventional, clinical studies intended to use real-world data for evidence generation for the purpose of medical product development and evaluation (i.e.,…
▽ More
The American Statistical Association Biopharmaceutical Section (ASA BIOP) working group on real-world evidence (RWE) has been making continuous, extended effort towards a goal of supporting and advancing regulatory science with respect to non-interventional, clinical studies intended to use real-world data for evidence generation for the purpose of medical product development and evaluation (i.e., RWE studies). In 2023, the working group published a manuscript delineating challenges and opportunities in constructing estimands for RWE studies following a framework in ICH E9(R1) guidance on estimand and sensitivity analysis. As a follow-up task, we describe the other issue in RWE studies, sensitivity analysis. Focusing on the issue of unmeasured confounding, we review availability and applicability of sensitivity analysis methods for different types unmeasured confounding. We discuss consideration on the choice and use of sensitivity analysis for RWE studies. Updated version of this article will present how findings from sensitivity analysis could support regulatory decision-making using a real example.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Tackling the infinite likelihood problem when fitting mixtures of shifted asymmetric Laplace distributions
Authors:
Yuan Fang,
Brian C. Franczak,
Sanjeena Subedi
Abstract:
Mixtures of shifted asymmetric Laplace distributions were introduced as a tool for model-based clustering that allowed for the direct parameterization of skewness in addition to location and scale. Following common practices, an expectation-maximization algorithm was developed to fit these mixtures. However, adaptations to account for the `infinite likelihood problem' led to fits that gave good cl…
▽ More
Mixtures of shifted asymmetric Laplace distributions were introduced as a tool for model-based clustering that allowed for the direct parameterization of skewness in addition to location and scale. Following common practices, an expectation-maximization algorithm was developed to fit these mixtures. However, adaptations to account for the `infinite likelihood problem' led to fits that gave good classification performance at the expense of parameter recovery. In this paper, we propose a more valuable solution to this problem by developing a novel Bayesian parameter estimation scheme for mixtures of shifted asymmetric Laplace distributions. Through simulation studies, we show that the proposed parameter estimation scheme gives better parameter estimates compared to the expectation-maximization based scheme. In addition, we also show that the classification performance is as good, and in some cases better, than the expectation-maximization based scheme. The performance of both schemes are also assessed using well-known real data sets.
△ Less
Submitted 24 March, 2023;
originally announced March 2023.
-
Intrinsic and extrinsic deep learning on manifolds
Authors:
Yihao Fang,
Ilsang Ohn,
Vijay Gupta,
Lizhen Lin
Abstract:
We propose extrinsic and intrinsic deep neural network architectures as general frameworks for deep learning on manifolds. Specifically, extrinsic deep neural networks (eDNNs) preserve geometric features on manifolds by utilizing an equivariant embedding from the manifold to its image in the Euclidean space. Moreover, intrinsic deep neural networks (iDNNs) incorporate the underlying intrinsic geom…
▽ More
We propose extrinsic and intrinsic deep neural network architectures as general frameworks for deep learning on manifolds. Specifically, extrinsic deep neural networks (eDNNs) preserve geometric features on manifolds by utilizing an equivariant embedding from the manifold to its image in the Euclidean space. Moreover, intrinsic deep neural networks (iDNNs) incorporate the underlying intrinsic geometry of manifolds via exponential and log maps with respect to a Riemannian structure. Consequently, we prove that the empirical risk of the empirical risk minimizers (ERM) of eDNNs and iDNNs converge in optimal rates. Overall, The eDNNs framework is simple and easy to compute, while the iDNNs framework is accurate and fast converging. To demonstrate the utilities of our framework, various simulation studies, and real data analyses are presented with eDNNs and iDNNs.
△ Less
Submitted 16 February, 2023;
originally announced February 2023.
-
Fully Stochastic Trust-Region Sequential Quadratic Programming for Equality-Constrained Optimization Problems
Authors:
Yuchen Fang,
Sen Na,
Michael W. Mahoney,
Mladen Kolar
Abstract:
We propose a trust-region stochastic sequential quadratic programming algorithm (TR-StoSQP) to solve nonlinear optimization problems with stochastic objectives and deterministic equality constraints. We consider a fully stochastic setting, where at each step a single sample is generated to estimate the objective gradient. The algorithm adaptively selects the trust-region radius and, compared to th…
▽ More
We propose a trust-region stochastic sequential quadratic programming algorithm (TR-StoSQP) to solve nonlinear optimization problems with stochastic objectives and deterministic equality constraints. We consider a fully stochastic setting, where at each step a single sample is generated to estimate the objective gradient. The algorithm adaptively selects the trust-region radius and, compared to the existing line-search StoSQP schemes, allows us to utilize indefinite Hessian matrices (i.e., Hessians without modification) in SQP subproblems. As a trust-region method for constrained optimization, our algorithm must address an infeasibility issue -- the linearized equality constraints and trust-region constraints may lead to infeasible SQP subproblems. In this regard, we propose an adaptive relaxation technique to compute the trial step, consisting of a normal step and a tangential step. To control the lengths of these two steps while ensuring a scale-invariant property, we adaptively decompose the trust-region radius into two segments, based on the proportions of the rescaled feasibility and optimality residuals to the rescaled full KKT residual. The normal step has a closed form, while the tangential step is obtained by solving a trust-region subproblem, to which a solution ensuring the Cauchy reduction is sufficient for our study. We establish a global almost sure convergence guarantee for TR-StoSQP, and illustrate its empirical performance on both a subset of problems in the CUTEst test set and constrained logistic regression problems using data from the LIBSVM collection.
△ Less
Submitted 28 January, 2024; v1 submitted 29 November, 2022;
originally announced November 2022.
-
Learning and Inference in Sparse Coding Models with Langevin Dynamics
Authors:
Michael Y. -S. Fang,
Mayur Mudigonda,
Ryan Zarcone,
Amir Khosrowshahi,
Bruno A. Olshausen
Abstract:
We describe a stochastic, dynamical system capable of inference and learning in a probabilistic latent variable model. The most challenging problem in such models - sampling the posterior distribution over latent variables - is proposed to be solved by harnessing natural sources of stochasticity inherent in electronic and neural systems. We demonstrate this idea for a sparse coding model by derivi…
▽ More
We describe a stochastic, dynamical system capable of inference and learning in a probabilistic latent variable model. The most challenging problem in such models - sampling the posterior distribution over latent variables - is proposed to be solved by harnessing natural sources of stochasticity inherent in electronic and neural systems. We demonstrate this idea for a sparse coding model by deriving a continuous-time equation for inferring its latent variables via Langevin dynamics. The model parameters are learned by simultaneously evolving according to another continuous-time equation, thus bypassing the need for digital accumulators or a global clock. Moreover we show that Langevin dynamics lead to an efficient procedure for sampling from the posterior distribution in the 'L0 sparse' regime, where latent variables are encouraged to be set to zero as opposed to having a small L1 norm. This allows the model to properly incorporate the notion of sparsity rather than having to resort to a relaxed version of sparsity to make optimization tractable. Simulations of the proposed dynamical system on both synthetic and natural image datasets demonstrate that the model is capable of probabilistically correct inference, enabling learning of the dictionary as well as parameters of the prior.
△ Less
Submitted 23 April, 2022;
originally announced April 2022.
-
On p-value combination of independent and frequent signals: asymptotic efficiency and Fisher ensemble
Authors:
Yusi Fang,
Chung Chang,
George Tseng
Abstract:
Combining p-values to integrate multiple effects is of long-standing interest in social science and biomedical research. In this paper, we focus on revisiting a classical scenario closely related to meta-analysis, which combines a relatively small (finite and fixed) number of p-values while the sample size for generating each p-value is large (asymptotically goes to infinity). We evaluate a list o…
▽ More
Combining p-values to integrate multiple effects is of long-standing interest in social science and biomedical research. In this paper, we focus on revisiting a classical scenario closely related to meta-analysis, which combines a relatively small (finite and fixed) number of p-values while the sample size for generating each p-value is large (asymptotically goes to infinity). We evaluate a list of traditional and recently developed modified Fisher's methods to investigate their asymptotic efficiencies and finite-sample numerical performance. The result concludes Fisher and adaptively weighted Fisher method to have top performance and complementary advantages across different proportions of true signals. Finally, we propose an ensemble method, namely Fisher ensemble, to combine the two top-performing Fisher-related methods using a robust truncated Cauchy ensemble approach. We show that Fisher ensemble achieves asymptotic Bahadur optimality and integrates the strengths of Fisher and adaptively weighted Fisher methods in simulations. We subsequently extend Fisher ensemble to a variant with emphasized power for concordant effect size directions. A transcriptomic meta-analysis application confirms the theoretical and simulation conclusions, generates intriguing biomarker and pathway findings and demonstrates strengths and strategy of using proposed Fisher ensemble methods.
△ Less
Submitted 14 April, 2022; v1 submitted 22 March, 2022;
originally announced March 2022.
-
Association study between gene expression and multiple phenotypes in omics applications of complex diseases
Authors:
Yujia Li,
Yusi Fang,
Peng Liu,
George C. Tseng
Abstract:
Studying phenotype-gene association can uncover mechanism of diseases and develop efficient treatments. In complex disease where multiple phenotypes are available and correlated, analyzing and interpreting associated genes for each phenotype respectively may decrease statistical power and lose intepretation due to not considering the correlation between phenotypes. The typical approaches are many…
▽ More
Studying phenotype-gene association can uncover mechanism of diseases and develop efficient treatments. In complex disease where multiple phenotypes are available and correlated, analyzing and interpreting associated genes for each phenotype respectively may decrease statistical power and lose intepretation due to not considering the correlation between phenotypes. The typical approaches are many global testing methods, such as multivariate analysis of variance (MANOVA), which tests the overall association between phenotypes and each gene, without considersing the heterogeneity among phenotypes. In this paper, we extend and evaluate two p-value combination methods, adaptive weighted Fisher's method (AFp) and adaptive Fisher's method (AFz), to tackle this problem, where AFp stands out as our final proposed method, based on extensive simulations and a real application. Our proposed AFp method has three advantages over traditional global testing methods. Firstly, it can consider the heterogeneity of phenotypes and determines which specific phenotypes a gene is associated with, using phenotype specific 0-1 weights. Secondly, AFp takes the p-values from the test of association of each phenotype as input, thus can accommodate different types of phenotypes (continuous, binary and count). Thirdly, we also apply bootstrapping to construct a variability index for the weight estimator of AFp and generate a co-membership matrix to categorize (cluster) genes based on their association-patterns for intuitive biological investigations. Through extensive simulations, AFp shows superior performance over global testing methods in terms of type I error control and statistical power, as well as higher accuracy of 0-1 weights estimation over AFz. A real omics application with transcriptomic and clinical data of complex lung diseases demonstrates insightful biological findings of AFp.
△ Less
Submitted 10 December, 2021;
originally announced December 2021.
-
pyStoNED: A Python Package for Convex Regression and Frontier Estimation
Authors:
Sheng Dai,
Yu-Hsueh Fang,
Chia-Yen Lee,
Timo Kuosmanen
Abstract:
Shape-constrained nonparametric regression is a growing area in econometrics, statistics, operations research, machine learning and related fields. In the field of productivity and efficiency analysis, recent developments in the multivariate convex regression and related techniques such as convex quantile regression and convex expectile regression have bridged the long-standing gap between the con…
▽ More
Shape-constrained nonparametric regression is a growing area in econometrics, statistics, operations research, machine learning and related fields. In the field of productivity and efficiency analysis, recent developments in the multivariate convex regression and related techniques such as convex quantile regression and convex expectile regression have bridged the long-standing gap between the conventional deterministic-nonparametric and stochastic-parametric methods. Unfortunately, the heavy computational burden and the lack of powerful, reliable, and fully open access computational package has slowed down the diffusion of these advanced estimation techniques to the empirical practice. The purpose of the Python package pyStoNED is to address this challenge by providing a freely available and user-friendly tool for the multivariate convex regression, convex quantile regression, convex expectile regression, isotonic regression, stochastic nonparametric envelopment of data, and related methods. This paper presents a tutorial of the pyStoNED package and illustrates its application, focusing on the estimation of frontier cost and production functions.
△ Less
Submitted 27 September, 2021;
originally announced September 2021.
-
Spatially and Robustly Hybrid Mixture Regression Model for Inference of Spatial Dependence
Authors:
Wennan Chang,
Pengtao Dang,
Changlin Wan,
Xiaoyu Lu,
Yue Fang,
Tong Zhao,
Yong Zang,
Bo Li,
Chi Zhang,
Sha Cao
Abstract:
In this paper, we propose a Spatial Robust Mixture Regression model to investigate the relationship between a response variable and a set of explanatory variables over the spatial domain, assuming that the relationships may exhibit complex spatially dynamic patterns that cannot be captured by constant regression coefficients. Our method integrates the robust finite mixture Gaussian regression mode…
▽ More
In this paper, we propose a Spatial Robust Mixture Regression model to investigate the relationship between a response variable and a set of explanatory variables over the spatial domain, assuming that the relationships may exhibit complex spatially dynamic patterns that cannot be captured by constant regression coefficients. Our method integrates the robust finite mixture Gaussian regression model with spatial constraints, to simultaneously handle the spatial nonstationarity, local homogeneity, and outlier contaminations. Compared with existing spatial regression models, our proposed model assumes the existence a few distinct regression models that are estimated based on observations that exhibit similar response-predictor relationships. As such, the proposed model not only accounts for nonstationarity in the spatial trend, but also clusters observations into a few distinct and homogenous groups. This provides an advantage on interpretation with a few stationary sub-processes identified that capture the predominant relationships between response and predictor variables. Moreover, the proposed method incorporates robust procedures to handle contaminations from both regression outliers and spatial outliers. By doing so, we robustly segment the spatial domain into distinct local regions with similar regression coefficients, and sporadic locations that are purely outliers. Rigorous statistical hypothesis testing procedure has been designed to test the significance of such segmentation. Experimental results on many synthetic and real-world datasets demonstrate the robustness, accuracy, and effectiveness of our proposed method, compared with other robust finite mixture regression, spatial regression and spatial segmentation methods.
△ Less
Submitted 28 September, 2021; v1 submitted 1 September, 2021;
originally announced September 2021.
-
Understanding the Spread of COVID-19 Epidemic: A Spatio-Temporal Point Process View
Authors:
Shuang Li,
Lu Wang,
Xinyun Chen,
Yixiang Fang,
Yan Song
Abstract:
Since the first coronavirus case was identified in the U.S. on Jan. 21, more than 1 million people in the U.S. have confirmed cases of COVID-19. This infectious respiratory disease has spread rapidly across more than 3000 counties and 50 states in the U.S. and have exhibited evolutionary clustering and complex triggering patterns. It is essential to understand the complex spacetime intertwined pro…
▽ More
Since the first coronavirus case was identified in the U.S. on Jan. 21, more than 1 million people in the U.S. have confirmed cases of COVID-19. This infectious respiratory disease has spread rapidly across more than 3000 counties and 50 states in the U.S. and have exhibited evolutionary clustering and complex triggering patterns. It is essential to understand the complex spacetime intertwined propagation of this disease so that accurate prediction or smart external intervention can be carried out. In this paper, we model the propagation of the COVID-19 as spatio-temporal point processes and propose a generative and intensity-free model to track the spread of the disease. We further adopt a generative adversarial imitation learning framework to learn the model parameters. In comparison with the traditional likelihood-based learning methods, this imitation learning framework does not need to prespecify an intensity function, which alleviates the model-misspecification. Moreover, the adversarial learning procedure bypasses the difficult-to-evaluate integral involved in the likelihood evaluation, which makes the model inference more scalable with the data and variables. We showcase the dynamic learning performance on the COVID-19 confirmed cases in the U.S. and evaluate the social distancing policy based on the learned generative model.
△ Less
Submitted 24 June, 2021;
originally announced June 2021.
-
Heavy-tailed distribution for combining dependent $p$-values with asymptotic robustness
Authors:
Yusi Fang,
George C. Tseng,
Chung Chang
Abstract:
The issue of combining individual $p$-values to aggregate multiple small effects is prevalent in many scientific investigations and is a long-standing statistical topic. Many classical methods are designed for combining independent and frequent signals in a traditional meta-analysis sense using the sum of transformed $p$-values with the transformation of light-tailed distributions, in which Fisher…
▽ More
The issue of combining individual $p$-values to aggregate multiple small effects is prevalent in many scientific investigations and is a long-standing statistical topic. Many classical methods are designed for combining independent and frequent signals in a traditional meta-analysis sense using the sum of transformed $p$-values with the transformation of light-tailed distributions, in which Fisher's method and Stouffer's method are the most well-known. Since the early 2000, advances in big data promoted methods to aggregate independent, sparse and weak signals, such as the renowned higher criticism and Berk-Jones tests. Recently, Liu and Xie(2020) and Wilson(2019) independently proposed Cauchy and harmonic mean combination tests to robustly combine $p$-values under "arbitrary" dependency structure, where a notable application is to combine $p$-values from a set of often correlated SNPs in genome-wide association studies. The proposed tests are the transformation of heavy-tailed distributions for improved power with the sparse signal. It calls for a natural question to investigate heavy-tailed distribution transformation, to understand the connection among existing methods, and to explore the conditions for a method to possess robustness to dependency. In this paper, we investigate the regularly varying distribution, which is a rich family of heavy-tailed distribution and includes Pareto distribution as a special case. We show that only an equivalent class of Cauchy and harmonic mean tests have sufficient robustness to dependency in a practical sense. We also show an issue caused by large negative penalty in the Cauchy method and propose a simple, yet practical modification. Finally, we present simulations and apply to a neuroticism GWAS application to verify the discovered theoretical insights and provide practical guidance.
△ Less
Submitted 7 September, 2021; v1 submitted 23 March, 2021;
originally announced March 2021.
-
Clustering microbiome data using mixtures of logistic normal multinomial models
Authors:
Yuan Fang,
Sanjeena Subedi
Abstract:
Discrete data such as counts of microbiome taxa resulting from next-generation sequencing are routinely encountered in bioinformatics. Taxa count data in microbiome studies are typically high-dimensional, over-dispersed, and can only reveal relative abundance therefore being treated as compositional. Analyzing compositional data presents many challenges because they are restricted on a simplex. In…
▽ More
Discrete data such as counts of microbiome taxa resulting from next-generation sequencing are routinely encountered in bioinformatics. Taxa count data in microbiome studies are typically high-dimensional, over-dispersed, and can only reveal relative abundance therefore being treated as compositional. Analyzing compositional data presents many challenges because they are restricted on a simplex. In a logistic normal multinomial model, the relative abundance is mapped from a simplex to a latent variable that exists on the real Euclidean space using the additive log-ratio transformation. While a logistic normal multinomial approach brings in flexibility for modeling the data, it comes with a heavy computational cost as the parameter estimation typically relies on Bayesian techniques. In this paper, we develop a novel mixture of logistic normal multinomial models for clustering microbiome data. Additionally, we utilize an efficient framework for parameter estimation using variational Gaussian approximations (VGA). Adopting a variational Gaussian approximation for the posterior of the latent variable reduces the computational overhead substantially. The proposed method is illustrated on simulated and real datasets.
△ Less
Submitted 21 June, 2022; v1 submitted 12 November, 2020;
originally announced November 2020.
-
Data-Level Recombination and Lightweight Fusion Scheme for RGB-D Salient Object Detection
Authors:
Xuehao Wang,
Shuai Li,
Chenglizhao Chen,
Yuming Fang,
Aimin Hao,
Hong Qin
Abstract:
Existing RGB-D salient object detection methods treat depth information as an independent component to complement its RGB part, and widely follow the bi-stream parallel network architecture. To selectively fuse the CNNs features extracted from both RGB and depth as a final result, the state-of-the-art (SOTA) bi-stream networks usually consist of two independent subbranches; i.e., one subbranch is…
▽ More
Existing RGB-D salient object detection methods treat depth information as an independent component to complement its RGB part, and widely follow the bi-stream parallel network architecture. To selectively fuse the CNNs features extracted from both RGB and depth as a final result, the state-of-the-art (SOTA) bi-stream networks usually consist of two independent subbranches; i.e., one subbranch is used for RGB saliency and the other aims for depth saliency. However, its depth saliency is persistently inferior to the RGB saliency because the RGB component is intrinsically more informative than the depth component. The bi-stream architecture easily biases its subsequent fusion procedure to the RGB subbranch, leading to a performance bottleneck. In this paper, we propose a novel data-level recombination strategy to fuse RGB with D (depth) before deep feature extraction, where we cyclically convert the original 4-dimensional RGB-D into \textbf{D}GB, R\textbf{D}B and RG\textbf{D}. Then, a newly lightweight designed triple-stream network is applied over these novel formulated data to achieve an optimal channel-wise complementary fusion status between the RGB and D, achieving a new SOTA performance.
△ Less
Submitted 7 August, 2020;
originally announced September 2020.
-
Optimization of Graph Neural Networks with Natural Gradient Descent
Authors:
Mohammad Rasool Izadi,
Yihao Fang,
Robert Stevenson,
Lizhen Lin
Abstract:
In this work, we propose to employ information-geometric tools to optimize a graph neural network architecture such as the graph convolutional networks. More specifically, we develop optimization algorithms for the graph-based semi-supervised learning by employing the natural gradient information in the optimization process. This allows us to efficiently exploit the geometry of the underlying stat…
▽ More
In this work, we propose to employ information-geometric tools to optimize a graph neural network architecture such as the graph convolutional networks. More specifically, we develop optimization algorithms for the graph-based semi-supervised learning by employing the natural gradient information in the optimization process. This allows us to efficiently exploit the geometry of the underlying statistical model or parameter space for optimization and inference. To the best of our knowledge, this is the first work that has utilized the natural gradient for the optimization of graph neural networks that can be extended to other semi-supervised problems. Efficient computations algorithms are developed and extensive numerical studies are conducted to demonstrate the superior performance of our algorithms over existing algorithms such as ADAM and SGD.
△ Less
Submitted 21 August, 2020;
originally announced August 2020.
-
Outcome-Guided Disease Subtyping for High-Dimensional Omics Data
Authors:
Peng Liu,
Yusi Fang,
Zhao Ren,
Lu Tang,
George C. Tseng
Abstract:
High-throughput microarray and sequencing technology have been used to identify disease subtypes that could not be observed otherwise by using clinical variables alone. The classical unsupervised clustering strategy concerns primarily the identification of subpopulations that have similar patterns in gene features. However, as the features corresponding to irrelevant confounders (e.g. gender or ag…
▽ More
High-throughput microarray and sequencing technology have been used to identify disease subtypes that could not be observed otherwise by using clinical variables alone. The classical unsupervised clustering strategy concerns primarily the identification of subpopulations that have similar patterns in gene features. However, as the features corresponding to irrelevant confounders (e.g. gender or age) may dominate the clustering process, the resulting clusters may or may not capture clinically meaningful disease subtypes. This gives rise to a fundamental problem: can we find a subtyping procedure guided by a pre-specified disease outcome? Existing methods, such as supervised clustering, apply a two-stage approach and depend on an arbitrary number of selected features associated with outcome. In this paper, we propose a unified latent generative model to perform outcome-guided disease subtyping constructed from omics data, which improves the resulting subtypes concerning the disease of interest. Feature selection is embedded in a regularization regression. A modified EM algorithm is applied for numerical computation and parameter estimation. The proposed method performs feature selection, latent subtype characterization and outcome prediction simultaneously. To account for possible outliers or violation of mixture Gaussian assumption, we incorporate robust estimation using adaptive Huber or median-truncated loss function. Extensive simulations and an application to complex lung diseases with transcriptomic and clinical data demonstrate the ability of the proposed method to identify clinically relevant disease subtypes and signature genes suitable to explore toward precision medicine.
△ Less
Submitted 21 July, 2020;
originally announced July 2020.
-
Inductive Link Prediction for Nodes Having Only Attribute Information
Authors:
Yu Hao,
Xin Cao,
Yixiang Fang,
Xike Xie,
Sibo Wang
Abstract:
Predicting the link between two nodes is a fundamental problem for graph data analytics. In attributed graphs, both the structure and attribute information can be utilized for link prediction. Most existing studies focus on transductive link prediction where both nodes are already in the graph. However, many real-world applications require inductive prediction for new nodes having only attribute i…
▽ More
Predicting the link between two nodes is a fundamental problem for graph data analytics. In attributed graphs, both the structure and attribute information can be utilized for link prediction. Most existing studies focus on transductive link prediction where both nodes are already in the graph. However, many real-world applications require inductive prediction for new nodes having only attribute information. It is more challenging since the new nodes do not have structure information and cannot be seen during the model training. To solve this problem, we propose a model called DEAL, which consists of three components: two node embedding encoders and one alignment mechanism. The two encoders aim to output the attribute-oriented node embedding and the structure-oriented node embedding, and the alignment mechanism aligns the two types of embeddings to build the connections between the attributes and links. Our model DEAL is versatile in the sense that it works for both inductive and transductive link prediction. Extensive experiments on several benchmark datasets show that our proposed model significantly outperforms existing inductive link prediction methods, and also outperforms the state-of-the-art methods on transductive link prediction.
△ Less
Submitted 15 July, 2020;
originally announced July 2020.
-
Intelligent Credit Limit Management in Consumer Loans Based on Causal Inference
Authors:
Hang Miao,
Kui Zhao,
Zhun Wang,
Linbo Jiang,
Quanhui Jia,
Yanming Fang,
Quan Yu
Abstract:
Nowadays consumer loan plays an important role in promoting the economic growth, and credit cards are the most popular consumer loan. One of the most essential parts in credit cards is the credit limit management. Traditionally, credit limits are adjusted based on limited heuristic strategies, which are developed by experienced professionals. In this paper, we present a data-driven approach to man…
▽ More
Nowadays consumer loan plays an important role in promoting the economic growth, and credit cards are the most popular consumer loan. One of the most essential parts in credit cards is the credit limit management. Traditionally, credit limits are adjusted based on limited heuristic strategies, which are developed by experienced professionals. In this paper, we present a data-driven approach to manage the credit limit intelligently. Firstly, a conditional independence testing is conducted to acquire the data for building models. Based on these testing data, a response model is then built to measure the heterogeneous treatment effect of increasing credit limits (i.e. treatments) for different customers, who are depicted by several control variables (i.e. features). In order to incorporate the diminishing marginal effect, a carefully selected log transformation is introduced to the treatment variable. Moreover, the model's capability can be further enhanced by applying a non-linear transformation on features via GBDT encoding. Finally, a well-designed metric is proposed to properly measure the performances of compared methods. The experimental results demonstrate the effectiveness of the proposed approach.
△ Less
Submitted 10 July, 2020;
originally announced July 2020.
-
Multi-View Collaborative Network Embedding
Authors:
Sezin Kircali Ata,
Yuan Fang,
Min Wu,
Jiaqi Shi,
Chee Keong Kwoh,
Xiaoli Li
Abstract:
Real-world networks often exist with multiple views, where each view describes one type of interaction among a common set of nodes. For example, on a video-sharing network, while two user nodes are linked if they have common favorite videos in one view, they can also be linked in another view if they share common subscribers. Unlike traditional single-view networks, multiple views maintain differe…
▽ More
Real-world networks often exist with multiple views, where each view describes one type of interaction among a common set of nodes. For example, on a video-sharing network, while two user nodes are linked if they have common favorite videos in one view, they can also be linked in another view if they share common subscribers. Unlike traditional single-view networks, multiple views maintain different semantics to complement each other. In this paper, we propose MANE, a multi-view network embedding approach to learn low-dimensional representations. Similar to existing studies, MANE hinges on diversity and collaboration - while diversity enables views to maintain their individual semantics, collaboration enables views to work together. However, we also discover a novel form of second-order collaboration that has not been explored previously, and further unify it into our framework to attain superior node representations. Furthermore, as each view often has varying importance w.r.t. different nodes, we propose MANE+, an attention-based extension of MANE to model node-wise view importance. Finally, we conduct comprehensive experiments on three public, real-world multi-view networks, and the results demonstrate that our models consistently outperform state-of-the-art approaches.
△ Less
Submitted 17 December, 2020; v1 submitted 17 May, 2020;
originally announced May 2020.
-
Infinite mixtures of multivariate normal-inverse Gaussian distributions for clustering of skewed data
Authors:
Yuan Fang,
Dimitris Karlis,
Sanjeena Subedi
Abstract:
Mixtures of multivariate normal inverse Gaussian (MNIG) distributions can be used to cluster data that exhibit features such as skewness and heavy tails. However, for cluster analysis, using a traditional finite mixture model framework, either the number of components needs to be known $a$-$priori$ or needs to be estimated $a$-$posteriori$ using some model selection criterion after deriving result…
▽ More
Mixtures of multivariate normal inverse Gaussian (MNIG) distributions can be used to cluster data that exhibit features such as skewness and heavy tails. However, for cluster analysis, using a traditional finite mixture model framework, either the number of components needs to be known $a$-$priori$ or needs to be estimated $a$-$posteriori$ using some model selection criterion after deriving results for a range of possible number of components. However, different model selection criteria can sometimes result in different number of components yielding uncertainty. Here, an infinite mixture model framework, also known as Dirichlet process mixture model, is proposed for the mixtures of MNIG distributions. This Dirichlet process mixture model approach allows the number of components to grow or decay freely from 1 to $\infty$ (in practice from 1 to $N$) and the number of components is inferred along with the parameter estimates in a Bayesian framework thus alleviating the need for model selection criteria. We provide real data applications with benchmark datasets as well as a small simulation experiment to compare with other existing models. The proposed method provides competitive clustering results to other clustering approaches for both simulation and real data and parameter recovery are illustrated using simulation studies.
△ Less
Submitted 11 May, 2020;
originally announced May 2020.
-
A Bayesian approach for clustering skewed data using mixtures of multivariate normal-inverse Gaussian distributions
Authors:
Yuan Fang,
Dimitris Karlis,
Sanjeena Subedi
Abstract:
Non-Gaussian mixture models are gaining increasing attention for mixture model-based clustering particularly when dealing with data that exhibit features such as skewness and heavy tails. Here, such a mixture distribution is presented, based on the multivariate normal inverse Gaussian (MNIG) distribution. For parameter estimation of the mixture, a Bayesian approach via Gibbs sampler is used; for t…
▽ More
Non-Gaussian mixture models are gaining increasing attention for mixture model-based clustering particularly when dealing with data that exhibit features such as skewness and heavy tails. Here, such a mixture distribution is presented, based on the multivariate normal inverse Gaussian (MNIG) distribution. For parameter estimation of the mixture, a Bayesian approach via Gibbs sampler is used; for this, a novel approach to simulate univariate generalized inverse Gaussian random variables and matrix generalized inverse Gaussian random matrices is provided. The proposed algorithm will be applied to both simulated and real data. Through simulation studies and real data analysis, we show parameter recovery and that our approach provides competitive clustering results compared to other clustering approaches.
△ Less
Submitted 5 May, 2020;
originally announced May 2020.
-
Large-scale Uncertainty Estimation and Its Application in Revenue Forecast of SMEs
Authors:
Zebang Zhang,
Kui Zhao,
Kai Huang,
Quanhui Jia,
Yanming Fang,
Quan Yu
Abstract:
The economic and banking importance of the small and medium enterprise (SME) sector is well recognized in contemporary society. Business credit loans are very important for the operation of SMEs, and the revenue is a key indicator of credit limit management. Therefore, it is very beneficial to construct a reliable revenue forecasting model. If the uncertainty of an enterprise's revenue forecasting…
▽ More
The economic and banking importance of the small and medium enterprise (SME) sector is well recognized in contemporary society. Business credit loans are very important for the operation of SMEs, and the revenue is a key indicator of credit limit management. Therefore, it is very beneficial to construct a reliable revenue forecasting model. If the uncertainty of an enterprise's revenue forecasting can be estimated, a more proper credit limit can be granted. Natural gradient boosting approach, which estimates the uncertainty of prediction by a multi-parameter boosting algorithm based on the natural gradient. However, its original implementation is not easy to scale into big data scenarios, and computationally expensive compared to state-of-the-art tree-based models (such as XGBoost). In this paper, we propose a Scalable Natural Gradient Boosting Machines that is simple to implement, readily parallelizable, interpretable and yields high-quality predictive uncertainty estimates. According to the characteristics of revenue distribution, we derive an uncertainty quantification function. We demonstrate that our method can distinguish between samples that are accurate and inaccurate on revenue forecasting of SMEs. What's more, interpretability can be naturally obtained from the model, satisfying the financial needs.
△ Less
Submitted 2 May, 2020;
originally announced May 2020.
-
NetDP: An Industrial-Scale Distributed Network Representation Framework for Default Prediction in Ant Credit Pay
Authors:
Jianbin Lin,
Zhiqiang Zhang,
Jun Zhou,
Xiaolong Li,
Jingli Fang,
Yanming Fang,
Quan Yu,
Yuan Qi
Abstract:
Ant Credit Pay is a consumer credit service in Ant Financial Service Group. Similar to credit card, loan default is one of the major risks of this credit product. Hence, effective algorithm for default prediction is the key to losses reduction and profits increment for the company. However, the challenges facing in our scenario are different from those in conventional credit card service. The firs…
▽ More
Ant Credit Pay is a consumer credit service in Ant Financial Service Group. Similar to credit card, loan default is one of the major risks of this credit product. Hence, effective algorithm for default prediction is the key to losses reduction and profits increment for the company. However, the challenges facing in our scenario are different from those in conventional credit card service. The first one is scalability. The huge volume of users and their behaviors in Ant Financial requires the ability to process industrial-scale data and perform model training efficiently. The second challenges is the cold-start problem. Different from the manual review for credit card application in conventional banks, the credit limit of Ant Credit Pay is automatically offered to users based on the knowledge learned from big data. However, default prediction for new users is suffered from lack of enough credit behaviors. It requires that the proposal should leverage other new data source to alleviate the cold-start problem. Considering the above challenges and the special scenario in Ant Financial, we try to incorporate default prediction with network information to alleviate the cold-start problem. In this paper, we propose an industrial-scale distributed network representation framework, termed NetDP, for default prediction in Ant Credit Pay. The proposal explores network information generated by various interaction between users, and blends unsupervised and supervised network representation in a unified framework for default prediction problem. Moreover, we present a parameter-server-based distributed implement of our proposal to handle the scalability challenge. Experimental results demonstrate the effectiveness of our proposal, especially in cold-start problem, as well as the efficiency for industrial-scale dataset.
△ Less
Submitted 31 March, 2020;
originally announced April 2020.
-
A Semi-supervised Graph Attentive Network for Financial Fraud Detection
Authors:
Daixin Wang,
Jianbin Lin,
Peng Cui,
Quanhui Jia,
Zhen Wang,
Yanming Fang,
Quan Yu,
Jun Zhou,
Shuang Yang,
Yuan Qi
Abstract:
With the rapid growth of financial services, fraud detection has been a very important problem to guarantee a healthy environment for both users and providers. Conventional solutions for fraud detection mainly use some rule-based methods or distract some features manually to perform prediction. However, in financial services, users have rich interactions and they themselves always show multifacete…
▽ More
With the rapid growth of financial services, fraud detection has been a very important problem to guarantee a healthy environment for both users and providers. Conventional solutions for fraud detection mainly use some rule-based methods or distract some features manually to perform prediction. However, in financial services, users have rich interactions and they themselves always show multifaceted information. These data form a large multiview network, which is not fully exploited by conventional methods. Additionally, among the network, only very few of the users are labelled, which also poses a great challenge for only utilizing labeled data to achieve a satisfied performance on fraud detection.
To address the problem, we expand the labeled data through their social relations to get the unlabeled data and propose a semi-supervised attentive graph neural network, namedSemiGNN to utilize the multi-view labeled and unlabeled data for fraud detection. Moreover, we propose a hierarchical attention mechanism to better correlate different neighbors and different views. Simultaneously, the attention mechanism can make the model interpretable and tell what are the important factors for the fraud and why the users are predicted as fraud. Experimentally, we conduct the prediction task on the users of Alipay, one of the largest third-party online and offline cashless payment platform serving more than 4 hundreds of million users in China. By utilizing the social relations and the user attributes, our method can achieve a better accuracy compared with the state-of-the-art methods on two tasks. Moreover, the interpretable results also give interesting intuitions regarding the tasks.
△ Less
Submitted 28 February, 2020;
originally announced March 2020.
-
Autoencoder Based Residual Deep Networks for Robust Regression Prediction and Spatiotemporal Estimation
Authors:
Lianfa Li,
Ying Fang,
Jun Wu,
Jinfeng Wang
Abstract:
To have a superior generalization, a deep learning neural network often involves a large size of training sample. With increase of hidden layers in order to increase learning ability, neural network has potential degradation in accuracy. Both could seriously limit applicability of deep learning in some domains particularly involving predictions of continuous variables with a small size of samples.…
▽ More
To have a superior generalization, a deep learning neural network often involves a large size of training sample. With increase of hidden layers in order to increase learning ability, neural network has potential degradation in accuracy. Both could seriously limit applicability of deep learning in some domains particularly involving predictions of continuous variables with a small size of samples. Inspired by residual convolutional neural network in computer vision and recent findings of crucial shortcuts in the brains in neuroscience, we propose an autoencoder-based residual deep network for robust prediction. In a nested way, we leverage shortcut connections to implement residual mapping with a balanced structure for efficient propagation of error signals. The novel method is demonstrated by multiple datasets, imputation of high spatiotemporal resolution non-randomness missing values of aerosol optical depth, and spatiotemporal estimation of fine particulate matter <2.5 μm, achieving the cutting edge of accuracy and efficiency. Our approach is also a general-purpose regression learner to be applicable in diverse domains.
△ Less
Submitted 28 December, 2018;
originally announced December 2018.
-
Tool Breakage Detection using Deep Learning
Authors:
Guang Li,
Xin Yang,
Duanbing Chen,
Anxing Song,
Yuke Fang,
Junlin Zhou
Abstract:
In manufacture, steel and other metals are mainly cut and shaped during the fabrication process by computer numerical control (CNC) machines. To keep high productivity and efficiency of the fabrication process, engineers need to monitor the real-time process of CNC machines, and the lifetime management of machine tools. In a real manufacturing process, breakage of machine tools usually happens wit…
▽ More
In manufacture, steel and other metals are mainly cut and shaped during the fabrication process by computer numerical control (CNC) machines. To keep high productivity and efficiency of the fabrication process, engineers need to monitor the real-time process of CNC machines, and the lifetime management of machine tools. In a real manufacturing process, breakage of machine tools usually happens without any indication, this problem seriously affects the fabrication process for many years. Previous studies suggested many different approaches for monitoring and detecting the breakage of machine tools. However, there still exists a big gap between academic experiments and the complex real fabrication processes such as the high demands of real-time detections, the difficulty in data acquisition and transmission. In this work, we use the spindle current approach to detect the breakage of machine tools, which has the high performance of real-time monitoring, low cost, and easy to install. We analyze the features of the current of a milling machine spindle through tools wearing processes, and then we predict the status of tool breakage by a convolutional neural network(CNN). In addition, we use a BP neural network to understand the reliability of the CNN. The results show that our CNN approach can detect tool breakage with an accuracy of 93%, while the best performance of BP is 80%.
△ Less
Submitted 16 August, 2018;
originally announced August 2018.
-
Bayesian Detection of Abnormal ADS in Mutant Caenorhabditis elegans Embryos
Authors:
Wei Liang,
Yuxiao Yang,
Yusi Fang,
Zhongying Zhao,
Jie Hu
Abstract:
Cell division timing is critical for cell fate specification and morphogenesis during embryogenesis. How division timings are regulated among cells during development is poorly understood. Here we focus on the comparison of asynchrony of division between sister cells (ADS) between wild-type and mutant individuals of Caenorhabditis elegans. Since the replicate number of mutant individuals of each m…
▽ More
Cell division timing is critical for cell fate specification and morphogenesis during embryogenesis. How division timings are regulated among cells during development is poorly understood. Here we focus on the comparison of asynchrony of division between sister cells (ADS) between wild-type and mutant individuals of Caenorhabditis elegans. Since the replicate number of mutant individuals of each mutated gene, usually one, is far smaller than that of wild-type, direct comparison of two distributions of ADS between wild-type and mutant type, such as Kolmogorov- Smirnov test, is not feasible. On the other hand, we find that sometimes ADS is correlated with the life span of corresponding mother cell in wild-type. Hence, we apply a semiparametric Bayesian quantile regression method to estimate the 95% confidence interval curve of ADS with respect to life span of mother cell of wild-type individuals. Then, mutant-type ADSs outside the corresponding confidence interval are selected out as abnormal one with a significance level of 0.05. Simulation study demonstrates the accuracy of our method and Gene Enrichment Analysis validates the results of real data sets.
△ Less
Submitted 13 March, 2018;
originally announced March 2018.
-
On Scalable Inference with Stochastic Gradient Descent
Authors:
Yixin Fang,
Jinfeng Xu,
Lei Yang
Abstract:
In many applications involving large dataset or online updating, stochastic gradient descent (SGD) provides a scalable way to compute parameter estimates and has gained increasing popularity due to its numerical convenience and memory efficiency. While the asymptotic properties of SGD-based estimators have been established decades ago, statistical inference such as interval estimation remains much…
▽ More
In many applications involving large dataset or online updating, stochastic gradient descent (SGD) provides a scalable way to compute parameter estimates and has gained increasing popularity due to its numerical convenience and memory efficiency. While the asymptotic properties of SGD-based estimators have been established decades ago, statistical inference such as interval estimation remains much unexplored. The traditional resampling method such as the bootstrap is not computationally feasible since it requires to repeatedly draw independent samples from the entire dataset. The plug-in method is not applicable when there are no explicit formulas for the covariance matrix of the estimator. In this paper, we propose a scalable inferential procedure for stochastic gradient descent, which, upon the arrival of each observation, updates the SGD estimate as well as a large number of randomly perturbed SGD estimates. The proposed method is easy to implement in practice. We establish its theoretical properties for a general class of models that includes generalized linear models and quantile regression models as special cases. The finite-sample performance and numerical utility is evaluated by simulation studies and two real data applications.
△ Less
Submitted 1 July, 2017;
originally announced July 2017.
-
Quasi-Reliable Estimates of Effective Sample Size
Authors:
Youhan Fang,
Yudong Cao,
Robert D. Skeel
Abstract:
The efficiency of a Markov chain Monte Carlo algorithm might be measured by the cost of generating one independent sample, or equivalently, the total cost divided by the effective sample size, defined in terms of the integrated autocorrelation time. To ensure the reliability of such an estimate, it is suggested that there be an adequate sampling of state space--- to the extent that this can be det…
▽ More
The efficiency of a Markov chain Monte Carlo algorithm might be measured by the cost of generating one independent sample, or equivalently, the total cost divided by the effective sample size, defined in terms of the integrated autocorrelation time. To ensure the reliability of such an estimate, it is suggested that there be an adequate sampling of state space--- to the extent that this can be determined from the available samples. A possible method for doing this is derived and evaluated.
△ Less
Submitted 11 May, 2017; v1 submitted 10 May, 2017;
originally announced May 2017.
-
Additive Partially Linear Models for Massive Heterogeneous Data
Authors:
Binhuan Wang,
Yixin Fang,
Heng Lian,
Hua Liang
Abstract:
We consider an additive partially linear framework for modelling massive heterogeneous data. The major goal is to extract multiple common features simultaneously across all sub-populations while exploring heterogeneity of each sub-population. We propose an aggregation type of estimators for the commonality parameters that possess the asymptotic optimal bounds and the asymptotic distributions as if…
▽ More
We consider an additive partially linear framework for modelling massive heterogeneous data. The major goal is to extract multiple common features simultaneously across all sub-populations while exploring heterogeneity of each sub-population. We propose an aggregation type of estimators for the commonality parameters that possess the asymptotic optimal bounds and the asymptotic distributions as if there were no heterogeneity. This oracle result holds when the number of sub-populations does not grow too fast and the tuning parameters are selected carefully. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. Furthermore, we develop a heterogeneity test for the linear components and a homogeneity test for the non-linear components accordingly. The performance of the proposed methods is evaluated via simulation studies and an application to the Medicare Provider Utilization and Payment data.
△ Less
Submitted 28 December, 2018; v1 submitted 13 January, 2017;
originally announced January 2017.
-
Penalized Weighted Least Squares for Outlier Detection and Robust Regression
Authors:
Xiaoli Gao,
Yixin Fang
Abstract:
To conduct regression analysis for data contaminated with outliers, many approaches have been proposed for simultaneous outlier detection and robust regression, so is the approach proposed in this manuscript. This new approach is called "penalized weighted least squares" (PWLS). By assigning each observation an individual weight and incorporating a lasso-type penalty on the log-transformation of t…
▽ More
To conduct regression analysis for data contaminated with outliers, many approaches have been proposed for simultaneous outlier detection and robust regression, so is the approach proposed in this manuscript. This new approach is called "penalized weighted least squares" (PWLS). By assigning each observation an individual weight and incorporating a lasso-type penalty on the log-transformation of the weight vector, the PWLS is able to perform outlier detection and robust regression simultaneously. A Bayesian point-of-view of the PWLS is provided, and it is showed that the PWLS can be seen as an example of M-estimation. Two methods are developed for selecting the tuning parameter in the PWLS. The performance of the PWLS is demonstrated via simulations and real applications.
△ Less
Submitted 23 March, 2016;
originally announced March 2016.
-
Sparse Convex Clustering
Authors:
Binhuan Wang,
Yilong Zhang,
Will Wei Sun,
Yixin Fang
Abstract:
Convex clustering, a convex relaxation of k-means clustering and hierarchical clustering, has drawn recent attentions since it nicely addresses the instability issue of traditional nonconvex clustering methods. Although its computational and statistical properties have been recently studied, the performance of convex clustering has not yet been investigated in the high-dimensional clustering scena…
▽ More
Convex clustering, a convex relaxation of k-means clustering and hierarchical clustering, has drawn recent attentions since it nicely addresses the instability issue of traditional nonconvex clustering methods. Although its computational and statistical properties have been recently studied, the performance of convex clustering has not yet been investigated in the high-dimensional clustering scenario, where the data contains a large number of features and many of them carry no information about the clustering structure. In this paper, we demonstrate that the performance of convex clustering could be distorted when the uninformative features are included in the clustering. To overcome it, we introduce a new clustering method, referred to as Sparse Convex Clustering, to simultaneously cluster observations and conduct feature selection. The key idea is to formulate convex clustering in a form of regularization, with an adaptive group-lasso penalty term on cluster centers. In order to optimally balance the tradeoff between the cluster fitting and sparsity, a tuning criterion based on clustering stability is developed. In theory, we provide an unbiased estimator for the degrees of freedom of the proposed sparse convex clustering method. Finally, the effectiveness of the sparse convex clustering is examined through a variety of numerical experiments and a real data application.
△ Less
Submitted 10 February, 2017; v1 submitted 18 January, 2016;
originally announced January 2016.
-
Flexible combination of multiple diagnostic biomarkers to improve diagnostic accuracy
Authors:
Tu Xu,
Yixin Fang,
Alan Rong,
Junhui Wang
Abstract:
In medical research, it is common to collect information of multiple continuous biomarkers to improve the accuracy of diagnostic tests. Combining the measurements of these biomarkers into one single score is a popular practice to integrate the collected information, where the accuracy of the resultant diagnostic test is usually improved. To measure the accuracy of a diagnostic test, the Youden ind…
▽ More
In medical research, it is common to collect information of multiple continuous biomarkers to improve the accuracy of diagnostic tests. Combining the measurements of these biomarkers into one single score is a popular practice to integrate the collected information, where the accuracy of the resultant diagnostic test is usually improved. To measure the accuracy of a diagnostic test, the Youden index has been widely used in literature. Various parametric and nonparametric methods have been proposed to linearly combine biomarkers so that the corresponding Youden index can be optimized. Yet there seems to be little justification of enforcing such a linear combination. This paper proposes a flexible approach that allows both linear and nonlinear combinations of biomarkers. The proposed approach formulates the problem in a large margin classification framework, where the combination function is embedded in a flexible reproducing kernel Hilbert space. Advantages of the proposed approach are demonstrated in a variety of simulated experiments as well as a real application to a liver disorder study.
△ Less
Submitted 7 July, 2015; v1 submitted 13 March, 2015;
originally announced March 2015.
-
Compressible Generalized Hybrid Monte Carlo
Authors:
Youhan Fang,
Jesus-Maria Sanz-Serna,
Robert D. Skeel
Abstract:
One of the most demanding calculations is to generate random samples from a specified probability distribution (usually with an unknown normalizing prefactor) in a high-dimensional configuration space. One often has to resort to using a Markov chain Monte Carlo method, which converges only in the limit to the prescribed distribution. Such methods typically inch through configuration space step by…
▽ More
One of the most demanding calculations is to generate random samples from a specified probability distribution (usually with an unknown normalizing prefactor) in a high-dimensional configuration space. One often has to resort to using a Markov chain Monte Carlo method, which converges only in the limit to the prescribed distribution. Such methods typically inch through configuration space step by step, with acceptance of a step based on a Metropolis(-Hastings) criterion. An acceptance rate of 100% is possible in principle by embedding configuration space in a higher-dimensional phase space and using ordinary differential equations. In practice, numerical integrators must be used, lowering the acceptance rate. This is the essence of hybrid Monte Carlo methods. Presented is a general framework for constructing such methods under relaxed conditions: the only geometric property needed is (weakened) reversibility; volume preservation is not needed. The possibilities are illustrated by deriving a couple of explicit hybrid Monte Carlo methods, one based on barrier-lowering variable-metric dynamics and another based on isokinetic dynamics.
△ Less
Submitted 27 February, 2014;
originally announced February 2014.
-
A model-free estimation for the covariate-adjusted Youden index and its associated cut-point
Authors:
Tu Xu,
Junhui Wang,
Yixin Fang
Abstract:
In medical research, continuous markers are widely employed in diagnostic tests to distinguish diseased and non-diseased subjects. The accuracy of such diagnostic tests is commonly assessed using the receiver operating characteristic (ROC) curve. To summarize an ROC curve and determine its optimal cut-point, the Youden index is popularly used. In literature, estimation of the Youden index has been…
▽ More
In medical research, continuous markers are widely employed in diagnostic tests to distinguish diseased and non-diseased subjects. The accuracy of such diagnostic tests is commonly assessed using the receiver operating characteristic (ROC) curve. To summarize an ROC curve and determine its optimal cut-point, the Youden index is popularly used. In literature, estimation of the Youden index has been widely studied via various statistical modeling strategies on the conditional density. This paper proposes a new model-free estimation method, which directly estimates the covariate-adjusted cut-point without estimating the conditional density. Consequently, covariate-adjusted Youden index can be estimated based on the estimated cutpoint. The proposed method formulates the estimation problem in a large margin classification framework, which allows flexible modeling of the covariate-adjusted Youden index through kernel machines. The advantage of the proposed method is demonstrated in a variety of simulated experiments as well as a real application to Pima Indians diabetes study.
△ Less
Submitted 8 February, 2014;
originally announced February 2014.
-
Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices
Authors:
Yixin Fang,
Binhuan Wang,
Yang Feng
Abstract:
Recently many regularized estimators of large covariance matrices have been proposed, and the tuning parameters in these estimators are usually selected via cross-validation. However, there is no guideline on the number of folds for conducting cross-validation and there is no comparison between cross-validation and the methods based on bootstrap. Through extensive simulations, we suggest 10-fold c…
▽ More
Recently many regularized estimators of large covariance matrices have been proposed, and the tuning parameters in these estimators are usually selected via cross-validation. However, there is no guideline on the number of folds for conducting cross-validation and there is no comparison between cross-validation and the methods based on bootstrap. Through extensive simulations, we suggest 10-fold cross-validation (nine-tenths for training and one-tenth for validation) be appropriate when the estimation accuracy is measured in the Frobenius norm, while 2-fold cross-validation (half for training and half for validation) or reverse 3-fold cross-validation (one-third for training and two-thirds for validation) be appropriate in the operator norm. We also suggest the "optimal" cross-validation be more appropriate than the methods based on bootstrap for both types of norm.
△ Less
Submitted 15 August, 2013;
originally announced August 2013.
-
A note on selection stability: combining stability and prediction
Authors:
Yixin Fang,
Junhui Wang,
Wei Sun
Abstract:
Recently, many regularized procedures have been proposed for variable selection in linear regression, but their performance depends on the tuning parameter selection. Here a criterion for the tuning parameter selection is proposed, which combines the strength of both stability selection and cross-validation and therefore is referred as the prediction and stability selection (PASS). The selection c…
▽ More
Recently, many regularized procedures have been proposed for variable selection in linear regression, but their performance depends on the tuning parameter selection. Here a criterion for the tuning parameter selection is proposed, which combines the strength of both stability selection and cross-validation and therefore is referred as the prediction and stability selection (PASS). The selection consistency is established assuming the data generating model is a subset of the full model, and the small sample performance is demonstrated through some simulation studies where the assumption is either held or violated.
△ Less
Submitted 29 January, 2013;
originally announced January 2013.
-
Consistent selection of tuning parameters via variable selection stability
Authors:
Wei Sun,
Junhui Wang,
Yixin Fang
Abstract:
Penalized regression models are popularly used in high-dimensional data analysis to conduct variable selection and model fitting simultaneously. Whereas success has been widely reported in literature, their performances largely depend on the tuning parameters that balance the trade-off between model fitting and model sparsity. Existing tuning criteria mainly follow the route of minimizing the esti…
▽ More
Penalized regression models are popularly used in high-dimensional data analysis to conduct variable selection and model fitting simultaneously. Whereas success has been widely reported in literature, their performances largely depend on the tuning parameters that balance the trade-off between model fitting and model sparsity. Existing tuning criteria mainly follow the route of minimizing the estimated prediction error or maximizing the posterior model probability, such as cross-validation, AIC and BIC. This article introduces a general tuning parameter selection criterion based on a novel concept of variable selection stability. The key idea is to select the tuning parameters so that the resultant penalized regression model is stable in variable selection. The asymptotic selection consistency is established for both fixed and diverging dimensions. The effectiveness of the proposed criterion is also demonstrated in a variety of simulated examples as well as an application to the prostate cancer data.
△ Less
Submitted 13 December, 2013; v1 submitted 16 August, 2012;
originally announced August 2012.
-
A divergence formula for regularization methods with an L2 constraint
Authors:
Yixin Fang,
Yuanjia Wang,
Xin Huang
Abstract:
We derive a divergence formula for a group of regularization methods with an L2 constraint. The formula is useful for regularization parameter selection, because it provides an unbiased estimate for the number of degrees of freedom. We begin with deriving the formula for smoothing splines and then extend it to other settings such as penalized splines, ridge regression, and functional linear regres…
▽ More
We derive a divergence formula for a group of regularization methods with an L2 constraint. The formula is useful for regularization parameter selection, because it provides an unbiased estimate for the number of degrees of freedom. We begin with deriving the formula for smoothing splines and then extend it to other settings such as penalized splines, ridge regression, and functional linear regression.
△ Less
Submitted 15 March, 2012;
originally announced March 2012.