Search | arXiv e-print repository

Local Linear Recovery Guarantee of Deep Neural Networks at Overparameterization

Authors: Yaoyu Zhang, Leyang Zhang, Zhongwang Zhang, Zhiwei Bai

Abstract: Determining whether deep neural network (DNN) models can reliably recover target functions at overparameterization is a critical yet complex issue in the theory of deep learning. To advance understanding in this area, we introduce a concept we term "local linear recovery" (LLR), a weaker form of target function recovery that renders the problem more amenable to theoretical analysis. In the sense o… ▽ More Determining whether deep neural network (DNN) models can reliably recover target functions at overparameterization is a critical yet complex issue in the theory of deep learning. To advance understanding in this area, we introduce a concept we term "local linear recovery" (LLR), a weaker form of target function recovery that renders the problem more amenable to theoretical analysis. In the sense of LLR, we prove that functions expressible by narrower DNNs are guaranteed to be recoverable from fewer samples than model parameters. Specifically, we establish upper limits on the optimistic sample sizes, defined as the smallest sample size necessary to guarantee LLR, for functions in the space of a given DNN. Furthermore, we prove that these upper bounds are achieved in the case of two-layer tanh neural networks. Our research lays a solid groundwork for future investigations into the recovery capabilities of DNNs in overparameterized scenarios. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2211.11623

arXiv:2403.07318 [pdf, ps, other]

Test for high-dimensional linear hypothesis of mean vectors via random integration

Authors: Jianghao Li, Shizhe Hong, Zhenzhen Niu, Zhidong Bai

Abstract: In this paper, we investigate hypothesis testing for the linear combination of mean vectors across multiple populations through the method of random integration. We have established the asymptotic distributions of the test statistics under both null and alternative hypotheses. Additionally, we provide a theoretical explanation for the special use of our test statistics in situations when the nonze… ▽ More In this paper, we investigate hypothesis testing for the linear combination of mean vectors across multiple populations through the method of random integration. We have established the asymptotic distributions of the test statistics under both null and alternative hypotheses. Additionally, we provide a theoretical explanation for the special use of our test statistics in situations when the nonzero signal in the linear combination of the true mean vectors is weakly dense. Moreover, Monte-Carlo simulations are presented to evaluate the suggested test against existing high-dimensional tests. The findings from these simulations reveal that our test not only aligns with the performance of other tests in terms of size but also exhibits superior power. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.05760 [pdf, other]

Simultaneous test of the mean vectors and covariance matrices for high-dimensional data using RMT

Authors: Zhenzhen Niu, Jianghao Li, Wenya Luo, Zhidong Bai

Abstract: In this paper, we propose a new modified likelihood ratio test (LRT) for simultaneously testing mean vectors and covariance matrices of two-sample populations in high-dimensional settings. By employing tools from Random Matrix Theory (RMT), we derive the limiting null distribution of the modified LRT for generally distributed populations. Furthermore, we compare the proposed test with existing tes… ▽ More In this paper, we propose a new modified likelihood ratio test (LRT) for simultaneously testing mean vectors and covariance matrices of two-sample populations in high-dimensional settings. By employing tools from Random Matrix Theory (RMT), we derive the limiting null distribution of the modified LRT for generally distributed populations. Furthermore, we compare the proposed test with existing tests using simulation results, demonstrating that the modified LRT exhibits favorable properties in terms of both size and power. △ Less

Submitted 8 March, 2024; originally announced March 2024.

arXiv:2402.03933 [pdf]

Development of a Evaluation Tool for Age-Appropriate Software in Aging Environments: A Delphi Study

Authors: Zhenggang Bai, Yougxiang Fang, Hongtu Chen, Xinru Chen, Ning An, Min Zhang, Guoxin Rui, Jing Jin

Abstract: Objective: We aimed to develop a dependable reliable tool for assessing software ageappropriateness. Methods: We conducted a systematic review to get the indicators of technology ageappropriateness from studies from January 2000 to April 2023.This study engaged 25 experts from the fields of anthropology, sociology,and social technology research across, three rounds of Delphi consultations were con… ▽ More Objective: We aimed to develop a dependable reliable tool for assessing software ageappropriateness. Methods: We conducted a systematic review to get the indicators of technology ageappropriateness from studies from January 2000 to April 2023.This study engaged 25 experts from the fields of anthropology, sociology,and social technology research across, three rounds of Delphi consultations were conducted. Experts were asked to screen, assess, add and provide feedback on the preliminary indicators identified in the initial indicator pool. Result: We found 76 criterias for evaluating quality criteria was extracted, grouped into 11 distinct domains. After completing three rounds of Delphi consultations,experts drew upon their personal experiences,theoretical frameworks,and industry insights to arrive at a three-dimensional structure for the evaluation tooluser experience,product quality,and social promotion.These metrics were further distilled into a 16-item scale, and a corresponding questionnaire was formulated.The developed tool exhibited strong internal reliability(Cronbach's Alpha is 0.867)and content validity(S-CVI is 0.93). Conclusion: This tool represents a straightforward,objective,and reliable mechanism for evaluating software's appropriateness across age groups. Moreover,it offers valuable insights and practical guidance for designing and developing of high-quality age-appropriate software,and assisst age groups to select software they like. △ Less

Submitted 4 February, 2024; originally announced February 2024.

arXiv:2401.17143 [pdf, other]

Test for high-dimensional mean vectors via the weighted $L_2$-norm

Authors: Jianghao Li, Zhenzhen Niu, Shizhe Hong, Zhidong Bai

Abstract: In this paper, we propose a novel approach to test the equality of high-dimensional mean vectors of several populations via the weighted $L_2$-norm. We establish the asymptotic normality of the test statistics under the null hypothesis. We also explain theoretically why our test statistics can be highly useful in weakly dense cases when the nonzero signal in mean vectors is present. Furthermore, w… ▽ More In this paper, we propose a novel approach to test the equality of high-dimensional mean vectors of several populations via the weighted $L_2$-norm. We establish the asymptotic normality of the test statistics under the null hypothesis. We also explain theoretically why our test statistics can be highly useful in weakly dense cases when the nonzero signal in mean vectors is present. Furthermore, we compare the proposed test with existing tests using simulation results, demonstrating that the weighted $L_2$-norm-based test statistic exhibits favorable properties in terms of both size and power. △ Less

Submitted 31 January, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

arXiv:2307.08921 [pdf, other]

Optimistic Estimate Uncovers the Potential of Nonlinear Models

Authors: Yaoyu Zhang, Zhongwang Zhang, Leyang Zhang, Zhiwei Bai, Tao Luo, Zhi-Qin John Xu

Abstract: We propose an optimistic estimate to evaluate the best possible fitting performance of nonlinear models. It yields an optimistic sample size that quantifies the smallest possible sample size to fit/recover a target function using a nonlinear model. We estimate the optimistic sample sizes for matrix factorization models, deep models, and deep neural networks (DNNs) with fully-connected or convoluti… ▽ More We propose an optimistic estimate to evaluate the best possible fitting performance of nonlinear models. It yields an optimistic sample size that quantifies the smallest possible sample size to fit/recover a target function using a nonlinear model. We estimate the optimistic sample sizes for matrix factorization models, deep models, and deep neural networks (DNNs) with fully-connected or convolutional architecture. For each nonlinear model, our estimates predict a specific subset of targets that can be fitted at overparameterization, which are confirmed by our experiments. Our optimistic estimate reveals two special properties of the DNN models -- free expressiveness in width and costly expressiveness in connection. These properties suggest the following architecture design principles of DNNs: (i) feel free to add neurons/kernels; (ii) restrain from connecting neurons. Overall, our optimistic estimate theoretically unveils the vast potential of nonlinear models in fitting at overparameterization. Based on this framework, we anticipate gaining a deeper understanding of how and why numerous nonlinear models such as DNNs can effectively realize their potential in practice in the near future. △ Less

Submitted 17 July, 2023; originally announced July 2023.

arXiv:2303.17230 [pdf, other]

KOO approach for scalable variable selection problem in large-dimensional regression

Authors: Zhidong Bai, Kwok Pui Choi, Yasunori Fujikoshi, Jiang Hu

Abstract: An important issue in many multivariate regression problems is to eliminate candidate predictors with null predictor vectors. In large-dimensional (LD) setting where the numbers of responses and predictors are large, model selection encounters the scalability challenge. Knock-one-out (KOO) statistics hold promise to meet this challenge. In this paper, the almost sure limits and the central limit t… ▽ More An important issue in many multivariate regression problems is to eliminate candidate predictors with null predictor vectors. In large-dimensional (LD) setting where the numbers of responses and predictors are large, model selection encounters the scalability challenge. Knock-one-out (KOO) statistics hold promise to meet this challenge. In this paper, the almost sure limits and the central limit theorem of the KOO statistics are derived under the LD setting and mild distributional assumptions (finite fourth moments) of the errors. These theoretical results guarantee the strong consistency of a subset selection rule based on the KOO statistics with a general threshold. For enhancing the robustness of the selection rule, we also propose a bootstrap threshold for the KOO approach. Simulation results support our conclusions and demonstrate the selection probabilities by the KOO approach with the bootstrap threshold outperform the methods using Akaike information threshold, Bayesian information threshold and Mallow's C$_p$ threshold. We compare the proposed KOO approach with those based on information threshold to a chemometrics dataset and a yeast cell-cycle dataset, which suggests our proposed method identifies useful models. △ Less

Submitted 25 April, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

arXiv:2211.15982 [pdf, ps, other]

Revisit of a Diaconis urn model

Authors: Li Yang, Jiang Hu, Zhidong Bai

Abstract: Let $G$ be a finite Abelian group of order $d$. We consider an urn in which, initially, there are labeled balls that generate the group $G$. Choosing two balls from the urn with replacement, observe their labels, and perform a group multiplication on the respective group elements to obtain a group element. Then, we put a ball labeled with that resulting element into the urn. This model was formula… ▽ More Let $G$ be a finite Abelian group of order $d$. We consider an urn in which, initially, there are labeled balls that generate the group $G$. Choosing two balls from the urn with replacement, observe their labels, and perform a group multiplication on the respective group elements to obtain a group element. Then, we put a ball labeled with that resulting element into the urn. This model was formulated by P. Diaconis while studying a group theoretic algorithm called MeatAxe (Holt and Rees (1994)). Siegmund and Yakir (2004) partially investigated this model. In this paper, we further investigate and generalize this model. More specifically, we allow a random number of balls to be drawn from the urn at each stage in the Diaconis urn model. For such a case, we verify that the normalized urn composition converges almost surely to the uniform distribution on the group $G$. Moreover, we obtain the asymptotic joint distribution of the urn composition by using the martingale central limit theorem. △ Less

Submitted 29 November, 2022; originally announced November 2022.

arXiv:2211.11891 [pdf, other]

A Bi-level Nonlinear Eigenvector Algorithm for Wasserstein Discriminant Analysis

Authors: Dong Min Roh, Zhaojun Bai, Ren-Cang Li

Abstract: Much like the classical Fisher linear discriminant analysis (LDA), the recently proposed Wasserstein discriminant analysis (WDA) is a linear dimensionality reduction method that seeks a projection matrix to maximize the dispersion of different data classes and minimize the dispersion of same data classes via a bi-level optimization. In contrast to LDA, WDA can account for both global and local int… ▽ More Much like the classical Fisher linear discriminant analysis (LDA), the recently proposed Wasserstein discriminant analysis (WDA) is a linear dimensionality reduction method that seeks a projection matrix to maximize the dispersion of different data classes and minimize the dispersion of same data classes via a bi-level optimization. In contrast to LDA, WDA can account for both global and local interconnections between data classes by using the underlying principles of optimal transport. In this paper, a bi-level nonlinear eigenvector algorithm (WDA-nepv) is presented to fully exploit the structures of the bi-level optimization of WDA. The inner level of WDA-nepv for computing the optimal transport matrices is formulated as an eigenvector-dependent nonlinear eigenvalue problem (NEPv), and meanwhile, the outer level for trace ratio optimizations is formulated as another NEPv. Both NEPvs can be computed efficiently under the self-consistent field (SCF) framework. WDA-nepv is derivative-free and surrogate-model-free when compared with existing algorithms. Convergence analysis of the proposed WDA-nepv justifies the utilization of the SCF for solving the bi-level optimization of WDA. Numerical experiments with synthetic and real-life datasets demonstrate the classification accuracy and scalability of WDA-nepv. △ Less

Submitted 27 July, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

arXiv:2211.11623 [pdf, other]

Linear Stability Hypothesis and Rank Stratification for Nonlinear Models

Authors: Yaoyu Zhang, Zhongwang Zhang, Leyang Zhang, Zhiwei Bai, Tao Luo, Zhi-Qin John Xu

Abstract: Models with nonlinear architectures/parameterizations such as deep neural networks (DNNs) are well known for their mysteriously good generalization performance at overparameterization. In this work, we tackle this mystery from a novel perspective focusing on the transition of the target recovery/fitting accuracy as a function of the training data size. We propose a rank stratification for general… ▽ More Models with nonlinear architectures/parameterizations such as deep neural networks (DNNs) are well known for their mysteriously good generalization performance at overparameterization. In this work, we tackle this mystery from a novel perspective focusing on the transition of the target recovery/fitting accuracy as a function of the training data size. We propose a rank stratification for general nonlinear models to uncover a model rank as an "effective size of parameters" for each function in the function space of the corresponding model. Moreover, we establish a linear stability theory proving that a target function almost surely becomes linearly stable when the training data size equals its model rank. Supported by our experiments, we propose a linear stability hypothesis that linearly stable functions are preferred by nonlinear training. By these results, model rank of a target function predicts a minimal training data size for its successful recovery. Specifically for the matrix factorization model and DNNs of fully-connected or convolutional architectures, our rank stratification shows that the model rank for specific target functions can be much lower than the size of model parameters. This result predicts the target recovery capability even at heavy overparameterization for these nonlinear models as demonstrated quantitatively by our experiments. Overall, our work provides a unified framework with quantitative prediction power to understand the mysterious target recovery behavior at overparameterization for general nonlinear models. △ Less

Submitted 21 November, 2022; originally announced November 2022.

arXiv:2210.16435 [pdf, other]

Scalable Spectral Clustering with Group Fairness Constraints

Authors: Ji Wang, Ding Lu, Ian Davidson, Zhaojun Bai

Abstract: There are synergies of research interests and industrial efforts in modeling fairness and correcting algorithmic bias in machine learning. In this paper, we present a scalable algorithm for spectral clustering (SC) with group fairness constraints. Group fairness is also known as statistical parity where in each cluster, each protected group is represented with the same proportion as in the entiret… ▽ More There are synergies of research interests and industrial efforts in modeling fairness and correcting algorithmic bias in machine learning. In this paper, we present a scalable algorithm for spectral clustering (SC) with group fairness constraints. Group fairness is also known as statistical parity where in each cluster, each protected group is represented with the same proportion as in the entirety. While FairSC algorithm (Kleindessner et al., 2019) is able to find the fairer clustering, it is compromised by high costs due to the kernels of computing nullspaces and the square roots of dense matrices explicitly. We present a new formulation of underlying spectral computation by incorporating nullspace projection and Hotelling's deflation such that the resulting algorithm, called s-FairSC, only involves the sparse matrix-vector products and is able to fully exploit the sparsity of the fair SC model. The experimental results on the modified stochastic block model demonstrate that s-FairSC is comparable with FairSC in recovering fair clustering. Meanwhile, it is sped up by a factor of 12 for moderate model sizes. s-FairSC is further demonstrated to be scalable in the sense that the computational costs of s-FairSC only increase marginally compared to the SC without fairness constraints. △ Less

Submitted 14 April, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

Journal ref: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:6613-6629, 2023

arXiv:2210.03859 [pdf, other]

Spectrally-Corrected and Regularized Linear Discriminant Analysis for Spiked Covariance Model

Authors: Hua Li, Wenya Luo, Zhidong Bai, Huanchao Zhou, Zhangni Pu

Abstract: This paper proposes an improved linear discriminant analysis called spectrally-corrected and regularized LDA (SRLDA). This method integrates the design ideas of the sample spectrally-corrected covariance matrix and the regularized discriminant analysis. With the support of a large-dimensional random matrix analysis framework, it is proved that SRLDA has a linear classification global optimal solut… ▽ More This paper proposes an improved linear discriminant analysis called spectrally-corrected and regularized LDA (SRLDA). This method integrates the design ideas of the sample spectrally-corrected covariance matrix and the regularized discriminant analysis. With the support of a large-dimensional random matrix analysis framework, it is proved that SRLDA has a linear classification global optimal solution under the spiked model assumption. According to simulation data analysis, the SRLDA classifier performs better than RLDA and ILDA and is closer to the theoretical classifier. Experiments on different data sets show that the SRLDA algorithm performs better in classification and dimensionality reduction than currently used tools. △ Less

Submitted 8 March, 2024; v1 submitted 7 October, 2022; originally announced October 2022.

arXiv:2007.02376 [pdf, other]

doi 10.1145/3394486.3403173

Block Model Guided Unsupervised Feature Selection

Authors: Zilong Bai, Hoa Nguyen, Ian Davidson

Abstract: Feature selection is a core area of data mining with a recent innovation of graph-driven unsupervised feature selection for linked data. In this setting we have a dataset $\mathbf{Y}$ consisting of $n$ instances each with $m$ features and a corresponding $n$ node graph (whose adjacency matrix is $\mathbf{A}$) with an edge indicating that the two instances are similar. Existing efforts for unsuperv… ▽ More Feature selection is a core area of data mining with a recent innovation of graph-driven unsupervised feature selection for linked data. In this setting we have a dataset $\mathbf{Y}$ consisting of $n$ instances each with $m$ features and a corresponding $n$ node graph (whose adjacency matrix is $\mathbf{A}$) with an edge indicating that the two instances are similar. Existing efforts for unsupervised feature selection on attributed networks have explored either directly regenerating the links by solving for $f$ such that $f(\mathbf{y}_i,\mathbf{y}_j) \approx \mathbf{A}_{i,j}$ or finding community structure in $\mathbf{A}$ and using the features in $\mathbf{Y}$ to predict these communities. However, graph-driven unsupervised feature selection remains an understudied area with respect to exploring more complex guidance. Here we take the novel approach of first building a block model on the graph and then using the block model for feature selection. That is, we discover $\mathbf{F}\mathbf{M}\mathbf{F}^T \approx \mathbf{A}$ and then find a subset of features $\mathcal{S}$ that induces another graph to preserve both $\mathbf{F}$ and $\mathbf{M}$. We call our approach Block Model Guided Unsupervised Feature Selection (BMGUFS). Experimental results show that our method outperforms the state of the art on several real-world public datasets in finding high-quality features for clustering. △ Less

Submitted 5 July, 2020; originally announced July 2020.

Comments: Published at KDD2020

Journal ref: Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD2020)

arXiv:2005.04557 [pdf, other]

A Multi-Variate Triple-Regression Forecasting Algorithm for Long-Term Customized Allergy Season Prediction

Authors: Xiaoyu Wu, Zeyu Bai, Jianguo Jia, Youzhi Liang

Abstract: In this paper, we propose a novel multi-variate algorithm using a triple-regression methodology to predict the airborne-pollen allergy season that can be customized for each patient in the long term. To improve the prediction accuracy, we first perform a pre-processing to integrate the historical data of pollen concentration and various inferential signals from other covariates such as the meteoro… ▽ More In this paper, we propose a novel multi-variate algorithm using a triple-regression methodology to predict the airborne-pollen allergy season that can be customized for each patient in the long term. To improve the prediction accuracy, we first perform a pre-processing to integrate the historical data of pollen concentration and various inferential signals from other covariates such as the meteorological data. We then propose a novel algorithm which encompasses three-stage regressions: in Stage 1, a regression model to predict the start/end date of a airborne-pollen allergy season is trained from a feature matrix extracted from 12 time series of the covariates with a rolling window; in Stage 2, a regression model to predict the corresponding uncertainty is trained based on the feature matrix and the prediction result from Stage 1; in Stage 3, a weighted linear regression model is built upon prediction results from Stage 1 and 2. It is observed and proved that Stage 3 contributes to the improved forecasting accuracy and the reduced uncertainty of the multi-variate triple-regression algorithm. Based on different allergy sensitivity level, the triggering concentration of the pollen - the definition of the allergy season can be customized individually. In our backtesting, a mean absolute error (MAE) of 4.7 days was achieved using the algorithm. We conclude that this algorithm could be applicable in both generic and long-term forecasting problems. △ Less

Submitted 10 December, 2020; v1 submitted 9 May, 2020; originally announced May 2020.

Comments: 4 pages, 4 figures

arXiv:1909.11527 [pdf, ps, other]

A Self-consistent-field Iteration for Orthogonal Canonical Correlation Analysis

Authors: Leihong Zhang, Li Wang, Zhaojun Bai, Ren-cang Li

Abstract: We propose an efficient algorithm for solving orthogonal canonical correlation analysis (OCCA) in the form of trace-fractional structure and orthogonal linear projections. Even though orthogonality has been widely used and proved to be a useful criterion for pattern recognition and feature extraction, existing methods for solving OCCA problem are either numerical unstable by relying on a deflation… ▽ More We propose an efficient algorithm for solving orthogonal canonical correlation analysis (OCCA) in the form of trace-fractional structure and orthogonal linear projections. Even though orthogonality has been widely used and proved to be a useful criterion for pattern recognition and feature extraction, existing methods for solving OCCA problem are either numerical unstable by relying on a deflation scheme, or less efficient by directly using generic optimization methods. In this paper, we propose an alternating numerical scheme whose core is the sub-maximization problem in the trace-fractional form with an orthogonal constraint. A customized self-consistent-field (SCF) iteration for this sub-maximization problem is devised. It is proved that the SCF iteration is globally convergent to a KKT point and that the alternating numerical scheme always converges. We further formulate a new trace-fractional maximization problem for orthogonal multiset CCA (OMCCA) and then propose an efficient algorithm with an either Jacobi-style or Gauss-Seidel-style updating scheme based on the same SCF iteration. Extensive experiments are conducted to evaluate the proposed algorithms against existing methods including two real world applications: multi-label classification and multi-view feature extraction. Experimental results show that our methods not only perform competitively to or better than baselines but also are more efficient. △ Less

Submitted 25 September, 2019; originally announced September 2019.

arXiv:1906.06713 [pdf, ps, other]

Community Detection Based on the $L_\infty$ convergence of eigenvectors in DCBM

Authors: Yan Liu, Zhiqiang Hou, Zhigang Yao, Zhidong Bai, Jiang Hu, Shurong Zheng

Abstract: Spectral clustering is one of the most popular algorithms for community detection in network analysis. Based on this rationale, in this paper we give the convergence rate of eigenvectors for the adjacency matrix in the $l_\infty$ norm, under the stochastic block model (BM) and degree corrected stochastic block model (DCBM), adding some mild and rational conditions. We also extend this result to a… ▽ More Spectral clustering is one of the most popular algorithms for community detection in network analysis. Based on this rationale, in this paper we give the convergence rate of eigenvectors for the adjacency matrix in the $l_\infty$ norm, under the stochastic block model (BM) and degree corrected stochastic block model (DCBM), adding some mild and rational conditions. We also extend this result to a more general model, presented based on the DCBM such that the value of random variables in the adjacency matrix is not 0 or 1, but an arbitrary real number. During the process of proving the above conclusion, we obtain the relationship of the eigenvalues in the adjacency matrix and the corresponding `population' matrix, which vary in dimension from the community-wise edge probability matrix. Using that result, we can give an estimate of the number of the communities in a known set of network data. Meanwhile we proved the consistency of the estimator. Furthermore, according to the derivation of proof for the convergence of eigenvectors, we propose a new approach to community detection -- Spectral Clustering based on Difference of Ratios of Eigenvectors (SCDRE). Our simulation experiments demonstrate the superiority of our method in community detection. △ Less

Submitted 16 June, 2019; originally announced June 2019.

Comments: 28 pages, 2 figures

arXiv:1903.01734 [pdf]

A Novel Efficient Approach with Data-Adaptive Capability for OMP-based Sparse Subspace Clustering

Authors: Jiaqiyu Zhan, Zhiqiang Bai, Yuesheng Zhu

Abstract: Orthogonal Matching Pursuit (OMP) plays an important role in data science and its applications such as sparse subspace clustering and image processing. However, the existing OMP-based approaches lack of data adaptiveness so that the data cannot be represented well enough and may lose the accuracy. This paper proposes a novel approach to enhance the data-adaptive capability for OMP-based sparse sub… ▽ More Orthogonal Matching Pursuit (OMP) plays an important role in data science and its applications such as sparse subspace clustering and image processing. However, the existing OMP-based approaches lack of data adaptiveness so that the data cannot be represented well enough and may lose the accuracy. This paper proposes a novel approach to enhance the data-adaptive capability for OMP-based sparse subspace clustering. In our method a parameter selection process is developed to adjust the parameters based on the data distribution for information representation. Our theoretical analysis indicates that the parameter selection process can efficiently coordinate with any OMP-based methods to improve the clustering performance. Also a new Self-Expressive-Affinity (SEA) ratio metric is defined to measure the sparse representation conversion efficiency for spectral clustering to obtain data segmentations. Our experiments show that proposed approach can achieve better performances compared with other OMP-based sparse subspace clustering algorithms in terms of clustering accuracy, SEA ratio and representation quality, also keep the time efficiency and anti-noise ability. △ Less

Submitted 30 August, 2019; v1 submitted 5 March, 2019; originally announced March 2019.

arXiv:1808.05362 [pdf, other]

Generalized Four Moment Theorem and an Application to CLT for Spiked Eigenvalues of Large-dimensional Covariance Matrices

Authors: Dandan Jiang, Zhidong Bai

Abstract: We consider a more generalized spiked covariance matrix $Σ$, which is a general non-definite matrix with the spiked eigenvalues scattered into a few bulks and the largest ones allowed to tend to infinity. By relaxing the matching of the 4th moment to a tail probability decay, a {\it Generalized Four Moment Theorem} (G4MT) is proposed to show the universality of the asymptotic law for the local spe… ▽ More We consider a more generalized spiked covariance matrix $Σ$, which is a general non-definite matrix with the spiked eigenvalues scattered into a few bulks and the largest ones allowed to tend to infinity. By relaxing the matching of the 4th moment to a tail probability decay, a {\it Generalized Four Moment Theorem} (G4MT) is proposed to show the universality of the asymptotic law for the local spectral statistics of generalized spiked covariance matrices, which implies the limiting distribution of the spiked eigenvalues of the generalized spiked covariance matrix is independent of the actual distributions of the samples satisfying our relaxed assumptions. Moreover, by applying it to the Central Limit Theorem (CLT) for the spiked eigenvalues of the generalized spiked covariance matrix, we also extend the result of Bai and Yao (2012) to a general form of the population covariance matrix, where the 4th moment is not necessarily required to exist and the spiked eigenvalues are allowed to be dependent on the non-spiked ones, thus meeting the actual cases better. △ Less

Submitted 24 April, 2019; v1 submitted 16 August, 2018; originally announced August 2018.

Comments: 48 pages, 4 figures,5 tables

MSC Class: 60B20; 62H25; 60F05; 62H10

arXiv:1707.01225 [pdf, other]

Estimating the Number of Sources in Magnetoencephalography Using Spiked Population Eigenvalues

Authors: Zhigang Yao, Ye Zhang, Zhidong Bai, William F. Eddy

Abstract: Magnetoencephalography (MEG) is an advanced imaging technique used to measure the magnetic fields outside the human head produced by the electrical activity inside the brain. Various source localization methods in MEG require the knowledge of the underlying active sources, which are identified by a priori. Common methods used to estimate the number of sources include principal component analysis o… ▽ More Magnetoencephalography (MEG) is an advanced imaging technique used to measure the magnetic fields outside the human head produced by the electrical activity inside the brain. Various source localization methods in MEG require the knowledge of the underlying active sources, which are identified by a priori. Common methods used to estimate the number of sources include principal component analysis or information criterion methods, both of which make use of the eigenvalue distribution of the data, thus avoiding solving the time-consuming inverse problem. Unfortunately, all these methods are very sensitive to the signal-to-noise ratio (SNR), as examining the sample extreme eigenvalues does not necessarily reflect the perturbation of the population ones. To uncover the unknown sources from the very noisy MEG data, we introduce a framework, referred to as the intrinsic dimensionality (ID) of the optimal transformation for the SNR rescaling functional. It is defined as the number of the spiked population eigenvalues of the associated transformed data matrix. It is shown that the ID yields a more reasonable estimate for the number of sources than its sample counterparts, especially when the SNR is small. By means of examples, we illustrate that the new method is able to capture the number of signal sources in MEG that can escape PCA or other information criterion based methods. △ Less

Submitted 5 July, 2017; originally announced July 2017.

Comments: 38 pages, 8 figures, 4 tables

arXiv:1703.01102 [pdf, ps, other]

doi 10.1371/journal.pone.0185155

A New Test of Multivariate Nonlinear Causality

Authors: Zhidong Bai, Yongchang Hui, Zhihui Lv, Wing-Keung Wong, Shurong Zheng, Zhenzhen Zhu

Abstract: The multivariate nonlinear Granger causality developed by Bai et al. (2010) plays an important role in detecting the dynamic interrelationships between two groups of variables. Following the idea of Hiemstra-Jones (HJ) test proposed by Hiemstra and Jones (1994), they attempt to establish a central limit theorem (CLT) of their test statistic by applying the asymptotical property of multivariate… ▽ More The multivariate nonlinear Granger causality developed by Bai et al. (2010) plays an important role in detecting the dynamic interrelationships between two groups of variables. Following the idea of Hiemstra-Jones (HJ) test proposed by Hiemstra and Jones (1994), they attempt to establish a central limit theorem (CLT) of their test statistic by applying the asymptotical property of multivariate $U$-statistic. However, Bai et al. (2016) revisit the HJ test and find that the test statistic given by HJ is NOT a function of $U$-statistics which implies that the CLT neither proposed by Hiemstra and Jones (1994) nor the one extended by Bai et al. (2010) is valid for statistical inference. In this paper, we re-estimate the probabilities and reestablish the CLT of the new test statistic. Numerical simulation shows that our new estimates are consistent and our new test performs decent size and power. △ Less

Submitted 3 March, 2017; originally announced March 2017.

Comments: 20 pages. arXiv admin note: substantial text overlap with arXiv:1701.03992

arXiv:1701.03992 [pdf, ps, other]

The Hiemstra-Jones Test Revisited

Authors: Zhidong Bai, Yongchang Hui, Zhihui Lv, Wing-Keung Wong, Zhen-Zhen Zhu

Abstract: The famous Hiemstra-Jones (HJ) test developed by Hiemstra and Jones (1994) plays a significant role in studying nonlinear causality. Over the last two decades, there have been numerous applications and theoretical extensions based on this pioneering work. However, several works note that counterintuitive results are obtained from the HJ test, and some researchers find that the HJ test is seriously… ▽ More The famous Hiemstra-Jones (HJ) test developed by Hiemstra and Jones (1994) plays a significant role in studying nonlinear causality. Over the last two decades, there have been numerous applications and theoretical extensions based on this pioneering work. However, several works note that counterintuitive results are obtained from the HJ test, and some researchers find that the HJ test is seriously over-rejecting in simulation studies. In this paper, we reinvestigate HJ's creative 1994 work and find that their proposed estimators of the probabilities over different time intervals were not consistent with the target ones proposed in their criterion. To test HJ's novel hypothesis on Granger causality, we propose new estimators of the probabilities defined in their paper and reestablish the asymptotic properties to induce new tests similar to those of HJ. Some simulations will also be presented to support our findings. △ Less

Submitted 14 January, 2017; originally announced January 2017.

arXiv:1404.6633 [pdf, ps, other]

Substitution principle for CLT of linear spectral statistics of high-dimensional sample covariance matrices with applications to hypothesis testing

Authors: Shurong Zheng, Z. D. Bai, Jiangfeng Yao

Abstract: Sample covariance matrices are widely used in multivariate statistical analysis. The central limit theorems (CLT's) for linear spectral statistics of high-dimensional non-centered sample covariance matrices have received considerable attention in random matrix theory and have been applied to many high-dimensional statistical problems. However, known population mean vectors are assumed for non-cent… ▽ More Sample covariance matrices are widely used in multivariate statistical analysis. The central limit theorems (CLT's) for linear spectral statistics of high-dimensional non-centered sample covariance matrices have received considerable attention in random matrix theory and have been applied to many high-dimensional statistical problems. However, known population mean vectors are assumed for non-centered sample covariance matrices, some of which even assume Gaussian-like moment conditions. In fact, there are still another two most frequently used sample covariance matrices: the MLE (by subtracting the sample mean vector from each sample vector) and the unbiased sample covariance matrix (by changing the denominator $n$ as $N=n-1$ in the MLE) without depending on unknown population mean vectors. In this paper, we not only establish new CLT's for non-centered sample covariance matrices without Gaussian-like moment conditions but also characterize the non-negligible differences among the CLT's for the three classes of high-dimensional sample covariance matrices by establishing a {\em substitution principle}: substitute the {\em adjusted} sample size $N=n-1$ for the actual sample size $n$ in the major centering term of the new CLT's so as to obtain the CLT of the unbiased sample covariance matrices. Moreover, it is found that the difference between the CLT's for the MLE and unbiased sample covariance matrix is non-negligible in the major centering term although the two sample covariance matrices only have differences $n$ and $n-1$ on the dominator. The new results are applied to two testing problems for high-dimensional data. △ Less

Submitted 26 April, 2014; originally announced April 2014.

Comments: 36 pages, 23 references

MSC Class: 62H15; 62H10

arXiv:1302.0355 [pdf, other]

Estimation of the population spectral distribution from a large dimensional sample covariance matrix

Authors: Weiming Li, Jiaqi Chen, Yingli Qin, Jianfeng Yao, Zhidong Bai

Abstract: This paper introduces a new method to estimate the spectral distribution of a population covariance matrix from high-dimensional data. The method is founded on a meaningful generalization of the seminal Marcenko-Pastur equation, originally defined in the complex plan, to the real line. Beyond its easy implementation and the established asymptotic consistency, the new estimator outperforms two exis… ▽ More This paper introduces a new method to estimate the spectral distribution of a population covariance matrix from high-dimensional data. The method is founded on a meaningful generalization of the seminal Marcenko-Pastur equation, originally defined in the complex plan, to the real line. Beyond its easy implementation and the established asymptotic consistency, the new estimator outperforms two existing estimators from the literature in almost all the situations tested in a simulation experiment. An application to the analysis of the correlation matrix of S&P stocks data is also given. △ Less

Submitted 2 February, 2013; originally announced February 2013.

Comments: 16 pages, 4 figures

arXiv:1206.0867 [pdf, ps, other]

doi 10.1080/02331888.2012.708031

Testing linear hypotheses in high-dimensional regressions

Authors: Z. Bai, D. Jiang, J. Yao, S. Zheng

Abstract: For a multivariate linear model, Wilk's likelihood ratio test (LRT) constitutes one of the cornerstone tools. However, the computation of its quantiles under the null or the alternative requires complex analytic approximations and more importantly, these distributional approximations are feasible only for moderate dimension of the dependent variable, say $p\le 20$. On the other hand, assuming that… ▽ More For a multivariate linear model, Wilk's likelihood ratio test (LRT) constitutes one of the cornerstone tools. However, the computation of its quantiles under the null or the alternative requires complex analytic approximations and more importantly, these distributional approximations are feasible only for moderate dimension of the dependent variable, say $p\le 20$. On the other hand, assuming that the data dimension $p$ as well as the number $q$ of regression variables are fixed while the sample size $n$ grows, several asymptotic approximations are proposed in the literature for Wilk's $\bLa$ including the widely used chi-square approximation. In this paper, we consider necessary modifications to Wilk's test in a high-dimensional context, specifically assuming a high data dimension $p$ and a large sample size $n$. Based on recent random matrix theory, the correction we propose to Wilk's test is asymptotically Gaussian under the null and simulations demonstrate that the corrected LRT has very satisfactory size and power, surely in the large $p$ and large $n$ context, but also for moderately large data dimensions like $p=30$ or $p=50$. As a byproduct, we give a reason explaining why the standard chi-square approximation fails for high-dimensional data. We also introduce a new procedure for the classical multiple sample significance test in MANOVA which is valid for high-dimensional data. △ Less

Submitted 5 June, 2012; originally announced June 2012.

Comments: Accepted 02/2012 for publication in "Statistics". 20 pages, 2 pages and 2 tables

MSC Class: 62H15; 62H10

Journal ref: Statistics: A Journal of Theoretical and Applied Statistics 47(6):1207-1223, June 2013,

Showing 1–24 of 24 results for author: Bai, Z