-
Detection problems in the spiked matrix models
Authors:
Ji Hyung Jung,
Hye Won Chung,
Ji Oon Lee
Abstract:
We study the statistical decision process of detecting the low-rank signal from various signal-plus-noise type data matrices, known as the spiked random matrix models. We first show that the principal component analysis can be improved by entrywise pre-transforming the data matrix if the noise is non-Gaussian, generalizing the known results for the spiked random matrix models with rank-1 signals.…
▽ More
We study the statistical decision process of detecting the low-rank signal from various signal-plus-noise type data matrices, known as the spiked random matrix models. We first show that the principal component analysis can be improved by entrywise pre-transforming the data matrix if the noise is non-Gaussian, generalizing the known results for the spiked random matrix models with rank-1 signals. As an intermediate step, we find out sharp phase transition thresholds for the extreme eigenvalues of spiked random matrices, which generalize the Baik-Ben Arous-Péché (BBP) transition. We also prove the central limit theorem for the linear spectral statistics for the spiked random matrices and propose a hypothesis test based on it, which does not depend on the distribution of the signal or the noise. When the noise is non-Gaussian noise, the test can be improved with an entrywise transformation to the data matrix with additive noise. We also introduce an algorithm that estimates the rank of the signal when it is not known a priori.
△ Less
Submitted 16 January, 2023; v1 submitted 12 January, 2023;
originally announced January 2023.
-
Detection of Signal in the Spiked Rectangular Models
Authors:
Ji Hyung Jung,
Hye Won Chung,
Ji Oon Lee
Abstract:
We consider the problem of detecting signals in the rank-one signal-plus-noise data matrix models that generalize the spiked Wishart matrices. We show that the principal component analysis can be improved by pre-transforming the matrix entries if the noise is non-Gaussian. As an intermediate step, we prove a sharp phase transition of the largest eigenvalues of spiked rectangular matrices, which ex…
▽ More
We consider the problem of detecting signals in the rank-one signal-plus-noise data matrix models that generalize the spiked Wishart matrices. We show that the principal component analysis can be improved by pre-transforming the matrix entries if the noise is non-Gaussian. As an intermediate step, we prove a sharp phase transition of the largest eigenvalues of spiked rectangular matrices, which extends the Baik-Ben Arous-Péché (BBP) transition. We also propose a hypothesis test to detect the presence of signal with low computational complexity, based on the linear spectral statistics, which minimizes the sum of the Type-I and Type-II errors when the noise is Gaussian.
△ Less
Submitted 27 April, 2021;
originally announced April 2021.
-
Weak Detection in the Spiked Wigner Model with General Rank
Authors:
Ji Hyung Jung,
Hye Won Chung,
Ji Oon Lee
Abstract:
We study the statistical decision process of detecting the signal from a `signal+noise' type matrix model with an additive Wigner noise. We propose a hypothesis test based on the linear spectral statistics of the data matrix, which does not depend on the distribution of the signal or the noise. The test is optimal under the Gaussian noise if the signal-to-noise ratio is small, as it minimizes the…
▽ More
We study the statistical decision process of detecting the signal from a `signal+noise' type matrix model with an additive Wigner noise. We propose a hypothesis test based on the linear spectral statistics of the data matrix, which does not depend on the distribution of the signal or the noise. The test is optimal under the Gaussian noise if the signal-to-noise ratio is small, as it minimizes the sum of the Type-I and Type-II errors. Under the non-Gaussian noise, the test can be improved with an entrywise transformation to the data matrix. We also introduce an algorithm that estimates the rank of the signal when it is not known a priori.
△ Less
Submitted 4 March, 2021; v1 submitted 16 January, 2020;
originally announced January 2020.
-
Weak detection in the spiked Wigner model
Authors:
Hye Won Chung,
Ji Oon Lee
Abstract:
We consider the weak detection problem in a rank-one spiked Wigner data matrix where the signal-to-noise ratio is small so that reliable detection is impossible. We propose a hypothesis test on the presence of the signal by utilizing the linear spectral statistics of the data matrix. The test is data-driven and does not require prior knowledge about the distribution of the signal or the noise. Whe…
▽ More
We consider the weak detection problem in a rank-one spiked Wigner data matrix where the signal-to-noise ratio is small so that reliable detection is impossible. We propose a hypothesis test on the presence of the signal by utilizing the linear spectral statistics of the data matrix. The test is data-driven and does not require prior knowledge about the distribution of the signal or the noise. When the noise is Gaussian, the proposed test is optimal in the sense that its error matches that of the likelihood ratio test, which minimizes the sum of the Type-I and Type-II errors. If the density of the noise is known and non-Gaussian, the error of the test can be lowered by applying an entrywise transformation to the data matrix. We establish a central limit theorem for the linear spectral statistics of general rank-one spiked Wigner matrices as an intermediate step.
△ Less
Submitted 10 November, 2019; v1 submitted 27 September, 2018;
originally announced September 2018.
-
Parity Queries for Binary Classification
Authors:
Hye Won Chung,
Ji Oon Lee,
Doyeon Kim,
Alfred O. Hero
Abstract:
Consider a query-based data acquisition problem that aims to recover the values of $k$ binary variables from parity (XOR) measurements of chosen subsets of the variables. Assume the response model where only a randomly selected subset of the measurements is received. We propose a method for designing a sequence of queries so that the variables can be identified with high probability using as few (…
▽ More
Consider a query-based data acquisition problem that aims to recover the values of $k$ binary variables from parity (XOR) measurements of chosen subsets of the variables. Assume the response model where only a randomly selected subset of the measurements is received. We propose a method for designing a sequence of queries so that the variables can be identified with high probability using as few ($n$) measurements as possible. We define the query difficulty $\bar{d}$ as the average size of the query subsets and the sample complexity $n$ as the minimum number of measurements required to attain a given recovery accuracy. We obtain fundamental trade-offs between recovery accuracy, query difficulty, and sample complexity. In particular, the necessary and sufficient sample complexity required for recovering all $k$ variables with high probability is $n = c_0 \max\{k, (k \log k)/\bar{d}\}$ and the sample complexity for recovering a fixed proportion $(1-δ)k$ of the variables for $δ=o(1)$ is $n = c_1\max\{k, (k \log(1/δ))/\bar{d}\}$, where $c_0, c_1>0$.
△ Less
Submitted 7 November, 2019; v1 submitted 4 September, 2018;
originally announced September 2018.
-
Fundamental Limits on Data Acquisition: Trade-offs between Sample Complexity and Query Difficulty
Authors:
Hye Won Chung,
Ji Oon Lee,
Alfred O. Hero
Abstract:
We consider query-based data acquisition and the corresponding information recovery problem, where the goal is to recover $k$ binary variables (information bits) from parity measurements of those variables. The queries and the corresponding parity measurements are designed using the encoding rule of Fountain codes. By using Fountain codes, we can design potentially limitless number of queries, and…
▽ More
We consider query-based data acquisition and the corresponding information recovery problem, where the goal is to recover $k$ binary variables (information bits) from parity measurements of those variables. The queries and the corresponding parity measurements are designed using the encoding rule of Fountain codes. By using Fountain codes, we can design potentially limitless number of queries, and corresponding parity measurements, and guarantee that the original $k$ information bits can be recovered with high probability from any sufficiently large set of measurements of size $n$. In the query design, the average number of information bits that is associated with one parity measurement is called query difficulty ($\bar{d}$) and the minimum number of measurements required to recover the $k$ information bits for a fixed $\bar{d}$ is called sample complexity ($n$). We analyze the fundamental trade-offs between the query difficulty and the sample complexity, and show that the sample complexity of $n=c\max\{k,(k\log k)/\bar{d}\}$ for some constant $c>0$ is necessary and sufficient to recover $k$ information bits with high probability as $k\to\infty$.
△ Less
Submitted 2 January, 2018; v1 submitted 30 November, 2017;
originally announced December 2017.