Testing for a difference in means of a single feature after clustering

Biostatistics. 2024 Dec 31;26(1):kxae046. doi: 10.1093/biostatistics/kxae046.

Abstract

For many applications, it is critical to interpret and validate groups of observations obtained via clustering. A common interpretation and validation approach involves testing differences in feature means between observations in two estimated clusters. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we propose a new test for the difference in means in a single feature between a pair of clusters obtained using hierarchical or k-means clustering. The test controls the selective Type I error rate in finite samples and can be efficiently computed. We further illustrate the validity and power of our proposal in simulation and demonstrate its use on single-cell RNA-sequencing data.

Keywords: hypothesis testing; post-selection inference; type I error; unsupervised learning.

MeSH terms

  • Biostatistics / methods
  • Cluster Analysis
  • Data Interpretation, Statistical
  • Humans
  • Models, Statistical
  • Sequence Analysis, RNA / methods
  • Single-Cell Analysis* / methods
  • Single-Cell Analysis* / statistics & numerical data