Sparse k-means clustering (Sparse_kM) can exclude uninformative variables and yield reliable parsimonious clustering results, especially for p≫n. In this work, Sparse_kM and data resampling were combined to identify variables of greatest interest and define confidence levels for the clustering. The method was evaluated by statistical simulation and applied to PiB PET amyloid imaging data to identify normal control (NC) subjects with (+) or without (-) evidence of amyloid, i.e., PiB(+/-).
Simulations: A dataset of n=60 observations (3 groups of 20) and p=500 variables was generated for each simulation run; only 50 variables were truly different across groups. The dataset was resampled 20 times, Sparse_kM was applied to each sample and average variable weights were calculated. Probabilities of cluster membership, also called confidence levels, were computed (n=60). Simulations were performed 250 times. The 50 truly different variables were identified by variable weights that were 13-32 times greater than those for the 450 uninformative variables.
Human data: For the PiB PET dataset, images (ECAT HR+, 10-15 mCi, 90 min) were acquired for 64 cognitively normal subjects (74.1±5.4 yrs). Parametric PiB distribution volume ratio images were generated (Logan method, cerebellum reference) and normalized to the MNI template (SPM8) to produce a dataset of n=64 subjects and p=343,099 voxels/image. The dataset was resampled 10 times and Sparse_kM was applied. An average voxel weight image was computed that indicated cortical areas of greatest interest that included precuneus and frontal cortex; these are key areas linked to early amyloid deposition. Seven of 64 subjects were identified as PiB(+) and 47 as PiB(-) with confidence ≥ 90%, where another subject was PiB(+) at lower confidence (80%) and the other 9 subjects were PiB(-) at confidence in the range of 50-70%. In conclusion, Sparse_kM with resampling can help to establish confidence levels for clustering when p≫n and may be a promising method for revealing informative voxels/spatial patterns that distinguish levels of amyloid load, including that at the transitional amyloid +/- boundary.