Massively parallel unsupervised single-particle cryo-EM data clustering via statistical manifold learning

PLoS One. 2017 Aug 7;12(8):e0182130. doi: 10.1371/journal.pone.0182130. eCollection 2017.

Abstract

Structural heterogeneity in single-particle cryo-electron microscopy (cryo-EM) data represents a major challenge for high-resolution structure determination. Unsupervised classification may serve as the first step in the assessment of structural heterogeneity. However, traditional algorithms for unsupervised classification, such as K-means clustering and maximum likelihood optimization, may classify images into wrong classes with decreasing signal-to-noise-ratio (SNR) in the image data, yet demand increased computational costs. Overcoming these limitations requires further development of clustering algorithms for high-performance cryo-EM data processing. Here we introduce an unsupervised single-particle clustering algorithm derived from a statistical manifold learning framework called generative topographic mapping (GTM). We show that unsupervised GTM clustering improves classification accuracy by about 40% in the absence of input references for data with lower SNRs. Applications to several experimental datasets suggest that our algorithm can detect subtle structural differences among classes via a hierarchical clustering strategy. After code optimization over a high-performance computing (HPC) environment, our software implementation was able to generate thousands of reference-free class averages within hours in a massively parallel fashion, which allows a significant improvement on ab initio 3D reconstruction and assists in the computational purification of homogeneous datasets for high-resolution visualization.

Publication types

  • Comparative Study

MeSH terms

  • Cluster Analysis
  • Computer Simulation
  • Cryoelectron Microscopy* / methods
  • Escherichia coli
  • Image Processing, Computer-Assisted / methods*
  • Imaging, Three-Dimensional / methods
  • Inflammasomes / ultrastructure
  • Multivariate Analysis
  • Principal Component Analysis
  • Proteasome Endopeptidase Complex / ultrastructure
  • Ribosome Subunits, Large, Bacterial / ultrastructure
  • Unsupervised Machine Learning*

Substances

  • Inflammasomes
  • Proteasome Endopeptidase Complex
  • ATP dependent 26S protease