In Search of Disentanglement in Tandem Mass Spectrometry Datasets

Biomolecules. 2023 Sep 4;13(9):1343. doi: 10.3390/biom13091343.

Abstract

Generative modeling and representation learning of tandem mass spectrometry data aim to learn an interpretable and instrument-agnostic digital representation of metabolites directly from MS/MS spectra. Interpretable and instrument-agnostic digital representations would facilitate comparisons of MS/MS spectra between instrument vendors and enable better and more accurate queries of large MS/MS spectra databases for metabolite identification. In this study, we apply generative modeling and representation learning using variational autoencoders to understand the extent to which tandem mass spectra can be disentangled into their factors of generation (e.g., collision energy, ionization mode, instrument type, etc.) with minimal prior knowledge of the factors. We find that variational autoencoders can disentangle tandem mass spectra data with the proper choice of hyperparameters into meaningful latent representations aligned with known factors of variation. We develop a two-step approach to facilitate the selection of models that are disentangled, which could be applied to other complex and high-dimensional data sets.

Keywords: deep learning; disentangled representation; generative models; latent space; tandem mass spectrometry; variational autoencoder.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Databases, Factual
  • Learning*
  • Tandem Mass Spectrometry*

Grants and funding

This research was funded by Douglas McCloskey and The Novo Nordisk Foundation, grant number NNF20CC0035580.