pValid 2: A deep learning based validation method for peptide identification in shotgun proteomics with increased discriminating power

J Proteomics. 2022 Jan 16:251:104414. doi: 10.1016/j.jprot.2021.104414. Epub 2021 Nov 2.

Abstract

Tandem mass spectrometry has been the principal method in shotgun proteomics for peptide and protein identification. However, incorrect identifications reported by proteome search engines are still unknown, and further validation methods are needed. We have proposed a validation method pValid before, but its scope of application is limited because two features used in pValid are related to open database search and sub-optimal peptide candidates for tandem mass spectra, and the performance on complex datasets still has room for improvement. In this study, we developed a more comprehensive validation method, pValid 2, to break these limitations by removing the two features and bringing in a new feature related to the retention time predicted by a deep learning-based method pPredRT. pValid 2 yielded an average false positive rate of 0.03% and an average false negative rate of 1.37% on three testing datasets, better than those of pValid, and flagged 8.47% to 11.31% more incorrect identifications than pValid on two complex datasets. Moreover, pValid 2 flagged almost all decoy identifications in validating the open-search datasets. In addition, the function of validating identifications given by MaxQuant and MS-GF+ was implemented in pValid 2, and the validation results showed that pValid 2 performed dramatically better than three metabolic labeling validation methods. Further considering its cost-effectiveness as a pure computational approach, pValid 2 has the potential to be a widely used validation tool for peptide identifications of any proteome search engines in shotgun proteomics. SIGNIFICANCE: Identification results given by shotgun proteomics are vital to life science research. The correctness of identifications deeply affects the precision of the subsequent studies about protein structures and functions, protein-protein interactions, pathogenic mechanism, and targeted drugs. Thus, validating the correctness of identifications is crucial and urgent. In 2019, we developed an identification credibility validation method named pValid, whose false positive rate (FPR) is 0.03% and false negative rate (FNR) is 1.79%, comparable to those of the gold standard, i.e., the Synthetic-peptide validation method. However, pValid can only be used for validating the results from pFind, and its validation performance on a few complex datasets still has room for improvement. So, in this submission, we proposed pValid 2, a more comprehensive computational validation method that can validate identifications from any proteome search engines with increased discriminating power.

Keywords: Deep learning; Proteomics; Retention time; Tandem mass spectrometry; Validation method.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Databases, Protein
  • Deep Learning*
  • Peptides / analysis
  • Proteome / analysis
  • Proteomics* / methods
  • Software

Substances

  • Peptides
  • Proteome