Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies

Bioinformatics. 2007 Sep 1;23(17):2210-7. doi: 10.1093/bioinformatics/btm267. Epub 2007 May 17.

Abstract

Motivation: In proteomics, reverse database searching is used to control the false match frequency for tandem mass spectrum/peptide sequence matches, but reversal creates sequences devoid of patterns that usually challenge database-search software.

Results: We designed an unsupervised pattern recognition algorithm for detecting patterns with various lengths from large sequence datasets. The patterns found in a protein sequence database were used to create decoy databases using a Monte Carlo sampling algorithm. Searching these decoy databases led to the prediction of false positive rates for spectrum/peptide sequence matches. We show examples where this method, independent of instrumentation, database-search software and samples, provides better estimation of false positive identification rates than a prevailing reverse database searching method. The pattern detection algorithm can also be used to analyze sequences for other purposes in biology or cryptology.

Availability: On request from the authors.

Supplementary information: http://bioinformatics.psb.ugent.be/.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Amino Acid Sequence
  • Artificial Intelligence*
  • Computer Simulation
  • Data Interpretation, Statistical
  • False Positive Reactions*
  • Mass Spectrometry / methods*
  • Models, Chemical
  • Models, Statistical
  • Molecular Sequence Data
  • Monte Carlo Method
  • Pattern Recognition, Automated / methods*
  • Peptide Mapping / methods*
  • Proteins / chemistry*
  • Sequence Analysis, Protein / methods*

Substances

  • Proteins