Discussion on common data analysis strategies used in MS-based proteomics

Proteomics. 2011 Feb;11(4):604-19. doi: 10.1002/pmic.201000404. Epub 2011 Jan 17.

Abstract

Current proteomics technology is limited in resolving the proteome complexity of biological systems. The main issue at stake is to increase throughput and spectra quality so that spatiotemporal dimensions, population parameters and the complexity of protein modifications on a quantitative scale can be considered. MS-based proteomics and protein arrays are the main players in large-scale proteome analysis and an integration of these two methodologies is powerful but presently not sufficient for detailed quantitative and spatiotemporal proteome characterization. Improvements of instrumentation for MS-based proteomics have been achieved recently resulting in data sets of approximately one million spectra which is a large step in the right direction. The corresponding raw data range from 50 to 100 Gb and are frequently made available. Multidimensional LC-MS data sets have been demonstrated to identify and quantitate 2000-8000 proteins from whole cell extracts. The analysis of the resulting data sets requires several steps from raw data processing, to database-dependent search, statistical evaluation of the search result, quantitative algorithms and statistical analysis of quantitative data. A large number of software tools have been proposed for the above-mentioned tasks. However, it is not the aim of this review to cover all software tools, but rather discuss common data analysis strategies used by various algorithms for each of the above-mentioned steps in a non-redundant approach and to argue that there are still some areas which need improvements.

Publication types

  • Research Support, Non-U.S. Gov't
  • Review

MeSH terms

  • Algorithms
  • Proteomics*
  • Tandem Mass Spectrometry*