Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics

Olivier Bastien; Jean-Christophe Aude; Sylvaine Roy; Eric Maréchal

doi:10.1093/bioinformatics/btg440

Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics

Bioinformatics. 2004 Mar 1;20(4):534-7. doi: 10.1093/bioinformatics/btg440. Epub 2004 Jan 22.

Authors

Olivier Bastien¹, Jean-Christophe Aude, Sylvaine Roy, Eric Maréchal

Affiliation

¹ Laboratoire de Physiologie Cellulaire Végétale, Département Réponse et Dynamique Cellulaire, UMR 5168 CNRS-CEA-INRA-Université J. Fourier, CEA Grenoble, 17 rue des Martyrs, F-38054, Grenoble cedex 09, France.

PMID: 14990449
DOI: 10.1093/bioinformatics/btg440

Abstract

Motivation: Different automatic methods of sequence alignments are routinely used as a starting point for homology searches and function inference. Confidence in an alignment probability is one of the major fundamentals of massive automatic genome-scale pairwise comparisons, for clustering of putative orthologs and paralogs, sequenced genome annotation or multiple-genomic tree constructions. Extreme value distribution based on the Karlin-Altschul model, usually advised for large-scale comparisons are not always valid, particularly in the case of comparisons of non-biased with nucleotide-biased genomes (such that of Plasmodium falciparum). Z-values estimates based on Monte Carlo technics, can be calculated experimentally for any alignment output, whatever the method used. Empirically, a Z-value higher than approximately 8 is supposed reasonable to assess that an alignment score is significant, but this arbitrary figure was never theoretically justified.

Results: In this paper, we used the Bienaymé-Chebyshev inequality to demonstrate a theorem of the upper limit of an alignment score probability (or P-value). This theorem implies that a computed Z-value is a statistical test, a single-linkage clustering criterion and that 1/Z-value(2) is an upper limit to the probability of an alignment score whatever the actual probability law is. Therefore, this study provides the missing theoretical link between a Z-value cut-off used for an automatic clustering of putative orthologs and/or paralogs, and the corresponding statistical risk in such genome-scale comparisons (using non-biased or biased genomes).

Publication types

Comparative Study
Evaluation Study
Validation Study

MeSH terms

Algorithms*
Amino Acid Sequence
Amino Acids / chemistry
Cluster Analysis
Data Interpretation, Statistical*
Molecular Sequence Data
Proteins / chemistry*
Quality Control
Reproducibility of Results
Sensitivity and Specificity
Sequence Alignment / methods*
Sequence Analysis, Protein / methods*
Sequence Homology, Amino Acid

Substances

Amino Acids
Proteins