Statistical properties of DNA sequences

C K Peng; S V Buldyrev; A L Goldberger; S Havlin; R N Mantegna; M Simons; H E Stanley

doi:10.1016/0378-4371(95)00247-5

Statistical properties of DNA sequences

Physica A. 1995:221:180-92. doi: 10.1016/0378-4371(95)00247-5.

Authors

C K Peng¹, S V Buldyrev, A L Goldberger, S Havlin, R N Mantegna, M Simons, H E Stanley

Collaborator

A L Goldberger²

Affiliations

¹ Cardiovascular Division, Harvard Medical School, Boston, MA 02215, USA.
² Beth Israel Hosp, Boston, MA

PMID: 11540495
DOI: 10.1016/0378-4371(95)00247-5

Abstract

We review evidence supporting the idea that the DNA sequence in genes containing non-coding regions is correlated, and that the correlation is remarkably long range--indeed, nucleotides thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene. We resolve the problem of the "non-stationarity" feature of the sequence of base pairs by applying a new algorithm called detrended fluctuation analysis (DFA). We address the claim of Voss that there is no difference in the statistical properties of coding and non-coding regions of DNA by systematically applying the DFA algorithm, as well as standard FFT analysis, to every DNA sequence (33301 coding and 29453 non-coding) in the entire GenBank database. Finally, we describe briefly some recent work showing that the non-coding sequences have certain statistical features in common with natural and artificial languages. Specifically, we adapt to DNA the Zipf approach to analyzing linguistic texts. These statistical properties of non-coding sequences support the possibility that non-coding regions of DNA may carry biological information.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Algorithms*
Animals
Base Sequence
DNA / chemistry*
Data Interpretation, Statistical
Escherichia coli
Fractals*
Humans
Introns
Invertebrates
Linguistics
Nucleotide Mapping*
Sequence Analysis, DNA / statistics & numerical data*
Software*
Yeasts

Substances

DNA