Identification of repeat structure in large genomes using repeat probability clouds

Wanjun Gu; Todd A Castoe; Dale J Hedges; Mark A Batzer; David D Pollock

doi:10.1016/j.ab.2008.05.015

Identification of repeat structure in large genomes using repeat probability clouds

Anal Biochem. 2008 Sep 1;380(1):77-83. doi: 10.1016/j.ab.2008.05.015. Epub 2008 May 20.

Authors

Wanjun Gu¹, Todd A Castoe, Dale J Hedges, Mark A Batzer, David D Pollock

Affiliation

¹ Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045, USA.

Abstract

The identification of repeat structure in eukaryotic genomes can be time-consuming and difficult because of the large amount of information ( approximately 3 x 10(9) bp) that needs to be processed and compared. We introduce a new approach based on exact word counts to evaluate, de novo, the repeat structure present within large eukaryotic genomes. This approach avoids sequence alignment and similarity search, two of the most time-consuming components of traditional methods for repeat identification. Algorithms were implemented to efficiently calculate exact counts for any length oligonucleotide in large genomes. Based on these oligonucleotide counts, oligonucleotide excess probability clouds, or "P-clouds," were constructed. P-clouds are composed of clusters of related oligonucleotides that occur, as a group, more often than expected by chance. After construction, P-clouds were mapped back onto the genome, and regions of high P-cloud density were identified as repetitive regions based on a sliding window approach. This efficient method is capable of analyzing the repeat content of the entire human genome on a single desktop computer in less than half a day, at least 10-fold faster than current approaches. The predicted repetitive regions strongly overlap with known repeat elements as well as other repetitive regions such as gene families, pseudogenes, and segmental duplicons. This method should be extremely useful as a tool for use in de novo identification of repeat structure in large newly sequenced genomes.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Alu Elements / genetics
Chromosomes, Human, Pair 1 / genetics
False Positive Reactions
Genome, Human / genetics*
Humans
Oligonucleotides / genetics
Probability*
Repetitive Sequences, Nucleic Acid / genetics*
Sensitivity and Specificity
Time Factors

Substances

Oligonucleotides

Abstract

Publication types

MeSH terms

Substances

Grants and funding