Multiple alignment-free sequence comparison

Jie Ren; Kai Song; Fengzhu Sun; Minghua Deng; Gesine Reinert

doi:10.1093/bioinformatics/btt462

Multiple alignment-free sequence comparison

Bioinformatics. 2013 Nov 1;29(21):2690-8. doi: 10.1093/bioinformatics/btt462. Epub 2013 Aug 29.

Authors

Jie Ren¹, Kai Song, Fengzhu Sun, Minghua Deng, Gesine Reinert

Affiliation

¹ School of Mathematics, Peking University, Beijing 100871, PR China, Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089-2910, USA, MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, PR China and Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK.

Abstract

Motivation: Recently, a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here, we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, first, C(*)1 and C(S)1, extensions of statistics for pairwise comparison of the joint k-tuple content of all the sequences, and second, C(*)2, C(S)2 and C(geo)2, averages of sums of pairwise comparison statistics. The two tasks we consider are, first, to identify sequences that are similar to a set of target sequences, and, second, to measure the similarity within a set of sequences.

Results: Our investigation uses both simulated data as well as cis-regulatory module data where the task is to identify cis-regulatory modules with similar transcription factor binding sites. We find that although for real data, all of our statistics show a similar performance, on simulated data the Shepp-type statistics are in some instances outperformed by star-type statistics. The multiple alignment-free statistics are more sensitive to contamination in the data than the pairwise average statistics.

Availability: Our implementation of the five statistics is available as R package named 'multiAlignFree' at be http://www-rcf.usc.edu/∼fsun/Programs/multiAlignFree/multiAlignFreemain.html.

Contact: [email protected].

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Animals
Binding Sites
Data Interpretation, Statistical
Mice
Regulatory Elements, Transcriptional
Sequence Alignment
Sequence Analysis, DNA / methods*
Transcription Factors / metabolism

Substances

Transcription Factors

Abstract

Publication types

MeSH terms

Substances

Grants and funding