CNV-guided multi-read allocation for ChIP-seq

Qi Zhang; Sündüz Keleş

doi:10.1093/bioinformatics/btu402

CNV-guided multi-read allocation for ChIP-seq

Bioinformatics. 2014 Oct 15;30(20):2860-7. doi: 10.1093/bioinformatics/btu402. Epub 2014 Jun 24.

Authors

Qi Zhang¹, Sündüz Keleş²

Affiliations

¹ Department of Biostatistics and Medical Informatics, 425 Henry Mall and Department of Statistics, 1300 University Avenue, Madison, WI 53706, USA.
² Department of Biostatistics and Medical Informatics, 425 Henry Mall and Department of Statistics, 1300 University Avenue, Madison, WI 53706, USA Department of Biostatistics and Medical Informatics, 425 Henry Mall and Department of Statistics, 1300 University Avenue, Madison, WI 53706, USA.

Abstract

Motivation: In chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) and other short-read sequencing experiments, a considerable fraction of the short reads align to multiple locations on the reference genome (multi-reads). Inferring the origin of multi-reads is critical for accurately mapping reads to repetitive regions. Current state-of-the-art multi-read allocation algorithms rely on the read counts in the local neighborhood of the alignment locations and ignore the variation in the copy numbers of these regions. Copy-number variation (CNV) can directly affect the read densities and, therefore, bias allocation of multi-reads.

Results: We propose cnvCSEM (CNV-guided ChIP-Seq by expectation-maximization algorithm), a flexible framework that incorporates CNV in multi-read allocation. cnvCSEM eliminates the CNV bias in multi-read allocation by initializing the read allocation algorithm with CNV-aware initial values. Our data-driven simulations illustrate that cnvCSEM leads to higher read coverage with satisfactory accuracy and lower loss in read-depth recovery (estimation). We evaluate the biological relevance of the cnvCSEM-allocated reads and the resultant peaks with the analysis of several ENCODE ChIP-seq datasets.

Availability and implementation: Available at http://www.stat.wisc.edu/∼qizhang/

Contact: : [email protected] or [email protected]

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Algorithms*
Base Sequence
Chromatin Immunoprecipitation / methods*
DNA Copy Number Variations*
Genome, Human / genetics
Humans
Oligonucleotide Array Sequence Analysis
Repetitive Sequences, Nucleic Acid / genetics
Sequence Analysis, DNA / methods*
Transcription Factors / metabolism

Substances

Transcription Factors

Abstract

Publication types

MeSH terms

Substances

Grants and funding