Perm-seq: Mapping Protein-DNA Interactions in Segmental Duplication and Highly Repetitive Regions of Genomes with Prior-Enhanced Read Mapping

PLoS Comput Biol. 2015 Oct 20;11(10):e1004491. doi: 10.1371/journal.pcbi.1004491. eCollection 2015 Oct.

Abstract

Segmental duplications and other highly repetitive regions of genomes contribute significantly to cells' regulatory programs. Advancements in next generation sequencing enabled genome-wide profiling of protein-DNA interactions by chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq). However, interactions in highly repetitive regions of genomes have proven difficult to map since short reads of 50-100 base pairs (bps) from these regions map to multiple locations in reference genomes. Standard analytical methods discard such multi-mapping reads and the few that can accommodate them are prone to large false positive and negative rates. We developed Perm-seq, a prior-enhanced read allocation method for ChIP-seq experiments, that can allocate multi-mapping reads in highly repetitive regions of the genomes with high accuracy. We comprehensively evaluated Perm-seq, and found that our prior-enhanced approach significantly improves multi-read allocation accuracy over approaches that do not utilize additional data types. The statistical formalism underlying our approach facilitates supervising of multi-read allocation with a variety of data sources including histone ChIP-seq. We applied Perm-seq to 64 ENCODE ChIP-seq datasets from GM12878 and K562 cells and identified many novel protein-DNA interactions in segmental duplication regions. Our analysis reveals that although the protein-DNA interactions sites are evolutionarily less conserved in repetitive regions, they share the overall sequence characteristics of the protein-DNA interactions in non-repetitive regions.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms
  • Base Sequence
  • Chromatin Immunoprecipitation / methods
  • Chromosome Mapping / methods*
  • DNA / chemistry
  • DNA / genetics*
  • DNA-Binding Proteins / chemistry
  • DNA-Binding Proteins / genetics*
  • High-Throughput Nucleotide Sequencing / methods
  • Humans
  • K562 Cells
  • Molecular Sequence Data
  • Protein Interaction Mapping / methods*
  • Repetitive Sequences, Nucleic Acid / genetics*
  • Segmental Duplications, Genomic / genetics*

Substances

  • DNA-Binding Proteins
  • DNA