Utilizing mapping targets of sequences underrepresented in the reference assembly to reduce false positive alignments

Karen H Miga; Christopher Eisenhart; W James Kent

doi:10.1093/nar/gkv671

Utilizing mapping targets of sequences underrepresented in the reference assembly to reduce false positive alignments

Nucleic Acids Res. 2015 Nov 16;43(20):e133. doi: 10.1093/nar/gkv671. Epub 2015 Jul 10.

Authors

Karen H Miga¹, Christopher Eisenhart², W James Kent²

Affiliations

¹ Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA [email protected].
² Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA.

Abstract

The human reference assembly remains incomplete due to the underrepresentation of repeat-rich sequences that are found within centromeric regions and acrocentric short arms. Although these sequences are marginally represented in the assembly, they are often fully represented in whole-genome short-read datasets and contribute to inappropriate alignments and high read-depth signals that localize to a small number of assembled homologous regions. As a consequence, these regions often provide artifactual peak calls that confound hypothesis testing and large-scale genomic studies. To address this problem, we have constructed mapping targets that represent roughly 8% of the human genome generally omitted from the human reference assembly. By integrating these data into standard mapping and peak-calling pipelines we demonstrate a 10-fold reduction in signals in regions common to the blacklisted region and identify a comprehensive set of regions that exhibit mapping sensitivity with the presence of the repeat-rich targets.

Publication types

Evaluation Study
Research Support, N.I.H., Extramural

MeSH terms

Artifacts*
DNA / chemistry
Databases, Nucleic Acid
Genome, Human*
Genomics / methods*
Humans
Repetitive Sequences, Nucleic Acid
Sequence Alignment / methods*

Substances

DNA

Abstract

Publication types

MeSH terms

Substances

Grants and funding