Visualization and probability-based scoring of structural variants within repetitive sequences

Eitan Halper-Stromberg; Jared Steranka; Kathleen H Burns; Sarven Sabunciyan; Rafael A Irizarry

doi:10.1093/bioinformatics/btu054

Visualization and probability-based scoring of structural variants within repetitive sequences

Bioinformatics. 2014 Jun 1;30(11):1514-21. doi: 10.1093/bioinformatics/btu054. Epub 2014 Feb 4.

Authors

Eitan Halper-Stromberg¹, Jared Steranka², Kathleen H Burns¹, Sarven Sabunciyan³, Rafael A Irizarry³

Affiliations

¹ Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Program in Human Genetics and Molecular Biology, Johns Hopkins University School of Medicine, Computational Bioscience Program, University of Colorado, Denver, Department of Molecular Biology and Genetics, Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins Hospital, Department of Pathology, Johns Hopkins University, High Throughput Biology Center, Johns Hopkins University School of Medicine, Johns Hopkins University, Center for Epigenetics, Johns Hopkins University School of Medicine, Department of Pediatrics, Johns Hopkins University School of Medicine, Baltimore, MD and Department of Biostatistics and Computational Biology, Dana Farber Cancer Institute, Boston, Massachusetts, MA, USADepartment of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Program in Human Genetics and Molecular Biology, Johns Hopkins University School of Medicine, Computational Bioscience Program, University of Colorado, Denver, Department of Molecular Biology and Genetics, Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins Hospital, Department of Pathology, Johns Hopkins University, High Throughput Biology Center, Johns Hopkins University School of Medicine, Johns Hopkins University, Center for Epigenetics, Johns Hopkins University School of Medicine, Department of Pediatrics, Johns Hopkins University School of Medicine, Baltimore, MD and Department of Biostatistics and Computational Biology, Dana Farber Cancer Institute, Boston, Massachusetts, MA, USADepartment of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Program in Human Genetics and Molecular Biology, Johns Hopkins University School of Medicine, Computational Bioscience Program, University of Colorado, Denver, Department of Molecular Biology and Genetics, Department of Oncology, The Sidney Kimmel Comprehensive Cancer Cente
² Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Program in Human Genetics and Molecular Biology, Johns Hopkins University School of Medicine, Computational Bioscience Program, University of Colorado, Denver, Department of Molecular Biology and Genetics, Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins Hospital, Department of Pathology, Johns Hopkins University, High Throughput Biology Center, Johns Hopkins University School of Medicine, Johns Hopkins University, Center for Epigenetics, Johns Hopkins University School of Medicine, Department of Pediatrics, Johns Hopkins University School of Medicine, Baltimore, MD and Department of Biostatistics and Computational Biology, Dana Farber Cancer Institute, Boston, Massachusetts, MA, USA.
³ Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Program in Human Genetics and Molecular Biology, Johns Hopkins University School of Medicine, Computational Bioscience Program, University of Colorado, Denver, Department of Molecular Biology and Genetics, Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins Hospital, Department of Pathology, Johns Hopkins University, High Throughput Biology Center, Johns Hopkins University School of Medicine, Johns Hopkins University, Center for Epigenetics, Johns Hopkins University School of Medicine, Department of Pediatrics, Johns Hopkins University School of Medicine, Baltimore, MD and Department of Biostatistics and Computational Biology, Dana Farber Cancer Institute, Boston, Massachusetts, MA, USADepartment of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Program in Human Genetics and Molecular Biology, Johns Hopkins University School of Medicine, Computational Bioscience Program, University of Colorado, Denver, Department of Molecular Biology and Genetics, Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins Hospital, Department of Pathology, Johns Hopkins University, High Throughput Biology Center, Johns Hopkins University School of Medicine, Johns Hopkins University, Center for Epigenetics, Johns Hopkins University School of Medicine, Department of Pediatrics, Johns Hopkins University School of Medicine, Baltimore, MD and Department of Biostatistics and Computational Biology, Dana Farber Cancer Institute, Boston, Massachusetts, MA, USA.

Abstract

Motivation: Repetitive sequences account for approximately half of the human genome. Accurately ascertaining sequences in these regions with next generation sequencers is challenging, and requires a different set of analytical techniques than for reads originating from unique sequences. Complicating the matter are repetitive regions subject to programmed rearrangements, as is the case with the antigen-binding domains in the Immunoglobulin (Ig) and T-cell receptor (TCR) loci.

Results: We developed a probability-based score and visualization method to aid in distinguishing true structural variants from alignment artifacts. We demonstrate the usefulness of this method in its ability to separate real structural variants from false positives generated with existing upstream analysis tools. We validated our approach using both target-capture and whole-genome experiments. Capture sequencing reads were generated from primary lymphoid tumors, cancer cell lines and an EBV-transformed lymphoblast cell line over the Ig and TCR loci. Whole-genome sequencing reads were from a lymphoblastoid cell-line.

Availability: We implement our method as an R package available at https://github.com/Eitan177/targetSeqView. Code to reproduce the figures and results are also available.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Cell Line, Transformed
Cell Line, Tumor
DNA / chemistry
Genome, Human
Genomic Structural Variation*
Genomics / methods
High-Throughput Nucleotide Sequencing / methods*
Humans
Probability
Repetitive Sequences, Nucleic Acid
Sequence Alignment / methods*
Sequence Analysis, DNA / methods*

Substances

DNA

Abstract

Publication types

MeSH terms

Substances

Grants and funding