Systematic discovery of conservation states for single-nucleotide annotation of the human genome

Commun Biol. 2019 Jul 2:2:248. doi: 10.1038/s42003-019-0488-1. eCollection 2019.

Abstract

Comparative genomics sequence data is an important source of information for interpreting genomes. Genome-wide annotations based on this data have largely focused on univariate scores or binary elements of evolutionary constraint. Here we present a complementary whole genome annotation approach, ConsHMM, which applies a multivariate hidden Markov model to learn de novo 'conservation states' based on the combinatorial and spatial patterns of which species align to and match a reference genome in a multiple species DNA sequence alignment. We applied ConsHMM to a 100-way vertebrate sequence alignment to annotate the human genome at single nucleotide resolution into 100 conservation states. These states have distinct enrichments for other genomic information including gene annotations, chromatin states, repeat families, and bases prioritized by various variant prioritization scores. Constrained elements have distinct heritability partitioning enrichments depending on their conservation state assignment. ConsHMM conservation states are a resource for analyzing genomes and genetic variants.

Keywords: Comparative genomics; Evolutionary genetics; Genome informatics.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Chromatin / metabolism
  • Cluster Analysis
  • Computational Biology / methods*
  • Epigenomics
  • Genome, Human*
  • Genome-Wide Association Study
  • Genomics / methods*
  • Humans
  • Markov Chains
  • Molecular Sequence Annotation / methods*
  • Multivariate Analysis
  • Nucleotides
  • Phenotype
  • Polymorphism, Single Nucleotide
  • Reproducibility of Results

Substances

  • Chromatin
  • Nucleotides

Associated data

  • figshare/10.6084/m9.figshare.8162036.v1