Methods for evaluating unsupervised vector representations of genomic regions

Guangtao Zheng; Julia Rymuza; Erfaneh Gharavi; Nathan J LeRoy; Aidong Zhang; Nathan C Sheffield

doi:10.1093/nargab/lqae086

Methods for evaluating unsupervised vector representations of genomic regions

NAR Genom Bioinform. 2024 Aug 10;6(3):lqae086. doi: 10.1093/nargab/lqae086. eCollection 2024 Sep.

Authors

Guangtao Zheng¹, Julia Rymuza², Erfaneh Gharavi^{2

3}, Nathan J LeRoy^{2

4}, Aidong Zhang^{1

3

4}, Nathan C Sheffield^{2

3

4

5

6

7}

Affiliations

¹ Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA.
² Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.
³ School of Data Science, University of Virginia, Charlottesville, VA 22904, USA.
⁴ Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA.
⁵ Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.
⁶ Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.
⁷ Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.

Abstract

Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.

Abstract

Grants and funding