multiDGD: A versatile deep generative model for multi-omics data

Viktoria Schuster; Emma Dann; Anders Krogh; Sarah A Teichmann

doi:10.1038/s41467-024-53340-z

multiDGD: A versatile deep generative model for multi-omics data

Nat Commun. 2024 Nov 20;15(1):10031. doi: 10.1038/s41467-024-53340-z.

Authors

Viktoria Schuster^{1

2}, Emma Dann³, Anders Krogh^{4

5}, Sarah A Teichmann^{6

7

8}

Affiliations

¹ Department of Computer Science, University of Copenhagen, Universitetsparken 5, Copenhagen, 2100, Denmark.
² Center for Health Data Science, University of Copenhagen, Blegdamsvej 3B, Copenhagen, 2200, Denmark.
³ Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom.
⁴ Department of Computer Science, University of Copenhagen, Universitetsparken 5, Copenhagen, 2100, Denmark. [email protected].
⁵ Center for Health Data Science, University of Copenhagen, Blegdamsvej 3B, Copenhagen, 2200, Denmark. [email protected].
⁶ Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom. [email protected].
⁷ Theory of Condensed Matter Group, Cavendish Laboratory, University of Cambridge, J J Thomson Avenue, Cambridge, CB3 0HE, United Kingdom. [email protected].
⁸ Cambridge Stem Cell Institute, Jeffrey Cheah Biomedical Centre, Puddicombe Way, Cambrdige, CB2 0AW, United Kingdom. [email protected].

Abstract

Recent technological advancements in single-cell genomics have enabled joint profiling of gene expression and alternative modalities at unprecedented scale. Consequently, the complexity of multi-omics data sets is increasing massively. Existing models for multi-modal data are typically limited in functionality or scalability, making data integration and downstream analysis cumbersome. We present multiDGD, a scalable deep generative model providing a probabilistic framework to learn shared representations of transcriptome and chromatin accessibility. It shows outstanding performance on data reconstruction without feature selection. We demonstrate on several data sets from human and mouse that multiDGD learns well-clustered joint representations. We further find that probabilistic modeling of sample covariates enables post-hoc data integration without the need for fine-tuning. Additionally, we show that multiDGD can detect statistical associations between genes and regulatory regions conditioned on the learned representations. multiDGD is available as an scverse-compatible package on GitHub.

MeSH terms

Algorithms
Animals
Chromatin / genetics
Chromatin / metabolism
Computational Biology / methods
Deep Learning
Gene Expression Profiling / methods
Genomics* / methods
Humans
Mice
Models, Statistical
Multiomics
Single-Cell Analysis / methods
Software
Transcriptome

Substances

Chromatin

Abstract

MeSH terms

Substances

Grants and funding