Multimodal omics provide deeper insight into the biological processes and cellular functions, especially transcriptomics and proteomics. Computational methods have been proposed for the integration of single-cell multimodal omics of transcriptomics and proteomics. However, existing methods primarily concentrate on the alignment of different omics, overlooking the unique information inherent in each omics type. Moreover, as the majority of single-cell cohorts only encompass one omics, it becomes critical to transfer the knowledge learnt from multimodal omics to enhance unimodal omics analysis. Therefore, we proposed a novel framework that leverages masked autoencoder with cross-attention mechanism, called scMMAE (single-cell multimodal masked autoencoder), to fuse multimodal omics and enhance unimodal omics analysis. scMMAE simultaneously captures both the shared features and the distinctive information of two single-cell omics modalities and transfers the knowledge to enhance single-cell transcriptome data. Comparative evaluations against benchmarking methods across various cohorts revealed a notable improvement, with an increase of up to 21% in the adjusted Rand index and up to 12% in normalized mutual information in the context of multimodal fusion. In the realm of unimodal omics, scMMAE demonstrated an overall enhancement of approximately 20% in the adjusted Rand index and nearly 10% in normalized mutual information. Other nine metrics, including the Fowlkes-Mallows index and silhouette coefficient, further underscored the high performance of scMMAE. Significantly, scMMAE exhibits an elevated level of proficiency in distinguishing between different cell types, particularly on CD4 and CD8 T cells. Availability and implementation: scMMAE source code at https://github.com/DM0815/scMMAE/.
Keywords: cross-attention; multimodal masked autoencoder; scRNA-seq; single-cell multimodal omics.
© The Author(s) 2025. Published by Oxford University Press.