Self-distillation improves self-supervised learning for DNA sequence inference

Tong Yu; Lei Cheng; Ruslan Khalitov; Erland B Olsson; Zhirong Yang

doi:10.1016/j.neunet.2024.106978

Self-distillation improves self-supervised learning for DNA sequence inference

Neural Netw. 2025 Mar:183:106978. doi: 10.1016/j.neunet.2024.106978. Epub 2024 Dec 7.

Authors

Tong Yu¹, Lei Cheng², Ruslan Khalitov², Erland B Olsson², Zhirong Yang²

Affiliations

¹ Norwegian University of Science and Technology, Trondheim, Norway. Electronic address: [email protected].
² Norwegian University of Science and Technology, Trondheim, Norway.

PMID: 39667220
DOI: 10.1016/j.neunet.2024.106978

Abstract

Self-supervised Learning (SSL) has been recognized as a method to enhance prediction accuracy in various downstream tasks. However, its efficacy for DNA sequences remains somewhat constrained. This limitation stems primarily from the fact that most existing SSL approaches in genomics focus on masked language modeling of individual sequences, neglecting the crucial aspect of encoding statistics across multiple sequences. To overcome this challenge, we introduce an innovative deep neural network model, which incorporates collaborative learning between a 'student' and a 'teacher' subnetwork. In this model, the student subnetwork employs masked learning on nucleotides and progressively adapts its parameters to the teacher subnetwork through an exponential moving average approach. Concurrently, both subnetworks engage in contrastive learning, deriving insights from two augmented representations of the input sequences. This self-distillation process enables our model to effectively assimilate both contextual information from individual sequences and distributional data across the sequence population. We validated our approach with preliminary pretraining using the human reference genome, followed by applying it to 20 downstream inference tasks. The empirical results from these experiments demonstrate that our novel method significantly boosts inference performance across the majority of these tasks. Our code is available at https://github.com/wiedersehne/FinDNA.

Keywords: Contrastive learning; DNA sequence modeling; Self-supervised pretraining.

MeSH terms

Algorithms
DNA / genetics
Deep Learning
Genome, Human
Genomics / methods
Humans
Neural Networks, Computer*
Sequence Analysis, DNA / methods
Supervised Machine Learning*

Substances

DNA