Prototype-based contrastive substructure identification for molecular property prediction

Gaoqi He; Shun Liu; Zhuoran Liu; Changbo Wang; Kai Zhang; Honglin Li

doi:10.1093/bib/bbae565

Prototype-based contrastive substructure identification for molecular property prediction

Brief Bioinform. 2024 Sep 23;25(6):bbae565. doi: 10.1093/bib/bbae565.

Authors

Gaoqi He¹, Shun Liu¹, Zhuoran Liu¹, Changbo Wang¹, Kai Zhang¹, Honglin Li^{2

3}

Affiliations

¹ School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.
² Innovation Center for AI and Drug Discovery, East China Normal University, 200062 Shanghai, China.
³ Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, 200237 Shanghai, China.

PMID: 39494969
DOI: 10.1093/bib/bbae565

Abstract

Substructure-based representation learning has emerged as a powerful approach to featurize complex attributed graphs, with promising results in molecular property prediction (MPP). However, existing MPP methods mainly rely on manually defined rules to extract substructures. It remains an open challenge to adaptively identify meaningful substructures from numerous molecular graphs to accommodate MPP tasks. To this end, this paper proposes Prototype-based cOntrastive Substructure IdentificaTion (POSIT), a self-supervised framework to autonomously discover substructural prototypes across graphs so as to guide end-to-end molecular fragmentation. During pre-training, POSIT emphasizes two key aspects of substructure identification: firstly, it imposes a soft connectivity constraint to encourage the generation of topologically meaningful substructures; secondly, it aligns resultant substructures with derived prototypes through a prototype-substructure contrastive clustering objective, ensuring attribute-based similarity within clusters. In the fine-tuning stage, a cross-scale attention mechanism is designed to integrate substructure-level information to enhance molecular representations. The effectiveness of the POSIT framework is demonstrated by experimental results from diverse real-world datasets, covering both classification and regression tasks. Moreover, visualization analysis validates the consistency of chemical priors with identified substructures. The source code is publicly available at https://github.com/VRPharmer/POSIT.

Keywords: Graph Neural Networks; contrastive learning; molecular property prediction; self-supervised learning.

MeSH terms

Algorithms*
Cluster Analysis
Computational Biology / methods
Molecular Structure
Software