Histone Deacetylases (HDACs) are enzymes that regulate gene expression by removing acetyl groups from histones. They are involved in various diseases, including neurodegenerative, cardiovascular, inflammatory, and metabolic disorders, as well as fibrosis in the liver, lungs, and kidneys. Successfully identifying potent HDAC inhibitors may offer a promising approach to treating these diseases. In addition to experimental techniques, researchers have introduced several in silico methods for identifying HDAC inhibitors. However, these existing computer-aided methods have shortcomings in their modeling stages, which limit their applications. In our study, we present a Streamlined Masked Transformer-based Pretrained (SMTP) encoder, which can be used to generate features for downstream tasks. The training process of the SMTP encoder was directed by masked attention-based learning, enhancing the model's generalizability in encoding molecules. The SMTP features were used to develop 11 classification models identifying 11 HDAC isoforms. We trained SMTP, a lightweight encoder, with only 1.9 million molecules, a smaller number than other known molecular encoders, yet its discriminant ability remains competitive. The results revealed that machine learning models developed using the SMTP feature set outperformed those developed using other feature sets in 8 out of 11 classification tasks. Additionally, chemical diversity analysis confirmed the encoder's effectiveness in distinguishing between two classes of molecules.
Keywords: Attention; Deep learning; Histone deacetylases; Machine learning; Molecular encoder; Molecular graph; Transformer.
Copyright © 2024 Elsevier Inc. All rights reserved.