Rapid discovery of Transglutaminase 2 inhibitors for celiac disease with boosting ensemble machine learning

Comput Struct Biotechnol J. 2024 Oct 16:23:3669-3679. doi: 10.1016/j.csbj.2024.10.019. eCollection 2024 Dec.

Abstract

Celiac disease poses a significant health challenge for individuals consuming gluten-containing foods. While the availability of gluten-free products has increased, there is still a need for therapeutic treatments. The advancement of computational drug design, particularly using bio-cheminformatics-oriented machine learning, offers promising avenues for developing such therapies. One promising target is Transglutaminase 2 (TG2), a protein involved in the autoimmune response triggered by gluten consumption. In this study, we utilized data from approximately 1100 TG2 inhibition assays to develop ligand-based molecular screening techniques using ensemble machine-learning models and extensive molecular feature libraries. Various classifiers, including tree-based methods, artificial neural networks, and graph neural networks, were evaluated to identify primary systems for predictive analysis and feature significance assessment. Boosting ensembles of perceptron deep learning and low-depth random forest weak learners emerged as the most effective, achieving over 90 % accuracy, significantly outperforming a baseline of 64 %. Key features, such as the presence of a terminal Michael acceptor group and a sulfonamide group, were identified as important for activity. Additionally, a regression model was created to rank active compounds. We developed a web application, Celiac Informatics (https://celiac-informatics-v1-2b0a85e75868.herokuapp.com), to facilitate the screening of potential therapeutic molecules for celiac disease. The web app also provides drug-likeness reports, supporting the development of novel drugs.

Keywords: Celiac disease; Computational drug discovery; Ensemble machine learning; Inhibitor screening; Quantitative structure-activity relationship (QSAR); Transglutaminase 2.