An ensemble approach to predict binding hotspots in protein-RNA interactions based on SMOTE data balancing and Random Grouping feature selection strategies

Bioinformatics. 2022 Apr 28;38(9):2452-2458. doi: 10.1093/bioinformatics/btac138.

Abstract

Motivation: The identification of binding hotspots in protein-RNA interactions is crucial for understanding their potential recognition mechanisms and drug design. The experimental methods have many limitations, since they are usually time-consuming and labor-intensive. Thus, developing an effective and efficient theoretical method is urgently needed.

Results: Here, we present SREPRHot, a method to predict hotspots, defined as the residues whose mutation to alanine generate a binding free energy change ≥2.0 kcal/mol, while others use a cutoff of 1.0 kcal/mol to obtain balanced datasets. To deal with the dataset imbalance, Synthetic Minority Over-sampling Technique (SMOTE) is utilized to generate minority samples to achieve a dataset balance. Additionally, besides conventional features, we use two types of new features, residue interface propensity previously developed by us, and topological features obtained using node-weighted networks, and propose an effective Random Grouping feature selection strategy combined with a two-step method to determine an optimal feature set. Finally, a stacking ensemble classifier is adopted to build our model. The results show SREPRHot achieves a good performance with SEN, MCC and AUC of 0.900, 0.557 and 0.829 on the independent testing dataset. The comparison study indicates SREPRHot shows a promising performance.

Availability and implementation: The source code is available at https://github.com/ChunhuaLiLab/SREPRHot.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • RNA*
  • Software*

Substances

  • RNA