LncSL: A Novel Stacked Ensemble Computing Tool for Subcellular Localization of lncRNA by Amino Acid-Enhanced Features and Two-Stage Automated Selection Strategy

Int J Mol Sci. 2024 Dec 23;25(24):13734. doi: 10.3390/ijms252413734.

Abstract

Long non-coding RNA (lncRNA) is a non-coding RNA longer than 200 nucleotides, crucial for functions like cell cycle regulation and gene transcription. Accurate localization prediction from sequence information is vital for understanding lncRNA's biological roles. Computational methods offer an effective alternative to traditional experimental methods for annotating lncRNA subcellular positions. Existing machine learning-based methods are limited and often overlook regions with coding potential that affect the function of lncRNA. Therefore, we propose a new model called LncSL. For feature encoding, both lncRNA sequences and amino acid sequences from open reading frames (ORFs) are employed. And we selected the most suitable features by CatBoost and integrated them into a new feature set. Additionally, a voting process with seven feature selection algorithms identified the higher contributive features for training our final stacked model. Additionally, an automatic model selection strategy is constructed to find a better performance meta-model for assembling LncSL. This study specifically focuses on predicting the subcellular localization of lncRNA in the nucleus and cytoplasm. On two benchmark datasets called S1 and S2 datasets, LncSL outperformed existing methods by 6.3% to 12.3% in the Matthew's correlation coefficient on a balanced test dataset. On an unbalanced independent test dataset sourced from S1, LncSL improved by 4.7% to 18.6% in the Matthew's correlation coefficient, which further demonstrates that LncSL is superior to other compared methods. In all, this study presents an effective method for predicting lncRNA subcellular localization through enhancing sequence information, which is always overlooked by traditional methods, and addressing contributive meta-model selection problems, which can offer new insights for other bioinformatics problems.

Keywords: feature fusion; long non-coding RNA; sequence analysis; stacking strategy.

MeSH terms

  • Algorithms
  • Cell Nucleus / genetics
  • Cell Nucleus / metabolism
  • Computational Biology* / methods
  • Cytoplasm / genetics
  • Cytoplasm / metabolism
  • Humans
  • Machine Learning
  • Open Reading Frames
  • RNA, Long Noncoding* / genetics
  • Software

Substances

  • RNA, Long Noncoding