PredPromoter-MF(2L): A Novel Approach of Promoter Prediction Based on Multi-source Feature Fusion and Deep Forest

Interdiscip Sci. 2022 Sep;14(3):697-711. doi: 10.1007/s12539-022-00520-4. Epub 2022 Apr 30.

Abstract

Promoters short DNA sequences play vital roles in initiating gene transcription. However, it remains a challenge to identify promoters using conventional experiment techniques in a high-throughput manner. To this end, several computational predictors based on machine learning models have been developed, while their performance is unsatisfactory. In this study, we proposed a novel two-layer predictor, called PredPromoter-MF(2L), based on multi-source feature fusion and ensemble learning. PredPromoter-MF(2L) was developed based on various deep features learned by a pre-trained deep learning network model and sequence-derived features. Feature selection based on XGBoost was applied to reduce fused features dimensions, and a cascade deep forest model was trained on the selected feature subset for promoter prediction. The results both fivefold cross-validation and independent test demonstrated that PredPromoter-MF(2L) outperformed state-of-the-art methods.

Keywords: Deep Forest; Deep learning; Feature fusion; Feature selection; Machine learning; Promoter.

MeSH terms

  • Base Sequence
  • Machine Learning*
  • Promoter Regions, Genetic / genetics