A hybrid residue based sequential encoding mechanism with XGBoost improved ensemble model for identifying 5-hydroxymethylcytosine modifications

Sci Rep. 2024 Sep 6;14(1):20819. doi: 10.1038/s41598-024-71568-z.

Abstract

RNA modifications play an important role in actively controlling recently created formation in cellular regulation mechanisms, which link them to gene expression and protein. The RNA modifications have numerous alterations, presenting broad glimpses of RNA's operations and character. The modification process by the TET enzyme oxidation is the crucial change associated with cytosine hydroxymethylation. The effect of CR is an alteration in specific biochemical ways of the organism, such as gene expression and epigenetic alterations. Traditional laboratory systems that identify 5-hydroxymethylcytosine (5hmC) samples are expensive and time-consuming compared to other methods. To address this challenge, the paper proposed XGB5hmC, a machine learning algorithm based on a robust gradient boosting algorithm (XGBoost), with different residue based formulation methods to identify 5hmC samples. Their results were amalgamated, and six different frequency residue based encoding features were fused to form a hybrid vector in order to enhance model discrimination capabilities. In addition, the proposed model incorporates SHAP (Shapley Additive Explanations) based feature selection to demonstrate model interpretability by highlighting the high contributory features. Among the applied machine learning algorithms, the XGBoost ensemble model using the tenfold cross-validation test achieved improved results than existing state-of-the-art models. Our model reported an accuracy of 89.97%, sensitivity of 87.78%, specificity of 94.45%, F1-score of 0.8934%, and MCC of 0.8764%. This study highlights the potential to provide valuable insights for enhancing medical assessment and treatment protocols, representing a significant advancement in RNA modification analysis.

Keywords: 5-Hydroxymethylcytosine; Cancer; Cardiovascular diseases; Machine learning; RNA modifications; TET enzyme; XGBoost.

MeSH terms

  • 5-Methylcytosine* / analogs & derivatives
  • 5-Methylcytosine* / metabolism
  • Algorithms*
  • Cytosine / analogs & derivatives
  • Cytosine / metabolism
  • Humans
  • Machine Learning*

Substances

  • 5-hydroxymethylcytosine
  • 5-Methylcytosine
  • Cytosine