Novel gene-specific Bayesian Gaussian mixture model to predict the missense variants pathogenicity of Sanfilippo syndrome

Sci Rep. 2024 May 27;14(1):12148. doi: 10.1038/s41598-024-62352-0.

Abstract

MPS III is an autosomal recessive lysosomal storage disease caused mainly by missense variants in the NAGLU, GNS, HGSNAT, and SGSH genes. The pathogenicity interpretation of missense variants is still challenging. We aimed to develop unsupervised clustering-based pathogenicity predictor scores using extracted features from eight in silico predictors to predict the impact of novel missense variants of Sanfilippo syndrome. The model was trained on a dataset consisting of 415 uncertain significant (VUS) missense NAGLU variants. Performance The SanfilippoPred tool was evaluated by validation and test datasets consisting of 197-labelled NAGLU missense variants, and its performance was compared versus individual pathogenicity predictors using receiver operating characteristic (ROC) analysis. Moreover, we tested the SanfilippoPred tool using extra-labelled 427 missense variants to assess its specificity and sensitivity threshold. Application of the trained machine learning (ML) model on the test dataset of labelled NAGLU missense variants showed that SanfilippoPred has an accuracy of 0.93 (0.86-0.97 at CI 95%), sensitivity of 0.93, and specificity of 0.92. The comparative performance of the SanfilippoPred showed better performance (AUC = 0.908) than the individual predictors SIFT (AUC = 0.756), Polyphen-2 (AUC = 0.788), CADD (AUC = 0.568), REVEL (AUC = 0.548), MetaLR (AUC = 0.751), and AlphMissense (AUC = 0.885). Using high-confidence labelled NAGLU variants, showed that SanfilippoPred has an 85.7% sensitivity threshold. The poor correlation between the Sanfilippo syndrome phenotype and genotype represents a demand for a new tool to classify its missense variants. This study provides a significant tool for preventing the misinterpretation of missense variants of the Sanfilippo syndrome-relevant genes. Finally, it seems that ML-based pathogenicity predictors and Sanfilippo syndrome-specific prediction tools could be feasible and efficient pathogenicity predictors in the future.

Keywords: Machine-learning model; Missense variants; Pathogenicity prediction; Sanfilippo syndrome.

MeSH terms

  • Bayes Theorem*
  • Computational Biology / methods
  • Humans
  • Machine Learning
  • Mucopolysaccharidosis III* / genetics
  • Mutation, Missense*
  • Normal Distribution
  • ROC Curve