Identification of biomarkers for knee osteoarthritis through clinical data and machine learning models

Sci Rep. 2025 Jan 11;15(1):1703. doi: 10.1038/s41598-025-85945-9.

Abstract

Knee osteoarthritis (KOA) represents a progressive degenerative disorder characterized by the gradual erosion of articular cartilage. This study aimed to develop and validate biomarker-based predictive models for KOA diagnosis using machine learning techniques. Clinical data from 2594 samples were obtained and stratified into training and validation datasets in a 7:3 ratio. Key clinical features were identified through differential analysis between KOA and control groups, combined with least absolute shrinkage and selection operator (LASSO) regression. The SHapley Additive Planning (SHAP) method was employed to rank feature importance quantitatively. Based on these rankings, predictive models were constructed using Logistic Regression (LR), Random Forest (RF), eXtreme Gradient Boosting (xGBoost), Naive Bayes (NB), Support Vector Machine (SVM), and Decision Tree (DT) algorithms. Models were developed for subsets of variables, including the top 5, top 10, top 15, and all identified features. Receiver operating characteristic (ROC) curves were applied to compare diagnostic performance across models. Additionally, a risk stratification framework for KOA prediction was designed using recursive partitioning analysis (RPA). Using difference analysis and LASSO, 44 critical clinical features were identified. Among these, age, plasma prothrombin time, gender, body mass index (BMI), and prothrombin time and international normalized ratio (PTINR) emerged as the top five features, with SHAP values of 0.1990, 0.0981, 0.0471, 0.0433, and 0.0422, respectively. Machine learning analysis demonstrated that these variables provided robust diagnostic performance for KOA. In the training set, area under the curve (AUC) values for LR, RF, xGBoost, NB, SVM, and DT models were 0.947, 0.961, 0.892, 0.952, 0.885, and 0.779, respectively. Similarly, in the validation dataset, these models achieved AUC values of 0.961, 0.943, 0.789, 0.957, 0.824, and 0.76. Among them, RF consistently exhibited superior diagnostic accuracy for KOA. Additionally, RPA analysis indicated a higher prevalence of KOA among individuals aged 54 years and older. The integration of the top five clinical variables significantly enhanced the diagnostic accuracy for KOA, particularly when employing the RF model. Moreover, the RPA model offered valuable insights to assist clinicians in refining prognostic assessments and optimizing clinical decision-making processes.

Keywords: Clinical data; Diagnostic performance; Knee osteoarthritis; Machine learning.

MeSH terms

  • Aged
  • Biomarkers*
  • Decision Trees
  • Female
  • Humans
  • Logistic Models
  • Machine Learning*
  • Male
  • Middle Aged
  • Osteoarthritis, Knee* / diagnosis
  • ROC Curve
  • Support Vector Machine

Substances

  • Biomarkers