A robust and interpretable ensemble machine learning model for predicting healthcare insurance fraud

Sci Rep. 2025 Jan 2;15(1):218. doi: 10.1038/s41598-024-82062-x.

Abstract

Healthcare insurance fraud imposes a significant financial burden on healthcare systems worldwide, with annual losses reaching billions of dollars. This study aims to improve fraud detection accuracy using machine learning techniques. Our approach consists of three key stages: data preprocessing, model training and integration, and result analysis with feature interpretation. Initially, we examined the dataset's characteristics and employed embedded and permutation methods to test the performance and runtime of single models under different feature sets, selecting the minimal number of features that could still achieve high performance. We then applied ensemble techniques, including Voting, Weighted, and Stacking methods, to combine different models and compare their performances. Feature interpretation was achieved through partial dependence plots (PDP), SHAP, and LIME, allowing us to understand each feature's impact on the predictions. Finally, we benchmarked our approach against existing studies to evaluate its advantages and limitations. The findings demonstrate improved fraud detection accuracy and offer insights into the interpretability of machine learning models in this context.

Keywords: Healthcare insurance fraud; Machine learning; Model ensemble; Model interpretability.

MeSH terms

  • Fraud*
  • Humans
  • Insurance, Health* / economics
  • Machine Learning*