Machine Learning Informed Diagnosis for Congenital Heart Disease in Large Claims Data Source

JACC Adv. 2023 Dec 25;3(2):100801. doi: 10.1016/j.jacadv.2023.100801. eCollection 2024 Feb.

Abstract

Background: With an increasing interest in using large claims databases in medical practice and research, it is a meaningful and essential step to efficiently identify patients with the disease of interest.

Objectives: This study aims to establish a machine learning (ML) approach to identify patients with congenital heart disease (CHD) in large claims databases.

Methods: We harnessed data from the Quebec claims and hospitalization databases from 1983 to 2000. The study included 19,187 patients. Of them, 3,784 were labeled as true CHD patients using a clinician developed algorithm with manual audits considered as the gold standards. To establish an accurate ML-empowered automated CHD classification system, we evaluated ML methods including Gradient Boosting Decision Tree, Support Vector Machine, Decision tree, and compared them to regularized logistic regression. The Area Under the Precision Recall Curve was used as the evaluation metric. External validation was conducted with an updated data set to 2010 with different subjects.

Results: Among the ML methods we evaluated, Gradient Boosting Decision Tree led the performance in identifying true CHD patients with 99.3% Area Under the Precision Recall Curve, 98.0% for sensitivity, and 99.7% for specificity. External validation returned similar statistics on model performance.

Conclusions: This study shows that a tedious and time-consuming clinical inspection for CHD patient identification can be replaced by an extremely efficient ML algorithm in large claims database. Our findings demonstrate that ML methods can be used to automate complicated algorithms to identify patients with complex diseases.

Keywords: congenital heart disease; large administrative claims database; machine learning.