Coronary artery disease risk assessment from unstructured electronic health records using text mining

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S203-S210. doi: 10.1016/j.jbi.2015.08.003. Epub 2015 Aug 28.

Abstract

Coronary artery disease (CAD) often leads to myocardial infarction, which may be fatal. Risk factors can be used to predict CAD, which may subsequently lead to prevention or early intervention. Patient data such as co-morbidities, medication history, social history and family history are required to determine the risk factors for a disease. However, risk factor data are usually embedded in unstructured clinical narratives if the data is not collected specifically for risk assessment purposes. Clinical text mining can be used to extract data related to risk factors from unstructured clinical notes. This study presents methods to extract Framingham risk factors from unstructured electronic health records using clinical text mining and to calculate 10-year coronary artery disease risk scores in a cohort of diabetic patients. We developed a rule-based system to extract risk factors: age, gender, total cholesterol, HDL-C, blood pressure, diabetes history and smoking history. The results showed that the output from the text mining system was reliable, but there was a significant amount of missing data to calculate the Framingham risk score. A systematic approach for understanding missing data was followed by implementation of imputation strategies. An analysis of the 10-year Framingham risk scores for coronary artery disease in this cohort has shown that the majority of the diabetic patients are at moderate risk of CAD.

Keywords: Coronary artery disease; EHR; Framingham risk score; Temporal data; Text mining.

MeSH terms

  • Aged
  • Cardiovascular Diseases / diagnosis
  • Cardiovascular Diseases / epidemiology*
  • Cohort Studies
  • Comorbidity
  • Computer Security
  • Confidentiality
  • Data Mining / methods*
  • Diabetes Complications / diagnosis
  • Diabetes Complications / epidemiology*
  • Electronic Health Records / organization & administration*
  • Female
  • Humans
  • Incidence
  • Longitudinal Studies
  • Male
  • Middle Aged
  • Narration*
  • Natural Language Processing*
  • New South Wales / epidemiology
  • Pattern Recognition, Automated / methods
  • Risk Assessment / methods
  • Vocabulary, Controlled