Development and Validation of an Algorithm to Identify Nonalcoholic Fatty Liver Disease in the Electronic Medical Record

Kathleen E Corey; Uri Kartoun; Hui Zheng; Stanley Y Shaw

doi:10.1007/s10620-015-3952-x

Development and Validation of an Algorithm to Identify Nonalcoholic Fatty Liver Disease in the Electronic Medical Record

Dig Dis Sci. 2016 Mar;61(3):913-9. doi: 10.1007/s10620-015-3952-x. Epub 2015 Nov 4.

Authors

Kathleen E Corey^{1

2}, Uri Kartoun^{3

4}, Hui Zheng⁵, Stanley Y Shaw^{3

4}

Affiliations

¹ Gastrointestinal Unit, Massachusetts General Hospital, 55 Fruit Street, Blake 4, Boston, MA, 02114, USA. [email protected].
² Harvard Medical School, Boston, MA, USA. [email protected].
³ Harvard Medical School, Boston, MA, USA.
⁴ Center for Systems Biology, Massachusetts General Hospital, Boston, MA, USA.
⁵ Biostatistics Center, Massachusetts General Hospital, Boston, MA, USA.

Abstract

Background and aims: Nonalcoholic fatty liver disease (NAFLD) is the most common cause of chronic liver disease worldwide. Risk factors for NAFLD disease progression and liver-related outcomes remain incompletely understood due to the lack of computational identification methods. The present study sought to design a classification algorithm for NAFLD within the electronic medical record (EMR) for the development of large-scale longitudinal cohorts.

Methods: We implemented feature selection using logistic regression with adaptive LASSO. A training set of 620 patients was randomly selected from the Research Patient Data Registry at Partners Healthcare. To assess a true diagnosis for NAFLD we performed chart reviews and considered either a documentation of a biopsy or a clinical diagnosis of NAFLD. We included in our model variables laboratory measurements, diagnosis codes, and concepts extracted from medical notes. Variables with P < 0.05 were included in the multivariable analysis.

Results: The NAFLD classification algorithm included number of natural language mentions of NAFLD in the EMR, lifetime number of ICD-9 codes for NAFLD, and triglyceride level. This classification algorithm was superior to an algorithm using ICD-9 data alone with AUC of 0.85 versus 0.75 (P < 0.0001) and leads to the creation of a new independent cohort of 8458 individuals with a high probability for NAFLD.

Conclusions: The NAFLD classification algorithm is superior to ICD-9 billing data alone. This approach is simple to develop, deploy, and can be applied across different institutions to create EMR-based cohorts of individuals with NAFLD.

Keywords: Electronic medical records; Nonalcoholic fatty liver disease; Nonalcoholic steatohepatitis; Triglycerides.

Publication types

Research Support, N.I.H., Extramural
Validation Study

MeSH terms

Adult
Aged
Alanine Transaminase / blood
Algorithms*
Aspartate Aminotransferases / blood
Biopsy
Cohort Studies
Data Collection
Diabetes Mellitus / epidemiology
Electronic Health Records*
Female
Humans
International Classification of Diseases
Logistic Models
Male
Middle Aged
Natural Language Processing*
Non-alcoholic Fatty Liver Disease* / blood
Non-alcoholic Fatty Liver Disease* / epidemiology
Prevalence
Triglycerides / blood
United States / epidemiology

Substances

Triglycerides
Aspartate Aminotransferases
Alanine Transaminase

Abstract

Publication types

MeSH terms

Substances

Grants and funding