Imputation and characterization of uncoded self-harm in major mental illness using machine learning

Praveen Kumar; Anastasiya Nestsiarovich; Stuart J Nelson; Berit Kerner; Douglas J Perkins; Christophe G Lambert

doi:10.1093/jamia/ocz173

Imputation and characterization of uncoded self-harm in major mental illness using machine learning

J Am Med Inform Assoc. 2020 Jan 1;27(1):136-146. doi: 10.1093/jamia/ocz173.

Authors

Praveen Kumar^{1

2}, Anastasiya Nestsiarovich¹, Stuart J Nelson³, Berit Kerner⁴, Douglas J Perkins¹, Christophe G Lambert^{1

2

5}

Affiliations

¹ Center for Global Health, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, New Mexico, USA.
² Department of Computer Science, University of New Mexico, Albuquerque, New Mexico, USA.
³ Biomedical Informatics Center, Department of Clinical Research & Leadership, George Washington University, Washington, DC, USA.
⁴ Semel Institute for Neuroscience and Human Behavior, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA.
⁵ Translational Informatics Division, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, New Mexico, USA.

Abstract

Objective: We aimed to impute uncoded self-harm in administrative claims data of individuals with major mental illness (MMI), characterize self-harm incidence, and identify factors associated with coding bias.

Materials and methods: The IBM MarketScan database (2003-2016) was used to analyze visit-level self-harm in 10 120 030 patients with ≥2 MMI codes. Five machine learning (ML) classifiers were tested on a balanced data subset, with XGBoost selected for the full dataset. Classification performance was validated via random data mislabeling and comparison with a clinician-derived "gold standard." The incidence of coded and imputed self-harm was characterized by year, patient age, sex, U.S. state, and MMI diagnosis.

Results: Imputation identified 1 592 703 self-harm events vs 83 113 coded events, with areas under the curve >0.99 for the balanced and full datasets, and 83.5% agreement with the gold standard. The overall coded and imputed self-harm incidence were 0.28% and 5.34%, respectively, varied considerably by age and sex, and was highest in individuals with multiple MMI diagnoses. Self-harm undercoding was higher in male than in female individuals and increased with age. Substance abuse, injuries, poisoning, asphyxiation, brain disorders, harmful thoughts, and psychotherapy were the main features used by ML to classify visits.

Discussion: Only 1 of 19 self-harm events was coded for individuals with MMI. ML demonstrated excellent performance in recovering self-harm visits. Male individuals and seniors with MMI are particularly vulnerable to self-harm undercoding and may be at risk of not getting appropriate psychiatric care.

Conclusions: ML can effectively recover unrecorded self-harm in claims data and inform psychiatric epidemiological and observational studies.

Trial registration: ClinicalTrials.gov NCT02893371.

Keywords: coding; electronic health records; machine learning; self-harm; suicide.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Adult
Algorithms
Classification / methods
Clinical Coding / methods*
Datasets as Topic
Electronic Health Records*
Female
Humans
Incidence
Machine Learning*
Male
Mental Disorders / classification*
Mental Disorders / psychology
Self-Injurious Behavior / classification*
Self-Injurious Behavior / diagnosis
Self-Injurious Behavior / epidemiology
Suicidal Ideation*

Associated data

ClinicalTrials.gov/NCT02893371

Grants and funding

UL1 TR001449/TR/NCATS NIH HHS/United States