Application of multivariate probabilistic (Bayesian) networks to substance use disorder risk stratification and cost estimation

Lawrence Weinstein; Todd A Radano; Timothy Jack; Philip Kalina; John S Eberhardt 3rd

Application of multivariate probabilistic (Bayesian) networks to substance use disorder risk stratification and cost estimation

Perspect Health Inf Manag. 2009 Sep 16;6(Fall):1b.

Authors

Lawrence Weinstein¹, Todd A Radano, Timothy Jack, Philip Kalina, John S Eberhardt 3rd

Affiliation

¹ Catasys, Inc., Los Angeles, CA, USA.

PMID: 20169014
PMCID: PMC2804457

Abstract

Introduction: This paper explores the use of machine learning and Bayesian classification models to develop broadly applicable risk stratification models to guide disease management of health plan enrollees with substance use disorder (SUD). While the high costs and morbidities associated with SUD are understood by payers, who manage it through utilization review, acute interventions, coverage and cost limitations, and disease management, the literature shows mixed results for these modalities in improving patient outcomes and controlling cost. Our objective is to evaluate the potential of data mining methods to identify novel risk factors for chronic disease and stratification of enrollee utilization, which can be used to develop new methods for targeting disease management services to maximize benefits to both enrollees and payers.

Methods: For our evaluation, we used DecisionQ machine learning algorithms to build Bayesian network models of a representative sample of data licensed from Thomson-Reuters' MarketScan consisting of 185,322 enrollees with three full-year claim records. Data sets were prepared, and a stepwise learning process was used to train a series of Bayesian belief networks (BBNs). The BBNs were validated using a 10 percent holdout set.

Results: The networks were highly predictive, with the risk-stratification BBNs producing area under the curve (AUC) for SUD positive of 0.948 (95 percent confidence interval [CI], 0.944-0.951) and 0.736 (95 percent CI, 0.721-0.752), respectively, and SUD negative of 0.951 (95 percent CI, 0.947-0.954) and 0.738 (95 percent CI, 0.727-0.750), respectively. The cost estimation models produced area under the curve ranging from 0.72 (95 percent CI, 0.708-0.731) to 0.961 (95 percent CI, 0.95-0.971).

Conclusion: We were able to successfully model a large, heterogeneous population of commercial enrollees, applying state-of-the-art machine learning technology to develop complex and accurate multivariate models that support near-real-time scoring of novel payer populations based on historic claims and diagnostic data. Initial validation results indicate that we can stratify enrollees with SUD diagnoses into different cost categories with a high degree of sensitivity and specificity, and the most challenging issue becomes one of policy. Due to the social stigma associated with the disease and ethical issues pertaining to access to care and individual versus societal benefit, a thoughtful dialogue needs to occur about the appropriate way to implement these technologies.

Keywords: Bayesian belief network; chemical dependency; predictive modeling; substance use disorder.

Publication types

Validation Study

MeSH terms

Algorithms
Area Under Curve
Bayes Theorem*
Cost Savings
Cost of Illness
Data Mining / methods
Decision Trees
Disease Management
Humans
Insurance Claim Reporting / statistics & numerical data
Models, Statistical*
Multivariate Analysis*
Neural Networks, Computer*
Nonlinear Dynamics
Predictive Value of Tests
ROC Curve
Recurrence
Risk Assessment / methods*
Selection Bias
Substance-Related Disorders* / diagnosis
Substance-Related Disorders* / economics
Substance-Related Disorders* / therapy
Utilization Review