Application of multivariate probabilistic (Bayesian) networks to substance use disorder risk stratification and cost estimation

Perspect Health Inf Manag. 2009 Sep 16;6(Fall):1b.

Abstract

Introduction: This paper explores the use of machine learning and Bayesian classification models to develop broadly applicable risk stratification models to guide disease management of health plan enrollees with substance use disorder (SUD). While the high costs and morbidities associated with SUD are understood by payers, who manage it through utilization review, acute interventions, coverage and cost limitations, and disease management, the literature shows mixed results for these modalities in improving patient outcomes and controlling cost. Our objective is to evaluate the potential of data mining methods to identify novel risk factors for chronic disease and stratification of enrollee utilization, which can be used to develop new methods for targeting disease management services to maximize benefits to both enrollees and payers.

Methods: For our evaluation, we used DecisionQ machine learning algorithms to build Bayesian network models of a representative sample of data licensed from Thomson-Reuters' MarketScan consisting of 185,322 enrollees with three full-year claim records. Data sets were prepared, and a stepwise learning process was used to train a series of Bayesian belief networks (BBNs). The BBNs were validated using a 10 percent holdout set.

Results: The networks were highly predictive, with the risk-stratification BBNs producing area under the curve (AUC) for SUD positive of 0.948 (95 percent confidence interval [CI], 0.944-0.951) and 0.736 (95 percent CI, 0.721-0.752), respectively, and SUD negative of 0.951 (95 percent CI, 0.947-0.954) and 0.738 (95 percent CI, 0.727-0.750), respectively. The cost estimation models produced area under the curve ranging from 0.72 (95 percent CI, 0.708-0.731) to 0.961 (95 percent CI, 0.95-0.971).

Conclusion: We were able to successfully model a large, heterogeneous population of commercial enrollees, applying state-of-the-art machine learning technology to develop complex and accurate multivariate models that support near-real-time scoring of novel payer populations based on historic claims and diagnostic data. Initial validation results indicate that we can stratify enrollees with SUD diagnoses into different cost categories with a high degree of sensitivity and specificity, and the most challenging issue becomes one of policy. Due to the social stigma associated with the disease and ethical issues pertaining to access to care and individual versus societal benefit, a thoughtful dialogue needs to occur about the appropriate way to implement these technologies.

Keywords: Bayesian belief network; chemical dependency; predictive modeling; substance use disorder.

Publication types

  • Validation Study

MeSH terms

  • Algorithms
  • Area Under Curve
  • Bayes Theorem*
  • Cost Savings
  • Cost of Illness
  • Data Mining / methods
  • Decision Trees
  • Disease Management
  • Humans
  • Insurance Claim Reporting / statistics & numerical data
  • Models, Statistical*
  • Multivariate Analysis*
  • Neural Networks, Computer*
  • Nonlinear Dynamics
  • Predictive Value of Tests
  • ROC Curve
  • Recurrence
  • Risk Assessment / methods*
  • Selection Bias
  • Substance-Related Disorders* / diagnosis
  • Substance-Related Disorders* / economics
  • Substance-Related Disorders* / therapy
  • Utilization Review