Selective prediction for extracting unstructured clinical data

J Am Med Inform Assoc. 2023 Dec 22;31(1):188-197. doi: 10.1093/jamia/ocad182.

Abstract

Objective: While there are currently approaches to handle unstructured clinical data, such as manual abstraction and structured proxy variables, these methods may be time-consuming, not scalable, and imprecise. This article aims to determine whether selective prediction, which gives a model the option to abstain from generating a prediction, can improve the accuracy and efficiency of unstructured clinical data abstraction.

Materials and methods: We trained selective classifiers (logistic regression, random forest, support vector machine) to extract 5 variables from clinical notes: depression (n = 1563), glioblastoma (GBM, n = 659), rectal adenocarcinoma (DRA, n = 601), and abdominoperineal resection (APR, n = 601) and low anterior resection (LAR, n = 601) of adenocarcinoma. We varied the cost of false positives (FP), false negatives (FN), and abstained notes and measured total misclassification cost.

Results: The depression selective classifiers abstained on anywhere from 0% to 97% of notes, and the change in total misclassification cost ranged from -58% to 9%. Selective classifiers abstained on 5%-43% of notes across the GBM and colorectal cancer models. The GBM selective classifier abstained on 43% of notes, which led to improvements in sensitivity (0.94 to 0.96), specificity (0.79 to 0.96), PPV (0.89 to 0.98), and NPV (0.88 to 0.91) when compared to a non-selective classifier and when compared to structured proxy variables.

Discussion: We showed that selective classifiers outperformed both non-selective classifiers and structured proxy variables for extracting data from unstructured clinical notes.

Conclusion: Selective prediction should be considered when abstaining is preferable to making an incorrect prediction.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Adenocarcinoma*
  • Humans
  • Logistic Models
  • Support Vector Machine*