Unmasking the chameleons: A benchmark for out-of-distribution detection in medical tabular data

Int J Med Inform. 2024 Dec 17:195:105762. doi: 10.1016/j.ijmedinf.2024.105762. Online ahead of print.

Abstract

Background: Machine Learning (ML) models often struggle to generalize effectively to data that deviates from the training distribution. This raises significant concerns about the reliability of real-world healthcare systems encountering such inputs known as out-of-distribution (OOD) data. These concerns can be addressed by real-time detection of OOD inputs. While numerous OOD detection approaches have been suggested in other fields - especially in computer vision - it remains unclear whether similar methods effectively address challenges posed by medical tabular data.

Objective: To answer this important question, we propose an extensive reproducible benchmark to compare different OOD detection methods in medical tabular data across a comprehensive suite of tests.

Method: To achieve this, we leverage 4 different and large public medical datasets, including eICU and MIMIC-IV, and consider various kinds of OOD cases within these datasets. For example, we examine OODs originating from a statistically different dataset than the training set according to the membership model introduced by Debray et al. [1], as well as OODs obtained by splitting a given dataset based on a value of a distinguishing variable. To identify OOD instances, we explore a range of 10 density-based methods that learn the marginal distribution of the data, alongside 17 post-hoc detectors that are applied on top of prediction models already trained on the data. The prediction models involve three distinct architectures, namely MLP, ResNet, and Transformer.

Main results: In our experiments, when the membership model achieved an AUC of 0.98, which indicated a clear distinction between OOD data and the training set, we observed that the OOD detection methods had achieved AUC values exceeding 0.95 in distinguishing OOD data. In contrast, in the experiments with subtler changes in data distribution such as selecting OOD data based on ethnicity and age characteristics, many OOD detection methods performed similarly to a random classifier with AUC values close to 0.5. This may suggest a correlation between separability, as indicated by the membership model, and OOD detection performance, as indicated by the AUC of the detection model. This warrants future research.

Keywords: Benchmark; Medical AI; Medical tabular data; Out-of-distribution detection; Safety.