CODI: Enhancing machine learning-based molecular profiling through contextual out-of-distribution integration

Tarek Eissa; Marinus Huber; Barbara Obermayer-Pietsch; Birgit Linkohr; Annette Peters; Frank Fleischmann; Mihaela Žigman

doi:10.1093/pnasnexus/pgae449

CODI: Enhancing machine learning-based molecular profiling through contextual out-of-distribution integration

PNAS Nexus. 2024 Oct 15;3(10):pgae449. doi: 10.1093/pnasnexus/pgae449. eCollection 2024 Oct.

Authors

Tarek Eissa^{1

2

3}, Marinus Huber^{1

2}, Barbara Obermayer-Pietsch⁴, Birgit Linkohr⁵, Annette Peters^{5

6}, Frank Fleischmann^{1

2}, Mihaela Žigman^{1

2}

Affiliations

¹ Chair of Experimental Physics - Laser Physics, Ludwig-Maximilians-Universität München, Bavaria 85748, Germany.
² Laboratory for Attosecond Physics, Max Planck Institute of Quantum Optics, Bavaria 85748, Germany.
³ School of Computation, Information and Technology, Technical University of Munich, Bavaria 85748, Germany.
⁴ Department of Internal Medicine, Division of Endocrinology and Diabetology, Medical University, Styria 8010, Austria.
⁵ Institute of Epidemiology, Helmholtz Zentrum München, Bavaria 85764, Germany.
⁶ Chair of Epidemiology, Institute for Medical Information Processing, Biometry and Epidemiology, Medical Faculty, Ludwig-Maximilians-Universität München, Bavaria 81377, Germany.

Abstract

Molecular analytics increasingly utilize machine learning (ML) for predictive modeling based on data acquired through molecular profiling technologies. However, developing robust models that accurately capture physiological phenotypes is challenged by the dynamics inherent to biological systems, variability stemming from analytical procedures, and the resource-intensive nature of obtaining sufficiently representative datasets. Here, we propose and evaluate a new method: Contextual Out-of-Distribution Integration (CODI). Based on experimental observations, CODI generates synthetic data that integrate unrepresented sources of variation encountered in real-world applications into a given molecular fingerprint dataset. By augmenting a dataset with out-of-distribution variance, CODI enables an ML model to better generalize to samples beyond the seed training data, reducing the need for extensive experimental data collection. Using three independent longitudinal clinical studies and a case-control study, we demonstrate CODI's application to several classification tasks involving vibrational spectroscopy of human blood. We showcase our approach's ability to enable personalized fingerprinting for multiyear longitudinal molecular monitoring and enhance the robustness of trained ML models for improved disease detection. Our comparative analyses reveal that incorporating CODI into the classification workflow consistently leads to increased robustness against data variability and improved predictive accuracy.

Keywords: data augmentation; machine learning; molecular analytics; out-of-distribution; variability modeling.