Evaluating the impact of modeling choices on the performance of integrated genetic and clinical models

Genet Med. 2024 Dec 26:101353. doi: 10.1016/j.gim.2024.101353. Online ahead of print.

Abstract

Purpose: The value of genetic information for improving the performance of clinical risk prediction models has yielded variable conclusions. Many methodological decisions have the potential to contribute to differential results. We performed multiple modeling experiments integrating clinical and demographic data from electronic health records (EHR) with genetic data to understand which decisions may affect performance.

Methods: Clinical data in the form of structured diagnostic codes, medications, procedural codes, and demographics were extracted from two large independent health systems and polygenic risk scores (PRS) were generated across all patients of European ancestry with genetic data in the corresponding biobanks. Crohn's disease was studied based on its substantial genetic component, established EHR-based definition, and sufficient prevalence for training and testing. We investigated the impact of choices regarding PRS integration method, training sample, model complexity, and performance metrics.

Results: Overall, our results show that including PRS resulted in higher performance but this gain was only robust in situations with limited clinical information. We find consistent performance increases from more compute-intensive models such as random forest, but the impact of other decisions vary by site.

Conclusion: This work highlights the importance of considering methodological decision points in interpreting the impact of PRS on prediction performance in clinical models.

Keywords: EHR; Genetics; Genomics; Machine Learning; PRS.