Unlocking the power of multi-institutional data: Integrating and harmonizing genomic data across institutions

Biometrics. 2024 Oct 3;80(4):ujae146. doi: 10.1093/biomtc/ujae146.

Abstract

Cancer is a complex disease driven by genomic alterations, and tumor sequencing is becoming a mainstay of clinical care for cancer patients. The emergence of multi-institution sequencing data presents a powerful resource for learning real-world evidence to enhance precision oncology. GENIE BPC, led by American Association for Cancer Research, establishes a unique database linking genomic data with clinical information for patients treated at multiple cancer centers. However, leveraging sequencing data from multiple institutions presents significant challenges. Variability in gene panels can lead to loss of information when analyses focus on genes common across panels. Additionally, differences in sequencing techniques and patient heterogeneity across institutions add complexity. High data dimensionality, sparse gene mutation patterns, and weak signals at the individual gene level further complicate matters. Motivated by these real-world challenges, we introduce the Bridge model. It uses a quantile-matched latent variable approach to derive integrated features to preserve information beyond common genes and maximize the utilization of all available data, while leveraging information sharing to enhance both learning efficiency and the model's capacity to generalize. By extracting harmonized and noise-reduced lower-dimensional latent variables, the true mutation pattern unique to each individual is captured. We assess model's performance and parameter estimation through extensive simulation studies. The extracted latent features from the Bridge model consistently excel in predicting patient survival across six cancer types in GENIE BPC data.

Keywords: cancer genomics; data integration; dimension reduction; missing data; precision oncology; systematic biases.

MeSH terms

  • Computer Simulation
  • Databases, Genetic / statistics & numerical data
  • Genomics* / methods
  • Genomics* / statistics & numerical data
  • Humans
  • Models, Statistical
  • Mutation
  • Neoplasms* / genetics
  • Precision Medicine / methods
  • Precision Medicine / statistics & numerical data