Generalizing predictions to unseen sequencing profiles via deep generative models

Sci Rep. 2022 May 3;12(1):7151. doi: 10.1038/s41598-022-11363-w.

Abstract

Predictive models trained on sequencing profiles often fail to achieve expected performance when externally validated on unseen profiles. While many factors such as batch effects, small data sets, and technical errors contribute to the gap between source and unseen data distributions, it is a challenging problem to generalize the predictive models across studies without any prior knowledge of the unseen data distribution. Here, this study proposes DeepBioGen, a sequencing profile augmentation procedure that characterizes visual patterns of sequencing profiles, generates realistic profiles based on a deep generative model capturing the patterns, and generalizes the subsequent classifiers. DeepBioGen outperforms other methods in terms of enhancing the generalizability of the prediction models on unseen data. The generalized classifiers surpass the state-of-the-art method, evaluated on RNA sequencing tumor expression profiles for anti-PD1 therapy response prediction and WGS human gut microbiome profiles for type 2 diabetes diagnosis.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Diabetes Mellitus, Type 2*
  • Humans
  • Sequence Analysis, RNA