Generating realistic synthetic health data (e.g., electronic health records), holds promise for fundamental research, AI model development, and enhancing data privacy safeguards. Generative Adversarial Networks (GANs) have been employed for this purpose, but their performance is largely constrained by their reliance on training data, rendering them inadequate for rare or previously unseen diseases. This study proposes Onto-CGAN, a novel generative framework that combines knowledge from disease ontologies with GANs to generate unseen diseases that are not present in the training data. The quality of the generated data is evaluated using variable distributions, correlation coefficients, and machine learning model performance. Our findings demonstrate that Onto-CGAN generates unseen diseases with statistical characteristics comparable to the real data, and significantly improves the training of machine learning models. This innovative approach addresses the scarcity of data for rare diseases, offering valuable applications in data augmentation, hypothesis generation, and preclinical validation of clinical models.
© 2025. The Author(s).