Background: Existing research on medical data has primarily focused on single time-points or single-modality data. This study aims to collect all data generated during radiotherapy comprehensively to improve the treatment and prognosis of patients with malignant tumors.
Methods: The data collected from each medical institution were transmitted to the lead organization, where they underwent a file integrity check and were processed using a data pipeline. The key metadata of the collected data were compiled into a database, which were examined by data analysts to identify outliers based on theoretical and institution-specific characteristics. Appropriate filters were applied and the filtered data were subsequently reviewed by artificial intelligence (AI)-based models and researchers for radiotherapy organ slides. Finally, they were annotated by specialists.
Results: The final dataset included 30,136 three-dimensional cone-beam computed tomography scans and 5,019 tabular data entries collected from 5,019 patients. It comprised 2,043,162 Digital Imaging and Communications in Medicine-format files with a total file size of 832 GB. Quality verification of the data using AI models revealed high classification performance for most organs, with relatively poor performance for the rectum. Overall, the macro AUROC value was 0.947.
Conclusions: This study implemented an automated data pipeline and AI-based verification to enhance the quality of collected radiotherapy data. The constructed dataset can be utilized for various types of future research and is expected to contribute to the improvement of radiotherapy efficiency.
Keywords: Data Pipeline; Fusion data; Malignant tumor; Survival data; Three-dimensional cone-beam computed tomography.
Copyright © 2024 Elsevier B.V. All rights reserved.