Objective: To evaluate the reliability of consensus-based segmentation in terms of reproducibility of radiomic features.
Methods: In this retrospective study, three tumor data sets were investigated: breast cancer (n = 30), renal cell carcinoma (n = 30), and pituitary macroadenoma (n = 30). MRI was utilized for breast and pituitary data sets, while CT was used for renal data set. 12 readers participated in the segmentation process. Consensus segmentation was created by making corrections on a previous region or volume of interest. Four experiments were designed to evaluate the reproducibility of radiomic features. Reliability was assessed with intraclass correlation coefficient (ICC) with two cut-off values: 0.75 and 0.9.
Results: Considering the lower bound of the 95% confidence interval and the ICC threshold of 0.90, at least 61% of the radiomic features were not reproducible in the inter-consensus analysis. In the susceptibility experiment, at least half (54%) became non-reproducible when the first reader is replaced with a different reader. In the intra-consensus analysis, at least about one-third (32%) were non-reproducible when the same second reader segmented the image over the same first reader two weeks later. Compared to inter-reader analysis based on independent single readers, the inter-consensus analysis did not statistically significantly improve the rates of reproducible features in all data sets and analyses.
Conclusions: Despite the positive connotation of the word "consensus", it is essential to REMIND that consensus-based segmentation has significant reproducibility issues. Therefore, the usage of consensus-based segmentation alone should be avoided unless a reliability analysis is performed, even if it is not practical in clinical settings.
Keywords: Computed tomography; Magnetic resonance imaging; Radiomics; Reproducibility; Segmentation.
Copyright © 2023 Elsevier B.V. All rights reserved.