REliability of consensus-based segMentatIoN in raDiomic feature reproducibility (REMIND): A word of caution

Burak Kocak; Aytul Hande Yardimci; Mehmet Ali Nazli; Sabahattin Yuzkan; Samet Mutlu; Tevfik Guzelbey; Merve Sam Ozdemir; Meliha Akin; Serap Yucel; Elif Bulut; Osman Nuri Bayrak; Ahmet Arda Okumus

doi:10.1016/j.ejrad.2023.110893

REliability of consensus-based segMentatIoN in raDiomic feature reproducibility (REMIND): A word of caution

Eur J Radiol. 2023 Aug:165:110893. doi: 10.1016/j.ejrad.2023.110893. Epub 2023 May 26.

Affiliations

¹ Department of Radiology, University of Health Sciences, Basaksehir Cam and Sakura City Hospital, Istanbul, Turkey. Electronic address: [email protected].
² Department of Radiology, University of Health Sciences, Basaksehir Cam and Sakura City Hospital, Istanbul, Turkey.
³ Department of Radiology, Baskent University, Istanbul Hospital, Istanbul, Turkey.

PMID: 37285646
DOI: 10.1016/j.ejrad.2023.110893

Abstract

Objective: To evaluate the reliability of consensus-based segmentation in terms of reproducibility of radiomic features.

Methods: In this retrospective study, three tumor data sets were investigated: breast cancer (n = 30), renal cell carcinoma (n = 30), and pituitary macroadenoma (n = 30). MRI was utilized for breast and pituitary data sets, while CT was used for renal data set. 12 readers participated in the segmentation process. Consensus segmentation was created by making corrections on a previous region or volume of interest. Four experiments were designed to evaluate the reproducibility of radiomic features. Reliability was assessed with intraclass correlation coefficient (ICC) with two cut-off values: 0.75 and 0.9.

Results: Considering the lower bound of the 95% confidence interval and the ICC threshold of 0.90, at least 61% of the radiomic features were not reproducible in the inter-consensus analysis. In the susceptibility experiment, at least half (54%) became non-reproducible when the first reader is replaced with a different reader. In the intra-consensus analysis, at least about one-third (32%) were non-reproducible when the same second reader segmented the image over the same first reader two weeks later. Compared to inter-reader analysis based on independent single readers, the inter-consensus analysis did not statistically significantly improve the rates of reproducible features in all data sets and analyses.

Conclusions: Despite the positive connotation of the word "consensus", it is essential to REMIND that consensus-based segmentation has significant reproducibility issues. Therefore, the usage of consensus-based segmentation alone should be avoided unless a reliability analysis is performed, even if it is not practical in clinical settings.

Keywords: Computed tomography; Magnetic resonance imaging; Radiomics; Reproducibility; Segmentation.

MeSH terms

Carcinoma, Renal Cell* / pathology
Consensus
Humans
Image Processing, Computer-Assisted / methods
Kidney Neoplasms* / pathology
Reproducibility of Results
Retrospective Studies