Purpose: This study aims to evaluate the inter-observer variability in assessing the optic disc in fundus photographs and its implications for establishing ground truth in AI research.
Methods: Seventy subjects were screened during a screening campaign. Fundus photographs were classified into normal (NL) or abnormal (GS: glaucoma and glaucoma suspects) by two masked glaucoma specialists. Referrals were based on these classifications, followed by intraocular pressure (IOP) measurements, with rapid decisions simulating busy outpatient clinics.In the second stage, four glaucoma specialists independently categorized images as normal, suspect, or glaucomatous. Reassessments were conducted with access to IOP and contralateral eye data.
Results: In the first stage, the agreement between senior and junior specialists in categorizing patients as normal or abnormal was moderately high. Knowledge of IOP emerged as an independent factor influencing the decision to refer more patients. In the second stage, agreement among the four specialists varied, with greater concordance observed when additional clinical information was available. Notably, there was a statistically significant variability in the assessment of optic disc excavation.
Conclusion: The inclusion of various risk factors significantly influences the classification accuracy of specialists. Risk factors like IOP and bilateral data influence diagnostic consistency among specialists. Reliance solely on fundus photographs for AI training can be misleading due to inter-observer variability. Comprehensive datasets integrating multimodal clinical information are essential for developing robust AI models for glaucoma screening.
Keywords: artificial intelligence; clinical decision support; diagnostic imaging; glaucoma screening; multimodal diagnostic.
Glaucoma is a leading cause of irreversible blindness, and early detection is critical in preventing blindness. Screening for glaucoma using fundus photographs is one approach, but there is significant variability in how specialists interpret these images. This study evaluated how consistently different eye specialists assess these photographs and what this variability means for developing artificial intelligence (AI) tools to detect glaucoma. The study involved 70 individuals screened for glaucoma using fundus photographs. Two specialists initially classified the images as either normal or abnormal (including glaucoma suspects). The agreement between the specialists was moderate, showing that different clinicians sometimes reach different conclusions based on the same images. The study also tested how additional information, like intraocular pressure (IOP), affects these classifications. Surprisingly, including IOP data introduced more variability, making agreement between the specialists even lower. The research highlights that relying solely on fundus photos without considering other clinical factors, like IOP or data from both eyes, could be misleading when developing AI tools. For AI to effectively assist in glaucoma detection, it must be trained on comprehensive datasets that include more than just fundus images. These findings emphasize the importance of using a broad range of clinical data when training AI models for glaucoma screening to improve accuracy and reliability in real-world settings.
© 2024 Pourjavan et al.