Creating a Standardized Tool for the Evaluation and Comparison of Artificial Intelligence-Based Computer-Aided Detection Programs in Colonoscopy: a Modified Delphi Approach

Gastrointest Endosc. 2024 Nov 26:S0016-5107(24)03752-0. doi: 10.1016/j.gie.2024.11.042. Online ahead of print.

Abstract

Background and aim: Multiple computer-aided detection (CADe) software have now achieved regulatory approval in the US, Europe, and Asia and are being used in routine clinical practice to support colorectal cancer screening. There is uncertainty regarding how different CADe algorithms may perform. No objective methodology exists for comparing different algorithms. We aimed to identify priority scoring metrics for CADe evaluation and comparison.

Methods: A modified Delphi approach was used. Twenty-five global leaders in CADe in colonoscopy, including endoscopists, researchers, and industry representatives, participated in an online survey over the course of 8 months. Participants generated 121 scoring criteria, 54 of which were deemed within the study scope and distributed for review and asynchronous email-based open comment. Participants then scored criteria in order of priority on a 5-point Likert scale during ranking round one. The top eleven highest-priority criteria were re-distributed, with another opportunity for open-comment, followed by a final round of priority scoring to identify the final 6 criteria.

Results: Mean priority scores for the 54 criteria ranged from 2.25 to 4.38 following the first ranking round. The top eleven criteria following ranking round one yielded mean priority scores ranging from 3.04 to 4.16. The final six highest priority criteria were 1) sensitivity (average = 4.16) and separate & independent validation of the CADe algorithm (4.16), 3) adenoma detection rate (4.08), 4) false positive rate (4.00), 5) latency (3.84), and 6) adenoma miss rate (3.68).

Conclusions: This is the first reported international consensus statement of priority scoring metrics for CADe in colonoscopy. These scoring criteria should inform CADe software development and refinement. Future research should validate these metrics on a benchmark video data set to develop a validated scoring instrument.

Keywords: Colonoscopy; adenoma detection rate; artificial intelligence; computer-aided detection; sensitivity.