Objective: The description and application of a new, overlap-integral comparison method and the quantification of human vs. human accuracies that can be used as goals for algorithms.
Methods: Four human experts marked ten 8 h electroencephalography (EEG) records from seizure patients. The seizures varied in origin and type, including complex partial, generalized absence, secondarily generalized and primary generalized tonic-clonic. The traditional any-overlap comparison method is used in addition to the overlap-integral method, which is sensitive to the correct placement of the seizure endpoints.
Results: The number of events marked by each reader ranged from 57 to 77. The average any-overlap sensitivity and false positives per hour rate are 0.92 and 0.117. The average overlap-integral correlation, sensitivity and specificity are 0.80, 0.82 and 0.9926. As expected, the correspondence between readers is high, but confounding issues resulted in overlap-integral sensitivities less than 0.5 for 10% of the records. Seven percent of the any-overlap sensitivities are less than 0.5. A comparison of the methods by record shows that the overlap-integral specificity and the any-overlap false positive rate measure different features.
Conclusions: There was little variation between readers and they were essentially interchangeable. High seizure rate (many per hour), short seizure durations (<10 s) and long seizure durations (approximately 10 min) with ambiguous offsets can complicate the analysis and result in poor correlation. There may be any number of unmarked events in rigorously marked records and it may be preferable to use records from non-epilepsy patients to compute the false positive rate. The any-overlap and overlap-integral comparison methods are complementary.
Significance: Correlation between expert human readers can be low on some records, which will complicate testing of seizure detection algorithms.