In pathological studies, subjective assays, especially companion diagnostic tests, can dramatically affect treatment of cancer. Binary diagnostic test results (ie, positive vs negative) may vary between pathologists or observers who read the tumor slides. Some tests have clearly defined criteria resulting in highly concordant outcomes, even with minimal training. Other tests are more challenging. Observers may achieve poor concordance even with training. While there are many statistically rigorous methods for measuring concordance between observers, we are unaware of a method that can identify how many observers are needed to determine whether a test can reach an acceptable concordance, if at all. Here we introduce a statistical approach to the assessment of test performance when the test is read by multiple observers, as would occur in the real world. By plotting the number of observers against the estimated overall agreement proportion, we can obtain a curve that plateaus to the average observer concordance. Diagnostic tests that are well-defined and easily judged show high concordance and plateau with few interobserver comparisons. More challenging tests do not plateau until many interobserver comparisons are made, and typically reach a lower plateau or even 0. We further propose a statistical test of whether the overall agreement proportion will drop to 0 with a large number of pathologists. The proposed analytical framework can be used to evaluate the difficulty in the interpretation of pathological test criteria and platforms, and to determine how pathology-based subjective tests will perform in the real world. The method could also be used outside of pathology, where concordance of a diagnosis or decision point relies on the subjective application of multiple criteria. We apply this method in two recent PD-L1 studies to test whether the curve of overall agreement proportion will converge to 0 and determine the minimal sufficient number of observers required to estimate the concordance plateau of their reads.
Keywords: Binomial distribution; concordance; inflated binomial distribution; overall agreement proportion; pathological tests.
© 2021 John Wiley & Sons Ltd.