Computer aided diagnosis is an established field in medical image analysis; a great deal of effort goes into the development and refinement of pipelines to achieve greater performance. This improvement is dependent on reliable comparison, which is intimately related to variance estimation. For supervised methods, this can be confounded by statistical issues at the comparatively small sample sizes typical of the field. Given the importance of reliable comparison to pipeline development, this issue has received relatively little attention. As a solution, we advocate an empirical variance estimator based on validation within disjoint subsets of the available data. Using Alzheimer's disease classification in the ADNI dataset as an examplar, we investigate the behaviour of different variance estimators in a series of resampling experiments. We show that the proposed estimator is unbiased, and that it exceeds the estimates of naive approaches, which are biased down. Because the estimator avoids independence assumptions, it is able to accommodate arbitrary validation strategies and performance metrics. As it is unbiased, it is able to provide statistically convincing comparison and confidence intervals for algorithm performance. Finally, we show how the estimator can be used to compare different validation strategies, and make some recommendations about which should be used.