There exist a significant number of benchmarks for evaluating the performance of boundary detection algorithms, most of them relying on some sort of comparison of the automatically-generated boundaries with human-labeled ones. Such benchmarks are composed of a representative image data set, as well as a comparison measure on the universe of boundary images. Despite many such data sets and measures have been proposed, there is no clear way of knowing which combinations of them are the most suitable for the task. In this paper, we introduce four criteria that allow for a sensible evaluation of the performance of a comparison measure on a given data set. The criteria mimic the way in which humans understand boundary images, as well as their ability to recognize the underlying scenes. These criteria can, as a final goal, quantify the ability of the boundary detection benchmarks to evaluate the performance of boundary detection methods, either edge-based or segmentation-based.