There is growing interest in using deep learning models to automate wildlife detection in aerial imaging surveys to increase efficiency, but human-generated annotations remain necessary for model training. However, even skilled observers may diverge in interpreting aerial imagery of complex environments, which may result in downstream instability of models. In this study, we present a framework for assessing annotation reliability by calculating agreement metrics for individual observers against an aggregated set of annotations generated by clustering multiple observers' observations and selecting the mode classification. We also examined how image attributes like spatial resolution and texture influence observer agreement. To demonstrate the framework, we analyzed expert and volunteer annotations of twelve drone images of migratory waterfowl in New Mexico. Neither group reliably identified duck species: experts showed low agreement (43-56%) for several common species, and volunteers opted out of the task. When simplified into broad morphological categories, there was high agreement for cranes (99% among experts, 95% among volunteers) and ducks (93% among experts, 92% among volunteers), though agreement among volunteers was notably lower for classifying geese (75%) than among experts (94%). The aggregated annotation sets from the two groups were similar: the volunteer count of birds across all images was 91% of the expert count, with no statistically significant difference per image (t = 1.27, df = 338, p = 0.20). Bird locations matched 81% between groups and classifications matched 99.4%. Tiling images to reduce search area and maintaining a constant scale to keep size differences between classes consistent may increase observer agreement. Although our sample was limited, these findings indicate potential taxonomic limitations to aerial wildlife surveys and show that, in aggregate, volunteers can produce data comparable to experts'. This framework may assist other wildlife practitioners in evaluating the reliability of their input data for deep learning models.
Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.