A framework for assessing reliability of observer annotations of aerial wildlife imagery, with insights for deep learning applications

PLoS One. 2025 Jan 15;20(1):e0316832. doi: 10.1371/journal.pone.0316832. eCollection 2025.

Abstract

There is growing interest in using deep learning models to automate wildlife detection in aerial imaging surveys to increase efficiency, but human-generated annotations remain necessary for model training. However, even skilled observers may diverge in interpreting aerial imagery of complex environments, which may result in downstream instability of models. In this study, we present a framework for assessing annotation reliability by calculating agreement metrics for individual observers against an aggregated set of annotations generated by clustering multiple observers' observations and selecting the mode classification. We also examined how image attributes like spatial resolution and texture influence observer agreement. To demonstrate the framework, we analyzed expert and volunteer annotations of twelve drone images of migratory waterfowl in New Mexico. Neither group reliably identified duck species: experts showed low agreement (43-56%) for several common species, and volunteers opted out of the task. When simplified into broad morphological categories, there was high agreement for cranes (99% among experts, 95% among volunteers) and ducks (93% among experts, 92% among volunteers), though agreement among volunteers was notably lower for classifying geese (75%) than among experts (94%). The aggregated annotation sets from the two groups were similar: the volunteer count of birds across all images was 91% of the expert count, with no statistically significant difference per image (t = 1.27, df = 338, p = 0.20). Bird locations matched 81% between groups and classifications matched 99.4%. Tiling images to reduce search area and maintaining a constant scale to keep size differences between classes consistent may increase observer agreement. Although our sample was limited, these findings indicate potential taxonomic limitations to aerial wildlife surveys and show that, in aggregate, volunteers can produce data comparable to experts'. This framework may assist other wildlife practitioners in evaluating the reliability of their input data for deep learning models.

MeSH terms

  • Animals
  • Animals, Wild*
  • Birds / physiology
  • Deep Learning*
  • Ducks
  • Humans
  • Image Processing, Computer-Assisted / methods
  • New Mexico
  • Reproducibility of Results

Grants and funding

This research was funded under a US Fish and Wildlife Service Co-Operative Agreement, number F17AC0122. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.