The presence of substantial similarities and redundant information within video data limits the performance of video object recognition models. To address this issue, a Global-Local Storage Enhanced video object recognition model (GSE) is proposed in this paper. Firstly, the model incorporates a two-stage dynamic multi-frame aggregation module to aggregate shallow frame features. This module aggregates features in batches from each input video using feature extraction, dynamic multi-frame aggregation, and centralized concatenations, significantly reducing the model's computational burden while retaining key information. In addition, a Global-Local Storage (GS) module is constructed to retain and utilize the information in the frame sequence effectively. This module classifies features using a temporal difference threshold method and employs a processing approach of inheritance, storage, and output to filter and retain features. By integrating global, local and key features, the model can accurately capture important temporal features when facing complex video scenes. Subsequently, a Cascaded Multi-head Attention (CMA) mechanism is designed. The multi-head cascade structure in this mechanism progressively focuses on object features and explores the correlations between key and global, local features. The differential step attention calculation is used to ensure computational efficiency. Finally, we optimize the model structure and adjust parameters, and verify the GSE model performance through comprehensive experiments. Experimental results on the ImageNet 2015 and NPS-Drones datasets demonstrate that the GSE model achieves the highest mAP of 0.8352 and 0.8617, respectively. Compared with other models, the GSE model achieves a commendable balance across metrics such as precision, efficiency, and power consumption.
Keywords: Cascading multi-head attention; Global–local storage; Multi-frame aggregation; Video object recognition.
Copyright © 2025 Elsevier Ltd. All rights reserved.