AMFEF-DETR: An End-to-End Adaptive Multi-Scale Feature Extraction and Fusion Object Detection Network Based on UAV Aerial Images

Wang, Sen; Jiang, Huiping; Yang, Jixiang; Ma, Xuan; Chen, Jiamin

doi:10.3390/drones8100523

Open AccessArticle

AMFEF-DETR: An End-to-End Adaptive Multi-Scale Feature Extraction and Fusion Object Detection Network Based on UAV Aerial Images

by

Sen Wang

^1,2

,

Huiping Jiang

^1,2,*,

Jixiang Yang

^1,2,

Xuan Ma

^1,2 and

Jiamin Chen

^1,2

¹

Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China, Beijing 100081, China

²

School of Information Engineering, Minzu University of China, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(10), 523; https://doi.org/10.3390/drones8100523

Submission received: 29 July 2024 / Revised: 18 September 2024 / Accepted: 24 September 2024 / Published: 26 September 2024

(This article belongs to the Special Issue Artificial Intelligence (AI) and Machine Learning (ML) in UAV Technology)

Download

Browse Figures

Versions Notes

Abstract

:

To address the challenge of low detection accuracy and slow detection speed in unmanned aerial vehicle (UAV) aerial images target detection tasks, caused by factors such as complex ground environments, varying UAV flight altitudes and angles, and changes in lighting conditions, this study proposes an end-to-end adaptive multi-scale feature extraction and fusion detection network, named AMFEF-DETR. Specifically, to extract target features from complex backgrounds more accurately, we propose an adaptive backbone network, FADC-ResNet, which dynamically adjusts dilation rates and performs adaptive frequency awareness. This enables the convolutional kernels to effectively adapt to varying scales of ground targets, capturing more details while expanding the receptive field. We also propose a HiLo attention-based intra-scale feature interaction (HLIFI) module to handle high-level features from the backbone. This module uses dual-pathway encoding of high and low frequencies to enhance the focus on the details of dense small targets while reducing noise interference. Additionally, the bidirectional adaptive feature pyramid network (BAFPN) is proposed for cross-scale feature fusion, integrating semantic information and enhancing adaptability. The Inner-Shape-IoU loss function, designed to focus on bounding box shapes and incorporate auxiliary boxes, is introduced to accelerate convergence and improve regression accuracy. When evaluated on the VisDrone dataset, the AMFEF-DETR demonstrated improvements of 4.02% and 16.71% in mAP50 and FPS, respectively, compared to the RT-DETR. Additionally, the AMFEF-DETR model exhibited strong robustness, achieving mAP50 values 2.68% and 3.75% higher than the RT-DETR and YOLOv10, respectively, on the HIT-UAV dataset.

Keywords:

artificial intelligence; object detection; UAV aerial images; FADC; adaptive feature interaction; adaptive information fusion; Inner-Shape-IoU

1. Introduction

With the rapid progress of unmanned aerial vehicle (UAV) technology [1] and deep learning [2], UAVs have evolved into powerful edge computing devices [3], capable of integrating sophisticated intelligent algorithms. Aerial image target detection based on UAVs has gained widespread application in fields such as traffic regulation [4], urban planning [5], and disaster emergency responses [6]. This technology leverages high-resolution cameras and embedded high-precision detection models mounted on UAVs to identify and track ground targets over long distances and large areas. Particularly in complex urban environments, aerial images provide a unique wide-angle perspective, facilitating the identification of subtle changes and small objects on the ground [7]. Compared to satellite remote sensing, UAVs offer lower operational costs, faster data acquisition, and the ability to capture high-quality images in real time, which are critical for decision-making and monitoring [8]. However, due to the varying flight altitudes and camera angles of UAVs, the captured objects often exhibit different sizes and are easily confused with the background, creating substantial challenges for detection algorithms [9]. Moreover, aerial imaging is subject to external factors, such as illumination changes, weather conditions, and occlusions, which can degrade image quality and impact recognition accuracy [10]. In long-distance aerial photography, mutual interference and occlusions among targets in crowded scenes further complicate the detection of small objects, increasing the difficulty for algorithms to extract meaningful features from these images [11]. At the same time, the limited hardware resources of UAVs necessitate a reduction in computational complexity in models to improve detection performance, especially for tasks requiring high real-time performance [12]. Therefore, developing a high-precision, efficient, and robust small-object detection algorithm for UAV aerial images in complex environments holds significant theoretical research value and practical application prospects.

Due to the overhead perspective of data collected by UAVs, the target distribution density in the images is uneven, and the target sizes are relatively small compared to ordinary perspectives. Targets such as pedestrians and vehicles are typically considered small targets when they occupy less than 32 × 32 pixels [13,14]. Additionally, small targets in aerial views are frequently hidden by trees or buildings, making it a significant challenge for detection algorithms to extract useful feature information from these high-interference small targets. Traditional machine learning techniques face certain limitations when identifying small objects in aerial images. For instance, while Baykara et al. used frame differencing and morphological dilation to detect moving targets, this approach is sensitive to lighting variations and fails to capture detailed target contours, affecting subsequent recognition and tracking tasks [15]. Similarly, Bazi et al. proposed a novel Convolutional Support Vector Machine (CSVM) network for analyzing UAV imagery. This CSVM employs a collection of linear SVMs as a filter bank to produce feature maps, rendering it especially appropriate for tasks with highly limited training samples. However, the dense sampling strategy of the sliding window in this method is computationally inefficient, and its feature learning capability still lags behind that of end-to-end deep neural networks [16]. Reference [17] utilized color histogram calculations to detect the main features of the target, and proposed a UAV visual target tracking system grounded in motion detection and the recognition of dominant features. The results showed that this lightweight system achieved fast target detection speeds, making it suitable for practical applications. However, it is easily affected by environmental changes and struggles to accurately detect small targets with complex occlusions in low-resolution images.

Deep learning surpasses traditional machine learning in identifying small targets in aerial imagery by automatically learning hierarchical feature representations, leading to higher accuracy and robustness in challenging conditions. Models such as the Faster R-CNN [18], the Mask R-CNN [19], the EfficientDet [20], and the YOLO series [21] excel in feature extraction and are now widely applied in various object recognition applications. The Facebook team [22] proposed the end-to-end detection model DETR in 2020. It uses a transformer architecture to predict object boundaries and categories directly, without the need for predefined anchor boxes or NMS post-processing. However, the DETR shows a less satisfactory performance in detecting small objects, and its matching-based loss function makes training convergence challenging. In 2020, Zhu et al. [23] presented the deformable DETR model, which focuses the attention module on a specific set of key sampling points around a reference. This approach mitigates problems with a slow training speed and limited detail in feature resolution, significantly improving the small-object detection performance. Deformable DETR also improves the performance when using multi-scale features but at a high computational cost. In 2022, Roh et al. [24] launched the sparse DETR, which selectively updates decoder-referenced tokens, reducing encoder attention complexity and enabling more encoder layers, for better performance with the same computational budget. In 2023, Zhao et al. [25] introduced the RT-DETR, the first real-time end-to-end detection transformer. This model effectively handles features of varying scales through within-scale interaction and between-scale fusion, excelling in both speed and accuracy. However, this model still exhibits high computational requirements and significant memory usage. Cheng et al. [26] presented an embedded lightweight aerial detection model for UAVs, integrating an innovative MC feature extraction module that improves accuracy while reducing complexity. Although the model achieves real-time detection, its performance in various real-world environments and overlapping UAV scenarios requires further improvement. Zhang et al. [27] presented the SINextNet model for identifying small targets in overhead imagery. It utilizes a SINext module leveraging depth-separable convolution to expand the receptive field and enhances feature expression by integrating small object features with background information. However, the increased computational cost diminishes real-time performance. Wang et al. [28] developed a lightweight infrared small-object detection model for aerial imagery called PHSI-RTDETR. This algorithm makes the model more lightweight. However, its performance still requires improvement for small objects that are prone to occlusion. Ren et al. [29] introduced a fine-grained domain object detection algorithm, FGFD, which boosts the performance by learning both cross-domain and unique domain features and incorporating detailed domain mix augmentation. However, its accuracy for occluded small objects still needs improvement. Wu et al. [30] introduced an efficient exchange network and multi-scale feature fusion optimized model for small-object detection called EM-YOLO. The algorithm seeks to resolve the problem of missed detections resulting from substantial size variations and mutual occlusions. However, the model still requires improvements in terms of its lightweight design. Tan et al. [31] introduced a multi-scale UAV imagery detection method utilizing adaptive feature fusion, which effectively identifies small target objects by adjusting the receptive field and improving shallow feature representation. However, the computational efficiency of the method may require further optimization for real-time applications. Battish et al. [32] proposed a Spatial Dilated Multi-Scale Network (SDMNet) architecture that uses multi-scale enhanced effective channel attention to preserve target details in images, but the model does not consider generalization in infrared scenarios. Xin et al. [33] proposed a new Effective Receptive Field (ERF) module and rigorously optimized the path aggregation network structure with it to reduce network parameters, but the improvement in accuracy was not significant.

In the tasks of object classification and localization, deep learning methods have demonstrated a significantly superior performance compared to traditional machine learning techniques. This advantage mainly stems from the powerful feature extraction capabilities of deep learning. Moreover, its complex network structures can effectively interact with and fuse features at different scales, capturing multi-scale semantic information and contextual relationships within the data. However, basic deep learning models often struggle to accurately recognize and locate small objects in aerial images, as these typically occupy very small pixel areas, and are often confused with the background. Additionally, the complexity of aerial scenes, such as varying shooting angles, lighting conditions, and target densities, further increases the difficulty of detection. These factors lead to issues such as missed detections and false positives in small-object detection tasks in aerial imagery. Furthermore, the computational intensity and processing delays of complex models can impact the overall detection performance and practicality. To enhance the recognition effectiveness of UAVs for multi-scale ground targets during dynamic flight, particularly in complex environments where objects are dense and prone to occlusion, we propose an end-to-end adaptive multi-scale feature extraction and fusion network for target detection in UAV aerial images, named AMFEF-DETR. The primary contributions of this study are as follows:

We integrate frequency-adaptive dilated convolution (FADC) with BasicBlock to form the frequency-adaptive FADC-Block module, which is used to construct the adaptive feature extraction FADC-ResNet backbone network. This network can dynamically adjust the dilation rate and has adaptive frequency awareness, enabling the convolutional kernels to better adapt to the scale variations for small objects in aerial images, and to capture more high-frequency detail information for small targets. Additionally, the frequency selection module enhances feature representation, suppressing background interference and enlarging the receptive field, thereby improving the robustness of the model.

This study proposes a novel intra-scale feature interaction module named HLIFI. The HLIFI module leverages the HiLo attention mechanism to process high-level features from the backbone network. By encoding the interaction of high-frequency and low-frequency components through dual pathways, the network is better able to focus on the detailed information of dense small objects while reducing interference from occlusions or background noise.

We present a novel bidirectional adaptive feature pyramid network for cross-scale feature fusion. The BAFPN introduces an adaptive fusion module and high-resolution shallow feature maps to better capture the fine-grained details of small objects, making it suitable for detecting small targets from high altitudes and long distances using UAVs. The adaptive fusion mechanism in the BAFPN can dynamically adjust the weights of each feature map, effectively integrating multi-scale information and enhancing the adaptability of the model to different scenes.

This study introduces a novel Inner-Shape-IoU loss function by combining Shape-IoU with the auxiliary bounding box-based Inner-IoU loss. This function emphasizes the shape and scale of bounding boxes to improve regression precision. Moreover, it utilizes auxiliary bounding boxes to accelerate convergence while enhancing the detection capability for small objects at long distances.

2. Materials and Methods

2.1. The AMFEF-DETR Model Architecture

The proposed AMFEF-DETR is an innovative, real-time, end-to-end, small-object detection model for UAV aerial images. Small aerial targets typically exhibit simple textures and dense occlusions, and are susceptible to background interference, making precise recognition challenging. Compared to the YOLO series, the AMFEF-DETR demonstrates a superior performance under the same testing conditions. It achieves faster training speeds without employing mosaic or mixed data augmentation strategies, while also attaining greater accuracy and a better balance. The AMFEF-DETR model consists of four main components: FADC-ResNet as the backbone network for feature extraction, the HLIFI module for intra-scale feature interaction, the BAFPN for cross-scale feature fusion, and the transformer decoder with auxiliary prediction heads. Figure 1 provides an overview of the AMFEF-DETR model.

Addressing the challenges of small-object detection in UAV aerial images, this model integrates various innovative techniques to boost accuracy and robustness while ensuring efficiency. We introduce frequency-adaptive dilated convolution [34] into residual blocks to form the frequency-adaptive FADC-Block module. By replacing the BasicBlock in ResNet-18 [35] with the FADC-Block module, we construct a new adaptive residual network, FADC-ResNet, as the backbone for feature extraction. This network can dynamically adjust the dilation rate and capture high-frequency details, thereby improving scale adaptability and reducing background interference. The HLIFI module utilizes the HiLo attention [36] mechanism to process high-level features, focusing on the high-frequency detail information of dense small objects and mitigating the noise impact from occluded backgrounds. To achieve cross-scale feature fusion, the proposed BAFPN introduces an adaptive fusion module and high-resolution shallow feature maps, dynamically adjusting the weights of feature maps to capture fine-grained details from different heights and distances. Finally, we propose a novel Inner-Shape-IoU loss function, which calculates the loss by emphasizing the shape and scale of the bounding boxes to improve regression accuracy and accelerate convergence, while employing the scale factor ratio to generate auxiliary bounding boxes at different scales, thereby enhancing the detection performance for extremely small objects. The detailed architecture of the AMFEF-DETR model is shown in Figure 2.

2.2. Frequency-Adaptive Dilated Feature Extraction Network

To achieve an optimal balance between computational efficiency and feature extraction capability, we employ the comparatively compact ResNet-18 as the base backbone network. The residual structure of ResNet-18 not only helps mitigate the vanishing gradient problem in deep networks but promotes feature reuse and propagation. Building upon this foundation, we innovatively incorporate FADC into the residual blocks, forming FADC-Block to further enhance the feature extraction capabilities. FADC enhances the spatial adaptability of the structure by adaptively adjusting dilation rates. Moreover, it dynamically modifies the ratio of high- and low-frequency components in convolution weights and applies spatially variant reweighting, improving the capture of target detail information, and increasing the effective bandwidth and receptive field size. Consequently, FADC-ResNet is capable of more accurately extracting the features of ground targets in UAV aerial images while maintaining a high computational efficiency. Figure 3 illustrates the structural design of the FADC.

The proposed FADC-ResNet backbone network is constructed by replacing the BasicBlock modules in the P4 and P5 layers of ResNet-18 with FADC-Block modules, as illustrated in Figure 4. This adaptive feature extraction network can dynamically adjust the dilation rates and convolution kernel weights based on the input features, enabling it to capture the fine-grained details crucial for detecting small objects. Moreover, by incorporating frequency selection, FADC-ResNet can adaptively emphasize the high-frequency components of small targets while suppressing less relevant background information. This design greatly improves the network’s ability to capture features of multi-scale ground targets such as cars, bicycles, and pedestrians in UAV aerial images, improving the detection accuracy and adaptability.

2.3. Feature Interaction Utilizing the HLIFI Module

This study integrates the HiLo mechanism into the single-scale Transformer encoder layer to develop the HLIFI intra-scale feature interaction module. The HLIFI module is utilized to process high-level features, enhancing richer features’ fusion and simultaneously improving the capability to capture more granular semantic information of small targets. Traditional Multi-Head Self-Attention (MSA) layers exhibit limitations when processing the features of different frequencies, as they perform the same global attention operation on all image blocks, overlooking the differences between high-frequency and low-frequency features. This limitation is particularly detrimental to densely distributed small targets, which typically contain a high percentage of high-frequency detail information. To address this issue, the HiLo mechanism divides the MSA layer into two parallel paths: one path utilizes localized self-attention and relatively high-resolution feature maps to encode high-frequency interactions, capturing such critical details as the edges and textures of small objects; the other path employs global attention and downsampled features to encode low-frequency interactions, capturing the global structure and contextual features of the ground targets. This separation enables the model to better adapt to features with various frequencies, improving the detection accuracy for densely distributed ground objects. Figure 5 illustrates the HiLo attention mechanism and the HLIFI module.

As shown in Figure 5b, the HLIFI module is specifically tailored to handle high-level features from the FADC-ResNet network, which are rich in semantic information. By leveraging the HiLo mechanism, the approach first decouples the high and low frequencies in the feature maps through separate paths. Subsequently, the refined high-frequency and low-frequency features are concatenated and propagated to the subsequent layers. This not only improves computational efficiency but enhances the capacity of the model to capture the detailed and high-resolution features of tiny objects. The HLIFI module efficiently captures extensive dependencies across the feature maps, thereby improving the representation of high-level features while avoiding redundant calculations on lower-level features, thus reducing computational overheads. The computational process of the HLIFI module in handling high-level features is described as follows:

Q = K = V = F l a t t e n (S_{5})

(1)

F_{5} = R e s h a p e (H L I F I (Q, K, V))

(2)

where

S_{5}

represents the high-level features, and the

F l a t t e n

operation converts the two-dimensional

S_{5}

features into a one-dimensional vector. After

F l a t t e n

processing,

Q

,

K

, and

V

are fed into the HLIFI module for further processing. The HLIFI module utilizes the HiLo mechanism to process these features, optimizing the feature representation. The processed output then undergoes a reshape operation, converting it back to its original two-dimensional form, denoted as

F_{5}

. This step is performed to preserve the spatial layout of the feature maps, ensuring compatibility with the input requirements of the subsequent cross-scale feature fusion layer.

2.4. Improved Cross-Scale Feature Fusion Network

In the neck network for feature fusion, the HLIFI module is first employed to process high-level features, enhancing their representational capacity and semantic information. Subsequently, features from different levels are input into the cross-scale feature fusion network for integration, improving the adaptability of the model to multi-scale objects, and resulting in more comprehensive and accurate detection outcomes.

The traditional feature pyramid network (FPN) [37] achieves multi-scale feature fusion by transmitting deep semantic information to shallow layers through a top-down pathway. The PANet (path aggregation network) [38] extends the FPN by incorporating an upward pathway, further enhancing feature propagation and aggregation, although at the cost of significantly increased computational complexity. The BiFPN (bidirectional feature pyramid network) [20] removes single-input nodes with minimal contributions and employs bidirectional feature propagation paths. Through weighted feature fusion and multi-scale feature reuse, the BiFPN achieves more efficient and accurate multi-scale feature fusion. Building upon the foundation of the BiFPN, this paper proposes a bidirectional adaptive feature pyramid network (BAFPN) for multi-scale feature fusion. The BAFPN introduces an adaptive fusion module and high-resolution shallow feature maps, which enable a better focus on the fine details of ground targets. A comparison of different feature pyramid network structures is illustrated in Figure 6.

The BAFPN, which performs cross-scale feature fusion, consists of convolutional blocks, upsample layers, RepC3 modules, and an adaptive fusion module. Among these, the adaptive fusion module first concatenates the input feature maps along the channel dimension and generates spatially adaptive weights through a convolutional layer. Subsequently, these weights are normalized using a softmax function, and the normalized weights are then split into the same number of parts as the original feature maps. Finally, each feature map is multiplied by its corresponding weight, and the weighted feature maps are summed to generate an output feature map that aggregates contextual information. This adaptive fusion mechanism dynamically adjusts the weights of each feature map, effectively integrating multi-scale information, which makes it suitable for aerial-to-ground detection. The three different fusion strategies are shown in Figure 7.

2.5. The Inner-Shape-IoU Loss Function

Compared to traditional IoU, Shape-IoU [39] not only emphasizes the overlap between the predicted and ground truth boxes but evaluates the similarity of the bounding box shapes. This additional consideration enhances the sensitivity of the detection algorithm to shape features, thereby improving the accuracy of the bounding box regression, particularly for targets with irregular shapes. The Shape-IoU formula is as follows:

w w = \frac{2 \times {(w^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}, h h = \frac{2 \times {(h^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(3)

{d i s t a n c e}^{s h a p e} = h h \times {(x_{c} - x_{c}^{g t})}^{2} / c^{2} + w w \times {(y_{c} - y_{c}^{g t})}^{2} / c^{2}

(4)

Ω^{s h a p e} = \sum_{t = w, h} {(1 - e^{- w_{t}})}^{θ}, θ = 4

(5)

L_{S h a p e - I o U} = 1 - I o U + {d i s t a n c e}^{s h a p e} + 0.5 \times Ω^{s h a p e}

(6)

where scale is a scaling factor. The weights in the horizontal and vertical directions are

w w

and

h h

, respectively. The width and height of the ground truth box are denoted as

w^{g t}

and

h^{g t}

, respectively.

I o U

represents the intersection over union,

{d i s t a n c e}^{s h a p e}

denotes the shape loss, and

Ω^{s h a p e}

denotes the scale penalty.

Building on Shape-IoU, we incorporate auxiliary boxes from Inner-IoU [40], thereby proposing an Inner-Shape-IoU for loss calculation. This method accelerates the regression process. This technique not only speeds up convergence but reduces false positives and false negatives for extremely small objects. Inner-Shape-IoU is defined as follows:

i n t e r = (m i n (b_{r}^{g t}, b_{r}) - m a x (b_{l}^{g t}, b_{l})) * (m i n (b_{b}^{g t}, b_{b}) - m a x (b_{t}^{g t}, b_{t}))

(7)

u n i o n = (w^{g t} * h^{g t}) * {(r a t i o)}^{2} + (w * h) * {(r a t i o)}^{2} - i n t e r

(8)

{I o U}^{i n n e r} = \frac{i n t e r}{u n i o n}

(9)

L_{I n n e r - G I o U} = L_{G I o U} + I o U - {I o U}^{i n n e r}

(10)

L_{I n n e r - S h a p e I o U} = L_{S h a p e - I o U} + I o U - {I o U}^{i n n e r}

(11)

where

w

and

h

denote the width and height of the anchor, respectively, and

r a t i o

represents the scaling factor, which usually falls within the range of [0.5, 1.5].

L_{S h a p e - I o U}

represents the bounding box regression loss of Shape-IoU.

2.6. Datasets

This study employed the VisDrone dataset [41], a comprehensive benchmark for detecting objects in UAV-captured aerial images, to assess the model improvements. The VisDrone dataset comprises 10,209 images collected from various urban and suburban areas, such as city streets, residential neighborhoods, and industrial zones, as depicted in Figure 8. These images are acquired by UAVs with different altitudes, camera orientations, and weather conditions, showcasing a wide range of objects including pedestrians, vehicles, bicycles, and trucks across ten categories. The majority of the objects in the dataset have a pixel size smaller than 32 × 32, classifying them as small objects. The images in the dataset exhibit significant variations in scale, pose, and density, with objects appearing in various sizes, orientations, and levels of occlusion. The diversity and complexity of the VisDrone dataset allow the model to learn and extract robust features for accurate object detection in real-world scenarios. The resolution of the images in the dataset varies, with the majority having a resolution of 2000 × 1500 pixels.

To ensure a comprehensive evaluation of the performance of the model, the dataset was divided into three distinct subsets. The training subset, comprising 6471 images, was used to optimize the parameters of the model and learn robust features. A separate validation subset, consisting of 548 images, was employed to fine-tune the hyperparameters of the model and prevent overfitting. Finally, the remaining 1610 images were allocated to the test subset, serving as an independent set to assess the generalization ability of the model and its overall performance on unseen data. Table 1 provides a detailed breakdown of the dataset composition, including the numbers of images and objects within each subset, as well as the distribution of object categories across the entire dataset.

2.7. Evaluation Indicators

To assess the effectiveness of the AMFEF-DETR, this study employed several metrics for comparative analysis. The equations for calculating precision and recall were as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(12)

R e c a l l = \frac{T P}{T P + F N}

(13)

where

T P

represents true positives,

F P

represents false positives, and

F N

represents false negatives. The F1 score, which is the harmonic mean of precision and recall, combines these indicators into a single measure.

F 1 = (\frac{2}{{R e c a l l}^{- 1} + {P r e c i s i o n}^{- 1}}) = 2 \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(14)

The average precision (AP) represents the mean precision for a single class, while the mean average precision (mAP) is the average of AP values across all classes, serving as a comprehensive metric for evaluating model performance in multi-class object detection tasks. The formulas are as follows:

A P = \int_{0}^{1} P (r) d r

(15)

m A P = \frac{\sum_{j = 1}^{S} A P (j)}{S}

(16)

where

S

symbolizes the overall number of categories. Additionally, FLOPs quantify computational complexity, while parameters reflect storage requirements and complexity, together assessing the efficiency and scale of the model.

3. Experiments and Results

3.1. Experimental Environment and Parameter Configuration

The experimental setup comprised an Ubuntu 20.04 OS, a Xeon Platinum 8352 V CPU, an RTX 4090 GPU, and 90 GB of memory. The detailed settings are shown in Table 2.

3.2. Feature Extraction Network Comparison Experiment

Enhancing the ability of UAV-embedded algorithms to detect multi-scale targets in aerial images plays a crucial role in both civilian and military domains. The difficulty associated with this task stems from the need for UAVs to accurately identify and pinpoint ground targets such as vehicles, pedestrians, or buildings in intricate settings from elevated heights, while ensuring real-time detection. To capture more detailed feature information on ground targets, this paper proposes a novel adaptive feature extraction network called FADC-ResNet. The network is composed of the FADC-Block module, which introduces frequency-adaptive dilated convolution. The FADC-Block module replaces the BasicBlock residual blocks in the last two layers of the original network to enhance its ability to capture and adapt to the scale and shape variations for small targets in aerial images. To demonstrate the effectiveness of the FADC-Block, several state-of-the-art convolutional modules were selected for improvements and comparative testing in the same positions, with the results presented in Table 3.

Table 3 presents the comparative experimental results of different backbone networks integrated with various convolutions on the test set. The model incorporating the adaptive FADC-Block achieves the highest mAP50 and mAP50:95 of 39.01% and 22.85%, respectively, which are 1.67% and 1.16% higher than the baseline model. At the same time, the computational load is reduced by 13.09%, while the model parameters only increase by 0.6%. Meanwhile, DySnakeConv [42], which focuses on capturing elongated and tortuous tubular structures, is integrated into the network. The network improved with DySnakeConv increases the mAP50 by 0.55%, but at the same time, the network parameters and GFLOPs significantly increase by 50.13% and 13.61%, thus failing to meet the requirements for lightweight and real-time model detection. Furthermore, AKConv [43], DualConv [44], and PConv [45], which are all lightweight convolutional structures, reduce the model parameters and GFLOPs when used to improve the backbone network. However, the mAP50 growth levels are relatively small, at only 0.44%, 1.54%, and 1.16%, respectively. Lastly, iRMB [46] is highly efficient in modeling both short-range dependencies and long-range interactions, and its simple structure makes it suitable for deployment on mobile devices. The IRMB-Block significantly optimizes various indicators compared to BasicBlock; however, its precision and mAP50 are 1.58% and 0.37% lower, respectively, than the improved network proposed in this study. The above comparative experimental analysis shows that our proposed FADC-ResNet feature extraction network is superior in extracting UAV aerial image features compared to other mainstream networks.

3.3. Analyzing the Performance of the HLIFI Module

To research and confirm the performance enhancement effect of the HiLo attention-based HLIFI module in multi-scale object detection on UAV aerial images, this study generated and compared heatmaps before and after the introduction of the HLIFI module using a LayerCAM [47]. Figure 9 presents the results of this comparative evaluation.

From the original images in Figure 9, the UAV scene features numerous targets of varying scales and types, including cars, pedestrians, and buses. These targets are distributed relatively densely, with a certain degree of occlusion and overlap. When comparing the visualized heatmaps, it can be observed that the RT-DETR model without the HLIFI module tends to overlook certain small targets or pay them only light attention. In contrast, the AMFEF-DETR model, which incorporates the HLIFI module, demonstrates a significantly heightened focus on small-scale targets. The model is capable of more accurately concentrating on the boundaries and contours of the targets, generating heatmaps with a greater overlap with the target regions. This indicates that HLIFI not only enhances the scale invariance of the model but improves the localization precision for features, enabling the attention points to align more closely with the true regions of the targets. These results intuitively showcase the superiority and effectiveness of the AMFEF-DETR model in completing small-target detection tasks in complex UAV aerial images, verifying that the HLIFI module can significantly boost detection performance.

3.4. Verifying the Effectiveness of the Adaptive Feature Fusion Network

As shown in Table 4, the comparative experiments of different cross-scale feature fusion network enhancement strategies revealed the specific model optimization effect. Model 2, which introduces the BiFPN strategy, significantly improves the precision and recall on both the validation and test sets compared to the baseline Model 1 using PAFPN as the network for fusing features. The mAP50 and mAP50:95 on the validation set increased by 2.07% and 1.32%, respectively. Similarly, the mAP50 and mAP50:95 on the test set also improved by 1.29% and 0.81%, respectively, demonstrating the effectiveness of the BiFPN. To further boost the interaction with and fusion of multi-scale target feature details, we employed three fusion enhancement strategies to improve the BiFPN. Model 3 adopts the weighted fusion strategy to further optimize the fusion approach of the BiFPN, increasing the mAP50 and mAP50:95 on the validation set to 50.82% and 31.30%, respectively. The precision and recall on the test set also improve, showcasing the advantage of the weighted fusion strategy in enhancing the detection accuracy. Model 4 utilizes the concatenation fusion strategy, resulting in 2.88% and 1.61% increases in precision and recall on the validation set, respectively. However, the mAP50 decreases by 0.04% compared to Model 2, indicating that this strategy improves certain performance metrics, but falls short of the overall expectations. Ultimately, Model 5, which employs the adaptive fusion strategy (BAFPN), demonstrates the best performance on both the validation and test sets. The mAP50 and mAP50:95 on the validation set increase by 2.56% and 1.78%, reaching 50.99% and 31.74%, respectively. On the test set, the precision and recall improve by 0.84% and 1.41%, respectively, with the highest increases in mAP50 and mAP50:95 of 1.47% and 1.09% compared to the baseline. This signifies that the adaptive fusion strategy significantly enhances multi-scale target detection capabilities while maintaining computational efficiency. Overall, the BAFPN, which includes the adaptive fusion strategy, was identified in this study as the most efficient adaptive feature fusion network for enhancing small-target accuracy.

3.5. Comparative Experiments of Different Loss Functions

To verify the effectiveness of the Inner-Shape-IoU loss function, we conducted experiments comparing it with GIoU, DIoU, CIoU, and SIoU loss functions [48,49,50]. The data presented in Table 5 indicate that the Shape-IoU, with a target scale factor of 0.5, significantly enhances the model performance. Relative to the baseline model employing GIoU loss, the mAP50 and mAP50:95 on the test set improved by 0.53% and 0.16%, respectively. Subsequently, by implementing a ratio factor-regulated auxiliary bounding box, the Inner-Shape-IoU loss function was derived. The model using the Inner-Shape-IoU loss function with a 0.75 ratio setting showed the most significant performance improvements. It exhibited a 0.88% increase in mAP50 and a 0.53% increase in mAP50:95 on the validation set compared to the baseline model. On the test set, precision improved by 1.74%, recall by 0.30%, and the mAP50 and mAP50:95 by 1.07% and 0.54%, respectively. These results suggest that using the Inner-Shape-IoU loss function can provide a more consistent regression on bounding boxes and significantly enhance the prediction precision. Furthermore, for other ratio settings, such as 1.15, we also noted stable performance improvements, further validating the effectiveness of Inner-Shape-IoU in multi-scale target detection.

3.6. Ablation Study

To evaluate the impact of the proposed improvement modules on the AMFEF-DETR model, eight sets of ablation studies were conducted. The baseline network underwent the following sequential upgrades: the original ResNet-18 was substituted with FADC-ResNet, an adaptive feature extraction backbone, the HLIFI module incorporating the HiLo mechanism was included, and the weighted bidirectional feature pyramid network with an adaptive fusion approach was applied for multi-scale feature fusion. Additionally, Inner-Shape-IoU was implemented as the loss function. These experiments were performed by incrementally integrating each improvement module, with the findings detailed in Table 6.

The ablation study results presented in Table 6 indicate that replacing the original residual network with the FADC-ResNet adaptive feature extraction backbone in Model 2 resulted in improvements of 1.67% and 1.16% in mAP50 and mAP50:95, respectively. This demonstrates that FADC-ResNet is more effective in extracting multi-scale object features. Introducing the HLIFI module with the HiLo attention mechanism to the baseline Model 1 resulted in Model 3, which also showed significant performance improvements, with mAP50 and mAP50:95 increasing by 1.12% and 0.76%, respectively, compared to Model 1. This indicates that the HLIFI module enhances the model’s focus on high-frequency detailed information by improving intra-scale feature interactions. In Experiment 4, we found that using the BAFPN with adaptive fusion strategies as the cross-scale feature fusion network significantly enhanced the mAP50 and mAP50:95 by 3.22% and 2.19%, respectively, and markedly increased precision and recall by 3% and 3.11%. This validates that the BAFPN framework effectively boosts feature fusion and representational skills in UAV small-object detection. After implementing the Inner-Shape-IoU loss function, Model 5 showed improvements of 2.13% in mAP50 and 1.58% in mAP50:95, along with notable advances in precision and recall. This demonstrates that the Inner-Shape-IoU loss function ensures a more consistent boundary box regression, thus enhancing detection accuracy.

Ultimately, the AMFEF-DETR model, which integrates all the proposed improvement strategies, achieved substantial enhancements of 4.02% in mAP50, 2.59% in mAP50:95, 4.86% in precision, and 3.35% in recall. Additionally, the F1 score increased by 4%, indicating a more balanced and stable model. Overall, these experiments showed that these proposed improvements significantly enhance the AMFEF-DETR model’s performance.

3.7. Comparative Experiments between the AMFEF-DETR Model and Other Advanced Models

Table 7 presents the detection performance of the AMFEF-DETR model across various object categories in the VisDrone dataset. Overall, the model exhibited high precision and recall rates on the test set, achieving 59.62% and 41.66%, respectively. The mAP50 and mAP50:95 metrics also reached impressive values of 41.36% and 24.28%, confirming their effectiveness in the high-precision detection of multi-scale ground targets in UAV aerial images. In particular, for the detection of the “Pedestrian” and “Car” categories, the model demonstrated exceptionally high precision rates of 61.86% and 79.32%, respectively, and achieved outstanding mAP50 scores of 41.15% and 78.98%, respectively. Moreover, despite the dense distribution and mutual occlusion of “Bicycle” targets in the dataset, the model still achieved an mAP50 exceeding 16%, showcasing its advantage in handling densely packed targets. Additionally, the model’s performance in detecting the “Tricycle” and “Awning Tricycle” categories, with mAP50 scores of 26.67% and 25.46%, respectively, also highlights its potential in detecting small vehicles, despite the less distinct features and varying shapes of these targets. Moreover, the performance of the AMFEF-DETR model on larger targets, such as the “Car” and “Bus” categories, was particularly outstanding, with mAP50 scores of 78.98% and 60.59%, respectively. These results indicate that the model not only excels in small-target detection but demonstrates robust performance across multi-scale target types, making it highly promising for practical applications.

To further enhance the applicability of the AMFEF-DETR model in real-world scenarios, future research could explore the design of lightweight model architectures. By using efficient model pruning strategies to refine the model structure, the goals of minimizing redundant parameters and lowering the model’s computational complexity may be achieved. Additionally, incorporating advanced data augmentation techniques, such as simulating various atmospheric conditions or introducing synthetic targets, could help the model to better cope with the challenges posed by complex aerial scenes. Moreover, investigating the fusion of complementary information from other sensors, such as thermal or multispectral cameras, could improve the ability of the model to detect and classify targets under different lighting and visibility conditions.

To thoroughly assess the effectiveness of the AMFEF-DETR model in accurately localizing and classifying objects of varying scales and types in UAV aerial images, we charted precision–confidence, recall–confidence, precision–recall, and F1–confidence curves for each target category, as illustrated in Figure 10.

3.8. Comparative Analysis of Different Detection Models

Figure 11 compares the confusion matrices of the RT-DETR and AMFEF-DETR models under the same parameter settings. The visualized confusion matrices clearly demonstrate that the AMFEF-DETR model achieves higher correct classification rates across all target categories compared to the RT-DETR model, while significantly reducing false positive and false negative predictions. The improvement is particularly notable for dense small-target categories that inherently have lower correct prediction rates, such as category 2 representing bicycles and category 7 representing awning tricycles, with true positive rates increasing by 8% for both. The comparative analysis indicates that the AMFEF-DETR model significantly enhances the detection and localization of small targets in UAV imagery, validating its effectiveness and robustness in practical applications. This highlights its substantial potential in fields such as smart city monitoring and emergency rescue, where the accurate detection of small targets is crucial.

The effectiveness of the AMFEF-DETR model was evaluated against several other object recognition models, including advanced end-to-end models such as the YOLOv10 [51] and the RT-DETR series [25]. Additionally, the comparison included other advanced models like the QueryDet [52], the TOOD [53], the RTMDet [54], and the Efficient-DETR [55], as well as relatively lightweight models like the YOLOv8 and the high-performance YOLOv9 [56]. The comparative results are presented in Table 8.

The data from the table reveal that the AMFEF-DETR model demonstrated an exceptional performance across multiple key metrics, with particularly significant improvements on the test set. Firstly, the AMFEF-DETR model achieved a precision of 59.62% on the test set, surpassing the second-best-performing model, the RT-DETR-R50, by 2.67 percentage points and YOLOv9 by 9.06 percentage points. The AMFEF-DETR also led the models in terms of recall on the test set, with 41.66%, showing a substantial increase of 3.35% when compared with the standard RT-DETR-R18 model, indicating that it produces fewer false negatives in practical applications. Furthermore, the mAP50 of the AMFEF-DETR on the test set was 4.02% higher than that of the RT-DETR-R18, reaching 41.36%. On the more stringent mAP50:95 metric, The AMFEF-DETR achieved 24.28%, significantly outperforming the RT-DETR-R50 (23.55%) and YOLOv10-L (21.41%). This improvement showcases its strong capability in handling multi-scale objects on complex backgrounds.

Although the parameter count of the AMFEF-DETR (35.81 M) is higher than certain lightweight models, such as the YOLOv8-M (25.85 M) and RT-DETR-R18 (19.97 M), the significant improvement in accuracy makes this trade-off justifiable. Furthermore, compared to models with larger parameter counts, like the YOLOv6-L (110.87 M) [57] and YOLOv9 (60.51 M), the AMFEF-DETR strikes a more favorable balance between performance and resource consumption. In terms of computational efficiency, the AMFEF-DETR demonstrates a GFLOPs of 142 G, which is slightly higher than the RT-DETR-R50 (135 G), yet 28.64% and 10.69% lower than the TOOD and Efficient-DETR, respectively. Additionally, the AMFEF-DETR achieves an impressive FPS of 84.5, significantly outperforming the QueryDet and RTMDet by 62.9 s and 46.8 s, respectively, and also surpassing other mainstream models. These results further highlight its suitability for real-time applications.

At the same time, in the validation set, the precision of the AMFEF-DETR model was 2.34% and 7.37% higher than those of the similar end-to-end detection models RT-DETR-R18 and YOLO10-L, respectively, and the mAP50 was 4.85% and 7.32% higher than those of the two models, respectively. In summary, the outstanding performance of the AMFEF-DETR indicates its high potential and application value as an embedded detection model for UAVs, particularly in multi-scale object detection and small-object detection.

3.9. Visual Analysis

The rapid development of UAV photography technology has provided new perspectives and opportunities for addressing the challenges of object detection in complex urban environments. However, in practical applications, images captured by UAVs often raise various challenges. To comprehensively evaluate the robustness and adaptability of object detection models in real-world scenarios, we conducted systematic tests and analyses in different conditions. Figure 12 presents the results for ground object recognition by the AMFEF-DETR model in various challenging environments. By visually comparing the detection performance levels for different geographical environments, UAV flight altitudes, and lighting conditions, we gained deep insights into the strengths and limitations of the model, providing crucial references for its subsequent deployment.

In a comprehensive visual analysis of the detection results for various environmental conditions, UAV flight altitudes, and lighting variations, we found that the AMFEF-DETR model exhibited an exceptional performance and robustness in object detection tasks involving complex urban scenes captured in UAV aerial images. The model demonstrated excellent adaptability and stability in detecting vehicles, pedestrians, and other targets in both daytime and nighttime scenarios, as well as in high-traffic road environments. Moreover, the model showcased outstanding multi-scale object detection capabilities across different UAV altitudes and perspectives, with particularly impressive accuracy in identifying and localizing densely packed small objects. Additionally, the AMFEF-DETR model’s consistent performance under varying lighting conditions, both day and night, highlighted its robustness to illumination changes, which is crucial for the all-weather application of UAVs. However, as observed in the first image of Figure 12c, the model mistakenly detected three reflected vehicles on the mirrored facade of a building as real targets, underscoring its limitations when dealing with highly reflective surfaces. This misidentification likely arises due to the similarity in visual features between real objects and their reflections, especially in aerial imagery where depth cues are limited. Despite this, the model achieved high detection accuracy for dense small objects in complex environments. Future improvements could focus on addressing misdetections caused by mirror reflections and expanding its scope to more practical application scenarios, to further enhance its detection robustness and practical value.

To further evaluate the performance of the AMFEF-DETR model, we conducted a comparative analysis with two advanced object detection models, the YOLOv10 and RT-DETR models. Figure 13 illustrates the outcomes of our comparative analysis of these three models’ performance levels on UAV aerial images depicting different environmental conditions.

When examining the detection results across various urban scenes, including highways, bridges, intersections, and residential areas, we observed that the AMFEF-DETR consistently outperformed both the YOLOv10 and RT-DETR models. In the first column of images shown in Figure 13, it can be seen that the AMFEF-DETR successfully identified all vehicles on the highway and achieved a higher accuracy in detecting pedestrians around the road, while the YOLOv10 missed some cars and pedestrians, and the RT-DETR model incorrectly classified some roadside poles as pedestrians. The second column highlights the ability the AMFEF-DETR showed to precisely detect targets on the bridge, whereas both the YOLOv10 and RT-DETR failed to locate the two pedestrians on the overpass. The third column demonstrates the exceptional ability of the AMFEF-DETR to accurately detect pedestrians on the roadside under low-visibility conditions at night. By contrast, the YOLOv10 and RT-DETR had issues with missing detections of some pedestrians and misclassifications of some vehicles. Overall, the AMFEF-DETR exhibited a higher detection accuracy and robustness across different complex environments, significantly outperforming both the YOLOv10 and RT-DETR models.

4. Extended Experiments

We conducted additional experiments using the HIT-UAV dataset. This dataset consists of 2898 infrared thermal images captured by UAVs from various locations, such as schools, highways, and parking lots, featuring dense small objects from five categories, including humans, vehicles, and bicycles. The dataset expands the application scope of UAVs under low-light conditions. In the same experimental setup, we used 2029 images for training, 290 images for validation, and the remaining 579 images for testing. Table 9 presents the comparative experimental results against state-of-the-art methods.

According to the data in Table 9, the AMFEF-DETR demonstrates superior detection performance across multiple categories in UAV aerial infrared images compared to the YOLOv10 and RT-DETR models. Notably, for the “Person” and “Bicycle” categories, the AMFEF-DETR achieves precision rates of 94.17% and 91.01%, respectively, outperforming both the YOLOv10 and RT-DETR. Additionally, in terms of the mAP50 metric, the AMFEF-DETR attains 81.45%, exceeding the YOLOv10 and RT-DETR by 3.75% and 2.68%, respectively. Moreover, the AMFEF-DETR scores 53.19% on the mAP50:95 metric, significantly surpassing the other compared models. These results indicate that our method not only excels in conventional visible light scenarios but performs robustly in more challenging conditions, such as infrared imagery. This highlights its potential for broad application in complex environments. The comparison of the visual detection results between the AMFEF-DETR model and the two advanced object detection models, the YOLOv10 and RT-DETR, in different scenarios is shown in Figure 14.

As shown in the first column of images in Figure 14, the AMFEF-DETR successfully detects both the pedestrians on the field and the cars off the field, while the YOLOv10 fails to detect the pedestrians on the field. In the second column, neither the YOLOv10 nor the RT-DETR manages to localize the pedestrians in the parking lot. In the third column, the YOLOv10 mistakenly identifies a small window as a bicycle, whereas the RT-DETR incorrectly detects a large window as a car. In the fourth column comparison, the YOLOv10 misses the pedestrian, while the AMFEF-DETR accurately detects all four different object categories in the infrared scene. Overall, the AMFEF-DETR demonstrates superior detection accuracy and robustness in infrared scenarios, clearly outperforming both the YOLOv10 and RT-DETR models.

5. Discussion

Our experimental results indicate that the AMFEF-DETR model significantly enhances the precision and robustness in detecting multi-scale ground targets. Ablation experiments confirmed the efficacy of each component, such as FADC-Block, the HLIFI module, the BAFPN architecture, and the Inner-Shape-IoU loss function.

As can be seen from the results of the feature extraction network comparison experiment in Table 3, current popular feature extraction networks are not well-suited for object detection in UAV aerial imagery. Our proposed FADC-ResNet dynamically adjusts dilation rates and incorporates frequency awareness, expanding the receptive field while capturing detailed information. However, there is still room for improvement. As shown in Figure 9, the HLIFI module enhances feature representation by processing high-frequency and low-frequency information through dual channels, focusing on local details while reducing the interference of background noise. Additionally, the ablation experiment in Table 6 demonstrates that our proposed BAFPN feature fusion architecture integrates multi-scale information through adaptive fusion modules and high-resolution shallow feature maps, significantly improving model detection performance. Lastly, the Inner-Shape-IoU loss function accelerates convergence and improves detection of distant small ground targets by using auxiliary bounding boxes.

Our comprehensive analysis indicates that the AMFEF-DETR model performs exceptionally well on both visible light and infrared images captured by UAVs. The high precision and recall rates of the model demonstrate its outstanding capacity to capture detailed features of targets. In comparison with other detection models, such as the RT-DETR and YOLOv10, the AMFEF-DETR exhibits significant advantages in detection accuracy and processing speed. Our visualizations demonstrate the robustness of the model in handling diverse geographical environments, UAV viewpoints, and lighting conditions. The AMFEF-DETR shows great potential for applications of UAV detection, including urban traffic monitoring, nighttime emergency rescue, and military reconnaissance. However, the AMFEF-DETR has certain limitations. For example, in dense urban scenes with mirrored surfaces, the model often misidentifies reflected objects as real ones due to the similarity in shape between reflections and real objects, as the model lacks temporal and causal reasoning capabilities. Furthermore, in real-world applications, UAVs, as widely used edge devices, are tasked with an increasing number of responsibilities, and there is a growing demand for lightweight models on these platforms. While our method significantly improves accuracy, it remains a serious challenge to improve detection accuracy while reducing computational load and model size. Future work should focus on comprehensively evaluating the deployment feasibility and long-term performance of the model in complex environments.

6. Conclusions

In this paper, we have proposed an adaptive feature extraction and fusion small-object detection model, AMFEF-DETR, specifically designed for urban UAV imagery, providing more accurate decision support for applications such as traffic monitoring and disaster rescue. Initially, the backbone network, FADC-ResNet, was constructed using the frequency-adaptive dilated module, FADC-Block, proposed in this paper for adaptive feature extraction. It captures more high-frequency detail information on small targets and expands the receptive field. Then, we developed a novel intra-scale feature interaction module, HLIFI, which uses the HiLo attention mechanism to process high-level features. By encoding the interaction of high-frequency and low-frequency components through dual pathways, the network better focuses on the details of dense small objects while reducing background noise interference. Additionally, a novel bidirectional adaptive feature pyramid network (BAFPN) has been proposed for cross-scale feature fusion. The adaptive fusion mechanism in the BAFPN dynamically adjusts the weights of each feature map, effectively integrating multi-scale information and enhancing the adaptability of the model to different scenes. Lastly, we have proposed a novel Inner-Shape-IoU loss function. This emphasizes the inherent shapes of bounding boxes and uses auxiliary boxes to accelerate convergence while enhancing the recognition of extremely small ground targets at long distances.

Our comprehensive experiments validated the outstanding performance of the AMFEF-DETR. Comparative results on the VisDrone test set revealed that, compared to the end-to-end YOLOv10 and RT-DETR models, the AMFEF-DETR achieved 7.83% and 4.86% higher precision, respectively, along with mAP50 values 4.61% and 4.02% higher, respectively. Furthermore, visualization comparison experiments demonstrated the robustness of the AMFEF-DETR, showing its superior ability to accurately identify multi-scale targets across diverse challenging environments, spanning various geographical areas, UAV flight altitudes, and lighting conditions, compared to the advanced YOLOv10 and RT-DETR models. To further substantiate the generalization capability of the proposed network, future testing will be conducted on more diverse datasets, including UAV aerial imagery captured in rainy and foggy conditions. Additionally, we will explore methods for addressing the issue of misdetection of reflective objects in aerial images, such as utilizing multi-UAV information sharing for collaborative detection, to improve the robustness of the model against such interferences. Moreover, practical deployment considerations will be a key focus in future work. This includes significantly enhancing detection precision while simultaneously reducing computational load and model size. Ensuring long-term reliability through remote maintenance and updates, as well as reducing energy consumption to extend flight times, will also be prioritized. Comprehensive testing under diverse environmental conditions will be necessary to assess the model’s robustness in real-world continuous operations, facilitating the transition from theoretical development to practical application.

Author Contributions

Conceptualization, S.W. and J.Y.; methodology, S.W.; software, S.W.; validation, S.W., J.Y. and X.M.; formal analysis, S.W.; investigation, X.M.; resources, H.J.; data curation, J.C.; writing—original draft preparation, S.W., J.Y. and X.M.; writing—review and editing, S.W. and H.J.; visualization, J.C.; supervision, H.J.; project administration, S.W. and J.Y.; funding acquisition, H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant no. 61773416, and supported by the Graduate Research and Practice Projects of Minzu University of China, grant no. SJCX2024021.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

Our sincere thanks go to the Graduate Research and Practice Projects of Minzu University of China for their generous support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Colomina, I.; Molina, P. Unmanned aerial systems for photogrammetry and remote sensing: A review. ISPRS J. Photogramm. Remote Sens. 2014, 92, 79–97. [Google Scholar] [CrossRef]
Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.-L.; Chen, S.-C.; Iyengar, S.S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. (CSUR) 2018, 51, 1–36. [Google Scholar] [CrossRef]
Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge computing: Vision and challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
Ke, R.; Li, Z.; Tang, J.; Pan, Z.; Wang, Y. Real-time traffic flow parameter estimation from UAV video based on ensemble classifier and optical flow. IEEE Trans. Intell. Transp. Syst. 2018, 20, 54–64. [Google Scholar] [CrossRef]
Feng, Q.; Liu, J.; Gong, J. UAV remote sensing for urban vegetation mapping using random forest and texture analysis. Remote Sens. 2015, 7, 1074–1094. [Google Scholar] [CrossRef]
Erdelj, M.; Natalizio, E.; Chowdhury, K.R.; Akyildiz, I.F. Help from the sky: Leveraging UAVs for disaster management. IEEE Pervasive Comput. 2017, 16, 24–32. [Google Scholar] [CrossRef]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Liu, Y.; Piramanayagam, S.; Monteiro, S.T.; Saber, E. Dense semantic labeling of very-high-resolution aerial imagery and lidar with fully-convolutional neural networks and higher-order CRFs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 76–85. [Google Scholar]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
Bai, Z.; Pei, X.; Qiao, Z.; Wu, G.; Bai, Y. Improved YOLOv7 Target Detection Algorithm Based on UAV Aerial Photography. Drones 2024, 8, 104. [Google Scholar] [CrossRef]
Mandal, M.; Shah, M.; Meena, P.; Devi, S.; Vipparthi, S.K. AVDNet: A small-sized vehicle detection network for aerial visual data. IEEE Geosci. Remote Sens. Lett. 2019, 17, 494–498. [Google Scholar] [CrossRef]
Mohsan, S.A.H.; Othman, N.Q.H.; Li, Y.; Alsharif, M.H.; Khan, M.A. Unmanned aerial vehicles (UAVs): Practical aspects, applications, open challenges, security issues, and future trends. Intell. Serv. Robot. 2023, 16, 109–137. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 877–886. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. pp. 740–755. [Google Scholar]
Baykara, H.C.; Bıyık, E.; Gül, G.; Onural, D.; Öztürk, A.S.; Yıldız, I. Real-time detection, tracking and classification of multiple moving objects in UAV videos. In Proceedings of the 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), Boston, MA, USA, 6–8 November 2017; pp. 945–950. [Google Scholar]
Bazi, Y.; Melgani, F. Convolutional SVM networks for object detection in UAV imagery. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3107–3118. [Google Scholar] [CrossRef]
Abughalieh, K.M.; Sababha, B.H.; Rawashdeh, N.A. A video-based object detection and tracking system for weight sensitive UAVs. Multimed. Tools Appl. 2019, 78, 9149–9167. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse detr: Efficient end-to-end object detection with learnable sparsity. arXiv 2021, arXiv:2111.14330. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Cheng, Q.; Wang, Y.; He, W.; Bai, Y. Lightweight air-to-air unmanned aerial vehicle target detection model. Sci. Rep. 2024, 14, 2609. [Google Scholar] [CrossRef]
Zhang, W.; Hong, Z.; Xiong, L.; Zeng, Z.; Cai, Z.; Tan, K. Sinextnet: A New Small Object Detection Model for Aerial Images Based on PP-Yoloe. J. Artif. Intell. Soft Comput. Res. 2024, 14, 251–265. [Google Scholar] [CrossRef]
Wang, S.; Jiang, H.; Li, Z.; Yang, J.; Ma, X.; Chen, J.; Tang, X. PHSI-RTDETR: A Lightweight Infrared Small Target Detection Algorithm Based on UAV Aerial Photography. Drones 2024, 8, 240. [Google Scholar] [CrossRef]
Jin, R.; Jia, Z.; Yin, X.; Niu, Y.; Qi, Y. Domain Feature Decomposition for Efficient Object Detection in Aerial Images. Remote Sens. 2024, 16, 1626. [Google Scholar] [CrossRef]
Wu, M.; Yun, L.; Wang, Y.; Chen, Z.; Cheng, F. Detection algorithm for dense small objects in high altitude image. Digit. Signal Process. 2024, 146, 104390. [Google Scholar] [CrossRef]
Tan, S.; Duan, Z.; Pu, L. Multi-scale object detection in UAV images based on adaptive feature fusion. PLoS ONE 2024, 19, e0300120. [Google Scholar] [CrossRef] [PubMed]
Battish, N.; Kaur, D.; Chugh, M.; Poddar, S. SDMNet: Spatially dilated multi-scale network for object detection for drone aerial imagery. Image Vis. Comput. 2024, 150, 105232. [Google Scholar] [CrossRef]
Wang, X.; He, N.; Hong, C.; Sun, F.; Han, W.; Wang, Q. YOLO-ERF: Lightweight object detector for UAV aerial images. Multimed. Syst. 2023, 29, 3329–3339. [Google Scholar] [CrossRef]
Chen, L.; Gu, L.; Zheng, D.; Fu, Y. Frequency-Adaptive Dilated Convolution for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3414–3425. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Pan, Z.; Cai, J.; Zhuang, B. Fast vision transformers with hilo attention. Adv. Neural Inf. Process. Syst. 2022, 35, 14541–14554. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Zhang, H.; Zhang, S. Shape-IoU: More Accurate Metric considering Bounding Box Shape and Scale. arXiv 2023, arXiv:2312.17663. [Google Scholar]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6070–6079. [Google Scholar]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. AKConv: Convolutional Kernel with Arbitrary Sampled Shapes and Arbitrary Number of Parameters. arXiv 2023, arXiv:2311.11587. [Google Scholar]
Zhong, J.; Chen, J.; Mian, A. DualConv: Dual convolutional kernels for lightweight deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 9528–9535. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.-h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking mobile block for efficient attention-based models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 1389–1400. [Google Scholar]
Jiang, P.-T.; Zhang, C.-B.; Hou, Q.; Cheng, M.-M.; Wei, Y. Layercam: Exploring hierarchical class activation maps for localization. IEEE Trans. Image Process. 2021, 30, 5875–5888. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13668–13677. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 3490–3499. [Google Scholar]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. Rtmdet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient detr: Improving end-to-end object detector with dense prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]

Figure 1. Overview of the AMFEF-DETR model. We feed the feature from the last stage S5 of the FADC-ResNet backbone into the HLIFI module to perform intra-scale interaction and obtain F5. The features S2, S3, and S4 from the backbone network, along with F5, are then fed into the BAFPN for bidirectional adaptive cross-scale feature fusion. The decoder iteratively optimizes the selected query initial features via auxiliary prediction heads to generate categories and boxes.

Figure 2. Detailed architecture of the AMFEF-DETR model.

Figure 3. Structural diagram of the frequency-adaptive dilated convolution (FADC). AdaDR, AdaKern, and FreqSelect denote the adaptive dilation rate, adaptive kernel, and frequency selection, respectively.

Figure 4. The structure of the frequency-adaptive dilated feature extraction backbone network FADC-ResNet.

Figure 5. (a) Diagram of the dual-path structure of the HiLo attention mechanism.

N_{h}

represents the total number of self-attention heads.

α

refers to the split ratio for high-/low-frequency heads. (b) The structural diagram of the HLIFI module. The high- level

S_{5}

features output from the feature extraction network are transformed into a vector and processed through HiLo attention and a feedforward network to obtain the two-dimensional

F_{5}

vector.

Figure 5. (a) Diagram of the dual-path structure of the HiLo attention mechanism.

N_{h}

represents the total number of self-attention heads.

α

refers to the split ratio for high-/low-frequency heads. (b) The structural diagram of the HLIFI module. The high- level

S_{5}

features output from the feature extraction network are transformed into a vector and processed through HiLo attention and a feedforward network to obtain the two-dimensional

F_{5}

vector.

Figure 6. Comparison of feature pyramid network structures. (a) Traditional FPN structure. (b) BiFPN architecture with bidirectional cross-scale connections. (c) BAFPN structure, which introduces an adaptive fusion module for enhanced multi-scale feature aggregation.

Figure 7. Methods: (a) Weighted fusion. (b) Adaptive fusion method. (c) Concatenation operation.

Figure 8. A selection of the VisDrone dataset images.

Figure 9. Comparison of heatmaps with and without the HLIFI module.

Figure 10. The precision–confidence, recall–confidence, precision–recall, and F1–confidence curves of the AMFEF-DETR model.

Figure 11. Confusion matrix comparison plot of RT-DETR and AMFEF-DETR. (a) RT-DETR confusion matrix. (b) AMFEF-DETR confusion matrix.

Figure 12. Example diagrams of the detection effect of the AMFEF-DETR model in different complex environments.

Figure 13. Visual comparison of detection outcomes across various models.

Figure 14. Visual comparison of detection outcomes across various models on the HIT-UAV dataset.

Table 1. VisDrone dataset labeling information.

Type	Number	Pedestrian	People	Bicycle	Car	Van	Truck	Tricycle	Awning Tricycle	Bus	Motor
Training	6471	109,185	38,560	13,069	187,004	32,702	16,284	6387	4377	9117	40,377
Validation	548	8844	5125	1287	14,064	1975	750	1045	532	251	4886
Test	1610	21,006	6376	1302	28,074	5771	2659	530	599	2940	5845
Total	8629	139,035	50,061	15,658	229,142	40,448	19,693	7962	5508	12,308	51,108

Table 2. Hardware configuration and model parameters.

Type	Version	Type	Value
GPU	RTX 4090	Optimizer	AdamW
Python	3.8.0	Batch	4
Pytorch	2.0.0	Learning rate	1 × 10⁻⁴
CUDA	11.8	Momentum	0.9

Table 3. Comparative results from various backbone networks.

Model	$P^{t e s t}$ (%)	$R^{t e s t}$ (%)	${m A P}_{50}^{t e s t}$ (%)	${m A P}_{50 : 95}^{t e s t}$ (%)	Param (M)	GFLOPs (G)
BasicBlock	54.76	38.31	37.34	21.69	19.97	57.3
AKConv-Block	55.38	38.72	37.78	22.19	15.63	51.8
DualConv-Block	57.81	39.67	38.88	22.73	16.20	52.5
DySnakeConv-Block	55.10	39.30	37.89	22.06	29.98	65.1
PConv-Block	56.99	39.33	38.50	22.41	14.45	50.2
iRMB-Block	55.92	39.97	38.64	22.65	16.72	53.4
FADC-Block	57.50	40.08	39.01	22.85	20.09	49.8

Table 4. Comparative experiments of cross-scale feature fusion network enhancement strategies.

Model	$P^{v a l}$ (%)	$R^{v a l}$ (%)	${m A P}_{50}^{v a l}$ (%)	${m A P}_{50 : 95}^{v a l}$ (%)	$P^{t e s t}$ (%)	$R^{t e s t}$ (%)	${m A P}_{50}^{t e s t}$ (%)	${m A P}_{50 : 95}^{t e s t}$ (%)
1. PAFPN (base)	60.83	47.22	48.43	29.96	57.04	39.95	38.82	22.65
2. BiFPN	63.35	48.69	50.50	31.28	57.29	41.72	40.11	23.46
3. BiFPN + Weighted Fusion	63.39	48.75	50.82	31.30	57.07	41.05	40.23	23.55
4. BiFPN + Concatenation Fusion	63.71	48.83	50.46	31.39	57.41	40.96	40.13	23.22
5. BiFPN + Adaptive Fusion (BAFPN)	63.48	49.08	50.99	31.74	57.88	41.36	40.29	23.74

Table 5. Comparative analysis of models using enhanced loss functions.

Loss Function	$P^{v a l}$ (%)	$R^{v a l}$ (%)	${m A P}_{50}^{v a l}$ (%)	${m A P}_{50 : 95}^{v a l}$ (%)	$P^{t e s t}$ (%)	$R^{t e s t}$ (%)	${m A P}_{50}^{t e s t}$ (%)	${m A P}_{50 : 95}^{t e s t}$ (%)
GIoU	63.48	49.08	50.99	31.74	57.88	41.36	40.29	23.74
DIoU	63.01	49.96	51.37	32.17	57.68	41.64	40.44	23.94
CIoU	63.68	49.77	51.07	31.76	56.77	41.77	40.59	23.93
SIoU	63.28	49.84	51.56	32.12	57.69	42.03	40.85	24.09
Shape-IoU (scale = 0.0)	63.80	48.98	50.26	30.62	57.25	41.04	39.88	23.02
Shape-IoU (scale = 0.5)	63.47	50.09	51.79	32.19	58.01	41.60	40.82	23.90
Shape-IoU (scale = 1.0)	63.57	49.33	51.18	31.45	56.71	42.09	40.31	23.37
Shape-IoU (scale = 1.5)	63.32	49.90	51.79	32.18	56.48	41.91	40.67	23.96
Inner-Shape-IoU (ratio = 0.70)	64.05	49.42	51.42	32.23	57.24	41.82	40.86	24.09
Inner-Shape-IoU (ratio = 0.75)	63.68	50.01	51.87	32.27	59.62	41.66	41.36	24.28
Inner-Shape-IoU (ratio = 0.80)	63.70	49.77	51.22	31.75	56.98	41.37	39.96	23.37
Inner-Shape-IoU (ratio = 1.10)	64.32	49.43	51.56	32.26	57.94	42.14	40.93	24.06
Inner-Shape-IoU (ratio = 1.13)	64.17	49.04	51.50	32.10	58.20	41.15	40.91	24.06
Inner-Shape-IoU (ratio = 1.15)	64.50	49.04	51.10	31.83	58.85	42.09	41.01	24.25

Table 6. Results of ablation experiments.

Methods	FADC-ResNet	HLIFI	BAFPN	Inner-Shape-IoU	$P^{t e s t}$ (%)	$R^{t e s t}$ (%)	${m A P}_{50}^{t e s t}$ (%)	${m A P}_{50 : 95}^{t e s t}$ (%)	F1 (%)
1. Base					54.76	38.31	37.34	21.69	44
2	√				57.50	40.08	39.01	22.85	46
3		√			56.75	39.32	38.46	22.45	46
4			√		57.76	41.42	40.56	23.88	47
5				√	57.35	40.22	39.47	23.27	46
6	√	√			57.04	39.95	38.82	22.65	46
7	√	√	√		57.88	41.36	40.29	23.74	48
8. Ours	√	√	√	√	59.62	41.66	41.36	24.28	48

The “√” symbol indicates an improvement in the structure of the corresponding ordinate.

Table 7. Detection results of the AMFEF-DETR model for various categories on the VisDrone dataset.

Class	$P^{v a l}$ (%)	$R^{v a l}$ (%)	${m A P}_{50}^{v a l}$ (%)	${m A P}_{50 : 95}^{v a l}$ (%)	$P^{t e s t}$ (%)	$R^{t e s t}$ (%)	${m A P}_{50}^{t e s t}$ (%)	${m A P}_{50 : 95}^{t e s t}$ (%)
All	63.68	50.01	51.87	32.27	59.62	41.66	41.36	24.28
Pedestrian	68.83	53.31	58.67	29.28	61.86	38.37	41.15	17.34
People	64.70	49.91	51.10	22.63	61.56	26.44	29.86	11.35
Bicycle	45.01	27.97	25.82	12.40	46.56	18.36	16.38	7.07
Car	81.48	84.44	86.81	63.75	79.32	77.05	78.98	51.54
Van	70.11	48.41	53.80	40.86	57.59	43.55	40.52	29.21
Truck	65.61	44.27	46.75	32.20	60.39	49.49	49.23	32.18
Tricycle	54.81	42.01	41.28	24.28	40.44	35.47	26.67	14.82
Awning Tricycle	42.06	20.49	20.87	13.15	54.03	25.21	25.46	16.15
Bus	80.22	65.34	70.02	52.44	78.08	55.23	60.59	44.03
Motor	63.96	63.84	63.58	31.67	56.34	47.42	44.76	19.15

Table 8. Comparison of performance of different models.

Model	$P^{v a l}$ (%)	$R^{v a l}$ (%)	${m A P}_{50}^{v a l}$ (%)	${m A P}_{50 : 95}^{v a l}$ (%)	$P^{t e s t}$ (%)	$R^{t e s t}$ (%)	${m A P}_{50}^{t e s t}$ (%)	${m A P}_{50 : 95}^{t e s t}$ (%)	Param (M)	GFLOPs (G)	FPS (s)
YOLOv5-L	55.25	43.45	44.48	27.43	48.53	37.31	35.32	20.71	46.51	109	101
YOLOv6-L	54.09	40.82	42.64	26.38	47.70	36.38	34.80	20.33	59.60	151	47
YOLOv8-M	54.45	41.01	42.78	26.17	46.12	36.43	34.61	20.24	25.85	79	135
YOLOv8-L	56.33	42.58	44.75	27.71	49.62	37.02	35.87	21.19	43.61	165	111
YOLOv9	57.01	43.88	45.89	28.61	50.56	39.92	38.57	23.15	57.30	189	112
YOLOv10-L	56.31	42.94	44.55	27.47	51.79	37.12	36.75	21.41	24.37	120	63
QueryDet	60.69	45.85	48.12	29.79	52.21	36.98	38.08	23.03	-	-	21.6
TOOD	55.22	40.18	41.92	25.58	46.85	34.03	33.92	20.19	32.04	199	34.9
RTMDet	55.74	41.26	43.18	26.36	48.15	34.73	35.36	21.16	52.30	80	37.7
Efficient DETR	58.75	44.02	46.08	28.49	49.54	36.18	36.76	22.07	32.01	159	-
RT-DETR-R18	61.34	45.40	47.02	28.80	54.76	38.31	37.34	21.69	20.18	58	72.4
RT-DETR-R34	60.76	44.35	46.21	28.29	53.63	38.52	37.41	21.96	31.44	90	60.2
RT-DETR-R50	63.60	48.94	50.31	31.25	56.95	41.52	40.24	23.55	42.94	135	53.5
RT-DETR-L	63.27	46.42	48.38	29.62	56.88	39.26	38.49	22.38	32.01	108	59.2
AMFEF-DETR	63.68	50.01	51.87	32.27	59.62	41.66	41.36	24.28	35.81	142	84.5

Table 9. Comparison results for each category on the HIT-UAV dataset.

Model	Person (%)	Car (%)	Bicycle (%)	OtherVehicle (%)	DontCare (%)	${m A P}_{50}^{t e s t}$ (%)	${m A P}_{50 : 95}^{t e s t}$ (%)
YOLOv5	92.58	98.05	90.10	73.31	23.30	75.47	48.18
YOLOv6	94.17	96.48	91.37	52.69	57.16	78.38	49.73
YOLOv8	94.47	96.59	91.41	57.78	59.30	79.91	51.04
YOLOv9	92.29	98.87	92.80	77.26	43.12	80.89	52.49
YOLOv10	88.01	96.80	84.49	66.10	52.48	77.70	47.39
RT-DETR	93.67	97.37	90.08	59.59	53.16	78.77	49.59
RT-DETR-R34	92.28	96.46	88.88	50.61	47.43	75.13	46.91
RT-DETR-R50	93.43	96.85	90.63	55.94	64.82	80.33	51.27
RT-DETR-L	94.15	96.87	90.72	57.14	56.49	79.07	49.65
AMFEF-DETR	94.17	96.07	91.01	58.46	67.51	81.45	53.19

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, S.; Jiang, H.; Yang, J.; Ma, X.; Chen, J. AMFEF-DETR: An End-to-End Adaptive Multi-Scale Feature Extraction and Fusion Object Detection Network Based on UAV Aerial Images. Drones 2024, 8, 523. https://doi.org/10.3390/drones8100523

AMA Style

Wang S, Jiang H, Yang J, Ma X, Chen J. AMFEF-DETR: An End-to-End Adaptive Multi-Scale Feature Extraction and Fusion Object Detection Network Based on UAV Aerial Images. Drones. 2024; 8(10):523. https://doi.org/10.3390/drones8100523

Chicago/Turabian Style

Wang, Sen, Huiping Jiang, Jixiang Yang, Xuan Ma, and Jiamin Chen. 2024. "AMFEF-DETR: An End-to-End Adaptive Multi-Scale Feature Extraction and Fusion Object Detection Network Based on UAV Aerial Images" Drones 8, no. 10: 523. https://doi.org/10.3390/drones8100523

Article Menu

AMFEF-DETR: An End-to-End Adaptive Multi-Scale Feature Extraction and Fusion Object Detection Network Based on UAV Aerial Images

Abstract

1. Introduction

2. Materials and Methods

2.1. The AMFEF-DETR Model Architecture

2.2. Frequency-Adaptive Dilated Feature Extraction Network

2.3. Feature Interaction Utilizing the HLIFI Module

2.4. Improved Cross-Scale Feature Fusion Network

2.5. The Inner-Shape-IoU Loss Function

2.6. Datasets

2.7. Evaluation Indicators

3. Experiments and Results

3.1. Experimental Environment and Parameter Configuration

3.2. Feature Extraction Network Comparison Experiment

3.3. Analyzing the Performance of the HLIFI Module

3.4. Verifying the Effectiveness of the Adaptive Feature Fusion Network

3.5. Comparative Experiments of Different Loss Functions

3.6. Ablation Study

3.7. Comparative Experiments between the AMFEF-DETR Model and Other Advanced Models

3.8. Comparative Analysis of Different Detection Models

3.9. Visual Analysis

4. Extended Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI