CMFX: Cross-modal fusion network for RGB-X crowd counting

Neural Netw. 2024 Dec 18:184:107070. doi: 10.1016/j.neunet.2024.107070. Online ahead of print.

Abstract

Currently, for obtaining more accurate counts, existing methods primarily utilize RGB images combined with features of complementary modality (X-modality) for counting. However, designing a model that can adapt to various sensors is still an unsolved issue due to the differences in features between different modalities. Therefore, this paper proposes a unified fusion framework called CMFX for RGB-X crowd counting. CMFX contains three core components: fast feature aggregation module (FFAM), cross-modal feature interaction module (CFIM), and cross-modal feature decoding module (CFDM). Specifically, FFAM aims to enhance the fusion representation capability of low-level multimodal features through lightweight mixed attention. CFIM can fully realize the interaction and fusion of high-level features by rectifying the feature information of two modalities and exploring their potential correlations. In addition, CFDM employs a novel graph convolution block for refining and preserving cross-modal features at high-level and low-level. To validate CMFX, this paper unifies, for the first time, two modalities complementary to RGB images, namely depth and thermal. After extensive experiments on three public datasets RGBT-CC, DroneRGBT, and ShanghaiTechRGBD, we found that CFMX performs excellently in these two multimodal combinations. Code: https://github.com/Duanxm9/CMFX.

Keywords: Cross-modal fusion; Feature decoding; Feature rectification; RGB-X crowd counting.