Facial Expression Recognition-You Only Look Once-Neighborhood Coordinate Attention Mamba: Facial Expression Detection and Classification Based on Neighbor and Coordinates Attention Mechanism

Cheng Peng; Mingqi Sun; Kun Zou; Bowen Zhang; Genan Dai; Ah Chung Tsoi

doi:10.3390/s24216912

Facial Expression Recognition-You Only Look Once-Neighborhood Coordinate Attention Mamba: Facial Expression Detection and Classification Based on Neighbor and Coordinates Attention Mechanism

Sensors (Basel). 2024 Oct 28;24(21):6912. doi: 10.3390/s24216912.

Authors

Cheng Peng¹, Mingqi Sun², Kun Zou¹, Bowen Zhang³, Genan Dai³, Ah Chung Tsoi⁴

Affiliations

¹ School of Computing, Zhongshan Institute, University of Electronic Science and Technology of China, Zhongshan 528402, China.
² School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.
³ College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518118, China.
⁴ School of Computing and Information Technology, University of Wollongong, Wollongong, NSW 2522, Australia.

Abstract

In studying the joint object detection and classification problem for facial expression recognition (FER) deploying the YOLOX framework, we introduce a novel feature extractor, called neighborhood coordinate attention Mamba (NCAMamba) to substitute for the original feature extractor in the Feature Pyramid Network (FPN). NCAMamba combines the background information reduction capabilities of Mamba, the local neighborhood relationship understanding of neighborhood attention, and the directional relationship understanding of coordinate attention. The resulting FER-YOLO-NCAMamba model, when applied to two unaligned FER benchmark datasets, RAF-DB and SFEW, obtains significantly improved mean average precision (mAP) scores when compared with those obtained by other state-of-the-art methods. Moreover, in ablation studies, it is found that the NCA module is relatively more important than the Visual State Space (VSS), a version of using Mamba for image processing, and in visualization studies using the grad-CAM method, it reveals that regions around the nose tip are critical to recognizing the expression; if it is too large, it may lead to erroneous prediction, while a small focused region would lead to correct recognition; this may explain why FER of unaligned faces is such a challenging problem.

Keywords: attention; facial expression recognition; object detection; visual state space model.

MeSH terms

Algorithms
Attention / physiology
Automated Facial Recognition / methods
Facial Expression*
Facial Recognition / physiology
Humans
Image Processing, Computer-Assisted / methods
Pattern Recognition, Automated / methods

Grants and funding

This work was supported in part by the Fund for High-Level Talents Awarded by University of Electronic Science and Technology of China, Zhongshan Institute (419YKQN13, 422YKQS02), the Young Innovative Talents Project of Education Department of Guangdong Province (2021KQNCX148), the Featured Innovative Project of Education Department of Guangdong Province (2022KTSCX195), special projects in key fields of universities in Guangdong Province (2022ZDZX1047), and Social Welfare and Basic Research Key Project of Zhongshan City (2021B2006 and 2021B2018).