Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation

Xiaofeng Zhang; Fanshuo Zeng; Chaochen Gu

doi:10.1016/j.neunet.2024.107059

Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation

Neural Netw. 2024 Dec 31:184:107059. doi: 10.1016/j.neunet.2024.107059. Online ahead of print.

Authors

Xiaofeng Zhang¹, Fanshuo Zeng², Chaochen Gu³

Affiliations

¹ Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, Shanghai, China.
² Central South University, 932 South Lushan Road, Yuelu District, Changsha, 410083, Hunan, China.
³ Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, Shanghai, China. Electronic address: [email protected].

PMID: 39787679
DOI: 10.1016/j.neunet.2024.107059

Abstract

Recently, the field of multimodal large language models (MLLMs) has grown rapidly, with many Large Vision-Language Models (LVLMs) relying on sequential visual representations. In these models, images are broken down into numerous tokens before being fed into the Large Language Model (LLM) alongside text prompts. However, the opaque nature of these models poses significant challenges to their interpretability, particularly when dealing with complex reasoning tasks. To address this issue, we utilized Grad-CAM to investigate the interaction dynamics between images and text within complex reasoning processes. Our information flow analysis revealed a distinct pattern: it tends to converge in the initial layers and then disperse as it progresses through deeper layers. This pattern suggests that the early stages of processing focus on the interaction between visual and textual elements, while later stages engage in deeper reasoning. We developed Simignore, a novel image token reduction technique based on this insight. Simignore enhances the model's complex reasoning capabilities by calculating the similarity between image and text embeddings, thereby ignoring tokens that are not semantically relevant. Extensive experiments across different MLLM architectures have shown that our approach consistently improves performance in complex reasoning tasks. This work not only contributes to the advancement of MLLM interpretability but also provides a robust framework for future research in this area. The paper's source code can be accessed from https://github.com/FanshuoZeng/Simignore.

Keywords: Image-text similarity; Information flow; Multimodal large language models.