GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer Based Fusion Network for Multimodal Sentiment Analysis

Yijie Jin
Shanghai University
[email protected]

Abstract

Multimodal Sentiment Analysis (MSA) leverages multiple modals to analyze sentiments. Typically, advanced fusion methods and representation learning-based methods are designed to tackle it. Our proposed GSIFN solves two key problems to be solved in MSA: (i) In multimodal fusion, the decoupling of modal combinations and tremendous parameter redundancy in existing fusion methods, which lead to poor fusion performance and efficiency. (ii) The trade-off between representation capability and computation overhead of the unimodal feature extractors and enhancers. GSIFN incorporates two main components to solve these problems: (i) Graph-Structured and Interlaced-Masked Multimodal Transformer. It adopts the Interlaced Mask mechanism to construct robust multimodal graph embedding, achieve all-modal-in-one Transformer-based fusion, and greatly reduce the computation overhead. (ii) A self-supervised learning framework with low computation overhead and high performance, which utilizes a parallelized LSTM with matrix memory to enhance non-verbal modal feature for unimodal label generation. Evaluated on the MSA datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS, GSIFN demonstrates superior performance with significantly lower computation overhead compared with state-of-the-art methods.

Yijie Jin Shanghai University [email protected]

1 Introduction

With the increasingly widespread use of social media, in which users express sentiment through information forms including text, video, audio, etc. To achieve more natural human-computer interactions, multimodal sentiment analysis (MSA) has become a popular research area (Peng et al., 2023b; Zhao et al., 2023; Wang et al., 2024; Zheng et al., 2024a, b). MSA task relies on at least two data modals for sentiment polarity prediction. Specifically, its data form is usually a trimodal combination of text, vision, and audio. The main challenge of MSA is to integrate inconsistent sentiment information, thus achieving semantic disambiguation and effective sentiment analysis. Methods of MSA involve designing effective fusion strategies (Zadeh et al., 2017; Tsai et al., 2019; Zhang et al., 2023) to integrate heterogeneous data for comprehensive sentiment representation and semantic alignment, and developing representation learning strategies (Yu et al., 2021; Yang et al., 2023; Lin and Hu, 2024) to enhance unimodal information and model robustness.

Despite achieving some successes, existing approaches still face three main challenges. First, for the models that focus on modal fusion, the computation overhead rises due to the widespread use of cross-modal attention mechanisms-based (CMA-based) modules. What is more, different unidirectional bimodal combinations are decoupled and then inputted into multiple independent CMA-based modules for fusion, this prevents such models from fully integrating trimodal representation information. Instead, they retain redundant information in the dominant modal of the bimodal combination. Therefore, these models are excessively redundant and in need of pruning. However, once the naive serial weight-sharing strategy (Hazarika et al., 2020) or modal sequence concatenation operation is applied to share trimodal representation information and prune the model, information disorder occurs, which is worth solving. Second, for the representation learning-based models, the data extraction and representation module of non-verbal modals cannot effectively balance the number of parameters and representation performance. Small models (GRU(Chung et al., 2014), LSTM(Hochreiter, 1997), etc.) or conventional extractors (OpenFace2.0(Baltrusaitis et al., 2018), COVAREP(Degottex et al., 2014), etc.) usually cause excessive loss of representation of non-verbal modals. In contrast, large models (ViT(Dosovitskiy et al., 2021), Wav2Vec(Schneider et al., 2019), etc.) bring better performance but incur excessive overhead. Third, models combining the above two approaches face both of these drawbacks, so it is of vital importance to weigh the pros and cons.

To address the aforementioned issues, we propose a model called Graph-Structured and Interlaced-Masked Multimodal Transformer Based Fusion Network, dubbed GSIFN. There are two attractive properties in GSIFN. First, in the process of multimodal fusion, it realizes efficient and low overhead representation information sharing without information disorder. To attain this, we propose a Graph-structured and interlaced-masked multimodal Transformer (GsiT), which is structured modal-wise in units of modal subgraphs. GsiT utilizes the Interlaced Mask (IM) mechanism to construct Multimodal Graph Embeddings (MGE), in which Interlaced-Inter-Fusion Mask (IFM) constructs fusion MGE. Interlaced-Intra-Enhancement Mask (IEM) constructs enhancement MGE. Specifically, with shared information, IFM constructs two opposite unidirectional ring MGE to realize a complete fusion procedure. IEM constructs an internal enhancement MGE to realize the multimodal fusion enhancement. IM utilizes a weight-sharing strategy to achieve an all-modal-in-one fusion and enhancement mechanism. It also eliminates useless information, thereby improving fusion efficiency and achieving pruning. Second, it significantly reduces computation overhead brought by non-verbal modal feature enhancement operations and ensures the robustness and performance of the model. We employ a unimodal label generation module (ULGM) to enhance the model robustness and apply an extended LSTM with matrix memory (mLSTM) to enhance non-verbal modal features in ULGM. mLSTM is fully parallelized and has a superior memory mechanism over LSTM, which can deeply mine the semantic information of non-verbal modals. Additionally, using mLSTM could avoid the huge computation overhead caused by large models. Thus balancing the computation overhead and the representation capability of GSFIN. Overall, our contributions are as follows:

•

We propose GSIFN, a graph-structured and interlaced-masked multimodal transformer network. Experiments and ablation studies across various datasets validate its effectiveness and superiority.
•

We design GsiT, a graph-structured and interlaced-masked multimodal transformer that uses the Interlaced Mask mechanism to build multimodal graph embeddings from modal subgraphs. It ensures efficient, low-overhead information sharing, reduces spatio-temporal redundancy and noise, and yields a more compact and informative multimodal representation while lowering the module’s parameter count.
•

We employ mLSTM, an extended LSTM with matrix memory, to enhance non-verbal modal features utilized for unimodal label generation. This approach improves model robustness and representation capability and avoids the overhead of large models.

Refer to caption — Figure 1: GSIFN Architecture.

2 Related Work

2.1 Multimodal Sentiment Analysis

Multimodal sentiment analysis (MSA) is an increasingly popular research field. Its data includes more than two modals. Text, vision, and audio are the most widely used modals. Earlier models focus on modal fusion. Zadeh et al. were among the first promoters in this field. They proposed TFN (Zadeh et al., 2017), built a power set form of modal combination, and realized complete modal fusion by using Cartesian product. TFN did not consider the temporal information of non-verbal modals, thus, MFN(Zadeh et al., 2018) is designed. MFN uses an LSTM system to extract the timing information of the three modes through explicit modal alignment (CTC, padding etc.) and uses attention mechanism and gated memory to realize the efficient fusion of multimodal temporal information.

With the rise of Transformer, MulT (Tsai et al., 2019) proposes a cross-modal attention mechanism (CMA) from the perspective of modal translation. CMA can effectively integrate multimodal data while realizing implicit modal alignment. Based on MulT and CMA, models such as TETFN (Wang et al., 2023a) and ALMT (Zhang et al., 2023) focus on text data to enhance non-verbal modal data, for text data contains stronger emotional information. Thus, they achieve superior representation performance and modal fusion. MAG-BERT (Rahman et al., 2020) uses a Multimodal Adaptation Gate (MAG) to fine-tune BERT using multi-modal data. CENet (Wang et al., 2023b) constructs non-verbal modal vocabularies, realizes non-verbal modal representation enhancement, and realizes MSA capability enhancement of fine-tuned BERT.

To improve the robustness of the model and the representation ability of non-verbal modals, and thus improve the overall multimodal sentiment analysis ability of the model, representation learning-based models such as Self-MM (Yu et al., 2021), ConFEDE (Yang et al., 2023), and MTMD (Lin and Hu, 2024) were proposed. They use self-supervised learning, contrast learning or knowledge distillation to achieve robust representation of modal information consistency and difference.

TETFN, MMML (Wu et al., 2024) and AcFormer (Zong et al., 2023) combine multimodal Transformer with representation learning to effectively improve model performance, and verify the feasibility of combining the two to learn from other strengths.

Due to the excessive use of traditional multimodal Transformer architecture in these methods, they often have a high number of parameters in the core fusion module. Additionally, different fusion combinations are decoupled to multiple independent Transformers (Vaswani et al., 2017), the interaction of modal information is insufficient, and there are problems of insufficient weight regularity. In the concrete implementation, we refer to the idea of graph attention networks (Velickovic et al., 2018; Brody et al., 2022) and construct a graph-structured multimodal Transformer with modal subgraph units.

2.2 Linear Attention Networks

In the field of natural language processing (NLP), reducing the computational cost of Transformers while maintaining performance has become a popular research topic. RWKV (Peng et al., 2023a), RetNet (Sun et al., 2023), Mamba (Gu and Dao, 2023), Mamba-2 (Dao and Gu, 2024) are representatives among them. xLSTM (Pöppel et al., 2024), as an extension of LSTM, introduces exponential gating to solve the limitations of memory capacity and parallelization, especially when dealing with long sequences.

At the same time, recent works in the field of MSA have begun to use more advanced feature extractors to enhance non-verbal modal features, taking into account the weak representation capability of non-verbal modals. For instance, TETFN and AcFormer (Zong et al., 2023) use Vision Transformer (ViT) (Dosovitskiy et al., 2021) to extract vision features, AcFormer uses Wav2Vec (Schneider et al., 2019) to extract features, and MMML (Wu et al., 2024) uses raw audio data to fine-tune Data2Vec (Baevski et al., 2022). However, these methods often result in excessive growth in the number of parameters, with obscure improvement over traditional features. To reduce model parameters and ensure model performance at the same time, the self-supervised learning method is used to strengthen the capture and representation of sentiment information. In GSIFN, mLSTM module in xLSTM is used to enhance the non-verbal input feature to unimodal label generation, it significantly reduces the computation overhead and ensures model performance.

3 Methodology

3.1 Preliminaries

The objective of multimodal sentiment analysis (MSA) is to evaluate sentiment polarity using multimodal data. Existing MSA datasets generally contain three modals: $t,v,a$ represent text, vision, and audio, respectively. Specially, $m$ denotes multimodal. The input of MSA task is $S_{u}\in\mathbb{R}^{T^{s}_{u}\times d^{s}_{u}}$ , where $u\in\{t,v,a\}$ , $T^{s}_{u}$ denotes the raw sequence length and $d^{s}_{u}$ denotes the raw representation dimension of modal $u$ . In this paper, we define multiple outputs $\hat{y}_{u}\in R$ , where $u\in\{t,v,a,m\}$ , $\hat{y}_{\{t,v,a\}}$ denote unimodal outputs, obtained for unimodal label generation. $\hat{y}_{m}$ denotes the fusion output, obtained for the final prediction. Other symbols are defined as follows, fusion module inputs are $\{X_{t},X_{v},X_{a}\}$ . ULGM inputs are $\{\mathcal{X}_{t},\mathcal{X}_{v},\mathcal{X}_{a}\}$ . The predictor input is $\mathcal{X}_{m}$ . In particular, in the interpretation of GsiT $\{X_{t},X_{v}X_{a}\}$ are abstracted to sequences of vertices $\{\mathcal{V}_{t},\mathcal{V}_{v},\mathcal{V}_{a}\}$ . Labels for $y_{u}\in R$ , where $u\in\{t,v,a,m\}$ , $y_{\{t,v,a\}}$ are unimodal label generated by ULGM, $y_{m}$ is the ground truth label for fusion output.

3.2 Overall Architecture

The overview of our model is shown in Figure 1 which consists of three major parts: (1) Modal Encoding utilizes tokenizer (for text modality), feature extractors and temporal enhancers (for non-verbal modals vision and audio) to convert raw multimodal data into numerical feature sequences. Enhanced non-verbal modal features are utilized for unimodal label generation. (2) Graph-Structured Multimodal Fusion takes the processed text, vision, and audio embedding as input. The module graph-structured and interlaced-masked multimodal Transformer utilizes interlaced masks to construct multimodal graph embedding. It employs weight-sharing to facilitate comprehensive multimodal information interaction and eliminate redundant data, thereby enhancing fusion efficiency and enabling model pruning. (3) Self-Supervised Learning Framework generates final representations and defines positive and negative centers by projecting text features, enhanced vision audio features, and fusion output to hidden states. Unimodal labels are separately generated using text, vision, and audio hidden states.

3.3 Modal Encoding

For text modal, we use the pretrained transformer BERT as the text encoder. Input text token sequence is constructed by the raw sentence $S_{t}$ = $\{w_{1},w_{2},\dots,w_{n}\}$ concatenated with two special tokens ([CLS] at the head and [SEP] at the end) which form $S_{t}^{{}^{\prime}}$ = $\{\text{[CLS]},w_{1},w_{2},\dots,w_{n},\text{[SEP]}\}$ . Then input $S_{t}^{{}^{\prime}}$ into BERT to construct $\mathcal{X}_{t}$ , which is used to generate text modal labels.

\mathcal{X}_{t}=\text{BERT}(S_{t}^{{}^{\prime}})=\{t_{0},t_{1},\dots,t_{n+1}\}

(1)

Following previous works (Tsai et al., 2019), input sequences $X_{\{t,v,a\}}$ are handled by one dimensional convolution layer from $\mathcal{X}_{t}$ and raw vision, audio sequences $S_{\{v,a\}}$ .

	$\displaystyle X_{t}=\text{Conv1D}(\mathcal{X}_{t})$		(2)
	$\displaystyle X_{\{v,a\}}=\text{Conv1D}(S_{\{v,a\}})$		(3)

After that, we employ an extended Long Short Term Memory which is fully parallelizable with a matrix memory and a covariance update rule (mLSTM) as the temporal enhancer of vision and audio modal. mLSTM can improve model representation capability. Meanwhile, using it can avoid the overhead of large models. The detailed definition of mLSTM is in Appendix E.2.

We use mLSTM to enhance the temporal features of vision and audio.

\mathcal{X}_{\{v,a\}}=\text{mLSTM}(X_{\{v,a\}})

(4)

mLSTM can enhance non-verbal modal features utilized for unimodal label generation.

3.4 Graph-Structured Multimodal Fusion

Following previous works (Tsai et al., 2019; Wang et al., 2023a), we only use the low-level temporal feature sequences $\{X_{t},X_{v},X_{a}\}$ as input of multimodal fusion. Then $\{X_{t},X_{v},X_{a}\}$ are regarded as graph vertex sequences $\{\mathcal{V}_{t},\mathcal{V}_{v},\mathcal{V}_{a}\}$ . Then, concatenate vertices into a single sequence $\mathcal{V}_{m}$ = $[\mathcal{V}_{t};\mathcal{V}_{v};\mathcal{V}_{a}]^{\top}$ . $\mathcal{V}_{m}$ is treated as the multimodal graph embedding (MGE). The architecture of Graph-Structured and Interlaced-Masked Multimodal Transformer Architecture (GsiT) is shown in Figure 2.

Graph Structure Construction To start with, we utilize the self-attention mechanism as the basic theory to construct a naive fully connected graph. The attention weight matrix is regarded as the adjacency matrix $\mathcal{A}$ with dynamic weights. In $\mathcal{A}$ , $\mathcal{E}^{i,j}\in\mathbb{R}^{T_{i}\times T_{j}}$ , $\{i,j\}\in\{t,v,a\}$ is the adjacency matrix of the subgraph constructed by $\mathcal{V}_{i}$ and $\mathcal{V}_{j}$ .

	$\displaystyle\mathcal{A}$	$\displaystyle=(\mathcal{W}_{q}\mathcal{V}_{m})\cdot(\mathcal{W}_{k}\mathcal{V}% _{m})^{\top}$		(5)
		$\displaystyle=\begin{pmatrix}\mathcal{E}^{t,t}&\mathcal{E}^{t,v}&\mathcal{E}^{% t,a}\\ \mathcal{E}^{v,t}&\mathcal{E}^{v,v}&\mathcal{E}^{v,a}\\ \mathcal{E}^{a,t}&\mathcal{E}^{a,v}&\mathcal{E}^{a,a}\\ \end{pmatrix}$		(5)

The derivation process of detailed graph structure construction (from vertex to subgraph) is in Appendix D.1.

Interlaced Mask Mechanism Interlaced Mask (IM) is a modal-wise mask mechanism, thus all of the elements in the mask matrix are subgraph adjacency matrices. The mask matrix is represented as a block matrix. Then the construction procedure of IM is described in detail. The computation procedure with IM is shown in Figure 2.

To start with, to avoid the influence of intra-modal subgraph $\mathcal{E}^{i,i}_{i\in\{t,v,a\}}$ , we apply modal-wise intra mask as shown in Equation 6. We define $\mathcal{O}^{i,j}\in\mathbb{R}^{T_{i}\times T_{j}}$ as all zero matrix, $\mathcal{J}^{i,j}\in\mathbb{R}^{T_{i}\times T_{j}}$ as all negative infinity matrix.

\mathcal{M}_{inter}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{O}^{t,v}&% \mathcal{O}^{t,a}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{O}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}

(6)

$\mathcal{M}_{inter}$ can already make cross-modal fusion not be affected by intra-modal subgraphs. However, in the fusion procedure, different modal sequences should not be recognized as the same sequence. Therefore, we extend $\mathcal{M}_{inter}$ to the following two mask matrices, which is called as Interlaced-Inter-Fusion Mask (IFM). The explanation of aforementioned information disorder is in Appendix D.2

\begin{cases}\begin{aligned} &\mathcal{M}_{inter}^{forward}=\begin{pmatrix}% \mathcal{J}^{t,t}&\mathcal{O}^{t,v}&\mathcal{J}^{t,a}\\ \mathcal{J}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{J}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\\ &\mathcal{M}_{inter}^{backward}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{J}^{% t,v}&\mathcal{O}^{t,a}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{J}^{v,a}\\ \mathcal{J}^{a,t}&\mathcal{O}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\end{aligned}\end{cases}

(7)

Based on the two matrices, two opposite uni-directional ring graphs can be constructed to achieve a complete fusion procedure. We define softmax operation as $\mathcal{S}$ , dropout operation as $\mathcal{D}$ , and function composition operator as $\circ$ .

\begin{cases}\begin{aligned} &\mathcal{G}_{inter}^{forward}=\mathcal{S}\circ% \mathcal{D}(\mathcal{A}+\mathcal{M}_{inter}^{forward})\\ &\mathcal{G}_{inter}^{backward}=\mathcal{S}\circ\mathcal{D}(\mathcal{A}+% \mathcal{M}_{inter}^{backward})\end{aligned}\end{cases}

(8)

By now, $\mathcal{G}_{inter}^{forward}$ and $\mathcal{G}_{inter}^{backward}$ truly make MGE $\mathcal{V}_{m}$ graph-structured. Both of the two matrices manage to aggregate the information of the trimodal without temporal disorder and intra-modal information influence.

After aggregation, the fusion process is performed.

\begin{cases}\begin{aligned} &\overline{\mathcal{V}}_{m}^{forward}=\mathcal{G}% _{inter}^{forward}\mathcal{W}_{v}\mathcal{V}_{m}\\ &\overline{\mathcal{V}}_{m}^{backward}=\mathcal{G}_{inter}^{backward}\mathcal{% W}_{v}\mathcal{V}_{m}\end{aligned}\end{cases}

(9)

Where $\mathcal{W}_{v}$ denotes the value projection weight of $\mathcal{V}_{m}$ .

As shown in Figure 2, two MGEs are constructed by IFM in two separated Transformers, they are two opposite unidirectional rings. Due to their special structure, a complete fusion process is achieved.

After fusion, intra-modal subgraphs need to be enhanced accordingly. Therefore, the Intelaced-Intra-Enhancement Mask (IEM) is constructed.

\mathcal{M}_{intra}=\mathcal{J}-\mathcal{M}_{inter}

(10)

Where $\mathcal{J}$ denotes a negative infinity matrix at the same size of $\mathcal{M}_{inter}$ .

$\mathcal{M}_{intra}$ leaves only intra-modal subgraphs visible to enhance the fused MGEs.

After IEM construction, concatenate two opposite unidirectional ring MGEs on feature dimension into one bidirectional MGE. We define $\parallel$ as the concatenation operation on the feature dimension.

\overline{\mathcal{V}}_{m}^{bidirection}=\parallel\overline{\mathcal{V}}_{m}^{% \{forward,backward\}}

(11)

Utilizing the bidirectional MGE $\overline{\mathcal{V}}_{m}^{bidirection}$ and $\mathcal{M}_{intra}$ , the intra-modal enhancement graph could be constructed. We define $\overline{\mathcal{V}}_{m}^{b}$ = $\overline{\mathcal{V}}_{m}^{bidirection}$ , $\mathcal{W}_{q}^{b}$ , $\mathcal{W}_{k}^{b}$ as the query, key projection weight of $\mathcal{V}_{m}^{b}$ .

	$\displaystyle\mathcal{A}_{fusion}=(\mathcal{W}_{q}^{b}\overline{\mathcal{V}}_{% m}^{b})\cdot(\mathcal{W}_{k}^{b}\overline{\mathcal{V}}_{m}^{b})^{\top}$		(12)
	$\displaystyle\mathcal{G}_{intra}=\mathcal{S}\circ\mathcal{D}(\mathcal{A}_{% fusion}+\mathcal{M}_{intra})$		(13)

Then, we construct the final feature sequence.

\overline{\mathcal{V}}_{m}=\mathcal{G}_{intra}\mathcal{W}_{v}^{b}\overline{% \mathcal{V}}_{m}^{b}

(14)

Where $\mathcal{W}_{v}^{b}$ denotes the value projection weight of $\mathcal{V}_{m}^{b}$ .

Finally, the sequence is decomposed according to the length of the original feature sequence. Then, the final hidden states of different modals are concatenated on the feature dimension to construct the fusion feature $\mathcal{X}_{m}$ .

The detailed generation algorithm of IM is described in Appendix E.1

3.5 Self-Supervised Learning Framework

A unimodal label generation module (ULGM) is integrated into our approach to capture unimodal-specific information. As shown in figure 1, we use input features $\mathcal{X}_{\{t,v,a\}}$ to generate unimodal final hidden states $\hat{y}_{\{t,v,a\}}$ . During the prediction process, ULGM uses $h_{\{t,v,a\}}$ and ground truth multimodal labels to define positive and negative centers, which are determined based on the predicted unimodal labels and multimodal fusion representations. Next, we calculate the relative distance of each modal representation from the positive and negative centers. Then, we generate new unimodal labels $y_{\{t,v,a\}}^{i}$ from the unimodal labels to the ground truth multimodal label, where $i$ represents the $i$ training iteration. In this way, sentiment analysis can be more conducive to obtaining the distinguishing information of different modals, while maintaining the consistency of each modal.

Using the predicted results $\hat{y}_{\{m,t,v,a\}}$ and the ground truth multimodal label $y_{m}$ and the generated labels $y_{\{t,v,a\}}$ , we implement a weighted loss to optimize our model.

The weighted loss is defined by Equation 15 whereas the unimodal loss for each modality is defined as Equation LABEL:u_loss

		$\displaystyle\mathcal{L}_{w}=\sum_{u\in{\{m,t,v,a\}}}\mathcal{L}_{u}$		(15)
		$\displaystyle\begin{aligned} &\mathcal{L}_{u}=\frac{\sum_{i=0}^{\mathcal{B}}{w% _{u}^{i}*\|\hat{y}_{u}^{i}-y_{u}^{i}\|}}{\mathcal{B}}\\ &w_{u}^{i}=\begin{cases}{1}&u=m\\ \tanh{(\|\hat{y}_{u}^{i}-\hat{y}_{m}^{i}\|)}&u\in{\{t,v,a\}}\end{cases}\end{aligned}$		(16)

Where $\mathcal{B}$ denotes the appointed batch size.

Table 1: Comparison on CMU-MOSI and CMU-MOSEI.

Model	CMU-MOSI					CMU-MOSEI					Data State
Model	Acc-2 $\uparrow$	F1 $\uparrow$	Acc-7 $\uparrow$	MAE $\downarrow$	Corr $\uparrow$	Acc-2 $\uparrow$	F1 $\uparrow$	Acc-7 $\uparrow$	MAE $\downarrow$	Corr $\uparrow$	Data State
$\text{MulT}^{*}$	83.0 / -	82.8 / -	40.0	0.871	0.698	81.6 / -	81.6 / -	50.7	0.591	0.694	Unaligned
$\text{MTAG}^{*}$	82.3 / -	82.1 / -	38.9	0.866	0.722	- / -	- / -	-	-	-	Unaligned
$\text{MISA}^{*}$	81.8 / 83.4	81.7 / 83.6	42.3	0.783	0.761	83.6 / 85.5	83.8 / 85.3	52.2	0.555	0.756	Unaligned
$\text{HyCon-BERT}^{*}$	- / 85.2	- / 85.1	46.6	0.713	0.790	- / 85.4	- / 85.6	52.8	0.601	0.776	Aligned
$\text{ConFEDE}^{*}$	84.2 / 85.5	84.1 / 85.5	42.3	0.742	0.784	81.7 / 85.8	82.2 / 85.8	54.9	0.522	0.780	Unaligned
$\text{MMIN}^{*}$	83.5 / 85.5	83.5 / 85.5	-	0.741	0.795	83.8 / 85.9	83.9 / 85.8	-	0.542	0.761	Unaligned
$\text{MTMD}^{*}$	84.0 / 86.0	83.9 / 86.0	47.5	0.705	0.799	84.8 / 86.1	84.9 / 85.9	53.7	0.531	0.767	Unaligned
$\text{MulT}^{\dagger}$	79.6 / 81.4	79.1 / 81.0	36.2	0.923	0.686	78.1 / 83.7	78.9 / 83.7	53.4	0.559	0.740	Unaligned
$\text{CENet-BERT}^{\dagger}$	82.8 / 84.5	82.7 / 84.5	45.2	0.736	0.793	81.7 / 82.3	81.6 / 81.9	52.0	0.576	0.711	Aligned
$\text{Self-MM}^{\dagger}$	82.2 / 83.5	82.3 / 83.6	43.9	0.758	0.792	80.8 / 85.0	81.3 / 84.9	53.3	0.539	0.761	Unaligned
$\text{TETFN}^{\dagger}$	82.4 / 84.0	82.4 / 84.1	46.1	0.749	0.784	81.9 / 84.3	82.1 / 84.1	52.7	0.576	0.728	Unaligned
GSIFN	85.0 / 86.0	85.0 / 86.0	48.3	0.707	0.801	85.0 / 86.3	85.1 / 86.2	53.4	0.538	0.767	Unaligned

Table 2: Comparison on CH-SIMS.

Model	CH-SIMS
Model	Acc-2 $\uparrow$	Acc-3 $\uparrow$	Acc-5 $\uparrow$	F1 $\uparrow$	MAE $\downarrow$	Corr $\uparrow$
$\text{TFN}^{\dagger}$	77.7	66.3	42.7	77.7	0.436	0.582
$\text{MFN}^{\dagger}$	77.8	65.4	38.8	77.6	0.443	0.566
$\text{MulT}^{\dagger}$	77.8	65.3	38.2	77.7	0.443	0.578
$\text{MISA}^{\dagger}$	75.3	62.4	35.5	75.4	0.457	0.553
$\text{Self-MM}^{\dagger}$	78.1	65.2	41.3	78.2	0.423	0.585
$\text{TETFN}^{\dagger}$	78.0	64.4	42.9	78.0	0.425	0.582
GSIFN	80.5	67.2	45.5	80.7	0.397	0.619

Table 3: Ablation study on CMU-MOSI.

Description	CMU-MOSI
Description	Acc-2 $\uparrow$	F1 $\uparrow$	Acc-7 $\uparrow$	MAE $\downarrow$	Corr $\uparrow$
GSIFN	85.0 / 86.0	85.0 / 86.0	48.3	0.707	0.801
w/o GsiT	83.8 / 85.5	83.2 / 85.7	46.5	0.742	0.790
w/o mLSTM	84.6 / 86.0	84.5 / 86.0	47.2	0.730	0.792
w/o ULGM	83.4 / 84.8	83.4 / 84.8	46.7	0.711	0.801

4 Experiment

We evaluate our model on three benchmarks, CMU-MOSI (Zadeh et al., 2016), CMU-MOSEI (Bagher Zadeh et al., 2018) and CH-SIMS (Yu et al., 2020). These datasets provide aligned (CMU-MOSI, CMU-MOSEI) and unaligned (all) multimodal data (text, vision, and audio) for each utterance. Further details are in Appendix B

Following prior works, several evaluation metrics are adopted. Binary classification accuracy (Acc-2), F1 Score (F1), three classification accuracy (Acc-3), five classification accuracy (Acc-5), seven classification accuracy (Acc-7), mean absolute error (MAE), and the correlation of the model’s prediction with human (Corr). In particular, Acc-3 and Acc-5 are applied only for CH-SIMS dataset, Acc-2 and F1 are calculated in two ways: negative/non-negative(NN) and negative/positive(NP) on CMU-MOSI and CMU-MOSEI datasets, respectively.

For CMU-MOSI and CMU-MOSEI, we choose MulT(Tsai et al., 2019), MTAG(Yang et al., 2021), MISA(Hazarika et al., 2020), HyCon-BERT(Mai et al., 2023), TETFN(Wang et al., 2023a), ConFEDE(Yang et al., 2023), MMINFang et al. (2024), MTMDLin and Hu (2024), CENet-BERTWang et al. (2023b), Self-MMYu et al. (2021) as baselines. As for CH-SIMS, TFN(Zadeh et al., 2017), MFN(Zadeh et al., 2018), MISA, MulT, Self-MM and TETFN are chosen. All of which are previous state-of-the-arts(SOTA). Further details are in Appendix C.

Detailed experiment settings of hyperparameters and feature extraction methods are in Appendix A.1.

Table 4: Comparison of GsiT and MulT on CMU-MOSI and CMU-MOSEI.

Model	CMU-MOSI					CMU-MOSEI					Params(M)	FLOPS(G)
Model	Acc-2 $\uparrow$	F1 $\uparrow$	Acc-7 $\uparrow$	MAE $\downarrow$	Corr $\uparrow$	Acc-2 $\uparrow$	F1 $\uparrow$	Acc-7 $\uparrow$	MAE $\downarrow$	Corr $\uparrow$	Params(M)	FLOPS(G)
MulT	79.6 / 81.4	79.1 / 81.0	36.2	0.923	0.686	78.1 / 83.7	78.9 / 83.7	53.4	0.559	0.740	4.362	105.174
GsiT	83.4 / 84.9	83.4 / 85.0	45.5	0.716	0.803	84.1 / 86.3	84.4 / 86.3	53.5	0.539	0.774	0.891	25.983

Table 5: The Computational Overhead of Different Vision/Audio Modality Enhancement Models

Model	mLSTM(V)	mLSTM(A)	ViT	Wav2Vec	Whisper
Params(M)	0.439	0.439	127.272	94.395	17.120
FLOPS(G)	1.674	1.252	35.469	68.543	315.128

Table 6: Comparison with model using large model extractors. Note: OF denotes OpenFace, CR denotes COVAREP

Model	CMU-MOSI					Extractor(V/A)	Enhancer
Model	Acc-2 $\uparrow$	F1 $\uparrow$	Acc-7 $\uparrow$	MAE $\downarrow$	Corr $\uparrow$	Extractor(V/A)	Enhancer
GSIFN	85.0 / 86.0	85.0 / 86.0	48.3	0.707	0.801	OF/CR	mLSTM
TETFN	84.1 / 86.1	83.8 / 86.1	46.5	0.717	0.800	ViT/CR	LSTM
AcFormer	82.3 / 85.4	82.1 / 85.2	44.2	0.742	0.794	ViT/Wav2Vec	Transformer

4.1 Results

The performance comparison of all methods on MOSI, MOSEI and CH-SIMS are summarized in Table 1 and Table 2.

For all metrics, the best results are highlighted in bold, the second-best results are double-underlined, and the third-best results are single-underlined. ^† denotes that the model is sourced from the GitHub page^4.1 and the scores are reproduced, ^∗ denotes the result is obtained directly from the original paper.

GSIFN is trained end-to-end without any pre-training. Thus, all replicated results ensure the consistency and fairness in the experimental environment. Except for MulT, the reproducible results differ greatly from the original results.

In Table 1, for a fair comparison in CMU-MOSI and CMU-MOSEI, we split models into two categories based on data state: Unaligned and Aligned. For Acc-2 and F1, the left of the "/" corresponds to "negative/non-negative" and the right corresponds to "negative/positive".

As shown in Table 1 and 2, GSIFN outperforms all of the previous SOTAs in most of the metrics in all of the datasets. The comparable metrics in CMU-MOSEI (Acc-7, MAE, Corr) and CMU-MOSI (MAE) also reach at least the third-best performance. GSIFN achieves all-modal-in-one fusion and enhanced self-supervised learning, which ensures its superior performance over previous SOTA.

¹¹footnotetext: https://github.com/thuiar/MMSA

4.2 Ablation Study

In this session, we will discuss our ablation study on modules in Table 3. Further ablation study results are in Appendix A.2

There are three main modules in our model, including Graph-Structured Interlaced-Masked Multimodal Transformer (GsiT) for multimodal fusion, extended LSTM with matrix memory (mLSTM) for vision, audio temporal enhancement, Unimodal Label Generation Module (ULGM) for self-supervision. In Table 3, w/o denotes the absence of the corresponding module in the model.

The results in Table 3 indicate all the modules are necessary for achieving SOTA performance. GsiT module realizes all-modal-in-one Transformer-based fusion, without module GsiT, the performance of the whole model has a substantial decrease in all metrics. GsiT is the core module of GSIFN, and it is especially important in fine-grained tasks. Without module mLSTM, the performance weakens mainly on fine-grained metrics, it is a necessary module of GSIFN. Without module ULGM, the performance weakens on almost all the metrics. ULGM is significant to GSIFN in coarse-grained tasks.

4.3 Further Analysis

We discuss the GsiT comparison of performance and efficiency with MulT and mLSTM efficiency in this section. In particular, Params denotes the number of parameters, FLOPS denotes floating-point operations per second.

GsiT and MulT MulT mainly uses CMA to realize effective modal fusion. Like the core module of GSIFN, which is GsiT, MulT realizes complete fusion and post-fusion enhancement separately in 9 Transformers (Vaswani et al., 2017). However, GsiT uses IM to realize the graph structure construction in MGEs. Each of the MGE contains trimodal information altogether. GsiT reduces the number of Transformers from 9 to 3. Through weight-sharing without information disorder, each of the Transformers in GsiT can completely fuse trimodal sentiment information all in one, achieving better weight regularization and fusion performance at the same time.

For a fair comparison, we trained MulT and GsiT with the same hyperparameters. The experiments are shown in the Table 4. GsiT outperforms MulT in all metrics. The Params and FLOPS of GsiT are much lower than MulT.

Vision/Audio Encoder Efficiency As shown in Table 4, Params and FLOPS of widely used non-verbal modal feature extractors. Vision Transformer (ViT)(Dosovitskiy et al., 2021), Wav2Vec(Schneider et al., 2019) and Whisper(Radford et al., 2023) are employed to extract high-quality features. We employ mLSTM to enhance low-quality features extracted by COVAREP(Degottex et al., 2014) (for audio), OpenFace(Baltrusaitis et al., 2016) (for vision). The Params and FLOPS of mLSTMs is way lower than ViT, Wav2Vec and Whisper.

To analyze the efficiency and effectiveness of mLSTM, GSIFN is compared with two models using large model extractors: TETFN, AcFormer(Zong et al., 2023). As shown in Table 6, GSIFN performs close even better than TETFN and AcFormer in most of the metrics.

5 Conclusion

In this paper, we propose GSIFN, a Graph-Structured and Interlaced Multimodal Transformer Based Fusion Network. GSIFN addresses multimodal challenges with two key components: (1) a graph-structured and interlaced-masked multimodal Transformer that builds a robust multimodal graph embedding and achieves efficient, effective all-modal-in-one fusion; (2) a self-supervised learning framework that offers high performance at low computation overhead, using mLSTM to boost non-verbal modal features for unimodal label generation. The experimental results show that GSIFN reduces the computation overhead and has superior performance in multimodal sentiment analysis.

Limitations

Our GSIFN lacks pre-training for different modal data in the feature extraction part and does not deal with the overpopulated part of the data, resulting in too many redundant and uninformative vertices in the graph structure. This has an impact on our performance on fine-grained tasks in the dataset, such as MAE and Acc-7, which is not outstanding compared to past methods.

References

Baevski et al. (2022) Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. 2022. data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 1298–1312. PMLR.
Bagher Zadeh et al. (2018) AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, Melbourne, Australia. Association for Computational Linguistics.
Baltrusaitis et al. (2016) Tadas Baltrusaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: An open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision, WACV 2016, Lake Placid, NY, USA, March 7-10, 2016, pages 1–10. IEEE Computer Society.
Baltrusaitis et al. (2018) Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China, May 15-19, 2018, pages 59–66. IEEE Computer Society.
Brody et al. (2022) Shaked Brody, Uri Alon, and Eran Yahav. 2022. How attentive are graph attention networks? In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
Chung et al. (2014) Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, abs/1412.3555.
Dao and Gu (2024) Tri Dao and Albert Gu. 2024. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 10041–10071. PMLR.
Degottex et al. (2014) Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP - A collaborative voice analysis repository for speech technologies. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014, pages 960–964. IEEE.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
Fang et al. (2024) Lingyong Fang, Gongshen Liu, and Ru Zhang. 2024. Multi-grained multimodal interaction network for sentiment analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024, Seoul, Republic of Korea, April 14-19, 2024, pages 7730–7734. IEEE.
Gu and Dao (2023) Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
Hazarika et al. (2020) Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, pages 1122–1131. ACM.
Hochreiter (1997) S Hochreiter. 1997. Long short-term memory. Neural Computation MIT-Press.
Lin and Hu (2024) Ronghao Lin and Haifeng Hu. 2024. Multi-task momentum distillation for multimodal sentiment analysis. IEEE Trans. Affect. Comput., 15(2):549–565.
Mai et al. (2023) Sijie Mai, Ying Zeng, Shuangjia Zheng, and Haifeng Hu. 2023. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans. Affect. Comput., 14(3):2276–2289.
Peng et al. (2023a) Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Johan Wind, Stanisław Woźniak, Zhenyuan Zhang, Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu. 2023a. RWKV: Reinventing RNNs for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077, Singapore. Association for Computational Linguistics.
Peng et al. (2023b) Junjie Peng, Ting Wu, Wenqiang Zhang, Feng Cheng, Shuhua Tan, Fen Yi, and Yansong Huang. 2023b. A fine-grained modal label-based multi-stage network for multimodal sentiment analysis. Expert Syst. Appl., 221:119721.
Pöppel et al. (2024) Korbinian Pöppel, Maximilian Beck, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. 2024. xLSTM: Extended long short-term memory. In First Workshop on Long-Context Foundation Models @ ICML 2024.
Radford et al. (2023) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
Rahman et al. (2020) Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrating multimodal information in large pretrained transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2359–2369, Online. Association for Computational Linguistics.
Schneider et al. (2019) Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. In 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019, pages 3465–3469. ISCA.
Sun et al. (2023) Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. 2023. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, abs/2307.08621.
Tsai et al. (2019) Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6558–6569, Florence, Italy. Association for Computational Linguistics.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
Wang et al. (2023a) Di Wang, Xutong Guo, Yumin Tian, Jinhui Liu, Lihuo He, and Xuemei Luo. 2023a. TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit., 136:109259.
Wang et al. (2023b) Di Wang, Shuai Liu, Quan Wang, Yumin Tian, Lihuo He, and Xinbo Gao. 2023b. Cross-modal enhancement network for multimodal sentiment analysis. IEEE Trans. Multim., 25:4909–4921.
Wang et al. (2024) Lan Wang, Junjie Peng, Cangzhi Zheng, Tong Zhao, and Li’an Zhu. 2024. A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning. Inf. Process. Manag., 61(2):103675.
Wu et al. (2024) Zehui Wu, Ziwei Gong, Jaywon Koo, and Julia Hirschberg. 2024. Multimodal multi-loss fusion network for sentiment analysis. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3588–3602, Mexico City, Mexico. Association for Computational Linguistics.
Yang et al. (2021) Jianing Yang, Yongxin Wang, Ruitao Yi, Yuying Zhu, Azaan Rehman, Amir Zadeh, Soujanya Poria, and Louis-Philippe Morency. 2021. MTAG: Modal-temporal attention graph for unaligned human multimodal language sequences. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1009–1021, Online. Association for Computational Linguistics.
Yang et al. (2023) Jiuding Yang, Yakun Yu, Di Niu, Weidong Guo, and Yu Xu. 2023. ConFEDE: Contrastive feature decomposition for multimodal sentiment analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7617–7630, Toronto, Canada. Association for Computational Linguistics.
Yu et al. (2020) Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. 2020. CH-SIMS: A Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3718–3727, Online. Association for Computational Linguistics.
Yu et al. (2021) Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 10790–10797. AAAI Press.
Zadeh et al. (2017) Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114, Copenhagen, Denmark. Association for Computational Linguistics.
Zadeh et al. (2018) Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory fusion network for multi-view sequential learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5634–5641. AAAI Press.
Zadeh et al. (2016) Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259.
Zhang et al. (2023) Haoyu Zhang, Yu Wang, Guanghao Yin, Kejun Liu, Yuanyuan Liu, and Tianshu Yu. 2023. Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 756–767, Singapore. Association for Computational Linguistics.
Zhao et al. (2023) Tong Zhao, Junjie Peng, Yansong Huang, Lan Wang, Huiran Zhang, and Zesu Cai. 2023. A graph convolution-based heterogeneous fusion network for multimodal sentiment analysis. Appl. Intell., 53(24):30455–30468.
Zheng et al. (2024a) Cangzhi Zheng, Junjie Peng, and Zesu Cai. 2024a. Extracting method for fine-grained emotional features in videos. Knowledge-Based Systems, 302:112382.
Zheng et al. (2024b) Cangzhi Zheng, Junjie Peng, Lan Wang, Li’an Zhu, Jiatao Guo, and Zesu Cai. 2024b. Frame-level nonverbal feature enhancement based sentiment analysis. Expert Systems with Applications, 258:125148.
Zong et al. (2023) Daoming Zong, Chaoyue Ding, Baoxiang Li, Jiakui Li, Ken Zheng, and Qunyan Zhou. 2023. Acformer: An aligned and compact transformer for multimodal sentiment analysis. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, pages 833–842. ACM.

Appendix A Experiment

A.1 Experiment Settings

Table 7: The hyperparameters of the main experiment.

Learning Rate
Hyperparameter	CMU-MOSI	CMU-MOSEI	CH-SIMS
batch size	64	64	64
lr-bert	$5\times 10^{-5}$	$5\times 10^{-6}$	$5\times 10^{-5}$
lr-audio	$5\times 10^{-5}$	$5\times 10^{-5}$	$5\times 10^{-5}$
lr-video	$5\times 10^{-5}$	$5\times 10^{-5}$	$5\times 10^{-5}$
lr-other	$5\times 10^{-4}$	$5\times 10^{-4}$	$1\times 10^{-4}$
Weight Decay
wd-bert	0.001	0.001	0.001
wd-audio	0.001	0.001	0.001
wd-video	0.001	0.001	0.001
wd-other	0.001	0.001	0.001
Model Hyper Parameter
xlstm blocks	4	4	4
feature	128	128	128
heads	4	4	4
dropout	0.2	0.2	0.2

Table 8: The extractors of the main experiment.

Modal	CMU-MOSI	CMU-MOSEI	CH-SIMS
Text	bert-base-uncased	bert-base-uncased	bert-base-chinese
Vision	OpenFace	OpenFace	OpenFace2.0
Audio	COVAREP	COVAREP	LibROSA

In this section, we discuss the experiment settings. The hyperparameters of the main experiment are shown in Table 7. For further analysis experiments, the hyperparameters of MulT are the same as those of GSIFN in CMU-MOSI.

Following previous work (Zheng et al., 2024b), the feature extraction tools of different modals in each dataset. BERT(Devlin et al., 2019) for text, OpenFace(Baltrusaitis et al., 2016) and OpenFace 2.0(Baltrusaitis et al., 2018) for vision, COVAREP(Degottex et al., 2014) and LibROSA for audio. The extractors for each dataset are shown in Table 8.

A.2 Further Ablation Study

In this section, experiments of further ablation study are performed and presented to fully analysis GSIFN. These experiments include Graph Structure Ablation, Fusion Modal Ablation, ULGM Modal Ablation, and Pretrained Language Model Ablation. Note that the multimodal representation (M) is used for the final classification task. In the original case, M is composed of unimodal text (T), vision (V), and audio (A).

Graph Structure Selection The structure of the graph has a significant impact on the performance of the model, so we conduct an ablation study on its graph structure. The structures include the original structure, structure-1, structure-2, structure-3, and self-only structure.

The graph structure of the three modals can only be constructed in four cases. As a contrast, we design a self-only mask to interpret the influence of information disorder.

Original Structure: The original structure is two opposite unidirectional ring graphs. They both realize cyclic all-modal-in-one fusion, which makes trimodal information fully interact in shared model weights. The structure is: $\{t\rightarrow v,v\rightarrow a,a\rightarrow t\}$ , $\{a\rightarrow v,v\rightarrow t,t\rightarrow a\}$ . The modal-wise IFMs are:

\begin{cases}\begin{aligned} &\mathcal{M}_{inter}^{forward}=\begin{pmatrix}% \mathcal{J}^{t,t}&\mathcal{O}^{t,v}&\mathcal{J}^{t,a}\\ \mathcal{J}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{J}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\\ &\mathcal{M}_{inter}^{backward}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{J}^{% v,t}&\mathcal{O}^{a,t}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{J}^{v,a}\\ \mathcal{J}^{a,t}&\mathcal{O}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\end{aligned}\end{cases}

(17)

Structure-1: Structure-1 realizes all-modal-in-one fusion, but the information passing is not cyclic. The structure is: $\{a\rightarrow v,v\rightarrow a,a\rightarrow t\}$ , $\{v\rightarrow t,t\rightarrow v,t\rightarrow a\}$ . The modal-wise IFMs are:

\begin{cases}\begin{aligned} &\mathcal{M}_{inter}^{forward}=\begin{pmatrix}% \mathcal{J}^{t,t}&\mathcal{J}^{t,v}&\mathcal{O}^{t,a}\\ \mathcal{J}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{J}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\\ &\mathcal{M}_{inter}^{backward}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{O}^{% v,t}&\mathcal{J}^{t,a}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{J}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{J}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\end{aligned}\end{cases}

(18)

Structure-2: Structure-2 realizes all-modal-in-one fusion, but the information passing is not cyclic. The structure is: $\{v\rightarrow t,t\rightarrow v,v\rightarrow a\}$ , $\{a\rightarrow t,t\rightarrow a,a\rightarrow v\}$ . The modal-wise IFMs are:

\begin{cases}\begin{aligned} &\mathcal{M}_{inter}^{forward}=\begin{pmatrix}% \mathcal{J}^{t,t}&\mathcal{O}^{t,v}&\mathcal{J}^{t,a}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{J}^{v,a}\\ \mathcal{J}^{a,t}&\mathcal{O}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\\ &\mathcal{M}_{inter}^{backward}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{J}^{% v,t}&\mathcal{O}^{t,a}\\ \mathcal{J}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{J}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\end{aligned}\end{cases}

(19)

Structure-3: Structure-3 realizes all-modal-in-one fusion, but the information passing is not cyclic. The structure is: $\{a\rightarrow v,v\rightarrow a,v\rightarrow t\}$ , $\{a\rightarrow t,t\rightarrow a,t\rightarrow v\}$ . The modal-wise IFMs are:

\begin{cases}\begin{aligned} &\mathcal{M}_{inter}^{forward}=\begin{pmatrix}% \mathcal{J}^{t,t}&\mathcal{O}^{t,v}&\mathcal{J}^{t,a}\\ \mathcal{J}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{J}^{a,t}&\mathcal{O}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\\ &\mathcal{M}_{inter}^{backward}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{J}^{% v,t}&\mathcal{O}^{t,a}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{J}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{J}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\end{aligned}\end{cases}

(20)

Additionally, we constructed a graph with only intra-mask which is diordered in multimodal temporal information.

Self-Only:

\mathcal{M}_{inter}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{O}^{t,v}&% \mathcal{O}^{t,a}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{O}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}

(21)

As shown in Table 9 part Graph Structure Ablation, the original structure is superior to the other three theoretically feasible structures in all metrics. The four theoretically feasible structures are superior to the self-only structure, which is theoretically infeasible.

Table 9: Modality Ablation Study on CMU-MOSI. Note: F denotes finetuning pretrained language models, NF denotes not finetuning

Graph Structure Ablation
Description	CMU-MOSI
Description	Acc-2 $\uparrow$	F1 $\uparrow$	Acc-7 $\uparrow$	MAE $\downarrow$	Corr $\uparrow$
Orginal	85.0 / 86.0	85.0 / 86.0	48.3	0.707	0.801
Structure-1	82.4 / 84.0	82.3 / 84.0	46.5	0.712	0.792
Structure-2	83.8 / 85.7	83.7 / 85.6	46.1	0.731	0.796
Structure-3	83.4 / 85.1	83.3 / 85.1	45.5	0.727	0.793
Self-Only	81.6 / 83.2	81.7 / 83.3	43.3	0.750	0.791
Fusion Modality Ablation
M(T,V,A)	85.0 / 86.0	85.0 / 86.0	48.3	0.707	0.801
M(T,V)	84.3 / 85.5	84.2 / 85.5	45.5	0.720	0.797
M(T,A)	84.3 / 85.7	84.3 / 85.7	47.2	0.704	0.800
M(V,A)	59.8 / 60.2	59.7 / 60.3	17.9	1.344	0.196
M(T)	83.1 / 84.8	83.0 / 84.7	47.5	0.715	0.786
M(V)	59.2 / 59.8	58.9 / 59.6	16.8	1.372	0.141
M(A)	60.4 / 61.3	59.0 / 60.0	21.3	1.322	0.236
ULGM Modal Ablation
M+T+V+A	85.0 / 86.0	85.0 / 86.0	48.3	0.707	0.801
M+T+V	84.4 / 85.7	84.3 / 85.7	44.5	0.742	0.742
M+T+A	83.9 / 85.7	83.7 / 85.6	46.1	0.731	0.796
M+V+A	83.8 / 85.2	83.8 / 85.3	44.6	0.748	0.794
M+T	83.4 / 85.7	83.3 / 85.6	45.0	0.731	0.796
M+V	83.5 / 85.4	83.5 / 85.4	45.8	0.724	0.801
M+A	82.5 / 84.6	82.4 / 84.6	46.1	0.709	0.800
M	83.4 / 84.8	83.4 / 84.8	46.7	0.711	0.801
Pretrained Language Model Ablation
BERT(F)	85.0 / 86.0	85.0 / 86.0	48.3	0.707	0.801
BERT(NF)	83.8 / 85.7	83.7 / 85.6	46.1	0.731	0.796

Fusion Modal Ablation To fully investigate the influence of the combined form of multimodal representation on the representation ability of the whole model, we designed the Modality Ablation study, which contains the trimodal form: M(T, V, A); the bimodal forms: M(T, V), M(T, A), M(V, A); and the unimodal forms: M(T), M(V), M(A). Note that the structure of the model in the unimodal case is already missing, thus the graph-structured attention degenerates to naive multi-head self-attention.

As shown in Table 9 part Fusion Modal Ablation, modal combinations with text modal M(T, V, A), M(T, V), M(T, A), and M(T) have superior performance than those without text modal like M(V, A). For those who have text modal, trimodal combination M(T, V, A) performs better than bimodal combination M(T, V) and M(T, A). In bimodal combinations, audio modal plays a relatively more important role than vision modal in multimodal fusion. In unimodal cases, only text modal has superior performance than vision and audio.

ULGM Modal Ablation In our proposed Self-Supervised Learning Framework, multimodality (M) is used for classification, and unimodal text (T), vision (V), and audio (A) are used to generate unimodal labels in ULGM to ensure that the model learns a robust representation of the multimodal data.

To fully analyze the importance of each modal in the model, we design ULGM modal ablation experiment. The forms include ULGM with three modals: M+T+V+A, ULGM with two modals: M+T+V, M+T+A, M+V+A, ULGM with one modal: M+T, M+V, M+A, and without ULGM: M.

As shown in Table 9 part ULGM Modal Ablation, take M+T as an example, compared with M+T+V and M+T+A, M+T performs weaker in coarse-grained tasks (Acc-2, F1). The binary classification performance of GSIFN is affected by the number of modals in ULGM. However, take M as an example, the performance of M in fine-grained tasks (Acc-7, MAE) is superior to M+A, M+T, etc. Among all of the cases, M+T+V+A achieves the best performance. Therefore, ULGM promotes the coarse-grained capability of GSIFN, GsiT boosts the fine-grained capability of GSIFN.

Pretrained Language Model Ablation The experiment on whether or not finetuning BERT is shown in Table 9, part Pretrained Language Model Ablation. The result shows that BERT finetuning is quite useful to GSIFN.

A.3 Alignment

Specifically, a real-time example of a complete adjacency matrix (attention map) of the original structure is shown in Figure 4.

An analysis example of the alignment efficiency of GSIFN is shown in Figure 3. We choose bimodal combinations of vision-to-text and audio-to-text as examples, these two groups are produced from two different MGEs. As can be seen from Figure 3, GSIFN effectively and comprehensively composes the semantics of the three modals together.

Appendix B Datasets

Brief introduction to the three chosen datasets are as follows.

CMU-MOSI: The CMU-MOSI is a commonly used dataset for human multimodal sentiment analysis. It consists of 2,198 short monologue video clips (each clip lasts for the duration of one sentence) expressing the opinion of the speaker inside the video on a topic such as movies. The utterances are manually annotated with a continuous opinion score between [-3, +3], [-3: highly negative, -2 negative, -1 weakly negative, 0 neutral, +1 weakly positive, +2 positive, +3 highly positive].

CMU-MOSEI: The CMU-MOSEI is an improved version of CMU-MOSI. It contains 23,453 annotated video clips (about 10 times more than CMU-MOSI) from 5,000 videos, 1,000 different speakers, and 250 different topics. The number of discourses, samples, speakers, and topics is also larger than CMU-MOSI. The range of labels taken for each discourse is consistent with CMU-MOSI.

CH-SIMS: The CH-SIMS includes the same modalities in Mandarin: audio, text, and video, collected from 2281 annotated video segments. It includes data from TV shows and movies,making it culturally distinct and diverse, and provides multiple labels for the same utterance based on different modalities, which adds an extra layer of complexity and richness to the data.

Appendix C Baselines

The introduction to baseline models is as follows.

TFN: The Tensor Fusion Network (TFN) uses modality embedding subnetwork and tensor fusion to learn intra- and inter-modality dynamics.

MFN: The Memory Fusion Network (MFN) explicitly accounts for both interactions in a neural architecture and continuously models them through time.

MulT: The Multimodal Transformer (MulT) uses a cross-modal transformer based on cross-modal attention to make modality translation.

MTAG: The Modal-Temporal Attention Graph (MTAG) is a graph neural network model that incorporates modal attention mechanisms and dynamic pruning techniques to effectively capture complex interactions across modes and time, achieving a parametrically efficient and interpretable model.

MISA: The Modality-Invariant and -Specific Representations (MISA) project representations into modality-specific and modality-invariant spaces and learn distributional similarity, orthogonal loss, reconstruction loss, and task prediction loss

Self-MM: Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning (Self-MM) designs a multi- and a uni- task to learn inter-modal consistency and intra-modal specificity

CENet-BERT: Cross-Modal Enhancement Network (CENet) uses K-Means clustering to cluster the visual and audio modes into multiple tokens to realize the generation of the corresponding embedding, thus improving the representation ability of the two auxiliary modes and realizing a better BERT fine-tuning migration gate

HyCon-BERT: proposes a novel multimodal representation learning framework HyCon based on contrastive learning, designed with three types of losses to comprehensively learn inter-modal and intra-modal dynamics in both supervised and unsupervised ways.

TETFN: Text Enhanced Transformer Fusion Network (TETFN) strengthens the role of text modes in multimodal information fusion through text-oriented cross-modal mapping and single-modal label generation, and uses Vision-Transformer pre-training model to extract visual features

ConFEDE: Contrastive Feature Decomposition (ConFEDE) constructs a unified learning framework that jointly performs contrastive representation learning and contrastive feature decomposition to enhance the representation of multimodal information.

MMIN: Multi-modal Interaction Network (MMIN) is an advanced multi-modal sentiment analysis model that combines a coarse-grained interaction network (CIN) and a fine-grained interaction network (FIN). Adversarial learning and sparse attention mechanisms are used to capture complex interactions between different modals and reduce redundant and irrelevant information.

MTMD: Multi-Task Momentum Distillation (MTMD) treats the modal learning process as multiple subtasks and knowledge distillation between teacher network and student network effectively reduces the gap between different modes, and uses momentum models to explore mode-specific knowledge and learn robust multimodal representations through adaptive momentum fusion factors.

Appendix D Aggregation of Modal Subgraphs

D.1 How to Aggregate Subgraphs?

The derivation of graph aggregation from vertex to subgraph.

Vertex Aggregation Assuming a set of vertex features, $\mathcal{V}=\{v_{1},v_{2},\dots,v_{N}\}$ , $v_{i}\in\mathbb{R}^{D}$ , where $N$ is the number of vertices, and D is the feature dimension in each vertex.

From previous works (Velickovic et al., 2018; Brody et al., 2022), the GAT is defined as follows. GAT performs self-attention on the vertices, which is a shared attentional mechanism $a:\mathbb{R}^{D^{{}^{\prime}}}\times\mathbb{R}^{D}\rightarrow\mathbb{R}$ computes attention coefficients. Before that, a shared linear transformation, parameterized by a weight matrix, $\mathbf{W}\in\mathbb{R}^{D^{{}^{\prime}}\times D}$ .

e^{i,j}=a(\mathbf{W}v_{i},\mathbf{W}v_{j})=(\mathbf{W}v_{i})\cdot(\mathbf{W}v_% {i})^{\top}

(22)

$e^{ij}$ indicates the importance of vertex $j$ ’s feautures to vertex $i$ . In the most general formulation, the model allows a vertex to attend to every other vertex, which drops all structural information. GAT injects the graph structure into the mechanism by performing masked attention: it only computes $e^{ij}$ for vertex $j\in\mathcal{N}_{i}$ , where $\mathcal{N}_{i}$ is some neighbor of vertex $i$ in the graph. To make coefficients easily comparable across different vertices, GAT normalizes them across all choices of $j$ using the softmax function ( $\mathcal{S}$ ):

\alpha^{i,j}=\mathcal{S}_{j}(e^{i,j})=\frac{\exp(e^{i,j})}{\sum_{k\in\mathcal{% N}_{i}}\exp(e^{i,k})}

(23)

Unlike GAT, whose attention mechanism $a$ is a single-layer feedforward neural network, we directly employ a multi-head self-attention mechanism as the aggregation algorithm.

Therefore, the final output features for every vertex is defined as follows.

\overline{v}_{i}=\sum_{j\in\mathcal{N}_{i}}\alpha^{i,j}\mathbf{W}v_{j}

(24)

Where $\sigma$ indicates the sigmoid nonlinearity.

Then, we extend the mechanism to multi-head attention.

\overline{v}_{i}=\parallel^{K}_{k=1}\sum_{j\in\mathcal{N}_{i}}\alpha^{i,j}_{k}% \mathbf{W}_{k}v_{j}

(25)

Where $\parallel$ represents the concatenation operation.

From Vertex to Subgraph Assuming two sets of vertices $\mathcal{V}_{i}=\{v_{1}^{i},v_{2}^{i},\dots,v_{N_{i}}^{i}\}$ , $v_{m}^{i}\in\mathbb{R}^{D_{i}}$ and $\mathcal{V}_{j}=\{v_{1}^{j},v_{2}^{j},\dots,v_{N_{j}}^{j}\}$ , $v_{n}^{j}\in\mathbb{R}^{D_{j}}$ . Where $N_{\{i,j\}}$ is the number of vertices of $\mathcal{V}_{\{i,j\}}$ , $D_{\{i,j\}}$ is the feature dimension of each vertex in $\mathcal{V}_{\{i,j\}}$ .

Then, apply the GAT algorithm on $v_{m}^{i}$ and $v_{n}^{j}$ . Instead of a shared linear transformation, we use two weight matrices, query weight $\mathbf{W}_{q}^{m}\in\mathbb{R}^{D_{i}^{{}^{\prime}}\times D_{i}}$ and key weight $\mathbf{W}_{k}^{n}\in\mathbb{R}^{D_{j}^{{}^{\prime}}\times D_{j}}$ .

	$\displaystyle e^{m,n}$	$\displaystyle=a(\mathbf{W}_{q}^{m}v_{m}^{i},\mathbf{W}_{k}^{n}v_{n}^{j})$		(26)
	$\displaystyle\alpha^{m,n}$	$\displaystyle=\mathcal{S}_{n}(e^{m,n})=\frac{\exp(e^{m,n})}{\sum_{l\in\mathcal% {N}_{n}}\exp(e^{m,l})}$		(27)

After that, the final output feature for $v_{m}^{i}$ is computed. The value weight $\mathbf{W}_{v}^{m}\in\mathbb{R}^{D_{i}^{{}^{\prime}}\times D_{i}}$ is applied to transform $v_{n}^{j}$ :

\overline{v}_{m}^{i}=\parallel^{L}_{l=1}\sum_{n\in\mathcal{N}_{m}}\alpha^{m,n}% _{l}\mathbf{W}_{v_{l}}^{n}v_{n}^{j}

(28)

In the subgraph aspect, we assume that $\mathcal{N}_{m}$ includes all the vertices in subgraph $\mathcal{V}_{j}$ . The current attention coefficient matrix is a vector $\mathcal{G}^{m}$ , it can be regarded as a graph aggregated from $\mathcal{V}_{j}$ to $v_{m}^{i}$ . The key, value weight for $\mathcal{V}_{j}$ is represented as $\mathcal{W}_{\{k,v\}}\in\mathbb{R}^{N_{j}\times D_{j}^{{}^{\prime}}\times D_{j}}$ . Then, the aggregation can be defined as follows:

	$\displaystyle e^{m}$	$\displaystyle=a(\mathbf{W}_{q}^{m}v_{m}^{i},\mathcal{W}_{k}\mathcal{V}_{j}),% \quad\mathcal{G}^{m}=\mathcal{S}(e^{m})$		(29)
	$\displaystyle\overline{v}_{m}^{i}$	$\displaystyle=\parallel^{L}_{l=1}(\mathcal{G}^{m}_{l}\mathcal{W}_{v_{l}}% \mathcal{V}_{j})$		(30)

Then, apply the algorithm defined by Equation 28, 29, 30 to all the vertices in $\mathcal{V}_{i}$ . The aggregation form is now vertex set to vertex set, thus, we regard the vertex sets as subgraphs and vertex-to-vertex aggregation is transformed into subgraph aggregation. Also, the attention coefficient matrix $e$ is transformed as a directional subgraph adjacency matrix $\mathcal{E}$ . The query weight for $\mathcal{V}_{i}$ is represented as $\mathcal{W}_{q}$ .

	$\displaystyle\mathcal{E}^{i,j}$	$\displaystyle=a(\mathcal{W}_{q}\mathcal{V}_{i},\mathcal{W}_{k}\mathcal{V}_{j})% ,\quad\mathcal{G}^{i,j}=\mathcal{S}(\mathcal{E}^{i,j})$		(31)
	$\displaystyle\overline{\mathcal{V}}_{i}$	$\displaystyle=\parallel^{L}_{l=1}(\mathcal{G}_{l}^{i,j}\mathcal{W}_{v_{l}}% \mathcal{V}_{j})$		(32)

Now the aggregation procedure is equal to multi-head cross-attention mechanism (Tsai et al., 2019).

Multimodal Subgraph Aggregation Take $\mathcal{V}_{i}$ and $\mathcal{V}_{j}$ , where $\{i,j\}\in\{t,v,a\}$ two modal sequences as an example, which is regarded as two vertex sets. Assuming that the unidirectional subgraph is constructed by the two modal vertex sequences, the adjacency matrix weight aggregation process of the corresponding subgraph is as follows.

\mathcal{E}^{i,j}=(\mathcal{W}_{q}\mathcal{V}_{j})\cdot(\mathcal{W}_{k}% \mathcal{V}_{i})^{\top}

(33)

Then apply the softmax function.

\mathcal{G}^{i,j}=\mathcal{S}(\mathcal{E}^{i,j})

(34)

Finally, some of the edges in the subgraph are randomly masked which is realized by the dropout operation implemented on the adjacency matrix.

\mathcal{G}_{dropout}^{i,j}=\mathcal{D}(\mathcal{G}^{i,j})

(35)

where $\mathcal{D}$ denotes the dropout function.

After the aggregation, fusion process is started, which is regarded as the directional information fusion procedure from $\mathcal{V}_{j}$ to $\mathcal{V}_{i}$ .

\overline{\mathcal{V}}_{i}=\mathcal{G}_{dropout}^{i,j}\mathcal{W}_{v}\mathcal{% V}_{j}

(36)

Then we extend the above operation globally as follows:

		$\displaystyle\mathcal{G}=\mathcal{S}\circ\mathcal{D}(\mathcal{A})$		(37)
		$\displaystyle\overline{\mathcal{V}}_{m}=\mathcal{G}\mathcal{W}_{v}\mathcal{V}_% {m}$		(38)

Where $\circ$ represents the function composition operation. Note: $\mathcal{A}$ is defined in Equation 5

Constructed graph structure in Equation 37 is actually unstructured at all, it loses sight of the separated modality-wise temporal features of the concatenated sequence which makes the sequence disordered. What is more, it over-fuses the inter-modal information, confuses inter-modal information and the intra-modal information and leaves way too much fine-grained information unconsidered.

D.2 Why the Interlaced Mask?

Take the first block row in $\mathcal{A}$ as an example, which is $\mathbf{BR}=[\mathcal{E}^{t,t},\mathcal{E}^{t,v},\mathcal{E}^{t,a}]$ . Knowing that $\mathcal{V}_{m}$ = $[\mathcal{V}_{t};\mathcal{V}_{v};\mathcal{V}_{a}]^{\top}$ . Then the $\mathcal{E}^{t,t}$ is aggregated by $\mathcal{V}_{t}$ of $\mathcal{V}_{m}$ itself, $\mathcal{E}^{t,v}$ is aggregated by $\mathcal{V}_{t}$ and $\mathcal{V}_{v}$ , $\mathcal{E}^{t,a}$ is aggregated by $\mathcal{V}_{t}$ and $\mathcal{V}_{a}$ . And as defined in Equation 31, 32, the direction of aggregation of $\mathcal{E}^{i,j}$ is from $j$ to $i$ .

If the final output feature computation is performed without interlaced mask. It has to be noted that aggregation in this case is only performed on text modal $t$ .

\overline{\mathcal{V}}_{t}=\mathbf{BR}\cdot(\mathcal{W}_{v}\mathcal{V}_{m})

(39)

As shown in Figure 5. When we only mask one or two blocks (subgraphs), vertex sequences of different modals are considered to be the same sequence because they are spliced together. Thus making the temporal information disordered, which is not advisable.

Appendix E Algorithms

E.1 Interlaced Mask Generation Algorithm

Algorithm 1 Interlaced Mask Generation

Input: Segmentation of the length of three-modal sequence $seg$ = $\{T_{t},T_{v},T_{a}\}$ , Mode of the mask generation $mode$ $\in$ $\{inter,intra\}$ , Direction of fusion procedure $dir$ $\in$ $\{forward,backward\}$ ;

Output: The generated mask of appointed mode and direction;

1: Let

\{l_{t},l_{v},l_{a}\}=seg

2: Define segments

s1=(0,l_{t})

s2=(l_{t},l_{t}+l_{v})

s3=(l_{t}+l_{v},l_{t}+l_{v}+l_{a})

3: Let

l_{sum}=l_{t}+l_{v}+l_{a}

4: Initialize an empty list

\mathcal{M}_{list}

5: for each

i

[0,1,2]

6: for each element in

seg[i]

7: Initialize

m_{row}

as a tensor of ones with size

l_{sum}

8: if

i==0

then

9: Set

m_{row}[0:s1[1]]=0

10: if

mode==inter

then

11: if

dir==forward

then

12: Set

m_{row}[s3[0]:]=0

13: else if

dir==backward

then

14: Set

m_{row}[s2[0]:s2[1]]=0

15: end if

16: end if

17: else if

i==1

then

18: Set

m_{row}[s2[0]:s2[1]]=0

19: if

mode==inter

then

20: if

dir==forward

then

21: Set

m_{row}[0:s1[1]]=0

22: else if

dir==backward

then

23: Set

m_{row}[s3[0]:]=0

24: end if

25: end if

26: else if

i==2

then

27: Set

m_{row}[s3[0]:s3[1]]=0

28: if

mode==inter

then

29: if

dir==forward

then

30: Set

m_{row}[s2[0]:s2[1]]=0

31: else if

dir==backward

then

32: Set

m_{row}[0:s1[1]]=0

33: end if

34: end if

35: end if

36: Append

m_{row}

\mathcal{M}_{list}

37: end for

38: end for

39: if

mode==inter

then

40: Let

\mathcal{M}=\text{Stack}(\mathcal{M}_{list})

41: return

\text{GenerateMask}(\mathcal{M})

42: else if

mode==intra

then

43: return

\text{GenerateMask}(|\text{Stack}(\mathcal{M}_{list})-1)|)

44: end if

The detailed generation method of interlaced mask for not only the forward and backward inter-fusion but also the intra-enhancement is shown in the algorithm table above. It is of vital importance for our model to accurately construct the graph structure of the concatenated sequence list. The masks could be constructed during the initialization procedure.

E.2 Extended Long Short Term Memory with Matrix Memory

$\displaystyle C_{t}$	$\displaystyle=f_{t}C_{t-1}+i_{t}v_{t}k_{t}^{\top}$			(40)
$\displaystyle n_{t}$	$\displaystyle=f_{t}n_{t-1}+i_{t}k_{t}$			(41)
$\displaystyle h_{t}$	$\displaystyle=o_{t}\odot\tilde{h}_{t},$	$\displaystyle\tilde{h}_{t}$	$\displaystyle=\frac{C_{t}q_{t}}{\text{max}\{\|n_{t}^{\top}q_{t}\|,1\}}$	(42)
$\displaystyle q_{t}$	$\displaystyle=W_{q}x_{t}+b_{q}$			(43)
$\displaystyle k_{t}$	$\displaystyle=\frac{1}{\sqrt{d}}W_{k}x_{t}+b_{k}$			(44)
$\displaystyle v_{t}$	$\displaystyle=W_{v}x_{t}+b_{v}$			(45)
$\displaystyle i_{t}$	$\displaystyle=\exp(\tilde{i}_{t}),$	$\displaystyle\tilde{i}_{t}$	$\displaystyle=w_{i}^{\top}x_{t}+b_{i}$	(46)
$\displaystyle f_{t}$	$\displaystyle=\sigma{(\tilde{f}_{t})}\text{OR}\exp{(\tilde{f}_{t})},$	$\displaystyle\tilde{f}_{t}$	$\displaystyle=w_{f}^{\top}x_{t}+b_{f}$	(47)
$\displaystyle o_{t}$	$\displaystyle=\sigma{(\tilde{o}_{t})},$	$\displaystyle\tilde{o}_{t}$	$\displaystyle=W_{o}x_{t}+b_{o}$	(48)

The forward pass of mLSTM can be described as the above equation group, while the detailed architecture is shown in Figure 6