GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer Based Fusion Network for Multimodal Sentiment Analysis

Yijie Jin
Shanghai University
[email protected]
Abstract

Multimodal Sentiment Analysis (MSA) leverages multiple modals to analyze sentiments. Typically, advanced fusion methods and representation learning-based methods are designed to tackle it. Our proposed GSIFN solves two key problems to be solved in MSA: (i) In multimodal fusion, the decoupling of modal combinations and tremendous parameter redundancy in existing fusion methods, which lead to poor fusion performance and efficiency. (ii) The trade-off between representation capability and computation overhead of the unimodal feature extractors and enhancers. GSIFN incorporates two main components to solve these problems: (i) Graph-Structured and Interlaced-Masked Multimodal Transformer. It adopts the Interlaced Mask mechanism to construct robust multimodal graph embedding, achieve all-modal-in-one Transformer-based fusion, and greatly reduce the computation overhead. (ii) A self-supervised learning framework with low computation overhead and high performance, which utilizes a parallelized LSTM with matrix memory to enhance non-verbal modal feature for unimodal label generation. Evaluated on the MSA datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS, GSIFN demonstrates superior performance with significantly lower computation overhead compared with state-of-the-art methods.

GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer Based Fusion Network for Multimodal Sentiment Analysis


Yijie Jin Shanghai University [email protected]


1 Introduction

With the increasingly widespread use of social media, in which users express sentiment through information forms including text, video, audio, etc. To achieve more natural human-computer interactions, multimodal sentiment analysis (MSA) has become a popular research area (Peng et al., 2023b; Zhao et al., 2023; Wang et al., 2024; Zheng et al., 2024a, b). MSA task relies on at least two data modals for sentiment polarity prediction. Specifically, its data form is usually a trimodal combination of text, vision, and audio. The main challenge of MSA is to integrate inconsistent sentiment information, thus achieving semantic disambiguation and effective sentiment analysis. Methods of MSA involve designing effective fusion strategies (Zadeh et al., 2017; Tsai et al., 2019; Zhang et al., 2023) to integrate heterogeneous data for comprehensive sentiment representation and semantic alignment, and developing representation learning strategies (Yu et al., 2021; Yang et al., 2023; Lin and Hu, 2024) to enhance unimodal information and model robustness.

Despite achieving some successes, existing approaches still face three main challenges. First, for the models that focus on modal fusion, the computation overhead rises due to the widespread use of cross-modal attention mechanisms-based (CMA-based) modules. What is more, different unidirectional bimodal combinations are decoupled and then inputted into multiple independent CMA-based modules for fusion, this prevents such models from fully integrating trimodal representation information. Instead, they retain redundant information in the dominant modal of the bimodal combination. Therefore, these models are excessively redundant and in need of pruning. However, once the naive serial weight-sharing strategy (Hazarika et al., 2020) or modal sequence concatenation operation is applied to share trimodal representation information and prune the model, information disorder occurs, which is worth solving. Second, for the representation learning-based models, the data extraction and representation module of non-verbal modals cannot effectively balance the number of parameters and representation performance. Small models (GRU(Chung et al., 2014), LSTM(Hochreiter, 1997), etc.) or conventional extractors (OpenFace2.0(Baltrusaitis et al., 2018), COVAREP(Degottex et al., 2014), etc.) usually cause excessive loss of representation of non-verbal modals. In contrast, large models (ViT(Dosovitskiy et al., 2021), Wav2Vec(Schneider et al., 2019), etc.) bring better performance but incur excessive overhead. Third, models combining the above two approaches face both of these drawbacks, so it is of vital importance to weigh the pros and cons.

To address the aforementioned issues, we propose a model called Graph-Structured and Interlaced-Masked Multimodal Transformer Based Fusion Network, dubbed GSIFN. There are two attractive properties in GSIFN. First, in the process of multimodal fusion, it realizes efficient and low overhead representation information sharing without information disorder. To attain this, we propose a Graph-structured and interlaced-masked multimodal Transformer (GsiT), which is structured modal-wise in units of modal subgraphs. GsiT utilizes the Interlaced Mask (IM) mechanism to construct Multimodal Graph Embeddings (MGE), in which Interlaced-Inter-Fusion Mask (IFM) constructs fusion MGE. Interlaced-Intra-Enhancement Mask (IEM) constructs enhancement MGE. Specifically, with shared information, IFM constructs two opposite unidirectional ring MGE to realize a complete fusion procedure. IEM constructs an internal enhancement MGE to realize the multimodal fusion enhancement. IM utilizes a weight-sharing strategy to achieve an all-modal-in-one fusion and enhancement mechanism. It also eliminates useless information, thereby improving fusion efficiency and achieving pruning. Second, it significantly reduces computation overhead brought by non-verbal modal feature enhancement operations and ensures the robustness and performance of the model. We employ a unimodal label generation module (ULGM) to enhance the model robustness and apply an extended LSTM with matrix memory (mLSTM) to enhance non-verbal modal features in ULGM. mLSTM is fully parallelized and has a superior memory mechanism over LSTM, which can deeply mine the semantic information of non-verbal modals. Additionally, using mLSTM could avoid the huge computation overhead caused by large models. Thus balancing the computation overhead and the representation capability of GSFIN. Overall, our contributions are as follows:

  • We propose GSIFN, a graph-structured and interlaced-masked multimodal transformer network. Experiments and ablation studies across various datasets validate its effectiveness and superiority.

  • We design GsiT, a graph-structured and interlaced-masked multimodal transformer that uses the Interlaced Mask mechanism to build multimodal graph embeddings from modal subgraphs. It ensures efficient, low-overhead information sharing, reduces spatio-temporal redundancy and noise, and yields a more compact and informative multimodal representation while lowering the module’s parameter count.

  • We employ mLSTM, an extended LSTM with matrix memory, to enhance non-verbal modal features utilized for unimodal label generation. This approach improves model robustness and representation capability and avoids the overhead of large models.

Refer to caption
Figure 1: GSIFN Architecture.

2 Related Work

2.1 Multimodal Sentiment Analysis

Multimodal sentiment analysis (MSA) is an increasingly popular research field. Its data includes more than two modals. Text, vision, and audio are the most widely used modals. Earlier models focus on modal fusion. Zadeh et al. were among the first promoters in this field. They proposed TFN (Zadeh et al., 2017), built a power set form of modal combination, and realized complete modal fusion by using Cartesian product. TFN did not consider the temporal information of non-verbal modals, thus, MFN(Zadeh et al., 2018) is designed. MFN uses an LSTM system to extract the timing information of the three modes through explicit modal alignment (CTC, padding etc.) and uses attention mechanism and gated memory to realize the efficient fusion of multimodal temporal information.

With the rise of Transformer, MulT (Tsai et al., 2019) proposes a cross-modal attention mechanism (CMA) from the perspective of modal translation. CMA can effectively integrate multimodal data while realizing implicit modal alignment. Based on MulT and CMA, models such as TETFN (Wang et al., 2023a) and ALMT (Zhang et al., 2023) focus on text data to enhance non-verbal modal data, for text data contains stronger emotional information. Thus, they achieve superior representation performance and modal fusion. MAG-BERT (Rahman et al., 2020) uses a Multimodal Adaptation Gate (MAG) to fine-tune BERT using multi-modal data. CENet (Wang et al., 2023b) constructs non-verbal modal vocabularies, realizes non-verbal modal representation enhancement, and realizes MSA capability enhancement of fine-tuned BERT.

To improve the robustness of the model and the representation ability of non-verbal modals, and thus improve the overall multimodal sentiment analysis ability of the model, representation learning-based models such as Self-MM (Yu et al., 2021), ConFEDE (Yang et al., 2023), and MTMD (Lin and Hu, 2024) were proposed. They use self-supervised learning, contrast learning or knowledge distillation to achieve robust representation of modal information consistency and difference.

TETFN, MMML (Wu et al., 2024) and AcFormer (Zong et al., 2023) combine multimodal Transformer with representation learning to effectively improve model performance, and verify the feasibility of combining the two to learn from other strengths.

Due to the excessive use of traditional multimodal Transformer architecture in these methods, they often have a high number of parameters in the core fusion module. Additionally, different fusion combinations are decoupled to multiple independent Transformers (Vaswani et al., 2017), the interaction of modal information is insufficient, and there are problems of insufficient weight regularity. In the concrete implementation, we refer to the idea of graph attention networks (Velickovic et al., 2018; Brody et al., 2022) and construct a graph-structured multimodal Transformer with modal subgraph units.

2.2 Linear Attention Networks

In the field of natural language processing (NLP), reducing the computational cost of Transformers while maintaining performance has become a popular research topic. RWKV (Peng et al., 2023a), RetNet (Sun et al., 2023), Mamba (Gu and Dao, 2023), Mamba-2 (Dao and Gu, 2024) are representatives among them. xLSTM (Pöppel et al., 2024), as an extension of LSTM, introduces exponential gating to solve the limitations of memory capacity and parallelization, especially when dealing with long sequences.

Refer to caption
Figure 2: GsiT Architecture and IM Mechanism.

At the same time, recent works in the field of MSA have begun to use more advanced feature extractors to enhance non-verbal modal features, taking into account the weak representation capability of non-verbal modals. For instance, TETFN and AcFormer (Zong et al., 2023) use Vision Transformer (ViT) (Dosovitskiy et al., 2021) to extract vision features, AcFormer uses Wav2Vec (Schneider et al., 2019) to extract features, and MMML (Wu et al., 2024) uses raw audio data to fine-tune Data2Vec (Baevski et al., 2022). However, these methods often result in excessive growth in the number of parameters, with obscure improvement over traditional features. To reduce model parameters and ensure model performance at the same time, the self-supervised learning method is used to strengthen the capture and representation of sentiment information. In GSIFN, mLSTM module in xLSTM is used to enhance the non-verbal input feature to unimodal label generation, it significantly reduces the computation overhead and ensures model performance.

3 Methodology

3.1 Preliminaries

The objective of multimodal sentiment analysis (MSA) is to evaluate sentiment polarity using multimodal data. Existing MSA datasets generally contain three modals: t,v,a𝑡𝑣𝑎t,v,aitalic_t , italic_v , italic_a represent text, vision, and audio, respectively. Specially, m𝑚mitalic_m denotes multimodal. The input of MSA task is SuTus×dussubscript𝑆𝑢superscriptsubscriptsuperscript𝑇𝑠𝑢subscriptsuperscript𝑑𝑠𝑢S_{u}\in\mathbb{R}^{T^{s}_{u}\times d^{s}_{u}}italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT × italic_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where u{t,v,a}𝑢𝑡𝑣𝑎u\in\{t,v,a\}italic_u ∈ { italic_t , italic_v , italic_a }, Tussubscriptsuperscript𝑇𝑠𝑢T^{s}_{u}italic_T start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT denotes the raw sequence length and dussubscriptsuperscript𝑑𝑠𝑢d^{s}_{u}italic_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT denotes the raw representation dimension of modal u𝑢uitalic_u. In this paper, we define multiple outputs y^uRsubscript^𝑦𝑢𝑅\hat{y}_{u}\in Rover^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ italic_R, where u{t,v,a,m}𝑢𝑡𝑣𝑎𝑚u\in\{t,v,a,m\}italic_u ∈ { italic_t , italic_v , italic_a , italic_m }, y^{t,v,a}subscript^𝑦𝑡𝑣𝑎\hat{y}_{\{t,v,a\}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT { italic_t , italic_v , italic_a } end_POSTSUBSCRIPT denote unimodal outputs, obtained for unimodal label generation. y^msubscript^𝑦𝑚\hat{y}_{m}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes the fusion output, obtained for the final prediction. Other symbols are defined as follows, fusion module inputs are {Xt,Xv,Xa}subscript𝑋𝑡subscript𝑋𝑣subscript𝑋𝑎\{X_{t},X_{v},X_{a}\}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT }. ULGM inputs are {𝒳t,𝒳v,𝒳a}subscript𝒳𝑡subscript𝒳𝑣subscript𝒳𝑎\{\mathcal{X}_{t},\mathcal{X}_{v},\mathcal{X}_{a}\}{ caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT }. The predictor input is 𝒳msubscript𝒳𝑚\mathcal{X}_{m}caligraphic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. In particular, in the interpretation of GsiT {Xt,XvXa}subscript𝑋𝑡subscript𝑋𝑣subscript𝑋𝑎\{X_{t},X_{v}X_{a}\}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } are abstracted to sequences of vertices {𝒱t,𝒱v,𝒱a}subscript𝒱𝑡subscript𝒱𝑣subscript𝒱𝑎\{\mathcal{V}_{t},\mathcal{V}_{v},\mathcal{V}_{a}\}{ caligraphic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , caligraphic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT }. Labels for yuRsubscript𝑦𝑢𝑅y_{u}\in Ritalic_y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ italic_R, where u{t,v,a,m}𝑢𝑡𝑣𝑎𝑚u\in\{t,v,a,m\}italic_u ∈ { italic_t , italic_v , italic_a , italic_m }, y{t,v,a}subscript𝑦𝑡𝑣𝑎y_{\{t,v,a\}}italic_y start_POSTSUBSCRIPT { italic_t , italic_v , italic_a } end_POSTSUBSCRIPT are unimodal label generated by ULGM, ymsubscript𝑦𝑚y_{m}italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the ground truth label for fusion output.

3.2 Overall Architecture

The overview of our model is shown in Figure 1 which consists of three major parts: (1) Modal Encoding utilizes tokenizer (for text modality), feature extractors and temporal enhancers (for non-verbal modals vision and audio) to convert raw multimodal data into numerical feature sequences. Enhanced non-verbal modal features are utilized for unimodal label generation. (2) Graph-Structured Multimodal Fusion takes the processed text, vision, and audio embedding as input. The module graph-structured and interlaced-masked multimodal Transformer utilizes interlaced masks to construct multimodal graph embedding. It employs weight-sharing to facilitate comprehensive multimodal information interaction and eliminate redundant data, thereby enhancing fusion efficiency and enabling model pruning. (3) Self-Supervised Learning Framework generates final representations and defines positive and negative centers by projecting text features, enhanced vision audio features, and fusion output to hidden states. Unimodal labels are separately generated using text, vision, and audio hidden states.

3.3 Modal Encoding

For text modal, we use the pretrained transformer BERT as the text encoder. Input text token sequence is constructed by the raw sentence Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = {w1,w2,,wn}subscript𝑤1subscript𝑤2subscript𝑤𝑛\{w_{1},w_{2},\dots,w_{n}\}{ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } concatenated with two special tokens ([CLS] at the head and [SEP] at the end) which form Stsuperscriptsubscript𝑆𝑡S_{t}^{{}^{\prime}}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = {[CLS],w1,w2,,wn,[SEP]}[CLS]subscript𝑤1subscript𝑤2subscript𝑤𝑛[SEP]\{\text{[CLS]},w_{1},w_{2},\dots,w_{n},\text{[SEP]}\}{ [CLS] , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , [SEP] }. Then input Stsuperscriptsubscript𝑆𝑡S_{t}^{{}^{\prime}}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT into BERT to construct 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which is used to generate text modal labels.

𝒳t=BERT(St)={t0,t1,,tn+1}subscript𝒳𝑡BERTsuperscriptsubscript𝑆𝑡subscript𝑡0subscript𝑡1subscript𝑡𝑛1\mathcal{X}_{t}=\text{BERT}(S_{t}^{{}^{\prime}})=\{t_{0},t_{1},\dots,t_{n+1}\}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = BERT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) = { italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT } (1)

Following previous works (Tsai et al., 2019), input sequences X{t,v,a}subscript𝑋𝑡𝑣𝑎X_{\{t,v,a\}}italic_X start_POSTSUBSCRIPT { italic_t , italic_v , italic_a } end_POSTSUBSCRIPT are handled by one dimensional convolution layer from 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and raw vision, audio sequences S{v,a}subscript𝑆𝑣𝑎S_{\{v,a\}}italic_S start_POSTSUBSCRIPT { italic_v , italic_a } end_POSTSUBSCRIPT.

Xt=Conv1D(𝒳t)subscript𝑋𝑡Conv1Dsubscript𝒳𝑡\displaystyle X_{t}=\text{Conv1D}(\mathcal{X}_{t})italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = Conv1D ( caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (2)
X{v,a}=Conv1D(S{v,a})subscript𝑋𝑣𝑎Conv1Dsubscript𝑆𝑣𝑎\displaystyle X_{\{v,a\}}=\text{Conv1D}(S_{\{v,a\}})italic_X start_POSTSUBSCRIPT { italic_v , italic_a } end_POSTSUBSCRIPT = Conv1D ( italic_S start_POSTSUBSCRIPT { italic_v , italic_a } end_POSTSUBSCRIPT ) (3)

After that, we employ an extended Long Short Term Memory which is fully parallelizable with a matrix memory and a covariance update rule (mLSTM) as the temporal enhancer of vision and audio modal. mLSTM can improve model representation capability. Meanwhile, using it can avoid the overhead of large models. The detailed definition of mLSTM is in Appendix E.2.

We use mLSTM to enhance the temporal features of vision and audio.

𝒳{v,a}=mLSTM(X{v,a})subscript𝒳𝑣𝑎mLSTMsubscript𝑋𝑣𝑎\mathcal{X}_{\{v,a\}}=\text{mLSTM}(X_{\{v,a\}})caligraphic_X start_POSTSUBSCRIPT { italic_v , italic_a } end_POSTSUBSCRIPT = mLSTM ( italic_X start_POSTSUBSCRIPT { italic_v , italic_a } end_POSTSUBSCRIPT ) (4)

mLSTM can enhance non-verbal modal features utilized for unimodal label generation.

3.4 Graph-Structured Multimodal Fusion

Following previous works (Tsai et al., 2019; Wang et al., 2023a), we only use the low-level temporal feature sequences {Xt,Xv,Xa}subscript𝑋𝑡subscript𝑋𝑣subscript𝑋𝑎\{X_{t},X_{v},X_{a}\}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } as input of multimodal fusion. Then {Xt,Xv,Xa}subscript𝑋𝑡subscript𝑋𝑣subscript𝑋𝑎\{X_{t},X_{v},X_{a}\}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } are regarded as graph vertex sequences {𝒱t,𝒱v,𝒱a}subscript𝒱𝑡subscript𝒱𝑣subscript𝒱𝑎\{\mathcal{V}_{t},\mathcal{V}_{v},\mathcal{V}_{a}\}{ caligraphic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , caligraphic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT }. Then, concatenate vertices into a single sequence 𝒱msubscript𝒱𝑚\mathcal{V}_{m}caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [𝒱t;𝒱v;𝒱a]superscriptsubscript𝒱𝑡subscript𝒱𝑣subscript𝒱𝑎top[\mathcal{V}_{t};\mathcal{V}_{v};\mathcal{V}_{a}]^{\top}[ caligraphic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; caligraphic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ; caligraphic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. 𝒱msubscript𝒱𝑚\mathcal{V}_{m}caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is treated as the multimodal graph embedding (MGE). The architecture of Graph-Structured and Interlaced-Masked Multimodal Transformer Architecture (GsiT) is shown in Figure 2.

Graph Structure Construction To start with, we utilize the self-attention mechanism as the basic theory to construct a naive fully connected graph. The attention weight matrix is regarded as the adjacency matrix 𝒜𝒜\mathcal{A}caligraphic_A with dynamic weights. In 𝒜𝒜\mathcal{A}caligraphic_A, i,jTi×Tjsuperscript𝑖𝑗superscriptsubscript𝑇𝑖subscript𝑇𝑗\mathcal{E}^{i,j}\in\mathbb{R}^{T_{i}\times T_{j}}caligraphic_E start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, {i,j}{t,v,a}𝑖𝑗𝑡𝑣𝑎\{i,j\}\in\{t,v,a\}{ italic_i , italic_j } ∈ { italic_t , italic_v , italic_a } is the adjacency matrix of the subgraph constructed by 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒱jsubscript𝒱𝑗\mathcal{V}_{j}caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

𝒜𝒜\displaystyle\mathcal{A}caligraphic_A =(𝒲q𝒱m)(𝒲k𝒱m)absentsubscript𝒲𝑞subscript𝒱𝑚superscriptsubscript𝒲𝑘subscript𝒱𝑚top\displaystyle=(\mathcal{W}_{q}\mathcal{V}_{m})\cdot(\mathcal{W}_{k}\mathcal{V}% _{m})^{\top}= ( caligraphic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ⋅ ( caligraphic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (5)
=(t,tt,vt,av,tv,vv,aa,ta,va,a)absentmatrixsuperscript𝑡𝑡superscript𝑡𝑣superscript𝑡𝑎superscript𝑣𝑡superscript𝑣𝑣superscript𝑣𝑎superscript𝑎𝑡superscript𝑎𝑣superscript𝑎𝑎\displaystyle=\begin{pmatrix}\mathcal{E}^{t,t}&\mathcal{E}^{t,v}&\mathcal{E}^{% t,a}\\ \mathcal{E}^{v,t}&\mathcal{E}^{v,v}&\mathcal{E}^{v,a}\\ \mathcal{E}^{a,t}&\mathcal{E}^{a,v}&\mathcal{E}^{a,a}\\ \end{pmatrix}= ( start_ARG start_ROW start_CELL caligraphic_E start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_E start_POSTSUPERSCRIPT italic_t , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_E start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_E start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_E start_POSTSUPERSCRIPT italic_v , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_E start_POSTSUPERSCRIPT italic_v , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_E start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_E start_POSTSUPERSCRIPT italic_a , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_E start_POSTSUPERSCRIPT italic_a , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG )

The derivation process of detailed graph structure construction (from vertex to subgraph) is in Appendix D.1.

Interlaced Mask Mechanism Interlaced Mask (IM) is a modal-wise mask mechanism, thus all of the elements in the mask matrix are subgraph adjacency matrices. The mask matrix is represented as a block matrix. Then the construction procedure of IM is described in detail. The computation procedure with IM is shown in Figure 2.

To start with, to avoid the influence of intra-modal subgraph i{t,v,a}i,isubscriptsuperscript𝑖𝑖𝑖𝑡𝑣𝑎\mathcal{E}^{i,i}_{i\in\{t,v,a\}}caligraphic_E start_POSTSUPERSCRIPT italic_i , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i ∈ { italic_t , italic_v , italic_a } end_POSTSUBSCRIPT, we apply modal-wise intra mask as shown in Equation 6. We define 𝒪i,jTi×Tjsuperscript𝒪𝑖𝑗superscriptsubscript𝑇𝑖subscript𝑇𝑗\mathcal{O}^{i,j}\in\mathbb{R}^{T_{i}\times T_{j}}caligraphic_O start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as all zero matrix, 𝒥i,jTi×Tjsuperscript𝒥𝑖𝑗superscriptsubscript𝑇𝑖subscript𝑇𝑗\mathcal{J}^{i,j}\in\mathbb{R}^{T_{i}\times T_{j}}caligraphic_J start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as all negative infinity matrix.

inter=(𝒥t,t𝒪t,v𝒪t,a𝒪v,t𝒥v,v𝒪v,a𝒪a,t𝒪a,v𝒥a,a)subscript𝑖𝑛𝑡𝑒𝑟matrixsuperscript𝒥𝑡𝑡superscript𝒪𝑡𝑣superscript𝒪𝑡𝑎superscript𝒪𝑣𝑡superscript𝒥𝑣𝑣superscript𝒪𝑣𝑎superscript𝒪𝑎𝑡superscript𝒪𝑎𝑣superscript𝒥𝑎𝑎\mathcal{M}_{inter}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{O}^{t,v}&% \mathcal{O}^{t,a}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{O}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_t , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) (6)

intersubscript𝑖𝑛𝑡𝑒𝑟\mathcal{M}_{inter}caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT can already make cross-modal fusion not be affected by intra-modal subgraphs. However, in the fusion procedure, different modal sequences should not be recognized as the same sequence. Therefore, we extend intersubscript𝑖𝑛𝑡𝑒𝑟\mathcal{M}_{inter}caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT to the following two mask matrices, which is called as Interlaced-Inter-Fusion Mask (IFM). The explanation of aforementioned information disorder is in Appendix D.2

{interforward=(𝒥t,t𝒪t,v𝒥t,a𝒥v,t𝒥v,v𝒪v,a𝒪a,t𝒥a,v𝒥a,a)interbackward=(𝒥t,t𝒥t,v𝒪t,a𝒪v,t𝒥v,v𝒥v,a𝒥a,t𝒪a,v𝒥a,a)casesmissing-subexpressionsuperscriptsubscript𝑖𝑛𝑡𝑒𝑟𝑓𝑜𝑟𝑤𝑎𝑟𝑑matrixsuperscript𝒥𝑡𝑡superscript𝒪𝑡𝑣superscript𝒥𝑡𝑎superscript𝒥𝑣𝑡superscript𝒥𝑣𝑣superscript𝒪𝑣𝑎superscript𝒪𝑎𝑡superscript𝒥𝑎𝑣superscript𝒥𝑎𝑎missing-subexpressionsuperscriptsubscript𝑖𝑛𝑡𝑒𝑟𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑matrixsuperscript𝒥𝑡𝑡superscript𝒥𝑡𝑣superscript𝒪𝑡𝑎superscript𝒪𝑣𝑡superscript𝒥𝑣𝑣superscript𝒥𝑣𝑎superscript𝒥𝑎𝑡superscript𝒪𝑎𝑣superscript𝒥𝑎𝑎otherwise\begin{cases}\begin{aligned} &\mathcal{M}_{inter}^{forward}=\begin{pmatrix}% \mathcal{J}^{t,t}&\mathcal{O}^{t,v}&\mathcal{J}^{t,a}\\ \mathcal{J}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{J}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\\ &\mathcal{M}_{inter}^{backward}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{J}^{% t,v}&\mathcal{O}^{t,a}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{J}^{v,a}\\ \mathcal{J}^{a,t}&\mathcal{O}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\end{aligned}\end{cases}{ start_ROW start_CELL start_ROW start_CELL end_CELL start_CELL caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_o italic_r italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_t , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) end_CELL end_ROW end_CELL start_CELL end_CELL end_ROW (7)

Based on the two matrices, two opposite uni-directional ring graphs can be constructed to achieve a complete fusion procedure. We define softmax operation as 𝒮𝒮\mathcal{S}caligraphic_S, dropout operation as 𝒟𝒟\mathcal{D}caligraphic_D, and function composition operator as \circ.

{𝒢interforward=𝒮𝒟(𝒜+interforward)𝒢interbackward=𝒮𝒟(𝒜+interbackward)casesmissing-subexpressionsuperscriptsubscript𝒢𝑖𝑛𝑡𝑒𝑟𝑓𝑜𝑟𝑤𝑎𝑟𝑑𝒮𝒟𝒜superscriptsubscript𝑖𝑛𝑡𝑒𝑟𝑓𝑜𝑟𝑤𝑎𝑟𝑑missing-subexpressionsuperscriptsubscript𝒢𝑖𝑛𝑡𝑒𝑟𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑𝒮𝒟𝒜superscriptsubscript𝑖𝑛𝑡𝑒𝑟𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑otherwise\begin{cases}\begin{aligned} &\mathcal{G}_{inter}^{forward}=\mathcal{S}\circ% \mathcal{D}(\mathcal{A}+\mathcal{M}_{inter}^{forward})\\ &\mathcal{G}_{inter}^{backward}=\mathcal{S}\circ\mathcal{D}(\mathcal{A}+% \mathcal{M}_{inter}^{backward})\end{aligned}\end{cases}{ start_ROW start_CELL start_ROW start_CELL end_CELL start_CELL caligraphic_G start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_o italic_r italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = caligraphic_S ∘ caligraphic_D ( caligraphic_A + caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_o italic_r italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_G start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = caligraphic_S ∘ caligraphic_D ( caligraphic_A + caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT ) end_CELL end_ROW end_CELL start_CELL end_CELL end_ROW (8)

By now, 𝒢interforwardsuperscriptsubscript𝒢𝑖𝑛𝑡𝑒𝑟𝑓𝑜𝑟𝑤𝑎𝑟𝑑\mathcal{G}_{inter}^{forward}caligraphic_G start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_o italic_r italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT and 𝒢interbackwardsuperscriptsubscript𝒢𝑖𝑛𝑡𝑒𝑟𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑\mathcal{G}_{inter}^{backward}caligraphic_G start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT truly make MGE 𝒱msubscript𝒱𝑚\mathcal{V}_{m}caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT graph-structured. Both of the two matrices manage to aggregate the information of the trimodal without temporal disorder and intra-modal information influence.

After aggregation, the fusion process is performed.

{𝒱¯mforward=𝒢interforward𝒲v𝒱m𝒱¯mbackward=𝒢interbackward𝒲v𝒱mcasesmissing-subexpressionsuperscriptsubscript¯𝒱𝑚𝑓𝑜𝑟𝑤𝑎𝑟𝑑superscriptsubscript𝒢𝑖𝑛𝑡𝑒𝑟𝑓𝑜𝑟𝑤𝑎𝑟𝑑subscript𝒲𝑣subscript𝒱𝑚missing-subexpressionsuperscriptsubscript¯𝒱𝑚𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑superscriptsubscript𝒢𝑖𝑛𝑡𝑒𝑟𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑subscript𝒲𝑣subscript𝒱𝑚otherwise\begin{cases}\begin{aligned} &\overline{\mathcal{V}}_{m}^{forward}=\mathcal{G}% _{inter}^{forward}\mathcal{W}_{v}\mathcal{V}_{m}\\ &\overline{\mathcal{V}}_{m}^{backward}=\mathcal{G}_{inter}^{backward}\mathcal{% W}_{v}\mathcal{V}_{m}\end{aligned}\end{cases}{ start_ROW start_CELL start_ROW start_CELL end_CELL start_CELL over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_o italic_r italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = caligraphic_G start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_o italic_r italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = caligraphic_G start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_CELL end_ROW end_CELL start_CELL end_CELL end_ROW (9)

Where 𝒲vsubscript𝒲𝑣\mathcal{W}_{v}caligraphic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT denotes the value projection weight of 𝒱msubscript𝒱𝑚\mathcal{V}_{m}caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

As shown in Figure 2, two MGEs are constructed by IFM in two separated Transformers, they are two opposite unidirectional rings. Due to their special structure, a complete fusion process is achieved.

After fusion, intra-modal subgraphs need to be enhanced accordingly. Therefore, the Intelaced-Intra-Enhancement Mask (IEM) is constructed.

intra=𝒥intersubscript𝑖𝑛𝑡𝑟𝑎𝒥subscript𝑖𝑛𝑡𝑒𝑟\mathcal{M}_{intra}=\mathcal{J}-\mathcal{M}_{inter}caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT = caligraphic_J - caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT (10)

Where 𝒥𝒥\mathcal{J}caligraphic_J denotes a negative infinity matrix at the same size of intersubscript𝑖𝑛𝑡𝑒𝑟\mathcal{M}_{inter}caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT.

intrasubscript𝑖𝑛𝑡𝑟𝑎\mathcal{M}_{intra}caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT leaves only intra-modal subgraphs visible to enhance the fused MGEs.

After IEM construction, concatenate two opposite unidirectional ring MGEs on feature dimension into one bidirectional MGE. We define parallel-to\parallel as the concatenation operation on the feature dimension.

𝒱¯mbidirection=𝒱¯m{forward,backward}\overline{\mathcal{V}}_{m}^{bidirection}=\parallel\overline{\mathcal{V}}_{m}^{% \{forward,backward\}}over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_i italic_d italic_i italic_r italic_e italic_c italic_t italic_i italic_o italic_n end_POSTSUPERSCRIPT = ∥ over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT { italic_f italic_o italic_r italic_w italic_a italic_r italic_d , italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d } end_POSTSUPERSCRIPT (11)

Utilizing the bidirectional MGE 𝒱¯mbidirectionsuperscriptsubscript¯𝒱𝑚𝑏𝑖𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛\overline{\mathcal{V}}_{m}^{bidirection}over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_i italic_d italic_i italic_r italic_e italic_c italic_t italic_i italic_o italic_n end_POSTSUPERSCRIPT and intrasubscript𝑖𝑛𝑡𝑟𝑎\mathcal{M}_{intra}caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT, the intra-modal enhancement graph could be constructed. We define 𝒱¯mbsuperscriptsubscript¯𝒱𝑚𝑏\overline{\mathcal{V}}_{m}^{b}over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = 𝒱¯mbidirectionsuperscriptsubscript¯𝒱𝑚𝑏𝑖𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛\overline{\mathcal{V}}_{m}^{bidirection}over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_i italic_d italic_i italic_r italic_e italic_c italic_t italic_i italic_o italic_n end_POSTSUPERSCRIPT, 𝒲qbsuperscriptsubscript𝒲𝑞𝑏\mathcal{W}_{q}^{b}caligraphic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT, 𝒲kbsuperscriptsubscript𝒲𝑘𝑏\mathcal{W}_{k}^{b}caligraphic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT as the query, key projection weight of 𝒱mbsuperscriptsubscript𝒱𝑚𝑏\mathcal{V}_{m}^{b}caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT.

𝒜fusion=(𝒲qb𝒱¯mb)(𝒲kb𝒱¯mb)subscript𝒜𝑓𝑢𝑠𝑖𝑜𝑛superscriptsubscript𝒲𝑞𝑏superscriptsubscript¯𝒱𝑚𝑏superscriptsuperscriptsubscript𝒲𝑘𝑏superscriptsubscript¯𝒱𝑚𝑏top\displaystyle\mathcal{A}_{fusion}=(\mathcal{W}_{q}^{b}\overline{\mathcal{V}}_{% m}^{b})\cdot(\mathcal{W}_{k}^{b}\overline{\mathcal{V}}_{m}^{b})^{\top}caligraphic_A start_POSTSUBSCRIPT italic_f italic_u italic_s italic_i italic_o italic_n end_POSTSUBSCRIPT = ( caligraphic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) ⋅ ( caligraphic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (12)
𝒢intra=𝒮𝒟(𝒜fusion+intra)subscript𝒢𝑖𝑛𝑡𝑟𝑎𝒮𝒟subscript𝒜𝑓𝑢𝑠𝑖𝑜𝑛subscript𝑖𝑛𝑡𝑟𝑎\displaystyle\mathcal{G}_{intra}=\mathcal{S}\circ\mathcal{D}(\mathcal{A}_{% fusion}+\mathcal{M}_{intra})caligraphic_G start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT = caligraphic_S ∘ caligraphic_D ( caligraphic_A start_POSTSUBSCRIPT italic_f italic_u italic_s italic_i italic_o italic_n end_POSTSUBSCRIPT + caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT ) (13)

Then, we construct the final feature sequence.

𝒱¯m=𝒢intra𝒲vb𝒱¯mbsubscript¯𝒱𝑚subscript𝒢𝑖𝑛𝑡𝑟𝑎superscriptsubscript𝒲𝑣𝑏superscriptsubscript¯𝒱𝑚𝑏\overline{\mathcal{V}}_{m}=\mathcal{G}_{intra}\mathcal{W}_{v}^{b}\overline{% \mathcal{V}}_{m}^{b}over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = caligraphic_G start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT (14)

Where 𝒲vbsuperscriptsubscript𝒲𝑣𝑏\mathcal{W}_{v}^{b}caligraphic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT denotes the value projection weight of 𝒱mbsuperscriptsubscript𝒱𝑚𝑏\mathcal{V}_{m}^{b}caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT.

Finally, the sequence is decomposed according to the length of the original feature sequence. Then, the final hidden states of different modals are concatenated on the feature dimension to construct the fusion feature 𝒳msubscript𝒳𝑚\mathcal{X}_{m}caligraphic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

The detailed generation algorithm of IM is described in Appendix E.1

3.5 Self-Supervised Learning Framework

A unimodal label generation module (ULGM) is integrated into our approach to capture unimodal-specific information. As shown in figure 1, we use input features 𝒳{t,v,a}subscript𝒳𝑡𝑣𝑎\mathcal{X}_{\{t,v,a\}}caligraphic_X start_POSTSUBSCRIPT { italic_t , italic_v , italic_a } end_POSTSUBSCRIPT to generate unimodal final hidden states y^{t,v,a}subscript^𝑦𝑡𝑣𝑎\hat{y}_{\{t,v,a\}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT { italic_t , italic_v , italic_a } end_POSTSUBSCRIPT. During the prediction process, ULGM uses h{t,v,a}subscript𝑡𝑣𝑎h_{\{t,v,a\}}italic_h start_POSTSUBSCRIPT { italic_t , italic_v , italic_a } end_POSTSUBSCRIPT and ground truth multimodal labels to define positive and negative centers, which are determined based on the predicted unimodal labels and multimodal fusion representations. Next, we calculate the relative distance of each modal representation from the positive and negative centers. Then, we generate new unimodal labels y{t,v,a}isuperscriptsubscript𝑦𝑡𝑣𝑎𝑖y_{\{t,v,a\}}^{i}italic_y start_POSTSUBSCRIPT { italic_t , italic_v , italic_a } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT from the unimodal labels to the ground truth multimodal label, where i𝑖iitalic_i represents the i𝑖iitalic_i training iteration. In this way, sentiment analysis can be more conducive to obtaining the distinguishing information of different modals, while maintaining the consistency of each modal.

Using the predicted results y^{m,t,v,a}subscript^𝑦𝑚𝑡𝑣𝑎\hat{y}_{\{m,t,v,a\}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT { italic_m , italic_t , italic_v , italic_a } end_POSTSUBSCRIPT and the ground truth multimodal label ymsubscript𝑦𝑚y_{m}italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and the generated labels y{t,v,a}subscript𝑦𝑡𝑣𝑎y_{\{t,v,a\}}italic_y start_POSTSUBSCRIPT { italic_t , italic_v , italic_a } end_POSTSUBSCRIPT, we implement a weighted loss to optimize our model.

The weighted loss is defined by Equation 15 whereas the unimodal loss for each modality is defined as Equation LABEL:u_loss

w=u{m,t,v,a}usubscript𝑤subscript𝑢𝑚𝑡𝑣𝑎subscript𝑢\displaystyle\mathcal{L}_{w}=\sum_{u\in{\{m,t,v,a\}}}\mathcal{L}_{u}caligraphic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_u ∈ { italic_m , italic_t , italic_v , italic_a } end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT (15)
u=i=0wui|y^uiyui|wui={1u=mtanh(|y^uiy^mi|)u{t,v,a}missing-subexpressionsubscript𝑢superscriptsubscript𝑖0superscriptsubscript𝑤𝑢𝑖superscriptsubscript^𝑦𝑢𝑖superscriptsubscript𝑦𝑢𝑖missing-subexpressionsuperscriptsubscript𝑤𝑢𝑖cases1𝑢𝑚superscriptsubscript^𝑦𝑢𝑖superscriptsubscript^𝑦𝑚𝑖𝑢𝑡𝑣𝑎\displaystyle\begin{aligned} &\mathcal{L}_{u}=\frac{\sum_{i=0}^{\mathcal{B}}{w% _{u}^{i}*|\hat{y}_{u}^{i}-y_{u}^{i}|}}{\mathcal{B}}\\ &w_{u}^{i}=\begin{cases}{1}&u=m\\ \tanh{(|\hat{y}_{u}^{i}-\hat{y}_{m}^{i}|)}&u\in{\{t,v,a\}}\end{cases}\end{aligned}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_B end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∗ | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | end_ARG start_ARG caligraphic_B end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL italic_u = italic_m end_CELL end_ROW start_ROW start_CELL roman_tanh ( | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | ) end_CELL start_CELL italic_u ∈ { italic_t , italic_v , italic_a } end_CELL end_ROW end_CELL end_ROW (16)

Where \mathcal{B}caligraphic_B denotes the appointed batch size.

Table 1: Comparison on CMU-MOSI and CMU-MOSEI.
Model CMU-MOSI CMU-MOSEI Data State
Acc-2\uparrow F1\uparrow Acc-7\uparrow MAE\downarrow Corr\uparrow Acc-2\uparrow F1\uparrow Acc-7\uparrow MAE\downarrow Corr\uparrow
MulTsuperscriptMulT\text{MulT}^{*}MulT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 83.0 / - 82.8 / - 40.0 0.871 0.698 81.6 / - 81.6 / - 50.7 0.591 0.694 Unaligned
MTAGsuperscriptMTAG\text{MTAG}^{*}MTAG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 82.3 / - 82.1 / - 38.9 0.866 0.722 - / - - / - - - - Unaligned
MISAsuperscriptMISA\text{MISA}^{*}MISA start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 81.8 / 83.4 81.7 / 83.6 42.3 0.783 0.761 83.6 / 85.5 83.8 / 85.3 52.2 0.555 0.756 Unaligned
HyCon-BERTsuperscriptHyCon-BERT\text{HyCon-BERT}^{*}HyCon-BERT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - / 85.2 - / 85.1 46.6 0.713 0.790 - / 85.4 - / 85.6 52.8 0.601 0.776 Aligned
ConFEDEsuperscriptConFEDE\text{ConFEDE}^{*}ConFEDE start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 84.2 / 85.5 84.1 / 85.5 42.3 0.742 0.784 81.7 / 85.8 82.2 / 85.8 54.9 0.522 0.780 Unaligned
MMINsuperscriptMMIN\text{MMIN}^{*}MMIN start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 83.5 / 85.5 83.5 / 85.5 - 0.741 0.795 83.8 / 85.9 83.9 / 85.8 - 0.542 0.761 Unaligned
MTMDsuperscriptMTMD\text{MTMD}^{*}MTMD start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 84.0 / 86.0 83.9 / 86.0 47.5 0.705 0.799 84.8 / 86.1 84.9 / 85.9 53.7 0.531 0.767 Unaligned
MulTsuperscriptMulT\text{MulT}^{\dagger}MulT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 79.6 / 81.4 79.1 / 81.0 36.2 0.923 0.686 78.1 / 83.7 78.9 / 83.7 53.4 0.559 0.740 Unaligned
CENet-BERTsuperscriptCENet-BERT\text{CENet-BERT}^{\dagger}CENet-BERT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 82.8 / 84.5 82.7 / 84.5 45.2 0.736 0.793 81.7 / 82.3 81.6 / 81.9 52.0 0.576 0.711 Aligned
Self-MMsuperscriptSelf-MM\text{Self-MM}^{\dagger}Self-MM start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 82.2 / 83.5 82.3 / 83.6 43.9 0.758 0.792 80.8 / 85.0 81.3 / 84.9 53.3 0.539 0.761 Unaligned
TETFNsuperscriptTETFN\text{TETFN}^{\dagger}TETFN start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 82.4 / 84.0 82.4 / 84.1 46.1 0.749 0.784 81.9 / 84.3 82.1 / 84.1 52.7 0.576 0.728 Unaligned
GSIFN 85.0 / 86.0 85.0 / 86.0 48.3 0.707 0.801 85.0 / 86.3 85.1 / 86.2 53.4 0.538 0.767 Unaligned
Table 2: Comparison on CH-SIMS.
Model CH-SIMS
Acc-2\uparrow Acc-3\uparrow Acc-5\uparrow F1\uparrow MAE\downarrow Corr\uparrow
TFNsuperscriptTFN\text{TFN}^{\dagger}TFN start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 77.7 66.3 42.7 77.7 0.436 0.582
MFNsuperscriptMFN\text{MFN}^{\dagger}MFN start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 77.8 65.4 38.8 77.6 0.443 0.566
MulTsuperscriptMulT\text{MulT}^{\dagger}MulT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 77.8 65.3 38.2 77.7 0.443 0.578
MISAsuperscriptMISA\text{MISA}^{\dagger}MISA start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 75.3 62.4 35.5 75.4 0.457 0.553
Self-MMsuperscriptSelf-MM\text{Self-MM}^{\dagger}Self-MM start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 78.1 65.2 41.3 78.2 0.423 0.585
TETFNsuperscriptTETFN\text{TETFN}^{\dagger}TETFN start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 78.0 64.4 42.9 78.0 0.425 0.582
GSIFN 80.5 67.2 45.5 80.7 0.397 0.619
Table 3: Ablation study on CMU-MOSI.
Description CMU-MOSI
Acc-2\uparrow F1\uparrow Acc-7\uparrow MAE\downarrow Corr\uparrow
GSIFN 85.0 / 86.0 85.0 / 86.0 48.3 0.707 0.801
w/o GsiT 83.8 / 85.5 83.2 / 85.7 46.5 0.742 0.790
w/o mLSTM 84.6 / 86.0 84.5 / 86.0 47.2 0.730 0.792
w/o ULGM 83.4 / 84.8 83.4 / 84.8 46.7 0.711 0.801

4 Experiment

We evaluate our model on three benchmarks, CMU-MOSI (Zadeh et al., 2016), CMU-MOSEI (Bagher Zadeh et al., 2018) and CH-SIMS (Yu et al., 2020). These datasets provide aligned (CMU-MOSI, CMU-MOSEI) and unaligned (all) multimodal data (text, vision, and audio) for each utterance. Further details are in Appendix B

Following prior works, several evaluation metrics are adopted. Binary classification accuracy (Acc-2), F1 Score (F1), three classification accuracy (Acc-3), five classification accuracy (Acc-5), seven classification accuracy (Acc-7), mean absolute error (MAE), and the correlation of the model’s prediction with human (Corr). In particular, Acc-3 and Acc-5 are applied only for CH-SIMS dataset, Acc-2 and F1 are calculated in two ways: negative/non-negative(NN) and negative/positive(NP) on CMU-MOSI and CMU-MOSEI datasets, respectively.

For CMU-MOSI and CMU-MOSEI, we choose MulT(Tsai et al., 2019), MTAG(Yang et al., 2021), MISA(Hazarika et al., 2020), HyCon-BERT(Mai et al., 2023), TETFN(Wang et al., 2023a), ConFEDE(Yang et al., 2023), MMINFang et al. (2024), MTMDLin and Hu (2024), CENet-BERTWang et al. (2023b), Self-MMYu et al. (2021) as baselines. As for CH-SIMS, TFN(Zadeh et al., 2017), MFN(Zadeh et al., 2018), MISA, MulT, Self-MM and TETFN are chosen. All of which are previous state-of-the-arts(SOTA). Further details are in Appendix C.

Detailed experiment settings of hyperparameters and feature extraction methods are in Appendix A.1.

Table 4: Comparison of GsiT and MulT on CMU-MOSI and CMU-MOSEI.
Model CMU-MOSI CMU-MOSEI Params(M) FLOPS(G)
Acc-2\uparrow F1\uparrow Acc-7\uparrow MAE\downarrow Corr\uparrow Acc-2\uparrow F1\uparrow Acc-7\uparrow MAE\downarrow Corr\uparrow
MulT 79.6 / 81.4 79.1 / 81.0 36.2 0.923 0.686 78.1 / 83.7 78.9 / 83.7 53.4 0.559 0.740 4.362 105.174
GsiT 83.4 / 84.9 83.4 / 85.0 45.5 0.716 0.803 84.1 / 86.3 84.4 / 86.3 53.5 0.539 0.774 0.891 25.983
Table 5: The Computational Overhead of Different Vision/Audio Modality Enhancement Models
Model mLSTM(V) mLSTM(A) ViT Wav2Vec Whisper
Params(M) 0.439 0.439 127.272 94.395 17.120
FLOPS(G) 1.674 1.252 35.469 68.543 315.128
Table 6: Comparison with model using large model extractors. Note: OF denotes OpenFace, CR denotes COVAREP
Model CMU-MOSI Extractor(V/A) Enhancer
Acc-2\uparrow F1\uparrow Acc-7\uparrow MAE\downarrow Corr\uparrow
GSIFN 85.0 / 86.0 85.0 / 86.0 48.3 0.707 0.801 OF/CR mLSTM
TETFN 84.1 / 86.1 83.8 / 86.1 46.5 0.717 0.800 ViT/CR LSTM
AcFormer 82.3 / 85.4 82.1 / 85.2 44.2 0.742 0.794 ViT/Wav2Vec Transformer

4.1 Results

The performance comparison of all methods on MOSI, MOSEI and CH-SIMS are summarized in Table 1 and Table 2.

For all metrics, the best results are highlighted in bold, the second-best results are double-underlined, and the third-best results are single-underlined. denotes that the model is sourced from the GitHub page4.1 and the scores are reproduced, denotes the result is obtained directly from the original paper.

GSIFN is trained end-to-end without any pre-training. Thus, all replicated results ensure the consistency and fairness in the experimental environment. Except for MulT, the reproducible results differ greatly from the original results.

In Table 1, for a fair comparison in CMU-MOSI and CMU-MOSEI, we split models into two categories based on data state: Unaligned and Aligned. For Acc-2 and F1, the left of the "/" corresponds to "negative/non-negative" and the right corresponds to "negative/positive".

As shown in Table 1 and 2, GSIFN outperforms all of the previous SOTAs in most of the metrics in all of the datasets. The comparable metrics in CMU-MOSEI (Acc-7, MAE, Corr) and CMU-MOSI (MAE) also reach at least the third-best performance. GSIFN achieves all-modal-in-one fusion and enhanced self-supervised learning, which ensures its superior performance over previous SOTA.

11footnotetext: https://github.com/thuiar/MMSA

4.2 Ablation Study

In this session, we will discuss our ablation study on modules in Table 3. Further ablation study results are in Appendix A.2

There are three main modules in our model, including Graph-Structured Interlaced-Masked Multimodal Transformer (GsiT) for multimodal fusion, extended LSTM with matrix memory (mLSTM) for vision, audio temporal enhancement, Unimodal Label Generation Module (ULGM) for self-supervision. In Table 3, w/o denotes the absence of the corresponding module in the model.

The results in Table 3 indicate all the modules are necessary for achieving SOTA performance. GsiT module realizes all-modal-in-one Transformer-based fusion, without module GsiT, the performance of the whole model has a substantial decrease in all metrics. GsiT is the core module of GSIFN, and it is especially important in fine-grained tasks. Without module mLSTM, the performance weakens mainly on fine-grained metrics, it is a necessary module of GSIFN. Without module ULGM, the performance weakens on almost all the metrics. ULGM is significant to GSIFN in coarse-grained tasks.

4.3 Further Analysis

We discuss the GsiT comparison of performance and efficiency with MulT and mLSTM efficiency in this section. In particular, Params denotes the number of parameters, FLOPS denotes floating-point operations per second.

GsiT and MulT MulT mainly uses CMA to realize effective modal fusion. Like the core module of GSIFN, which is GsiT, MulT realizes complete fusion and post-fusion enhancement separately in 9 Transformers (Vaswani et al., 2017). However, GsiT uses IM to realize the graph structure construction in MGEs. Each of the MGE contains trimodal information altogether. GsiT reduces the number of Transformers from 9 to 3. Through weight-sharing without information disorder, each of the Transformers in GsiT can completely fuse trimodal sentiment information all in one, achieving better weight regularization and fusion performance at the same time.

For a fair comparison, we trained MulT and GsiT with the same hyperparameters. The experiments are shown in the Table 4. GsiT outperforms MulT in all metrics. The Params and FLOPS of GsiT are much lower than MulT.

Vision/Audio Encoder Efficiency As shown in Table 4, Params and FLOPS of widely used non-verbal modal feature extractors. Vision Transformer (ViT)(Dosovitskiy et al., 2021), Wav2Vec(Schneider et al., 2019) and Whisper(Radford et al., 2023) are employed to extract high-quality features. We employ mLSTM to enhance low-quality features extracted by COVAREP(Degottex et al., 2014) (for audio), OpenFace(Baltrusaitis et al., 2016) (for vision). The Params and FLOPS of mLSTMs is way lower than ViT, Wav2Vec and Whisper.

To analyze the efficiency and effectiveness of mLSTM, GSIFN is compared with two models using large model extractors: TETFN, AcFormer(Zong et al., 2023). As shown in Table 6, GSIFN performs close even better than TETFN and AcFormer in most of the metrics.

5 Conclusion

In this paper, we propose GSIFN, a Graph-Structured and Interlaced Multimodal Transformer Based Fusion Network. GSIFN addresses multimodal challenges with two key components: (1) a graph-structured and interlaced-masked multimodal Transformer that builds a robust multimodal graph embedding and achieves efficient, effective all-modal-in-one fusion; (2) a self-supervised learning framework that offers high performance at low computation overhead, using mLSTM to boost non-verbal modal features for unimodal label generation. The experimental results show that GSIFN reduces the computation overhead and has superior performance in multimodal sentiment analysis.

Limitations

Our GSIFN lacks pre-training for different modal data in the feature extraction part and does not deal with the overpopulated part of the data, resulting in too many redundant and uninformative vertices in the graph structure. This has an impact on our performance on fine-grained tasks in the dataset, such as MAE and Acc-7, which is not outstanding compared to past methods.

References

  • Baevski et al. (2022) Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. 2022. data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 1298–1312. PMLR.
  • Bagher Zadeh et al. (2018) AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, Melbourne, Australia. Association for Computational Linguistics.
  • Baltrusaitis et al. (2016) Tadas Baltrusaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: An open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision, WACV 2016, Lake Placid, NY, USA, March 7-10, 2016, pages 1–10. IEEE Computer Society.
  • Baltrusaitis et al. (2018) Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China, May 15-19, 2018, pages 59–66. IEEE Computer Society.
  • Brody et al. (2022) Shaked Brody, Uri Alon, and Eran Yahav. 2022. How attentive are graph attention networks? In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  • Chung et al. (2014) Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, abs/1412.3555.
  • Dao and Gu (2024) Tri Dao and Albert Gu. 2024. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 10041–10071. PMLR.
  • Degottex et al. (2014) Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP - A collaborative voice analysis repository for speech technologies. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014, pages 960–964. IEEE.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  • Fang et al. (2024) Lingyong Fang, Gongshen Liu, and Ru Zhang. 2024. Multi-grained multimodal interaction network for sentiment analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024, Seoul, Republic of Korea, April 14-19, 2024, pages 7730–7734. IEEE.
  • Gu and Dao (2023) Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
  • Hazarika et al. (2020) Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, pages 1122–1131. ACM.
  • Hochreiter (1997) S Hochreiter. 1997. Long short-term memory. Neural Computation MIT-Press.
  • Lin and Hu (2024) Ronghao Lin and Haifeng Hu. 2024. Multi-task momentum distillation for multimodal sentiment analysis. IEEE Trans. Affect. Comput., 15(2):549–565.
  • Mai et al. (2023) Sijie Mai, Ying Zeng, Shuangjia Zheng, and Haifeng Hu. 2023. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans. Affect. Comput., 14(3):2276–2289.
  • Peng et al. (2023a) Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Johan Wind, Stanisław Woźniak, Zhenyuan Zhang, Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu. 2023a. RWKV: Reinventing RNNs for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077, Singapore. Association for Computational Linguistics.
  • Peng et al. (2023b) Junjie Peng, Ting Wu, Wenqiang Zhang, Feng Cheng, Shuhua Tan, Fen Yi, and Yansong Huang. 2023b. A fine-grained modal label-based multi-stage network for multimodal sentiment analysis. Expert Syst. Appl., 221:119721.
  • Pöppel et al. (2024) Korbinian Pöppel, Maximilian Beck, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. 2024. xLSTM: Extended long short-term memory. In First Workshop on Long-Context Foundation Models @ ICML 2024.
  • Radford et al. (2023) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
  • Rahman et al. (2020) Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrating multimodal information in large pretrained transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2359–2369, Online. Association for Computational Linguistics.
  • Schneider et al. (2019) Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. In 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019, pages 3465–3469. ISCA.
  • Sun et al. (2023) Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. 2023. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, abs/2307.08621.
  • Tsai et al. (2019) Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6558–6569, Florence, Italy. Association for Computational Linguistics.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
  • Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
  • Wang et al. (2023a) Di Wang, Xutong Guo, Yumin Tian, Jinhui Liu, Lihuo He, and Xuemei Luo. 2023a. TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit., 136:109259.
  • Wang et al. (2023b) Di Wang, Shuai Liu, Quan Wang, Yumin Tian, Lihuo He, and Xinbo Gao. 2023b. Cross-modal enhancement network for multimodal sentiment analysis. IEEE Trans. Multim., 25:4909–4921.
  • Wang et al. (2024) Lan Wang, Junjie Peng, Cangzhi Zheng, Tong Zhao, and Li’an Zhu. 2024. A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning. Inf. Process. Manag., 61(2):103675.
  • Wu et al. (2024) Zehui Wu, Ziwei Gong, Jaywon Koo, and Julia Hirschberg. 2024. Multimodal multi-loss fusion network for sentiment analysis. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3588–3602, Mexico City, Mexico. Association for Computational Linguistics.
  • Yang et al. (2021) Jianing Yang, Yongxin Wang, Ruitao Yi, Yuying Zhu, Azaan Rehman, Amir Zadeh, Soujanya Poria, and Louis-Philippe Morency. 2021. MTAG: Modal-temporal attention graph for unaligned human multimodal language sequences. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1009–1021, Online. Association for Computational Linguistics.
  • Yang et al. (2023) Jiuding Yang, Yakun Yu, Di Niu, Weidong Guo, and Yu Xu. 2023. ConFEDE: Contrastive feature decomposition for multimodal sentiment analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7617–7630, Toronto, Canada. Association for Computational Linguistics.
  • Yu et al. (2020) Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. 2020. CH-SIMS: A Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3718–3727, Online. Association for Computational Linguistics.
  • Yu et al. (2021) Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 10790–10797. AAAI Press.
  • Zadeh et al. (2017) Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114, Copenhagen, Denmark. Association for Computational Linguistics.
  • Zadeh et al. (2018) Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory fusion network for multi-view sequential learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5634–5641. AAAI Press.
  • Zadeh et al. (2016) Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259.
  • Zhang et al. (2023) Haoyu Zhang, Yu Wang, Guanghao Yin, Kejun Liu, Yuanyuan Liu, and Tianshu Yu. 2023. Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 756–767, Singapore. Association for Computational Linguistics.
  • Zhao et al. (2023) Tong Zhao, Junjie Peng, Yansong Huang, Lan Wang, Huiran Zhang, and Zesu Cai. 2023. A graph convolution-based heterogeneous fusion network for multimodal sentiment analysis. Appl. Intell., 53(24):30455–30468.
  • Zheng et al. (2024a) Cangzhi Zheng, Junjie Peng, and Zesu Cai. 2024a. Extracting method for fine-grained emotional features in videos. Knowledge-Based Systems, 302:112382.
  • Zheng et al. (2024b) Cangzhi Zheng, Junjie Peng, Lan Wang, Li’an Zhu, Jiatao Guo, and Zesu Cai. 2024b. Frame-level nonverbal feature enhancement based sentiment analysis. Expert Systems with Applications, 258:125148.
  • Zong et al. (2023) Daoming Zong, Chaoyue Ding, Baoxiang Li, Jiakui Li, Ken Zheng, and Qunyan Zhou. 2023. Acformer: An aligned and compact transformer for multimodal sentiment analysis. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, pages 833–842. ACM.

Appendix A Experiment

A.1 Experiment Settings

Table 7: The hyperparameters of the main experiment.
Hyperparameter CMU-MOSI CMU-MOSEI CH-SIMS
batch size 64 64 64
Learning Rate
lr-bert 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 5×1065superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
lr-audio 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
lr-video 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
lr-other 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Weight Decay
wd-bert 0.001 0.001 0.001
wd-audio 0.001 0.001 0.001
wd-video 0.001 0.001 0.001
wd-other 0.001 0.001 0.001
Model Hyper Parameter
xlstm blocks 4 4 4
feature 128 128 128
heads 4 4 4
dropout 0.2 0.2 0.2
Table 8: The extractors of the main experiment.
Modal CMU-MOSI CMU-MOSEI CH-SIMS
Text bert-base-uncased bert-base-uncased bert-base-chinese
Vision OpenFace OpenFace OpenFace2.0
Audio COVAREP COVAREP LibROSA

In this section, we discuss the experiment settings. The hyperparameters of the main experiment are shown in Table 7. For further analysis experiments, the hyperparameters of MulT are the same as those of GSIFN in CMU-MOSI.

Following previous work (Zheng et al., 2024b), the feature extraction tools of different modals in each dataset. BERT(Devlin et al., 2019) for text, OpenFace(Baltrusaitis et al., 2016) and OpenFace 2.0(Baltrusaitis et al., 2018) for vision, COVAREP(Degottex et al., 2014) and LibROSA for audio. The extractors for each dataset are shown in Table 8.

A.2 Further Ablation Study

In this section, experiments of further ablation study are performed and presented to fully analysis GSIFN. These experiments include Graph Structure Ablation, Fusion Modal Ablation, ULGM Modal Ablation, and Pretrained Language Model Ablation. Note that the multimodal representation (M) is used for the final classification task. In the original case, M is composed of unimodal text (T), vision (V), and audio (A).

Graph Structure Selection The structure of the graph has a significant impact on the performance of the model, so we conduct an ablation study on its graph structure. The structures include the original structure, structure-1, structure-2, structure-3, and self-only structure.

The graph structure of the three modals can only be constructed in four cases. As a contrast, we design a self-only mask to interpret the influence of information disorder.

Original Structure: The original structure is two opposite unidirectional ring graphs. They both realize cyclic all-modal-in-one fusion, which makes trimodal information fully interact in shared model weights. The structure is: {tv,va,at}formulae-sequence𝑡𝑣formulae-sequence𝑣𝑎𝑎𝑡\{t\rightarrow v,v\rightarrow a,a\rightarrow t\}{ italic_t → italic_v , italic_v → italic_a , italic_a → italic_t }, {av,vt,ta}formulae-sequence𝑎𝑣formulae-sequence𝑣𝑡𝑡𝑎\{a\rightarrow v,v\rightarrow t,t\rightarrow a\}{ italic_a → italic_v , italic_v → italic_t , italic_t → italic_a }. The modal-wise IFMs are:

{interforward=(𝒥t,t𝒪t,v𝒥t,a𝒥v,t𝒥v,v𝒪v,a𝒪a,t𝒥a,v𝒥a,a)interbackward=(𝒥t,t𝒥v,t𝒪a,t𝒪v,t𝒥v,v𝒥v,a𝒥a,t𝒪a,v𝒥a,a)casesmissing-subexpressionsuperscriptsubscript𝑖𝑛𝑡𝑒𝑟𝑓𝑜𝑟𝑤𝑎𝑟𝑑matrixsuperscript𝒥𝑡𝑡superscript𝒪𝑡𝑣superscript𝒥𝑡𝑎superscript𝒥𝑣𝑡superscript𝒥𝑣𝑣superscript𝒪𝑣𝑎superscript𝒪𝑎𝑡superscript𝒥𝑎𝑣superscript𝒥𝑎𝑎missing-subexpressionsuperscriptsubscript𝑖𝑛𝑡𝑒𝑟𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑matrixsuperscript𝒥𝑡𝑡superscript𝒥𝑣𝑡superscript𝒪𝑎𝑡superscript𝒪𝑣𝑡superscript𝒥𝑣𝑣superscript𝒥𝑣𝑎superscript𝒥𝑎𝑡superscript𝒪𝑎𝑣superscript𝒥𝑎𝑎otherwise\begin{cases}\begin{aligned} &\mathcal{M}_{inter}^{forward}=\begin{pmatrix}% \mathcal{J}^{t,t}&\mathcal{O}^{t,v}&\mathcal{J}^{t,a}\\ \mathcal{J}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{J}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\\ &\mathcal{M}_{inter}^{backward}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{J}^{% v,t}&\mathcal{O}^{a,t}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{J}^{v,a}\\ \mathcal{J}^{a,t}&\mathcal{O}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\end{aligned}\end{cases}{ start_ROW start_CELL start_ROW start_CELL end_CELL start_CELL caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_o italic_r italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_t , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) end_CELL end_ROW end_CELL start_CELL end_CELL end_ROW (17)

Structure-1: Structure-1 realizes all-modal-in-one fusion, but the information passing is not cyclic. The structure is: {av,va,at}formulae-sequence𝑎𝑣formulae-sequence𝑣𝑎𝑎𝑡\{a\rightarrow v,v\rightarrow a,a\rightarrow t\}{ italic_a → italic_v , italic_v → italic_a , italic_a → italic_t }, {vt,tv,ta}formulae-sequence𝑣𝑡formulae-sequence𝑡𝑣𝑡𝑎\{v\rightarrow t,t\rightarrow v,t\rightarrow a\}{ italic_v → italic_t , italic_t → italic_v , italic_t → italic_a }. The modal-wise IFMs are:

{interforward=(𝒥t,t𝒥t,v𝒪t,a𝒥v,t𝒥v,v𝒪v,a𝒪a,t𝒥a,v𝒥a,a)interbackward=(𝒥t,t𝒪v,t𝒥t,a𝒪v,t𝒥v,v𝒥v,a𝒪a,t𝒥a,v𝒥a,a)casesmissing-subexpressionsuperscriptsubscript𝑖𝑛𝑡𝑒𝑟𝑓𝑜𝑟𝑤𝑎𝑟𝑑matrixsuperscript𝒥𝑡𝑡superscript𝒥𝑡𝑣superscript𝒪𝑡𝑎superscript𝒥𝑣𝑡superscript𝒥𝑣𝑣superscript𝒪𝑣𝑎superscript𝒪𝑎𝑡superscript𝒥𝑎𝑣superscript𝒥𝑎𝑎missing-subexpressionsuperscriptsubscript𝑖𝑛𝑡𝑒𝑟𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑matrixsuperscript𝒥𝑡𝑡superscript𝒪𝑣𝑡superscript𝒥𝑡𝑎superscript𝒪𝑣𝑡superscript𝒥𝑣𝑣superscript𝒥𝑣𝑎superscript𝒪𝑎𝑡superscript𝒥𝑎𝑣superscript𝒥𝑎𝑎otherwise\begin{cases}\begin{aligned} &\mathcal{M}_{inter}^{forward}=\begin{pmatrix}% \mathcal{J}^{t,t}&\mathcal{J}^{t,v}&\mathcal{O}^{t,a}\\ \mathcal{J}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{J}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\\ &\mathcal{M}_{inter}^{backward}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{O}^{% v,t}&\mathcal{J}^{t,a}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{J}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{J}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\end{aligned}\end{cases}{ start_ROW start_CELL start_ROW start_CELL end_CELL start_CELL caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_o italic_r italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) end_CELL end_ROW end_CELL start_CELL end_CELL end_ROW (18)

Structure-2: Structure-2 realizes all-modal-in-one fusion, but the information passing is not cyclic. The structure is: {vt,tv,va}formulae-sequence𝑣𝑡formulae-sequence𝑡𝑣𝑣𝑎\{v\rightarrow t,t\rightarrow v,v\rightarrow a\}{ italic_v → italic_t , italic_t → italic_v , italic_v → italic_a }, {at,ta,av}formulae-sequence𝑎𝑡formulae-sequence𝑡𝑎𝑎𝑣\{a\rightarrow t,t\rightarrow a,a\rightarrow v\}{ italic_a → italic_t , italic_t → italic_a , italic_a → italic_v }. The modal-wise IFMs are:

{interforward=(𝒥t,t𝒪t,v𝒥t,a𝒪v,t𝒥v,v𝒥v,a𝒥a,t𝒪a,v𝒥a,a)interbackward=(𝒥t,t𝒥v,t𝒪t,a𝒥v,t𝒥v,v𝒪v,a𝒪a,t𝒥a,v𝒥a,a)casesmissing-subexpressionsuperscriptsubscript𝑖𝑛𝑡𝑒𝑟𝑓𝑜𝑟𝑤𝑎𝑟𝑑matrixsuperscript𝒥𝑡𝑡superscript𝒪𝑡𝑣superscript𝒥𝑡𝑎superscript𝒪𝑣𝑡superscript𝒥𝑣𝑣superscript𝒥𝑣𝑎superscript𝒥𝑎𝑡superscript𝒪𝑎𝑣superscript𝒥𝑎𝑎missing-subexpressionsuperscriptsubscript𝑖𝑛𝑡𝑒𝑟𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑matrixsuperscript𝒥𝑡𝑡superscript𝒥𝑣𝑡superscript𝒪𝑡𝑎superscript𝒥𝑣𝑡superscript𝒥𝑣𝑣superscript𝒪𝑣𝑎superscript𝒪𝑎𝑡superscript𝒥𝑎𝑣superscript𝒥𝑎𝑎otherwise\begin{cases}\begin{aligned} &\mathcal{M}_{inter}^{forward}=\begin{pmatrix}% \mathcal{J}^{t,t}&\mathcal{O}^{t,v}&\mathcal{J}^{t,a}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{J}^{v,a}\\ \mathcal{J}^{a,t}&\mathcal{O}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\\ &\mathcal{M}_{inter}^{backward}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{J}^{% v,t}&\mathcal{O}^{t,a}\\ \mathcal{J}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{J}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\end{aligned}\end{cases}{ start_ROW start_CELL start_ROW start_CELL end_CELL start_CELL caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_o italic_r italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_t , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) end_CELL end_ROW end_CELL start_CELL end_CELL end_ROW (19)
Refer to caption
Figure 3: Example of Alignment.
Refer to caption
Figure 4: Attention map split by interlaced masks.

Structure-3: Structure-3 realizes all-modal-in-one fusion, but the information passing is not cyclic. The structure is: {av,va,vt}formulae-sequence𝑎𝑣formulae-sequence𝑣𝑎𝑣𝑡\{a\rightarrow v,v\rightarrow a,v\rightarrow t\}{ italic_a → italic_v , italic_v → italic_a , italic_v → italic_t }, {at,ta,tv}formulae-sequence𝑎𝑡formulae-sequence𝑡𝑎𝑡𝑣\{a\rightarrow t,t\rightarrow a,t\rightarrow v\}{ italic_a → italic_t , italic_t → italic_a , italic_t → italic_v }. The modal-wise IFMs are:

{interforward=(𝒥t,t𝒪t,v𝒥t,a𝒥v,t𝒥v,v𝒪v,a𝒥a,t𝒪a,v𝒥a,a)interbackward=(𝒥t,t𝒥v,t𝒪t,a𝒪v,t𝒥v,v𝒥v,a𝒪a,t𝒥a,v𝒥a,a)casesmissing-subexpressionsuperscriptsubscript𝑖𝑛𝑡𝑒𝑟𝑓𝑜𝑟𝑤𝑎𝑟𝑑matrixsuperscript𝒥𝑡𝑡superscript𝒪𝑡𝑣superscript𝒥𝑡𝑎superscript𝒥𝑣𝑡superscript𝒥𝑣𝑣superscript𝒪𝑣𝑎superscript𝒥𝑎𝑡superscript𝒪𝑎𝑣superscript𝒥𝑎𝑎missing-subexpressionsuperscriptsubscript𝑖𝑛𝑡𝑒𝑟𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑matrixsuperscript𝒥𝑡𝑡superscript𝒥𝑣𝑡superscript𝒪𝑡𝑎superscript𝒪𝑣𝑡superscript𝒥𝑣𝑣superscript𝒥𝑣𝑎superscript𝒪𝑎𝑡superscript𝒥𝑎𝑣superscript𝒥𝑎𝑎otherwise\begin{cases}\begin{aligned} &\mathcal{M}_{inter}^{forward}=\begin{pmatrix}% \mathcal{J}^{t,t}&\mathcal{O}^{t,v}&\mathcal{J}^{t,a}\\ \mathcal{J}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{J}^{a,t}&\mathcal{O}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\\ &\mathcal{M}_{inter}^{backward}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{J}^{% v,t}&\mathcal{O}^{t,a}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{J}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{J}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}\end{aligned}\end{cases}{ start_ROW start_CELL start_ROW start_CELL end_CELL start_CELL caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_o italic_r italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_t , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) end_CELL end_ROW end_CELL start_CELL end_CELL end_ROW (20)

Additionally, we constructed a graph with only intra-mask which is diordered in multimodal temporal information.

Self-Only:

inter=(𝒥t,t𝒪t,v𝒪t,a𝒪v,t𝒥v,v𝒪v,a𝒪a,t𝒪a,v𝒥a,a)subscript𝑖𝑛𝑡𝑒𝑟matrixsuperscript𝒥𝑡𝑡superscript𝒪𝑡𝑣superscript𝒪𝑡𝑎superscript𝒪𝑣𝑡superscript𝒥𝑣𝑣superscript𝒪𝑣𝑎superscript𝒪𝑎𝑡superscript𝒪𝑎𝑣superscript𝒥𝑎𝑎\mathcal{M}_{inter}=\begin{pmatrix}\mathcal{J}^{t,t}&\mathcal{O}^{t,v}&% \mathcal{O}^{t,a}\\ \mathcal{O}^{v,t}&\mathcal{J}^{v,v}&\mathcal{O}^{v,a}\\ \mathcal{O}^{a,t}&\mathcal{O}^{a,v}&\mathcal{J}^{a,a}\\ \end{pmatrix}caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_t , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_v , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_v , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_t end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_O start_POSTSUPERSCRIPT italic_a , italic_v end_POSTSUPERSCRIPT end_CELL start_CELL caligraphic_J start_POSTSUPERSCRIPT italic_a , italic_a end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) (21)

As shown in Table 9 part Graph Structure Ablation, the original structure is superior to the other three theoretically feasible structures in all metrics. The four theoretically feasible structures are superior to the self-only structure, which is theoretically infeasible.

Table 9: Modality Ablation Study on CMU-MOSI. Note: F denotes finetuning pretrained language models, NF denotes not finetuning
Description CMU-MOSI
Acc-2\uparrow F1\uparrow Acc-7\uparrow MAE\downarrow Corr\uparrow
Graph Structure Ablation
Orginal 85.0 / 86.0 85.0 / 86.0 48.3 0.707 0.801
Structure-1 82.4 / 84.0 82.3 / 84.0 46.5 0.712 0.792
Structure-2 83.8 / 85.7 83.7 / 85.6 46.1 0.731 0.796
Structure-3 83.4 / 85.1 83.3 / 85.1 45.5 0.727 0.793
Self-Only 81.6 / 83.2 81.7 / 83.3 43.3 0.750 0.791
Fusion Modality Ablation
M(T,V,A) 85.0 / 86.0 85.0 / 86.0 48.3 0.707 0.801
M(T,V) 84.3 / 85.5 84.2 / 85.5 45.5 0.720 0.797
M(T,A) 84.3 / 85.7 84.3 / 85.7 47.2 0.704 0.800
M(V,A) 59.8 / 60.2 59.7 / 60.3 17.9 1.344 0.196
M(T) 83.1 / 84.8 83.0 / 84.7 47.5 0.715 0.786
M(V) 59.2 / 59.8 58.9 / 59.6 16.8 1.372 0.141
M(A) 60.4 / 61.3 59.0 / 60.0 21.3 1.322 0.236
ULGM Modal Ablation
M+T+V+A 85.0 / 86.0 85.0 / 86.0 48.3 0.707 0.801
M+T+V 84.4 / 85.7 84.3 / 85.7 44.5 0.742 0.742
M+T+A 83.9 / 85.7 83.7 / 85.6 46.1 0.731 0.796
M+V+A 83.8 / 85.2 83.8 / 85.3 44.6 0.748 0.794
M+T 83.4 / 85.7 83.3 / 85.6 45.0 0.731 0.796
M+V 83.5 / 85.4 83.5 / 85.4 45.8 0.724 0.801
M+A 82.5 / 84.6 82.4 / 84.6 46.1 0.709 0.800
M 83.4 / 84.8 83.4 / 84.8 46.7 0.711 0.801
Pretrained Language Model Ablation
BERT(F) 85.0 / 86.0 85.0 / 86.0 48.3 0.707 0.801
BERT(NF) 83.8 / 85.7 83.7 / 85.6 46.1 0.731 0.796

Fusion Modal Ablation To fully investigate the influence of the combined form of multimodal representation on the representation ability of the whole model, we designed the Modality Ablation study, which contains the trimodal form: M(T, V, A); the bimodal forms: M(T, V), M(T, A), M(V, A); and the unimodal forms: M(T), M(V), M(A). Note that the structure of the model in the unimodal case is already missing, thus the graph-structured attention degenerates to naive multi-head self-attention.

As shown in Table 9 part Fusion Modal Ablation, modal combinations with text modal M(T, V, A), M(T, V), M(T, A), and M(T) have superior performance than those without text modal like M(V, A). For those who have text modal, trimodal combination M(T, V, A) performs better than bimodal combination M(T, V) and M(T, A). In bimodal combinations, audio modal plays a relatively more important role than vision modal in multimodal fusion. In unimodal cases, only text modal has superior performance than vision and audio.

ULGM Modal Ablation In our proposed Self-Supervised Learning Framework, multimodality (M) is used for classification, and unimodal text (T), vision (V), and audio (A) are used to generate unimodal labels in ULGM to ensure that the model learns a robust representation of the multimodal data.

To fully analyze the importance of each modal in the model, we design ULGM modal ablation experiment. The forms include ULGM with three modals: M+T+V+A, ULGM with two modals: M+T+V, M+T+A, M+V+A, ULGM with one modal: M+T, M+V, M+A, and without ULGM: M.

As shown in Table 9 part ULGM Modal Ablation, take M+T as an example, compared with M+T+V and M+T+A, M+T performs weaker in coarse-grained tasks (Acc-2, F1). The binary classification performance of GSIFN is affected by the number of modals in ULGM. However, take M as an example, the performance of M in fine-grained tasks (Acc-7, MAE) is superior to M+A, M+T, etc. Among all of the cases, M+T+V+A achieves the best performance. Therefore, ULGM promotes the coarse-grained capability of GSIFN, GsiT boosts the fine-grained capability of GSIFN.

Pretrained Language Model Ablation The experiment on whether or not finetuning BERT is shown in Table 9, part Pretrained Language Model Ablation. The result shows that BERT finetuning is quite useful to GSIFN.

A.3 Alignment

Specifically, a real-time example of a complete adjacency matrix (attention map) of the original structure is shown in Figure 4.

An analysis example of the alignment efficiency of GSIFN is shown in Figure 3. We choose bimodal combinations of vision-to-text and audio-to-text as examples, these two groups are produced from two different MGEs. As can be seen from Figure 3, GSIFN effectively and comprehensively composes the semantics of the three modals together.

Appendix B Datasets

Brief introduction to the three chosen datasets are as follows.

CMU-MOSI: The CMU-MOSI is a commonly used dataset for human multimodal sentiment analysis. It consists of 2,198 short monologue video clips (each clip lasts for the duration of one sentence) expressing the opinion of the speaker inside the video on a topic such as movies. The utterances are manually annotated with a continuous opinion score between [-3, +3], [-3: highly negative, -2 negative, -1 weakly negative, 0 neutral, +1 weakly positive, +2 positive, +3 highly positive].

CMU-MOSEI: The CMU-MOSEI is an improved version of CMU-MOSI. It contains 23,453 annotated video clips (about 10 times more than CMU-MOSI) from 5,000 videos, 1,000 different speakers, and 250 different topics. The number of discourses, samples, speakers, and topics is also larger than CMU-MOSI. The range of labels taken for each discourse is consistent with CMU-MOSI.

CH-SIMS: The CH-SIMS includes the same modalities in Mandarin: audio, text, and video, collected from 2281 annotated video segments. It includes data from TV shows and movies,making it culturally distinct and diverse, and provides multiple labels for the same utterance based on different modalities, which adds an extra layer of complexity and richness to the data.

Appendix C Baselines

The introduction to baseline models is as follows.

TFN: The Tensor Fusion Network (TFN) uses modality embedding subnetwork and tensor fusion to learn intra- and inter-modality dynamics.

MFN: The Memory Fusion Network (MFN) explicitly accounts for both interactions in a neural architecture and continuously models them through time.

MulT: The Multimodal Transformer (MulT) uses a cross-modal transformer based on cross-modal attention to make modality translation.

MTAG: The Modal-Temporal Attention Graph (MTAG) is a graph neural network model that incorporates modal attention mechanisms and dynamic pruning techniques to effectively capture complex interactions across modes and time, achieving a parametrically efficient and interpretable model.

MISA: The Modality-Invariant and -Specific Representations (MISA) project representations into modality-specific and modality-invariant spaces and learn distributional similarity, orthogonal loss, reconstruction loss, and task prediction loss

Self-MM: Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning (Self-MM) designs a multi- and a uni- task to learn inter-modal consistency and intra-modal specificity

CENet-BERT: Cross-Modal Enhancement Network (CENet) uses K-Means clustering to cluster the visual and audio modes into multiple tokens to realize the generation of the corresponding embedding, thus improving the representation ability of the two auxiliary modes and realizing a better BERT fine-tuning migration gate

HyCon-BERT: proposes a novel multimodal representation learning framework HyCon based on contrastive learning, designed with three types of losses to comprehensively learn inter-modal and intra-modal dynamics in both supervised and unsupervised ways.

TETFN: Text Enhanced Transformer Fusion Network (TETFN) strengthens the role of text modes in multimodal information fusion through text-oriented cross-modal mapping and single-modal label generation, and uses Vision-Transformer pre-training model to extract visual features

ConFEDE: Contrastive Feature Decomposition (ConFEDE) constructs a unified learning framework that jointly performs contrastive representation learning and contrastive feature decomposition to enhance the representation of multimodal information.

MMIN: Multi-modal Interaction Network (MMIN) is an advanced multi-modal sentiment analysis model that combines a coarse-grained interaction network (CIN) and a fine-grained interaction network (FIN). Adversarial learning and sparse attention mechanisms are used to capture complex interactions between different modals and reduce redundant and irrelevant information.

MTMD: Multi-Task Momentum Distillation (MTMD) treats the modal learning process as multiple subtasks and knowledge distillation between teacher network and student network effectively reduces the gap between different modes, and uses momentum models to explore mode-specific knowledge and learn robust multimodal representations through adaptive momentum fusion factors.

Appendix D Aggregation of Modal Subgraphs

D.1 How to Aggregate Subgraphs?

The derivation of graph aggregation from vertex to subgraph.

Vertex Aggregation Assuming a set of vertex features, 𝒱={v1,v2,,vN}𝒱subscript𝑣1subscript𝑣2subscript𝑣𝑁\mathcal{V}=\{v_{1},v_{2},\dots,v_{N}\}caligraphic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, viDsubscript𝑣𝑖superscript𝐷v_{i}\in\mathbb{R}^{D}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the number of vertices, and D is the feature dimension in each vertex.

From previous works (Velickovic et al., 2018; Brody et al., 2022), the GAT is defined as follows. GAT performs self-attention on the vertices, which is a shared attentional mechanism a:D×D:𝑎superscriptsuperscript𝐷superscript𝐷a:\mathbb{R}^{D^{{}^{\prime}}}\times\mathbb{R}^{D}\rightarrow\mathbb{R}italic_a : blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R computes attention coefficients. Before that, a shared linear transformation, parameterized by a weight matrix, 𝐖D×D𝐖superscriptsuperscript𝐷𝐷\mathbf{W}\in\mathbb{R}^{D^{{}^{\prime}}\times D}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT × italic_D end_POSTSUPERSCRIPT.

ei,j=a(𝐖vi,𝐖vj)=(𝐖vi)(𝐖vi)superscript𝑒𝑖𝑗𝑎𝐖subscript𝑣𝑖𝐖subscript𝑣𝑗𝐖subscript𝑣𝑖superscript𝐖subscript𝑣𝑖tope^{i,j}=a(\mathbf{W}v_{i},\mathbf{W}v_{j})=(\mathbf{W}v_{i})\cdot(\mathbf{W}v_% {i})^{\top}italic_e start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT = italic_a ( bold_W italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_W italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = ( bold_W italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ ( bold_W italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (22)

eijsuperscript𝑒𝑖𝑗e^{ij}italic_e start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT indicates the importance of vertex j𝑗jitalic_j’s feautures to vertex i𝑖iitalic_i. In the most general formulation, the model allows a vertex to attend to every other vertex, which drops all structural information. GAT injects the graph structure into the mechanism by performing masked attention: it only computes eijsuperscript𝑒𝑖𝑗e^{ij}italic_e start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT for vertex j𝒩i𝑗subscript𝒩𝑖j\in\mathcal{N}_{i}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is some neighbor of vertex i𝑖iitalic_i in the graph. To make coefficients easily comparable across different vertices, GAT normalizes them across all choices of j𝑗jitalic_j using the softmax function (𝒮𝒮\mathcal{S}caligraphic_S):

αi,j=𝒮j(ei,j)=exp(ei,j)k𝒩iexp(ei,k)superscript𝛼𝑖𝑗subscript𝒮𝑗superscript𝑒𝑖𝑗superscript𝑒𝑖𝑗subscript𝑘subscript𝒩𝑖superscript𝑒𝑖𝑘\alpha^{i,j}=\mathcal{S}_{j}(e^{i,j})=\frac{\exp(e^{i,j})}{\sum_{k\in\mathcal{% N}_{i}}\exp(e^{i,k})}italic_α start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT = caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT ) = divide start_ARG roman_exp ( italic_e start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_e start_POSTSUPERSCRIPT italic_i , italic_k end_POSTSUPERSCRIPT ) end_ARG (23)

Unlike GAT, whose attention mechanism a𝑎aitalic_a is a single-layer feedforward neural network, we directly employ a multi-head self-attention mechanism as the aggregation algorithm.

Therefore, the final output features for every vertex is defined as follows.

v¯i=j𝒩iαi,j𝐖vjsubscript¯𝑣𝑖subscript𝑗subscript𝒩𝑖superscript𝛼𝑖𝑗𝐖subscript𝑣𝑗\overline{v}_{i}=\sum_{j\in\mathcal{N}_{i}}\alpha^{i,j}\mathbf{W}v_{j}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT bold_W italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (24)

Where σ𝜎\sigmaitalic_σ indicates the sigmoid nonlinearity.

Then, we extend the mechanism to multi-head attention.

v¯i=k=1Kj𝒩iαki,j𝐖kvj\overline{v}_{i}=\parallel^{K}_{k=1}\sum_{j\in\mathcal{N}_{i}}\alpha^{i,j}_{k}% \mathbf{W}_{k}v_{j}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∥ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (25)

Where parallel-to\parallel represents the concatenation operation.

From Vertex to Subgraph Assuming two sets of vertices 𝒱i={v1i,v2i,,vNii}subscript𝒱𝑖superscriptsubscript𝑣1𝑖superscriptsubscript𝑣2𝑖superscriptsubscript𝑣subscript𝑁𝑖𝑖\mathcal{V}_{i}=\{v_{1}^{i},v_{2}^{i},\dots,v_{N_{i}}^{i}\}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }, vmiDisuperscriptsubscript𝑣𝑚𝑖superscriptsubscript𝐷𝑖v_{m}^{i}\in\mathbb{R}^{D_{i}}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒱j={v1j,v2j,,vNjj}subscript𝒱𝑗superscriptsubscript𝑣1𝑗superscriptsubscript𝑣2𝑗superscriptsubscript𝑣subscript𝑁𝑗𝑗\mathcal{V}_{j}=\{v_{1}^{j},v_{2}^{j},\dots,v_{N_{j}}^{j}\}caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT }, vnjDjsuperscriptsubscript𝑣𝑛𝑗superscriptsubscript𝐷𝑗v_{n}^{j}\in\mathbb{R}^{D_{j}}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Where N{i,j}subscript𝑁𝑖𝑗N_{\{i,j\}}italic_N start_POSTSUBSCRIPT { italic_i , italic_j } end_POSTSUBSCRIPT is the number of vertices of 𝒱{i,j}subscript𝒱𝑖𝑗\mathcal{V}_{\{i,j\}}caligraphic_V start_POSTSUBSCRIPT { italic_i , italic_j } end_POSTSUBSCRIPT, D{i,j}subscript𝐷𝑖𝑗D_{\{i,j\}}italic_D start_POSTSUBSCRIPT { italic_i , italic_j } end_POSTSUBSCRIPT is the feature dimension of each vertex in 𝒱{i,j}subscript𝒱𝑖𝑗\mathcal{V}_{\{i,j\}}caligraphic_V start_POSTSUBSCRIPT { italic_i , italic_j } end_POSTSUBSCRIPT.

Then, apply the GAT algorithm on vmisuperscriptsubscript𝑣𝑚𝑖v_{m}^{i}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and vnjsuperscriptsubscript𝑣𝑛𝑗v_{n}^{j}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. Instead of a shared linear transformation, we use two weight matrices, query weight 𝐖qmDi×Disuperscriptsubscript𝐖𝑞𝑚superscriptsuperscriptsubscript𝐷𝑖subscript𝐷𝑖\mathbf{W}_{q}^{m}\in\mathbb{R}^{D_{i}^{{}^{\prime}}\times D_{i}}bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT × italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and key weight 𝐖knDj×Djsuperscriptsubscript𝐖𝑘𝑛superscriptsuperscriptsubscript𝐷𝑗subscript𝐷𝑗\mathbf{W}_{k}^{n}\in\mathbb{R}^{D_{j}^{{}^{\prime}}\times D_{j}}bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT × italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

em,nsuperscript𝑒𝑚𝑛\displaystyle e^{m,n}italic_e start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT =a(𝐖qmvmi,𝐖knvnj)absent𝑎superscriptsubscript𝐖𝑞𝑚superscriptsubscript𝑣𝑚𝑖superscriptsubscript𝐖𝑘𝑛superscriptsubscript𝑣𝑛𝑗\displaystyle=a(\mathbf{W}_{q}^{m}v_{m}^{i},\mathbf{W}_{k}^{n}v_{n}^{j})= italic_a ( bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) (26)
αm,nsuperscript𝛼𝑚𝑛\displaystyle\alpha^{m,n}italic_α start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT =𝒮n(em,n)=exp(em,n)l𝒩nexp(em,l)absentsubscript𝒮𝑛superscript𝑒𝑚𝑛superscript𝑒𝑚𝑛subscript𝑙subscript𝒩𝑛superscript𝑒𝑚𝑙\displaystyle=\mathcal{S}_{n}(e^{m,n})=\frac{\exp(e^{m,n})}{\sum_{l\in\mathcal% {N}_{n}}\exp(e^{m,l})}= caligraphic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT ) = divide start_ARG roman_exp ( italic_e start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l ∈ caligraphic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_e start_POSTSUPERSCRIPT italic_m , italic_l end_POSTSUPERSCRIPT ) end_ARG (27)

After that, the final output feature for vmisuperscriptsubscript𝑣𝑚𝑖v_{m}^{i}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is computed. The value weight 𝐖vmDi×Disuperscriptsubscript𝐖𝑣𝑚superscriptsuperscriptsubscript𝐷𝑖subscript𝐷𝑖\mathbf{W}_{v}^{m}\in\mathbb{R}^{D_{i}^{{}^{\prime}}\times D_{i}}bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT × italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is applied to transform vnjsuperscriptsubscript𝑣𝑛𝑗v_{n}^{j}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT:

v¯mi=l=1Ln𝒩mαlm,n𝐖vlnvnj\overline{v}_{m}^{i}=\parallel^{L}_{l=1}\sum_{n\in\mathcal{N}_{m}}\alpha^{m,n}% _{l}\mathbf{W}_{v_{l}}^{n}v_{n}^{j}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ∥ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_n ∈ caligraphic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT (28)

In the subgraph aspect, we assume that 𝒩msubscript𝒩𝑚\mathcal{N}_{m}caligraphic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT includes all the vertices in subgraph 𝒱jsubscript𝒱𝑗\mathcal{V}_{j}caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The current attention coefficient matrix is a vector 𝒢msuperscript𝒢𝑚\mathcal{G}^{m}caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, it can be regarded as a graph aggregated from 𝒱jsubscript𝒱𝑗\mathcal{V}_{j}caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to vmisuperscriptsubscript𝑣𝑚𝑖v_{m}^{i}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. The key, value weight for 𝒱jsubscript𝒱𝑗\mathcal{V}_{j}caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is represented as 𝒲{k,v}Nj×Dj×Djsubscript𝒲𝑘𝑣superscriptsubscript𝑁𝑗superscriptsubscript𝐷𝑗subscript𝐷𝑗\mathcal{W}_{\{k,v\}}\in\mathbb{R}^{N_{j}\times D_{j}^{{}^{\prime}}\times D_{j}}caligraphic_W start_POSTSUBSCRIPT { italic_k , italic_v } end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT × italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Then, the aggregation can be defined as follows:

emsuperscript𝑒𝑚\displaystyle e^{m}italic_e start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT =a(𝐖qmvmi,𝒲k𝒱j),𝒢m=𝒮(em)formulae-sequenceabsent𝑎superscriptsubscript𝐖𝑞𝑚superscriptsubscript𝑣𝑚𝑖subscript𝒲𝑘subscript𝒱𝑗superscript𝒢𝑚𝒮superscript𝑒𝑚\displaystyle=a(\mathbf{W}_{q}^{m}v_{m}^{i},\mathcal{W}_{k}\mathcal{V}_{j}),% \quad\mathcal{G}^{m}=\mathcal{S}(e^{m})= italic_a ( bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , caligraphic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = caligraphic_S ( italic_e start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) (29)
v¯misuperscriptsubscript¯𝑣𝑚𝑖\displaystyle\overline{v}_{m}^{i}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =l=1L(𝒢lm𝒲vl𝒱j)\displaystyle=\parallel^{L}_{l=1}(\mathcal{G}^{m}_{l}\mathcal{W}_{v_{l}}% \mathcal{V}_{j})= ∥ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (30)

Then, apply the algorithm defined by Equation 28, 29, 30 to all the vertices in 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The aggregation form is now vertex set to vertex set, thus, we regard the vertex sets as subgraphs and vertex-to-vertex aggregation is transformed into subgraph aggregation. Also, the attention coefficient matrix e𝑒eitalic_e is transformed as a directional subgraph adjacency matrix \mathcal{E}caligraphic_E. The query weight for 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is represented as 𝒲qsubscript𝒲𝑞\mathcal{W}_{q}caligraphic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT.

i,jsuperscript𝑖𝑗\displaystyle\mathcal{E}^{i,j}caligraphic_E start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT =a(𝒲q𝒱i,𝒲k𝒱j),𝒢i,j=𝒮(i,j)formulae-sequenceabsent𝑎subscript𝒲𝑞subscript𝒱𝑖subscript𝒲𝑘subscript𝒱𝑗superscript𝒢𝑖𝑗𝒮superscript𝑖𝑗\displaystyle=a(\mathcal{W}_{q}\mathcal{V}_{i},\mathcal{W}_{k}\mathcal{V}_{j})% ,\quad\mathcal{G}^{i,j}=\mathcal{S}(\mathcal{E}^{i,j})= italic_a ( caligraphic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , caligraphic_G start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT = caligraphic_S ( caligraphic_E start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT ) (31)
𝒱¯isubscript¯𝒱𝑖\displaystyle\overline{\mathcal{V}}_{i}over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =l=1L(𝒢li,j𝒲vl𝒱j)\displaystyle=\parallel^{L}_{l=1}(\mathcal{G}_{l}^{i,j}\mathcal{W}_{v_{l}}% \mathcal{V}_{j})= ∥ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (32)

Now the aggregation procedure is equal to multi-head cross-attention mechanism (Tsai et al., 2019).

Multimodal Subgraph Aggregation Take 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒱jsubscript𝒱𝑗\mathcal{V}_{j}caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where {i,j}{t,v,a}𝑖𝑗𝑡𝑣𝑎\{i,j\}\in\{t,v,a\}{ italic_i , italic_j } ∈ { italic_t , italic_v , italic_a } two modal sequences as an example, which is regarded as two vertex sets. Assuming that the unidirectional subgraph is constructed by the two modal vertex sequences, the adjacency matrix weight aggregation process of the corresponding subgraph is as follows.

i,j=(𝒲q𝒱j)(𝒲k𝒱i)superscript𝑖𝑗subscript𝒲𝑞subscript𝒱𝑗superscriptsubscript𝒲𝑘subscript𝒱𝑖top\mathcal{E}^{i,j}=(\mathcal{W}_{q}\mathcal{V}_{j})\cdot(\mathcal{W}_{k}% \mathcal{V}_{i})^{\top}caligraphic_E start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT = ( caligraphic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ ( caligraphic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (33)

Then apply the softmax function.

𝒢i,j=𝒮(i,j)superscript𝒢𝑖𝑗𝒮superscript𝑖𝑗\mathcal{G}^{i,j}=\mathcal{S}(\mathcal{E}^{i,j})caligraphic_G start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT = caligraphic_S ( caligraphic_E start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT ) (34)

Finally, some of the edges in the subgraph are randomly masked which is realized by the dropout operation implemented on the adjacency matrix.

𝒢dropouti,j=𝒟(𝒢i,j)superscriptsubscript𝒢𝑑𝑟𝑜𝑝𝑜𝑢𝑡𝑖𝑗𝒟superscript𝒢𝑖𝑗\mathcal{G}_{dropout}^{i,j}=\mathcal{D}(\mathcal{G}^{i,j})caligraphic_G start_POSTSUBSCRIPT italic_d italic_r italic_o italic_p italic_o italic_u italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT = caligraphic_D ( caligraphic_G start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT ) (35)

where 𝒟𝒟\mathcal{D}caligraphic_D denotes the dropout function.

After the aggregation, fusion process is started, which is regarded as the directional information fusion procedure from 𝒱jsubscript𝒱𝑗\mathcal{V}_{j}caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

𝒱¯i=𝒢dropouti,j𝒲v𝒱jsubscript¯𝒱𝑖superscriptsubscript𝒢𝑑𝑟𝑜𝑝𝑜𝑢𝑡𝑖𝑗subscript𝒲𝑣subscript𝒱𝑗\overline{\mathcal{V}}_{i}=\mathcal{G}_{dropout}^{i,j}\mathcal{W}_{v}\mathcal{% V}_{j}over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_G start_POSTSUBSCRIPT italic_d italic_r italic_o italic_p italic_o italic_u italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (36)

Then we extend the above operation globally as follows:

𝒢=𝒮𝒟(𝒜)𝒢𝒮𝒟𝒜\displaystyle\mathcal{G}=\mathcal{S}\circ\mathcal{D}(\mathcal{A})caligraphic_G = caligraphic_S ∘ caligraphic_D ( caligraphic_A ) (37)
𝒱¯m=𝒢𝒲v𝒱msubscript¯𝒱𝑚𝒢subscript𝒲𝑣subscript𝒱𝑚\displaystyle\overline{\mathcal{V}}_{m}=\mathcal{G}\mathcal{W}_{v}\mathcal{V}_% {m}over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = caligraphic_G caligraphic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (38)

Where \circ represents the function composition operation. Note: 𝒜𝒜\mathcal{A}caligraphic_A is defined in Equation 5

Constructed graph structure in Equation 37 is actually unstructured at all, it loses sight of the separated modality-wise temporal features of the concatenated sequence which makes the sequence disordered. What is more, it over-fuses the inter-modal information, confuses inter-modal information and the intra-modal information and leaves way too much fine-grained information unconsidered.

Refer to caption
Figure 5: Example of to explain the necessity of interlaced mask.

D.2 Why the Interlaced Mask?

Take the first block row in 𝒜𝒜\mathcal{A}caligraphic_A as an example, which is 𝐁𝐑=[t,t,t,v,t,a]𝐁𝐑superscript𝑡𝑡superscript𝑡𝑣superscript𝑡𝑎\mathbf{BR}=[\mathcal{E}^{t,t},\mathcal{E}^{t,v},\mathcal{E}^{t,a}]bold_BR = [ caligraphic_E start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT , caligraphic_E start_POSTSUPERSCRIPT italic_t , italic_v end_POSTSUPERSCRIPT , caligraphic_E start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT ]. Knowing that 𝒱msubscript𝒱𝑚\mathcal{V}_{m}caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [𝒱t;𝒱v;𝒱a]superscriptsubscript𝒱𝑡subscript𝒱𝑣subscript𝒱𝑎top[\mathcal{V}_{t};\mathcal{V}_{v};\mathcal{V}_{a}]^{\top}[ caligraphic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; caligraphic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ; caligraphic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. Then the t,tsuperscript𝑡𝑡\mathcal{E}^{t,t}caligraphic_E start_POSTSUPERSCRIPT italic_t , italic_t end_POSTSUPERSCRIPT is aggregated by 𝒱tsubscript𝒱𝑡\mathcal{V}_{t}caligraphic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of 𝒱msubscript𝒱𝑚\mathcal{V}_{m}caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT itself, t,vsuperscript𝑡𝑣\mathcal{E}^{t,v}caligraphic_E start_POSTSUPERSCRIPT italic_t , italic_v end_POSTSUPERSCRIPT is aggregated by 𝒱tsubscript𝒱𝑡\mathcal{V}_{t}caligraphic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒱vsubscript𝒱𝑣\mathcal{V}_{v}caligraphic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, t,asuperscript𝑡𝑎\mathcal{E}^{t,a}caligraphic_E start_POSTSUPERSCRIPT italic_t , italic_a end_POSTSUPERSCRIPT is aggregated by 𝒱tsubscript𝒱𝑡\mathcal{V}_{t}caligraphic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒱asubscript𝒱𝑎\mathcal{V}_{a}caligraphic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. And as defined in Equation 31, 32, the direction of aggregation of i,jsuperscript𝑖𝑗\mathcal{E}^{i,j}caligraphic_E start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT is from j𝑗jitalic_j to i𝑖iitalic_i.

If the final output feature computation is performed without interlaced mask. It has to be noted that aggregation in this case is only performed on text modal t𝑡titalic_t.

𝒱¯t=𝐁𝐑(𝒲v𝒱m)subscript¯𝒱𝑡𝐁𝐑subscript𝒲𝑣subscript𝒱𝑚\overline{\mathcal{V}}_{t}=\mathbf{BR}\cdot(\mathcal{W}_{v}\mathcal{V}_{m})over¯ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_BR ⋅ ( caligraphic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) (39)

As shown in Figure 5. When we only mask one or two blocks (subgraphs), vertex sequences of different modals are considered to be the same sequence because they are spliced together. Thus making the temporal information disordered, which is not advisable.

Appendix E Algorithms

E.1 Interlaced Mask Generation Algorithm

  Algorithm 1 Interlaced Mask Generation

  Input: Segmentation of the length of three-modal sequence seg𝑠𝑒𝑔segitalic_s italic_e italic_g = {Tt,Tv,Ta}subscript𝑇𝑡subscript𝑇𝑣subscript𝑇𝑎\{T_{t},T_{v},T_{a}\}{ italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT }, Mode of the mask generation mode𝑚𝑜𝑑𝑒modeitalic_m italic_o italic_d italic_e \in {inter,intra}𝑖𝑛𝑡𝑒𝑟𝑖𝑛𝑡𝑟𝑎\{inter,intra\}{ italic_i italic_n italic_t italic_e italic_r , italic_i italic_n italic_t italic_r italic_a }, Direction of fusion procedure dir𝑑𝑖𝑟diritalic_d italic_i italic_r \in {forward,backward}𝑓𝑜𝑟𝑤𝑎𝑟𝑑𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑\{forward,backward\}{ italic_f italic_o italic_r italic_w italic_a italic_r italic_d , italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d };

Output: The generated mask of appointed mode and direction;

1:  Let {lt,lv,la}=segsubscript𝑙𝑡subscript𝑙𝑣subscript𝑙𝑎𝑠𝑒𝑔\{l_{t},l_{v},l_{a}\}=seg{ italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } = italic_s italic_e italic_g
2:  Define segments s1=(0,lt)𝑠10subscript𝑙𝑡s1=(0,l_{t})italic_s 1 = ( 0 , italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), s2=(lt,lt+lv)𝑠2subscript𝑙𝑡subscript𝑙𝑡subscript𝑙𝑣s2=(l_{t},l_{t}+l_{v})italic_s 2 = ( italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ), s3=(lt+lv,lt+lv+la)𝑠3subscript𝑙𝑡subscript𝑙𝑣subscript𝑙𝑡subscript𝑙𝑣subscript𝑙𝑎s3=(l_{t}+l_{v},l_{t}+l_{v}+l_{a})italic_s 3 = ( italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT )
3:  Let lsum=lt+lv+lasubscript𝑙𝑠𝑢𝑚subscript𝑙𝑡subscript𝑙𝑣subscript𝑙𝑎l_{sum}=l_{t}+l_{v}+l_{a}italic_l start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
4:  Initialize an empty list listsubscript𝑙𝑖𝑠𝑡\mathcal{M}_{list}caligraphic_M start_POSTSUBSCRIPT italic_l italic_i italic_s italic_t end_POSTSUBSCRIPT
5:  for each i𝑖iitalic_i in [0,1,2]012[0,1,2][ 0 , 1 , 2 ] do
6:     for each element in seg[i]𝑠𝑒𝑔delimited-[]𝑖seg[i]italic_s italic_e italic_g [ italic_i ] do
7:        Initialize mrowsubscript𝑚𝑟𝑜𝑤m_{row}italic_m start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT as a tensor of ones with size lsumsubscript𝑙𝑠𝑢𝑚l_{sum}italic_l start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT
8:        if i==0i==0italic_i = = 0 then
9:           Set mrow[0:s1[1]]=0m_{row}[0:s1[1]]=0italic_m start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT [ 0 : italic_s 1 [ 1 ] ] = 0
10:           if mode==intermode==interitalic_m italic_o italic_d italic_e = = italic_i italic_n italic_t italic_e italic_r then
11:              if dir==forwarddir==forwarditalic_d italic_i italic_r = = italic_f italic_o italic_r italic_w italic_a italic_r italic_d then
12:                 Set mrow[s3[0]:]=0m_{row}[s3[0]:]=0italic_m start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT [ italic_s 3 [ 0 ] : ] = 0
13:              else if dir==backwarddir==backwarditalic_d italic_i italic_r = = italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d then
14:                 Set mrow[s2[0]:s2[1]]=0m_{row}[s2[0]:s2[1]]=0italic_m start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT [ italic_s 2 [ 0 ] : italic_s 2 [ 1 ] ] = 0
15:              end if
16:           end if
17:        else if i==1i==1italic_i = = 1 then
18:           Set mrow[s2[0]:s2[1]]=0m_{row}[s2[0]:s2[1]]=0italic_m start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT [ italic_s 2 [ 0 ] : italic_s 2 [ 1 ] ] = 0
19:           if mode==intermode==interitalic_m italic_o italic_d italic_e = = italic_i italic_n italic_t italic_e italic_r then
20:              if dir==forwarddir==forwarditalic_d italic_i italic_r = = italic_f italic_o italic_r italic_w italic_a italic_r italic_d then
21:                 Set mrow[0:s1[1]]=0m_{row}[0:s1[1]]=0italic_m start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT [ 0 : italic_s 1 [ 1 ] ] = 0
22:              else if dir==backwarddir==backwarditalic_d italic_i italic_r = = italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d then
23:                 Set mrow[s3[0]:]=0m_{row}[s3[0]:]=0italic_m start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT [ italic_s 3 [ 0 ] : ] = 0
24:              end if
25:           end if
26:        else if i==2i==2italic_i = = 2 then
27:           Set mrow[s3[0]:s3[1]]=0m_{row}[s3[0]:s3[1]]=0italic_m start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT [ italic_s 3 [ 0 ] : italic_s 3 [ 1 ] ] = 0
28:           if mode==intermode==interitalic_m italic_o italic_d italic_e = = italic_i italic_n italic_t italic_e italic_r then
29:              if dir==forwarddir==forwarditalic_d italic_i italic_r = = italic_f italic_o italic_r italic_w italic_a italic_r italic_d then
30:                 Set mrow[s2[0]:s2[1]]=0m_{row}[s2[0]:s2[1]]=0italic_m start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT [ italic_s 2 [ 0 ] : italic_s 2 [ 1 ] ] = 0
31:              else if dir==backwarddir==backwarditalic_d italic_i italic_r = = italic_b italic_a italic_c italic_k italic_w italic_a italic_r italic_d then
32:                 Set mrow[0:s1[1]]=0m_{row}[0:s1[1]]=0italic_m start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT [ 0 : italic_s 1 [ 1 ] ] = 0
33:              end if
34:           end if
35:        end if
36:        Append mrowsubscript𝑚𝑟𝑜𝑤m_{row}italic_m start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT to listsubscript𝑙𝑖𝑠𝑡\mathcal{M}_{list}caligraphic_M start_POSTSUBSCRIPT italic_l italic_i italic_s italic_t end_POSTSUBSCRIPT
37:     end for
38:  end for
39:  if mode==intermode==interitalic_m italic_o italic_d italic_e = = italic_i italic_n italic_t italic_e italic_r then
40:     Let =Stack(list)Stacksubscript𝑙𝑖𝑠𝑡\mathcal{M}=\text{Stack}(\mathcal{M}_{list})caligraphic_M = Stack ( caligraphic_M start_POSTSUBSCRIPT italic_l italic_i italic_s italic_t end_POSTSUBSCRIPT )
41:     return GenerateMask()GenerateMask\text{GenerateMask}(\mathcal{M})GenerateMask ( caligraphic_M )
42:  else if mode==intramode==intraitalic_m italic_o italic_d italic_e = = italic_i italic_n italic_t italic_r italic_a then
43:     return GenerateMask(|Stack(list)1)|)\text{GenerateMask}(|\text{Stack}(\mathcal{M}_{list})-1)|)GenerateMask ( | Stack ( caligraphic_M start_POSTSUBSCRIPT italic_l italic_i italic_s italic_t end_POSTSUBSCRIPT ) - 1 ) | )
44:  end if

 

The detailed generation method of interlaced mask for not only the forward and backward inter-fusion but also the intra-enhancement is shown in the algorithm table above. It is of vital importance for our model to accurately construct the graph structure of the concatenated sequence list. The masks could be constructed during the initialization procedure.

E.2 Extended Long Short Term Memory with Matrix Memory

Ctsubscript𝐶𝑡\displaystyle C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =ftCt1+itvtktabsentsubscript𝑓𝑡subscript𝐶𝑡1subscript𝑖𝑡subscript𝑣𝑡superscriptsubscript𝑘𝑡top\displaystyle=f_{t}C_{t-1}+i_{t}v_{t}k_{t}^{\top}= italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (40)
ntsubscript𝑛𝑡\displaystyle n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =ftnt1+itktabsentsubscript𝑓𝑡subscript𝑛𝑡1subscript𝑖𝑡subscript𝑘𝑡\displaystyle=f_{t}n_{t-1}+i_{t}k_{t}= italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (41)
htsubscript𝑡\displaystyle h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =oth~t,absentdirect-productsubscript𝑜𝑡subscript~𝑡\displaystyle=o_{t}\odot\tilde{h}_{t},= italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , h~tsubscript~𝑡\displaystyle\tilde{h}_{t}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Ctqtmax{|ntqt|,1}absentsubscript𝐶𝑡subscript𝑞𝑡maxsuperscriptsubscript𝑛𝑡topsubscript𝑞𝑡1\displaystyle=\frac{C_{t}q_{t}}{\text{max}\{|n_{t}^{\top}q_{t}|,1\}}= divide start_ARG italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG max { | italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | , 1 } end_ARG (42)
qtsubscript𝑞𝑡\displaystyle q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Wqxt+bqabsentsubscript𝑊𝑞subscript𝑥𝑡subscript𝑏𝑞\displaystyle=W_{q}x_{t}+b_{q}= italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (43)
ktsubscript𝑘𝑡\displaystyle k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =1dWkxt+bkabsent1𝑑subscript𝑊𝑘subscript𝑥𝑡subscript𝑏𝑘\displaystyle=\frac{1}{\sqrt{d}}W_{k}x_{t}+b_{k}= divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (44)
vtsubscript𝑣𝑡\displaystyle v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Wvxt+bvabsentsubscript𝑊𝑣subscript𝑥𝑡subscript𝑏𝑣\displaystyle=W_{v}x_{t}+b_{v}= italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT (45)
itsubscript𝑖𝑡\displaystyle i_{t}italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =exp(i~t),absentsubscript~𝑖𝑡\displaystyle=\exp(\tilde{i}_{t}),= roman_exp ( over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , i~tsubscript~𝑖𝑡\displaystyle\tilde{i}_{t}over~ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =wixt+biabsentsuperscriptsubscript𝑤𝑖topsubscript𝑥𝑡subscript𝑏𝑖\displaystyle=w_{i}^{\top}x_{t}+b_{i}= italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (46)
ftsubscript𝑓𝑡\displaystyle f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =σ(f~t)ORexp(f~t),absent𝜎subscript~𝑓𝑡ORsubscript~𝑓𝑡\displaystyle=\sigma{(\tilde{f}_{t})}\text{OR}\exp{(\tilde{f}_{t})},= italic_σ ( over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) OR roman_exp ( over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , f~tsubscript~𝑓𝑡\displaystyle\tilde{f}_{t}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =wfxt+bfabsentsuperscriptsubscript𝑤𝑓topsubscript𝑥𝑡subscript𝑏𝑓\displaystyle=w_{f}^{\top}x_{t}+b_{f}= italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT (47)
otsubscript𝑜𝑡\displaystyle o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =σ(o~t),absent𝜎subscript~𝑜𝑡\displaystyle=\sigma{(\tilde{o}_{t})},= italic_σ ( over~ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , o~tsubscript~𝑜𝑡\displaystyle\tilde{o}_{t}over~ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Woxt+boabsentsubscript𝑊𝑜subscript𝑥𝑡subscript𝑏𝑜\displaystyle=W_{o}x_{t}+b_{o}= italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT (48)

The forward pass of mLSTM can be described as the above equation group, while the detailed architecture is shown in Figure 6

Refer to caption
Figure 6: Parallelized Extended LSTM with Matrix Memory.