(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Academy for Engineering & Technology, Fudan University 22institutetext: Tencent Youtu Lab 33institutetext: Cognition and Intelligent Technology Laboratory (CIT Lab) 44institutetext: Engineering Research Center of AI and Robotics, Ministry of Education, China
44email: [email protected], [email protected]
footnotetext: Equal contribution. {}^{\textrm{{\char 0\relax}}}start_FLOATSUPERSCRIPT ✉ end_FLOATSUPERSCRIPTCorresponding author.footnotetext: Work done during the internship at Tencent Youtu Lab.

Towards Multimodal Sentiment Analysis Debiasing via Bias Purification

Dingkang Yang∗† 1133    Mingcheng Li 1133    Dongling Xiao 22    Yang Liu 11    Kun Yang 11    Zhaoyu Chen 11    Yuzheng Wang 11    Peng Zhai 11    Ke Li 22    Lihua Zhang{}^{\textrm{{\char 0\relax}}}start_FLOATSUPERSCRIPT ✉ end_FLOATSUPERSCRIPT 113344
Abstract

Multimodal Sentiment Analysis (MSA) aims to understand human intentions by integrating emotion-related clues from diverse modalities, such as visual, language, and audio. Unfortunately, the current MSA task invariably suffers from unplanned dataset biases, particularly multimodal utterance-level label bias and word-level context bias. These harmful biases potentially mislead models to focus on statistical shortcuts and spurious correlations, causing severe performance bottlenecks. To alleviate these issues, we present a Multimodal Counterfactual Inference Sentiment (MCIS) analysis framework based on causality rather than conventional likelihood. Concretely, we first formulate a causal graph to discover harmful biases from already-trained vanilla models. In the inference phase, given a factual multimodal input, MCIS imagines two counterfactual scenarios to purify and mitigate these biases. Then, MCIS can make unbiased decisions from biased observations by comparing factual and counterfactual outcomes. We conduct extensive experiments on several standard MSA benchmarks. Qualitative and quantitative results show the effectiveness of the proposed framework.

Keywords:
Sentiment analysis Multimodal learning

1 Introduction

“Believe nothing you hear, and only one half that you see.”

-Edgar Allan Poe, The System of Doctor Tarr and Professor Fether

As an essential task in human intention understanding, Multimodal Sentiment Analysis (MSA) [59, 58, 16, 14] attempts to empower machines with the senses of “hearing” [11] and “seeing” [22] to mimic human perception of emotions from diverse modalities. Following the traditional likelihood rule, most existing studies focus on improving MSA performance by exploiting various strategies, including disentangled representation learning [38, 10, 44, 17], attention-based cross-modal interactions [37, 18, 23, 45, 48, 13, 49], fusion mechanisms [28, 57, 21, 31, 34, 46, 50], and well-designed auxiliary tasks [56, 42, 15]. Despite the impressive improvements achieved by numerous works, they all invariably captured harmful dataset biases [33, 24, 6] and suffered from unintended confounders [36, 30, 43], which are multimodal utterance-level label bias and word-level context bias.

Refer to caption
Figure 1: The distribution of (a) sentiment labels and (b) several context words from the training set on the MOSI dataset [59].

The harmful label bias usually occurs when the number of training samples for a specific category is more significant than for other categories. For instance, Fig. 1(a) illustrates that the positive samples dominate MOSI dataset [59] compared to the other samples. Worse still, a binary sentiment analysis dataset could have a label distribution of 95% : 5% [5]. In this case, many previous studies [60, 30, 5] have indicated that such unbalanced data distribution would lead to trained models relying heavily on label bias as statistical shortcuts to make inaccurate predictions. Different from unimodal tasks that potentially convey the adverse effects via specific modalities [40, 30], most MSA models are poisoned with side effects captured by multimodal representations due to multiple modalities in each sample sharing the same sentiment label [1, 59, 58].

Moreover, previous studies [37, 42, 10] have demonstrated that language modality plays an important role in MSA compared to non-linguistic modalities, i.e., a suitable language model could achieve considerable performance [28]. Nevertheless, linguistic information is not always beneficial due to the inherent context bias [19, 30]. The fatal context bias generally emerges when trained models exhibit strong spurious correlations between specific categories and context words in language modality. In Fig. 1(b), some emotionally ambiguous words appear with imbalanced frequency in negative and positive samples. Consequently, MSA models tend to predict samples containing those words to an incorrect category based on biased statistical information rather than intrinsic textual semantics [41, 30]. For example, Fig. 2(a) shows the predicted binary classification result from a state-of-the-art (SOTA) model [17] on the MOSI. As the context words “good” and “very” appear more frequently in the positive than in the negative samples in the training set, the model predicts the testing sample as “positive” via an unreliable association. Therefore, to perform more reasonable sentiment inference, we need to suitably purify and eliminate the prejudicial effects caused by these biases in prior observations, as shown in Fig. 2(b).

Unlike machines that make biased predictions directly from an inference process by considering prior observations, humans have a natural counterfactual intuition [24]. Specifically, even though we are born and learn in a biased world, the counterfactual ability [35] enables us to make unbiased decisions by removing exogenous interference (e.g., label bias under limited observations) and endogenous reason (e.g., language context bias). The underlying mechanism is causality-based: decisions are made by counterfactual inference to pursue a true causal effect rather than a statistical shortcut or spurious correlation. To this end, we depict the counterfactual scenario as follows:

Counterfactual MSA: What will the prediction be, if the model does not see the multimodal input or only sees context words in the language modality?

Intuitively, the counterfactual MSA have two outcomes: (1) the trained model relies purely on the statistical shortcut for prediction under the no-treatment condition of the multimodal input. In this case, Fig. 2(b) shows that the purified label bias results in a higher probability of “positive” than “negative”. (2) The trained model relies only on the spurious correlation for prediction under the intervention of preserving context words solely. The result contains the pure side effect obtained by distilling the context bias. Motivated by the above observations, we propose a Multimodal Counterfactual Inference Sentiment (MCIS) analysis framework to mitigate the deleterious impact of two types of dataset biases. Concretely, we first design a tailored causal graph for MSA to diagnose causalities among variables and identify the dataset biases as unintended confounders. The proposed framework is parameter-free and training-free, meaning that MCIS accommodates already-trained models following biased vanilla training via our generalized causal graph. During the inference phase, MCIS intervenes with confounding multimodal inputs via backdoor adjustment theory [25, 30] to mimic the two counterfactual outcomes described above. By subtracting the counterfactual outcomes of the pure dataset biases, MCIS consistently improves the performance of SOTA models with unbiased predictions.

The main contributions are summarized as follows:

  • We are the first to identify and disentangle the label and context biases in the MSA task from a novel causal inference perspective. Based on innate human counterfactual intuition, we empower models to achieve unbiased predictions in biased observations.

  • Our causality-based MCIS is general and suitable for different MSA architectures and fusion mechanisms.

  • Comprehensive experiments on several MSA benchmarks demonstrate the effectiveness of our framework.

2 Related Work

Refer to caption
Figure 2: An example of multimodal sentiment analysis. (a) Likelihood-based biased prediction from re-implemented model DMD [17]. (b) Unbiased prediction from the same model in the proposed framework. Binary classification results for illustration.

Multimodal Sentiment Analysis. Instead of modeling linguistic information alone [29], MSA aims to integrate additional non-linguistic modalities to learn sentiment-related representations, such as visual [58] and acoustic signals [45]. Driven by learning-based techniques [53, 47, 54, 2, 12, 20], mainstream MSA studies follow two aspects: representation learning and multimodal fusion. Multimodal representation learning [10, 18, 55, 38, 42, 17] attends to mitigating modality gap or information redundancy to obtain refined modality semantics. For instance, Hazarika et al. [10] advocated projecting each modality into modality-invariant and -specific spaces to learn complementary information. For multimodal fusion, previous works [23, 37, 57, 31, 21] have explored sophisticated fusion strategies and mechanisms to obtain effective representations. As a typical example, Tsai et al. [37] achieved potential adaption fusion from one modality to another based on multimodal transformers. Despite the impressive improvements achieved by previous studies following traditional likelihood estimation, they invariably ignored the adverse effects of the dataset biases, resulting in biased predictions. In comparison, we achieve unbiased decisions by exploiting causality-based counterfactual thinking. The proposed framework significantly improves the performance of existing models without any complex network designs and parameters.

Causal Inference. Causal inference is a tool that seeks actual effects in a specific phenomenon [25]. Currently, the mainstream causal inference studies applied to deep learning consist of two aspects: intervention [40, 3, 36, 51] and counterfactuals [24, 35, 33, 30, 52]. Intervention is an operation that alters original data distribution to discover causal effects [7]. Counterfactuals depict imagined outcomes produced by factual variables under different treatments [26]. Our study focuses on obtaining counterfactual outcomes via intervention. Causal inference can remove confounders in data and learn actual causal effects instead of spurious associations, so it has been widely used in many downstream tasks to improve the models’ performance, including visual question answer [24], natural language understanding [36], and scene graph generation [35]. A recent study [33] focused on designing an additional model to capture the harmful effect of textual semantics. However, they ignored the label bias and failed to disentangle the main content and context at the word level, thus incapable of language bias ascription. Different from previous efforts [30, 33], this is the first work to identify both label bias and context bias in the MSA task from a causal perspective. Our framework effectively eliminates the side effects of dataset biases from multimodal inputs, which makes a step towards unbiased prediction in this field.

3 Methodology

3.1 Framework Overview

The proposed MCIS framework is illustrated in Fig. 4(b). Concretely, MCIS allows already-trained models to preserve harmful dataset biases via biased conventional learning. Given a factual multimodal input in the inference phase, MCIS imagines two types of multimodal counterfactual inputs to obtain two counterfactual outputs: purified label bias and context bias. Eventually, MCIS performs a bias elimination strategy in adaptive proportions to obtain unbiased counterfactual predictions by comparing factual and counterfactual outcomes.

Refer to caption
Figure 3: (a) The tailored causal graph for MSA. (b) The simplified causal graph for MSA. (c) Comparison between factual MSA and counterfactual MSA. White nodes are at the value M=m𝑀𝑚M=mitalic_M = italic_m while gray nodes are at the value M=m^𝑀^𝑚M=\hat{m}italic_M = over^ start_ARG italic_m end_ARG or M=m~𝑀~𝑚M=\tilde{m}italic_M = over~ start_ARG italic_m end_ARG.

3.2 Structural Causal Graph in MSA

Problem Formalization. Given multimodal utterance inputs from video segments, MSA aims to predict sentiment scores by learning multimodal models ()\mathcal{F}(\cdot)caligraphic_F ( ⋅ ) using language (l𝑙litalic_l), audio (a𝑎aitalic_a), and visual (v𝑣vitalic_v) modalities. This conventional training procedure is represented as y^i=(l,a,v)subscript^𝑦𝑖𝑙𝑎𝑣\hat{y}_{i}=\mathcal{F}(l,a,v)over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_F ( italic_l , italic_a , italic_v ), where y^isubscript^𝑦𝑖\hat{y}_{i}\in\mathbb{R}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R is a sentimental intensity variable. Aligned with previous mainstream works [37, 10, 32, 56, 9], we regard MSA as a regression task to ensure a fair comparison.

Cause-Effect Look at MSA. To diagnose the causal relationships among variables, we formulate a causal graph to summarize the MSA framework. Here, we represent a random variable as a capital letter (e.g., L𝐿Litalic_L), and denote its observed value as a lowercase letter (e.g., l𝑙litalic_l). Theoretically, a causal graph 𝒢={𝒩,}𝒢𝒩\mathcal{G}=\{\mathcal{N},\mathcal{E}\}caligraphic_G = { caligraphic_N , caligraphic_E } is considered a directed acyclic graph, which represents how a set of variables 𝒩𝒩\mathcal{N}caligraphic_N convey causal effects through the causal links \mathcal{E}caligraphic_E. It provides an intuitive reference to causal correlations for counterfactual analysis [35, 24] and causal intervention [25, 7]. In Fig. 3(a), there are six variables in MSA causal graph, including language modality L𝐿Litalic_L, audio modality A𝐴Aitalic_A, visual modality V𝑉Vitalic_V, multimodal representation M𝑀Mitalic_M, harmful confounders Z𝑍Zitalic_Z, and prediction Y𝑌Yitalic_Y. From causal theories [27, 25], the adverse dataset biases as the confounders to “poison” models. All causal relationships among them are explained as follows:

\blacktriangleright Link (𝑳,𝑨,𝑽)𝑴𝒀𝑳𝑨𝑽𝑴𝒀\bm{(L,A,V)}\rightarrow\bm{M}\rightarrow\bm{Y}bold_( bold_italic_L bold_, bold_italic_A bold_, bold_italic_V bold_) → bold_italic_M → bold_italic_Y. Following biased learning [10, 37], the causal path (L,A,V)M𝐿𝐴𝑉𝑀(L,A,V)\rightarrow M( italic_L , italic_A , italic_V ) → italic_M indicates that the multimodal inputs (L,A,V)𝐿𝐴𝑉(L,A,V)( italic_L , italic_A , italic_V ) produce the final multimodal representation M𝑀Mitalic_M through MSA models ()\mathcal{F}(\cdot)caligraphic_F ( ⋅ ) :

m=ξM(L=l,A=a,V=v),𝑚subscript𝜉𝑀formulae-sequence𝐿𝑙formulae-sequence𝐴𝑎𝑉𝑣m=\xi_{M}(L=l,A=a,V=v),italic_m = italic_ξ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_L = italic_l , italic_A = italic_a , italic_V = italic_v ) , (1)

where ξM()subscript𝜉𝑀\xi_{M}(\cdot)italic_ξ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( ⋅ ) is a fusion strategy that depends on different models (e.g., Transformer [10] or concatenation [37]). Subsequently, the link MY𝑀𝑌M\rightarrow Yitalic_M → italic_Y reflects that MSA models estimate the desired prediction Y𝑌Yitalic_Y based on pure M𝑀Mitalic_M.

\blacktriangleright Link 𝑴𝒁𝒀𝑴𝒁𝒀\bm{M}\leftarrow\bm{Z}\rightarrow\bm{Y}bold_italic_M ← bold_italic_Z → bold_italic_Y. According to [27], the confounders Z𝑍Zitalic_Z are the common cause of M𝑀Mitalic_M and Y𝑌Yitalic_Y. The dataset biases follow the backdoor causal path MZY𝑀𝑍𝑌M\leftarrow Z\rightarrow Yitalic_M ← italic_Z → italic_Y to establish spurious associations to prevent the models from pursuing true causal effects, which we should eliminate.

Without loss of generality, the nodes (L,A,V)𝐿𝐴𝑉(L,A,V)( italic_L , italic_A , italic_V ) are omitted for simplicity since they are not directly affected by Z𝑍Zitalic_Z. The new causal graph is illustrated in Fig. 3(b). Existing models rely on the likelihood P(Y|M)𝑃conditional𝑌𝑀P(Y|M)italic_P ( italic_Y | italic_M ) following the new graph. This process is formulated via the Bayes rule [40]:

(m)=P(Y|M)=zP(Y|M,z)P(z|M),𝑚𝑃conditional𝑌𝑀subscript𝑧𝑃conditional𝑌𝑀𝑧𝑃conditional𝑧𝑀\mathcal{F}(m)=P(Y|M)=\sum_{z}P(Y|M,z)P(z|M),caligraphic_F ( italic_m ) = italic_P ( italic_Y | italic_M ) = ∑ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_P ( italic_Y | italic_M , italic_z ) italic_P ( italic_z | italic_M ) , (2)

where z𝑧zitalic_z is any confounder caused by the label or context bias. In this case, MSA models would invariably focus on the statistical shortcut or spurious correlation to perform biased predictions, significantly limiting their performance. To remove the detrimental effect caused by z𝑧zitalic_z, our insight is to embrace backdoor adjustment [25], i.e., predicting an actively intervened outcome via the do𝑑𝑜doitalic_d italic_o-operator [7]. As a typical causal intervention, do()𝑑𝑜do(\cdot)italic_d italic_o ( ⋅ ) prevents the effect of parent nodes that cause variables from the non-causal direction, i.e., ZM𝑍𝑀Z\rightarrow Mitalic_Z → italic_M. As shown in Fig. 3(c), the intervention cuts the causal path from Z𝑍Zitalic_Z to m𝑚mitalic_m, i.e., m𝑚mitalic_m is no longer affected by Z𝑍Zitalic_Z. In practice, we intervene m𝑚mitalic_m based on counterfactual embeddings under different scenarios to purify the pure label and context biases in  Secs. 3.3 and 3.4.

Refer to caption
Figure 4: (a) The biased learning of MSA models follows the factual training. (b) The architecture of our MCIS framework. MCIS compares factual and counterfactual outcomes for different multimodal input treatments. By subtracting the label and context biases, MCIS can achieve unbiased predictions from biased observations.

3.3 Label Bias Purification

As Fig. 4(a) shows, the unbalanced label distribution (i.e., “positive” dominates the training data over “negative”) misleads MSA models to establish non-causal associations between the input samples and the positive category. In this case, MSA models would give predictions based on statistical shortcuts even though the contents of the multimodal testing samples are not observed [8]. To implement the theoretical do()𝑑𝑜do(\cdot)italic_d italic_o ( ⋅ ) intervention, we utilize m^^𝑚\hat{m}over^ start_ARG italic_m end_ARG to denote the imagined counterfactual multimodal representation. The intervention-based counterfactual outcome is as follows:

P(Y|do(M))=P(Y|M=m^)=(m^),m^=ξM(L=l^,A=a^,V=v^).formulae-sequence𝑃conditional𝑌𝑑𝑜𝑀𝑃conditional𝑌𝑀^𝑚^𝑚^𝑚subscript𝜉𝑀formulae-sequence𝐿^𝑙formulae-sequence𝐴^𝑎𝑉^𝑣\begin{split}P(Y|do(M))=P(Y|M=\hat{m})=\mathcal{F}(\hat{m}),\\ \hat{m}=\xi_{M}(L=\hat{l},A=\hat{a},V=\hat{v}).\\ \end{split}start_ROW start_CELL italic_P ( italic_Y | italic_d italic_o ( italic_M ) ) = italic_P ( italic_Y | italic_M = over^ start_ARG italic_m end_ARG ) = caligraphic_F ( over^ start_ARG italic_m end_ARG ) , end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_m end_ARG = italic_ξ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_L = over^ start_ARG italic_l end_ARG , italic_A = over^ start_ARG italic_a end_ARG , italic_V = over^ start_ARG italic_v end_ARG ) . end_CELL end_ROW (3)

Here l^,a^^𝑙^𝑎\hat{l},\hat{a}over^ start_ARG italic_l end_ARG , over^ start_ARG italic_a end_ARG, and v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG represent the no-treatment condition where l,a𝑙𝑎l,aitalic_l , italic_a, and v𝑣vitalic_v are not given. As MSA models cannot “see” any multimodal inputs after the intervention, the counterfactual output (m^)^𝑚\mathcal{F}(\hat{m})caligraphic_F ( over^ start_ARG italic_m end_ARG ) actually reflects the purely adverse effect from the trained models, i.e., the label bias captured by Z𝑍Zitalic_Z. Considering that neural network models cannot deal with void inputs, we utilize average features over the entire training set as counterfactual embeddings for different modalities:

l^=1NiNli,a^=1NiNai,v^=1NiNvi,formulae-sequence^𝑙1𝑁superscriptsubscript𝑖𝑁subscript𝑙𝑖formulae-sequence^𝑎1𝑁superscriptsubscript𝑖𝑁subscript𝑎𝑖^𝑣1𝑁superscriptsubscript𝑖𝑁subscript𝑣𝑖\hat{l}=\frac{1}{N}\sum_{i}^{N}l_{i},\hat{a}=\frac{1}{N}\sum_{i}^{N}a_{i},\hat% {v}=\frac{1}{N}\sum_{i}^{N}v_{i},over^ start_ARG italic_l end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_v end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (4)

where N𝑁Nitalic_N is the number of training samples. Empirically, the average embedding usually produces a distribution similar to the ideal bias and forces the models to decouple the outcome of the harmful bias as humans do [35].

3.4 Context Bias Purification

Motivated by human decision-making that combines exogenous and endogenous reasons [39], language utterances can be summarized in the main content words and context words. The main content words provide valuable semantics clues (e.g., emotionally-beneficial semantics). Conversely, the context words (e.g., stop words or a part of adjectives) as the confounders trick the models into focusing on spurious correlations between semantically-unimportant contexts and specific categories (e.g., good \leftrightarrow positive mapping). To this end, we use m~~𝑚\tilde{m}over~ start_ARG italic_m end_ARG to achieve another counterfactual outcome with only context words:

P(Y|do(M))=P(Y|M=m~)=(m~),m~=ξM(L=l~,A=a˘,V=v˘).formulae-sequence𝑃conditional𝑌𝑑𝑜𝑀𝑃conditional𝑌𝑀~𝑚~𝑚~𝑚subscript𝜉𝑀formulae-sequence𝐿~𝑙formulae-sequence𝐴˘𝑎𝑉˘𝑣\begin{split}P(Y|do(M))=P(Y|M=\tilde{m})=\mathcal{F}(\tilde{m}),\\ \tilde{m}=\xi_{M}(L=\tilde{l},A=\breve{a},V=\breve{v}).\\ \end{split}start_ROW start_CELL italic_P ( italic_Y | italic_d italic_o ( italic_M ) ) = italic_P ( italic_Y | italic_M = over~ start_ARG italic_m end_ARG ) = caligraphic_F ( over~ start_ARG italic_m end_ARG ) , end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_m end_ARG = italic_ξ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_L = over~ start_ARG italic_l end_ARG , italic_A = over˘ start_ARG italic_a end_ARG , italic_V = over˘ start_ARG italic_v end_ARG ) . end_CELL end_ROW (5)

Here l~~𝑙\tilde{l}over~ start_ARG italic_l end_ARG denotes counterfactual word embedding where the main content words are masked. The mask operation process is as follows:

wjl~,{wj[MASK]ifwjlcontent,wjwjifwjlcontext,\displaystyle\forall w_{j}\in\tilde{l},\left\{\begin{matrix}w_{j}% \longleftarrow\mbox{[MASK]}&\text{if}\enspace w_{j}\in l_{\text{content}},\\ w_{j}\longleftarrow w_{j}&\text{if}\enspace w_{j}\in l_{\text{context}},\end{% matrix}\right.∀ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ over~ start_ARG italic_l end_ARG , { start_ARG start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟵ [MASK] end_CELL start_CELL if italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_l start_POSTSUBSCRIPT content end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟵ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL if italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_l start_POSTSUBSCRIPT context end_POSTSUBSCRIPT , end_CELL end_ROW end_ARG (6)

where [MASK] symbol is a special token to mask a single word wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Meanwhile, a˘˘𝑎\breve{a}over˘ start_ARG italic_a end_ARG and v˘˘𝑣\breve{v}over˘ start_ARG italic_v end_ARG denote unseen empty embeddings, a.k.a., zero feature embeddings. In this situation, MSA models could only rely on visible context words to make bias-based predictions. Essentially, the counterfactual outcome (m~)~𝑚\mathcal{F}(\tilde{m})caligraphic_F ( over~ start_ARG italic_m end_ARG ) reflects the pure side effect from the trained vanilla models and word-level harmful context bias.

3.5 Bias Elimination Strategy

Thanks to humans’ innate counterfactual intuition [24, 30], we can wisely reveal actual causal effects among variables in biased observations rather than superficial connections. The human inference process for unbiased decisions is essentially achieved by comparing factual and counterfactual outcomes [26]. To block the transfer of biases from the training data to the inference process, we imitate such human intuition to introduce an operationally simple yet empirically powerful subtraction operation (i.e., bias elimination strategy). The debiased prediction via the strategy is as follows:

(m)=(m)(λ^(m^)+λ~(m~)),𝑚𝑚^𝜆^𝑚~𝜆~𝑚\aleph(m)=\mathcal{F}(m)-(\hat{\lambda}\,\mathcal{F}(\hat{m})+\tilde{\lambda}% \,\mathcal{F}(\tilde{m})),roman_ℵ ( italic_m ) = caligraphic_F ( italic_m ) - ( over^ start_ARG italic_λ end_ARG caligraphic_F ( over^ start_ARG italic_m end_ARG ) + over~ start_ARG italic_λ end_ARG caligraphic_F ( over~ start_ARG italic_m end_ARG ) ) , (7)

where (m)𝑚\mathcal{F}(m)caligraphic_F ( italic_m ) and (m)𝑚\aleph(m)roman_ℵ ( italic_m ) correspond to the traditional factual prediction and counterfactual prediction, respectively. (m^)^𝑚\mathcal{F}(\hat{m})caligraphic_F ( over^ start_ARG italic_m end_ARG ) and (m~)~𝑚\mathcal{F}(\tilde{m})caligraphic_F ( over~ start_ARG italic_m end_ARG ) are the label bias and context bias purified from the poisoned models. Two adaptive trade-off parameters, λ^^𝜆\hat{\lambda}over^ start_ARG italic_λ end_ARG and λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG, are applied to measure the extent of label bias and context bias. Since different datasets suffer from varying extent of biases, the grid search strategy is utilized on the validation set to estimate the extent to which the two biases poison the models. We implement the search for λ^^𝜆\hat{\lambda}over^ start_ARG italic_λ end_ARG and λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG in a two-dimensional space of a specific interval:

λ^,λ~=argmaxλ^,λ~[α,β]Φ𝒟((m|λ^,λ~)),superscript^𝜆superscript~𝜆subscript^𝜆~𝜆𝛼𝛽subscriptΦ𝒟conditional𝑚^𝜆~𝜆\hat{\lambda}^{*},\,\tilde{\lambda}^{*}=\mathop{\arg\max}\limits_{\hat{\lambda% },\,\tilde{\lambda}\in[\alpha,\beta]}\Phi_{\mathcal{D}}\left(\aleph(m|\hat{% \lambda},\tilde{\lambda})\right),over^ start_ARG italic_λ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over~ start_ARG italic_λ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT over^ start_ARG italic_λ end_ARG , over~ start_ARG italic_λ end_ARG ∈ [ italic_α , italic_β ] end_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( roman_ℵ ( italic_m | over^ start_ARG italic_λ end_ARG , over~ start_ARG italic_λ end_ARG ) ) , (8)

where [α[\alpha[ italic_α, β]\beta]italic_β ] is the search interval. Φ()Φ\Phi(\cdot)roman_Φ ( ⋅ ) is a function used for calculating a specific metric that measures the model’s performance on the validation set 𝒟𝒟\mathcal{D}caligraphic_D. The evaluation metric is the weighted F1-score, which is the balanced harmonic mean of precision and recall and can excellently reflect the extent of the dataset biases, especially for the imbalanced data. To reduce the invalid computational overhead during the bias elimination process, we employ a coarse-to-fine grid search strategy to perform a search by gradually narrowing the search interval and step size. As two dataset-level parameters, they are searched only once for each validation set and can be used in inference for all testing data.

4 Experiments

4.1 Datasets and Evaluation Metrics

Datasets. Here, we conduct experiments on two different scales of datasets that show significant label and context biases [33]. MOSI [59] is a realistic dataset comprising 2,199 opinion video clips collected from YouTube. There are 1,284, 229, and 686 video clips in train, valid, and test data, respectively. MOSEI [58] benchmark contains 23,453 annotated video segments from over 1,000 speakers and 250 topics. There are a total of 16,326, 1,871, and 4,659 video segments in training, validation, and testing sets, respectively. Each sample has a label for both datasets from -3 (strongly negative) to +3 (strongly positive).

Evaluation Metrics. Following previous works [23, 18], we leverage various metrics to evaluate the MCIS framework’s performance, including seven-class classification accuracy (Acc-7) meaning the proportions of correct predicted scores in seven intervals from -3 to +3, binary classification accuracy (Acc-2), and the weighted F1 score computed for positive/negative classification results.

4.2 Model Zoo

To fully evaluate the effectiveness of MCIS across different methods, we select five representative and reproducible state-of-the-art (SOTA) models. Concretely, MulT [37] learns element correlations among modalities via paired cross-modal attention interactions. MISA [10] projects each modality into two distinct subspaces to learn the discrepancy and consistency across modalities separately. CubeMLP [32] utilizes three independent multi-layer perceptron units for feature-mixing on three axes. MMIM [9] maximizes the mutual information during multimodal fusion to maintain task-related information. DMD [17] introduces cross-modal distillations to facilitate the transfer of informative semantics from strong to weak modalities.

4.3 Implementation Details

Feature Extraction. Following the original protocols of the models above, the audio and visual features are provided by MOSI and MOSEI. The language embeddings are extracted by the pre-trained BERT [4], whether fine-tuning depends on the vanilla settings of different methods. Moreover, we employ the Python NLTK toolkit to tokenize sentences into word lists and then extract the main content words that may affect the semantics in the transcripts. The average mask ratio of the main content words is 68.96%. For the grid search strategy, the search step and search interval are 0.5 and [-2.0, 2.0] in the coarse search process. In the fine search process, the search step is 0.1, while the search interval depends on the results of the coarse search process.

Experimental Setup. We re-implement these five SOTA models based on the public codebase and combine them with our MCIS framework. All models are reproduced on NVIDIA Tesla V100 GPUs. For impartiality, the training settings of these models (e.g., loss function, batch size, learning rate strategy, and other hyper-parameters) are consistent with the details reported in original papers.

Table 1: Comparison results on the MOSI testing set. All models use the BERT-based word embedding. \dagger: reproduced results from public code with hyper-parameters provided in original papers. The improved results are marked in bold.
Models Acc-7 (%) Acc-2 (%) F1 (%)
TFN [57] 34.9 80.8 80.7
LMF [21] 33.2 82.5 82.4
MFM [38] 35.4 81.7 81.6
ICCN [34] 39.0 83.0 83.0
MAG-BERT [31] 43.6 84.4 84.6
FDMER [44] 44.1 84.6 84.7
Self-MM [56] 45.8 84.8 84.9
MulT(ACL’19) [37] 42.6 84.1 83.9
MulT + MCIS 43.5 85.5 85.2
MISA(ACM MM’20) [10] 42.1 82.3 82.6
MISA + MCIS 42.0 83.7 84.1
MMIM(EMNLP’21) [9] 46.4 85.5 85.4
MMIM + MCIS 47.9 86.6 86.5
CubeMLP(ACM MM’22) [32] 44.5 84.7 84.6
CubeMLP + MCIS 45.7 85.9 85.8
DMD(CVPR’23) [17] 45.3 85.1 85.1
DMD + MCIS 46.5 86.3 86.3
Table 2: Comparison results on the MOSEI testing set. All models use the BERT-based word embedding. \dagger: reproduced results from public code with hyper-parameters provided in original papers. The improved results are marked in bold.
Models Acc-7 (%) Acc-2 (%) F1 (%)
TFN [57] 50.2 82.5 82.1
LMF [21] 48.0 82.0 82.1
MFM [38] 51.3 84.4 84.3
ICCN [34] 51.6 84.2 84.2
MAG-BERT [31] 52.7 84.8 84.7
FDMER [44] 54.1 86.1 85.8
Self-MM [56] 53.5 85.0 84.9
MulT(ACL’19) [37] 52.3 82.7 82.5
MulT + MCIS 54.1 84.3 84.0
MISA(ACM MM’20) [10] 52.1 84.4 84.2
MISA + MCIS 53.6 85.8 85.7
MMIM(EMNLP’21) [9] 53.1 85.1 85.0
MMIM + MCIS 54.5 86.7 86.6
CubeMLP(ACM MM’22) [32] 52.7 84.2 83.7
CubeMLP + MCIS 54.2 86.2 85.9
DMD(CVPR’23) [17] 53.9 85.6 85.5
DMD + MCIS 55.2 87.3 87.1

4.4 Comparison with State-of-the-art Methods

We compare the MCIS-based models with recent competitive methods, including TFN [57], LMF [21], MFM [38], ICCN [34], MAG-BERT [31], FDMER [44], and Self-MM [56]. The results on MOSI and MOSEI are reported in Tables 2&2. The key observations are as follows. (i) The models with MCIS significantly and consistently outperform the vanilla versions by large margins on most evaluation metrics for both datasets. In particular, the MCIS-based MMIM [9] achieves new SOTAs with the Acc-7/Acc-2/F1 scores of 47.9%/86.6%/86.5% on MOSI. Thanks to MCIS, the distillation-based DMD [17] yields the best results on MOSEI with affluent improvements of 1.3%, 1.7%, and 1.6% on these three metrics. The performance gains across methods with different representation learning patterns [10, 9, 17] and fusion strategies [37, 32] confirm the usefulness and generalizability of our framework.

(ii) Compared to existing models that obtain inadequate results (average about 0.54%similar-to\sim1.26% gain across all metrics) via complex structures and numerous parameters [44, 31, 34, 37, 56, 38, 17, 57, 21], MCIS can easily achieve superior improvements (average about 0.94%similar-to\sim 1.76% gain across all metrics) by removing harmful biases only at the inference phase in a parameter-free manner. In practice, our framework is cost-effective compared to training a new SOTA model from scratch since the time overhead is reduced by about 26 times on average. The better results show that these biases are the ignored “culprits” and the importance of counterfactual debiasing.

(iii) Furthermore, we find that the MCIS-based models provide better improvements on MOSEI (average about 1.63% gain across models) than on MOSI (average about 1.16% gain across models). The phenomenon potentially derives from extensive data samples in the large-scale dataset beneficial to trained models preserving the two biases that obey the ideal distribution, thus facilitating MCIS to purify and mitigate the adverse effects more effectively.

Table 3: Ablation study results of different dataset biases. We provide comprehensive results for five MCIS-based SOTA models on the MOSEI testing set. Similar trends are also observed on the MOSI. “w/o” is short for the without.
Designs/Mechanisms MulT + MCIS MISA + MCIS MMIM + MCIS CubeMLP + MCIS DMD + MCIS
Acc-7 Acc-2 F1 Acc-7 Acc-2 F1 Acc-7 Acc-2 F1 Acc-7 Acc-2 F1 Acc-7 Acc-2 F1
Full Framework 54.1 84.3 84.0 53.6 85.8 85.7 54.5 86.7 86.6 54.2 86.2 85.9 55.2 87.3 87.1
w/o Label Bias Elimination 53.8 83.7 83.5 53.2 85.3 85.2 54.2 86.0 85.9 54.0 85.7 85.5 54.8 86.7 86.6
w/o Context Bias Elimination 52.8 83.2 83.1 52.5 84.7 84.7 53.3 85.5 85.4 53.2 84.8 84.5 54.2 86.1 85.9
w/o Grid Search Strategy 52.6 83.0 82.8 51.5 83.8 83.6 52.3 84.4 84.2 52.8 84.4 84.0 53.7 86.0 85.7
Table 4: Ablation study results of multimodal counterfactual embeddings in the label bias. “L/A/V/RCE” stands for language, audio, visual, and random counterfactual embeddings, respectively.
Models Metrics Full w/o LCE w/o ACE w/o VCE w/ RCE
MulT [37] + MCIS Acc-2 (%) 84.3 83.9 84.2 84.1 83.6
F1 (%) 84.0 83.7 83.9 83.8 83.3
MISA [10] + MCIS Acc-2 (%) 85.8 85.4 85.6 85.7 85.1
F1 (%) 85.7 85.4 85.5 85.6 85.0
MMIM [9] + MCIS Acc-2 (%) 86.7 86.2 86.5 86.4 85.8
F1 (%) 86.6 86.1 86.5 86.2 85.7
CubeMLP [32] + MCIS Acc-2 (%) 86.2 85.8 86.0 86.1 85.5
F1 (%) 85.9 85.6 85.7 85.9 85.1
DMD [17] + MCIS Acc-2 (%) 87.3 86.8 87.0 87.1 86.5
F1 (%) 87.1 86.7 86.8 87.0 86.3

4.5 Ablation Studies

We perform systematic ablation studies using the MCIS-based models on MOSEI. Comprehensive experiments aim to evaluate the different designs and mechanisms in the proposed MCIS.

Analysis of Different Dataset Biases. Table 3 provides investigations of two types of bias eliminations and grid search strategy (GSS). (i) Firstly, the label and context bias eliminations are retained separately to verify the effect of the distinct biases. The gain drops for all metrics reveal that it is indispensable to simultaneously remove statistical shortcuts and spurious correlations. The core explanation is that the purified label bias provides a sample-agnostic global offset and the purified context bias provides utterance-specific local offsets to correct for the predicted space, allowing the trained models to sidestep the interference of harmful biases in the observed data. (ii) Another finding is that the impact of context bias is more severe than label bias, implying that misleading or unfair context words more easily mislead the trained models. This observation provides pertinent evidence for the dominance of language modality in MSA [37, 42]. (iii) When our GSS is eliminated (i.e., λ^=λ~=1^𝜆~𝜆1\hat{\lambda}=\tilde{\lambda}=1over^ start_ARG italic_λ end_ARG = over~ start_ARG italic_λ end_ARG = 1), all gain degradation indicates that proper mitigation of varying degrees of biases is essential.

Impact of MCE in Label Bias. Multimodal Counterfactual Embeddings (MCE) play an important role in obtaining the intervened outcomes based on the purified biases. (i) In practice, we investigate the necessity of Language, Audio, and Visual Counterfactual Embeddings (L/A/VCE) separately. From the decreased results in Table 4, the incomplete counterfactual embeddings (i.e., the absence of whichever of L/A/VCE) would impede the biased models from producing the multimodal representation that benefits from precise intervention, and then fail to imagine the bias-based outcome purely. According to Fig. 3, the reason could be that M𝑀Mitalic_M is confounded by the harmful effects of statistical shortcuts conveyed jointly by links to different modalities i.e., (L,A,V)M𝐿𝐴𝑉𝑀(L,A,V)\rightarrow M( italic_L , italic_A , italic_V ) → italic_M. Therefore, it takes sufficient intervention with each modality to purify the effective label bias. (ii) Across all MCIS-based models, the worse deterioration is observed with the elimination of LCE. Meanwhile, the impact of A/VCE on gain depends on different models, e.g., removing ACE is less damaging to the performance of MulT and MMIM as well as VCE is slightly impairing MISA, CubeMLP, and DMD. (iii) Additionally, we empirically provide a candidate assumption that the average features from three modalities are replaced with the Random Counterfactual Embeddings (RCE), which are initialized by random distribution. The poor results are inevitable because random guesses potentially fail to produce a stable distribution similarly distributed with the ideal bias.

Table 5: Ablation study results of multimodal counterfactual embeddings in the context bias. “Mask” means the mask operation in Eq. 6. “w/” and “w/o” are short for the with and without, respectively. We only report F1 scores for visual clarity.
Designs MulT [37] + MCIS MISA [10] + MCIS MMIM [9] + MCIS CubeMLP [32] + MCIS DMD [17] + MCIS
Full Framework 84.0 85.7 86.6 85.9 87.1
w/o Mask 83.2 84.9 85.6 84.7 86.3
w/ All Mask 83.6 85.2 86.1 85.4 86.7
w/ Random Mask 83.3 84.6 85.9 85.0 86.2
w/o ACE 83.7 85.6 86.5 85.8 86.9
w/o VCE 83.9 85.4 86.4 85.7 87.1
w/ RCE 82.8 84.4 85.4 84.3 85.8
Refer to caption
Figure 5: Case study of counterfactual learning on MOSI and MOSEI. We report the binary evaluation results from the DMD [17] with our MCIS for the intuitive display. Label/Context Word Distribution: the imbalanced distribution of sentiment labels and context words in positive and negative categories comes from the training set.

Impact of MCE in Context Bias. Intuitively, the core of context bias elimination is masking the main content words and forcing the models to focus only on the spurious correlations provided by the context words. To explore this, (i) we perform the word non-masking (w/o Mask), all masking (w/ All Mask), and random masking (w/ Random Mask) separately before converting the transcripts into word embeddings via the pre-trained BERT in Table 5. The decreased results in F1 scores confirm three explanations: (1) Due to the language modality unavailability in all masking, the Context Bias Elimination (CBE) process does not impact linguistic effects. Despite the bias of vanilla models, the main content words contribute more valuable gains. (2) Instead, CBE purifies the effects of both good semantics and bad bias in non-masking, leading to worse results. (3) The poor results for random masking than for all masking suggest that CBE probably over-eliminates meaningful clues as the main content words dominate. (ii) Furthermore, the original features of different training samples are retained when ACE and VCE are removed separately. The most gain drops suggest that our zero feature embedding assumption guarantees a safe estimation for the purification of pure word-level context bias. (iii) As an alternative to L/A/VCE, the worst performance from all metrics with RCE verifies the rationality of the proposed embedding paradigm.

4.6 Qualitative Analysis

Case Study of Counterfactual Learning. Fig. 5 shows a counterfactual example from each testing set on MOSI and MOSEI, respectively. Specifically, we provide the sentiment intensity scores of positive/negative evaluation results from vanilla DMD, two types of counterfactual outputs, and the counterfactual predictions. The corresponding label and context word distributions for the display samples intuitively show the presence of the dataset biases. Evidently, MCIS corrects the baseline predictions and gives reasonable sentiment polarities. Taking Case 1 (Fig. 5(a)) as an example, the vanilla model obtains a falsely positive polarity, which is misled by the dataset biases. According to the two counterfactual outputs corresponding to the purified biases, the biased baseline results suffer from two deleterious effects: (1) the statistical shortcut caused by the large proportion of “positive” labels; (2) the spurious correlation between the context words (e.g., “also”, “very”) and “positive” category. Thanks to the proposed MCIS, we can empower the model to think twice and make unbiased predictions by comparing factual and counterfactual outcomes.

Refer to caption
Figure 6: Distribution differences of sentiment scores for the testing set (sorted) on (a) MOSI and (b) MOSEI. The blue dots represent the predicted scores from the baseline DMD [17], while the red dots represent the predicted scores from the MCIS-based DMD. The more compact the distribution of predicted sentiment scores and ground truths, the better the model performance.

Distribution Differences of Sentiment Scores. The distribution differences of sentiment scores on MOSI and MOSEI testing sets are displayed in Fig. 6(a) and Fig. 6(b), respectively. (i) Macroscopically, the predicted score distribution of the MCIS-based model is more compact with the ground truth distribution, indicating that MCIS can effectively correct prediction errors around ground truths. (ii) In practice, our framework mitigates the overall prediction gap caused by samples with outlier-predicted scores while maintaining correct predictions for most samples. For instance, the MCIS-based DMD successfully corrects about 90% and 93% of the predicted sentiment scores in samples with changes in sentiment polarities on MOSI and MOSEI. (iii) Microscopically, MCIS differs in its debiasing effect on different samples, depending on the misleading extent of the context words in the samples. In short, our method contributes to a meaningful step towards the unbiased estimation of existing models.

5 Conclusion

In this paper, we investigate and disentangle the dataset biases that have long poisoned MSA models from a causal inference perspective. As a model-agnostic causality-based framework, the proposed MCIS eliminates the detrimental effects caused by these biases via imitating human counterfactual intuition. Comprehensive experiments demonstrate that the MCIS-based models achieve better performance than their biased counterparts.

Future Work. We plan to equip MCIS with modality reconstruction techniques to cope with potential modality missingness in realistic applications.

Acknowledgement. This work is supported in part by the National Key R&D Program of China under Grant 2021ZD0113503 and in part by the Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0103.

References

  • [1] Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: Iemocap: Interactive emotional dyadic motion capture database. Language Resources and Evaluation 42(4), 335–359 (2008)
  • [2] Chen, J., Yang, D., Jiang, Y., Lei, Y., Zhang, L.: Miss: A generative pretraining and finetuning approach for med-vqa. arXiv preprint arXiv:2401.05163 (2024)
  • [3] Chen, Y., Chen, D., Wang, T., Wang, Y., Liang, Y.: Causal intervention for subject-deconfounded facial action unit recognition. arXiv preprint arXiv:2204.07935 (2022)
  • [4] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • [5] Dixon, L., Li, J., Sorensen, J., Thain, N., Vasserman, L.: Measuring and mitigating unintended bias in text classification. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. pp. 67–73 (2018)
  • [6] Feder, A., Oved, N., Shalit, U., Reichart, R.: Causalm: Causal model explanation through counterfactual language models. Computational Linguistics 47(2), 333–386 (2021)
  • [7] Glymour, M., Pearl, J., Jewell, N.P.: Causal inference in statistics: A primer. John Wiley & Sons (2016)
  • [8] Grice, H.P., White, A.R.: Symposium: The causal theory of perception. Proceedings of the Aristotelian Society, Supplementary Volumes 35, 121–168 (1961)
  • [9] Han, W., Chen, H., Poria, S.: Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. arXiv preprint arXiv:2109.00412 (2021)
  • [10] Hazarika, D., Zimmermann, R., Poria, S.: Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia (ACM MM). pp. 1122–1131 (2020)
  • [11] Kshirsagar, S.R., Falk, T.H.: Quality-aware bag of modulation spectrum features for robust speech emotion recognition. IEEE Transactions on Affective Computing (2022)
  • [12] Kuang, H., Yang, D., Wang, S., Wang, X., Zhang, L.: Towards simultaneous segmentation of liver tumors and intrahepatic vessels via cross-attention mechanism. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5 (2023)
  • [13] Lei, Y., Yang, D., Li, M., Wang, S., Chen, J., Zhang, L.: Text-oriented modality reinforcement network for multimodal sentiment analysis from unaligned multimodal sequences. arXiv preprint arXiv:2307.13205 (2023)
  • [14] Li, M., Yang, D., Lei, Y., Wang, S., Wang, S., Su, L., Yang, K., Wang, Y., Sun, M., Zhang, L.: A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). pp. 10074–10082. No. 9 (2024)
  • [15] Li, M., Yang, D., Zhang, L.: Towards robust multimodal sentiment analysis under uncertain signal missing. IEEE Signal Processing Letters 30, 1497–1501 (2023)
  • [16] Li, M., Yang, D., Zhao, X., Wang, S., Wang, Y., Yang, K., Sun, M., Kou, D., Qian, Z., Zhang, L.: Correlation-decoupled knowledge distillation for multimodal sentiment analysis with incomplete modalities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12458–12468 (2024)
  • [17] Li, Y., Wang, Y., Cui, Z.: Decoupled multimodal distilling for emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6631–6640 (2023)
  • [18] Liang, T., Lin, G., Feng, L., Zhang, Y., Lv, F.: Attention is not enough: Mitigating the distribution discrepancy in asynchronous multimodal sequence fusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 8148–8156 (2021)
  • [19] Lin, X., Parikh, D.: Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2984–2993 (2015)
  • [20] Liu, Y., Yang, D., Wang, Y., Liu, J., Song, L.: Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models. arXiv preprint arXiv:2302.05087 (2023)
  • [21] Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.P.: Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064 (2018)
  • [22] Lu, C., Zong, Y., Zheng, W., Li, Y., Tang, C., Schuller, B.W.: Domain invariant feature learning for speaker-independent speech emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30, 2217–2230 (2022)
  • [23] Lv, F., Chen, X., Huang, Y., Duan, L., Lin, G.: Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2554–2562 (2021)
  • [24] Niu, Y., Tang, K., Zhang, H., Lu, Z., Hua, X.S., Wen, J.R.: Counterfactual vqa: A cause-effect look at language bias. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12700–12710 (2021)
  • [25] Pearl, J.: Causal inference in statistics: An overview. Statistics Surveys 3, 96–146 (2009)
  • [26] Pearl, J.: Causality. Cambridge University Press (2009)
  • [27] Pearl, J., et al.: Models, reasoning and inference. Cambridge, UK: CambridgeUniversityPress 19,  2 (2000)
  • [28] Pham, H., Liang, P.P., Manzini, T., Morency, L.P., Póczos, B.: Found in translation: Learning robust joint representations by cyclic translations between modalities. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). vol. 33, pp. 6892–6899 (2019)
  • [29] Poria, S., Chaturvedi, I., Cambria, E., Bisio, F.: Sentic lda: Improving on lda with semantic similarity for aspect-based sentiment analysis. In: International Joint Conference on Neural Networks (IJCNN). pp. 4465–4473. IEEE (2016)
  • [30] Qian, C., Feng, F., Wen, L., Ma, C., Xie, P.: Counterfactual inference for text classification debiasing. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 5434–5445 (2021)
  • [31] Rahman, W., Hasan, M.K., Lee, S., Zadeh, A., Mao, C., Morency, L.P., Hoque, E.: Integrating multimodal information in large pretrained transformers. In: Annual Meeting of the Association for Computational Linguistics (ACL). vol. 2020, p. 2359. NIH Public Access (2020)
  • [32] Sun, H., Wang, H., Liu, J., Chen, Y.W., Lin, L.: Cubemlp: An mlp-based model for multimodal sentiment analysis and depression estimation. In: Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). pp. 3722–3729 (2022)
  • [33] Sun, T., Wang, W., Jing, L., Cui, Y., Song, X., Nie, L.: Counterfactual reasoning for out-of-distribution multimodal sentiment analysis. In: Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). pp. 15–23 (2022)
  • [34] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). pp. 8992–8999 (2020)
  • [35] Tang, K., Niu, Y., Huang, J., Shi, J., Zhang, H.: Unbiased scene graph generation from biased training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3716–3725 (2020)
  • [36] Tian, B., Cao, Y., Zhang, Y., Xing, C.: Debiasing nlu models via causal intervention and counterfactual reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). vol. 36, pp. 11376–11384 (2022)
  • [37] Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for Computational Linguistics. Meeting (ACL). vol. 2019, p. 6558. NIH Public Access (2019)
  • [38] Tsai, Y.H.H., Liang, P.P., Zadeh, A., Morency, L.P., Salakhutdinov, R.: Learning factorized multimodal representations. arXiv preprint arXiv:1806.06176 (2018)
  • [39] Van Hoeck, N., Watson, P.D., Barbey, A.K.: Cognitive neuroscience of human counterfactual reasoning. Frontiers in Human Neuroscience 9,  420 (2015)
  • [40] Wang, T., Huang, J., Zhang, H., Sun, Q.: Visual commonsense r-cnn. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10760–10770 (2020)
  • [41] Waseem, Z., Hovy, D.: Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In: Proceedings of the NAACL Student Research Workshop. pp. 88–93 (2016)
  • [42] Wu, Y., Lin, Z., Zhao, Y., Qin, B., Zhu, L.N.: A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 4730–4738 (2021)
  • [43] Yang, D., Chen, Z., Wang, Y., Wang, S., Li, M., Liu, S., Zhao, X., Huang, S., Dong, Z., Zhai, P., Zhang, L.: Context de-confounded emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19005–19015 (June 2023)
  • [44] Yang, D., Huang, S., Kuang, H., Du, Y., Zhang, L.: Disentangled representation learning for multimodal emotion recognition. In: Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). pp. 1642–1651 (2022)
  • [45] Yang, D., Huang, S., Liu, Y., Zhang, L.: Contextual and cross-modal interaction for multi-modal speech emotion recognition. IEEE Signal Processing Letters 29, 2093–2097 (2022)
  • [46] Yang, D., Huang, S., Wang, S., Liu, Y., Zhai, P., Su, L., Li, M., Zhang, L.: Emotion recognition for multiple context awareness. In: European Conference on Computer Vision (ECCV). pp. 144–162. Springer (2022)
  • [47] Yang, D., Huang, S., Xu, Z., Li, Z., Wang, S., Li, M., Wang, Y., Liu, Y., Yang, K., Chen, Z., Wang, Y., Liu, J., Zhang, P., Zhai, P., Zhang, L.: Aide: A vision-driven multi-view, multi-modal, multi-tasking dataset for assistive driving perception. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 20459–20470 (October 2023)
  • [48] Yang, D., Kuang, H., Huang, S., Zhang, L.: Learning modality-specific and-agnostic representations for asynchronous multimodal language sequences. In: Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). pp. 1708–1717 (2022)
  • [49] Yang, D., Kuang, H., Yang, K., Li, M., Zhang, L.: Towards asynchronous multimodal signal interaction and fusion via tailored transformers. IEEE Signal Processing Letters (2024)
  • [50] Yang, D., Liu, Y., Huang, C., Li, M., Zhao, X., Wang, Y., Yang, K., Wang, Y., Zhai, P., Zhang, L.: Target and source modality co-reinforcement for emotion understanding from asynchronous multimodal sequences. Knowledge-Based Systems p. 110370 (2023)
  • [51] Yang, D., Xiao, D., Li, K., Wang, Y., Chen, Z., Wei, J., Zhang, L.: Towards multimodal human intention understanding debiasing via subject-deconfounding. arXiv preprint arXiv:2403.05025 (2024)
  • [52] Yang, D., Yang, K., Li, M., Wang, S., Wang, S., Zhang, L.: Robust emotion recognition in context debiasing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12447–12457 (2024)
  • [53] Yang, D., Yang, K., Wang, Y., Liu, J., Xu, Z., Yin, R., Zhai, P., Zhang, L.: How2comm: Communication-efficient and collaboration-pragmatic multi-agent perception. In: Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) (2023)
  • [54] Yang, K., Yang, D., Zhang, J., Wang, H., Sun, P., Song, L.: What2comm: Towards communication-efficient collaborative perception via feature decoupling. In: Proceedings of the 31th ACM International Conference on Multimedia (ACM MM). p. 7686–7695 (2023)
  • [55] Yu, W., Xu, H., Meng, F., Zhu, Y., Ma, Y., Wu, J., Zou, J., Yang, K.: Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 3718–3727 (2020)
  • [56] Yu, W., Xu, H., Yuan, Z., Wu, J.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). vol. 35, pp. 10790–10797 (2021)
  • [57] Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017)
  • [58] Zadeh, A., Pu, P.: Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers) (2018)
  • [59] Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems 31(6), 82–88 (2016)
  • [60] Zhang, Y., Lai, G., Zhang, M., Zhang, Y., Liu, Y., Ma, S.: Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. pp. 83–92 (2014)