Causal Discovery Inspired Unsupervised Domain Adaptation for Emotion-Cause Pair Extraction

Yuncheng Hua\heartsuit, Yujin Huang\heartsuit, Shuo Huang\heartsuit, Tao Feng\heartsuit, Lizhen Qu\heartsuit222Corresponding author.,
Chris Bain\clubsuit, Richard Bassed\diamondsuit, Gholamreza Haffari\heartsuit
\heartsuit Department of Data Science & AI, Monash University, Australia
\clubsuit Department of Human Centred Computing, Monash University, Australia
\diamondsuit Victorian Institute of Forensic Medicine, Melbourne, Australia
{devin.hua, shuo.huang1, chris.a.bain, firstname.lastname}@monash.edu,
[email protected]
Abstract

This paper tackles the task of emotion-cause pair extraction in the unsupervised domain adaptation setting. The problem is challenging as the distributions of the events causing emotions in target domains are dramatically different than those in source domains, despite the distributions of emotional expressions between domains are overlapped. Inspired by causal discovery, we propose a novel deep latent model in the variational autoencoder (VAE) framework, which not only captures the underlying latent structures of data but also utilizes the easily transferable knowledge of emotions as the bridge to link the distributions of events in different domains. To facilitate knowledge transfer across domains, we also propose a novel variational posterior regularization technique to disentangle the latent representations of emotions from those of events in order to mitigate the damage caused by the spurious correlations related to the events in source domains. Through extensive experiments, we demonstrate that our model outperforms the strongest baseline by approximately 11.05% on a Chinese benchmark and 2.45% on a English benchmark in terms of weighted-average F1 score. The source code will be publicly available upon acceptance.

Causal Discovery Inspired Unsupervised Domain Adaptation for Emotion-Cause Pair Extraction


Yuncheng Hua\heartsuit, Yujin Huang\heartsuit, Shuo Huang\heartsuit, Tao Feng\heartsuit, Lizhen Qu\heartsuit222Corresponding author., Chris Bain\clubsuit, Richard Bassed\diamondsuit, Gholamreza Haffari\heartsuit \heartsuit Department of Data Science & AI, Monash University, Australia \clubsuit Department of Human Centred Computing, Monash University, Australia \diamondsuit Victorian Institute of Forensic Medicine, Melbourne, Australia {devin.hua, shuo.huang1, chris.a.bain, firstname.lastname}@monash.edu, [email protected]


1 Introduction

Refer to caption
Figure 1: An illustrative example of the UDA-ECPE task. Orange and green highlights respectively denote emotion and cause clauses.

Emotion-cause pair extraction (ECPE) aims to extract emotions and the events causing such emotions mentioned in a document Xia and Ding (2019). The task has potential applications in a number of areas, such as affective computing, market analysis, and intelligent agents for customer support. However, there are only a small number of labeled training corpora available in a handful of domains. As shown in Fig. 1, in order to deploy ECPE models to target domains, where there are only unlabeled data, we focus on the unsupervised domain adaptation (UDA) for ECPE, coined UDA-ECPE, which is not explored before.

Multi-class or multi-label classification dominates in conventional UDA tasks. UDA-ECPE is more challenging because the events causing the same emotion are barely the same across domains, despite the knowledge of emotional expressions is easier to transfer across domains using the UDA methods Zad et al. (2021). For example, the reason for "I feel so happy today" can be "I have received a grant from the government" in the society domain and "I found that the stock I bought went up" in the finance domain. There are usually no explicit keywords such as "because" showing their causal relations. However, current UDA methods assume that there are small discrepancies between source and target distributions Zhao et al. (2019); Kumar et al. (2020). We show in Sec. 4.2 that the state-of-the-art (SOTA) UDA methods indeed have limited capabilities to improve the performance of the SOTA ECPE models.

It is a common practice to project texts into latent representations for improving language understanding Wang et al. (2019). Existing techniques disentangle different types of latent representations by applying regularization terms to enforce independence between the corresponding random variables Cheng et al. (2020). However, the independence assumption contradicts the fact that emotions and the events causing them are statically dependent.

To tackle the above challenges, we take the transferable knowledge of emotional expressions as the bridge between a source domain and a target domain. In a single domain, we identify causal relations between emotions and domain-specific events, which can be viewed as a causal discovery problem between the corresponding random variables. In the VAE framework Kingma and Welling (2013), we propose a novel model, coined CaRel-VAE, to map inputs texts into latent emotion representations and latent event representations and detect their causal relations. Herein, we propose a novel variational posterior regularizer to disentangle those representations by maximizing the divergences between the posteriors without assuming independence. In a target domain, we improve the self-training algorithm Chen et al. (2011) for discovering domain-specific causal relations, referred to as CD-SelfTrain. Instead of incrementally updating a training set, we improve the original algorithm by producing a new pseudo-labeled training set in each epoch. As a result, our method outperforms the SOTA ECPE models trained with the SOTA UDA methods by a wide margin.

To sum up, our contributions are the following:

  • We propose a novel causal discovery inspired UDA method, coined CD-SelfTrain, and a new model, coined CaRel-VAE, for the ECPE task in the unexplored UDA setting.

  • We propose a novel disentanglement regularization term on variational Posteriors so that it does not enforce independence between emotions and the events causing them.

  • Our approach achieves superior performance in terms of weighted-average F1 over the strongest baseline by approximately 11.05% on a Chinese benchmark and 2.45% on a English benchmark. Even if that baseline is trained with the SOTA UDA method, our method still achieves the best.

2 Challenges in UDA-ECPE

The task ECPE is concerned with recognizing causal relations between the events causing emotions and the corresponding emotional expressions mentioned in a document. All prior studies on the ECPE task employ a (deep) learning-based classifier to detect mentions of causal relations based on an input text. They often choose an input text that mentions an event and an emotional expression. Then those classifiers determine whether the event causes the emotional expression by investigating if i) the event and the emotional expression are correlated and ii) there is a linguistic pattern indicating their relation is causal, e.g. using a key phrase “leads to”.

Formally, given an input text 𝒙𝒙{\bm{x}}bold_italic_x, we extract an event embedding 𝒛csuperscript𝒛𝑐{\bm{z}}^{c}bold_italic_z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and an emotion embedding 𝒛esuperscript𝒛𝑒{\bm{z}}^{e}bold_italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, which are the values sampled from the corresponding latent random variable vectors 𝐙csuperscript𝐙𝑐{\mathbf{Z}}^{c}bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and 𝐙esuperscript𝐙𝑒{\mathbf{Z}}^{e}bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT. In a source domain, a model learns a distribution 𝐙c,𝐙ep(Y|𝐙c,𝐙e,𝒙)p(𝐙c,𝐙e|𝒙)subscriptsuperscript𝐙𝑐superscript𝐙𝑒𝑝conditional𝑌superscript𝐙𝑐superscript𝐙𝑒𝒙𝑝superscript𝐙𝑐conditionalsuperscript𝐙𝑒𝒙\sum_{{\mathbf{Z}}^{c},{\mathbf{Z}}^{e}}p(Y|{\mathbf{Z}}^{c},{\mathbf{Z}}^{e},% {\bm{x}})p({\mathbf{Z}}^{c},{\mathbf{Z}}^{e}|{\bm{x}})∑ start_POSTSUBSCRIPT bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p ( italic_Y | bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_x ) italic_p ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_italic_x ), where Y𝑌Yitalic_Y denotes a binary random variable indicating if there is a causal relation between 𝐙csuperscript𝐙𝑐{\mathbf{Z}}^{c}bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and 𝐙esuperscript𝐙𝑒{\mathbf{Z}}^{e}bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT. The key challenge is that both p(Y|𝐙c,𝐙e,𝒙)𝑝conditional𝑌superscript𝐙𝑐superscript𝐙𝑒𝒙p(Y|{\mathbf{Z}}^{c},{\mathbf{Z}}^{e},{\bm{x}})italic_p ( italic_Y | bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_x ) and p(𝐙c,𝐙e|𝒙)𝑝superscript𝐙𝑐conditionalsuperscript𝐙𝑒𝒙p({\mathbf{Z}}^{c},{\mathbf{Z}}^{e}|{\bm{x}})italic_p ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_italic_x ) are significantly different in target domains. Although prior studies show that p(𝐙e|𝒙)𝑝conditionalsuperscript𝐙𝑒𝒙p({\mathbf{Z}}^{e}|{\bm{x}})italic_p ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_italic_x ) can be easily transferred from source domains to target domains Wang et al. (2022), the correlations between 𝐙csuperscript𝐙𝑐{\mathbf{Z}}^{c}bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and 𝐙esuperscript𝐙𝑒{\mathbf{Z}}^{e}bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT are almost not transferable, because p(𝐙c)𝑝superscript𝐙𝑐p({\mathbf{Z}}^{c})italic_p ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) are dramatically different between domains. Therefore, when adapting a model trained in a source domain to a target domain, the model needs to forget the correlations between emotions and events from the source domain, followed by learning new correlations in the target domain.

To provide an intuitive understanding of the above mentioned challenges in the UDA setting, we visualize the clause embeddings, namely p(𝐙c)𝑝superscript𝐙𝑐p({\mathbf{Z}}^{c})italic_p ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ), for ground-truth emotion and emotion causes respectively on CH-ECPE and EN-ECPE, and compare them with the sentence embeddings for a widely used domain adaptation corpus Amazon Reviews Blitzer et al. (2007) using t-SNE. As the original CH-ECPE are not partitioned based on domains, we manually assign each data point in the corpus with the corresponding domain label. Further details are provided in Sec. 4.1.

As shown in Figure 5, the data points of Chinese emotion clauses from various CH-ECPE’s domains are strongly overlapped, the domain divergences are far smaller than those of the embeddings of the emotion causes. It is thus challenging for existing UDA methods, which work only in the cases that the distribution shift from a source domain to a target domain is small, as illustrated in Fig.2(a) Zhao et al. (2019); Kumar et al. (2020). In addition, we employ two different datasets as different domains for English. For English corpora similar tendency can be found in A.1.

Refer to caption
(a) Amazon sentiment reviews
Refer to caption
(b) Chinese emotion clauses
Refer to caption
(c) Chinese emotion cause clauses
Figure 2: The t-SNE visualizations of the sentence embeddings from Amazon Reviews multi-domain sentiment corpus, the clause embeddings from the Chinese UDA-ECPE corpora for English UDA-ECPE corpora please refer to A.1

3 Methodology

The UDA-ECPE task is concerned with identifying causal relations between mentions of events and emotional expressions in target domains, which do not have labeled data. In the source domain, there is a set of labeled documents 𝒟s={(𝐗1s,1s),(𝐗2s,2s),,(𝐗ns,ns)}superscript𝒟𝑠subscriptsuperscript𝐗𝑠1subscriptsuperscript𝑠1subscriptsuperscript𝐗𝑠2subscriptsuperscript𝑠2subscriptsuperscript𝐗𝑠𝑛subscriptsuperscript𝑠𝑛{\mathcal{D}}^{s}=\{({\mathbf{X}}^{s}_{1},{\mathcal{R}}^{s}_{1}),({\mathbf{X}}% ^{s}_{2},{\mathcal{R}}^{s}_{2}),...,({\mathbf{X}}^{s}_{n},{\mathcal{R}}^{s}_{n% })\}caligraphic_D start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = { ( bold_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( bold_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , ( bold_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , caligraphic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) }. Each document 𝐗kssubscriptsuperscript𝐗𝑠𝑘{\mathbf{X}}^{s}_{k}bold_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT consists of a sequence of clauses (𝒙1,𝒙2,,𝒙d)subscript𝒙1subscript𝒙2subscript𝒙𝑑({\bm{x}}_{1},{\bm{x}}_{2},...,{\bm{x}}_{d})( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) and is annotated with a set of labeled emotion-cause pairs ks={(yijr,yic,yje)}i,jsubscriptsuperscript𝑠𝑘subscriptsubscriptsuperscript𝑦𝑟𝑖𝑗subscriptsuperscript𝑦𝑐𝑖subscriptsuperscript𝑦𝑒𝑗𝑖𝑗{\mathcal{R}}^{s}_{k}=\{(y^{r}_{ij},y^{c}_{i},y^{e}_{j})\}_{i,j}caligraphic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { ( italic_y start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, where yijrsubscriptsuperscript𝑦𝑟𝑖𝑗y^{r}_{ij}italic_y start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is a binary label indicating if 𝒙isubscript𝒙𝑖{\bm{x}}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an event mention causing an emotion expressed in 𝒙jsubscript𝒙𝑗{\bm{x}}_{j}bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, yicsuperscriptsubscript𝑦𝑖𝑐y_{i}^{c}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denotes whether 𝒙isubscript𝒙𝑖{\bm{x}}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an event or not, and yje𝒴esubscriptsuperscript𝑦𝑒𝑗superscript𝒴𝑒y^{e}_{j}\in{\mathcal{Y}}^{e}italic_y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT denotes the category of the emotion. In this work, we consider the widely used six basic emotion categories: happiness, sadness, fear, disgust, anger, and surprise. Then the task is to identify a set of such causal relations and emotion categories kt={(yijr,yje)}i,jsubscriptsuperscript𝑡𝑘subscriptsubscriptsuperscript𝑦𝑟𝑖𝑗subscriptsuperscript𝑦𝑒𝑗𝑖𝑗{\mathcal{R}}^{t}_{k}=\{(y^{r}_{ij},y^{e}_{j})\}_{i,j}caligraphic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { ( italic_y start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT from each unlabeled document k𝑘kitalic_k in target domains. In contrast, the prior studies Xia and Ding (2019) assume the training and test distributions are identical and emotional expressions are not categorized. Hence, our setting is more difficult and practical by considering emotion categories and distribution discrepancies between domains.

CaRel-VAE Overview.

Denoted by 𝐙esuperscript𝐙𝑒{\mathbf{Z}}^{e}bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and 𝐙csuperscript𝐙𝑐{\mathbf{Z}}^{c}bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT the latent random variable vectors for emotion and event respectively, we adopt the VAE framework to learn the latent distribution p(yijr,ye,yc,𝐗ij,𝐙e,𝐙c)𝑝superscriptsubscript𝑦𝑖𝑗𝑟superscript𝑦𝑒superscript𝑦𝑐subscript𝐗𝑖𝑗superscript𝐙𝑒superscript𝐙𝑐p(y_{ij}^{r},y^{e},y^{c},{\mathbf{X}}_{ij},{\mathbf{Z}}^{e},{\mathbf{Z}}^{c})italic_p ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) for a pair of clauses 𝐗ij=(𝒙i,𝒙j)subscript𝐗𝑖𝑗subscript𝒙𝑖subscript𝒙𝑗{\mathbf{X}}_{ij}=({\bm{x}}_{i},{\bm{x}}_{j})bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), which is factorized into

p(yijr|𝐙e,𝐙c)p(ye|𝐙e)p(yc|𝐙c)task-specificp(𝐗ij|𝐙e,𝐙c)p(𝐙e)p(𝐙c)standard VAEsuperscript𝑝conditionalsuperscriptsubscript𝑦𝑖𝑗𝑟superscript𝐙𝑒superscript𝐙𝑐𝑝conditionalsuperscript𝑦𝑒superscript𝐙𝑒𝑝conditionalsuperscript𝑦𝑐superscript𝐙𝑐task-specificsuperscript𝑝conditionalsubscript𝐗𝑖𝑗superscript𝐙𝑒superscript𝐙𝑐𝑝superscript𝐙𝑒𝑝superscript𝐙𝑐standard VAE\overbrace{p(y_{ij}^{r}|{\mathbf{Z}}^{e},{\mathbf{Z}}^{c})p(y^{e}|{\mathbf{Z}}% ^{e})p(y^{c}|{\mathbf{Z}}^{c})}^{\text{task-specific}}\overbrace{p({\mathbf{X}% }_{ij}|{\mathbf{Z}}^{e},{\mathbf{Z}}^{c})p({\mathbf{Z}}^{e})p({\mathbf{Z}}^{c}% )}^{\text{standard VAE}}over⏞ start_ARG italic_p ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) italic_p ( italic_y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) italic_p ( italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) end_ARG start_POSTSUPERSCRIPT task-specific end_POSTSUPERSCRIPT over⏞ start_ARG italic_p ( bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) italic_p ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) italic_p ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) end_ARG start_POSTSUPERSCRIPT standard VAE end_POSTSUPERSCRIPT

In addition to the standard components of VAE, such as the decoder p(𝐗ij|𝐙e,𝐙c)𝑝conditionalsubscript𝐗𝑖𝑗superscript𝐙𝑒superscript𝐙𝑐p({\mathbf{X}}_{ij}|{\mathbf{Z}}^{e},{\mathbf{Z}}^{c})italic_p ( bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ), we include task-specific predictors: an emotion classifier p(ye|𝐙e)𝑝conditionalsuperscript𝑦𝑒superscript𝐙𝑒p(y^{e}|{\mathbf{Z}}^{e})italic_p ( italic_y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ), an emotion-cause relation classifier p(yijr|𝐙e,𝐙c)𝑝conditionalsuperscriptsubscript𝑦𝑖𝑗𝑟superscript𝐙𝑒superscript𝐙𝑐p(y_{ij}^{r}|{\mathbf{Z}}^{e},{\mathbf{Z}}^{c})italic_p ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ), and an event predictor p(yc|𝐙c)𝑝conditionalsuperscript𝑦𝑐superscript𝐙𝑐p(y^{c}|{\mathbf{Z}}^{c})italic_p ( italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ).

To approximate the true distribution, we consider a factorized variational distribution q(𝐙e,𝐙c|𝐗ij)=q(𝐙e|𝐗ij)q(𝐙c|𝐗ij)𝑞superscript𝐙𝑒conditionalsuperscript𝐙𝑐subscript𝐗𝑖𝑗𝑞conditionalsuperscript𝐙𝑒subscript𝐗𝑖𝑗𝑞conditionalsuperscript𝐙𝑐subscript𝐗𝑖𝑗q({\mathbf{Z}}^{e},{\mathbf{Z}}^{c}|{\mathbf{X}}_{ij})=q({\mathbf{Z}}^{e}|{% \mathbf{X}}_{ij})q({\mathbf{Z}}^{c}|{\mathbf{X}}_{ij})italic_q ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = italic_q ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) italic_q ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ), which correspond to an emotion encoder and an event encoder respectively. Then the variational lower bound (ELBO) takes the following form:

𝔼q(𝐙e,𝐙c|𝐗ij)log[p(𝐗ij|𝐙e,𝐙c)p(yijr|𝐙e,𝐙c)\displaystyle\mathbb{E}_{q({\mathbf{Z}}^{e},{\mathbf{Z}}^{c}|{\mathbf{X}}_{ij}% )}\log\big{[}p({\mathbf{X}}_{ij}|{\mathbf{Z}}^{e},{\mathbf{Z}}^{c})p(y_{ij}^{r% }|{\mathbf{Z}}^{e},{\mathbf{Z}}^{c})blackboard_E start_POSTSUBSCRIPT italic_q ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_log [ italic_p ( bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) italic_p ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT )
p(ye|𝐙e)p(yc|𝐙c)]𝔻KL(q(𝐙e|𝐗ijp(𝐙e))\displaystyle p(y^{e}|{\mathbf{Z}}^{e})p(y^{c}|{\mathbf{Z}}^{c})\big{]}-% \mathbb{D}_{\mathrm{KL}}(q({\mathbf{Z}}^{e}|{\mathbf{X}}_{ij}\|p({\mathbf{Z}}^% {e}))italic_p ( italic_y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) italic_p ( italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ] - blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_q ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∥ italic_p ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) )
𝔻KL(q(𝐙c|𝐗ijp(𝐙c))\displaystyle-\mathbb{D}_{\mathrm{KL}}(q({\mathbf{Z}}^{c}|{\mathbf{X}}_{ij}\|p% ({\mathbf{Z}}^{c}))- blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_q ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∥ italic_p ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) )
Disentanglement.

In target domains, it is not desirable that the latent representation of an emotion is mixed with event information, which makes transfer of the knowledge about emotions across domains difficult, because events in target domains are not directly related to those in source domains. Therefore, we need to disentangle latent emotion representations from latent event representations for improving compositional generalization Russin et al. (2019) without making the independence assumption.

In light of the above analysis, we propose a variational posterior regularization technique. The key idea is to regularize the model in the way that the dense regions of q(𝐙e|𝐗ij)𝑞conditionalsuperscript𝐙𝑒subscript𝐗𝑖𝑗q({\mathbf{Z}}^{e}|{\mathbf{X}}_{ij})italic_q ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) associate with only emotions, while those of q(𝐙c|𝐗ij)𝑞conditionalsuperscript𝐙𝑐subscript𝐗𝑖𝑗q({\mathbf{Z}}^{c}|{\mathbf{X}}_{ij})italic_q ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) associate with only events. The classifiers for p(ye|𝐙e)𝑝conditionalsuperscript𝑦𝑒superscript𝐙𝑒p(y^{e}|{\mathbf{Z}}^{e})italic_p ( italic_y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) and p(yc|𝐙c)𝑝conditionalsuperscript𝑦𝑐superscript𝐙𝑐p(y^{c}|{\mathbf{Z}}^{c})italic_p ( italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) are in general smooth such that they consistently predict only one label in a dense region. If there is little overlap between the dense regions of q(𝐙e|𝐗ij)𝑞conditionalsuperscript𝐙𝑒subscript𝐗𝑖𝑗q({\mathbf{Z}}^{e}|{\mathbf{X}}_{ij})italic_q ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) and those of q(𝐙c|𝐗ij)𝑞conditionalsuperscript𝐙𝑐subscript𝐗𝑖𝑗q({\mathbf{Z}}^{c}|{\mathbf{X}}_{ij})italic_q ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ), a dense region from either distribution is expected to associated with either an emotion category or a type of events estimated by one of the classifiers, under the maximum likelihood principle. In another word, we only need to add a regularizer to minimize the overlap between q(𝐙e|𝐗ij)𝑞conditionalsuperscript𝐙𝑒subscript𝐗𝑖𝑗q({\mathbf{Z}}^{e}|{\mathbf{X}}_{ij})italic_q ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) and q(𝐙c|𝐗ij)𝑞conditionalsuperscript𝐙𝑐subscript𝐗𝑖𝑗q({\mathbf{Z}}^{c}|{\mathbf{X}}_{ij})italic_q ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) such that their divergence is high.

In theory, the corresponding divergence measures 𝔻k(q(𝐙e|𝐗ij)q(𝐙c|𝐗ij))\mathbb{D}_{k}(q({\mathbf{Z}}^{e}|{\mathbf{X}}_{ij})\|q({\mathbf{Z}}^{c}|{% \mathbf{X}}_{ij}))blackboard_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ∥ italic_q ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ) should not assume absolute continuity Royden and Fitzpatrick (1988), which requires that q(Zie|𝐗ij)>0𝑞conditionalsubscriptsuperscript𝑍𝑒𝑖subscript𝐗𝑖𝑗0q(Z^{e}_{i}|{\mathbf{X}}_{ij})>0italic_q ( italic_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) > 0 for every q(Zic|𝐗ij)>0𝑞conditionalsubscriptsuperscript𝑍𝑐𝑖subscript𝐗𝑖𝑗0q(Z^{c}_{i}|{\mathbf{X}}_{ij})>0italic_q ( italic_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) > 0, vice versa. In reality, a random variable Ziesuperscriptsubscript𝑍𝑖𝑒Z_{i}^{e}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT may have high probability in the region where a Zjcsuperscriptsubscript𝑍𝑗𝑐Z_{j}^{c}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT has zero probability. To tackle this, we choose Bhattacharyya distance Bhattacharyya (1946) and maximum mean discrepancy (MMD) Gretton et al. (2012) respectively as a regularizer. Each of them has its own strength. More details are covered in Sec. 3.2.

Refer to caption
Figure 3: The architecture of our model CaRel-VAE.

3.1 Model Details

CaRel-VAE Model.

As illustrated in Fig. 3, our model is composed of an inference module, a text generator, task-specific predictors and priors.

Inference Module. The inference module consists of a pre-trained BERT Devlin et al. (2018) encoder, an emotion encoder and an event predictor. Given a pair of clauses (𝒙i,𝒙j)subscript𝒙𝑖subscript𝒙𝑗({\bm{x}}_{i},{\bm{x}}_{j})( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), we construct inputs following the common practice that inserts an [SEP]delimited-[]𝑆𝐸𝑃[SEP][ italic_S italic_E italic_P ] token between the two clauses and prepends the sequence with a [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] token. We take the hidden representation 𝒉𝒉{\bm{h}}bold_italic_h of [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] as the output of the BERT encoder.

To distinguish the representation of the event and emotion variables, we employ two adapters to produce different embedding respectively. We initialize two vectors 𝒂esubscript𝒂𝑒\bm{a}_{e}bold_italic_a start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and 𝒂csubscript𝒂𝑐\bm{a}_{c}bold_italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT for emotion and event respectively, and treat them as the queries while view 𝒉𝒉{\bm{h}}bold_italic_h as key and value. We therefore synthesize the new emotion and event representations 𝒉esubscript𝒉𝑒{\bm{h}}_{e}bold_italic_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and 𝒉csubscript𝒉𝑐{\bm{h}}_{c}bold_italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT by computing the sparsemax attention while using 𝒂esubscript𝒂𝑒\bm{a}_{e}bold_italic_a start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and 𝒂csubscript𝒂𝑐\bm{a}_{c}bold_italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT as queries respectively Martins and Astudillo (2016).

The variational distribution q(𝐙e,𝐙c|𝐗ij)𝑞superscript𝐙𝑒conditionalsuperscript𝐙𝑐subscript𝐗𝑖𝑗q({\mathbf{Z}}^{e},{\mathbf{Z}}^{c}|{\mathbf{X}}_{ij})italic_q ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) are realized as simple factorized Gaussians, which correpond to an emotion encoder q(𝐙e|𝒉e)𝑞conditionalsuperscript𝐙𝑒subscript𝒉𝑒q({\mathbf{Z}}^{e}|{\bm{h}}_{e})italic_q ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) and an event predictor q(𝐙c|𝒉c)𝑞conditionalsuperscript𝐙𝑐subscript𝒉𝑐q({\mathbf{Z}}^{c}|{\bm{h}}_{c})italic_q ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) on top of the hidden representations 𝒉esubscript𝒉𝑒{\bm{h}}_{e}bold_italic_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and 𝒉csubscript𝒉𝑐{\bm{h}}_{c}bold_italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT respectively. Each encoder is implemented as a multilayer perceptrons (MLPs) after applying the reparameterization trick.

𝝁e,log𝝈e=MLP(𝒉e;𝜽e)superscript𝝁𝑒superscript𝝈𝑒MLPsubscript𝒉𝑒subscript𝜽𝑒\displaystyle\bm{\mu}^{e},\log\bm{\sigma}^{e}=\text{MLP}({\bm{h}}_{e};\bm{% \theta}_{e})bold_italic_μ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , roman_log bold_italic_σ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = MLP ( bold_italic_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) (1)
𝝁c,log𝝈c=MLP(𝒉c;𝜽c)superscript𝝁𝑐superscript𝝈𝑐MLPsubscript𝒉𝑐subscript𝜽𝑐\displaystyle\bm{\mu}^{c},\log\bm{\sigma}^{c}=\text{MLP}({\bm{h}}_{c};\bm{% \theta}_{c})bold_italic_μ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , roman_log bold_italic_σ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = MLP ( bold_italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )
𝒛e=𝝁e+𝝈eϵ,ϵ𝒩(𝟎,𝐈)formulae-sequencesuperscript𝒛𝑒superscript𝝁𝑒direct-productsuperscript𝝈𝑒bold-italic-ϵsimilar-tobold-italic-ϵ𝒩0𝐈\displaystyle{\bm{z}}^{e}=\bm{\mu}^{e}+\bm{\sigma}^{e}\odot\bm{\epsilon},\bm{% \epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})bold_italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = bold_italic_μ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT + bold_italic_σ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ⊙ bold_italic_ϵ , bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I )
𝒛c=𝝁c+𝝈cϵ,ϵ𝒩(𝟎,𝐈)formulae-sequencesuperscript𝒛𝑐superscript𝝁𝑐direct-productsuperscript𝝈𝑐bold-italic-ϵsimilar-tobold-italic-ϵ𝒩0𝐈\displaystyle{\bm{z}}^{c}=\bm{\mu}^{c}+\bm{\sigma}^{c}\odot\bm{\epsilon},\bm{% \epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})bold_italic_z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = bold_italic_μ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + bold_italic_σ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⊙ bold_italic_ϵ , bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I )

where 𝜽esubscript𝜽𝑒\bm{\theta}_{e}bold_italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and 𝜽csubscript𝜽𝑐\bm{\theta}_{c}bold_italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are the parameters of the emotion and event encoders respectively, 𝝁esuperscript𝝁𝑒\bm{\mu}^{e}bold_italic_μ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, 𝝈esuperscript𝝈𝑒\bm{\sigma}^{e}bold_italic_σ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and 𝝁csuperscript𝝁𝑐\bm{\mu}^{c}bold_italic_μ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, 𝝈csuperscript𝝈𝑐\bm{\sigma}^{c}bold_italic_σ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denote the means and standard deviations of the corresponding Gaussian distributions, ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ denotes independent Gaussian noises, 𝒛esuperscript𝒛𝑒{\bm{z}}^{e}bold_italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and 𝒛csuperscript𝒛𝑐{\bm{z}}^{c}bold_italic_z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denote the respective values of 𝐙esuperscript𝐙𝑒{\mathbf{Z}}^{e}bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and 𝐙csuperscript𝐙𝑐{\mathbf{Z}}^{c}bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

Text Generator. For p(𝐗ij|𝐙e,𝐙c)𝑝conditionalsubscript𝐗𝑖𝑗superscript𝐙𝑒superscript𝐙𝑐p({\mathbf{X}}_{ij}|{\mathbf{Z}}^{e},{\mathbf{Z}}^{c})italic_p ( bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ), we considers a lightweight solution that only reconstructs a bag-of-words (BoW) representation from latent representations, which is significantly faster than a conventional sequence decoder.

p(𝒙BoW|𝒛e,𝒛c)=σ(𝑾dec[𝒛e,𝒛c]+𝒃dec)𝑝conditionalsuperscript𝒙BoWsuperscript𝒛𝑒superscript𝒛𝑐𝜎superscript𝑾decsuperscript𝒛𝑒superscript𝒛𝑐superscript𝒃decp({\bm{x}}^{\text{BoW}}|{\bm{z}}^{e},{\bm{z}}^{c})=\sigma(\bm{W}^{\text{dec}}[% {\bm{z}}^{e},{\bm{z}}^{c}]+\bm{b}^{\text{dec}})italic_p ( bold_italic_x start_POSTSUPERSCRIPT BoW end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_σ ( bold_italic_W start_POSTSUPERSCRIPT dec end_POSTSUPERSCRIPT [ bold_italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] + bold_italic_b start_POSTSUPERSCRIPT dec end_POSTSUPERSCRIPT ) (2)

where 𝜽dec=[𝑾dec;𝒃dec]subscript𝜽decsuperscript𝑾decsuperscript𝒃dec\bm{\theta}_{\text{dec}}=[\bm{W}^{\text{dec}};\bm{b}^{\text{dec}}]bold_italic_θ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT = [ bold_italic_W start_POSTSUPERSCRIPT dec end_POSTSUPERSCRIPT ; bold_italic_b start_POSTSUPERSCRIPT dec end_POSTSUPERSCRIPT ] denotes the parameters of the decoder, σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is the sigmoid function, and 𝒙BoWsuperscript𝒙BoW{\bm{x}}^{\text{BoW}}bold_italic_x start_POSTSUPERSCRIPT BoW end_POSTSUPERSCRIPT is the BoW representation of 𝐗ijsubscript𝐗𝑖𝑗{\mathbf{X}}_{ij}bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

Priors. For both p(𝐙e)𝑝superscript𝐙𝑒p({\mathbf{Z}}^{e})italic_p ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) and p(𝐙c)𝑝superscript𝐙𝑐p({\mathbf{Z}}^{c})italic_p ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ), we follow the common practice to use 𝒩(𝟎,𝐈)𝒩0𝐈\mathcal{N}(\mathbf{0},\mathbf{I})caligraphic_N ( bold_0 , bold_I ) as their priors.

Task-Specific Predictors. For each predictor, we apply a linear layer to its inputs, followed by a softmax layer if it is a multi-class classification problem, otherwise a sigmoid layer for a binary classification problem.

Emotion Extraction Model.

We can apply any emotion extraction model to obtain clauses containing emotional expressions. In this work, we extend the emotion classification model in Xia and Ding (2019) by replacing its encoder with BERT encoder and its binary classification layer with a softmax layer.

3.2 Model Training

3.2.1 Source Domain Training

CaRel-VAE Model.

Given a set of documents, each of which is annotated with a set ks={(yijr,yic,yje)}i,jsubscriptsuperscript𝑠𝑘subscriptsubscriptsuperscript𝑦𝑟𝑖𝑗subscriptsuperscript𝑦𝑐𝑖subscriptsuperscript𝑦𝑒𝑗𝑖𝑗{\mathcal{R}}^{s}_{k}=\{(y^{r}_{ij},y^{c}_{i},y^{e}_{j})\}_{i,j}caligraphic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { ( italic_y start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT for positive examples, we obtain negative examples of relations by randomly sampling clause pairs that are not part of kssubscriptsuperscript𝑠𝑘{\mathcal{R}}^{s}_{k}caligraphic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. In particular, for each emotion clause in ssuperscript𝑠{\mathcal{R}}^{s}caligraphic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, we pair it with a randomly picked non-cause clause in the document, resulting in the same number of negative samples. The training loss =ELBO+λΩsuperscriptELBO𝜆Ω{\mathcal{L}}={\mathcal{L}}^{\text{ELBO}}+\lambda\Omegacaligraphic_L = caligraphic_L start_POSTSUPERSCRIPT ELBO end_POSTSUPERSCRIPT + italic_λ roman_Ω, including the loss ELBOsuperscriptELBO{\mathcal{L}}^{\text{ELBO}}caligraphic_L start_POSTSUPERSCRIPT ELBO end_POSTSUPERSCRIPT derived from the ELBO and the variational posterior regularizer ΩΩ\Omegaroman_Ω adjusted by the hyperparameter λ𝜆\lambdaitalic_λ.

Similar to prior works, the loss ELBOsuperscriptELBO{\mathcal{L}}^{\text{ELBO}}caligraphic_L start_POSTSUPERSCRIPT ELBO end_POSTSUPERSCRIPT includes the cross-entropy losses from the text decoder and the task-specific predictors, as well as two regularization terms from the two KL divergences, each of which takes the form of 𝒛2log𝝈superscriptnorm𝒛2𝝈\|{\bm{z}}\|^{2}-\log\bm{\sigma}∥ bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - roman_log bold_italic_σ.

To motivate the regularizer ΩΩ\Omegaroman_Ω, we start with Bhattacharyya distance, which measures the angle between two probability vectors (pa(z0),,pa(zn))subscript𝑝𝑎subscript𝑧0subscript𝑝𝑎subscript𝑧𝑛(\sqrt{p_{a}(z_{0})},...,\sqrt{p_{a}(z_{n})})( square-root start_ARG italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG , … , square-root start_ARG italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG ) and (pb(z0),,pb(zn))subscript𝑝𝑏subscript𝑧0subscript𝑝𝑏subscript𝑧𝑛(\sqrt{p_{b}(z_{0})},...,\sqrt{p_{b}(z_{n})})( square-root start_ARG italic_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG , … , square-root start_ARG italic_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG ) over n𝑛nitalic_n data points. Unlike KL divergence, Bhattacharyya distance yields a positive value regardless the probability at a data point is zero or not, if the distance is not zero. For Gaussians, which are the cases for the variational posteriors, it has a closed form solution:

𝔻bh=18(𝝁e𝝁c)T𝚺1(𝝁e𝝁c)+12ln(det𝚺σeσc)subscript𝔻bh18superscriptsuperscript𝝁esuperscript𝝁cTsuperscript𝚺1superscript𝝁esuperscript𝝁c12det𝚺productsuperscript𝜎eproductsuperscript𝜎c\displaystyle\begin{aligned} \mathbb{D}_{\text{bh}}=\frac{1}{8}(\bm{\mu}^{e}-% \bm{\mu}^{c})^{\text{T}}{\bm{\Sigma}}^{-1}(\bm{\mu}^{e}-\bm{\mu}^{c})+\frac{1}% {2}\ln\big{(}\frac{\text{det}{\bm{\Sigma}}}{\prod\sigma^{e}\prod\sigma^{c}}% \big{)}\end{aligned}start_ROW start_CELL blackboard_D start_POSTSUBSCRIPT bh end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 8 end_ARG ( bold_italic_μ start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT - bold_italic_μ start_POSTSUPERSCRIPT roman_c end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT - bold_italic_μ start_POSTSUPERSCRIPT roman_c end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_ln ( divide start_ARG det bold_Σ end_ARG start_ARG ∏ italic_σ start_POSTSUPERSCRIPT roman_e end_POSTSUPERSCRIPT ∏ italic_σ start_POSTSUPERSCRIPT roman_c end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW (3)

where 𝚺=(𝝈e+𝝈c)22𝐈𝚺superscriptsuperscript𝝈𝑒superscript𝝈𝑐22𝐈{\bm{\Sigma}}=\frac{(\bm{\sigma}^{e}+\bm{\sigma}^{c})^{2}}{2}{\mathbf{I}}bold_Σ = divide start_ARG ( bold_italic_σ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT + bold_italic_σ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG bold_I and the determinant det𝚺=((σe)2+(σc)2)2det𝚺productsuperscriptsuperscript𝜎𝑒2superscriptsuperscript𝜎𝑐22\text{det}{\bm{\Sigma}}=\frac{\prod((\sigma^{e})^{2}+(\sigma^{c})^{2})}{2}det bold_Σ = divide start_ARG ∏ ( ( italic_σ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_σ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 end_ARG. The left term is essentially an unnormalized multivariate Gaussian. The corresponding regularizer Ωb=𝔻bhsuperscriptΩbsubscript𝔻bh\Omega^{\text{b}}=-\mathbb{D}_{\text{bh}}roman_Ω start_POSTSUPERSCRIPT b end_POSTSUPERSCRIPT = - blackboard_D start_POSTSUBSCRIPT bh end_POSTSUBSCRIPT, which maximizes this distance, would drive the two Gaussians far away from each other.

The above regularizer only maximizes the distance between two types of latent representations from the same clause pair. Intuitively, it would be useful to also push 𝒛iesubscriptsuperscript𝒛𝑒𝑖{\bm{z}}^{e}_{i}bold_italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of an instance i𝑖iitalic_i away from the 𝒛jcsubscriptsuperscript𝒛𝑐𝑗{\bm{z}}^{c}_{j}bold_italic_z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of the other instances. For efficiency, we only apply such regularizations between instances in a batch, which ends up a regularizer ΩbbsuperscriptΩbb\Omega^{\text{bb}}roman_Ω start_POSTSUPERSCRIPT bb end_POSTSUPERSCRIPT that maximizes Bhattacharyya distance between any pair of (𝒛ie,𝒛jc)subscriptsuperscript𝒛𝑒𝑖subscriptsuperscript𝒛𝑐𝑗({\bm{z}}^{e}_{i},{\bm{z}}^{c}_{j})( bold_italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) in a batch.

Following the same idea, we also exploit maximum mean discrepancy (MMD) Gretton et al. (2012), which is a kernel-based divergence measure not requiring absolute continuity, for maximizing divergences across instances batchwise.

ΩMMD=superscriptΩMMD\displaystyle\Omega^{\text{MMD}}=-roman_Ω start_POSTSUPERSCRIPT MMD end_POSTSUPERSCRIPT = - ϕ(𝒛e)ϕ(𝒛c)2,subscriptsuperscriptnormitalic-ϕsuperscript𝒛𝑒italic-ϕsuperscript𝒛𝑐2\displaystyle\|\phi({\bm{z}}^{e})-\phi({\bm{z}}^{c})\|^{2}_{\mathcal{H}},∥ italic_ϕ ( bold_italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) - italic_ϕ ( bold_italic_z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT , (4)
𝒛e𝐙e,𝒛c𝐙cformulae-sequencesimilar-tosuperscript𝒛𝑒superscript𝐙𝑒similar-tosuperscript𝒛𝑐superscript𝐙𝑐\displaystyle{\bm{z}}^{e}\sim{\mathbf{Z}}^{e},{\bm{z}}^{c}\sim{\mathbf{Z}}^{c}bold_italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∼ bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∼ bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT

where ϕitalic-ϕ\phiitalic_ϕ is a mapping function that projects both 𝒛esuperscript𝒛𝑒{\bm{z}}^{e}bold_italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and 𝒛csuperscript𝒛𝑐{\bm{z}}^{c}bold_italic_z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT into a reproducing kernel Hilbert space denoted by \mathcal{H}caligraphic_H. In this work, we mainly adopt this regularizer in experiments due to its superior performance over the other two.

Emotion Extraction Model.

Provided a set of clauses annotated with emotion categories or None, we train the emotion extraction model as a seven-way classification problem, following the maximum likelihood principle.

3.2.2 Adaptation to Target Domains

We transfer first the emotion extraction model to a target domain, followed by our model. The emotion extraction model is fine tuned by the self-training algorithm Chen et al. (2011) on an unlabeled corpus in a target domain. The parameters of our model are fine tuned by using our method CD-SelfTrain on the same corpora. Given an unlabeled corpus, both self-training algorithms start with applying the model to predict the most likely labels for each input text. The predictions are used to construct a training set to fine tune the model with the same loss {\mathcal{L}}caligraphic_L as the source domain training in one epoch. Then the algorithms construct a new training set or update the training set with new examples by using the current model and repeats the process till the convergence criteria are met. Our algorithm CD-SelfTrain differs from the current one in terms of the way to construct training datasets.

Relation Prediction. Given a set of documents 𝒟usubscript𝒟𝑢{\mathcal{D}}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT in a target domain, each of which contains at least one clause annotated with emotion pseudo-labels, we pair each emotion clause with the remaining clauses to create clause pairs for relation identification. When constructing a training set with pseudo-labels in each iteration, we select a pair with the highest probability in a document as a positive sample and randomly choose a clause pair from the remaining as a negative sample. Deep models with a high width tend to memorize training examples to reduce training errors van den Burg and Williams (2021), which could hurt the model performance by not improving its generalization capability. Thus, we construct a training set from scratch each time instead of updating the training set from the previous iteration. The training procedure terminates when a maximal number of iterations is reached.

Emotion Extraction. For emotion extraction, we apply the self-training algorithm Chen et al. (2011) to train the model in a target domain. It starts with an empty training set 𝒟tsubscript𝒟𝑡{\mathcal{D}}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and a set of unlabeled documents 𝒟usubscript𝒟𝑢{\mathcal{D}}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. In each iteration, if a document in 𝒟usubscript𝒟𝑢{\mathcal{D}}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT contains at least one pseudo-labeled emotion clauses with their confidences above a pre-defined threshold, we add it to the training set 𝒟tsubscript𝒟𝑡{\mathcal{D}}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the next iteration. In each of such documents, we keep only the pseudo-labeled emotion clause with the highest probability, the remaining clauses are considered as non-emotion ones.

4 Experiments

4.1 Experimental Setup

Datasets. Since there is no corpus for ECPE in the UDA setting, we divide CH-ECPE into multiple domains. Given the fact that the documents in CH-ECPE are Chinese news articles sampled from the THUCNews dataset Li and Sun (2007), we employ the topic classifier THUCTC Sun et al. (2016) trained on the THUCNews dataset to categorize CH-ECPE into 14 subsets based on topics and choose the largest five as the final domains (e.g. home, society and finance, etc.). To further improve the purity of classification, based on THUCTC’s classification results, we conduct manual inspection and labeling to complete the domain classification of CH-ECPE. Also, in the English language setting, we view EN-ECPE and Recognizing Emotion Cause in CONversations (RECCON) Poria et al. (2021) – an English dataset specifically designed for identifying the causes of emotions within conversations, as the two source-target domains. Table 4 summarizes the statistics of each corpus and can be found in A.2.

Metrics.

For each target domain in each corpus, we evaluate models for emotion extraction and relation identification respectively in terms of precision, recall and F1-score. A prediction is correct if there is a correct causal relation and the emotion category is correct.

Baselines. To make a fair comparison, we adapt the three existing ECPE models RankCP, UTOS, UECA-Prompt (all employ BERT as the backbone model) for emotion extraction (EE) and ECPE. In addition, since the universal prompt-based method for ECA tasks (UECA-Prompt) Zheng et al. (2022) is designed to solve the different Emotion cause analysis (ECA) tasks in an unified framework, we thus only integrate three UDA approaches on the two ECPE models (RankCP Wei et al. (2020) and UTOS Cheng et al. (2021)) in the ECPE task to further demonstrate the effectiveness of our model. The introduction of baseline method and implementation detail please refer to A.2.

4.2 Results and Analysis

  Model Society \rightarrow Startseite Society \rightarrow Finanzbranche Society \rightarrow Bildung Society \rightarrow Entertainment Weighted Average EE (%) ECPE (%) EE (%) ECPE (%) EE (%) ECPE (%) EE (%) ECPE (%) EE (%) ECPE (%) P𝑃Pitalic_P R𝑅Ritalic_R F1𝐹1F1italic_F 1 P𝑃Pitalic_P R𝑅Ritalic_R F1𝐹1F1italic_F 1 P𝑃Pitalic_P R𝑅Ritalic_R F1𝐹1F1italic_F 1 P𝑃Pitalic_P R𝑅Ritalic_R F1𝐹1F1italic_F 1 P𝑃Pitalic_P R𝑅Ritalic_R F1𝐹1F1italic_F 1 P𝑃Pitalic_P R𝑅Ritalic_R F1𝐹1F1italic_F 1 P𝑃Pitalic_P R𝑅Ritalic_R F1𝐹1F1italic_F 1 P𝑃Pitalic_P R𝑅Ritalic_R F1𝐹1F1italic_F 1 F1𝐹1F1italic_F 1 F1𝐹1F1italic_F 1 (a) S: Society RankCP 21.90 25.22 23.44 13.14 14.54 13.80 18.04 21.00 19.41 8.56 9.86 9.17 26.13 31.90 28.73 18.59 22.29 20.27 26.87 32.73 29.51 13.43 16.36 14.75 23.49 13.65 RankCP+Ada-TSA 18.55 21.16 19.77 12.30 13.48 12.86 15.86 17.44 16.61 7.12 7.75 7.42 20.62 24.54 22.41 11.86 13.86 12.78 23.44 27.27 25.21 6.25 7.27 6.72 19.65 11.41 RankCP+DANN 91.51 98.15 94.72 51.69 93.85 66.67 85.06 93.24 88.96 40.38 75.35 52.58 82.87 92.02 87.21 43.01 74.10 54.42 77.78 89.09 83.05 30.48 58.18 40.00 92.03 60.93 RankCP+MEDM 20.17 23.12 21.55 12.77 14.07 13.39 20.43 23.84 22.00 9.76 11.27 10.46 24.14 30.06 26.78 13.30 16.27 14.63 14.52 16.36 15.38 6.45 7.27 6.84 22.04 12.63 UTOS 91.51 47.72 62.73 70.99 35.58 47.40 93.33 49.82 64.97 71.33 37.68 49.31 92.21 43.56 59.17 67.09 31.93 43.27 71.43 27.27 39.47 47.62 18.18 26.32 61.77 46.39 UTOS+Ada-TSA 18.55 21.16 19.77 12.30 13.48 12.86 15.86 17.44 16.61 7.12 7.75 7.42 20.62 24.54 22.41 11.86 13.86 12.78 23.44 27.27 25.21 6.25 7.27 6.72 19.65 11.41 UTOS+DANN 84.96 61.13 71.10 56.41 40.07 46.86 89.55 64.06 74.69 57.84 41.55 48.36 86.92 57.06 68.89 62.28 42.77 50.71 80.65 45.45 58.14 48.39 27.27 34.88 71.04 47.16 UTOS+MEDM 52.80 55.60 54.16 14.63 33.32 20.31 15.31 89.68 26.15 0.64 13.03 1.21 53.00 62.50 46.01 24.23 28.31 26.11 57.50 41.82 48.42 12.64 20.00 15.49 46.82 16.70 UECA-Prompt 75.59 74.66 75.12 50.92 61.43 55.69 71.01 69.75 70.38 51.13 62.63 56.30 75.84 82.82 79.17 48.84 62.87 54.97 73.58 70.91 72.22 45.21 60.00 51.56 74.48 55.55 Ours 81.77 76.14 78.85 58.59 71.98 64.60 86.42 81.49 83.88 75.96 82.01 78.87 83.85 82.82 83.33 74.30 79.64 76.88 86.00 78.18 81.90 84.62 80.00 82.24 80.63 71.35 (b) S: Home Startseite \rightarrow Society Startseite \rightarrow Finanzbranche Startseite \rightarrow Bildung Startseite \rightarrow Entertainment RankCP 83.88 91.82 87.67 44.33 75.42 55.84 86.56 93.95 90.10 43.41 75.35 55.08 83.33 92.02 87.46 44.48 77.71 56.58 84.48 89.09 86.73 36.78 58.18 45.07 81.85 51.31 RankCP+Ada-TSA 16.38 19.37 17.75 8.25 9.51 8.84 20.42 24.20 22.15 8.11 9.51 8.75 18.82 21.47 20.06 10.75 12.05 11.36 22.06 27.27 24.39 5.88 7.27 6.50 18.01 8.40 RankCP+DANN 29.29 37.45 32.87 26.79 33.43 29.74 25.00 29.54 27.08 14.76 17.25 15.91 31.34 41.72 35.79 17.51 22.89 19.84 27.27 32.73 29.75 13.64 16.36 14.88 29.49 22.73 RankCP+MEDM 15.43 17.36 16.34 6.89 7.55 7.20 7.61 7.47 7.54 2.17 2.11 2.14 22.04 25.15 23.50 8.60 9.64 9.09 23.81 27.27 25.42 6.35 7.27 6.78 14.55 5.81 UTOS 88.56 51.08 64.79 70.69 40.08 51.16 90.00 57.65 70.28 62.30 40.14 48.82 92.13 50.31 65.08 70.79 37.95 49.41 78.26 32.73 46.15 52.17 21.82 30.77 60.57 45.89 UTOS+Ada-TSA 16.38 19.37 17.75 8.25 9.51 8.84 20.42 24.20 22.15 8.11 9.51 8.75 18.82 21.47 20.06 10.75 12.05 11.36 22.06 27.27 24.39 5.88 7.27 6.50 18.01 8.40 UTOS+DANN 87.98 62.98 73.41 63.04 44.62 52.25 89.36 59.79 71.64 63.16 42.25 50.63 85.32 57.06 68.38 60.91 40.36 48.55 78.12 45.45 57.47 37.50 21.82 27.59 66.45 46.63 UTOS+MEDM 33.96 65.28 44.67 5.52 37.20 9.61 13.85 92.88 24.11 0.61 14.44 1.16 39.21 54.60 45.64 6.45 30.12 10.63 46.67 50.91 48.70 9.3 21.82 13.04 37.31 7.37 UECA-Prompt 76.52 85.08 80.57 66.33 63.11 64.68 78.04 82.21 80.07 61.96 59.17 60.53 75.14 81.60 78.24 66.27 65.87 66.07 75.93 74.55 75.23 58.18 58.18 58.18 69.10 59.04 Ours 86.07 79.77 82.80 68.78 75.07 71.79 81.79 84.70 83.22 76.03 83.39 79.54 80.72 82.21 81.46 84.71 79.64 82.10 84.31 78.18 81.13 83.33 81.82 82.57 76.72 70.09  

Table 1: Experimental results of our models and baselines utilizing precision (P), recall (R), and F1 score (F1) as metrics on the UDA-ECPE task. Emotion Extraction is denoted by EE. S refers to source domain.

  Model EN-ECPE \rightarrow RECCON RECCON \rightarrow EN-ECPE Weighted Average EE F1𝐹1F1italic_F 1 (%) ECPE F1𝐹1F1italic_F 1 (%) EE F1𝐹1F1italic_F 1 (%) ECPE F1𝐹1F1italic_F 1 (%) EE F1𝐹1F1italic_F 1 (%) ECPE F1𝐹1F1italic_F 1 (%) RankCP 39.86 23.28 52.96 28.26 47.87 26.32 RankCP+Ada-TSA 22.67 12.13 19.73 11.79 20.87 11.92 RankCP+DANN 26.40 14.87 32.17 17.87 29.93 16.7 RankCP+MEDM 21.79 4.69 30.15 8.65 26.90 7.11 UTOS 33.96 27.83 24.13 18.48 27.95 22.12 UTOS+Ada-TSA 23.73 11.21 19.13 11.73 20.92 11.53 UTOS+DANN 15.29 3.36 13.91 3.71 14.44 3.57 UTOS+MEDM 30.11 1.55 18.09 3.75 22.76 2.89 UECA-Prompt 0.63 15.76 1.63 18.48 1.24 17.42 Ours 29.57 28.94 21.58 28.66 24.69 28.77  

Table 2: Experimental results of our models and the baseline models on EN-ECPE and RECCON.

  Model Society \rightarrow Entertainment Society \rightarrow Startseite Society \rightarrow Bildung Society \rightarrow Finanzbranche ECPE (%) ECPE (%) ECPE (%) ECPE (%) P𝑃Pitalic_P R𝑅Ritalic_R F1𝐹1F1italic_F 1 P𝑃Pitalic_P R𝑅Ritalic_R F1𝐹1F1italic_F 1 P𝑃Pitalic_P R𝑅Ritalic_R F1𝐹1F1italic_F 1 P𝑃Pitalic_P R𝑅Ritalic_R F1𝐹1F1italic_F 1 Original 84.62 80.00 82.24 58.59 71.98 64.60 74.30 79.64 76.88 75.96 82.01 78.87 w/o MMD 69.63 74.02 71.76 49.77 48.70 49.23 65.54 69.78 67.60 68.65 58.63 63.30 w/o HSIC 59.87 73.23 65.88 40.51 51.76 45.66 61.73 73.38 67.05 64.23 61.57 62.88 w/o VI 63.51 74.02 68.36 45.97 52.52 49.09 60.24 71.94 65.57 69.12 60.59 64.61 w/o ΩbsuperscriptΩ𝑏\Omega^{b}roman_Ω start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT 61.66 61.42 61.54 39.50 55.57 46.58 62.91 76.26 68.94 60.31 67.45 63.71 w/o ΩbbsuperscriptΩ𝑏𝑏\Omega^{bb}roman_Ω start_POSTSUPERSCRIPT italic_b italic_b end_POSTSUPERSCRIPT 76.52 79.53 77.99 54.80 52.52 53.64 66.55 71.49 68.95 83.10 69.41 75.71 w/o ΩMMDsuperscriptΩMMD\Omega^{\text{MMD}}roman_Ω start_POSTSUPERSCRIPT MMD end_POSTSUPERSCRIPT 78.12 78.74 78.43 64.30 57.86 60.95 69.14 80.58 74.42 86.39 68.43 76.49 w/o Adapter 86.67 75.00 80.44 59.05 71.16 64.54 75.88 74.44 75.15 75.74 79.93 77.78 w/o Self-training 45.24 34.55 39.18 18.63 66.00 29.06 25.62 61.68 36.20 27.19 51.56 35.60 with Gold Emotions 89.83 96.36 92.98 78.32 89.80 83.67 90.48 91.02 90.75 74.16 91.35 81.86  

Table 3: Experimental results of our models with different settings for the ECPE task on CH-ECPE.
Overall Comparisons.

Table 1 and Table 2 report the results of our models and the baselines on the ECPE task, as well as the EE subtask. To dispel the doubt that our model outperforms the baselines only because they are developed in the supervised setting, we apply the SOTA UDA methods Ada-TS Zhang et al. (2021), DANN Ganin et al. (2016) and MEDM Wu et al. (2021) to the two baselines RankCP and UTOS on the UDA-ECPE task. MEDM is a minimal-entropy UDA approach that introduces diversity maximization to regulate entropy minimization for seeking a close-to-ideal domain adaptation. Ada-TSA is a recently proposed adapter-based UDA approach in which the newly-added adapters can capture transferable features between source and target domains by using the domain-fusion scheme. DANN is a widely adopted adversarial-based UDA approach that learns domain invariant representations through a domain discriminator. It can be found that after applying the UDA framework, RankCP and UTOS significantly improved their performance and became comparable with the SOTA prompt-based model UECA-Prompt.

However, though we employ UDA (for RankCP and UTOS) while leverage the powerful ability of the Large Language Model (LLM) (for UECA-Prompt) to enhance the baseline models, the baseline models still perform worse than our proposed model. On CH-ECPE, our model outperforms the RankCP+DANN by 10.42%percent10.4210.42\%10.42 % when treating society as the source domain, and UECA-Prompt by 11.05%percent11.0511.05\%11.05 % with home as the source domain in terms of weighted average F1. On EN-ECPE, our model is better than the supervised learning model RankCP by 2.45%percent2.452.45\%2.45 %. Also, we can observe that our models get the best ECPE results in almost all of the domains except the SocietyHome𝑆𝑜𝑐𝑖𝑒𝑡𝑦𝐻𝑜𝑚𝑒Society\rightarrow Homeitalic_S italic_o italic_c italic_i italic_e italic_t italic_y → italic_H italic_o italic_m italic_e setting, indicating the generalization ability of the proposed approach. It is worth mentioning that our model performs the best even it does not always achieve the best performance on the EE subtask. Note that there is a significant performance gap between the Chinese and English benchmarks. The cause of this gap mainly due to the distribution bias problem where the five domains used for testing in the Chinese benchmark are extracted from the same corpus, i.e., CH-ECPE, however the two domains under the English setting derive from the two different datasets RECCON and EN-ECPE. Therefore, compared with the Chinese domains, the two English domains share less knowledge between each other, making the model hard to transfer from one domain to another. Overall, the results demonstrate the strengths of our model in terms of identifying new causal relations between events and emotions in new domains.

Ablation Study.

To analyze the influence that different module might exert on the proposed approach, we conduct the ablation study. The second row (named ‘Original’) in Table 3 refers to the result that our model could get when it is equipped with all the techniques presented in this work.

To study the effect of the regularizer ΩΩ\Omegaroman_Ω (see Sec. 3.2.2) for disentangled representation learning, we remove the ΩMMDsuperscriptΩMMD\Omega^{\text{MMD}}roman_Ω start_POSTSUPERSCRIPT MMD end_POSTSUPERSCRIPT during model training, as well as compare it with the other types of regularizers, including two independence measures Hilbert–Schmidt independence criterion (Gretton et al., 2005, (HSIC) and Variation of Information (Cheng et al., 2020, (VI). From Table 3 we can see that there is at least a 2.38% drop in terms of F1 on CH-ECPE when the regularizer ΩMMDsuperscriptΩMMD\Omega^{\text{MMD}}roman_Ω start_POSTSUPERSCRIPT MMD end_POSTSUPERSCRIPT is removed. Adding HSIC does more harm than gain, and VI brings almost no benefits to the model. It is also not useful to only apply the regularizer ΩbsuperscriptΩ𝑏\Omega^{b}roman_Ω start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT, which maximizes Bhattacharyya distance between the variational posteriors q(𝐙e|𝐗ij)𝑞conditionalsuperscript𝐙𝑒subscript𝐗𝑖𝑗q({\mathbf{Z}}^{e}|{\mathbf{X}}_{ij})italic_q ( bold_Z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) and q(𝐙c|𝐗uv)𝑞conditionalsuperscript𝐙𝑐subscript𝐗𝑢𝑣q({\mathbf{Z}}^{c}|{\mathbf{X}}_{uv})italic_q ( bold_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) from the same clause pair. However, the regularizer works when we maximize Bhattacharyya distance between two variational posteriors from all possible instance pairs in a batch. Similarly, the MMD-based regularizer ΩMMDsuperscriptΩMMD\Omega^{\text{MMD}}roman_Ω start_POSTSUPERSCRIPT MMD end_POSTSUPERSCRIPT works also because it maximizes the MMD distance across instances.

Also, we remove Emotion and Event adapters and use the unified pair representation as the input for both the emotion and event encoders. By doing this we lost performance for all domains, as the Table 3 shows. It is proved that using the different vectors to represent the emotion / event variables is a better solution. In addition, we also conduct experiments on investigating the efficacy of self-training and regularizer, detailed in A.3.

Refer to caption
Figure 4: Experimental results of CaRel-VAE w/o MMD and CaRel-VAE for normal and self-chain cases. The normal case refers to an emotion-cause pair composed of two different clauses, while for the self-chain case a pair are mentioned in the same clause.

5 Related Work

Emotion-Cause Pair Extraction.

ECPE is a new task that aims to extract all potential emotions and corresponding causes in a unannotated document. The pioneer Xia and Ding (2019) proposes a two-step approach that first extracts emotion and cause clauses separately.  Wei et al. (2020) propose a joint neural approach that applies graph attention to model the interrelations between clauses and rank ECPE.  Zheng et al. (2022) first introduce prompt learning method into the ECPE task by decomposing the ECPE task into multiple sub-tasks and design prompts for each the sub-task.

Our model is different from existing works in two main aspects. Firstly, we tackle ECPE in the UDA setting, which is more difficult and practical as it allows distribution discrepancies between different domains. Secondly, we solve UDA-ECPE from a causal perspective and design a causal disentanglement mechanism to approximate emotion and cause random variables, enabling causal discovery to identify causal relations between them and consequently retrieve positive pairs.

Unsupervised Domain Adaptation.

Domain adaptation addresses domain shift, allowing a pre-trained model to generalize from a source to a target domain. It falls into two types: supervised and unsupervised(examples of both types can be found in A.4).

Our work focuses on unsupervised domain adaptation (UDA), specifically extracting cross-domain emotion-cause pairs from labeled source domains to unlabeled target domains. Unlike prior studies Miller (2019); Du et al. (2020); Zou et al. (2021); Karouzos et al. (2021); Zhang et al. (2021) on binary sentiment classification, we tackle non-binary variables (emotion and cause) that are causally linked. This is the first known attempt to discover causal relations in UDA.

Disentangled Representation Learning.

The aim of disentangled representation learning (DRL) is to learn factorized representations that reveal the semantically meaningful factors hidden in the observed data Bengio et al. (2013); Higgins et al. (2018). Mainstream DRL approaches in NLP John et al. (2019); Cheng et al. (2020); Vishnubhotla et al. (2021) learn such representations by adopting variational autoencoders  (Kingma and Welling, 2013, VAE), which achieve disentanglement via the Kullback-Leibler (Kullback and Leibler, 1951, KL) divergence minimization between the posterior of the latent factors and a standard multivariate normal prior.

6 Conclusion

We propose a novel causal discovery inspired VAE model and a customized self-training algorithm for the UDA-ECPE task. Herein, we propose to disentangle the latent representations of emotions from those of events by a novel variational posterior regularization technique that does not enforce independence between the corresponding latent random variables. This work also sheds the light on the connections between the task of causal relation identification in the NLP community and the causal discovery theory, paves the way for theoretically grounded approaches to comprehensively analyzing causal structures in texts.

Limitations

A potential limitation of this work is that, due to resource and time constraints, we only used the ECPE classification model based on Bert, which matches our model’s architecture, as the baseline model. We did not compare it with the latest large language models (LLMs). Recent studies indicate that LLMs are not particularly effective at solving causal discovery tasks. Therefore, in the future, we plan to include the following LLM-based baseline models: zero-shot learning-based LLM (encapsulating the ECPE task in a task instruction prompt to obtain answers from the LLM), few-shot learning-based LLM (selecting a few ECPE examples as in-context learning demonstrations), and SFT-based LLM (fine-tuning the LLM using the ECPE dataset as task instruction). In future work, we will compare the method proposed in this paper with LLM-based methods to empirically explore whether LLM models can be effectively applied to causal discovery tasks.

References

  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828.
  • Bhattacharyya (1946) Anil Bhattacharyya. 1946. On a measure of divergence between two multinomial populations. Sankhyā: the indian journal of statistics, pages 401–406.
  • Blitzer et al. (2007) John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics, pages 440–447.
  • Chen et al. (2011) Minmin Chen, Kilian Q Weinberger, and John Blitzer. 2011. Co-training for domain adaptation. Advances in neural information processing systems, 24.
  • Cheng et al. (2020) Pengyu Cheng, Martin Renqiang Min, Dinghan Shen, Christopher Malon, Yizhe Zhang, Yitong Li, and Lawrence Carin. 2020. Improving disentangled text representation learning with information-theoretic guidance. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7530–7541.
  • Cheng et al. (2021) Zifeng Cheng, Zhiwei Jiang, Yafeng Yin, Na Li, and Qing Gu. 2021. A unified target-oriented sequence-to-sequence model for emotion-cause pair extraction. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2779–2791.
  • Daumé III (2007) Hal Daumé III. 2007. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 256–263.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Du et al. (2020) Chunning Du, Haifeng Sun, Jingyu Wang, Qi Qi, and Jianxin Liao. 2020. Adversarial and domain-aware bert for cross-domain sentiment analysis. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics, pages 4019–4028.
  • Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030.
  • Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML.
  • Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773.
  • Gretton et al. (2005) Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. 2005. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer.
  • Higgins et al. (2018) Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. 2018. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230.
  • John et al. (2019) Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga Vechtomova. 2019. Disentangled representation learning for non-parallel text style transfer. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 424–434.
  • Karouzos et al. (2021) Constantinos Karouzos, Georgios Paraskevopoulos, and Alexandros Potamianos. 2021. Udalm: Unsupervised domain adaptation through language modeling. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2579–2590.
  • Kingma and Welling (2013) Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Kullback and Leibler (1951) Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86.
  • Kumar et al. (2020) Ananya Kumar, Tengyu Ma, and Percy Liang. 2020. Understanding self-training for gradual domain adaptation. In International Conference on Machine Learning, pages 5468–5479. PMLR.
  • Li and Sun (2007) Jingyang Li and Maosong Sun. 2007. Scalable term selection for text categorization. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 774–782.
  • Martins and Astudillo (2016) André F. T. Martins and Ramón Fernandez Astudillo. 2016. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1614–1623. JMLR.org.
  • Miller (2019) Timothy Miller. 2019. Simplified neural unsupervised domain adaptation. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, volume 2019, page 414. NIH Public Access.
  • Plank (2011) Barbara Plank. 2011. Domain adaptation for parsing. Citeseer.
  • Poria et al. (2021) Soujanya Poria, Navonil Majumder, Devamanyu Hazarika, Deepanway Ghosal, Rishabh Bhardwaj, Samson Yu Bai Jian, Pengfei Hong, Romila Ghosh, Abhinaba Roy, Niyati Chhaya, Alexander F. Gelbukh, and Rada Mihalcea. 2021. Recognizing emotion cause in conversations. Cogn. Comput., 13(5):1317–1332.
  • Ramponi and Plank (2020) Alan Ramponi and Barbara Plank. 2020. Neural unsupervised domain adaptation in nlp—a survey. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6838–6855.
  • Royden and Fitzpatrick (1988) Halsey Lawrence Royden and Patrick Fitzpatrick. 1988. Real analysis, volume 32. Macmillan New York.
  • Russin et al. (2019) Jake Russin, Jason Jo, Randall C O’Reilly, and Yoshua Bengio. 2019. Compositional generalization in a deep seq2seq model by separating syntax and semantics. arXiv preprint arXiv:1904.09708.
  • Sun et al. (2016) Maosong Sun, Jingyang Li, Zhipeng Guo, Z Yu, Y Zheng, X Si, and Z Liu. 2016. Thuctc: an efficient chinese text classifier. GitHub Repository.
  • van den Burg and Williams (2021) Gerrit van den Burg and Chris Williams. 2021. On memorization in probabilistic deep generative models. Advances in Neural Information Processing Systems, 34:27916–27928.
  • Vishnubhotla et al. (2021) Krishnapriya Vishnubhotla, Graeme Hirst, and Frank Rudzicz. 2021. An evaluation of disentangled representation learning for texts. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1939–1951.
  • Wang et al. (2019) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  • Wang et al. (2022) Yufei Wang, Haoliang Li, Hao Cheng, Bihan Wen, Lap-Pui Chau, and Alex C. Kot. 2022. Variational disentanglement for domain generalization. Trans. Mach. Learn. Res., 2022.
  • Wei et al. (2020) Penghui Wei, Jiahao Zhao, and Wenji Mao. 2020. Effective inter-clause modeling for end-to-end emotion-cause pair extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3171–3181.
  • Wu et al. (2021) Xiaofu Wu, Suofei Zhang, Quan Zhou, Zhen Yang, Chunming Zhao, and Longin Jan Latecki. 2021. Entropy minimization versus diversity maximization for domain adaptation. IEEE Transactions on Neural Networks and Learning Systems.
  • Xia and Ding (2019) Rui Xia and Zixiang Ding. 2019. Emotion-cause pair extraction: A new task to emotion analysis in texts. arXiv preprint arXiv:1906.01267.
  • Zad et al. (2021) Samira Zad, Maryam Heidari, H James Jr, and Ozlem Uzuner. 2021. Emotion detection of textual data: An interdisciplinary survey. In 2021 IEEE World AI IoT Congress (AIIoT), pages 0255–0261. IEEE.
  • Zhang et al. (2021) Rongsheng Zhang, Yinhe Zheng, Xiaoxi Mao, and Minlie Huang. 2021. Unsupervised domain adaptation with adapter. In Advances in Neural Information Processing Systems.
  • Zhao et al. (2019) Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. 2019. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pages 7523–7532. PMLR.
  • Zheng et al. (2022) Xiaopeng Zheng, Zhiyue Liu, Zizhen Zhang, Zhaoyang Wang, and Jiahai Wang. 2022. Ueca-prompt: Universal prompt for emotion cause analysis. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 7031–7041. International Committee on Computational Linguistics.
  • Zou et al. (2021) Han Zou, Jianfei Yang, and Xiaojian Wu. 2021. Unsupervised energy-based adversarial domain adaptation for cross-domain text classification. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1208–1218.

Appendix A Appendix

A.1 Visualization of sentence embeddings for English UDA-ECPE corpora

As shown in Fig.5(a) and Fig.5(b), regardless if a clause mentions an emotion or an emotion cause, there is a very clear boundary between the two domains. Their domain differences are largely caused by the differences between the two datasets.

Refer to caption
(a) English emotion cause clauses
Refer to caption
(b) English emotion clauses
Figure 5: The t-SNE visualizations of the clause embeddings from the English UDA-ECPE corpora

A.2 Baseline Model and Implementation Detail

 
Sprache Domain #Docs
Chinese Startseite 746
Society 659
Finanzbranche 263
Bildung 153
Entertainment 52
Englisch EN-ECPE 1226
RECCON 780
 
Table 4: The statistics of the UDA-ECPE corpora.

RankCP performs the emotion-cause pair extraction using the graph attention network, which models the inter-clause information and extracts the valid emotion-cause pairs from a ranking perspective.

UTOS adopts the unified sequence labeling approach to extract emotion-cause pairs in a way that the position of emotion and cause clauses as well as how they pair can be predicted via one pass of sequence labeling.

UECA-Prompt designs sub-propmts for the emotion extraction, cause extraction, and emotion-cause pair extraction sub-tasks, then synthesize the sub-prompts to solve the ECA task.

We adopt BERTZHsubscriptBERT𝑍𝐻\text{BERT}_{ZH}BERT start_POSTSUBSCRIPT italic_Z italic_H end_POSTSUBSCRIPT***https://huggingface.co/hfl/chinese-roberta-wwm-ext and BERTENsubscriptBERT𝐸𝑁\text{BERT}_{EN}BERT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPThttps://huggingface.co/roberta-base as the clause pair encoders for Chinese and English, respectively. The hidden size of bidirectional LSTM in emotion extraction model is set to 100. The outputted dimensions of emotion classifier and event predictor in CaRel-VAE are set to 24. The confidence threshold for the self-training of emotion extraction model is set to 0.7. The number of iterations for the self-training of event-emotion relation model is set to 50.

We train the emotion extraction model and the CaRel-VAE by using Adam optimizer, where the learning rates and the mini-batch sizes are 2e-5 and 4 and 1e-5 and 64, respectively. As for regularization, we apply dropout to both of them with the dropout rate 0.5.

A.3 Ablation Study in Self Training

We train the model using the source domain’s ground-truth labels, and then directly apply this supervised-learning model to the target domain without any self-training. In the ‘w/o Self-training’ row of the Table 3, we can see the model experiences a major performance drop, indicating the usefulness of the self-training.

Furthermore, it is also interesting to explore the extent to which the predicted emotion labels, aka EE’s results, will influence the downstream ECPE’s performance. We therefore utilize the ground-truth emotion labels instead of the ones that are predicted by the emotion extraction model as the input of the ECPE task. In the last row of the Table 3, the minimum improvement observed is 2.99% in terms of F1 among all domains, showing that the quality of the emotion prediction does have a certain impact on the ECPE task. However, our model can still achieve the best results even we only use an emotion extraction model with a moderate performance to predict the emotions, whose task is not the focus of this work.

Regularizer.

To further understand how ΩMMDsuperscriptΩMMD\Omega^{\text{MMD}}roman_Ω start_POSTSUPERSCRIPT MMD end_POSTSUPERSCRIPT contributes to the UDA-ECPE task, we examine the performance of our original model and its variant for two different types of emotion-cause pairs including normal and self-chain, the results are shown in Figure 4. Observe that the performance improvement is mainly attributed to the significant increment of precision in self-chain cases. This suggests that disentangled representation learning helps approximate emotion and cause random variables from emotion-cause pairs, and ultimately aids in the causal discovery process.

Refer to caption
Figure 6: Experimental results of our variant models that fixes negative samples during the self-training (denoted as "CaRel-VAE w/ FN") and our original model CaRel-VAE.
Improved Self-training.

For CD-SelfTrain, we examine the usefulness of always constructing a new training set in each iteration during self-training. As a comparison, we only update the training set from the previous iteration by adding new documents. In this way, negative examples in the training set remain the same once their documents are added to the training set. Fig. 6 reports the proportion of changed positive examples and the proportion of changed examples in each iteration, as well as changes of precision/recall/F1 over time. We can see that changing negative examples in each iteration indeed prevents the model from memorizing the training examples so that it improves the generalization capability of our model.

A.4 Additional Content for related work

Depending on the situation of target domain data, Domain adaptation can be categorized into two broad classes: supervised domain adaptation and unsupervised domain adaptation. The former can achieve promising results given the small amount of target domain labeled data Daumé III (2007); Plank (2011). Conversely, the unsupervised domain adaptation (UDA) does not require any data in the target domain to be labeled and thus is more attractive and challenging Glorot et al. (2011); Ramponi and Plank (2020). Our work falls under the UDA research area. Specifically, cross-domain emotion-cause pair extraction from one source domain with labels to various unlabeled target domains. Unlike most previous works Miller (2019); Du et al. (2020); Zou et al. (2021); Karouzos et al. (2021); Zhang et al. (2021) on cross-domain sentiment classification that solely work with a binary categorical variable (i.e., positive or negative sentiment), we simultaneously focus on two non-binary ones (i.e., emotion and cause) that are causally dependent. To the best of our knowledge, this is the first attempt at discovering causal relations in the context of UDA.