11institutetext: Japan Advanced Institute of Science and Technology
11email: {xuejieying,phuongnm,matheny.blake,nguyenml}@jaist.ac.jp

BiosERC: Integrating Biography Speakers Supported by LLMs for ERC Tasks

Jieying Xue    Minh-Phuong Nguyen    Blake Matheny    Le-Minh Nguyen
Abstract

In the Emotion Recognition in Conversation task, recent investigations have utilized attention mechanisms exploring relationships among utterances from intra- and inter-speakers for modeling emotional interaction between them. However, attributes such as speaker personality traits remain unexplored and present challenges in terms of their applicability to other tasks or compatibility with diverse model architectures. Therefore, this work introduces a novel framework named BiosERC, which investigates speaker characteristics in a conversation. By employing Large Language Models (LLMs), we extract the “biographical information” of the speaker within a conversation as supplementary knowledge injected into the model to classify emotional labels for each utterance. Our proposed method achieved state-of-the-art (SOTA) results on three famous benchmark datasets: IEMOCAP, MELD, and EmoryNLP, demonstrating the effectiveness and generalization of our model and showcasing its potential for adaptation to various conversation analysis tasks. Our source code is available at https://github.com/yingjie7/BiosERC.

Keywords:
speaker modeling biography of speaker in conversation emotion recognition in conversation large language models

1 Introduction

Emotion recognition in conversation (ERC) is a pivotal research topic that has garnered growing attention due to its extensive range of applications [1, 2]. In ERC tasks, the input text frequently consists of transcribed spoken dialogues from a speech recognition system, featuring colloquial or truncated statements that lack standardized grammar, thereby complicating emotional recognition in the dialogue. Unlike the traditional non-conversation sentiment analysis task, ERC emphasizes some of the many factors that influence ERC tasks, including contextual and speaker-specific information [1].

Therefore, recent approaches have inclined toward encoding acoustic features [3, 4] or contextual information [5, 6, 7] to enrich utterance vector representation. On the other hand, numerous previous works have typically utilized GRU [8, 9, 10], GNN [11, 2], or self-attention network [1, 12, 13] to encode richer speaker-specific information, including intra- and inter-speaker features. However, this latent information is predominantly learned from relationships among utterances. It poses challenges for validating its effectiveness and applying it to alternative tasks, and is problematic for other model architectures.

Refer to caption
Figure 1: Overview of our BiosERC framework

Additionally, speaker characteristics as a crucial and foundational feature in ERC tasks has not been comprehensively explored. We posit that within a dialogue, an individual’s character can significantly influence their manner of emotional expression and habitual vocabulary selection, leading to varying emotions for the same statement even when articulated by different speakers. Comprehension of interlocutors’ personality traits can thus facilitate accurately discerning their emotional inclinations within the discourse.

To tackle the aforementioned challenge, we propose BiosERC, a novel method designed to discover speakers’ personality information to enhance ERC systems. In contrast to previous methodologies relying on GRU [9, 10] or speaker-based masked attention mechanisms [1, 12, 13] to capture emotional expression features of different speakers, BiosERC stands out by precisely extracting individual personalities of speakers within dialogues (Figure 1). This uniqueness empowers the model to intricately comprehend character traits and encapsulate events of emotional transitions occurring within the characters. Moreover, our mechanism for extracting speaker characteristics is explicit and more amenable to verification and adaptation for application to various conversation analysis tasks.

Specifically, BiosERC utilizes LLMs with a prompting technique [14, 15] to extract descriptions of interlocutor features as supplementary knowledge, which are then injected into the emotion recognition process within conversations. As shown in Figure 1, this conversation involves three distinct speakers, each presenting unique perspectives and exhibiting markedly different emotional states. The speaker description facilitates the model’s thorough understanding of each speaker’s role within a conversation. Particularly, SPEAKER A is experiencing sadness and regret (as mentioned in the speaker description), resulting in expressions predominantly filled with sadness. SPEAKER B appears to be a supportive and empathetic listener, with limited involvement in the conversation, and reacts through SPEAKER A’s utterances. Meanwhile, SPEAKER C responds with excitement upon hearing their conversation. Intuitively, the integration of biographical data plays an important role in enriching the emotional background of each speaker in conversations, and holds the potential for more precise and comprehensive emotional recognition, especially in complex dialogues.

We carry out experiments on three benchmark datasets, including IEMOCAP, MELD, and EmoryNLP. The experimental results demonstrate that our method achieves SOTA performance, which indicates the effectiveness of our proposed model. Furthermore, our proposed mechanism, which uses a prompting technique for LLMs to extract the speakers’ biographical information, shows the potential to adapt to various conversation-level tasks such as opinion analysis, recommendation, and others.

2 Related Work

Emotion Recognition in Conversation.

In contrast to the conventional non-conversation sentiment analysis task, ERC demands a greater reliance on contextual and speaker-specific information for its support. For the purpose of modeling the conversational context, numerous studies employ Recurrent Neural Networks (RNNs) [16, 17] or Graph Convolution Network (GCN) [1, 18] to explore the hidden relationships between utterances. Moreover, the incorporation of contextual information and external knowledge into utterance vector representations has been notably achieved in recent works [12, 6, 16, 19] through the utilization of self-attention mechanisms and pre-trained Language Models (LM) [20, 21]. In the recent success of LLMs on various NLP tasks, InstructERC [22] is proposed to utilize the instruction prompting technique and fine-tune the LLM model for ERC tasks. MKFM framework [23] proposed the utilization of diverse supplementary knowledge information (e.g., emotional cause, topics) by ChatGPT service to inject into a graph-based model. In comparison, our work focuses on modeling speaker characteristics, a fundamental information which can be extracted by open-source LLMs (e.g, LLama-2). In addition, we also prove our proposed mechanism worked effectively when fine-tuning on both popular architectures: BERT and transformer-based decoder-only LLM.

Speaker-based ERC.

Because of the significant impact of speakers on ERC, researchers have placed emphasis on speaker modeling. DialogueRNN [8] and COSMIC [16] leverage Gated Recurrent Units (GRU) for the modeling of speaker-specific semantic context. Some researches [11, 2] treat conversations as graphs while incorporating prior speaker information as distinct relationships between utterances, or considers speakers as nodes within the graph. HiTrans [5] exploits an auxiliary task to classify whether two utterances belong to the same speaker to make the model speaker-sensitive. S+PAGE [24] employs a two-stream conversation Transformer architecture to extract both self and inter-speaker contextual features. However, the majority of prior research has predominantly concentrated on modeling individual speaker utterances or interactions among different speakers, with particular attention given to the intra- and inter-speaker aspects for the extraction of speaker-based information [6, 13]. Regrettably, limited emphasis has been placed on exploring speaker characteristics, which constitute critical and foundational elements of conversational information. Therefore, we propose a novel method named BiosERC, which employs external tools to extract speaker characteristics and inject them into the process of emotion recognition within conversations.

3 Methodology

This section introduces our baseline model architecture for the ERC task, which utilizes intra- and inter-speaker information following current SOTA methods [24, 6, 10], and our proposed method BiosERC, which incorporates the biography of the speakers into an ERC model. Formally, we define a conversation as: 𝒞={ui}0i<|𝒞|𝒞subscriptsubscript𝑢𝑖0𝑖𝒞\mathcal{C}=\{u_{i}\}_{0\leq i<|\mathcal{C}|}caligraphic_C = { italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 0 ≤ italic_i < | caligraphic_C | end_POSTSUBSCRIPT, where each individual utterance uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is articulated by speaker p(ui)𝒮𝑝subscript𝑢𝑖𝒮p(u_{i})\in\mathcal{S}italic_p ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_S, with 𝒮={sj}0j<|𝒮|𝒮subscriptsubscript𝑠𝑗0𝑗𝒮\mathcal{S}=\{s_{j}\}_{0\leq j<|\mathcal{S}|}caligraphic_S = { italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 0 ≤ italic_j < | caligraphic_S | end_POSTSUBSCRIPT representing the set of speakers in the conversation. Here, p𝑝pitalic_p denotes a mapping function that associates utterances with their respective speakers.

3.1 Intra-inter ERC (baseline)

Based on recent SOTA methods in the ERC task [24, 10, 13], we implement our baseline model consisting of three principal components: utterance vector representation, context modeling, and an emotion classification layer.

Utterance Vector Representation.

To enrich meaning representations, we follow an approach that mixes the surrounding utterances within a fixed-window size [25, 9, 19, 13]. Particularly, to encode a sentence uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the text input is combined by surrounding the utterance according to the following template: “[cls], uiwsubscript𝑢𝑖𝑤u_{i-w}italic_u start_POSTSUBSCRIPT italic_i - italic_w end_POSTSUBSCRIPT, .., </s>, uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, </s>, ... ui+wsubscript𝑢𝑖𝑤u_{i+w}italic_u start_POSTSUBSCRIPT italic_i + italic_w end_POSTSUBSCRIPT”, where w𝑤witalic_w is the local contextual window size hyperparameter. The utterance vector is computed by aggregating the respective word vectors following [13]:

hcls,hwordssuperscript𝑐𝑙𝑠superscript𝑤𝑜𝑟𝑑𝑠\displaystyle h^{cls},h^{words}italic_h start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_w italic_o italic_r italic_d italic_s end_POSTSUPERSCRIPT =RoBERTa([uiw,..,ui+w])\displaystyle=\mathrm{RoBERTa}([u_{i-w},..,u_{i+w}])= roman_RoBERTa ( [ italic_u start_POSTSUBSCRIPT italic_i - italic_w end_POSTSUBSCRIPT , . . , italic_u start_POSTSUBSCRIPT italic_i + italic_w end_POSTSUBSCRIPT ] ) (1)
huttsuperscript𝑢𝑡𝑡\displaystyle h^{utt}italic_h start_POSTSUPERSCRIPT italic_u italic_t italic_t end_POSTSUPERSCRIPT =[tanh(average(hwordsofui)Wu)]0i<|𝒞|absentsubscriptdelimited-[]tanhaveragesuperscriptwordsofsubscriptuisuperscript𝑊𝑢0𝑖𝒞\displaystyle=[\mathrm{tanh}(\mathrm{average}(h^{\mathrm{words\,of\,u_{i}}})% \cdot W^{u})]_{0\leq i<|\mathcal{C}|}= [ roman_tanh ( roman_average ( italic_h start_POSTSUPERSCRIPT roman_words roman_of roman_u start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ⋅ italic_W start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT 0 ≤ italic_i < | caligraphic_C | end_POSTSUBSCRIPT (2)

where hwordsofuisuperscriptwordsofsubscriptuih^{\mathrm{words\,of\,u_{i}}}italic_h start_POSTSUPERSCRIPT roman_words roman_of roman_u start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the word vectors selected from hwordssuperscript𝑤𝑜𝑟𝑑𝑠h^{words}italic_h start_POSTSUPERSCRIPT italic_w italic_o italic_r italic_d italic_s end_POSTSUPERSCRIPT at the positions of the utterance uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT; huttsuperscript𝑢𝑡𝑡h^{utt}italic_h start_POSTSUPERSCRIPT italic_u italic_t italic_t end_POSTSUPERSCRIPT is all utterance vectors in a conversation; and Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT refers to learnable weights.

Context Modeling.

Utterance vectors are integrated contextual information of whole conversation by attention mechanism:

Attn(q,k,v,M)Attn𝑞𝑘𝑣𝑀\displaystyle\mathrm{Attn}(q,k,v,M)roman_Attn ( italic_q , italic_k , italic_v , italic_M ) =softmax(qkdt+M)vabsentsoftmax𝑞superscript𝑘subscript𝑑𝑡𝑀𝑣\displaystyle=\mathrm{softmax}(\frac{q\cdot k^{\intercal}}{\sqrt{d_{t}}}+M)\cdot v= roman_softmax ( divide start_ARG italic_q ⋅ italic_k start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG + italic_M ) ⋅ italic_v (3)
qt,kt,vtsubscript𝑞𝑡subscript𝑘𝑡subscript𝑣𝑡\displaystyle{q}_{t},{k}_{t},{v}_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =huttWtq,huttWtk,huttWtvabsentsuperscript𝑢𝑡𝑡superscriptsubscript𝑊𝑡𝑞superscript𝑢𝑡𝑡superscriptsubscript𝑊𝑡𝑘superscript𝑢𝑡𝑡superscriptsubscript𝑊𝑡𝑣\displaystyle={h^{utt}}{W}_{t}^{q},{h^{utt}}{W}_{t}^{k},{h^{utt}}{W}_{t}^{v}= italic_h start_POSTSUPERSCRIPT italic_u italic_t italic_t end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_u italic_t italic_t end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_u italic_t italic_t end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT (4)
headt𝑒𝑎subscript𝑑𝑡\displaystyle{head}_{t}italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Attn(qt,kt,vt,M)absentAttnsubscript𝑞𝑡subscript𝑘𝑡subscript𝑣𝑡𝑀\displaystyle=\mathrm{Attn}({q}_{t},{k}_{t},{v}_{t},M)= roman_Attn ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_M ) (5)
hMultiHeadsubscriptMultiHead\displaystyle{h}_{\mathrm{MultiHead}}italic_h start_POSTSUBSCRIPT roman_MultiHead end_POSTSUBSCRIPT =concat([headt|0<tH])Woabsentconcatdelimited-[]𝑒𝑎subscript𝑑conditional𝑡0𝑡𝐻superscript𝑊𝑜\displaystyle=\mathrm{concat}([{head}_{t|0<t\leq H}]){W}^{o}= roman_concat ( [ italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT italic_t | 0 < italic_t ≤ italic_H end_POSTSUBSCRIPT ] ) italic_W start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT (6)

where H𝐻Hitalic_H is the number of heads in the MultiHead attention layer; qt,kt,vtsubscript𝑞𝑡subscript𝑘𝑡subscript𝑣𝑡{q}_{t},{k}_{t},{v}_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are utterance vectors in various semantic space (dimension size dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT). In detail, following [6, 13], we construct the relation matrices (M𝑀Mitalic_M) for modeling relationship among utterances, where Mik=0subscript𝑀𝑖𝑘0M_{ik}=0italic_M start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 0 if uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and uksubscript𝑢𝑘u_{k}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT should have interaction, Mik=subscript𝑀𝑖𝑘M_{ik}=-\inftyitalic_M start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = - ∞ if otherwise. For the baseline model, we implement three different relationships: global context (all utterance pairs are connected), intra-speaker (only utterance pairs of the same speaker are connected), and inter-speaker (only utterance pairs of the different speaker are connected). Consequently, we acquire three new hidden states (from Equation 6) hcontxtsuperscript𝑐𝑜𝑛𝑡𝑥𝑡h^{contxt}italic_h start_POSTSUPERSCRIPT italic_c italic_o italic_n italic_t italic_x italic_t end_POSTSUPERSCRIPT, hintrasuperscript𝑖𝑛𝑡𝑟𝑎h^{intra}italic_h start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUPERSCRIPT, hintersuperscript𝑖𝑛𝑡𝑒𝑟h^{inter}italic_h start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUPERSCRIPT feed-forward to the Classification component.

Classification.

This component aims to integrate all the hidden features of utterances to classify the emotion label.

hispeakersubscriptsuperscript𝑠𝑝𝑒𝑎𝑘𝑒𝑟𝑖\displaystyle h^{speaker}_{i}italic_h start_POSTSUPERSCRIPT italic_s italic_p italic_e italic_a italic_k italic_e italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =hiintraWa+hiinterWrabsentsubscriptsuperscript𝑖𝑛𝑡𝑟𝑎𝑖superscript𝑊𝑎subscriptsuperscript𝑖𝑛𝑡𝑒𝑟𝑖superscript𝑊𝑟\displaystyle={h^{intra}_{i}}{W}^{a}+{h^{inter}_{i}}{W}^{r}= italic_h start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT + italic_h start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT (7)
eiosubscriptsuperscript𝑒𝑜𝑖\displaystyle e^{o}_{i}italic_e start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =softmax(hiuttWu+hicontxtWg+hispeaker)absentsoftmaxsubscriptsuperscript𝑢𝑡𝑡𝑖superscript𝑊𝑢superscriptsubscript𝑖𝑐𝑜𝑛𝑡𝑥𝑡superscript𝑊𝑔subscriptsuperscript𝑠𝑝𝑒𝑎𝑘𝑒𝑟𝑖\displaystyle={\mathrm{softmax}(h^{utt}_{i}}{W}^{u}+{h_{i}^{contxt}}{W}^{g}+h^% {speaker}_{i})= roman_softmax ( italic_h start_POSTSUPERSCRIPT italic_u italic_t italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT + italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_o italic_n italic_t italic_x italic_t end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT + italic_h start_POSTSUPERSCRIPT italic_s italic_p italic_e italic_a italic_k italic_e italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (8)

Then, the emotion vector (eiosubscriptsuperscript𝑒𝑜𝑖e^{o}_{i}italic_e start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) is used to compute the loss via Cross-Entropy function and is trained based on the gold emotional label of the i𝑖iitalic_i-th utterance.

3.2 Bios ERC

In this section, we describe the process of generating the speaker’s biography and present our BiosERC framework, leveraging two popular pre-trained LM as backbones: a BERT-based model [21] (e.g., RoBERTa) and a transformer-based decoder-only LLM model [15] (e.g., Llama-2). Notably, we also introduce an effective mechanism using the biography of speakers incorporating fine-tuning a LLM-based [26] with the prompting technique.

3.3 Biography of Speaker

In this part, we introduce a mechanism using the prompting technique for the LLMs to generate the description (djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) for the respective speaker (ujsubscript𝑢𝑗u_{j}italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT). Given a conversation 𝒞𝒞\mathcal{C}caligraphic_C, the output of this step is the biography (description) of all speakers in a conversation ={dj}0j<|𝒮|subscriptsubscript𝑑𝑗0𝑗𝒮\mathcal{B}=\{d_{j}\}_{0\leq j<|\mathcal{S}|}caligraphic_B = { italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 0 ≤ italic_j < | caligraphic_S | end_POSTSUBSCRIPT.

dj=LLMs(prompting(𝒞,sj))subscript𝑑𝑗LLMsprompting𝒞subscript𝑠𝑗\displaystyle d_{j}=\mathrm{LLMs}(\mathrm{prompting}(\mathcal{C},s_{j}))italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_LLMs ( roman_prompting ( caligraphic_C , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) (9)

LLMs refers to large language models such as Llama2 [15], which can generalize a speaker’s biography based on their conversation. The prompting function is a template containing two conversation instances (𝒞𝒞\mathcal{C}caligraphic_C) and speaker identification (sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) to exploit the knowledge of the LLMs (Table 1). To avoid long plain text descriptions, we instruct the LLMs to limit the output by adding a “note” concerning the length of the prompting template. Consequently, we obtain additional data about the persona of the speakers in each conversation (\mathcal{B}caligraphic_B), which is utilized for speaker modeling in the subsequent step.

Table 1: Prompting template to extract the description of characteristics of the speaker from a conversation with LLMs.
Given this conversation between speakers:
{conversation content 𝒞𝒞\mathcal{C}caligraphic_C}
In overall above conversation, what do you think about the characteristics of speaker {speaker identification sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT}? (Note: provide an answer within 250 words)

3.4 BERT-based BiosERC architecture

Firstly, we encode the speaker’s description using a pre-trained language model to acquire hidden vector representation (hjdescsubscriptsuperscript𝑑𝑒𝑠𝑐𝑗h^{desc}_{j}italic_h start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT).

hjdescsubscriptsuperscript𝑑𝑒𝑠𝑐𝑗\displaystyle h^{desc}_{j}italic_h start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =RoBERTa(dj)[0]absentRoBERTasubscript𝑑𝑗delimited-[]0\displaystyle=\mathrm{RoBERTa}(d_{j})[0]= roman_RoBERTa ( italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) [ 0 ] (10)

where j𝑗jitalic_j is the speaker index in the set of speakers in a conversation, 0j<|𝒮|0𝑗𝒮0\leq j<|\mathcal{S}|0 ≤ italic_j < | caligraphic_S |. Our proposed method, BiosERC, extends the baseline model and redefines the speaker’s hidden vector representation (hispeakersuperscriptsubscript𝑖𝑠𝑝𝑒𝑎𝑘𝑒𝑟h_{i}^{speaker}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_p italic_e italic_a italic_k italic_e italic_r end_POSTSUPERSCRIPT in Equation 7) (Figure 2).

Refer to caption
Figure 2: Overview of our BiosERC model architecture.

This architecture is designed with a straightforward target that injects the personality information of each speaker into their corresponding utterances by a multi-layer perceptron network. Then, the speaker information in Equation 7 is replaced by:

hispeakersuperscriptsubscript𝑖𝑠𝑝𝑒𝑎𝑘𝑒𝑟\displaystyle h_{i}^{speaker}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_p italic_e italic_a italic_k italic_e italic_r end_POSTSUPERSCRIPT =hp(ui)descWdesc+bdescabsentsubscriptsuperscript𝑑𝑒𝑠𝑐𝑝subscript𝑢𝑖superscript𝑊𝑑𝑒𝑠𝑐superscript𝑏𝑑𝑒𝑠𝑐\displaystyle=h^{desc}_{p(u_{i})}{W}^{desc}+b^{desc}= italic_h start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_c end_POSTSUPERSCRIPT + italic_b start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_c end_POSTSUPERSCRIPT (11)

where p(ui)𝑝subscript𝑢𝑖p(u_{i})italic_p ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the corresponding speaker of utterance uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Through this mechanism, all the utterances from the same speaker are shared in the unified speaker vector representation, while the weights are updated in the training process. Finally, the utterance vector is fused with the speaker vector which supports emotional classification.

BiosERC - biography injected by attention mechanism.

We consider a variant of our BiosERC model, which is engineered to dynamically incorporate the speaker’s information into each utterance via the attention mechanism. The relationship between the current utterance and all individual speakers is integrated to enrich the utterance vector representation.

hifusionsuperscriptsubscript𝑖𝑓𝑢𝑠𝑖𝑜𝑛\displaystyle h_{i}^{fusion}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_u italic_s italic_i italic_o italic_n end_POSTSUPERSCRIPT =hp(ui)descWp+hiuttabsentsubscriptsuperscript𝑑𝑒𝑠𝑐𝑝subscript𝑢𝑖superscript𝑊𝑝subscriptsuperscript𝑢𝑡𝑡𝑖\displaystyle=h^{desc}_{p(u_{i})}{W}^{p}+h^{utt}_{i}= italic_h start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT + italic_h start_POSTSUPERSCRIPT italic_u italic_t italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (12)
hdescsuperscript𝑑𝑒𝑠𝑐\displaystyle h^{desc}italic_h start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_c end_POSTSUPERSCRIPT ={hjdesc}0j<|𝒮|absentsubscriptsubscriptsuperscript𝑑𝑒𝑠𝑐𝑗0𝑗𝒮\displaystyle=\{h^{desc}_{j}\}_{0\leq j<|\mathcal{S}|}= { italic_h start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 0 ≤ italic_j < | caligraphic_S | end_POSTSUBSCRIPT (13)
hispeakersubscriptsuperscript𝑠𝑝𝑒𝑎𝑘𝑒𝑟𝑖\displaystyle h^{speaker}_{i}italic_h start_POSTSUPERSCRIPT italic_s italic_p italic_e italic_a italic_k italic_e italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =Attn(hifusion,hdesc,hdesc,𝟎)absentAttnsuperscriptsubscript𝑖𝑓𝑢𝑠𝑖𝑜𝑛superscript𝑑𝑒𝑠𝑐superscript𝑑𝑒𝑠𝑐0\displaystyle=\mathrm{Attn}(h_{i}^{fusion},h^{desc},h^{desc},\mathbf{0})= roman_Attn ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_u italic_s italic_i italic_o italic_n end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_c end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_c end_POSTSUPERSCRIPT , bold_0 ) (14)

We first compute a fusion vector (hfusionsuperscript𝑓𝑢𝑠𝑖𝑜𝑛h^{fusion}italic_h start_POSTSUPERSCRIPT italic_f italic_u italic_s italic_i italic_o italic_n end_POSTSUPERSCRIPT) between the utterance and respective speaker description vectors. Then we collect all the speaker description vectors (hdescsuperscript𝑑𝑒𝑠𝑐h^{desc}italic_h start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_c end_POSTSUPERSCRIPT) and use the attention mechanism to model the relationship between the utterance and all speakers in a conversation. Finally, the speaker features are embedded in this vector, hispeakersubscriptsuperscript𝑠𝑝𝑒𝑎𝑘𝑒𝑟𝑖h^{speaker}_{i}italic_h start_POSTSUPERSCRIPT italic_s italic_p italic_e italic_a italic_k italic_e italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and are replaced using Equation 7 in the baseline system.

3.5 LLM-based BiosERC + instruction fine-tuning (ft LLM)

Since the robust natural language understanding capabilities of LLMs [15], we provide the speaker description as part of the text prompting input for the model (highlighted in blue in Table 2) instead of modifying model architecture. We follow the instruction fine-tuning approach [27], with causal language modeling objective to train an LLM to generate emotional label text (highlighted in red in Table 2):

x𝑥\displaystyle xitalic_x =prompting(ui,sj,dj,𝒞,ei)absentpromptingsubscript𝑢𝑖subscript𝑠𝑗subscript𝑑𝑗𝒞subscript𝑒𝑖\displaystyle=\textrm{prompting}(u_{i},s_{j},d_{j},\mathcal{C},e_{i})= prompting ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , caligraphic_C , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (15)
IP(x)IP𝑥\displaystyle\text{I\kern-1.49994ptP}(x)IP ( italic_x ) =Πz=1|x|IP(xz|x0,x1,,xz1)absentsuperscriptsubscriptΠ𝑧1𝑥IPconditionalsubscript𝑥𝑧subscript𝑥0subscript𝑥1subscript𝑥𝑧1\displaystyle=\Pi_{z=1}^{|x|}\text{I\kern-1.49994ptP}(x_{z}|x_{0},x_{1},...,x_% {z-1})= roman_Π start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_x | end_POSTSUPERSCRIPT IP ( italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_z - 1 end_POSTSUBSCRIPT ) (16)

where x,z𝑥𝑧x,zitalic_x , italic_z is a sequence of tokens and the token’s index in prompting input (Table 2), respectively. Additionally, we utilize LoRA [28], a lightweight training technique, to reduce the number of trainable parameters. The instruction fine-tuned LLM learned the distribution of emotional labels given prompting input (x𝑥xitalic_x). During the inferring phase, the emotional label (eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) which is omitted from prompting input, is left to be generated by the fine-tuned LLM.

Table 2: Prompting input template using speaker description and content of conversation for fine-tuning LLMs.
system
### You are an expert at analyzing the emotion of utterances among speakers in a conversation.
### Given the characteristic of this speaker, {speaker name sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT}: {speaker description djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT}
### Given the following conversation as a context {conversation 𝒞𝒞\mathcal{C}caligraphic_C}
user
Based on the above conversation and characteristics of the speakers, which emotional label of {sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT} in the utterance {utterance uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT} ?
assistant
{emotional label of uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in text: eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT}

4 Experimental Setting

Datasets

We conducted evaluations on three ERC benchmark datasets in text-only version: IEMOCAP [29], involving daily conversations between pairs with ten different speakers; MELD [30], derived from TV shows and featuring multiparty conversations; EmoryNLP [31], another multiparty daily dialogue dataset sourced from TV shows. The statistical information of these datasets is shown in Table 3. In accordance with prior works [16, 12], we employed the Weighted-F1 score as the evaluation metric to maintain compatibility.

Table 3: Statistical information on all ERC datasets.
Dataset #dialogues #utterances #speaker
train dev test train dev test
IEMOCAP 108 12 31 5,163 647 1,623 2.00
EmoryNLP 659 89 79 7,551 954 984 3.34
MELD 1,039 114 280 9,989 1,109 2,610 2.72

Implementation Details

Since the recent successful applications and advancing capabilities of pre-trained LLMs [15, 14], we leverage LLama-2 model to procure personality descriptions for each participant in the conversation. Specifically, we verify the effectiveness of speaker description information on two aforementioned pre-trained LMs: the BERT-based model with roberta-large and the transformer-based decoder-only LLM model with Llama-2-13b. The best model is determined based on the development set of each dataset and employed to evaluate the test set. For fine-tuning BERT-based BiosERC (section 3.4), the hyper-parameters were selected as follows: the learning rate is selected from {1e5;5e6}1superscript𝑒55superscript𝑒6\{1e^{-5};5e^{-6}\}{ 1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT ; 5 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT }; the dropout value is 0.20.20.20.2, and number epochs is 30303030; and the local context window size (w𝑤witalic_w) is chosen in {2,4}24\{2,4\}{ 2 , 4 }; we report the average scores obtained across 10 independent runs. For fine-tuning LLM-based BiosERC (section 3.5), learning rate is selected from {2e4;3e4}2superscript𝑒43superscript𝑒4\{2e^{-4};3e^{-4}\}{ 2 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ; 3 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT }, number epochs is 3333; we report the average scores obtained across 5 independent runs because of the computation cost. All the source code of this project is published at MASKED_LINK.

5 Results and Analysis

5.1 Main results

Our approach demonstrated competitive performance compared to recent SOTA methods on three famous benchmark datasets (Table 4) on both two architectures BERT-based and transformer-based decoder-only LLM model.

Table 4: Performance comparison between our proposed method and previous works on the test sets. Column #T.Params. refers to the number of trainable parameters. The notations \ddagger, \dagger indicate the significant difference (t-test) with the baseline in levels p<0.01𝑝0.01p<0.01italic_p < 0.01 and p<0.05𝑝0.05p<0.05italic_p < 0.05, separately.
Methods #T.Params. IEMOCAP EmoryNLP MELD
HiTrans [32] 64.50 36.75 61.94
DAG [12] 68.03 39.02 63.65
DialogXL [33] 65.94 34.73 62.14
DialogueEIN [6] 68.93 38.92 65.37
SGED + DAG-ERC [10] 68.53 40.24 65.46
S+PAGE [24] 68.93 40.05 64.67
InstructERC [22] +(ft LLM) 71.39 41.39 69.15
Intra/inter ERC (baseline) [13] 189×106189superscript106189\times 10^{6}189 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 67.65 39.33 64.58
BiosERCBERT-basedBERT-based{}_{\textit{{BERT-based}}}start_FLOATSUBSCRIPT BERT-based end_FLOATSUBSCRIPT 186×106186superscript106186\times 10^{6}186 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 67.79 39.89 65.51
BiosERC +ft LLMLlama-2-7bLlama-2-7b{}_{\texttt{Llama-2-7b}}start_FLOATSUBSCRIPT Llama-2-7b end_FLOATSUBSCRIPT 80×10680superscript10680\times 10^{6}80 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 69.02 41.44 68.72
BiosERC +ft LLMLlama-2-13bLlama-2-13b{}_{\texttt{Llama-2-13b}}start_FLOATSUBSCRIPT Llama-2-13b end_FLOATSUBSCRIPT 125×106125superscript106125\times 10^{6}125 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 71.19 41.68 69.83

In comparison with the previous speaker-based methods (SGED + DAG-ERC [10], S+PAGE [24] and DialogueEIN [6]) experimental results demonstrated the effectiveness of our proposed approach and further affirm that speaker modeling by speaker descriptions are superior to the information offered by intra- and inter-speaker contexts. In addition, our BiosERC model achieved significant differences with the baseline system on both the EmoryNLP and MELD datasets, which is shown clearly in Figure 3. Because the MELD and EmoryNLP are multiparty conversation datasets (the average number of interlocutors are 2.72 and 3.34, respectively), the emotions are influenced more by different speaker personalities in a conversation than the IEMOCAP dataset.

In the previous method of fine-tuning an LLM, InstructERC [22] considers speaker identifier as an auxiliary task, requiring two-stage training, which is more time-consuming than ours in the training process. Besides, our proposed method uses speaker descriptions generated by LLM in natural language, which can be easier incorporated with humans using our system for customization (e.g., customer support staff can directly provide or modify characteristics generated by LLM of their customers). Similar to BERT-based BiosERC, among three datasets, the LLM-based BiosERC shows the strengthens of multi-party datasets (more than two speakers in each conversation), EmoryNLP and MELD. By fine-tuning an LLM, Llama-2-13b, the performance of our BiosERC increased by 1-4% weighted F1 scores compared to BERT-based models and achieved new SOTA performance on EmoryNLP and MELD datasets. Besides, since utilizing a lightweight training technique, LoRA [28], the number of training parameters in LLM-based BiosERC was smaller than BERT-based BiosERC (only fine-tuned on two last layers) which proved the potential of LLM-based BiosERC in the real application.

60606060626262626464646466666666Training stepsWeighted-F1BiosERCIntra-Inter ERC
Figure 3: Performance comparison between our BERT-based BiosERC and the baseline model (MELD dev set), illustrating the performance variability across 10 random runs.

5.2 Ablation study

We conducted an ablation study to evaluate the effectiveness of integrating speaker biographies into the broader system encompassing various aspects.

BiosERC architecture.

As shown in Table 5, it is apparent that our BERT-based BiosERC (row 3), which incorporates the speaker’s descriptions, exhibits significant advantages in F1 score, outperforming the baseline system that relies on intra/inter- speaker relationships. Besides, by using the attention mechanism to encode the speaker’s biography (row 2), BiosERC achieved high performance and it also clearly outperformed the baseline model. Moreover, in the setting of BiosERC +fine-tuning LLM (row 8), when removing the speaker description (the blue part in Table 2) from the input prompting (row 6), the performance significantly decreased by 1.05 F1 score. By fine-tuning the different LLM models, Llama-2-13b and Llama-2-7b, the performance is slightly decreased with 0.520.520.520.52 F1 score (rows 7, 8). These results proved the importance of the speaker’s biography information and the efficacy of our proposed approach for speaker modeling.

Table 5: Performance comparison among variants of BiosERC on the MELD development set.
Methods LLMs extracting bio. Weighted-F1
1. Intra/inter ERC (baseline) - 66.0866.0866.0866.08(-1.19)
\cdashline1-3 2. BiosERC injecting bio. by attention Llama-2-chat-70b 66.71superscript66.7166.71^{\dagger}66.71 start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT(-0.56)
3. BiosERC Llama-2-chat-70b 67.27superscript67.27\mathbf{67.27}^{\ddagger}bold_67.27 start_POSTSUPERSCRIPT ‡ end_POSTSUPERSCRIPT
4. BiosERC Llama-2-chat-7b 67.23superscript67.2367.23^{\ddagger}67.23 start_POSTSUPERSCRIPT ‡ end_POSTSUPERSCRIPT(-0.04)
5. BiosERC vicuna-33b-v1.3 66.96superscript66.9666.96^{\ddagger}66.96 start_POSTSUPERSCRIPT ‡ end_POSTSUPERSCRIPT(-0.32)
6. BiosERC +ft LLMLlama-2-13bLlama-2-13b{}_{\texttt{Llama-2-13b}}start_FLOATSUBSCRIPT Llama-2-13b end_FLOATSUBSCRIPT w/o speaker bio. - 69.17(1.05)subscript69.171.0569.17_{(-1.05)}69.17 start_POSTSUBSCRIPT ( - 1.05 ) end_POSTSUBSCRIPT
\cdashline1-3 7. BiosERC +ft LLMLlama-2-7bLlama-2-7b{}_{\texttt{Llama-2-7b}}start_FLOATSUBSCRIPT Llama-2-7b end_FLOATSUBSCRIPT Llama-2-chat-70b 69.70(0.52)subscript69.700.5269.70_{(-0.52)}69.70 start_POSTSUBSCRIPT ( - 0.52 ) end_POSTSUBSCRIPT
8. BiosERC +ft LLMLlama-2-13bLlama-2-13b{}_{\texttt{Llama-2-13b}}start_FLOATSUBSCRIPT Llama-2-13b end_FLOATSUBSCRIPT Llama-2-chat-70b 70.22superscript70.22\mathbf{70.22^{\dagger}}bold_70.22 start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT

Speaker biographies.

We explored various currently popular LLMs for generating speaker biographies, including LLama-2-chat-70b, Llama-2-chat-7b [15], and vicuna-33b-v1.3 [34]. Among these, LLama-2-chat-70b yielded the best outcomes. Based on observation, we found that the Vicuna model failed to provide speaker descriptions in some “extra difficult” cases, such as when the conversation length is too short (e.g., less than three utterances) or when the specific speaker has extremely short utterances (e.g., Hmm). These solid improvements worked on the diverse biographies generated by various LLMs underscore the versatility and effectiveness of extracting “speaker biographies”, demonstrating that the LLMs framework can be highly beneficial for biography generation and assisting with ERC tasks.

5.3 Conversation length

We analyzed the MELD development set to evaluate the impact of conversation length on the performance as depicted in Figure 4. Overall, our method outperforms intra- and inter-speaker methods across conversations of varying lengths. It is worth noting that the performance of short dialogues (conversation length less than 15) improves significantly more than that of long dialogues. These results also proved the importance of “speaker characteristic” in short conversations lacking contextual information. When contextual information is limited, speaker characteristics are based on the speaker’s lexical choices, which contain explicit or implicit meaning in the sentence. An LLM can extract the speaker’s characteristics by recognizing the explicit or implicit meaning conveyed in these statements. In addition, MELD is a multiparty dataset, which contains many conversations involving more than three speakers. Based on our observations on the improvement examples, “speaker characteristic” is especially important in short conversations which lack much contextual information.

5.4 Case study

Our model enhances emotion recognition accuracy, even in short conversations with limited contextual information. As show in Table 6, conversation 1041 is a short dialogue consisting of only five sentences. Our model perceives two speaker descriptions, these descriptions facilitate a more accurate identification of SPEAKER_0’s discourse, leaning towards positivity rather than anger. In addition, our architecture shows an improved capacity in predicting emotions in shorter sentences through the utilization of speaker description, particularly in cases where traditional models struggle due to the minimal information contained in expressions such as “Yeah”, or “Okay”.

Additionally, our approach adeptly copes with scenarios where the error rate is high at the beginning of the conversation as show in Table 6 (u0subscript𝑢0u_{0}italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, u2subscript𝑢2u_{2}italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in conversation 1061). Because contextual and speaker information is lacking at the outset of the conversation, the baseline model consistently produces incorrect initial sentences. However, with the assistance of our “speaker description”, it delivers a better performance from the beginning of the dialogue. Experimental results prove the effectiveness and versatility of our approach across diverse conversations, including those with complex contextual information.

[0-5)[5-10)[10-15)[15-20)[20-)6565656570707070#ConversationWeighted-F1BiosERCIntra-Inter ERC
Figure 4: Performance comparison respect to length of conversation (number of utterance) on the MELD development set (variability across 10 random runs).
Table 6: Case study of improvement examples collected in the MELD dataset. The red and green labels refer to the incorrect and correct prediction of the models, respectively.
Idx Conversation 1041 Label BiosERC Baseline
Speaker_0 Speaker_1
d0 SPEAKER_0 in the conversation comes across as someone who is confident, friendly .. to create a relaxed atmosphere
\cdashline1-7 d1 SPEAKER_1 in the conversation comes across as a friendly .. have a strong sense of loyalty and trust in their relationships…
u0 Hey Estelle,listen neutral neutral neutral
u1 Well! Well! Well! Joey Tribbiani! So you came back huh? surprise surprise joy
u2 What are you talking about? I never left you! You’ve always been my agent! surprise surprise anger
u3 Really?! surprise surprise surprise
u4 Yeah! joy joy anger
Idx Conversation 1061 Label BiosERC Baseline
Speaker_0 Speaker_1 Speaker_2
d0 SPEAKER_0 seems to be a very inquisitive and curious person. … SPEAKER_0 appears to be quite blunt and direct in his communication style, not mincing words or sugarcoating his thoughts.
\cdashline1-9 d1 SPEAKER_1 seems to be a humorous and light-hearted person. … SPEAKER_1 shares that they have only been with one person in their whole life, and this is met with surprise and disbelief by the other …
\cdashline1-9 d2 SPEAKER_2 seems to be a humorous and light-hearted person… SPEAKER_2 is someone who enjoys having fun and is not afraid to poke fun at themselves or others…
u0 Well, what? neutral neutral surprise
u1 What? neutral neutral surprise
u2 What is it? neutral neutral sadness
u3 That she left you? surprise surprise sadness
u4 That she likes women? neutral sadness sadness
u5 That she left you for another woman that likes women? neutral surprise sadness
u6 Little louder, okay, I think there’s a man on the twelfth floor in a coma that didn’t quite hear you. anger neutral anger
u11 With Carol? Oh. surprise surprise neutral
u12 So in your whole life, you’ve only been with one oh. surprise neutral neutral
u13 Whoah, boy, hockey was a big mistake! There was a whole bunch of stuff we could’ve done tonight! surprise surprise joy

6 Limitation

In this work, we introduce a novel method for modeling speaker characteristics based on biographical information of interlocutors in a conversation, generated by a large language model (LLM). In terms of computation time, our BiosERC method requires additional computing resources for the inference of the LLM compared to methods like Intra-inter ERC, which utilize hidden speaker identity information. Additionally, for the scope of this paper, we have not addressed issues related to human privacy data. In realistic applications, access to conversation history should be granted and clarified by the data owner. However, we believe that, with appropriate agreements to protect users’ privacy data, it is possible to obtain this permission.

7 Conclusion

In conclusion, we proposed a novel mechanism incorporating speakers’ characteristics into the ERC task, which has not been fully developed in prior research. We improve the performance of the ERC task by investigating the influence of the personality of interlocutors on emotions, considering this external knowledge as a unique feature. Our experiments on three benchmark datasets consistently yielded SOTA or competitive results, thereby substantiating the effectiveness of our proposed method. Furthermore, our model is straightforward yet highly adaptable, thus enabling its applicability to a wide range of conversation analysis tasks.

Acknowledgement.

This work is supported partly by AOARD grant FA23862214039.

References

  • [1] D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh, “DialogueGCN: A graph convolutional neural network for emotion recognition in conversation,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds.   Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 154–164.
  • [2] D. Zhang, L. Wu, C. Sun, S. Li, Q. Zhu, and G. Zhou, “Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations.” in IJCAI, 2019, pp. 5415–5421.
  • [3] G. Hu, T.-E. Lin, Y. Zhao, G. Lu, Y. Wu, and Y. Li, “UniMSE: Towards unified multimodal sentiment analysis and emotion recognition,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.   Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 7837–7851.
  • [4] X. Shi, X. Li, and T. Toda, “Emotion Awareness in Multi-utterance Turn for Improving Emotion Prediction in Multi-Speaker Conversation,” in Proc. INTERSPEECH 2023, 2023, pp. 765–769.
  • [5] J. Li, D. Ji, F. Li, M. Zhang, and Y. Liu, “Hitrans: A transformer-based context-and speaker-sensitive model for emotion detection in conversations,” in Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 4190–4200.
  • [6] Y. Liu, J. Zhao, J. Hu, R. Li, and Q. Jin, “DialogueEIN: Emotion interaction network for dialogue affective analysis,” in Proceedings of the 29th International Conference on Computational Linguistics.   Gyeongju, Republic of Korea: International Committee on Computational Linguistics, Oct. 2022, pp. 684–693.
  • [7] J. Li, Z. Lin, P. Fu, and W. Wang, “Past, present, and future: Conversational emotion recognition through structural modeling of psychological knowledge,” in Findings of the association for computational linguistics: EMNLP 2021, 2021, pp. 1204–1214.
  • [8] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “Dialoguernn: An attentive rnn for emotion detection in conversations,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 6818–6825.
  • [9] J. Lee and W. Lee, “CoMPM: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.   Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 5669–5679.
  • [10] Y. Bao, Q. Ma, L. Wei, W. Zhou, and S. Hu, “Speaker-guided encoder-decoder framework for emotion recognition in conversation,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, L. D. Raedt, Ed.   International Joint Conferences on Artificial Intelligence Organization, 7 2022, pp. 4051–4057, main Track.
  • [11] S. Poria, N. Majumder, D. Hazarika, D. Ghosal, R. Bhardwaj, S. Y. B. Jian, P. Hong, R. Ghosh, A. Roy, N. Chhaya et al., “Recognizing emotion cause in conversations,” Cognitive Computation, vol. 13, pp. 1317–1332, 2021.
  • [12] W. Shen, S. Wu, Y. Yang, and X. Quan, “Directed acyclic graph network for conversational emotion recognition,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).   Online: Association for Computational Linguistics, Aug. 2021, pp. 1551–1560.
  • [13] X. Jieying, N. Phuong, M. Blake, and N. Le Minh, “Accumulating word representations in multi-level context integration for erc task,” in 2023 15th International Conference on Knowledge and Systems Engineering (KSE), 2023, pp. 1–6.
  • [14] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  • [15] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  • [16] D. Ghosal, N. Majumder, A. Gelbukh, R. Mihalcea, and S. Poria, “COSMIC: COmmonSense knowledge for eMotion identification in conversations,” in Findings of the Association for Computational Linguistics: EMNLP 2020.   Online: Association for Computational Linguistics, Nov. 2020, pp. 2470–2481.
  • [17] Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu, “DailyDialog: A manually labelled multi-turn dialogue dataset,” in Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers).   Taipei, Taiwan: Asian Federation of Natural Language Processing, Nov. 2017, pp. 986–995.
  • [18] B. Lee and Y. S. Choi, “Graph based network with contextualized representations of turns in dialogue,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.   Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 443–455.
  • [19] W. Li, L. Zhu, R. Mao, and E. Cambria, “Skier: A symbolic knowledge integrated model for conversational emotion recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, pp. 13 121–13 129, Jun. 2023.
  • [20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).   Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.
  • [21] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  • [22] S. Lei, G. Dong, X. Wang, K. Wang, and S. Wang, “Instructerc: Reforming emotion recognition in conversation with a retrieval multi-task llms framework,” arXiv preprint arXiv:2309.11911, 2023.
  • [23] G. Tu, B. Liang, B. Qin, K.-F. Wong, and R. Xu, “An empirical study on multiple knowledge from ChatGPT for emotion recognition in conversations,” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds.   Singapore: Association for Computational Linguistics, Dec. 2023, pp. 12 160–12 173.
  • [24] C. Liang, J. Xu, Y. Lin, C. Yang, and Y. Wang, “S+PAGE: A speaker and position-aware graph neural network model for emotion recognition in conversation,” in Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).   Online only: Association for Computational Linguistics, Nov. 2022, pp. 148–157.
  • [25] T. Kim and P. Vossen, “Emoberta: Speaker-aware emotion recognition in conversation with roberta,” CoRR, vol. abs/2108.12009, 2021.
  • [26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
  • [27] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei, “Scaling instruction-finetuned language models,” 2022.
  • [28] E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022.
  • [29] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, pp. 335–359, 2008.
  • [30] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “Meld: A multimodal multi-party dataset for emotion recognition in conversations,” arXiv preprint arXiv:1810.02508, 2018.
  • [31] S. M. Zahiri and J. D. Choi, “Emotion detection on tv show transcripts with sequence-based convolutional neural networks,” in Workshops at the thirty-second aaai conference on artificial intelligence, 2018.
  • [32] J. Li, D. Ji, F. Li, M. Zhang, and Y. Liu, “HiTrans: A transformer-based context- and speaker-sensitive model for emotion detection in conversations,” in Proceedings of the 28th International Conference on Computational Linguistics.   Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 4190–4200.
  • [33] W. Shen, J. Chen, X. Quan, and Z. Xie, “Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 15, pp. 13 789–13 797, 5 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/17625
  • [34] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. Gonzalez, and I. Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena,” ArXiv, vol. abs/2306.05685, 2023.