¹¹institutetext: Japan Advanced Institute of Science and Technology
¹¹email: {xuejieying,phuongnm,matheny.blake,nguyenml}@jaist.ac.jp

BiosERC: Integrating Biography Speakers Supported by LLMs for ERC Tasks

Jieying Xue Minh-Phuong Nguyen Blake Matheny Le-Minh Nguyen

Abstract

In the Emotion Recognition in Conversation task, recent investigations have utilized attention mechanisms exploring relationships among utterances from intra- and inter-speakers for modeling emotional interaction between them. However, attributes such as speaker personality traits remain unexplored and present challenges in terms of their applicability to other tasks or compatibility with diverse model architectures. Therefore, this work introduces a novel framework named BiosERC, which investigates speaker characteristics in a conversation. By employing Large Language Models (LLMs), we extract the “biographical information” of the speaker within a conversation as supplementary knowledge injected into the model to classify emotional labels for each utterance. Our proposed method achieved state-of-the-art (SOTA) results on three famous benchmark datasets: IEMOCAP, MELD, and EmoryNLP, demonstrating the effectiveness and generalization of our model and showcasing its potential for adaptation to various conversation analysis tasks. Our source code is available at https://github.com/yingjie7/BiosERC.

Keywords:

speaker modeling biography of speaker in conversation emotion recognition in conversation large language models

1 Introduction

Emotion recognition in conversation (ERC) is a pivotal research topic that has garnered growing attention due to its extensive range of applications [1, 2]. In ERC tasks, the input text frequently consists of transcribed spoken dialogues from a speech recognition system, featuring colloquial or truncated statements that lack standardized grammar, thereby complicating emotional recognition in the dialogue. Unlike the traditional non-conversation sentiment analysis task, ERC emphasizes some of the many factors that influence ERC tasks, including contextual and speaker-specific information [1].

Therefore, recent approaches have inclined toward encoding acoustic features [3, 4] or contextual information [5, 6, 7] to enrich utterance vector representation. On the other hand, numerous previous works have typically utilized GRU [8, 9, 10], GNN [11, 2], or self-attention network [1, 12, 13] to encode richer speaker-specific information, including intra- and inter-speaker features. However, this latent information is predominantly learned from relationships among utterances. It poses challenges for validating its effectiveness and applying it to alternative tasks, and is problematic for other model architectures.

Refer to caption — Figure 1: Overview of our BiosERC framework

Additionally, speaker characteristics as a crucial and foundational feature in ERC tasks has not been comprehensively explored. We posit that within a dialogue, an individual’s character can significantly influence their manner of emotional expression and habitual vocabulary selection, leading to varying emotions for the same statement even when articulated by different speakers. Comprehension of interlocutors’ personality traits can thus facilitate accurately discerning their emotional inclinations within the discourse.

To tackle the aforementioned challenge, we propose BiosERC, a novel method designed to discover speakers’ personality information to enhance ERC systems. In contrast to previous methodologies relying on GRU [9, 10] or speaker-based masked attention mechanisms [1, 12, 13] to capture emotional expression features of different speakers, BiosERC stands out by precisely extracting individual personalities of speakers within dialogues (Figure 1). This uniqueness empowers the model to intricately comprehend character traits and encapsulate events of emotional transitions occurring within the characters. Moreover, our mechanism for extracting speaker characteristics is explicit and more amenable to verification and adaptation for application to various conversation analysis tasks.

Specifically, BiosERC utilizes LLMs with a prompting technique [14, 15] to extract descriptions of interlocutor features as supplementary knowledge, which are then injected into the emotion recognition process within conversations. As shown in Figure 1, this conversation involves three distinct speakers, each presenting unique perspectives and exhibiting markedly different emotional states. The speaker description facilitates the model’s thorough understanding of each speaker’s role within a conversation. Particularly, SPEAKER A is experiencing sadness and regret (as mentioned in the speaker description), resulting in expressions predominantly filled with sadness. SPEAKER B appears to be a supportive and empathetic listener, with limited involvement in the conversation, and reacts through SPEAKER A’s utterances. Meanwhile, SPEAKER C responds with excitement upon hearing their conversation. Intuitively, the integration of biographical data plays an important role in enriching the emotional background of each speaker in conversations, and holds the potential for more precise and comprehensive emotional recognition, especially in complex dialogues.

We carry out experiments on three benchmark datasets, including IEMOCAP, MELD, and EmoryNLP. The experimental results demonstrate that our method achieves SOTA performance, which indicates the effectiveness of our proposed model. Furthermore, our proposed mechanism, which uses a prompting technique for LLMs to extract the speakers’ biographical information, shows the potential to adapt to various conversation-level tasks such as opinion analysis, recommendation, and others.

2 Related Work

Emotion Recognition in Conversation.

In contrast to the conventional non-conversation sentiment analysis task, ERC demands a greater reliance on contextual and speaker-specific information for its support. For the purpose of modeling the conversational context, numerous studies employ Recurrent Neural Networks (RNNs) [16, 17] or Graph Convolution Network (GCN) [1, 18] to explore the hidden relationships between utterances. Moreover, the incorporation of contextual information and external knowledge into utterance vector representations has been notably achieved in recent works [12, 6, 16, 19] through the utilization of self-attention mechanisms and pre-trained Language Models (LM) [20, 21]. In the recent success of LLMs on various NLP tasks, InstructERC [22] is proposed to utilize the instruction prompting technique and fine-tune the LLM model for ERC tasks. MKFM framework [23] proposed the utilization of diverse supplementary knowledge information (e.g., emotional cause, topics) by ChatGPT service to inject into a graph-based model. In comparison, our work focuses on modeling speaker characteristics, a fundamental information which can be extracted by open-source LLMs (e.g, LLama-2). In addition, we also prove our proposed mechanism worked effectively when fine-tuning on both popular architectures: BERT and transformer-based decoder-only LLM.

Speaker-based ERC.

Because of the significant impact of speakers on ERC, researchers have placed emphasis on speaker modeling. DialogueRNN [8] and COSMIC [16] leverage Gated Recurrent Units (GRU) for the modeling of speaker-specific semantic context. Some researches [11, 2] treat conversations as graphs while incorporating prior speaker information as distinct relationships between utterances, or considers speakers as nodes within the graph. HiTrans [5] exploits an auxiliary task to classify whether two utterances belong to the same speaker to make the model speaker-sensitive. S+PAGE [24] employs a two-stream conversation Transformer architecture to extract both self and inter-speaker contextual features. However, the majority of prior research has predominantly concentrated on modeling individual speaker utterances or interactions among different speakers, with particular attention given to the intra- and inter-speaker aspects for the extraction of speaker-based information [6, 13]. Regrettably, limited emphasis has been placed on exploring speaker characteristics, which constitute critical and foundational elements of conversational information. Therefore, we propose a novel method named BiosERC, which employs external tools to extract speaker characteristics and inject them into the process of emotion recognition within conversations.

3 Methodology

This section introduces our baseline model architecture for the ERC task, which utilizes intra- and inter-speaker information following current SOTA methods [24, 6, 10], and our proposed method BiosERC, which incorporates the biography of the speakers into an ERC model. Formally, we define a conversation as: $\mathcal{C}=\{u_{i}\}_{0\leq i<|\mathcal{C}|}$ , where each individual utterance $u_{i}$ is articulated by speaker $p(u_{i})\in\mathcal{S}$ , with $\mathcal{S}=\{s_{j}\}_{0\leq j<|\mathcal{S}|}$ representing the set of speakers in the conversation. Here, $p$ denotes a mapping function that associates utterances with their respective speakers.

3.1 Intra-inter ERC (baseline)

Based on recent SOTA methods in the ERC task [24, 10, 13], we implement our baseline model consisting of three principal components: utterance vector representation, context modeling, and an emotion classification layer.

Utterance Vector Representation.

To enrich meaning representations, we follow an approach that mixes the surrounding utterances within a fixed-window size [25, 9, 19, 13]. Particularly, to encode a sentence $u_{i}$ , the text input is combined by surrounding the utterance according to the following template: “[cls], $u_{i-w}$ , .., </s>, $u_{i}$ , </s>, ... $u_{i+w}$ ”, where $w$ is the local contextual window size hyperparameter. The utterance vector is computed by aggregating the respective word vectors following [13]:

	$\displaystyle h^{cls},h^{words}$	$\displaystyle=\mathrm{RoBERTa}([u_{i-w},..,u_{i+w}])$		(1)
	$\displaystyle h^{utt}$	$\displaystyle=[\mathrm{tanh}(\mathrm{average}(h^{\mathrm{words\,of\,u_{i}}})% \cdot W^{u})]_{0\leq i<\|\mathcal{C}\|}$		(2)

where $h^{\mathrm{words\,of\,u_{i}}}$ denotes the word vectors selected from $h^{words}$ at the positions of the utterance $u_{i}$ ; $h^{utt}$ is all utterance vectors in a conversation; and $W^{*}$ refers to learnable weights.

Context Modeling.

Utterance vectors are integrated contextual information of whole conversation by attention mechanism:

$\displaystyle\mathrm{Attn}(q,k,v,M)$	$\displaystyle=\mathrm{softmax}(\frac{q\cdot k^{\intercal}}{\sqrt{d_{t}}}+M)\cdot v$	(3)
$\displaystyle{q}_{t},{k}_{t},{v}_{t}$	$\displaystyle={h^{utt}}{W}_{t}^{q},{h^{utt}}{W}_{t}^{k},{h^{utt}}{W}_{t}^{v}$	(4)
$\displaystyle{head}_{t}$	$\displaystyle=\mathrm{Attn}({q}_{t},{k}_{t},{v}_{t},M)$	(5)
$\displaystyle{h}_{\mathrm{MultiHead}}$	$\displaystyle=\mathrm{concat}([{head}_{t\|0<t\leq H}]){W}^{o}$	(6)

where $H$ is the number of heads in the MultiHead attention layer; ${q}_{t},{k}_{t},{v}_{t}$ are utterance vectors in various semantic space (dimension size $d_{t}$ ). In detail, following [6, 13], we construct the relation matrices ( $M$ ) for modeling relationship among utterances, where $M_{ik}=0$ if $u_{i}$ and $u_{k}$ should have interaction, $M_{ik}=-\infty$ if otherwise. For the baseline model, we implement three different relationships: global context (all utterance pairs are connected), intra-speaker (only utterance pairs of the same speaker are connected), and inter-speaker (only utterance pairs of the different speaker are connected). Consequently, we acquire three new hidden states (from Equation 6) $h^{contxt}$ , $h^{intra}$ , $h^{inter}$ feed-forward to the Classification component.

Classification.

This component aims to integrate all the hidden features of utterances to classify the emotion label.

	$\displaystyle h^{speaker}_{i}$	$\displaystyle={h^{intra}_{i}}{W}^{a}+{h^{inter}_{i}}{W}^{r}$		(7)
	$\displaystyle e^{o}_{i}$	$\displaystyle={\mathrm{softmax}(h^{utt}_{i}}{W}^{u}+{h_{i}^{contxt}}{W}^{g}+h^% {speaker}_{i})$		(8)

Then, the emotion vector ( $e^{o}_{i}$ ) is used to compute the loss via Cross-Entropy function and is trained based on the gold emotional label of the $i$ -th utterance.

3.2 Bios ERC

In this section, we describe the process of generating the speaker’s biography and present our BiosERC framework, leveraging two popular pre-trained LM as backbones: a BERT-based model [21] (e.g., RoBERTa) and a transformer-based decoder-only LLM model [15] (e.g., Llama-2). Notably, we also introduce an effective mechanism using the biography of speakers incorporating fine-tuning a LLM-based [26] with the prompting technique.

3.3 Biography of Speaker

In this part, we introduce a mechanism using the prompting technique for the LLMs to generate the description ( $d_{j}$ ) for the respective speaker ( $u_{j}$ ). Given a conversation $\mathcal{C}$ , the output of this step is the biography (description) of all speakers in a conversation $\mathcal{B}=\{d_{j}\}_{0\leq j<|\mathcal{S}|}$ .

\displaystyle d_{j}=\mathrm{LLMs}(\mathrm{prompting}(\mathcal{C},s_{j}))

(9)

LLMs refers to large language models such as Llama2 [15], which can generalize a speaker’s biography based on their conversation. The prompting function is a template containing two conversation instances ( $\mathcal{C}$ ) and speaker identification ( $s_{j}$ ) to exploit the knowledge of the LLMs (Table 1). To avoid long plain text descriptions, we instruct the LLMs to limit the output by adding a “note” concerning the length of the prompting template. Consequently, we obtain additional data about the persona of the speakers in each conversation ( $\mathcal{B}$ ), which is utilized for speaker modeling in the subsequent step.

Table 1: Prompting template to extract the description of characteristics of the speaker from a conversation with LLMs.

Given this conversation between speakers:

{conversation content

\mathcal{C}

}

In overall above conversation, what do you think about the characteristics of speaker {speaker identification

s_{j}

}? (Note: provide an answer within 250 words)

3.4 BERT-based BiosERC architecture

Firstly, we encode the speaker’s description using a pre-trained language model to acquire hidden vector representation ( $h^{desc}_{j}$ ).

\displaystyle h^{desc}_{j}

\displaystyle=\mathrm{RoBERTa}(d_{j})[0]

(10)

where $j$ is the speaker index in the set of speakers in a conversation, $0\leq j<|\mathcal{S}|$ . Our proposed method, BiosERC, extends the baseline model and redefines the speaker’s hidden vector representation ( $h_{i}^{speaker}$ in Equation 7) (Figure 2).

This architecture is designed with a straightforward target that injects the personality information of each speaker into their corresponding utterances by a multi-layer perceptron network. Then, the speaker information in Equation 7 is replaced by:

\displaystyle h_{i}^{speaker}

\displaystyle=h^{desc}_{p(u_{i})}{W}^{desc}+b^{desc}

(11)

where $p(u_{i})$ denotes the corresponding speaker of utterance $u_{i}$ . Through this mechanism, all the utterances from the same speaker are shared in the unified speaker vector representation, while the weights are updated in the training process. Finally, the utterance vector is fused with the speaker vector which supports emotional classification.

BiosERC - biography injected by attention mechanism.

We consider a variant of our BiosERC model, which is engineered to dynamically incorporate the speaker’s information into each utterance via the attention mechanism. The relationship between the current utterance and all individual speakers is integrated to enrich the utterance vector representation.

$\displaystyle h_{i}^{fusion}$	$\displaystyle=h^{desc}_{p(u_{i})}{W}^{p}+h^{utt}_{i}$	(12)
$\displaystyle h^{desc}$	$\displaystyle=\{h^{desc}_{j}\}_{0\leq j<\|\mathcal{S}\|}$	(13)
$\displaystyle h^{speaker}_{i}$	$\displaystyle=\mathrm{Attn}(h_{i}^{fusion},h^{desc},h^{desc},\mathbf{0})$	(14)

We first compute a fusion vector ( $h^{fusion}$ ) between the utterance and respective speaker description vectors. Then we collect all the speaker description vectors ( $h^{desc}$ ) and use the attention mechanism to model the relationship between the utterance and all speakers in a conversation. Finally, the speaker features are embedded in this vector, $h^{speaker}_{i}$ , and are replaced using Equation 7 in the baseline system.

3.5 LLM-based BiosERC + instruction fine-tuning (ft LLM)

Since the robust natural language understanding capabilities of LLMs [15], we provide the speaker description as part of the text prompting input for the model (highlighted in blue in Table 2) instead of modifying model architecture. We follow the instruction fine-tuning approach [27], with causal language modeling objective to train an LLM to generate emotional label text (highlighted in red in Table 2):

	$\displaystyle x$	$\displaystyle=\textrm{prompting}(u_{i},s_{j},d_{j},\mathcal{C},e_{i})$		(15)
	$\displaystyle\text{I\kern-1.49994ptP}(x)$	$\displaystyle=\Pi_{z=1}^{\|x\|}\text{I\kern-1.49994ptP}(x_{z}\|x_{0},x_{1},...,x_% {z-1})$		(16)

where $x,z$ is a sequence of tokens and the token’s index in prompting input (Table 2), respectively. Additionally, we utilize LoRA [28], a lightweight training technique, to reduce the number of trainable parameters. The instruction fine-tuned LLM learned the distribution of emotional labels given prompting input ( $x$ ). During the inferring phase, the emotional label ( $e_{i}$ ) which is omitted from prompting input, is left to be generated by the fine-tuned LLM.

Table 2: Prompting input template using speaker description and content of conversation for fine-tuning LLMs.

system

### You are an expert at analyzing the emotion of utterances among speakers in a conversation.

### Given the characteristic of this speaker, {speaker name

s_{j}

}: {speaker description

d_{j}

}

### Given the following conversation as a context {conversation

\mathcal{C}

}

user

Based on the above conversation and characteristics of the speakers, which emotional label of {

s_{j}

} in the utterance {utterance

u_{i}

} ?

assistant

{emotional label of

u_{i}

in text:

e_{i}

}

4 Experimental Setting

Datasets

We conducted evaluations on three ERC benchmark datasets in text-only version: IEMOCAP [29], involving daily conversations between pairs with ten different speakers; MELD [30], derived from TV shows and featuring multiparty conversations; EmoryNLP [31], another multiparty daily dialogue dataset sourced from TV shows. The statistical information of these datasets is shown in Table 3. In accordance with prior works [16, 12], we employed the Weighted-F1 score as the evaluation metric to maintain compatibility.

Table 3: Statistical information on all ERC datasets.

Dataset	#dialogues			#utterances			#speaker
Dataset	train	dev	test	train	dev	test	#speaker
IEMOCAP	108	12	31	5,163	647	1,623	2.00
EmoryNLP	659	89	79	7,551	954	984	3.34
MELD	1,039	114	280	9,989	1,109	2,610	2.72

Implementation Details

Since the recent successful applications and advancing capabilities of pre-trained LLMs [15, 14], we leverage LLama-2 model to procure personality descriptions for each participant in the conversation. Specifically, we verify the effectiveness of speaker description information on two aforementioned pre-trained LMs: the BERT-based model with roberta-large and the transformer-based decoder-only LLM model with Llama-2-13b. The best model is determined based on the development set of each dataset and employed to evaluate the test set. For fine-tuning BERT-based BiosERC (section 3.4), the hyper-parameters were selected as follows: the learning rate is selected from $\{1e^{-5};5e^{-6}\}$ ; the dropout value is $0.2$ , and number epochs is $30$ ; and the local context window size ( $w$ ) is chosen in $\{2,4\}$ ; we report the average scores obtained across 10 independent runs. For fine-tuning LLM-based BiosERC (section 3.5), learning rate is selected from $\{2e^{-4};3e^{-4}\}$ , number epochs is $3$ ; we report the average scores obtained across 5 independent runs because of the computation cost. All the source code of this project is published at MASKED_LINK.

5 Results and Analysis

5.1 Main results

Our approach demonstrated competitive performance compared to recent SOTA methods on three famous benchmark datasets (Table 4) on both two architectures BERT-based and transformer-based decoder-only LLM model.

Table 4: Performance comparison between our proposed method and previous works on the test sets. Column #T.Params. refers to the number of trainable parameters. The notations

\ddagger

\dagger

indicate the significant difference (t-test) with the baseline in levels

p<0.01

and

p<0.05

, separately.

Methods	#T.Params.	IEMOCAP	EmoryNLP	MELD
HiTrans [32]		64.50	36.75	61.94
DAG [12]		68.03	39.02	63.65
DialogXL [33]		65.94	34.73	62.14
DialogueEIN [6]		68.93	38.92	65.37
SGED + DAG-ERC [10]		68.53	40.24	65.46
S+PAGE [24]		68.93	40.05	64.67
InstructERC [22] +(ft LLM)		71.39	41.39	69.15
Intra/inter ERC (baseline) [13]	$189\times 10^{6}$	67.65	39.33	64.58
BiosERC ${}_{\textit{{BERT-based}}}$	$186\times 10^{6}$	67.79	39.89^†	65.51^‡
BiosERC +ft LLM ${}_{\texttt{Llama-2-7b}}$	$80\times 10^{6}$	69.02	41.44	68.72
BiosERC +ft LLM ${}_{\texttt{Llama-2-13b}}$	$125\times 10^{6}$	71.19	41.68	69.83

In comparison with the previous speaker-based methods (SGED + DAG-ERC [10], S+PAGE [24] and DialogueEIN [6]) experimental results demonstrated the effectiveness of our proposed approach and further affirm that speaker modeling by speaker descriptions are superior to the information offered by intra- and inter-speaker contexts. In addition, our BiosERC model achieved significant differences with the baseline system on both the EmoryNLP and MELD datasets, which is shown clearly in Figure 3. Because the MELD and EmoryNLP are multiparty conversation datasets (the average number of interlocutors are 2.72 and 3.34, respectively), the emotions are influenced more by different speaker personalities in a conversation than the IEMOCAP dataset.

In the previous method of fine-tuning an LLM, InstructERC [22] considers speaker identifier as an auxiliary task, requiring two-stage training, which is more time-consuming than ours in the training process. Besides, our proposed method uses speaker descriptions generated by LLM in natural language, which can be easier incorporated with humans using our system for customization (e.g., customer support staff can directly provide or modify characteristics generated by LLM of their customers). Similar to BERT-based BiosERC, among three datasets, the LLM-based BiosERC shows the strengthens of multi-party datasets (more than two speakers in each conversation), EmoryNLP and MELD. By fine-tuning an LLM, Llama-2-13b, the performance of our BiosERC increased by 1-4% weighted F1 scores compared to BERT-based models and achieved new SOTA performance on EmoryNLP and MELD datasets. Besides, since utilizing a lightweight training technique, LoRA [28], the number of training parameters in LLM-based BiosERC was smaller than BERT-based BiosERC (only fine-tuned on two last layers) which proved the potential of LLM-based BiosERC in the real application.

Figure 3: Performance comparison between our BERT-based BiosERC and the baseline model (MELD dev set), illustrating the performance variability across 10 random runs.

5.2 Ablation study

We conducted an ablation study to evaluate the effectiveness of integrating speaker biographies into the broader system encompassing various aspects.

BiosERC architecture.

As shown in Table 5, it is apparent that our BERT-based BiosERC (row 3), which incorporates the speaker’s descriptions, exhibits significant advantages in F1 score, outperforming the baseline system that relies on intra/inter- speaker relationships. Besides, by using the attention mechanism to encode the speaker’s biography (row 2), BiosERC achieved high performance and it also clearly outperformed the baseline model. Moreover, in the setting of BiosERC +fine-tuning LLM (row 8), when removing the speaker description (the blue part in Table 2) from the input prompting (row 6), the performance significantly decreased by 1.05 F1 score. By fine-tuning the different LLM models, Llama-2-13b and Llama-2-7b, the performance is slightly decreased with $0.52$ F1 score (rows 7, 8). These results proved the importance of the speaker’s biography information and the efficacy of our proposed approach for speaker modeling.

Table 5: Performance comparison among variants of BiosERC on the MELD development set.

Methods	LLMs extracting bio.	Weighted-F1
1. Intra/inter ERC (baseline)	-	$66.08$ _(-1.19)
\cdashline1-3 2. BiosERC injecting bio. by attention	Llama-2-chat-70b	$66.71^{\dagger}$ _(-0.56)
3. BiosERC	Llama-2-chat-70b	$\mathbf{67.27}^{\ddagger}$
4. BiosERC	Llama-2-chat-7b	$67.23^{\ddagger}$ _(-0.04)
5. BiosERC	vicuna-33b-v1.3	$66.96^{\ddagger}$ _(-0.32)
6. BiosERC +ft LLM ${}_{\texttt{Llama-2-13b}}$ w/o speaker bio.	-	$69.17_{(-1.05)}$
\cdashline1-3 7. BiosERC +ft LLM ${}_{\texttt{Llama-2-7b}}$	Llama-2-chat-70b	$69.70_{(-0.52)}$
8. BiosERC +ft LLM ${}_{\texttt{Llama-2-13b}}$	Llama-2-chat-70b	$\mathbf{70.22^{\dagger}}$

Speaker biographies.

We explored various currently popular LLMs for generating speaker biographies, including LLama-2-chat-70b, Llama-2-chat-7b [15], and vicuna-33b-v1.3 [34]. Among these, LLama-2-chat-70b yielded the best outcomes. Based on observation, we found that the Vicuna model failed to provide speaker descriptions in some “extra difficult” cases, such as when the conversation length is too short (e.g., less than three utterances) or when the specific speaker has extremely short utterances (e.g., Hmm). These solid improvements worked on the diverse biographies generated by various LLMs underscore the versatility and effectiveness of extracting “speaker biographies”, demonstrating that the LLMs framework can be highly beneficial for biography generation and assisting with ERC tasks.

5.3 Conversation length

We analyzed the MELD development set to evaluate the impact of conversation length on the performance as depicted in Figure 4. Overall, our method outperforms intra- and inter-speaker methods across conversations of varying lengths. It is worth noting that the performance of short dialogues (conversation length less than 15) improves significantly more than that of long dialogues. These results also proved the importance of “speaker characteristic” in short conversations lacking contextual information. When contextual information is limited, speaker characteristics are based on the speaker’s lexical choices, which contain explicit or implicit meaning in the sentence. An LLM can extract the speaker’s characteristics by recognizing the explicit or implicit meaning conveyed in these statements. In addition, MELD is a multiparty dataset, which contains many conversations involving more than three speakers. Based on our observations on the improvement examples, “speaker characteristic” is especially important in short conversations which lack much contextual information.

5.4 Case study

Our model enhances emotion recognition accuracy, even in short conversations with limited contextual information. As show in Table 6, conversation 1041 is a short dialogue consisting of only five sentences. Our model perceives two speaker descriptions, these descriptions facilitate a more accurate identification of SPEAKER_0’s discourse, leaning towards positivity rather than anger. In addition, our architecture shows an improved capacity in predicting emotions in shorter sentences through the utilization of speaker description, particularly in cases where traditional models struggle due to the minimal information contained in expressions such as “Yeah”, or “Okay”.

Additionally, our approach adeptly copes with scenarios where the error rate is high at the beginning of the conversation as show in Table 6 ( $u_{0}$ , $u_{1}$ , $u_{2}$ in conversation 1061). Because contextual and speaker information is lacking at the outset of the conversation, the baseline model consistently produces incorrect initial sentences. However, with the assistance of our “speaker description”, it delivers a better performance from the beginning of the dialogue. Experimental results prove the effectiveness and versatility of our approach across diverse conversations, including those with complex contextual information.

Figure 4: Performance comparison respect to length of conversation (number of utterance) on the MELD development set (variability across 10 random runs).

Table 6: Case study of improvement examples collected in the MELD dataset. The red and green labels refer to the incorrect and correct prediction of the models, respectively.

Idx	Conversation 1041		Label	BiosERC	Baseline
	Speaker_0	Speaker_1
d₀	SPEAKER_0 in the conversation comes across as someone who is confident, friendly .. to create a relaxed atmosphere …
\cdashline1-7 d₁	SPEAKER_1 in the conversation comes across as a friendly .. have a strong sense of loyalty and trust in their relationships…
u0	Hey Estelle,listen		neutral	neutral	neutral
u1		Well! Well! Well! Joey Tribbiani! So you came back huh?	surprise	surprise	joy
u2	What are you talking about? I never left you! You’ve always been my agent!		surprise	surprise	anger
u3		Really?!	surprise	surprise	surprise
u4	Yeah!		joy	joy	anger

Idx	Conversation 1061				Label	BiosERC	Baseline
Idx	Speaker_0		Speaker_1	Speaker_2	Label	BiosERC	Baseline
d₀	SPEAKER_0 seems to be a very inquisitive and curious person. … SPEAKER_0 appears to be quite blunt and direct in his communication style, not mincing words or sugarcoating his thoughts.
\cdashline1-9 d₁	SPEAKER_1 seems to be a humorous and light-hearted person. … SPEAKER_1 shares that they have only been with one person in their whole life, and this is met with surprise and disbelief by the other …
\cdashline1-9 d₂	SPEAKER_2 seems to be a humorous and light-hearted person… SPEAKER_2 is someone who enjoys having fun and is not afraid to poke fun at themselves or others…
u0	Well, what?				neutral	neutral	surprise
u1	What?				neutral	neutral	surprise
u2	What is it?				neutral	neutral	sadness
u3	That she left you?				surprise	surprise	sadness
u4	That she likes women?				neutral	sadness	sadness
u5	That she left you for another woman that likes women?				neutral	surprise	sadness
u6		Little louder, okay, I think there’s a man on the twelfth floor in a coma that didn’t quite hear you.			anger	neutral	anger
…
u11			With Carol? Oh.		surprise	surprise	neutral
u12	So in your whole life, you’ve only been with one oh.				surprise	neutral	neutral
u13			Whoah, boy, hockey was a big mistake! There was a whole bunch of stuff we could’ve done tonight!		surprise	surprise	joy

6 Limitation

In this work, we introduce a novel method for modeling speaker characteristics based on biographical information of interlocutors in a conversation, generated by a large language model (LLM). In terms of computation time, our BiosERC method requires additional computing resources for the inference of the LLM compared to methods like Intra-inter ERC, which utilize hidden speaker identity information. Additionally, for the scope of this paper, we have not addressed issues related to human privacy data. In realistic applications, access to conversation history should be granted and clarified by the data owner. However, we believe that, with appropriate agreements to protect users’ privacy data, it is possible to obtain this permission.

7 Conclusion

In conclusion, we proposed a novel mechanism incorporating speakers’ characteristics into the ERC task, which has not been fully developed in prior research. We improve the performance of the ERC task by investigating the influence of the personality of interlocutors on emotions, considering this external knowledge as a unique feature. Our experiments on three benchmark datasets consistently yielded SOTA or competitive results, thereby substantiating the effectiveness of our proposed method. Furthermore, our model is straightforward yet highly adaptable, thus enabling its applicability to a wide range of conversation analysis tasks.

Acknowledgement.

This work is supported partly by AOARD grant FA23862214039.

References

[1] D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh, “DialogueGCN: A graph convolutional neural network for emotion recognition in conversation,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 154–164.
[2] D. Zhang, L. Wu, C. Sun, S. Li, Q. Zhu, and G. Zhou, “Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations.” in IJCAI, 2019, pp. 5415–5421.
[3] G. Hu, T.-E. Lin, Y. Zhao, G. Lu, Y. Wu, and Y. Li, “UniMSE: Towards unified multimodal sentiment analysis and emotion recognition,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 7837–7851.
[4] X. Shi, X. Li, and T. Toda, “Emotion Awareness in Multi-utterance Turn for Improving Emotion Prediction in Multi-Speaker Conversation,” in Proc. INTERSPEECH 2023, 2023, pp. 765–769.
[5] J. Li, D. Ji, F. Li, M. Zhang, and Y. Liu, “Hitrans: A transformer-based context-and speaker-sensitive model for emotion detection in conversations,” in Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 4190–4200.
[6] Y. Liu, J. Zhao, J. Hu, R. Li, and Q. Jin, “DialogueEIN: Emotion interaction network for dialogue affective analysis,” in Proceedings of the 29th International Conference on Computational Linguistics. Gyeongju, Republic of Korea: International Committee on Computational Linguistics, Oct. 2022, pp. 684–693.
[7] J. Li, Z. Lin, P. Fu, and W. Wang, “Past, present, and future: Conversational emotion recognition through structural modeling of psychological knowledge,” in Findings of the association for computational linguistics: EMNLP 2021, 2021, pp. 1204–1214.
[8] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “Dialoguernn: An attentive rnn for emotion detection in conversations,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 6818–6825.
[9] J. Lee and W. Lee, “CoMPM: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 5669–5679.
[10] Y. Bao, Q. Ma, L. Wei, W. Zhou, and S. Hu, “Speaker-guided encoder-decoder framework for emotion recognition in conversation,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, L. D. Raedt, Ed. International Joint Conferences on Artificial Intelligence Organization, 7 2022, pp. 4051–4057, main Track.
[11] S. Poria, N. Majumder, D. Hazarika, D. Ghosal, R. Bhardwaj, S. Y. B. Jian, P. Hong, R. Ghosh, A. Roy, N. Chhaya et al., “Recognizing emotion cause in conversations,” Cognitive Computation, vol. 13, pp. 1317–1332, 2021.
[12] W. Shen, S. Wu, Y. Yang, and X. Quan, “Directed acyclic graph network for conversational emotion recognition,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 1551–1560.
[13] X. Jieying, N. Phuong, M. Blake, and N. Le Minh, “Accumulating word representations in multi-level context integration for erc task,” in 2023 15th International Conference on Knowledge and Systems Engineering (KSE), 2023, pp. 1–6.
[14] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
[15] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[16] D. Ghosal, N. Majumder, A. Gelbukh, R. Mihalcea, and S. Poria, “COSMIC: COmmonSense knowledge for eMotion identification in conversations,” in Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020, pp. 2470–2481.
[17] Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu, “DailyDialog: A manually labelled multi-turn dialogue dataset,” in Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Taipei, Taiwan: Asian Federation of Natural Language Processing, Nov. 2017, pp. 986–995.
[18] B. Lee and Y. S. Choi, “Graph based network with contextualized representations of turns in dialogue,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 443–455.
[19] W. Li, L. Zhu, R. Mao, and E. Cambria, “Skier: A symbolic knowledge integrated model for conversational emotion recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, pp. 13 121–13 129, Jun. 2023.
[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.
[21] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[22] S. Lei, G. Dong, X. Wang, K. Wang, and S. Wang, “Instructerc: Reforming emotion recognition in conversation with a retrieval multi-task llms framework,” arXiv preprint arXiv:2309.11911, 2023.
[23] G. Tu, B. Liang, B. Qin, K.-F. Wong, and R. Xu, “An empirical study on multiple knowledge from ChatGPT for emotion recognition in conversations,” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 12 160–12 173.
[24] C. Liang, J. Xu, Y. Lin, C. Yang, and Y. Wang, “S+PAGE: A speaker and position-aware graph neural network model for emotion recognition in conversation,” in Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online only: Association for Computational Linguistics, Nov. 2022, pp. 148–157.
[25] T. Kim and P. Vossen, “Emoberta: Speaker-aware emotion recognition in conversation with roberta,” CoRR, vol. abs/2108.12009, 2021.
[26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
[27] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei, “Scaling instruction-finetuned language models,” 2022.
[28] E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022.
[29] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, pp. 335–359, 2008.
[30] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “Meld: A multimodal multi-party dataset for emotion recognition in conversations,” arXiv preprint arXiv:1810.02508, 2018.
[31] S. M. Zahiri and J. D. Choi, “Emotion detection on tv show transcripts with sequence-based convolutional neural networks,” in Workshops at the thirty-second aaai conference on artificial intelligence, 2018.
[32] J. Li, D. Ji, F. Li, M. Zhang, and Y. Liu, “HiTrans: A transformer-based context- and speaker-sensitive model for emotion detection in conversations,” in Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 4190–4200.
[33] W. Shen, J. Chen, X. Quan, and Z. Xie, “Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 15, pp. 13 789–13 797, 5 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/17625
[34] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. Gonzalez, and I. Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena,” ArXiv, vol. abs/2306.05685, 2023.