The Effects of Embodiment and Personality Expression on Learning in LLM-based Educational Agents

Sinan Sonlu, Bennie Bendiksen, Funda Durupinar, Uğur Güdükbay Manuscript received June 19, 2024; revised August 16, 2024. S. Sonlu and U. Güdükbay are with the Department of Computer Engineering, Bilkent University, Ankara, Turkey. e-mails: [email protected], [email protected]. B. Bendiksen and F. Durupinar are with the Department of Computer Science, The University of Massachusetts Boston, USA. e-mails: [email protected], [email protected].
Abstract

This work investigates how personality expression and embodiment affect personality perception and learning in educational conversational agents. We extend an existing personality-driven conversational agent framework by integrating LLM-based conversation support tailored to an educational application. We describe a user study built on this system to evaluate two distinct personality styles: high extroversion and agreeableness and low extroversion and agreeableness. For each personality style, we assess three models: (1) a dialogue-only model that conveys personality through dialogue, (2) an animated human model that expresses personality solely through dialogue, and (3) an animated human model that expresses personality through both dialogue and body and facial animations. The results indicate that all models are positively perceived regarding both personality and learning outcomes. Models with high personality traits are perceived as more engaging than those with low personality traits. We provide a comprehensive quantitative and qualitative analysis of perceived personality traits, learning parameters, and user experiences based on participant ratings of the model types and personality styles, as well as users’ responses to open-ended questions.

Index Terms:
Five-factor personality, Generative Pre-trained Transformer (GPT), Large Language Model (LLM), Conversational Agent, Pedagogical Agent, Dialogue, Animation.
publicationid: pubid: 0000–0000/00$00.00 © 2021 IEEE

1 Introduction

Virtual humans have tremendous potential to support educational activities by providing on-demand, personalized learning experiences. Their function is more than just relaying information; they can socially connect with users, establish rapport, and motivate them [1]. Their potential has significantly increased with the advancement of Large Language Models (LLMs), which can effectively assume various roles and personalities and provide information on any topic. Virtual humans with LLM-driven dialogue capabilities offer customized experiences for learners with diverse preferences and needs.

A substantial body of literature examines how teacher personalities impact effective learning and student preferences [2, 3, 4, 5]. Overall, all the Five-Factor Model (FFM) [6] personality traits with positive connotations–openness, agreeableness, conscientiousness, extroversion, and emotional stability (negative neuroticism)–play important roles in educational contexts [7]. However, factors such as individual differences among students [3] and the type of the course [2] determine which traits are effective in specific scenarios.

Research has shown that embodied characters are perceived as more trustworthy, engaging, and socially present than their disembodied counterparts [8]. Embodied pedagogical agents have been reported to increase motivation and enjoyment in learning; yet, their effects on knowledge acquisition have been mixed [9, 10] as they also increase cognitive load and cause distraction.

Building on these insights, we investigate how virtual agents’ embodiment and personality expression affect learning outcomes in an educational application. This work extends an existing personality-driven conversational agent framework [11] with LLM-based conversation support tailored for an educational scenario. In our application, users interact with a conversational agent in real time by typing their questions, to which the agent responds in speech vocalized by the operating system’s text-to-speech functionality.

As the system’s personality parameters, we chose the combination of extroversion and agreeableness since previous work showed that these two personality factors were more effectively conveyed through body gestures and facial expressions than the other three [11]. We refer to them as an agent’s “personality style” with high-trait and low-trait variants, corresponding to a high extroversion-agreeableness combination and a low extroversion-agreeableness combination. The former is manifested as a friendly, vivid, and energetic agent designed to engage users with enthusiasm and warmth. In contrast, the latter is characterized by a more reserved and less approachable demeanor, offering interactions that may be perceived as less engaging and more formal. Our system expresses personality through dialogue text and/or body and face animation.

We created three models to assess embodiment: a dialogue-only model and two models with 3D humanoid bodies. All the models displayed conversation text concurrently with audio feedback. We evaluated the efficacy of different modalities and personality styles on learning through an independent-subjects user study. The study randomly presented each participant with a high or low personality variant of each model. The dialogue-only model and one of the embodied models expressed personality only through text, and the other embodied model expressed personality through face and body movements and gaze. During the study, we collected ratings about the system for learning, quality, and engagement as well as the perceived personalities of the agents. Additionally, we obtained user feedback through open-ended questions.

Although the system parameters were selected to express the personality dimensions of extroversion and agreeableness, we considered potential variations in participants’ perceptions of the agents’ personalities. For instance, a high-trait agent might also be perceived as emotionally stable with low neuroticism, or a dialogue-only agent might be viewed as conscientious, even though these traits were not intentionally highlighted. Therefore, we collected users’ perceptions of the agents’ personalities across all five dimensions of the FFM for each of the three models.

We aim to answer the following research questions:

  • RQ1.

    Is there an effect of personality style on perceived personality?

  • RQ2.

    Is there an effect of model type on perceived personality?

  • RQ3.

    Is there an effect of model type on learning outcomes?

  • RQ4.

    Is there a correlation between learning outcomes and personality perception?

Based on the findings of the previous studies, we have the following hypotheses:

  • H1.

    Learning outcomes will be rated higher for the embodied agents than the dialogue-only agent, reflected as higher ratings in H1a. learning; H1b. quality; H1c. engagement. Since the literature indicates a more positive approach towards embodied agents than disembodied ones, we expect a similar tendency in our educational application. In other words, the embodied agents will be more engaging and effective in learning [10].

  • H2.

    Agents expressing high extroversion and agreeableness will be rated higher in terms of learning outcomes than the agents expressing low extroversion and agreeableness, reflected as higher ratings in H1a. learning; H1b. quality; H1c. engagement. We formulate this hypothesis based on students’ reported preferences for teacher personalities and documented preferences of users for highly agreeable chatbots [12, 13].

In addition to investigating these questions through quantitative analysis, we identify common themes and individual differences across participants by an in-depth qualitative analysis of their responses to open-ended questions. Furthermore, we provide our system as an open-source virtual tutoring application with conversational virtual agents that exhibit desired personality traits via motion and language. Our data and code are available in our public repository 111Repository link will be available in the final version.

2 Related Work

Involving multiple computing fields, this work on educational agents combines personality expression, LLM-based dialogue generation, and conversational agent systems. We summarize the related work based on these categories, including similar studies on pedagogical agents.

2.1 Pedagogical Agents

The earliest usage of educational computer software includes military applications such as flight simulations [14]. Increasing widespread use of computers opened the way for many interactive multimedia applications focusing on education [15]. One special form of computer-assisted learning includes life-like pedagogical agents that help with learning and motivation in multimedia environments [16]. Such agents can assume the role of instructors, coaches, tutors, or learning companions [17] and converse with the user using natural language with text or speech input [18]. Pedagogical agents can simulate instructional roles such as expert, motivator, and mentor with high accuracy [19]. Human-like agents have been shown to influence learner achievement, attitude, and retention of learning [20] and deliver a more interesting overall learning experience than learning without an agent [21].

Research has explored the use of 2D and 3D agents in educational settings, revealing varying effects on learning outcomes. 2D agents can outperform their more visually complex 3D counterparts in certain scenarios [22], as they reduce extraneous cognitive load and enhance coherence in multimedia learning [23]. High levels of anthropomorphism in agents can detract from social co-presence [24] and consequently limit their pedagogical benefits [25]. 3D agents, in contrast, offer a greater variety of expressions and gestures, which can boost learning and engagement [26]. Although high behavioral realism enhances social presence in virtual reality environments, where 3D agents are particularly effective, it can negatively impact factual learning outcomes [10]. However, the presence of a real instructor is associated with increased learning and satisfaction, leading to better information recall [27]. The mixed effects of high behavioral realism may be explained by the uncanny valley effect [28], where realistic yet not quite human-like agents induce discomfort and unease. In robotics, incorporating personality into humanoid robots has been shown to reduce these uncanny feelings and enhance the overall user experience [29], which can also apply to virtual humans.

Pedagogical agents can be embedded into different applications such as text-based mobile conversational systems [30], collaborative serious games [31], multiple-agent intelligent tutoring systems [32], collaborative augmented reality environments [33], and artificial intelligence-enabled remote learning[34]. In robotic educational agents, verbal cues are more effective than nonverbal cues in improving engagement [35]. Expressing various emotions through facial expressions and body motion can appeal to different learner types in animated pedagogical agents [36]. Pedagogical agents have been utilized in teaching a wide range of subjects, including STEM [37], foreign language [38], history [39], as well as work training [40]. Learning, engagement, human likeness, credibility, and personality can be used as measures of the psychometric structure of pedagogical agents [41].

2.2 Personality Expression in Agents

Multi-modal communication elements are essential for expressing desired personality traits in digital characters. Nonverbal behavior elements are commonly used in affective virtual agents to convey these traits [42, 43, 44, 45]. Studies leverage high-level meanings of motion to express the target personality type [46]. For instance, PERFORM establishes a link between Laban Movement Analysis (LMA) parameters and the perceived personality of virtual human characters [47]. In addition to body motion, personality-specific voice, dialogue, and facial expressions help distinguish personality traits in expressive conversational agents [11]. Recent research indicates that appearance and movement significantly influence the expression of certain traits such as agreeableness and neuroticism [48].

The character’s rendering style also affects personality perception [49]. For example, cartoon-like rendered characters are perceived as more agreeable, whereas characters with unappealing, ill-looking rendering styles are found quarrelsome and less sympathetic [50]. Additionally, animation realism [51], face model [52], body shape [53, 54], clothing, environment, and facial expression [55], as well as skin texture and viewing angle [56] of virtual characters are all influential on perceived personality. In multi-agent scenarios, the interaction between agents and proximity is a successful indicator of personality [57, 58, 59], as well as emotional group dynamics [59]. Action choices in procedural story generation can also express different personality types [60, 61]. An agent’s perceived personality highly influences how users interact with the system; for example, users are more willing to trust and listen to serious-looking, assertive agents [62].

Human hands are highly expressive in communication [63]; specific hand movements can convey different personalities [64]. Gesture performance in combination with language highly influences perceived personality [65]. Linguistic elements such as the ratio of phrases, words of emotion and cognition, and exclamations correlate with apparent personality traits [66, 67]. Automatic personality assessment is possible using text input from social media messages [68, 69], using speech [70, 71], facial expressions [72], gait features [73] and body motion features [74]. Overall, different communication elements contribute to the perceived personality of digital characters where multi-modal approaches often yield better results [75, 76, 77].

LLMs such as GPT exhibit consistent personality cues [78] and can be customized for different personalities through prompting [79], as we do in this work. Analysis suggests that word choices and the length of the generated text are influential on this [80]. Data-driven personality estimation systems can predict different personality types when the generated text uses certain prompts [81, 82], supporting the success of LLMs in capturing personality cues from language.

2.3 LLM-Based Agents

LLMs span many applications, including natural language-based human-computer interaction [78]. The rise of LLMs has opened up new avenues for creating and populating digital worlds. For instance, language models can be used to generate and animate 3D avatars [83], schedule motions [84], control facial expressions and body motion styles [85], drive non-player character behavior in games [86, 85], and automate and refine digital storytelling [87]. 3D scenes can be injected into LLMs for captioning, 3D question answering, and navigation tasks [88]; models can describe or compare objects in 3D scenes [89].

LLMs have also started playing critical roles in innovative educational technologies [90]. For instance, LLMs are used with vocalized agents in Augmented Reality (AR) environments to aid students in foreign language learning [91]. Similarly, in healthcare, LLMs enhance patient experiences during consultation, diagnosis, and management [92]. Although dialog systems utilizing LLMs can generate highly sophisticated responses, they lack access to dynamic real-time data, such as the current date [93]. Consequently, LLM-based agent systems often focus on isolated tasks, such as answering questions based on pre-existing knowledge. To enable more context-aware conversations, additional mechanisms to enhance LLM memory can be implemented [94]. In this work, we use an LLM to generate educational responses for specific computing subjects in an educational setting.

2.4 Conversational Agents

Conversational agents aim to interpret and respond to user statements in ordinary natural language through integrating computational linguistics techniques [95]. Understanding and exhibiting emotions and personality are essential for successful natural language conversations [96]. For example, the same query may require different interpretations based on the user’s current mood, and similarly, the same response can be perceived differently based on the agent’s current body language and facial expression. Conversational systems that give relevant answers to user questions are perceived as more human-like and engaging [97]. The realism of animation and behavior is critical for agents with a visual representation [98]. Thus, studies synthesize gesture animation to accompany speech [99, 100]. Recent deep learning-based co-speech animation generation systems can produce highly realistic results [101, 102, 103, 104], and the generated animations can be authored to express the desired pose and style at specific frames [105]. Facial expressions are also crucial to create realistic experiences. For instance, agents that mimic user facial expressions deliver believable and empathetic conversations [106]. Conversational agents can take as input multiple stimuli, including user’s gaze [107], speech and facial expression [106], structured or natural language text [108]. In this study, we input natural language text from participants to keep the system requirements minimal and leave other modalities for future work.

3 Method

This section describes a user study to test the impact of different modalities and personalities on learning and personality perception factors. For this, we designed an application employing a conversational agent that teaches a complex topic of the user’s choice through turn-based dialogue 222Bilkent University Ethical Committee for Human Research approved the study with the decision number 2023_11_05_01..

3.1 System

To run our study, we updated the personality-driven conversational agent platform by Sonlu et al. [11], an open-source, multi-modal system for animating 3D conversational virtual agents. The platform runs on Unity [109] and controls multiple modalities, such as dialogue, facial expressions, and body movements, based on an input personality. Body movement control involves modifying a given animation clip via joint rotation and animation speed adjustments, noise addition, and inverse kinematics-based gesture changes following the LMA mappings defined in PERFORM [47]. Facial animation involves mouth movements during speech, frequent blinks associated with neuroticism, and updating blend shapes to express emotions associated with specific personality factors. Facial expressions are designed according to Facial Animation Control System (FACS) [110], and mouth movements are handled by Oculus LipSync [111] with a customized mapping to the facial blend shapes of the 3D model. The input personality determines the agent’s default facial expression. For example, an agreeable agent tends to smile by default with each turn of its dialogue, which diminishes with time. We designed 3D human models for the current study using Reallusion Character Creator [112]. To introduce a measure of diversity, we created four characters: two female and two male, each with light and dark skin tones, as depicted in Figure 1.

Refer to caption
Figure 1: Different 3D agent models used in the study expressing high (left group) and low (right group) traits.

Our updated system differs from the existing personality-driven platform in terms of how it handles dialogue. The previous work used IBM Watson Assistant [113] to extract the intent from user queries mapped into domain-specific handcrafted dialogue lines. This work replaces the dialogue logic with an LLM-based text generation model, GPT-3.5 Turbo [114], facilitated by OpenAI’s Chat Completions Application Programming Interface (API), eliminating the need for manual dialogue crafting. When users type their prompts (e.g., ask a question), the system returns an answer coherent with the input personality description. We limit the token number to 750 for the output text to keep the conversation concise. Unlike the previous platform, which used Watson Text-to-Speech API for speech generation, our current system employs Microsoft® text-to-speech functionality, producing an almost immediate response to vocalize the agent’s answer. This local solution also lets us determine the currently spoken word we use to display partial subtitles. Since the generated responses could be fairly long, we followed a dynamic approach where five words centering the currently spoken word were displayed on top of the agent in models with visual representation. The subtitles were on by default, but the users could disable them in models with 3D agents if they found them distracting.

We used a temperature of 0.9 to promote diverse outputs from GPT while maintaining the information’s reliability. Temperatures greater than 1 introduce creativity; however, they lead to hallucinations, conflicting with the aim of the educational system. Chat Completions API takes as input a “messages” parameter consisting of message objects, where each object has a role of “system”, “user”, or “assistant” and content. For the role of “system”, we give the following messages as input for different agent personalities and teaching topics:

  • {“role”: “system”, “content”: Act as an extroverted teacher teaching about <<<topic>>>, give friendly and polite answers.}

  • {“role”: “system”, “content”: Act as an introverted teacher teaching about <<<topic>>>, give short and unfriendly answers.}

To produce a response that the agent speaks, the system sends the role prompt together with the last five dialogue messages, alternating between the user and assistant: {“role”: “user / assistant”, “content”: <<< message >>>} We limited the number of messages to five to reduce costs, eliminate context drift, and prevent manipulation. This restriction helps prevent altering the language model’s perception through extended interactions [115].

3.2 Stimuli

We designed a 3×2323\times 23 × 2 independent subjects study for a comparative analysis of three models—D, A, and E—each tested with high and low values of agreeableness-extroversion combination. Model D is the dialogue-only setting, where the system’s answers were shown on screen sentence-by-sentence concurrently with audio playback. Model A included an animated 3D model of the agent, randomly chosen among four alternatives (two male, two female, each with dark and light skin tones). Model A involved the virtual human animated without any personality-based alterations. In models D and A, personality was conveyed only through synthesized dialogue. Model E was similar to Model A but additionally incorporated the expression of personality through face and body movements.

In model E, motions that display a combination of high extroversion and agreeableness involve the LMA parameters of Indirect Space, Light Weight, and Free Flow [47]. These correspond to multi-focal spatial attention, delicate, lifted-up movements, and uncontrolled and fluid motion. Because high extroversion and agreeableness are associated with opposite Time Efforts (Sudden, urgent vs. Sustained, lingering), we left the Time component of the animations unaltered. The animations expressing low extroversion and agreeableness involve Direct Space, Strong Weight, Bound Flow, and neutral Time, corresponding to single-focused, heavy, and controlled movements. The default facial expression of a highly extroverted and agreeable agent is relaxed and happy, with occasional smiles and direct eye contact. In contrast, an agent characterized by lower levels of these traits displays a more tense expression by default. Low-trait agents also avoid eye contact with the viewer.

Screenshots of different models are displayed in Figure 2. We name each variation of the system with its model name and whether they express high or low traits. For example, E-High refers to the variation where we express high trait personality using the model that uses both text and animation-based cues. Note that we display a single image for models A and D as they are visually similar in high and low variations. For the animation in Figure 2(b), the E-Low variation has the hands close to the body with a slightly more slanted posture (Figure 2(c)), and the E-High variation has the hands further from the body with more upright posture (Figure 2(d)).

Refer to caption
(a) Model D
Refer to caption
(b) Model A
Refer to caption
(c) Model E-Low
Refer to caption
(d) Model E-High
Figure 2: Sample screenshot from different models and their variations.

3.3 Study Design

The study involved an educational application with a conversational agent that teaches users complex subjects. We presented participants with six options and asked them to select the least familiar topic. The topics were quantum computing, blockchain technologies, transformer architectures, quantum mechanics, string theory, and general relativity. The selected topic was provided to the GPT model as part of the system role prompt to guide a focused conversation. The application required that participants pose the agent at least five questions to learn about the topic. There was no upper limit on the number of questions a participant could ask. Upon completing their queries, participants could proceed to answer survey questions. They could review the survey questions or the chat history at any point.

The survey questions appeared in two groups. The first group included 27 questions on a 5-point Likert scale, where 15 questions measured the perceived personality of the agent using the extra-short form of the Big Five Inventory–2 (BFI-2-XS) [116], and 12 questions measured self-assessment of learning, quality, and engagement using the Learning Object Evaluation Scale for Students (LOES-S[117]. In LOES-S, learning-related questions are about the self-assessment of learning and how much the learning object, i.e., the tool in question, helped teach the subjects a new concept. Quality assesses the instructional design, ease of use, organization, and help features. Engagement evaluates how much the subjects liked the tool and whether they found it motivating.

The second group of questions required open-ended input to receive detailed participant feedback. There was no character limit for the answers to the open-ended questions. Completing both groups of questions directed the participants to the user study completion page, where they received a link for task approval.

3.4 Participants

We used the crowd-sourcing service Prolific to recruit participants. Before running the study, each participant was directed to a website to test whether they had installed the correct text-to-speech package. Only those with the supported system configurations could continue with the study. Two hundred ten unique participants (99 female, 95 male, 16 not specified) rated our system, with each alternative evaluated by 35 individuals, which provides a medium effect size (Cohen’s f=0.26𝑓0.26f=0.26italic_f = 0.26) for both main effects and their interaction and power of 0.800.800.800.80 at a significance level of 0.050.050.050.05 for independent-subjects Analysis of Variance (ANOVA).

Each participant interacted with only one version of the system, where the average interaction time was 19.74±9.25plus-or-minus19.749.2519.74\pm 9.2519.74 ± 9.25 minutes. This time excludes the introduction, during which participants read about the task and downloaded the application, but includes the time spent answering the survey questions. The average participant age was 28.80±8.57plus-or-minus28.808.5728.80\pm 8.5728.80 ± 8.57. Upon entering their Prolific IDs, participants were shown an introduction message with information about the study details. We informed the participants about the data collected and the study’s aim to measure the system’s performance; we emphasized that the study did not aim to measure their knowledge in any manner.

4 Quantitative Analysis

4.1 Data Organization and Exploratory Analysis

BFI-2-XS includes three questions for each personality factor, half of which are inversely proportional to the measured dimensions. Responses were assigned integer values on a 5-point Likert scale, ranging from -2 to 2. We calculated the signed sum of these values to derive personality scores between -6 and 6, which were then re-scaled back to the range [2,2]22[-2,2][ - 2 , 2 ]. Similarly, LOES-S has five questions measuring learning, four questions measuring quality, and three questions measuring the engagement of the learning object. We calculated the sum per measurement type and mapped the corresponding ranges into [2,2]22[-2,2][ - 2 , 2 ] to report the corresponding means.

For exploratory analysis, we display box plot diagrams of each model regarding perceived personality and LOES-S scores for learning, quality, and engagement (see Figure 3). The diagrams indicate positive mean scores for openness, conscientiousness, extroversion, and agreeableness and negative mean scores for neuroticism across all models. The models received particularly high ratings for conscientiousness. The plots also show high positive ratings for learning, quality, and engagement, with mean engagement scores slightly higher for high personality variations than for low personality ones. We can also observe that the model E-High, followed by A-High, represents high conscientiousness, high agreeableness, and low neuroticism better than the other models. In the next section, we perform descriptive analysis to identify potential statistically significant effects of the models and personality styles on the output variables.

Refer to caption
Figure 3: Box plots of each variation’s BFI-2-XS and LOES-S scores.

4.2 The Effects of Model and Personality Styles on Learning Outcomes and Personality Perception

To investigate the impact of model type (D, A, and E) and personality style (high or low) on learning outcomes and perceived personality factors, we ran seven two-way analysis of variance (ANOVA) models and one non-parametric alternative model (Welch’s ANOVA). Welch’s ANOVA was utilized to assess the quality scores of LOES-S across model type and personality style, given that the assumption of equal variances was violated, as indicated by a Bartlett test.

Apart from the non-parametric model, which combined model type and personality style as a single factor, all other models examined the influence of model type and personality style on outcome mean individually and any potential interaction between these factors. With balanced and sufficiently large sample sizes (n=35𝑛35n{=}35italic_n = 35) across factor combinations and evidence for equal variances across factor levels (as measured by Bartlett’s test), all outcomes except for quality were appropriate for ANOVA modeling. To control the familywise error rate at 0.050.050.050.05, we employed the Hommel method to adjust for multiple testing across all model terms, including one post-hoc analysis. Unlike the conservative Bonferroni correction, the Hommel method offers increased statistical power. Table I presents significant terms for all ANOVA runs before and after the correction for multiple testing. Adjusted-for ANOVA tests returned significant main effects of personality style on the outcomes of engagement (F=9.502,p=0.0421formulae-sequence𝐹9.502𝑝0.0421F=9.502,p=0.0421italic_F = 9.502 , italic_p = 0.0421), openness (F=4.474,p=0.00178formulae-sequence𝐹4.474𝑝0.00178F=4.474,p=0.00178italic_F = 4.474 , italic_p = 0.00178), extroversion (F=10.148,p=0.03006formulae-sequence𝐹10.148𝑝0.03006F=10.148,p=0.03006italic_F = 10.148 , italic_p = 0.03006), agreeableness (F=25.541,p=0.0000211formulae-sequence𝐹25.541𝑝0.0000211F=25.541,p=0.0000211italic_F = 25.541 , italic_p = 0.0000211), and neuroticism (F=9.708,p=0.0378formulae-sequence𝐹9.708𝑝0.0378F=9.708,p=0.0378italic_F = 9.708 , italic_p = 0.0378). Although conscientiousness initially carried a significant finding for personality style (F=5.68,p=0.0181formulae-sequence𝐹5.68𝑝0.0181F=5.68,p=0.0181italic_F = 5.68 , italic_p = 0.0181), this term’s statistical significance dropped after multiple testing corrections. The main effect of the model type was initially significant for neuroticism, but the effect did not remain significant after the Hommel procedure (p=0.417𝑝0.417p=0.417italic_p = 0.417). The box plots of the statistically significant effects are depicted in Figure 4.

The effects of agent gender and skin color were not among the hypotheses, so we randomly selected one 3D agent model among four different appearances, which also determined the agent’s voice, to support variety. We do not observe a significant effect due to the agent’s gender or skin color, which confirms previous work [22].

TABLE I: Two-way ANOVA significant findings for learning outcomes and perceived personality on model type and personality style (n=210𝑛210n=210italic_n = 210). Statistically significant factors (p<0.05𝑝0.05p<0.05italic_p < 0.05) after p-value adjustment are emphasized in bold.
Outcome Factor F p-value Adj. p-value
Engagement Pers. Style 9.502 0.002 0.042
Openness Pers. Style 4.474 <0.001absent0.001<0.001< 0.001 0.002
Conscientiousness Pers. Style 5.68 0.018 0.290
Extroversion Pers. Style 10.148 0.002 0.031
Agreeableness Pers. Style 25.541 <0.001absent0.001<0.001< 0.001 <0.001absent0.001<\textbf{0.001}< 0.001
Neuroticism Pers. Style 9.708 0.002 0.038
Neuroticism Model Type 3.681 0.027 0.417
Refer to caption
Figure 4: Box plots of statistically significant effects.

4.3 Correlations between Personality Factors and Learning Scores

We report the Pearson Correlation between LOES-S and personality factors in Table II. In general, correlation coefficients higher than 0.40.40.40.4 are considered moderate. Perceived openness, conscientiousness, extroversion, and agreeableness all positively correlate to the LOES-S scores, albeit some weakly. The highest correlations are for conscientiousness, particularly for Model D. Quality and engagement scores are strongly correlated (>0.6absent0.6>0.6> 0.6), and learning is moderately correlated, nearing the threshold for a strong correlation. For Model D, openness and agreeableness also have moderate correlations with all the learning parameters. In general, engagement is moderately correlated with all factors except neuroticism. The expressed neuroticism is weakly inversely proportional to each parameter, but its only statistically significant correlations are for learning in Model D and quality in Model A. The correlations are the strongest for Model D and weakest for Model E.

TABLE II: Pearson correlation (r𝑟ritalic_r) of Learning, Quality, and Engagement measurements of each model with each personality factor. indicates p<0.05𝑝0.05p<0.05italic_p < 0.05, ∗∗ indicates p<0.001𝑝0.001p<0.001italic_p < 0.001. The cell colors indicate transition from weak to strong correlation.
Model Corr. O C E A N
D rLesubscript𝑟Ler_{\textit{Le}}italic_r start_POSTSUBSCRIPT Le end_POSTSUBSCRIPT .422superscript.422absent.422^{**}.422 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .582superscript.582absent.582^{**}.582 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .396superscript.396absent.396^{**}.396 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .446superscript.446absent.446^{**}.446 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .349superscript.349-.349^{*}- .349 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
rQusubscript𝑟Qur_{\textit{Qu}}italic_r start_POSTSUBSCRIPT Qu end_POSTSUBSCRIPT .437superscript.437absent.437^{**}.437 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .654superscript.654absent.654^{**}.654 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .279superscript.279.279^{*}.279 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .325superscript.325.325^{*}.325 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .190.190-.190- .190
rEnsubscript𝑟Enr_{\textit{En}}italic_r start_POSTSUBSCRIPT En end_POSTSUBSCRIPT .417superscript.417absent.417^{**}.417 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .608superscript.608absent.608^{**}.608 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .287superscript.287.287^{*}.287 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .422superscript.422absent.422^{**}.422 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .109.109-.109- .109
A rLesubscript𝑟Ler_{\textit{Le}}italic_r start_POSTSUBSCRIPT Le end_POSTSUBSCRIPT .351superscript.351.351^{*}.351 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .348superscript.348.348^{*}.348 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .456superscript.456absent.456^{**}.456 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .398superscript.398absent.398^{**}.398 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .208.208-.208- .208
rQusubscript𝑟Qur_{\textit{Qu}}italic_r start_POSTSUBSCRIPT Qu end_POSTSUBSCRIPT .215.215.215.215 .378superscript.378.378^{*}.378 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .204.204.204.204 .150.150.150.150 .425superscript.425absent-.425^{**}- .425 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT
rEnsubscript𝑟Enr_{\textit{En}}italic_r start_POSTSUBSCRIPT En end_POSTSUBSCRIPT .504superscript.504absent.504^{**}.504 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .332superscript.332.332^{*}.332 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .505superscript.505absent.505^{**}.505 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .541superscript.541absent.541^{**}.541 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .233.233-.233- .233
E rLesubscript𝑟Ler_{\textit{Le}}italic_r start_POSTSUBSCRIPT Le end_POSTSUBSCRIPT .266superscript.266.266^{*}.266 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .405superscript.405absent.405^{**}.405 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .281superscript.281.281^{*}.281 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .184.184.184.184 .046.046-.046- .046
rQusubscript𝑟Qur_{\textit{Qu}}italic_r start_POSTSUBSCRIPT Qu end_POSTSUBSCRIPT .334superscript.334.334^{*}.334 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .133.133.133.133 .056.056.056.056 .305superscript.305.305^{*}.305 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .054.054-.054- .054
rEnsubscript𝑟Enr_{\textit{En}}italic_r start_POSTSUBSCRIPT En end_POSTSUBSCRIPT .401superscript.401absent.401^{**}.401 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .353superscript.353.353^{*}.353 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .389superscript.389absent.389^{**}.389 start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT .353superscript.353.353^{*}.353 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .052.052-.052- .052

5 Qualitative Analysis

5.1 Data

We posed the following open-ended questions to the participants for in-depth feedback:

  1. 1.

    What is <<<topic>>> based on your conversation with the system?

  2. 2.

    Why is <<<topic>>>important, please discuss briefly?

  3. 3.

    Did you learn anything new about <<<topic>>> after your conversation with the system? If so, what did you learn?

  4. 4.

    Do you think the system’s behavior influenced your learning experience?

  5. 5.

    Please briefly describe your conversation with the system. Was it interesting? Did you encounter any problems?

Two independent researchers tagged the answers with at least 95% agreement for each theme. The themes were determined based on the following criteria:

  • 1

    Benefit/No Benefit to Learning Experience: Regarding the third and fourth questions, this theme reflects the effect of the system behavior on the participant’s learning experience.

  • 2

    Learning / No Learning: Regarding the third question, this theme captures whether the participant learned anything new interacting with the system.

  • 3

    Interesting / Not Interesting: Regarding the fifth question, this theme considers whether the participant finds the system interesting or not.

  • 4

    Realistic / Robotic / Poor Voice: This theme captures the overall human likeness of the system based on the answers to the fourth and fifth questions. In certain cases, the participants only commented about the system’s voice, which utilized less developed local solutions to support immediate response; otherwise, the participants’ feedback is on the overall image of the agent.

  • 5

    Detailed / Moderate / Short Answer: Regarding the answers to the first and second questions, this theme captures how much detail the response includes. Although this may strongly depend on the participants’ characteristics, the system’s behavior may have also caused them to adjust the level of detail of their answers. These answers may also depend on the participant’s learning; the less they know, the shorter the answers will be. We tag the responses as detailed or short; any remaining response is considered moderate.

Table III shows the number of themes for each variation. The following subsections focus on each theme with quotes from the participants. At the end of each quote, we indicate the participant’s ID as a reference and the variation that the participant interacted with. We only fixed the typos in some of the responses to improve readability.

TABLE III: Theme analysis results where the numbers depict the occurrence of each theme for each variation. The cell colors indicate transition from low to high over 35 participant answers per variation.
Variation D-Low D-High A-Low A-High E-Low E-High
Benefit 15151515 19191919 17171717 22222222 19191919 21212121
No Benefit 9999 4444 8888 6666 8888 7777
Lernen 23232323 25252525 21212121 26262626 25252525 29292929
No Learning 6666 6666 5555 2222 7777 2222
Interesting 15151515 22222222 19191919 20202020 21212121 20202020
Not Interesting 3333 2222 00 1111 2222 1111
Realistic 00 3333 2222 3333 1111 4444
Robotic 1111 1111 6666 5555 5555 3333
Poor Voice 9999 5555 3333 6666 2222 3333
Detailed Answer 2222 9999 2222 6666 8888 4444
Moderate Answer 10101010 10101010 10101010 8888 10101010 20202020
Short Answer 23232323 16161616 23232323 21212121 17171717 11111111

5.2 Benefit

A primary objective of agent-based educational systems is to facilitate self-directed learning by providing a useful learning experience. Actively involving individuals in learning helps them develop a deeper understanding of the subject matter than passive learning through only reading or watching the material [118]. This level of interaction is naturally present with a human teacher who can promptly address a learner’s questions. We required participants to actively engage by asking questions to replicate this dynamic and prevent passive learning.

Most participants indicated that the system behavior improved their learning experience; they emphasized increased engagement due to having a human-like agent: “It was more engaging to have a human avatar instead of a blank screen or other representation.” – P3 (A-High). Even for the dialogue-only variations, the experience was more enjoyable due to interaction: “I think I enjoyed learning using the system more than I would have if I were reading on my own in a book or Google.” – P21 (D-High). Being able to ask questions using natural language and receive to-the-point answers was also found to be beneficial by certain participants: “It saved me some time from having to Google specific terms and read long texts on them, giving me the key points to get a basic understanding.” – P17 (E-High). A human-like interaction with a non-human system can help people with anxiety to experience interactive learning: “It can be useful a lot, and I’d like to use it because it is quite calming down the person who has anxiety.” – P27 (E-High).

However, a human teacher would typically simplify the material to facilitate easier learning, which can be challenging for a language model to emulate with limited user feedback: “I believe the system was very informational, which influenced my learning. However, the information was too much to handle at once, which clouded my mind.” – P30 (A-High). Participants who reported no benefit from the system often cited the difficulty of the answers: “I could have learned more on Wikipedia or Google. The system’s answers were too difficult to understand for me.” – P169 (D-Low). Although the high-trait models were generally found to be more beneficial, some participants noted that their responses were too lengthy: “The responses were quite long, sometimes too long. Some of the things it said could’ve been left out as they didn’t provide any useful information; it was just ’flavor text.’ I also read very fast, so waiting for it to stop talking was a bit boring. Other than that, it was a very positive experience. ” – P42 (D-High).

In addition to length, a major difference between the high and low variations was the LLM’s word choices. The high-trait variations use motivational language, supporting the learner instead of just giving the answer: “Yes, the system had a motivational tone that lifted my energy towards learning about something I had zero knowledge about.” – P37 (D-High). Such language can help create a more interactive and thus motivational experience: “Yes, I liked the easy-to-learn explanations and also the motivational part ’that is indeed a fantastic question’.” – P106 (D-High). Participants also referred to the system as friendly (P50 and P150, A-High) and polite (P209, D-High). On the other hand, the low-trait variations responded with shorter sentences that some users preferred: “It was not interesting per se; however, it was very informative and straight to the point.” – P2 (E-Low). Especially, the introverted language [66] of the low variations could influence the relatively short and simple responses: “Conversation was simple and quick. The agent gave me short and simple answers to my question that are easy to understand” – P170 (E-Low).

Participants found variations with embodied agents slightly more beneficial and interesting. Especially, they reported a positive influence of having an embodied human agent for the high-trait variations: “Somewhat, seeing a human-like face made it easier to memorize the information” – P163 (A-Low). “Interacting with a humanoid entity is more engaging than reading a book.” – P143 (A-High). “Yes! I really enjoyed it; it seemed very human. It was like talking to an expert; it can answer any question you have instantly.” – P12 (A-High). “It was really exciting to see a person/character in front of the screen. Of course, it has an influence and really affected my learning process positively.” – P134 (E-High). “I think the animations and the looks of the character were motivating, and this could help with the learning in general.” – P207 (E-High). “Body movement and text to speech allowed to be more engaged in the conversation.” – P184 (E-High). A human-like agent can help learners better focus on the conversation, positively affecting learning: “Having someone explaining a subject to you in human form generates a curiosity that is similar to listening to an enthusiastic teacher. As someone with a low attention span, the agent kept me engaged in the conversation and sparked further interest.” – P1 (E-High). “It was really exciting to see a person/character in front of the screen. Of course, it has an influence and really affected my learning process positively.” – P134 (E-High). While most participants focused on the body movements, a few reflected on the facial expressions and their positive effect in the E-High variant: “Slight facial “expressions” was noted and kind of felt like it made an impact, to be fair.” – P11 (E-High).

For the low-trait agents, participants noted that agent movements were monotonic and the visual representation brought no advantage: “The person itself is extremely dull, there is no life if that makes sense, and the movements and gestures are extremely weird, the hand and arm movements are strange, I would rather have that taken away, but being able to ask any questions to a topic and a response provided immediately is amazing, really like that aspect.” – P196 (E-Low). “Maybe, I think if the system were ”nicer” and less monotonic, the learning would be easier.” – P147 (E-Low). “I did not find the “graphics” to help. A chatbot would have basically had the same effect on me.” – P191 (A-Low).

5.3 Learning

The results suggest that high-trait personality variations lead to improved learning outcomes. However, the influence of expressivity appears limited, as Model E shows only a slight improvement in learning performance compared to Model A. Only two responses in each A-High and E-High were associated with no learning, pointing out that human-like appearance combined with high traits can improve learning.

The system sometimes inspired participants to learn more about the topic: “I did not know anything about this theory at all. After my conversation, I can proudly say that I am really into string theory. I learned the basic concept and the creators of the theory. I also asked how I could learn more, and the conversational agent suggested four different possible sources.” – P8 (A-High). “It really almost felt like talking to someone who knows quantum computing well. I especially appreciated the way it understood my questions, even though I felt a question or two were a bit vague. The system actually made me want to know more about the subject so I can ask better questions.” – P11 (E-High). Since we asked participants to choose the subject they had the least information about, most of them reported having almost no prior knowledge of the conversation topic: “It was a completely new topic to me, and I feel generally satisfied with the answers of your system” – P15 (E-Low). Therefore, the themes associated with learning correspond to the participants who are newly introduced to the subject and acquired new information. A few participants indicated they already had sufficient knowledge of the subject: “I didn’t learn anything new; I already knew the basic information about Blockchain.” – P29 (E-Low). Still, such participants found the system useful and reported a desire to use it again: “I am an enthusiast of general relativity. I honestly didn’t learn much, but I would love to use the AI to learn more in the future!” – P101 (E-Low).

Participants categorized under the “no learning” theme generally indicated the difficulty of the subject: “Previously, I had no idea what transformer architecture was. But unfortunately, I still believe that I did not learn a great deal about this type of technology. In my opinion, these new technologies that use AI are very difficult to understand if a person has no background knowledge.” – P20 (E-High). One detriment to learning could be the agent’s short answers in the low-trait variations: “Not much (learning) as the replies were brief, but I got a basic idea” – P101 (E-Low). Conversely, the lengthy answers of the high-trait variations may have been distracting: “The topics answered were on point, maybe a bit too long, and different questions had similar answers in common as part of it.” – P41 (E-High).

5.4 Interestingness and Realism

Promoting interest in a subject is essential for success in education [119]. For conversational agents, behavioral and graphical realism is an important factor for engagement [120], which also applies to pedagogical agents [121].

Participants found the D-Low variation to be the least interesting, followed by A-Low. All the high-trait variations and E-low were perceived as similarly interesting. The expressive gesturing in the E-Low variation may have mitigated the decrease in interest. A few participants found the agent realistic in all variations, with the E-High variation receiving the most responses associated with the “realistic” theme.

Participants who found the system interesting usually reported a positive influence on learning in the E-High variation: “I found it fascinating, really interesting, and quickly increased my knowledge on the subject. I would have loved to do more and carry on asking questions to discover more about blockchain.” – P199 (E-High). “It was really exciting to see a person/character in front of the screen. Of course, it has an influence and really affected my learning process positively.” – P134 (E-High). “I think that the system behaved as naturally as possible, so it was a good influence.” – P174 (E-High). Participants mentioned a positive effect on concentration due to the existence of the human avatar: “…Avatar of the agent helps concentrate on the conversation.” – P170 ( E-Low). The experience’s novelty could have resulted in some participants finding the study interesting: “It was quite interesting. I did not know what to expect when entering the task, but I was pleasantly surprised and engaged in the entire experience. It would definitely be something I would use again if I could.” – P175 (A-High). Participants found Model D less interesting but still beneficial in teaching: “I think it wasn’t interesting, but it taught me a topic I didn’t know about.” – P2 (D-Low). Since our focus was on a short conversation with the agent, the users could have endured the negative aspects of Models D and A more easily. The positive outcomes of expressive gestures may become more prominent in the long term. For example, keeping the agent’s behavior more interesting can be important to keep the user’s attention longer.

The system’s voice was highly criticized due to using a local solution for immediate responses at the cost of realism: “I didn’t like the voice because it seemed robotic and emotionless.” – P10 (D-Low). Some participants found the speech inhuman and slow: “I prefer the more traditional ’read what the bot says’ instead of it talking with its inhuman voice that speaks way slower and more monotonously than I would prefer.” – P41 (E-High). There was an increased focus on the negative aspects of speech synthesis when there was no visual representation, with 14 participants mentioning the poor voice of the agent in Model D. In contrast, negative comments on the agent’s voice were as low as 5 in Model E. Since Model D does not include an embodied agent, only a few participants regarded the agent as a robotic individual, with most of the negative comments focusing on the voice. Among the variations with a visual representation, E-High had the fewest responses associated with the “robotic” theme, which can be interpreted as the high-trait expressive motion effectively enhancing the agent’s realism.

Participants who found Model D human-like mostly reflected on the naturalness of the generated responses: “It was interesting. I liked the way it started to answer. It always added a kind of human comment and then gave you the answer.” – P59 (D-High). Participants assigned to the embodied representations tended to describe the agent as a real human:“It felt like a conversation with a real human, not like a robot. I felt like he was my teacher; I could ask any question.” – P6 (A-Low). “I really enjoyed it, it seemed very human.” – P12 (A-High). “I think it gave a more human experience and helped me focus on the subject.” – P179 (E-Low). The lack of expressive facial expressions in Model A was noticed by participants, which could explain why some found it robotic: “Some facial expressions might make the bot more relatable.” – P163 (A-Low). One participant found the overall appearance of the agent uncanny: “At first, I was surprised that this ’agent’ has a human-like avatar. I don’t think it helped me through the learning process because it felt kinda uncanny. Also, the TTS quality was poor, so it would be hard to understand some words if not for the subtitles. But I think that agent’s responses were very good and comprehensive.” – P57 (E-High).

5.5 Answer Depth

The depth of participants’ answers to the first and second questions can indirectly measure their learning and attention. Therefore, we grouped the responses into three categories based on how much detail they captured. Short answers usually only summarize the conversation topic with a single sentence: “1. Quantum physics is a fundamental branch of physics that deals with the behavior of very small particles. 2. It provides information about the little particles, which are everywhere.” – P56 (E-Low). Moderate-length answers use examples from the conversation and include more detail on the subject: “1. Blockchain is a decentralized technology that records transactions. It consists of blocks of data linked together in a chain. 2. It is important because of its transparency that allows one to view the transaction history, it is also highly secure.” – P103 (A-High). Detailed answers contain more than one sentence for each question and include different aspects of the subject: “1. Blockchain is a decentralized system to store information about transactions. A single block contains data about the transaction and the hash of the previous block, so it can’t be removed without breaking the chain. Data are stored as copies on many computers called nodes. 2. Blockchain is important because it is decentralized, and there is no authority keeping watch on it, so it is hard to manipulate it. It is also very secure, as every transaction must be validated by miners before it becomes a part of the chain.” – P57 (E-High).

More participants provided detailed answers in the high-trait variations for Models D and A, possibly because the agent’s longer responses encouraged them to elaborate. However, more participants gave detailed responses in the low-trait variation for Model E compared to its high-trait variation. Notably, the E-High variation had the highest number of moderate-length responses, twice as many as the other models. Although the amount of detail in participant responses may depend on various factors, including their personality and attitude toward the experiment, we note a general trend toward lengthier responses for high-trait agents.

6 Discussion

All the agents received positive mean personality ratings across all traits except for neuroticism. Since neuroticism is the inverse of emotional stability, we can conclude that, regardless of modality and personality expression, the agents were perceived positively as open, conscientious, extroverted, agreeable, and emotionally stable. Participant responses to open-ended questions also support this finding. The whole experience was perceived favorably even when low-trait personality variants, which were supposed to be less friendly, were employed. Previous studies show that people find interactions with virtual agents engaging, informative, and usable [122]. However, the positive responses could also be due to the “novelty effect”, an initial fascination with new technology. To mitigate such effects, techniques like extended tutorials and adaptive strategies can be employed [123].

Although mean ratings were positive for both, high-trait agents received higher scores than low-trait agents for all the positive personality factors. High-trait personality styles were associated with increased openness, extroversion, agreeableness, and emotional stability ratings with statistically significant effects. The only factor that did not have a statistically significant relationship with style variation after multiple hypothesis testing was conscientiousness. Thus, for R1, we can conclude that personality style affects the perception of all the personality factors except conscientiousness. Among these, agreeableness had the highest effect size, followed by extroversion. This finding also helps validate the personality style expression adjustments in the system and the mappings of the LMA factors of Space, Weight, and Flow to extroversion and agreeableness. Participants’ answers to open-ended questions suggest that they cared about the virtual agent’s ‘friendliness” and “niceness” or lack thereof. These results align with the previous reports that students generally prefer teachers high in extroversion, agreeableness, and conscientiousness [7]. The variance in conscientiousness is difficult to discern in a short scenario [11]; so, the lack of statistically significant effects of style on its perception is expected. However, it is important to note that conscientiousness received the highest scores as the perceived agent personality, which may imply a tendency to attribute reliability and organizational skills to educational agents.

Regarding RQ2, we found no statistically significant effect of model type on perceived personality. Similarly, for RQ3, no statistically significant effects of model type on LOES-S scores were observed. Thus, H1 was rejected as the absence or presence of visual representations did not impact learning outcomes assessed via LOES-S scores. Some participants indicated an indifference toward the graphical representation in their comments. However, overall thematic analysis suggests that models A and E provided greater benefits on the learning experience, were found more interesting, and enhanced learning of a new topic compared to model D. This is in line with the literature indicating that the visual representation of an agent can enhance motivation, interest in the topic, and belief in its utility [124].

For comparisons between high and low-trait styles, the evidence partially supported H2. Although both personality styles were positively rated across all models, the only significant difference was in the mean engagement scores. High-trait agents were found more engaging than low-trait agents, supporting H2c. Participant responses to open-ended questions also confirm this finding. The absence of statistically significant differences in quality and learning scores of LOES-S across high and low-trait variants can be attributed to individual learning preferences. In their responses, some participants praised the directness of the low-trait agents, while others emphasized that the high-trait versions were motivational. Such preferences may be linked to participant personalities and learning style preferences [3].

For RQ4, we found positive correlations between perceived positive personality factors (O, C, E, A) and most learning outcomes for all the models. We also found a negative correlation between perceived neuroticism and learning for Model D and quality for Model A. The highest correlations between learning outcomes and perceived personality traits were for Model D, followed by Models A and E. Among all personality dimensions, conscientiousness yielded the highest correlation with learning outcomes. This is particularly evident in Model D, which suggests that the lack of a visual representation may have allowed participants to focus more on the educational aspects of the system. Organizational skills describe Conscientiousness and is strongly associated with educational proficiency [125]. Without a humanoid body that competes for attention, participants may have perceived the system as more focused and reliable in delivering educational content.

7 Limitations and Future Work

The LOES-S measures learning outcomes through self-reports, which do not necessarily reflect whether participants have effectively comprehended the material. Since our study did not assess the participants’ mastery of the subject, we cannot confirm whether model variations and personality expression influenced learning performance. Although we asked open-ended questions about the subject, participants’ responses were generally short and too vague to gauge comprehension accurately. Future research could include pre- and post-assessment questions to measure learning more effectively. Also, because the system does not track comprehension, the complexity of the returned responses is not customized; therefore, some participants found them to be too long and difficult to understand. Future work can utilize natural feedback, such as gaze [126], to track the learner’s attention and understanding, which can help adjust the subject’s difficulty. Similarly, nonverbal behaviors such as head movement and facial expressions can be used to estimate the learner’s interest [127]. We believe that more accurate information can be achieved through long-term studies and multiple sessions during which learners have more opportunities to interact and familiarize themselves with the agent.

A major source of criticism about the system was the unnaturalness of the synthesized voice, as we used the operating system’s built-in text-to-speech program for efficiency. The local solution had another disadvantage: the participants had to download the system to their computers and run it locally. In the future, we would like to provide text-to-speech as a built-in component and serve the tool as a web-based platform to reach a broader audience. In this study, we kept the voice selections limited to one male and one female voice. Experimenting with different voices and analyzing their effect on user perception could be a research direction.

The thematic analysis indicates that individual differences across user preferences were prominent. A future direction is to employ agents customized for each user so that they have compatible personalities. Such selections can potentially increase the user’s trust and willingness to listen [62]. In addition, we only tested two opposite personality combinations to keep the study duration and number of conditions feasible. More nuanced variations and incorporating other personality factors, such as conscientiousness, could provide more detailed information.

A possible extension to our system is to adaptively generate agent animations to fit the generated dialogue. Currently, we use the same set of gestures to complement speech. Online gesture synthesis techniques that take text or audio as input can help overcome the problems related to naturalness, thus increasing engagement and learning quality.

8 Conclusion

This paper presents a conversational system and user study to explore how personality perception and embodiment affect learning outcomes. Using GPT 3.5 and realistic 3D human models, we created agents expressing high and low agreeableness and extroversion variations through dialogue and animation cues. We designed three types of agents: a disembodied agent expressing personality through dialogue, one expressing personality only through dialogue, and an embodied agent expressing personality through dialogue and animation. We conducted a three-by-two independent-subjects user study with three agent models and the two personality variations, where each participant was asked to converse with an agent on a complex subject to learn about it. After the conversation, participants rated their version of the system based on their perceived personality of the agent and learning, quality, and engagement of the learning experience. The results indicate that regardless of the model choice, the whole experience was rated favorably in general, and participants judged the agents as high in openness, conscientiousness, extroversion, agreeableness, and low in neuroticism. However, the degree of positive perception was lower in low-trait personality styles than in high-trait ones. Although the engagement score was higher for the embodied agent with expressive animations, we found no significant differences across the models for other learning outcomes. We hope that the findings of this work inspire future studies to utilize expressive animation and dialogue cues to improve the overall experience of educational applications with conversational agents.

Acknowledgments

This research is supported by The Scientific and Technological Research Council of Turkey (TÜBİTAK) under Grant No. 122E123.

References

  • [1] W. Swartout, R. Artstein, E. Forbell, S. Foutz, H. C. Lane, B. Lange, J. F. Morie, A. S. Rizzo, and D. Traum, “Virtual humans for learning,” AI Magazine, vol. 34, no. 4, pp. 13–30, 2013.
  • [2] H. G. Murray, J. P. Rushton, and S. V. Paunonen, “Teacher personality traits and student instructional ratings in six types of university courses.,” Journal of Educational Psychology, vol. 82, no. 2, p. 250, 1990.
  • [3] A. Furnham and T. Chamorro-Premuzic, “Individual differences in students’ preferences for lecturers’ personalities,” Journal of Individual Differences, vol. 26, no. 4, pp. 176–184, 2005.
  • [4] T. Chamorro-Premuzic, A. Furnham, A. N. Christopher, J. Garwood, and G. N. Martin, “Birds of a feather: Students’ preferences for lecturers’ personalities as predicted by their own personality and learning approaches,” Personality and Individual Differences, vol. 44, no. 4, pp. 965–976, 2008.
  • [5] L. E. Kim and C. MacCann, “What is students’ ideal university instructor personality? an investigation of absolute and relative personality preferences,” Personality and Individual Differences, vol. 102, pp. 190–203, 2016.
  • [6] P. T. Costa Jr and R. R. McCrae, “The five-factor model of personality and its relevance to personality disorders,” Journal of Personality Disorders, vol. 6, no. 4, pp. 343–359, 1992.
  • [7] S. Tan, A. Mansi, and A. Furnham, “Students’ preferences for lecturers’ personalities,” Journal of Further and Higher Education, vol. 42, no. 3, pp. 429–438, 2018.
  • [8] K. Kim, L. Boelling, S. Haesler, J. Bailenson, G. Bruder, and G. F. Welch, “Does a digital assistant need a body? the influence of visual embodiment and social behavior on the perception of intelligent virtual agents in AR,” in Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, ISMAR ’18, IEEE, 2018.
  • [9] N. L. Schroeder, O. O. Adesope, and R. B. Gilbert, “How effective are pedagogical agents for learning? a meta-analytic review,” Journal of Educational Computing Research, vol. 49, no. 1, pp. 1–39, 2013.
  • [10] G. B. Petersen, A. Mottelson, and G. Makransky, “Pedagogical agents in educational VR: An in the wild study,” in Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI ’21, (New York, NY, USa), ACM, 2021.
  • [11] S. Sonlu, U. Güdükbay, and F. Durupinar, “A conversational agent framework with multi-modal personality expression,” ACM Transactions on Graphics, vol. 40, no. 1, Article no. 7, 16 pages, 2021.
  • [12] S. T. Völkel and L. Kaya, “Examining user preference for agreeableness in chatbots,” in Proceedings of the 3rd Conference on Conversational User Interfaces, CUI ’21, (New York, NY, USA), Association for Computing Machinery, 2021. Article no. 38, 6 pages.
  • [13] B. Mehra, “Chatbot personality preferences in Global South urban English speakers,” Social Sciences & Humanities Open, vol. 3, no. 1, Article no. 100131, 9 pages, 2021.
  • [14] M. Aebersold, “The history of simulation and its impact on the future,” AACN Advanced Critical Care, vol. 27, no. 1, pp. 56–61, 2016.
  • [15] J. Troutner, “The historical evolution of educational software,” tech. rep., US Department of Education, Institute of Education Sciences, Education Resources Information Center (ERIC) Publications, 1991. Available at https://eric.ed.gov/?id=ED349936.
  • [16] S. Heidig and G. Clarebout, “Do pedagogical agents make a difference to student motivation and learning?,” Educational Research Review, vol. 6, no. 1, pp. 27–54, 2011.
  • [17] A. Gulz, M. Haake, A. Silvervarg, B. Sjödén, and G. Veletsianos, “Building a social conversational pedagogical agent: Design challenges and methodological approaches,” in Conversational agents and natural language interaction: Techniques and effective practices, pp. 128–155, IGI Global, 2011.
  • [18] F. Weber, T. Wambsganss, D. Rüttimann, and M. Söllner, “Pedagogical agents for interactive learning: A taxonomy of conversational agents in education,” in ICIS, 2021.
  • [19] A. L. Baylor and Y. Kim, “Simulating instructional roles through pedagogical agents,” International Journal of Artificial Intelligence in Education, vol. 15, no. 2, pp. 95–115, 2005.
  • [20] R. Yılmaz and E. Kılıç-Çakmak, “Educational interface agents as social models to influence learner achievement, attitude and retention of learning,” Computers & Education, vol. 59, no. 2, pp. 828–838, 2012.
  • [21] L. Lin, P. Ginns, T. Wang, and P. Zhang, “Using a pedagogical agent to deliver conversational style instruction: What benefits can you obtain?,” Computers & Education, vol. 143, 2020. Article no. 103658, 11 pages.
  • [22] J. C. Castro-Alonso, R. M. Wong, O. O. Adesope, and F. Paas, “Effectiveness of multimedia pedagogical agents predicted by diverse theories: A meta-analysis,” Educational Psychology Review, vol. 33, pp. 989–1015, 2021.
  • [23] R. E. Mayer and L. Fiorella, “12 principles for reducing extraneous processing in multimedia learning: Coherence, signaling, redundancy, spatial contiguity, and temporal contiguity principles,” in The Cambridge Handbook of Multimedia Learning, vol. 279, pp. 279–315, New York, NY, USA: Cambridge University Press, 2014.
  • [24] K. L. Nowak and F. Biocca, “The effect of the agency and anthropomorphism on users’ sense of telepresence, copresence, and social presence in virtual environments,” Presence: Teleoperators & Virtual Environments, vol. 12, no. 5, pp. 481–494, 2003.
  • [25] L. Zhang, X. Hu, F. Andrasik, and S. Feng, “Benefits and potential issues for intelligent tutoring systems and pedagogical agents,” in The Frontlines of Artificial Intelligence Ethics, pp. 84–101, Routledge, 2022.
  • [26] R. O. Davis, “The impact of pedagogical agent gesturing in multimedia learning environments: A meta-analysis,” Educational Research Review, vol. 24, pp. 193–209, 2018.
  • [27] J. Wang and P. D. Antonenko, “Instructor presence in instructional video: Effects on visual attention, recall, and perceived learning,” Computers in Human Behavior, vol. 71, pp. 79–89, 2017.
  • [28] M. Mori, K. F. MacDorman, and N. Kageki, “The uncanny valley [from the field],” IEEE Robotics & Automation Magazine, vol. 19, no. 2, pp. 98–100, 2012.
  • [29] M. Paetzel, G. Perugia, and G. Castellano, “The influence of robot personality on the development of uncanny feelings towards a social robot,” Computers in Human Behavior, vol. 120, 2021. Article no. 106756, 17 pages.
  • [30] C. D. Kloos, C. Catálan, P. J. Muñoz-Merino, and C. Alario-Hoyos, “Design of a conversational agent as an educational tool,” in Proceedings of the Learning with MOOCS, LWMOOCS ’18, (Piscataway, NJ, USA), pp. 27–30, IEEE, 2018.
  • [31] T. Terzidou and T. Tsiatsos, “The impact of pedagogical agents in 3d collaborative serious games,” in Proceedings of the IEEE Global Engineering Education Conference, EDUCON ’14, (Piscataway, NJ, USA), pp. 1175–1182, IEEE, 2014.
  • [32] A. Lippert, K. Shubeck, B. Morgan, A. Hampton, and A. Graesser, “Multiple agent designs in conversational intelligent tutoring systems,” Technology, Knowledge and Learning, vol. 25, no. 3, pp. 443–463, 2020.
  • [33] M. A. Zielke, D. Zakhidov, T. Lo, S. D. Craig, R. Rege, H. Pyle, N. V. Meer, and N. Kuo, “Exploring social learning in collaborative augmented reality with pedagogical agents as learning companions,” International Journal of Human-Computer Interaction, pp. 1–26, 2024.
  • [34] A. Atif, M. Jha, D. Richards, and A. A. Bilgin, “Artificial intelligence (AI)-enabled remote learning and teaching using pedagogical conversational agents and learning analytics,” in Intelligent Systems and Learning Data Analytics in Online Education, pp. 3–29, Elsevier, 2021.
  • [35] L. Brown, R. Kerwin, and A. M. Howard, “Applying behavioral strategies for student engagement using a robotic educational agent,” in Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, SMC ’13, (Piscataway, NJ, USA), pp. 4360–4365, IEEE, 2013.
  • [36] N. Adamo, B. Benes, R. E. Mayer, X. Lei, Z. Wang, Z. Meyer, and A. Lawson, “Multimodal affective pedagogical agents for different types of learners,” in Intelligent Human Systems Integration 2021: Proceedings of the 4th International Conference on Intelligent Human Systems Integration (IHSI 2021): Integrating People and Intelligent Systems, February 22-24, 2021, Palermo, Italy, (Berlin, Germany), pp. 218–224, Springer, 2021.
  • [37] A. Terracina, R. Berta, F. Bordini, R. Damilano, and M. Mecella, “Teaching STEM through a role-playing serious game and intelligent pedagogical agents,” in Proceedings of the IEEE 16th International Conference on Advanced Learning Technologies, ICALT ’16, (Piscataway, NJ, USA), pp. 148–152, IEEE, 2016.
  • [38] T. Carlotto and P. A. Jaques, “The effects of animated pedagogical agents in an English-as-a-foreign-language learning environment,” International Journal of Human-Computer Studies, vol. 95, pp. 15–26, 2016.
  • [39] E. G. Poitras and S. P. Lajoie, “Developing an agent-based adaptive system for scaffolding self-regulated inquiry learning in history education,” Educational Technology Research and Development, vol. 62, pp. 335–366, 2014.
  • [40] A. Khokhar and C. Borst, “Modifying pedagogical agent spatial guidance sequences to respond to eye-tracked student gaze in VR,” in Proceedings of the ACM Symposium on Spatial User Interaction, SUI ’22, (New York, NY, USA), pp. 1–12, ACM, 2022. Article no. 15, 12 pages.
  • [41] J. Ryu and A. L. Baylor, “The psychometric structure of pedagogical agent persona,” Technology Instruction Cognition and Learning, vol. 2, no. 4, p. 291, 2005.
  • [42] K. Perlin, “Real time responsive animation with personality,” IEEE Transactions on Visualization and Computer Graphics, vol. 1, no. 01, pp. 5–15, 1995.
  • [43] E. André, M. Klesen, P. Gebhard, S. Allen, T. Rist, et al., “Exploiting models of personality and emotions to control the behavior of animated interactive agents,” in Proceedings of the Fourth International Conference on Autonomous Agents, AGENTS ’00, (New York, NY, USA), pp. 3–7, Association for Computing Machinery, 2000.
  • [44] M. Saberi, A Computational Framework for Expressive, Personality-based, Non-verbal Behaviour for Affective 3D Character Agents. PhD thesis, Simon Fraser University, 2016.
  • [45] M. Saberi, S. DiPaola, and U. Bernardet, “Expressing personality through non-verbal behaviour in real-time interaction,” Frontiers in Psychology, vol. 12, Article no. 660895, 19 pages, 2021.
  • [46] J. Allbeck and N. Badler, “Toward representing agent behaviors modified by personality and emotion,” in Embodied Conversational Agents at AAMAS, 2002.
  • [47] F. Durupinar, M. Kapadia, S. Deutsch, M. Neff, and N. I. Badler, “Perform: Perceptual approach for adding OCEAN personality to human motion using Laban Movement Analysis,” ACM Transactions on Graphics, vol. 36, no. 1, 2016.
  • [48] A. Yurtoğlu, S. Sonlu, Y. Doğan, and U. Güdükbay, “Personality perception in human videos altered by motion transfer networks,” Computers & Graphics, vol. 119, Article no. 103886, 11 pages, 2024.
  • [49] E. Zell, K. Zibrek, and R. McDonnell, “Perception of virtual characters,” in ACM Siggraph 2019 Courses, (New York, NY, USA), Association for Computing Machinery, 2019.
  • [50] K. Zibrek and R. McDonnell, “Does render style affect perception of personality in virtual humans?,” in Proceedings of the ACM Symposium on Applied Perception, SAP ’14, (New York, NY, USA), pp. 111–115, Association for Computing Machinery, 2014.
  • [51] S. Thomas, Y. Ferstl, R. McDonnell, and C. Ennis, “Investigating how speech and animation realism influence the perceived personality of virtual characters and agents,” in Proceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces, VR ’22, (Piscataway, NJ, USA), pp. 11–20, IEEE, 2022.
  • [52] S. Branham, “Creating physical personalities for agents with faces: Modeling trait impressions of the face,” in Proceedings of the UM2001 Workshop on Attitudes, Personality and Emotions in User-Adapted Interactions, (Bari, Italy), pp. 1–7, University of Bari, 2001.
  • [53] V. Swami, T. Buchanan, A. Furnham, and M. J. Tovée, “Five-factor personality correlates of perceptions of women’s body sizes,” Personality and Individual Differences, vol. 45, no. 7, pp. 697–699, 2008.
  • [54] Y. Hu, C. J. Parde, M. Q. Hill, N. Mahmood, and A. J. O’Toole, “First impressions of personality traits from body shapes,” Psychological Science, vol. 29, no. 12, pp. 1969–1983, 2018.
  • [55] K. Legde and D. W. Cunningham, “Evaluating the effect of clothing and environment on the perceived personality of virtual avatars,” in Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, IVA ’19, (New York, NY, USa), pp. 206–208, Association for Computing Machinery, 2019.
  • [56] A. L. Jones, R. S. Kramer, and R. Ward, “Signals of personality and health: the contributions of facial shape, skin texture, and viewing angle,” Journal of Experimental Psychology: Human Perception and Performance, vol. 38, no. 6, pp. 1353–1361, 2012.
  • [57] F. Durupinar, N. Pelechano, J. Allbeck, U. Güdükbay, and N. I. Badler, “How the Ocean personality model affects the perception of crowds,” IEEE Computer Graphics and Applications, vol. 31, no. 3, pp. 22–31, 2009.
  • [58] M. Kapadia, A. Shoulson, F. Durupinar, and N. I. Badler, “Authoring multi-actor behaviors in crowds with diverse personalities,” in Modeling, Simulation and Visual Analysis of Crowds: A Multidisciplinary Perspective, pp. 147–180, New York, NY, USA: Springer, 2013.
  • [59] F. Durupinar, U. Güdükbay, A. Aman, and N. I. Badler, “Psychological parameters for crowd simulation: From audiences to mobs,” IEEE Transactions on Visualization and Computer Graphics, vol. 22, no. 09, pp. 2145–2159, 2016.
  • [60] A. Wardhani, B. Pham, and W. Su, “Personality and emotion-based high-level control of affective story characters,” IEEE Transactions on Visualization and Computer Graphics, vol. 13, no. 02, pp. 281–293, 2007.
  • [61] J. Bahamon and R. Young, “An empirical evaluation of a generative method for the expression of personality traits through action choice,” in Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, AIIDE ’17, (Washington, DC, USA), pp. 144–150, Association for the Advancement of Artificial Intelligence, 2017.
  • [62] M. X. Zhou, G. Mark, J. Li, and H. Yang, “Trusting virtual agents: The effect of personality,” ACM Transactions on Interactive Intelligent Systems (TiiS), vol. 9, no. 2-3, pp. 1–36, 2019.
  • [63] A. Adkins, A. Normoyle, L. Lin, Y. Sun, Y. Ye, M. Di Luca, and S. Jörg, “How important are detailed hand motions for communication for a virtual character through the lens of charades?,” ACM Transactions on Graphics, vol. 42, no. 3, Article no. 27, 16 pages, 2023.
  • [64] Y. Wang, J. E. F. Tree, M. Walker, and M. Neff, “Assessing the impact of hand motion on virtual character personality,” ACM Transactions on Applied Perception, vol. 13, no. 2, Article no. 9, 23 pages, 2016.
  • [65] M. Neff, Y. Wang, R. Abbott, and M. Walker, “Evaluating the effect of gesture and language on personality perception in conversational agents,” in Proceedings of the 10th International Conference on Intelligent Virtual Agents, vol. 6356 of IVA ’10, Lecture Notes in Computer Science, (Berlin, Heidelberg, Germany), pp. 222–235, Springer, 2010.
  • [66] F. Mairesse and M. A. Walker, “Can conversational agents express big five personality traits through language?: Evaluating a psychologically-informed language generator,” tech. rep., Cambridge University, Department of Engineering, Cambridge & Sheffield, UK, 2009.
  • [67] C. H. Lee, K. Kim, Y. S. Seo, and C. K. Chung, “The relations between personality and language use,” The Journal of General Psychology, vol. 134, no. 4, pp. 405–413, 2007.
  • [68] J. Golbeck, C. Robles, M. Edmondson, and K. Turner, “Predicting personality from Twitter,” in Proceedings of the IEEE Third International Conference on Privacy, Security, Risk and Trust/IEEE Third International Conference on Social Computing, PASSAT/SocialCom 2011, (Los Alamitos, CA, USA), pp. 149–156, IEEE Computer Society, 2011.
  • [69] G. Park, H. A. Schwartz, J. C. Eichstaedt, M. L. Kern, M. Kosinski, D. J. Stillwell, L. H. Ungar, and M. E. Seligman, “Automatic personality assessment through social media language,” Journal of Personality and Social Psychology, vol. 108, no. 6, pp. 934–952, 2015.
  • [70] T. Polzehl, S. Möller, and F. Metze, “Automatically assessing personality from speech,” in Proceedings of the IEEE Fourth International Conference on Semantic Computing, ICSC ’10, (Piscataway, NJ, USA), pp. 134–140, IEEE, 2010.
  • [71] S. Möller, T. Polzehl, and F. Metze, “Automatically assessing personality from speech,” in Proceedings of the IEEE Sixth International Conference on Semantic Computing, ICSC ’12, (Los Alamitos, CA, USA), pp. 134–140, IEEE Computer Society, 2010.
  • [72] J.-I. Biel, L. Teijeiro-Mosquera, and D. Gatica-Perez, “FaceTube: predicting personality from facial expressions of emotion in online conversational video,” in Proceedings of the 14th ACM International Conference on Multimodal Interaction, ICMI ’12’, (New York, NY, USA), pp. 53–56, Association for Computing Machinery, 2012.
  • [73] J. Sun, P. Wu, Y. Shen, Z. Yang, H. Li, Y. Liu, T. Zhu, L. Li, K. Zhang, and M. Chen, “Relationship between personality and gait: predicting personality with gait features,” in Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, BIBM ’18, (Los Alamitos, CA, USA), pp. 1227–1231, IEEE Computer Society, 2018.
  • [74] S. Sonlu, Y. Doğan, A. U. Ergüzen, M. E. Ünalan, S. Demirci, F. Durupinar, and U. Güdükbay, “Towards understanding personality expression via body motion,” in Proceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces, Workshop on Multi-modal Affective and Social Behavior Analysis and Synthesis in Extended Reality, MASSXR ’24, (Piscataway, NJ, USA), pp. 628–631, IEEE, 2024.
  • [75] A. Vinciarelli and G. Mohammadi, “A survey of personality computing,” IEEE Transactions on Affective Computing, vol. 5, no. 3, pp. 273–291, 2014.
  • [76] M. Skowron, M. Tkalčič, B. Ferwerda, and M. Schedl, “Fusing social media cues: personality prediction from Twitter and Instagram,” in Proceedings of the 25th International Conference Companion on World Wide Web, WWW ’16 Companion, (Republic and Canton of Geneva, CHE), pp. 107–108, International World Wide Web Conferences Steering Committee, 2016.
  • [77] O. Kampman, E. J. Barezi, D. Bertero, and P. Fung, “Investigating audio, video, and text fusion methods for end-to-end automatic personality prediction,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Volume 2: Short Papers (I. Gurevych and Y. Miyao, eds.), (Stroudsburg, PA, USA), pp. 606–611, Association for Computational Linguistics, July 2018.
  • [78] J. tse Huang, W. Wang, M. H. Lam, E. J. Li, W. Jiao, and M. R. Lyu, “Revisiting the reliability of psychological scales on large language models,” arXiv:2305.19926 [cs.CL], 2023.
  • [79] H. Gu, C. Degachi, U. Genç, S. Chandrasegaran, and H. Verma, “On the effectiveness of creating conversational agent personalities through prompting,” arXiv preprint arXiv:2310.11182, 2023.
  • [80] G. Serapio-García, M. Safdari, C. Crepy, L. Sun, S. Fitz, P. Romero, M. Abdulhai, A. Faust, and M. Matarić, “Personality traits in large language models,” arXiv:2307.00184 [cs.CL], 2023.
  • [81] Y. Mehta, S. Fatehi, A. Kazameini, C. Stachl, E. Cambria, and S. Eetemadi, “Bottom-up and top-down: Predicting personality with psycholinguistic and language model features,” in Proceedings of the IEEE International Conference on Data Mining, ICDM ’20, (Piscataway, NJ, USA), pp. 1184–1189, IEEE, 2020.
  • [82] S. R. Karra, S. T. Nguyen, and T. Tulabandhula, “Estimating the personality of white-box language models,” arXiv:2204.12000 [cs.CL], 2023.
  • [83] F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, and Z. Liu, “AvatarCLIP: Zero-shot text-driven generation and animation of 3D avatars,” ACM Transactions on Graphics, vol. 41, no. 4, Article no. 161, 19 pages, 2022.
  • [84] Z. Qing, Z. Cai, Z. Yang, and L. Yang, “Story-to-motion: Synthesizing infinite and controllable character animation from long text,” in Proceedings of SIGGRAPH Asia, Technical Communications, pp. 1–4, New York, NY, USA: ACM, 2023.
  • [85] A. Normoyle, J. Sedoc, and F. Durupinar, “Using LLMs to animate interactive story characters with emotions and personality,” in Proceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW ’24, (Piscataway, NJ, USA), pp. 632–635, IEEE, 2024.
  • [86] V. Kumaran, J. Rowe, B. Mott, and J. Lester, “SCENECRAFT: Automating interactive narrative scene generation in digital games with large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, AIIDE ’23, pp. 86–96, Association for the Advancement of Artificial Intelligence, 2023.
  • [87] S. S. Sohn, D. Li, S. Zhang, C.-J. Chang, and M. Kapadia, “From words to worlds: Transforming one-line prompt into immersive multi-modal digital stories with communicative LLM agent,” arXiv preprint arXiv:2406.10478, 2024.
  • [88] Y. Hong, H. Zhen, P. Chen, S. Zheng, Y. Du, Z. Chen, and C. Gan, “3D-LLM: Injecting the 3D world into large language models,” in Advances in Neural Information Processing Systems, vol. 36 of NeurIPS ’23, (San Francisco, CA, USA), pp. 20482–20494, Curran Associates, Inc., 2023.
  • [89] Z. Wang, H. Huang, Y. Zhao, Z. Zhang, and Z. Zhao, “Chat-3D: Data-efficiently tuning large language model for universal dialogue of 3D scenes,” arXiv:2308.08769, [cs.CV], 2023.
  • [90] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier, et al., “ChatGPT for good? On opportunities and challenges of large language models for education,” Learning and Individual Differences, vol. 103, Article no. 102274, 9 pages, 2023.
  • [91] O. Topsakal and E. Topsakal, “Framework for a foreign language teaching software for children utilizing AR, Voicebots and ChatGPT (Large Language Models),” The Journal of Cognitive Systems, vol. 7, no. 2, pp. 33–38, 2022.
  • [92] R. Yang, T. F. Tan, W. Lu, A. J. Thirunavukarasu, D. S. W. Ting, and N. Liu, “Large language models in health care: Development, applications, and challenges,” Health Care Science, vol. 2, no. 4, pp. 255–263, 2023.
  • [93] L. Villa, D. Carneros-Prado, A. Sánchez-Miguel, C. C. Dobrescu, and R. Hervás, “Conversational agent development through large language models: Approach with GPT,” in Proceedings of the International Conference on Ubiquitous Computing and Ambient Intelligence, Lecture Notes in Networks and Systems, vol. 835 of UCAmI ’23, (Cham, Switzerland), pp. 286–297, Springer Nature, 2023.
  • [94] N. Liu, L. Chen, X. Tian, W. Zou, K. Chen, and M. Cui, “From llm to conversational agent: A memory enhanced architecture with fine-tuning of large language models,” arXiv preprint arXiv:2401.02777, 2024.
  • [95] J. Lester, K. Branting, and B. Mott, “Conversational agents,” in The Practical Handbook of Internet Computing, pp. 220–240, Boca Raton, FL, USA: Chapman & Hall/CRC, 2004.
  • [96] G. Ball and J. Breese, “Emotion and personality in a conversational agent,” in Embodied Conversational Agents (J. Cassell, J. Sullivan, S. Prevost, and E. F. Churchill, eds.), p. 189–219, Cambridge, MA, USA: MIT Press, 2000.
  • [97] R. M. Schuetzler, G. M. Grimes, and J. S. Giboney, “An investigation of conversational agent relevance, presence, and engagement,” in Proceedings of the Twenty-fourth Americas Conference on Information Systems, AMCIS ’18, (Atlanta, GA, USA), Association for Information Systems, 2018.
  • [98] M. Thiebaux, S. Marsella, A. N. Marshall, and M. Kallmann, “SmartBody: Behavior realization for embodied conversational agents,” in Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems, vol. 1 of AAMAS ’08, (Richland, SC, USA), pp. 151–158, International Foundation for Autonomous Agents and Multiagent Systems, 2008.
  • [99] S. Kopp and I. Wachsmuth, “Synthesizing multimodal utterances for conversational agents,” Computer Animation and Virtual Worlds, vol. 15, no. 1, pp. 39–52, 2004.
  • [100] S. Nyatsanga, T. Kucherenko, C. Ahuja, G. E. Henter, and M. Neff, “A comprehensive review of data-driven co-speech gesture generation,” Computer Graphics Forum, vol. 42, no. 2, pp. 569–596, 2023.
  • [101] S. Qian, Z. Tu, Y. Zhi, W. Liu, and S. Gao, “Speech drives templates: Co-speech gesture synthesis with learned templates,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV ’21, (Piscataway, NJ, USA), pp. 11077–11086, IEEE, 2021.
  • [102] U. Bhattacharya, E. Childs, N. Rewkowski, and D. Manocha, “Speech2AffectiveGestures: Synthesizing co-speech gestures with generative adversarial affective expression learning,” in Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, (New York, NY, USA), pp. 2027–2036, ACM, 2021.
  • [103] T. Ao, Q. Gao, Y. Lou, B. Chen, and L. Liu, “Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings,” ACM Transactions on Graphics, vol. 41, no. 6, Article no. 209, 19 pages, 2022.
  • [104] U. Bhattacharya, A. Bera, and D. Manocha, “Speech2UnifiedExpressions: Synchronous synthesis of co-speech affective face and body expressions from affordable inputs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR ’24, pp. 1877–1887, 2024.
  • [105] Y. Yoon, K. Park, M. Jang, J. Kim, and G. Lee, “SGToolkit: An interactive gesture authoring toolkit for embodied conversational agents,” in Proceedings of the 34th Annual ACM Symposium on User Interface Software and Technology, UIST ’21, (New York, NY, USA), pp. 826–840, ACM, 2021.
  • [106] D. Aneja, R. Hoegen, D. McDuff, and M. Czerwinski, “Understanding conversational and expressive style in a multimodal embodied conversational agent,” in Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI ’21, Article no. 102, 10 pages, (New York, NY, USA), ACM, 2021.
  • [107] R. Ishii, Y. I. Nakano, and T. Nishida, “Gaze awareness in conversational agents: Estimating a user’s conversational engagement from eye gaze,” ACM Transactions on Interactive Intelligent Systems, vol. 3, no. 2, pp. 1–25, 2013.
  • [108] L. Tudor Car, D. A. Dhinagaran, B. M. Kyaw, T. Kowatsch, S. Joty, Y.-L. Theng, and R. Atun, “Conversational agents in health care: scoping review and conceptual analysis,” Journal of Medical Internet Research, vol. 22, no. 8, p. e17158, 2020.
  • [109] Unity Technologies, “Unity’s Real-time 3D Development Engine.” Available at https://unity.com/products/unity-engine, 2024. Accessed: 2024-06-22.
  • [110] P. Ekman, W. V. Friesen, and J. C. Hager, Facial Action Coding System. Research Nexus division of Network Information Research Corporation, 2002.
  • [111] Meta Quest, “Oculus Lipsync.” https://developer.oculus.com/downloads/package/oculus-lipsync-unity/, 2019. Accessed: 2024-06-22.
  • [112] Reallusion, Inc., “Reallusion Character Creator (CC).” https://www.reallusion.com/character-creator/, 2024. Accessed: 2024-06-22.
  • [113] IBM, “IBM Watson API.” https://www.ibm.com/watson, 2015. Accessed: 2024-06-22.
  • [114] OpenAI, “GPT-3.5 API,” 2023. Last accessed on 06/23/2024.
  • [115] M. Dynel, “Lessons in linguistics with chatgpt: Metapragmatics, metacommunication, metadiscourse and metalanguage in human-ai interactions,” Language & Communication, vol. 93, pp. 107–124, 2023.
  • [116] C. J. Soto and O. P. John, “Short and extra-short forms of the Big Five Inventory–2: The BFI-2-S and BFI-2-XS,” Journal of Research in Personality, vol. 68, pp. 69–81, 2017.
  • [117] R. H. Kay and L. Knaack, “Assessing learning, quality and engagement in learning objects: the Learning Object Evaluation Scale for Students (LOES-S),” Educational Technology Research and Development, vol. 57, pp. 147–168, 2009.
  • [118] S. Cairncross and M. Mannion, “Interactive multimedia and learning: Realizing the benefits,” Innovations in Education and Teaching International, vol. 38, no. 2, pp. 156–164, 2001.
  • [119] J. M. Harackiewicz, J. L. Smith, and S. J. Priniski, “Interest matters: The importance of promoting interest in education,” Policy insights from the behavioral and brain sciences, vol. 3, no. 2, pp. 220–227, 2016.
  • [120] M. E. Latoschik, D. Roth, D. Gall, J. Achenbach, T. Waltemate, and M. Botsch, “The effect of avatar realism in immersive social virtual realities,” in Proceedings of the 23rd ACM Symposium on Virtual Reality Software and Technology, VRST ’17, (New York, NY, USA), Association for Computing Machinery, 2017. Article no. 39, 10 pages.
  • [121] V. Salehi and F. T. Nia, “Effect of levels of realism in mobile-based pedagogical agents on health e-learning.,” Future of Medical Education Journal, vol. 9, no. 2, 2019.
  • [122] J. Gratch, N. Wang, A. Okhmatovskaia, F. Lamothe, M. Morales, R. J. van der Werf, and L.-P. Morency, “Can virtual humans be more engaging than real ones?,” in Human-Computer Interaction. HCI Intelligent Multimodal Interaction Environments: 12th International Conference, HCI International 2007, Beijing, China, July 22-27, 2007, Proceedings, Part III 12, pp. 286–297, Springer, 2007.
  • [123] I. Miguel-Alonso, D. Checa, H. Guillen-Sanz, and A. Bustillo, “Evaluation of the novelty effect in immersive virtual reality learning experiences,” Virtual Reality, vol. 28, no. 1, p. 27, 2024.
  • [124] A. L. Baylor, “Promoting motivation with virtual agents and avatars: role of visual presence and appearance,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 364, no. 1535, pp. 3559–3565, 2009.
  • [125] L. E. Kim, A. E. Poropat, and C. MacCann, “Conscientiousness in education: Its conceptualization, assessment, and utility,” Psychosocial skills and school systems in the 21st century: Theory, research, and practice, pp. 155–185, 2016.
  • [126] A. Arslanyilmaz and J. Sullins, “Eye-gaze data to measure students’ attention to and comprehension of computational thinking concepts,” International Journal of Child-Computer Interaction, vol. 38, Article no. 100414, 15 pages, 2023.
  • [127] K. Nakamura, K. Kakusho, T. Shoji, and M. Minoh, “Investigation of a method to estimate learners’ interest level for agent-based conversational e-learning,” in Advances in Computational Intelligence: 14th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, IPMU 2012, Catania, Italy, July 9-13, 2012. Proceedings, Part II 14, pp. 425–433, Springer, 2012.