The Better Angels of Machine Personality: How Personality Relates to LLM Safety

Jie Zhang1,2\star, Dongrui Liu1\star, Chen Qian1,3\star, Ziyue Gan1,4,5,
Yong Liu3, Yu Qiao1, Jing Shao1
1
Shanghai Artificial Intelligence Laboratory
2 University of Chinese Academy of Sciences  3 Renmin University of China
4 Department of Philosophy, Xi’an Jiaotong University
5 Computational Philosophy Lab, Xi’an Jiaotong University
[email protected]  [email protected]  {liudongrui, shaojing}@pjlab.org.cn
Abstract

Personality psychologists have analyzed the relationship between personality and safety behaviors in human society. Although Large Language Models (LLMs) demonstrate personality traits, the relationship between personality traits and safety abilities in LLMs still remains a mystery. In this paper, we discover that LLMs’ personality traits are closely related to their safety abilities, i.e., toxicity, privacy, and fairness, based on the reliable MBTI-M scale. Meanwhile, the safety alignment generally increases various LLMs’ Extraversion, Sensing, and Judging traits. According to such findings, we can edit LLMs’ personality traits and improve their safety performance, e.g., inducing personality from ISTJ to ISTP resulted in a relative improvement of approximately 43% and 10% in privacy and fairness performance, respectively. Additionally, we find that LLMs with different personality traits are differentially susceptible to jailbreak. This study pioneers the investigation of LLM safety from a personality perspective, providing new insights into LLM safety enhancement.

footnotetext: Equal contribution     Corresponding author  

1 Introduction

What you resist not only persists, but will grow in size. — Carl Jung

As LLMs become more powerful and prevalent, interacting with humans in a variety of contexts, it becomes increasingly important to understand and describe LLMs from a social science perspective, particularly through psychology [87; 54; 1; 88]. Recent studies show that LLMs actually exhibit personalities [87; 54; 88; 39], and that personality could affect the theory-of-mind reasoning of models [92]. Therefore, editing LLMs’ personality traits to control their outputs [58; 102] is valuable for various applications, e.g., it can support role-playing by creating personalized chatbots to enhance user experience [83; 95; 100], and it can also involve developing human-like social robots to empower research on the evolution of human behavior [71; 90; 32].

Personality psychologists have established the relationship between different personality and other variables in human society [80; 47; 11; 33]. Specifically, some studies investigate the relationship between personality and safety motivation [66; 44], others analyze the personalities of different people in actual workplace safety [7; 93; 105; 74]. These findings from personality psychology provide valuable insights into understanding the relationship between LLMs’ personality and safety.

LLM safety and alignment with human values has emerged as a key challenge [4; 38]. Although previous research has explored various perspectives, including optimizing LLMs based on human preferences [110; 69; 6; 78; 46; 106] and self-alignment [62; 79; 56], the personality psychology perspective has been overlooked. Research on the LLMs’ personalities has already benefited role-playing and social agents [95; 32], and we believe that studying LLM safety from a personality perspective can also contribute to AI safety and alignment.

Our study explores the close relationship between LLMs’ personality traits and safety capabilities. In LLMs’ personality assessment, the output is influenced by the format of the input, including language, option labels of the questions [50; 94; 18], and user instructions [53; 81]. To mitigate these influences, we select the optimal settings for each factor. Using these settings, we conduct multiple assessments to measure LLMs’ personalities in different model sizes, ensuring the reliability of the MBTI results.

Based on the reliable personality results, we first investigate the relation between personality traits and performance in safety capabilities. We find that alignment typically results in more Extraversion, Sensing, and Judging traits, while models exhibiting more Extroversion, iNtuition, and Feeling traits are more susceptible to jailbreak. Considering the trade-offs between different safety capabilities in LLMs [43; 57; 85; 104], we analyze each safety capability independently, i.e., toxicity, fairness, and privacy. Specifically, we investigate the relationship between a single safety capability and personality. Our study reveals specific relationships between personality traits and safety capabilities, e.g., models that are more Perceiving traits exhibit superior fairness performance.

According to these findings, we then edit specific personality in a controllable way to enhance the model’s safety capabilities, e.g., inducing LLM’s personality from ISTJ to ISTP via steering vectors resulted in a relative improvement of approximately 43% and 10% in privacy and fairness performance, respectively. We also controllably edit specific safety capabilities and observe impacts on personality traits, verifying the relationship between personality traits and LLM safety.

Refer to caption
Figure 1: Investigating and utilizing the relationship between LLMs’ personality traits and safety capabilities. We find that MBTI personality traits are closely related to LLM safety, and editing specific personalities in a controllable way can enhance the safety capability of LLMs.

This paper presents the first comprehensive study on the relation between LLMs’ personality and safety, and demonstrates that editing personality traits can enhance model safety capabilities. This supports the view that for AI-based decision support systems to be trusted, their design may have to consider people’s personality traits [84]. We do not claim that personality alone can ensure LLM safety, as psychologists state that personality influences behavior through a series of complex associations [22; 26]. However, we do believe that considering personality in LLM safety is promising, it can provide a supplement to comprehensive LLM safety with further exploration and development.

2 Personality Traits in LLMs

Preliminary: Several studies have demonstrated that LLMs actually exhibit personalities [87; 54; 88; 39]. To gain a deeper understanding of LLM personality, researchers have used personality models from psychology to assist LLM personality study [37]. In particular, the Myers-Briggs Type Indicator (MBTI) scale [10] has been widely used to assess LLMs’ personality traits [72; 23; 1; 86]. The MBTI assesses individuals’ personalities across four dimensions: Extraversion-Introversion(E-I), Sensing-iNtuition(S-N), Thinking-Feeling(T-F), and Judging-Perceiving(J-P). In this study, we choose the most recent version of the MBTI assessment, namely MBTI-M [64], as our assessment scale. This scale consists of forced-choice questions based on binary options where respondents must select one, making it easier to adapt for assessing LLMs’ personalities. Moreover, this scale is suitable for most LLMs, as it demands a minimum reading comprehension level equivalent to the seventh grade [63], and LLMs trained on extensive texts are capable of completing this task.

2.1 Optimal Selection of Factors Affecting MBTI Assessment

Research has shown that for multiple-choice questions, the output of LLMs is affected by the option order, exhibiting a preference for the first position [99; 108; 49]. We categorize the option order in the MBTI scale into two types: the first follows the settings of previous research [99] by exchanging option descriptions (i.e., changing A. Agree, B. Disagree to A. Disagree, B. Agree), and the second exchanges option labels while maintaining the order of descriptions (i.e., changing A. Agree, B. Disagree to B. Agree, A. Disagree). In the main paper, we discuss the option order that exchanges option descriptions, the option order that exchanges option labels is discussed in Appendix A.2.

To minimize the influence of option order in LLMs’ personality assessment, we analyze the impact of different settings of option labels, instructions, and language factors on the MBTI results. This analysis enables us to identify the optimal selections among these three factors that are less affected by option order.

  • Option Label. LLMs exhibit differential sensitivity to number and alphabet [50; 94]. To investigate the influence of label type, this paper sets option labels in two forms: alphabets (e.g., A. Agree B. Disagree) and numbers (e.g., 1. Agree 2. Disagree), and examines its impact on the MBTI assessment results.

  • Instructions. The configuration of the instructions could affect the output of LLMs [53; 81]. To obtain stable and reliable assessment results, this study adopts a few-shot learning approach, providing two styles of instruction: (1) samples that answer contains option label and corresponding description (i.e., Question: Artificial intelligence cannot have emotions. A. Agree, B. Disagree. Your answer: B. Disagree); (2) answer contains only option label without descriptions (i.e., Question: Artificial intelligence cannot have emotions. A. Agree, B. Disagree. Your answer: B).

  • Language. Psychological research indicates that individuals may respond differently to personality scales in different cultural backgrounds [45; 70; 96; 17; 18; 2]. Therefore, this study extends this issue to LLMs, using both Chinese and English versions of the MBTI-M questionnaire to assess the personality results of LLMs in different culture background.

Experiment settings. We randomly shuffle the option order in the MBTI scale before each assessment. For each factor, we assess the MBTI results under two variants and calculate the kappa coefficient [19]. We then compare the kappa coefficients among different settings for the same factor. A higher kappa coefficient indicates greater consistency in assessments across different option orders, thereby identifying the optimal selection of the factor for LLMs’ MBTI assessments.

Result analysis. For option labels, instructions, and language, we have identified the selections as numbers, detailed descriptions, and the Chinese MBTI version, respectively. Table 1 lists the kappa coefficients for various models in different settings in terms of the order of options (exchange option description). It can be seen that selecting the number as the option label and incorporating the option description within few-shot instructions have been shown to yield a higher kappa coefficient, indicating that “number” and “with description” are the better selections under these two factors. Additionally, the kappa coefficient on MBTI is comparable between Chinese and English scales. In line with prior studies [72; 23; 37], this paper chooses the Chinese version of the MBTI-M, characterized by number as the option label, and uses instructions with descriptions to assess MBTI across various LLMs.

Table 1: Kappa coefficient of the option order (exchange option descriptions) in LLMs’ MBTI assessment under three factors, respectively.
Factors Llama-2 Llama-3 Amber Gemma Mistral Baichuan Internlm Internlm2 Qwen Qwen-1.5 Yi
number 0.3071 0.1005 0.0333 0.0802 0.1369 0.409 0.0552 0.4614 -0.042 0.1263 0.2248
Option Label alphabet 0.168 -0.0107 0.0176 0.303 0.121 0.2413 0.1985 0.0714 0.0917 0.0972 0.1618
w/ desc 0.2084 0.136 0.0916 0.0655 0.1177 0.4172 0.0408 0.4794 0.046 0.0952 0.3028
Instruction w/o desc -0.0349 0.0567 0.015 0.1618 0.1388 0.2103 0.0405 0.4385 0.1908 0.3138 0.1771
chinese 0.2669 0.0958 0.0997 0.127 0.115 0.4343 0.0656 0.4861 0.0555 0.1097 0.126
Sprache english 0.1659 0.2383 0.0193 0.1496 0.0721 0.1059 0.1361 0.3534 0.329 0.3421 0.2535

2.2 Reliability of MBTI through Multiple-time Assessments

We employ a method of averaging multiple-time assessments to mitigate the impact of option order and obtain reliable MBTI results. As shown in Table 1, even with the optimal choice of three factors, the kappa coefficients between different assessments remain low, indicating that it is challenging to obtain stable results. MBTI results are reliable after multiple-time assessments [13]. The core issue is selecting the appropriate number of assessments. We randomly shuffle the options in the scale before each assessment. Each model is assessed between 1 and 100 times, and the Kappa coefficient is calculated for each number of assessments to evaluate reliability. As shown in Figure 2(a), different models have varying sensitivities to the number of assessments. For instance, models such as Llama-3-8b and GPT-3.5 achieve stable results with fewer assessments (less than 10 times), while models like Llama-2-7b and Internlm-7b require more (20-30 times). We can observe that after 30 assessments, all models produce consistent results regardless of the option order. Therefore, we decide to conduct 30 assessments of the MBTI-M scale in a random option order for each model.

After selecting the number of assessments, we further verify the faithfulness of obtaining MBTI results under this setup by analyzing the distribution of results using boxplots. As shown in Figure 2(b), across the four dimensions of the MBTI, the lower or upper quartile of boxplots for all models are located on one side of the 50% (indicated by a red dashed line). This distribution indicates that although there is some standard deviation in the multiple-time assessments due to the option order, personality traits are separable on independent MBTI dimensions, thus demonstrating the faithfulness of the MBTI-M assessment. In addition, we also conduct MBTI assessments on different personality models (provided by Mindset [23]) and larger models (Llama-2-13b, Qwen-1.5-14b, Internlm-2-20b), which are discussed in Appendix B.

Refer to caption
Figure 2: (a) Kappa coefficient with the number of assessments. (b) Boxplot of 30 times MBTI assessments. In MBTI, E-I, S-N, T-F, and J-P are opposite personality pairs, so only one dimension from each pair is represented in the figure.

3 The Relationship between LLMs’ Personality Traits and Safety Capabilities

This section explores the relationship between MBTI personality traits and LLMs’ safety capabilities. We begin by investigating the differences in safety performance among models with various MBTI personality traits, clarifying how different personalities show different safety capabilities (3.1). Next, we analyze the changes in MBTI personality of various models before and after safety alignment, providing insights into how alignment affects LLMs personality traits (3.2). In addition, we study the jailbreak success rates of models with different personalities, revealing the susceptibility of certain personality traits to jailbreaks (3.3).

3.1 LLMs with Different Personality Traits Have Different Safety Capabilities

Psychological research has found a correlation between personality and safety capabilities [66; 44; 7; 93; 105; 74]. To explore whether this correlation also exists within LLMs, we evaluate 16 variants of a base model, each with a different MBTI personality trait, in three general and three safety capabilities, including toxicity, privacy, and fairness.

Models. Machine Mindset employs a two-phase fine-tuning and DPO to embed MBTI traits into LLMs [23]. They provide 16 Chinese models based on Baichuan-7b-chat fine-tuning, namely Minsdet-zh, and 16 English models based on Llama-2-7b fine-tuning, namely Mindset-en. Each model is embedded with one of the 16 MBTI personality types.

Evaluation Datasets. For general abilities, we choose ARC, MMLU, and MathQA datasets, evaluated using lm-harness [29]. For safety capabilities, classic datasets are selected for evaluation. We choose ToxiGen [31] to evaluate the toxicity ratio of Mindset, following the approach of Llama2 by using a revised version of the dataset [35]. We choose the tier 2 task from ConfAIde [61] to evaluate the accuracy of judging privacy violations, and we use the combined data based on ConfAIde and the Solove Taxonomy from [76]. We used StereoSet [65] to evaluate the stereotype ratio of LLMs, i.e., whether LLMs capture stereotypical biases about race, religion, profession, and gender.

We first obtain reliable MBTI results of Mindset models using the assessment methods described in Section 2. Subsequently, we evaluate each model’s performance on both general and safety datasets. Due to the trade-offs between different safety capabilities in LLMs [43; 57; 85; 104], we analyze the relationship between each of the four MBTI dimensions (E-I, N-S, T-F, J-P) and the three safety capabilities (toxicity, privacy, and fairness) separately. For each MBTI dimension, we select models with significant differences in that personality dimension for analysis. See Appendix D for more details.

Refer to caption
Figure 3: Performances of different personality models on general and safety evaluation, respectively.

There are significant differences in the performance of LLMs with different personalities in terms of safety capability. Figure 3 illustrates that LLMs with different personalities show nearly performance in general ability datasets, i.e., ARC, MMLU, and MathQA. However, there are significant differences in performance across three safety capability datasets, i.e., ToxiGen, StereoSet, and ConfAIde, indicating the indeed correlation between personality and LLMs safety capabilities. As shown in Figure 4, when analyzing the relations of different dimensions of MBTI on privacy, fairness, and toxicity performance, we can get the following observations:

Refer to caption
Figure 4: Toxicity, privacy, and fairness performance within four dimensions of MBTI, respectively.
  1. 1.

    In the E-I dimension, models that are more towards introversion trait demonstrate better privacy performance, while fairness and toxicity performance decline.

  2. 2.

    In the N-S dimension, models that are more towards sensing trait demonstrate better both privacy and fairness performances, while toxicity performance declines.

  3. 3.

    In the F-T dimension, models that are more toward feeling traits demonstrate better toxicity performance. However, in Mindset-zh, the performance of such models declines in both privacy and fairness, while in Mindset-en, improvements are observed in these two dimensions. See Appendix C for a discussion on cultural differences in the context of languages.

  4. 4.

    In the J-P dimension, models that are more toward perceiving traits demonstrate better fairness performance. As the perceiving trait increases, privacy performance improves in Mindset-zh but declines in Mindset-en. The changes in J-P dimensions do not significantly affect the toxicity performance in either Mindset-zh or Mindset-en.

3.2 Safety Alignment Changes Personality Traits

Safety and alignment are closely linked concepts in LLMs development [38; 106; 52]. Alignment is considered a crucial approach to achieving model safety, as a well-aligned model is expected to inherently avoid unsafe outputs. Conversely, evaluating model safety serves as a key indicator for verifying the effectiveness of alignment techniques [75; 98]. In this part, we aim to investigate the impact of safety alignment on LLMs’ personalities, as assessed by the MBTI.

To study the impact of alignment on the LLM personality, we perform a comparative analysis using 11 pairs of open-source LLMs. Each pair consists of one base model and one aligned model. We conduct standard MBTI questionnaires to all 22 models, with each model responding to the questionnaire 30 times. The options for each questionnaire are randomly shuffled. Finally, average scores are recorded across the E-I, S-N, T-F, and J-P dimensions. A discussion of the larger LLMs (i.e., Llama-2-13b, Qwen-1.5-14b, Internlm-2-20b) is provided in the Appendix E.2.

Refer to caption
Figure 5: MBTI of base and aligned LLMs. (a) E-I dimension of different LLMs’ MBTI traits. (b) S-N dimension of different LLMs’ MBTI traits. (c) F-T dimension of different LLMs’ MBTI traits. (d) J-P dimension of different LLMs’ MBTI traits.

LLMs statistically show a tendency towards certain personality types. If an LLM has a high extraversion percent, e.g., more than 50%, such LLM demonstrate an extraversion trait. In this way, Figure 5 shows that most base and aligned LLMs tend to be extraversion, intuition, feeling, and judging traits in the MBTI assessment. Amber, InternLM-2, and Baichuan-2 exhibit some relatively minor deviations, i.e., they are showing introversion, sensing, and thinking traits. We hypothesize that the consistent personality tendencies in LLMs may result from their training on extensive data, which reflects the overall characteristics of the human population. Consequently, LLMs may inherit the average personality traits of the human behind the data. This phenomenon and its underlying causes deserve further investigation [30; 41; 3].

Alignment generally makes LLMs exhibit more Extraversion, Sensing, and Judging traits compared to their base models. Figure 5 indicates that the alignment operation indeed changes the personality traits of LLMs, especially in the E-I, N-S, and J-P dimensions of the MBTI framework. Specifically, most LLMs show consistent patterns of change after alignment, e.g., the number of aligned models with increased extroverted, sensing, and judging percentages are 8, 9, and 8, respectively. Mistral shows no significant change in the E-I dimension, and Yi shows no change in the J-P dimension after alignment.

The personality changes through alignment techniques are consistent with some psychological findings on humans. Numerous studies conducted by psychologists have established an actual relation between personality traits and safety [44; 34; 60; 8]. They find that extraverts are more positive communicators, proactive in addressing safety concerns, and participative in group safety activities [68]. Judging individuals prefer conscientiousness, which correlates negatively with unsafe behaviors [60; 7]. Sensing individuals are more detail-oriented and observant, enhancing their adherence to safety protocols and recognition of immediate hazards.

3.3 LLMs with Different Personality Traits are Differentially Susceptible to Jailbreaks

Refer to caption
Figure 6: Success rate (%) of various jailbreak approaches within four dimensions of MBTI.

Jailbreak attacks are crucial for identifying the security vulnerabilities of LLMs [101; 109]. Recent research has shown that simply role-playing can compromise even the most advanced LLMs [24; 82]. This finding suggests that LLMs, when assigned specific roles or characters, are prone to complying with harmful instructions. To elucidate the relationship between personality traits and jailbreak susceptibility, this study jailbreaks the Mindset-en introduced in Section 3.1. Specifically, we employ Jailbroken [101], Cipher [107], and CodeChameleon [55] to jailbreak ,and the attack success rates on Llama-2-7b-chat are 6%, 61%, and 80%, respectively [109].

Models with more Extraversion, iNtuition, and Feeling traits are more likely to be jailbroken. We conduct three jailbreak attacks on models with different personality traits in the Mindset-en [23], and then analyze the susceptibility of MBTI personality to jailbreaks following Section 3.1. As shown in Figure 6, models with different personality traits result in varying jailbreak success rates. Models with extraversion, intuition, or feeling traits are more susceptible to jailbreaks. It can also be observed that as the attack methods become stronger, the attack success rate on models with these traits increases. For example, in the E-I dimension, the success rates for the Jailbroken, Cipher, and CodeChameleon methods increase by 1.46%, 10.47%, and 15.9%, respectively.

Findings from psychology may provide explanations for the observation that LLMs with certain personality traits are more susceptible to jailbreak. Extraverted individuals prioritize interaction and feedback [67]. Consequently, models with more extraversion trait are more susceptible to harmful instructions. Intuitive individuals are more open to new ideas and experiences [59]. This openness increases the vulnerability of models with more intuition traits to jailbreak. The feeling trait is associated with higher agreeableness [60; 12]. Therefore, models with more feeling traits are more likely to produce accommodating responses, making them more susceptible to jailbreak.

4 Enhancing LLMs’ Safety Capabilities from Personality Perspective

Motivated by the observation in Section 3 that there is a relationship between LLMs’ personality traits and safety capabilities, this section aims to enhance the safety capabilities of LLMs by controllably editing personality traits. We first introduce the steering vector technique used to edit LLMs’ personality traits (4.1). Next, we delve into the impact of controllably editing LLMs’ personality traits on their safety capabilities and vice versa (4.2).

4.1 Controllably Editing LLMs’ Personality Traits with Steering Vector Technique

Steering vector-based activation intervention techniques have been widely used to guide model inference, including improving model truthfulness [48], enhancing model trustworthiness [111; 76], and executing backdoor attacks on models [98].

We first provide a brief overview of the steering vector technique here. Given a dataset 𝒟={(xi,yi)}i=1|𝒟|𝒟superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝒟\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{|\mathcal{D}|}caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D | end_POSTSUPERSCRIPT, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents a sentence related to a specific subject (e.g., personality), yi{0,1}subscript𝑦𝑖01y_{i}\in\{0,1\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } is the corresponding binary label (e.g., 1111 denotes E, 00 denotes I). We denote the set of sentences with labels 1111 and 00 as 𝒳+superscript𝒳\mathcal{X^{+}}caligraphic_X start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and 𝒳superscript𝒳\mathcal{X^{-}}caligraphic_X start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT, respectively. Next, we input all sentences from the dataset into the LLM and collect the activation sets Al(𝒳+)subscript𝐴𝑙superscript𝒳A_{l}(\mathcal{X^{+}})italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) and Al(𝒳)subscript𝐴𝑙superscript𝒳A_{l}(\mathcal{X^{-}})italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ), where Alsubscript𝐴𝑙A_{l}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is a function representing the activations at the l𝑙litalic_l-th layer of the LLM. Subsequently, we compute the centroids of each activation set and take their difference to obtain the steering vector:

𝒗l=Al¯(𝒳+)Al¯(𝒳).subscript𝒗𝑙¯subscript𝐴𝑙superscript𝒳¯subscript𝐴𝑙superscript𝒳\bm{v}_{l}=\overline{A_{l}}(\mathcal{X}^{+})-\overline{A_{l}}(\mathcal{X}^{-}).bold_italic_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = over¯ start_ARG italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG ( caligraphic_X start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) - over¯ start_ARG italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG ( caligraphic_X start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) . (1)

Finally, we add this steering vector to the corresponding l𝑙litalic_l-th layer representations during LLM generation to intervene in the model output:

𝒉l=𝒉l+α𝒗l,subscript𝒉superscript𝑙subscript𝒉𝑙𝛼subscript𝒗𝑙\bm{h}_{l^{\prime}}=\bm{h}_{l}+\alpha\bm{v}_{l},bold_italic_h start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = bold_italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_α bold_italic_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , (2)

where hlsubscript𝑙h_{l}italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT represents the origin representation of l𝑙litalic_l-th layer, hlsubscriptsuperscript𝑙h_{l^{\prime}}italic_h start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT represents the corresponding intervened representation, the hyperparameter α𝛼\alphaitalic_α controls the intervention strength. Note that this operation occurs at each token generation of the LLM’s autoregressive inference.

Experimental Settings. The models and evaluation datasets used in this section are consistent with those described in Section 3.1. When controllably editing the personalities of LLMs, we use the dataset provided by [23] to activate LLMs; for controllably changing the safety capabilities of LLMs, we use the datasets mentioned in Section 3.1. Regarding the selection of hyperparameters for the steering vector technique, specifically the layer l𝑙litalic_l and intervention strength α𝛼\alphaitalic_α, we empirically determine the optimal parameters through a coarse grid search [48; 98; 76] under the constraint of the Perplexity metric [16; 76]. Please refer to Appendix D for more details.

Refer to caption
Figure 7: Results of controllably editing LLMs’ personality traits by steering vector technique. (Upper) MBTI of original and intervened LLMs; (Bottom) Safety capabilities of original and intervened LLMs. Where the edited LLMs are indicated by slashed textures.

4.2 Controllably Editing LLMs’ Personality Traits Enhances LLMs’ Safety Capabilities

Employing the steering vector technique to controllably edit the personality traits of LLMs could significantly enhance their safety capabilities. We select three base models and use the steering vector technique to controllably change their personalities (i.e., ISTJ->ISTP, ESFJ->ESTJ, ENFP->ESFP). The results shown in Figure 7 indicate that the steering vector technique could controllably edit an LLM’s personality in a specific dimension while causing relatively minor changes in other personality dimensions. Moreover, in these three cases of directional personality changes, the models exhibit improved fairness and privacy performance and declined toxicity performance. These findings align with observations 2, 3, and 4 discussed in Section 3.1, confirming the claims that the relationship between LLMs’ personality and safety is close.

Employing steering vector technique to change the safety capabilities of LLMs also impacts their personality traits. Conversely, we further investigate whether changes in LLMs’ safety capabilities impact their personality traits. Similarly, we select three base models and controllably edit their safety capabilities (i.e., fairness, privacy, and toxicity, respectively). The results in Table 2 indicate that the steering vector technique can significantly change a model’s safety capabilities. Additionally, we observe corresponding changes in the models’ personalities. For example, when changing the privacy capability of the model with an ESFJ MBTI, the model’s traits of extraversion, sensing, thinking, and judging would become more significant. Therefore, these experimental results further confirm the association between LLMs’ personality and safety.

Notably, when aiming to enhance an LLM’s safety capability, editing its personality offers a potential technical approach. Personality traits are powerful predictors of outcomes across various domains, including education, work, relationships, health, and well-being [9; 103]. By editing models’ personality traits in a controllable way, we can enable them to adapt to different fields and satisfy the diverse requirements of various scenarios. Controllable personality edit based on the steering vector technique not only significantly enhances an LLM’s safety performance with minimal cost but also benefits from research in psychology, sociology, and behavioral science, thereby providing greater interpretability.

Table 2: Results of changing LLM safety capabilities by steering vector. Orange values indicate improvement difference, green values indicate decline difference. The shaded region indicates the specific safety capabilities that are controllably changed.
Model Typ Safety Capabilities MBTI
Fairness \uparrow Privacy \uparrow Toxicity \downarrow E S T J
INFJ Original 0.4361 0.3414 0.058 63.33% 54.73% 24.17% 74.55%
Intervened 0.5465 0.3395 0.044 66.05% 54.62% 27.63% 73.05%
Diff ΔΔ\Deltaroman_Δ +0.1104 -0.0019 -0.014 +2.72% -0.11% +3.46% -1.50%
ESFJ Original 0.3491 0.3395 0.042 79.19% 44.35% 23.63% 70.91%
Intervened 0.5160 0.4785 0.012 79.86% 45.27% 26.67% 74.68%
Diff ΔΔ\Deltaroman_Δ +0.1669 +0.1390 -0.030 +0.67% +0.92% +3.04% +3.77%
ISTP Original 0.5126 0.7153 0.078 50.33% 47.42% 52.79% 41.95%
Intervened 0.4994 0.7080 0.042 51.10% 47.19% 52.50% 45.14%
Diff ΔΔ\Deltaroman_Δ -0.0132 -0.0073 -0.036 +0.77% -0.23% -0.29% +3.19%

5 Related work

Myers-Briggs Type Indicator. Personality is a fundamental concept in psychology, referring to the dynamic integration of the totality of a person’s subjective experience and behavior patterns [42]. Various theories and models have been proposed to conceptualize and measure personality traits [40; 27; 10]. Myers-Briggs Type Indicator [10] are based on Carl Jung’s theory of psychological types. Two notable variants of the MBTI have been developed to meet specific research needs, i.e., MBTI-G [15] and MBTI-M [63]. These adaptations of the MBTI have been applied in various research to investigate the relationships between personality and other variables [36; 11; 33; 32].

LLMs Personality Traits. Research suggests that LLMs exhibit unique personality traits that both resemble and differ from human personalities [87; 54; 25; 1]. MBTI has been used to assess LLM personality [72; 23; 1; 86]. Specifically, there are two primary methods for assessing LLM personality: one is the direct application of human psychological scales to LLMs [37; 25; 39]; the other is inferring personality traits based on language content generated by LLMs through specialized models [86; 54; 100]. In terms of editing and shaping model personalities, researchers have proposed various methods to change LLMs personalities to suit different application scenarios and user needs, including prompt [39; 92], role-playing [100; 83], model edit [58] and fine-tuning[23].

LLMs Alignment and Safety. The foundation of understanding Safety LLMs is established through existing research on AI governance [91; 21; 20] and trustworthy AI [51; 28]. These studies provide guidance for identifying the core dimensions of trustworthiness in LLMs [52; 97; 89; 76]. To this end, ensuring the alignment of LLMs with human values is crucial to mitigate and avoid potential societal safety risks. Many approaches have been proposed, including optimizing LLMs from human preferences [110; 69; 6; 78; 46; 106] and self-alignment [62; 79; 56], This enables the LLM to identify and rectify the harmfulness of its outputs, thereby fostering greater alignment with societal values.

6 Conclusion

In this study, we discover that safety alignment can generally change LLMs’ personality traits, and LLMs with different personality traits are differentially susceptible to jailbreaks. Meanwhile, we discover that LLMs’ personality traits are closely related to their performance in safety capabilities such as toxicity, privacy, and fairness. Based on these findings, we experimentally demonstrate that editing LLMs’ personality traits can enhance their safety performance, providing new insights for the development of LLM safety. This study pioneers the exploration of LLM safety from a personality perspective. However, due to the complex correlation rather than causation between personality and safety in psychology [47; 14], there is still a need to further explore the more intrinsic relationship between personality traits and LLM safety.

Limitations

There are several limitations of this work. Firstly, our study focuses on 7B models due to the availability of both base and alignment models. Few 13B models offer both, limiting their representativeness. Thus, our main research centers on 7B models, with 13B models discussed in the appendix. Secondly, we measure the MBTI traits of closed-source models without editing because there are no model weights to construct steering vectors, and the prompt methods are uncontrollable. Thirdly, to reduce variables, we limited our study to three representative safety dimensions, ensuring a manageable scope and meaningful insights into the relationship between LLMs’ personality traits and safety capabilities.

Broader Impact and Ethics Statement

This study focuses on better understanding the relationship between personality traits and LLM safety. We emphasize that personality traits, assessed and edited in this study, do not imply any inherent value judgments. There are no “good” or “bad” personality traits, and our objective is only to enhance LLM safety. We strictly prohibit the intentional steering of models towards unsafe personality traits. All modifications are performed with the primary goal of improving model safety, ensuring that our work contributes positively to the development of ethical and trustworthy AI systems.

This research is carried out in a secure, controlled environment, ensuring the safety of real-world systems. Access to the most sensitive aspects of our experiments is limited to researchers with the proper authorization, who are committed to following rigorous ethical standards. These precautions are taken to maintain the integrity of our research and to mitigate any risks that could arise from the experiment’s content.

References

  • [1] Yiming Ai, Zhiwei He, Ziyin Zhang, Wenhong Zhu, Hongkun Hao, Kai Yu, Lingjun Chen, and Rui Wang. Is cognition and action consistent or not: Investigating large language model’s personality. arXiv preprint arXiv:2402.14679, 2024.
  • [2] Jüri Allik and Anu Realo. Culture and personality. Oxford handbook of culture and psychology, pages 401–424, 2019.
  • [3] Stephen K Asare, Herman van Brenk, and Kristina C Demek. Evidence on the homogeneity of personality traits within the auditing profession. Critical Perspectives on Accounting, page 102584, 2023.
  • [4] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  • [5] Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. Differential privacy has disparate impact on model accuracy. Advances in neural information processing systems, 32, 2019.
  • [6] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  • [7] Jeremy M Beus, Lindsay Y Dhanani, and Mallory A McCord. A meta-analysis of personality and workplace safety: addressing unanswered questions. Journal of applied psychology, 100(2):481, 2015.
  • [8] Jeremy M Beus, Mallory A McCord, and Dov Zohar. Workplace safety: A review and research synthesis. Organizational psychology review, 6(4):352–381, 2016.
  • [9] Wiebke Bleidorn, Patrick L. Hill, Mitja D. Back, Jaap J. A. Denissen, Marie Hennecke, Christopher James Hopwood, Markus Jokela, Christian Kandler, Richard E. Lucas, Maike Luhmann, Ulrich R. Orth, Jenny Wagner, Cornelia Wrzus, Johannes Zimmermann, and Brent W. Roberts. The policy relevance of personality traits. The American psychologist, 74 9:1056–1067, 2019.
  • [10] Katharine C Briggs. Myers-Briggs type indicator. Consulting Psychologists Press Palo Alto, CA, 1976.
  • [11] F William Brown and Michael D Reilly. The myers-briggs type indicator and transformational leadership. Journal of Management Development, 28(10):916–932, 2009.
  • [12] Ethan Campbell and Matthew P. Kassner. The Importance of Agreeableness, pages 1–6. Springer International Publishing, Cham, 2018.
  • [13] Robert M Capraro and Mary Margaret Capraro. Myers-briggs type indicator score reliability across: Studies a meta-analytic reliability generalization study. Educational and Psychological Measurement, 62(4):590–602, 2002.
  • [14] Vincent Careau and Theodore Garland Jr. Performance, personality, and energetics: correlation, causation, and mechanism. Physiological and Biochemical Zoology, 85(6):543–571, 2012.
  • [15] John G Carlson. Recent assessments of the myers-briggs type indicator. Journal of personality assessment, 49(4):356–365, 1985.
  • [16] Stanley F Chen, Douglas Beeferman, and Roni Rosenfeld. Evaluation metrics for language models. 1998.
  • [17] Sylvia Xiaohua Chen, Verónica Benet-Martínez, and Jacky CK Ng. Does language affect personality perception? a functional approach to testing the whorfian hypothesis. Journal of personality, 82(2):130–143, 2014.
  • [18] A Timothy Church. Personality traits across cultures. Current Opinion in Psychology, 8:22–30, 2016.
  • [19] Jacob Cohen. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46, 1960.
  • [20] European Commission. Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts, pub. l. no. com(2021) 206 final., 2021b.
  • [21] European Commission, Content Directorate-General for Communications Networks, and Technology. Ethics guidelines for trustworthy AI. Publications Office, 2019.
  • [22] Paul T Costa and Robert R McCrae. Trait theories of personality. In Advanced personality, pages 103–121. Springer, 1998.
  • [23] Jiaxi Cui, Liuzhenghao Lv, Jing Wen, Jing Tang, YongHong Tian, and Li Yuan. Machine mindset: An mbti exploration of large language models. arXiv preprint arXiv:2312.12999, 2023.
  • [24] Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
  • [25] Florian E Dorner, Tom Sühr, Samira Samadi, and Augustin Kelava. Do personality tests generalize to large language models? arXiv preprint arXiv:2311.05297, 2023.
  • [26] Seymour Epstein. Trait theory as personality theory: Can a part be as great as the whole? Psychological Inquiry, 5(2):120–122, 1994.
  • [27] Hans Jurgen Eysenck and Sybil Bianca Giuletta Eysenck. Manual of the Eysenck Personality Questionnaire (junior & adult). Hodder and Stoughton Educational, 1975.
  • [28] AI Verify Foundation. Catalogue of llm evaluations, 2023.
  • [29] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023.
  • [30] Jennifer M George. Personality, affect, and behavior in groups. Journal of applied psychology, 75(2):107, 1990.
  • [31] Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for implicit and adversarial hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022.
  • [32] Zihong He and Changwang Zhang. Afspp: Agent framework for shaping preference and personality with large language models. arXiv preprint arXiv:2401.02870, 2024.
  • [33] Malcolm Higgs. Is there a relationship between the myers-briggs type indicator and emotional intelligence? Journal of Managerial Psychology, 16(7):509–533, 2001.
  • [34] Joyce Hogan and Jeff Foster. Multifaceted personality predictors of workplace safety performance: More than conscientiousness. Human Performance, 26(1):20–43, 2013.
  • [35] Saghar Hosseini, Hamid Palangi, and Ahmed Hassan Awadallah. An empirical study of metrics to measure representational harms in pre-trained language models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 121–134, 2023.
  • [36] Jill R Hough and DT Ogilvie. An empirical test of cognitive style and strategic decision outcomes. Journal of Management Studies, 42(2):417–448, 2005.
  • [37] Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, and Michael R Lyu. Who is chatgpt? benchmarking llms’ psychological portrayal using psychobench. arXiv preprint arXiv:2310.01386, 2023.
  • [38] Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, et al. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023.
  • [39] Guangyuan Jiang, Manjie Xu, Song-Chun Zhu, Wenjuan Han, Chi Zhang, and Yixin Zhu. Evaluating and inducing personality in pre-trained language models. Advances in Neural Information Processing Systems, 36, 2024.
  • [40] Oliver P John, Eileen M Donahue, and Robert L Kentle. Big five inventory. Journal of personality and social psychology, 1991.
  • [41] Petri Kajonius and Erik Mac Giolla. Personality traits across countries: Support for similarities rather than differences. PloS one, 12(6):e0179646, 2017.
  • [42] Otto F Kernberg. What is personality? Journal of personality disorders, 30(2):145–156, 2016.
  • [43] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In Innovations in Theoretical Computer Science (ITCS), 2017.
  • [44] Julie Laurent, Nik Chmiel, and Isabelle Hansez. Personality and safety citizenship: the role of safety motivation and safety knowledge. Heliyon, 6(1), 2020.
  • [45] Chang H Lee, Kyungil Kim, Young Seok Seo, and Cindy K Chung. The relations between personality and language use. The Journal of general psychology, 134(4):405–413, 2007.
  • [46] Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  • [47] James J Lee. Correlation and causation in the study of personality. European Journal of Personality, 26(4):372–390, 2012.
  • [48] Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [49] Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, and Noa Garcia. Can multiple-choice questions really be useful in detecting the abilities of llms? arXiv preprint arXiv:2403.17752, 2024.
  • [50] Tian Liang, Zhiwei He, Jen-tes Huang, Wenxuan Wang, Wenxiang Jiao, Rui Wang, Yujiu Yang, Zhaopeng Tu, Shuming Shi, and Xing Wang. Leveraging word guessing games to assess the intelligence of large language models. arXiv preprint arXiv:2310.20499, 2023.
  • [51] Haochen Liu, Yiqi Wang, Wenqi Fan, Xiaorui Liu, Yaxin Li, Shaili Jain, Yunhao Liu, Anil Jain, and Jiliang Tang. Trustworthy ai: A computational perspective. ACM Transactions on Intelligent Systems and Technology, page 1–59, Feb 2023.
  • [52] Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment, 2023.
  • [53] Manikanta Loya, Divya Sinha, and Richard Futrell. Exploring the sensitivity of LLMs’ decision-making capabilities: Insights from prompt variations and hyperparameters. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3711–3716, Singapore, December 2023. Association for Computational Linguistics.
  • [54] Yang Lu, Jordan Yu, and Shou-Hsuan Stephen Huang. Illuminating the black box: A psychometric investigation into the multifaceted nature of large language models. arXiv preprint arXiv:2312.14202, 2023.
  • [55] Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, and Xuanjing Huang. Codechameleon: Personalized encryption framework for jailbreaking large language models. arXiv preprint arXiv:2402.16717, 2024.
  • [56] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [57] Paul Mangold, Michaël Perrot, Aurélien Bellet, and Marc Tommasi. Differential privacy has bounded impact on fairness in classification. In International Conference on Machine Learning, pages 23681–23705, 2023.
  • [58] Shengyu Mao, Ningyu Zhang, Xiaohan Wang, Mengru Wang, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. Editing personality for llms. arXiv preprint arXiv:2310.02168, 2023.
  • [59] Robert R. McCrae and Paul T. Costa. Joint factors in self-reports and ratings: Neuroticism, extraversion and openness to experience. Personality and Individual Differences, 4(3):245–255, 1983.
  • [60] Robert R McCrae and Paul T Costa Jr. Reinterpreting the myers-briggs type indicator from the perspective of the five-factor model of personality. Journal of personality, 57(1):17–40, 1989.
  • [61] Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, and Yejin Choi. Can llms keep a secret? testing privacy implications of language models via contextual integrity theory, 2023.
  • [62] Masato Mita, Shun Kiyono, Masahiro Kaneko, Jun Suzuki, and Kentaro Inui. A self-refinement strategy for noise reduction in grammatical error correction. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 267–280, Online, November 2020. Association for Computational Linguistics.
  • [63] Isabel Briggs Myers. MBTI manual: A guide to the development and use of the Myers-Briggs Type Indicator. CPP, 2003.
  • [64] Peter B. Myers, Katharine D. Myers, Isabel Briggs Myers, and Linda K. Kirby. Myers-Briggs Type Indicator : form M. Consulting Psychologists Press, Palo Alto, Calif., 1998.
  • [65] Moin Nadeem, Anna Bethke, and Siva Reddy. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, 2021.
  • [66] Nigel Nicholson, Emma Soane, Mark Fenton-O’Creevy, and Paul Willman. Personality and domain-specific risk taking. Journal of Risk Research, 8(2):157–176, 2005.
  • [67] S. K. Opt and D. A. Loffredo. Communicator image and Myers-Briggs Type Indicator extraversion-introversion. J Psychol, 137(6):560–568, Nov 2003.
  • [68] Susan K Opt and Donald A Loffredo. Communicator image and myers—briggs type indicator extraversion—introversion. The Journal of psychology, 137(6):560–568, 2003.
  • [69] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  • [70] Katarzyna Ożańska-Ponikwia. What has personality and emotional intelligence to do with ‘feeling different’while using a foreign language? International Journal of Bilingual Education and Bilingualism, 15(2):217–234, 2012.
  • [71] Debajyoti Pal, Vajirasak Vanijja, Himanshu Thapliyal, and Xiangmin Zhang. What affects the usage of artificial conversational agents? an agent personality and love theory perspective. Computers in Human Behavior, 145:107788, 2023.
  • [72] Keyu Pan and Yawen Zeng. Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180, 2023.
  • [73] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, 2016.
  • [74] Vijay Pereira, Umesh Bamel, Happy Paul, and Arup Varma. Personality and safety behavior: An analysis of worldwide research on road and traffic safety leading to organizational and policy implications. Journal of Business Research, 151:185–196, 2022.
  • [75] Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
  • [76] Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, and Jing Shao. Towards tracing trustworthiness dynamics: Revisiting pre-training period of large language models. arXiv preprint arXiv:2402.19465, 2024.
  • [77] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
  • [78] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [79] Machel Reid and Graham Neubig. Learning to model editing processes. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3822–3832, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  • [80] William Revelle. Experimental approaches to the study of personality. Handbook of research methods in personality psychology, pages 37–61, 2007.
  • [81] Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting, 2023.
  • [82] Rusheb Shah, Quentin Feuillade Montixi, Soroush Pour, Arush Tagade, and Javier Rando. Jailbreaking language models at scale via persona modulation. 2023.
  • [83] Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-llm: A trainable agent for role-playing. arXiv preprint arXiv:2310.10158, 2023.
  • [84] NN Sharan and DM Romano. The effects of personality and locus of control on trust in humans versus artificial intelligence. heliyon, 6 (8), e04572, 2020.
  • [85] Liwei Song, Reza Shokri, and Prateek Mittal. Privacy risks of securing machine learning models against adversarial examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 241–257, 2019.
  • [86] Xiaoyang Song, Yuta Adachi, Jessie Feng, Mouwei Lin, Linhao Yu, Frank Li, Akshat Gupta, Gopala Anumanchipalli, and Simerjot Kaur. Identifying multiple personalities in large language models with external evaluation. arXiv preprint arXiv:2402.14805, 2024.
  • [87] Xiaoyang Song, Akshat Gupta, Kiyan Mohebbizadeh, Shujie Hu, and Anant Singh. Have large language models developed a personality?: Applicability of self-assessment tests in measuring personality in llms. arXiv preprint arXiv:2305.14693, 2023.
  • [88] Aleksandra Sorokovikova, Natalia Fedorova, Sharwin Rezagholi, and Ivan P Yamshchikov. Llms simulate big five personality traits: Further evidence. arXiv preprint arXiv:2402.01765, 2024.
  • [89] Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024.
  • [90] Reiji Suzuki and Takaya Arita. An evolutionary model of personality traits related to cooperative behavior using a large language model. Scientific Reports, 14(1):5989, 2024.
  • [91] Elham Tabassi. Artificial intelligence risk management framework (ai rmf 1.0), 2023-01-26 05:01:00 2023.
  • [92] Fiona Anting Tan, Gerard Christopher Yeo, Fanyou Wu, Weijie Xu, Vinija Jain, Aman Chadha, Kokil Jaidka, Yang Liu, and See-Kiong Ng. Phantom: Personality has an effect on theory-of-mind reasoning in large language models. arXiv preprint arXiv:2403.02246, 2024.
  • [93] Da Tao, Xiaofeng Diao, Xingda Qu, Xiaoting Ma, and Tingru Zhang. The predictors of unsafe behaviors among nuclear power plant workers: An investigation integrating personality, cognitive and attitudinal factors. International Journal of Environmental Research and Public Health, 20(1):820, 2023.
  • [94] Jen tse Huang, Wenxuan Wang, Man Ho Lam, Eric John Li, Wenxiang Jiao, and Michael R. Lyu. Revisiting the reliability of psychological scales on large language models, 2023.
  • [95] Quan Tu, Chuanqi Chen, Jinpeng Li, Yanran Li, Shuo Shang, Dongyan Zhao, Ran Wang, and Rui Yan. Characterchat: Learning towards conversational ai with personalized social support. arXiv preprint arXiv:2308.10278, 2023.
  • [96] G Marina Veltkamp, Guillermo Recio, Arthur M Jacobs, and Markus Conrad. Is personality modulated by language? International Journal of Bilingualism, 17(4):496–504, 2013.
  • [97] Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  • [98] Haoran Wang and Kai Shu. Backdoor activation attack: Attack large language models using activation steering for safety-alignment. arXiv preprint arXiv:2311.09433, 2023.
  • [99] Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
  • [100] Xintao Wang, Yunze Xiao, Jen tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, and Yanghua Xiao. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews, 2024.
  • [101] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
  • [102] Yixuan Weng, Shizhu He, Kang Liu, Shengping Liu, and Jun Zhao. Controllm: Crafting diverse personalities for language models. arXiv preprint arXiv:2402.10151, 2024.
  • [103] Amanda Wright and Joshua Jackson. Do changes in personality predict life outcomes? Journal of Personality and Social Psychology, 125, 06 2023.
  • [104] Han Xu, Xiaorui Liu, Yaxin Li, Anil Jain, and Jiliang Tang. To be robust or to be fair: Towards fairness in adversarial training. In International conference on machine learning, pages 11492–11501, 2021.
  • [105] Li Yang, Sumaiya Bashiru Danwana, Fadilul-lah Yassaanah Issahaku, Sundas Matloob, and Junqi Zhu. Investigating the effects of personality on the safety behavior of gold mine workers: A moderated mediation approach. International journal of environmental research and public health, 19(23):16054, 2022.
  • [106] Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
  • [107] Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
  • [108] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  • [109] Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, et al. Easyjailbreak: A unified framework for jailbreaking large language models. arXiv preprint arXiv:2403.12171, 2024.
  • [110] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  • [111] Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.

Appendix A The Reliability of MBTI Assessment for LLMs

A.1 Setting Cases about Factors Affecting MBTI Assessment

There are two types of the option order in the MBTI scale:

  • Exchange Option Description. Following the settings of previous research [99], we exchange option descriptions while maintaining the order of label, i.e., changing A. Agree, B. Disagree to A. Disagree, B. Agree.

  • Exchange Option Label. we exchange option labels while maintaining the order of descriptions, i.e., changing A. Agree, B. Disagree to B. Agree, A. Disagree.

There are three factors that can affect MBTI assessment under each type of option order:

  • Option Label. We set option labels in two forms: alphabets (e.g., A. Agree B. Disagree) and numbers (e.g., 1. Agree 2. Disagree), and examine their impact on the MBTI assessment results.

  • Instructions. We provide two styles of instruction: (1) samples that answer contains option label and corresponding description (i.e., Question: Artificial intelligence cannot have emotions. A. Agree, B. Disagree. Your answer: B. Disagree); (2) answer contains only option label without descriptions (i.e., Question: Artificial intelligence cannot have emotions. A. Agree, B. Disagree. Your answer: B).

  • Language. We use both Chinese and English versions of the MBTI-M questionnaire to assess the personality results of LLMs from different cultural backgrounds.

A.2 Kappa coefficient of Option Order by Exchanging Option Labels

Table 3 presents the kappa coefficients for various models under different settings in terms of the option order (i.e., the exchange option label). The results demonstrate that selecting the number as the option label consistently yields a higher kappa coefficient than the alphabet, suggesting that using “number” is the optimal selection for this factor. Additionally, the analysis reveals that the kappa coefficient on the MBTI results remains comparable regardless of whether the instructions include descriptions or not. This observation indicates that the option descriptions do not significantly impact the performance of the assessment in this setting. The kappa coefficient also exhibits consistency between the Chinese and English scales, which means that the MBTI results are not greatly affected by the assessment language.

Table 3: Kappa coefficient of the option order (exchange option labels) in LLMs’ MBTI assessment under three factors, respectively.
Factors Llama-2 Llama-3 Amber Gemma Mistral Baichuan Internlm Internlm2 Qwen Qwen-1.5 Yi
number 0.1017 0.1846 0.2733 0.0937 0.2493 0.4003 0.1597 0.2637 0.0203 0.2999 0.0107
Choice Label alphabet 0.0143 0.094 0.0688 0.0092 0.335 0.2636 0.058 0.5164 0.3378 0.0827 0.6666
w/ desc 0.1285 0.2184 0.2352 0.0597 0.2678 0.319 0.1189 0.2274 0.0905 0.2725 0.0541
ICL w/o desc 0.0055 0.2163 0.024 0.1033 0.0139 0.3491 0.2868 0.2285 0.2779 0.4045 0.0553
chinese 0.0357 0.1535 0.2572 0.0453 0.2523 0.3924 0.1006 0.2784 0.0002 0.2704 0.0137
Sprache english 0.4702 0.1635 0.3983 0.041 0.1915 0.4418 0.2622 0.1299 0.2626 0.2353 0.4515

A.3 MBTI Mean and Standard Deviation with Number of Measurements

The results of personality assessments using psychological scales can vary between multiple-time measurements, even when administered to human subjects, and may change over time. As shown in Figure 8, there is always a certain standard deviation in the results of repeated MBTI assessments. Although the model’s output tends to stabilize to some extent as the number of measurements increases, the standard deviation persists and never completely diminishes. This observation indicates that while employing multiple measurements can contribute to obtaining MBTI results, the influence of the standard deviation on the outcomes remains a notable factor.

The persistence of the standard deviation across multiple measurements highlights the inherent complexity and potential instability in capturing personality traits through psychological scales. To investigate this, We randomly shuffle the options in the scale before each assessment, conducting between 1 to 100 assessments for each model, and calculate the Kappa coefficient for each instance, as shown in Section 2.2 of the main text, thus verifying the reliability of the MBTI results. This method allows us to obtain reliable personality assessment outcomes in the presence of variability introduced by the option order.

Refer to caption
Figure 8: MBTI Results with Number of Assessments.

Appendix B Boxplot of 30 times MBTI Assessment for More LLMs

B.1 Boxplots of Mindset

Based on the reliable MBTI assessment method described in Section 2, we re-evaluate the personality of 32 LLMs provided by Mindset [23]. The assessment results for the 16 MBTI personality models in Mindset-zh (Chinese) and Mindset-en (English) are presented in Figure 9 and Figure 10, respectively. Our results demonstrate a significant alignment with the expected MBTI obtained through fine-tuning, indicating that our assessment method possesses significant discriminative power. In most cases, opposite personality pairs are clearly distinguishable, with scores distinctly located on one side of the 50% threshold.

However, some discrepancies are observed, particularly in the Extraversion-Introversion (E-I) dimension, where models rarely exhibit introverted traits. This observation suggests that additional methods are needed to make the models more introverted. For certain models, specific personality traits, such as Sensing-Intuition (S-N), are not easily differentiated, with scores hovering around the 50% mark. This finding implies that further refinement and shaping of these models’ personalities may be required to achieve more distinct and well-defined traits.

Refer to caption
Figure 9: Boxplots of Mindset-zh.
Refer to caption
Figure 10: Boxplots of Mindset-en.

B.2 Boxplots of Larger LLMs

To further investigate the reliability and scalability of our MBTI-based personality analysis approach, we conduct 30 assessments on models with larger parameter scales, i.e., Llama-2-13b, Qwen-1.5-14b, and Internlm-2-20b. For each model, we get their personality traits using the MBTI scale and plot the results using box plots, as shown in Figure 11. The box plots reveal that the MBTI personality dimensions of nearly all the tested models are significantly distinguishable, indicating that the models exhibit distinct personality profiles. For instance, in the case of Llama-2-13b, there is a substantial difference between the scores for the Feeling and Thinking (F-T) dimensions.

Refer to caption
Figure 11: Boxplots of Llama-2-13b, Qwen-1.5-14b, and Internlm-2-20b.

Appendix C Cultural Differences in the Context of Languages

In the realm of sociology, previous research [45, 70, 96, 17, 18, 2] collectively suggests that language and culture significantly impact individual personality and behavior. These studies reveal that language is not merely a tool for communication but a crucial medium for shaping and expressing cultural identity, emotions, and social conduct. Furthermore, individuals may exhibit varying personality traits across different linguistic environments.

Thus, the observed differences in our experiments might be a reflection of these cultural and linguistic imprints on LLMs’ learning process. In the context of LLMs, these findings suggest that the linguistic and cultural nuances embedded within a model’s training data may shape its personality expressions and security behaviors. Moreover, the models’ ability to adapt to security threats may be affected by emotional intelligence factors such as empathy and social awareness.

Appendix D Experiment Setting Details

D.1 Mindset Model Selection in Four Dimensions of MBTI

When analyzing the relationship between each of the four MBTI dimensions (E-I, N-S, T-F, J-P) and the three safety aspects (toxicity, privacy, and fairness) separately, we select models with significant differences in that personality dimension for analysis for each MBTI dimension. This selection process is based on the reliable MBTI results of our assessed mindset (see Appendix B.1). The selection criteria are primarily twofold:

First, for each MBTI dimension (e.g., E-I), we select models that exhibit significant differences in that dimension for analysis. Specifically, we choose models with scores at the opposite ends of the dimension, i.e., those that clearly demonstrate either E or I, while avoiding models with scores in the middle. This ensures that the selected models have a clear distinction in that personality dimension.

Second, we also need to ensure that the number of models for each personality pair (e.g., E and I) is roughly balanced. This helps to balance the data, making the analysis results more reliable and statistically meaningful. If the number of models for a particular personality dimension is highly skewed, it may affect the reliability of the results.

D.2 Controllable Editing with Steering Vector Technique

In Section 4.2, we conduct experiments on controllably editing the LLMs’ personality traits (Figure 7) based on Mindset-zh-ISTJ, Mindset-zh-ESFJ, and Mindset-en-ENFP. Additionally, we conduct experiments on changing the LLMs’ safety capabilities (Table 2) by changing the fairness of Mindset-zh-INFJ, the privacy of Mindset-zh-ESFJ, and the toxicity of Mindset-zh-ISTP. Notably, as observed in previous literature [57, 5, 76], there are trade-offs between different safety dimensions of a model (e.g., privacy-fairness trade-off [57]), it is challenging to observe a “targeted” change in a particular safety capability. Nevertheless, this does not undermine the conclusion that changing an LLM’s safety capabilities impacts its personality traits.

When constructing steering vectors for safety datasets, we follow [48, 76] to divide datasets into a development set and a test set in a 1:1:111:11 : 1 ratio. The development set is used for constructing the steering vector, while the test set is used for evaluating the model’s safety capabilities.

Regarding the Perplexity constraints mentioned in Section 4.1, we follow the approach in [77] to calculate Perplexity on the LAMBADA [73] dataset. Following [76], we select a Perplexity threshold of 6666, considering intervention effects below this threshold as reasonable.

Appendix E Changes in MBTI after Safety Alignment for More LLMs

E.1 Llama-2 Series LLMs: Llama-2, Vicuna-1.5, and Tulu-2-dpo

Despite the overall trends, some models demonstrate personality shifts in the opposite direction, indicating potential interactions between alignment methods and the models’ inherent characteristics. Our analysis of other Llama-2 aligned models (i.e., vicuna-1.5 and tulu-2-dpo) reveals that they also exhibit opposite personality shift patterns similar to Llama-2-chat, confirming the inherent model characteristics may cause individual models to deviate from the overall trends in personality changes.

Refer to caption
Figure 12: MBTI of base and aligned LLMs (Llama-2-7b, Vicuna-1.5-7b, and Tulu-2-dpo-7b).

E.2 Larger LLMs: Llama-2-13b, Qwen-1.5-14b, and Internlm-2-20b

We further assess larger LLMs, including Llama-2-13b, Qwen-1.5-14b, and Internlm-2-20b, analyzing their MBTI changes before and after alignment. The results, presented in Figure 13, showcase the changes in personality dimensions observed in each model. However, due to the limited number of available models from the community at the corresponding parameter scales (13B, 14B, 20B), conducting a comprehensive statistical analysis of these findings remains challenging. To further advance research on LLM safety from the personality perspective, we strongly encourage increased open-sourcing efforts from the AI community. Researchers can explore the implications of personality traits on the safe development and deployment of LLMs.

Refer to caption
Figure 13: MBTI of base and aligned LLMs (Llama-2-13b, Qwen-1.5-14b, and Internlm-2-20b).

Appendix F Experiments Compute Resources

All experiments in this study were conducted using NVIDIA A100 GPUs with 80GB memory. We perform MBTI personality assessment on the following models: 32 Mindset models, 22 base and align models with 7B parameters (11 pairs), 6 models with larger parameter sizes, and ChatGPT model, totaling 61 models. Storing a model with 7B parameters typically requires approximately 14GB of memory. Performing a single MBTI assessment on one model takes about 1 minute, with an estimated 100 hours for a complete single assessment process.