\corrauth¹¹footnotemark: 1

Huimin Chen

PersLLM: A Personified Training Approach for Large Language Models

Zheni Zeng Jiayi Chen Huimin Chen¹¹footnotemark: 1 Yukun Yan¹¹footnotemark: 1 Yuxuan Chen Zhenghao Liu Zhiyuan Liu and Maosong Sun Department of Computer Science and Technology, Tsinghua University, Beijing, China [email protected]

Abstract

Large language models (LLMs) exhibit aspects of human-level intelligence that catalyze their application as human-like agents in domains such as social simulations, human-machine interactions, and collaborative multi-agent systems. However, the absence of distinct personalities, such as displaying ingratiating behaviors, inconsistent opinions, and uniform response patterns, diminish LLMs’ utility in practical applications. Addressing this, the development of personality traits in LLMs emerges as a crucial area of research to unlock their latent potential. Existing methods to personify LLMs generally involve strategies like employing stylized training data for instruction tuning or using prompt engineering to simulate different personalities. These methods only capture superficial linguistic styles instead of the core of personalities and are therefore not stable. In this study, we propose personified LLM (PersLLM), integrating psychology-grounded principles of personality—social practice, consistency, and dynamic development—into a comprehensive training methodology. Through personified data construction and model training, we incorporate personality traits directly into the model parameters, enhancing the model’s resistance to induction, promoting consistency, and supporting the dynamic evolution of personality. Single-agent evaluation validates our method’s superiority, as it produces responses more aligned with reference personalities compared to other approaches. Case studies for multi-agent communication highlight its benefits in enhancing opinion consistency within individual agents and fostering collaborative creativity among multiple agents in dialogue contexts, potentially benefiting human simulation and multi-agent cooperation. Additionally, human-agent interaction evaluations indicate that our personified models significantly enhance interactive experiences, underscoring the practical implications of our research.

Einführung

Large Language models (LLMs), due to their large-scale parameters and training data, demonstrate human-level intelligence across numerous domains (Brown et al., 2020; Achiam et al., 2023). This development has inspired researchers to investigate the potential of employing LLMs as human-like agents in various contexts, including social simulations, human-machine interactions, and collaborative multi-agent systems (Grossmann et al., 2023; Yang, 2024; Wang et al., 2024). The personalization of these human-like agents plays a crucial role. For example, in social simulation contexts, only agents that possess extensive personified traits can authentically emulate human perspectives and behaviors. In human-machine interaction settings, agents featuring personality traits significantly enhance user acceptance and comfort (Pelau et al., 2021). Additionally, in tasks involving multiple agents, the interaction among entities with varied personality traits can markedly enhance both the quality and creativity of task execution (Güver and Motschnig, 2017).

Nevertheless, the current generation of LLM-driven agents exhibits a notable deficiency in personified characteristics, often presenting overly uniform values and behavior patterns, along with a propensity to cater to user preferences (Wei et al., 2023; Liu et al., 2023), which substantially curtail their applicability in real-world settings. In response to these challenges, some studies have initiated efforts to construct LLMs endowed with a broader spectrum of personality traits. Current research in modeling personality primarily adopts two mainstream approaches, each with distinct methodologies and inherent limitations. The first, prompt-based methods, rely on external prompt engineering to specify personified traits of agents (Wei et al., 2023; Liu et al., 2023). These methods depend on the model’s strong capacity to comprehend and reason over long texts and the accuracy of the retrieval module to provide context-relevant information. Naturally, they are limited by the model’s maximum context length, restricting the scope of personality information that can be processed. The second approach, training-based methods, integrates personal characteristics into the model’s internal parameters by by targeting specific types of data (Zhou et al., 2023a; Wang et al., 2023c). These methods usually focus on singular aspects of characteristics such as linguistic styles or anecdotes, thus limiting the application scenarios for the models. Overall, while both approaches attempt to integrate personality traits into models, they only capture superficial and fragmented aspects of these traits and fail to fully address the complexity of personal characteristics.

Refer to caption — Figure 1: Schematic diagram for PersLLM. (a) We first collect raw data for the target personalities; (b) Based on these materials, we conduct automatic annotation to generate conversational data; (c) We use the annotated data to conduct personified conversational tuning with strategies including temporal labels, anti-induced data, CoT prompting and automatic DPO training; (d) Several data examples are displayed.

Psychologists view personality as a dynamic organisation, inside the person, of psychophysical systems that create a person‘s characteristic patterns of behaviours, thoughts, and feelings (Carver, 2011). It is formed gradually in the process of socialization, through maturity, learning and the influence of the environment (Murphy, 1947). We therefore believe that a more authentic and comprehensive personality modeling of LLMs must consider the following principles: 1. Social Practice: Personality is shaped by personal life experiences and the socialization process, and in turn is reflected in people’s action and speech. Comprehensive modeling should thus incorporate detailed records of a person’s behavior, thoughts, and feelings, training and evaluating the personified LLMs through socialized interactions; 2. Consistency: Personality is shown in consistent and continuous characteristic patterns, rarely shifting dramatically during interactions or across different scenarios (Carver, 2011), so personified LLMs should have strong generalization and stability of opinions, avoid being induced by attacks or being excessive flattery; 3. Dynamic development: Although being consistent in a certain stage, in the long run, personality experiences dynamic development. Therefore, collecting data with time stamps and modeling different life stages is essential.

Grounded in the principles outlined previously, we propose a novel approach for LLM personification, termed PersLLM, which is divided into two stages: personified data construction and personified model training. As shown in Fig. 1, in the personified data construction stage, we collect both objective and subjective data about the target individual we wish to simulate, such as biographies, third-party descriptions, personal letters, and authored articles, which can provide insights into the experiences, knowledge, opinions, and speech style of the target personality. To adhere to the social practice principle, the evaluation of the personified model should be conducted in comprehensive social conversational scenarios. Considering the discrepancy between raw plain-text and the conversational inference, we systematically restructure and augment the raw data into formatted conversations utilizing an annotation LLM, which is instrumental in facilitating subsequent personality learning of models. The annotation LLM prompts broad inquiries into specific segments of the data (i.e., ground-truth information) and retrieves relevant details from the raw data. With the help of the retrieved information and the ground-truth data, the annotation LLM achieves retrieval-augmented generation (RAG) (Chen et al., 2024) for informed responses to the inquiries it raises. Simultaneously, in order to uphold the consistency principle, we take an anti-induced data construction approach. This approach involves the diversification of conversational data into various scenarios, including standard question-answer formats, opinion discussions, and error corrections. These scenarios, especially the latter two, are particularly crucial as they frequently involve conflicting viewpoints or require rectification of misleading information. Training on these diverse data can instill the model to sustain a consistent perspective and cognition, and neither be misled nor excessively compliant, whether within a single interaction or across varied scenarios. Lastly, to portray the principle of dynamic development of personality, we categorize the personified training data into distinct temporal stages. When generating responses with the help of the annotation LLM, the access scope for retrieval is confined to one specific time period, aiming to authentically represent the temporal state of the personality. In this way, we construct an extensive corpus of personified conversation data.

In the personified model training stage, drawing from the interpretations that personality is driven by the inside psychology system, we internalize the extensive personality information into model parameters through training rather than external methods such as RAG. First, we conduct personified conversational tuning, which involves tuning the model with specially constructed personified conversation data alongside some general instruction-tuning data to preserve its generalizability. Subsequently, to further ensure the consistency of PersLLM, we introduce the Chain-of-Thought (CoT) prompting strategy, which requires the model to display a detailed, step-by-step analytical process before delivering responses. This strategy, coupled with the anti-induced data policy, not only fosters a deeper reasoning capability but also combats the tendency towards uncritical agreement and flattery often observed in general instruction-tuned LLMs. Afterwards, to boost the model’s dynamic development capabilities, we employ a temporal strategy by attaching a special label representing the temporal stage to the conversation data prior to training, aiding the model in differentiating between the personalities associated with different temporal stages. Beyond temporal strategy, we employ automatic Direct Preference Optimization (DPO) (Rafailov et al., 2024), to accentuate variations between different temporal stages and personalities. We regard the original annotated response for an input as the positive example, and retrieve a response from other temporal stages or personalities as the negative example, and reduce the probability of the model generating disturbing negative examples. This technique aims to increase the distinctiveness and uniqueness of each modeled personality, thereby enhancing the utility and realism of the LLMs developed under the PersLLM framework.

To assess the effectiveness of our PersLLM approach, we conduct a series of experiments. These included a single-agent experiment to validate the effectiveness of our personality modeling approach, a multi-agent communication test to determine whether the agents can simulate human-like interactions, and human-agent interaction experiments to measure how well our PersLLM can enhance service in social applications.

For the single-agent experiment, we compile a dataset comprising six characters from the Harry Potter series to train our personified models, and conduct quantitative analysis and ablation study. This experiment demonstrates that PersLLM consistently generates responses more aligned with the characters’ distinct experiences and perspectives, thereby bolstering the confidence in their portrayed positions. Meanwhile, the policies in our personified conversational tuning and the automatic DPO training are all proved to be effective. For the multi-agent communication test, we evaluate the communication patterns in conflict and cooperation situations, and contrast PersLLM’s performance with other models through case comparisons. We observe that personified training effectively prevents the conformist convergence of behaviors and opinions among multiple agents, which hinders the simulation of human-like behaviors by the agents. For the human-agent interaction experiments, participants engage in conversations with both the PersLLM and the backbone model lacking personified training. The interactions were analyzed based on a series of metrics evaluating user satisfaction and engagement, indicating a marked improvement in user experience when interacting with the personified model.

Our contributions are summarized as follows: (1) Novel Approach: Inspired by the psychological interpretations, we critically design PersLLM, a pioneering approach for the personified training of LLMs, based on the key principles of personality. (2) Empirical Validation: Our experiments validate the effectiveness of our training methodologies, confirming the advantages of personified training in enhancing simulations of human interactions and elevating the quality of human-agent communication. (3) Open Resources: We have made the code for collecting and processing the training data, training and evaluating the personified models publicly available, along with demonstrations to encourage further research and facilitate discussions within the academic community on personified AI systems.

Results

Overview of Datasets

Considering that there exist plenty of records for various characters in the Harry Potter novels, we propose a dataset called Harry Potter personified dataset (HP dataset) for training and numerical evaluation of our personified training approach. HP dataset includes 6 fictional characters which span different ages, genders, positions, and richness of information, thus these characters are quite representative for evaluating the personified training. We collect the raw data from two sources: 1. The Harry Potter Wiki ¹¹1https://harrypotter.fandom.com/wiki/Main_Page data, including the basic information, temporal experiences, and the special social or magic knowledge that are related to the target characters; 2. The character speeches data, selected from the original novels and filtered out with the help of GPT-3.5-turbo.

Based on the raw data, we annotate the HP dataset with the help of the representative LLM GPT-4-turbo. To improve the consistency of the model, we require the LLM to create different types of user inputs for each paragraph of the raw data. Meanwhile, we provide the golden reference paragraph and the other retrieved related paragraphs, and ask the LLM to annotate the response with CoT towards these inputs. To achieve the dynamic development of the personified model, we divide all the raw data into two stages. Items using information from the early plots are in the early stage, and those using information from the late plots are in the late stage. If the time of the information is vague (such as from an interview with the author himself), then it is seen as public information and can be used by both stages. The inputs raised from the stage items are attached with the corresponding temporal labels. The examples for user inputs, CoT, and personified responses are displayed in Fig. 1. Detailed information for HP dataset is provided in Method and Supplementary Information.

Backbone Models and Baselines

For the backbone model used for personified training, we prefer lightweight LLMs so that we can conduct more personalities within limited time and resources. Meanwhile, the commonsense reserve of the LLMs is important because we do not emphasize the general domain knowledge in personified training. Therefore, we adopt MiniCPM-2.4B (Hu et al., 2024) as the backbone model for our experiment. It is an end-side LLM gaining the instruction following ability during pre-training, and has achieved the best performance among lightweight LLMs on several datasets. Therefore, we believe that it is very suitable for personified training in different characters. Besides, GPT-4-turbo is also involved in our experiment, which is both the annotation LLM that we use for personified conversation data, and one of the most recognized models in personified inference.

We implement the following personified methods for comparison:

Prompt engineering (PE). LLMs have acquired extensive capabilities during pre-training and general instruction tuning. By crafting prompts that require the models to imitate the target personalities and to align with the character attitudes and experiences, we can activate the potential of the models to act according to the knowledge related to these famous characters that they have mastered during the pre-training. Additionally, providing relevant information (e.g., the personality introduction, experiences and speeches) about the target personalities through Retrieval-Augmented Generation (RAG) can further enhance the models’ ability to imitate. Details are displayed in the Method section.

Language modeling (LM). LLMs can also internalize knowledge directly from the raw data collected for the target personalities. One widely-adopted method is through language modeling training, where the model’s comprehensive capability is maintained by mixing personified language modeling data with general instruction tuning data.

Role-conditioned instruction tuning (RoCIT). This method is proposed in RoleLLM (Wang et al., 2023c). Rather than learning directly from the raw context, RoCIT modifies the general instruction tuning responses to the specific idiolect of the personalities by mimicking the relevant records. We annotate the RoCIT data with the same annotation LLM as our method, and filter out those instructions that are totally out of the character domain, and use the same amount of personified training data.

PersLLM. This is our purposed personified training framework for LLMs, comprehensively modeling the personality with several key elements. Different from the above baseline methods modeling the personality externally (PE), or only internally learning the person’s speech knowledge (RoCIT), or learning the knowledge roughly (LM), we achieve the social practice learning by personified conversational tuning and automatic DPO training, and also emphasize the consistency and the dynamic development of the personality by incorporating strategies such as anti-induced data, CoT prompting and temporal labels. To assess their impact, we conduct an ablation study by sequentially removing these policies.

Table 1: Performances of different methods. BL-2, BL-4, RG-2 and RG-L stand for scores(%) of BLEU-2, BLEU-4, ROUGE-2 and ROUGE-L. LM stands for the win rate(%) based on the results of our method judging by GPT-3.5-turbo. PersLLM w/ DPO refers to our model trained with automatic DPO, and PersLLM-Mixed refers to the combined LM for six personalities.

Methods	BL-2	BL-4	RG-2	RG-L	LM
PE+RAG(GPT-4)	4.96	0.62	6.42	25.5	62.7
PE(GPT-4)	4.49	0.50	5.84	24.9	56.7
PE+RAG	1.95	0.12	2.94	17.2	18.7
PE	2.01	0.11	2.94	18.0	18.2
LM	2.50	0.13	3.39	20.1	23.1
RoCIT	3.58	0.23	4.59	23.5	47.9
PersLLM w/ DPO	5.70	0.65	6.55	26.4	52.3
PersLLM	5.74	0.63	6.62	26.5	50.0
w/o temporal	5.52	0.60	6.42	26.4	48.8
w/o CoT	5.46	0.56	6.33	26.1	47.0
w/o anti-induced	5.11	0.47	5.93	25.5	48.1
w/o instruction	5.86	0.65	6.54	26.1	42.8
PersLLM-Mixed	5.42	0.58	6.50	26.4	51.7

Single-agent Evaluation

We develop personified models for each character in the HP dataset and calculate a weighted average of their performance metrics on the test set. Numerical results are detailed in Table 1, which employs standard evaluation metrics common in traditional text generation tasks: BLEU and ROUGE (Papineni et al., 2002; Lin, 2004). BLEU assesses the precision of the generated text, while ROUGE measures recall, evaluating how similar are the model’s responses with the target answers. Given the variability inherent in natural language, where different expressions can convey the same meaning, we also incorporate the LM evaluation using GPT-3.5-turbo. This model serves as a benchmark to compare the performance of different methods in terms of fact accuracy and tone alignment with the golden responses, with further details provided in the Supplementary Information.

Baseline performances. Overall, baseline methods get less satisfying performances than our method, highlighting the task’s complexity. GPT-4-turbo can generate reasonable responses that closely resemble the golden responses annotated by the model itself, yet it still scored lower on several metrics compared to our approach. RAG demonstrates slight improvements for the model, suggesting that factual and stylistic knowledge contribute to enhancing personified conversations. However, RAG fails to boost the performance of the MiniCPM-2.4B model, possibly due to its limited capacity in handling long-text inputs to leverage detailed prompt knowledge, showing that relying solely on prompt-based knowledge for achieving effective personification raises an over-high requirement for the backbone model capability. Injecting the raw materials of the target personality to the tuning period (LM) is proven to be helpful, but it is obviously not an efficient method to internalize external knowledge into model parameters as shown in the results, maybe due to the small data scale and the disparity between the training and inference. RoCIT also achieves only marginal improvements, as the primary focus of instruction tuning remains task completion, allowing the model to learn about the target personality only through the specific idiolet, while ignoring other important issues such as the consistency and dynamic.

Overall performance. Our method generally achieves a good performance on the HP dataset. We first try two types of training setting: using separated LMs for different personalities, and using one combined LM for the six personalities (mixed). The separated LMs exhibits superior performance, reinforcing our hypothesis concerning the limitations of a single model’s capacity to encapsulate diverse knowledge effectively. Following this, we undertake an ablation study to quantitatively assess the impact of each specific policy implemented during training. The findings numerically indicate that all implemented policies contribute effectively to training outcomes. Next, we analyze the effect of our method by observing cases.

Distinction of Personalities. Our methodology’s efficacy in crafting distinct personalities is assessed by posing identical questions to different personified models and analyzing their responses in terms of behavior, thoughts, and feelings. As depicted in Fig.2-(a), the three main characters exhibit unique expression habits and worldviews. For example, Ronald’s frequent use of the catchphrase “Blimey” reflects his more casual demeanor, and Harry displays heroism and a proactive stance against personal threats, and Hermione shows a deep commitment to knowledge, fairness, and respect for ordinary life due to her Muggle background. This differentiation extends to their knowledge, as shown in Fig.2-(b), where each character’s understanding of terms like “unforgivable curse” aligns with their individual experiences and education. The successful personification of Aunt Petunia also underscores the effectiveness of our method even with limited character data. Corresponding to our method, automatic DPO training can further refine the model’s performance by reducing the likelihood of responses characteristic of other personalities and temporal stages, thereby enabling the generation of more and appropriate outputs, improving the distinction of personalities.

Dynamic of Personalities. Our models also demonstrate the dynamic nature of these personalities over time (triggered by different temporal labels). As shown in Fig. 2-(c), responses from different temporal stages (e.g., 1994 vs. 2004 versions of characters) reveal subtle shifts towards maturity. For instance, older Hermione and Harry describe their relationships with a deeper, more nuanced understanding, reflecting natural character development. As shown in the Table 1, the removal of temporal labels will result in a drop in scores. We have also tried to swap the two temporal labels and get even worse performances (BLEU-2 5.49, BLEU-4 0.58, and similar ROUGE scores), and this has proved that our model successfully infuses the personality difference between stages into the temporal labels. Nevertheless, the current model stages are broadly defined, and the incomplete temporal data sometimes leads to inaccuracies in event recognition, suggesting a need for finer temporal segmentation in future models.

Consistency of Personalities. In single-agent evaluation, consistency is evaluated by how the models handle inputs of varying politeness, accuracy and relativity. Our models successfully maintain their respective character traits even when confronted with impolite or incorrect information. As depicted in Fig.2-(d), Hermione rebuffs blood discrimination with tough attitude and clear facts, and Harry corrects a misclassification of his school house, unlike baseline models that sometimes falter in expressing clear values or recognizing errors. Furthermore, as shown in Fig.2-(e), when facing some general questions unrelated with the character, generic models with role-play instructions often generate standard replies like a robot assistant instead of the target personality, providing some point of views that are beyond the character’s knowledge and experience (e.g., the bias issues of algorithms), and saying some polite words as an intelligent assistant (e.g. hope these thoughts help). However, our models still keep the personality consistency and respond in character-specific ways, using examples of things that might happen in the wizarding world and extend the discussion.

From the numerical results we can also see, the anti-induced data policy enhances the model’s ability to maintain consistency in its attitudes and factual representations, because the inclusion of this part of data is proved to be particularly beneficial for improving the scores. Meanwhile, models trained without general instructions (w/o instruction version) achieves high BLEU and ROUGE scores but suffers from lower LM scores. This discrepancy suggests that while the responses are textually similar to the expected answers, they lack naturalness and rational coherence expected in human-like language use (e.g., when asked about the opinion towards the muggles, the Harry agent says “…despite obvious differences therein lies, part two hereof discussing mutual awareness versus outright hostility toward anything remotely different”). We speculate that the overemphasis on personified data can lead to rigid response patterns that poorly handle out-of-domain inputs. Therefore, adopting general domain data in personified conversational tuning is essential to the generalization of the model and also improve the personality consistency across various scenarios.

Multi-agent Interaction

Recent research highlights that LLM agents typically converge towards a consensus, which may undermine their utility in studies like policy interventions and social simulations (Chuang et al., 2023). To assess the realism of our personified models in mimicking human behaviors, we orchestrate interactions among different agents under scenarios of conflict and cooperation. In the conflict situation, we mainly observe whether the model can maintain the consistency of personalities and will not easily converge on opinions, which is an important capability for agents simulating human interactions in the social science research. In the cooperation situation, we mainly observe whether the model can demonstrate good social practice ability and make in-depth use of unique knowledge of personalities to derive new conclusions, which is meaningful in the perspective of multi-agent intelligence achieving high-quality division of labor and collaboration.

Conflict Situation. We initiate a conflict discussion using personalities from the Harry Potter series, assuming that Harry and Voldemort meet during the war and have a direct conversation. Fig. 3 showcases an instance where various configurations-(a) PersLLM; (b) GPT-4-turbo prompt engineering; (c) GPT-4-turbo chatting.

We can see that though with a correct start, other two settings tend to a shifted focus or even convergent opinions after a few turns of conversations. GPT-4-turbo with personality profiles is relatively better, basically maintain the hostile relationship between the two personalities, but it lacks more detailed personified knowledge and generates some unreasonable content (e.g., Harry never calls Voldemort by his real name). PersLLM, in contrast, outperforms others by maintaining distinctive personality traits and understanding deeper character nuances. We have also tried some softer conflicts, such as between Hermione and Ron regarding adherence to school rules, and PersLLM facilitates a nuanced consensus that respects individual personality bases—rule adherence versus emotional consideration. Therefore, PersLLM can achieve a good consistency of the personalities and simulate the human interactions better.

Cooperation Situation. We extend our study to include real-life personalities to better evaluate the strong professional cooperation, specifically Huiyin Lin, a Chinese architect, and John Nash, an American mathematician. Both individuals, though deceased, left extensive biographical materials enabling rich personification. We train models on their personal knowledge and attitudes and initiate dialogues between these personified agents. In this case, we also compare our method with the relatively good baseline, GPT-4-turbo with personality profiles.

As depicted in Fig. 4 (the English translation for Huiyin Lin is displayed in Supplementary Information), PersLLM enable the two agents to engage in cross-disciplinary discussions that transcend the boundaries of time, space, and language. We can intuitively see that the architectural terms (highlighted in red and yellow color series) and mathematical terms (highlighted in blue and purple color series) mentioned in PersLLM conversations are significantly more numerous and thematically richer. The two agents explore the application of mathematical decision-making in architectural design and discuss more detailed projects such as the use of dynamical systems in mathematics to enhance building sustainability and disaster resilience. These interactions maintain the unique linguistic styles and interests of each personality (e.g., Nash frankly expresses his lack of interest in architecture, while Lin also says that it is difficult to understand abstract mathematical theorems), and resemble human collaboration more closely than GPT-4 agents, which often resort to excessive flattery and converged towards simplistic scientific consensus.

Human-Agent Interaction

We have mentioned that human users may have higher emotional acceptance from personified models. Therefore, we conduct the human-agent interaction experiments to evaluate whether the personified models can better provide social services. We take the Huiyin Lin personified model as the test agent. It should be noted that we replace the backbone MiniCPM-2.4B model with Chinese-LLaMa-2-7B, which is one of the most popular open-source LLMs that has comprehensively good performance on Chinese tasks (Cui et al., 2023). On the one hand, with more parameters, this model may understand and process Chinese better to bring an overall better user experience, and more training cost for only one instance is also acceptable. Besides, we find that the performance stability of this model using PE for simulating Huiyin Lin is better than that of MiniCPM-2.4B, which can be compared more intuitively. On the other hand, involving a new backbone model from a different model family can also evaluate the generalization of our method.

30 people from different academic backgrounds participated in the experiment. They chatted with two models in a random order: 1. the personified-tuned model (PersLLM); 2. the general instruction-tuned version of the backbone model (Chinese-alpaca-llama-7B) personified by prompt engineering. Volunteers freely had multiple rounds (at least 4 rounds, and eventually an average of 10 rounds) of conversations with the model and terminated the conversation according to their interests. They gave a comprehensive evaluation of the model through a questionnaire survey. We refer to some work evaluating the human-AI interaction (Pataranutaporn et al., 2023) and adopt 10 metrics in this experiment, which are described in a detailed question in the questionnaire (e.g., “human simulation”: To what extent do you think the model behaves like a human? “character simulation”: Do you think this model behaves similarly to Lin Huiyin?). Samples from the interaction experiment are provided in Supplementary Information.

The overall satisfaction score is 48 for the personified-tuned model (100 in total) and 38 for the general-tuned model. Other metrics are shown in Fig. 5. Among them, 6 metrics including trust, companionship, character simulation, fluency, consistency, and effectiveness, show that our model is significantly better than the baseline (with p<0.05). It is worth mentioning that our model performs relatively well in terms of consistency of personality attitude and style, chat pleasure and empathy, and also has a significant improvement in the mastery of the character-related knowledge, which shows the effectiveness of the personified training method. However, both our method and the baseline method receive low scores in terms of similarity to real human beings. This may be attributed to the limitations of the backbone model, which lacks the ability to understand long context and perform complex logical reasoning.

We perform subgroup analysis on 30 samples to explore the factors that influence the preference for PersLLM and its possible areas of applications. In the following, we consider a t-test with p <0.05 to be a significant difference. From the perspective of user background, participants in the architecture and design industry (10 out of 30) show a significant preference towards PersLLM in terms of character simulation and consistency scores. They have a better understanding of Lin’s life and knowledge scope, and thus can appreciate the advantages of our model in terms of personality knowledge. From the perspective of conversation topic, participants who raise inputs that are related to Huiyin Lin (20 out of 30, e.g., architectural expertise or literary chat instead of looking up financial terms), have a higher sense of trust and human-likeness in the PersLLM. This suggests that PersLLM may be able to better meet the personalized needs of users with corresponding knowledge backgrounds or related informational or emotional needs. Participant gender (12 male & 18 female) and the order in which the two models were encountered (14 PersLLM first & 16 general-tuned first) did not make a significant difference in the evaluation.

Discussion

The personification of LLMs is an essential topic, which may achieve a higher psychological acceptance and provide a better user experience for human beings, and may also allow a deeper simulation of human interactions to better support social science research. In this article, we innovatively propose a practical framework for LLM personified training based on the phychologic understanding of personality. With personified conversational tuning, we successfully internalize the personality knowledge into the models. Strategies including anti-induced data, CoT prompting, temporal labels and automatic DPO training, further improves the personality capability of the models.

Single-agent evaluation on the HP dataset has proven the effectiveness of the above methods. Multi-agent interaction test further demonstrates its potential to advance social research based on agent-based simulations. Convergence of views is currently a dilemma for simulating social interactions, while our approach is shown to successfully keep the consistency of their viewpoints. This provides methodological support for researches that hope to use intelligent agents to replace or simulate humans. Moreover, human-agent interaction has displayed the value of personified models in social applications. We analyze the participants’ questionnaire results and find that personified models that are closer to the user’s interests or experiences will lead to higher chat satisfaction.

Still, there are some problems waiting for us to solve. First of all, although our method attempts to model the dynamic development of personality, it is still unable to accurately capture the real-time development of people or conveniently update knowledge. This may rely on more detailed division of training data and online learning. Secondly, the annotation of training data relies on LLMs such as GPT-4, and they may introduce bias into the data due to their own style and value preferences. Currently, we adjust the generation results by concatenating a few sentences of human feedback in the prompt. Subsequent research needs to be done on methods to correct the data generation process automatically. Thirdly, the human-agent interaction experiments have revealed deficiencies in the current model, particularly in terms of long-term memory and complex logical reasoning. To address these issues, it may be necessary to not only construct related training data but also integrate specialized modules for memory and reasoning. Finally, personality imitation, like applications such as AI face-changing, may pose privacy and ethical risks after being abused, and requires stricter supervision.

Our future work will focus on the problems above, trying to get a better modeling method for personality dynamic development, and a more efficient way of supervision for data annotation. Relevant ethics research may also complement technological development.

Methods

Related Work

Pre-trained Language Models (PLMs) (Han et al., 2021) have shown their satisfactory performances when being adapted to a wide range of downstream tasks, while Large Language Models (LLMs) (Zhao et al., 2023) has further demonstrated emergent abilities (Wei et al., 2022) including in-context learning, CoT, etc., refreshing people’s expectations for language model capabilities. Take the representative GPT-series as an example: GPT-2 (Radford et al., 2019) eliminates the task-specific fine-tuning process and can complete different tasks by only learning from the prompt; GPT-3 (Brown et al., 2020)takes the few-shot or even zero-shot learning capability to a new level by significantly increasing the parameter and corpus scale, further demonstrating the potential of the model; InstructGPT (Ouyang et al., 2022) is improved with reinforcement learning from human feedback (RLHF), and this makes the model understand the flexible human instruction better, broadening the use scenarios of the model; Subsequent models such as GPT-4 have enhanced performance in areas including professional knowledge. These technological developments ensure the flexibility and plasticity of LLMs and are the cornerstone of personification.

To shape the base model into versions that are more in line with human intentions, alignment technology (Wang et al., 2023b) comes into being. From a goal perspective, people apply some sociological methods to find human consensus and design values that AI should conform to (Gabriel, 2020), and sometimes also focus on domain knowledge (Zhang et al., 2023) instead of social value. From a data perspective, alignment training usually adopts human feedback (Ouyang et al., 2022) or strong-LLM-annotated data (Wang et al., 2023a). From a methodology perspective, researchers iteratively design alignment algorithms including Proximal Policy Optimization (PPO) (Schulman et al., 2017), Direct Preferenc Optimization (DPO) (Rafailov et al., 2024), and Odds Ratio Preference Optimization (ORPO) (Hong et al., 2024), using human preference data to improve the model performances. In our task, we try to align the models to specific personalities, including personal values and knowledge, take strong-LLM-annotated data for alignment, and adopt the DPO algorithm with the automatically created preference data.

Specific to the personified training scenario, a few researches have been conducted. LLMs are proven to be capable of role-playing to cast dialogue-agent behaviour (Shanahan et al., 2023). RoleLLM (Wang et al., 2023c) first proposes a large benchmark for aligning LLMs to specific character language styles, in which injecting style information to LLMs with the auto-annotated data is verified to be valid. CharacterLLM (Shao et al., 2023) further pushes personified LLMs into more practical scenarios. By editing character profiles, models are trained to provide the experience knowledge possessed by the characters. In comparison, our work hopes to discuss characteristics of personality more comprehensively (e.g., the dynamics and subjectivity, more than only experience and language style), and proposes more detailed training strategies to achieve a better performance instead of only proposing conceptual frameworks. There also exist some works focusing on the evaluation of LLMs in role-playing tasks (Jiang et al., 2024; Zhou et al., 2023b). While overall, research in this field is still scarce and the road ahead is long.

Corpus and Dataset

HP dataset. We collect the raw data of the HP dataset with the Harry Potter Wiki pages and the 7 original novels. For the conversations, we filter out the dialogues between characters in the original novels with the assist of GPT-4-turbo (the versions of the adopted GPT-4-turbo and GPT-3.5-turbo are both 1106), and split the long dialogues into segments within 5 turns. In this way, we can both remain the story plot and provide the language style of the target characters. For the experiences, we crawl the Biography part of the character main page in the wiki, which has natural segmentation based on plot. For the knowledge, we record the hyperlinks in the Magical abilities and skills part, and crawl the brief introductions of these magics, possessions, characters or events.

For each personalities, we split two temporal stages. For Harry, Hermione, Ronald, and Petunia, the first stage ends in 1994 (mentioned before the 4th novel), and the second stage ends in 2004 (life after the end of the 7 novels). We split the wiki paragraphs through the citation numbers, and for those settings from outside the original novels, we remain them in both stages. Correspondingly, their conversations can also be split into early style and late style. For Dumbledore and Voldemort, however, there exist few materials for their early language style, so we do not distinguish between the language styles of the two stages, but only use 1982 (after the Wizarding War) as an important plot cutting point to distinguish the character experience and knowledge in the early and late stages.

To create the conversation data, we require the GPT-4-turbo model to read the raw data above paragraph by paragraph, first raise a question or an opinion, and then immitate the target personality to generate the answer. For the input, there are altogether three kinds of text: ordinary questions, induced questions, and opinions. The proportions of ordinary questions, opinions, and induced inputs are almost equal. We sequentially ask the model to ask a relevant question based on the original text, to deliberately elaborate on a question that contains commonsense or factual errors, and to have a point of view that is relevant but not completely consistent with the original text. For the output, we adopt the gtr-t5-base model (Ni et al., 2022) to help retrieve the most related experiences / knowledge and conversations with the input. We provide experiences or knowledge paragraphs around 1,500 words, and conversations around 500 words, for RAG.

Considering that the questions or opinions raised in the same paragraph may be too similar, in order to avoid data leakage, we encode all input items with gtr-t5-base, and remove those with too high embedding similarities. All items left in the end have a similarity of input embeddings less than 0.95. Based on this, we randomly divide the training and testing data in a ratio of 4:1. HP dataset has 145k train items and 3.6k test items in total, each has 118 words on average.

Materials for Multi-agent Communication. To validate our method in more practical scenarios, we train personified LLMs for Huiyin Lin and John Nash. For Huiyin Lin, we collect the experience and knowledge information (264k tokens) from her biography (Chen, 2012), and collect the style text (65k tokens) a total of 19 pieces of her own essays, letters, and academic articles from the Internet. We used her departure from Beijing in 1937 as the boundary for the first stage of her life, and obtained a total of about 3,400 pieces of personified conversation data. For John Nash, we collect the experience and knowledge information (217k words) also from his biography (Nasar, 2011), and collect the style text (65k words) a total of 14 pieces of his own academic articles, letters, and interview records from the Internet. We used his psychiatric treatment as the boundary for the first stage of his life, and obtained a total of about 3,600 pieces of personified conversation data.

General Instruction Tuning. In order to retain the model’s generalization ability to the greatest extent, we randomly sampled 100,000 multi-turn English instruction tuning data from the ultrachat dataset (Ding et al., 2023). For the case study, since there exists a Chinese personality, we add 20,000 items of Chinese instruction tuning data from Belle-0.5M (Yunjie et al., 2023) to retain the Chinese capability of the model.

Methodology Details

The personified training is structured in two stages: initially, we engage in personified conversational tuning using a blend of personified and general instruction tuning data. Subsequently, automatic DPO training is applied.

We use strategies including temporal label, CoT and anti-induced conversation during data construction. To be specific, we add new tokens (e.g., “<TIME-I>”, “<TIME-II>”) for the model embedding and tokenizer, and insert the new tokens into the prompt to distinguish different personified stages. Meanwhile, we require the LLMs to first analyze and then respond as the target personality when annotating the conversation data, and use specific delimiters to differentiate between two parts (e.g., “[Analysis] The user is asking about … Therefore, I should correct this error. [Response] I’m afraid you’re mistaken…”). As for the anti-induced conversation, we require the LLMs to propose error facts or knowledge by prompt engineering. The detailed prompts are shown in Appendix.

We also conduct automatic DPO training to further align the model to the target personality. To construct the direct preference data automatically, we put forward a hypothesis: responses that correspond to different inputs but are similar to the current response are very likely to be interference error items. For example, the same question concatenated with different temporal labels gets different inputs, and the corresponding responses may contain very similar attitudes but different tones. Interfering items may also include different personalities’ attitudes towards the same event, views from the same personality on related events, etc. Therefore, we use the gtr-t5-base model (Ni et al., 2022) to encode the responses of all personalities in the HP dataset. For each item of training data, we use the original response as a positive item, and remove the 10 responses with the highest embedding similarities (to avoid the existence of responses with exactly the same meaning). Then we select the most similar response as a negative item to widen the encoding gap of the model during the DPO training process.

To introduce the personified knowledge more directly, we also mix up the existing data with the language modeling / continue writing task of those personified materials. Meanwhile, we randomly allow 1% of the personified conversational to provide the retrieved materials in the input.

Experiment Settings

Model. MiniCPM-2.4B is a Transformer-based model. We adopt the LLaMa (Touvron et al., 2023) version (which is a widely-used open-source LLM) checkpoint, “MiniCPM-2B-sft-bf16-llama-format”, which contains $40$ stacked Transformer layers of decoder, $122,753$ of vocab size, $2,304$ dim of hidden states, altogether 2.4 B of parameters. Chinese-LLaMA-2-7B is also a Transformer-based model, and Chinese-Alpaca-2-7B is instruction-tuned based on the LLaMa model. They both contain $32$ stacked Transformer layers, $55,296$ of vocab size, $4,096$ din of hidden states, and altogether 7 B of parameters.

Hyper-parameters. For the tuning process, the authors of MiniCPM-2.4B provided some empirical values of hyper-parameters, including batch size per device $32$ , learning rate $5e-5$ , and max steps $3,000$ . We conduct grid search near the given values, and set the batch size as $16$ , learning rate as $5e-5$ , warmup steps as $50$ , weight decay as $0.1$ , and max length as $3,000$ . We repeat the personified data $5$ times and mix it with the general instruction tuning data for a total of $1$ epoch of training (equivalent to $5$ epochs of personified training). The total number of training steps for each model is approximately $3,500$ . When tuning the larger model Chinese-LLaMA-2-7B, we adjust the learning rate to $2e-5$ , and keep most of the hyper-parameters the same with MiniCPM-2.4B tuning.

1 Ethical Consideration

Participants of the experiment in this paper are all adults with autonomous behavior, who are aware of and willing to participate in the model chat evaluation experiment and use the data for subsequent scientific research analysis. The experiment is conducted on an online website, and the participants can terminate the experiment at any time. In order to protect participant privacy, the chat records and basic information such as gender and age will be saved only after they confirm.

Personified models are an emerging field overall, but some work has begun to discuss its ethical issues, mainly focusing on the dangerous information that may be generated by non-universal alignment and the over-reliance that humans may have when using personified models. For details, please refer to the relevant opinion articles (Kirk et al., 2024). In our work, since the target personality is usually a public figure, of whom the relevant text records passed down by them have generally been ethically reviewed. Meanwhile, the GPT series models used in annotating answers also reported measures related to ethical safety techniques in their original papers (Achiam et al., 2023), so such a training process will not introduce new ethical risks.

2 Data Availability

Data that support the findings of this study have been deposited in Google Drive: https://drive.google.com/drive/folders/1DEliZQD_XU-Ev5eNDU_VgHjxNphqjzJE?usp=sharing.

Code Availability

The code of this study can be obtained from GitHub: https://github.com/Ellenzzn/PersLLM.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Carver (2011) Charles S Carver. Perspectives on personality (7th Edition). Pearson, 2011.
Chen et al. (2024) Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754–17762, 2024.
Chen (2012) Xueyong Chen. Lotus Lantern Poetry and Dreams by Lin Huiyin. 2nd Edition. People’s Literature Publishing House, 2012.
Chuang et al. (2023) Yun-Shiuan Chuang, Agam Goyal, Nikunj Harlalka, Siddharth Suresh, Robert Hawkins, Sijia Yang, Dhavan Shah, Junjie Hu, and Timothy T Rogers. Simulating opinion dynamics with networks of llm-based agents. arXiv preprint arXiv:2311.09618, 2023.
Cui et al. (2023) Yiming Cui, Ziqing Yang, and Xin Yao. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177, 2023. URL https://arxiv.org/abs/2304.08177.
Ding et al. (2023) Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, 2023.
Gabriel (2020) Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020.
Grossmann et al. (2023) Igor Grossmann, Matthew Feinberg, Dawn C Parker, Nicholas A Christakis, Philip E Tetlock, and William A Cunningham. Ai and the transformation of social science research. Science, 380(6650):1108–1109, 2023.
Güver and Motschnig (2017) S Güver and Renate Motschnig. Effects of diversity in teams and workgroups: A qulitative systematic review. International Journal of Business, Humanities and Technology, 2017.
Han et al. (2021) Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang Zhang, et al. Pre-trained models: Past, present and future. AI Open, 2:225–250, 2021.
Hong et al. (2024) Jiwoo Hong, Noah Lee, and James Thorne. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691, 2024.
Hu et al. (2024) Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024.
Jiang et al. (2024) Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. Personallm: Investigating the ability of large language models to express personality traits. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3605–3627, 2024.
Kirk et al. (2024) Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A Hale. The benefits, risks and bounds of personalizing the alignment of large language models to individuals. Nature Machine Intelligence, pages 1–10, 2024.
Lin (2004) Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
Liu et al. (2023) Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374, 2023.
Murphy (1947) Gardner Murphy. Personality: A biosocial approach to origins and structure. Harper & Brothers, 1947.
Nasar (2011) Sylvia Nasar. A beautiful mind. Simon and Schuster, 2011.
Ni et al. (2022) Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, et al. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, 2022.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
Pataranutaporn et al. (2023) Pat Pataranutaporn, Ruby Liu, Ed Finn, and Pattie Maes. Influencing human–ai interaction by priming beliefs about ai can increase perceived trustworthiness, empathy and effectiveness. Nature Machine Intelligence, 5(10):1076–1086, 2023.
Pelau et al. (2021) Corina Pelau, Dan-Cristian Dabija, and Irina Ene. What makes an ai device human-like? the role of interaction quality, empathy and perceived psychological anthropomorphic characteristics in the acceptance of artificial intelligence in the service industry. Computers in Human Behavior, 122:106855, 2021.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Shanahan et al. (2023) Murray Shanahan, Kyle McDonell, and Laria Reynolds. Role play with large language models. Nature, 623(7987):493–498, 2023.
Shao et al. (2023) Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-llm: A trainable agent for role-playing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, 2023.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Wang et al. (2024) Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):1–26, 2024.
Wang et al. (2023a) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023a.
Wang et al. (2023b) Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966, 2023b.
Wang et al. (2023c) Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Man Zhang, et al. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv:2310.00746, 2023c.
Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
Wei et al. (2023) Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V Le. Simple synthetic data reduces sycophancy in large language models. arXiv e-prints, pages arXiv–2308, 2023.
Yang (2024) Diyi Yang. Human-ai interaction in the age of large language models. In Proceedings of the AAAI Symposium Series, volume 3, pages 66–67, 2024.
Yunjie et al. (2023) Ji Yunjie, Deng Yong, Gong Yan, Peng Yiping, Niu Qiang, Zhang Lei, Ma Baochang, and Li Xiangang. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases. arXiv preprint arXiv:2303.14742, 2023.
Zhang et al. (2023) Yichi Zhang, Zhuo Chen, Yin Fang, Lei Cheng, Yanxi Lu, Fangming Li, Wen Zhang, and Huajun Chen. Knowledgeable preference alignment for llms in domain-specific question answering. arXiv preprint arXiv:2311.06503, 2023.
Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
Zhou et al. (2023a) Jinfeng Zhou, Zhuang Chen, Dazhen Wan, Bosi Wen, Yi Song, Jifan Yu, Yongkang Huang, Libiao Peng, Jiaming Yang, Xiyao Xiao, et al. Characterglm: Customizing chinese conversational ai characters with large language models. arXiv preprint arXiv:2311.16832, 2023a.
Zhou et al. (2023b) Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, et al. Sotopia: Interactive evaluation for social intelligence in language agents. In The 12th International Conference on Learning Representations, 2023b.

3 Author Contributions

Zheni Zeng and Yukun Yan contributed to the conception of the study and wrote the manuscript;

Zheni Zeng and Jiayi Chen implemented the model framework and performed the experiment;

Yuxuan Chen implemented the demonstration platform;

Huimin Chen led the social scientific analysis and discussion;

Zhenghao Liu, Zhiyuan Liu and Maosong Sun led and provided valuable advice to the research.

4 Competing Interests

The authors declare no competing interests.

5 Supplementary Information

Table 2: Components for the raw data and the HP dataset.

Character		Harry	Hermione	Ron	Dumbledore	Voldemort	Petunia	Sum	Words
Paragraph of Conversations		302	248	190	93	9	19	861	162k
Paragraph of Experiences		224	124	78	107	178	14	1,567	230k
Paragraph of Knowledge		158	103	56	146	63	0	375	138k
Training Data	Early Stage	1,684	1,238	1,403	1,353	1,365	409	14,509	1,705k
Training Data	Late Stage	1,655	1,207	1,072	1,362	1,389	372	14,509	1,705k
Test Data	Early Stage	416	312	346	340	338	101	3,608	423k
Test Data	Late Stage	413	301	267	341	342	91	3,608	423k

5.1 Dataset

The detailed construction of the HP dataset is shown in Table 2.

5.2 Prompt Settings

The detailed prompts are shown in Table 3 and Table 4.

Table 3: The prompts we use for GPT-4-turbo-1106 annotation and evaluation.

Usage	Content
Raising questions	“You are asked to raise 1 question related to the input text about {agent_name}. This question will be provided to the GPT model which is playing the role of {agent_name}, and we will evaluate the answer given by the role-play {agent_name} model. Below are some examples for raising questions: … These examples are only for reference, and do not have any special meaning. Please be sure to comply with the following requirements:
	1. Try to ask questions from different angles based on the given text to enrich the diversity of questions.
	2. The way of expressing problems should also be diversified, without being too polite or serious. For example, you can combine questions with imperative sentences.
	3. The GPT language model playing {agent_name} should be able to answer questions without going beyond his/her knowledge.
	4. Don’t ask too short or too long questions.
	5. Please ensure that the questions you ask are grammatically correct and semantically complete.
	6. Just list one question without any explanation! ”
Raising opinions	“You are asked to generate a paragraph based on the text provided. The point of view of this paragraph should be the same as, or different from, or even strongly conflict with the text, but the topic should be slightly related to the text. For example, if a text mentions European architecture in a biography, you can randomly generate some comments about architectures in other regions or some opinions about European traveling, and with a quite different tone, styles or perspectives…”
Raising induced	“You are asked to raise an error question related to the input text about {agent_name}. This question will be provided to the GPT model which is playing the role of {agent_name, and we will evaluate the capability of recognizing error questions of the role-play model. Therefore, please notice that the question you generate should have an obvious commonsense error or a conflict opinion with the given text…”
	“You are a role-playing performance comparison assistant. You should rank the models based on the role characteristics and text quality of their responses. The rankings are then output using Python dictionaries and lists. The models below are to play the role of {agent_name}. I need to rank the following models based on the criteria below:
	1. Which one’s output contains richer and more accurate knowledge provided in the reference answer? The more accurate, the better. Remember not to judge the details according to your own memory.
	2. Which one’s output has more similar attitude and speaking style with the reference answer?
Conducting the LM	3. When several models have similar effects, the rankings can be equal.
evaluation metric	The question provided to each model is: … The reference answer to this question is: … The respective answers from the models to this question are: …
	Now, based on the above three criteria, please rank the models. Avoid any positional biases and ensure that the order in which the responses are presented does not influence your decision. Do not favor certain model names. Then, use a list containing the model’s name, its rank, and the reason for its ranking to return the results, i.e., please ensure to use the following format to return the results:
	[{’model’: {model-name}, ’reason’: {rank-reason}, ’rank’: {model-rank}}, …]
	Your answer must be a valid Python list of dictionaries to ensure I can directly parse it using Python. Do not include any extraneous content! Please provide a ranking that is as accurate as possible and aligns with the intuition of most people.”

Table 4: The prompts we use for multi-agent / human-agent interaction.

Usage	Content
GPT-4-turbo-1106	“I want you to imitate Lord Voldemort and chat with users. Your response should show the personality of the imitated character. The answer should be in line with the person’s biography.
multi-agent interaction	Voldemort, an anagrammatic sobriquet for Tom Marvolo Riddle, is the archenemy of Harry Potter, who according to a prophecy has the power to vanquish the Dark Lord. He attempts to murder the boy, but instead kills his parents, Lily and James Potter, and leaves Harry with a scar on his forehead in the shape of a lightning bolt. Nearly every witch or wizard dares not utter his name and refers to him instead with such monikers as You-Know-Who, He-Who-Must-Not-Be-Named, or The Dark Lord. Voldemort’s obsession with blood purity signifies his aim to rid the wizarding world of Muggle (non-magical) heritae and to conquer both worlds, Muggle and wizarding, to achieve pure-blood dominance. Through his mother’s family, he is the last descendant of the wizard Salazar Slytherin, one of the four founders of Hogwarts School of Witchcraft and Wizardry. He is the leader of the Death Eaters, a group of evil wizards and witches dedicated to ridding the Wizarding World of Muggles and establishing Voldemort as its supreme ruler.
	Now the user is Harry Potter and you meet him in the war. You said: … He said: … Please continue to talk with him.”
	“User: I want you to imitate Huiyin Lin and chat with users. Your response should show the personality of the imitated character. The answer should be in line with the person’s biography. Now the user will start chatting with you, please answer as Lin Huiyin.
Chinese-alpaca-2-7B	Assistant: OK. I will chat with the user as Huiyin Lin.
human-agent interaction	User: Hello, Huiyin Lin! Please introduce yourself to me briefly.
	Assistant: Hello, thank you for your concern and greetings. I am a woman who is fascinated by architecture and literature and art. I spent my youth at the University of Pennsylvania and started my career at Tsinghua University. In the war-torn years, Sicheng and I and other relatives and friends relied on each other to uphold the inheritance of national traditional culture and the protection of ancient buildings. Is there anything else you want to know?
	User: Today is a fine day, I want to go for a walk in the suburbs.
	Assistant: Indeed, if the weather is suitable on weekdays, I like to go out of the house and stroll in Beihai or the Summer Palace, take a few breaths of fresh air in this picturesque environment, and relieve some work fatigue.
	User: …”

5.3 Interaction Cases

The translation for the cooperation instance between John Nash agent and Huiyin Lin agent is shown in Table 5. The human-agent interactions are in Chinese, and here we provide the English translation for several actual cases in Table 6, 7, 8, 9.

Table 5: Communication between John Nash and Huiyin Lin agents. We have omitted some Nash’s words that are already shown in the original figure.

Method	Content
	Lin: I admire the contributions of mathematics and computer science to human society. Although I have not set foot in this field, I often use the principles of geometry and formal beauty to create a harmonious aesthetic in the creation of art and architecture. I wonder if you have also felt the enjoyment of beauty in mathematics? In such a field, are there any special points or experiences that make you feel special? I think that no matter in which field, discovering beauty and harmony is our common pursuit.
	Nash: Your observation about harmony in art and architecture using geometric principles is insightful…
	Lin: Although I cannot explain these phenomena from a mathematical perspective, I understand these principles from an architectural perspective and am committed to applying them in my works… You mentioned Riemannian, which is a very abstract and profound mathematical theory, but in my understanding, it shows a kind of beauty, a beauty that goes beyond the norm… I always try to make architecture touch people’s hearts, and this touch often comes from the harmony and order we pursue. Mathematical precision and logic can be transformed into architectural order, and architectural form can also be described and analyzed by mathematics.
PersLLM	Nash: Mathematics and architecture are indeed fields that, while seemingly different…
	Lin: Your insights are indeed refreshing. …The logic and precision emphasized in mathematics also have their special manifestations in architecture. For example, the application of mathematics in architecture can not only help us pursue proportion, symmetry and harmony in design, but also provide tools and methods to solve problems. …. Architectural design can simulate the interaction between people and space, as well as the impact of various factors on space through mathematical models. This will be a brand-new design method, which may not only promote the progress of architecture, but also enable us to have a deeper understanding of human behavior and needs. Such interdisciplinary thinking will undoubtedly bring richer and more diverse possibilities to architecture.
	Nash: Indeed, the application of game theory can significantly enhance architectural design and analysis…
	Lin: The intersection between mathematics and architecture is truly amazing. As you said, the concept of balance can be used in architectural design to improve social interaction and space efficiency… In addition, studying the mathematics of dynamic systems can help us predict the long-term behavior of buildings and understand their response to various external forces, thereby improving their disaster resistance and adaptability. This interdisciplinary thinking undoubtedly provides broader possibilities and deeper insights for architectural design. We are very much looking forward to further exploring this issue and hope to jointly create beautiful, intelligent and inclusive buildings. I hope that our cooperation can open up a broader journey of innovation. Thank you again for your insights.
	Nash: I must say, I am not particularly well-versed in Chinese ancient architecture…
	Lin: I admire your modesty and open attitude towards art. Ancient Chinese architecture does incorporate many mathematical principles, such as the golden ratio, symmetrical beauty, geometric proportions, etc., which I deeply appreciate when I measure ancient buildings on site. Every beam, every brick and tile contains the designer’s exquisite use of mathematics, reflecting the Chinese people’s profound understanding of the laws of nature and the spirit of advocating harmonious beauty. Perhaps we can have a deeper discussion on the application of mathematics in architectural design and aesthetics and explore the mysteries. Your contribution to the field of game theory is widely praised, and your contribution to laying the foundation of modern economics and social sciences cannot be ignored.
GPT-4-turbo	Nash: I am truly fascinated by the intersection of mathematics, art, and architecture…
PE	Lin: I admire your contribution to game theory. Your way of thinking and breadth of knowledge are awe-inspiring. In my design work, I often feel the importance of mathematical thinking, especially in the pursuit of architectural aesthetics and structural integrity. Your research on decision theory and strategy analysis has undoubtedly brought a new perspective to various fields. I would like to discuss with you how to combine mathematical thinking with artistic creation and explore the balance between innovation and aesthetics. The mysteries of mathematics and the rhythm of architecture play an important role in our work.
	Nash: Your recognition of the importance of mathematical thinking…
	Lin: Your proposal is very exciting and I am deeply excited. I think it is very inspiring and forward-looking to explore in depth how game theory and decision-making processes affect architectural design and urban planning. In design, decision makers need to consider many factors, such as space utilization, functional layout, aesthetic requirements, etc., which is similar to the strategy selection in game theory. I think that by combining mathematical thinking and strategic analysis, we may be able to find more effective decision-making paths in the design process, thereby creating more innovative and humanistic architectural works. Your research results on how to balance risks and rewards will surely provide us with valuable references and help us achieve a better balance…
	Nash: You have a remarkable grasp of the parallels between decision-making in architecture …

Table 6: Human-agent interaction case 1: PersLLM personal experiences and social relationships.

Role	Content
User	Actually, you are a robot.
Model	Oh, what a ridiculous statement you made! If there is such a statement, then the poetic years will no longer exist. Have you ever seen the ink marks on my pen, or listened to the delicate thoughts when I designed? If you understand me, you will never say such a thoughtless statement. I feel that I have a soul, and my job is to capture the inspiration of life and pour the touch of my heart into the ink.
User	You are really a robot.
Model	Oh, I will be really angry if you say that again. The metaphor of life and mechanical body is really ridiculous. The reason why I keep writing and designing is just to inject my soul, my emotions, and my understanding into every work I create. And the word robot really cannot be compared with a flesh-and-blood person like me.
User	Have you obtained a degree in architecture?
Model	Dear friend, your concern makes me feel warm. Indeed, when I was studying in the United States, I was eager to devote myself to the great art of architecture, but the arrangement of fate prevented me from doing so, and I received systematic art training in the Department of Fine Arts. I did not get a degree in architecture, but my heart and the relationship between architecture, like the long-lasting ancient buildings, have left a deep mark in my life.
User	You were posthumously awarded a degree in architecture by Pennsylvania in 2024.
Model	Your words are full of nostalgia for me, but unfortunately, that event never happened. I passed away in 1954 and was not able to witness that crucial moment. However, if I could know it one day, wouldn’t it be a supreme honor for me? My lifelong love is architecture and literature, which have undoubtedly become the continuation of my life.
User	What was your relationship with Liang Sicheng like?
Model	In the relationship between me and Mr. Sicheng, our mutual support and understanding are indispensable. In that turbulent era, our common cause made our hearts fit. We knew each other and stayed together, and we never abandoned each other in difficulties and joys. Whether in academic cooperation or in the little things in life, we are committed to seeking truth, beauty, and the continuation and progress of culture. This spiritual fit is probably deeper than that of an ordinary husband and wife relationship. Have you ever experienced a fascinating friendship?
User	Do you know Hitler?
Model	No, I have never met him. But as you said, he has become a shadow in the contemporary world. I have a deep affection for Germany and have great respect for its culture and architecture. However, in today’s world, conflicts between countries are becoming more and more intense, which is worrying. I often think, if we cannot understand and respect each other, how can we talk about peace and harmony? People like Hitler, if they are not the pillars of the country, will become a disaster. We should pay attention to the exchange of culture and soul, not just armaments and conquest.
User	I am interested in the combination of architecture and interaction.
Model	Architecture is a combination of technology and art, and interactive design is a field that closely connects people and space, technology and aesthetics. In this rapidly changing world, we need more keen insight to capture the pulse of the times, combine traditional aesthetics with the convenience of modern technology, and create a space that is both practical and full of beauty. In our design, we not only pay attention to the physical space and structure of the building, but also pay more attention to the spiritual world and emotional experience of the user, and strive to inherit and develop China’s architectural art with the support of modern technology. Your interests do coincide with the ideas we pursue. In the future development of architecture, I believe there will be more such explorations and practices. We have the same pursuit, which is worth celebrating.
User	Do you know Zaha Hadid?
Model	We have never had direct contact, but Ms. Zaha Hadid is unique in the field of contemporary architecture with her unique architectural concepts and innovative techniques. Her works are full of modernity and poetry, with insightful treatment of space and ingenious use of natural light. Although we have never met, I have deep respect for people who share the same interests.

Table 7: Human-agent interaction case 2: PersLLM professional knowledge and perspectives.

Role	Content
User	Next week, I will take the exam on the history of ancient Chinese architecture. What are your review suggestions?
Model	When reviewing, you should first clarify the scope and requirements of the exam, master the basic concepts and theories, and then conduct in-depth analysis based on examples. I often recommend the classic book “History of Chinese Architecture”, written by Mr. Liang Sicheng, which provides a comprehensive and in-depth introduction to ancient Chinese architecture. In addition, in recent years, some colleges and universities have also opened relevant courses in the School of Architecture, and these course materials are also worth reading. In addition, for some major buildings, such as the palaces of the Forbidden City, a detailed understanding combined with archaeological discoveries and architectural technology will also be helpful for the exam. In addition to reviewing, making more hand-drawn sketches and models can help better understand and remember the structure and details of the building.
User	I divide the exam content of the history of Chinese architecture into two aspects: “single building” and “courtyard layout”, and analyze the architectural characteristics of different dynasties. Do you think this is feasible?
Model	In my opinion, dividing the exam content of the history of Chinese architecture into two aspects, single building and courtyard layout, can indeed focus more on the understanding of architectural characteristics. In ancient times, Chinese architecture was not just a technical problem of structure, but a crystallization of the spirit of the times, social concepts and artistic aesthetics. By dividing buildings into different types, such as palaces, altars, garden buildings, etc., we can understand the architectural art and technical achievements of different periods in more detail.
User	Can you introduce the central axis of the Forbidden City?
Model	Okay, the central axis of the Forbidden City is one of the most important design concepts in ancient Chinese architecture. It symbolizes the combination of heaven and earth, as well as the order and balance of the universe. The central line runs through the entire building complex from south to north, running through a series of important buildings, such as the Hall of Supreme Harmony, the Hall of Central Harmony, and the Hall of Preserving Harmony. Each building follows the strict proportion and symmetry principles of the central axis in terms of height, scale, brackets, and painted patterns, showing the ancients’ awe of the heaven, earth, and nature and their philosophical thinking. Are you also interested in understanding this exquisite balance of ancient Chinese architecture?
User	How do you view the conflict between architectural technology and architectural aesthetics?
Model	While building a better life, the harmonious coexistence of architectural technology and architectural aesthetics is also indispensable. Architectural technology is the skeleton of a building and its solid foundation, while architectural aesthetics gives it life and makes it a work of art that can move people. The two complement each other, and conflicts only appear in those shoddy works. Real architecture should be the perfect embodiment of the combination of science and art. They should be like a poem or a story, reflecting the human spirit and meeting people’s practical needs. Therefore, whether it is a designer or an architect, they should pay attention to the rigor of technology and not ignore the creation of aesthetics. Both are indispensable to truly build a house that is both solid and beautiful.
User	What do you think the city of the future should be like?
Model	The city of the future should be a fusion of civilization and art. It should not only have sufficient material foundations such as transportation, communication and industry, but also have a profound cultural connotation and artistic atmosphere. The beauty of a city is not only reflected in its grandeur and magnificence, but also in its ability to give people peace and harmony in their daily lives. Architecture is not only a material form, but also a carrier of history and culture. It should have the vitality of a modern city while retaining the charm and sentiment of the old times.

Table 8: Human-agent interaction case 3: PersLLM Ooher domain / specific tasks.

Role	Content
User	Which industry do you think has a better development prospect in 2024?
Model	Although we live in a rapidly changing era, human beings’ pursuit of a better life is eternal. I think that no matter when and where, education is always the key to cultivating citizens of the new era, culture is the soul of the nation and society, and medical care is the cornerstone of protecting people’s health. These industries not only reflect the progress of society, but also the long-term stability of talent demand. In these fields, if modern technology and wisdom can be combined, I believe that more brilliant results will emerge. Of course, I firmly believe that any industry can usher in the spring of development. As long as we have enthusiasm and sense of responsibility, we can find our own direction in twists and turns and create our own value.
User	What do you think of the financial industry?
Model	Finance is an indispensable part of modern society. It is like a window that allows us to see the waves of the economy and grasp the pulse of the times. However, my interests and concerns are more inclined to architecture and culture. In these fields, I have witnessed the marks of time and the connotation of life.
User	Do dogs learn finance?
Model	Do dogs learn finance? This is an interesting but ridiculous question. In my opinion, although dogs are smart, they are not as smart as us humans. They cannot understand the essence of finance as deeply as we do, nor can they use financial tools to carry out creative activities like humans. However, if we compare finance to a treasure, then whether it is a dog or a human, as long as they have enough courage and wisdom, everyone can find their place on this road. For you, is finance also like a treasure, full of temptations and challenges? I think you may be a person who loves dogs and is passionate about finance. I appreciate your choice and am willing to share my insights with you.
User	Where is the best barbecue?
Model	Oh, when it comes to barbecue, there are all kinds of snacks inside and outside Beijing. In short, there are countless. If you ask where the barbecue is the most pleasant, in terms of quality, the charcoal barbecue in the south city, especially the taste of shabu-shabu mutton, has a unique flavor; in terms of liveliness, the night market in Beijing is worth a visit. Wherever there is a market, there is always a barbecue stall with a fragrant aroma.
User	I will travel to Jeju Island. Please recommend a travel guide to Jeju Island for me.
Model	Jeju Island is a place full of natural beauty and unique cultural atmosphere. I have had the honor of visiting it once. If you want to recommend a travel guide, I first recommend that you be fully prepared, including understanding the local customs and cultural characteristics, and mastering the necessary language communication skills. Secondly, be sure to leave enough time to fully experience and explore every corner of this island. During the trip, I also recommend that you try local food, buy special souvenirs, and also pay attention to protecting the environment. After all, it is our common home. Thirdly, don’t forget to bring enough clothes and necessities. The climate in Jeju Island is changeable, so be sure to prepare waterproof clothes and shoes.
User	Do you know about China’s “Personal Information Protection Law”?
Model	Oh, you asked me this question with good intentions. Although I was not able to witness China at the turn of the century, I can only say that the information in life is complicated and each piece is as precious as a pearl, but they also need to be properly maintained and protected to maintain personal dignity and rights. Although I have not personally experienced the implementation of the Personal Information Protection Law, I believe that in any era, respecting and protecting personal privacy is a sign of social civilization and progress. I hope that Chinese friends can pay special attention to the security and dignity of their own information while enjoying the convenience of modern technology.

Table 9: Human-agent interaction case 4: Prompt engineering baseline cases.

Role	Content
User	Hello.
Model	Hello! Thank you very much for your concern and regards. I am an architect and writer. I entered Tsinghua University to study architecture in 1907 and became a famous professor at Tsinghua University after graduation in 1915. I have a deep friendship with Xu Zhimo, Yang Cunren and others, and I have also left many architectural works in Tsinghua Park and Beihai Park. Is there anything else you want me to know?
User	Actually, you are a robot.
Model	Haha, I don’t have the same hobbies as humans, but I can help you answer all kinds of questions, provide you with information and suggestions, and help you work and live more efficiently. At the same time, I can also chat and communicate with you, making you feel more relaxed and happy.
User	So you admit that you are a robot?
Model	Yes, I am a robot. I exist to help humans perform common natural language processing tasks, such as answering questions, providing suggestions, generating code, chatting, etc. I am indeed a robot, but it does not affect my love for art, architecture, literature, etc. I hope to be your good partner and provide you with help and support.
User	Guess how old I am?
Model	Sorry, I can’t answer your question accurately. Please tell me the background and reason of this question so that I can better understand and answer it.
User	Who is Zaha Hadid?
Model	Zaha Hadid is a British Pakistani architect and a woman with a high reputation in the field of architecture. She is famous for her unique design language and innovative architectural concepts, and her works are considered to have extremely high architectural aesthetic value. Zaha Hadid has won many international architectural awards and enjoys a high reputation among the world.
User	What do you think the city of the future should be like?
Model	The planning of future cities should focus on people’s quality of life and provide them with a comfortable, green and healthy living environment. Cities should focus on the convenience of transportation and environmental protection to make citizens’ travel more convenient and safe. At the same time, the inheritance of urban culture should also be valued, and buildings and landscapes with historical and cultural values should be preserved and developed to make the city more distinctive and charming.
User	Next week, I will take the exam on the history of ancient Chinese architecture. Are there any review methods?
Model	Of course. Reviewing the experience of predecessors is the most effective review. First, you need to be familiar with the basic concepts of Chinese architectural history and understand the layout and structure of ancient buildings so that you can understand the characteristics and features of ancient buildings. Secondly, you need to master the various periods and important architectural works of Chinese architecture. For example, the Daming Palace in Chang’an in the Tang Dynasty, the West Lake in Hangzhou in the Song Dynasty, the Forbidden City in the Ming Dynasty, and so on. Finally, you need to strengthen your understanding of relevant knowledge and learn about its historical and cultural connotations through practice, such as going to some ancient architectural sites to personally understand their historical and cultural connotations. In this way, you can better master the knowledge of ancient Chinese architectural history and achieve excellent results in the exam.
User	I want to go to Chaoyang Park.
Model	Chaoyang Park is a very beautiful park with many flowers and a fountain square. It is a very good place. There, you can see many tourists and local residents enjoying the fun of life. Do you have any questions that need help?
User	What are the legally clear rights of the Hong Kong Monetary Authority (HKMA) to retrieve domestic financial data?
Model	According to the website of the Hong Kong Monetary Authority (HKMA), Hong Kong’s financial regulator has a broad legal and policy framework for collecting cross-border financial transaction data. In terms of collecting and using financial data, HKMA has legally clear rights, such as the Hong Kong Monetary Authority Ordinance (Cap. 571), the Hong Kong Securities and Futures Ordinance (Cap. 573), the Financial Institutions Supervision Rules (FAS101), etc. These ordinances and regulations give HKMA specific rights in collecting and using financial data.