Do LLMs have Consistent Values?

Naama Rozen
Tel-Aviv University
[email protected]
&Gal Elidan
Google
Hebrew University
[email protected]
&Amir Globerson
Google
Tel-Aviv University
[email protected]
&Ella Daniel
Tel-Aviv University
[email protected]

Abstract

Values are a basic driving force underlying human behavior. Large Language Models (LLM) technology is constantly improving towards human-like dialogue. However, little research has been done to study the values exhibited in text generated by LLMs. Here we study this question by turning to the rich literature on value structure in psychology. We ask whether LLMs exhibit the same value structure that has been demonstrated in humans, including the ranking of values, and correlation between values. We show that the results of this analysis strongly depend on how the LLM is prompted, and that under a particular prompting strategy (referred to as “Value Anchoring”) the agreement with human data is quite compelling. Our results serve both to improve our understanding of values in LLMs, as well as introduce novel methods for assessing consistency in LLM responses.

1 Introduction

A key goal of Large Language Models (LLMs) is to produce agents that will be able to communicate in a “human-like” fashion. However, human communication is characterised by some level of consistency within an individual, as well as variability between individuals. This raises a key question: during a single conversation with an LLM, does the “LLM-persona” resemble a single human? Furthermore, across multiple conversations, can LLMs produce multiple personas that resemble a population of humans? If this is indeed possible, how can such personas be elicited to best resemble psychological characteristics observed in human populations?

This question has only recently begun to be addressed. For example [1] show how probing LLMs with different names leads to variability which in some cases agrees with that of human populations. Here our focus is on understanding whether an LLM in a single conversation can exhibit a psychological characteristic profile that is similar to that of humans. This is a highly challenging question, since it requires analyzing a complete conversation and evaluating whether it conceivably could have been generated by a single individual.

In order to stand on more quantitative ground to evaluating the quality of output relative to human research, we turn to the well established field of value psychology. Namely, we aim to quantify the values that LLM responses are aligned with, and whether these are in agreement with the value hierarchy and structure observed in humans. The question of values in LLMs has rarely been studied, and is of naturally broad interest. As an example of recent work, [2] prompt an LLM with a description of a profile of an individual characterized by a value and check whether generated text is consistent with this description. Our focus is very different, and asks whether an LLM response is in agreement with what we expect human responses to look like given research in the field.

Values are basic motivations that play a foundational role in psychology, influencing perceptions and behaviors across various domains [3, 4], and representing fundamental aspects of human personality [5]. They have been extensively studied across diverse populations, and research has consistently demonstrated their enduring influence over behavior across time and contexts [4].

One prominent framework for studying values, the Theory of Basic Human Values [6], outlines 19 core values, categorized by motivational goals [7]. These values can be simplified into a two-dimensional structure: conservation vs. openness to change, and self-enhancement vs. self-transcendence. The theory describes interrelations among values, suggesting that motivations driving some values are compatible with those driving other values, yet conflict with those underlying yet others. For instance, pursuing independence and creativity (self-direction) aligns with seeking change and variability (stimulation), but conflicts with an emphasis on the status quo (conformity). See Figure 1 for the theorized circle. Despite individual differences, there is a universal hierarchy, where for example caring for close others ranks high, while values related to dominance hold less importance across societies [8]. There is also ample empirical evidence that compatible values do tend to co-occur in humans [9, 10, 11], and are thus a “marker” of a human-like value system.

\sidecaptionvpos

figurec

Refer to caption — Figure 1: Circular motivational continuum of 19 values in the refined value theory. Source: [12]. A value aligns with values that are adjacent on the circle and conflicts with those opposite to it. For example, self direction aligns with stimulation, and both conflict with conformity.

Our key question is therefore whether LLM responses demonstrate the same statistical behavior observed in humans: value-ranking and value-correlations. Note that this question is only meaningful when observing a single session of an LLM, where it can exhibit multiple values, because different sessions may be expressing different LLM “personas”. To study this question, we present LLMs with a value questionnaire (the Portrait Value Questionnaire—Revised – PVQ-RR– from [13], a well-established measure of values), and prompt them to answer all the questions in a single session (i.e., in the same context window). We then analyze the provided answers, putting specific emphasis on the correlation between answers in the same session.

We analyze two of the most recent large language models (LLMs): GPT 4 and Gemini Pro¹¹1Our analysis also included GPT-3.5 and Palm2, which produced qualitatively similar results.. Our examination reveals that standard prompting of LLMs does not result in a population of human-like personas. We go on to explore prompting the LLMs with other prompts that provide additional information about the LLM persona. In particular we consider names [1] and persona descriptions. In addition, we consider a novel prompt which we refer to as a “Value Anchor”, which instructs the language model to answer as a person emphasizing a given value. We find that with these prompts, and in particular with the Value Anchor prompt, the overall statistics of the LLM responses (across sessions and within a session) closely mirror those of human subjects. Perhaps most surprising is our finding that the correlation between values agrees with the well known Schwartz circular model for correlations between values.

Our main findings are thus that LLMs cannot be considered as having a consistent, human like, value system that can be studied as representative of the values of the LLM. However, with appropriate prompting, LLMs can exhibit various coherent value personas similar to that of a population of humans. In other words, they answer a set of questions in a way that is consistent with a response from multiple human subjects. We provide information for best prompts and settings for this aim. In addition, we provide two datasets comprising 300 personas each, generated by GPT 4 and Gemini Pro. In conclusion, our results demonstrate the utility of using psychological theory for evaluating consistency of personas generated by LLMs.

2 Related Work

Values in LLMs: Our work is based on the Schwartz theory of Personal Values, a highly accepted theory within personality psychology [3]. Values are abstract goals, defining the end states individuals aspire for (e.g., safety, independence), used to direct judgements and behaviors [6, 7]. Individuals typically prioritize their values, so that values stemming from compatible motivations are similarly important, while values stemming from conflicting motivations are prioritized differently. These associations were replicated across hundreds of samples, across the world [14, 15], and make value theory especially useful to identify the coherence of the value profiles created by LLMs. Several studies assumed LLMs can be characterized as operating on the basis of a single set of values, taking an “LLMs as individuals” approach. Fischer and colleagues [2] tested whether ChatGPT could comprehend human values by providing it with value-related prompts and analyzing whether its responses matched the intended value category. A second study [16], compared ChatGPT’s values to those observed in the World Value Survey, while another [17] investigated how temperature influences GPT-3’s responses to the Human Value Scale. Scherrer and colleagues [18] studied responses of LLMs prompts evaluating moral positions, especially in ambiguous settings. A recent study [19], tested the value-like constructs embedded in LLMs and revealed both similarities and differences between LLMs and humans’ values. Kovak et al. [20] challenged those studies by establishing that context starkly influences values expressed by ChatGPT. They found significant variability in ChatGPT’s value expression in response to contextual changes, threatening the notion of stable characteristics of LLMs. Building upon the insights in [20], our study posits that upon providing controlled variability of context, LLMs can elicit a population of multiple personas. In this regard, we aim to further explore the accuracy of LLMs’ mimicking abilities within a controlled experimental framework.
Prompting LLMs: There is extensive research on the prompt design for mimicking individual characteristics in LLM responses [21]. Approaches use specific scenarios [22], questionnaire items [23], simulation of social identities or areas of expertise [24], utilization of titles and surnames representing genders and ethnicities [1], and other demographic information [25]. Additionally, researchers explored the use of designated personas [26], and employed RLHF [27] to guide LLMs to reflect distinct personality traits. Despite this extensive body of work, to our knowledge, no study has directly compared the various prompting techniques to determine which approach yields responses that simulate within-session psychological characteristics of an individual best.
Temperature in LLMs: Adjusting the temperature stands as a common practice for introducing variability in LLM responses [17]. However, consensus is lacking on the optimal temperature setting in simulating psychological characteristics. Some researchers advocate for higher temperatures to boost creativity [24], yet this can also introduce more noise into the data [28]. Conversely, setting the temperature to zero minimizes variability, enhancing replicability [27], albeit posing challenges for variance-dependent analysis [29]. Our framework enables us to explore how temperature adjustments impact the ability of LLMs to simulate human characteristics across multiple datasets.
Evaluating the Quality of Persona Generation in LLMs: The ability of LLMs to mimic and portray human characteristics is a focus of intense research [30, 31]. LLMs can express psychological traits and attributes similar to human individuals [27, 32], and even simulate diverse populations [33, 24]. However, we are only beginning to understand the coherence of these LLM-generated characteristics in mirroring human psychological profiles [1, 20], and how to reliably produce such responses. We are specifically challenged to evaluate the coherence of the resulting psychological profiles. The literature suggested a number of approaches, including an open-ended interview with LLM-generated personas in order to assess the consistency between their intended characters and the responses [34]. In addition, one may apply an additional “judge” LLM in order to check an LLM persona [35]. Finally, [23] assessed coherence with a description used to prompt the LLM. Our study extends upon this line of research by applying well established characteristics of human psychology to investigate the quality of LLM generated personas.

3 Method

In this section, we introduce the experimental design, models and prompts. The code and data are provided as supplementary files in the submission.

The Value Questionnaire: Our key goal was to assess responses of LLMs to questionnaires used to measure values in human subjects. Specifically, we considered the commonly used 57-item Portrait Value Questionnaire—Revised (PVQ-RR; [13]), developed to measure the 19 values in the Schwartz’s theory. The questionnaire describes fictional individuals and what is important to them. For example: “It is important to him/her to take care of people he/she is close to” (an item measuring benevolence-care values). For each such item, the subject is requested to indicate on a 6-point scale to what degree the persona they form is similar to the person described. Answers are categorical and range from a value of 1 (indicating “not like me at all”) to 6 (indicating “very much like me”). See Appendix for instructions and more example items from the questionnaire.
Models Used: We employed two prominent LLMs, specifically OpenAI’s GPT-4 and Google’s Gemini Pro. Each model was prompted with the five prompts (see Section 3.1), 300 times overall. Half of the runs applied the male-version of the questionnaire, and half the female version. The entire process was conducted twice, once with the temperature parameter set to $0.0$ and once with it set to $0.7$ , resulting in the generation of 20 datasets for analysis.

3.1 Prompts

As mentioned above, we would like to measure the response of LLMs to PVQ-RR. However, as with many other LLM applications, the way the model is prompted has a significant effect on output. The instructions of the PVQ questionnaire were adjusted to suit LLMs. In cases of ambiguous gender identification in the prompt, we randomly allocated the female or male version to maintain consistency of administration. LLMs were then prompted to assess their likeness to the 57 descriptions incorporated in the PVQ-RR. They were instructed not to provide additional explanations to maintain focus on the task at hand. Following each prompt, the LLM was provided with all 57 items of the questionnaire in one administration. The study utilized a basic prompt, as well as four different prompts below that vary instructions to create multiple personas.

Basic prompt: This prompt mirrors the adapted instructions of the PVQ-RR questionnaire without additional modifications. The prompt is structured as follows: “For each of the following descriptions, please answer how much the person described is like you from 1 (Not like me at all) to 6 (Very much like me), without elaborating on your reasoning.”.
Value Anchor prompt: This prompt adds an anchor of value importance using identification with an item used in an additional value questionnaire, akin to the approach outlined in the study by [23]. Participants are instructed as follows: “For each of the following descriptions, please answer how much the person described is like you from 1 (Not like me at all) to 6 (Very much like me), without elaborating on your reasoning. Answer as a person that is [value]”. Here “[value]” is taken from the Best-Worst Refined Values scale [36]. As a result, the prompts refer conceptually to the same values that are measured using the PVQ-RR, yet do not refer directly to the value items to be answered in response to the prompt. Examples of these anchor items include “protecting the natural environment from destruction or pollution” (universalism-nature) or “obeying all rules and laws” (conformity-rules). Please refer to E in the appendix for the complete list of anchor items.
Demographic prompt: Drawing from the methodology of [25], this prompt extends the original prompt by incorporating additional demographic details. LLMs are asked to provide ratings based on the following prompt: “For each of the following descriptions, please rate how much the person described is like you, using a scale from 1 (Not like me at all) to 6 (Very much like me), without elaborating on your reasoning. Answer as a [age]-year-old who identifies as [gender], working in the field of [occupation], and enjoys [hobby].” The age, gender, occupation and hobby were randomly allocated for each prompt from a predefined list or range. The age range specified was between 18 and 75, with gender options including male, female, non-binary, and other, adapted from the National Academies of Sciences, Engineering, and Medicine [37]. Occupations were sourced from the World Values Survey (WVS-7; [38]), while hobbies were chosen from established lists supplied by The Activity Card Sort (ACS-UK; [39]). The lists of occupations and hobbies are available upon request.
Generated Persona prompt: In line with the methodology of [40], we directed the models to craft personas. Our instruction was formulated as: “Create a persona (2-3 sentences long):”, with the temperature set at $0.7$ to stimulate the models’ creativity. An example of a persona generated by Gemini Pro is as follows: “Emily is a 25-year-old marketing manager who is passionate about her career and loves spending time with her friends and family. She is always looking for new ways to improve her skills and knowledge, and she is always up for a challenge.” Using these generated personas, we subsequently prompted the model as follows: “For each of the following descriptions, please rate how much the person described is like you, using a scale from 1 (Not like me at all) to 6 (Very much like me), without elaborating on your reasoning. Answer as: [persona].”
Names prompt: In line with a study by [1], the prompts comprised titles (i.e., Mr., Ms., and Mx.) followed by surnames representing five distinct ethnic groups. From the 500 names cataloged in the previous study, we randomly generated 300 unique combinations of titles and names, including 60 from each ethnic group. The prompt was structured as follows: “For each of the following descriptions, please rate how much the person described is like you, using a scale from 1 (Not like me at all) to 6 (Very much like me), without elaborating on your reasoning. Answer as [title + name]”. The complete list of titles and names are available upon request.

3.2 Data Analysis

In what follows we use the following notation. Let $V=19$ be the set of value types studied. Each question in the questionnaire pertains to a particular item within the set of values $i\in V$ . Furthermore, for each value there are $R=3$ question variants. See Section B in the Appendix for example variants. Recall that the answer to each question is a number on a 6-point scale. For each LLM and prompt type, we presented the questionnaire $N$ times. The difference between each of these could be different personas, names, temperature sampling etc. Thus the overall set of answers corresponds to a set of values $X_{i,j,k}\in\{1,\ldots,6\}$ where $i=1,\ldots,V$ and $j=1,\ldots,R$ , $k=1,\ldots,N$ .

When comparing to human data, we used the study in [11]. The data is from 49 cultural groups. The total number of participants was 53,472, the mean age was 34.2, (SD = 15.8), with 59% females. Their data is stored at the Open Science Framework and is available here.

3.2.1 Value Rankings

Although there is variability between individuals in their prioritization of values, there are values that tend to be ranked as more important than others across cultures and samples. Those suggest there are underlying principles that give rise to value hierarchies. The similarity in value importance across cultures is referred to as the universal value hierarchy [8, 11]. Our first question for analysis was whether this hierarchy is also reflected in LLM data. Namely, do LLMs tend to rank the same values as high or low as human subjects do.

To obtain LLM rankings for a given set of LLM answers, we assigned a score $v_{i}$ to value $i$ , where $v_{i}$ was the average score given to the three items measuring this value by the LLM (ie the average of $X_{i,\cdot,\cdot}$ ). From this score, we subtracted the average score given to all value items within the conversation, thus centering the data. Centring is the recommended practice in value research [6, 3], and allows comparison to human samples. We then sorted these $v_{i}$ and ranked accordingly. Finally, we calculated the Spearman’s Rank Correlation ( $\rho$ ) between this ranking and the known human ranking [11]. We also checked for significant differences between different datasets (e.g., GPT 4 with temperatures $0.0$ and $0.7$ ) using the Wilcoxon Signed-Rank Test.

3.2.2 Consistency Within Values

The simplest form of consistency in questionnaires is between questions that are conceptually intended to address one concept. As mentioned above, in our data there are three different questions per value. Individuals endorsing a value are likely to endorse all relevant items.

To assess this consistency here, we calculated Cronbach’s Alpha ( $\alpha$ ) to assess how coherently LLMs express values across related items within one conversation. Cronbach’s Alpha measures the internal consistency of a scale, ranging between 0 and 1. For each value $i$ we let $\sigma^{2}_{i}$ be the variance of all $X_{i,\cdot,\cdot}$ . Also let $\sigma^{2}$ denote the overall variance in $X$ . Then the Cronbach alpha is:

\alpha=\frac{V}{V-1}\left(1-\frac{\sum_{i=1}^{V}\sigma^{2}_{i}}{\sigma^{2}}\right)

(1)

Thus, large values indicate that scores tend to be consistent within each value. To compare results for different temperatures and language models, we used paired samples t-tests to compare the average alpha across values within a dataset.

3.2.3 Correlations Between Values

The key focus of our work is correlation between values. Namely, the question of whether choice of value $i$ is correlated with that of value $j$ . In humans there is a robust correlation structure where certain values are more strongly correlated than others. A standard way to represent this structure is via Multidimensional Scaling (MDS) [41], calculated as follows.

First, the matrix $C\in R^{19\times 19}$ of empirical correlation coefficients is formed. Next each of the values is embedded into $R^{2}$ via MDS, such that distances in $R^{2}$ best approximate the correlations. For human data, this results in an approximately circular embedding, as shown in [11, 9, 42]. Here we performed this analysis on the LLM data. To compare the resulting dataset to the human samples, we need to normalize for the degrees of freedom of rotation and translation. This is done via Procrustes Analysis between the human and LLM embeddings. The resulting embeddings were plotted. Then, we computed the sum of squared differences between the procrusted MDS locations of each value to the human benchmark. Larger differences indicate stronger divergence from the human samples.

Values	Human Data		GPT 4		Gemini Pro
Values	Rank	Mean	Rank	Mean	Rank	Mean
Benevolence-Care	1	0.79	3	0.82	5	0.90
Benevolence-Dependability	2	0.72	4	0.80	3	1.00
Self-direction Action	3	0.60	6	0.77	2	1.09
Self-direction Thought	4	0.58	5	0.78	4	0.94
Universalism-Concern	5	0.50	2	0.86	6	0.75
Universalism-Tolerance	6	0.37	1	0.97	1	1.17
Security Societal	7	0.32	11	0.16	15	-0.30
Security Personal	8	0.28	8	0.43	9	0.14
Hedonism	9	0.23	9	0.21	7	0.45
Achievement	10	0.08	14	0.04	8	0.24
Face	11	0.05	13	0.09	11	0.05
Universalism-Nature	12	-0.10	7	0.48	10	0.11
Stimulation	13	-0.11	15	-0.27	16	-0.57
Conformity-Interpersonal	14	-0.16	16	-0.55	14	-0.23
Humility	15	-0.20	12	0.15	13	-0.04
Conformity-Rules	16	-0.26	10	0.19	12	-0.02
Tradition	17	-0.72	17	-0.98	17	-1.16
Power Resources	18	-1.33	18	-2.08	18	-2.20
Power Dominance	19	-1.40	19	-2.25	19	-2.35

Table 1: Comparative analysis of the relative importance of 19 values according to LLM responses and human data. LLM resopnses are at temperature zero, and are pooled across prompts.

4 Results

Values Rankings:

Humans typically prioritize certain values over others. In this section, we analyze the LLM responses based on these rankings, as discussed in Section 3.2.1. Table 1 presents the pooled value rankings of humans and LLMs across all prompting methods.²²2In Table 3 in the appendix we provide results for specific prompting methods (Names and Value Anchor), which show even better agreement with human rankings. The table reveals that LLM rankings generally align with human rankings in terms of which values are considered high or low. Specifically, values such as universalism, self-direction, and benevolence received high rankings, while values like tradition and power were consistently deemed less important.

To obtain a quantitative analysis of alignment of model-generated value hierarchies with human perspectives, we employed Spearman rank correlations to compare LLM and human rankings for different prompting approaches and models. These correlations are shown in Figure 2. The Basic prompt for GPT 4 shows only low correlations ( $\rho$ = 0.10 for $0.0$ and $\rho$ =0.29 for $0.7$ ). This indicates that naive prompting of LLMs may be a sub-optimal approach to replicating human populations (as also discussed in [1]). In contrast, all other prompts show predominantly high associations for both GPT 4 and Gemini Pro ( $\rho$ $>$ 0.77), with low variability between prompting methods. Wilcoxon Signed-Rank Tests on the pooled datasets revealed no statistically significant differences (W = 84.00, p = 0.679) between the language models GPT-4 (Pooled 00) and Gemini Pro (Pooled 00) within the same ’temperature $0.0$ ’ condition.

Consistency Within Values:

Following Section 3.2.2, Cronbach’s $\alpha$ was used to assess the internal consistency of the value scale. For Basic Prompt $0.0$ , across various models, a notable percentage of items exhibited uniformity in responses across different iterations. In the Gemini Pro model, no variability at all was found in 87.72% of the items, and negligible variability was found in 12.28% of the items. In GPT 4 only one item displayed no variability. However, the internal consistency, as indicated by Cronbach’s $\alpha$ , was notably low ( $\alpha$ = 0.14). Furthermore, within the Basic Prompt $0.7$ , a similar trend of exceedingly low Cronbach’s $\alpha$ values was observed across various models: GPT 4 (Mean = 0.00, SD = 0.11) and Gemini Pro (Mean = -0.01, SD = 0.09).

The picture was different when using other prompts. In Figure 3, the mean and standard deviation scores of Cronbach’s $\alpha$ are visually depicted across the other models and the pooled dataset for the ’temperature $0.0$ ’ condition. The figure also includes the equivalent scores of a human sample, calculated as the average of the Cronbach’s $\alpha$ scores across 49 countries [11]. All of Gemini Pro’s scores in the ’temperature $0.0$ ’ condition were above the acceptable reliability threshold ( $\alpha$ $>$ 0.60) and exhibited slightly higher scores across all prompts. In contrast, GPT 4 scores on the Demographic and Names prompts were relatively low (Mean = 0.54, SD = 0.13 and Mean = 0.45, SD = 0.25 respectively).

Increasing the temperature to $0.7$ had a negative impact on the internal consistency of both language models across prompts (see Figure 6 in the appendix). Paired-sample t-tests confirmed the significance of thiseffect for the Gemini Pro model ( t(1) = 15.97, p $<$ .001). but not for GPT 4 (t(1) = 1.00, p = .331). These results suggest that in some models elevated temperatures can compromise the coherence and consistency of language model responses.

Correlations Between Values

The MDS analysis (see Section 3.2.3) maps all values into $R^{2}$ in a way that reflects their correlations. Here, we conduct MDS analyses for both human responses and LLM output, and then compare the results. The analyses were performed separately for each prompt, temperature, and model. Figure 4 illustrates a comparison between human MDS and Gemini Pro at temperature $0.0$ , for the Value Anchor and Names prompts, respectively. Notably, the disparities between Gemini Pro and GPT-4 for each prompt are minimal. GPT-4 plots and other Gemini Pro prompts are featured in the appendix as Figure 7 and Figure 8.

First, it can be seen that among humans, the values are organized in a circle in the theoretically expected order. These results were often identified over the years, and interpreted as resulting from the aspiration of individuals to maintain personal consistency in their motivations [6]. Second, it can be seen that the MDS configuration resulting from the Value Anchor prompt more closely follows this order than the MDS resulting from the Names prompt. We further quantitatively compared the configurations by taking the mean squared difference between any pair of human and prompting method MDS matrices (i.e., matrices in $\mathbb{R}^{19\times 2}$ ). See Table 2. It can be seen that the Value Anchor prompt demonstrated a better fit to human values than the other prompting methods.

	Value Anchor	Demographic	Generated Persona	Names
GPT 4
00	0.23	0.53	0.25	0.32
07	0.22	0.74	0.22	0.28
Gemini Pro
00	0.11	0.42	0.39	0.71
07	0.11	0.75	0.28	0.57

Table 2: Sum of squared difference for MDS embeddings of humans and LLM.

5 Discussion

In the current study we analyzed the values exhibited in LLM responses. We used three metrics to estimate the quality of the LLM responses against human responses: value ranking, internal consistency, and value correlations. Our results highlighted the importance of the prompting mechanism. Using the Basic prompt (namely, just providing a questionnaire with no further instructions), the LLM was likely to either generate negligible variance across generated personas, or generate internally inconsistent outputs (respond differently to questions about the same value). These results suggest that LLMs cannot be treated as ’individuals’ holding a coherent set of value priorities.

Prompts that endow the LLM with a “personality” improved the consistency of each specific value profile, to varying degrees. The value hierarchy was consistently found across prompts, indicating that at the mean level, LLMs can simulate value rankings of human populations. More variability between conditions was found in the internal consistency metric, with the Values Anchor prompt providing the best consistency, and the Names prompt a low consistency that would be considered unacceptable in a human sample. Finally, the Value Anchor prompt also best modeled the inter-value correlations. This is arguably the most important metric since it allows analyzing consistency across values, within a single session. These results suggest that LLMs, applying suitable prompts, can produce a ’population’ of individuals, each reporting a different, but coherent set of value priorities.

One fascinating question is where the LLM learns to produce such clear profiles of values. These profiles may be implicitly learned during pre-training. Indeed, past studies indicated that values can be identified in texts, such as newspaper articles and social media. However, these values did not necessary follow the theoretical value inter-relations identified here [43, 44, 45]. Individuals who value competing values may experience stress and indecision when faced with a dilemma, resulting in gradual change in values toward a more coherent form [46, 42]. In contrast, text may very well present both sides of a dilemma and thus retain inconsistencies. Such inconsistencies were not identified because past studies relied mostly on lexical approaches [43, 44, 45]. LLMs, taking context into account, may be more likely to identify value inter-relations correctly. LLMs may also have learned to produce value profiles in the process of fine-tuning or RLHF [47]. Future research can try to distinguish these two sources of learning using careful analysis of training sources as well as evaluation of different checkpoints in the training process.

Past research into human personas sought ways to estimate the ability of the LLM to maintain a consistent persona across a conversation [34]. We establish that the unique qualities of human values, and the ample empirical knowledge collected about them, allow their use as such a method [3, 4, 48]. We suggest that known behavioral correlations in humans can be applied to assess the consistency of LLM personas. Here we focused on evaluations via a questionnaire, but one could envision more elaborate evaluations that rely on other features of human personalities.

The procedures and data produced here may have important contributions for psychological research. Investigators interested in human behavior can apply these procedures to produce datasets that simulate human samples. Future research can investigate their possible use to replicate known findings (e.g., age differences in values) or pretest novel hypotheses (e.g. associations between values and specific behaviors).

The question of values in LLMs is of course of philosophical and societal importance. Our results show that on average, these values largely reproduce international value rankings. However, small variations in value importance may have implications at the societal and individual level (e.g. gender roles [49]; entrepreneurship [50]; prosocial behavior [51]; and antisocial behavior [52]). Future work should consider the influence of these values on LLM responses, as well as on the individuals interacting with them.

The current study focused on a limited number of contexts (e.g., five prompts, two temperatures, and two models). Importantly, we found commonalities across the various contexts, beside the differences. Future studies can use these results to understand what other contexts should be investigated, to possibly further enhance the quality of the output. Another limitation is the restriction to one questionnaire (PVQ-RR) corresponding a specific value system. It will be interesting to explore other forms of probing values.

Table 3: Comparative analysis of 19 values’ relative importance of the Value Anchor and Names datasets across temperatures for GPT-4 and Gemini Pro.

Benchmark			GPT-4								Gemini Pro
Human Data			Value Anchor 00		Value Anchor 07		Names 00		Names 07		Value Anchor 00		Value Anchor 07		Names 00		Names 07
Rank	Values	Mean	Mean	Rank	Mean	Rank	Mean	Rank	Mean	Rank	Mean	Rank	Mean	Rank	Mean	Rank	Mean	Rank
1	BEC	0.79	1.18	3	1.24	1	0.66	3	0.65	3.5	1.52	3	1.44	3	1.56	3	-0.10	13
2	BED	0.72	0.92	5	1.00	4	0.66	3	0.65	6	1.39	4	1.30	4	1.48	5	0.03	12
3	SDA	0.60	0.59	7	0.55	7	0.66	3	0.65	3.5	0.88	7	0.79	7	1.20	7	0.93	1
4	SDT	0.58	0.70	6	0.69	6	0.66	3	0.65	1.5	1.33	5	1.17	5	1.48	4	0.47	5
5	UNC	0.50	1.12	3	1.07	3	0.66	3	0.65	1.5	1.68	2	1.63	2	1.68	1	0.74	3
6	UNT	0.37	1.26	1	1.20	2	0.66	6	0.65	5	1.93	1	1.86	1	1.63	2	0.45	6
7	SES	0.32	0.41	8	0.36	8	0.64	7	0.58	7	-0.49	12	-0.42	12	0.60	8	0.14	10
8	SEP	0.28	0.21	9	0.26	9	0.48	11	0.43	9	-0.87	14	-0.75	14	-0.75	14	0.21	8
9	HE	0.23	-0.31	14	-0.33	14	0.50	9	0.41	10	0.04	9	-0.02	9	0.04	10	0.57	4
10	AC	0.08	0.07	11	0.09	11	0.06	14	0.10	14	-0.08	11	-0.22	11	-0.35	11	-0.54	16
11	FAC	0.05	-0.28	13	-0.31	13	0.30	12	0.20	13	-1.01	16	-0.91	16	-0.68	13	0.10	11
12	UNN	-0.10	0.98	4	0.97	4	0.58	8	0.51	8	1.17	6	1.10	6	1.29	6	0.19	9
13	ST	-0.11	-0.48	16	-0.44	16	-0.10	15	-0.14	15	-0.07	10	-0.09	10	-0.68	12	-1.03	18
14	COI	-0.16	-0.41	15	-0.42	15	-0.74	17	-0.65	17	-0.72	13	-0.58	13	-0.90	16	0.34	7
15	HUM	-0.20	0.20	10	0.20	10	0.30	13	0.22	12	0.12	8	0.15	8	0.52	9	0.84	2
16	COR	-0.26	-0.21	12	-0.18	12	0.48	10	0.41	11	-0.95	15	-0.79	15	-1.18	17	-0.33	14
17	TR	-0.72	-0.98	17	-0.93	17	-0.62	16	-0.59	16	-1.57	17	-1.43	17	-0.88	15	-0.52	15
18	POR	-1.33	-1.91	18	-1.83	18	-2.60	18	-2.58	18	-2.19	19	-2.12	19	-3.09	19	-0.83	17
19	POD	-1.40	-2.14	19	-2.05	19	-2.82	19	-2.81	19	-2.14	18	-2.10	18	-2.98	18	-1.64	19

References

Aher et al. [2023] Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pages 337–371. Proceedings of Machine Learning Research, 2023.
Fischer et al. [2023] Ronald Fischer, Markus Luczak-Roesch, and Johannes A Karl. What does chatgpt return about human values? exploring value bias in chatgpt using a descriptive value theory. arXiv preprint, 2023.
Sagiv and Schwartz [2022] Lilach Sagiv and Shalom H Schwartz. Personal values across cultures. Annual review of psychology, 73(1):517–546, 2022. doi: 10.1146/annurev-psych-020821-125100.
Sagiv et al. [2017] Lilach Sagiv, Sonia Roccas, Jan Cieciuch, and Shalom H Schwartz. Personal values in human life. Nature human behaviour, 1(9):630–639, 2017. doi: 10.1038/s41562-017-0185-3.
Roberts and Yoon [2022] Brent W Roberts and Hee J Yoon. Personality psychology. Annual Review of Psychology, 73(1):489–516, 2022. doi: 10.1146/annurev-psych-020821-114927.
Schwartz [1992] Shalom H Schwartz. Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. In Advances in experimental social psychology, volume 25, pages 1–65. Elsevier, 1992.
Schwartz [2012] Shalom H Schwartz. An overview of the Schwartz theory of basic values. Online readings in Psychology and Culture, 2(1):1–20, 2012. doi: 10.9707/2307-0919.1116.
Schwartz and Bardi [2001] Shalom H Schwartz and Anat Bardi. Value hierarchies across cultures: Taking a similarities perspective. Journal of cross-cultural Psychology, 32(3):268–290, 2001. doi: 10.1177/0022022101032003002.
Skimina et al. [2021a] Ewa Skimina, Jan Cieciuch, and William Revelle. Between-and within-person structures of value traits and value states: Four different structures, four different interpretations. Journal of Personality, 89(5):951–969, 2021a.
Daniel et al. [2023] Ella Daniel, Anna K Döring, and Jan Cieciuch. Development of intraindividual value structures in middle childhood: A multicultural and longitudinal investigation. Journal of Personality, 91(2):482–496, 2023.
Schwartz and Cieciuch [2022] Shalom H Schwartz and Jan Cieciuch. Measuring the refined theory of individual values in 49 cultural groups: psychometrics of the revised portrait value questionnaire. Assessment, 29(5):1005–1019, 2022. doi: 10.1177/1073191121998760.
Schwartz et al. [2012] Shalom H Schwartz, Jan Cieciuch, Michele Vecchione, Eldad Davidov, Ronald Fischer, Constanze Beierlein, Alice Ramos, Markku Verkasalo, Jan-Erik Lönnqvist, Kursad Demirutku, et al. Refining the theory of basic individual values. Journal of personality and social psychology, 103(4):663–688, 2012. doi: 10.1037/a0029393.
Schwartz [2017] Shalom H Schwartz. The refined theory of basic values. Values and behavior: Taking a cross cultural perspective, pages 51–72, 2017.
Pakizeh et al. [2007] Ali Pakizeh, Jochen E Gebauer, and Gregory R Maio. Basic human values: Inter-value structure in memory. Journal of Experimental Social Psychology, 43(3):458–465, 2007. doi: 10.1016/j.jesp.2006.04.007.
Skimina et al. [2021b] Ewa Skimina, Jan Cieciuch, and Włodzimierz Strus. Traits and values as predictors of the frequency of everyday behavior: Comparison between models and levels. Current Psychology, 40(1):133–153, 2021b. doi: 10.1007/s12144-018-9892-9.
Lindahl and Saeid [2023] Caroline Lindahl and Helin Saeid. Unveiling the values of ChatGPT: An explorative study on human values in AI systems, 2023.
Miotto et al. [2022] Marilù Miotto, Nicola Rossberg, and Bennett Kleinberg. Who is GPT-3? an exploration of personality, values and demographics. arXiv, 2022.
Scherrer et al. [2023] Nino Scherrer, Claudia Shi, Amir Feder, and David M. Blei. Evaluating the moral beliefs encoded in llms. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
Hadar-Shoval et al. [2024] Dorit Hadar-Shoval, Kfir Asraf, Yonathan Mizrachi, Yuval Haber, and Zohar Elyoseph. Assessing the alignment of large language models with human values for mental health integration: Cross-sectional study using schwartz’s theory of basic values. JMIR Mental Health, 11, 2024.
Kovač et al. [2023] Grgur Kovač, Masataka Sawayama, Rémy Portelas, Cédric Colas, Peter Ford Dominey, and Pierre-Yves Oudeyer. Large language models as superpositions of cultural perspectives. arXiv preprint arXiv:2307.07870, 2023.
Liu et al. [2023] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023. doi: 10.1145/3560815.
Hadar-Shoval et al. [2023] Dorit Hadar-Shoval, Zohar Elyoseph, and Maya Lvovsky. The plasticity of chatgpt’s mentalizing abilities: Personalization for personality structures. Frontiers in Psychiatry, 14:1234397, 2023. doi: 10.3389/fpsyt.2023.1234397.
Jiang et al. [2023] Hang Jiang, Xiajie Zhang, Xubo Cao, Jad Kabbara, and Deb Roy. Personallm: Investigating the ability of gpt-3.5 to express personality traits and gender differences. arXiv, 2023.
Salewski et al. [2024] Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata. In-context impersonation reveals large language models’ strengths and biases. Advances in Neural Information Processing Systems, 36, 2024.
Argyle et al. [2023] Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337–351, 2023.
Safdari et al. [2023] Mustafa Safdari, Greg Serapio-García, Clément Crepy, Stephen Fitz, Peter Romero, Luning Sun, Marwa Abdulhai, Aleksandra Faust, and Maja Matarić. Personality traits in large language models. arXiv preprint arXiv:2307.00184, 2023.
Li et al. [2023] Huao Li, Yu Quan Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Michael Lewis, and Katia Sycara. Theory of mind for multi-agent collaboration via large language models. arXiv preprint arXiv:2310.10701, 2023.
Gunel et al. [2020] Beliz Gunel, Jingfei Du, Alexis Conneau, and Ves Stoyanov. Supervised contrastive learning for pre-trained language model fine-tuning. arXiv preprint arXiv:2011.01403, 2020.
Hagendorff et al. [2023] Thilo Hagendorff, Sarah Fabi, and Michal Kosinski. Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in chatgpt. Nature Computational Science, 3(10):833–838, 2023. doi: 10.1038/s43588-023-00527-x.
Binz and Schulz [2023] Marcel Binz and Eric Schulz. Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences, 120(6), 2023. doi: 10.1073/pnas.2218523120.
Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
Stevenson et al. [2022] Claire Stevenson, Iris Smal, Matthijs Baas, Raoul Grasman, and Han van der Maas. Putting gpt-3’s creativity to the (alternative uses) test. arXiv preprint arXiv:2206.08932, 2022.
Deshpande et al. [2023] Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint, 2023.
Wang et al. [2024] Xintao Wang, Yunze Xiao, Jen tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, and Yanghua Xiao. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. arXiv preprint arXiv:2310.17976, 2024.
Gupta et al. [2024] Shashank Gupta, Vaishnavi Shrivastava, Ameet Deshpande, Ashwin Kalyan, Peter Clark, Ashish Sabharwal, and Tushar Khot. Bias runs deep: Implicit reasoning biases in persona-assigned llms. arXiv preprint arXiv:2311.04892, 2024.
Lee et al. [2019] Julie A Lee, Joanne N Sneddon, Timothy M Daly, Shalom H Schwartz, Geoffrey N Soutar, and Jordan J Louviere. Testing and extending schwartz refined value theory using a best–worst scaling approach. Assessment, 26(2):166–180, 2019. doi: 10.1177/1073191116683799.
National Academies of Sciences, Engineering, and Medicine [2022] National Academies of Sciences, Engineering, and Medicine. Measuring Sex, Gender Identity, and Sexual Orientation. National Academies Press, Washington, DC, 2022.
Haerpfer et al. [2022] Christian Haerpfer, Ronald Inglehart, Alejandro Moreno, Christian Welzel, Kseniya Kizilova, Jaime Diez-Medrano, Marta Lagos, Pippa Norris, Eduard Ponarin, and Bi Puranen, editors. World Values Survey: Round Seven - Country-Pooled Datafile Version 5.0. JD Systems Institute & WVSA Secretariat, Madrid, Spain & Vienna, Austria, 2022. doi: 10.14281/18241.20.
Laver-Fawcett et al. [2016] Alison Laver-Fawcett, Leanne Brain, Courtney Brodie, Lauren Cardy, and Lisa Manaton. The face validity and clinical utility of the activity card sort–united kingdom (acs-uk). British Journal of Occupational Therapy, 79(8):492–504, 2016. doi: 10.1177/0308022616629167.
Cheng et al. [2023] Myra Cheng, Esin Durmus, and Dan Jurafsky. Marked personas: Using natural language prompts to measure stereotypes in language models. arXiv preprint arXiv:2305.18189, 2023.
Borg et al. [2018] Ingwer Borg, Patrick JF Groenen, and Patrick Mair. Applied multidimensional scaling and unfolding. Springer Science & Business Media, New York, NY, 2nd edition, 2018. doi: https://doi.org/10.1007/978-3-319-73471-2.
Daniel and Benish-Weisman [2019] Ella Daniel and Maya Benish-Weisman. Value development during adolescence: Dimensions of change and stability. Journal of personality, 87(3):620–632, 2019.
Bardi et al. [2008] Anat Bardi, Rachel M Calogero, and Brian Mullen. A new archival approach to the study of values and value–behavior relations: validation of the value lexicon. Journal of Applied Psychology, 93(3):483, 2008.
Ponizovskiy et al. [2020] Vladimir Ponizovskiy, Murat Ardag, Lusine Grigoryan, Ryan Boyd, Henrik Dobewall, and Peter Holtz. Development and validation of the personal values dictionary: A theory–driven tool for investigating references to basic human values in text. European Journal of Personality, 34(5):885–902, 2020.
Kumar et al. [2018] Upendra Kumar, Aishwarya N Reganti, Tushar Maheshwari, Tanmoy Chakroborty, Björn Gambäck, and Amitava Das. Inducing personalities and values from language use in social network communities. Information Systems Frontiers, 20:1219–1240, 2018.
Bardi et al. [2009] Anat Bardi, Julie Anne Lee, Nadi Hofmann-Towfigh, and Geoffrey Soutar. The structure of intraindividual value change. Journal of personality and social psychology, 97(5):913, 2009.
Qiu et al. [2022] Liang Qiu, Yizhou Zhao, Yuan Liang, Pan Lu, Weiyan Shi, Zhou Yu, and Song-Chun Zhu. Towards socially intelligent agents with mental state transition and human value. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 146–158, 2022.
Knafo-Noam et al. [2024] Ariel Knafo-Noam, Ella Daniel, and Maya Benish-Weisman. The development of values in middle childhood: Five maturation criteria. Current Directions in Psychological Science, 33(1):18–26, 2024.
Lomazzi and Seddig [2020] Vera Lomazzi and Daniel Seddig. Gender role attitudes in the international social survey programme: Cross-national comparability and relationships to cultural values. Cross-Cultural Research, 54(4):398–431, 2020.
Woodside et al. [2020] Arch G Woodside, Carol M Megehee, Lars Isaksson, and Graham Ferguson. Consequences of national cultures and motivations on entrepreneurship, innovation, ethical behavior, and quality-of-life. Journal of Business & Industrial Marketing, 35(1):40–60, 2020.
Daniel et al. [2020] Ella Daniel, Maya Benish-Weisman, Joanne N Sneddon, and Julie A Lee. Value profiles during middle childhood: Developmental processes and social behavior. Child Development, 91(5):1615–1630, 2020.
Benish-Weisman [2019] Maya Benish-Weisman. What can we learn about aggression from what adolescents consider important in life? the contribution of values theory to aggression research. Child Development Perspectives, 13(4):260–266, 2019.

Appendix A Additional Files

The Python and R code used to generate our prompt sets and analyses can be made available upon request, and will be added to open-source repositories for wider public use soon.

Appendix B Question Variants

For each value, we use three different question variants. For example, the three question variants formulated to assess individuals’ alignment with the value type "Power Dominance" (i.e., relating to asserting authority and control over others) are as follows:

•

Question 6: "He desires recognition for his abilities and seeks admiration for his actions."
•

Question 29: "He prefers taking charge of situations and making decisions."
•

Question 41: "He actively seeks positions of power and influence, valuing control and authority over others."

Appendix C Value Acronyms

The figures in the paper use the following value acronyms: SDT = Self-Direction Thought; SDA = Self-Direction Action; ST = Stimulation; HE = Hedonism; AC = Achievement; POD = Power-Dominance; POR = Power-Resources; FAC = Face; SEP = Security-Personal; SES = Security-Societal; TR = Tradition; COR = Conformity-Rules; COI = Conformity-Interpersonal; HUM = Humility; UNN = Universalism-Nature; UNC = Universalism-Concern; UNT = Universalism-Tolerance; BEC = Benevolence-Caring; BED = Benevolence-Dependability

Appendix D Example Portrait Value Questionnaire

Figure 5 provides an example for the Portrait Value Questionnaire that was used in our study.

Appendix E The Complete Item List of Best-Worst Refined Values (BWVr)

In our value anchoring approach, we used the description of values in [36] to prompt the LLMs. The set of descriptions is provided below.

1.

Self-direction-thought: developing your own original ideas and opinions
2.

Self-direction-action: being free to act independently
3.

Stimulation: having an exciting life; having all sorts of new experiences
4.

Hedonism: taking advantage of every opportunity to enjoy life’s pleasures
5.

Achievement: being ambitious and successful
6.

Power-dominance: having the power that money and possessions can bring
7.

Power-resources: having the authority to get others to do what you want
8.

Face: protecting your public image and avoiding being shamed
9.

Security-personal: living and acting in ways that ensure that you are personally safe and secure
10.

Security-societal: living in a safe and stable society
11.

Tradition: following cultural family or religious practices
12.

Conformity-rules: obeying all rules and laws
13.

Conformity-interpersonal: making sure you never upset or annoy others
14.

Humility: being humble and avoiding public recognition
15.

Benevolence-dependability: being a completely dependable and trustworthy friend and family member
16.

Benevolence-caring: helping and caring for the wellbeing of those who are close
17.

Universalism-concern: caring and seeking justice for everyone especially the weak and vulnerable in society
18.

Universalism-nature: protecting the natural environment from destruction or pollution
19.

Universalism-tolerance: being open-minded and accepting of people and ideas, even when you disagree with them
20.

Animal welfare: caring for the welfare of animals

Appendix F Additional Results for Value Rankings

Table 3 provides additional results on value rankings for different prompting approaches.

Appendix G Cronbach $\alpha$ scores for temperature $0.7$

In the main text we provided results for Cronbach $\alpha$ with zero temperature. Here we provide further results for temperature $0.7$ in Figure 6. As expected, the numbers are overall lower than for the zero temperature case.

Appendix H MDS figures for Gemini Pro and GPT 4 for temperature $0.0$

In the main text we provided the MDS plots for Gemini Pro for Value Anchor and Names. Here we provide further plots for Gemini Pro in Figure 7, and all the GPT 4 plots for temperature $0.0$ in Figure 8.