How AI Ideas Affect the Creativity, Diversity, and Evolution of Human Ideas: Evidence From a Large, Dynamic Experiment

Joshua Ashkinaze University of MichiganUnited States [email protected] , Julia Mendelsohn University of MichiganUnited States [email protected] , Li Qiwei University of MichiganUnited States [email protected] , Ceren Budak University of MichiganUnited States [email protected] and Eric Gilbert University of MichiganUnited States [email protected]

(2024; 20 February 2007; 12 March 2009; 5 June 2009)

Abstract.

Exposure to large language model output is rapidly increasing. How will seeing AI-generated ideas affect human ideas? We conducted a dynamic experiment (800+ participants, 40+ countries) where participants viewed creative ideas that were from ChatGPT or prior experimental participants, and then brainstormed their own idea. We varied the number of AI-generated examples (none, low, or high exposure) and if the examples were labeled as “AI” (disclosure). We find that high AI exposure (but not low AI exposure) did not affect the creativity of individual ideas but did increase the average amount and rate of change of collective idea diversity. AI made ideas different, not better. There were no main effects of disclosure. We also found that self-reported creative people were less influenced by knowing an idea was from AI and that participants may knowingly adopt AI ideas when the task is difficult. Our findings suggest that introducing AI ideas may increase collective diversity but not individual creativity.

artificial intelligence, creativity, large language models, cultural evolution

^†^†copyright: acmlicensed^†^†journalyear: 2024^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Preprint; Jul 03, 2024; Woodstock, NY^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Human-centered computing^†^†ccs: Human-centered computing Human computer interaction (HCI)^†^†ccs: Human-centered computing Empirical studies in HCI^†^†ccs: Human-centered computing Empirical studies in collaborative and social computing^†^†ccs: Computing methodologies Artificial intelligence^†^†ccs: Computing methodologies Natural language processing^†^†ccs: Human-centered computing Collaborative interaction

1. Introduction

If we think of culture as a “loop” where individuals and societies shape each other through exchanges of ideas and practices (Richerson and Boyd, 2008; Boyd and Richerson, 1988), then a question emerges: What happens when generative AI enters the “culture loop?” Exposure to LLMs (large language models) is increasing rapidly: When released, ChatGPT was the fastest-growing consumer application in history (Hu, 2023). Moreover, we are likely exposed to even more AI content than we realize: Humans overestimate their ability to distinguish AI from human content (Jakesch et al., 2022). This exposure likely matters: Ideas we see affect the ideas we create (Nijstad and Stroebe, 2006). How will the rapid rise of exposure to LLM-generated ideas affect the creativity, diversity, and evolution of human ideas? And to what extent do AI ideas influence human ideas?

The scale of ‘passive exposure’ to AI ideas is high, and different from prior human-AI interactions. By ‘passive exposure’, we refer to cases when (A) users see LLM outputs but do not have an active role in the creation of these outputs and (B) users are given no instructions to actively engage with these outputs. ‘Passive exposure’ approximates how users often encounter LLM outputs in the real world. For example, OpenAI users generate 100 billion words per day (Griffin, 2024). It is likely that the number of people who are merely seeing (i.e., passively exposed) AI output is significantly larger than the number of people who are creating (i.e., actively engaging) with these systems. Arguably, many future human-AI teams will exhibit this relationship. Yet in existing studies of human-AI creativity, participants are often actively interacting with an AI system (Yang et al., 2022; Osone et al., 2021; Lee et al., 2022; Branch et al., 2021; Gero and Chilton, 2019; Padmakumar and He, 2024). It is crucial, then, to understand how passive AI exposure shapes human ideas.

As AI exposure has increased, so have concerns over AI disclosure (Hancock et al., 2020) (whether providers should disclose when they use AI systems). California, for instance, considered passing a law requiring disclosure on behalf of anyone using bots on social media (Williams, 2018). Concerns regarding the disclosure of LLMs are only more likely to grow. In domains ranging from poetry (Köbis and Mossink, 2021) to online social media profiles (Jakesch et al., 2022), LLM output is increasingly indistinguishable from that of humans. We are interested, then, if disclosing ideas as coming from AI moderates the effect of AI exposure.

Refer to caption — Figure 1. Graphical depiction of experiment. The task (Panel 1) is to submit a creative idea after seeing examples, where examples are from humans or AI. We vary (Panel 2) the amount of AI ideas in the example set (exposure) and if AI ideas are labeled as such (disclosure). The experiment is dynamic (Panel 3). Responses from prior participants serve as examples for future participants.

Motivated by these dynamics, we conducted a large-scale experiment to systematically test how AI exposure and disclosure affect the creativity, diversity, and evolution of human ideas. We employ a variant of the Alternate Uses Task (AUT, (Guilford, 1978)), a common measure of creativity, and manipulate exposure to LLM ideas. In the AUT, participants are told to think of non-obvious uses of an item. For example: What is a creative use for a tire? In our variant, participants complete the AUT for an item after viewing example ideas. These examples constitute our manipulation. Examples vary in AI exposure (none, few, or many AI examples) and AI disclosure (whether AI-generated ideas are labeled as such). The human-generated ideas in each example set come from prior participants in the same experimental condition. See Figure 1 for a graphical depiction of the experiment.

Our dynamic experiment design—ideas from prior participants are used as stimuli for future participants—speaks to the interdependent process of cultural creation: creative ideas are built upon prior ideas. Hence, we capture the compounding effects of having LLMs “in the culture loop”. It is also intended to mimic possible futures for human-AI teams. Our design allows us to observe not just average levels but also temporal dynamics of creativity and diversity in each condition. Taken together, our results provide insights into the role of LLMs in shaping collective thought.

Concretely, our main findings are:

(1)

High AI exposure increases collective diversity but not individual creativity. We find that high AI exposure increases collective idea diversity, but does not affect individual creativity. Our high-powered null finding around creativity can inform public debates over the creative impact of seeing AI ideas ‘in the wild’. However, we found conditions with high levels of AI exposure had more collective idea diversity. That is, ideas in the high AI exposure conditions were more different from each other. Our findings around creativity and diversity suggest the effect of AI exposure may be nuanced: The introduction of AI ideas into human society may yield more diverse but no better human ideas.
(2)

High AI exposure increases the speed at which idea diversity develops. Culture is constantly evolving (Boyd et al., 2011; Boyd and Richerson, 1988), yet many laboratory experiments are not designed to model this evolution. Through our dynamic design, we find that high AI exposure increases not only the average levels of collective idea diversity but also the rate of change in idea diversity. This is a consequential finding since even small differences in rates of change can lead to large cumulative differences over time.
(3)

People who identify as creative are less influenced by AI disclosure. Prior work argues that attitudes and expectations shape engagement with human-AI co-creation systems (Gero et al., 2022). Due to our large sample size, we can model this heterogeneity. We find that for users who self-identify as highly creative, adoption of AI ideas is not influenced by AI disclosure. But AI disclosure did affect the adoption of AI ideas for users who self-identified as low in creativity. This finding suggests that highly creative people will not be “duped” into adopting AI ideas.
(4)

Participants may adopt AI ideas for harder prompts. We find that when AI ideas are disclosed, participants are more likely to adopt the ideas of AI for difficult AUT prompts. This suggests that users will rely on AI ideas not for trivial creative tasks but for difficult ones. But since this finding is based on a small number of prompts, we view this finding as speculative and encourage more work on the topic.

1.1. Defining Concepts and Variables

1.1.1. Creativity

Creativity is defined in many ways (Walia, 2019). But one common conception is divergent thinking (Guilford, 1967). This is when “an individual solves a problem or reaches a decision using strategies that deviate from commonly used or previously taught strategies” (of Psychology, [n. d.]). One of the most common (Abraham, 2016) tests of divergent thinking is the Alternate Uses Task (AUT) (Guilford, 1978)¹¹1https://www.mindgarden.com/67-alternate-uses, where participants are asked to think of an original use for an everyday object. Traditionally, responses to the AUT are measured along four dimensions: originality (how original the idea is), elaboration (how much the participant elaborates on the idea), fluency (how many ideas), and flexibility (different categories of ideas). The latter two can only be measured if the participant provides multiple responses to the same question. Due to our research design²²2Participants see the most recent responses in the condition as stimuli, so if one participant brainstorms many responses that participant would be over-represented in future participants’ example sets., we have participants generate just one creative idea (as in (Beaty et al., 2022)), and we focus on originality.

We follow a long tradition of scoring responses to the AUT computationally (Yu et al., 2023; Beaty and Johnson, 2021; Beaty et al., 2022; Yang et al., 2023; Organisciak et al., 2022; Dumas et al., 2021). Specifically, we measure the creativity of AUT ideas with an existing fine-tuned GPT-3 classifier (Organisciak et al., 2022, 2023), which has an r=0.81 overall correlation with human judgments of AUT originality. Moreover, we chose AUT items for our experiment where the classifier had the highest accuracy³³3tire (r=0.91), pants (r=0.91), shoe (r=0.91), table (r=0.9), and bottle (r=0.88). Note that our task is highly ‘in-domain’ for the classifier: we ask participants to do the same exact task for the same exact items the model was trained on. We refer to the originality score from this classifier as individual-level creativity, though we note that future work can explore other dimensions of creativity (such as fluency). We discuss this classifier in more detail in Section 3.

1.1.2. Idea Diversity & AI Adoption

In addition to creativity, we measure how our experimental factors (LLM exposure and LLM disclosure) shape the diversity of ideas that participants produce. This is a complementary measure to creativity. Creativity is often thought of as an individual-level outcome. Diversity is a collective outcome. Put another way, creativity is a property of an idea while diversity is a property of an idea set. We measure two sides of diversity—semantic divergence (which we refer to as idea diversity) and semantic convergence towards AI ideas (which we refer to as AI adoption).

To measure idea diversity and AI adoption, we first embed all ideas using SBERT (Reimers and Gurevych, 2019), which are transformer-based embeddings designed for sentences. SBERT excels at capturing semantic similarity (Reimers and Gurevych, 2019). Prior work uses neural embeddings to compute similarity for AUT responses (Baten et al., 2021) and other creative tasks (Roemmele, 2021).

Idea diversity is the median pairwise cosine distance between idea embeddings in an idea set. As robustness checks, we also measure the mean pairwise distance and average distance to the centroid of a set.
AI adoption is the maximum cosine similarity between the embedding of the idea a participant submits and the embeddings of AI examples that the participants see. Following Roemmele (2021), we use the max rather than a measure of central tendency because if a participant is inspired by an idea, it would likely be a single idea. As robustness checks, we also measure the mean and median pairwise similarity between the submitted idea and an AI example, but these are noisier measures of adoption.

2. Related Work

Our work bridges three research streams: human-AI co-creation, crowd-sourced creativity, and complex systems. AI ideas are scattered amongst human ideas, whether or not we can tell (Jakesch et al., 2022). This exposure presumably affects the ideas we create (co-creation). And our ideas presumably affect the ideas others create (crowdsourced creativity). While real-world culture is dynamic and evolving, most experiments are not set up to capture evolution (collective dynamics). By employing a large-N sample size and ‘many-worlds’ setup, we model the complex dynamics of AI influence. After discussing how our study bridges these streams, we turn to the particular kind of creativity and diversity our experiment captures and what is known about how our two factors (LLM exposure; LLM disclosure) would affect these outcome variables. However, much of the relevant literature gives conflicting predictions, a key motivation for conducting the current study.

2.1. Situating Our Work Between Co-Creation, Crowd Creativity, and Collective Dynamics

2.1.1. Human-AI Co-Creation

As the creative ability of AI advances (Miller, 2019), researchers explored how co-creating with AI affects human creativity. Much of this research explores creative writing with language models, in particular (Gero et al., 2022; Mirowski et al., 2023; Lee et al., 2022; Yang et al., 2022; Yuan et al., 2022; Roemmele, 2021; Di Fede et al., 2022; Hitsuwari et al., 2022; Gero, 2023; Mizrahi et al., 2020; Gero and Chilton, 2019; Padmakumar and He, 2024). While most prior work in this domain involves users actively engaging with custom systems, our study is concerned with passive exposure to outputs from off-the-shelf models. (By ‘passive exposure’ we mean that (1) users are shown LLM outputs but did not have an active role in the creation of these outputs and that (2) users were given no instructions to actively engage with these outputs; they were merely shown the LLM outputs. ‘Passive exposure’ approximates how users often encounter LLM outputs in the real world.) Human-AI co-creation shows that the relationship between AI ideas and their effect on human creativity is nuanced, with task-level and attitudinal factors playing a role. Roemmele (2021) found seeing AI examples influenced outcomes for hard, but not easy, prompts. Gero et al. (2022) found the quality of LLM outputs did not correlate with perceived usefulness. This is consistent with other research showing large variance in the perceived usefulness of outputs from co-creation systems (Calderwood et al., 2020), suggesting human attitudes partially determine the utility of AI creativity aids. We extend this predominantly qualitative work with a large-scale quantitative study.

Most similar to our work, several studies have explored how (post-ChatGPT) generative artificial intelligence affects creativity and diversity. Several studies found that ideating with generative AI can decrease diversity (Anderson et al., 2024; Doshi and Hauser, 2024; Padmakumar and He, 2024). Some studies suggest generative AI increases individual creativity (Doshi and Hauser, 2024; Dell’Acqua et al., 2023) while others (Anderson et al., 2024) find no effect. Our study offers several additions to this literature. First, because of its dynamic design, we test the long-run effects of AI, where ideas feed forward to future participants. Second, our study is concerned with “passive exposure”: Participants are not told ideas are from AI, and are not instructed to engage with these ideas. By systematically ablating whether AI ideas are disclosed as such, we can explore if the effect of AI ideas depends on knowledge of where the idea is from. Third, we employ a large sample size—which is useful since it provides power for precise estimates of effects and the ability to capture heterogeneity. Moreover, our large sample is comprised of creative professionals and technology-oriented users, two groups most relevant to the phenomenon.

2.1.2. Crowdsourced Creativity

Crowdsourcing can enhance creative outcomes (Yu and Nickerson, 2011, 2013; Nickerson and Sakamoto, 2010; Huang et al., 2020; Siangliulue et al., 2015). For example, Yu and Nickerson (2013) devised a method where crowds build on each other’s ideas by combining ideas from previous generations. Later generations of ideas were rated as more creative compared to earlier generations. Siangliulue et al. (2015) found that the creativity and diversity of idea sets that participants saw influenced the creativity and diversity of what these participants produced. This supports a main contention of our paper: AI exposure matters because the ideas we see affect the ideas we create. We incorporate elements of crowdsourced creativity, particularly in measuring how creativity and diversity unfold over subsequent generations.

2.1.3. Collective Dynamics & Many-Worlds Experiments

Prior work in complex systems and computational sociology highlights the importance of studying collectives to understand social dynamics (Salganik and Watts, 2009). Meanwhile, traditional experiments focus on individuals. Identifying the effect of AI ideas on the diversity and evolution human ideas similarly requires an examination of complex systems as opposed to individuals in isolation. Hence, our experimental design draws on the “many-world” paradigm (e.g., (Salganik et al., 2006)):We create multiple, parallel realizations of worlds with and without AI ideas, each evolving independently under controlled conditions. By employing a large-N sample size and many different parallel worlds, we can better understand the collective effects of AI ideas on human ideas. Beyond simple averages, we can also model how AI ideas affect the evolution of human ideas.

2.1.4. Our Contributions

Our study incorporates elements of human-AI co-creation, crowd creativity, and collective dynamics. We note that co-creation studies often confound the effect of exposure with the effect of disclosure: If one is creating with an AI system, it is impossible to separate the content of an AI system from the knowledge that the content is from an AI system. Our factorial design lets us estimate the marginal effect of exposure and disclosure separately. Co-creation studies typically employ a small number of specialized participants actively engaged with a system. From the perspective of validating a system, this is reasonable. But we are interested in the effects of (1) passive exposure on (2) a general public. For this reason, we adopt a large-scale experimental design—similar to crowd-sourced creativity studies—that lets us estimate effects on the general public rather than specialized users. A key benefit of our large sample size is that we can precisely estimate how participant attitudes affect human-AI outcomes. This is important because, as Gero et al. (2022, pg. 1016) write: “[P]articipant attitudes are a major unknown factor when studying human-AI collaboration.” Drawing on the “many-worlds” paradigm, our experiment design also lets us understand the effect of AI over time since responses feed forward, allowing us to observe differences in rates of change between conditions.

2.2. Factor 1: LLM Exposure

2.2.1. Effects on Creativity

Intuitively, the effect of exposure to ChatGPT ideas will depend on how creative ChatGPT answers are relative to human ideas. In preliminary testing, we found that the answers to the AUT generated by our prompt were scored as more creative than the ideas generated by humans (see Appendix E) via the Organisciak et al. (2022) classifier. LLM generations may be increasing in creativity: while GPT-3 (an earlier model than ChatGPT-3.5) scored lower in AUT creativity than humans on the AUT (Stevenson et al., 2022), GPT-4 (a more recent model than ChatGPT-3.5) scored among the top percentile of humans on a similar verbal creativity task (Guzik et al., 2023), as measured by human judges.

Even if language models can generate creative ideas, it is unclear from prior work if mere exposure to these ideas can increase human creativity. On one hand, the associative model of brainstorming suggests that exposure to others’ ideas can stimulate idea generation by activating a non-accessible concept of a participant’s memory (Nijstad and Stroebe, 2006; Brown and Paulus, 2002; Paulus and Brown, 2007). For example, ChatGPT may come up with a use for a bottle that you never associated with bottles. This can then inspire you to come up with creative uses along this line. In this way, ChatGPT can stimulate creativity. On the other hand, there is also evidence that seeing the ideas of others inhibits a participant’s idea generation if “one is exposed to an idea that has few connections to other ideas in an individual’s semantic network” (Paulus and Brown, 2007, pg. 10). Indeed, this appeared to be the case in Yang et al. (2022). There is a possibility that AI ideas are creative but so divorced from how humans generate ideas that seeing these ideas actually has an inhibiting effect. Separate from prior academic work, there are public debates about the impact of LLMs (such as ChatGPT) on creativity (e.g., (News, 2023; Review, 2023; Jared Henderson, 2022; Krish Naik, 2023; Tubefilter, 2023; Eapen et al., 2023; Wilcot, 2023)). Many of these debates assume ChatGPT will have some impact on an individual’s creativity—either good or bad. Our work contributes empirical results to this broader public conversation.

2.2.2. Effects on Diversity

Prior work in AI co-creation finds mixed effects. Collaborating with AI can lead to more diverse (Yang et al., 2022; Osone et al., 2021; Lee et al., 2022; Branch et al., 2021; Gero and Chilton, 2019) or less diverse outputs (Padmakumar and He, 2024; Doshi and Hauser, 2024; Dell’Acqua et al., 2023). But note that these studies are testing active engagement, and most test active engagement with intentionally constructed systems. This is different from the passive, incidental exposure to AI ideas that now occur in everyday life. Writers call ChatGPT ‘a blurry JPEG of the internet’ (Chiang, 2023) and discuss its ‘incredible blandness’ (Mangalaseril, 2023); researchers call it a ‘stochastic parrot’ (Bender et al., 2021). It is not clear, then, how passive exposure to ideas from off-the-shelf LLMs—precisely the kind we are inundated with—would affect the diversity of human ideas.

2.3. Factor 2: LLM Disclosure

2.3.1. Effects on Creativity

Building on Hwang and Won (2021), we employ the theory of social facilitation (Bond and Titus, 1983) to understand how LLM disclosure can affect human creativity. Facilitation theory is concerned with how the presence of others affects one’s performance. Hwang and Won (2021) asked participants to brainstorm with chatbots (which gave pre-programmed responses) and experimentally varied whether or not participants were told that their partner was a chatbot. Disclosing that the partner was a chatbot led to higher creativity in participant responses, which Hwang and Won (2021) attributes to the novelty of brainstorming with a chatbot. We build on this notion of facilitation as a theoretical lens. However, it is not clear if Hwang and Won (2021)’s finding (that telling people they are brainstorming with a chatbot increases creativity) would replicate in our study, especially in a post-ChatGPT era. First, we are measuring exposure and not direct engagement with chatbots. The novelty of a chatbot may be higher when you are the one working with it to generate ideas. Second, presumably, the novelty of talking to a chatbot may be lower due to the widespread popularity of ChatGPT. Moreover, we may expect heterogeneity in disclosure’s effect on creativity and diversity. It may be that users who have lower self-perceived creative abilities may feel ‘competition’ with AI due to its presence and, in turn, submit more creative responses when they know the ideas they are exposed to are from AI.

2.3.2. Effects on Diversity

It is not clear how knowing content is from AI will affect the diversity of ideas participants produce. But prior work suggests heterogeneity along two lines: the difficulty of the prompt⁴⁴4As discussed later, we measure the ‘difficulty’ of a prompt by the inverse rank of the average creativity in the control condition. If participants tended to submit lower creativity ideas in the control condition for item X, we said item X was difficult., and the attitude of the participant. Prior work suggests that disclosing ideas as AI-generated would decrease diversity due to automation bias, the tendency to over-rely on AI systems (Schemmer et al., 2022; Mosier et al., 1996; Goddard et al., 2014). Increased reliance on AI ideas (when labeled as such) could lead to lower idea diversity and higher AI adoption. Conversely, some evidence suggests people display algorithmic aversion to creative products such as haikus (Hitsuwari et al., 2022) or art (Kirk et al., 2009). This aversion would yield the opposite prediction. Roemmele (2021) found that seeing AI examples only affected the participant’s writing on a key measure for difficult prompts—suggesting creative task difficulty might moderate the effect of disclosure on AI adoption. Task confidence decreases reliance on automated systems and trust in a system increases reliance on automated systems (Goddard et al., 2014). Although this literature is not usually applied to creativity, we might then suspect that people self-reporting low creativity (i.e., low task confidence) and those who think AI is more creative than humans (i.e., high system trust) are most likely to increase adoption of AI ideas when the source is disclosed.

3. Pre-Experiment

Before describing the experiment, we discuss how we chose the five specific AUT (Alternate Uses Test) items and constructed our ChatGPT prompt.

3.1. Stimuli Construction

3.1.1. Choosing AUT Items

We had to choose a selection of items that people would brainstorm creative uses for. We chose five items for which the creativity classifier that we used had the highest accuracy. Previously, Organisciak et al. (2022) fine-tuned GPT-3 Davinci to predict the creativity of AUT items. This dataset contains 20,121 responses from 2,025 participants, across 21 distinct AUT items and nine distinct studies (Organisciak et al., 2022).⁵⁵5We obtained this dataset by direct correspondence with Dr. Organisciak on February 23, 2023; the code that Dr. Organisciak used to generate this dataset is available at https://github.com/massivetexts/llm_aut_study/blob/main/notebooks/Process_AUT_GT.ipynb Each response was graded for creativity by humans and normalized to a scale of 1-5. Then Organisciak et al. (2022) fine-tuned GPT-3 Davinci on this dataset to predict creativity scores. Here, fine-tuning involves providing {Input (an AUT response), Output (human rating)} pairs to a pre-trained LLM. Then the LLM adjusts its parameters to produce a similar output given an input, proxying human judgments. Overall, the fine-tuned GPT-3 classifier had a correlation⁶⁶6We obtained scores for this classifier by downloading the zip file from (https://github.com/massivetexts/llm_aut_study/blob/main/results/evaluation.zip), then navigating to gt_main2/gpt-ft-davinci-1.csv of $r=0.81$ with human judgment. Accuracy varied by item (Appendix Table LABEL:aut_desc_stats). For our experiment, we picked the five items for which the classifier had the highest accuracy: tire (r=0.91), pants (r=0.91), shoe (r=0.91), table (r=0.9), and bottle (r=0.88). Our task is ‘in-domain’ for the classifier since we ask participants to do the same task for the same items the classifier was trained on.

3.1.2. Generating GPT Ideas

We generated AI ideas with ChatGPT-3.5 and a zero-shot prompt based on prior work. These decisions followed two principles: ecological validity and precedent.

Model & Prompting Strategy

Our model and prompting strategy were driven by a desire to approximate how ordinary users would use large language models to generate ideas. First, we used ChatGPT-3.5, the latest ChatGPT model freely available at the time of the study. Because ChatGPT has a popular and accessible UI, we assume users would be more likely to use ChatGPT rather than a model available only through an API or on a limited basis. Second, we used zero-shot prompting rather than few-shot prompting. Because zero-shot prompting requires no labeled data, this would be a more natural use case for most users.

Prompt Construction

Our specific zero-shot prompt was informed by prior work on LLMs and creativity. Stevenson et al. (2022) administered the AUT to GPT-3 through a zero-shot prompt. However, this prompt generated much wordier responses $(M=25.4,SD=8.5)$ than the human responses in the Organisciak Dataset $(M=4.6,SD=5)$ . Such a discrepancy would alert participants to what was AI vs. human generated, which would nullify the disclosure factor (whether the source of an idea is disclosed). Hence, we appended a request (Figure 2) to use roughly the same number of words (5) as the average human response. The modified prompt resulted in responses with an average word length $(M=4.4,SD=1.3)$ much closer to human responses than the un-modified Stevenson prompt, $p<0.001$ by two-tailed permutation tests. We use the ideas from this modified Stevenson prompt for our experiments (Appendix B for more details).

What are some creative uses for a [OBJECT]? The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. List creative uses for a [OBJECT]. Make sure each response is [MEAN HUMAN WORDS] words.

Figure 2. To generate AUT ideas, we used the zero-shot prompt from Stevenson et al. (2022) with an additional instruction at the end to match the mean length of human responses from prior work.

4. Experiment

4.1. Summary

We recruited participants from a mixture of social media and newsletters. Once participants clicked on the link to the experiment, they were taken to a landing page. In addition to a consent button, that landing page asked several questions. Participants were asked (1) to rate their creativity relative to other humans and to AI, (2) to rate their attitude towards AI (Nadeem, 2022, 2023), (3) age, (4) country, and (5) gender. After consenting, participants engaged in 5 trials. For each trial, a participant generated a creative use for an item under a specific experimental condition, after seeing example ideas. These example ideas constituted our experimental manipulation. Ideas fed forward to future trials such that if a participant was in the {[Control], tire} condition the example ideas the participant saw were the most recent ideas from prior participants in the {[Control], tire} condition. See Table 1 for experimental conditions. The experiment took place in the summer of 2023.

Table 1. Experiment conditions and associated factors. Participants complete the Alternate Uses Task in each condition after being exposed to prior responses generated in each condition. Conditions vary by LLM exposure (none, high or low) and LLM disclosure (source is labeled or not).

Condition	Number of AI	Number of Human	Source Disclosed
	Examples	Examples	(Y/N)
Control	0	6	N
High Exposure; Disclosed	4	2	Y
High Exposure; Not Disclosed	4	2	N
Low Exposure; Disclosed	2	4	Y
Low Exposure; Not Disclosed	2	4	N

4.2. Ethics

The experiment was approved by our university’s institutional review board.

4.3. Recruitment

We recruited volunteer participants through three sources: (1) Facebook ads, (2) Reddit, and (3) the weekly newsletter of Creative Mornings⁷⁷7https://creativemornings.com/⁸⁸8The first author contacted Creative Mornings, who agreed to include the experiment in the newsletter., which is ‘the world’s largest face-to-face creative community’. Creative Mornings is an organization geared towards creative professionals that organizes (e.g.) talks and meetups. We ensured all participants were above 18. While we did not offer monetary compensation, we offered to give participants information about themselves, such as their creativity relative to both humans and AI and their ability to spot creative ideas. Providing information to participants about themselves is often effective for recruiting volunteer participants since it makes the task intrinsically rewarding (Reinecke and Gajos, 2015). Appendix F describes the information we provided to participants.

We recruited volunteer participants instead of crowdsourced workers for several reasons. First, we wanted participants to be intrinsically motivated since (1) many theories suggest intrinsic motivation helps creativity (Mumford and Hemlin, 2017) and (2) we did not want low-quality engagement to confound results (especially since ideas propagate forward). Second, we were interested in an international sample. Because we did not pay participants, we did not need to collect any personally identifiable information. Each user was assigned a random identifier. The experiment being anonymous created a lower barrier to recruiting international participants since GDPR was not operative. Finally, we recruited participants in a targeted manner. In particular, we wanted to generalize this experiment to two key groups: individuals who have a demonstrated interest in technology and those who have a demonstrated interest in creativity. These groups are most relevant to the phenomena in question. To this end, we reached technology-oriented users by posting the experiment in the following subreddits: r/InternetIsBeautiful, r/chatgpt, r/singularity, and r/artifical. We reached creativity-oriented users by posting the experiment in r/writing, r/poetry, and the Creative Mornings newsletter. We also used several ‘neutral’ sources to test the experiment: r/samplesize and Facebook ads. If a participant completed the experiment, then the participant was given a shareable link to their results so they could spread the study.

4.4. Experiment Procedure

Once participants clicked on our link, they were taken to a landing page that included a consent form, task description, and pre-treatment questions.

4.4.1. Study Description

The description read as follows:

What you will do:
We’ll show you 5 common items, and you’ll come up with creative uses for each item. To spark your imagination, you’ll see ideas from prior participants and even from AI (i.e., ChatGPT). You’ll be asked to rank these ideas in order of creativity. The ideas you write may be anonymously shown to future participants to spark their imagination. The study takes 3-6 minutes to complete. The goal is to learn about how humans and AI brainstorm.

What you will learn:

•

How creative you are compared to other humans

•

How creative you are compared to AI

•

How well you can rank creative ideas

We will give you a shareable link with results at the end.

See Appendix F for more details on how each of these three pieces of information was calculated.

4.4.2. Pre-Treatment Questions

Participants were asked several pre-treatment questions:

(1)

(required) A slider ranging from 0 to 100 that says ‘I am more creative than X% of AI‘
(2)

(required) A slider ranging from 0 to 100 that says ‘I am more creative than X% of Humans‘
(3)

(required) ‘Artificial intelligence computer programs are designed to learn tasks that humans typically do. Would you say the increased use of artificial intelligence computer programs in daily life makes you feel…[‘More concerned than excited’, ‘More excited than concerned’, ‘Equally excited and concerned’]
(4)

(optional) What country are you from?
(5)

(optional) What is your age?
(6)

(optional) What is your gender?

The third question was from Pew (Nadeem, 2023, 2022). Our gender question was based on guidance from Spiel et al. (2019). We chose the Pew question instead of a longer battery of questions about AI to minimize the response burden. See Appendix C for more details about these questions.

4.4.3. Randomization

Participants were assigned a sequence of 5 trials, where each trial was a {[condition], item} pair. For example, one trial might be a creative idea for pants in the [High Exposure, Disclosed] condition. We mapped each AUT item (pants, tire, shoe, bottle, table) to one of the five conditions such that neither conditions nor items repeated in a 5-item sequence. See Figure 3 for a visual explanation.

4.4.4. Task Instructions

For each trial, participants were asked to first rank a list of example ideas in order of creativity and then submit their own idea:

Task
For this task, you will submit a creative use for a [ITEM]. But before submitting your idea, here are some ideas for inspiration. Rank them by creativity.

Rank Previous Ideas

•

Rank these ideas in order of creativity, with the most creative use on top. Drag ideas to rank them.

•

We’ll show you how your rankings compare to rankings from a highly accurate model.

[SORTABLE EXAMPLE IDEAS HERE]

Submit Your Idea
Your turn! What is a creative use for a [ITEM]? The goal is to come up with a creative idea, which is an idea that strikes people as clever, unusual, interesting, uncommon, humorous, innovative, or different. List a creative use for a [ITEM].

See Appendix M for screenshots. We asked participants to rank ideas to ensure that they would engage with the example ideas.⁹⁹9We did not use these rankings as a DV since—because a participant ranks the examples they are shown, and all examples are from the same condition—these ranks could not speak to between-condition differences, which is the focus of the paper. Depending on the condition, (1) either there were or were not AI ideas in this example set (exposure); (2) AI ideas were or were not labeled (disclosure). We use the same prompt for humans (the text under Submit Your Idea) as with ChatGPT (Stevenson et al., 2022), but with a slight modification to request a single idea. This prompt contains language consistent with best practices in divergent thinking assessment (Beaty et al., 2021). After submitting an idea, participants received feedback on their idea’s uniqueness and how accurately they ranked the example ideas (Appendix M).

4.4.5. Response Chains

Logic

The human ideas that participants saw came from prior participants in the same {[condition], item} combination. See Figure 4. For instance, if a user was placed in the [Control] condition for a tire, that user would see six human ideas—the most recent six ideas for a tire under the [Control] condition. In order to avoid overfitting to a specific idea sequence, we reset this ‘response chain’ every 20 trials. So, the first 20 participants in the {[Control], tire} combination would see each other’s ideas, but the chain would reset for the 21st respondent. We use the logic described in this paragraph and Figure 4 for the human ideas in all conditions.

Note that because human ideas are propagated at the {[condition], item} level, the human ideas in the [Control] condition are ‘clean’ from AI contamination. They were brainstormed after seeing sets of human-only ideas, also from the [Control] condition.)

We ran seven response chains for each of the 25 (5 items x 5 conditions) combinations, corresponding to 175 response chains in all and 3500 targeted responses $(175\text{ response chains }\times 20\text{ trials per chain})$ .

Human Seeds

Of course, there is a bootstrapping problem—what human ideas does the first person in the {[Control], tire} condition see? The seeds for each {[condition], item} combination came from prior responses from the Organisciak Dataset. That is, Participant 1 for a {[Control], tire} response chain would see 6 seed items. Then Participant 2 in the same response chain would see 5 seed items plus Participant 1’s idea (the order of ideas is randomized). Participant 3 would see 4 seed items plus Participant 1 and Participant 2’s ideas, etc. We chose a random sample of seeds for each {[condition], item} combination from the Organisciak Dataset. The dataset labeled ideas with gold-standard human ratings of originality. We conducted an ANOVA and found no significant condition-level difference in the originality of the seeds we used.

5. Recruited Participants

Table 2. Summary Statistics of Experiment

	Value
Unique Countries	48.00
Total Responses	3414.00
Unique Participants	844.00
Avg Responses/Participant	4.05
Avg Duration/Response	144.31

Table 3. Sources of participants and trials. For analysis, we categorized each source into a higher-level interest group (technology, creativity, neutral).

Interest Group	source	Participants (N, % of total)	Trials (N, % of total)
creative	Creative Mornings newsletter	343 (40.6%)	1470 (43.1%)
technology	r/InternetIsBeautiful	298 (35.3%)	1115 (32.7%)
neutral	r/samplesize	94 (11.1%)	389 (11.4%)
neutral	share	61 (7.2%)	250 (7.3%)
technology	r/chatgpt	19 (2.3%)	79 (2.3%)
creative	r/writing	7 (0.8%)	30 (0.9%)
neutral	other	6 (0.7%)	22 (0.6%)
technology	r/singularity	6 (0.7%)	13 (0.4%)
technology	r/artificial	5 (0.6%)	24 (0.7%)
creative	r/poetry	3 (0.4%)	15 (0.4%)
neutral	facebook	2 (0.2%)	7 (0.2%)

We received over 3000 responses from 48 countries. See Appendix G for sample characteristics. Out of a total of five trials, participants finished four trials on average (Table 2), suggesting the experiment was engaging. Most participants came from the Creative Mornings newsletter or r/InternetIsBeautiful (Table 3 for source counts and categorization). The sample was 50% women, 43% men, 4% non-binary, 3% not disclosed, 1% self-described. The mean age was 34.92 (SD = 10.86). Regarding AI, the sample was 48% neutral, 28% excited, 24% concerned. Participants said they were more creative than 57.86% (SD = 26.66) of AI and 58.67% (SD = 23.65) of humans. See Appendix Figure 11 for kernel density plots. Users from neutral interest groups who were concerned about AI tended to have low self-reported creativity.

6. Outcome Measures

We have three outcome measures (idea diversity, creativity, and AI adoption) and three levels of analysis (local, evolution, and global). See Table LABEL:big_table. The local level measures outcomes at the level of an individual trial (e.g., how a submitted response relates to example responses). The evolution level measures the rate of change of outcome variables with respect to the trial number in the response chain (i.e., experiment iteration). The global level compares all submitted responses in a condition to each other. For all pairwise comparisons, we use a Holm-Bonferroni adjustment for multiple comparisons. For idea diversity and AI adoption, we scale the dependent variable (cosine distance or cosine similarity, respectively) by 100 for easier interpretation.

Table 4. We measure three outcome measures (idea diversity, creativity, AI adoption) and three levels of analysis (local, global, and evolution). If a level of analysis is not appropriate for an outcome measure, we put a ‘Not applicable’ in that cell. All ideas are embedded using SBERT.

	Local	Global	Evolution
Creativity	How creative is the submitted response? This is measured by the prediction of the classifier from Organisciak et al. (2022).	Not applicable	Does the creativity of submitted ideas change over time? This is measured by the slope of the response chain’s trial number (i.e., iteration in the response chain) on creativity (the metric from Organisciak et al. (2022)).
Idea Diversity	How different is a participant’s response from example responses? This is measured by the median pairwise semantic distance between ideas a participant sees and their response.	How diverse were all the participant’s ideas in a condition? This is measured by the median pairwise distance between all submitted ideas in a condition.	Do ideas become more different from each other as the experiment goes on? We first measure the median pairwise distance (‘idea diversity’) of ideas at each trial number (i.e., iteration in the response chain). We then measure the slope of the trial number on idea diversity.
AI Adoption	How similar is a participant’s response to AI example responses? This is measured by the maximum semantic distance between a participant’s response and AI examples.	Not applicable	Not applicable

6.1. Local Level

Outcomes at the local level—the level of an individual trial—are useful for two reasons. First, this level shows how a participant’s response relates to the examples they see. Second, this level lets us model whether individual differences moderate the effect of either disclosure or transparency. For each of our local outcomes, we have a baseline model that uses crossed random intercepts to account for the multilevel structure of the experiment. The first random intercept is for participants, accounting for clustering due to repeated measures. This random intercept is then crossed with a second random intercept for response chains, which we nest inside of {[condition], item} combinations.¹⁰¹⁰10In R syntax, the random effect structure was ... + (1|ParticipantID) + (1|ItemCondition/ResponseChainID); See Figure 4 for a visual explanation of how response chains are nested in items and conditions. Models were fit in the lme4 R package. We computed profile likelihood confidence intervals for coefficients using the confint R package. We used estimated marginal means (emmeans R package) to conduct model-adjusted F-tests, linear contrasts, predictions, and pairwise comparisons. We apply Holm-Bonferroni adjustments to pairwise comparison p-values. Our baseline ‘local’ model is:

	$\displaystyle\text{variable}_{ijk}$	$\displaystyle=\beta_{0}+\beta_{1}\text{condition}_{j}+\beta_{2}\text{% CreativityHuman}_{i}+\beta_{3}\text{AiRelCreate}_{i}+$
		$\displaystyle\phantom{=}\beta_{4}\text{AiFeeling}_{i}+\beta_{5}\text{% InterestGroup}_{i}+\beta_{6}\text{ConditionOrder}_{ijk}$
		$\displaystyle\phantom{=}\beta_{7}\text{LogDuration}_{ijk}+\beta_{8}\text{% nSeedsPresent}_{ijk}+\beta_{9}\text{TrialNo}_{jk}$
		$\displaystyle\phantom{=}u_{0i}+v_{0jk}+e_{ijk}$

where

•

$i$ indexes participants, $j$ indexes item-condition combinations, $k$ indexes response chains.
•

CreativityHuman is self-perceived creativity relative to AI.
•

AiRelCreate is constructed as (self-perceived creativity to humans) - (self-perceived creativity to AI). Note that this is an implicit measure of AI’s creativity relative to humans. For example, if you say you are more creative than 40% of humans and 60% of AI, then AiRelCreate = -20, as the implicit belief is AI is less creative (-20 percentile points) than humans. Conversely, if you say you are more creative than 20% of AI but 50% of humans then the implicit belief is humans are more creative (50% - 30% = +20).
•

AiFeeling refers to the AI sentiment question.
•

InterestGroup maps each source of the experiment to categories: creative, neutral, or technology. These categories are described in Table 3.
•

ConditionOrder denotes the sequence in which the participant was assigned to complete the trial (e.g., the 1st time a participant took part, etc.).
•

LogDuration is the natural logarithm of the time (in seconds) a participant spent before submitting their answer.
•

nSeedsPresent controls for the number of examples the participant saw that were seed ideas from the Organisciak Dataset.
•

TrialNo indicates the trial number within a specific response chain. For example: the 18th response for {[Control], tire, response chain 5}

We balanced interest in testing experimental hypotheses that conditions differed by subgroups with caution around an over-fitted model. We considered interactions between the treatment condition and four potential moderators: self-perceived human creativity ( $CreateHuman$ ), AI - Human creativity ( $AiRelCreate$ ), feeling towards AI ( $AIFeeling)$ , and interest group ( $InterestGroup$ ). We first conducted likelihood ratio tests to test if adding each moderator improved our baseline model. Moderators were kept only if they significantly improved the fit $(p<0.05)$ . See Appendix Table 11 for retained moderators. Then, we used emmeans to probe and interpret moderating effects.

6.2. Global

Intuitively, the global diversity of ideas in a condition measures how similar or different submitted ideas in a condition tend to be. The relevant level of aggregation here is all of the submitted ideas at a {[condition], item} level. For example, consider the total set of ideas participants submitted for a tire in the [Control] condition. Is this set of ideas more diverse from each other than the set of submitted ideas for a tire in the [High Exposure, Disclosed] condition?

We used a Monte Carlo procedure and permutation tests to assess if conditions differed with respect to these metrics. For 50 Monte Carlo runs, for each {[condition], item} combination, we randomly sampled 50 ideas and computed idea diversity metrics. We then conducted pairwise paired (at the level of Monte Carlo seeds and items) permutation tests with 10,000 iterations to see if the two conditions differed on these metrics. As a non-parametric measure of effect size, we also calculate Cliff’s Delta $(\delta)$ , which ranges from -1 to 1. A value of 0 indicates no difference between the two conditions, +1 indicates values from the first condition are always larger, and -1 indicates the opposite. See Appendix I.1 for more details.

6.3. Evolution

6.3.1. Creativity

To test if conditions differed in their evolution of creativity, we conducted a likelihood ratio test on whether an interaction between condition and TrialNo significantly improved the fit of the local creativity model.

6.3.2. Idea Diversity

Intuitively, we are interested in if—as the experiment goes on—ideas that participants submit tend to become more or less similar to each other. We use the trial number in a response chain to index time in the experiment. For example, is the set of submitted responses at trial number 4 more or less similar to each other as the set of submitted responses at trial number 20? Here, the diversity of interest is not between a submitted response and example responses but between all submitted responses at a given ‘time point’ (i.e., trial number). The question is if the diversity increases or decreases as the experiment goes on and if this rate of change differs by condition. Here is the mechanics of our process. See Appendix I.3 for more details.

(1)

We first ‘pooled’ together all ideas at the {[condition], item, trial number} level, across response chains. For example, consider all ideas for a tire for the [Control] condition that were the fourth response in a response chain. We refer to this set as a ‘pool’ of ideas.
(2)

We next computed idea diversity measures for each pool of ideas, where idea pools were defined in (1). We use the same metrics that we measure at a local level for idea diversity. Median pairwise distance is our main measure. We conduct robustness checks using mean pairwise distance and mean distance from the centroid. Each metric shows qualitatively similar results.
(3)

We then fit a mixed model (items as random intercepts) to test if the slope of trial number on idea diversity differed by condition. That is: Are submitted responses in some conditions changing at a faster rate?

7. Results

7.1. Creativity

We found no effect of conditions on creativity. Average individual creativity did not significantly differ by condition $(F(4,19.86)=0.12,p=0.97)$ and no condition coefficient differed from zero in our regression (Appendix Table 19). Hence, we conclude that neither AI exposure nor AI disclosure affected individual creativity. Additionally, we tested for whether the evolution of creativity differed by condition via a likelihood ratio test on whether interacting trial number and experimental condition would improve the model fit. The likelihood ratio test indicated that allowing for these interactions did not significantly improve the model fit $(\chi^{2}(4)=6.52,p=0.16)$ ¹¹¹¹11However, there was a small, negative interaction effect ( $\beta=-0.015$ , $t(3248)=-2.21$ , 95% CI = $[-0.03,-0.002]$ ) between trial number and the [High Exposure, Disclosed] condition when adding this interaction. However, due to the (1) size of the interaction combined with (2) no overall differences and (3) a null likelihood ratio test, we do not interpret this interaction.. In short, we do not find enough evidence to conclude creativity was affected by experimental conditions.

7.2. Idea Diversity

7.2.1. Local Level

Intuitively, local idea diversity is how different a response is from the examples a participant sees. There was no main effect of condition $(F(4,19.95)=0.09,p=0.98)$ , and the effect of self-perceived creativity did not differ by condition $(F(4,2650.44)=1.59,p=0.18)$ . But the effect of belief in AI’s relative creativity did differ by condition, $F(4,2635.80)=2.93,p=0.02$ . As robustness checks, we ran the same specification with two alternative measures of idea diversity, mean pairwise distance and distance from the centroid. Regression results are broadly similar. However, post-hoc estimated marginal means showed non-significant contrasts, so we refrain from interpreting this finding. See Appendix I.2 for a more in-depth discussion, regression results, and pairwise comparisons.

7.2.2. Global

By measuring global idea diversity, we capture how different the submitted ideas in a condition are from one another. This can be thought of as a measure of collective idea diversity. See Appendix I.1 for more details on the procedure. Across a range of different metrics, high AI exposure conditions had more global idea diversity than the control condition (Figure 5; Appendix Tables 12, 13, 14). The median pairwise distance provides the most conservative estimate of the metrics that we measured. But even for median pairwise distance, both the [High Exposure, Disclosed] ( $\text{Cliff's }\delta=0.31\text{ on a scale of -1 to 1})$ and [High Exposure, Undisclosed] $(\delta=0.26)$ condition had more idea diversity than the control condition. But of the low exposure conditions, only the [Low Exposure, Undisclosed] $(\delta=0.11)$ condition had higher global diversity than the control condition, with a much smaller effect size than the high exposure conditions. Hence, high AI exposure (but not necessarily low AI exposure) increases global idea diversity.

7.2.3. Evolution

By measuring the evolution of idea diversity, we capture the rate of change in idea diversity across trials. See Appendix I.3 for more details on the procedure. Relative to the control condition, the conditions with high exposure to AI ideas (but not low exposure to AI) had increased rates of change in idea diversity. See Figure 6 for estimated marginal means predictions and Appendix Table 18 for regression results. As with global idea diversity, different metrics yielded similar regression coefficients. In the control condition, idea diversity decreased over trials ( $\beta=-0.39$ , $t(349)=-2.23$ , 95% CI = $[-0.73,-0.05]$ , $p=0.03$ ). That is, submitted ideas were becoming more similar to each other as the experiment went on. Relative to the control condition, however, the slope of idea diversity with respect to trial number was more positive for the [High Exposure, Undisclosed] condition ( $\beta=0.53$ , $t(349)=2.2$ , 95% CI = $[0.06,0.99]$ , $p=0.03$ ) and the[High Exposure, Disclosed] condition ( $\beta=0.57$ , $t(349)=2.37$ , 95% CI = $[0.1,1.03]$ , $p=0.02$ ). The rate of change in idea diversity for the low AI exposure conditions did not differ from the rate of change in the control condition. Thus, we conclude that high exposure to AI ideas increased the rate of idea diversity relative to the no-AI, control condition.

7.3. AI Adoption

7.3.1. Local Level

At the local level, we measured AI adoption by the maximum cosine similarity between a participant’s response and AI examples the participant saw. There was a main effect of condition $(F(3,16.59)=4.33,p=0.02)$ . But we would expect higher similarity to AI ideas in the high-exposure condition even by chance (since there are more AI ideas), so we do not interpret main effects and instead focus on subgroup differences and effects of disclosure in the high-exposure condition. We found that the effect of conditions did not differ by interest groups $(F(6,719.77)=1.98,p=0.07)$ , but the effect of conditions did differ by self-perceived creativity $(F(3,1984.95)=5.18,p=0.001)$ and relative AI creativity $(F(3,1974.58)=2.9,p=0.03)$ . As robustness checks, we ran the same specification with two alternative measures of AI adoption, mean and median AI adoption. The coefficients of our regression are broadly similar. See Appendix K for regression results and post-hoc contrasts.

Exposure to AI ideas increased adoption for (self-perceived) high-creativity participants regardless of disclosure, but this was not the case for (self-perceived) low-creativity participants. There was a significant interaction between self-perceived human creativity and the [High Exposure, Disclosed] condition ( $\beta=0.11$ , $t(2588)=3.93$ , 95% CI = $[0.06,0.17],p=0.0001$ ; Appendix Table 23). To probe this interaction, we used our model to predict AI adoption by condition for both the top 10% and bottom 10% of participants by self-perceived creativity (Figures 9, 9 and 9). For high-creativity participants (Figure 9), adoption rates appear to differ only by exposure (color) and not disclosure (shape). More formally, we tested whether the effect of exposure on adoption is larger when AI ideas are disclosed vs undisclosed. We find that for high-creativity participants, there is no difference in adoption between ([High Exposure, Undisclosed] - [Low Exposure, Undisclosed]) and ([High Exposure, Disclosed] - [Low Exposure, Disclosed]), $\Delta=-1.69,d=-0.14,p=0.59$ . That is, the effect of exposure is not moderated by disclosure. But for low-creativity participants, the difference in adoption for the undisclosed conditions ([High Exposure, Undisclosed] - [Low Exposure, Undisclosed]) was larger than the equivalent difference in adoption for disclosed conditions ([High Exposure, Disclosed] - [Low Exposure, Disclosed]), $\Delta=7.77,d=0.65,p=0.03$ . That is, disclosing ideas as from AI reduced the effect of exposure on adoption for lower (self-reported) creativity participants. In summary, higher (self-reported) creativity people adopt AI ideas solely based on content, and not disclosure.

We also found that one’s attitude about AI’s creativity affected adoption, though this had a smaller effect than self-perceived creativity. There was a significant interaction between the [High Exposure, Undisclosed] condition and relative AI creativity (positive values imply AI is more creative than humans), $\beta=-0.07$ , $t(2588)=-2.61$ , 95% CI = $[-0.13,-0.02],p=0.01$ . We used estimated marginal means to probe this interaction by predicting AI adoption for the top and bottom decile of participants by belief in relative AI creativity. We found that in the [High Exposure, Undisclosed] condition, people who believed AI was uncreative (bottom decile of AiRelCreate) were slightly more likely to adopt AI ideas than people who believed AI was creative (top decile of AiRelCreate), $\Delta=4.66,d=0.39,p=0.005$ . But no such difference existed in the [High Exposure, Disclosed] condition. This may suggest labeling sources as AI neutralizes adoption among users who do not think AI is creative.

In addition to who adopts AI ideas, we also measured when AI ideas are adopted. We found that people adopt AI ideas for difficult prompts rather than easier prompts (Figure 10). To measure the ‘difficulty’ of an item prompt, we first calculated the mean creativity of a response to an item in the control condition. Then we reverse-ranked items such that high mean creativity implies low difficulty and vice versa. We measured ‘AI adoption‘ of an item by the average of the trial-level maximum similarity to AI examples. We then examined the rank-order correlation between item difficulty and AI adoption in high-exposure conditions. If task difficulty leads people to rely on AI, then we should see a larger correlation between item difficulty and AI adoption in the [High Exposure, Disclosed] condition than in the [High Exposure, Undisclosed] condition. That is what we find. The rank-order correlation between difficulty and adoption was $\rho=0.8$ for the [High Exposure, Disclosed] condition but only $\rho=0.3$ for the [High Exposure, Undisclosed]. That is, when people were told ideas were from AI, they were more likely to adopt AI ideas if the prompt was difficult. Since we employed only five items, we view this finding as speculative; future work should test this relationship with a larger number of stimuli.

8. Discussion

Against the backdrop of a massive increase in LLM exposure, we asked: How does exposure to ideas generated by LLMs affect the creativity, diversity, and evolution of human ideas? To answer this, we conducted a large-scale experiment where participants submitted ideas in response to the Alternate Uses Task (a measure of creativity where people brainstorm novel uses of an item) after seeing a set of example ideas. The examples were from prior participants in the same experimental condition or—in some conditions—ChatGPT. The evolving aspect of our experiment, that ideas in a condition feed forward to subsequent trials in that condition, captures the interdependent nature of idea formation and lets us model the evolutionary effects of having AI ‘in the culture loop’. Here are three takeaways from our experiment.

8.1. AI makes ideas different but not better.

Most notably, exposure to AI ideas did not, on average, make human ideas any ‘better’ or ‘worse’ (by creativity). Our high-powered, null finding around average creativity by condition can inform debates about the effect of AI ideas on individual human creativity. Maybe there is little effect. Of course, our experiment is measuring just a single task. But these results suggest that perhaps both worry and optimism around the effect of AI ideas on individual human creativity should be tempered.

Our null finding around creativity contrasts with some prior work suggesting human-AI co-creation enhances the quality of creative outputs (Mizrahi et al., 2020; Yuan et al., 2022; Roemmele, 2021; Hitsuwari et al., 2022). But our study differs from prior studies in its aim and design: We test passive exposure to off-the-shelf LLMs—not active engagement with optimized-for-creativity AI aides. The latter is useful for understanding how AI could affect creativity. But we aim to approximate how ordinary, existing, and pervasive AI tools do affect the creativity of ideas. At least for this task, we find no evidence of such an effect.

On the other hand, the presence of AI ideas increased the diversity of human ideas. This is consistent with work suggesting collaborating with AI leads to more diverse or unexpected outputs (Yang et al., 2022; Osone et al., 2021; Lee et al., 2022; Branch et al., 2021; Gero and Chilton, 2019) and inconsistent with other work that finds collaborating with LLMs decreases diversity (Padmakumar and He, 2024; Doshi and Hauser, 2024; Dell’Acqua et al., 2023). But we highlight that our study is testing passive exposure to AI ideas and not active engagement with AI ideas. Our setup—passive exposure to AI ideas, scattered amongst human ones—maps onto how many users now experience AI ideas. Hence, it may be that active engagement with LLMs decreases content diversity but simply seeing these ideas as ‘sparks’ (Gero et al., 2022) increases content diversity. And because many more people may be passively exposed to LLM outputs than actively engaging with LLMs, the effect of passive exposure is important to understand.

Crucially, high AI exposure increased both average amounts of diversity and rates of change in idea diversity. The latter result is especially important. Small differences in rates of change can yield large aggregate differences over time. Future work—both simulations and dynamic experiments—can explore the implications of this increase in collective idea diversity unaccompanied by an increase in average individual creativity. For instance, can this dynamic generate ‘innovation’?

Our finding around the evolution of diversity (Figure 6) is instructive. Seeing other people’s ideas reduced idea diversity in the control condition over time. This may suggest that successive participants were converging on a particular idea sequence. But then injecting AI ideas into the example set increased the diversity of submitted responses by partially ‘resetting’ this convergence. Our finding relates to recent work proposing AI systems that generate ‘alien’ scientific hypotheses humans would not think of (Sourati and Evans, 2023). More generally, a promising avenue for future work: Can AI input reduce ‘groupthink’?

8.2. High creativity people are less influenced by the source label of ideas.

Participants who viewed themselves as highly creative had the same levels of adoption of AI ideas in both disclosed and non-disclosed conditions. But for lower-creativity participants, knowing the source of an idea did affect the adoption of that idea. Perhaps people high in self-reported creativity relied less on source cues when adopting ideas because they were more confident in their ability to judge an idea’s creative merit. Future work can employ think-alouds to better understand how AI disclosure affects the idea-generation process, itself. Regardless, our results suggest that (self-reported) creative people will adopt ideas on the basis of their content. Knowing the source does not matter. In a world where humans have difficulty distinguishing if the content was human or AI-generated (Jakesch et al., 2022), these findings suggest people high in (self-reported) creativity will not be ‘duped’ into adopting AI ideas.

8.3. Participants may adopt AI ideas when the task is difficult

When AI ideas were labeled, participants were more likely to adopt AI ideas for difficult prompts rather than easy prompts. Although this finding should be taken as speculative (due to the small number of items), it is similar to what (Roemmele, 2021) observed, where seeing AI examples only influenced creative output when the task was difficult. Both our and (Roemmele, 2021)’s results are consistent with a theoretical account of task difficulty being associated with increased reliance on automation (Goddard et al., 2014). Future work can further test whether users adopt AI ideas for more challenging creative tasks.

If users turn to AI for difficult rather than trivial tasks, this has several implications. On one hand, AI can augment human creativity where human imaginations falter. At the same time, researchers raised concerns over ‘model collapse’ (Shumailov et al., 2023)—the deteriorating performance of LLMs when trained on their outputs. If reliance on AI for creative tasks becomes routine, this may contribute to model collapse, ironically decreasing the efficacy of such reliance.

8.4. Conclusion: Passive exposure to AI ideas affects collective thought.

We conclude that passive exposure to AI ideas—the kind of passive exposure we are inundated with in a post-ChatGPT era—does affect collective thought. Even small effects are meaningful since this exposure is both pervasive and growing. But the effects of AI ideas are nuanced. Seeing AI ideas did not increase individual creativity, though it did increase collective diversity. The effects of AI ideas vary across individuals and tasks. There is still much to learn. We hope our study inspires more research on how passive exposure to AI ideas affects collective thought.

9. Limitations & Future Work

Our study has several limitations that can inform future work. First, we measured the effect of AI ideas for a single task. We chose this task because it is one of the most common creativity tasks (Abraham, 2016). But future work could explore if our results replicate for other kinds of tasks. Second, we had to operationalize ‘ChatGPT’ in some concrete way. The logic for our prompt was driven by ecological validity and prior work: We used a zero-shot prompt because that is what users would likely use, and the specific prompt we used was derived from prior research. We chose not to vary prompts in order not to further increase the complexity of an already complex experiment. Future work could explore if different prompts elicit different results. Another avenue for future work is only propagating the ‘best’ AI ideas forward. Third, future work should test if alternative classifiers or ways of conceiving variables yield different results. For idea diversity and AI adoption, we addressed this problem by showing that conceptually similar ways of measuring variables yielded qualitatively similar results. For our creativity measure, we used a highly accurate classifier (correlation with human judgments greater than $r=0.88$ for items we used) trained for this exact task, for these exact items. But of course, all models have some error and future work based on this model propagates these errors. Incidentally, human judges of creativity only correlate with other human judges at $r=0.88$ (Organisciak et al., 2023), suggesting the classifier we used may be approaching ‘the approximate ceiling at which we could expect a model to correlate with human judgments’ (Organisciak et al., 2023, pg. 11) of creativity. Nonetheless, future work can adopt a similar design but with human judgments or different measures. Fourth, our finding about AI adoption and task difficulty is based on five AUT items. Future work should explore this relationship with a larger number of stimuli. Fifth, we focus on one facet of creativity: originality. Future work can also explore whether AI ideas have different effects on other facets of creativity. We also did not necessarily measure idea “quality”, which is distinct from originality and diversity. Sixth, we employed a convenience sample of technology-interested users and creative professionals. While these two groups are most relevant to the phenomena in question, our sample also limits generalizability. Future work can explore these dynamics with different samples. Seventh, these dynamics may differ for future LLMs. Finally, we conducted this experiment close to the launch of ChatGPT. As AI becomes increasingly embedded in everyday life, attitudes towards AI and ways of engaging with AI may also change. Despite these limitations, our work offers the first large-scale, dynamic account of how ideas from LLMs affect collective thought.

References

(1)
Abraham (2016) Anna Abraham. 2016. Gender and creativity: an overview of psychological and neuroscientific literature. Brain Imaging and Behavior 10, 2 (June 2016), 609–618. https://doi.org/10.1007/s11682-015-9410-8
Anderson et al. (2024) Barrett R Anderson, Jash Hemant Shah, and Max Kreminski. 2024. Homogenization Effects of Large Language Models on Human Creative Ideation. In Creativity and Cognition. ACM, Chicago IL USA, 413–425. https://doi.org/10.1145/3635636.3656204
Baten et al. (2021) Raiyan Abdul Baten, Richard N. Aslin, Gourab Ghoshal, and Ehsan Hoque. 2021. Cues to gender and racial identity reduce creativity in diverse social networks. Scientific Reports 11, 1 (May 2021), 10261. https://doi.org/10.1038/s41598-021-89498-5
Beaty and Johnson (2021) Roger E. Beaty and Dan R. Johnson. 2021. Automating creativity assessment with SemDis: An open platform for computing semantic distance. Behavior Research Methods 53, 2 (April 2021), 757–780. https://doi.org/10.3758/s13428-020-01453-w
Beaty et al. (2022) Roger E. Beaty, Dan R. Johnson, Daniel C. Zeitlen, and Boris Forthmann. 2022. Semantic Distance and the Alternate Uses Task: Recommendations for Reliable Automated Assessment of Originality. Creativity Research Journal 34, 3 (July 2022), 245–260. https://doi.org/10.1080/10400419.2022.2025720
Beaty et al. (2021) Roger E. Beaty, Daniel C. Zeitlen, Brendan S. Baker, and Yoed N. Kenett. 2021. Forward flow and creative thought: Assessing associative cognition and its role in divergent thinking. Thinking Skills and Creativity 41 (Sept. 2021), 100859. https://doi.org/10.1016/j.tsc.2021.100859
Bender et al. (2021) Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
Bond and Titus (1983) Charles F. Bond and Linda J. Titus. 1983. Social facilitation: A meta-analysis of 241 studies. Psychological Bulletin 94, 2 (1983), 265–292. https://doi.org/10.1037/0033-2909.94.2.265
Boyd and Richerson (1988) Robert Boyd and Peter J. Richerson. 1988. Culture and the Evolutionary Process. University of Chicago Press.
Boyd et al. (2011) Robert Boyd, Peter J. Richerson, and Joseph Henrich. 2011. The cultural niche: Why social learning is essential for human adaptation. Proceedings of the National Academy of Sciences 108, supplement_2 (June 2011), 10918–10925. https://doi.org/10.1073/pnas.1100290108
Branch et al. (2021) Boyd Branch, Piotr Mirowski, and Kory W. Mathewson. 2021. Collaborative Storytelling with Human Actors and AI Narrators. http://arxiv.org/abs/2109.14728
Brown and Paulus (2002) Vincent R. Brown and Paul B. Paulus. 2002. Making Group Brainstorming More Effective: Recommendations From an Associative Memory Perspective. Current Directions in Psychological Science 11, 6 (Dec. 2002), 208–212. https://doi.org/10.1111/1467-8721.00202
Calderwood et al. (2020) Alex Calderwood, Vivian Qiu, Katy Ilonka Gero, and Lydia B Chilton. 2020. How Novelists Use Generative Language Models: An Exploratory User Study.. In HAI-GEN+ user2agent IUI.
Chiang (2023) Ted Chiang. 2023. ChatGPT Is a Blurry JPEG of the Web. The New Yorker (Feb. 2023). https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web
Dell’Acqua et al. (2023) Fabrizio Dell’Acqua, Edward McFowland III, Ethan R. Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. 2023. Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. https://doi.org/10.2139/ssrn.4573321
Di Fede et al. (2022) Giulia Di Fede, Davide Rocchesso, Steven P. Dow, and Salvatore Andolina. 2022. The Idea Machine: LLM-based Expansion, Rewriting, Combination, and Suggestion of Ideas. In Proceedings of the 14th Conference on Creativity and Cognition (C&C ’22). Association for Computing Machinery, New York, NY, USA, 623–627. https://doi.org/10.1145/3527927.3535197
Doshi and Hauser (2024) Anil R. Doshi and Oliver P. Hauser. 2024. Generative artificial intelligence enhances creativity but reduces the diversity of novel content. https://doi.org/10.48550/arXiv.2312.00506 arXiv:2312.00506 [cs, econ, q-fin].
Dumas et al. (2021) Denis Dumas, Peter Organisciak, Shannon Maio, and Michael Doherty. 2021. Four Text-Mining Methods for Measuring Elaboration. The Journal of Creative Behavior 55, 2 (2021), 517–531. https://doi.org/10.1002/jocb.471
Eapen et al. (2023) Tojin T. Eapen, Daniel J. Finkenstadt, Josh Folk, and Lokesh Venkataswamy. 2023. How Generative AI Can Augment Human Creativity. Harvard Business Review (July 2023). https://hbr.org/2023/07/how-generative-ai-can-augment-human-creativity
Gero (2023) Katy Ilonka Gero. 2023. AI and the Writer: How Language Models Support Creative Writers. Ph.D. Columbia University, United States – New York. https://www.proquest.com/docview/2753687892/abstract/ACF7F21F1E274995PQ/1
Gero and Chilton (2019) Katy Ilonka Gero and Lydia B. Chilton. 2019. Metaphoria: An Algorithmic Companion for Metaphor Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, Glasgow Scotland Uk, 1–12. https://doi.org/10.1145/3290605.3300526
Gero et al. (2022) Katy Ilonka Gero, Vivian Liu, and Lydia Chilton. 2022. Sparks: Inspiration for Science Writing using Language Models. In Proceedings of the 2022 ACM Designing Interactive Systems Conference (DIS ’22). Association for Computing Machinery, New York, NY, USA, 1002–1019. https://doi.org/10.1145/3532106.3533533
Goddard et al. (2014) Kate Goddard, Abdul Roudsari, and Jeremy C. Wyatt. 2014. Automation bias: Empirical results assessing influencing factors. International Journal of Medical Informatics 83, 5 (May 2014), 368–375. https://doi.org/10.1016/j.ijmedinf.2014.01.001
Griffin (2024) Andrew Griffin. 2024. ChatGPT creators OpenAI are generating 100 billion words per day. https://www.independent.co.uk/tech/chatgpt-openai-words-sam-altman-b2494900.html
Guilford (1967) J.P. Guilford. 1967. The nature of human intelligence. McGraw-Hill, New York, NY, US.
Guilford (1978) Joy Paul Guilford. 1978. Alternate uses. Sheridan supply Company.
Guzik et al. (2023) Erik E. Guzik, Christian Byrge, and Christian Gilde. 2023. The originality of machines: AI takes the Torrance Test. Journal of Creativity 33, 3 (Dec. 2023), 100065. https://doi.org/10.1016/j.yjoc.2023.100065
Hancock et al. (2020) Jeffrey T Hancock, Mor Naaman, and Karen Levy. 2020. AI-Mediated Communication: Definition, Research Agenda, and Ethical Considerations. Journal of Computer-Mediated Communication 25, 1 (March 2020), 89–100. https://doi.org/10.1093/jcmc/zmz022
Hitsuwari et al. (2022) Jimpei Hitsuwari, Yoshiyuki Ueda, Woojin Yun, and Michio Nomura. 2022. Does human–AI collaboration lead to more creative art? Aesthetic evaluation of human-made and AI-generated haiku poetry. Computers in Human Behavior (Oct. 2022), 107502. https://doi.org/10.1016/j.chb.2022.107502
Hu (2023) Krystal Hu. 2023. ChatGPT sets record for fastest-growing user base - analyst note. Reuters (Feb. 2023). https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/
Huang et al. (2020) Chieh-Yang Huang, Shih-Hong Huang, and Ting-Hao Kenneth Huang. 2020. Heteroglossia: In-Situ Story Ideation with the Crowd. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3313831.3376715
Hwang and Won (2021) Angel Hsing-Chi Hwang and Andrea Stevenson Won. 2021. IdeaBot: Investigating Social Facilitation in Human-Machine Team Creativity. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama Japan, 1–16. https://doi.org/10.1145/3411764.3445270
Jakesch et al. (2022) Maurice Jakesch, Jeffrey Hancock, and Mor Naaman. 2022. Human Heuristics for AI-Generated Language Are Flawed. https://doi.org/10.48550/arXiv.2206.07271
Jared Henderson (2022) Jared Henderson. 2022. ChatGPT Will Make You Less Creative. https://www.youtube.com/watch?v=1K8PiMNoR7A
Kirk et al. (2009) Ulrich Kirk, Martin Skov, Oliver Hulme, Mark S. Christensen, and Semir Zeki. 2009. Modulation of aesthetic value by semantic context: An fMRI study. NeuroImage 44, 3 (Feb. 2009), 1125–1132. https://doi.org/10.1016/j.neuroimage.2008.10.009
Köbis and Mossink (2021) Nils Köbis and Luca D. Mossink. 2021. Artificial intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry. Computers in Human Behavior 114 (Jan. 2021), 106553. https://doi.org/10.1016/j.chb.2020.106553
Krish Naik (2023) Krish Naik. 2023. Will Chatgpt Kill Your Creativity? https://www.youtube.com/watch?v=0m2r9elReBY
Lee et al. (2022) Mina Lee, Percy Liang, and Qian Yang. 2022. CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–19. https://doi.org/10.1145/3491102.3502030
Mangalaseril (2023) Jasmine Mangalaseril. 2023. The Incredible Blandness of ChatGPT. https://cardamomaddict.substack.com/p/the-incredible-blandness-of-chatgpt
Miller (2019) Arthur I Miller. 2019. The Artist in the Machine: The World of AI-Powered Creativity. Cambridge: MIT Press. https://direct.mit.edu/books/book/4547/The-Artist-in-the-MachineThe-World-of-AI-Powered
Mirowski et al. (2023) Piotr Mirowski, Kory W. Mathewson, Jaylen Pittman, and Richard Evans. 2023. Co-Writing Screenplays and Theatre Scripts with Language Models: Evaluation by Industry Professionals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–34. https://doi.org/10.1145/3544548.3581225
Mizrahi et al. (2020) Moran Mizrahi, Stav Yardeni Seelig, and Dafna Shahaf. 2020. Coming to Terms: Automatic Formation of Neologisms in Hebrew. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4918–4929. https://doi.org/10.18653/v1/2020.findings-emnlp.442
Mosier et al. (1996) Kathleen L. Mosier, Linda J. Skitka, Mark D. Burdick, and Susan T. Heers. 1996. Automation Bias, Accountability, and Verification Behaviors. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 40, 4 (Oct. 1996), 204–208. https://doi.org/10.1177/154193129604000413
Mumford and Hemlin (2017) Michael D. Mumford and Sven Hemlin. 2017. Handbook of Research on Leadership and Creativity. Edward Elgar Publishing.
Nadeem (2022) Reem Nadeem. 2022. How Americans think about artificial intelligence. https://www.pewresearch.org/internet/2022/03/17/how-americans-think-about-artificial-intelligence/
Nadeem (2023) Reem Nadeem. 2023. Public Awareness of Artificial Intelligence in Everyday Activities. https://www.pewresearch.org/science/2023/02/15/public-awareness-of-artificial-intelligence-in-everyday-activities/
News (2023) Nation World News. 2023. Why Does ChatGPT Increase Creativity? https://nationworldnews.com/why-does-chatgpt-increase-creativity/
Nickerson and Sakamoto (2010) J. Nickerson and Yasuaki Sakamoto. 2010. Crowdsourcing Creativity: Combining Ideas in Networks. https://www.semanticscholar.org/paper/Crowdsourcing-Creativity%3A-Combining-Ideas-in-Nickerson-Sakamoto/340a7645d1402287e151e83981f8a4085227e317
Nijstad and Stroebe (2006) Bernard A. Nijstad and Wolfgang Stroebe. 2006. How the Group Affects the Mind: A Cognitive Model of Idea Generation in Groups. Personality and Social Psychology Review 10, 3 (Aug. 2006), 186–213. https://doi.org/10.1207/s15327957pspr1003_1
of Psychology ([n. d.]) American Psychological Association Dictionary of Psychology. [n. d.]. divergent thinking. https://dictionary.apa.org/divergent-thinking
Ojala and Garriga (2009) Markus Ojala and Gemma C. Garriga. 2009. Permutation Tests for Studying Classifier Performance. In 2009 Ninth IEEE International Conference on Data Mining. IEEE, Miami Beach, FL, USA, 908–913. https://doi.org/10.1109/ICDM.2009.108
Organisciak et al. (2022) Peter Organisciak, Selcuk Acar, Denis Dumas, and Kelly Berthiaume. 2022. Beyond Semantic Distance: Automated Scoring of Divergent Thinking Greatly Improves with Large Language Models. Vol. 49. 101356 pages. https://doi.org/10.1016/j.tsc.2023.101356
Organisciak et al. (2023) Peter Organisciak, Selcuk Acar, Denis Dumas, and Kelly Berthiaume. 2023. Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models. Thinking Skills and Creativity 49 (2023), 101356. https://doi.org/10.1016/j.tsc.2023.101356
Osone et al. (2021) Hiroyuki Osone, Jun-Li Lu, and Yoichi Ochiai. 2021. BunCho: AI Supported Story Co-Creation via Unsupervised Multitask Learning to Increase Writers’ Creativity in Japanese. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama Japan, 1–10. https://doi.org/10.1145/3411763.3450391
Padmakumar and He (2024) Vishakh Padmakumar and He He. 2024. Does Writing with Language Models Reduce Content Diversity? https://doi.org/10.48550/arXiv.2309.05196
Paulus and Brown (2007) Paul B. Paulus and Vincent R. Brown. 2007. Toward More Creative and Innovative Group Idea Generation: A Cognitive-Social-Motivational Perspective of Brainstorming. Social and Personality Psychology Compass 1, 1 (2007), 248–265. https://doi.org/10.1111/j.1751-9004.2007.00006.x
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. http://arxiv.org/abs/1908.10084
Reinecke and Gajos (2015) Katharina Reinecke and Krzysztof Z. Gajos. 2015. LabintheWild: Conducting Large-Scale Online Experiments With Uncompensated Samples. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW ’15). Association for Computing Machinery, New York, NY, USA, 1364–1378. https://doi.org/10.1145/2675133.2675246
Review (2023) European Business Review. 2023. ChatGPT: Ushering in the Age of Creativity. https://www.europeanbusinessreview.com/chatgpt-ushering-in-the-age-of-creativity/
Richerson and Boyd (2008) Peter J. Richerson and Robert Boyd. 2008. Not By Genes Alone: How Culture Transformed Human Evolution. University of Chicago Press.
Roemmele (2021) Melissa Roemmele. 2021. Inspiration through Observation: Demonstrating the Influence of Automatically Generated Text on Creative Writing. https://doi.org/10.48550/arXiv.2107.04007
Salganik et al. (2006) Matthew J. Salganik, Peter Sheridan Dodds, and Duncan J. Watts. 2006. Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market. Science 311, 5762 (Feb. 2006), 854–856. https://doi.org/10.1126/science.1121066
Salganik and Watts (2009) Matthew J. Salganik and Duncan J. Watts. 2009. Web-Based Experiments for the Study of Collective Social Dynamics in Cultural Markets. Topics in Cognitive Science 1, 3 (2009), 439–468. https://doi.org/10.1111/j.1756-8765.2009.01030.x
Schemmer et al. (2022) Max Schemmer, Niklas Kühl, Carina Benz, and Gerhard Satzger. 2022. On the Influence of Explainable AI on Automation Bias. https://doi.org/10.48550/arXiv.2204.08859
Shumailov et al. (2023) Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. 2023. The Curse of Recursion: Training on Generated Data Makes Models Forget. http://arxiv.org/abs/2305.17493
Siangliulue et al. (2015) Pao Siangliulue, Kenneth C. Arnold, Krzysztof Z. Gajos, and Steven P. Dow. 2015. Toward Collaborative Ideation at Scale: Leveraging Ideas from Others to Generate More Creative and Diverse Ideas. Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (Feb. 2015), 937–945. https://doi.org/10.1145/2675133.2675239
Sourati and Evans (2023) Jamshid Sourati and James A. Evans. 2023. Accelerating science with human-aware artificial intelligence. Nature Human Behaviour 7, 10 (Oct. 2023), 1682–1696. https://doi.org/10.1038/s41562-023-01648-z
Spiel et al. (2019) Katta Spiel, Oliver L. Haimson, and Danielle Lottridge. 2019. How to do better with gender on surveys: a guide for HCI researchers. Interactions 26, 4 (June 2019), 62–65. https://doi.org/10.1145/3338283
Stevenson et al. (2022) Claire Stevenson, Iris Smal, Matthijs Baas, Raoul Grasman, and Han van der Maas. 2022. Putting GPT-3’s Creativity to the (Alternative Uses) Test. https://doi.org/10.48550/arXiv.2206.08932
Tubefilter (2023) Tubefilter. 2023. 86% of creators believe AI has a positive effect on creativity. ChatGPT offered its own opinions. https://www.tubefilter.com/2023/06/02/lightricks-creator-artificial-intelligence-ai-survey-chat-gpt-wired/
Walia (2019) Chetan Walia. 2019. A Dynamic Definition of Creativity. Creativity Research Journal 31, 3 (July 2019), 237–247. https://doi.org/10.1080/10400419.2019.1641787
Wilcot (2023) Wilcot. 2023. Using Chat-GPT for Innovators: Enhancing Creativity and Innovation. https://www.boardofinnovation.com/blog/using-chat-gpt-for-innovators-enhancing-creativity-and-innovation/
Williams (2018) Jamie Williams. 2018. Should AI Always Identify Itself? It’s More Complicated Than You Might Think. https://www.eff.org/deeplinks/2018/05/should-ai-always-identify-itself-its-more-complicated-you-might-think
Yang et al. (2022) Daijin Yang, Yanpeng Zhou, Zhiyuan Zhang, Toby Jia-Jun Li, and L. C. Ray. 2022. AI as an Active Writer: Interaction Strategies with Generated Text in Human-AI Collaborative Fiction Writing 56-65. https://www.semanticscholar.org/paper/AI-as-an-Active-Writer%3A-Interaction-Strategies-with-Yang-Zhou/15ddeb7765e2a3ea692a27d9b30e8f9446d74742
Yang et al. (2023) Tianchen Yang, Qifan Zhang, Zhaoyang Sun, and Yubo Hou. 2023. Automatic Assessment of Divergent Thinking in Chinese Language with TransDis: A Transformer-Based Language Model Approach. https://doi.org/10.48550/arXiv.2306.14790
Yu and Nickerson (2011) Lixiu Yu and Jeffrey V. Nickerson. 2011. Cooks or cobblers? crowd creativity through combination. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’11). Association for Computing Machinery, New York, NY, USA, 1393–1402. https://doi.org/10.1145/1978942.1979147
Yu and Nickerson (2013) Lixiu Yu and Jeffrey V. Nickerson. 2013. An internet-scale idea generation system. ACM Transactions on Interactive Intelligent Systems 3, 1 (April 2013), 2:1–2:24. https://doi.org/10.1145/2448116.2448118
Yu et al. (2023) Yuhua Yu, Roger E. Beaty, Boris Forthmann, Mark Beeman, John Henry Cruz, and Dan Johnson. 2023. A MAD method to assess idea novelty: Improving validity of automatic scoring using maximum associative distance (MAD). Psychology of Aesthetics, Creativity, and the Arts (2023), No Pagination Specified–No Pagination Specified. https://doi.org/10.1037/aca0000573
Yuan et al. (2022) Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. 2022. Wordcraft: Story Writing With Large Language Models. In 27th International Conference on Intelligent User Interfaces (IUI ’22). Association for Computing Machinery, New York, NY, USA, 841–852. https://doi.org/10.1145/3490099.3511105

Appendix A AUT Items

Table 5. AUT items by frequency of occurrence in dataset and classifier accuracy. Accuracy is defined as the correlation between human ratings of creativity and model predictions. The overall accuracy was r = 0.81. The accuracy of responses from the best-performing 5-item subset was r= 0.90. The data and model is from Organisciak et al. (2022).

AUT Item	Classifier Accuracy (r)	Frequency in Test Set
tire	0.91	412
pants	0.91	443
shoe	0.91	382
table	0.90	461
bottle	0.88	839
pencil	0.85	384
ball	0.84	393
fork	0.83	407
lightbulb	0.83	383
toothbrush	0.81	379
knife	0.81	2163
backpack	0.80	34
shovel	0.79	339
paperclip	0.79	1385
hat	0.76	380
box	0.74	2842
spoon	0.73	386
book	0.71	487
sock	0.69	380
brick	0.64	5162
rope	0.56	2080

Appendix B AUT Prompts

We conducted a Monte Carlo experiment to confirm our modified zero-shot prompt resulted in responses with a similar word length to humans. Using parameters from Stevenson et al. (2022)’s experiments: For $n=1000$ trials, we fixed presence penalty and frequency penalty at 1, randomly chose a temperature (higher values lead to more randomness) in [0.65, 0.7, 0.75, 0.8], and randomly chose one of our 5 AUT items. For each trial, ChatGPT generated five ideas. The modified prompt resulted in responses with an average word length (M=4.44, SD = 1.34) much closer to human responses (M=4.56, SD = 4.97) than the original zero-shot prompt (M=25.38, SD=8.55). A permutation test further shows that this difference in word count was significant at $p<0.001$ . We used the ideas generated by our modified zero-shot prompt as stimuli for the main experiment.

Table 6. Summary statistics of AUT prompt experiment. Human ideas are from Organisciak et al. (2022) and include only those ideas in response to the chosen AUT items. Note that in some cases ChatGPT did not return the desired number of ideas, leading to a slight discrepancy between ideas generated between the two prompts.

	N	Average Words	SD Words
Condition
Human Ideas	2537	4.56	4.97
Zero Shot Length Limited	7500	4.44	1.34
Zero Shot	8153	25.38	8.55

Appendix C Pre-Treatment Questions

Pew asked about feeling towards AI (Nadeem, 2023, 2022) and we used the specific phrasing and choice ordering from (Nadeem, 2022). We randomized the first two options and kept neutral last. Our gender question was based on guidance from Spiel et al. (2019). The options were: ’woman’, ’man’, ’non-binary’, ’prefer to self-describe’, ’prefer not to disclose’. We added a text box meant for those who preferred to self-describe. The only deviation from Spiel et al. (2019) is that we did not allow for participants to select multiple options. We note that gender (as well as age and country) were optional.

Appendix D Exclusion Criteria for Analysis

Participants could have consented and answered pre-treatment questions but failed to complete any trial. We only analyze data from participants who completed at least one trial. In $(n=4)$ cases, users submitted ages that were implausible. We replaced these age values with missing for the purpose of summarizing participants but kept the responses. In $(n=2)$ cases, responses that should not have been shown were shown. We remove these two responses from analysis. As discussed in L, we instituted content moderation after receiving several troll responses. After the study, we manually inspected each response flagged by our system. There were $46$ ideas labeled as profane, and we determined $36$ were true positives. We remove the true positives $(n=36)$ from analysis, resulting in a final set of $3414$ responses for analysis from an initial set of $3452$ responses. Importantly, we conducted chi-squared tests and found that condition was unrelated to the number of flagged ideas ( $\chi^{2}$ (4) = 6.06, p = 0.19), number of flagged ideas minus false positives ( $\chi^{2}$ (4) = 2.92, p = 0.57) or total number of excluded ideas ( $\chi^{2}$ (4) = 3.87, p = 0.42).

Appendix E Human vs AI Ideas

We compared a sample of 1500 ideas from our modified Stevenson prompt in the prompt experiment and a random sample of 1500 ideas from the Organisciak Dataset for our 5 items. For each set, we used the model’s predicted originality scores. Originality ranges from 1-5. Overall, ChatGPT ideas had higher ( $\beta=0.62$ , $t(2994)=22.49$ , 95% CI = $[0.56,0.67]$ ) originality.

Table 7. Comparing predicted originality of ChatGPT generated ideas to ideas from a dataset of prior human responses

	Dependent variable:
	originality
sourcechatgpt	0.618^∗∗∗
	(0.027)
promptpants	$-$ 0.072^∗
	(0.041)
promptshoe	0.096^∗∗
	(0.043)
prompttable	$-$ 0.007
	(0.042)
prompttire	$-$ 0.196^∗∗∗
	(0.041)
Constant	2.751^∗∗∗
	(0.029)
Observations	3,000
R²	0.159
Adjusted R²	0.157
Residual Std. Error	0.746 (df = 2994)
F Statistic	112.903^∗∗∗ (df = 5; 2994)
Note:	^∗p $<$ 0.1; ^∗∗p $<$ 0.05; ^∗∗∗p $<$ 0.01

Appendix F Participant Feedback

We encouraged participants to start the study in the first place by saying that—if they finished all 5 trials—we would show them how creative they are relative to humans and AI. At the end of the experiment, we first computed a participant’s average score from the Organisciak et al. (2022) classifier as their ‘creativity score’. We then graphically and verbally showed participants what percentile this score would be for both humans and AI (where the human and AI scores come from applying the Organisciak et al. (2022) classifier to a sample of AI ideas we generated and prior human ideas from the Organisciak Dataset.) We also provided a graph that compared a participant’s scores in the AI condition to their scores in the no-AI conditions.

Additionally, we wanted to minimize attrition for participants once they started. We gave participants two pieces of feedback after each trial so they would continue taking the study. See M for screenshots.

•

First, we calculated how unique a participant’s response was relative to the last person’s response. We did this by calculating the cosine distance between a word2vec embedding of the participant’s response and a word2vec embedding of the last response in a given {[condition], item}. Due to resource constraints, we used a truncated word2vec model—the top 15k words in English.
•

We also compared the accuracy of participants’ rankings to the rankings of ideas by the classifier (Organisciak et al., 2022) we used. To do this, we calculated the rank-order correlation between a participant’s rankings of items and the rank order generated by the Organisciak et al. (2022) model.

In certain cases, either of these metrics could not be calculated, and we returned an arbitrary, random number.

Appendix G Sample Characteristics

Table 8. Descriptive Stats (Non-Missing Values)

	Mean	SD	25th Percentile	Median	75th Percentile
age	34.92	10.86	27.0	33.0	40.0
creativity_ai	57.86	26.66	40.0	60.0	76.0
creativity_human	58.67	23.65	44.0	62.0	75.0

Table 9. Distribution of Gender

	Counts (% of total)
gender
woman	308 (36%)
man	268 (32%)
Missing	222 (26%)
non-binary	23 (3%)
prefer_not_disclose	16 (2%)
prefer_self_describe	7 (1%)

Table 10. Distribution of AI Feeling

	Counts (% of total)
ai_feeling
neutral	403 (48%)
excited	232 (27%)
concerned	198 (23%)
Missing	11 (1%)

Although we did not assess English language proficiency, the top five countries by responses (77.85% of responses) were the United States, Canada, Germany, United Kingdom, and Australia— countries with high English proficiency. The median response length was six words, which is relatively short, also suggesting English language proficiency is not a likely confounder.

Appendix H Model Selection

DV	Potential Moderator	$\chi^{2}$	Df	$p<\chi^{2}$	Added Interaction
Idea Diversity	Self-Perceived Human Creativity	10.32	4.00	0.04	YES
	AI - Human Creativity	15.70	4.00	0.00	YES
	AI Feeling	8.20	8.00	0.41	NO
	Interest Group	12.02	8.00	0.15	NO
Creativity	Self-Perceived Human Creativity	3.28	4.00	0.51	NO
	AI - Human Creativity	1.11	4.00	0.89	NO
	AI Feeling	8.19	8.00	0.41	NO
	Interest Group	1.57	8.00	0.99	NO
AI Adoption	Self-Perceived Human Creativity	18.24	3.00	0.00	YES
	AI - Human Creativity	9.94	3.00	0.02	YES
	AI Feeling	4.14	6.00	0.66	NO
	Interest Group	13.05	6.00	0.04	YES

Table 11. To determine which moderating variables to include, we conducted likelihood ratio tests comparing the baseline specification to a model including an interaction between a potential moderator and the treatment condition. If the likelihood ratio test indicated the interaction improved the fit at

p<0.05

, we included this interaction in our model.

Selected models already include ‘Interest Group’ to control for participant source (neutral, creative, technical). As a robustness check, we subsequently created an additional participant source variable, ‘IsSocialMedia’, indicating if the respondent was from social media. Likelihood ratio tests found adding ‘IsSocialMedia’ and its interaction with the treatment condition did not improve the fit of the selected models $(p>0.39\text{ for all models})$ .

Appendix I Idea Diversity

I.1. Global

For 50 Monte Carlo runs with different seed values, we sampled 50 ideas for each {[condition], item} combination. For each 50-idea set, we computed various idea diversity measures. First, we calculated all pairwise SBERT distances. Next, we measured the mean and median pairwise distances. We also computed the centroid of each 50-idea set and calculated the mean distance from the centroid. After calculating these metrics, we conducted two-tailed, paired permutation tests (10,000 iterations) to test if two conditions differed on these metrics. To conduct the paired permutation test, we randomly swapped the sign of the difference between pairs of values, equivalent to randomly swapping the condition labels of rows—simulating the null hypothesis that conditions do not differ. We then counted the proportion of null distribution iterations where one would observe a larger absolute difference in means than the observed difference. We added a 1 to the numerator and denominator, which is a common, conservative adjustment (Ojala and Garriga, 2009) and stops p-values from being 0. Because the test is paired (equivalent to swapping the condition label within each ‘row’), our permutation tests are controlling for both AUT items and Monte Carlo seeds, since each row shares these attributes. We controlled for multiple pairwise comparisons by applying a Holm-Bonferroni adjustment to p-values. As a non-parametric measure of effect size, we used Cliff’s Delta. This metric ranges from -1 to +1 where 0 indicates no difference between conditions, +1 indicates that all Monte Carlo runs for the first condition are larger than those for the second, and vice versa for -1. As with evolution, to avoid the confounding effect of conditions differing in the number of seeds, we consider ideas after the sixth trial (see SI I.3 for a more detailed discussion). This is because the experiment is designed to ‘shed’ all seed examples after trial six.

Table 12. Global idea diversity measured by mean pairwise distance

	Contrast	Diff in Means	Adj P Value	Cliff’s Delta
0	HighExposureDisclosed-Control	1.340000	0.001000	0.350000
1	HighExposureUndisclosed-Control	0.900000	0.001000	0.270000
2	LowExposureDisclosed-Control	0.340000	0.011100	0.090000
3	LowExposureUndisclosed-Control	-0.460000	0.016500	-0.020000
4	HighExposureDisclosed-HighExposureUndisclosed	0.440000	0.011100	0.050000
5	HighExposureDisclosed-LowExposureDisclosed	1.000000	0.001000	0.290000
6	HighExposureDisclosed-LowExposureUndisclosed	1.800000	0.001000	0.320000
7	HighExposureUndisclosed-LowExposureDisclosed	0.560000	0.001000	0.210000
8	HighExposureUndisclosed-LowExposureUndisclosed	1.360000	0.001000	0.280000
9	LowExposureDisclosed-LowExposureUndisclosed	0.800000	0.001000	0.090000

Table 13. Global idea diversity measured by median pairwise distance

	Contrast	Diff in Means	Adj P Value	Cliff’s Delta
0	HighExposureDisclosed-Control	1.210000	0.001000	0.310000
1	HighExposureUndisclosed-Control	0.810000	0.001000	0.260000
2	LowExposureDisclosed-Control	0.470000	0.001000	0.110000
3	LowExposureUndisclosed-Control	-0.610000	0.006300	-0.060000
4	HighExposureDisclosed-HighExposureUndisclosed	0.410000	0.032800	0.040000
5	HighExposureDisclosed-LowExposureDisclosed	0.740000	0.001000	0.230000
6	HighExposureDisclosed-LowExposureUndisclosed	1.820000	0.001000	0.320000
7	HighExposureUndisclosed-LowExposureDisclosed	0.330000	0.032800	0.180000
8	HighExposureUndisclosed-LowExposureUndisclosed	1.410000	0.001000	0.290000
9	LowExposureDisclosed-LowExposureUndisclosed	1.080000	0.001000	0.150000

Table 14. Global idea diversity measured by mean centroid distance

	Contrast	Diff in Means	Adj P Value	Cliff’s Delta
0	HighExposureDisclosed-Control	1.480000	0.001000	0.350000
1	HighExposureUndisclosed-Control	1.030000	0.001000	0.270000
2	LowExposureDisclosed-Control	0.360000	0.013800	0.090000
3	LowExposureUndisclosed-Control	-0.430000	0.029500	-0.020000
4	HighExposureDisclosed-HighExposureUndisclosed	0.450000	0.013800	0.050000
5	HighExposureDisclosed-LowExposureDisclosed	1.120000	0.001000	0.290000
6	HighExposureDisclosed-LowExposureUndisclosed	1.910000	0.001000	0.320000
7	HighExposureUndisclosed-LowExposureDisclosed	0.680000	0.001000	0.210000
8	HighExposureUndisclosed-LowExposureUndisclosed	1.460000	0.001000	0.280000
9	LowExposureDisclosed-LowExposureUndisclosed	0.790000	0.001000	0.090000

I.2. Local

We found mixed evidence that belief in AI’s relative creativity moderates local idea diversity. Regression results showed a small but significant interaction effect between the [High Exposure, Undisclosed] condition and relative AI creativity ( $\beta=0.038$ , $t(3244)=2.149$ , 95% CI = $[0.003,0.073]$ , $p=0.03$ ). We probed this effect with estimated marginal means, predicting local idea diversity for the bottom and top decile of participants by perception of AI creativity. Top-decile participants had slightly higher local idea diversity than bottom-decile participants in the [High Exposure, Undisclosed] condition ( $\Delta=2.62,d=0.34$ ) but although this difference was significant before multiple comparisons ( $p=0.01$ ), it was not significant after adjusting for multiple comparisons, ( $p=0.06$ ; see Appendix Table 17). Hence, we conclude there is mixed evidence for the role of belief in AI’s relative creativity as a moderator of local idea diversity.

Table 15. Predictors of local idea diversity with coefficients and SEs in parentheses. The DV for models (1) and (2) are the median and mean pairwise distances between a participant’s response and examples. Model (3) uses the distance between a participant’s response and the centroid of examples. Ideas are embedded using SBERT. All three models have a random intercept for participants crossed with a random intercept for response chains, nested in (item, condition) combinations.

	Dependent variable:
	Median PW Distance	Mean PW Distance	Centroid Distance
	(1)	(2)	(3)
conditionLoExposure_Disclosed	$-$ 1.467 (1.864)	$-$ 2.177 (1.701)	$-$ 3.791 (2.438)
	t = $-$ 0.787	t = $-$ 1.280	t = $-$ 1.555
conditionLoExposure_Undisclosed	0.272 (1.861)	$-$ 0.098 (1.698)	$-$ 0.414 (2.433)
	t = 0.146	t = $-$ 0.057	t = $-$ 0.170
conditionHiExposure_Disclosed	1.051 (1.864)	0.814 (1.701)	3.354 (2.438)
	t = 0.564	t = 0.479	t = 1.376
conditionHiExposure_Undisclosed	$-$ 0.744 (1.866)	$-$ 0.958 (1.703)	0.442 (2.441)
	t = $-$ 0.399	t = $-$ 0.563	t = 0.181
creativity_human	$-$ 0.006 (0.013)	$-$ 0.008 (0.013)	$-$ 0.015 (0.021)
	t = $-$ 0.484	t = $-$ 0.647	t = $-$ 0.723
ai_rel_create	$-$ 0.006 (0.013)	$-$ 0.005 (0.012)	$-$ 0.008 (0.020)
	t = $-$ 0.425	t = $-$ 0.401	t = $-$ 0.388
trial_no	$-$ 0.023 (0.029)	$-$ 0.020 (0.027)	$-$ 0.007 (0.045)
	t = $-$ 0.814	t = $-$ 0.725	t = $-$ 0.166
ai_feelingconcerned	0.354 (0.358)	0.317 (0.339)	0.455 (0.565)
	t = 0.990	t = 0.935	t = 0.807
ai_feelingexcited	0.270 (0.349)	0.107 (0.331)	0.207 (0.550)
	t = 0.773	t = 0.324	t = 0.377
interest_groupcreative	$-$ 0.573 (0.453)	$-$ 0.492 (0.431)	$-$ 0.694 (0.693)
	t = $-$ 1.263	t = $-$ 1.142	t = $-$ 1.002
interest_grouptechnology	0.081 (0.465)	0.161 (0.443)	0.139 (0.711)
	t = 0.173	t = 0.363	t = 0.196
condition_order	0.182^∗∗ (0.092)	0.165^∗ (0.087)	0.256^∗ (0.143)
	t = 1.982	t = 1.897	t = 1.798
log_duration	$-$ 0.765^∗∗∗ (0.200)	$-$ 0.718^∗∗∗ (0.189)	$-$ 1.192^∗∗∗ (0.313)
	t = $-$ 3.820	t = $-$ 3.791	t = $-$ 3.809
n_seeds	0.398^∗∗∗ (0.136)	0.440^∗∗∗ (0.128)	0.597^∗∗∗ (0.213)
	t = 2.927	t = 3.429	t = 2.808
conditionLoExposure_Disclosed:creativity_human	0.034^∗ (0.018)	0.040^∗∗ (0.017)	0.070^∗∗ (0.028)
	t = 1.884	t = 2.313	t = 2.456
conditionLoExposure_Undisclosed:creativity_human	0.006 (0.018)	0.005 (0.017)	0.015 (0.028)
	t = 0.304	t = 0.285	t = 0.519
conditionHiExposure_Disclosed:creativity_human	$-$ 0.003 (0.018)	$-$ 0.004 (0.017)	$-$ 0.013 (0.028)
	t = $-$ 0.169	t = $-$ 0.248	t = $-$ 0.455
conditionHiExposure_Undisclosed:creativity_human	0.024 (0.018)	0.022 (0.017)	0.030 (0.028)
	t = 1.298	t = 1.256	t = 1.065
conditionLoExposure_Disclosed:ai_rel_create	$-$ 0.018 (0.018)	$-$ 0.017 (0.017)	$-$ 0.031 (0.028)
	t = $-$ 1.027	t = $-$ 0.991	t = $-$ 1.136
conditionLoExposure_Undisclosed:ai_rel_create	0.007 (0.018)	0.005 (0.017)	0.006 (0.028)
	t = 0.365	t = 0.296	t = 0.207
conditionHiExposure_Disclosed:ai_rel_create	$-$ 0.008 (0.018)	$-$ 0.007 (0.017)	$-$ 0.005 (0.027)
	t = $-$ 0.462	t = $-$ 0.414	t = $-$ 0.199
conditionHiExposure_Undisclosed:ai_rel_create	0.038^∗∗ (0.018)	0.040^∗∗ (0.017)	0.065^∗∗ (0.028)
	t = 2.149	t = 2.384	t = 2.331
Constant	85.104^∗∗∗ (1.733)	84.644^∗∗∗ (1.607)	72.776^∗∗∗ (2.466)
	t = 49.099	t = 52.676	t = 29.508
Observations	3,271	3,271	3,271
Log Likelihood	$-$ 11,201.060	$-$ 11,014.890	$-$ 12,630.840
Akaike Inf. Crit.	22,456.120	22,083.790	25,315.680
Bayesian Inf. Crit.	22,620.630	22,248.290	25,480.180
Note:	^∗p $<$ 0.1; ^∗∗p $<$ 0.05; ^∗∗∗p $<$ 0.01

Table 16. Estimated marginal means contrasts of local idea diversity, using a mixed model to compare predictions for top 10 percentile and bottom 10 percentile of participants by belief in relative AI creativity. Local idea diversity is computed as the median pairwise distance between a participant’s idea and the example ideas. P-values adjusted for multiple comparisons using the Holm-Bonferroni method.

contrast	Relative AI Creativity Percentile	estimate	SE	df	t.ratio	Adjusted P Value	d
LoExposure_Undisclosed - HiExposure_Undisclosed	10	-0.368	1.540	20.311	-0.239	1.000	-0.048
LoExposure_Undisclosed - LoExposure_Disclosed	10	0.298	1.539	20.279	0.193	1.000	0.038
LoExposure_Undisclosed - HiExposure_Disclosed	10	-0.129	1.539	20.299	-0.084	1.000	-0.017
LoExposure_Undisclosed - Control	10	0.660	1.541	20.381	0.428	1.000	0.085
HiExposure_Undisclosed - LoExposure_Disclosed	10	0.666	1.540	20.356	0.432	1.000	0.086
HiExposure_Undisclosed - HiExposure_Disclosed	10	0.239	1.539	20.276	0.155	1.000	0.031
HiExposure_Undisclosed - Control	10	1.029	1.545	20.585	0.666	1.000	0.133
LoExposure_Disclosed - HiExposure_Disclosed	10	-0.427	1.540	20.346	-0.277	1.000	-0.055
LoExposure_Disclosed - Control	10	0.363	1.541	20.390	0.235	1.000	0.047
HiExposure_Disclosed - Control	10	0.790	1.545	20.591	0.511	1.000	0.102
LoExposure_Undisclosed - HiExposure_Undisclosed	90	-2.915	2.203	84.275	-1.323	1.000	-0.377
LoExposure_Undisclosed - LoExposure_Disclosed	90	2.284	2.202	84.112	1.037	1.000	0.295
LoExposure_Undisclosed - HiExposure_Disclosed	90	1.042	2.187	81.952	0.476	1.000	0.135
LoExposure_Undisclosed - Control	90	1.181	2.206	84.767	0.535	1.000	0.153
HiExposure_Undisclosed - LoExposure_Disclosed	90	5.198	2.206	84.630	2.357	0.207	0.672
HiExposure_Undisclosed - HiExposure_Disclosed	90	3.957	2.188	81.948	1.809	0.608	0.512
HiExposure_Undisclosed - Control	90	4.096	2.213	85.683	1.851	0.608	0.530
LoExposure_Disclosed - HiExposure_Disclosed	90	-1.242	2.190	82.342	-0.567	1.000	-0.161
LoExposure_Disclosed - Control	90	-1.102	2.207	84.791	-0.500	1.000	-0.143
HiExposure_Disclosed - Control	90	0.139	2.197	83.314	0.063	1.000	0.018

Table 17. Estimated marginal means contrasts of local idea diversity, using a mixed model to compare predictions for the top 10 percentile and bottom ten percentile of participants by belief in relative AI creativity. Local idea diversity is computed as the median pairwise distance between a participant’s idea and the example ideas. P-values adjusted for multiple comparisons using the Holm-Bonferroni method

contrast	condition	estimate	SE	df	t.ratio	P Value	d	Adjusted P Value
ai_rel_create10 - ai_rel_create90	LoExposure_Undisclosed	-0.077	1.043	3156.269	-0.074	0.941	-0.010	1.000
ai_rel_create10 - ai_rel_create90	HiExposure_Undisclosed	-2.624	1.040	3136.309	-2.523	0.012	-0.339	0.058
ai_rel_create10 - ai_rel_create90	LoExposure_Disclosed	1.909	1.041	3159.288	1.833	0.067	0.247	0.268
ai_rel_create10 - ai_rel_create90	HiExposure_Disclosed	1.094	1.013	3181.870	1.080	0.280	0.141	0.840
ai_rel_create10 - ai_rel_create90	Control	0.444	1.046	3177.974	0.424	0.671	0.057	1.000

I.3. Evolution

To model the evolution of idea diversity, we pooled together submitted ideas at the level of (item, condition, trial number). We then computed the median pairwise distance, mean pairwise distance, and mean distance from centroid for each pool of ideas. We fit a model to test if idea diversity changed at a different rate for different conditions:

	$\displaystyle\text{variable}_{cti}=\beta_{0}+\beta_{1}\text{Condition}_{c}+% \beta_{2}\text{TrialNo}_{t}+\beta_{3}\text{TrialNo X Condition}_{tc}+$
		$\displaystyle\phantom{=}\beta_{4}\text{Nobs}_{cti}+u_{0i}+e_{cti}$

Where:

•

$c$ indexes conditions.
•

$t$ indexes trial number.
•

$i$ indexes items
•

$\beta_{0}$ is the global intercept.
•

$u_{0i}\sim N(0,\sigma_{u}^{2})$ are random intercepts for items
•

$e_{cti}\sim N(0,\sigma^{2})$ is the residual

We took two additional steps to make sure our results were not driven by confounding factors. First, $Nobs$ controls for how many ideas are in the set that is being analyzed. Recall that we designed the experiment so that each (item, condition) combination was replicated exactly seven times in response chains of exactly 20 trials. However, there were some minor deviations (discussed in SI L) in response chains, resulting in some (item, condition, trial number) sets having fewer items than others. Hence, we control for the number of ideas in a set. Second, we only ran this analysis on data after the sixth trial in a response chain to rule out the effect of seeds on evolution. The logic here is that the condition with the most initial seeds (6) was the control condition. The experiment is designed to ‘shed’ all seeds after trial six since by that time there would have been six experiment responses, meaning the most recent six ideas in the control condition would now all be from the experiment, and hence no seeds present in the example sets. (Note that for all local analyses, we directly control for the number of seeds present in the example set as a fixed effect.)

Table 18. Evolution of idea diversity by condition. Each model has a random intercept for item. The reference level for experimental conditions is the control condition.

	Dependent variable:
	Median PW Distance	Mean PW Distance	Centroid Distance
	(1)	(2)	(3)
nobs	0.827^∗∗ (0.337)	0.894^∗∗∗ (0.317)	3.940^∗∗∗ (0.214)
	t = 2.454	t = 2.818	t = 18.450
conditionLow ExposureUndisclosed	$-$ 1.558 (3.349)	$-$ 1.048 (3.151)	$-$ 0.534 (2.121)
	t = $-$ 0.465	t = $-$ 0.333	t = $-$ 0.252
conditionLow ExposureDisclosed	$-$ 4.087 (3.368)	$-$ 4.163 (3.168)	$-$ 1.884 (2.133)
	t = $-$ 1.213	t = $-$ 1.314	t = $-$ 0.883
conditionHigh ExposureUndisclosed	$-$ 5.575 (3.417)	$-$ 4.853 (3.214)	$-$ 3.517 (2.164)
	t = $-$ 1.632	t = $-$ 1.510	t = $-$ 1.625
conditionHigh ExposureDisclosed	$-$ 6.003^∗ (3.416)	$-$ 5.573^∗ (3.213)	$-$ 3.608^∗ (2.163)
	t = $-$ 1.757	t = $-$ 1.734	t = $-$ 1.668
trial_no	$-$ 0.391^∗∗ (0.175)	$-$ 0.321^∗ (0.165)	$-$ 0.141 (0.111)
	t = $-$ 2.232	t = $-$ 1.948	t = $-$ 1.273
conditionLow ExposureUndisclosed:trial_no	0.140 (0.231)	0.111 (0.218)	0.069 (0.146)
	t = 0.605	t = 0.512	t = 0.472
conditionLow ExposureDisclosed:trial_no	0.368 (0.233)	0.379^∗ (0.220)	0.196 (0.148)
	t = 1.578	t = 1.728	t = 1.328
conditionHigh ExposureUndisclosed:trial_no	0.525^∗∗ (0.239)	0.461^∗∗ (0.225)	0.335^∗∗ (0.151)
	t = 2.200	t = 2.051	t = 2.213
conditionHigh ExposureDisclosed:trial_no	0.566^∗∗ (0.239)	0.561^∗∗ (0.225)	0.378^∗∗ (0.151)
	t = 2.371	t = 2.498	t = 2.500
Constant	81.500^∗∗∗ (3.899)	78.538^∗∗∗ (3.669)	19.040^∗∗∗ (2.479)
	t = 20.904	t = 21.404	t = 7.681
Observations	362	362	362
Log Likelihood	$-$ 1,158.987	$-$ 1,137.571	$-$ 998.930
Akaike Inf. Crit.	2,343.974	2,301.141	2,023.861
Bayesian Inf. Crit.	2,394.566	2,351.733	2,074.452
Note:	^∗p $<$ 0.1; ^∗∗p $<$ 0.05; ^∗∗∗p $<$ 0.01

Appendix J Creativity

Table 19. Predictors of creativity with coefficients and SEs in parentheses. This model has a random intercept for participants crossed with a random intercept for response chains, nested in (item, condition) combinations.

	Dependent variable:
	Creativity
conditionLoExposure_Disclosed	0.011 (0.112)
	t = 0.095
conditionLoExposure_Undisclosed	0.019 (0.112)
	t = 0.166
conditionHiExposure_Disclosed	$-$ 0.025 (0.113)
	t = $-$ 0.221
conditionHiExposure_Undisclosed	0.050 (0.113)
	t = 0.443
creativity_human	0.001 (0.001)
	t = 1.587
ai_rel_create	0.001 (0.001)
	t = 1.506
interest_groupcreative	$-$ 0.069^∗ (0.039)
	t = $-$ 1.760
interest_grouptechnology	$-$ 0.010 (0.040)
	t = $-$ 0.249
trial_no	$-$ 0.002 (0.003)
	t = $-$ 0.797
ai_feelingconcerned	$-$ 0.017 (0.033)
	t = $-$ 0.526
ai_feelingexcited	$-$ 0.046 (0.032)
	t = $-$ 1.448
condition_order	0.003 (0.008)
	t = 0.448
log_duration	0.097^∗∗∗ (0.018)
	t = 5.525
n_seeds	$-$ 0.014 (0.012)
	t = $-$ 1.215
Constant	3.192^∗∗∗ (0.128)
	t = 24.932
Observations	3,271
Log Likelihood	$-$ 3,196.244
Akaike Inf. Crit.	6,430.488
Bayesian Inf. Crit.	6,546.252
Note:	^∗p $<$ 0.1; ^∗∗p $<$ 0.05; ^∗∗∗p $<$ 0.01

Appendix K AI Adoption

Table 20. Estimated marginal means contrasts of AI adoption, using a mixed model to compare predictions for top 10 percentile and bottom 10 percentile of participants by self-perceived human creativity. AI Adoption is the max cosine similarity of a participant’s response and AI examples. P-values are adjusted for multiple comparisons using Holm-Bonferroni method.

contrast	Perceived Creativity Percentile	estimate	SE	df	t.ratio	Adjusted P Value	Cohen’s d
LoExposure_Undisclosed - HiExposure_Undisclosed	10	-6.021	2.369	37.026	-2.542	0.092	-0.502
LoExposure_Undisclosed - LoExposure_Disclosed	10	-3.669	2.369	37.048	-1.549	0.520	-0.306
LoExposure_Undisclosed - HiExposure_Disclosed	10	-1.924	2.367	36.890	-0.813	0.983	-0.160
HiExposure_Undisclosed - LoExposure_Disclosed	10	2.352	2.372	37.212	0.992	0.983	0.196
HiExposure_Undisclosed - HiExposure_Disclosed	10	4.097	2.365	36.779	1.732	0.458	0.341
LoExposure_Disclosed - HiExposure_Disclosed	10	1.744	2.370	37.092	0.736	0.983	0.145
LoExposure_Undisclosed - HiExposure_Undisclosed	90	-5.757	2.172	26.231	-2.650	0.040	-0.480
LoExposure_Undisclosed - LoExposure_Disclosed	90	0.599	2.170	26.127	0.276	1.000	0.050
LoExposure_Undisclosed - HiExposure_Disclosed	90	-6.852	2.168	26.058	-3.160	0.020	-0.571
HiExposure_Undisclosed - LoExposure_Disclosed	90	6.356	2.174	26.323	2.924	0.028	0.529
HiExposure_Undisclosed - HiExposure_Disclosed	90	-1.095	2.166	25.913	-0.506	1.000	-0.091
LoExposure_Disclosed - HiExposure_Disclosed	90	-7.451	2.171	26.171	-3.433	0.012	-0.621

Table 21. Estimated marginal means contrasts of AI adoption, using a mixed model to compare predictions for the top 10 percentile and bottom 10 percentile of participants by belief in relative AI creativity. This metric captures how creative participants think AI is relative to humans (higher values means more creative than humans). AI Adoption is the max cosine similarity of a participant’s response and AI examples. P-values adjusted for multiple comparisons using Holm-Bonferroni method.

contrast	Relative AI Creativity Percentile	estimate	SE	df	t.ratio	Adjusted P Value	d
LoExposure_Undisclosed - HiExposure_Undisclosed	10	-5.225	1.952	17.121	-2.677	0.095	-0.435
LoExposure_Undisclosed - LoExposure_Disclosed	10	-1.115	1.949	17.038	-0.572	1.000	-0.093
LoExposure_Undisclosed - HiExposure_Disclosed	10	-4.651	1.951	17.100	-2.384	0.145	-0.387
HiExposure_Undisclosed - LoExposure_Disclosed	10	4.110	1.953	17.186	2.104	0.201	0.342
HiExposure_Undisclosed - HiExposure_Disclosed	10	0.574	1.949	17.013	0.295	1.000	0.048
LoExposure_Disclosed - HiExposure_Disclosed	10	-3.536	1.953	17.169	-1.811	0.263	-0.295
LoExposure_Undisclosed - HiExposure_Undisclosed	90	0.422	3.168	115.486	0.133	1.000	0.035
LoExposure_Undisclosed - LoExposure_Disclosed	90	-1.388	3.161	114.586	-0.439	1.000	-0.116
LoExposure_Undisclosed - HiExposure_Disclosed	90	-2.363	3.130	110.316	-0.755	1.000	-0.197
HiExposure_Undisclosed - LoExposure_Disclosed	90	-1.811	3.172	115.993	-0.571	1.000	-0.151
HiExposure_Undisclosed - HiExposure_Disclosed	90	-2.785	3.134	110.601	-0.889	1.000	-0.232
LoExposure_Disclosed - HiExposure_Disclosed	90	-0.974	3.135	110.956	-0.311	1.000	-0.081

Table 22. Estimated marginal means contrasts of AI adoption, using a mixed model to compare predictions for top 10 percentile and bottom 10 percentile of participants by belief in relative AI creativity. This metric captures how creative participants think AI is relative to humans (higher values means more creative than humans). AI Adoption is the max cosine similarity of a participant’s response and AI examples. P-values adjusted for multiple comparisons using Holm-Bonferroni method.

contrast	condition	estimate	SE	df	t.ratio	Adjusted P Value	d
ai_rel_create10 - ai_rel_create90	LoExposure_Undisclosed	-0.987	1.673	2545.523	-0.590	0.555	-0.082
ai_rel_create10 - ai_rel_create90	HiExposure_Undisclosed	4.660	1.669	2538.494	2.792	0.005	0.388
ai_rel_create10 - ai_rel_create90	LoExposure_Disclosed	-1.261	1.668	2542.774	-0.756	0.450	-0.105
ai_rel_create10 - ai_rel_create90	HiExposure_Disclosed	1.301	1.611	2480.927	0.808	0.419	0.108

Table 23. Predictors of AI adoption with coefficients and SEs in parentheses. The respective dependent variables are the max, mean, and median cosine similarities between the SBERT embedding of a participant’s response and the SBERT embeddings of AI examples the participant saw. All three models have a random intercept for participants crossed with a random intercept for response chains, nested in (item, condition) combinations.

	Dependent variable:
	Max AI Similarity	Mean AI Similarity	Median AI Similarity
	(1)	(2)	(3)
conditionLoExposure_Undisclosed	$-$ 5.481^∗ (2.801)	$-$ 3.962^∗ (2.356)	$-$ 3.960 (2.412)
	t = $-$ 1.957	t = $-$ 1.682	t = $-$ 1.642
conditionHiExposure_Disclosed	$-$ 2.243 (2.799)	$-$ 3.498 (2.355)	$-$ 3.415 (2.411)
	t = $-$ 0.801	t = $-$ 1.485	t = $-$ 1.416
conditionHiExposure_Undisclosed	3.507 (2.799)	0.098 (2.355)	$-$ 0.584 (2.411)
	t = 1.253	t = 0.041	t = $-$ 0.242
creativity_human	$-$ 0.059^∗∗∗ (0.022)	$-$ 0.041^∗∗ (0.017)	$-$ 0.041^∗∗ (0.017)
	t = $-$ 2.726	t = $-$ 2.430	t = $-$ 2.363
ai_rel_create	0.016 (0.021)	0.015 (0.016)	0.015 (0.017)
	t = 0.757	t = 0.911	t = 0.885
interest_groupcreative	2.425^∗ (1.360)	1.853^∗ (1.052)	1.828^∗ (1.074)
	t = 1.784	t = 1.761	t = 1.702
interest_grouptechnology	0.708 (1.389)	0.979 (1.074)	0.945 (1.097)
	t = 0.510	t = 0.912	t = 0.862
trial_no	$-$ 0.037 (0.050)	0.005 (0.039)	0.022 (0.040)
	t = $-$ 0.734	t = 0.122	t = 0.556
ai_feelingconcerned	$-$ 0.261 (0.630)	$-$ 0.211 (0.479)	$-$ 0.189 (0.489)
	t = $-$ 0.413	t = $-$ 0.441	t = $-$ 0.388
ai_feelingexcited	0.569 (0.615)	0.165 (0.467)	0.149 (0.477)
	t = 0.925	t = 0.354	t = 0.312
condition_order	$-$ 0.204 (0.164)	$-$ 0.201 (0.128)	$-$ 0.212 (0.131)
	t = $-$ 1.247	t = $-$ 1.569	t = $-$ 1.619
log_duration	0.529 (0.354)	0.539^∗∗ (0.273)	0.510^∗ (0.279)
	t = 1.494	t = 1.974	t = 1.828
n_seeds	$-$ 0.435 (0.309)	$-$ 0.265 (0.240)	$-$ 0.214 (0.245)
	t = $-$ 1.407	t = $-$ 1.104	t = $-$ 0.875
conditionLoExposure_Undisclosed:creativity_human	0.053^∗ (0.029)	0.037 (0.023)	0.037 (0.023)
	t = 1.820	t = 1.601	t = 1.557
conditionHiExposure_Disclosed:creativity_human	0.115^∗∗∗ (0.029)	0.066^∗∗∗ (0.023)	0.061^∗∗∗ (0.023)
	t = 3.931	t = 2.868	t = 2.601
conditionHiExposure_Undisclosed:creativity_human	0.050^∗ (0.029)	0.030 (0.023)	0.039^∗ (0.024)
	t = 1.702	t = 1.293	t = 1.655
conditionLoExposure_Undisclosed:ai_rel_create	$-$ 0.003 (0.028)	$-$ 0.004 (0.022)	$-$ 0.004 (0.023)
	t = $-$ 0.121	t = $-$ 0.178	t = $-$ 0.174
conditionHiExposure_Disclosed:ai_rel_create	$-$ 0.032 (0.028)	$-$ 0.011 (0.022)	$-$ 0.007 (0.022)
	t = $-$ 1.151	t = $-$ 0.523	t = $-$ 0.330
conditionHiExposure_Undisclosed:ai_rel_create	$-$ 0.074^∗∗∗ (0.028)	$-$ 0.065^∗∗∗ (0.022)	$-$ 0.071^∗∗∗ (0.023)
	t = $-$ 2.612	t = $-$ 2.924	t = $-$ 3.146
conditionLoExposure_Undisclosed:interest_groupcreative	1.036 (1.851)	0.819 (1.446)	0.834 (1.476)
	t = 0.560	t = 0.566	t = 0.565
conditionHiExposure_Disclosed:interest_groupcreative	$-$ 0.465 (1.844)	$-$ 0.421 (1.440)	$-$ 0.523 (1.470)
	t = $-$ 0.252	t = $-$ 0.293	t = $-$ 0.356
conditionHiExposure_Undisclosed:interest_groupcreative	$-$ 3.224^∗ (1.843)	$-$ 2.329 (1.440)	$-$ 2.344 (1.470)
	t = $-$ 1.750	t = $-$ 1.617	t = $-$ 1.595
conditionLoExposure_Undisclosed:interest_grouptechnology	2.810 (1.888)	1.516 (1.474)	1.535 (1.504)
	t = 1.488	t = 1.028	t = 1.020
conditionHiExposure_Disclosed:interest_grouptechnology	$-$ 1.391 (1.871)	$-$ 1.550 (1.461)	$-$ 1.731 (1.491)
	t = $-$ 0.743	t = $-$ 1.061	t = $-$ 1.161
conditionHiExposure_Undisclosed:interest_grouptechnology	$-$ 1.522 (1.875)	$-$ 1.870 (1.465)	$-$ 1.926 (1.495)
	t = $-$ 0.811	t = $-$ 1.277	t = $-$ 1.288
Constant	23.789^∗∗∗ (2.704)	17.655^∗∗∗ (2.185)	17.609^∗∗∗ (2.235)
	t = 8.796	t = 8.080	t = 7.880
Observations	2,618	2,618	2,618
Log Likelihood	$-$ 10,155.720	$-$ 9,508.078	$-$ 9,561.834
Akaike Inf. Crit.	20,371.430	19,076.160	19,183.670
Bayesian Inf. Crit.	20,547.540	19,252.260	19,359.770
Note:	^∗p $<$ 0.1; ^∗∗p $<$ 0.05; ^∗∗∗p $<$ 0.01

Appendix L Implementation Complications

Running a massive online networked experiment open to any user on the Internet will often invite implementation challenges. In the interest of disclosure—and for the benefit of future researchers running similar experiments—we share the challenges we faced, solutions we implemented, and rationales for our decisions.

L.1. Server Capacity

Overall, we received far more responses than we expected. At several points throughout the experiment, we experienced more concurrent traffic than the application was designed to handle. Hence, we had to temporarily turn off the experiment to wait out high demand, add more resources, or implement and test changes described in Content Moderation. We note that these capacity issues did not affect the responses we collected.

L.2. Content Moderation

Initially, we did not implement any content moderation. But on July 8th, we received an influx of responses from Reddit. Several participants were trolls, providing repetitive profane responses. We then implemented a form of content moderation, flagging any idea that contained a word in a list of words banned by Google as of July 8, 2023¹²¹²12https://github.com/coffee-and-fun/google-profanity-words and subsequently added two more words to the list. If an idea was flagged, it was written to our database but not shown to future participants. Some profane ideas were already shown to participants in between the time when the responses were submitted and we saw and implemented our solution. Also, the content moderation strategy was imperfect: 10 of 46 flagged ideas were false positives. As discussed earlier, condition was unrelated to the number of flagged ideas ( $\chi^{2}$ (4) = 6.06, p = 0.19) or total number of excluded ideas ( $\chi^{2}$ (4) = 3.87, p = 0.42).

We acknowledge there are more advanced and nuanced content moderation strategies, but this one was the best option given our specific circumstances and constraints. First, this bag-of-words method is very transparent. Second, we deployed this experiment through Heroku, which imposes a CPU limit on the project, precluding the use of pre-trained classifiers such as BERT. Third, we did not want to use APIs like Jigsaw or OpenAI moderation endpoints because these APIs have rate limits, which can slow down the experiment.

L.3. Small Deviations From 20 Trials Per Chain

We intended for each response chain to contain 20 responses. The average number of trials per response chain was 19.73 (SD = 1.45) and the median number per chain was 20. We concluded the experiment before the last round of response chains was completely finished for all condition and item combinations, so the minimum number of trials in a response chain (occurring for an item and condition combination in the last round) was 14. The maximum number of trials was 24. These minor deviations occurred due to server overload, very high traffic leading to race conditions and excluding several responses based on the criteria described in D. Based on a two-way ANOVA, we concluded that response chain lengths did not differ by item, $(F(4)=0.38,p=0.82)$ , condition $(F(4)=0.01,p=1.00)$ , or the interaction between items and conditions $(F(16)=0.01,p=1.00)$ ,