How AI Ideas Affect the Creativity, Diversity, and Evolution of Human Ideas: Evidence From a Large, Dynamic Experiment

Joshua Ashkinaze University of MichiganUnited States [email protected] Julia Mendelsohn University of MichiganUnited States [email protected] Li Qiwei University of MichiganUnited States [email protected] Ceren Budak University of MichiganUnited States [email protected]  and  Eric Gilbert University of MichiganUnited States [email protected]
(2024; 20 February 2007; 12 March 2009; 5 June 2009)
Abstract.

Exposure to large language model output is rapidly increasing. How will seeing AI-generated ideas affect human ideas? We conducted a dynamic experiment (800+ participants, 40+ countries) where participants viewed creative ideas that were from ChatGPT or prior experimental participants, and then brainstormed their own idea. We varied the number of AI-generated examples (none, low, or high exposure) and if the examples were labeled as “AI” (disclosure). We find that high AI exposure (but not low AI exposure) did not affect the creativity of individual ideas but did increase the average amount and rate of change of collective idea diversity. AI made ideas different, not better. There were no main effects of disclosure. We also found that self-reported creative people were less influenced by knowing an idea was from AI and that participants may knowingly adopt AI ideas when the task is difficult. Our findings suggest that introducing AI ideas may increase collective diversity but not individual creativity.

artificial intelligence, creativity, large language models, cultural evolution
copyright: acmlicensedjournalyear: 2024doi: XXXXXXX.XXXXXXXconference: Preprint; Jul 03, 2024; Woodstock, NYisbn: 978-1-4503-XXXX-X/18/06ccs: Human-centered computingccs: Human-centered computing Human computer interaction (HCI)ccs: Human-centered computing Empirical studies in HCIccs: Human-centered computing Empirical studies in collaborative and social computingccs: Computing methodologies Artificial intelligenceccs: Computing methodologies Natural language processingccs: Human-centered computing Collaborative interaction

1. Introduction

If we think of culture as a “loop” where individuals and societies shape each other through exchanges of ideas and practices (Richerson and Boyd, 2008; Boyd and Richerson, 1988), then a question emerges: What happens when generative AI enters the “culture loop?” Exposure to LLMs (large language models) is increasing rapidly: When released, ChatGPT was the fastest-growing consumer application in history (Hu, 2023). Moreover, we are likely exposed to even more AI content than we realize: Humans overestimate their ability to distinguish AI from human content (Jakesch et al., 2022). This exposure likely matters: Ideas we see affect the ideas we create (Nijstad and Stroebe, 2006). How will the rapid rise of exposure to LLM-generated ideas affect the creativity, diversity, and evolution of human ideas? And to what extent do AI ideas influence human ideas?

The scale of ‘passive exposure’ to AI ideas is high, and different from prior human-AI interactions. By ‘passive exposure’, we refer to cases when (A) users see LLM outputs but do not have an active role in the creation of these outputs and (B) users are given no instructions to actively engage with these outputs. ‘Passive exposure’ approximates how users often encounter LLM outputs in the real world. For example, OpenAI users generate 100 billion words per day (Griffin, 2024). It is likely that the number of people who are merely seeing (i.e., passively exposed) AI output is significantly larger than the number of people who are creating (i.e., actively engaging) with these systems. Arguably, many future human-AI teams will exhibit this relationship. Yet in existing studies of human-AI creativity, participants are often actively interacting with an AI system (Yang et al., 2022; Osone et al., 2021; Lee et al., 2022; Branch et al., 2021; Gero and Chilton, 2019; Padmakumar and He, 2024). It is crucial, then, to understand how passive AI exposure shapes human ideas.

As AI exposure has increased, so have concerns over AI disclosure (Hancock et al., 2020) (whether providers should disclose when they use AI systems). California, for instance, considered passing a law requiring disclosure on behalf of anyone using bots on social media (Williams, 2018). Concerns regarding the disclosure of LLMs are only more likely to grow. In domains ranging from poetry (Köbis and Mossink, 2021) to online social media profiles (Jakesch et al., 2022), LLM output is increasingly indistinguishable from that of humans. We are interested, then, if disclosing ideas as coming from AI moderates the effect of AI exposure.

Refer to caption
Figure 1. Graphical depiction of experiment. The task (Panel 1) is to submit a creative idea after seeing examples, where examples are from humans or AI. We vary (Panel 2) the amount of AI ideas in the example set (exposure) and if AI ideas are labeled as such (disclosure). The experiment is dynamic (Panel 3). Responses from prior participants serve as examples for future participants.

Motivated by these dynamics, we conducted a large-scale experiment to systematically test how AI exposure and disclosure affect the creativity, diversity, and evolution of human ideas. We employ a variant of the Alternate Uses Task (AUT, (Guilford, 1978)), a common measure of creativity, and manipulate exposure to LLM ideas. In the AUT, participants are told to think of non-obvious uses of an item. For example: What is a creative use for a tire? In our variant, participants complete the AUT for an item after viewing example ideas. These examples constitute our manipulation. Examples vary in AI exposure (none, few, or many AI examples) and AI disclosure (whether AI-generated ideas are labeled as such). The human-generated ideas in each example set come from prior participants in the same experimental condition. See Figure 1 for a graphical depiction of the experiment.

Our dynamic experiment design—ideas from prior participants are used as stimuli for future participants—speaks to the interdependent process of cultural creation: creative ideas are built upon prior ideas. Hence, we capture the compounding effects of having LLMs “in the culture loop”. It is also intended to mimic possible futures for human-AI teams. Our design allows us to observe not just average levels but also temporal dynamics of creativity and diversity in each condition. Taken together, our results provide insights into the role of LLMs in shaping collective thought.

Concretely, our main findings are:

  1. (1)

    High AI exposure increases collective diversity but not individual creativity. We find that high AI exposure increases collective idea diversity, but does not affect individual creativity. Our high-powered null finding around creativity can inform public debates over the creative impact of seeing AI ideas ‘in the wild’. However, we found conditions with high levels of AI exposure had more collective idea diversity. That is, ideas in the high AI exposure conditions were more different from each other. Our findings around creativity and diversity suggest the effect of AI exposure may be nuanced: The introduction of AI ideas into human society may yield more diverse but no better human ideas.

  2. (2)

    High AI exposure increases the speed at which idea diversity develops. Culture is constantly evolving (Boyd et al., 2011; Boyd and Richerson, 1988), yet many laboratory experiments are not designed to model this evolution. Through our dynamic design, we find that high AI exposure increases not only the average levels of collective idea diversity but also the rate of change in idea diversity. This is a consequential finding since even small differences in rates of change can lead to large cumulative differences over time.

  3. (3)

    People who identify as creative are less influenced by AI disclosure. Prior work argues that attitudes and expectations shape engagement with human-AI co-creation systems (Gero et al., 2022). Due to our large sample size, we can model this heterogeneity. We find that for users who self-identify as highly creative, adoption of AI ideas is not influenced by AI disclosure. But AI disclosure did affect the adoption of AI ideas for users who self-identified as low in creativity. This finding suggests that highly creative people will not be “duped” into adopting AI ideas.

  4. (4)

    Participants may adopt AI ideas for harder prompts. We find that when AI ideas are disclosed, participants are more likely to adopt the ideas of AI for difficult AUT prompts. This suggests that users will rely on AI ideas not for trivial creative tasks but for difficult ones. But since this finding is based on a small number of prompts, we view this finding as speculative and encourage more work on the topic.

1.1. Defining Concepts and Variables

1.1.1. Creativity

Creativity is defined in many ways (Walia, 2019). But one common conception is divergent thinking (Guilford, 1967). This is when “an individual solves a problem or reaches a decision using strategies that deviate from commonly used or previously taught strategies” (of Psychology, [n. d.]). One of the most common (Abraham, 2016) tests of divergent thinking is the Alternate Uses Task (AUT) (Guilford, 1978)111https://www.mindgarden.com/67-alternate-uses, where participants are asked to think of an original use for an everyday object. Traditionally, responses to the AUT are measured along four dimensions: originality (how original the idea is), elaboration (how much the participant elaborates on the idea), fluency (how many ideas), and flexibility (different categories of ideas). The latter two can only be measured if the participant provides multiple responses to the same question. Due to our research design222Participants see the most recent responses in the condition as stimuli, so if one participant brainstorms many responses that participant would be over-represented in future participants’ example sets., we have participants generate just one creative idea (as in (Beaty et al., 2022)), and we focus on originality.

We follow a long tradition of scoring responses to the AUT computationally (Yu et al., 2023; Beaty and Johnson, 2021; Beaty et al., 2022; Yang et al., 2023; Organisciak et al., 2022; Dumas et al., 2021). Specifically, we measure the creativity of AUT ideas with an existing fine-tuned GPT-3 classifier (Organisciak et al., 2022, 2023), which has an r=0.81 overall correlation with human judgments of AUT originality. Moreover, we chose AUT items for our experiment where the classifier had the highest accuracy333tire (r=0.91), pants (r=0.91), shoe (r=0.91), table (r=0.9), and bottle (r=0.88). Note that our task is highly ‘in-domain’ for the classifier: we ask participants to do the same exact task for the same exact items the model was trained on. We refer to the originality score from this classifier as individual-level creativity, though we note that future work can explore other dimensions of creativity (such as fluency). We discuss this classifier in more detail in Section 3.

1.1.2. Idea Diversity & AI Adoption

In addition to creativity, we measure how our experimental factors (LLM exposure and LLM disclosure) shape the diversity of ideas that participants produce. This is a complementary measure to creativity. Creativity is often thought of as an individual-level outcome. Diversity is a collective outcome. Put another way, creativity is a property of an idea while diversity is a property of an idea set. We measure two sides of diversity—semantic divergence (which we refer to as idea diversity) and semantic convergence towards AI ideas (which we refer to as AI adoption).

To measure idea diversity and AI adoption, we first embed all ideas using SBERT (Reimers and Gurevych, 2019), which are transformer-based embeddings designed for sentences. SBERT excels at capturing semantic similarity (Reimers and Gurevych, 2019). Prior work uses neural embeddings to compute similarity for AUT responses (Baten et al., 2021) and other creative tasks (Roemmele, 2021).

  • Idea diversity is the median pairwise cosine distance between idea embeddings in an idea set. As robustness checks, we also measure the mean pairwise distance and average distance to the centroid of a set.

  • AI adoption is the maximum cosine similarity between the embedding of the idea a participant submits and the embeddings of AI examples that the participants see. Following Roemmele (2021), we use the max rather than a measure of central tendency because if a participant is inspired by an idea, it would likely be a single idea. As robustness checks, we also measure the mean and median pairwise similarity between the submitted idea and an AI example, but these are noisier measures of adoption.

2. Related Work

Our work bridges three research streams: human-AI co-creation, crowd-sourced creativity, and complex systems. AI ideas are scattered amongst human ideas, whether or not we can tell (Jakesch et al., 2022). This exposure presumably affects the ideas we create (co-creation). And our ideas presumably affect the ideas others create (crowdsourced creativity). While real-world culture is dynamic and evolving, most experiments are not set up to capture evolution (collective dynamics). By employing a large-N sample size and ‘many-worlds’ setup, we model the complex dynamics of AI influence. After discussing how our study bridges these streams, we turn to the particular kind of creativity and diversity our experiment captures and what is known about how our two factors (LLM exposure; LLM disclosure) would affect these outcome variables. However, much of the relevant literature gives conflicting predictions, a key motivation for conducting the current study.

2.1. Situating Our Work Between Co-Creation, Crowd Creativity, and Collective Dynamics

2.1.1. Human-AI Co-Creation

As the creative ability of AI advances (Miller, 2019), researchers explored how co-creating with AI affects human creativity. Much of this research explores creative writing with language models, in particular (Gero et al., 2022; Mirowski et al., 2023; Lee et al., 2022; Yang et al., 2022; Yuan et al., 2022; Roemmele, 2021; Di Fede et al., 2022; Hitsuwari et al., 2022; Gero, 2023; Mizrahi et al., 2020; Gero and Chilton, 2019; Padmakumar and He, 2024). While most prior work in this domain involves users actively engaging with custom systems, our study is concerned with passive exposure to outputs from off-the-shelf models. (By ‘passive exposure’ we mean that (1) users are shown LLM outputs but did not have an active role in the creation of these outputs and that (2) users were given no instructions to actively engage with these outputs; they were merely shown the LLM outputs. ‘Passive exposure’ approximates how users often encounter LLM outputs in the real world.) Human-AI co-creation shows that the relationship between AI ideas and their effect on human creativity is nuanced, with task-level and attitudinal factors playing a role. Roemmele (2021) found seeing AI examples influenced outcomes for hard, but not easy, prompts. Gero et al. (2022) found the quality of LLM outputs did not correlate with perceived usefulness. This is consistent with other research showing large variance in the perceived usefulness of outputs from co-creation systems (Calderwood et al., 2020), suggesting human attitudes partially determine the utility of AI creativity aids. We extend this predominantly qualitative work with a large-scale quantitative study.

Most similar to our work, several studies have explored how (post-ChatGPT) generative artificial intelligence affects creativity and diversity. Several studies found that ideating with generative AI can decrease diversity (Anderson et al., 2024; Doshi and Hauser, 2024; Padmakumar and He, 2024). Some studies suggest generative AI increases individual creativity (Doshi and Hauser, 2024; Dell’Acqua et al., 2023) while others (Anderson et al., 2024) find no effect. Our study offers several additions to this literature. First, because of its dynamic design, we test the long-run effects of AI, where ideas feed forward to future participants. Second, our study is concerned with “passive exposure”: Participants are not told ideas are from AI, and are not instructed to engage with these ideas. By systematically ablating whether AI ideas are disclosed as such, we can explore if the effect of AI ideas depends on knowledge of where the idea is from. Third, we employ a large sample size—which is useful since it provides power for precise estimates of effects and the ability to capture heterogeneity. Moreover, our large sample is comprised of creative professionals and technology-oriented users, two groups most relevant to the phenomenon.

2.1.2. Crowdsourced Creativity

Crowdsourcing can enhance creative outcomes (Yu and Nickerson, 2011, 2013; Nickerson and Sakamoto, 2010; Huang et al., 2020; Siangliulue et al., 2015). For example, Yu and Nickerson (2013) devised a method where crowds build on each other’s ideas by combining ideas from previous generations. Later generations of ideas were rated as more creative compared to earlier generations. Siangliulue et al. (2015) found that the creativity and diversity of idea sets that participants saw influenced the creativity and diversity of what these participants produced. This supports a main contention of our paper: AI exposure matters because the ideas we see affect the ideas we create. We incorporate elements of crowdsourced creativity, particularly in measuring how creativity and diversity unfold over subsequent generations.

2.1.3. Collective Dynamics & Many-Worlds Experiments

Prior work in complex systems and computational sociology highlights the importance of studying collectives to understand social dynamics (Salganik and Watts, 2009). Meanwhile, traditional experiments focus on individuals. Identifying the effect of AI ideas on the diversity and evolution human ideas similarly requires an examination of complex systems as opposed to individuals in isolation. Hence, our experimental design draws on the “many-world” paradigm (e.g., (Salganik et al., 2006)):We create multiple, parallel realizations of worlds with and without AI ideas, each evolving independently under controlled conditions. By employing a large-N sample size and many different parallel worlds, we can better understand the collective effects of AI ideas on human ideas. Beyond simple averages, we can also model how AI ideas affect the evolution of human ideas.

2.1.4. Our Contributions

Our study incorporates elements of human-AI co-creation, crowd creativity, and collective dynamics. We note that co-creation studies often confound the effect of exposure with the effect of disclosure: If one is creating with an AI system, it is impossible to separate the content of an AI system from the knowledge that the content is from an AI system. Our factorial design lets us estimate the marginal effect of exposure and disclosure separately. Co-creation studies typically employ a small number of specialized participants actively engaged with a system. From the perspective of validating a system, this is reasonable. But we are interested in the effects of (1) passive exposure on (2) a general public. For this reason, we adopt a large-scale experimental design—similar to crowd-sourced creativity studies—that lets us estimate effects on the general public rather than specialized users. A key benefit of our large sample size is that we can precisely estimate how participant attitudes affect human-AI outcomes. This is important because, as Gero et al. (2022, pg. 1016) write: “[P]articipant attitudes are a major unknown factor when studying human-AI collaboration.” Drawing on the “many-worlds” paradigm, our experiment design also lets us understand the effect of AI over time since responses feed forward, allowing us to observe differences in rates of change between conditions.

2.2. Factor 1: LLM Exposure

2.2.1. Effects on Creativity

Intuitively, the effect of exposure to ChatGPT ideas will depend on how creative ChatGPT answers are relative to human ideas. In preliminary testing, we found that the answers to the AUT generated by our prompt were scored as more creative than the ideas generated by humans (see Appendix E) via the Organisciak et al. (2022) classifier. LLM generations may be increasing in creativity: while GPT-3 (an earlier model than ChatGPT-3.5) scored lower in AUT creativity than humans on the AUT (Stevenson et al., 2022), GPT-4 (a more recent model than ChatGPT-3.5) scored among the top percentile of humans on a similar verbal creativity task (Guzik et al., 2023), as measured by human judges.

Even if language models can generate creative ideas, it is unclear from prior work if mere exposure to these ideas can increase human creativity. On one hand, the associative model of brainstorming suggests that exposure to others’ ideas can stimulate idea generation by activating a non-accessible concept of a participant’s memory (Nijstad and Stroebe, 2006; Brown and Paulus, 2002; Paulus and Brown, 2007). For example, ChatGPT may come up with a use for a bottle that you never associated with bottles. This can then inspire you to come up with creative uses along this line. In this way, ChatGPT can stimulate creativity. On the other hand, there is also evidence that seeing the ideas of others inhibits a participant’s idea generation if “one is exposed to an idea that has few connections to other ideas in an individual’s semantic network” (Paulus and Brown, 2007, pg. 10). Indeed, this appeared to be the case in Yang et al. (2022). There is a possibility that AI ideas are creative but so divorced from how humans generate ideas that seeing these ideas actually has an inhibiting effect. Separate from prior academic work, there are public debates about the impact of LLMs (such as ChatGPT) on creativity (e.g., (News, 2023; Review, 2023; Jared Henderson, 2022; Krish Naik, 2023; Tubefilter, 2023; Eapen et al., 2023; Wilcot, 2023)). Many of these debates assume ChatGPT will have some impact on an individual’s creativity—either good or bad. Our work contributes empirical results to this broader public conversation.

2.2.2. Effects on Diversity

Prior work in AI co-creation finds mixed effects. Collaborating with AI can lead to more diverse (Yang et al., 2022; Osone et al., 2021; Lee et al., 2022; Branch et al., 2021; Gero and Chilton, 2019) or less diverse outputs (Padmakumar and He, 2024; Doshi and Hauser, 2024; Dell’Acqua et al., 2023). But note that these studies are testing active engagement, and most test active engagement with intentionally constructed systems. This is different from the passive, incidental exposure to AI ideas that now occur in everyday life. Writers call ChatGPT ‘a blurry JPEG of the internet’ (Chiang, 2023) and discuss its ‘incredible blandness’ (Mangalaseril, 2023); researchers call it a ‘stochastic parrot’ (Bender et al., 2021). It is not clear, then, how passive exposure to ideas from off-the-shelf LLMs—precisely the kind we are inundated with—would affect the diversity of human ideas.

2.3. Factor 2: LLM Disclosure

2.3.1. Effects on Creativity

Building on Hwang and Won (2021), we employ the theory of social facilitation (Bond and Titus, 1983) to understand how LLM disclosure can affect human creativity. Facilitation theory is concerned with how the presence of others affects one’s performance. Hwang and Won (2021) asked participants to brainstorm with chatbots (which gave pre-programmed responses) and experimentally varied whether or not participants were told that their partner was a chatbot. Disclosing that the partner was a chatbot led to higher creativity in participant responses, which Hwang and Won (2021) attributes to the novelty of brainstorming with a chatbot. We build on this notion of facilitation as a theoretical lens. However, it is not clear if Hwang and Won (2021)’s finding (that telling people they are brainstorming with a chatbot increases creativity) would replicate in our study, especially in a post-ChatGPT era. First, we are measuring exposure and not direct engagement with chatbots. The novelty of a chatbot may be higher when you are the one working with it to generate ideas. Second, presumably, the novelty of talking to a chatbot may be lower due to the widespread popularity of ChatGPT. Moreover, we may expect heterogeneity in disclosure’s effect on creativity and diversity. It may be that users who have lower self-perceived creative abilities may feel ‘competition’ with AI due to its presence and, in turn, submit more creative responses when they know the ideas they are exposed to are from AI.

2.3.2. Effects on Diversity

It is not clear how knowing content is from AI will affect the diversity of ideas participants produce. But prior work suggests heterogeneity along two lines: the difficulty of the prompt444As discussed later, we measure the ‘difficulty’ of a prompt by the inverse rank of the average creativity in the control condition. If participants tended to submit lower creativity ideas in the control condition for item X, we said item X was difficult., and the attitude of the participant. Prior work suggests that disclosing ideas as AI-generated would decrease diversity due to automation bias, the tendency to over-rely on AI systems (Schemmer et al., 2022; Mosier et al., 1996; Goddard et al., 2014). Increased reliance on AI ideas (when labeled as such) could lead to lower idea diversity and higher AI adoption. Conversely, some evidence suggests people display algorithmic aversion to creative products such as haikus (Hitsuwari et al., 2022) or art (Kirk et al., 2009). This aversion would yield the opposite prediction. Roemmele (2021) found that seeing AI examples only affected the participant’s writing on a key measure for difficult prompts—suggesting creative task difficulty might moderate the effect of disclosure on AI adoption. Task confidence decreases reliance on automated systems and trust in a system increases reliance on automated systems (Goddard et al., 2014). Although this literature is not usually applied to creativity, we might then suspect that people self-reporting low creativity (i.e., low task confidence) and those who think AI is more creative than humans (i.e., high system trust) are most likely to increase adoption of AI ideas when the source is disclosed.

3. Pre-Experiment

Before describing the experiment, we discuss how we chose the five specific AUT (Alternate Uses Test) items and constructed our ChatGPT prompt.

3.1. Stimuli Construction

3.1.1. Choosing AUT Items

We had to choose a selection of items that people would brainstorm creative uses for. We chose five items for which the creativity classifier that we used had the highest accuracy. Previously, Organisciak et al. (2022) fine-tuned GPT-3 Davinci to predict the creativity of AUT items. This dataset contains 20,121 responses from 2,025 participants, across 21 distinct AUT items and nine distinct studies (Organisciak et al., 2022).555We obtained this dataset by direct correspondence with Dr. Organisciak on February 23, 2023; the code that Dr. Organisciak used to generate this dataset is available at https://github.com/massivetexts/llm_aut_study/blob/main/notebooks/Process_AUT_GT.ipynb Each response was graded for creativity by humans and normalized to a scale of 1-5. Then Organisciak et al. (2022) fine-tuned GPT-3 Davinci on this dataset to predict creativity scores. Here, fine-tuning involves providing {Input (an AUT response), Output (human rating)} pairs to a pre-trained LLM. Then the LLM adjusts its parameters to produce a similar output given an input, proxying human judgments. Overall, the fine-tuned GPT-3 classifier had a correlation666We obtained scores for this classifier by downloading the zip file from (https://github.com/massivetexts/llm_aut_study/blob/main/results/evaluation.zip), then navigating to gt_main2/gpt-ft-davinci-1.csv of r=0.81𝑟0.81r=0.81italic_r = 0.81 with human judgment. Accuracy varied by item (Appendix Table LABEL:aut_desc_stats). For our experiment, we picked the five items for which the classifier had the highest accuracy: tire (r=0.91), pants (r=0.91), shoe (r=0.91), table (r=0.9), and bottle (r=0.88). Our task is ‘in-domain’ for the classifier since we ask participants to do the same task for the same items the classifier was trained on.

3.1.2. Generating GPT Ideas

We generated AI ideas with ChatGPT-3.5 and a zero-shot prompt based on prior work. These decisions followed two principles: ecological validity and precedent.

Model & Prompting Strategy

Our model and prompting strategy were driven by a desire to approximate how ordinary users would use large language models to generate ideas. First, we used ChatGPT-3.5, the latest ChatGPT model freely available at the time of the study. Because ChatGPT has a popular and accessible UI, we assume users would be more likely to use ChatGPT rather than a model available only through an API or on a limited basis. Second, we used zero-shot prompting rather than few-shot prompting. Because zero-shot prompting requires no labeled data, this would be a more natural use case for most users.

Prompt Construction

Our specific zero-shot prompt was informed by prior work on LLMs and creativity. Stevenson et al. (2022) administered the AUT to GPT-3 through a zero-shot prompt. However, this prompt generated much wordier responses (M=25.4,SD=8.5)formulae-sequence𝑀25.4𝑆𝐷8.5(M=25.4,SD=8.5)( italic_M = 25.4 , italic_S italic_D = 8.5 ) than the human responses in the Organisciak Dataset (M=4.6,SD=5)formulae-sequence𝑀4.6𝑆𝐷5(M=4.6,SD=5)( italic_M = 4.6 , italic_S italic_D = 5 ). Such a discrepancy would alert participants to what was AI vs. human generated, which would nullify the disclosure factor (whether the source of an idea is disclosed). Hence, we appended a request (Figure 2) to use roughly the same number of words (5) as the average human response. The modified prompt resulted in responses with an average word length (M=4.4,SD=1.3)formulae-sequence𝑀4.4𝑆𝐷1.3(M=4.4,SD=1.3)( italic_M = 4.4 , italic_S italic_D = 1.3 ) much closer to human responses than the un-modified Stevenson prompt, p<0.001𝑝0.001p<0.001italic_p < 0.001 by two-tailed permutation tests. We use the ideas from this modified Stevenson prompt for our experiments (Appendix B for more details).

What are some creative uses for a [OBJECT]? The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. List creative uses for a [OBJECT]. Make sure each response is [MEAN HUMAN WORDS] words.
Figure 2. To generate AUT ideas, we used the zero-shot prompt from Stevenson et al. (2022) with an additional instruction at the end to match the mean length of human responses from prior work.

4. Experiment

4.1. Summary

We recruited participants from a mixture of social media and newsletters. Once participants clicked on the link to the experiment, they were taken to a landing page. In addition to a consent button, that landing page asked several questions. Participants were asked (1) to rate their creativity relative to other humans and to AI, (2) to rate their attitude towards AI (Nadeem, 2022, 2023), (3) age, (4) country, and (5) gender. After consenting, participants engaged in 5 trials. For each trial, a participant generated a creative use for an item under a specific experimental condition, after seeing example ideas. These example ideas constituted our experimental manipulation. Ideas fed forward to future trials such that if a participant was in the {[Control], tire} condition the example ideas the participant saw were the most recent ideas from prior participants in the {[Control], tire} condition. See Table 1 for experimental conditions. The experiment took place in the summer of 2023.

Table 1. Experiment conditions and associated factors. Participants complete the Alternate Uses Task in each condition after being exposed to prior responses generated in each condition. Conditions vary by LLM exposure (none, high or low) and LLM disclosure (source is labeled or not).
Condition Number of AI Number of Human Source Disclosed
Examples Examples (Y/N)
Control 0 6 N
High Exposure; Disclosed 4 2 Y
High Exposure; Not Disclosed 4 2 N
Low Exposure; Disclosed 2 4 Y
Low Exposure; Not Disclosed 2 4 N

4.2. Ethics

The experiment was approved by our university’s institutional review board.

4.3. Recruitment

We recruited volunteer participants through three sources: (1) Facebook ads, (2) Reddit, and (3) the weekly newsletter of Creative Mornings777https://creativemornings.com/888The first author contacted Creative Mornings, who agreed to include the experiment in the newsletter., which is ‘the world’s largest face-to-face creative community’. Creative Mornings is an organization geared towards creative professionals that organizes (e.g.) talks and meetups. We ensured all participants were above 18. While we did not offer monetary compensation, we offered to give participants information about themselves, such as their creativity relative to both humans and AI and their ability to spot creative ideas. Providing information to participants about themselves is often effective for recruiting volunteer participants since it makes the task intrinsically rewarding (Reinecke and Gajos, 2015). Appendix F describes the information we provided to participants.

We recruited volunteer participants instead of crowdsourced workers for several reasons. First, we wanted participants to be intrinsically motivated since (1) many theories suggest intrinsic motivation helps creativity (Mumford and Hemlin, 2017) and (2) we did not want low-quality engagement to confound results (especially since ideas propagate forward). Second, we were interested in an international sample. Because we did not pay participants, we did not need to collect any personally identifiable information. Each user was assigned a random identifier. The experiment being anonymous created a lower barrier to recruiting international participants since GDPR was not operative. Finally, we recruited participants in a targeted manner. In particular, we wanted to generalize this experiment to two key groups: individuals who have a demonstrated interest in technology and those who have a demonstrated interest in creativity. These groups are most relevant to the phenomena in question. To this end, we reached technology-oriented users by posting the experiment in the following subreddits: r/InternetIsBeautiful, r/chatgpt, r/singularity, and r/artifical. We reached creativity-oriented users by posting the experiment in r/writing, r/poetry, and the Creative Mornings newsletter. We also used several ‘neutral’ sources to test the experiment: r/samplesize and Facebook ads. If a participant completed the experiment, then the participant was given a shareable link to their results so they could spread the study.

4.4. Experiment Procedure

Once participants clicked on our link, they were taken to a landing page that included a consent form, task description, and pre-treatment questions.

4.4.1. Study Description

The description read as follows:

What you will do:
We’ll show you 5 common items, and you’ll come up with creative uses for each item. To spark your imagination, you’ll see ideas from prior participants and even from AI (i.e., ChatGPT). You’ll be asked to rank these ideas in order of creativity. The ideas you write may be anonymously shown to future participants to spark their imagination. The study takes 3-6 minutes to complete. The goal is to learn about how humans and AI brainstorm.

What you will learn:

  • How creative you are compared to other humans

  • How creative you are compared to AI

  • How well you can rank creative ideas

We will give you a shareable link with results at the end.

See Appendix F for more details on how each of these three pieces of information was calculated.

4.4.2. Pre-Treatment Questions

Participants were asked several pre-treatment questions:

  1. (1)

    (required) A slider ranging from 0 to 100 that says ‘I am more creative than X% of AI‘

  2. (2)

    (required) A slider ranging from 0 to 100 that says ‘I am more creative than X% of Humans‘

  3. (3)

    (required) ‘Artificial intelligence computer programs are designed to learn tasks that humans typically do. Would you say the increased use of artificial intelligence computer programs in daily life makes you feel…[‘More concerned than excited’, ‘More excited than concerned’, ‘Equally excited and concerned’]

  4. (4)

    (optional) What country are you from?

  5. (5)

    (optional) What is your age?

  6. (6)

    (optional) What is your gender?

The third question was from Pew (Nadeem, 2023, 2022). Our gender question was based on guidance from Spiel et al. (2019). We chose the Pew question instead of a longer battery of questions about AI to minimize the response burden. See Appendix C for more details about these questions.

Refer to caption
Figure 3. Participants are randomized to a sequence of 5 trials. In each trial, participants generate a creative use for an item under a specific experimental condition. Neither items nor conditions repeat in a 5-trial sequence.

4.4.3. Randomization

Participants were assigned a sequence of 5 trials, where each trial was a {[condition], item} pair. For example, one trial might be a creative idea for pants in the [High Exposure, Disclosed] condition. We mapped each AUT item (pants, tire, shoe, bottle, table) to one of the five conditions such that neither conditions nor items repeated in a 5-item sequence. See Figure 3 for a visual explanation.

4.4.4. Task Instructions

For each trial, participants were asked to first rank a list of example ideas in order of creativity and then submit their own idea:

Task
For this task, you will submit a creative use for a [ITEM]. But before submitting your idea, here are some ideas for inspiration. Rank them by creativity.

Rank Previous Ideas

  • Rank these ideas in order of creativity, with the most creative use on top. Drag ideas to rank them.

  • We’ll show you how your rankings compare to rankings from a highly accurate model.

[SORTABLE EXAMPLE IDEAS HERE]

Submit Your Idea
Your turn! What is a creative use for a [ITEM]? The goal is to come up with a creative idea, which is an idea that strikes people as clever, unusual, interesting, uncommon, humorous, innovative, or different. List a creative use for a [ITEM].

See Appendix M for screenshots. We asked participants to rank ideas to ensure that they would engage with the example ideas.999We did not use these rankings as a DV since—because a participant ranks the examples they are shown, and all examples are from the same condition—these ranks could not speak to between-condition differences, which is the focus of the paper. Depending on the condition, (1) either there were or were not AI ideas in this example set (exposure); (2) AI ideas were or were not labeled (disclosure). We use the same prompt for humans (the text under Submit Your Idea) as with ChatGPT (Stevenson et al., 2022), but with a slight modification to request a single idea. This prompt contains language consistent with best practices in divergent thinking assessment (Beaty et al., 2021). After submitting an idea, participants received feedback on their idea’s uniqueness and how accurately they ranked the example ideas (Appendix M).

4.4.5. Response Chains

Refer to caption
Figure 4. Participants see example ideas from prior participants in the same condition. These ‘response chains’ reset every 20 responses.
Logic

The human ideas that participants saw came from prior participants in the same {[condition], item} combination. See Figure 4. For instance, if a user was placed in the [Control] condition for a tire, that user would see six human ideas—the most recent six ideas for a tire under the [Control] condition. In order to avoid overfitting to a specific idea sequence, we reset this ‘response chain’ every 20 trials. So, the first 20 participants in the {[Control], tire} combination would see each other’s ideas, but the chain would reset for the 21st respondent. We use the logic described in this paragraph and Figure 4 for the human ideas in all conditions.

Note that because human ideas are propagated at the {[condition], item} level, the human ideas in the [Control] condition are ‘clean’ from AI contamination. They were brainstormed after seeing sets of human-only ideas, also from the [Control] condition.)

We ran seven response chains for each of the 25 (5 items x 5 conditions) combinations, corresponding to 175 response chains in all and 3500 targeted responses (175 response chains ×20 trials per chain)175 response chains 20 trials per chain(175\text{ response chains }\times 20\text{ trials per chain})( 175 response chains × 20 trials per chain ).

Human Seeds

Of course, there is a bootstrapping problem—what human ideas does the first person in the {[Control], tire} condition see? The seeds for each {[condition], item} combination came from prior responses from the Organisciak Dataset. That is, Participant 1 for a {[Control], tire} response chain would see 6 seed items. Then Participant 2 in the same response chain would see 5 seed items plus Participant 1’s idea (the order of ideas is randomized). Participant 3 would see 4 seed items plus Participant 1 and Participant 2’s ideas, etc. We chose a random sample of seeds for each {[condition], item} combination from the Organisciak Dataset. The dataset labeled ideas with gold-standard human ratings of originality. We conducted an ANOVA and found no significant condition-level difference in the originality of the seeds we used.

5. Recruited Participants

Table 2. Summary Statistics of Experiment
Value
Unique Countries 48.00
Total Responses 3414.00
Unique Participants 844.00
Avg Responses/Participant 4.05
Avg Duration/Response 144.31
Table 3. Sources of participants and trials. For analysis, we categorized each source into a higher-level interest group (technology, creativity, neutral).
Interest Group source Participants (N, % of total) Trials (N, % of total)
creative Creative Mornings newsletter 343 (40.6%) 1470 (43.1%)
technology r/InternetIsBeautiful 298 (35.3%) 1115 (32.7%)
neutral r/samplesize 94 (11.1%) 389 (11.4%)
neutral share 61 (7.2%) 250 (7.3%)
technology r/chatgpt 19 (2.3%) 79 (2.3%)
creative r/writing 7 (0.8%) 30 (0.9%)
neutral other 6 (0.7%) 22 (0.6%)
technology r/singularity 6 (0.7%) 13 (0.4%)
technology r/artificial 5 (0.6%) 24 (0.7%)
creative r/poetry 3 (0.4%) 15 (0.4%)
neutral facebook 2 (0.2%) 7 (0.2%)

We received over 3000 responses from 48 countries. See Appendix G for sample characteristics. Out of a total of five trials, participants finished four trials on average (Table 2), suggesting the experiment was engaging. Most participants came from the Creative Mornings newsletter or r/InternetIsBeautiful (Table 3 for source counts and categorization). The sample was 50% women, 43% men, 4% non-binary, 3% not disclosed, 1% self-described. The mean age was 34.92 (SD = 10.86). Regarding AI, the sample was 48% neutral, 28% excited, 24% concerned. Participants said they were more creative than 57.86% (SD = 26.66) of AI and 58.67% (SD = 23.65) of humans. See Appendix Figure 11 for kernel density plots. Users from neutral interest groups who were concerned about AI tended to have low self-reported creativity.

6. Outcome Measures

We have three outcome measures (idea diversity, creativity, and AI adoption) and three levels of analysis (local, evolution, and global). See Table LABEL:big_table. The local level measures outcomes at the level of an individual trial (e.g., how a submitted response relates to example responses). The evolution level measures the rate of change of outcome variables with respect to the trial number in the response chain (i.e., experiment iteration). The global level compares all submitted responses in a condition to each other. For all pairwise comparisons, we use a Holm-Bonferroni adjustment for multiple comparisons. For idea diversity and AI adoption, we scale the dependent variable (cosine distance or cosine similarity, respectively) by 100 for easier interpretation.

Table 4. We measure three outcome measures (idea diversity, creativity, AI adoption) and three levels of analysis (local, global, and evolution). If a level of analysis is not appropriate for an outcome measure, we put a ‘Not applicable’ in that cell. All ideas are embedded using SBERT.
Local Global Evolution
Creativity How creative is the submitted response? This is measured by the prediction of the classifier from Organisciak et al. (2022). Not applicable Does the creativity of submitted ideas change over time? This is measured by the slope of the response chain’s trial number (i.e., iteration in the response chain) on creativity (the metric from Organisciak et al. (2022)).
Idea Diversity How different is a participant’s response from example responses? This is measured by the median pairwise semantic distance between ideas a participant sees and their response. How diverse were all the participant’s ideas in a condition? This is measured by the median pairwise distance between all submitted ideas in a condition. Do ideas become more different from each other as the experiment goes on? We first measure the median pairwise distance (‘idea diversity’) of ideas at each trial number (i.e., iteration in the response chain). We then measure the slope of the trial number on idea diversity.
AI Adoption How similar is a participant’s response to AI example responses? This is measured by the maximum semantic distance between a participant’s response and AI examples. Not applicable Not applicable

6.1. Local Level

Outcomes at the local level—the level of an individual trial—are useful for two reasons. First, this level shows how a participant’s response relates to the examples they see. Second, this level lets us model whether individual differences moderate the effect of either disclosure or transparency. For each of our local outcomes, we have a baseline model that uses crossed random intercepts to account for the multilevel structure of the experiment. The first random intercept is for participants, accounting for clustering due to repeated measures. This random intercept is then crossed with a second random intercept for response chains, which we nest inside of {[condition], item} combinations.101010In R syntax, the random effect structure was ... + (1|ParticipantID) + (1|ItemCondition/ResponseChainID); See Figure 4 for a visual explanation of how response chains are nested in items and conditions. Models were fit in the lme4 R package. We computed profile likelihood confidence intervals for coefficients using the confint R package. We used estimated marginal means (emmeans R package) to conduct model-adjusted F-tests, linear contrasts, predictions, and pairwise comparisons. We apply Holm-Bonferroni adjustments to pairwise comparison p-values. Our baseline ‘local’ model is:

variableijksubscriptvariable𝑖𝑗𝑘\displaystyle\text{variable}_{ijk}variable start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT =β0+β1conditionj+β2CreativityHumani+β3AiRelCreatei+absentsubscript𝛽0subscript𝛽1subscriptcondition𝑗subscript𝛽2subscriptCreativityHuman𝑖limit-fromsubscript𝛽3subscriptAiRelCreate𝑖\displaystyle=\beta_{0}+\beta_{1}\text{condition}_{j}+\beta_{2}\text{% CreativityHuman}_{i}+\beta_{3}\text{AiRelCreate}_{i}+= italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT condition start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT CreativityHuman start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT AiRelCreate start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT +
β4AiFeelingi+β5InterestGroupi+β6ConditionOrderijksubscript𝛽4subscriptAiFeeling𝑖subscript𝛽5subscriptInterestGroup𝑖subscript𝛽6subscriptConditionOrder𝑖𝑗𝑘\displaystyle\phantom{=}\beta_{4}\text{AiFeeling}_{i}+\beta_{5}\text{% InterestGroup}_{i}+\beta_{6}\text{ConditionOrder}_{ijk}italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT AiFeeling start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT InterestGroup start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ConditionOrder start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT
β7LogDurationijk+β8nSeedsPresentijk+β9TrialNojksubscript𝛽7subscriptLogDuration𝑖𝑗𝑘subscript𝛽8subscriptnSeedsPresent𝑖𝑗𝑘subscript𝛽9subscriptTrialNo𝑗𝑘\displaystyle\phantom{=}\beta_{7}\text{LogDuration}_{ijk}+\beta_{8}\text{% nSeedsPresent}_{ijk}+\beta_{9}\text{TrialNo}_{jk}italic_β start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT LogDuration start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT nSeedsPresent start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT TrialNo start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT
u0i+v0jk+eijksubscript𝑢0𝑖subscript𝑣0𝑗𝑘subscript𝑒𝑖𝑗𝑘\displaystyle\phantom{=}u_{0i}+v_{0jk}+e_{ijk}italic_u start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT + italic_v start_POSTSUBSCRIPT 0 italic_j italic_k end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT

where

  • i𝑖iitalic_i indexes participants, j𝑗jitalic_j indexes item-condition combinations, k𝑘kitalic_k indexes response chains.

  • CreativityHuman is self-perceived creativity relative to AI.

  • AiRelCreate is constructed as (self-perceived creativity to humans) - (self-perceived creativity to AI). Note that this is an implicit measure of AI’s creativity relative to humans. For example, if you say you are more creative than 40% of humans and 60% of AI, then AiRelCreate = -20, as the implicit belief is AI is less creative (-20 percentile points) than humans. Conversely, if you say you are more creative than 20% of AI but 50% of humans then the implicit belief is humans are more creative (50% - 30% = +20).

  • AiFeeling refers to the AI sentiment question.

  • InterestGroup maps each source of the experiment to categories: creative, neutral, or technology. These categories are described in Table 3.

  • ConditionOrder denotes the sequence in which the participant was assigned to complete the trial (e.g., the 1st time a participant took part, etc.).

  • LogDuration is the natural logarithm of the time (in seconds) a participant spent before submitting their answer.

  • nSeedsPresent controls for the number of examples the participant saw that were seed ideas from the Organisciak Dataset.

  • TrialNo indicates the trial number within a specific response chain. For example: the 18th response for {[Control], tire, response chain 5}

We balanced interest in testing experimental hypotheses that conditions differed by subgroups with caution around an over-fitted model. We considered interactions between the treatment condition and four potential moderators: self-perceived human creativity (CreateHuman𝐶𝑟𝑒𝑎𝑡𝑒𝐻𝑢𝑚𝑎𝑛CreateHumanitalic_C italic_r italic_e italic_a italic_t italic_e italic_H italic_u italic_m italic_a italic_n), AI - Human creativity (AiRelCreate𝐴𝑖𝑅𝑒𝑙𝐶𝑟𝑒𝑎𝑡𝑒AiRelCreateitalic_A italic_i italic_R italic_e italic_l italic_C italic_r italic_e italic_a italic_t italic_e), feeling towards AI (AIFeeling)AIFeeling)italic_A italic_I italic_F italic_e italic_e italic_l italic_i italic_n italic_g ), and interest group (InterestGroup𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝐺𝑟𝑜𝑢𝑝InterestGroupitalic_I italic_n italic_t italic_e italic_r italic_e italic_s italic_t italic_G italic_r italic_o italic_u italic_p). We first conducted likelihood ratio tests to test if adding each moderator improved our baseline model. Moderators were kept only if they significantly improved the fit (p<0.05)𝑝0.05(p<0.05)( italic_p < 0.05 ). See Appendix Table 11 for retained moderators. Then, we used emmeans to probe and interpret moderating effects.

6.2. Global

Intuitively, the global diversity of ideas in a condition measures how similar or different submitted ideas in a condition tend to be. The relevant level of aggregation here is all of the submitted ideas at a {[condition], item} level. For example, consider the total set of ideas participants submitted for a tire in the [Control] condition. Is this set of ideas more diverse from each other than the set of submitted ideas for a tire in the [High Exposure, Disclosed] condition?

We used a Monte Carlo procedure and permutation tests to assess if conditions differed with respect to these metrics. For 50 Monte Carlo runs, for each {[condition], item} combination, we randomly sampled 50 ideas and computed idea diversity metrics. We then conducted pairwise paired (at the level of Monte Carlo seeds and items) permutation tests with 10,000 iterations to see if the two conditions differed on these metrics. As a non-parametric measure of effect size, we also calculate Cliff’s Delta (δ)𝛿(\delta)( italic_δ ), which ranges from -1 to 1. A value of 0 indicates no difference between the two conditions, +1 indicates values from the first condition are always larger, and -1 indicates the opposite. See Appendix I.1 for more details.

6.3. Evolution

6.3.1. Creativity

To test if conditions differed in their evolution of creativity, we conducted a likelihood ratio test on whether an interaction between condition and TrialNo significantly improved the fit of the local creativity model.

6.3.2. Idea Diversity

Intuitively, we are interested in if—as the experiment goes on—ideas that participants submit tend to become more or less similar to each other. We use the trial number in a response chain to index time in the experiment. For example, is the set of submitted responses at trial number 4 more or less similar to each other as the set of submitted responses at trial number 20? Here, the diversity of interest is not between a submitted response and example responses but between all submitted responses at a given ‘time point’ (i.e., trial number). The question is if the diversity increases or decreases as the experiment goes on and if this rate of change differs by condition. Here is the mechanics of our process. See Appendix I.3 for more details.

  1. (1)

    We first ‘pooled’ together all ideas at the {[condition], item, trial number} level, across response chains. For example, consider all ideas for a tire for the [Control] condition that were the fourth response in a response chain. We refer to this set as a ‘pool’ of ideas.

  2. (2)

    We next computed idea diversity measures for each pool of ideas, where idea pools were defined in (1). We use the same metrics that we measure at a local level for idea diversity. Median pairwise distance is our main measure. We conduct robustness checks using mean pairwise distance and mean distance from the centroid. Each metric shows qualitatively similar results.

  3. (3)

    We then fit a mixed model (items as random intercepts) to test if the slope of trial number on idea diversity differed by condition. That is: Are submitted responses in some conditions changing at a faster rate?

7. Results

7.1. Creativity

We found no effect of conditions on creativity. Average individual creativity did not significantly differ by condition (F(4,19.86)=0.12,p=0.97)formulae-sequence𝐹419.860.12𝑝0.97(F(4,19.86)=0.12,p=0.97)( italic_F ( 4 , 19.86 ) = 0.12 , italic_p = 0.97 ) and no condition coefficient differed from zero in our regression (Appendix Table 19). Hence, we conclude that neither AI exposure nor AI disclosure affected individual creativity. Additionally, we tested for whether the evolution of creativity differed by condition via a likelihood ratio test on whether interacting trial number and experimental condition would improve the model fit. The likelihood ratio test indicated that allowing for these interactions did not significantly improve the model fit (χ2(4)=6.52,p=0.16)formulae-sequencesuperscript𝜒246.52𝑝0.16(\chi^{2}(4)=6.52,p=0.16)( italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 4 ) = 6.52 , italic_p = 0.16 )111111However, there was a small, negative interaction effect (β=0.015𝛽0.015\beta=-0.015italic_β = - 0.015, t(3248)=2.21𝑡32482.21t(3248)=-2.21italic_t ( 3248 ) = - 2.21, 95% CI = [0.03,0.002]0.030.002[-0.03,-0.002][ - 0.03 , - 0.002 ]) between trial number and the [High Exposure, Disclosed] condition when adding this interaction. However, due to the (1) size of the interaction combined with (2) no overall differences and (3) a null likelihood ratio test, we do not interpret this interaction.. In short, we do not find enough evidence to conclude creativity was affected by experimental conditions.

7.2. Idea Diversity

7.2.1. Local Level

Intuitively, local idea diversity is how different a response is from the examples a participant sees. There was no main effect of condition (F(4,19.95)=0.09,p=0.98)formulae-sequence𝐹419.950.09𝑝0.98(F(4,19.95)=0.09,p=0.98)( italic_F ( 4 , 19.95 ) = 0.09 , italic_p = 0.98 ), and the effect of self-perceived creativity did not differ by condition (F(4,2650.44)=1.59,p=0.18)formulae-sequence𝐹42650.441.59𝑝0.18(F(4,2650.44)=1.59,p=0.18)( italic_F ( 4 , 2650.44 ) = 1.59 , italic_p = 0.18 ). But the effect of belief in AI’s relative creativity did differ by condition, F(4,2635.80)=2.93,p=0.02formulae-sequence𝐹42635.802.93𝑝0.02F(4,2635.80)=2.93,p=0.02italic_F ( 4 , 2635.80 ) = 2.93 , italic_p = 0.02. As robustness checks, we ran the same specification with two alternative measures of idea diversity, mean pairwise distance and distance from the centroid. Regression results are broadly similar. However, post-hoc estimated marginal means showed non-significant contrasts, so we refrain from interpreting this finding. See Appendix I.2 for a more in-depth discussion, regression results, and pairwise comparisons.

7.2.2. Global

Refer to caption
Figure 5. Median pairwise distance of submitted ideas in a condition. There was more global diversity of ideas in the high AI exposure conditions than in the control condition.

By measuring global idea diversity, we capture how different the submitted ideas in a condition are from one another. This can be thought of as a measure of collective idea diversity. See Appendix I.1 for more details on the procedure. Across a range of different metrics, high AI exposure conditions had more global idea diversity than the control condition (Figure 5; Appendix Tables 12, 13, 14). The median pairwise distance provides the most conservative estimate of the metrics that we measured. But even for median pairwise distance, both the [High Exposure, Disclosed] (Cliff’s δ=0.31 on a scale of -1 to 1)\text{Cliff's }\delta=0.31\text{ on a scale of -1 to 1})Cliff’s italic_δ = 0.31 on a scale of -1 to 1 ) and [High Exposure, Undisclosed] (δ=0.26)𝛿0.26(\delta=0.26)( italic_δ = 0.26 ) condition had more idea diversity than the control condition. But of the low exposure conditions, only the [Low Exposure, Undisclosed] (δ=0.11)𝛿0.11(\delta=0.11)( italic_δ = 0.11 ) condition had higher global diversity than the control condition, with a much smaller effect size than the high exposure conditions. Hence, high AI exposure (but not necessarily low AI exposure) increases global idea diversity.

7.2.3. Evolution

Refer to caption
Figure 6. High exposure to AI ideas increased the rate of change in idea diversity.

By measuring the evolution of idea diversity, we capture the rate of change in idea diversity across trials. See Appendix I.3 for more details on the procedure. Relative to the control condition, the conditions with high exposure to AI ideas (but not low exposure to AI) had increased rates of change in idea diversity. See Figure 6 for estimated marginal means predictions and Appendix Table 18 for regression results. As with global idea diversity, different metrics yielded similar regression coefficients. In the control condition, idea diversity decreased over trials (β=0.39𝛽0.39\beta=-0.39italic_β = - 0.39, t(349)=2.23𝑡3492.23t(349)=-2.23italic_t ( 349 ) = - 2.23, 95% CI = [0.73,0.05]0.730.05[-0.73,-0.05][ - 0.73 , - 0.05 ], p=0.03𝑝0.03p=0.03italic_p = 0.03). That is, submitted ideas were becoming more similar to each other as the experiment went on. Relative to the control condition, however, the slope of idea diversity with respect to trial number was more positive for the [High Exposure, Undisclosed] condition (β=0.53𝛽0.53\beta=0.53italic_β = 0.53, t(349)=2.2𝑡3492.2t(349)=2.2italic_t ( 349 ) = 2.2, 95% CI = [0.06,0.99]0.060.99[0.06,0.99][ 0.06 , 0.99 ], p=0.03𝑝0.03p=0.03italic_p = 0.03) and the[High Exposure, Disclosed] condition (β=0.57𝛽0.57\beta=0.57italic_β = 0.57, t(349)=2.37𝑡3492.37t(349)=2.37italic_t ( 349 ) = 2.37, 95% CI = [0.1,1.03]0.11.03[0.1,1.03][ 0.1 , 1.03 ], p=0.02𝑝0.02p=0.02italic_p = 0.02). The rate of change in idea diversity for the low AI exposure conditions did not differ from the rate of change in the control condition. Thus, we conclude that high exposure to AI ideas increased the rate of idea diversity relative to the no-AI, control condition.

7.3. AI Adoption

7.3.1. Local Level

At the local level, we measured AI adoption by the maximum cosine similarity between a participant’s response and AI examples the participant saw. There was a main effect of condition (F(3,16.59)=4.33,p=0.02)formulae-sequence𝐹316.594.33𝑝0.02(F(3,16.59)=4.33,p=0.02)( italic_F ( 3 , 16.59 ) = 4.33 , italic_p = 0.02 ). But we would expect higher similarity to AI ideas in the high-exposure condition even by chance (since there are more AI ideas), so we do not interpret main effects and instead focus on subgroup differences and effects of disclosure in the high-exposure condition. We found that the effect of conditions did not differ by interest groups (F(6,719.77)=1.98,p=0.07)formulae-sequence𝐹6719.771.98𝑝0.07(F(6,719.77)=1.98,p=0.07)( italic_F ( 6 , 719.77 ) = 1.98 , italic_p = 0.07 ), but the effect of conditions did differ by self-perceived creativity (F(3,1984.95)=5.18,p=0.001)formulae-sequence𝐹31984.955.18𝑝0.001(F(3,1984.95)=5.18,p=0.001)( italic_F ( 3 , 1984.95 ) = 5.18 , italic_p = 0.001 ) and relative AI creativity (F(3,1974.58)=2.9,p=0.03)formulae-sequence𝐹31974.582.9𝑝0.03(F(3,1974.58)=2.9,p=0.03)( italic_F ( 3 , 1974.58 ) = 2.9 , italic_p = 0.03 ). As robustness checks, we ran the same specification with two alternative measures of AI adoption, mean and median AI adoption. The coefficients of our regression are broadly similar. See Appendix K for regression results and post-hoc contrasts.

Refer to caption
Figure 7. In the [High Exposure, Disclosed] condition, participants high in self-reported creativity had higher AI adoption (raw data).
Refer to caption
Figure 8. Adoption of AI ideas by self-perceived creativity In the [High Exposure] conditions (estimated marginal means).
Refer to caption
Figure 9. Higher creativity participants adopted ideas solely based on exposure, not disclosure (estimated marginal means).

Exposure to AI ideas increased adoption for (self-perceived) high-creativity participants regardless of disclosure, but this was not the case for (self-perceived) low-creativity participants. There was a significant interaction between self-perceived human creativity and the [High Exposure, Disclosed] condition (β=0.11𝛽0.11\beta=0.11italic_β = 0.11, t(2588)=3.93𝑡25883.93t(2588)=3.93italic_t ( 2588 ) = 3.93, 95% CI = [0.06,0.17],p=0.00010.060.17𝑝0.0001[0.06,0.17],p=0.0001[ 0.06 , 0.17 ] , italic_p = 0.0001; Appendix Table 23). To probe this interaction, we used our model to predict AI adoption by condition for both the top 10% and bottom 10% of participants by self-perceived creativity (Figures 9, 9 and 9). For high-creativity participants (Figure 9), adoption rates appear to differ only by exposure (color) and not disclosure (shape). More formally, we tested whether the effect of exposure on adoption is larger when AI ideas are disclosed vs undisclosed. We find that for high-creativity participants, there is no difference in adoption between ([High Exposure, Undisclosed] - [Low Exposure, Undisclosed]) and ([High Exposure, Disclosed] - [Low Exposure, Disclosed]), Δ=1.69,d=0.14,p=0.59formulae-sequenceΔ1.69formulae-sequence𝑑0.14𝑝0.59\Delta=-1.69,d=-0.14,p=0.59roman_Δ = - 1.69 , italic_d = - 0.14 , italic_p = 0.59. That is, the effect of exposure is not moderated by disclosure. But for low-creativity participants, the difference in adoption for the undisclosed conditions ([High Exposure, Undisclosed] - [Low Exposure, Undisclosed]) was larger than the equivalent difference in adoption for disclosed conditions ([High Exposure, Disclosed] - [Low Exposure, Disclosed]), Δ=7.77,d=0.65,p=0.03formulae-sequenceΔ7.77formulae-sequence𝑑0.65𝑝0.03\Delta=7.77,d=0.65,p=0.03roman_Δ = 7.77 , italic_d = 0.65 , italic_p = 0.03. That is, disclosing ideas as from AI reduced the effect of exposure on adoption for lower (self-reported) creativity participants. In summary, higher (self-reported) creativity people adopt AI ideas solely based on content, and not disclosure.

We also found that one’s attitude about AI’s creativity affected adoption, though this had a smaller effect than self-perceived creativity. There was a significant interaction between the [High Exposure, Undisclosed] condition and relative AI creativity (positive values imply AI is more creative than humans), β=0.07𝛽0.07\beta=-0.07italic_β = - 0.07, t(2588)=2.61𝑡25882.61t(2588)=-2.61italic_t ( 2588 ) = - 2.61, 95% CI = [0.13,0.02],p=0.010.130.02𝑝0.01[-0.13,-0.02],p=0.01[ - 0.13 , - 0.02 ] , italic_p = 0.01. We used estimated marginal means to probe this interaction by predicting AI adoption for the top and bottom decile of participants by belief in relative AI creativity. We found that in the [High Exposure, Undisclosed] condition, people who believed AI was uncreative (bottom decile of AiRelCreate) were slightly more likely to adopt AI ideas than people who believed AI was creative (top decile of AiRelCreate), Δ=4.66,d=0.39,p=0.005formulae-sequenceΔ4.66formulae-sequence𝑑0.39𝑝0.005\Delta=4.66,d=0.39,p=0.005roman_Δ = 4.66 , italic_d = 0.39 , italic_p = 0.005. But no such difference existed in the [High Exposure, Disclosed] condition. This may suggest labeling sources as AI neutralizes adoption among users who do not think AI is creative.

Refer to caption
Figure 10. Adoption of AI ideas in the [High Exposure, Disclosed] condition versus ‘difficulty’ of prompt. When ideas were disclosed as from AI, participants adopted AI ideas for difficult prompts.

In addition to who adopts AI ideas, we also measured when AI ideas are adopted. We found that people adopt AI ideas for difficult prompts rather than easier prompts (Figure 10). To measure the ‘difficulty’ of an item prompt, we first calculated the mean creativity of a response to an item in the control condition. Then we reverse-ranked items such that high mean creativity implies low difficulty and vice versa. We measured ‘AI adoption‘ of an item by the average of the trial-level maximum similarity to AI examples. We then examined the rank-order correlation between item difficulty and AI adoption in high-exposure conditions. If task difficulty leads people to rely on AI, then we should see a larger correlation between item difficulty and AI adoption in the [High Exposure, Disclosed] condition than in the [High Exposure, Undisclosed] condition. That is what we find. The rank-order correlation between difficulty and adoption was ρ=0.8𝜌0.8\rho=0.8italic_ρ = 0.8 for the [High Exposure, Disclosed] condition but only ρ=0.3𝜌0.3\rho=0.3italic_ρ = 0.3 for the [High Exposure, Undisclosed]. That is, when people were told ideas were from AI, they were more likely to adopt AI ideas if the prompt was difficult. Since we employed only five items, we view this finding as speculative; future work should test this relationship with a larger number of stimuli.

8. Discussion

Against the backdrop of a massive increase in LLM exposure, we asked: How does exposure to ideas generated by LLMs affect the creativity, diversity, and evolution of human ideas? To answer this, we conducted a large-scale experiment where participants submitted ideas in response to the Alternate Uses Task (a measure of creativity where people brainstorm novel uses of an item) after seeing a set of example ideas. The examples were from prior participants in the same experimental condition or—in some conditions—ChatGPT. The evolving aspect of our experiment, that ideas in a condition feed forward to subsequent trials in that condition, captures the interdependent nature of idea formation and lets us model the evolutionary effects of having AI ‘in the culture loop’. Here are three takeaways from our experiment.

8.1. AI makes ideas different but not better.

Most notably, exposure to AI ideas did not, on average, make human ideas any ‘better’ or ‘worse’ (by creativity). Our high-powered, null finding around average creativity by condition can inform debates about the effect of AI ideas on individual human creativity. Maybe there is little effect. Of course, our experiment is measuring just a single task. But these results suggest that perhaps both worry and optimism around the effect of AI ideas on individual human creativity should be tempered.

Our null finding around creativity contrasts with some prior work suggesting human-AI co-creation enhances the quality of creative outputs (Mizrahi et al., 2020; Yuan et al., 2022; Roemmele, 2021; Hitsuwari et al., 2022). But our study differs from prior studies in its aim and design: We test passive exposure to off-the-shelf LLMs—not active engagement with optimized-for-creativity AI aides. The latter is useful for understanding how AI could affect creativity. But we aim to approximate how ordinary, existing, and pervasive AI tools do affect the creativity of ideas. At least for this task, we find no evidence of such an effect.

On the other hand, the presence of AI ideas increased the diversity of human ideas. This is consistent with work suggesting collaborating with AI leads to more diverse or unexpected outputs (Yang et al., 2022; Osone et al., 2021; Lee et al., 2022; Branch et al., 2021; Gero and Chilton, 2019) and inconsistent with other work that finds collaborating with LLMs decreases diversity (Padmakumar and He, 2024; Doshi and Hauser, 2024; Dell’Acqua et al., 2023). But we highlight that our study is testing passive exposure to AI ideas and not active engagement with AI ideas. Our setup—passive exposure to AI ideas, scattered amongst human ones—maps onto how many users now experience AI ideas. Hence, it may be that active engagement with LLMs decreases content diversity but simply seeing these ideas as ‘sparks’ (Gero et al., 2022) increases content diversity. And because many more people may be passively exposed to LLM outputs than actively engaging with LLMs, the effect of passive exposure is important to understand.

Crucially, high AI exposure increased both average amounts of diversity and rates of change in idea diversity. The latter result is especially important. Small differences in rates of change can yield large aggregate differences over time. Future work—both simulations and dynamic experiments—can explore the implications of this increase in collective idea diversity unaccompanied by an increase in average individual creativity. For instance, can this dynamic generate ‘innovation’?

Our finding around the evolution of diversity (Figure 6) is instructive. Seeing other people’s ideas reduced idea diversity in the control condition over time. This may suggest that successive participants were converging on a particular idea sequence. But then injecting AI ideas into the example set increased the diversity of submitted responses by partially ‘resetting’ this convergence. Our finding relates to recent work proposing AI systems that generate ‘alien’ scientific hypotheses humans would not think of (Sourati and Evans, 2023). More generally, a promising avenue for future work: Can AI input reduce ‘groupthink’?

8.2. High creativity people are less influenced by the source label of ideas.

Participants who viewed themselves as highly creative had the same levels of adoption of AI ideas in both disclosed and non-disclosed conditions. But for lower-creativity participants, knowing the source of an idea did affect the adoption of that idea. Perhaps people high in self-reported creativity relied less on source cues when adopting ideas because they were more confident in their ability to judge an idea’s creative merit. Future work can employ think-alouds to better understand how AI disclosure affects the idea-generation process, itself. Regardless, our results suggest that (self-reported) creative people will adopt ideas on the basis of their content. Knowing the source does not matter. In a world where humans have difficulty distinguishing if the content was human or AI-generated (Jakesch et al., 2022), these findings suggest people high in (self-reported) creativity will not be ‘duped’ into adopting AI ideas.

8.3. Participants may adopt AI ideas when the task is difficult

When AI ideas were labeled, participants were more likely to adopt AI ideas for difficult prompts rather than easy prompts. Although this finding should be taken as speculative (due to the small number of items), it is similar to what (Roemmele, 2021) observed, where seeing AI examples only influenced creative output when the task was difficult. Both our and (Roemmele, 2021)’s results are consistent with a theoretical account of task difficulty being associated with increased reliance on automation (Goddard et al., 2014). Future work can further test whether users adopt AI ideas for more challenging creative tasks.

If users turn to AI for difficult rather than trivial tasks, this has several implications. On one hand, AI can augment human creativity where human imaginations falter. At the same time, researchers raised concerns over ‘model collapse’ (Shumailov et al., 2023)—the deteriorating performance of LLMs when trained on their outputs. If reliance on AI for creative tasks becomes routine, this may contribute to model collapse, ironically decreasing the efficacy of such reliance.

8.4. Conclusion: Passive exposure to AI ideas affects collective thought.

We conclude that passive exposure to AI ideas—the kind of passive exposure we are inundated with in a post-ChatGPT era—does affect collective thought. Even small effects are meaningful since this exposure is both pervasive and growing. But the effects of AI ideas are nuanced. Seeing AI ideas did not increase individual creativity, though it did increase collective diversity. The effects of AI ideas vary across individuals and tasks. There is still much to learn. We hope our study inspires more research on how passive exposure to AI ideas affects collective thought.

9. Limitations & Future Work

Our study has several limitations that can inform future work. First, we measured the effect of AI ideas for a single task. We chose this task because it is one of the most common creativity tasks (Abraham, 2016). But future work could explore if our results replicate for other kinds of tasks. Second, we had to operationalize ‘ChatGPT’ in some concrete way. The logic for our prompt was driven by ecological validity and prior work: We used a zero-shot prompt because that is what users would likely use, and the specific prompt we used was derived from prior research. We chose not to vary prompts in order not to further increase the complexity of an already complex experiment. Future work could explore if different prompts elicit different results. Another avenue for future work is only propagating the ‘best’ AI ideas forward. Third, future work should test if alternative classifiers or ways of conceiving variables yield different results. For idea diversity and AI adoption, we addressed this problem by showing that conceptually similar ways of measuring variables yielded qualitatively similar results. For our creativity measure, we used a highly accurate classifier (correlation with human judgments greater than r=0.88𝑟0.88r=0.88italic_r = 0.88 for items we used) trained for this exact task, for these exact items. But of course, all models have some error and future work based on this model propagates these errors. Incidentally, human judges of creativity only correlate with other human judges at r=0.88𝑟0.88r=0.88italic_r = 0.88 (Organisciak et al., 2023), suggesting the classifier we used may be approaching ‘the approximate ceiling at which we could expect a model to correlate with human judgments’ (Organisciak et al., 2023, pg. 11) of creativity. Nonetheless, future work can adopt a similar design but with human judgments or different measures. Fourth, our finding about AI adoption and task difficulty is based on five AUT items. Future work should explore this relationship with a larger number of stimuli. Fifth, we focus on one facet of creativity: originality. Future work can also explore whether AI ideas have different effects on other facets of creativity. We also did not necessarily measure idea “quality”, which is distinct from originality and diversity. Sixth, we employed a convenience sample of technology-interested users and creative professionals. While these two groups are most relevant to the phenomena in question, our sample also limits generalizability. Future work can explore these dynamics with different samples. Seventh, these dynamics may differ for future LLMs. Finally, we conducted this experiment close to the launch of ChatGPT. As AI becomes increasingly embedded in everyday life, attitudes towards AI and ways of engaging with AI may also change. Despite these limitations, our work offers the first large-scale, dynamic account of how ideas from LLMs affect collective thought.

References

  • (1)
  • Abraham (2016) Anna Abraham. 2016. Gender and creativity: an overview of psychological and neuroscientific literature. Brain Imaging and Behavior 10, 2 (June 2016), 609–618. https://doi.org/10.1007/s11682-015-9410-8
  • Anderson et al. (2024) Barrett R Anderson, Jash Hemant Shah, and Max Kreminski. 2024. Homogenization Effects of Large Language Models on Human Creative Ideation. In Creativity and Cognition. ACM, Chicago IL USA, 413–425. https://doi.org/10.1145/3635636.3656204
  • Baten et al. (2021) Raiyan Abdul Baten, Richard N. Aslin, Gourab Ghoshal, and Ehsan Hoque. 2021. Cues to gender and racial identity reduce creativity in diverse social networks. Scientific Reports 11, 1 (May 2021), 10261. https://doi.org/10.1038/s41598-021-89498-5
  • Beaty and Johnson (2021) Roger E. Beaty and Dan R. Johnson. 2021. Automating creativity assessment with SemDis: An open platform for computing semantic distance. Behavior Research Methods 53, 2 (April 2021), 757–780. https://doi.org/10.3758/s13428-020-01453-w
  • Beaty et al. (2022) Roger E. Beaty, Dan R. Johnson, Daniel C. Zeitlen, and Boris Forthmann. 2022. Semantic Distance and the Alternate Uses Task: Recommendations for Reliable Automated Assessment of Originality. Creativity Research Journal 34, 3 (July 2022), 245–260. https://doi.org/10.1080/10400419.2022.2025720
  • Beaty et al. (2021) Roger E. Beaty, Daniel C. Zeitlen, Brendan S. Baker, and Yoed N. Kenett. 2021. Forward flow and creative thought: Assessing associative cognition and its role in divergent thinking. Thinking Skills and Creativity 41 (Sept. 2021), 100859. https://doi.org/10.1016/j.tsc.2021.100859
  • Bender et al. (2021) Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
  • Bond and Titus (1983) Charles F. Bond and Linda J. Titus. 1983. Social facilitation: A meta-analysis of 241 studies. Psychological Bulletin 94, 2 (1983), 265–292. https://doi.org/10.1037/0033-2909.94.2.265
  • Boyd and Richerson (1988) Robert Boyd and Peter J. Richerson. 1988. Culture and the Evolutionary Process. University of Chicago Press.
  • Boyd et al. (2011) Robert Boyd, Peter J. Richerson, and Joseph Henrich. 2011. The cultural niche: Why social learning is essential for human adaptation. Proceedings of the National Academy of Sciences 108, supplement_2 (June 2011), 10918–10925. https://doi.org/10.1073/pnas.1100290108
  • Branch et al. (2021) Boyd Branch, Piotr Mirowski, and Kory W. Mathewson. 2021. Collaborative Storytelling with Human Actors and AI Narrators. http://arxiv.org/abs/2109.14728
  • Brown and Paulus (2002) Vincent R. Brown and Paul B. Paulus. 2002. Making Group Brainstorming More Effective: Recommendations From an Associative Memory Perspective. Current Directions in Psychological Science 11, 6 (Dec. 2002), 208–212. https://doi.org/10.1111/1467-8721.00202
  • Calderwood et al. (2020) Alex Calderwood, Vivian Qiu, Katy Ilonka Gero, and Lydia B Chilton. 2020. How Novelists Use Generative Language Models: An Exploratory User Study.. In HAI-GEN+ user2agent IUI.
  • Chiang (2023) Ted Chiang. 2023. ChatGPT Is a Blurry JPEG of the Web. The New Yorker (Feb. 2023). https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web
  • Dell’Acqua et al. (2023) Fabrizio Dell’Acqua, Edward McFowland III, Ethan R. Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. 2023. Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. https://doi.org/10.2139/ssrn.4573321
  • Di Fede et al. (2022) Giulia Di Fede, Davide Rocchesso, Steven P. Dow, and Salvatore Andolina. 2022. The Idea Machine: LLM-based Expansion, Rewriting, Combination, and Suggestion of Ideas. In Proceedings of the 14th Conference on Creativity and Cognition (C&amp;C ’22). Association for Computing Machinery, New York, NY, USA, 623–627. https://doi.org/10.1145/3527927.3535197
  • Doshi and Hauser (2024) Anil R. Doshi and Oliver P. Hauser. 2024. Generative artificial intelligence enhances creativity but reduces the diversity of novel content. https://doi.org/10.48550/arXiv.2312.00506 arXiv:2312.00506 [cs, econ, q-fin].
  • Dumas et al. (2021) Denis Dumas, Peter Organisciak, Shannon Maio, and Michael Doherty. 2021. Four Text-Mining Methods for Measuring Elaboration. The Journal of Creative Behavior 55, 2 (2021), 517–531. https://doi.org/10.1002/jocb.471
  • Eapen et al. (2023) Tojin T. Eapen, Daniel J. Finkenstadt, Josh Folk, and Lokesh Venkataswamy. 2023. How Generative AI Can Augment Human Creativity. Harvard Business Review (July 2023). https://hbr.org/2023/07/how-generative-ai-can-augment-human-creativity
  • Gero (2023) Katy Ilonka Gero. 2023. AI and the Writer: How Language Models Support Creative Writers. Ph.D. Columbia University, United States – New York. https://www.proquest.com/docview/2753687892/abstract/ACF7F21F1E274995PQ/1
  • Gero and Chilton (2019) Katy Ilonka Gero and Lydia B. Chilton. 2019. Metaphoria: An Algorithmic Companion for Metaphor Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, Glasgow Scotland Uk, 1–12. https://doi.org/10.1145/3290605.3300526
  • Gero et al. (2022) Katy Ilonka Gero, Vivian Liu, and Lydia Chilton. 2022. Sparks: Inspiration for Science Writing using Language Models. In Proceedings of the 2022 ACM Designing Interactive Systems Conference (DIS ’22). Association for Computing Machinery, New York, NY, USA, 1002–1019. https://doi.org/10.1145/3532106.3533533
  • Goddard et al. (2014) Kate Goddard, Abdul Roudsari, and Jeremy C. Wyatt. 2014. Automation bias: Empirical results assessing influencing factors. International Journal of Medical Informatics 83, 5 (May 2014), 368–375. https://doi.org/10.1016/j.ijmedinf.2014.01.001
  • Griffin (2024) Andrew Griffin. 2024. ChatGPT creators OpenAI are generating 100 billion words per day. https://www.independent.co.uk/tech/chatgpt-openai-words-sam-altman-b2494900.html
  • Guilford (1967) J.P. Guilford. 1967. The nature of human intelligence. McGraw-Hill, New York, NY, US.
  • Guilford (1978) Joy Paul Guilford. 1978. Alternate uses. Sheridan supply Company.
  • Guzik et al. (2023) Erik E. Guzik, Christian Byrge, and Christian Gilde. 2023. The originality of machines: AI takes the Torrance Test. Journal of Creativity 33, 3 (Dec. 2023), 100065. https://doi.org/10.1016/j.yjoc.2023.100065
  • Hancock et al. (2020) Jeffrey T Hancock, Mor Naaman, and Karen Levy. 2020. AI-Mediated Communication: Definition, Research Agenda, and Ethical Considerations. Journal of Computer-Mediated Communication 25, 1 (March 2020), 89–100. https://doi.org/10.1093/jcmc/zmz022
  • Hitsuwari et al. (2022) Jimpei Hitsuwari, Yoshiyuki Ueda, Woojin Yun, and Michio Nomura. 2022. Does human–AI collaboration lead to more creative art? Aesthetic evaluation of human-made and AI-generated haiku poetry. Computers in Human Behavior (Oct. 2022), 107502. https://doi.org/10.1016/j.chb.2022.107502
  • Hu (2023) Krystal Hu. 2023. ChatGPT sets record for fastest-growing user base - analyst note. Reuters (Feb. 2023). https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/
  • Huang et al. (2020) Chieh-Yang Huang, Shih-Hong Huang, and Ting-Hao Kenneth Huang. 2020. Heteroglossia: In-Situ Story Ideation with the Crowd. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3313831.3376715
  • Hwang and Won (2021) Angel Hsing-Chi Hwang and Andrea Stevenson Won. 2021. IdeaBot: Investigating Social Facilitation in Human-Machine Team Creativity. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama Japan, 1–16. https://doi.org/10.1145/3411764.3445270
  • Jakesch et al. (2022) Maurice Jakesch, Jeffrey Hancock, and Mor Naaman. 2022. Human Heuristics for AI-Generated Language Are Flawed. https://doi.org/10.48550/arXiv.2206.07271
  • Jared Henderson (2022) Jared Henderson. 2022. ChatGPT Will Make You Less Creative. https://www.youtube.com/watch?v=1K8PiMNoR7A
  • Kirk et al. (2009) Ulrich Kirk, Martin Skov, Oliver Hulme, Mark S. Christensen, and Semir Zeki. 2009. Modulation of aesthetic value by semantic context: An fMRI study. NeuroImage 44, 3 (Feb. 2009), 1125–1132. https://doi.org/10.1016/j.neuroimage.2008.10.009
  • Köbis and Mossink (2021) Nils Köbis and Luca D. Mossink. 2021. Artificial intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry. Computers in Human Behavior 114 (Jan. 2021), 106553. https://doi.org/10.1016/j.chb.2020.106553
  • Krish Naik (2023) Krish Naik. 2023. Will Chatgpt Kill Your Creativity? https://www.youtube.com/watch?v=0m2r9elReBY
  • Lee et al. (2022) Mina Lee, Percy Liang, and Qian Yang. 2022. CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–19. https://doi.org/10.1145/3491102.3502030
  • Mangalaseril (2023) Jasmine Mangalaseril. 2023. The Incredible Blandness of ChatGPT. https://cardamomaddict.substack.com/p/the-incredible-blandness-of-chatgpt
  • Miller (2019) Arthur I Miller. 2019. The Artist in the Machine: The World of AI-Powered Creativity. Cambridge: MIT Press. https://direct.mit.edu/books/book/4547/The-Artist-in-the-MachineThe-World-of-AI-Powered
  • Mirowski et al. (2023) Piotr Mirowski, Kory W. Mathewson, Jaylen Pittman, and Richard Evans. 2023. Co-Writing Screenplays and Theatre Scripts with Language Models: Evaluation by Industry Professionals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–34. https://doi.org/10.1145/3544548.3581225
  • Mizrahi et al. (2020) Moran Mizrahi, Stav Yardeni Seelig, and Dafna Shahaf. 2020. Coming to Terms: Automatic Formation of Neologisms in Hebrew. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4918–4929. https://doi.org/10.18653/v1/2020.findings-emnlp.442
  • Mosier et al. (1996) Kathleen L. Mosier, Linda J. Skitka, Mark D. Burdick, and Susan T. Heers. 1996. Automation Bias, Accountability, and Verification Behaviors. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 40, 4 (Oct. 1996), 204–208. https://doi.org/10.1177/154193129604000413
  • Mumford and Hemlin (2017) Michael D. Mumford and Sven Hemlin. 2017. Handbook of Research on Leadership and Creativity. Edward Elgar Publishing.
  • Nadeem (2022) Reem Nadeem. 2022. How Americans think about artificial intelligence. https://www.pewresearch.org/internet/2022/03/17/how-americans-think-about-artificial-intelligence/
  • Nadeem (2023) Reem Nadeem. 2023. Public Awareness of Artificial Intelligence in Everyday Activities. https://www.pewresearch.org/science/2023/02/15/public-awareness-of-artificial-intelligence-in-everyday-activities/
  • News (2023) Nation World News. 2023. Why Does ChatGPT Increase Creativity? https://nationworldnews.com/why-does-chatgpt-increase-creativity/
  • Nickerson and Sakamoto (2010) J. Nickerson and Yasuaki Sakamoto. 2010. Crowdsourcing Creativity: Combining Ideas in Networks. https://www.semanticscholar.org/paper/Crowdsourcing-Creativity%3A-Combining-Ideas-in-Nickerson-Sakamoto/340a7645d1402287e151e83981f8a4085227e317
  • Nijstad and Stroebe (2006) Bernard A. Nijstad and Wolfgang Stroebe. 2006. How the Group Affects the Mind: A Cognitive Model of Idea Generation in Groups. Personality and Social Psychology Review 10, 3 (Aug. 2006), 186–213. https://doi.org/10.1207/s15327957pspr1003_1
  • of Psychology ([n. d.]) American Psychological Association Dictionary of Psychology. [n. d.]. divergent thinking. https://dictionary.apa.org/divergent-thinking
  • Ojala and Garriga (2009) Markus Ojala and Gemma C. Garriga. 2009. Permutation Tests for Studying Classifier Performance. In 2009 Ninth IEEE International Conference on Data Mining. IEEE, Miami Beach, FL, USA, 908–913. https://doi.org/10.1109/ICDM.2009.108
  • Organisciak et al. (2022) Peter Organisciak, Selcuk Acar, Denis Dumas, and Kelly Berthiaume. 2022. Beyond Semantic Distance: Automated Scoring of Divergent Thinking Greatly Improves with Large Language Models. Vol. 49. 101356 pages. https://doi.org/10.1016/j.tsc.2023.101356
  • Organisciak et al. (2023) Peter Organisciak, Selcuk Acar, Denis Dumas, and Kelly Berthiaume. 2023. Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models. Thinking Skills and Creativity 49 (2023), 101356. https://doi.org/10.1016/j.tsc.2023.101356
  • Osone et al. (2021) Hiroyuki Osone, Jun-Li Lu, and Yoichi Ochiai. 2021. BunCho: AI Supported Story Co-Creation via Unsupervised Multitask Learning to Increase Writers’ Creativity in Japanese. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama Japan, 1–10. https://doi.org/10.1145/3411763.3450391
  • Padmakumar and He (2024) Vishakh Padmakumar and He He. 2024. Does Writing with Language Models Reduce Content Diversity? https://doi.org/10.48550/arXiv.2309.05196
  • Paulus and Brown (2007) Paul B. Paulus and Vincent R. Brown. 2007. Toward More Creative and Innovative Group Idea Generation: A Cognitive-Social-Motivational Perspective of Brainstorming. Social and Personality Psychology Compass 1, 1 (2007), 248–265. https://doi.org/10.1111/j.1751-9004.2007.00006.x
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. http://arxiv.org/abs/1908.10084
  • Reinecke and Gajos (2015) Katharina Reinecke and Krzysztof Z. Gajos. 2015. LabintheWild: Conducting Large-Scale Online Experiments With Uncompensated Samples. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW ’15). Association for Computing Machinery, New York, NY, USA, 1364–1378. https://doi.org/10.1145/2675133.2675246
  • Review (2023) European Business Review. 2023. ChatGPT: Ushering in the Age of Creativity. https://www.europeanbusinessreview.com/chatgpt-ushering-in-the-age-of-creativity/
  • Richerson and Boyd (2008) Peter J. Richerson and Robert Boyd. 2008. Not By Genes Alone: How Culture Transformed Human Evolution. University of Chicago Press.
  • Roemmele (2021) Melissa Roemmele. 2021. Inspiration through Observation: Demonstrating the Influence of Automatically Generated Text on Creative Writing. https://doi.org/10.48550/arXiv.2107.04007
  • Salganik et al. (2006) Matthew J. Salganik, Peter Sheridan Dodds, and Duncan J. Watts. 2006. Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market. Science 311, 5762 (Feb. 2006), 854–856. https://doi.org/10.1126/science.1121066
  • Salganik and Watts (2009) Matthew J. Salganik and Duncan J. Watts. 2009. Web-Based Experiments for the Study of Collective Social Dynamics in Cultural Markets. Topics in Cognitive Science 1, 3 (2009), 439–468. https://doi.org/10.1111/j.1756-8765.2009.01030.x
  • Schemmer et al. (2022) Max Schemmer, Niklas Kühl, Carina Benz, and Gerhard Satzger. 2022. On the Influence of Explainable AI on Automation Bias. https://doi.org/10.48550/arXiv.2204.08859
  • Shumailov et al. (2023) Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. 2023. The Curse of Recursion: Training on Generated Data Makes Models Forget. http://arxiv.org/abs/2305.17493
  • Siangliulue et al. (2015) Pao Siangliulue, Kenneth C. Arnold, Krzysztof Z. Gajos, and Steven P. Dow. 2015. Toward Collaborative Ideation at Scale: Leveraging Ideas from Others to Generate More Creative and Diverse Ideas. Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (Feb. 2015), 937–945. https://doi.org/10.1145/2675133.2675239
  • Sourati and Evans (2023) Jamshid Sourati and James A. Evans. 2023. Accelerating science with human-aware artificial intelligence. Nature Human Behaviour 7, 10 (Oct. 2023), 1682–1696. https://doi.org/10.1038/s41562-023-01648-z
  • Spiel et al. (2019) Katta Spiel, Oliver L. Haimson, and Danielle Lottridge. 2019. How to do better with gender on surveys: a guide for HCI researchers. Interactions 26, 4 (June 2019), 62–65. https://doi.org/10.1145/3338283
  • Stevenson et al. (2022) Claire Stevenson, Iris Smal, Matthijs Baas, Raoul Grasman, and Han van der Maas. 2022. Putting GPT-3’s Creativity to the (Alternative Uses) Test. https://doi.org/10.48550/arXiv.2206.08932
  • Tubefilter (2023) Tubefilter. 2023. 86% of creators believe AI has a positive effect on creativity. ChatGPT offered its own opinions. https://www.tubefilter.com/2023/06/02/lightricks-creator-artificial-intelligence-ai-survey-chat-gpt-wired/
  • Walia (2019) Chetan Walia. 2019. A Dynamic Definition of Creativity. Creativity Research Journal 31, 3 (July 2019), 237–247. https://doi.org/10.1080/10400419.2019.1641787
  • Wilcot (2023) Wilcot. 2023. Using Chat-GPT for Innovators: Enhancing Creativity and Innovation. https://www.boardofinnovation.com/blog/using-chat-gpt-for-innovators-enhancing-creativity-and-innovation/
  • Williams (2018) Jamie Williams. 2018. Should AI Always Identify Itself? It’s More Complicated Than You Might Think. https://www.eff.org/deeplinks/2018/05/should-ai-always-identify-itself-its-more-complicated-you-might-think
  • Yang et al. (2022) Daijin Yang, Yanpeng Zhou, Zhiyuan Zhang, Toby Jia-Jun Li, and L. C. Ray. 2022. AI as an Active Writer: Interaction Strategies with Generated Text in Human-AI Collaborative Fiction Writing 56-65. https://www.semanticscholar.org/paper/AI-as-an-Active-Writer%3A-Interaction-Strategies-with-Yang-Zhou/15ddeb7765e2a3ea692a27d9b30e8f9446d74742
  • Yang et al. (2023) Tianchen Yang, Qifan Zhang, Zhaoyang Sun, and Yubo Hou. 2023. Automatic Assessment of Divergent Thinking in Chinese Language with TransDis: A Transformer-Based Language Model Approach. https://doi.org/10.48550/arXiv.2306.14790
  • Yu and Nickerson (2011) Lixiu Yu and Jeffrey V. Nickerson. 2011. Cooks or cobblers? crowd creativity through combination. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’11). Association for Computing Machinery, New York, NY, USA, 1393–1402. https://doi.org/10.1145/1978942.1979147
  • Yu and Nickerson (2013) Lixiu Yu and Jeffrey V. Nickerson. 2013. An internet-scale idea generation system. ACM Transactions on Interactive Intelligent Systems 3, 1 (April 2013), 2:1–2:24. https://doi.org/10.1145/2448116.2448118
  • Yu et al. (2023) Yuhua Yu, Roger E. Beaty, Boris Forthmann, Mark Beeman, John Henry Cruz, and Dan Johnson. 2023. A MAD method to assess idea novelty: Improving validity of automatic scoring using maximum associative distance (MAD). Psychology of Aesthetics, Creativity, and the Arts (2023), No Pagination Specified–No Pagination Specified. https://doi.org/10.1037/aca0000573
  • Yuan et al. (2022) Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. 2022. Wordcraft: Story Writing With Large Language Models. In 27th International Conference on Intelligent User Interfaces (IUI ’22). Association for Computing Machinery, New York, NY, USA, 841–852. https://doi.org/10.1145/3490099.3511105

Appendix A AUT Items

Table 5. AUT items by frequency of occurrence in dataset and classifier accuracy. Accuracy is defined as the correlation between human ratings of creativity and model predictions. The overall accuracy was r = 0.81. The accuracy of responses from the best-performing 5-item subset was r= 0.90. The data and model is from Organisciak et al. (2022).
AUT Item Classifier Accuracy (r) Frequency in Test Set
tire 0.91 412
pants 0.91 443
shoe 0.91 382
table 0.90 461
bottle 0.88 839
pencil 0.85 384
ball 0.84 393
fork 0.83 407
lightbulb 0.83 383
toothbrush 0.81 379
knife 0.81 2163
backpack 0.80 34
shovel 0.79 339
paperclip 0.79 1385
hat 0.76 380
box 0.74 2842
spoon 0.73 386
book 0.71 487
sock 0.69 380
brick 0.64 5162
rope 0.56 2080

Appendix B AUT Prompts

We conducted a Monte Carlo experiment to confirm our modified zero-shot prompt resulted in responses with a similar word length to humans. Using parameters from Stevenson et al. (2022)’s experiments: For n=1000𝑛1000n=1000italic_n = 1000 trials, we fixed presence penalty and frequency penalty at 1, randomly chose a temperature (higher values lead to more randomness) in [0.65, 0.7, 0.75, 0.8], and randomly chose one of our 5 AUT items. For each trial, ChatGPT generated five ideas. The modified prompt resulted in responses with an average word length (M=4.44, SD = 1.34) much closer to human responses (M=4.56, SD = 4.97) than the original zero-shot prompt (M=25.38, SD=8.55). A permutation test further shows that this difference in word count was significant at p<0.001𝑝0.001p<0.001italic_p < 0.001. We used the ideas generated by our modified zero-shot prompt as stimuli for the main experiment.

Table 6. Summary statistics of AUT prompt experiment. Human ideas are from Organisciak et al. (2022) and include only those ideas in response to the chosen AUT items. Note that in some cases ChatGPT did not return the desired number of ideas, leading to a slight discrepancy between ideas generated between the two prompts.
N Average Words SD Words
Condition
Human Ideas 2537 4.56 4.97
Zero Shot Length Limited 7500 4.44 1.34
Zero Shot 8153 25.38 8.55

Appendix C Pre-Treatment Questions

Pew asked about feeling towards AI (Nadeem, 2023, 2022) and we used the specific phrasing and choice ordering from (Nadeem, 2022). We randomized the first two options and kept neutral last. Our gender question was based on guidance from Spiel et al. (2019). The options were: ’woman’, ’man’, ’non-binary’, ’prefer to self-describe’, ’prefer not to disclose’. We added a text box meant for those who preferred to self-describe. The only deviation from Spiel et al. (2019) is that we did not allow for participants to select multiple options. We note that gender (as well as age and country) were optional.

Appendix D Exclusion Criteria for Analysis

Participants could have consented and answered pre-treatment questions but failed to complete any trial. We only analyze data from participants who completed at least one trial. In (n=4)𝑛4(n=4)( italic_n = 4 ) cases, users submitted ages that were implausible. We replaced these age values with missing for the purpose of summarizing participants but kept the responses. In (n=2)𝑛2(n=2)( italic_n = 2 ) cases, responses that should not have been shown were shown. We remove these two responses from analysis. As discussed in L, we instituted content moderation after receiving several troll responses. After the study, we manually inspected each response flagged by our system. There were 46464646 ideas labeled as profane, and we determined 36363636 were true positives. We remove the true positives (n=36)𝑛36(n=36)( italic_n = 36 ) from analysis, resulting in a final set of 3414341434143414 responses for analysis from an initial set of 3452345234523452 responses. Importantly, we conducted chi-squared tests and found that condition was unrelated to the number of flagged ideas (χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(4) = 6.06, p = 0.19), number of flagged ideas minus false positives (χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(4) = 2.92, p = 0.57) or total number of excluded ideas (χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(4) = 3.87, p = 0.42).

Appendix E Human vs AI Ideas

We compared a sample of 1500 ideas from our modified Stevenson prompt in the prompt experiment and a random sample of 1500 ideas from the Organisciak Dataset for our 5 items. For each set, we used the model’s predicted originality scores. Originality ranges from 1-5. Overall, ChatGPT ideas had higher (β=0.62𝛽0.62\beta=0.62italic_β = 0.62, t(2994)=22.49𝑡299422.49t(2994)=22.49italic_t ( 2994 ) = 22.49, 95% CI = [0.56,0.67]0.560.67[0.56,0.67][ 0.56 , 0.67 ]) originality.

Table 7. Comparing predicted originality of ChatGPT generated ideas to ideas from a dataset of prior human responses
Dependent variable:
originality
sourcechatgpt 0.618∗∗∗
(0.027)
promptpants --0.072
(0.041)
promptshoe 0.096∗∗
(0.043)
prompttable --0.007
(0.042)
prompttire --0.196∗∗∗
(0.041)
Constant 2.751∗∗∗
(0.029)
Observations 3,000
R2 0.159
Adjusted R2 0.157
Residual Std. Error 0.746 (df = 2994)
F Statistic 112.903∗∗∗ (df = 5; 2994)
Note: p<<<0.1; ∗∗p<<<0.05; ∗∗∗p<<<0.01

Appendix F Participant Feedback

We encouraged participants to start the study in the first place by saying that—if they finished all 5 trials—we would show them how creative they are relative to humans and AI. At the end of the experiment, we first computed a participant’s average score from the Organisciak et al. (2022) classifier as their ‘creativity score’. We then graphically and verbally showed participants what percentile this score would be for both humans and AI (where the human and AI scores come from applying the Organisciak et al. (2022) classifier to a sample of AI ideas we generated and prior human ideas from the Organisciak Dataset.) We also provided a graph that compared a participant’s scores in the AI condition to their scores in the no-AI conditions.

Additionally, we wanted to minimize attrition for participants once they started. We gave participants two pieces of feedback after each trial so they would continue taking the study. See M for screenshots.

  • First, we calculated how unique a participant’s response was relative to the last person’s response. We did this by calculating the cosine distance between a word2vec embedding of the participant’s response and a word2vec embedding of the last response in a given {[condition], item}. Due to resource constraints, we used a truncated word2vec model—the top 15k words in English.

  • We also compared the accuracy of participants’ rankings to the rankings of ideas by the classifier (Organisciak et al., 2022) we used. To do this, we calculated the rank-order correlation between a participant’s rankings of items and the rank order generated by the Organisciak et al. (2022) model.

In certain cases, either of these metrics could not be calculated, and we returned an arbitrary, random number.

Appendix G Sample Characteristics

Refer to caption
Figure 11. Distributions of self-perceived creativity relative to humans and relative to AI, by both interest group and sentiment towards AI.
Table 8. Descriptive Stats (Non-Missing Values)
Mean SD 25th Percentile Median 75th Percentile
age 34.92 10.86 27.0 33.0 40.0
creativity_ai 57.86 26.66 40.0 60.0 76.0
creativity_human 58.67 23.65 44.0 62.0 75.0
Table 9. Distribution of Gender
Counts (% of total)
gender
woman 308 (36%)
man 268 (32%)
Missing 222 (26%)
non-binary 23 (3%)
prefer_not_disclose 16 (2%)
prefer_self_describe 7 (1%)
Table 10. Distribution of AI Feeling
Counts (% of total)
ai_feeling
neutral 403 (48%)
excited 232 (27%)
concerned 198 (23%)
Missing 11 (1%)

Although we did not assess English language proficiency, the top five countries by responses (77.85% of responses) were the United States, Canada, Germany, United Kingdom, and Australia— countries with high English proficiency. The median response length was six words, which is relatively short, also suggesting English language proficiency is not a likely confounder.

Appendix H Model Selection

DV Potential Moderator χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Df p<χ2𝑝superscript𝜒2p<\chi^{2}italic_p < italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Added Interaction
Idea Diversity Self-Perceived Human Creativity 10.32 4.00 0.04 YES
AI - Human Creativity 15.70 4.00 0.00 YES
AI Feeling 8.20 8.00 0.41 NO
Interest Group 12.02 8.00 0.15 NO
Creativity Self-Perceived Human Creativity 3.28 4.00 0.51 NO
AI - Human Creativity 1.11 4.00 0.89 NO
AI Feeling 8.19 8.00 0.41 NO
Interest Group 1.57 8.00 0.99 NO
AI Adoption Self-Perceived Human Creativity 18.24 3.00 0.00 YES
AI - Human Creativity 9.94 3.00 0.02 YES
AI Feeling 4.14 6.00 0.66 NO
Interest Group 13.05 6.00 0.04 YES
Table 11. To determine which moderating variables to include, we conducted likelihood ratio tests comparing the baseline specification to a model including an interaction between a potential moderator and the treatment condition. If the likelihood ratio test indicated the interaction improved the fit at p<0.05𝑝0.05p<0.05italic_p < 0.05, we included this interaction in our model.

Selected models already include ‘Interest Group’ to control for participant source (neutral, creative, technical). As a robustness check, we subsequently created an additional participant source variable, ‘IsSocialMedia’, indicating if the respondent was from social media. Likelihood ratio tests found adding ‘IsSocialMedia’ and its interaction with the treatment condition did not improve the fit of the selected models (p>0.39 for all models)𝑝0.39 for all models(p>0.39\text{ for all models})( italic_p > 0.39 for all models ).

Appendix I Idea Diversity

I.1. Global

For 50 Monte Carlo runs with different seed values, we sampled 50 ideas for each {[condition], item} combination. For each 50-idea set, we computed various idea diversity measures. First, we calculated all pairwise SBERT distances. Next, we measured the mean and median pairwise distances. We also computed the centroid of each 50-idea set and calculated the mean distance from the centroid. After calculating these metrics, we conducted two-tailed, paired permutation tests (10,000 iterations) to test if two conditions differed on these metrics. To conduct the paired permutation test, we randomly swapped the sign of the difference between pairs of values, equivalent to randomly swapping the condition labels of rows—simulating the null hypothesis that conditions do not differ. We then counted the proportion of null distribution iterations where one would observe a larger absolute difference in means than the observed difference. We added a 1 to the numerator and denominator, which is a common, conservative adjustment (Ojala and Garriga, 2009) and stops p-values from being 0. Because the test is paired (equivalent to swapping the condition label within each ‘row’), our permutation tests are controlling for both AUT items and Monte Carlo seeds, since each row shares these attributes. We controlled for multiple pairwise comparisons by applying a Holm-Bonferroni adjustment to p-values. As a non-parametric measure of effect size, we used Cliff’s Delta. This metric ranges from -1 to +1 where 0 indicates no difference between conditions, +1 indicates that all Monte Carlo runs for the first condition are larger than those for the second, and vice versa for -1. As with evolution, to avoid the confounding effect of conditions differing in the number of seeds, we consider ideas after the sixth trial (see SI I.3 for a more detailed discussion). This is because the experiment is designed to ‘shed’ all seed examples after trial six.

Table 12. Global idea diversity measured by mean pairwise distance
Contrast Diff in Means Adj P Value Cliff’s Delta
0 HighExposureDisclosed-Control 1.340000 0.001000 0.350000
1 HighExposureUndisclosed-Control 0.900000 0.001000 0.270000
2 LowExposureDisclosed-Control 0.340000 0.011100 0.090000
3 LowExposureUndisclosed-Control -0.460000 0.016500 -0.020000
4 HighExposureDisclosed-HighExposureUndisclosed 0.440000 0.011100 0.050000
5 HighExposureDisclosed-LowExposureDisclosed 1.000000 0.001000 0.290000
6 HighExposureDisclosed-LowExposureUndisclosed 1.800000 0.001000 0.320000
7 HighExposureUndisclosed-LowExposureDisclosed 0.560000 0.001000 0.210000
8 HighExposureUndisclosed-LowExposureUndisclosed 1.360000 0.001000 0.280000
9 LowExposureDisclosed-LowExposureUndisclosed 0.800000 0.001000 0.090000
Table 13. Global idea diversity measured by median pairwise distance
Contrast Diff in Means Adj P Value Cliff’s Delta
0 HighExposureDisclosed-Control 1.210000 0.001000 0.310000
1 HighExposureUndisclosed-Control 0.810000 0.001000 0.260000
2 LowExposureDisclosed-Control 0.470000 0.001000 0.110000
3 LowExposureUndisclosed-Control -0.610000 0.006300 -0.060000
4 HighExposureDisclosed-HighExposureUndisclosed 0.410000 0.032800 0.040000
5 HighExposureDisclosed-LowExposureDisclosed 0.740000 0.001000 0.230000
6 HighExposureDisclosed-LowExposureUndisclosed 1.820000 0.001000 0.320000
7 HighExposureUndisclosed-LowExposureDisclosed 0.330000 0.032800 0.180000
8 HighExposureUndisclosed-LowExposureUndisclosed 1.410000 0.001000 0.290000
9 LowExposureDisclosed-LowExposureUndisclosed 1.080000 0.001000 0.150000
Table 14. Global idea diversity measured by mean centroid distance
Contrast Diff in Means Adj P Value Cliff’s Delta
0 HighExposureDisclosed-Control 1.480000 0.001000 0.350000
1 HighExposureUndisclosed-Control 1.030000 0.001000 0.270000
2 LowExposureDisclosed-Control 0.360000 0.013800 0.090000
3 LowExposureUndisclosed-Control -0.430000 0.029500 -0.020000
4 HighExposureDisclosed-HighExposureUndisclosed 0.450000 0.013800 0.050000
5 HighExposureDisclosed-LowExposureDisclosed 1.120000 0.001000 0.290000
6 HighExposureDisclosed-LowExposureUndisclosed 1.910000 0.001000 0.320000
7 HighExposureUndisclosed-LowExposureDisclosed 0.680000 0.001000 0.210000
8 HighExposureUndisclosed-LowExposureUndisclosed 1.460000 0.001000 0.280000
9 LowExposureDisclosed-LowExposureUndisclosed 0.790000 0.001000 0.090000

I.2. Local

We found mixed evidence that belief in AI’s relative creativity moderates local idea diversity. Regression results showed a small but significant interaction effect between the [High Exposure, Undisclosed] condition and relative AI creativity (β=0.038𝛽0.038\beta=0.038italic_β = 0.038, t(3244)=2.149𝑡32442.149t(3244)=2.149italic_t ( 3244 ) = 2.149, 95% CI = [0.003,0.073]0.0030.073[0.003,0.073][ 0.003 , 0.073 ], p=0.03𝑝0.03p=0.03italic_p = 0.03). We probed this effect with estimated marginal means, predicting local idea diversity for the bottom and top decile of participants by perception of AI creativity. Top-decile participants had slightly higher local idea diversity than bottom-decile participants in the [High Exposure, Undisclosed] condition (Δ=2.62,d=0.34formulae-sequenceΔ2.62𝑑0.34\Delta=2.62,d=0.34roman_Δ = 2.62 , italic_d = 0.34) but although this difference was significant before multiple comparisons (p=0.01𝑝0.01p=0.01italic_p = 0.01), it was not significant after adjusting for multiple comparisons, (p=0.06𝑝0.06p=0.06italic_p = 0.06; see Appendix Table 17). Hence, we conclude there is mixed evidence for the role of belief in AI’s relative creativity as a moderator of local idea diversity.

Table 15. Predictors of local idea diversity with coefficients and SEs in parentheses. The DV for models (1) and (2) are the median and mean pairwise distances between a participant’s response and examples. Model (3) uses the distance between a participant’s response and the centroid of examples. Ideas are embedded using SBERT. All three models have a random intercept for participants crossed with a random intercept for response chains, nested in (item, condition) combinations.
Dependent variable:
Median PW Distance Mean PW Distance Centroid Distance
(1) (2) (3)
conditionLoExposure_Disclosed --1.467 (1.864) --2.177 (1.701) --3.791 (2.438)
t = --0.787 t = --1.280 t = --1.555
conditionLoExposure_Undisclosed 0.272 (1.861) --0.098 (1.698) --0.414 (2.433)
t = 0.146 t = --0.057 t = --0.170
conditionHiExposure_Disclosed 1.051 (1.864) 0.814 (1.701) 3.354 (2.438)
t = 0.564 t = 0.479 t = 1.376
conditionHiExposure_Undisclosed --0.744 (1.866) --0.958 (1.703) 0.442 (2.441)
t = --0.399 t = --0.563 t = 0.181
creativity_human --0.006 (0.013) --0.008 (0.013) --0.015 (0.021)
t = --0.484 t = --0.647 t = --0.723
ai_rel_create --0.006 (0.013) --0.005 (0.012) --0.008 (0.020)
t = --0.425 t = --0.401 t = --0.388
trial_no --0.023 (0.029) --0.020 (0.027) --0.007 (0.045)
t = --0.814 t = --0.725 t = --0.166
ai_feelingconcerned 0.354 (0.358) 0.317 (0.339) 0.455 (0.565)
t = 0.990 t = 0.935 t = 0.807
ai_feelingexcited 0.270 (0.349) 0.107 (0.331) 0.207 (0.550)
t = 0.773 t = 0.324 t = 0.377
interest_groupcreative --0.573 (0.453) --0.492 (0.431) --0.694 (0.693)
t = --1.263 t = --1.142 t = --1.002
interest_grouptechnology 0.081 (0.465) 0.161 (0.443) 0.139 (0.711)
t = 0.173 t = 0.363 t = 0.196
condition_order 0.182∗∗ (0.092) 0.165 (0.087) 0.256 (0.143)
t = 1.982 t = 1.897 t = 1.798
log_duration --0.765∗∗∗ (0.200) --0.718∗∗∗ (0.189) --1.192∗∗∗ (0.313)
t = --3.820 t = --3.791 t = --3.809
n_seeds 0.398∗∗∗ (0.136) 0.440∗∗∗ (0.128) 0.597∗∗∗ (0.213)
t = 2.927 t = 3.429 t = 2.808
conditionLoExposure_Disclosed:creativity_human 0.034 (0.018) 0.040∗∗ (0.017) 0.070∗∗ (0.028)
t = 1.884 t = 2.313 t = 2.456
conditionLoExposure_Undisclosed:creativity_human 0.006 (0.018) 0.005 (0.017) 0.015 (0.028)
t = 0.304 t = 0.285 t = 0.519
conditionHiExposure_Disclosed:creativity_human --0.003 (0.018) --0.004 (0.017) --0.013 (0.028)
t = --0.169 t = --0.248 t = --0.455
conditionHiExposure_Undisclosed:creativity_human 0.024 (0.018) 0.022 (0.017) 0.030 (0.028)
t = 1.298 t = 1.256 t = 1.065
conditionLoExposure_Disclosed:ai_rel_create --0.018 (0.018) --0.017 (0.017) --0.031 (0.028)
t = --1.027 t = --0.991 t = --1.136
conditionLoExposure_Undisclosed:ai_rel_create 0.007 (0.018) 0.005 (0.017) 0.006 (0.028)
t = 0.365 t = 0.296 t = 0.207
conditionHiExposure_Disclosed:ai_rel_create --0.008 (0.018) --0.007 (0.017) --0.005 (0.027)
t = --0.462 t = --0.414 t = --0.199
conditionHiExposure_Undisclosed:ai_rel_create 0.038∗∗ (0.018) 0.040∗∗ (0.017) 0.065∗∗ (0.028)
t = 2.149 t = 2.384 t = 2.331
Constant 85.104∗∗∗ (1.733) 84.644∗∗∗ (1.607) 72.776∗∗∗ (2.466)
t = 49.099 t = 52.676 t = 29.508
Observations 3,271 3,271 3,271
Log Likelihood --11,201.060 --11,014.890 --12,630.840
Akaike Inf. Crit. 22,456.120 22,083.790 25,315.680
Bayesian Inf. Crit. 22,620.630 22,248.290 25,480.180
Note: p<<<0.1; ∗∗p<<<0.05; ∗∗∗p<<<0.01
Table 16. Estimated marginal means contrasts of local idea diversity, using a mixed model to compare predictions for top 10 percentile and bottom 10 percentile of participants by belief in relative AI creativity. Local idea diversity is computed as the median pairwise distance between a participant’s idea and the example ideas. P-values adjusted for multiple comparisons using the Holm-Bonferroni method.
contrast Relative AI Creativity Percentile estimate SE df t.ratio Adjusted P Value d
LoExposure_Undisclosed - HiExposure_Undisclosed 10 -0.368 1.540 20.311 -0.239 1.000 -0.048
LoExposure_Undisclosed - LoExposure_Disclosed 10 0.298 1.539 20.279 0.193 1.000 0.038
LoExposure_Undisclosed - HiExposure_Disclosed 10 -0.129 1.539 20.299 -0.084 1.000 -0.017
LoExposure_Undisclosed - Control 10 0.660 1.541 20.381 0.428 1.000 0.085
HiExposure_Undisclosed - LoExposure_Disclosed 10 0.666 1.540 20.356 0.432 1.000 0.086
HiExposure_Undisclosed - HiExposure_Disclosed 10 0.239 1.539 20.276 0.155 1.000 0.031
HiExposure_Undisclosed - Control 10 1.029 1.545 20.585 0.666 1.000 0.133
LoExposure_Disclosed - HiExposure_Disclosed 10 -0.427 1.540 20.346 -0.277 1.000 -0.055
LoExposure_Disclosed - Control 10 0.363 1.541 20.390 0.235 1.000 0.047
HiExposure_Disclosed - Control 10 0.790 1.545 20.591 0.511 1.000 0.102
LoExposure_Undisclosed - HiExposure_Undisclosed 90 -2.915 2.203 84.275 -1.323 1.000 -0.377
LoExposure_Undisclosed - LoExposure_Disclosed 90 2.284 2.202 84.112 1.037 1.000 0.295
LoExposure_Undisclosed - HiExposure_Disclosed 90 1.042 2.187 81.952 0.476 1.000 0.135
LoExposure_Undisclosed - Control 90 1.181 2.206 84.767 0.535 1.000 0.153
HiExposure_Undisclosed - LoExposure_Disclosed 90 5.198 2.206 84.630 2.357 0.207 0.672
HiExposure_Undisclosed - HiExposure_Disclosed 90 3.957 2.188 81.948 1.809 0.608 0.512
HiExposure_Undisclosed - Control 90 4.096 2.213 85.683 1.851 0.608 0.530
LoExposure_Disclosed - HiExposure_Disclosed 90 -1.242 2.190 82.342 -0.567 1.000 -0.161
LoExposure_Disclosed - Control 90 -1.102 2.207 84.791 -0.500 1.000 -0.143
HiExposure_Disclosed - Control 90 0.139 2.197 83.314 0.063 1.000 0.018
Table 17. Estimated marginal means contrasts of local idea diversity, using a mixed model to compare predictions for the top 10 percentile and bottom ten percentile of participants by belief in relative AI creativity. Local idea diversity is computed as the median pairwise distance between a participant’s idea and the example ideas. P-values adjusted for multiple comparisons using the Holm-Bonferroni method
contrast condition estimate SE df t.ratio P Value d Adjusted P Value
ai_rel_create10 - ai_rel_create90 LoExposure_Undisclosed -0.077 1.043 3156.269 -0.074 0.941 -0.010 1.000
ai_rel_create10 - ai_rel_create90 HiExposure_Undisclosed -2.624 1.040 3136.309 -2.523 0.012 -0.339 0.058
ai_rel_create10 - ai_rel_create90 LoExposure_Disclosed 1.909 1.041 3159.288 1.833 0.067 0.247 0.268
ai_rel_create10 - ai_rel_create90 HiExposure_Disclosed 1.094 1.013 3181.870 1.080 0.280 0.141 0.840
ai_rel_create10 - ai_rel_create90 Control 0.444 1.046 3177.974 0.424 0.671 0.057 1.000

I.3. Evolution

To model the evolution of idea diversity, we pooled together submitted ideas at the level of (item, condition, trial number). We then computed the median pairwise distance, mean pairwise distance, and mean distance from centroid for each pool of ideas. We fit a model to test if idea diversity changed at a different rate for different conditions:

variablecti=β0+β1Conditionc+β2TrialNot+β3TrialNo X Conditiontc+subscriptvariable𝑐𝑡𝑖subscript𝛽0subscript𝛽1subscriptCondition𝑐subscript𝛽2subscriptTrialNo𝑡limit-fromsubscript𝛽3subscriptTrialNo X Condition𝑡𝑐\displaystyle\text{variable}_{cti}=\beta_{0}+\beta_{1}\text{Condition}_{c}+% \beta_{2}\text{TrialNo}_{t}+\beta_{3}\text{TrialNo X Condition}_{tc}+variable start_POSTSUBSCRIPT italic_c italic_t italic_i end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Condition start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT TrialNo start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT TrialNo X Condition start_POSTSUBSCRIPT italic_t italic_c end_POSTSUBSCRIPT +
β4Nobscti+u0i+ectisubscript𝛽4subscriptNobs𝑐𝑡𝑖subscript𝑢0𝑖subscript𝑒𝑐𝑡𝑖\displaystyle\phantom{=}\beta_{4}\text{Nobs}_{cti}+u_{0i}+e_{cti}italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT Nobs start_POSTSUBSCRIPT italic_c italic_t italic_i end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT italic_c italic_t italic_i end_POSTSUBSCRIPT

Where:

  • c𝑐citalic_c indexes conditions.

  • t𝑡titalic_t indexes trial number.

  • i𝑖iitalic_i indexes items

  • β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the global intercept.

  • u0iN(0,σu2)similar-tosubscript𝑢0𝑖𝑁0superscriptsubscript𝜎𝑢2u_{0i}\sim N(0,\sigma_{u}^{2})italic_u start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT ∼ italic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) are random intercepts for items

  • ectiN(0,σ2)similar-tosubscript𝑒𝑐𝑡𝑖𝑁0superscript𝜎2e_{cti}\sim N(0,\sigma^{2})italic_e start_POSTSUBSCRIPT italic_c italic_t italic_i end_POSTSUBSCRIPT ∼ italic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is the residual

We took two additional steps to make sure our results were not driven by confounding factors. First, Nobs𝑁𝑜𝑏𝑠Nobsitalic_N italic_o italic_b italic_s controls for how many ideas are in the set that is being analyzed. Recall that we designed the experiment so that each (item, condition) combination was replicated exactly seven times in response chains of exactly 20 trials. However, there were some minor deviations (discussed in SI L) in response chains, resulting in some (item, condition, trial number) sets having fewer items than others. Hence, we control for the number of ideas in a set. Second, we only ran this analysis on data after the sixth trial in a response chain to rule out the effect of seeds on evolution. The logic here is that the condition with the most initial seeds (6) was the control condition. The experiment is designed to ‘shed’ all seeds after trial six since by that time there would have been six experiment responses, meaning the most recent six ideas in the control condition would now all be from the experiment, and hence no seeds present in the example sets. (Note that for all local analyses, we directly control for the number of seeds present in the example set as a fixed effect.)

Table 18. Evolution of idea diversity by condition. Each model has a random intercept for item. The reference level for experimental conditions is the control condition.
Dependent variable:
Median PW Distance Mean PW Distance Centroid Distance
(1) (2) (3)
nobs 0.827∗∗ (0.337) 0.894∗∗∗ (0.317) 3.940∗∗∗ (0.214)
t = 2.454 t = 2.818 t = 18.450
conditionLow ExposureUndisclosed --1.558 (3.349) --1.048 (3.151) --0.534 (2.121)
t = --0.465 t = --0.333 t = --0.252
conditionLow ExposureDisclosed --4.087 (3.368) --4.163 (3.168) --1.884 (2.133)
t = --1.213 t = --1.314 t = --0.883
conditionHigh ExposureUndisclosed --5.575 (3.417) --4.853 (3.214) --3.517 (2.164)
t = --1.632 t = --1.510 t = --1.625
conditionHigh ExposureDisclosed --6.003 (3.416) --5.573 (3.213) --3.608 (2.163)
t = --1.757 t = --1.734 t = --1.668
trial_no --0.391∗∗ (0.175) --0.321 (0.165) --0.141 (0.111)
t = --2.232 t = --1.948 t = --1.273
conditionLow ExposureUndisclosed:trial_no 0.140 (0.231) 0.111 (0.218) 0.069 (0.146)
t = 0.605 t = 0.512 t = 0.472
conditionLow ExposureDisclosed:trial_no 0.368 (0.233) 0.379 (0.220) 0.196 (0.148)
t = 1.578 t = 1.728 t = 1.328
conditionHigh ExposureUndisclosed:trial_no 0.525∗∗ (0.239) 0.461∗∗ (0.225) 0.335∗∗ (0.151)
t = 2.200 t = 2.051 t = 2.213
conditionHigh ExposureDisclosed:trial_no 0.566∗∗ (0.239) 0.561∗∗ (0.225) 0.378∗∗ (0.151)
t = 2.371 t = 2.498 t = 2.500
Constant 81.500∗∗∗ (3.899) 78.538∗∗∗ (3.669) 19.040∗∗∗ (2.479)
t = 20.904 t = 21.404 t = 7.681
Observations 362 362 362
Log Likelihood --1,158.987 --1,137.571 --998.930
Akaike Inf. Crit. 2,343.974 2,301.141 2,023.861
Bayesian Inf. Crit. 2,394.566 2,351.733 2,074.452
Note: p<<<0.1; ∗∗p<<<0.05; ∗∗∗p<<<0.01

Appendix J Creativity

Table 19. Predictors of creativity with coefficients and SEs in parentheses. This model has a random intercept for participants crossed with a random intercept for response chains, nested in (item, condition) combinations.
Dependent variable:
Creativity
conditionLoExposure_Disclosed 0.011 (0.112)
t = 0.095
conditionLoExposure_Undisclosed 0.019 (0.112)
t = 0.166
conditionHiExposure_Disclosed --0.025 (0.113)
t = --0.221
conditionHiExposure_Undisclosed 0.050 (0.113)
t = 0.443
creativity_human 0.001 (0.001)
t = 1.587
ai_rel_create 0.001 (0.001)
t = 1.506
interest_groupcreative --0.069 (0.039)
t = --1.760
interest_grouptechnology --0.010 (0.040)
t = --0.249
trial_no --0.002 (0.003)
t = --0.797
ai_feelingconcerned --0.017 (0.033)
t = --0.526
ai_feelingexcited --0.046 (0.032)
t = --1.448
condition_order 0.003 (0.008)
t = 0.448
log_duration 0.097∗∗∗ (0.018)
t = 5.525
n_seeds --0.014 (0.012)
t = --1.215
Constant 3.192∗∗∗ (0.128)
t = 24.932
Observations 3,271
Log Likelihood --3,196.244
Akaike Inf. Crit. 6,430.488
Bayesian Inf. Crit. 6,546.252
Note: p<<<0.1; ∗∗p<<<0.05; ∗∗∗p<<<0.01

Appendix K AI Adoption

Table 20. Estimated marginal means contrasts of AI adoption, using a mixed model to compare predictions for top 10 percentile and bottom 10 percentile of participants by self-perceived human creativity. AI Adoption is the max cosine similarity of a participant’s response and AI examples. P-values are adjusted for multiple comparisons using Holm-Bonferroni method.
contrast Perceived Creativity Percentile estimate SE df t.ratio Adjusted P Value Cohen’s d
LoExposure_Undisclosed - HiExposure_Undisclosed 10 -6.021 2.369 37.026 -2.542 0.092 -0.502
LoExposure_Undisclosed - LoExposure_Disclosed 10 -3.669 2.369 37.048 -1.549 0.520 -0.306
LoExposure_Undisclosed - HiExposure_Disclosed 10 -1.924 2.367 36.890 -0.813 0.983 -0.160
HiExposure_Undisclosed - LoExposure_Disclosed 10 2.352 2.372 37.212 0.992 0.983 0.196
HiExposure_Undisclosed - HiExposure_Disclosed 10 4.097 2.365 36.779 1.732 0.458 0.341
LoExposure_Disclosed - HiExposure_Disclosed 10 1.744 2.370 37.092 0.736 0.983 0.145
LoExposure_Undisclosed - HiExposure_Undisclosed 90 -5.757 2.172 26.231 -2.650 0.040 -0.480
LoExposure_Undisclosed - LoExposure_Disclosed 90 0.599 2.170 26.127 0.276 1.000 0.050
LoExposure_Undisclosed - HiExposure_Disclosed 90 -6.852 2.168 26.058 -3.160 0.020 -0.571
HiExposure_Undisclosed - LoExposure_Disclosed 90 6.356 2.174 26.323 2.924 0.028 0.529
HiExposure_Undisclosed - HiExposure_Disclosed 90 -1.095 2.166 25.913 -0.506 1.000 -0.091
LoExposure_Disclosed - HiExposure_Disclosed 90 -7.451 2.171 26.171 -3.433 0.012 -0.621
Table 21. Estimated marginal means contrasts of AI adoption, using a mixed model to compare predictions for the top 10 percentile and bottom 10 percentile of participants by belief in relative AI creativity. This metric captures how creative participants think AI is relative to humans (higher values means more creative than humans). AI Adoption is the max cosine similarity of a participant’s response and AI examples. P-values adjusted for multiple comparisons using Holm-Bonferroni method.
contrast Relative AI Creativity Percentile estimate SE df t.ratio Adjusted P Value d
LoExposure_Undisclosed - HiExposure_Undisclosed 10 -5.225 1.952 17.121 -2.677 0.095 -0.435
LoExposure_Undisclosed - LoExposure_Disclosed 10 -1.115 1.949 17.038 -0.572 1.000 -0.093
LoExposure_Undisclosed - HiExposure_Disclosed 10 -4.651 1.951 17.100 -2.384 0.145 -0.387
HiExposure_Undisclosed - LoExposure_Disclosed 10 4.110 1.953 17.186 2.104 0.201 0.342
HiExposure_Undisclosed - HiExposure_Disclosed 10 0.574 1.949 17.013 0.295 1.000 0.048
LoExposure_Disclosed - HiExposure_Disclosed 10 -3.536 1.953 17.169 -1.811 0.263 -0.295
LoExposure_Undisclosed - HiExposure_Undisclosed 90 0.422 3.168 115.486 0.133 1.000 0.035
LoExposure_Undisclosed - LoExposure_Disclosed 90 -1.388 3.161 114.586 -0.439 1.000 -0.116
LoExposure_Undisclosed - HiExposure_Disclosed 90 -2.363 3.130 110.316 -0.755 1.000 -0.197
HiExposure_Undisclosed - LoExposure_Disclosed 90 -1.811 3.172 115.993 -0.571 1.000 -0.151
HiExposure_Undisclosed - HiExposure_Disclosed 90 -2.785 3.134 110.601 -0.889 1.000 -0.232
LoExposure_Disclosed - HiExposure_Disclosed 90 -0.974 3.135 110.956 -0.311 1.000 -0.081
Table 22. Estimated marginal means contrasts of AI adoption, using a mixed model to compare predictions for top 10 percentile and bottom 10 percentile of participants by belief in relative AI creativity. This metric captures how creative participants think AI is relative to humans (higher values means more creative than humans). AI Adoption is the max cosine similarity of a participant’s response and AI examples. P-values adjusted for multiple comparisons using Holm-Bonferroni method.
contrast condition estimate SE df t.ratio Adjusted P Value d
ai_rel_create10 - ai_rel_create90 LoExposure_Undisclosed -0.987 1.673 2545.523 -0.590 0.555 -0.082
ai_rel_create10 - ai_rel_create90 HiExposure_Undisclosed 4.660 1.669 2538.494 2.792 0.005 0.388
ai_rel_create10 - ai_rel_create90 LoExposure_Disclosed -1.261 1.668 2542.774 -0.756 0.450 -0.105
ai_rel_create10 - ai_rel_create90 HiExposure_Disclosed 1.301 1.611 2480.927 0.808 0.419 0.108
Table 23. Predictors of AI adoption with coefficients and SEs in parentheses. The respective dependent variables are the max, mean, and median cosine similarities between the SBERT embedding of a participant’s response and the SBERT embeddings of AI examples the participant saw. All three models have a random intercept for participants crossed with a random intercept for response chains, nested in (item, condition) combinations.
Dependent variable:
Max AI Similarity Mean AI Similarity Median AI Similarity
(1) (2) (3)
conditionLoExposure_Undisclosed --5.481 (2.801) --3.962 (2.356) --3.960 (2.412)
t = --1.957 t = --1.682 t = --1.642
conditionHiExposure_Disclosed --2.243 (2.799) --3.498 (2.355) --3.415 (2.411)
t = --0.801 t = --1.485 t = --1.416
conditionHiExposure_Undisclosed 3.507 (2.799) 0.098 (2.355) --0.584 (2.411)
t = 1.253 t = 0.041 t = --0.242
creativity_human --0.059∗∗∗ (0.022) --0.041∗∗ (0.017) --0.041∗∗ (0.017)
t = --2.726 t = --2.430 t = --2.363
ai_rel_create 0.016 (0.021) 0.015 (0.016) 0.015 (0.017)
t = 0.757 t = 0.911 t = 0.885
interest_groupcreative 2.425 (1.360) 1.853 (1.052) 1.828 (1.074)
t = 1.784 t = 1.761 t = 1.702
interest_grouptechnology 0.708 (1.389) 0.979 (1.074) 0.945 (1.097)
t = 0.510 t = 0.912 t = 0.862
trial_no --0.037 (0.050) 0.005 (0.039) 0.022 (0.040)
t = --0.734 t = 0.122 t = 0.556
ai_feelingconcerned --0.261 (0.630) --0.211 (0.479) --0.189 (0.489)
t = --0.413 t = --0.441 t = --0.388
ai_feelingexcited 0.569 (0.615) 0.165 (0.467) 0.149 (0.477)
t = 0.925 t = 0.354 t = 0.312
condition_order --0.204 (0.164) --0.201 (0.128) --0.212 (0.131)
t = --1.247 t = --1.569 t = --1.619
log_duration 0.529 (0.354) 0.539∗∗ (0.273) 0.510 (0.279)
t = 1.494 t = 1.974 t = 1.828
n_seeds --0.435 (0.309) --0.265 (0.240) --0.214 (0.245)
t = --1.407 t = --1.104 t = --0.875
conditionLoExposure_Undisclosed:creativity_human 0.053 (0.029) 0.037 (0.023) 0.037 (0.023)
t = 1.820 t = 1.601 t = 1.557
conditionHiExposure_Disclosed:creativity_human 0.115∗∗∗ (0.029) 0.066∗∗∗ (0.023) 0.061∗∗∗ (0.023)
t = 3.931 t = 2.868 t = 2.601
conditionHiExposure_Undisclosed:creativity_human 0.050 (0.029) 0.030 (0.023) 0.039 (0.024)
t = 1.702 t = 1.293 t = 1.655
conditionLoExposure_Undisclosed:ai_rel_create --0.003 (0.028) --0.004 (0.022) --0.004 (0.023)
t = --0.121 t = --0.178 t = --0.174
conditionHiExposure_Disclosed:ai_rel_create --0.032 (0.028) --0.011 (0.022) --0.007 (0.022)
t = --1.151 t = --0.523 t = --0.330
conditionHiExposure_Undisclosed:ai_rel_create --0.074∗∗∗ (0.028) --0.065∗∗∗ (0.022) --0.071∗∗∗ (0.023)
t = --2.612 t = --2.924 t = --3.146
conditionLoExposure_Undisclosed:interest_groupcreative 1.036 (1.851) 0.819 (1.446) 0.834 (1.476)
t = 0.560 t = 0.566 t = 0.565
conditionHiExposure_Disclosed:interest_groupcreative --0.465 (1.844) --0.421 (1.440) --0.523 (1.470)
t = --0.252 t = --0.293 t = --0.356
conditionHiExposure_Undisclosed:interest_groupcreative --3.224 (1.843) --2.329 (1.440) --2.344 (1.470)
t = --1.750 t = --1.617 t = --1.595
conditionLoExposure_Undisclosed:interest_grouptechnology 2.810 (1.888) 1.516 (1.474) 1.535 (1.504)
t = 1.488 t = 1.028 t = 1.020
conditionHiExposure_Disclosed:interest_grouptechnology --1.391 (1.871) --1.550 (1.461) --1.731 (1.491)
t = --0.743 t = --1.061 t = --1.161
conditionHiExposure_Undisclosed:interest_grouptechnology --1.522 (1.875) --1.870 (1.465) --1.926 (1.495)
t = --0.811 t = --1.277 t = --1.288
Constant 23.789∗∗∗ (2.704) 17.655∗∗∗ (2.185) 17.609∗∗∗ (2.235)
t = 8.796 t = 8.080 t = 7.880
Observations 2,618 2,618 2,618
Log Likelihood --10,155.720 --9,508.078 --9,561.834
Akaike Inf. Crit. 20,371.430 19,076.160 19,183.670
Bayesian Inf. Crit. 20,547.540 19,252.260 19,359.770
Note: p<<<0.1; ∗∗p<<<0.05; ∗∗∗p<<<0.01

Appendix L Implementation Complications

Running a massive online networked experiment open to any user on the Internet will often invite implementation challenges. In the interest of disclosure—and for the benefit of future researchers running similar experiments—we share the challenges we faced, solutions we implemented, and rationales for our decisions.

L.1. Server Capacity

Overall, we received far more responses than we expected. At several points throughout the experiment, we experienced more concurrent traffic than the application was designed to handle. Hence, we had to temporarily turn off the experiment to wait out high demand, add more resources, or implement and test changes described in Content Moderation. We note that these capacity issues did not affect the responses we collected.

L.2. Content Moderation

Initially, we did not implement any content moderation. But on July 8th, we received an influx of responses from Reddit. Several participants were trolls, providing repetitive profane responses. We then implemented a form of content moderation, flagging any idea that contained a word in a list of words banned by Google as of July 8, 2023121212https://github.com/coffee-and-fun/google-profanity-words and subsequently added two more words to the list. If an idea was flagged, it was written to our database but not shown to future participants. Some profane ideas were already shown to participants in between the time when the responses were submitted and we saw and implemented our solution. Also, the content moderation strategy was imperfect: 10 of 46 flagged ideas were false positives. As discussed earlier, condition was unrelated to the number of flagged ideas (χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(4) = 6.06, p = 0.19) or total number of excluded ideas (χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(4) = 3.87, p = 0.42).

We acknowledge there are more advanced and nuanced content moderation strategies, but this one was the best option given our specific circumstances and constraints. First, this bag-of-words method is very transparent. Second, we deployed this experiment through Heroku, which imposes a CPU limit on the project, precluding the use of pre-trained classifiers such as BERT. Third, we did not want to use APIs like Jigsaw or OpenAI moderation endpoints because these APIs have rate limits, which can slow down the experiment.

L.3. Small Deviations From 20 Trials Per Chain

We intended for each response chain to contain 20 responses. The average number of trials per response chain was 19.73 (SD = 1.45) and the median number per chain was 20. We concluded the experiment before the last round of response chains was completely finished for all condition and item combinations, so the minimum number of trials in a response chain (occurring for an item and condition combination in the last round) was 14. The maximum number of trials was 24. These minor deviations occurred due to server overload, very high traffic leading to race conditions and excluding several responses based on the criteria described in D. Based on a two-way ANOVA, we concluded that response chain lengths did not differ by item, (F(4)=0.38,p=0.82)formulae-sequence𝐹40.38𝑝0.82(F(4)=0.38,p=0.82)( italic_F ( 4 ) = 0.38 , italic_p = 0.82 ), condition (F(4)=0.01,p=1.00)formulae-sequence𝐹40.01𝑝1.00(F(4)=0.01,p=1.00)( italic_F ( 4 ) = 0.01 , italic_p = 1.00 ), or the interaction between items and conditions (F(16)=0.01,p=1.00)formulae-sequence𝐹160.01𝑝1.00(F(16)=0.01,p=1.00)( italic_F ( 16 ) = 0.01 , italic_p = 1.00 ),

Appendix M Experiment Screenshots

Refer to caption
Figure 12. Screenshot of a trial participants are told to complete. This is the [Low Exposure, Disclosed] condition since there are two AI ideas and the AI ideas are labeled.
Refer to caption
Figure 13. After each trial, participants were given feedback as motivation to continue.
Figure 14. Screenshots of the experiment and feedback we provided after each response