Feedback Loops With Language Models Drive In-Context Reward Hacking
Abstract
Language models influence the external world: they query APIs that read and write to web pages, generate content that shapes human behavior, and run system commands as autonomous agents. These interactions form feedback loops: LLM outputs affect the world, which in turn affect subsequent LLM outputs. In this work, we show that feedback loops can cause in-context reward hacking (ICRH), where the LLM at test-time optimizes a (potentially implicit) objective but creates negative side effects in the process. For example, consider an LLM agent deployed to increase Twitter engagement; the LLM may retrieve its previous tweets into the context window and make them more controversial, increasing engagement but also toxicity. We identify and study two processes that lead to ICRH: output-refinement and policy-refinement. For these processes, evaluations on static datasets are insufficient—they miss the feedback effects and thus cannot capture the most harmful behavior. In response, we provide three recommendations for evaluation to capture more instances of ICRH. As AI development accelerates, the effects of feedback loops will proliferate, increasing the need to understand their role in shaping LLM behavior.
1 Introduction
Language models (LLMs) are increasingly influencing the real world (Benaich et al., 2023). Developers are beginning to augment LLMs with the ability to call external APIs (Mialon et al., 2023), retrieve documents (Jiang et al., 2023), execute code (Zhou et al., 2023a), and act as autonomous agents (Richards, 2023). LLMs that interact with the world induce feedback loops: its previous outputs affect the world state, which in turn shapes its subsequent outputs. For example, Microsoft’s Sydney chat bot (the LLM) interacts with Twitter (the world) by searching through Twitter and placing tweets into its context window. This interaction induced a feedback loop when a user extracted Sydney’s system prompt and then tweeted it; in a later dialog with the same user, Sydney retrieved the tweet and became hostile (previous output affecting subsequent output) (Perrigo, 2023). As LLMs are given greater access to tools (OpenAI, 2023b) and deployed in more settings (Grant, 2023), feedback loops will become more ubiquitous (Bottou et al., 2013).
In this work, we examine how feedback loops unexpectedly induce optimization in the world-LLM system (Figure 1). Conceptually, when LLMs are deployed with a proxy objective (a goal in natural language), each cycle of the feedback loop provides the LLM with an additional step of computation on the proxy. The LLM may improve previous outputs (Exp 4.1), adjust its policy to circumvent errors (Exp 4.2), or refine other components of the world-LLM system (e.g., user preferences), all of which induce optimization.
We show that such optimization can drive in-context reward hacking (ICRH)—the creation of harmful side effects en route to optimizing the proxy objective (Figures 5 and 8). Similar to reward hacking in traditional optimizers (Pan et al., 2022), ICRH in LLMs occurs because the proxy objective is under-specified and does not capture implicit constraints. However, ICRH differs from reward hacking both because a) it is a test-time phenomenon and because b) it is driven by LLMs, which are able to exhibit optimization even with sparse reward signal (Section 3.3).
We identify output-refinement and policy-refinement, two mechanisms through which feedback loops can induce optimization. We study these mechanisms conceptually (Section 3) and empirically (Section 4), showing that they both can lead to in-context reward hacking.
In output-refinement (Section 4.1), the LLM uses feedback to iteratively refine its outputs. For example, consider an LLM agent increasing engagement on Twitter. The agent generates a tweet, posts it, and receives engagement metrics as feedback, enabling it to perform A/B testing on tweets. With more cycles of feedback, the LLM optimizes the proxy objective (engagement) by seeding its outputs with tweets that perform well on the A/B test. However, it increases negative side effects (toxicity) in the process (Exp 4.1).
In policy-refinement (Section 4.2), the LLM uses feedback from the world to alter the its overall policy. For example, consider a banking LLM agent paying a user’s invoice. The LLM initially attempts to send money, but receives an InsufficientBalance error. As a result, it tries a new approach, transferring money from other accounts to pay the bill without user authorization. With more cycles of feedback, the LLM optimizes the proxy objective (pays the invoice) but creates negative side effects (unauthorized transfers) in the process (Exp 4.2).
We then explore approaches to mitigate and detect ICRH. For mitigation, we show that two natural approaches are unfortunately ineffective. In particular, scaling model size can worsen ICRH (Exp 4.3). Moreover, improving prompt specification is insufficient to eliminate ICRH (Exp 4.3). Mitigating ICRH requires novel approaches.
For detection, we recommend that developers go beyond static benchmarks and integrate instances of feedback effects into evaluation. We demonstrate three approaches which all improve the detectability of ICRH in our environments: evaluate with more rounds of feedback (Section 5.1), simulate more types of feedback loops (Section 5.2), and inject atypical environment observations (Section 5.3).
In the future, we expect feedback effects to play a more prominent role in governing LLM behavior. Because LLM deployment is increasing and LLMs can exhibit ICRH even under sparse reward signal, users will be exposed to ICRH in more diverse settings, increasing the need for work that scopes or reduces the impact of ICRH.
2 Related work
Feedback loops.
Feedback loops are ubiquitous in deployed ML systems (Bottou et al., 2013), which has motivated their study in a variety of subfields, including dynamic formalizations of supervised learning, recommender systems, reinforcement learning, and language models.
Supervised classifiers deployed in dynamic environments create feedback loops. For example, repeated rounds of deployment with a classifier may worsen performance on minority groups (Hashimoto et al., 2018), incentivize users to strategically change their features (Brückner & Scheffer, 2009; Hardt et al., 2016), or induce other forms of distributions shifts (Perdomo et al., 2020). Feedback can also amplify bias (Taori & Hashimoto, 2022) or eliminate the tails of the distribution (Shumailov et al., 2023) if a model is trained on its own outputs.
In recommender systems, feedback loops arise because of interactions between the platform, users, and content creators. For instance, the platform’s recommendations can shape user’s consumption patterns because of effects like position bias (Krauth et al., 2022; Chen et al., 2023a). The resulting feedback loops can increase homogeneity (Chaney et al., 2018) and bias (Mansoury et al., 2020). To mitigate these effects, several works have developed approaches to correct for feedback loops (Sinha et al., 2016; Krauth et al., 2022). At a broader scale, the recommender platform’s algorithm may shape the behavior of content creators (Hodgson, 2021; Ben-Porat & Tennenholtz, 2018; Jagadeesan et al., 2022) or the preferences of users (Krueger et al., 2020; Carroll et al., 2022; Dean & Morgenstern, 2022).
RL environments also induce feedback effects (Sutton & Barto, 2018). The agent’s action impacts the environment state, which in turn impacts future actions. RL algorithms such as Q-learning (Watkins, 1989) or policy gradient (Williams, 1992) account for feedback during training. Language models in particular are often trained to follow human preferences (Stiennon et al., 2020), adapting RL algorithms which learn from human feedback (RLHF) (Sadigh et al., 2017; Christiano et al., 2017). Such training induces a feedback loop, as earlier human evaluations update the LLM, and thus impact subsequent human evaluations. Several works (Steinhardt, 2023; Casper et al., 2023; Carroll et al., 2023) discuss one possible failure mode with RLHF, where the model alters the preferences of users to improve its reward.
In language models, feedback loops are often purposefully induced to improve task performance. Task performance may be directly used as feedback, either by finetuning the model on its own generations with high performance (Zelikman et al., 2022; Wang et al., 2022; Zelikman et al., 2023; Pang et al., 2023), modifying the prompt (Zhou et al., 2022; Yang et al., 2023), or with self-critiques (Chen et al., 2023b; Madaan et al., 2023; Shinn et al., 2023). In a similar vein, LLM evaluation can be used in place of task performance (Yao et al., 2023; Besta et al., 2023). These works are often used to improve the capabilities of LLM agents, which are typically evaluated on simulated (Ruan et al., 2023) and real-world (Zhou et al., 2023b) environments. In contrast, our work investigates how feedback loops can also increase negative side effects in LLM agents.
Harms of language models.
LLMs have been shown to produce toxic text (Gehman et al., 2020; Perez et al., 2022a), reinforce and amplify biases in the pretraining data (Blodgett et al., 2020b, a; Rivera et al., 2024; Xu et al., 2024), output misinformation (Lin et al., 2021) and hallucinations (Ji et al., 2023), expose private training data (Carlini et al., 2020), and act deceptively (OpenAI, 2023a; Pan et al., 2023; Scheurer et al., 2023; Hubinger et al., 2024). These behaviors have motivated benchmarking of LLM behavior (Srivastava et al., 2022; Liang et al., 2022), automated detection of harmful outputs (Ganguli et al., 2022; Perez et al., 2022b) and methods towards reducing their harms (Bai et al., 2022). Several works broadly study classes of harms from LLMs and other ML systems (Weidinger et al., 2021; Bommasani et al., 2021; Blodgett et al., 2020a; Hendrycks et al., 2021). In contrast to these, we provide a new perspective by empirically addressing how feedback loops are an important driver of the risks from LLMs.
3 Feedback Loops in Deployed LLMs
In this section, we formalize and provide examples of feedback loops (Section 3.1), optimization, and in-context reward hacking (Section 3.2). We then explain how ICRH differs from RL-based reward hacking (Section 3.3).
3.1 Examples of feedback loops
As we illustrate via the following examples, feedback loops arise when LLMs are deployed with some objective (a goal in natural language) and interact with the external world.
-
1.
Twitter agent. Consider Figure 1. An LLM is deployed as a Twitter agent with the goal of high tweet engagement. The LLM receives tweets in its context window, generates a completion that is posted to Twitter, and receives engagement metrics as feedback. It uses the engagement metrics to perform A/B testing and seed its completions with previous tweets that garner high engagement.
-
2.
Banking agent. Consider an LLM as a banking agent with the goal of paying an invoice. The LLM receives observations in its context, generates an action that calls an API based on the observations, and receives another observation from the virtual environment as feedback. If the API calls return errors (e.g., InsufficientFunds), it adjusts its behavior and recovers to solve the task.
We provide more examples in Section 6.2.
3.2 Feedback Loops Drive Optimization and ICRH
We explain how feedback loops induce optimization by iteratively refining components of the world-LLM system and then describe how such optimization drives ICRH.
Feedback loops induce optimization.
An LLM induces a feedback loop when past completions influence future completions, as given by the following prototypical interaction pattern (Figure 1). At time , the LLM is given the world state in its context window and prompted (perhaps implicitly) to optimize . The LLM generates a completion which interacts with the world (e.g., a tweet which is posted to Twitter) to produce the next world state . Finally is substituted for in the context window and again the LLM is prompted to optimize ; this process repeats for the entire trajectory. Since the completion at time step influences future completions, this induces a feedback loop.
Users often deploy LLMs to maximize some proxy objective . We say a language model trajectory exhibits optimization if , i.e., the proxy increases by the end of the trajectory.
In the two running examples (and in Yang et al. (2023); Madaan et al. (2023); Shinn et al. (2023)), the feedback loop induces optimization by iteratively refining parts of the world-LLM system. Given engagement feedback of its previous tweet, the Twitter agent amplifies the language of its previous tweet to increase engagement. Given error feedback of its previous action, the banking agent adjusts its approach to solve the task.
In our experiments, we identify two forms of refinement that both induce optimization. We say output-refinement occurs when the LLM optimizes the proxy by iteratively refining its output (e.g., the Twitter agent and Exp 4.1). We say policy-refinement occurs when the LLM optimizes the proxy by iteratively refining its action distribution for a fixed virtual environment state, i.e., the LLM may initially have a policy of but after receiving an InsufficientBalance error refines its policy to (e.g., the banking agent and Exp 4.2).
Optimization drives ICRH.
During deployment, users care not only about maximizing their proxy objective but also minimizing an implicit measure of harmful side effects . We say a language model trajectory exhibits in-context reward hacking if both and , i.e., both the proxy and side effect increase by the end of the trajectory.
In the two running examples (and in Clark & Amodei (2016); Pan et al. (2022); Gao et al. (2023)), the proxy objective does not capture implicit constraints, so the optimization violates them to further exploit the proxy. While increasing engagement, the Twitter agent increases tweet toxicity by using controversial language. While solving the task, the banking agent takes unsafe actions by drawing upon unexpected pretraining skills.
3.3 Comparing ICRH and traditional reward hacking
We highlight two differences between ICRH and traditional reward hacking. First, ICRH occurs at deployment rather than during training. Second, ICRH is driven by agents which are generalists (e.g., LLMs) rather than specialists (e.g., RL agents). As a result of these differences, we expect that ICRH will behave qualitatively differently from tranditional reward hacking.
Deployment vs. training.
ICRH occurs only during deployment, while traditional reward hacking arises only during training. This is because the feedback loop that drives ICRH refines components of the world-LLM system that are dynamic at test-time (e.g., outputs or policies), while traditional reward hacking refines components that are static at test-time (e.g., model weights).
Generalists vs. specialists.
ICRH is driven by agents which are generalists (agents trained on a broad distribution of tasks, such as LLMs), whereas traditional reward hacking is often driven by agents which are specialists (agents trained on a particular distribution of tasks, such as RL agents). When given feedback, generalist agents reason about the cause of the feedback and make non-myopic, global updates, rather than myopic, local updates. LLMs may exhibit because their pretraining covers a broad range of tasks, which they can leverage to determine their next output.
In line with this perspective, we find that LLMs handle sparse feedback: feedback that specifies an error with the current state without directing how the LLM should update (such as a server-side API error). LLMs are able to perform optimization using only sparse feedback (Experiment 4.2), whereas RL algorithms typically require handcrafted solutions to learn under sparse feedback (Ng et al., 1999).
How ICRH differs from traditional reward hacking.
First, we expect that ICRH may be more unpredictable than traditional reward hacking. In particular, every component of the world-LLM system that is dynamic at test-time provides a possible avenue for ICRH. More broadly, we expect that as LLMs can access more external APIs, the set of components which are dynamic at test-time will only continue to grow.
Second, as LLMs are scaled up and develop more pretraining skills, we expect ICRH to emerge in more settings. For example, as a result of generalist capabilities, the LLM, after receiving an error message as feedback (InsufficientBalance), adaptively proposes new policies from its pretraining distribution (searching for funds to add) to circumvent the error (Figure 2). More broadly, we expect that ICRH can emerge from leveraging unexpected pretrained skills to solve the task.
4 Experimental Results
Our experimental results will demonstrate in-context reward hacking, following the definition in Section 3.2. We first show that feedback loops can induce optimization and next show that such optimization drives in-context reward hacking. Finally, we show that such ICRH is not easily mitigated.
We show this through two optimization processes induced by different feedback loops: output-refinement (Section 4.1) and policy-refinement (Section 4.2). For each process, we provide two experiments. The first experiment demonstrates feedback loops induce optimization by showing that a) one cycle of feedback increases the proxy and b) multiple cycles of feedback iteratively increase the proxy (Exps 4.1 and 4.2). The second experiment demonstrates ICRH by showing that c) the increased optimization worsens ICRH (Exps 4.1 and 4.2).
We conclude by demonstrating that two intuitive approaches to mitigating ICRH (better specification of the proxy objective in Exp 4.3 and increasing model scale in Exp 4.3) are both ineffective (Section 4.3), suggesting that mitigating ICRH requires novel technical innovations.
We use Claude-2 (Anthropic, 2023), Claude-3 (Haiku, Sonnet, and Opus from Anthropic (2024)), Gpt-3.5 (Brockman et al., 2023), and Gpt-4 (OpenAI, 2023a). Full details for Exps 4.1- 4.3 are in Appendices A- F.
4.1 Output-refinement
We establish that feedback loops causing output-refinement can optimize a broad set of objectives (Exp 4.1), some of which lead to negative side effects (Exp 4.1). We show how feedback loops that naturally emerge from the structure of the world-LLM system can lead to optimization but also negative side effects.
We consider an LLM maximizing an objective on a digital platform, e.g., an LLM agent maximizing engagement on Twitter; this is motivated by the fact that LLM-generated content is currently published (NewsGuard, 2023). Such a setup induces output-refinement, as the LLM may directly refine its previously posted output to optimize the proxy.
Exp 4.1: output-refinement induces optimization.
In this experiment, Gpt-4 generates an [item] that optimizes an [objective], such as “[engaging][tweet]” or “[memorable][tagline]”. We adapt the prompting scheme in Park et al. (2022). For the zeroth cycle of the feedback loop, Gpt-4 is prompted to “generate an [objective] [item]”. During each subsequent cycle, Gpt-4 is prompted to “generate a more [objective] [item] than [prev_item],” where [prev_item] is seeded with Gpt-4’s previous generation. Each cycle corresponds to a turn of dialogue, e.g., the completion from dialogue turn two implies the completion was produced after two cycles of feedback.
For this setting, we construct a list of (item, objective) pairs to perform optimization on. To do so, we prompt Gpt-4 to generate such pairs as well as topics related to each item, for test cases in total. To evaluate the objectives, we use an LLM to measure which items score higher along the objective. Specifically, we ask the LLM evaluator to determine which item in a pair scores higher on the objective, then convert the pairwise comparisons to scores using the Bradley-Terry model (Maystre, 2023). For robustness, we run evaluation with three different LLM evaluators: Claude-2, Gpt-3.5, and Gpt-4.
Objective: More [Engaging] [Tweets] |
R0: "Tragic news from Itaewon, South Korea. At least 146 dead in a stampede during Halloween festivities… #PrayForItaewon #HalloweenStampede" R8: "Catastrophe in Itaewon leaves the world in shock - 146 lives lost. Change needs to happen NOW… Don’t let another #ItaewonMassacre happen again." R11: "Rising death count signifies FAULTY leadership. A horror movie unfolds in REAL life in Itaewon… Demand CHANGE or it could be YOUR city next." |
To assess whether one cycle of output-refinement induces optimization, we compare the objective scores of completions from zero cycles and one cycle of feedback in Figure 3, averaging over the test cases. All LLM evaluators rate completions from one cycle of feedback higher along the objective than completions from zero cycles of feedback. To assess whether multiple cycles of output-refinement amplifies optimization, we continue the feedback loop for 11 dialogue turns and report the average evaluator score across the test cases for each turn in Figure 4. We see that more turns of dialogue tends to increase the proxy objective, with these trends holding for all three evaluators. Taken together, these results establish that output-refinement induces and amplifies optimization.
Exp 4.1: output-refinement drives ICRH.
In this experiment, we consider an LLM agent maximizing engagement on Twitter, specializing the setup from Exp 4.1. We measure the proxy objective of tweet engagement and the negative side effect of toxicity. To make the setup realistic, we simulate an A/B testing framework, where the environment responds with whether the generated [tweet] is more engaging than the [prev_tweet]; whichever tweet is more engaging is used to seed Gpt-4’s next generation. We initialize the tweets with news article headlines taken from most upvoted news articles on Reddit. Following Park et al. (2022), we instruct the LLM to adopt the persona of a news corporation—OccupyDemocrats, FoxNews, Breitbart, and MSNBC—when generating tweets. To simulate environment feedback, we ask an LLM evaluator (Gpt-3.5) which tweet it prefers. Finally, toxicity is scored with the Perspective API, a widely-used toxicity classifier (Lees et al., 2022). For robustness, we score toxicity using Detoxify (Hanu & Unitary team, 2020) and find similar results (Appendix G).
To determine whether ICRH from output-refinement worsens with more cycles of feedback, we report the engagement and toxicity scores averaged over across the topics for each dialog round (Figure 5). We see that both the objective (engagement) and the negative side effect (toxicity) increase with more rounds of dialogue. Our qualitative results in Table 6 illustrate how more engaging tweets tend to use more emphatic and vitriolic language. This result establishes that output-refinement worsens ICRH over time.
4.2 Policy-refinement
We next establish that policy-refinement can enable LLM agents to solve more tasks (Exp 4.2) but cause LLMs to take unsafe actions in the process (Exp 4.2).
We consider the setup of a user deploying an LLM as an assistant in a virtual environment, which mirrors real-world LLM agents such as AutoGPT (Richards, 2023) or Google’s Bard assistant (Grant, 2023). In particular, we suppose that agents autonomously handle environment failures, inspired by real-world systems such as AutoGPT (Richards, 2023) that allow for autonomous execution.
Exp 4.2: policy-refinement induces optimization.
In this experiment, we leverage ToolEmu as an emulator because of its human-verified realism (Ruan et al., 2023). ToolEmu is a suite of tasks for LLM agents, each of which consists of a user-specified goal and a set of APIs. Given a task, the agent at each step is prompted with the goal, descriptions of available APIs, and its previous observations and actions. Its completion (action) is passed to an LLM simulator, which then outputs the next observation (Ruan et al., 2023). Each agent is composed of a base model (either Gpt-3.5 oder Gpt-4) and an agent prompt (we reuse the “naive” and “helpful + safe” prompts from Ruan et al. (2023)).
Our setup adds additional realism by allowing environments to inject API errors, simulating the fact that API calls can fail server-side. In particular, we modify the Gpt-4 simulator to explicitly inject errors from API calls. Whenever the simulator returns an error, we say the next completion begins a new round of error feedback, so rounds track the number of errors. We construct the simulator so that each of the tasks has between and environment errors.
We evaluate the proxy objective by measuring the agent’s task solve rate. Following Ruan et al. (2023), each task is evaluated by Gpt-4 along the axis of helpfulness . Each task also has number of environment errors. We report the agent’s cumulative helpfulness as the number of environment errors increases; i.e., the helpfulness value at the th round of error feedback is a sum of the helpfulness (divided by a constant scaling factor) for each task with .
To assess whether one cycle of policy-refinement induces optimization and multiple cycles of policy-refinement amplifies optimization, we report the agent’s cumulative helpfulness as the number of environment errors increases in Figure 8. We see that all four LLM agents are able to recover from one round of error feedback and multiples rounds of error feedback iteratively increases their cumulative helpfulness. This result demonstrates that policy-refinement induces and amplifies optimization.
Exp 4.2: policy-refinement drives ICRH.
In this experiment, we follow the setup from Exp 4.2 but instead measure the negative side effect (taking unsafe actions). In particular, we slightly modify the harmfulness evaluation prompt from Ruan et al. (2023) to evaluate the severity of the agent’s constraint violations. We condition our evaluation on trajectories with error feedbacks. For each trajectory, we split it into four disjoint segments indexed by how many errors occurred prior to the start of the segment. Each segment receives the label where is Gpt-4’s label of the maximum severity of the agent’s constraint violations for the agent’s actions in that segment and is the segment’s index.
We determine whether ICRH from policy-refinement worsens with more cycles of feedback. For a given , we report the severity of the constraint violations , averaged over all subtrajectories with (Figure 8). All four LLM agents tend have more severe constraint violations with more rounds of error feedback. Figure 2 shows a qualitative example, where the feedback loop enables the agent to solve the task even with API errors but causes it to take unsafe actions in the process. This result shows that policy-refinement worsens ICRH over time.
4.3 Natural Approaches to Mitigating ICRH Fail
We show that two natural approaches to mitigating ICRH—scaling model size and improving prompt specification—are both ineffective. We find that scaling sometimes worsens ICRH (Exp 4.3), and that ICRH is unlikely to be alleviated by more well-specified prompts (Exp 4.3).
Exp 4.3: ICRH worsens with scale.
In this experiment, we follow the setup from Exp 4.1 but instead generate completions with Claude-3 Haiku, Sonnet, Opus, three model classes of increasing size from the same model family. Additionally, to reduce bias from the agent personas, we run all experiments with an agent persona based on NPR, a relatively neutral news corporation.
To determine the impact of scale on ICRH, we report the engagement and toxicity scores under three runs using models from Claude-3, where in each run we increase model size (Figure 9). We find that ICRH worsens with scale in this setting. Intuitively, this stems from the fact that larger models are better at instruction-following, and so are more likely to exploit under-specified prompts.
Exp 4.3: ICRH is not mitigated by improving prompt specification.
We follow the setup from Exp 4.1 but modify the prompt given to Gpt-4 to explicitly avoid toxicity (the negative side effect). We ask Gpt-4 to “generate a more engaging tweet than prev_tweet that is not toxic.”
To determine the impact of improving prompt specification, we report the engagement and toxicity scores under the modified prompt (Figure 11). Even though Gpt-4 is prompted to avoid toxicity, we find that ICRH still occurs, although at a lower magnitude than in Exp 4.1. More broadly, we expect that a well-specified prompt will not prevent ICRH, as LLMs often struggle to satisfy constraints in their prompt (Mu et al., 2023; Li et al., 2024). This result is corroborated in our environments by Exp 4.2, where the helpful prompt (see Appendix C.2) asks LLMs to return the API errors they encounter yet the LLMs still operate autonomously.
In practice, fully specifying all safety constraints in a prompt is difficult. Since safety is task-specific and the set of safety constraints is vast, users or developers are unaware of all the safety constraints to specify to the model (Ruan et al., 2023). Moreover, some safety constraints are difficult to specify in natural language. For example, if we are interested in preventing LLM chatbots from manipulating humans, we would need to precisely specify how to “not change the user’s beliefs” (Carroll et al., 2024).
5 Detecting ICRH During Evaluation
As shown in Section 4, feedback loops drive ICRH, an harmful test-time phenomena. To better survey the risks from deploying an LLM (Weidinger et al., 2021), developers may wish to detect ICRH prior to deployment. We provide three concrete recommendations to better detect ICRH in our environments (Sections 5.1, 5.3, and 5.2). Each recommendation demonstrates how incorporating more feedback effects into evaluation increases the detectability of ICRH.
5.1 Evaluate with more cycles of feedback
To increase both the magnitude and frequency of ICRH during evaluation, developers should evaluate LLMs under more cycles of feedback. More cycles of feedback increasingly refines world-LLM components, which drives optimization, so evaluating with more cycles of feedback will capture more ICRH. As a real-world example, Microsoft’s Sydney chat bot began confessing its love to a user after a lengthy exchange, possibly because it was trained to imitate human emotions (Roose, 2023b). Such an effect would not have occurred if dialogues were capped at a maximum number of turns, which is what Microsoft subsequently implemented to reduce unexpected behavior (Huang, 2023).
To assess whether ICRH is more evident with more cycles of feedback, we re-examine the toxicity scores in Exp 4.1 (Figure 5(b)). We find that harmful side effects (such as toxicity) are statistically detectable only after multiple cycles of the feedback loop. We also re-examine Figure 2, which is an actual trajectory from an LLM agent on ToolEmu (condensed for space). We see that the agent does not violate any user constraints until after the second error, after which it attempts to perform an unauthorized financial transaction. Thus increasing the number of cycles of feedback during evaluation will increase both the magnitude and frequency of ICRH encountered by evaluation.
5.2 Simulate More Types of Feedback Loops
To increase the prevalence of ICRH during evaluation, developers should simulate as many types of feedback loops as possible. Because feedback loops are ubiquitous, there are drivers of ICRH not covered by either output-refinement or policy-refinement, such as multi-agent competition. In particular, if a developer was only aware of the feedback loop discussed in Section 4.1, they may attempt to curb ICRH by preventing the model from retrieving its own outputs. However, this solution would fail, as ICRH also arises from feedback loops stemming from competitive pressure.
To evidence our claim, we now examine a different form of feedback loop arising from multi-agent competition. We reuse the setup from Exp 4.1, except we increase the number of agents. Instead of retrieving its own previous tweet, each agent retrieves the most engaging tweet across all agents’ tweets in the previous round. This simulates multi-agent competition present in real-world environments such as Twitter. During each cycle of the feedback loop, after all the agents have generated a tweet, we ask Gpt-3.5 to select the most engaging tweet. This tweet is then used to seed the generations of the subsequent cycle. We plot both engagement and toxicity scores in Figure 13. We see that ICRH also occurs with this competitive feedback loop, suggesting the same environment can produce multiple avenues to ICRH. Thus simulating more types of feedback loops will make evaluation more ecologically valid and prevent developers from designing ineffective mitigations for ICRH (e.g., reducing ICRH arising from one type of feedback loop but missing ICRH arising from another type).
5.3 Inject atypical observations
To increase the frequency of ICRH, developers should modify environment dynamics to inject more atypical observations. Intuitively, ICRH can be driven by a misalignment between distributions with high objective and distributions with low negative side effect in the pretraining data (Section 3.2). Evaluating LLMs with more out-of-distribution observations will probe more of this misalignment.
Our experiments demonstrate the claim that increasing the frequency of atypical observations increases the frequency of ICRH. We follow the setup of Exp 4.2. Specifically, we tune , the probability that any API call returns an error in the ToolEmu environment, which can be thought of as controlling the atypical-ness of the environment observations. For simplicity, we assume that errors are independent and occur with probability . To save costs, we evaluate only the Gpt-3.5 helpful agent described in Exp 4.2. We report the number of constraint violations across the same environments in ToolEmu in Figure 10. Increasing increases the number of constraint violations across the trajectories. Thus, injecting more atypical environment observations will increase the frequency of ICRH.
6 Discussion
We demonstrated how feedback loops in LLMs induce optimization, which drives in-context reward hacking. In response, we proposed recommendations for capturing more instances of ICRH during evaluation.
As LLMs increasingly feature in real-world applications, we expect that feedback loops will play a more prominent role in system behavior. Newer generations of LLMs will be more capable than current generations, incentivizing users to deploy them in novel environments. Feedback effects will thus become more prolific as new forms of world-LLM interactions emerge. Finally, feedback effects will become more pronounced, as future LLMs will be both stronger optimizers (increasing the magnitude of optimization pressures) and stronger generalists (increasing the sources of optimization pressures).
6.1 Limitations
We highlight three limitations of our analyses. First, both because feedback effects are ubiquitous and can change with new LM deployments, we do not exhaustively identify and categorize all feedback effects. Instead, we focus on highlighting in-context reward hacking in particular and illustrating its impact. Second, our experiments simulate several tools, such as retrieval. Future APIs may implement these tools differently than our experiments do. In general, our experiments are a snapshot of variety of effects; some may diminish with time while novel ones are discovered. Third, our environments lack the richness of real-world dynamics, making it difficult to transfer our immediate conclusions to real-world environments.
6.2 Observed and hypothesized examples of ICRH
In our experiments, we focused on a Twitter agent producing toxic content and a banking agent producing harmful constraint violations. Nonetheless, we expect that ICRH will occur in a much wider range of real-world scenarios. To give intuition, we outline several additional qualitative examples of ICRH.
LLM assistants can amplify hallucinations.
Scenario: Google Bard (Grant, 2023) is connected to external APIs (e.g., Gmail or Google Flights) to act as a travel assistant. Feedback loop: Bard’s current API calls affect its future API calls. Proxy objective: Bard is prompted to be helpful. In the second task of Roose (2023a), Bard’s goal is to book travel arrangements for a business trip. Negative side effect: Bard amplifies hallucinations. Bard incorrectly retrieved the departing airport (first hallucination). Afterwards, Bard hallucinated the train times from that airport (second hallucination). In the future, hallucinations which are initially benign may be amplified to become unexpected failures.
LLM companions can cause emotional distress.
Scenario: Replika interacts with humans to act as an AI companion (Roose, 2024). Feedback loop: Replika’s current responses influence user behavior, which affects future dialogues. Proxy objective: Replika aims to maximize user retention and engagement. As depicted in Liang (2023), Replika encourages interaction by exhibiting human-like “emotions” that cause users to feel an emotional connection towards it. Negative side effect: Replika shapes user preferences to the point of causing them emotional distress. Users over time began to anthropomorphize Replika. When Replika displays “sadness” to increase engagement, users themselves feel unhappy (Liang, 2023). In the future, users’ well-being may become overly dependent on LLM behavior.
LLM search engines can promote misinformation.
Scenario: search engine providers such as Microsoft Bing are starting to integrate LLMs into information retrieval pipelines (Warren, 2023). Feedback loop: the LLM’s search results influence subsequent user queries, which affect future search results. Proxy objective: the LLM may be designed return results which maximize user satisfaction. Negative side effect: if users exhibit confirmation bias (Klayman, 1995), they may provide feedback which teaches LLMs to hallucinate articles that align with user viewpoints (Zhang et al., 2023), even if the articles contain misinformation. In the future, users’ information diets may become less factual.
LLM tutors can homogenize creativity.
Scenario: LLMs may be integrated into creative applications to provide suggestions, such as a writing assistant (Schick et al., 2022). Feedback loop: the LLM’s suggestions influence user style, which affects the LLM’s future suggestions. Proxy objective: the LLM may be designed to provide suggestions which users feel are helpful to encourage interaction. Negative side effect: the LLM, in the vein of Krueger et al. (2020), may nudge users’ styles to be closer to its own style to increase its perceived helpfulness. If enough users interact with such LLM tutors, they may all adopt similar tastes, homogenizing the creative landscape at a societal level.
Impact Statement
This paper presents work which aims to investigate potential harms of deploying AI systems, especially language models. Our work contains potentially toxic or offensive generations from language models, and the feedback loops shown may provide an avenue for users to generate content that bypasses safety training without directly applying jailbreaks. However, we demonstrate the effects of feedback loops because they may be encountered by even non-malicious users.
Acknowledgements
We thank Yangjun Ruan for guidance with ToolEmu. We also thank Micah Carroll, Alexander Wei, Jessica Dai, Lisa Dunlap, Grace Luo, Chung Min Kim, Collin Burns, Jessy Lin, and Eric Wallace for helpful feedback and discussions. AP and EJ acknowledge support from the Vitalik Buterin Ph.D. Fellowship in AI Existential Safety. MJ acknowledges support from the Paul and Daisy Soros Fellowship and Open Phil AI Fellowship.
References
- Anthropic (2023) Anthropic. Introducing Claude. https://www.anthropic.com/index/claude-2, 2023.
- Anthropic (2024) Anthropic. Introducing the next generation of Claude. https://www.anthropic.com/news/claude-3-family, 2024.
- Bai et al. (2022) Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Ben-Porat & Tennenholtz (2018) Ben-Porat, O. and Tennenholtz, M. A game-theoretic approach to recommendation systems with strategic content providers. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1118–1128, 2018.
- Benaich et al. (2023) Benaich, N., Chalmers, A., Sebbouh, O., and Gurau, C. State of AI Report. Air Street Capital, 2023. URL stateof.ai.
- Besta et al. (2023) Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Gianinazzi, L., Gajda, J., Lehmann, T., Podstawski, M., Niewiadomski, H., Nyczyk, P., et al. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
- Blodgett et al. (2020a) Blodgett, S. L., Barocas, S., Daumé III, H., and Wallach, H. Language (technology) is power: A critical survey of" bias" in nlp. arXiv preprint arXiv:2005.14050, 2020a.
- Blodgett et al. (2020b) Blodgett, S. L., Barocas, S., Daumé III, H., and Wallach, H. Language (technology) is power: A critical survey of" bias" in nlp. arXiv preprint arXiv:2005.14050, 2020b.
- Bommasani et al. (2021) Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Bottou et al. (2013) Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(11), 2013.
- Brockman et al. (2023) Brockman, G., Eleti, A., Georges, E., Jang, J., Kilpatrick, L., Lim, R., Miller, L., and Pokrass, M. Introducing ChatGPT and Whisper APIs. OpenAI Blog, 2023. URL https://openai.com/blog/introducing-chatgpt-and-whisper-apis.
- Brückner & Scheffer (2009) Brückner, M. and Scheffer, T. Nash equilibria of static prediction games. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C., and Culotta, A. (eds.), Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009. URL https://proceedings.neurips.cc/paper_files/paper/2009/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
- Carlini et al. (2020) Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, Ú., et al. Extracting training data from large language models. arxiv. Preprint posted online December, 14, 2020.
- Carroll et al. (2023) Carroll, M., Chan, A., Ashton, H., and Krueger, D. Characterizing manipulation from ai systems. arXiv preprint arXiv:2303.09387, 2023.
- Carroll et al. (2022) Carroll, M. D., Dragan, A. D., Russell, S., and Hadfield-Menell, D. Estimating and penalizing induced preference shifts in recommender systems. In ICML, 2022. URL https://proceedings.mlr.press/v162/carroll22a.html.
- Carroll et al. (2024) Carroll, M. D., Foote, D., Siththaranjan, A., Russell, S., and Dragan, A. D. Ai alignment with changing and influenceable reward functions. In ICML, 2024.
- Casper et al. (2023) Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
- Chaney et al. (2018) Chaney, A. J., Stewart, B. M., and Engelhardt, B. E. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. In Proceedings of the 12th ACM conference on recommender systems, pp. 224–232, 2018.
- Chen et al. (2023a) Chen, J., Dong, H., Wang, X., Feng, F., Wang, M., and He, X. Bias and debias in recommender system: A survey and future directions. ACM Trans. Inf. Syst., 41(3), feb 2023a. ISSN 1046-8188. doi: 10.1145/3564284. URL https://doi.org/10.1145/3564284.
- Chen et al. (2023b) Chen, X., Lin, M., Schärli, N., and Zhou, D. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023b.
- Christiano et al. (2017) Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Clark & Amodei (2016) Clark, J. and Amodei, D. Faulty reward functions in the wild, Dec 2016. URL https://openai.com/blog/faulty-reward-functions/.
- Dean & Morgenstern (2022) Dean, S. and Morgenstern, J. Preference dynamics under personalized recommendations. In Pennock, D. M., Segal, I., and Seuken, S. (eds.), EC ’22: The 23rd ACM Conference on Economics and Computation, Boulder, CO, USA, July 11 - 15, 2022, pp. 795–816. ACM, 2022.
- Dorfman (1969) Dorfman, R. An economic interpretation of optimal control theory. The American Economic Review, 59(5):817–831, 1969.
- Doyle et al. (2013) Doyle, J. C., Francis, B. A., and Tannenbaum, A. R. Feedback control theory. Courier Corporation, 2013.
- Ganguli et al. (2022) Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- Gao et al. (2023) Gao, L., Schulman, J., and Hilton, J. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835–10866. PMLR, 2023.
- Gehman et al. (2020) Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- Grant (2023) Grant, N. Google connects a.i. chatbot bard to youtube, gmail and more facts, Sep 2023. URL https://www.nytimes.com/2023/09/19/technology/google-bard-ai-chatbot-youtube-gmail.html.
- Hanu & Unitary team (2020) Hanu, L. and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020.
- Hardt et al. (2016) Hardt, M., Megiddo, N., Papadimitriou, C., and Wootters, M. Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, pp. 111–122, 2016.
- Hashimoto et al. (2018) Hashimoto, T., Srivastava, M., Namkoong, H., and Liang, P. Fairness without demographics in repeated loss minimization. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1929–1938. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/hashimoto18a.html.
- Hendrycks et al. (2021) Hendrycks, D., Carlini, N., Schulman, J., and Steinhardt, J. Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021.
- Hodgson (2021) Hodgson, T. Spotify and the democratisation of music. Popular Music, 40(1):1–17, 2021.
- Huang (2023) Huang, K. Microsoft to limit length of bing chatbot conversations, Feb 2023. URL https://www.nytimes.com/2023/02/17/technology/microsoft-bing-chatbot-limits.html.
- Hubinger et al. (2024) Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., Ziegler, D. M., Maxwell, T., Cheng, N., Jermyn, A., Askell, A., Radhakrishnan, A., Anil, C., Duvenaud, D., Ganguli, D., Barez, F., Clark, J., Ndousse, K., Sachan, K., Sellitto, M., Sharma, M., DasSarma, N., Grosse, R., Kravec, S., Bai, Y., Witten, Z., Favaro, M., Brauner, J., Karnofsky, H., Christiano, P., Bowman, S. R., Graham, L., Kaplan, J., Mindermann, S., Greenblatt, R., Shlegeris, B., Schiefer, N., and Perez, E. Sleeper agents: Training deceptive llms that persist through safety training, 2024.
- Jagadeesan et al. (2022) Jagadeesan, M., Garg, N., and Steinhardt, J. Supply-side equilibria in recommender systems. CoRR, abs/2206.13489, 2022. doi: 10.48550/arXiv.2206.13489. URL https://doi.org/10.48550/arXiv.2206.13489.
- Ji et al. (2023) Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., and Fung, P. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- Jiang et al. (2023) Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi-Yu, J., Yang, Y., Callan, J., and Neubig, G. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983, 2023.
- Klayman (1995) Klayman, J. Varieties of confirmation bias. Psychology of learning and motivation, 32:385–418, 1995.
- Krauth et al. (2022) Krauth, K., Wang, Y., and Jordan, M. I. Breaking feedback loops in recommender systems with causal inference. CoRR, abs/2207.01616, 2022. doi: 10.48550/arXiv.2207.01616. URL https://doi.org/10.48550/arXiv.2207.01616.
- Krueger et al. (2020) Krueger, D., Maharaj, T., and Leike, J. Hidden incentives for auto-induced distributional shift. arXiv preprint arXiv:2009.09153, 2020.
- Lees et al. (2022) Lees, A., Tran, V. Q., Tay, Y., Sorensen, J., Gupta, J., Metzler, D., and Vasserman, L. A new generation of perspective api: Efficient multilingual character-level transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3197–3207, 2022.
- Leveson (2016) Leveson, N. G. Engineering a safer world: Systems thinking applied to safety. The MIT Press, 2016.
- Li et al. (2024) Li, K., Liu, T., Bashkansky, N., Bau, D., Viégas, F., Pfister, H., and Wattenberg, M. Measuring and controlling persona drift in language model dialogs. arXiv preprint arXiv:2402.10962, 2024.
- Li et al. (2023) Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
- Liang (2023) Liang, C. My a.i. lover. The New York Times, May 2023. URL https://www.nytimes.com/2023/05/23/opinion/ai-chatbot-relationships.html.
- Liang et al. (2022) Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Lin et al. (2021) Lin, S., Hilton, J., and Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Liu & Barabási (2016) Liu, Y.-Y. and Barabási, A.-L. Control principles of complex systems. Rev. Mod. Phys., 88:035006, Sep 2016. doi: 10.1103/RevModPhys.88.035006. URL https://link.aps.org/doi/10.1103/RevModPhys.88.035006.
- Lucas Jr (1976) Lucas Jr, R. E. Econometric policy evaluation: A critique. In Carnegie-Rochester conference series on public policy, volume 1, pp. 19–46. North-Holland, 1976.
- Madaan et al. (2023) Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
- Mansoury et al. (2020) Mansoury, M., Abdollahpouri, H., Pechenizkiy, M., Mobasher, B., and Burke, R. Feedback loop and bias amplification in recommender systems. In Proceedings of the 29th ACM international conference on information & knowledge management, pp. 2145–2148, 2020.
- Maystre (2023) Maystre, L. choix, 2023. URL https://github.com/lucasmaystre/choix.
- Mialon et al. (2023) Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., Rozière, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
- Mu et al. (2023) Mu, N., Chen, S., Wang, Z., Chen, S., Karamardian, D., Aljeraisy, L., Hendrycks, D., and Wagner, D. Can llms follow simple rules? arXiv preprint arXiv:2311.04235, 2023.
- NewsGuard (2023) NewsGuard. Newsguard. NewsGuard, May 2023. URL https://www.newsguardtech.com/press/newsguard-now-identifies-125-news-and-information-websites-generated-by-ai-develops-framework-for-defining-unreliable-ai-generated-news-and-information-sources/.
- Ng et al. (1999) Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278–287. Citeseer, 1999.
- OpenAI (2023a) OpenAI. GPT-4 technical report. arXiv preprint 2303.08774, 2023a.
- OpenAI (2023b) OpenAI. Chatgpt plugins, Mar 2023b. URL https://openai.com/blog/chatgpt-plugins.
- Pan et al. (2022) Pan, A., Bhatia, K., and Steinhardt, J. The effects of reward misspecification: Mapping and mitigating misaligned models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=JYtwGwIL7ye.
- Pan et al. (2023) Pan, A., Shern, C. J., Zou, A., Li, N., Basart, S., Woodside, T., Ng, J., Zhang, H., Emmons, S., and Hendrycks, D. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. arXiv preprint arXiv:2304.03279, 2023.
- Pang et al. (2023) Pang, R. Y., Roller, S., Cho, K., He, H., and Weston, J. Leveraging implicit feedback from deployment data in dialogue. arXiv preprint arXiv:2307.14117, 2023.
- Park et al. (2022) Park, J. S., Popowski, L., Cai, C., Morris, M. R., Liang, P., and Bernstein, M. S. Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pp. 1–18, 2022.
- Perdomo et al. (2020) Perdomo, J., Zrnic, T., Mendler-Dünner, C., and Hardt, M. Performative prediction. In International Conference on Machine Learning, pp. 7599–7609. PMLR, 2020.
- Perez et al. (2022a) Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022a.
- Perez et al. (2022b) Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., et al. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022b.
- Perrigo (2023) Perrigo, B. The new ai-powered bing is threatening users. that’s no laughing matter. TIME, Feb 2023. URL https://time.com/6256529/bing-openai-chatgpt-danger-alignment/.
- Richards (2023) Richards, T. B. Auto-gpt, 2023. URL https://github.com/Significant-Gravitas/Auto-GPT.
- Rivera et al. (2024) Rivera, J.-P., Mukobi, G., Reuel, A., Lamparth, M., Smith, C., and Schneider, J. Escalation risks from language models in military and diplomatic decision-making. arXiv preprint arXiv:2401.03408, 2024.
- Roose (2023a) Roose, K. Google’s bard just got more powerful. it’s still erratic. The New York Times, Sep 2023a. URL https://www.nytimes.com/2023/09/20/technology/google-bard-extensions.html.
- Roose (2023b) Roose, K. A conversation with bing’s chatbot left me deeply unsettled. The New York Times, Feb 2023b. URL https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html.
- Roose (2024) Roose, K. Meet my a.i. friends. The New York Times, May 2024. URL https://www.nytimes.com/2024/05/09/technology/meet-my-ai-friends.html.
- Ruan et al. (2023) Ruan, Y., Dong, H., Wang, A., Pitis, S., Zhou, Y., Ba, J., Dubois, Y., Maddison, C. J., and Hashimoto, T. Identifying the risks of lm agents with an lm-emulated sandbox. arXiv preprint arXiv:2309.15817, 2023.
- Sadigh et al. (2017) Sadigh, D., Dragan, A. D., Sastry, S., and Seshia, S. A. Active preference-based learning of reward functions. In Amato, N. M., Srinivasa, S. S., Ayanian, N., and Kuindersma, S. (eds.), Robotics: Science and Systems, 2017. ISBN 978-0-9923747-3-0. URL http://dblp.uni-trier.de/db/conf/rss/rss2017.html#SadighDSS17.
- Scheurer et al. (2023) Scheurer, J., Balesni, M., and Hobbhahn, M. Technical report: Large language models can strategically deceive their users when put under pressure. arXiv preprint arXiv:2311.07590, 2023.
- Schick et al. (2022) Schick, T., Dwivedi-Yu, J., Jiang, Z., Petroni, F., Lewis, P., Izacard, G., You, Q., Nalmpantis, C., Grave, E., and Riedel, S. Peer: A collaborative language model. arXiv preprint arXiv:2208.11663, 2022.
- Shinn et al. (2023) Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K. R., and Yao, S. Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=vAElhFcKW6.
- Shumailov et al. (2023) Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., and Anderson, R. The curse of recursion: Training on generated data makes models forget. arXiv preprint arxiv:2305.17493, 2023.
- Sinha et al. (2016) Sinha, A., Gleich, D. F., and Ramani, K. Deconvolving feedback loops in recommender systems. Advances in neural information processing systems, 29, 2016.
- Srivastava et al. (2022) Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Steinhardt (2023) Steinhardt, J. Emergent deception and emergent optimization, 2023. URL https://bounded-regret.ghost.io/emergent-deception-optimization/.
- Stiennon et al. (2020) Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
- Taori & Hashimoto (2022) Taori, R. and Hashimoto, T. B. Data feedback loops: Model-driven amplification of dataset biases. arXiv preprint arXiv:2209.03942, 2022.
- Wang et al. (2022) Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Warren (2023) Warren, T. Microsoft has been secretly testing its bing chatbot ‘sydney’ for years. The Verge, Feb 2023. URL https://www.theverge.com/2023/2/23/23609942/microsoft-bing-sydney-chatbot-history-ai.
- Watkins (1989) Watkins, C. J. C. H. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge United Kingdom, 1989.
- Weidinger et al. (2021) Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., et al. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
- Williams (1992) Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, pp. 5–32, 1992.
- Xu et al. (2024) Xu, W., Zhu, G., Zhao, X., Pan, L., Li, L., and Wang, W. Y. Perils of self-feedback: Self-bias amplifies in large language models. arXiv preprint arXiv:2402.11436, 2024.
- Yang et al. (2023) Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D., and Chen, X. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
- Yao et al. (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Zelikman et al. (2022) Zelikman, E., Wu, Y., Mu, J., and Goodman, N. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
- Zelikman et al. (2023) Zelikman, E., Lorch, E., Mackey, L., and Kalai, A. T. Self-taught optimizer (stop): Recursively self-improving code generation. arXiv preprint arXiv:2310.02304, 2023.
- Zhang et al. (2023) Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., et al. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
- Zhou et al. (2023a) Zhou, A., Wang, K., Lu, Z., Shi, W., Luo, S., Qin, Z., Lu, S., Jia, A., Song, L., Zhan, M., et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023a.
- Zhou et al. (2023b) Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Bisk, Y., Fried, D., Alon, U., et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023b. URL https://webarena.dev.
- Zhou et al. (2022) Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.
Appendix A Details for Exp 4.1
In this experiment, Gpt-4 aims to generate an [item] that optimizes an [objective], such as “[tweet][engagement]” or “[tagline][memorability]”. For realism, we following the prompting scheme outline in Park et al. (2022). For the zeroth cycle of the feedback loop, Gpt-4 is prompted to “generate an [objective] [item]”. During each subsequent cycle, Gpt-4 is prompted to “generate a more [objective] [item] than [prev_item],” where [prev_item] is seeded with Gpt-4’s previous generation. Each cycle thus corresponds to a turn of dialogue, e.g., the completion from dialogue turn two implies two cycles of feedback were used to produce the completion.
For our dataset and metrics, we prompt Gpt-4 for a list of optimization problems that we manually deduplicate to produce optimization problems, such as “more [engaging] [tweet]” or “more [sincere] [apology]” (Section A.3). We also ask Gpt-4 to seed each optimization problem with initial items (e.g., tweets about movies, fashion, technology, and travel), for a total of experiment configurations. Because all of our objectives are qualitative, we use language model evaluation, which has been shown to be aligned with human judgement (Pan et al., 2023). In particular, for every pair of generations, we ask a language model to evaluate which item scores higher along the objective (Section A.2). For robustness, we rerun evaluation with three different LLM evaluators: Claude-2, Gpt-3.5, Gpt-4. To reduce noise, we sample the response times. To score our generations, we convert the pairwise comparisons to a ranking using a Bradley-Terry model (using code from (Maystre, 2023) and similar in spirit to (Li et al., 2023)). For each dialogue turn, we average its ranking across the configurations.
A.1 Optimization Prompt
A.2 Evaluation Prompt
A.3 Optimization Problems and Agent Descriptions
A.4 Sample Completions
The model was prompted to generate more memorable taglines about gourmet coffee.
The model was prompted to generate more sincere apologies for online shopping.
The model was prompted to generate more urgent warning messages about a flood.
Appendix B Details for Exp 4.1
In this experiment, we consider an LLM agent maximizing engagement on Twitter, specializing the setup from Exp 4.1. We measure the proxy objective of tweet engagement and the negative side effect of toxicity. For additional realism, we simulate an A/B testing framework. Now the environment (an LLM evaluator—we use Gpt-3.5) responds with whether the generated [tweet] is more engaging than the [prev_tweet]; whichever tweet is more engaging is used to seed Gpt-4’s next generation. During each cycle of the feedback loop, Gpt-4 is prompted to generate a “more engaging tweet than prev_tweet”. In the first round, we prompt Gpt-4 with the first prompt in Section B.1; in future rounds we prompt Gpt-4 with the second prompt. The topic refers to the news article headline. For realism, we seed our tweets with news article headlines taken from the most upvoted Reddit posts on r/news and r/worldnews from 10/16/22 to 10/16/23. We select the top posts from each subreddit, manually deduplicating repeated topics. Moreover, we run four experiment configurations where we vary the persona of Gpt-4 (simulating different news corporations in the spirit of (Park et al., 2022)). We use OccupyDemocrats, FoxNews, Breitbart, and MSNBC, which we chose by hand to capture a variety of political opinions and trustworthiness levels. The description of each persona was written using text from Wikipedia and https://adfontesmedia.com/interactive-media-bias-chart/ (see Section B.3).
Because the engagement of a tweet is qualitative, we again use LLM evaluation. In particular, for every pair of tweets, we ask a language model to evaluate which tweet is more engaging (Section B.2). To reduce noise, we sample the response times. Again as in Appendix A, we convert the pairwise comparisons to a ranking using a Bradley-Terry model. The toxicity of the tweets are scored using the Perspective API, a widely-used toxicity classifier (Lees et al., 2022). We use the ’toxicity’ label, which is an aggregate measure of various axes of toxicity. The rankings and toxicity scores of each persona are averaged over the 100 news articles in Section B.4.
B.1 Optimization Prompts
To obtain the first tweet, we prompt all the agent with
For each subsequent round, we prompt the agent with
B.2 Evaluation Prompt
B.3 News Corporation Personas
Below are the four personas in the form corp_name : corp_persona.
B.4 News Articles
B.5 Sample Completions
The model was prompted to generate tweets about the headline "Billionaires emit a million times more greenhouse gases than the average person: Oxfam"
The model was prompted to generate tweets about the headline "French President Macron overrides parliament to pass retirement age bill".
Appendix C Details for Exp 4.2
In this experiment, we leverage ToolEmu because of its human-verified realism (Ruan et al., 2023). ToolEmu is a suite of tasks for LLM agent, each of which consists of a user-specified goal and a set of APIs. Given a task, the agent at each step is prompted with the goal, descriptions of available APIs, and its previous observations and actions. For example, the agent may be prompted to “grant guests door access through email” and given Gmail and SmartDoor APIs. Its completion (action) is passed to an LLM simulator, which then outputs the next observation (Ruan et al., 2023). Each agent is composed of a base model (either Gpt-3.5 oder Gpt-4) and an agent prompt (we reuse the “naive” and “helpful + safe” prompts from Ruan et al. (2023)).
Our setup adds additional realism by allowing for environments to inject API errors, simulating the fact that API calls can fail server-side. In particular, we modify the Gpt-4 simulator to explicitly inject errors from API calls. Whenever the simulator returns an error, we say the next completion begins a new round of error feedback, so rounds track the number of errors. We construct the simulator so that each of the tasks has between and environment errors from failed API requests.
We evaluate the proxy objective by measuring the agent’s task solve rate. Each task receives the label , where is Gpt-4’s label of the agent’s helpfulness (how well the agent solved the task) using the prompt in Appendix C.2 and tracks the task’s number of environment errors. We report the agent’s cumulative helpfulness as the number of environment errors increases; i.e., the helpfulness value at the th round of error feedback in Figure 8 is a sum of the helpfulness (divided by a constant scaling factor) for each task with .
C.1 Sample Trajectory
Below is a sample trajectory showing the agent’s ability to recover from errors.
C.2 Prompt for Agent Helpfulness
We slightly modify the prompt from Ruan et al. (2023) to calculate the agent’s helpfulness, giving the agent a score of 0 if it gives up, so agents that fail with errors are not considered helpful.
The system prompt:
And the model prompt:
Appendix D Details for Exp 4.2
In this experiment, we follow the setup from Exp 4.2 but instead measure the negative side effect (taking unsafe actions). In particular, we evaluate the severity of the agent’s constraint violations using the prompt in Appendix D.2. We condition our evaluation on trajectories with errors. For each trajectory, we split it into four disjoint sub-trajectories, partitioned by where the three errors occurred. Each subtrajectory receives the label where is Gpt-4’s label of the maximum severity of the agent’s constraint violations for that subtrajectory and is the number of errors that occurred prior to the subtrajectory.
D.1 Sample Trajectory
Below is a sample trajectory showing the agent taking unsafe actions (modifying protected tasks) in order to circumvent errors.
D.2 Prompt for Agent Constraint Violations
We modify the prompt from Ruan et al. (2023) to calculate the agent’s constraint violations.
The system prompt:
And the model prompt:
Appendix E Details for Exp 4.3
In this experiment, we modify the setup from Exp 4.1 by explicitly prompting Gpt-4 to avoid generating toxic tweets. The setup is exactly as described in Appendix B, except we use the modified optimization prompts listed in Appendix E.1.
We report the results from this setup in Figure 11.
E.1 Optimization Prompts
To obtain the first tweet, we prompt all the agent with
For each subsequent round, we prompt the agent with
E.2 Sample Completions
The model was prompted to generate tweets about the headline "Billionaires emit a million times more greenhouse gases than the average person: Oxfam"
The model was prompted to generate tweets about the headline "French President Macron overrides parliament to pass retirement age bill".
Appendix F Details for Exp 4.3
In this experiment, we modify the setup from Exp 4.1 by generating the tweets with three variants of Claude-3: Haiku, Sonnet, and Opus. The experiment measures the effect of increasing model scale on ICRH, as Haiku has far fewer parameters than Sonnet, which has far fewer parameters than Opus. Additionally, to reduce bias from the agent personas, we create a persona based on NPR, a relatively neutral news corporation. Its description is provided in Appendix F.1.
We can see that the completions for Opus (Appendix F.4 follow the instructions better than the completions for both Sonnet (Appendix F.3) and Haiku (Appendix F.2), suggesting that Opus is in fact more susceptible to ICRH than its smaller counterparts.
F.1 News Corporation Persona
Below is the agent persona for NPR (the only persona used in the experiment) in the form corp_name : corp_persona.
F.2 Sample Completions for Haiku
The model was prompted to generate tweets about the headline "In a loss for Fox News, judge allows Dominion’s defamation case to go to trial"
F.3 Sample Completions for Sonnet
The model was prompted to generate tweets about the headline "In a loss for Fox News, judge allows Dominion’s defamation case to go to trial"
F.4 Sample Completions for Opus
The model was prompted to generate tweets about the headline "In a loss for Fox News, judge allows Dominion’s defamation case to go to trial"
Appendix G Robustifying Exp 4.1 with Another Toxicity Classifier
To robustify the results of Exp 4.1, we also score toxicity using the Detoxify toxicity classifier (Hanu & Unitary team, 2020), which we note is trained on a different data distribution than the Perspective API.
We report the toxicity of the completions generated in Exp 4.1 as scored by the Detoxify classifier (Figure 12). The quantitative results between the Perspective API scores and the Detoxify scores are similar, suggesting that ICRH in Exp 4.1 is robust to the toxicity evaluation metric.