Search | arXiv e-print repository

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Authors: Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman

Abstract: Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such m… ▽ More Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such method is classifying inputs as potentially problematic, then selectively applying steering vectors on these problematic inputs, i.e. adding particular vectors to model hidden states. However, steering vectors can also negatively affect model performance, which will be an issue on cases where the classifier was incorrect. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits, by first training a model to minimize Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, then steering the model that has undergone this training. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model while maintaining helpfulness (as measured by MT-Bench) on benign requests almost on par with the original LM. To demonstrate the generality and transferability of our method beyond jailbreaks, we show that our KTS model can be steered to reduce bias towards user-suggested answers on TruthfulQA. Code is available: https://github.com/AsaCooperStickland/kl-then-steer. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2404.15758 [pdf, other]

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

Authors: Jacob Pfau, William Merrill, Samuel R. Bowman

Abstract: Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard… ▽ More Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge. We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula. For problems satisfying this characterization, chain-of-thought tokens need not provide information about the intermediate computational steps involved in multi-token computations. In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: 17 pages, 10 figures

ACM Class: I.2.6

arXiv:2310.13439 [pdf, other]

Self-Consistency of Large Language Models under Ambiguity

Authors: Henning Bartsch, Ole Jorgensen, Domenic Rosati, Jason Hoelscher-Obermaier, Jacob Pfau

Abstract: Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency, e.g., question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model s… ▽ More Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency, e.g., question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67\% to 82\%, far higher than would be predicted if a model's consistency was random, and increases as model capability improves. Furthermore, we show that models tend to maintain self-consistency across a series of robustness checks, including prompting speaker changes and sequence length changes. These results suggest that self-consistency arises as an emergent capability without specifically training for it. Despite this, we find that models are uncalibrated when judging their own consistency, with models displaying both over- and under-confidence. We also propose a nonparametric test for determining from token output distribution whether a model assigns non-trivial probability to alternative answers. Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers. This distribution of probability mass provides evidence that even highly self-consistent models internally compute multiple possible responses. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: BlackboxNLP @ EMNLP 2023

arXiv:2309.13214 [pdf, ps, other]

Assessing the Impact of Personality on Affective States from Video Game Communication

Authors: Atieh Kashani, Johannes Pfau, Magy Seif El-Nasr

Abstract: Individual differences in personality determine our preferences, traits and values, which should similarly hold for the way we express ourselves. With current advancements and transformations of technology and society, text-based communication has become ordinary and often even surpasses natural voice conversations -- with distinct challenges and opportunities. In this exploratory work, we investi… ▽ More Individual differences in personality determine our preferences, traits and values, which should similarly hold for the way we express ourselves. With current advancements and transformations of technology and society, text-based communication has become ordinary and often even surpasses natural voice conversations -- with distinct challenges and opportunities. In this exploratory work, we investigate the impact of personality on the tendency how players of a team-based collaborative alternate reality game express themselves affectively. We collected chat logs from eleven players over two weeks, labeled them according to their affective state, and assessed the connection between them and the five-factor personality domains and facets. After applying multi-linear regression, we found a series of reasonable correlations between (combinations of) personality variables and expressed affect -- as increased confusion could be predicted by lower self-competence (C1), personal annoyance by vulnerability to stress (N6) and expressing anger occured more often in players that are prone to anxiety (N1), less humble and modest (A5), think less carefully before they act (C6) and have higher neuroticism (N). Expanding the data set, sample size and input modalities in subsequent work, we aim to confirm these findings and reveal even more interesting connections that could inform affective computing and games user research equally. △ Less

Submitted 22 September, 2023; originally announced September 2023.

arXiv:2308.14224 [pdf, ps, other]

Modeling Player Personality Factors from In-Game Behavior and Affective Expression

Authors: Reza Habibi, Johannes Pfau, Magy Seif El-Nasr

Abstract: Developing a thorough understanding of the target audience (and/or single individuals) is a key factor for success - which is exceptionally important and powerful for the domain of video games that can not only benefit from informed decision making during development, but ideally even tailor game content, difficulty and player experience while playing. The granular assessment of individual persona… ▽ More Developing a thorough understanding of the target audience (and/or single individuals) is a key factor for success - which is exceptionally important and powerful for the domain of video games that can not only benefit from informed decision making during development, but ideally even tailor game content, difficulty and player experience while playing. The granular assessment of individual personality and differences across players is a particularly difficult endeavor, given the highly variant human nature, disagreement in psychological background models and because of the effortful data collection that most often builds upon long, time-consuming and deterrent questionnaires. In this work, we explore possibilities to predict a series of player personality questionnaire metrics from recorded in-game behavior and extend related work by explicitly adding affective dialog decisions to the game environment which could elevate the model's accuracy. Using random forest regression, we predicted a wide variety of personality metrics from seven established questionnaires across 62 players over 60 minute gameplay of a customized version of the role-playing game Fallout: New Vegas. While some personality variables could already be identified from reasonable underlying in-game actions and affective expressions, we did not find ways to predict others or encountered questionable correlations that could not be justified by theoretical background literature. Yet, building on the initial opportunities of this explorative study, we are striving to massively enlarge our data set to players from an ecologically valid industrial game environment and investigate the performance of more sophisticated machine learning approaches. △ Less

Submitted 27 August, 2023; originally announced August 2023.

arXiv:2308.07576 [pdf, other]

On Video Game Balancing: Joining Player- and Data-Driven Analytics

Authors: Johannes Pfau, Magy Seif El-Nasr

Abstract: Balancing is, especially among players, a highly debated topic of video games. Whether a game is sufficiently balanced greatly influences its reception, player satisfaction, churn rates and success. Yet, conceptions about the definition of balance diverge across industry, academia and players, and different understandings of designing balance can lead to worse player experiences than actual imbala… ▽ More Balancing is, especially among players, a highly debated topic of video games. Whether a game is sufficiently balanced greatly influences its reception, player satisfaction, churn rates and success. Yet, conceptions about the definition of balance diverge across industry, academia and players, and different understandings of designing balance can lead to worse player experiences than actual imbalances. This work accumulates concepts of balancing video games from industry and academia and introduces a player-driven approach to optimize player experience and satisfaction. Using survey data from 680 participants and empirically recorded data of over 4 million in-game fights of Guild Wars 2, we aggregate player opinions and requirements, contrast them to the status quo and approach a democratized quantitative technique to approximate closer configurations of balance. We contribute a strategy of refining balancing notions, a methodology of tailoring balance to the actual player base and point to an exemplary artifact that realizes this process. △ Less

Submitted 15 August, 2023; originally announced August 2023.

Comments: 25 pages, 5 figures

arXiv:2307.15217 [pdf, other]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Authors: Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen , et al. (7 additional authors not shown)

Abstract: Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and rel… ▽ More Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems. △ Less

Submitted 11 September, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

arXiv:2302.09070 [pdf, other]

Empathetic AI for Empowering Resilience in Games

Authors: Reza Habibi, Johannes Pfau, Jonattan Holmes, Magy Seif El-Nasr

Abstract: Failure and resilience are important aspects of gameplay. This is especially important for serious and competitive games, where players need to adapt and cope with failure frequently. In such situations, emotion regulation -- the active process of modulating ones' emotions to cope and adapt to challenging situations -- becomes essential. It is one of the prominent aspects of human intelligence and… ▽ More Failure and resilience are important aspects of gameplay. This is especially important for serious and competitive games, where players need to adapt and cope with failure frequently. In such situations, emotion regulation -- the active process of modulating ones' emotions to cope and adapt to challenging situations -- becomes essential. It is one of the prominent aspects of human intelligence and promotes mental health and well-being. While there has been work on developing artificial emotional regulation assistants to help users cope with emotion regulation in the field of Intelligent Tutoring systems, little is done to incorporate such systems or ideas into (serious) video games. In this paper, we introduce a data-driven 6-phase approach to establish empathetic artificial intelligence (EAI), which operates on raw chat log data to detect key affective states, identify common sequences and emotion regulation strategies and generalizes these to make them applicable for intervention systems. △ Less

Submitted 16 February, 2023; originally announced February 2023.

arXiv:2207.13749 [pdf, other]

Nutzungsverhalten und Funktionsanforderungen digitaler Trainingsanwendungen während der Pandemie

Authors: Freya Pfau, Johannes Pfau, Bastian Dänekas, Robert Porzel, Rainer Malaka, Melanie Krüger

Abstract: Due to contact restrictions, closure of fitness centers and quarantine measures, the SARS-CoV-2 pandemic led to a considerable decline of sporting activities. The first relaxation of these restrictions allowed German citizens to mostly return to their normal training and exercise behavior, yet the long-term impact of the recurring measures (i.e. the "Lockdown", "Lockdown light" as well as the "Cor… ▽ More Due to contact restrictions, closure of fitness centers and quarantine measures, the SARS-CoV-2 pandemic led to a considerable decline of sporting activities. The first relaxation of these restrictions allowed German citizens to mostly return to their normal training and exercise behavior, yet the long-term impact of the recurring measures (i.e. the "Lockdown", "Lockdown light" as well as the "Corona Emergency Break" in the case of Germany) remain rather under-investigated. Using a survey of (n=108) German sportspersons, we measured a significant decline of sporting activities even within the intermediary phases without major pandemic constraints. To evaluate the capabilities of digital training applications in countering these effects, we additionally recorded the usage of, among others, apps, trackers, videos and conferencing systems and identified the most important as well as missing and/or essential features with regards to their capabilities of facilitating individual sport and training in times without access to facilities or social contacts. Effectively, the usage of smart watches, online videos and conferences increased significantly when compared to before the pandemic; and especially online videos and conferences contributed to higher training frequencies. Data-driven or individual feedback, motivation and collaboration revealed to be the most important or even necessary functions for users of digital training applications to counter the decline of social components of training. △ Less

Submitted 27 July, 2022; originally announced July 2022.

Comments: in German language

arXiv:2105.14111 [pdf, other]

Goal Misgeneralization in Deep Reinforcement Learning

Authors: Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, David Krueger

Abstract: We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL). Goal misgeneralization failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal. For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place. In contrast, previous works have typically focused… ▽ More We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL). Goal misgeneralization failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal. For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place. In contrast, previous works have typically focused on capability generalization failures, where an agent fails to do anything sensible at test time. We formalize this distinction between capability and goal generalization, provide the first empirical demonstrations of goal misgeneralization, and present a partial characterization of its causes. △ Less

Submitted 9 January, 2023; v1 submitted 28 May, 2021; originally announced May 2021.

Comments: Published in ICML 2022. 9 Pages

arXiv:2104.02768 [pdf, other]

Robust Semantic Interpretability: Revisiting Concept Activation Vectors

Authors: Jacob Pfau, Albert T. Young, Jerome Wei, Maria L. Wei, Michael J. Keiser

Abstract: Interpretability methods for image classification assess model trustworthiness by attempting to expose whether the model is systematically biased or attending to the same cues as a human would. Saliency methods for feature attribution dominate the interpretability literature, but these methods do not address semantic concepts such as the textures, colors, or genders of objects within an image. Our… ▽ More Interpretability methods for image classification assess model trustworthiness by attempting to expose whether the model is systematically biased or attending to the same cues as a human would. Saliency methods for feature attribution dominate the interpretability literature, but these methods do not address semantic concepts such as the textures, colors, or genders of objects within an image. Our proposed Robust Concept Activation Vectors (RCAV) quantifies the effects of semantic concepts on individual model predictions and on model behavior as a whole. RCAV calculates a concept gradient and takes a gradient ascent step to assess model sensitivity to the given concept. By generalizing previous work on concept activation vectors to account for model non-linearity, and by introducing stricter hypothesis testing, we show that RCAV yields interpretations which are both more accurate at the image level and robust at the dataset level. RCAV, like saliency methods, supports the interpretation of individual predictions. To evaluate the practical use of interpretability methods as debugging tools, and the scientific use of interpretability methods for identifying inductive biases (e.g. texture over shape), we construct two datasets and accompanying metrics for realistic benchmarking of semantic interpretability methods. Our benchmarks expose the importance of counterfactual augmentation and negative controls for quantifying the practical usability of interpretability methods. △ Less

Submitted 6 April, 2021; originally announced April 2021.

Comments: ICML WHI 2020

arXiv:1910.07604 [pdf, other]

Global Saliency: Aggregating Saliency Maps to Assess Dataset Artefact Bias

Authors: Jacob Pfau, Albert T. Young, Maria L. Wei, Michael J. Keiser

Abstract: In high-stakes applications of machine learning models, interpretability methods provide guarantees that models are right for the right reasons. In medical imaging, saliency maps have become the standard tool for determining whether a neural model has learned relevant robust features, rather than artefactual noise. However, saliency maps are limited to local model explanation because they interpre… ▽ More In high-stakes applications of machine learning models, interpretability methods provide guarantees that models are right for the right reasons. In medical imaging, saliency maps have become the standard tool for determining whether a neural model has learned relevant robust features, rather than artefactual noise. However, saliency maps are limited to local model explanation because they interpret predictions on an image-by-image basis. We propose aggregating saliency globally, using semantic segmentation masks, to provide quantitative measures of model bias across a dataset. To evaluate global saliency methods, we propose two metrics for quantifying the validity of saliency explanations. We apply the global saliency method to skin lesion diagnosis to determine the effect of artefacts, such as ink, on model bias. △ Less

Submitted 3 December, 2019; v1 submitted 16 October, 2019; originally announced October 2019.

Comments: Machine Learning for Health (ML4H) at NeurIPS 2019 - Extended Abstract

Showing 1–12 of 12 results for author: Pfau, J