-
Are LLMs classical or nonmonotonic reasoners? Lessons from generics
Authors:
Alina Leidinger,
Robert van Rooij,
Ekaterina Shutova
Abstract:
Recent scholarship on reasoning in LLMs has supplied evidence of impressive performance and flexible adaptation to machine generated or human feedback. Nonmonotonic reasoning, crucial to human cognition for navigating the real world, remains a challenging, yet understudied task. In this work, we study nonmonotonic reasoning capabilities of seven state-of-the-art LLMs in one abstract and one common…
▽ More
Recent scholarship on reasoning in LLMs has supplied evidence of impressive performance and flexible adaptation to machine generated or human feedback. Nonmonotonic reasoning, crucial to human cognition for navigating the real world, remains a challenging, yet understudied task. In this work, we study nonmonotonic reasoning capabilities of seven state-of-the-art LLMs in one abstract and one commonsense reasoning task featuring generics, such as 'Birds fly', and exceptions, 'Penguins don't fly' (see Fig. 1). While LLMs exhibit reasoning patterns in accordance with human nonmonotonic reasoning abilities, they fail to maintain stable beliefs on truth conditions of generics at the addition of supporting examples ('Owls fly') or unrelated information ('Lions have manes'). Our findings highlight pitfalls in attributing human reasoning behaviours to LLMs, as well as assessing general capabilities, while consistent reasoning remains elusive.
△ Less
Submitted 12 June, 2024; v1 submitted 5 June, 2024;
originally announced June 2024.
-
The language of prompting: What linguistic properties make a prompt successful?
Authors:
Alina Leidinger,
Robert van Rooij,
Ekaterina Shutova
Abstract:
The latest generation of LLMs can be prompted to achieve impressive zero-shot or few-shot performance in many NLP tasks. However, since performance is highly sensitive to the choice of prompts, considerable effort has been devoted to crowd-sourcing prompts or designing methods for prompt optimisation. Yet, we still lack a systematic understanding of how linguistic properties of prompts correlate w…
▽ More
The latest generation of LLMs can be prompted to achieve impressive zero-shot or few-shot performance in many NLP tasks. However, since performance is highly sensitive to the choice of prompts, considerable effort has been devoted to crowd-sourcing prompts or designing methods for prompt optimisation. Yet, we still lack a systematic understanding of how linguistic properties of prompts correlate with task performance. In this work, we investigate how LLMs of different sizes, pre-trained and instruction-tuned, perform on prompts that are semantically equivalent, but vary in linguistic structure. We investigate both grammatical properties such as mood, tense, aspect and modality, as well as lexico-semantic variation through the use of synonyms. Our findings contradict the common assumption that LLMs achieve optimal performance on lower perplexity prompts that reflect language use in pretraining or instruction-tuning data. Prompts transfer poorly between datasets or models, and performance cannot generally be explained by perplexity, word frequency, ambiguity or prompt length. Based on our results, we put forward a proposal for a more robust and comprehensive evaluation standard for prompting research.
△ Less
Submitted 3 November, 2023;
originally announced November 2023.
-
Tolerance and degrees of truth
Authors:
Pablo Cobreros,
Paul Egré,
David Ripley,
Robert van Rooij
Abstract:
This paper explores the relations between two logical approaches to vagueness: on the one hand the fuzzy approach defended by Smith (2008), and on the other the strict-tolerant approach defended by Cobreros, Egré, Ripley and van Rooij (2012). Although the former approach uses continuum many values and the latter implicitly four, we show that both approaches can be subsumed under a common three-val…
▽ More
This paper explores the relations between two logical approaches to vagueness: on the one hand the fuzzy approach defended by Smith (2008), and on the other the strict-tolerant approach defended by Cobreros, Egré, Ripley and van Rooij (2012). Although the former approach uses continuum many values and the latter implicitly four, we show that both approaches can be subsumed under a common three-valued framework. In particular, we defend the claim that Smith's continuum many values are not needed to solve what Smith calls `the jolt problem', and we show that they are not needed for his account of logical consequence either. Not only are three values enough to satisfy Smith's central desiderata, but they also allow us to internalize Smith's closeness principle in the form of a tolerance principle at the object-language. The reduction, we argue, matters for the justification of many-valuedness in an adequate theory of vague language.
△ Less
Submitted 26 July, 2022;
originally announced July 2022.
-
Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you?
Authors:
Rochelle Choenni,
Ekaterina Shutova,
Robert van Rooij
Abstract:
In this paper, we investigate what types of stereotypical information are captured by pretrained language models. We present the first dataset comprising stereotypical attributes of a range of social groups and propose a method to elicit stereotypes encoded by pretrained language models in an unsupervised fashion. Moreover, we link the emergent stereotypes to their manifestation as basic emotions…
▽ More
In this paper, we investigate what types of stereotypical information are captured by pretrained language models. We present the first dataset comprising stereotypical attributes of a range of social groups and propose a method to elicit stereotypes encoded by pretrained language models in an unsupervised fashion. Moreover, we link the emergent stereotypes to their manifestation as basic emotions as a means to study their emotional effects in a more generalized manner. To demonstrate how our methods can be used to analyze emotion and stereotype shifts due to linguistic experience, we use fine-tuning on news sources as a case study. Our experiments expose how attitudes towards different social groups vary across models and how quickly emotions and stereotypes can shift at the fine-tuning stage.
△ Less
Submitted 21 September, 2021;
originally announced September 2021.