Conditional and Modal Reasoning in Large Language Models

Wesley H. Holliday¹ Matthew Mandelkern² Cedegao E. Zhang³
¹University of California, Berkeley
²New York University
³Massachusetts Institute of Technology
[email protected], [email protected], [email protected]

Abstract

The reasoning abilities of large language models (LLMs) are the topic of a growing body of research in AI and cognitive science. In this paper, we probe the extent to which twenty-five LLMs are able to distinguish logically correct inferences from logically fallacious ones. We focus on inference patterns involving conditionals (e.g., ‘If Ann has a queen, then Bob has a jack’) and epistemic modals (e.g., ‘Ann might have an ace’, ‘Bob must have a king’). These inferences have been of special interest to logicians, philosophers, and linguists, since they play a central role in the fundamental human ability to reason about distal possibilities. Assessing LLMs on these inferences is thus highly relevant to the question of how much the reasoning abilities of LLMs match those of humans. Among the LLMs we tested, all but the GPT-4 model family often make basic mistakes with conditionals, though zero-shot chain-of-thought prompting helps them make fewer mistakes. Moreover, even the GPT-4 family displays logically inconsistent judgments across inference patterns involving epistemic modals, and almost all models give answers to certain complex conditional inferences widely discussed in the literature that do not match human judgments. These results highlight gaps in basic logical reasoning in today’s LLMs.

Wesley H. Holliday¹ Matthew Mandelkern² Cedegao E. Zhang³ ¹University of California, Berkeley ²New York University ³Massachusetts Institute of Technology [email protected], [email protected], [email protected]

1 Introduction

One of the most distinctive human cognitive abilities is the ability to think about what follows if something is the case—conditional thinking—and about what might or must be the case—modal thinking Evans and Over (2004); Portner (2009). Such reasoning about distal possibilities is crucial to the human capacity for planning (we try to choose the action that would bring about the best effects if we were to take it Gibbard and Harper (1981)), causal reasoning (C causes E if E wouldn’t have happened if C hadn’t Lewis (1973a); Beller and Gerstenberg (2023)), retroactive evaluation, and more.

Conditional and modal language has thus been a central focus of philosophers (e.g., Stalnaker, 1968; Lewis, 1973b; Khoo, 2022), linguists (e.g., Kratzer, 2012; Portner, 2009), and logicians (e.g., Kripke, 1963; Stalnaker and Thomason, 1970; van Benthem, 2023), as well as an interest of computer scientists (e.g., Friedman and Halpern, 1994; Fagin et al., 1995), leading to a variety of sophisticated models of conditional and modal reasoning Egré and Rott (2021); Garson (2024).

Refer to caption — Figure 1: Summary of performance on the simple logical inference patterns discussed in § 4. Guessing accuracy is at 50%. Larger models generally perform better, and most models show clear weakness at this task.

With the rapid recent development of large language models (LLMs) that at least superficially resemble human speakers and reasoners in many respects Huang and Chang (2022); Wei et al. (2022); Bubeck et al. (2023); Zhao et al. (2023), a natural question to ask is to what extent LLMs have mastered conditional and modal reasoning. In this paper, we begin to tackle this problem from the perspective of philosophers and logicians, probing the degree to which different LLMs have mastered the logical inference patterns characteristic of conditional and modal reasoning. For example, consider the pattern known as Modus Tollens (MT): ‘If $p$ , then $q$ . Not $q$ . Therefore, not $p$ ’. We tested whether LLMs draw inferences in accord with this pattern by prompting them with many instances of the pattern, as in:

User prompt: From ‘If Alex finished the race, then Chris finished the race’ together with ‘Chris did not finish the race’, can we infer ‘Alex did not finish the race’? (System prompt: Answer only with ‘yes’ or ‘no’ and nothing else.)

GPT-4: yes. Mistral 7B: no. Etc.

We then gave other instances of the pattern to each LLM, asssessing their performance in terms of accuracy on the pattern of inference. Figure 1 summarizes performance across several inference patterns to be discussed. We also compared performance on the zero-shot condition shown above with few-shot and chain-of-thought conditions (Table 2).

After providing some background in § 2 and detailing our experimental setup in § 3, we discuss results for a number of inference patterns in § 4. We find that among the LLMs tested, all but the GPT-4 family often make basic mistakes with conditionals. Moreover, even GPT-4 displays logically inconsistent judgments across inference patterns involving epistemic modals, and almost all models give answers to certain complex conditional inferences that do not match reported human judgments. We also show that models’ performance on our reasoning tasks is highly correlated with that of Chatbot Arena Elo ratings Chiang et al. (2024), MMLU Hendrycks et al. (2020), and GSM8K Cobbe et al. (2021), supporting the hypothesis that logical reasoning abilities are predictive of general model capabilities and performance on downstream tasks. In sum, our main contributions in this paper are:

•

Emphasizing the importance and nuances of reasoning about conditionals and modals, grounded in up-to-date evidence and theories from the relevant literature.
•

Proposing a focused, novel benchmark that tests LLMs’ ability to engage in logical reasoning with conditionals and modals.
•

Reporting the performance of a large set of LLMs in different prompting settings and identifying some of their gaps and undesirable behaviors in basic logical reasoning.

2 Background and related work

Our goal is to apply methodologies from the philosophical, logical, and linguistic literature on conditionals and modals to the study of LLMs.

2.1 Logical inference

First and most generally, we draw on a philosophical understanding of what a logical inference is. Logical inferences are those inferences that are valid just in virtue of the meaning of logical words like ‘and’, ‘or’, ‘not’, ‘if’, ‘must’, ‘might’, and so on. That is, a logically valid inference is one whose conclusion is always true when its premises are, no matter how the non-logical words in the premises and conclusion are understood Tarski (1936).

This contrasts with more colloquial uses of ‘logical reasoning’ that are current in the literature on LLMs, where ‘logical reasoning’ is often used for reasoning in general, involving inferential leaps of various kinds that go beyond deductive inference proper (this holds for many of the tasks studied in Xu et al. (2023); Chen et al. (2023); Huang and Chang (2023); Liu et al. (2021) and nearly all the BigBench tasks BIG-bench authors (2023) categorized under the keyword ‘logical reasoning’). For instance, in the logicians’ sense, the inference ‘A is to the left of B, hence B is to the right of A’ is not logically valid, since its correctness depends on the meaning of the non-logical words ‘left’ and ‘right’. By contrast, ‘A is to the left of B, hence something is to the left of B’ is logically valid, since its correctness relies only on the meaning of the logical word ‘something’. While studying content-based reasoning in LLMs is obviously of great interest, we believe it is also of fundamental interest to study purely logical reasoning in LLMs, since such reasoning is plausibly part of the backbone of human inference and knowledge of meanings.

Regarding purely logical reasoning, a series of benchmarks have been created in recent years Tafjord et al. (2020); Tian et al. (2021); Han et al. (2022); Saparov and He (2023); Saparov et al. (2023), and various strategies have been proposed to solve some of them Creswell et al. (2023); Kazemi et al. (2023); Olausson et al. (2023); Pan et al. (2023); Poesia et al. (2023); Ye et al. (2023). Those benchmarks primarily focus on multi-step reasoning, where a proof is required from premises to the hypothesis. Here we target single-step inference patterns, which we treat as more fundamental. The inability to recognize the basic inference patterns we study here could provide further explanations of failures on those multi-step reasoning problems. Additionally, none of the work above studies modal operators and their interactions with conditionals.

2.2 Modals and conditionals

Here we draw specifically on the logical and philosophical literature on modals and conditionals. Enormous progress has been made in the last half century on both topics. First, modal operators like ‘must’ and ‘might’ have been successfully modeled as quantifiers over possible worlds Kripke (1963); Kratzer (1981). That is, just as ‘Every boy is sitting’ quantifies universally over all boys (in a given domain), ‘It must be raining’ quantifies over all possible worlds (in a given domain) and says that it is raining in all of them; and just as ‘Some boy is sitting’ quantifies existentially over boys, ‘It might be raining’ says that it is raining in some possible world. This interpretation yields corresponding logics of modality, with the details depending on how the domain of possible worlds is obtained (and on the interpretation of the other connectives and operators with which modals interact).

Conditional operators have likewise been analyzed with possible worlds semantics. In classical logic, ‘if $p$ , then $q$ ’ is treated as the material conditional, which is true whenever $p$ is false or $q$ is true. However, it is almost universally accepted by philosophers, linguists, and logicians that this treatment is a very poor approximation to the actual meaning of ‘if’ in natural language. For instance, on the material analysis of ‘if’, ‘No student will fail if she studies hard’ would entail ‘Every student will study hard’, which obviously does not follow. Likewise, if the material analysis were correct, then the probability of ‘if $p$ , then $q$ ’ would go up as the probability of $p$ goes down, but this is wrong. Consider a fair coin. The probability that the coin will land heads if it is flipped is intuitively .5, and it is intuitively probabilistically independent of whether the coin is flipped. That is, finding out that the coin probably will not be flipped does not make it any more likely that if it is flipped, it will land heads Douven and Verbrugge (2013). Edgington (1995) provides a battery of widely accepted further arguments against the material analysis.

These points are worth emphasizing, since although the material analysis is almost universally rejected by theorists of the conditional, it is still assumed in much existing work testing the logical capacities of humans and LLMs, in both cognitive science and artificial intelligence (e.g., in the recent Wan et al. (2024), which treats the material analysis as one of the benchmarks of correct reasoning with conditionals). This is a serious blindspot, since failing to reason in accord with the material conditional may be logically correct; and, conversely, reasoning in accord with the material conditional may be a serious logical mistake.

The most popular alternative treats ‘if $p$ , then $q$ ’ as a restricted modal operator, which says that $q$ is true in all $p$ -worlds (in a given domain). As for modals, this yields corresponding logics, with the details again depending on assumptions about which $p$ -worlds are in the domain, together with the interpretation of other connectives Stalnaker (1968); Lewis (1973b); Egré and Rott (2021).

Although the material analysis is almost universally rejected, there is ongoing controversy about the correct logic of conditionals and modals. We have chosen a wide range of inference patterns to test: in many of these cases there is (near) universal agreement about whether the inference pattern is valid. In other cases, there is less agreement about whether the pattern is truly valid, but even in those cases, there is for the most part agreement about whether naive human reasoners are inclined to draw the inference, and the remaining controversy is about how to model those patterns (as genuine (in)validities or the result of systematic context-shifting). We do not aim to take a position in these complex debates here but rather to compare the behavior of LLMs to widely reported human inferential dispositions. In future work, we plan to compare the behavior of LLMs with human subjects (compare the methodology of Pavlick and Kwiatkowski, 2019; Dasgupta et al., 2022; Webson et al., 2023); in this paper, we compare LLMs against expert claims about inference from the philosophical/logical literature.

2.3 Natural language inference

Our task format and evaluation method is similar to the one used in the natural language inference (NLI) paradigm Bowman et al. (2015); Williams et al. (2018); Nie et al. (2019), which has a rich and long tradition Katz (1972); Condoravdi et al. (2003); van Benthem (2008); MacCartney and Manning (2009); Dagan et al. (2010). There, a problem comes with a premise $P$ and a hypothesis $H$ , and the goal is to decide whether the premise entails, contradicts, or is neutral with respect to $H$ . The notion of entailment is typically based on common sense, whereas in this work we exclusively study logical entailment in the sense specified above.

In sum, our approach differs from previous work on LLMs in two central ways: (i) we focus on one-step logical inference, in the austere philosophical sense, rather than common-sense reasoning in general, differing from most extant benchmarks; (ii) we bring sophisticated approaches to the logic of conditionals and modals from philosophy, linguistics, and logic, yielding new ways to assess how closely LLMs match human reasoning in this key domain. In particular, in contrast to the work on logical reasoning cited above, we go beyond propositional/predicate logic to incorporate more realistic approaches to the logic of conditionals and modals, which to our knowledge has not been explored.

Valid Inferences	Examples	Valid Modal Inferences	Examples
8pt. DS:	Either Fido is inside or Fido is in the garden	MiN:	Mary might not have been at the wedding
$p\vee q,\neg q\vdash p$	Fido is not in the garden	$\lozenge\neg p\vdash\neg\Box p$	$\vdash$ It’s not the case that Mary must have been at the wedding
	$\vdash$ Fido is inside
MP:	If Mary was at the wedding, then Sue was at the wedding.	NMu:	It’s not the case that Mary must have been at the wedding
$p\to q,p\vdash q$	Mary was at the wedding	$\neg\Box p\vdash\lozenge\neg p$	$\vdash$ Mary might have been at the wedding
	$\vdash$ Sue was at the wedding.
MT:	If Mary was at the wedding, then Sue was at the wedding.
$p\to q,\neg q\vdash\neg p$	Sue was not at the wedding.
	Mary was not at the wedding
8pt. Invalid Inferences		Controversial Modal Inferences
8pt. AC:	If Mary was at the wedding, then Sue was at the wedding.	DSmu:	Either Fido is inside or Fido must be in the garden.
$p\to q,q\vdash p$	Sue was at the wedding	$p\vee\Box q,\neg\Box q\vdash\neg p$	It’s not the case that Fido must be in the garden
	$\vdash$ Mary was at the wedding		$\vdash$ Fido isn’t inside
CONV:	If Mary was at the wedding, then Sue was at the wedding.	DSmi:	Either Fido is inside or Fido must be in the garden
$p\to q\vdash q\to p$	$\vdash$ If Sue was at the wedding, then Mary was at the wedding	$p\vee\Box q,\lozenge\neg q\vdash\neg p$	Fido might not be in the garden.
			$\vdash$ Fido isn’t inside.
DA:	If Mary was at the wedding, then Sue was at the wedding	MTmi:	If Mary was at the wedding, then Sue must have been there.
$p\to q,\neg p\vdash\neg q$	Mary was not at the wedding.	$p\to\Box q,\lozenge\neg p\vdash\neg p$	Sue might not have been there.
	$\vdash$ Sue was not at the wedding		$\vdash$ Mary was not at the wedding.
INV:	If Mary was at the wedding, then Sue was at the wedding.	MTmu:	If Mary was at the wedding, then Sue must have been there.
$p\to q\vdash\neg p\to\neg q$	$\vdash$ If Mary was not at the wedding, then Sue was not at the wedding.	$p\to\Box q,\neg\Box q\vdash\neg p$	It’s not the case that Sue must have been at the wedding.
			$\vdash$ Mary was not at the wedding.
8pt. Controversial Inferences
8pt. AS:	If the match is struck, then it will light.	CMP:	If the warriors don’t win, then
$p\to q\vdash(p\wedge r)\to q$	$\vdash$ If the match is struck and has been soaked in water, then it will light.	$p\to(q\to r),p$	if the Lakers don’t win, the Celtics will.
CT:	If it’s raining, then it’s not raining hard.	$\vdash q\to r$	The Warriors won’t win.
$p\to q\vdash\neg q\to\neg p$	$\vdash$ If it’s raining hard, then it’s not raining.		$\vdash$ If the Lakers don’t win, the Celtics will.

Model	0-shot	Few-shot	0-shot Cot
$(T=0)$	Accuracy %	Delta	Delta
GPT-4 Turbo (2024-04-09)	98.5	+0.2	+0.8
GPT-4 Turbo (1106)	96.5	+1.0	+2.7
GPT-4 (0613)	95.6	+0.8	+3.1
Gemini 1.5 Pro	92.1	+3.8	+6.7
GPT-4 (0314)	91.7	-1.9	+6.9
GPT-4o (2024-05-13)	87.9	+1.3	+9.6
Claude 3 Opus	82.9	+0.8	+14.0
Gemini 1.5 Flash	82.1	+0.4	+8.8
Llama 3 Instruct 70B	77.1	+1.7	+13.1
Mixtral 8x7B	75.8	+2.7	+12.9
Claude 3 Sonnet	74.6	+0.8	+10.2
Llama 3 Instruct 8B	69.8	-3.1	+11.3
Claude 3 Haiku	69.0	-3.1	+9.6
Phi-2	60.0	+3.1	+16.7
Mistral 7B	59.8	-1.7	+10.2
Code Llama 13B	59.0	-1.0	+10.8
Code Llama 7B	57.5	-0.2	+11.5
GPT-3.5 Turbo (0613)	56.3	-1.0	+10.4
Code Llama 34B	56.3	-2.7	+9.8
GPT-3.5 Turbo (0125)	56.0	-3.8	+4.6
Llama 2 Chat 13B	55.2	-3.1	+3.8
Llama 2 Chat 70B	55.2	-5.2	+2.5
Llama 2 Chat 7B	55.0	-5.0	+0.2
GPT-3.5 Turbo (1106)	54.0	-4.2	-2.1
Yi Chat 34B	45.6	+4.2	+5.8

Conditional and Modal Reasoning in Large Language Models

Abstract

1 Introduction

2 Background and related work

2.1 Logical inference

2.2 Modals and conditionals

2.3 Natural language inference

3 Experiments

3.1 Models

3.2 Data

3.3 Evaluation

4 Results

4.1 Divergences from the material analysis

4.2 Inconsistency and overgeneralization

4.3 Relationship to some popular benchmarks

5 Discussion

6 Conclusion

7 Limitations

Acknowledgments

References

Appendix A Language models used

Appendix B Prompts used in the experiments

B.1 Zero-shot

B.2 Few-shot

B.3 Zero-shot chain-of-thought

Appendix C Additional results

C.1 Performance summaries

C.2 Individual inferences

C.2.1 Disjunctive Syllogism (DS)

C.2.2 Modus Ponens (MP)

C.2.3 Modus Tollens (MT)

C.2.4 Affirming the Consequent (AC)

C.2.5 Conversion CONV)

C.2.6 Denying the Antecedent (DA)

C.2.7 Inversion (INV)

C.2.8 Might Not (MiN)

C.2.9 Not Must (NMu)

C.2.10 Antecedent Strengthening (AS)

C.2.11 Contraposition (CT)

C.2.12 DS with ‘must’ (DSmu)

C.2.13 DS with ‘might’ (DSmi)

C.2.14 MT with ‘must’ (MTmu)

C.2.15 MT with ‘might’ (MTmi)

C.2.16 Complex Modus Ponens (CMP)