Reliable Reasoning Beyond Natural Language

Nasim Borazjanizadeh
University of California, Berkeley
Steven T. Piantadosi
Department of Psychology, UC Berkeley

Abstract

Despite their linguistic competence, Large Language models (LLMs) often exhibit limitations in their ability to reason reliably and flexibly. To address this, we propose a neurosymbolic approach that prompts LLMs to extract and encode all relevant information from a problem statement as logical code statements, and then use a logic programming language (Prolog) to conduct the iterative computations of explicit deductive reasoning. Our approach significantly enhances the performance of LLMs on the standard mathematical reasoning benchmark, GSM8k, and the Navigate dataset from the BIG-bench dataset. Additionally, we introduce a novel dataset, the Non-Linear Reasoning (NLR) dataset, consisting of 55 unique word problems that target the shortcomings of the next token prediction paradigm of LLMs and require complex non-linear reasoning but only basic arithmetic skills to solve. Our findings demonstrate that the integration of Prolog enables LLMs to achieve high performance on the NLR dataset, which even the most advanced language models (including GPT4) fail to solve using text only.

1 Introduction

The recent emergence of large language models (LLMs) [3, 25, 24, 8, 7, 29, 34, 35] has revolutionized the field of Natural Language Processing (NLP), with LLMs demonstrating human-level performance across various professional and academic benchmarks [26] and exhibiting an excellent understanding of linguistic rules and patterns [19].

However, despite their linguistic competence, LLMs often demonstrate significant limitations in their capacity to reason reliably and flexibly [19, 12, 39]. These limitations likely stem from the autoregressive architecture of transformers, which enforces the solution to the problems sequentially: the models’ reliance on a greedy process for predicting the next word constrains their backtracking and error recovery capability [12]. Models are expected to generate an answer in a single pass of their feedforward architecture, which cannot implement conditional loops [4]. Moreover, the statistical nature of LLMs’ training and representation means they often fail in generalizing appropriately to problems outside their training distribution, especially in settings requiring reasoning and discrete processes [40]. Furthermore, even the most advanced LLMs, including GPT4, have an incredibly short working memory[4], while reliable reasoning requires accurate and robust retrieval and integration of all relevant information.

Additionally, the linear and sequential nature of natural language contrasts with the complex and non-linear computations often involved in deductive reasoning. Even humans struggle with reasoning tasks when the brainstorming medium is confined to text. This is well illustrated by the history of logic. Aristotle’s writing on syllogistic reasoning, for example, lacked the tools of symbolic logic later developed for this kind of argumentation. The result is clunky and difficult to follow, even when correct:

If A has been proved to all or to some B, then B must belong to some A: and if A has been proved to belong to no B, then B belongs to no A. This is a different conclusion from the former. But if A does not belong to some B, it is not necessary that B should not belong to some A: for it may possibly belong to all A.

Others like Venn [36] and Boole [2] developed systems which allowed such reasoning to take place in a different medium—symbolic diagrams and algebraic equations respectively. These tools support reasoning about much richer types of logical relationships and deductive logic, than could be easily conveyed in natural language. Jevons [17] even developed a mechanical system for such logical reasoning, much in the spirit of Babbage’s work (see Gardner [14]). More generally, the principles and notation of mathematics allow us to concisely express concepts that would be incredibly difficult to express in natural language alone. Formalizing of reasoning in a system other than natural language has several descendants, from the General Problem Solver of Newell et al. [21], to logic programming languages like Prolog [11], and formal tools for robust verification like Lean [20]. Natural language is not enough for any of these domains.

2 Our Approach

Refer to caption — Figure 1: Our approach: A natural language problem (a math word problem from NLR dataset) is given to an LLM, which is prompted to perform CoT in text and logical code to encode the variable relationship as logical code statements. The Prolog interpreter executes the code. If the Prolog program fails, the LLM is re-prompted until valid code is generated or a limit of attempts is reached.

To enable LLMs to perform deductive reasoning robustly, we propose integrating a reliable, deductive reasoning module into their inference pipeline. Specifically, in this study, we prompt the model to encode the constraints and relationships among variables, as described in the problem statement, as a set of Prolog code statements. The generated code is then evaluated by Prolog, which uses deductive approach, to derive a deterministic answer to the problem (Figure 1). This not only has the advantage of mirroring the likley human architecture of separate linguistic and reasoning systems [13, 19], but as we show, significantly improves the performance of LLMs in mathematical reasoning.

Indeed, this approach draws on the strengths of both symbolic and neural systems. Though systems like Prolog support reliable deduction, they have no mechanism to deal with the complexities and intricacies of natural language descriptions of problems. Moreover, they are unable to perform implicit reasoning, which involves extracting information that is not explicitly stated in the text but is rather implied through common sense assumptions and context. However, Prolog and related systems excel at reasoning, with the ability to incorporate an arbitrary number of facts in their deductive processes, only generating valid conclusions given their assumptions. Prolog expresses knowledge as a set of relations, facts, and rules, and uses a reasoning engine to run queries over these relations, applying rules through resolution until a solution is found or all possibilities are exhausted. The ability to backtrack, conduct comprehensive searches, and accurately store and retrieve an arbitrary number of rules and relations are the capabilities that are difficult to implement using the feedforward architecture of LLMs, but essential for accurate deductive reasoning.

Moreover, in contrast to procedural or functional programming, declarative programming paradigm of Prolog focuses on defining what to execute and the program logic rather than specifying the detailed control flow. When LLMs are prompted to generate logical code to solve a problem, this declarative nature reduces the load on the LLM to define the variables or constraints encoded in the problem in the correct order or generate all intermediate steps of the computation correctly, allowing for a more direct mapping of the information encoded in natural language statements to logical code.

Two specific design choices help this approach work well. First, we prompt the LLM to perform Chain of Thought (CoT) [38] reasoning in text and logical code. This in-context learning method involves the integration of natural language comments that walk through the implicit reasoning steps required to arrive at the intermediate variables and code statements. While the code statements encode the explicit constraints and declarative arithmetic statements that the Prolog interpreter needs to compile. This technique allows the model to reason through the information implied by the context of problem statements and common sense but not explicitly stated (see e.g. Table 2). Second, we use the Multiple Try inference algorithm to obtain the models’ logical code generation for the problems. Using this inference method, if the Prolog code, generated by the LLM, fails to execute successfully¹¹1This primarily occurs due to variable name assignment errors, as data flows are described without mutability in Prolog’s declarative syntax, unlike procedural programming., we rerun the model with a slightly increased temperature (with a preset maximum number of attempts) and return the numerical answer returned by the model’s first executable code generation (in contrast to, e.g., majority-vote schemes [37]). This approach helps to mitigate the brittleness of symbolic programming code.

We also introduce a novel dataset, the Non-Linear reasoning dataset (NLR) dataset²²2Link to NLR dataset, which is designed to evaluate the generalizability of LLMs’ mathematical reasoning capabilities. Motivated by the corruption of test and training sets for many mathematical tasks [26], and the simple and repetitive pattern of reasoning required to solve the problems of the current reasoning benchmarks [10, 31, 27, 9, 18], we present a new dataset that (i) is certainly outside of current models’ training sets, and (ii) each problem necessitates a unique and creative reasoning pattern to solve, while the mathematical skills needed are limited to basic arithmetic and algebra. This benchmark consists of unique constraint problems, math word problems, and problems that require following algorithmic instructions for updating a game model (see e.g. Table 1). We demonstrate that the most advanced LLMs, including GPT4, struggle to solve these problems when prompted to solve them step by step, utilizing a chain of thought text prompt, despite their success on other mathematical tasks [26, 4].

NLR Problem Statement	Characteristics
Math Word Problem: When I was half my current age, my father was 30. When I was 1/3 my current age, my mother was 25. And when I was 1/6 of my current age, my sister was 7. If the sum of my age, my sister’s age, my father’s age, and my mother’s age is 116, then how old am I now?	4 entangled variables in the problem
Constraint Satisfaction: In a line to enter a cinema, 4 people are standing between Bob and Alex. Chad’s index in the line is 1 after Bob’s, he’s standing right behind Bob considering the order of people left to right. Frank is right behind Alex. Sam is right in front of Bob. There are 2 people between Sam and Frank. If Bob is in the 7th person in the line, counting left to right, what is the number of Alex?	2 constraints encoding multiple possibilities
Algorithmic Instructions: There’s a cinema with 12 seats organized in 3 rows and 4 columns. Due to covid there’s a policy that a seat can be filled only if none of the seats right next to it in the same column or the same row are not filled. If we place a person in the seat in the second column of the first row and then start to fill the seats left to right, row by row, starting row with 1, how many people can be seated in the cinema in total?	5 entangled variables in each state

Table 1: Examples of each problem category in the NLR dataset

3 Comparison to Other Approaches

Several studies have explored the integration of LLMs with external tools and symbolic reasoning modules [23, 22, 10, 30]. For instance, training LLMs to make API calls to tools like calculators, interpreters, or external datasets has been shown to improve their performance across a variety of reasoning tasks [10, 30, 28]. While these methods have successfully reduce the arithmetic errors of LLMs, they do not sufficiently address the reasoning limitations inherent to the next-token prediction paradigm of LLMs and the linear nature of text, which can restrict the ability to perform comprehensive searches over the space of possibilities, explore multiple pathways to a solution, or backtrack.

Our approach builds upon and extends the work of LINC [23] and Nye et al. [22]. LINC uses a neurosymbolic process to convert natural language into first-order logic expressions with LLMs to determine the truth value of conclusions via a symbolic theorem prover. This method has shown significant performance gains on the FOLIO [15] and ProofWriter [33] datasets compared to CoT prompting. However, it has a limitation in capturing implicit information not explicitly stated in the premises, as it primarily uses LLMs as a semantic parser, translating each natural language premise directly into a logical statement [23]. Similarly, Nye et al. [22] improves the performance of LLMs in story generation and instruction-following tasks by using a symbolic reasoning module to check the logical consistency of generated text against a minimal world model. This method increases accuracy and robustness of neural generation but is limited by the need for hand-crafting the world model and defining specific constraints.

In our approach, the world model is constructed by the LLM itself, with no limitations on the number of constraints that can be encoded in the problem. Moreover, rather than using LLMs as semantic parsers or text-to-logical code translators, we prompt the LLM to perform chain of thought (CoT) reasoning in both text and logical code, prompting the LLM to conduct implicit reasoning and use additional tokens as working memory to derive intermediate variables. This enables a more flexible and generalizable reasoning process, making our neurosymbolic approach applicable to a wider variety of problems.