Solving Zebra Puzzles
Using Constraint-Guided Multi-Agent Systems

Shmuel Berman
[email protected]
&Baishakhi Ray
[email protected]
&Kathleen McKeown
[email protected]
Abstract

Prior research has enhanced the ability of Large Language Models (LLMs) to solve logic puzzles using techniques such as chain-of-thought prompting or introducing a symbolic representation. These frameworks are still usually insufficient to solve complicated logical problems, such as Zebra puzzles, due to the inherent complexity of translating natural language clues into logical statements. We introduce a multi-agent system, ZPS, that integrates LLMs with an off the shelf theorem prover. This system tackles the complex puzzle-solving task by breaking down the problem into smaller, manageable parts, generating SMT (Satisfiability Modulo Theories) code to solve them with a theorem prover, and using feedback between the agents to repeatedly improve their answers. We also introduce an automated grid puzzle grader to assess the correctness of our puzzle solutions and show that the automated grader is reliable by evaluating it in a user-study. Our approach shows improvement in all three LLMs we tested, with GPT-4 showing 166% improvement in the number of fully correct solutions.

Solving Zebra Puzzles
Using Constraint-Guided Multi-Agent Systems


Shmuel Berman [email protected]                        Baishakhi Ray [email protected]                        Kathleen McKeown [email protected]


1 Introduction

Automated problem solving has long been a major goal in the field of Artificial Intelligence. This task ranges from trivial problems, like simple arithmetic or string searches, to more complex ones, such as solving a chess position.. However, unstructured problems presented in natural language introduce additional complications in modeling the problem accurately. Solving such problems has been extensively studied, from simple mathematical problems in the subfield of word problem solving to applications like automated code generation by Large Language Models (LLMs) (Mukherjee and Garain, 2009; Chen et al., 2021). These problems are particularly difficult because translating natural language into a precise logical or computational form requires sophisticated understanding and interpretation, making it a significant challenge in AI research.

Refer to caption
Figure 1: An Example Zebra Puzzle.

In this paper, we focus on a particular type of unstructured natural language problem known as a logic grid problem, or colloquially, an Einstein or Zebra puzzle. A Zebra puzzle is a set of natural language assertions involving multiple entities that are linked by various attributes (Fig. 1 shows an example). To solve a puzzle, the user must correctly assign attributes to all of the entities. These attributes range from descriptions to relative ordering. Participants are provided with a series of clues in natural language, which they must use to deduce the correct relationships using logical reasoning and by adhering to implicit domain constraints. These puzzles require the solver to map from natural language to structured space, understand implicit assumptions, and in some cases use domain-specific knowledge. For instance, as illustrated in Fig. 1, the solver must assign the correct attributes for three houses based on a series of interconnected clues.

Zebra puzzles are particularly challenging due to:

  • Complex Inferences: Each clue provides partial information that must be combined with others to deduce the solution.

  • High Interdependency: An error on one clue significantly impacts others, making the solution space highly interconnected.

  • Natural Language (NL) Clues: Translating ambigous NL clues into logical statements or formal representations is challenging.

  • Large Solution Space: The solver needs to explore numerous possibilities and combinations to find the solution.

  • Consistency Checking: Potential solutions must be checked against all clues, which is a computationally intensive and requires sophisticated, domain-specific reasoning.

The factors mentioned make Zebra puzzles difficult for both humans and AI systems due to the need for precise interpretation, inference, and logical reasoning. In Fig. 1, for example, a solver cannot simply map the spatial relationship between the Football house and the Red house. It must also encode additional constraints: the Football and Red houses occupy House 1 or House 3, respectively, and they must not be the same house. Encoding these constraints is non-trivial, as it requires detailed semantic interpretation of the clue’s subtext. Failure to accurately encode these subtleties usually renders the puzzle unsolvable.

This complexity has empirically been shown to challenge puzzle-solving models significantly. Prior work often employed human-in-the-loop methods. Milicevic et al. (2012) translated puzzles into formal logic but required users to rephrase or rewrite ambiguous clues. Claes et al. (2019) developed ZebraTutor, which creates a puzzle-specific lexicon to formalize the problem but needed users to edit the lexicon for accuracy. Prior research using ChatGPT to solve Zebra puzzles reported a correctness rate of only 8.33% (Groza, 2023), with performance deteriorating significantly as the problem’s complexity increases.

Due to their complexity, solving Zebra puzzles effectively requires the use of a constraint solver; a solver can efficiently determine the feasible and infeasible solution space within the given constraints. However, converting natural language clues into a formal representation suitable for a solver is a non-trivial task. This process often involves intricate interpretation of clues, which must be precise to ensure that the solver can operate correctly. Additionally, maintaining consistency across all clues requires iterative back-and-forth reasoning.

To address the challenges inherent in solving Zebra puzzles, we introduce a multi-agent based system, ZPS. This system decomposes the problem-solving process into discrete, manageable components, enhancing the handling of complex interdependencies and constraints. Each agent is responsible for a specific aspect of the problem, working collaboratively and using feedback loops to refine their answers and ensure consistency.

In this framework, we conceptualize integrating Large Language Models (LLMs) with formal reasoning. First, an LLM agent decompose a given puzzle to sub-problems. Then, another LLM agent interprets NL clues of each sub-problem and generates SMT-LIB translations of the constraints and parameters. An off-the-shelf SMT solver 111Satisfiability Modulo Theories (SMT) is a decision problem that involves determining whether a given logical formula is satisfiable, considering various background theories like Arithmetic, Arrays, Bit-Vector, etc. SMT extends the concept of Boolean satisfiability (SAT) by incorporating more complex theories. then processes these translations to produce a model that corresponds to the solution. The output, including the model and any syntactic errors, is fed back to the LLM which generates a new translation addressing syntactic and semantic errors, emulating back and forth reasoning. This continuous feedback refines the model’s predictions and ensures the translations are both syntactically correct and solvable. To this end, our approach demonstrates improvements across all three LLMs we tested, with GPT-4 showing up to a 166% increase in the number of fully correct solutions.

The main contributions of our research are as follows:

  1. 1.

    We demonstrate that combining a formal constraint solver with an LLM interpreter using an agent-based approach for solving Zebra Puzzles significantly improves upon existing baseline methodologies.

  2. 2.

    We implement a plan generation and decomposition strategy, enabling step-by-step reasoning that enhances the solving process.

  3. 3.

    We introduce an iterative conversation-based feedback mechanism that allows for continual refinement of solutions, adapting dynamically to the solving context.

  4. 4.

    We incorporate an autograder within our system to evaluate the accuracy of solutions, ensuring reliability and precision in automated assessments. We also present the results of a user study showing that this autograder correlates very well with human graders.

2 Methodology

Refer to caption
Figure 2: Logic Puzzle Solver Workflow
Refer to caption
Figure 3: Example Feedback Puzzle Solving Process. The puzzle is decomposed and then the LLM-agent attempts to translate it into a logical SMT formula. The theorem prover attempts to solve it, and the feedback is fed back into the LLM-agent so that it can modify its formal representation.

We integrate LLMs with formal systems within a multi-agent framework to solve Zebra puzzles. The process involves a series of steps where the problem is decomposed, translated into a formal language (SMT-LIB), solved using a theorem prover, and iteratively refined based on feedback. This approach aims to leverage the strengths of both LLMs and formal solvers, ensuring robust problem-solving capabilities.

2.1 Multi-Agent Workflow

The workflow, as illustrated in Figure 2, integrates multiple agents to transform a natural language puzzle into a logically solvable structure and then iteratively refines the solution. The process is initiated by the Decomposition Agent and continuously refined through a feedback loop that encompasses both the translation to SMT-LIB and the solving phases. Figure 3 shows a working example of puzzle solving by our method.

Decomposition

The input puzzle, expressed in natural language, is first decomposed by the Decomposition LLM-Agent. This agent identifies and isolates key entities, attributes, and relationships, structuring them into smaller, systematically translatable components. This is a first step that ensures the puzzle is presented in a format amenable to formal processing.

Feedback Loop

The core of our methodology lies in the feedback loop where continuous refinement of the solution occurs. This loop integrates the translation of decomposed components into SMT-LIB format by the Solver LLM-Agent and the subsequent problem solving using a theorem prover, whose output serves as feedback. Each iteration through the loop consists of the following steps:

  • Translation to SMT-LIB: After decomposition, the puzzle components are systematically translated into SMT-LIB (Satisfiability Modulo Theories Library) by the Solver LLM-Agent. This format is essential for interfacing with theorem provers and ensures that logical constraints and relationships are accurately represented.

  • Solving with Theorem Prover: The SMT-LIB formatted components are then processed by the theorem prover (Z3 in our implementation). The theorem prover attempts to find a satisfying assignment that adheres to all given constraints.

  • Evaluate and Refine: The solution generated by the theorem prover is evaluated by the Solver LLM-Agent to determine if it meets the puzzle’s requirements. If the solution is deemed insufficient– either due to to explicit errors or because of how the attributes are assigned– modifications are made to the translation of the SMT-LIB formalization of the puzzle and the cycle repeats. Otherwise, the LLM-agent submits its final answer.

This iterative process ensures that the Solver LLM-Agent and the Theorem Prover continually refine the solution until the Solver LLM-Agent is satisfied with the final assignments.

2.2 Modeling the Agent Environment

The feedback loop is how the agents engage with each other. This loop is mathematically modeled using a combination of evaluation functions and error detection mechanisms, which together guide the system towards a solution that optimally satisfies the problem constraints.

More formally, let 𝒟,𝒢,𝒯,𝒟𝒢𝒯\mathcal{D},\mathcal{G},\mathcal{T},\mathcal{E}caligraphic_D , caligraphic_G , caligraphic_T , caligraphic_E, and \mathcal{F}caligraphic_F represent the decomposition, translation to SMT-LIB, theorem solving, evaluation, and feedback functions, respectively. The feedback loop can be described by the following recursive function:

Sk+1=((𝒯(𝒢(𝒟(P)),Sk)),Sk)subscript𝑆𝑘1𝒯𝒢𝒟𝑃subscript𝑆𝑘subscript𝑆𝑘S_{k+1}=\mathcal{F}(\mathcal{E}(\mathcal{T}(\mathcal{G}(\mathcal{D}(P)),S_{k})% ),S_{k})italic_S start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = caligraphic_F ( caligraphic_E ( caligraphic_T ( caligraphic_G ( caligraphic_D ( italic_P ) ) , italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) , italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )

Where,
P𝑃Pitalic_P: initial puzzle in natural language.
Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT: solution state at the k𝑘kitalic_k-th iteration.
𝒟(P)𝒟𝑃\mathcal{D}(P)caligraphic_D ( italic_P ): decomposes P𝑃Pitalic_P into a structured format amenable to translation.
𝒢𝒢\mathcal{G}caligraphic_G: translates this structure into the SMT-LIB format.
𝒯𝒯\mathcal{T}caligraphic_T: applies the theorem prover to find a solution that satisfies the logical constraints.
\mathcal{E}caligraphic_E: evaluates this solution to determine its adequacy in solving the puzzle’s clues.
\mathcal{F}caligraphic_F: adjusts the translation based on the evaluation, aiming to correct any errors or optimize the solution.

Convergence Criteria

The convergence of this iterative process is governed by the Solver LLM-agent’s evaluation function \mathcal{E}caligraphic_E, which assesses both the correctness of the solution against the domain-specific requirements and the presence of any syntactic or semantic errors detected by 𝒯𝒯\mathcal{T}caligraphic_T. We assume that \mathcal{E}caligraphic_E is a black-box function defined by the instructions given to the LLM. The loop terminates when \mathcal{E}caligraphic_E returns a value indicating that the solution Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT sufficiently meets all puzzle requirements and contains no detectable errors, or when a maximum retry limit has been reached.

Optimization and Refinement

Each iteration through the feedback loop serves to progressively refine the solution, optimizing the representation and alignment with the puzzle’s constraints. This optimization process is critical for moving the solution towards a local optimum, where no further improvements can be detected by \mathcal{E}caligraphic_E or \mathcal{F}caligraphic_F.

3 Experimental Setup

To comprehensively evaluate ZPS’s performance, we examine its effectiveness across 114 Zebra Puzzles. Our assessment emphasizes ZPS’s capability to solve the puzzles using the different agents.

3.1 Selection of Logic Puzzles

We compiled two datasets to evaluate the problem-solving capabilities of our agent-centric approach. The first dataset, sourced from GitHub222https://github.com/ross-nordstrom/LogicSolver/tree/master/data, contains 59 Zebra puzzles involving entity-attribute matching. We further curated 55 additional puzzles from different sources from the Web and manually cross-checked them to determine they are valid zebra problems.

3.2 Agent Configuration

We experimented with three different LLMs: GPT-4, GPT-3.5,and Llama3-8b . We used z3 as the automated theorem solver. Our total cost across all experiments was approximately 2500 USD.

Number of Retries

In our experimental setup, we initially conduct the feedback loop once. To enhance performance and address syntactical errors in the final output, we implement an additional cold-start retry mechanism if we reach the action-limit without an error-free solution. This involves restarting the workflow from scratch with an increased temperature.

Response Limit

To limit the conversation length and prevent hallucination, we define a maximum number of actions that the LLM-agent can take. All experiments performed allow the LLM to perform up to 4 actions; this limit is reset if the puzzle-solving task is retried.

3.2.1 Grading

To assess solution accuracy, we created an autograder LLM-agent that provides a numeric grade to every solution generated by the solving agent. Each assignment is worth 1 point. In order to evaluate the reliability of this autograder, we also conducted a user study where a subset of the problems were regraded by humans and then compared to the autograder’s results.

Autograding

GPT-4o was used for autograding. It received the ground-truth answer, final SMT-LIB output, and conversation history to assess the consistency and correctness of each solution. For each problem, the model compared the logical assignments produced by the solving agent against the reference assignments, producing a final accuracy score.

To demonstrate the autograding process, consider a scenario where the output of the SMT-LIB solver is evaluated against a pre-defined answer key. The solver’s output and subsequent interpretation by the grader are detailed below.

SMT-LIB Solver Output

Below is a sample SMT solution for the logic grid puzzle given in Fig. 1.

; 1 is Brazilian, 2 is German
; 3 is American
(define-fun H1_Color () String "Blue")
(define-fun H1_N () Int 1)
(define-fun H1_Anml () String "Cats")
(define-fun H1_Sp () String "Football")
(define-fun H2_Color () String "Green")
(define-fun H2_N () Int 3)
(define-fun H2_Anml () String "Dogs")
(define-fun H2_Sp () String "Basketball")
(define-fun H3_Color () String "Red")
(define-fun H3_N () Int 2)
(define-fun H3_Anml () String "Fishes")
(define-fun H3_Sp () String "Baseball")
Ground Truth Answer
  • House 1: Blue, Brazilian, Fishes, Football

  • House 2: Green, American, Cats, Baseball

  • House 3: Red, German, Dogs, Basketball

The autograder evaluates the solution by mapping the SMT-LIB output to the expected results either using contextual clues or an explicitly defined lookup table, which would be defined in the SMT-LIB comments, converting function definitions into comparable assignments, as in Table 1.

Entity Assignment Result
House 1 Color: Blue
House 1 Nationality: Brazilian
House 1 Animal: Cats
House 1 Sport: Football
House 2 Color: Green
House 2 Nationality: American
House 2 Animal: Dogs
House 2 Sport: Basketball
House 3 Color: Red
House 3 Nationality: German
House 3 Animal: Fishes
House 3 Sport: Baseball
Table 1: Validation Results
Partial Scoring (PS)

Each correct match between the SMT-LIB output and the answer key earns a point. The autograder agent also calculates the total number of assignments which is equal to the number of points it is possible to receive. In this example, all matches are correct, thus:

Partial Score=Correct MatchesTotal Matches=812=0.67Partial ScoreCorrect MatchesTotal Matches8120.67\text{Partial Score}=\frac{\text{Correct Matches}}{\text{Total Matches}}=\frac% {8}{12}=0.67Partial Score = divide start_ARG Correct Matches end_ARG start_ARG Total Matches end_ARG = divide start_ARG 8 end_ARG start_ARG 12 end_ARG = 0.67

If the animals and sports had been chosen correctly, the score would be 1.

Manual User Study Grading

A separate user study manually graded 50 solutions from the state-of-the-art workflow, 35 solutions from a non-optimal variant, and 20 solutions from the naive approach. Though it was impractical to have all of the thousands of solutions that the LLM-agent generated be hand-graded, this user study allows us to quantify the correctness of our results and verify that our autograder correlates well with the ground-truth grades.

The manual grading team included five undergraduate computer science students and one master’s student. We then used their manual grades to capture various statistical measures of similarity between human grading and LLM grading; these stats are explained in the "Results" section.

A large percentage of the attempted solutions included explicit lookup tables, making these solutions significantly more time-consuming to grade (see "SMT-LIB Solver Output" and "Answer Key" above). The lookup table could appear anywhere in the generated text, which comprises multiple blocks of SMT-LIB code, errors, and intermediate SMT models. We therefore do not include them in our user study.

4 Results

This analysis is structured around four key research questions: Firstly, we examine the baseline performance of different LLMs in solving logic puzzles without solver assistance to understand their intrinsic problem-solving capabilities. Secondly, we assess the improvements in accuracy and problem-solving completeness when integrating solver feedback, evaluating how external theorem provers enhance LLM effectiveness. Thirdly, we explore the impact of using a decomposition agent, analyzing whether segmenting puzzles into simpler components before solving improves overall solution quality. Four, we conduct a user study to evaluate our LLM-Grader and substantiate the validity of our results.

4.1 ZPS Performance over Baselines

To establish a baseline, we first evaluate the performance of LLMs without the assistance of a solver by asking the LLM to solve the logic grid puzzle. This baseline configuration yields mediocre puzzle-solving accuracy, as detailed in Table 2. We report both the average partial score, given by the "Avg. PS" column, and the number of puzzles solved fully correctly, given by the "#Solved" column. For instance, GPT-4 under a baseline achieves an average partial score of of 52.4% and solves 27/114 logic grid puzzles completely correctly.

The effectiveness of the LLM-agent workflow increases markedly when solver feedback is incorporated. As shown in Table 3, the integration of theorem prover feedback, without retries and under a deterministic generation setting (temperature = 0), increases GPT-4’s average partial score to 0.687 from baseline of 0.524 (Δ=31.1Δ31.1\Delta=31.1roman_Δ = 31.1%). The inclusion of a decomposition agent further improves this to 0.700 (Δ=33.58Δ33.58\Delta=33.58roman_Δ = 33.58%). In terms of the total number solutions that can be completely solved, GPT4 with solver solves up to 133.33% more problems than the baseline settings. GPT-3.5 shows a similar positive trend.

Llama3’s improvement is more subtle; we believe this is because its fewer number of parameters limits its ability to generate syntactically correct SMT-LIB code. This theory is supported by the fact that in every Llama3 experiment, no less than 50 final solutions contained errors, whereas in every GPT-4 or GPT-3.5 experiment, the number was no more than 42. Nonetheless, Llama3 can also improve the total number of correct solutions by 50% over baseline.

4.2 ZPS Performance under Different Settings

For the variable temperature experiments, we set the model temperature to zero and increased it if the solution contained errors. While this approach provides the flexibility to bypass a solution if the deterministic solution is erroneous, it risks generating less stable solutions that may inadvertently replace syntactically incorrect yet valid solutions with syntactically correct but logically flawed ones. This phenomenon is particularly pronounced in models with fewer parameters, where the performance tends to decline with the introduction of retries. For example, under variable temperature conditions with retries, GPT-4 maintains a high accuracy rate of 76.1%, while Llama3’s accuracy degrades to 48.4%.

The addition of a decomposition agent to the SMT-integrated LLM-agent yielded mixed results. For both GPT-4 and GPT-3.5, the average partial score and number solved fully correctly slightly improved, in all cases by less than 5.5%. However, Llama-3’s average partial score declined by less than 5% and it was able to solve 3 fewer problems than with just SMT integration. Because of the relatively small differences in all cases, more experimentation is needed to determine when decomposition increases performance.

Model T D Avg. P.S #Solved
Llama3-8b 0 0.47 14 (12.3%)
GPT-3.5 0 0.471 17 (15.0%)
GPT-4 0 0.524 27 (23.7%)
Table 2: Baseline Performance of LLMs Without Solver Integration.The "D" column indicates if a decomposition agent was present in the workflow. The "T" column indicates Temperature.
Model T D Avg. PS #Solved Δ#Δ#\Delta\#roman_Δ #Solved
Llama3-8b 0 0.496 21 (18.4%) 50.0%
GPT-3.5 0 0.493 22 (19.3%) 29.4%
GPT-4 0 0.687 59 (51.8%) 118.5%
Llama3-8b Var. 0.436 15 (13.2%) 7.1%
GPT-3.5 Var. 0.484 24 (21.0%) 41.2%
GPT-4 Var. 0.761 72 (63.2%) 166.7%
Llama3-8b 0 0.468 18 (15.8%) 28.6%
GPT-3.5 0 0.520 24 (21.0%) 41.2%
GPT-4 0 0.700 63 (55.3%) 133.3%
Table 3: Enhancements from Solver Integration with Percentage Improvement Over Baseline. The "D" column indicates if a decomposition agent was present in the workflow. The "T" column indicates Temperature.

4.3 Manual Analysis of Grading

Based on our user study, our LLM-based grading systems demonstrate high accuracy accross a variety of models and settings. The system maintains consistent scoring accuracy, with exact match rates exceeding 78% across all tested scenarios.

To evaluate the LLM-grader, we employed various statistical measures: (i) Avg. Abs. Diff. : The average magnitude of the difference between the partial score given by the LLM-grader and the human evaluator. (ii) Avg. Rel. Diff. : The expected percent difference between the the partial score given by the LLM-grader and the human evaluator. % problems for which the LLM-grader (iii) overestimated and (iv) underestimated the partial score provided by the human evaluator. (v) % problems for which the LLM-grader and the human evaluator gave exactly the same partial score, and (vi) Joint Full Credit: The count of problems for which both the user and the LLM assigned full credit, normalized over the total number of problems that either party marked for full credit. This metric helps in understanding the extent of agreement in the grading of solutions between the human and machine evaluators.

The metric of "Joint Full Credit," which consistently registers above 85%, serves as a robust indicator of the LLM-grader’s capability to accurately assess fully correct solutions as demonstrated by the level of agreement between the LLM and a human grader. Additionally, the analysis indicates a propensity for the grader to overestimate the score of the LLM without SMT integration, whereas the integration of SMT tends to result in slight underestimations by the grader. This observation suggests that the integration of SMT and a feedback based loop may contribute more significantly to performance improvements than the raw grading differentials indicate.

GPT-4 GPT-3.5 GPT-4
SMT+D Naive SMT
Statistic (50) (20) (35)
Exact Match (%) 78.26 78.94 82.35
Avg. Abs.  Diff 0.056 0.040 0.117
Avg. Rel. Diff (%) -3.547 +13.8 +2.916
LLM Overestimated (%) 13.04 21.05 11.76
LLM Underestimated (%) 8.70 0.00 5.88
Joint Full Credit (%) 89.19 100 86.96
Spearman Correlation 0.73 0.948 0.70
Table 4: User Study Statistics comparing different experimental setups. The first column is GPT-4 with SMT integration and the decomposition (D) agent over 50 problems. The second column is GPT-3.5 without SMT integration over 20 problems. The third column is GPT-4 with just SMT integration over 35 problems. All problems were graded by both the manual grader and the LLM.

5 Background and Related Work

The concept of agent-centric LLM agents, as discussed in recent literature revolves around creating systems (usually backed by LLMs) that can act independently in diverse environments, both physical and virtual (Wang et al., 2024). This framework shifts the focus from passive systems to proactive entities capable of dynamic interaction and problem-solving. In these models, agents are designed to perceive and react to multi-modal data, integrating visual, auditory, and textual input to generate appropriate actions in real time. The most tangible benefit of this framework is feedback, which can take the form of a physical environment, error correction, or manual input (Durante et al., 2024).

Significant work has been done to apply this framework to solving text-based puzzles. Zhou et al. (2023) used a process called Language Agent Tree Search (LATS), which integrates planning, reasoning, and acting within LLMs to decompose and solve a high-level reasoning task.  Gao et al. (2023) showed that generating intermediate representations as Python programs allowed small LLMs to outperform much larger ones Logic-LM and SatLM both used LLMs to generate formal representations of general natural language problem and used off the shelf theorem provers to generate answers (Pan et al., 2023; Ye et al., 2023). While none of these approaches focus on Zebra puzzles, they each show that LLMs perform better when used as agents in a formally grounded system.

Research has shown that natural language cannot be mapped one-to-one with a formal space due to inherent ambiguities (Osama et al., 2020). For our approach, it was thus vital to create an agent that can take into account context and background knowledge to figure out the correct translation into a formal space. Even if the clues were perfectly translated as they are presented, a formal solver will not be able to generate a fully correct solution without additional encoding by the problem translator of this general context. Our approach is different from prior agent approaches in that we use a structured symbolic space (SMT-LIB) but use the syntactic and semantic feedback from an automated theorom prover for analysis in an LLM agent. We also provide a conceptual framework to understand LLM interaction with the automated theorem prover and its generated text as an agent.

6 Conclusion

This research shows the effectiveness of a multi-agent LLM and SMT framework that bolsters the performance of large language models (LLMs) in solving logic puzzles and other natural language task. Our work demonstrates the importance of integrating LLMs and SMTs in the task, boosting preformance over an LLM alone. We also show that the inter-agent critique mechanism plays a crucial role. Through dialogues, agents critique and refine each other’s contributions, which leads to more accurate and consistent results. The development of an autograder, with a verified correlation to human evaluation, played a role in the feedback mechanism by indicating when a solution was not judged logically correct and also enabled iterative development of our approach. Our findings suggest that structured planning and agent-feedback greatly enhance LLMs’ capability to solve logical problems.

Looking ahead, further research could optimize retry mechanisms for discovering more effective solutions, informed by approaches like Program-of-Thoughts and Graph-of-Thoughts-Rationale (Chen et al., 2023; Besta et al., 2024). Additionally, increasing the agent-environment size and the feedback loop length would enhance the solving agent’s self-correction capabilities by expanding the actual and effective context limits for remembering past strategies.

7 Limitations

This study, while advancing our understanding of LLMs in solving logic puzzles, has several limitations that warrant further investigation. Firstly, the experiments were confined to only three models: GPT-4, GPT-3.5, and Llama3-8b. Investigation of generalizability across different LLMs is warranted, especially because our performance gains occurred mainly in the GPT family.

Secondly, our approach relied on specific prompt constructions for both the grader and the solver agents. There exists a possibility that alternative prompting strategies could yield more accurate or efficient problem-solving and grading results. Further research is needed to explore and optimize these prompts to fully leverage the potential of LLMs in this domain.

Additionally, our user study inherently carries some uncertainty regarding its correlation to actual problem-solving performance. Solutions involving complex lookup tables were excluded from the user study due to how time consuming they were to grade, which might affect the study’s comprehensiveness and the general applicability of our findings.

Lastly, our bank of logic grid puzzles used in this study was somewhat limited in both size– we used 114 problems– and range of difficulty. The majority of puzzles used were subjectively rated as medium difficulty. Extending this research to include a larger and more varied dataset would verify the usefulness of our findings.

8 Ethics Statement

Use of Generative AI. Generative models carry ethical risks, including the potential to produce harmful content or content that closely mirrors pre-training data. However, we are using the generative models to solve puzzles rather than showing their direct output, minimizing this risk.

Compute. Employing deep learning models is computationally intensive and can have environmental implications. However, as no models were trained as part of this research, the computational impact remains relatively low.

Human Evaluator. We use only 5 human evaluators who are undergraduate/masters students in the lab environment and were given full disclosure about the nature of the study and its unpaid nature. No ethical violations were committed in such setting.

Acknowledgments

References

  • Besta et al. (2024) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690.
  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Trottier, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. arXiv.
  • Chen et al. (2023) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Preprint, arXiv:2211.12588.
  • Claes et al. (2019) Jens Claes, Bart Bogaerts, Rocsildes Canoy, and Tias Guns. 2019. User-oriented solving and explaining of natural language logic grid puzzles. In The Third Workshop on Progress Towards the Holy Grail, volume 14.
  • Durante et al. (2024) Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, Katsushi Ikeuchi, Hoi Vo, Li Fei-Fei, and Jianfeng Gao. 2024. Agent ai: Surveying the horizons of multimodal interaction. arXiv.
  • Gao et al. (2023) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  • Groza (2023) Adrian Groza. 2023. Measuring reasoning capabilities of chatgpt. arXiv.
  • Milicevic et al. (2012) Aleksandar Milicevic, Joseph P Near, and Rishabh Singh. 2012. Puzzler: An automated logic puzzle solver. Massachusetts Institute of Technology (MIT).
  • Mukherjee and Garain (2009) Anirban Mukherjee and Utpal Garain. 2009. A review of algorithms for solving mathematical word problems in natural language texts. Artificial Intelligence Review, 32(4):285–298.
  • Osama et al. (2020) Mohamed Osama, Aya Zaki-Ismail, Mohamed Abdelrazek, John Grundy, and Amani Ibrahim. 2020. Score-based automatic detection and resolution of syntactic ambiguity in natural language requirements. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 651–661.
  • Pan et al. (2023) Shirui Pan, Yizhen Zheng, and Yixin Liu. 2023. Integrating graphs with large language models: Methods and prospects. arXiv.
  • Wang et al. (2024) Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):1–26.
  • Ye et al. (2023) Xi Ye, Qiaochu Chen, Isil Dillig, and Greg Durrett. 2023. Satlm: Satisfiability-aided language models using declarative prompting. In Advances in Neural Information Processing Systems, volume 36, pages 45548–45580. Curran Associates, Inc.
  • Zhou et al. (2023) Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2023. Language agent tree search unifies reasoning acting and planning in language models. arXiv.

Appendix A Appendix

A.1 Example Puzzle Workflow

We begin by inputting the puzzle and its entities into the Solver LLM-Agent (in this case, GPT-4).

first, second, third, fourthBridget, Kermit, Ophelia, Stretch#105, #118, #126, #128The ostrich that finished second was #128.First place was either #118 or #126.Third place was either #126 or Bridget.Ophelia finished second.Stretch finished 2 places after Kermit.

The solver agent responds by attempting to translate the puzzle and its clues into SMT-LIB:

(set-logic QF_LIA); Place of each ostrich(declare-const Bridget_Place Int)(declare-const Kermit_Place Int)(declare-const Ophelia_Place Int)(declare-const Stretch_Place Int); Number of each ostrich(declare-const Bridget_Number Int)(declare-const Kermit_Number Int)(declare-const Ophelia_Number Int)(declare-const Stretch_Number Int)(assert (and(or (= Bridget_Place 1)(= Bridget_Place 2)(= Bridget_Place 3) (= Bridget_Place 4))(or (= Kermit_Place 1)(= Kermit_Place 2)(= Kermit_Place 3) (= Kermit_Place 4))(or (= Ophelia_Place 1)(= Ophelia_Place 2)(= Ophelia_Place 3) (= Ophelia_Place 4))(or (= Stretch_Place 1)(= Stretch_Place 2)(= Stretch_Place 3) (= Stretch_Place 4))))(assert (and(or (= Bridget_Number 105)(= Bridget_Number 118)(= Bridget_Number 126)(= Bridget_Number 128))(or (= Kermit_Number 105)(= Kermit_Number 118)(= Kermit_Number 126)(= Kermit_Number 128))(or (= Ophelia_Number 105)(= Ophelia_Number 118)(= Ophelia_Number 126)(= Ophelia_Number 128))(or (= Stretch_Number 105)(= Stretch_Number 118)(= Stretch_Number 126)(= Stretch_Number 128)))); The ostrich that finished second was #128.(assert (or (and (= Ophelia_Place 2)(= Ophelia_Number 128))            (and (= Bridget_Place 2)            (= Bridget_Number 128))            (and (= Kermit_Place 2)            (= Kermit_Number 128))            (and (= Stretch_Place 2)            (= Stretch_Number 128)))); The ostrich that finished first was either #118 or #126.(assert (or (and (= Ophelia_Place 1)(or (= Ophelia_Number 118)(= Ophelia_Number 126)))            (and (= Bridget_Place 1)            (or (= Bridget_Number 118)            (= Bridget_Number 126)))            (and (= Kermit_Place 1)            (or (= Kermit_Number 118)            (= Kermit_Number 126)))            (and (= Stretch_Place 1)            (or (= Stretch_Number 118)            (= Stretch_Number 126))))); The runner that finished third was; either #126 or Bridget.(assert (or (and (= Ophelia_Place 3)(= Ophelia_Number 126))            (and (= Bridget_Place 3)            (or (= Bridget_Number 126)            (= Bridget_Number 105)            (= Bridget_Number 118)            (= Bridget_Number 128)))            (and (= Kermit_Place 3)            (= Kermit_Number 126))            (and (= Stretch_Place 3)            (= Stretch_Number 126)))); Ophelia finished second.(assert (= Ophelia_Place 2)); Stretch finished 2 places after Kermit.(assert (= Stretch_Place (+ Kermit_Place 2))); Distinct places(assert (distinct Bridget_PlaceKermit_Place Ophelia_Place Stretch_Place)); Distinct numbers(assert (distinct Bridget_NumberKermit_Number Ophelia_Number Stretch_Number))

This SMT-LIB code is then fed into z3, which produces the following output:

(error ""line 15column 0: invalid command, ’(’ expected"")(error ""line 15column 2: unexpected character"")(error ""line 15column 77: unexpected character"")(error ""line 15column 88: unexpected character"")(error ""line 15column 117: unexpected character"")(error ""line 15column 136: unexpected character"")(error ""line 15column 142: unexpected character"")(error ""line 31column 0: invalid command, ’(’ expected"")(error ""line 31column 7: unexpected character"")(error ""line 57column 0: invalid command, ’(’ expected"")(error ""line 57column 7: unexpected character"")(error ""line 65column 0: invalid command, ’(’ expected"")sat(  (define-fun Ophelia_Number () Int    128)  (define-fun Stretch_Place () Int    3)  (define-fun Stretch_Number () Int    126)  (define-fun Bridget_Place () Int    4)  (define-fun Kermit_Number () Int    118)  (define-fun Bridget_Number () Int    105)  (define-fun Ophelia_Place () Int    2)  (define-fun Kermit_Place () Int    1))

This output is fed back into the Solver LLM-Agent, which then produces the following modified SMT-LIB code:

(set-logic QF_LIA); Place of each ostrich(declare-const Bridget_Place Int)(declare-const Kermit_Place Int)(declare-const Ophelia_Place Int)(declare-const Stretch_Place Int); Number of each ostrich(declare-const Bridget_Number Int)(declare-const Kermit_Number Int)(declare-const Ophelia_Number Int)(declare-const Stretch_Number Int); The places are in the set {1, 2, 3, 4}(assert (and(or (= Bridget_Place 1)(= Bridget_Place 2) (= Bridget_Place 3)(= Bridget_Place 4))(or (= Kermit_Place 1)(= Kermit_Place 2) (= Kermit_Place 3)(= Kermit_Place 4))(or (= Ophelia_Place 1)(= Ophelia_Place 2) (= Ophelia_Place 3)(= Ophelia_Place 4))(or (= Stretch_Place 1)(= Stretch_Place 2) (= Stretch_Place 3)(= Stretch_Place 4)))); Numbers are in the set {105, 118, 126, 128}(assert (and(or (= Bridget_Number 105)(= Bridget_Number 118) (= Bridget_Number 126)(= Bridget_Number 128))(or (= Kermit_Number 105)(= Kermit_Number 118) (= Kermit_Number 126)(= Kermit_Number 128))(or (= Ophelia_Number 105)(= Ophelia_Number 118) (= Ophelia_Number 126)(= Ophelia_Number 128))(or (= Stretch_Number 105)(= Stretch_Number 118) (= Stretch_Number 126)(= Stretch_Number 128)))); The ostrich that finished second was #128.(assert (or (and (= Ophelia_Place 2)(= Ophelia_Number 128))            (and (= Bridget_Place 2)            (= Bridget_Number 128))            (and (= Kermit_Place 2)            (= Kermit_Number 128))            (and (= Stretch_Place 2)            (= Stretch_Number 128)))); The ostrich that finished first was; either #118 or #126.(assert (or (and (= Ophelia_Place 1)(or (= Ophelia_Number 118)(= Ophelia_Number 126)))            (and (= Bridget_Place 1)            (or (= Bridget_Number 118)            (= Bridget_Number 126)))            (and (= Kermit_Place 1)            (or (= Kermit_Number 118)            (= Kermit_Number 126)))            (and (= Stretch_Place 1)            (or (= Stretch_Number 118)            (= Stretch_Number 126))))); The runner that finished third; was either #126 or Bridget.(assert (or (and (= Ophelia_Place 3)(= Ophelia_Number 126))            (and (= Bridget_Place 3)            (or (= Bridget_Number 126)            (= Bridget_Number 105)            (= Bridget_Number 118)            (= Bridget_Number 128)))            (and (= Kermit_Place 3)            (= Kermit_Number 126))            (and (= Stretch_Place 3)            (= Stretch_Number 126)))); Ophelia finished second.(assert (= Ophelia_Place 2)); Stretch finished 2 places after Kermit.(assert (= Stretch_Place (+ Kermit_Place 2))); Distinct places(assert (distinctBridget_Place Kermit_PlaceOphelia_Place Stretch_Place)); Distinct numbers(assert (distinctBridget_Number Kermit_NumberOphelia_Number Stretch_Number))(check-sat)(get-model)

Z3 produces the following output, which is then determined to be both error-free and consistent:


sat
(
  (define-fun Ophelia_Number () Int
    128)
  (define-fun Stretch_Place () Int
    3)
  (define-fun Stretch_Number () Int
    126)
  (define-fun Bridget_Place () Int
    4)
  (define-fun Kermit_Number () Int
    118)
  (define-fun Bridget_Number () Int
    105)
  (define-fun Ophelia_Place () Int
    2)
  (define-fun Kermit_Place () Int
    1)
)
Ground Truth Answer
  • Kermit: First, #118

  • Ophelia: Second, #128

  • Stretch: Third, #126

  • Bridget: Fourth, #105

The autograder evaluates the solution by mapping the SMT-LIB output to the expected results either using contextual clues or an explicitly defined lookup table, which would be defined in the SMT-LIB comments, converting function definitions into comparable assignments, as in Table 5. In this case, the solution gets full credit.

Table 5: Validation Results
Entity Result
Kermit Place: First (Correct)
Kermit Number: 118 (Correct)
Ophelia Place: Second (Correct)
Ophelia Number: 128 (Correct)
Stretch Place: Third (Correct)
Stretch Number: 126 (Correct)
Bridget Place: Fourth (Correct)
Bridget Number: 105 (Correct)

A.2 User Study Instructions

The following instructions were presented to our manual graders before they began grading. The full UI can be found at https://anonymous.4open.science/r/anon_emnlp-1AD0 by running the "autograder_flask.py" file.

[Uncaptioned image]