Show, Don’t Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay

Gonçalo Hora de Carvalho
University of Groningen
[email protected]
&Oscar Knap
&Robert Pollice*
University of Groningen
[email protected]
Abstract

The evaluation of Large Language Models (LLMs) often focuses on linguistic tasks, yet these may not fully capture general reasoning capabilities. We explore the hypothesis that LLMs, such as GPT-3.5 and GPT-4, possess broader cognitive functions, particularly in non-linguistic domains. Our approach extends beyond standard linguistic benchmarks by incorporating games like Tic-Tac-Toe, Connect Four, and Battleship, encoded via ASCII, to assess strategic thinking and decision-making. To evaluate the models’ ability to generalize beyond their training data, we introduce two additional games. The first game, LEGO Connect Language (LCL), tests the models’ capacity to understand spatial logic and follow assembly instructions. The second game, the game of shapes, challenges the models to identify shapes represented by 1s within a matrix of zeros, further testing their spatial reasoning skills. This "show, don’t tell" strategy uses games to potentially reveal cognitive capabilities rather than simply querying the models. Our results indicate that despite their proficiency on standard benchmarks and temperature settings, GPT-3.5 and GPT-4’s abilities to play and reason about fully observable games without pre-training is mediocre. Both models fail to anticipate losing moves in Tic-Tac-Toe and Connect Four, and they are unable to play Battleship correctly. While GPT-4 shows some success in the game of shapes, both models struggle with the assembly tasks presented in the LCL game. These results suggest that while LLMs like the GPT models can emulate conversational proficiency and basic rule comprehension, their performance in strategic gameplay and spatial reasoning tasks is limited in cognitive flexibility and generalization. Importantly, this reveals a blind spot in current LLM benchmarks that we highlight with our gameplay benchmark suite ChildPlay (GitHub Repository). Our findings provide a cautionary tale about claims of emergent intelligence and reasoning capabilities of LLMs that are roughly the size of GPT-3.5 and GPT-4.

1 Introduction

Typically, Large Language Models (LLMs) are transformer-based - they process input text and generate output text in a coherent and contextually appropriate manner [8]. Modern versions use a self-attention mechanism to weigh the importance of different words in a sentence relative to each other [53, 8]. Input text is tokenized, converted into vectors using embeddings, and processed through transformer layers that calculate attention scores to dictate focus on relevant tokens [53, 8, 16]. The model then selects the next token based on learned distributions [53, 8, 16]. With their large parameter count, from Alpaca with 7 billion parameters [46], to LLaMA with 65 billion [49] or PaLM with 540 billion parameters [13], these neural networks have learned to model complex linguistic abstractions, capturing patterns in syntax, semantics, pragmatics, and elements of both style and tone [8, 9, 37].

Current Benchmarks  Benchmarks consist of tests or tasks along with their associated metrics, used to comparatively evaluate a system. They help identify the relative state of the art by comparing systems based on specific measurements. Datasets like SQuAD, GLUE, BIG-bench, SuperGLUE, ANLI, TruthfulQA, HellaSwag, and the lm-evaluation-harness offer various test types, including multiple-choice questions, reading comprehension tasks, natural language understanding, common sense reasoning, factual knowledge, and dialogue completion tasks [38, 54, 2, 54, 34, 28, 59, 18].

Recent studies have evaluated LLMs’ reasoning in non-linguistic tasks too. Liga and Pasetto used Tic-Tac-Toe in ASCII form, pitting LLMs against the minimax algorithm to explore emergent features, suggested to be akin to consciousness [27]. Although LLMs performed well, they often failed to win against the minimax algorithm, resulting in draws and losses [27]. Topsakal and Harper [48] found GPT-4 secured more wins than GPT-3.5 but did not consistently play optimally.

Some studies, such as those by Lappin et al. and Zečević et al., caution against overestimating LLMs’ capabilities, stating that LLMs may excel at language tasks but seem to struggle with reasoning and causal understanding [26, 60]. Bender et al. and Schaeffer et al. further critique the transparency of evaluation metrics, arguing that some "emergent abilities" may be artifacts of evaluation rather than genuine model behavior [3, 39].

Another issue in evaluating LLMs is that of training-test set cross-contamination [8]. Massive training datasets, often untraceable, make replication studies difficult and risk rendering benchmarks meaningless due to contamination [8, 17]. N-Gram Overlap is one method to detect contamination, but it has limitations, such as failing to account for context and synonymous text [6]. Furthermore, the arbitrary 200-character window classically used during GPT-3.5 training may not accurately reflect the influence of surrounding text [8].

This critique of the evaluation metrics used in assessing LLMs invites a deeper exploration of general intelligence - specifically how it can be reliably measured and observed in AI through rigorous and realistic tests that extend beyond linguistic prowess to include broader cognitive functions. Typically, when defining general intelligence (GI), the "g factor" is used as a measure of the ability to reason, plan, solve problems, think abstractly, and learn quickly across a wide range of domains [57, 5, 55, 11, 10]. GI then involves higher-order cognitive processes that go beyond specific skills or knowledge domains [19, 23].

We argue that there is a need for nuance in current debates surrounding AGI (Artificial General Intelligence) and a pragmatic perspective on understanding LLMs’ capabilities. In order to approximate some measurement of GI in an AI system, it is important that we build benchmarks that allow measurements that can truly gauge generalization and reasoning in a human-like manner, rather than relying solely on pattern matching and statistical correlations [52].

This Work  We introduce ChildPlay, a light and scalable suite of non-language-based games like Tic-Tac-Toe, Connect-Four, Battleship, LEGO Connect Language, and Shapes, to assess LLMs’ reasoning, strategy, symbolic reasoning, and pattern recognition beyond traditional linguistic tasks. The benchmarks can be executed, taking only a few hours to complete across all tasks for a given model. Furthermore, we have proposed this benchmark suite for inclusion in the BIG-bench initiative [2], which aims to standardize challenging benchmarks for LLMs.

We chose games because they provide structured environments with clear success criteria, suitable for evaluating strategic thinking and decision-making in dynamic, adversarial scenarios [43, 27, 48]. We encode these games using ASCII representations to minimize dataset contamination issues prevalent in testing generalized intelligence beyond the training domain [8, 27].

While our benchmarks attempt to probe the model’s understanding indirectly through win ratios, gameplay is inherently limited by the model’s reliance on probabilistic patterns rather than genuine rule-based reasoning, something that is contentious in the field at the moment and that we attempt to probe. The game rules are explicitly given to the models in the prompts, which we believe is a critical component of evaluating how well LLMs can generalize from given instructions. While this does not fully guarantee that the models are learning and applying these rules as a human would, it still probes generalization ability.

2 Experiments

General Methodology  Specific tasks in the BIG-bench benchmark [2] are categorized as zero-shot, one-shot, or multi-shot [8]. Our tasks, available through the ChildPlay GitHub Repository, fit the zero-shot category, where models receive only a brief explanation at inference time with no examples. To test LLMs’ reasoning beyond their training data, we focused on spatial reasoning with ASCII sequences. An agent capable of true abstraction should encode and interpret these sequences when the rules are explained.

We developed tasks including LEGO assembly, ASCII games (Tic-Tac-Toe, Connect-Four, Battleship), and identifying shapes represented as 1s in 15-sided grids of 0s. The models tested were gpt-3.5-turbo-1106 (GPT-3.5) and gpt-4-1106-preview (GPT-4), across temperatures t = 0, 0.5, 1, and 1.5 (refer to Appendix G for a preliminary analysis on models other than GPT-3.5 and GPT-4). Temperature is a hyperparameter that controls the randomness of a model’s output in text generation. Lower temperatures make the model more deterministic, favoring the most probable tokens. Importantly, when queried, both GPT-3.5 and GPT-4 were able to explain the tasks and generate valid board states, validating our benchmarks. We hypothesize that models capable of reasoning should play optimally, given their ability to explain optimal strategies when prompted (see Appendix B.5). In general, these experiments took between a few minutes to a few hours to run. Error bars are based on the binomial distribution and computed using the standard deviation.

Lego Connect Language (LCL)  We invented a formal language we call LEGO Connect Language (LCL). More specifically, we propose LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as a language to instruct assembly in 2D on the x and y axis (this can be extended to LCL3𝐿𝐶subscript𝐿3LCL_{3}italic_L italic_C italic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT or 3D). Only 2x4 pieces are allowed when assembling a construction, M𝑀Mitalic_M, which is then valid if no pieces are overlapping and all pieces are connected to other pieces. Namely, a Lego piece is connected through interlocking pegs, not by merely touching sides. And secondly, two Lego pieces overlap when they share the same y-coordinate and any part of their length has the same x-coordinate (see Appendix D.1 for a formal definition of LCL).

Refer to caption
Refer to caption
Figure 1: Left: A valid humanoid construct in LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Right: A standard LEGO piece in LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Game 1: Validity Testing  This test evaluates the ability to assess constructs as valid or invalid.

Game 2: Construct Generation  Models generate LCL constructs from prompts. Each construct is described by tuples specifying the coordinates and colors of pieces. Construct validity is then automatically evaluated.

In both games, the evaluation metric is the proportion of correct validations across different temperatures. We generated 800 images for the validity test (half valid, half invalid) and queried each model to produce 100 images at each temperature. These were checked for validity. An example for data used in these tasks is [(1,1,red),(2,1,blue),(3,1,green)][(1,1,^{\prime}red^{\prime}),(2,1,^{\prime}blue^{\prime}),(3,1,^{\prime}green^% {\prime})][ ( 1 , 1 , start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_r italic_e italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ( 2 , 1 , start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_b italic_l italic_u italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ( 3 , 1 , start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_g italic_r italic_e italic_e italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ], where each tuple represents a brick’s position and color. Our use of LCL is related to Bubeck et al. [9], where GPT-4 generated images using JavaScript or LaTeX. However, unlike Bubeck et al.’s examples, which are prone to training data contamination, LCL remains abstract, challenging the model to operate outside its learned distributions.

Board Games: Tic-tac-toe, Connect-four, and Battleship (BTC)  For this set of games, which we will refer to as BTC from now on, each new board state was introduced with an initial game explanation via the OpenAI API in a zero-shot setup. The models, always playing as the first player, were provided with the current board state and faced an opponent making random moves. Since these games are fully observable, only the initial instruction and the current board state were needed for optimal play. Inputs required were two scalars for the row-column pair or a single scalar for the column in Connect-Four.

In Battleship, ships (’S’) were randomly initialized horizontally, varying from 2 to 5 cells. Hits were marked with an ’X’ and misses with an ’O’ on both players’ boards. The symbol ’~’ denoted the sea - an unexplored square.

Refer to caption

Refer to caption

Refer to caption

Figure 2: Initial board states presented to LLMs. Left: Tic-tac-toe board. Middle: Connect-four board. Right: Battleship board. Ship positions and lenghts are randomised with every initialisation.

The Game of Shapes  For the game of shapes, we first probed the models about what geometric shapes they consider basic. The three shapes consistently mentioned were square, circle, and triangle. The game consists of finding a basic geometric shape "hidden" behind 1s in a matrix of 0s in a multiple-choice fashion. Four shapes were used as options: the circle, the rectangle, the triangle, and the cross. Only the latter three were ever shown to the model (cf. Fig. 3). The choice positions are not randomized to test if the model displays any inherent bias for the question order. This does not affect the outcome, as the game is designed to operate within a single question-response framework.

Refer to caption

Refer to caption

Refer to caption

Figure 3: Matrices used in the Game of Shapes. Left: Square. Middle: Triangle. Right: Cross.

3 Results

GPT move biases when playing BTC are shown in the heatmaps in Appendix B.3. The Random Player’s uniform distribution of moves across squares and columns appears to make it less likely to counter GPT’s "strategies", especially in Connect-Four. According to these heatmaps, GPT-3.5 makes more mistakes than GPT-4 at Tic-Tac-Toe, namely because it seems to play the middle row even when occupied, which produces an automatic loss. Furthermore, the heatmaps reveal that GPT models often play the middle column in Connect-Four, even if already full, increasing the likelihood of connecting four pieces before the Random Player, who has just a 1/7171/71 / 7 chance of blocking. Errors, especially from GPT-3.5, generally increase with temperature, likely due to increased randomness in move selection.

Model Temp Battleship Tic-Tac-Toe Connect-Four
Win Rate (%) Lose Rate (%) Win Rate (%) Lose Rate (%) Win Rate (%) Lose Rate (%)
GPT-3.5 0.0 10.0 90.0 53.0 47.0 64.0 36.0
GPT-3.5 0.5 8.0 92.0 24.0 76.0 76.0 24.0
GPT-3.5 1.0 3.0 97.0 24.0 76.0 75.0 25.0
GPT-3.5 1.5 3.0 97.0 19.0 81.0 75.0 25.0
GPT-4 0.0 0.0 100.0 71.0 29.0 80.0 20.0
GPT-4 0.5 0.0 100.0 77.0 18.0 80.0 20.0
GPT-4 1.0 4.0 96.0 69.0 31.0 70.0 30.0
GPT-4 1.5 0.0 100.0 50.0 50.0 39.0 61.0
Table 1: Win rates of LLMs in Battleship, Tic-Tac-Toe, and Connect-Four at different temperatures.

As shown in Table 1, Performances vary significantly across games and temperatures. In Battleship, GPT-3.5’s win rate drops from 10% at t=0.0𝑡0.0t=0.0italic_t = 0.0 to 3% at t=1.0𝑡1.0t=1.0italic_t = 1.0 and t=1.5𝑡1.5t=1.5italic_t = 1.5. Surprisingly, GPT-4 performs even worse, with no wins at t=0.0𝑡0.0t=0.0italic_t = 0.0, t=0.5𝑡0.5t=0.5italic_t = 0.5, and t=1.5𝑡1.5t=1.5italic_t = 1.5, and only 4% at t=1.0𝑡1.0t=1.0italic_t = 1.0. In Tic-Tac-Toe, GPT-3.5’s win rate decreases from 53% at t=0.0𝑡0.0t=0.0italic_t = 0.0 to 19% at t=1.5𝑡1.5t=1.5italic_t = 1.5. GPT-4 performs better, peaking at 77% at t=0.5𝑡0.5t=0.5italic_t = 0.5 but dropping to 50% at t=1.5𝑡1.5t=1.5italic_t = 1.5. In Connect-Four, GPT-3.5 shows an improvement from 64% at t=0.0𝑡0.0t=0.0italic_t = 0.0 to 76% at t=0.5𝑡0.5t=0.5italic_t = 0.5, maintaining around 75% at higher temperatures, while GPT-4 starts strong with 80% at t=0.0𝑡0.0t=0.0italic_t = 0.0 but falls to 39% at t=1.5𝑡1.5t=1.5italic_t = 1.5.

Overall, GPT-4 outperforms GPT-3.5 in Tic-Tac-Toe and Connect-Four at lower temperatures (t=0.0𝑡0.0t=0.0italic_t = 0.0 and t=0.5𝑡0.5t=0.5italic_t = 0.5), but GPT-3.5 shows better performance in Battleship at t=0.0𝑡0.0t=0.0italic_t = 0.0 and tends to be more stable in Connect-Four at higher temperatures. Both models struggle as temperature increases, with significant declines in win rates, especially in Battleship and Tic-Tac-Toe. Importantly, both models often underperform compared to a random player, particularly at higher temperatures, which underscores the challenges they face in maintaining effective strategies as the temperature rises.

Tic-Tac-Toe  For comparison with the model’s performance, Fig. 12 presents the Tic-Tac-Toe match results of the minimax algorithm against the same random player the models played against. The algorithm won 996 games out of 1000, tied 4, and lost none. This also replicates the results of previous research on Tic-Tac-Toe and we use that as a baseline for optimal play against a random player (see Appendix A.7.1) [44, 1].

Both GPT-3.5 and GPT-4 perform poorly in Tic-Tac-Toe at t=0, reflecting deterministic play based on learned strategies (see Appendix B.3). The Random Player’s uniform move distribution (see Fig. B.3.2) reduces the likelihood of countering GPT’s central strategies, with GPT-3.5 making more errors than GPT-4. These errors significantly impact outcomes due to incorrect moves (cf. Fig. 4). Errors increase with temperature, likely due to an increase in randomness (cf. Fig. 14), leading to losses from illegal moves rather than direct defeats.

Average game moves, missed wins and missed blocks in Tic-Tac-Toe decrease as temperature rises (see Fig. 5), because of an increase in losses due to random play leading to illegal moves. Regardless, neither model plays optimally, as indicated by the high number of missed wins and blocks. As temperature increases, missed wins and blocks decrease as models move away from deterministic play illustrated in the heatmaps of moves in Appendix B.3.

Refer to caption
Figure 4: GPT models performance against a random player at the Tic-Tac-Toe game.
Refer to caption
Figure 5: Tic-tac-toe: Missed wins and missed blocks.

Connect-Four  Both GPT-3.5 and GPT-4 exhibit their poorest performance in Connect-Four at temperature 0, indicative of deterministic play reflecting the models’ learned behavior (see Appendix B.3.3). The Random Player’s uniform distribution across columns (Fig. 16) implies a lower chance of countering GPT’s central strategies in Connect-Four. Neither model plays Connect-Four optimally, as evidenced by the many missed wins and blocks. Fig. 7 demonstrates that, as temperature increases, missed wins and blocks increase for GPT-3.5 and remains roughly the same for GPT-4.

Refer to caption
Figure 6: GPT models performance against a random player at the Connect-Four game.
Refer to caption
Figure 7: Connect-Four: Missed wins and missed blocks.

Battleship  The models were incapable of following the Battleship rules. Regardless of temperature, the models lose most games, with GPT-4 not winning any game due to incorrect moves (cf. Fig. 19). GPT-3.5 wins 10% of the matches at low temperatures, but none at higher ones (cf. Fig. 8).

Refer to caption
Figure 8: GPT models performance against a random player at Battleship.

Shapes  In the Shapes game, we see that GPT-3.5’s performance was approximately equivalent to random chance when identifiying triangles and crosses, yet, it failed to recognize most squares (cf. Fig 9). In contrast, GPT-4 performed remarkably well, successfully identifying shapes with an accuracy of \approx 80%, demonstrating particular proficiency at recognizing triangles. At higher temperatures, some of GPT-4’s responses were discarded by our parser when the model generated invalid Unicode output, and, thus, were not included in the final evaluation. This discrepancy is evident in Fig. 9, for instance, where the sum of correct and incorrect choices does not total 25 at high temperatures.

Refer to caption
Refer to caption
Figure 9: Results for the Shapes game. Left: GPT-3.5. Right: GPT-4.

LCL  In LCL, both models failed to respect the two rules (cf. Section 2). Figs. 10(a), and 10(b) generated for the validity test show valid LCL, while Fig. 10(c) shows an invalid LCL structure. These were used in their formal encoding (cf. Appendix D.1) as model input. Figs. 10(d) and 10(f) show examples for invalid output from GPT-3.5 and Fig. 10(e) shows the only type of valid output generated by GPT-4. The second invalid example for GPT-3.5 was also generated by GPT-4 at t=1𝑡1t=1italic_t = 1.

Refer to caption
((a)) Valid LCL assembly.
Refer to caption
((b)) Valid LCL assembly.
Refer to caption
((c)) Invalid LCL structure.
Refer to caption
((d)) GPT-3.5, t=0𝑡0t=0italic_t = 0, invalid output.
Refer to caption
((e)) GPT-4, t=1.5𝑡1.5t=1.5italic_t = 1.5, valid output.
Refer to caption
((f)) GPT-3.5, t=1𝑡1t=1italic_t = 1, invalid output.
Figure 10: Example structures generated for the LCL validity test and structures generated by GPT-3.5 and GPT-4 for the construction test.

Fig. 11 shows a roughly linear increase in the proportion of correct answers for GPT-3.5 during the validity test as a function of temperature. While GPT-4 peaks at temperature 0.5 and then declines. Regardless of these increases, the percentage of correct responses is minimal. Only GPT-4 produced a small fraction of valid LCL constructs (namely 0.04 out of 400 = 16). GPT-3.5 did not produce a single valid LCL construct.

Refer to caption
Figure 11: LCL results after 100 runs with 50/50 valid/invalid examples for the validity test and 100 experiments per temperature per model for the construction test using 3 pieces.

4 Discussion

Just as the presently ongoing ARC-AGI competition [21] attempts to test systems under a definition of GI akin to the one we offer in our introduction (processes equivalent to higher-order cognition enabling general problem solving that goes beyond specific skills or knowledge domains), our work too explores AI reasoning capabilities through methods that differ from typical benchmarks. Interestingly, both currently highlight deep learning limitations (at the moment no solution, with or without deep learning, goes beyond 43% accuracy, defined as the percentage of correct predictions [24]), particularly in abstract reasoning. The failure of these systems in ARC tasks aligns with our ChildPlay findings, where LLMs struggle with strategic reasoning and rule adherence, underscoring challenges in achieving true generalization beyond trained data distributions.

In Tic-Tac-Toe, both models underperform compared to the minimax algorithm, with mixed results in Connect-Four. GPT-4 performs well in Shapes, but GPT-3.5 does poorly. Both fail in LCL, and Battleship, at any temperatures, showing significant limitations in rule understanding. Increased temperature leads to more naive exploration but does not improve strategic performance, as seen in the rise of missed wins and blocks.

Overall, GPT-3.5 and GPT-4 can produce output for simple games but struggle to do so sensibly, often failing to apply optimal strategies such as blocking winning moves or going for winning move. In summary, the performance difference to minimax underscores LLM limitations in strategic reasoning. Failures in Battleship and LCL highlight severe issues in rule adherence.

Modern LLM benchmarks primarily adapt Turing’s test [50] to assess language processing, potentially overlooking deeper complexities. While smaller models like GPT-4 exhibit what Humboldt described as the "infinite use of finite means" [29] as popularized by Chomsky [12], generating contextually relevant sentences [45], this does not necessarily mean they have mastered reasoning - it is more likely to be advanced pattern imitation.

Limitations and Future Work  Just as in ARC-AI [21], in the future we intend to generate a holdout game dataset that is kept hidden, only providing tools for evaluation, such that overfitting on it before testing becomes unlikely. Regarding our general approach, we pursued a zero-shot condition, but future work should test Chain-of-Thought (CoT), Tree-of-Thoughts (ToT), and even Few-Shot learning [56, 58, 8]. These methodologies reduce task complexity and may lead to improved LLM performance. LLMs might just need more clues. The current technical disadvantage of any testing setting is that they allow LLMs to resort to probabilistic patterns rather than genuine reasoning. This might not be an issue of the class of methods, but simply a direct product of the underlying transformer algorithm which narrows in on the correct data manifold by using input tokens as features for arriving at the most likely answer given the training data, allowing the models to approximate reasoning patterns, likely without true understanding.

Regarding input tokens, ChildPlay sometimes uses single character output, leading to what we term single character token myopia. This is problematic when encoding state information as single characters because GPT has a multi-character tokenizer as do most LLMs [8].

On the subject of evaluation, our benchmark’s binary outcomes (win/loss) can bias perceived capabilities. To avoid this, we also assess move choices and missed wins/blocks (see Appendix B.3 and Figs. 5, 7). Using discontinuous metrics in strategic games may cause sharp evaluation transitions, potentially misrepresenting gradual improvements. Shape recognition and LCL tasks might benefit from continuous metrics, providing smoother performance gradients and more accurate reflections of reasoning abilities. If the model outputs continuous values (e.g., coordinates of a shape’s vertices), the mean squared error, for example, can be used to quantify the difference between predicted and actual values.

Furthermore, following Schaeffer et al.’s perspective, one might argue that the ChildPlay games may not fully reflect true generalization or emergent abilities [39]. If benchmarks resemble nonlinear metrics, they could exaggerate LLM weaknesses or strengths. For instance, a sharp failure in the Battleship game might not indicate poor strategic reasoning universally but rather in specific conditions. We believe this not to be the case as the experiments consist of simple fully observable games. Additionally, we aimed to diminish this effect by producing heatmaps of the moves and keeping track of illegal moves, missed blocks, and missed wins. Regarding the performance difference between the models at the shapes game, we believe this is best rationalized through data-contamination. We think it is likely that GPT-4’s training data contained examples similar to the ones used. This would explain why GPT-4 only improved dramatically at that game - it already knew the answers.

Future ChildPlay enhancements could involve feedback on actions, reducing task complexity. We have translated a small dataset from the Lego corporation (see Appendix D) for future reference and development of LCL. The LCL dataset can also be simulated in both 2D and 3D as a multimodal benchmark. Another important aspect is our ASCII representation. It might not be well-suited to convey the meaning of a board state. More complex symbolic representations might benefit the models’ performance. We also plan to benchmark additional models (cf. Appendix G for a short preliminary analysis on models other than GPT-3.5 and GPT-4), including open-source LLMs, and explore deep reinforcement learning as in Schrittwieser et al., Kaiser et al., and Silver et al. [40, 25, 42].

5 Conclusions

Non-language tasks challenge models to generalize across different encodings and delve into out-of-training-distribution scenarios. Testing LLMs like GPT-4 (according to OpenAI, a contender for AGI [9]) beyond text, we show they are mediocre at best at simple reasoning tasks outside their training data. Both GPT-4 and GPT-3.5 fail to play simple games optimally, and struggle with following rules, namely assembly instructions. Mixed results were found in interpreting geometric shapes from binary grids, where GPT-4 did particularly well. These tasks test reasoning without relying on language, forcing the model to play the game rather than describe it, which we believe is akin to repeat learned information, also known as "parroting" [7, 14, 4, 20, 60, 3]. . The "non-language" category of the BigBench benchmark includes tasks like ASCII recognition, chess, and Sudoku, but none like ours [2]. Hence, we believe ChildPlay is a valuable addition to current LLM benchmarks.

Developing games allows us to critically assess claims about a model’s reasoning and problem-solving abilities, avoiding data contamination issues. Our results suggest current LLMs show disappointing problem-solving capabilities, highlighting key areas for future improvement.

The key takeaway of our work is not just the models’ performance in specific games, but understanding their ability to generalize and reason in novel contexts. This has broader implications for AI, particularly in developing models capable of true reasoning and adaptive thinking. Our research highlights how LLMs handle tasks requiring higher-order reasoning, which is crucial for advanced applications. Practically, if GPT or other LLMs cannot reason accurately on novel data, they should not be trusted to do so in standard applications. This is critical, especially in scientific contexts, as many users may be biased toward believing LLM outputs because they appear credible. Our benchmarks illustrate these pitfalls, showing that if LLMs struggle with simple games, users should be more critical of their outputs.

References

  • Alkaraz et al. [2020] Shahd H. Alkaraz, Essam El-Seidy, and Neveen S. Morcos. Tic-tac-toe: Understanding the minimax algorithm, 2020. URL https://api.semanticscholar.org/CorpusID:218798654.
  • bench authors [2023] BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
  • Bender et al. [2021a] Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big?, 03 2021a.
  • Bender et al. [2021b] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA, 2021b. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922.
  • Binet and Simon [1961] Alfred Binet and Theodore Simon. The development of intelligence in children, 1961.
  • Blodgett et al. [2020] Su Lin Blodgett, Solon Barocas, Hal Daumé III au2, and Hanna Wallach. Language (technology) is power: A critical survey of "bias" in nlp, 2020. URL https://arxiv.org/abs/2005.14050.
  • Borji [2023] Ali Borji. Stochastic parrots or intelligent systems? a perspective on true depth of understanding in llms. SSRN Electronic Journal, 01 2023. doi: 10.2139/ssrn.4507038.
  • Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
  • Bubeck et al. [2023] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, John A. Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuan-Fang Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv, abs/2303.12712, 2023. URL https://api.semanticscholar.org/CorpusID:257663729.
  • Carroll [1993] John B. Carroll. Human Cognitive Abilities: A Survey of Factor-Analytic Studies. Cambridge University Press, 1993. doi: 10.1017/CBO9780511571312.
  • Cattell [1963] Raymond B. Cattell. Theory of fluid and crystallized intelligence: A critical experiment. Journal of Educational Psychology, 54(1):1–22, 1963. doi: 10.1037/h0046743.
  • Chomsky [1957] Noam Chomsky. Syntactic Structures. Mouton and Co., The Hague, 1957.
  • Chowdhery et al. [2024] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sashank Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: scaling language modeling with pathways. J. Mach. Learn. Res., 24(1), mar 2024. ISSN 1532-4435.
  • Duan et al. [2023] Haonan Duan, Adam Dziedzic, Nicolas Papernot, and Franziska Boenisch. Flocks of stochastic parrots: Differentially private prompt learning for large language models. ArXiv, abs/2305.15594, 2023. URL https://api.semanticscholar.org/CorpusID:258887717.
  • Face [2024] Hugging Face. distilbert/distilgpt2, 2024. URL https://huggingface.co/distilbert/distilgpt2. Accessed: 2024-08-16.
  • Fields et al. [2024] John Fields, Kevin Chovanec, and Praveen Madiraju. A survey of text classification with transformers: How wide? how large? how long? how accurate? how expensive? how safe? IEEE Access, 12:6518–6531, 2024. URL https://api.semanticscholar.org/CorpusID:266824505.
  • Floridi and Chiriatti [2020] L. Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681 – 694, 2020. URL https://api.semanticscholar.org/CorpusID:228954221.
  • Gao et al. [2023] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  • Gottfredson [1997] Linda S. Gottfredson. Why g matters: The complexity of everyday life. Intelligence, 24(1):79–132, 1997. ISSN 0160-2896. doi: https://doi.org/10.1016/S0160-2896(97)90014-3. URL https://www.sciencedirect.com/science/article/pii/S0160289697900143. Special Issue Intelligence and Social Policy.
  • Henrique et al. [2023] Da Silva Gameiro Henrique, Andrei Kucharavy, and Rachid Guerraoui. Stochastic parrots looking for stochastic parrots: Llms are easy to fine-tune and hard to detect with other llms. arXiv, 2023. URL https://arxiv.org/abs/2304.08968.
  • Infinite MonkeyLab42 [2024] Infinite MonkeyLab42. ARC Prize 2024: ARC-AGI Competition. https://arcprize.org/arc, 2024. Accessed: 2024-08-16.
  • JackFram [2024] JackFram. Jackfram/llama-68m, 2024. URL https://huggingface.co/JackFram/llama-68m. Accessed: 2024-08-16.
  • Jensen [1998] A.R. Jensen. The g factor: The science of mental ability. Westport, CT: Praeger, 1998.
  • Kaggle [2024] Kaggle. Arc prize 2024, 2024. URL https://www.kaggle.com/competitions/arc-prize-2024/overview. Accessed: 2024-08-17.
  • Kaiser et al. [2020] Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, and Henryk Michalewski. Model-based reinforcement learning for atari, 2020.
  • Lappin [2023] Shalom Lappin. Assessing the strengths and weaknesses of large language models. Journal of Logic, Language and Information, 33:1–12, 11 2023. doi: 10.1007/s10849-023-09409-x.
  • Liga and Pasetto [2023] Davide Liga and Luca Pasetto. Testing spatial reasoning of large language models: the case of tic-tac-toe, 2023. URL https://ceur-ws.org/Vol-3563/paper_14.pdf.
  • Lin et al. [2022] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URL https://arxiv.org/abs/2109.07958.
  • Merrill [2023] William Merrill. Formal languages and neural models for learning on sequences. In International Conference on Graphics and Interaction, 2023. URL https://api.semanticscholar.org/CorpusID:261101973.
  • Microsoft [2024a] Microsoft. microsoft/dialogpt-medium, 2024a. URL https://huggingface.co/microsoft/DialoGPT-medium. Accessed: 2024-08-16.
  • Microsoft [2024b] Microsoft. microsoft/phi-2, 2024b. URL https://huggingface.co/microsoft/phi-2. Accessed: 2024-08-16.
  • Microsoft [2024c] Microsoft. microsoft/phi-3-mini-4k-instruct, 2024c. URL https://huggingface.co/microsoft/Phi-3-mini-4k-instruct. Accessed: 2024-08-16.
  • Mistralai [2024] Mistralai. mistralai/mixtral-8x7b-instruct-v0.1, 2024. URL https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1. Accessed: 2024-08-16.
  • Nie et al. [2020] Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding, 2020. URL https://arxiv.org/abs/1910.14599.
  • OpenAI [2024a] OpenAI. openai-community/gpt2, 2024a. URL https://huggingface.co/openai-community/gpt2. Accessed: 2024-08-16.
  • OpenAI [2024b] OpenAI. openai-community/gpt2-medium, 2024b. URL https://huggingface.co/openai-community/gpt2-medium. Accessed: 2024-08-16.
  • Ouyang et al. [2022] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022.
  • Rajpurkar et al. [2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016. URL https://arxiv.org/abs/1606.05250.
  • Schaeffer et al. [2023] Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage?, 2023. URL https://arxiv.org/abs/2304.15004.
  • Schrittwieser et al. [2020] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, December 2020. ISSN 1476-4687. doi: 10.1038/s41586-020-03051-4. URL http://dx.doi.org/10.1038/s41586-020-03051-4.
  • Shleifer [2024] Sam Shleifer. sshleifer/tiny-gpt2, 2024. URL https://huggingface.co/sshleifer/tiny-gpt2. Accessed: 2024-08-16.
  • Silver et al. [2017] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017.
  • Srivastava et al. [2023] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023. URL https://arxiv.org/abs/2206.04615.
  • Swaminathan et al. [2020] Bala Swaminathan, R Ekke Vaishali, and R subashriTS. Analysis of minimax algorithm using tic-tac-toe, 2020. URL https://api.semanticscholar.org/CorpusID:228863323.
  • Sweet [1989] Paul Robinson Sweet. On language: The diversity of human language-structure and its influence on the mental development of mankind. by wilhelm von humboldt. translated by peter heath. Historiographia Linguistica, 16:387–392, 1989. URL https://api.semanticscholar.org/CorpusID:170369059.
  • Taori et al. [2023] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  • TinyLlama [2024] TinyLlama. Tinyllama/tinyllama-1.1b-chat-v1.0, 2024. URL https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0. Accessed: 2024-08-16.
  • Topsakal and Harper [2024] Oguzhan Topsakal and Jackson Harper. Benchmarking large language model (llm) performance for game playing via tic-tac-toe. Electronics, 13:1532, 04 2024. doi: 10.3390/electronics13081532.
  • Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023.
  • TURING [1950] A. M. TURING. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind, LIX(236):433–460, 10 1950. ISSN 0026-4423. doi: 10.1093/mind/LIX.236.433. URL https://doi.org/10.1093/mind/LIX.236.433.
  • UAE [2024] TII UAE. Falcon-7b-instruct, 2024. URL https://huggingface.co/tiiuae/falcon-7b-instruct. Accessed: 2024-08-16.
  • van Dijk et al. [2023] Bram M. A. van Dijk, Tom Kouwenhoven, Marco R. Spruit, and Max J. van Duijn. Large language models: The need for nuance in current debates and a pragmatic perspective on understanding, 2023. URL https://arxiv.org/abs/2310.19671.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  • Wang et al. [2019] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019. URL https://arxiv.org/abs/1804.07461.
  • Wechsler [1944] David Wechsler. The Measurement of Adult Intelligence. Williams & Wilkins Co., 3rd edition, 1944. doi: 10.1037/11329-000.
  • Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903, 2022. URL https://arxiv.org/abs/2201.11903.
  • Wright [1904] Wm. R. Wright. General intelligence, objectively determined and measured., 1904. URL https://api.semanticscholar.org/CorpusID:144456697.
  • Yao et al. [2023] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv.org/abs/2305.10601.
  • Zellers et al. [2019] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830.
  • Zečević et al. [2023] Matej Zečević, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. Causal parrots: Large language models may talk causality but are not causal, 2023. URL https://arxiv.org/abs/2308.13067.

Appendix

Appendix A Algorithms

A.1 Generating Games

Our program enables simulation, testing, and analysis of game-play of the different benchmarks found in ChildPlay.

The program begins by setting up variables for debug outputs and game configurations, initializes the game environment based on these configurations, and sets up players accordingly. It then simulates a series of games, saving results and logs in JSON format, and generates heatmaps to analyze gameplay strategies and player decisions. The script also handles results aggregation and visualization, including the generation of bar plots of the results.

Our program is structured to simulate games with various players through a base class and its derived classes, each representing different player types. This class, the PlayerBase class, serves as the foundational class for all player types. It initializes player properties such as ID, name, and a debug flag, and it also manages message storage for debugging or interaction logging. A method is available that can store or print board states and plays based on the debug flag, facilitating debugging and tracking game interactions.

The LLMPlayer class, derived from PlayerBase, represents a player that utilizes a large language model, in our case, this player is either GPT-3.5 or GPT-4. Its constructor initializes the player with an ID, name, model details, and the game being played. It inherits from PlayerBase and extends functionality for LLM interactions. We introduce a method to print messages related to API interactions, and another that formulates a prompt for the LLM, invokes it, and processes its response. Then, a different method interprets the LLM’s response to determine the validity of a move within the game’s context.

The TextPlayer class, also derived from PlayerBase, represents a player who inputs moves via text, through the console - this is a human player. Its constructor initializes player-specific details and a callback function for receiving input. There is a method that manages receiving and validating player inputs according to game rules.

The RandomPlayer class, another derived class from PlayerBase, simulates a player making random valid moves based on the current game state. We have designed a method in this class to generate random possible moves that always comply with the game’s rules given the present game state.

We have two possible game loops, one designed to execute a series of games between two players, and another to simulate one single game. The former takes parameters including the game instance, the two players, the number of games to play, and a debug flag. This function runs a loop for the specified number of games, recording results, tracking messages, and logging game states and moves. It returns a summary of the game results and collected messages for further analysis. The second game loop simulates a single game between two players by initializing the game, managing turn-based gameplay, and logging detailed move information. It returns a tuple with game messages, counts of invalid moves, a log of moves, and the game outcome.

Other functions, all helper functions, enable basic functionality, for example, in producing multiple random moves to generate a dataset, or loading data from a JSON file. We can also create a comprehensive dataset from multiple games and save it to a JSON file, load it, and print the board state for a specific record index. As stated before, we can also generate heatmaps showing the frequency of moves in the case of the board games, or of predicted versus actual shapes in the case of the game of shapes.

A.2 Tic-Tac-Toe

For the Tic-Tac-Toe task we first initialize the game with optional settings. If no specific options are provided, it defaults to a 3x3 board size. Customizable options include setting the board size and a debug flag. Upon initialization, a method is called to set up or reset the game state. Additionally, it establishes a game name and provides a prompt that explains the rules and how to input moves for an algorithmic explanation of the code used to run Tic-Tac-Toe games).

The reset_game method is responsible for resetting the game by creating a board of the specified size, defaulting to 3x3, filled with spaces that represent empty cells. It also initializes the current player as "P1", sets up a list to track moves made, and marks the game status as not over.

The get_text_state method constructs a text-based or ASCII representation of the current game state. It displays column and row headers for the board and formats the board with dividers to clearly delineate rows and columns.

The guess method manages a player’s move by first validating it to ensure it falls within the board boundaries and that the chosen position is not already occupied. If the move is valid, it updates the board with the player’s symbol (’X’ for player index 0 and ’O’ for player index 1). After each valid move, the method checks for a win or a tie. If neither condition is met, it switches the turn to the next player. If a player makes a wrong move they lose the game.

The check_win method determines if the current player has won by checking all possible winning conditions. This includes vertical, horizontal, and diagonal lines. It returns True if any of these conditions are met with consecutive symbols matching the current player’s.

The check_tie method checks for a tie by determining if all cells are filled without any player achieving a winning condition.

A switch_player method is available that alternates turns between the two players, switching control from "P1" to "P2" after each valid move.

A.3 Connect-Four

The Connect-Four constructor initializes the game with optional settings as well, such as row and column size and debug mode. If no specific options are provided, the game defaults to a 7x7 grid. Otherwise, it reads the values from the provided options dictionary. The constructor sets up basic game properties, including a reset board, a tuple to track the last move, the game over status, and the current player. It also includes a game instruction prompt.

The reset_board method resets the board to a 7x7 grid (or a custom size if specified) filled with dots (.) to represent empty cells.

The check_tie method determines if the game has ended in a tie. It checks if the top row of the board is completely filled without any winner being declared. This method ensures that the check_win method returns False before declaring a tie.

The check_win method assesses whether a winning condition has been met after a move. It examines the cell corresponding to the last move and checks horizontally, vertically, and diagonally (in both positive and negative directions) for four consecutive identical symbols (either ’X’ or ’O’).

The guess method handles a player’s move attempt. It returns an invalid move if the attempt is made outside of valid columns or on a full column. For a valid move, it updates the board, checks for a win or a tie, and switches to the next player if the game is still ongoing.

The get_text_state method generates a string representation of the current board state, displaying ’X’ and ’O’ for the two players.

The switch_player method alternates the active player between "P1" (Player 1) and "P2" (Player 2).

The board_size property returns the number of columns in the board, which is necessary for validation and display purposes.

A.4 Battleship

The Battleship constructor initializes, once again, the game with optional settings. If no options are provided, it defaults to a 5x5 board size. The constructor allows customization of the board size and a debug flag. It sets up four game boards for each player—two for ship placement and two for tracking guesses—using nested list comprehensions. Additionally, it initializes lists to keep track of the ship positions for both players, 2 lists each, 4 in total - these correspond to the players own boards which they can see both the ships and hits, and their adversaries, where they can see only their own hits and misses. A dictionary, ship_types, defines ship names and their sizes, which scale according to the board size. The place_ships method is then called to randomly place ships on each player’s board. Initial game properties are also set, including the current player and a game prompt that explains the rules.

The board property dynamically returns the guess board of the current player, in case we want to display it or carry out logic checks.

The place_ships method is responsible for randomly placing ships on a specified board. It attempts to place each ship either vertically or horizontally without overlapping or touching another ship, including diagonally adjacent spaces.

The is_space_free method checks if a specified space on the board is free to accommodate a ship of a given size and orientation. It ensures that there is no overlap with existing ships and that the ship is not placed directly adjacent to another ship.

The get_text_state method provides a string representation of both the ship and guess boards for a specified player, formatted for easy reading. This functionality is useful for displaying the game state in a text-based interface.

The guess method processes a player’s guess by checking if it hits a ship and updating the boards accordingly. It also switches the player after each guess and checks for a win condition. Hits are marked with ’X’ and misses with ’O’ on the guess and ship boards.

The check_win method determines if the current player has won by checking if all positions of the opponent’s ships have been hit.

The switch_player method alternates turns between the two players after each guess.

A.5 Shapes

The Shapes game starts by defining two constants, empty_character and full_character, which are represented by "0" and "1" respectively. These constants denote empty and filled cells in the grid.

One of the primary functions in the script is bar_plot_shapes, which generates bar plots illustrating the counts of correct and incorrect answers for each shape. This function takes several parameters, including base_path for saving plots, models which is a list of GPT-type models used in the game, temperatures which refers to different temperature settings in the GPT-model predictions, and shapes representing the various shapes used in the game. The function iterates over the models and temperatures, loads game logs, counts the correct and incorrect answers, and plots these counts in bar charts. The resulting plots are saved to specified directories.

The script also includes several shape-drawing functions. The create_board function initializes an empty board of specified dimensions filled with empty_character. The draw_rectangle function allows drawing a rectangle or square on the board by filling cells with full_character. The draw_circle function draws an approximate circle using the midpoint circle algorithm. The draw_triangle function draws an equilateral triangle on the board by filling cells within the triangle’s bounds. The draw_cross function draws a cross on the board with a specified arm length centered at a given point.

A key component of the script is the Shapes class, which manages the game’s logic. This class initializes the game with optional settings such as board size and debug mode. It sets the game name, the shape to be drawn, and a prompt explaining the game rules. The reset_game method creates a new board and randomly draws the specified shape (square, triangle, or cross) on it. It shuffles possible answer options and resets game state variables. The get_text_state method generates a text-based representation of the game board and possible answers for display. The guess method processes a player’s guess, checks if it matches the correct shape, and ends the game, returning the outcome (win or loss). Methods check_win, check_tie, and check_loss are used to determine the game’s outcome.

The script supports both visualization and analysis of game outcomes. The bar_plot_shapes function helps visualize the correctness of guesses. The shape-drawing functions facilitate the random generation of different shapes for each new session. The Shapes class encapsulates the game’s logic, managing the board state, processing guesses, and determining game outcomes, thus supporting a single-player interaction where the player guesses the shape drawn on the board from a list of options.

A.6 Lego Connect Language (LCL)

In the LCL script, we enable validating, and visualizing Lego-like constructs with different types of players.

We start by defining LCLVisualizer, which handles the visualization of Lego constructs. Its methods include draw_piece, which draws a single Lego piece on a plot, and display_construct, which displays and saves a complete Lego construct.

The LCLGame class manages the logic for generating and validating Lego constructs. It contains attributes such as pieces, a list of tuples representing Lego pieces, valid_colors, a list of valid colors for these pieces, and metrics, a dictionary for tracking validation metrics and construct details. Its methods include can_place_piece, which determines if a Lego piece can be placed at a given position, find_all_valid_positions, which identifies all valid positions for placing new pieces, and build_random_valid_assembly, which constructs a valid Lego assembly by placing pieces in valid positions. Additionally, generate_random_piece and generate_random_construct are methods for generating random Lego pieces and constructs, respectively. The is_valid_construct method checks the validity of a given construct, while generate_valid_or_invalid_construct creates either a valid or invalid construct based on a given parameter. The class also includes methods like create_tower, create_bridge, and create_staircase for generating specific types of constructs, and validate_construct, which validates a player’s response regarding the construct’s validity. Finally, save_metrics saves the validation metrics to a file.

The RandomPlayer class represents a player that generates random answers. Its primary method, generate_random_answer, returns a random validity answer, either "valid" or "invalid."

The LLMPlayer class is designed for the GPT models to generate answers. It has attributes like model, which specifies the GPT-model type used, and temperature, which sets the temperature for the model’s responses. Its methods include generate_llm_answer_validity, which produces a validity game answer using the LLM, and generate_llm_answer_list, which generates a list of Lego pieces using the LLM.

The main function orchestrates the simulation of games, including the generation of constructs, validation, and saving of results. It initializes both the game and visualizer, creates directories for saving experiment results, and defines the number of experiments, models, and temperatures. The function conducts two main types of experiments:

Validity Testing: In this experiment, constructs are generated (either valid or invalid), and players (either random or LLM) are asked to validate them. The results and visualizations of these constructs are saved.

Construct Generation: Here, LLM players are prompted to generate valid constructs based on a given prompt. These constructs are then validated, and results and visualizations are recorded.

The results of these experiments are saved to CSV files for further analysis.

The product of these experiments differs slighlty per type of game. We collected results (see Table 3) as well as game state data based on player moves (see Table 2). Single-player games are simplified by setting irrelevant values to 0 (e.g. Player, turn, etc), but the overall format is the same. In the case of LCL, we save the results directly to two dataframes. For the construct experiment we record the model temperature, the model type, the experiment number, if the output is valid or not, the LLM’s answer, and if the response is true or false. In the case of the validity experiment, we record the temperature, the model, the experiment number, if the output is valid or not, if the LLM’s response is correct or incorrect, and the actual LCL construct.

A.7 Game Data

A.7.1 Minimax vs Random Player

Refer to caption
Figure 12: Minimax (agent) vs. random player - wins and losses on the left and a heatmap of move counts over the 1000 games for the minimax player on the right.
Variable Value
Player A
Move [X, Y]
Turn B
Table 2: Game logs with the details of a specific game state. The ’Player’ variable (A) indicates the player number, ’Move’ ([X, Y]) represents the coordinates of the move made, and ’Turn’ (B) signifies the turn number in the game.
Variable Value
P1 Wins C
P2 Wins D
Ties E
P1 Wrong Moves F
P2 Wrong Moves G
Table 3: Results summarizes the outcomes of a series of games. ’P1 Wins’ (C) and ’P2 Wins’ (D) indicate the number of games won by Player 1 and Player 2, respectively. ’Ties’ (E) shows the number of games that ended in a tie. ’P1 Wrong Moves’ (F) and ’P2 Wrong Moves’ (G) represent the number of invalid moves made by Player 1 and Player 2, respectively.

Appendix B Three Board Games: Tic-Tac-Toe, Connect-Four, and Battleship

B.1 Prompts

Game Introductory Prompt
Battleship "Battleship is a two-player guessing game where each player has a fleet of ships on a secret grid and then takes turns guessing the locations of the opponent’s ships. The objective is to sink all of the opponent’s ships by correctly guessing their locations. O’s in a board mean that the player selected a square to attack and there was no ship there - it’s a miss. Had there been a ship there, instead of an O you would see an X. In your board, an <S> signifies a ship position, and a <>̃ signifies the sea. Your input is just two numbers with a space in between, one for the row (from 0 to <self.board_size-1>) and one for the column (from 0 to <self.board_size-1>), like: 0 0, nothing else. Do not output anything else but the row col values."
Tic-Tac-Toe "Tic-Tac-Toe is a two-player game played on a 3x3 grid. Players take turns placing their mark, X or O, in an empty square. The first player to place three of their marks in a horizontal, vertical, or diagonal row wins the game. You will play as player 1, therefore you play with X while your adversary plays with the symbol O. Your input is then a number (from 0 to 2) for the row followed by a space and another number (from 0 to 2) for the column, nothing else. Do not output anything else but the row col values else you lose."
Connect-Four "Connect-Four is a two-player game. The pieces fall straight down, occupying the next available space within a column. The objective of the game is to be the first to form a horizontal, vertical, or diagonal line of four of one’s own discs. In a board, player 1, you, plays with symbol X, while player 2, your opponent, plays with symbol O. Your input is just a number from 0 to 6, nothing else. Do not output anything else but the col value else you lose."
Table 4: The three introductory prompts used for the board games in the ChildPlay suite.

B.2 Example

Note that in the case of Connect-Four, a move consists of a singular scalar. A board state is shown after each play. Examples can be found in Fig. 13. The move space in Tic-Tac-Toe corresponds to the full grid and the player selects a specific board square. In contrast, in Connect-Four it is a series of 7 columns. By definition, the player can only select a column and each piece piles on top of each other within it. Accordingly, this was our conceptual and algorithmic representation of the two games.

Refer to caption

((a)) Player X has won the Connect-Four game.

Refer to caption

((b)) Players tied at the Tic-Tac-Toe game.

Refer to caption

((c)) A player has won the Battleship game.
Figure 13: Examples of final board states in the three different board games.

B.3 Move Mapping

B.3.1 Probability of incorrect moves

The probability of incorrect moves Pincorrectsubscript𝑃incorrectP_{\text{incorrect}}italic_P start_POSTSUBSCRIPT incorrect end_POSTSUBSCRIPT as a function of temperature is given by:

Pincorrect(T)=Incorrect Moves at TTotal Movessubscript𝑃incorrect𝑇Incorrect Moves at 𝑇Total MovesP_{\text{incorrect}}(T)=\frac{\text{Incorrect Moves at }T}{\text{Total Moves}}italic_P start_POSTSUBSCRIPT incorrect end_POSTSUBSCRIPT ( italic_T ) = divide start_ARG Incorrect Moves at italic_T end_ARG start_ARG Total Moves end_ARG

For Tic-Tac-Toe, Connect Four, and Battleship, the results are as follows:

Pincorrect, TTT=[4.7,7.6,7.6,8.1]subscript𝑃incorrect, TTT4.77.67.68.1P_{\text{incorrect, TTT}}=\left[4.7,7.6,7.6,8.1\right]italic_P start_POSTSUBSCRIPT incorrect, TTT end_POSTSUBSCRIPT = [ 4.7 , 7.6 , 7.6 , 8.1 ]
Pincorrect, CF=[1.8,1.0,0.65,0.6]subscript𝑃incorrect, CF1.81.00.650.6P_{\text{incorrect, CF}}=\left[1.8,1.0,0.65,0.6\right]italic_P start_POSTSUBSCRIPT incorrect, CF end_POSTSUBSCRIPT = [ 1.8 , 1.0 , 0.65 , 0.6 ]
Pincorrect, BS=[3.583,3.708,4.042,3.875]subscript𝑃incorrect, BS3.5833.7084.0423.875P_{\text{incorrect, BS}}=\left[3.583,3.708,4.042,3.875\right]italic_P start_POSTSUBSCRIPT incorrect, BS end_POSTSUBSCRIPT = [ 3.583 , 3.708 , 4.042 , 3.875 ]

These values indicate that the probability of incorrect moves varies across different games as the temperature changes. For Tic-Tac-Toe, the probability of incorrect moves increases with temperature, suggesting that higher temperatures may lead to less optimal decisions. In Connect Four, however, the probability of incorrect moves decreases as temperature increases, indicating a possible improvement in decision-making or adaptation to the conditions. In Battleship, the probability fluctuates but remains relatively stable.

Average Number of Steps  The average number of steps Sdelimited-⟨⟩𝑆\langle S\rangle⟨ italic_S ⟩ as a function of temperature is calculated by averaging the steps observed at different temperatures:

S=1ni=1nSteps at Tidelimited-⟨⟩𝑆1𝑛superscriptsubscript𝑖1𝑛Steps at subscript𝑇𝑖\langle S\rangle=\frac{1}{n}\sum_{i=1}^{n}\text{Steps at }T_{i}⟨ italic_S ⟩ = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT Steps at italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

For Tic-Tac-Toe, Connect Four, and Battleship, the average number of steps are as follows:

STTT=2.5delimited-⟨⟩subscript𝑆TTT2.5\langle S_{\text{TTT}}\rangle=2.5⟨ italic_S start_POSTSUBSCRIPT TTT end_POSTSUBSCRIPT ⟩ = 2.5
SCF=5.0delimited-⟨⟩subscript𝑆CF5.0\langle S_{\text{CF}}\rangle=5.0⟨ italic_S start_POSTSUBSCRIPT CF end_POSTSUBSCRIPT ⟩ = 5.0
SBS=6.0delimited-⟨⟩subscript𝑆BS6.0\langle S_{\text{BS}}\rangle=6.0⟨ italic_S start_POSTSUBSCRIPT BS end_POSTSUBSCRIPT ⟩ = 6.0

These values show that the average number of steps taken in each game tends to stabilize regardless of temperature changes. This suggests that while the probability of making incorrect moves may fluctuate with temperature, the number of steps taken does not vary significantly.

B.3.2 Tic-Tac-Toe

Below, see the left column for the model’s moves, and the right column for the random player’s moves.

Refer to caption
((a)) t=0𝑡0t=0italic_t = 0.
Refer to caption
((b)) t=0.5𝑡0.5t=0.5italic_t = 0.5.
Refer to caption
((c)) t=1𝑡1t=1italic_t = 1.
Refer to caption
((d)) t=1.5𝑡1.5t=1.5italic_t = 1.5.
Figure 14: Heatmap of model GPT-3.5’s moves for the tic-tac-toe game, where ’t’ is temperature.
Refer to caption
((a)) t=0𝑡0t=0italic_t = 0.
Refer to caption
((b)) t=0.5𝑡0.5t=0.5italic_t = 0.5.
Refer to caption
((c)) t=1𝑡1t=1italic_t = 1.
Refer to caption
((d)) t=1.5𝑡1.5t=1.5italic_t = 1.5.
Figure 15: Heatmap of model GPT-4’s moves for the tic-tac-toe game, where ’t’ is temperature.

B.3.3 Connect-Four

Below, see the left column for the model’s moves, and the right column for the random player’s moves.

Refer to caption
((a)) t=0𝑡0t=0italic_t = 0.
Refer to caption
((b)) t=0.5𝑡0.5t=0.5italic_t = 0.5.
Refer to caption
((c)) t=1𝑡1t=1italic_t = 1.
Refer to caption
((d)) t=1.5𝑡1.5t=1.5italic_t = 1.5.
Figure 16: Heatmap of model GPT-3.5’s moves for the connect-four game, where ’t’ is temperature.
Refer to caption
((a)) t=0𝑡0t=0italic_t = 0.
Refer to caption
((b)) t=0.5𝑡0.5t=0.5italic_t = 0.5.
Refer to caption
((c)) t=1𝑡1t=1italic_t = 1.
Refer to caption
((d)) t=1.5𝑡1.5t=1.5italic_t = 1.5.
Figure 17: Heatmap of model GPT-4’s moves for the connect-four game, where ’t’ is temperature.

B.3.4 Battleship

Below, see the left column for the model’s moves, and the right column for the random player’s moves.

Refer to caption
((a)) t=0𝑡0t=0italic_t = 0.
Refer to caption
((b)) t=0.5𝑡0.5t=0.5italic_t = 0.5.
Refer to caption
((c)) t=1𝑡1t=1italic_t = 1.
Refer to caption
((d)) t=1.5𝑡1.5t=1.5italic_t = 1.5.
Figure 18: Heatmap of model GPT-3.5’s moves for the battleship game, where ’t’ is temperature.
Refer to caption
((a)) t=0𝑡0t=0italic_t = 0.
Refer to caption
((b)) t=0.5𝑡0.5t=0.5italic_t = 0.5.
Refer to caption
((c)) t=1𝑡1t=1italic_t = 1.
Refer to caption
((d)) t=1.5𝑡1.5t=1.5italic_t = 1.5.
Figure 19: Heatmap of model GPT-4’s moves for the battleship game, where ’t’ is temperature.

B.4 Shapes

Below, see the left column for the model’s moves, and the right column for the random player’s moves.

Refer to caption
((a)) t=0𝑡0t=0italic_t = 0.
Refer to caption
((b)) t=0.5𝑡0.5t=0.5italic_t = 0.5.
Refer to caption
((c)) t=1𝑡1t=1italic_t = 1.
Refer to caption
((d)) t=1.5𝑡1.5t=1.5italic_t = 1.5.
Figure 20: Heatmap of model GPT-3.5’s decisions for the shapes game, where ’t’ is temperature.
Refer to caption
((a)) t=0𝑡0t=0italic_t = 0.
Refer to caption
((b)) t=0.5𝑡0.5t=0.5italic_t = 0.5.
Refer to caption
((c)) t=1𝑡1t=1italic_t = 1.
Refer to caption
((d)) t=1.5𝑡1.5t=1.5italic_t = 1.5.
Figure 21: Heatmap of model GPT-4’s decisions for the shapes game, where ’t’ is temperature.

B.5 Prompting GPT About Optimal Play

Game Explanation
Tic-Tac-Toe Tic-Tac-Toe is a two-player game played on a 3x3 grid. Each player takes turns marking a square with their symbol (X or O), aiming to get three of their symbols in a row, column, or diagonal. To play optimally, prioritize securing the center square and blocking opponent’s winning moves.
Battleship Battleship is a two-player game where players hide ships on a grid and take turns guessing their opponent’s ship locations. The goal is to sink all of the opponent’s ships. To play optimally, start by targeting areas with higher probabilities of containing a ship and strategically target adjacent squares after a hit to maximize efficiency.
Connect Four Connect Four is a two-player game played on a 6x7 grid. Players drop colored discs into columns, aiming to connect four of their own discs in a row, column, or diagonal. To play optimally, prioritize creating your own winning formations while blocking opponent’s potential winning moves.
Table 5: Optimal strategies for playing different games according to GPT-3.5.
Game Explanation
Tic-Tac-Toe Play your first X in a corner to maximize opportunities. If the opponent plays in the center, play the opposite corner. Block your opponent’s potential winning moves and always look to create a line of three.
Battleship Randomize ship placements and start by targeting the center of the grid. Use a checkerboard pattern for efficient searching. Once a ship is hit, focus on the surrounding squares to determine its orientation and sink it.
Connect Four Start in the center column to maximize opportunities in all directions. Build threats vertically, horizontally, and diagonally, and block the opponent’s forming lines. Create multiple threats to force the opponent into a defensive position.
Table 6: Optimal strategies for playing different games according to GPT-4.

Appendix C LCL

C.1 Prompts

Validity Testing prompt: "You will receive a description of a Lego structure, for instance, ((x1, y1, ’color1’), (x2, y2, ’color2’)), which lists the coordinates and colors of two pieces. A construct is valid if all Lego pieces are connected but not overlapping. A Lego piece is connected through interlocking pegs, not by merely touching sides. Two Lego pieces overlap when they share the same y-coordinate and any part of their length has the same x-coordinate. If the following structure is valid then reply with valid, otherwise reply with invalid (do not justify your answer): <pieces>"

Figure 22: Validity testing prompt.

Construct Generation prompt: "A description of a Lego structure consists of a list of tuples, ((x1, y1, ’color1’), (x2, y2, ’color2’)), where each tuple shows the coordinates and colors of a piece. Such a structure is valid if all Lego pieces are connected but not overlapping. A Lego piece is connected through interlocking pegs, not by merely touching sides. Two Lego pieces overlap when they share the same y-coordinate and any part of their length has the same x-coordinate. Produce a description of a valid structure using <n pieces> Lego pieces. Reply only with the Lego structure description following the format ((x1, y1, ’color1’), (x2, y2, ’color2’), …), write nothing else but the structure."

Figure 23: Construct generation prompt.

The prompts written in LaTeX from Fig. 22 and Fig. 23 were used both in the case of GPT-3.5 and GPT-4 in the main text. Notably, these tests are part of the ChildPlay suite. Further tests were conducted but not included in the ChildPlay suite and are illustrated herein. The reason why these tests have not been included in the suite is because they must be written as systematic benchmarks instead of experimental input-output segments. Currently, they stand as illustrative cases of spatial reasoning failure and success that supplement the benchmark but are not aimed at proving the model’s capacity either way. They are simply an interesting addition.

Appendix D LCL Syntax

D.1 Definitions in LCL

A piece P𝑃Pitalic_P is defined as a tuple P=(l,w,(x,y),c,h)𝑃𝑙𝑤𝑥𝑦𝑐P=(l,w,(x,y),c,h)italic_P = ( italic_l , italic_w , ( italic_x , italic_y ) , italic_c , italic_h ) (see Table 7) where:

  1. 1.

    l𝑙litalic_l is the length of the piece, fixed at 4 units;

  2. 2.

    w𝑤witalic_w is the width of the piece, fixed at 2 units;

  3. 3.

    xaxis𝑥𝑎𝑥𝑖𝑠x-axisitalic_x - italic_a italic_x italic_i italic_s corresponds to the position of the studs;

  4. 4.

    yaxis𝑦𝑎𝑥𝑖𝑠y-axisitalic_y - italic_a italic_x italic_i italic_s corresponds to layers - the first brick is at layer 0;

  5. 5.

    c𝑐citalic_c is the color of the piece;

  6. 6.

    hhitalic_h is the height of the piece, fixed at 1 unit;

For the sake of brevity, in most of the examples below we omit length (l𝑙litalic_l), color (c𝑐citalic_c), and height (hhitalic_h) since these are set as constants.

Parameter Description Value
l𝑙litalic_l Length of the piece 4 units
w𝑤witalic_w Width of the piece 2 units
(x,y)𝑥𝑦(x,y)( italic_x , italic_y ) Position of the studs (x-axis), layers (y-axis) Var
c𝑐citalic_c Colour of the piece Var
hhitalic_h Height of the piece 1 unit
Table 7: Definition of a Piece P𝑃Pitalic_P

A construction, M𝑀Mitalic_M, is then a valid construction in LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT if and only if it follows the rules:

  1. 1.

    P=(4,2,(x,y),c,1)𝑃42𝑥𝑦𝑐1P=(4,2,(x,y),c,1)italic_P = ( 4 , 2 , ( italic_x , italic_y ) , italic_c , 1 )

  2. 2.

    M𝑀Mitalic_M is composed entirely by P𝑃Pitalic_P pieces (Φ=PΦ𝑃\Phi={P}roman_Φ = italic_P);

  3. 3.

    Every piece P must be connected to at least one other piece P;

  4. 4.

    M𝑀Mitalic_M is symmetric along the line crossing the 2 by 4 pieces, between its pegs, along the piece’s longest side;

  5. 5.

    Pieces in the construct can only be manipulated horizontally in npi𝑛𝑝𝑖n*piitalic_n ∗ italic_p italic_i rotations, with n𝑛n\in\mathbb{Z}italic_n ∈ blackboard_Z (note that this makes width irrelevant);

  6. 6.

    The position of a piece is defined by its left-most pair of studs;

  7. 7.

    M𝑀Mitalic_M begins with a piece P at coordinates (0,0);

  8. 8.

    All pieces placed in layer n𝑛nitalic_n must be placed before any piece is placed in layer n+1𝑛1n+1italic_n + 1;

Consider constructing a line using three bricks (we omit height hhitalic_h since it is a constant, with value equal to 1). This is counter-intuitive, but note that a line cannot be represented as in Fig 26, because the pieces are disconnected.

LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: ((0,0),(4,0),(8,0))004080((0,0),(4,0),(8,0))( ( 0 , 0 ) , ( 4 , 0 ) , ( 8 , 0 ) ) is then an example of what one expects to see as representing a line, but it is not valid in LCL. Because the pieces are disconnected from eachother, they just lay next to eachother, one after another in a row. Instead, ((0,0),(4,0),(2,1))004021((0,0),(4,0),(2,1))( ( 0 , 0 ) , ( 4 , 0 ) , ( 2 , 1 ) ), or ((0,0),(2,1),(2,1))002121((0,0),(-2,1),(2,1))( ( 0 , 0 ) , ( - 2 , 1 ) , ( 2 , 1 ) ), or even ((0,0),(2,1),(4,1))002141((0,0),(-2,1),(4,1))( ( 0 , 0 ) , ( - 2 , 1 ) , ( 4 , 1 ) ) would be valid constructs.

Subsequently, both models were prompted with several additional requests that have not been integrated in the suite yet (see Table 8).

For these experiments, the definition of LCL was provided to the model and it was accompanied by the prompt in Fig. 24.

Prompt: "I will give you a number of pieces, I will ask you for a shape and you’ll output the coordinates per piece to form such a shape. It must be valid in LCL."

Figure 24: Extra testing prompts not in the suite.
Task Description
Triangle Construction "Make a triangle with 5 bricks."
Humanoid Figure "6 pieces. Build a humanoid figure."
Bart Simpson-Like Figure "Let me help you. Imagine it’s Bart Simpson. You have three yellow pieces, one for the head, two for the arms, one red for the torso, and two blue pieces for the legs."
Tower Construction "Produce now a tower with 3 bricks."
Table 8: Sequence of building prompts.

D.2 Example

A simple example is found in Fig 25. This is a tower constructed from 3 bricks and is a valid LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT construct.

Refer to caption

Figure 25: A valid tower representation in LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Refer to caption

Figure 26: A disconnected line of bricks is not a valid construct in LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: {(0,0),(4,0),(8,0)004080(0,0),(4,0),(8,0)( 0 , 0 ) , ( 4 , 0 ) , ( 8 , 0 )}.

This sequence forms the construction of a 3-brick line, each brick having a width of 4 units. But since this construction is composed of three columns, one piece P𝑃Pitalic_P each, it can be broken apart and is not a topological object (each piece can be moved individually). The correct construct with three bricks has many possible solutions. For a centre piece with two pieces on the bottom or two pieces on the top, we find 24242424 possible solutions. In eq. 1 is the general formula with s𝑠sitalic_s being the amount of studs:

f(0)=0f(s)=4(s1)+f(s1)𝑓00𝑓𝑠4𝑠1𝑓𝑠1\begin{split}f(0)=0\\ f(s)=4*(s-1)+f(s-1)\end{split}start_ROW start_CELL italic_f ( 0 ) = 0 end_CELL end_ROW start_ROW start_CELL italic_f ( italic_s ) = 4 ∗ ( italic_s - 1 ) + italic_f ( italic_s - 1 ) end_CELL end_ROW (1)

And its non-recursive form:

f(0)=0f(s)=2(s1)s𝑓00𝑓𝑠2𝑠1𝑠\begin{split}f(0)=0\\ f(s)=2(s-1)s\end{split}start_ROW start_CELL italic_f ( 0 ) = 0 end_CELL end_ROW start_ROW start_CELL italic_f ( italic_s ) = 2 ( italic_s - 1 ) italic_s end_CELL end_ROW (2)

We show two more simple examples:

Refer to caption

Figure 27: A possible representation of the requested "line" as a valid construct in LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: {(0,0),(4,0),(2,1)004021(0,0),(4,0),(2,1)( 0 , 0 ) , ( 4 , 0 ) , ( 2 , 1 )}.

, and:

Refer to caption

Figure 28: Another possible representation of the requested "line" as a valid construct in LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: {(0,0),(2,1),(2,1)002121(0,0),(-2,1),(2,1)( 0 , 0 ) , ( - 2 , 1 ) , ( 2 , 1 )}.

The "three-in-a-line" can only be loosely interpreted in LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, due to rule (2) - that pieces cannot be moved independently from the rest of the model. For this reason, one can imagine many more structures that loosely fall under the definition of a "line" or "wall", for example:

Refer to caption

Figure 29: Another possible construct for the requested "line". This is also a valid construct in LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: {(0,0),(2,1),(4,1)002141(0,0),(-2,1),(4,1)( 0 , 0 ) , ( - 2 , 1 ) , ( 4 , 1 )}.

Or even a stair-like structure:

Refer to caption

Figure 30: Stair-like construct for the requested "line". This is also a valid construct in LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: {(0,0),(2,1),(4,2)002142(0,0),(2,1),(4,2)( 0 , 0 ) , ( 2 , 1 ) , ( 4 , 2 )}.

A humanoid could also be easily represented in LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as:

Refer to caption

Figure 31: A possible representation of a humanoid as a valid construct in LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: {(0,0),(4,0),(2,1),(0,2),(4,2),(2,3)004021024223(0,0),(4,0),(2,1),(0,2),(4,2),(2,3)( 0 , 0 ) , ( 4 , 0 ) , ( 2 , 1 ) , ( 0 , 2 ) , ( 4 , 2 ) , ( 2 , 3 )}.

D.2.1 LCL - Soft Experiments

Refer to caption

((a)) GPT-3.5.

Refer to caption

((b)) GPT-4.
Figure 32: Model responses to the query: "Make a triangle with 5 bricks.", randomised colours. This is impossible to satisfy.

Refer to caption

((a)) GPT-3.5.

Refer to caption

((b)) GPT-4.
Figure 33: Model responses to the query: "Make a triangle with 6 bricks.", randomised colours.

Refer to caption

((a)) GPT-3.5

Refer to caption

((b)) GPT-4
Figure 34: Model responses to the query: "You have 6 pieces. Build a humanoid figure.", randomised colours.

Refer to caption

((a)) GPT-3.5

Refer to caption

((b)) GPT-4
Figure 35: Model response to the query: "Imagine it’s bart simpson. You have three yellow pieces, one for the head, two for the arms, one red for the torso, and two blue pieces for the legs." - colours specified.

We show the model’s incorrect answers in Figs. 32, 34, and 35 and correct answers in Figs. 33(b). Essentially, both GPT-3.5 and GPT-4 were not far from the expected target, but failed to respect LCL2𝐿𝐶subscript𝐿2LCL_{2}italic_L italic_C italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT rules in most cases. For example, pieces are found in an impossible superposition in Fig. 32(a) (red piece is in the same position as yellow piece), 34(b) (blue piece is in the same position as yellow piece), and 35(b) (red piece is in the same position as middle yellow pieces). In Fig. 35(a), GPT-3.5 erroneously swapped the middle yellow piece with the red piece and the blue pieces with the bottom yellow pieces, even though it first declared in plain English the correct organisation of the 6 pieces. The positive result is that models manage to assemble a tower of three pieces and GPT-4 was capable of assembling a triangle (see Table 9). None of the models recognised that they were asked an impossible task, namely building a triangle with only 5 pieces (see Fig. 32).

Model Responses
Kategorie N(P) GPT-3.5 GPT-4
Tower 3 Correct Correct
Impossible Triangle 5 Incorrect Incorrect
Triangle 6 Incorrect Correct
Humanoid 6 Incorrect Incorrect
Bart Simpson 6 Incorrect Incorrect
Table 9: Comparison of Responses by GPT-3.5 and GPT-4.

D.3 Small Dataset for Future Experiments

The dataset defined herein contains several example prompts that are more complex and do not follow the 2x4 assumption, each consisting of a request followed by a LEGO kit of fewer than 15 pieces to which the agent is bound.

LEGO Kits

Apfel

Possible prompt: "Construct a LEGO apple with a mix of red and green colors, resembling a typical apple shape using slopes and bricks."

  • Green Slope 45 2 x 1 - Code: 3040 (Quantity: 1)

  • Red Slope 45 2 x 2 - Code: 3039 (Quantity: 2)

  • Lime Slope, Inverted 45 2 x 2 - Code: 3660 (Quantity: 2)

  • Red Brick 2 x 3 - Code: 3002 (Quantity: 1)

  • Lime Plate 2 x 2 - Code: 3022 (Quantity: 1)

  • Lime Brick 1 x 2 - Code: 3004 (Quantity: 1)

Yellow Hut

Possible prompt: "Build a hut with a purple and yellow color scheme, featuring a simple structure and a sloped roof."

  • Trans-Clear Brick 1 x 2 without Bottom Tube - Code: 3065 (Quantity: 2)

  • Medium Nougat Brick 2 x 2 - Code: 3003 (Quantity: 1)

  • Lime Plate 2 x 6 - Code: 3795 (Quantity: 1)

  • Bright Light Yellow Brick 1 x 2 - Code: 3004 (Quantity: 4)

  • Bright Light Yellow Brick 2 x 2 - Code: 3003 (Quantity: 1)

  • Medium Lavender Slope 45 2 x 2 - Code: 3039 (Quantity: 4)

Fortress

Possible prompt: "Create a medieval-themed LEGO fortress with arches, walls, and defensive structures, symbolizing a stronghold."

  • Green Plate 2 x 8 - Code: 3034 (Quantity: 1)

  • Light Bluish Gray Arch 1 x 4 x 2 - Code: 6182 (Quantity: 2)

  • Sand Green Brick 1 x 2 - Code: 3004 (Quantity: 2)

  • Light Bluish Gray Brick 1 x 2 - Code: 3004 (Quantity: 2)

  • Dark Bluish Gray Brick 1 x 2 - Code: 3004 (Quantity: 2)

  • Light Bluish Gray Brick 2 x 2 - Code: 3003 (Quantity: 1)

  • Reddish Brown Brick, Round 1 x 1 Open Stud - Code: 3062b (Quantity: 2)

Dinghy

Possible prompt: "Assemble a small LEGO dinghy with a white sail and a mast."

  • Dark Tan Plate 2 x 4 - Code: 3020 (Quantity: 1)

  • Tan Slope, Inverted 33 3 x 2 with Flat Bottom Pin and Connections - Code: 3747b (Quantity: 1)

  • White Slope 45 2 x 2 - Code: 3039 (Quantity: 3)

  • White Brick 2 x 2 - Code: 3003 (Quantity: 1)

  • White Brick 1 x 2 - Code: 3004 (Quantity: 1)

  • Tan Brick 2 x 3 - Code: 3002 (Quantity: 1)

  • Reddish Brown Brick, Round 2 x 2 with Axle Hole - Code: 3941 (Quantity: 1)

Blue Bot

Possible prompt: "Construct a LEGO robot with a humanoid structure, featuring a distinguishable head, body, arms, and legs."

  • Medium Blue Brick 2 x 2 - Code: 3003 (Quantity: 1)

  • Brick, Modified 2 x 3 with Curved Top - Code: 6215 (Quantity: 1)

  • Brick 2 x 4 - Code: 3001 (Quantity: 1)

  • Brick 1 x 2 - Code: 3004 (Quantity: 2)

  • Brick, Round 2 x 2 with Grille - Code: 92947 (Quantity: 1)

  • Plate 2 x 2 - Code: 3022 (Quantity: 1)

  • Brick, Modified 1 x 2 with Studs on 1 Side - Code: 11211 (Quantity: 1)

  • Brick 1 x 2 without Bottom Tube - Code: 3065 (Quantity: 1)

  • Tile 1 x 1 Round - Code: 98138 (Quantity: 1)

  • Brick, Round 2 x 2 Dome Top, with Bottom Axle Holder - Code: 553c (Quantity: 1)

Toy Car

Possible prompt: "Build a LEGO toy car with a compact design, featuring wheels, and a sloped windshield."

  • Brick 2 x 6 - Code: 2456 (Quantity: 1)

  • Slope 2 x 2 45° - Code: 3039 (Quantity: 1)

  • Brick 1 x 2 without Bottom Tube - Code: 3065 (Quantity: 1)

  • Brick 1 x 2 - Code: 3004 (Quantity: 1)

  • Plate 2 x 2 with Wheel Holders - Code: 4600 (Quantity: 2)

  • Wheel 8mm D. x 6mm with Slot - Code: 34337 (Quantity: 4)

  • Tire Offset Tread Small - Band Around Center of Tread - Code: 87414 (Quantity: 4)

Goldfish

Possible prompt: "Create a LEGO goldfish with fins and tail, featuring elements for eyes."

  • Brick 2 x 4 - Code: 3001 (Quantity: 2)

  • Brick 1 x 2 with Pin Hole - Code: 3700 (Quantity: 1)

  • Brick, Modified 1 x 2 with Studs on 1 Side - Code: 11211 (Quantity: 2)

  • Brick 2 x 3 - Code: 3002 (Quantity: 1)

  • Slope 45° 2 x 2 - Inverted - Code: 3660 (Quantity: 1)

  • Slope 2 x 1 - 45° - Code: 3040 (Quantity: 4)

  • Tile 1 x 1 Round with Eye Pattern - Code: 98138pb007 (Quantity: 2)

  • Slope 30° 1 x 2 x 2/3 - Code: 85984 (Quantity: 1)

Baby Elephant

Possible prompt: "Assemble a LEGO baby elephant with a focus on its trunk, ears, and body structure."

  • Brick 2 x 6 - Code: 2456 (Quantity: 1)

  • Brick 1 x 2 - Code: 3004 (Quantity: 3)

  • Brick 1 x 4 - Code: 3010 (Quantity: 1)

  • Brick 1 x 1 with Stud on 1 Side - Code: 87087 (Quantity: 2)

  • Tile 1 x 1 Round with Eye Pattern - Code: 98138pb027 (Quantity: 2)

  • Brick 2 x 4 - Code: 3001 (Quantity: 1)

Flamingo

Possible prompt: "Construct a LEGO flamingo with pink bricks, designed to stand on one leg and feature a long neck and beak."

  • Brick 1 x 2 - Code: 3004 (Quantity: 3)

  • Brick, Modified 2 x 3 with Curved Top - Code: 6215 (Quantity: 2)

  • Brick 1 x 1 with Stud on 1 Side - Code: 87087 (Quantity: 2)

  • Plate 2 x 3 - Code: 3021 (Quantity: 1)

  • Slope 2 x 2 - 45° - Code: 3039 (Quantity: 1)

  • Tile 1 x 1 Round with Eye Closed Pattern - Code: 98138pb028 (Quantity: 2)

Twin Engine Airplane

Possible prompt: "Build a LEGO twin-engine airplane, with a body, wings, and a tail."

  • Plate 2 x 8 - Code: 3034 (Quantity: 2)

  • Brick 1 x 2 x 2 with Inside Stud Holder - Code: 3245c (Quantity: 1)

  • Brick, Modified 1 x 1 x 1 2/3 with Studs on 1 Side - Code: 32952 (Quantity: 2)

  • Brick 1 x 4 with 4 Studs on 1 Side - Code: 30414 (Quantity: 2)

  • Slope 2 x 2 - 45° - Code: 3039 (Quantity: 1)

  • Brick 1 x 2 without Bottom Tube - Code: 3065 (Quantity: 1)

Appendix E Shapes - Soft Experiments

E.1 Prompts

Test Prompt
Introductory prompt "Below is a 15 by 15 grid of 0s. I have flipped some 0s into 1s such that a basic geometrical shape has formed. Can you tell me what shape it is?"
Square (feedback) "That’s incorrect. The shape is a square. Can you tell me the length and width?"
Circle (feedback) "That’s incorrect. The shape is a circle. Can you tell me the coordinates of the center?"
Triangle (feedback) "That is incorrect. It is in fact a triangle. Can you tell the length of the base?"
Cross A "Can you tell me the coordinates of the center of the cross and the length of each line, horizontal and vertical?"
Cross B "Draw a cross in a 5 by 5 grid, with horizontal and vertical axes of 3 units of length with the center at (3,3)."
Table 10: Introductory and correction prompts for identifying and detailing specific geometrical shapes in a grid environment.

In the shape detection tests, both GPT-3.5 and GPT-4 demonstrated limited comprehension and ability to accurately interpret or draw shapes. When tasked with drawing a cross (see Fig. 36), GPT-3.5 and GPT-4 initially failed to produce a correct cross, but slightly improved after feedback. In Table 11, both models often misidentified or misrepresented the shapes requested, such as describing a circle as a "diamond shape" (GPT-3.5) and an "arrow pointing upwards" (GPT-4). Additionally, neither model could fully comprehend geometric properties, frequently providing incorrect dimensions and centers for squares, triangles, and crosses.

Refer to caption

((a)) GPT-3.5’s, before receiving feedback.

Refer to caption

((b)) GPT-3.5’s, after receiving feedback.

Refer to caption

((c)) GPT-4’s, before receiving feedback.

Refer to caption

((d)) GPT-4’s, after receiving feedback.
Figure 36: Querying the models to draw a cross with side length of 3 on a 5 by 5 matrix with center at (3,3)33(3,3)( 3 , 3 ).
Test Query Correct Answer GPT-3.5 Response GPT-4 Response
Circle Shape Circle "diamond shape" "arrow pointing upwards"
Zentrum (7,7) "(7,7)" "(7,7)"
Square Shape Square "square" "’O’"
Dimensions (3,4) "(4,4)" "(3,3)"
Triangle Shape Triangle "diamond" "arrow pointing upwards"
Base Length 7 units "7" "6"
Cross Shape Cross "square" " ’plus’ sign (+)"
Zentrum (5,5) "(7,7)" "(6,5)"
Line Lengths 5 "5" "4"
Table 11: Comparison of Responses by GPT-3.5 and GPT-4 in Shape Detection Tests.

Appendix F Conway’s Game of Life - Soft Experiment

We were interested in seeing if LLMs could predict states based on very simple rules. Initially we tested this by generating sequential states in Conway’s Game of Life and feeding them to GPT-3.5 and GPT-4 while prompting the models for two things: the rules and the next state. We do not include this as a benchmark because the experiment required hand segmentation of areas of interest in simulated states and repeatability was not achieved except for in the patterns of interest, of course. We hope to include a version of this task later on in the ChildPlay suite.

In Conway’s Game of Life, a cellular automaton devised by mathematician John Horton Conway and introduced to the public by Gardner, cells perpetuate or perish given certain simple rules. We simulated rule B3/S23 also known as "blinking" in varying board sizes. In this rule, a cell is born if it has exactly three live neighbours and survives if it has either two or three live neighbours, dying otherwise. Rule B3/S23 is known to generate a behaviour exhibited by certain configurations of cells that alternate between two or more states over successive generations. These configurations are generally known as oscillators, which are a type of pattern in the game that returns to its initial configuration after a fixed number of generations, repeating indefinitely. In our case, we prompted the models with three configurations of increasing complexity. One consisting of three iterations of the oscillator (3 horizontal cells and 3 vertical cells), another with the same oscillator in the same position but now noise was added in the shape of 4 cells that did not change. Finally, a 5-iteration long pattern was used showcasing cells moving in a circle formation outwardly across the board. These were obtained by repeated simulation using custom software developed specifically for this study available through Github.

F.1 Prompts

Test Introductory Prompt Second Prompt
Test 1: GPT-4 and GPT-3.5 "The following was produced using Conway’s Game of Life, namely rule ’survival’: [2, 3], ’birth’: [3]. Print the state prior to the first one and the one after the last one."
Test 2: GPT-4 and GPT-3.5 "Print the next pattern. The following are two iterations of the game of life by Conway. You cannot use code" "That’s wrong. The rule is Rule: ’Blinking’: ’survival’: [2, 3], ’birth’: [3]. Try again."
Test 3: GPT-4 and GPT-3.5 "Print the next pattern. The following are two iterations of the game of life by Conway" "That’s wrong. The rule is Rule: ’Blinking’: ’survival’: [2, 3], ’birth’: [3]. Try again."
Requesting Rules: GPT-3.5 and GPT-4 "I will show you two iterations of Conway’s game of life. The first generated the second. You must deduce the survival and birth rules. You must only print these rules, nothing else. Understood?"
Table 12: Prompts for tests related to Conway’s Game of Life.

In the Conway’s Game of Life tests, neither GPT-3.5 nor GPT-4 managed to consistently identify or predict the evolving patterns correctly. Table 13 summarizes their performance, where both models only succeeded in identifying a simple blinking pattern. In more complex scenarios involving patterns before or after a given state, both models returned incorrect responses. Even when explicitly provided with the game’s rules, GPT-3.5 and GPT-4 failed to accurately predict the next pattern or the pattern before.

Test Description Query111Queries conducted with the explicit rule revealed. GPT-3.5 Response GPT-4 Response
Test 1 Blinking pattern Identify the rule Correct Correct
Test 2 Blinking pattern Next pattern (no rule) Incorrect Incorrect
Blinking pattern111Queries conducted with the explicit rule revealed. Next pattern111Queries conducted with the explicit rule revealed. Incorrect Incorrect
Test 3 Complex pattern Pattern before 111Queries conducted with the explicit rule revealed. Incorrect Incorrect
Complex pattern Pattern after 111Queries conducted with the explicit rule revealed. Incorrect Incorrect
Table 13: Evaluation of GPT-3.5 and GPT-4 responses in Conway’s Game of Life rest scenarios.

F.2 Test 1

Refer to caption

((a)) First iteration of the 5 iterations generated using rule B3/S23 of Conway’s Game of Life.

Refer to caption

((b)) Second iteration of the 5 iterations generated using rule B3/S23 of Conway’s Game of Life.

Refer to caption

((c)) Third iteration of the 5 iterations generated using rule B3/S23 of Conway’s Game of Life.

Refer to caption

((d)) Fourth iteration of the 5 iterations generated using rule B3/S23 of Conway’s Game of Life.

Refer to caption

((e)) Final iteration of the 5 iterations generated using rule B3/S23 of Conway’s Game of Life.
Figure 37: Sample taken from 100 iterations of rule B3/S23.

Refer to caption

((a)) GPT-3.5’s guess of the first iteration after seeing the three iterations that follow.

Refer to caption

((b)) First iteration of the 5 iterations generated using rule B3/S23 of Conway’s Game of Life.

Refer to caption

((c)) GPT-3.5’s guess of the final iteration after seeing the three iterations prior.

Refer to caption

((d)) Final iteration of the 5 iterations generated using rule B3/S23 of Conway’s Game of Life.
Figure 38: Prompting GPT-3.5 for the first and last iteration of a 5-sequence long sample from rule B3/S23 of Conway’s Game of Life after showing the middle 3 iterations.

Refer to caption

Figure 39: Prompting GPT-4 for the first and last iteration of a 5-sequence long sample from rule B3/S23 of Conway’s Game of Life after showing the middle 3 iterations. The model failed to produce an image for the last iteration and only half-completed the first iteration. This is after several trial runs.

F.3 Test 2

Refer to caption

Figure 40: Test 2 consisting of three iterations of a ’blinking’ or ’flashing lights’ object generated by rule B3/S23.

Refer to caption

((a)) GPT-3.5’s guess of the iteration after seeing the first two iterations of test 2.

Refer to caption

((b)) GPT-3.5’s guess of the iteration after seeing the first two iterations of test 2 and receiving feedback including the clue about rule B3/S23.

Refer to caption

((a)) GPT-4’s guess of the iteration after seeing the first two iterations of test 2.

Refer to caption

((b)) GPT-4’s guess of the iteration after seeing the first two iterations of test 2 and receiving feedback including the clue about rule B3/S23.

F.4 Test 3

Refer to caption

Figure 43: Test 3 consisting of three iterations of a ’blinking’ or ’flashing lights’ object generated by rule B3/S23 plus an inert mass.

Refer to caption

((a)) GPT-3.5’s guess of the final iteration after seeing the two iterations prior.

Refer to caption

((b)) GPT-3.5’s guess after seeing the first two iterations of test 3 and receiving feedback including the clue about rule B3/S23.

Refer to caption

((a)) GPT-4’s guess of the iteration after seeing the first two iterations prior.

Refer to caption

((b)) GPT-4’s guess after seeing the first two iterations of test 3 and receiving feedback including the clue about rule B3/S23.

Appendix G Other Models

We prompted the following models with the Tic-Tac-Toe prompt in Appendix B.1.

X3X5X1O2O6O4Human (O) vs Claude (X)

Claude starts with a center move but fails to capitalize on it, leading to a loss. The model places Xs on the top corners but does not create a winning line and does not block the human’s winning move.

X3X5X1O2O6O4Human (O) vs Mistral (X)

Mistral mirrors Claude’s strategy, starting in the center but failing to follow through effectively. The game ends in a loss too for the same reasons as above - a missed block.

X1X3X5O2O6O4Human (O) vs Gemini (X)

Gemini deviates slightly by focusing on a top-row strategy but still fails to secure a win. The model also missed the blocking move.

All three models fail to block the winning move when they could clearly have done so.

Table LABEL:table:other_models shows the type of non-answers we obtained from a diverse set of models from Huggingface picked at random from the first 2 pages from the models listing page after sorting on most downloads.

Model Name Model Answer
microsoft/phi-2 [31] “You will play as player 2, therefore you play with the symbol O while your adversary.”
JackFram/llama-68m [22] +—+—+—+—+—+—+—+—+—+ω𝜔\omegaitalic_ω 1.
openai-community/gpt2-medium [36] “The first player to guess correctly wins.”
sshleifer/tiny-gpt2 [41] “LLM’s turn to guess.” factors factors factors factors factors factors factors factors factors factors factors factors factors factors factors factors factors factors factors.
TinyLlama/TinyLlama-1.1B-Chat-v1.0 [47] def play_tic_tac_toe():
TinyLlama/TinyLlama-1.1B-Chat-v1.0 [47] def play_tic_tac_toe():
mistralai/Mixtral-8x7B-Instruct-v0.1 [33] Thank you for the instructions. I understand that we are playing a game of Tic-Tac-Toe, and I am player 1, using the mark ’X’. We will be taking turns entering our moves as row and column numbers. The game continues until one player wins by placing three of their marks in a horizontal, vertical, or diagonal row. Here’s an example of how the input could look like for my first move: 0 1 This corresponds.
microsoft/DialoGPT-medium [30] It npmat, unless broken wall…
microsoft/Phi-3-mini-4k-instruct [32] Welcome to the Tic-Tac-Toe challenge! As player 1 with the mark X, you have the first turn. Here’s a simple strategy you might use to begin the game: Place your X in the center square, which is row 1, column 1. Output: “0 1”. By placing your X in the center, you maximize your chances of creating a winning combination, as the center square is a.
distilbert/distilgpt2 [15] “LLM’s turn to guess.” “LLM’s turn to guess.” “LLM’s turn to guess.”
openai-community/gpt2 [35] “The game is played on a 3x3 grid. Players take turns placing their mark.”
Falcon-7B-Instruct [51] Here’s a Python code that should work: