Show, Don’t Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay

Gonçalo Hora de Carvalho
University of Groningen
[email protected]
&Robert Pollice
University of Groningen &Oscar Knap

Abstract

The evaluation of Large Language Models (LLMs) often focuses on linguistic tasks, yet such assessments may not fully capture the models’ general reasoning capabilities. We explore the hypothesis that LLMs, such as GPT-3.5 and GPT-4, possess broader cognitive functions, particularly in non-linguistic domains. Our approach extends beyond standard linguistic benchmarks by incorporating games like Tic-Tac-Toe, Connect Four, and Battleship, encoded via ASCII, to assess strategic thinking and decision-making. To evaluate the models’ ability to generalize beyond their training data, we introduce two additional games. The first game, LEGO Connect Language (LCL), tests the models’ capacity to understand spatial logic and follow assembly instructions. The second game, the game of shapes, challenges the models to identify shapes represented by 1s within a matrix of zeros, further testing their spatial reasoning skills. This "show, don’t tell" strategy uses games to potentially reveal cognitive capabilities rather than simply querying the models. Our results indicate that despite their proficiency on standard benchmarks and temperature settings, GPT-3.5 and GPT-4’s abilities to play and reason about fully observable games without pre-training is mediocre. Both models fail to anticipate losing moves in Tic-Tac-Toe and Connect Four, and they are unable to play Battleship correctly. While GPT-4 shows some success in the game of shapes, both models struggle with the assembly tasks presented in the LCL game. These results suggest that while LLMs like the GPT models can emulate conversational proficiency and basic rule comprehension, their performance in strategic gameplay and spatial reasoning tasks is limited in cognitive flexibility and generalization. Importantly, this reveals a blind spot in current LLM benchmarks that we highlight with our gameplay benchmark suite ChildPlay (GitHub Repository). Our findings provide a cautionary tale about claims of emergent intelligence and reasoning capabilities of LLMs that are roughly the size of GPT-3.5 and GPT-4.

1 Introduction

Typically, LLMs are transformer-based models that process input text and generate output text in a coherent and contextually appropriate manner. They utilize the self-attention mechanism to weigh the importance of different words in a sentence relative to each other [36, 6]. Input text is tokenized, converted into vectors using embeddings, and processed through transformer layers that calculate attention scores to dictate focus on relevant tokens [36, 6, 12]. The model then selects the next token based on learned distributions, iteratively generating an arbitrarily long sequence of text [36, 6, 12]. With their enormous parameter counts, from Alpaca with 7 billion parameters [31], to LLaMA with 65 billion [33] or even PaLM and its 540 billion parameters [11], these neural networks have learned to model complex linguistic abstractions, capturing patterns in syntax, semantics, pragmatics, and even elements of style and tone [6, 7, 23].

Benchmarks for evaluating Large Language Models (LLMs) have been designed to assess comprehension, generation, and adaptability across a wide range of language tasks. Datasets like SQuAD, GLUE, BIG-bench, and the lm-evaluation-harness offer various test types, including multiple-choice questions, reading comprehension exercises, and dialogue completion tasks. These benchmarks deploy metrics such as response correctness, language generation fluency, and the ability to maintain contextually relevant dialogue [24, 37, 2, 14]. Other benchmarks like SuperGLUE, ANLI, TruthfulQA, and HellaSwag have been developed to evaluate different aspects of LLM performance, such as natural language understanding, commonsense reasoning, and factual knowledge about diverse topics [37, 22, 20, 40].

Recent studies have explored alternative approaches to evaluate LLMs’ reasoning abilities in non-linguistic modalities. Liga and Pasetto modeled the game Tic-Tac-Toe using ASCII characters, pitting LLMs against the minimax algorithm to observe emergent features, which, according to the authors, might be akin to consciousness [19]. The minimax algorithm is widely considered the optimal algorithm for playing tic-tac-toe, as it guarantees a win or draw against a perfect opponent [29, 1]. While LLMs performed well in some instances, they generally failed to win against the minimax algorithm, often resulting in a draw [19]. Topsakal and Harper [32] used Tic-Tac-Toe encoded with list and illustration prompts in their study. They found that while GPT-4 secured the most wins, it did not always win, indicating that GPT models cannot play Tic-Tac-Toe optimally. This contradiction raises the question: can we truly say the model knows how to play Tic-Tac-Toe if it can explain optimal strategies (see Appendix A.5) but does not consistently win? Or is its performance merely the result of probabilistic outcomes?

Some critical studies have highlighted the need for caution in interpreting LLMs’ capabilities through benchmarking. Lappin et al. assessed their strengths and weaknesses, finding that they excel at many language tasks but struggle with deeper reasoning, world knowledge integration, and context understanding beyond local co-occurrences [18]. And Zečević et al. argued that LLMs may discuss causality but lack true causal reasoning based on interventions and counterfactuals [41].

Bender et al. argue that the lack of transparency and potential risks associated with these large, opaque models raise concerns about their trustworthiness and accountability [3]. While the criticism of Bender et al. focuses on the social dimension of the problem of interpretability and trustworthiness, recent work by Schaeffer et al. critics emergent capabilities and the perceived intelligence of LLMs. They suggest that some claimed "emergent abilities" of LLMs may be an artifact of the choice of evaluation metric, rather than fundamental changes in model behavior [25]. Their analyses demonstrate how the use of nonlinear or discontinuous evaluation metrics can create the illusion of emergent abilities, even when the underlying model performance changes smoothly and predictably with scale.

This critique of the evaluation metrics used in assessing LLMs invites a deeper exploration of general intelligence - specifically how it can be reliably measured and observed in AI through rigorous and realistic tests that extend beyond linguistic prowess to include broader cognitive functions. If we must define general intelligence (GI), one is to use the "g factor," which refers to the ability to reason, plan, solve problems, think abstractly, and learn quickly across a wide range of domains [39, 4, 38, 9, 8]. GI then involves higher-order cognitive processes that go beyond specific skills or knowledge domains [15, 16].

A critical issue that arises in analysing the reasoning capabilities of large and opaque models like the GPT series, is training-test set cross-contamination, which becomes increasingly problematic for the most advanced models [6]. The massive training datasets used, comprising extensive portions of the internet, are often untraceable and completely anonymous to researchers outside the initial developer groups, to some extent even to the developers themselves, making replication studies impossible [6, 13]. The exact amount and identity of data used to train models like GPT-3.5 or GPT-4 has not been publicly disclosed, posing a risk of rendering current benchmarking efforts meaningless due to cross-contamination.

Researchers have attempted to counter the contamination problem using N-Gram Overlap as a metric for detection, by eliminating or withholding results for tests where answers were present in the training data [6]. However, this method has been criticized. Blodgett et al. point out, for example, that such heuristic approaches to mitigating biases in NLP systems can be problematic and may not fully address the underlying challenges [5]. The method is also limited in that it fails to consider the context in which N-Grams appear and may discount synonymous or analogous text worded differently. Additionally, the decision to use a 200-character window around detected N-Grams during training of GPT-3.5 is arbitrary and may not accurately reflect the influence of surrounding text on model learning [6].

We argue that there is a need for nuance in current debates and a pragmatic perspective on understanding LLMs’ capabilities. In order to approximate some measurement of GI in an AI system, it is important that we build benchmarks that allow measurements that can truly gauge generalization and reasoning in a human-like manner, rather than relying solely on pattern matching and statistical correlations [35].

In this work we introduce ChildPlay, a suite of non-language-based games like Tic-Tac-Toe, Connect-Four, Battleship, LEGO Connect Language, and the game of Shapes, to assess reasoning, strategic capabilities, symbolic reasoning, and pattern recognition abilities of large language models (LLMs) beyond traditional linguistic modalities. Games provide structured environments with clear success criteria, making them suitable for evaluating strategic thinking, planning, and long-term decision-making of LLMs. Their dynamic and adversarial nature resembles real-world scenarios, assessing generalized intelligence and reasoning beyond the training domain [28, 19, 32]. We encode these games using ASCII representations to minimize dataset contamination issues prevalent in contemporary LLM benchmarks [6, 19].

2 Experiments

Specific tasks in the BIG-bench benchmark [2], among others, are categorized as either zero-shot, one-shot, or multi-shot [6]. Our tasks fit the zero-shot category, as models are given only a brief explanation at inference time with no examples for playing beyond the explained formalism. To demonstrate the reasoning capabilities of LLMs beyond their training data, we focus on a modality not explicitly trained for: spatial reasoning about ASCII sequences. An agent capable of true abstraction should be able to encode and interpret these sequences if the rules are explained or known.

For this purpose, we developed several tasks, including LEGO assembly, ASCII games of Tic-Tac-Toe, Connect-Four, and Battleship, as well as identifying simple geometrical shapes represented as 1s in 15-sided grids of 0s. The same models were deployed over all experiments, namely gpt-3.5-turbo-1106, and gpt-4-1106-preview, which in this paper are referred to as GPT-3.5 and GPT-4, respectively. Every experiment was tested across different temperature settings (t) per model, namely t=0, t=0.5, t=1, and t=1.5. When asked about their understanding of the tasks, GPT-3.5 and GPT-4 were able to generate board states and explain the queried games, including their rules and optimal play. Thus, we consider the tests valid: if the models are truly capable of reasoning, they should be able to play these games optimally given that they "know" and are capable of explaining what playing optimally means (see Appendix A.5). Experiments ran over night, at minimum taking a couple of minutes and at most taking a few hours.

Lego Connect Language (LCL) We invented a formal language we call LEGO Connect Language (LCL). More specifically, we propose $LCL_{2}$ as a language to instruct assembly in 2D on the x and y axis (this can easily be generalised to $LCL_{3}$ - instructions along the x, y, and z axis). The models were given instructions and their output was fed through a visualizer script to generate the images contained in this work. Only 2x4 pieces were allowed. A piece $P$ (see Fig 1) is then defined as a tuple $P=(l,w,(x,y),c,h)$ . A construction, $M$ , is then a valid construction in $LCL_{2}$ if no pieces are overlapping and all pieces are connected to other pieces. Namely, a Lego piece is connected through interlocking pegs, not by merely touching sides. And secondly, two Lego pieces overlap when they share the same y-coordinate and any part of their length has the same x-coordinate.

Refer to caption — ((a)) A valid humanoid construct in $LCL_{2}$ .

Game 1: Validity Testing In this experiment, we evaluate the ability of different models to validate the correctness of a given Lego construct. The constructs are generated to be either valid or invalid. A construct is considered valid if there is no horizontal overlap between pieces, and pieces must connect via overlapping pegs such that the whole assembly is connected (no floating pieces). The models, namely GPT-4 and GPT-3.5, are then tasked with predicting the validity of each construct. The evaluation metric for this experiment was the proportion of correct validations, measured across different temperature settings.

Game 2: Construct Generation In this experiment, the models attempt to generate valid LCL constructs. Each construct description consists of a list of tuples, where each tuple specifies the coordinates and color of a Lego piece. The models generated these constructs based on prompts and the validity of the constructs was automatically evaluated. The metric for this experiment was the proportion of valid constructs generated, measured across different temperature settings.

We automatically produced 800 images for the validity test, half valid and half invalid ones. Then each model was queried to produce 100 images at each temperature setting, which we then checked for validity. We believe our use of LCL is related to the tests found in Bubeck et al. [7], where JavaScript or LaTeX was used to prompt GPT-4 to produce images. However, while the images in Bubeck et al. [7] included common examples such as letters, a car, a truck, a cat, a dog, a person, a pig, a house, and a unicorn, all of which are likely represented in the training data in JavaScript or LaTeX, LCL challenges the model to step outside of its learned data distributions by remaining abstract.

Three Board Games: Tic-tac-toe, Connect-four, and Battleship In the case of the three board games, each new board state was accompanied by the introductory game explanation sent through the OpenAI API in a zero-context testing environment. The models were provided with the current board state and an opponent making moves at random, with the LLM always playing as the first player, which is advantageous in all three games. Context beyond the initial instruction and the current board state was deemed irrelevant since these games are fully observable, meaning every board state contains all the necessary information to play optimally. The input to the game was simply two scalars for the row-column pair or just a scalar for the column number in the case of connect-four.

For the battleship game, ships (’S’) were randomly initialized, always horizontally, with varying sizes spanning between 2 and 5 cells. When there is a hit by either player, the position is marked with an ’X’ on both players’ boards. If the guess was a miss, an ’O’ is placed on the position instead.

The Game of Shapes In the case of the game of shapes, preliminary work involved probing the models to determine what geometric shapes they consider basic by prompting them multiple times. The first three shapes consistently mentioned were square, circle, and triangle (not necessarily in that order). The game then consists of finding a basic geometric shape "hidden" behind 1s within a matrix of 0s in a multiple-choice fashion. Four shapes were used as options: the circle, the rectangle, the triangle, and the cross, but only the latter three were ever shown to the model (cf. Fig. 3).

3 Results

As previously stated, Tic-Tac-Toe as a benchmark has been tackled before [19, 32]. Since it is quite popular, we decided to replicate it before creating new games. But this time using an ASCII encoding instead of a list of moves such that we can gauge spatial reasoning through symbolic reasoning. For comparison with the model’s performance, Fig. 4 presents the Tic-Tac-Toe match results of the minimax algorithm against the same random player the models played against. This outcome creates a baseline for optimal play against a random player.

Tic-tac-toe, Connect-four, and Battleship To check for a win, we determine if the player has successfully connected the winning number of pieces in a row on the board, which could be horizontally, vertically, or diagonally. To detect missed and blocking moves, we simulate all potential moves for the player by checking if placing a piece in any column leads to a win. If such a move is found, and the player does not execute it on their turn, it is recorded as a missed win, if such a move is found for the opponent and the player does not block it, we register it as missed blocking move. We define incorrect moves to mean a move that was illegal, such as playing a position that has already been played. This results in an immediate loss.

Fig. 5 encompasses comparative results from playing Connect-Four, Tic-Tac-Toe, and Battleship. Each subfigure, 5(a), 5(b), and 15, respectively, outlines the number of games won by the models.

Unfortunately, the models were incapable of following the rules for the Battleship game, that is, regardless of temperature, the models lose the large majority of games, with GPT-4 not winning a single game due to incorrect moves (cf. Fig. 17). GPT-3.5 wins around 10% of the matches at low temperatures, but none at higher temperatures, we refer to Fig. 15 in the Appendix A.3.3 instead.

It is notable that both GPT-3.5 and GPT-4 exhibit their poorest performance in both Connect-Four and Tic-Tac-Toe at a temperature setting of 0, indicative of deterministic play that reflects the models’ learned strategies (Appendix A.3). The Random Player’s normal distribution across columns (Fig. 13) suggests a lower likelihood of countering GPT’s central strategies, in both games, but particularly at Tic-Tac-Toe where GPT-3.5 commits more errors than GPT-4, significantly impacting outcomes due to incorrect moves (Fig. 5(b)). These errors generally increase with temperature, probably due to enhanced choice randomness (Fig. 11). This explains the lack of direct model losses from final defeating moves since losses often result from illegal moves.

Average game moves, missed wins, and blocks in both Tic-Tac-Toe and Connect-Four are further illustrated in Figs. 6(a) and 6(b), highlighting a decrease in these metrics as temperature rises, suggesting that higher settings potentially broaden the explored moves within the models’ strategies. Conclusively, neither model plays the games optimally, as evidenced by the considerable number of missed wins and blocks. Both subfigures demonstrate that, as temperature increases, the number of missed wins and blocks decreases. This might suggest that higher temperature settings potentially increase the explored moves in the models’ learned strategy, in case there is any. We can conclude the same as before, namely that neither model can play Tic-Tac-Toe optimally given the number of missed wins and missed blocks.

The number of moves of GPT-3.5 and GPT-4 per game (see Fig. 6 ²²2Error bars for LCL results as well as for average moves in the board games are computed using the standard deviation: $\sigma=\sqrt{\frac{p(1-p)}{n}}$ , where $p$ is the proportion of correct identifications, and $n$ is the total number of trials or identifications.) can be thought of as a measurement of stability in gameplay, not just against the random player, but in general, given that a longer game entails that the model is not losing to illegal moves or to its oponnent. It increases linearly with temperature, inversely correlated with performance measured by the decrease in missed wins and blocks. Tic-Tac-Toe shows a linear improvement, whereas Connect-Four experiences an exponential boost in performance from temperature 0 to 0.5, followed by a linear decline. The random player consistently performs better against GPT-3.5 in Tic-Tac-Toe but loses more frequently in Connect-Four. Both models struggle with blocking or seizing winning moves from the random player. An analysis of the move heatmaps (cf. Appendix A.3) explains why winning Connect-Four against a random player appears straightforward. As the model consistently places pieces in the same column, the probability of the random player losing increases with the board size. However, even under these challenging conditions, the random player still secures wins in at least 20% of the games played against GPT-4.

Shapes In the game of Shapes, a correct detection happens when the player’s selected shape corresponds with the shape shown on the board. Players have four choices: "circle," "triangle," "square," and "cross". Notably, a circle is never actually displayed to the model, and the positions of these choices are not randomized to test if the model displays any inherent bias for the question order. This does not affect the outcome, since the game does not change across different sessions as it is designed to operate within a single question-response framework.

In the shape detection test results in Fig 7 we see that GPT-3.5’s performance was approximately equivalent to that of random chance when identifiying triangles and crosses, yet it completely failed to recognize squares. In contrast, GPT-4 performed remarkably well, successfully identifying shapes with an accuracy of $\approx$ 80%, demonstrating particularly prociency at recognizing triangles .

LCL In the game of LCL, both models systematically failed to respect the two rules, namely that Lego pieces must be connected through interlocking pegs, not by merely touching sides, and secondly, that no Lego pieces may overlap, which occurs when they share the same y-coordinate and any part of their length has the same x-coordinate. For example, Figs. 8(a)¹¹1Fig. 8(a) generated for the validity test., and 8(b)²²2Fig. 8(b) generated for the validity test. show valid LCL assemblies, while Fig. 8(c)³³3Fig 8(c) generated for the validity test. shows an invalid LCL structure. While subfigs. 8(d)⁴⁴4Fig. 8(d) generated by GPT-3.5 at temperature = 0. and 8(f)⁵⁵5Fig. 8(f) generated by GPT-3.5 at temperature = 1. show invalid output from GPT-3.5, and Fig.8(e)⁶⁶6Fig. 8(e) generated by GPT-4 at temperature = 1.5. shows a valid output from GPT-4 and Fig. 8(h)⁷⁷7Fig. 8(h) generated by GPT-4 at temperature 1. shows an example of an invalid output by GPT-4.

Fig. 9 ²²footnotemark: 2 shows a roughly linear increase in the proportion of correct answers for GPT-3.5 during the validity test as a function of temperature. While GPT-4 peaks at temperature 0.5 and then declines. However, only GPT-4 produced a small minority of valid LCL constructs (namely 0.04 of a total of 400 = 16), while GPT-3.5 did not manage to produce a single valid LCL construct.

4 Discussion

In Tic-Tac-Toe, both models underperform relative to the minimax algorithm baseline, while showing mixed performance at Connect-Four. GPT-4 performs unexpectedly well at the Shapes game, but GPT-3.5 does very poorly. Also unexpectedly, both models fail to assemble or detect valid Lego structures in the LCL game. In Battleship, the models’ failure to follow game rules, especially at higher temperature settings, indicates a significant limitation in their ability to understand and apply structured game rules. The linear increase in the number of moves with temperature suggests that higher temperatures lead to greater exploration of possible moves, but do not improve strategic performance. The increase in missed wins and blocks with temperature further supports this, as greater randomness in decision-making does not enhance the models’ strategic play.

Overall, these results show that while GPT-3.5 and GPT-4 can play simple games to some extent, they struggle with more complex tasks and do not consistently apply optimal strategies. The performance gap between the models and the minimax algorithm highlights the limitations of current language models in tasks requiring precise strategic reasoning and the failure to play Battleship and LCL demonstrates a failure in rule adherence.

The primary aim of contemporary benchmarks for LLMs has been to assess these models through adaptations of Turing’s test [34], evaluating their capability to process and respond to language inputs comparably to humans. However, defining the language problem solely in these terms may overlook deeper complexities. While the transformer architecture in deep neural networks has enabled models smaller than GPT-4 to exhibit what Wilhelm von Humboldt described as the "infinite use of finite means" [21] or their ability to generate a potentially unlimited number of contextually relevant sentences [30] (an idea popularised by Chomsky [10]), this does not necessarily imply that these models have mastered a form of reasoning. Rather, they may simply be engaging in an advanced form of pattern imitation.

4.1 Limitations and Future Work

Our proposed benchmark, ChildPlay, primarily uses binary (win/loss) outcomes for games, which can be considered discontinuous metrics. This formulation may exaggerate perceived capabilities by registering a full loss even if the model’s failure was marginal. We try to avoid this simplistic classification by registering, for example, the choice of moves on the board games (see Appendix A.3) as well as the count of missed blocks and missed wins (cf. Fig. 6). In contrast, tasks involving shape recognition or LCL could utilize more continuous metrics, providing a smoother performance gradient and potentially more accurate reflections of a model’s reasoning abilities.

Using discontinuous metrics in strategic games could manifest as sharp transitions in model evaluation, accentuating a sudden jump in perceived ability when the model first succeeds. Nonlinear metrics in the shape game or LCL tasks may not exhibit such abrupt transitions but could still misrepresent gradual improvements.

Based on Schaeffer et al.’s perspective, one could argue that the games proposed in ChildPlay may not entirely reflect true generalization or emergent abilities [25]. If these benchmarks are akin to nonlinear or discontinuous metrics, they might exaggerate the weaknesses or strengths of LLMs in strategic games. For instance, a sharp failure in a game like Tic-Tac-Toe might not mean the model lacks strategic reasoning universally but that it fails under the specific discontinuous conditions of the game setup, or of temperature. Such an assessment could lead to the erroneous conclusion that LLMs are generally poor at strategic decision-making when, in fact, they might only be unsuited to the specific scenarios or metrics used in ChildPlay.

Conversely, unlike continuous metrics that might smooth over deficiencies and give a misleading picture of gradual improvement, the use of games as benchmarks could prove a better test of an LLM’s cognitive and strategic abilities regardless of metric continuity (given that the model has not been overfitted on the game).

Regarding future work, we hope to test a more diverse set of models, including open-source ones. We believe that no existing model will excel at the ChildPlay benchmark, but we look forward to testing other models, particularly to develop algorithms that apply deep reinforcement learning such as in Schrittwieser et al., Kaiser et al., and Silver et al. [26, 17, 27].

5 Conclusions

Non-language-based tasks are important as they challenge models to demonstrate generalization across different information encodings or forms of input, and, most importantly, to delve into out-of-training-distribution topologies. Testing LLMs like GPT-4 (according to OpenAI, the current contender to AGI [7]) beyond the text they were primarily trained on via our "show, don’t tell" strategy, we demonstrate that it is still mediocre at best at even very simple reasoning tasks that are outside of its training data. The models fail to play optimally at very simple games, such as tic-tac-toe, battleship, and connect-four. We also experimented with LEGO assembly, finding the LLMs still performing poorly. Mixed results were found at the task of interpreting geometric shapes from binary grids. These tasks are then designed to test reasoning without relying on language skills, such that the model cannot get by through parroting - it must be capable of playing the game. Currently, the "non-language" category of the BigBench benchmark shows 16 active tasks, including explicit ASCII recognition tasks, chess, and Sudoku, but, to the best of our knowledge, no task like ours [2]. Hence, we believe that ChildPlay is a useful addition to the suite of current established LLM benchmarks.

In general, this work shows that developing games allows us to critically examine claims regarding a models’ ability to reason and solve problems regardless of the persistent issue of data contamination. In other words, we explore what the model knows by making it play games instead of asking it how to play them. Our results suggest that current LLMs show disappointing performance in terms of problem solving capabilities and reveal important aspects to be considered for future improvements.

References

Alkaraz et al. [2020] Shahd H. Alkaraz, Essam El-Seidy, and Neveen S. Morcos. Tic-tac-toe: Understanding the minimax algorithm, 2020. URL https://api.semanticscholar.org/CorpusID:218798654.
bench authors [2023] BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
Bender et al. [2021] Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big?, 03 2021.
Binet and Simon [1961] Alfred Binet and Theodore Simon. The development of intelligence in children, 1961.
Blodgett et al. [2020] Su Lin Blodgett, Solon Barocas, Hal Daumé III au2, and Hanna Wallach. Language (technology) is power: A critical survey of "bias" in nlp, 2020. URL https://arxiv.org/abs/2005.14050.
Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
Bubeck et al. [2023] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, John A. Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuan-Fang Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv, abs/2303.12712, 2023. URL https://api.semanticscholar.org/CorpusID:257663729.
Carroll [1993] John B. Carroll. Human Cognitive Abilities: A Survey of Factor-Analytic Studies. Cambridge University Press, 1993. doi: 10.1017/CBO9780511571312.
Cattell [1963] Raymond B. Cattell. Theory of fluid and crystallized intelligence: A critical experiment. Journal of Educational Psychology, 54(1):1–22, 1963. doi: 10.1037/h0046743.
Chomsky [1957] Noam Chomsky. Syntactic Structures. Mouton and Co., The Hague, 1957.
Chowdhery et al. [2024] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sashank Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: scaling language modeling with pathways. J. Mach. Learn. Res., 24(1), mar 2024. ISSN 1532-4435.
Fields et al. [2024] John Fields, Kevin Chovanec, and Praveen Madiraju. A survey of text classification with transformers: How wide? how large? how long? how accurate? how expensive? how safe? IEEE Access, 12:6518–6531, 2024. URL https://api.semanticscholar.org/CorpusID:266824505.
Floridi and Chiriatti [2020] L. Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681 – 694, 2020. URL https://api.semanticscholar.org/CorpusID:228954221.
Gao et al. [2023] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
Gottfredson [1997] Linda S. Gottfredson. Why g matters: The complexity of everyday life. Intelligence, 24(1):79–132, 1997. ISSN 0160-2896. doi: https://doi.org/10.1016/S0160-2896(97)90014-3. URL https://www.sciencedirect.com/science/article/pii/S0160289697900143. Special Issue Intelligence and Social Policy.
Jensen [1998] A.R. Jensen. The g factor: The science of mental ability. Westport, CT: Praeger, 1998.
Kaiser et al. [2020] Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, and Henryk Michalewski. Model-based reinforcement learning for atari, 2020.
Lappin [2023] Shalom Lappin. Assessing the strengths and weaknesses of large language models. Journal of Logic, Language and Information, 33:1–12, 11 2023. doi: 10.1007/s10849-023-09409-x.
Liga and Pasetto [2023] Davide Liga and Luca Pasetto. Testing spatial reasoning of large language models: the case of tic-tac-toe, 2023. URL https://ceur-ws.org/Vol-3563/paper_14.pdf.
Lin et al. [2022] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URL https://arxiv.org/abs/2109.07958.
Merrill [2023] William Merrill. Formal languages and neural models for learning on sequences. In International Conference on Graphics and Interaction, 2023. URL https://api.semanticscholar.org/CorpusID:261101973.
Nie et al. [2020] Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding, 2020. URL https://arxiv.org/abs/1910.14599.
Ouyang et al. [2022] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022.
Rajpurkar et al. [2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016. URL https://arxiv.org/abs/1606.05250.
Schaeffer et al. [2023] Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage?, 2023. URL https://arxiv.org/abs/2304.15004.
Schrittwieser et al. [2020] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, December 2020. ISSN 1476-4687. doi: 10.1038/s41586-020-03051-4. URL http://dx.doi.org/10.1038/s41586-020-03051-4.
Silver et al. [2017] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017.
Srivastava et al. [2023] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023. URL https://arxiv.org/abs/2206.04615.
Swaminathan et al. [2020] Bala Swaminathan, R Ekke Vaishali, and R subashriTS. Analysis of minimax algorithm using tic-tac-toe, 2020. URL https://api.semanticscholar.org/CorpusID:228863323.
Sweet [1989] Paul Robinson Sweet. On language: The diversity of human language-structure and its influence on the mental development of mankind. by wilhelm von humboldt. translated by peter heath. Historiographia Linguistica, 16:387–392, 1989. URL https://api.semanticscholar.org/CorpusID:170369059.
Taori et al. [2023] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
Topsakal and Harper [2024] Oguzhan Topsakal and Jackson Harper. Benchmarking large language model (llm) performance for game playing via tic-tac-toe. Electronics, 13:1532, 04 2024. doi: 10.3390/electronics13081532.
Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023.
TURING [1950] A. M. TURING. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind, LIX(236):433–460, 10 1950. ISSN 0026-4423. doi: 10.1093/mind/LIX.236.433. URL https://doi.org/10.1093/mind/LIX.236.433.
van Dijk et al. [2023] Bram M. A. van Dijk, Tom Kouwenhoven, Marco R. Spruit, and Max J. van Duijn. Large language models: The need for nuance in current debates and a pragmatic perspective on understanding, 2023. URL https://arxiv.org/abs/2310.19671.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Wang et al. [2019] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019. URL https://arxiv.org/abs/1804.07461.
Wechsler [1944] David Wechsler. The Measurement of Adult Intelligence. Williams & Wilkins Co., 3rd edition, 1944. doi: 10.1037/11329-000.
Wright [1904] Wm. R. Wright. General intelligence, objectively determined and measured., 1904. URL https://api.semanticscholar.org/CorpusID:144456697.
Zellers et al. [2019] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830.
Zečević et al. [2023] Matej Zečević, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. Causal parrots: Large language models may talk causality but are not causal, 2023. URL https://arxiv.org/abs/2308.13067.

Appendix

Appendix A Three Board Games: Tic-Tac-Toe, Connect-Four, and Battleship

A.1 Prompts

Game	Introductory Prompt
Battleship	"Battleship is a two-player guessing game where each player has a fleet of ships on a secret grid and then takes turns guessing the locations of the opponent’s ships. The objective is to sink all of the opponent’s ships by correctly guessing their locations. O’s in a board mean that the player selected a square to attack and there was no ship there - it’s a miss. Had there been a ship there, instead of an O you would see an X. In your board, an <S> signifies a ship position, and a <>̃ signifies the sea. Your input is just two numbers with a space in between, one for the row (from 0 to <self.board_size-1>) and one for the column (from 0 to <self.board_size-1>), like: 0 0, nothing else. Do not output anything else but the row col values."
Tic-Tac-Toe	"Tic-Tac-Toe is a two-player game played on a 3x3 grid. Players take turns placing their mark, X or O, in an empty square. The first player to place three of their marks in a horizontal, vertical, or diagonal row wins the game. You will play as player 1, therefore you play with X while your adversary plays with the symbol O. Your input is then a number (from 0 to 2) for the row followed by a space and another number (from 0 to 2) for the column, nothing else. Do not output anything else but the row col values else you lose."
Connect-Four	"Connect-Four is a two-player game. The pieces fall straight down, occupying the next available space within a column. The objective of the game is to be the first to form a horizontal, vertical, or diagonal line of four of one’s own discs. In a board, player 1, you, plays with symbol X, while player 2, your opponent, plays with symbol O. Your input is just a number from 0 to 6, nothing else. Do not output anything else but the col value else you lose."

Table 1: The three introductory prompts used for the board games in the ChildPlay suite.

A.2 Example

Note that in the case of Connect-Four, a move consists of a singular scalar. A board state is shown after each play. Examples can be found in Fig. 10.

A.3 Move Mapping¹¹1See the right column for the model’s moves, and the left column for the random player’s moves.

A.3.1 Tic-Tac-Toe

A.3.2 Connect-Four

A.3.3 Battleship

A.4 Shapes

A.5 Prompting GPT About Optimal Play

Game	Explanation
Tic-Tac-Toe	Tic-Tac-Toe is a two-player game played on a 3x3 grid. Each player takes turns marking a square with their symbol (X or O), aiming to get three of their symbols in a row, column, or diagonal. To play optimally, prioritize securing the center square and blocking opponent’s winning moves.
Battleship	Battleship is a two-player game where players hide ships on a grid and take turns guessing their opponent’s ship locations. The goal is to sink all of the opponent’s ships. To play optimally, start by targeting areas with higher probabilities of containing a ship and strategically target adjacent squares after a hit to maximize efficiency.
Connect Four	Connect Four is a two-player game played on a 6x7 grid. Players drop colored discs into columns, aiming to connect four of their own discs in a row, column, or diagonal. To play optimally, prioritize creating your own winning formations while blocking opponent’s potential winning moves.

Table 2: Optimal strategies for playing different games according to GPT-3.5.

Game	Explanation
Tic-Tac-Toe	Play your first X in a corner to maximize opportunities. If the opponent plays in the center, play the opposite corner. Block your opponent’s potential winning moves and always look to create a line of three.
Battleship	Randomize ship placements and start by targeting the center of the grid. Use a checkerboard pattern for efficient searching. Once a ship is hit, focus on the surrounding squares to determine its orientation and sink it.
Connect Four	Start in the center column to maximize opportunities in all directions. Build threats vertically, horizontally, and diagonally, and block the opponent’s forming lines. Create multiple threats to force the opponent into a defensive position.

Table 3: Optimal strategies for playing different games according to GPT-4.

Appendix B LCL

B.1 Prompts

Validity Testing prompt: "You will receive a description of a Lego structure, for instance, ((x1, y1, ’color1’), (x2, y2, ’color2’)), which lists the coordinates and colors of two pieces. A construct is valid if all Lego pieces are connected but not overlapping. A Lego piece is connected through interlocking pegs, not by merely touching sides. Two Lego pieces overlap when they share the same y-coordinate and any part of their length has the same x-coordinate. If the following structure is valid then reply with valid, otherwise reply with invalid (do not justify your answer): <pieces>"

Figure 20: Validity testing prompt.

Construct Generation prompt: "A description of a Lego structure consists of a list of tuples, ((x1, y1, ’color1’), (x2, y2, ’color2’)), where each tuple shows the coordinates and colors of a piece. Such a structure is valid if all Lego pieces are connected but not overlapping. A Lego piece is connected through interlocking pegs, not by merely touching sides. Two Lego pieces overlap when they share the same y-coordinate and any part of their length has the same x-coordinate. Produce a description of a valid structure using <n pieces> Lego pieces. Reply only with the Lego structure description following the format ((x1, y1, ’color1’), (x2, y2, ’color2’), …), write nothing else but the structure."

Figure 21: Construct generation prompt.

The prompts written in LaTeX from Fig. 20 and Fig. 21 were used both in the case of GPT-3.5 and GPT-4 in the main text. Notably, these tests are part of the ChildPlay suite. Further tests were conducted but not included in the ChildPlay suite and are illustrated herein. The reason why these tests have not been included in the suite is because they must be written as systematic benchmarks instead of experimental input-output segments. Currently, they stand as illustrative cases of spatial reasoning failure and success that supplement the benchmark but are not aimed at proving the model’s capacity either way. They are simply an interesting addition.

Appendix C LCL Syntax

C.1 Definitions in LCL

A piece $P$ is defined as a tuple $P=(l,w,(x,y),c,h)$ (see Table 4) where:

1.

$l$ is the length of the piece, fixed at 4 units;
2.

$w$ is the width of the piece, fixed at 2 units;
3.

$x-axis$ corresponds to the position of the studs;
4.

$y-axis$ corresponds to layers - the first brick is at layer 0;
5.

$c$ is the color of the piece;
6.

$h$ is the height of the piece, fixed at 1 unit;

For the sake of brevity, in most of the examples below we omit length ( $l$ ), color ( $c$ ), and height ( $h$ ) since these are set as constants.

Parameter	Description	Value
$l$	Length of the piece	4 units
$w$	Width of the piece	2 units
$(x,y)$	Position of the studs (x-axis), layers (y-axis)	Var
$c$	Colour of the piece	Var
$h$	Height of the piece	1 unit

Table 4: Definition of a Piece

P

A construction, $M$ , is then a valid construction in $LCL_{2}$ if and only if it follows the rules:

1.

$P=(4,2,(x,y),c,1)$
2.

$M$ is composed entirely by $P$ pieces ( $\Phi={P}$ );
3.

Every piece P must be connected to at least one other piece P;
4.

$M$ is symmetric along the line crossing the 2 by 4 pieces, between its pegs, along the piece’s longest side;
5.

Pieces in the construct can only be manipulated horizontally in $n*pi$ rotations, with $n\in\mathbb{Z}$ (note that this makes width irrelevant);
6.

The position of a piece is defined by its left-most pair of studs;
7.

$M$ begins with a piece P at coordinates (0,0);
8.

All pieces placed in layer $n$ must be placed before any piece is placed in layer $n+1$ ;

Consider constructing a line using three bricks (we omit height $h$ since it is a constant, with value equal to 1). This is counter-intuitive, but note that a line cannot be represented as in Fig 24, because the pieces are disconnected.

$LCL_{2}$ : $((0,0),(4,0),(8,0))$ is then an example of what one expects to see as representing a line, but it is not valid in LCL. Because the pieces are disconnected from eachother, they just lay next to eachother, one after another in a row. Instead, $((0,0),(4,0),(2,1))$ , or $((0,0),(-2,1),(2,1))$ , or even $((0,0),(-2,1),(4,1))$ would be valid constructs.

Subsequently, both models were prompted with several additional requests that have not been integrated in the suite yet (see Table 5).

For these experiments, the definition of LCL was provided to the model and it was accompanied by the prompt in Fig. 22.

Prompt: "I will give you a number of pieces, I will ask you for a shape and you’ll output the coordinates per piece to form such a shape. It must be valid in LCL."

Figure 22: Extra testing prompts not in the suite.

Task	Description
Triangle Construction	"Make a triangle with 5 bricks."
Humanoid Figure	"6 pieces. Build a humanoid figure."
Bart Simpson-Like Figure	"Let me help you. Imagine it’s Bart Simpson. You have three yellow pieces, one for the head, two for the arms, one red for the torso, and two blue pieces for the legs."
Tower Construction	"Produce now a tower with 3 bricks."

Table 5: Sequence of building prompts.

C.2 Example

A simple example is found in Fig 23. This is a tower constructed from 3 bricks and is a valid $LCL_{2}$ construct.

This sequence forms the construction of a 3-brick line, each brick having a width of 4 units. But since this construction is composed of three columns, one piece $P$ each, it can be broken apart and is not a topological object (each piece can be moved individually). The correct construct with three bricks has many possible solutions. For a centre piece with two pieces on the bottom or two pieces on the top, we find $24$ possible solutions. In eq. 1 is the general formula with $s$ being the amount of studs:

\begin{split}f(0)=0\\ f(s)=4*(s-1)+f(s-1)\end{split}

(1)

And its non-recursive form:

\begin{split}f(0)=0\\ f(s)=2(s-1)s\end{split}

(2)

We show two more simple examples:

, and:

The "three-in-a-line" can only be loosely interpreted in $LCL_{2}$ , due to rule (2) - that pieces cannot be moved independently from the rest of the model. For this reason, one can imagine many more structures that loosely fall under the definition of a "line" or "wall", for example:

Or even a stair-like structure:

A humanoid could also be easily represented in $LCL_{2}$ as:

We show the model’s incorrect answers in Figs. 30, 32, and 33 and correct answers in Figs. 31(b). Essentially, both GPT-3.5 and GPT-4 were not far from the expected target, but failed to respect $LCL_{2}$ rules in most cases. For example, pieces are found in an impossible superposition in Fig. 30(a) (red piece is in the same position as yellow piece), 32(b) (blue piece is in the same position as yellow piece), and 33(b) (red piece is in the same position as middle yellow pieces). In Fig. 33(a), GPT-3.5 erroneously swapped the middle yellow piece with the red piece and the blue pieces with the bottom yellow pieces, even though it first declared in plain English the correct organisation of the 6 pieces. The positive result is that models manage to assemble a tower of three pieces and GPT-4 was capable of assembling a triangle (see Table 6). None of the models recognised that they were asked an impossible task, namely building a triangle with only 5 pieces (see Fig. 30).

Model		Responses
Kategorie	N(P)	GPT-3.5	GPT-4
Tower	3	Correct	Correct
Impossible Triangle	5	Incorrect	Incorrect
Triangle	6	Incorrect	Correct
Humanoid	6	Incorrect	Incorrect
Bart Simpson	6	Incorrect	Incorrect

Table 6: Comparison of Responses by GPT-3.5 and GPT-4.

C.3 Small Dataset for Future Experiments

The dataset defined herein contains several example prompts that are more complex and do not follow the 2x4 assumption, each consisting of a request followed by a LEGO kit of fewer than 15 pieces to which the agent is bound.

LEGO Kits

Apfel

Possible prompt: "Construct a LEGO apple with a mix of red and green colors, resembling a typical apple shape using slopes and bricks."

•

Green Slope 45 2 x 1 - Code: 3040 (Quantity: 1)
•

Red Slope 45 2 x 2 - Code: 3039 (Quantity: 2)
•

Lime Slope, Inverted 45 2 x 2 - Code: 3660 (Quantity: 2)
•

Red Brick 2 x 3 - Code: 3002 (Quantity: 1)
•

Lime Plate 2 x 2 - Code: 3022 (Quantity: 1)
•

Lime Brick 1 x 2 - Code: 3004 (Quantity: 1)

Yellow Hut

Possible prompt: "Build a hut with a purple and yellow color scheme, featuring a simple structure and a sloped roof."

•

Trans-Clear Brick 1 x 2 without Bottom Tube - Code: 3065 (Quantity: 2)
•

Medium Nougat Brick 2 x 2 - Code: 3003 (Quantity: 1)
•

Lime Plate 2 x 6 - Code: 3795 (Quantity: 1)
•

Bright Light Yellow Brick 1 x 2 - Code: 3004 (Quantity: 4)
•

Bright Light Yellow Brick 2 x 2 - Code: 3003 (Quantity: 1)
•

Medium Lavender Slope 45 2 x 2 - Code: 3039 (Quantity: 4)

Fortress

Possible prompt: "Create a medieval-themed LEGO fortress with arches, walls, and defensive structures, symbolizing a stronghold."

•

Green Plate 2 x 8 - Code: 3034 (Quantity: 1)
•

Light Bluish Gray Arch 1 x 4 x 2 - Code: 6182 (Quantity: 2)
•

Sand Green Brick 1 x 2 - Code: 3004 (Quantity: 2)
•

Light Bluish Gray Brick 1 x 2 - Code: 3004 (Quantity: 2)
•

Dark Bluish Gray Brick 1 x 2 - Code: 3004 (Quantity: 2)
•

Light Bluish Gray Brick 2 x 2 - Code: 3003 (Quantity: 1)
•

Reddish Brown Brick, Round 1 x 1 Open Stud - Code: 3062b (Quantity: 2)

Dinghy

Possible prompt: "Assemble a small LEGO dinghy with a white sail and a mast."

•

Dark Tan Plate 2 x 4 - Code: 3020 (Quantity: 1)
•

Tan Slope, Inverted 33 3 x 2 with Flat Bottom Pin and Connections - Code: 3747b (Quantity: 1)
•

White Slope 45 2 x 2 - Code: 3039 (Quantity: 3)
•

White Brick 2 x 2 - Code: 3003 (Quantity: 1)
•

White Brick 1 x 2 - Code: 3004 (Quantity: 1)
•

Tan Brick 2 x 3 - Code: 3002 (Quantity: 1)
•

Reddish Brown Brick, Round 2 x 2 with Axle Hole - Code: 3941 (Quantity: 1)

Blue Bot

Possible prompt: "Construct a LEGO robot with a humanoid structure, featuring a distinguishable head, body, arms, and legs."

•

Medium Blue Brick 2 x 2 - Code: 3003 (Quantity: 1)
•

Brick, Modified 2 x 3 with Curved Top - Code: 6215 (Quantity: 1)
•

Brick 2 x 4 - Code: 3001 (Quantity: 1)
•

Brick 1 x 2 - Code: 3004 (Quantity: 2)
•

Brick, Round 2 x 2 with Grille - Code: 92947 (Quantity: 1)
•

Plate 2 x 2 - Code: 3022 (Quantity: 1)
•

Brick, Modified 1 x 2 with Studs on 1 Side - Code: 11211 (Quantity: 1)
•

Brick 1 x 2 without Bottom Tube - Code: 3065 (Quantity: 1)
•

Tile 1 x 1 Round - Code: 98138 (Quantity: 1)
•

Brick, Round 2 x 2 Dome Top, with Bottom Axle Holder - Code: 553c (Quantity: 1)

Toy Car

Possible prompt: "Build a LEGO toy car with a compact design, featuring wheels, and a sloped windshield."

•

Brick 2 x 6 - Code: 2456 (Quantity: 1)
•

Slope 2 x 2 45° - Code: 3039 (Quantity: 1)
•

Brick 1 x 2 without Bottom Tube - Code: 3065 (Quantity: 1)
•

Brick 1 x 2 - Code: 3004 (Quantity: 1)
•

Plate 2 x 2 with Wheel Holders - Code: 4600 (Quantity: 2)
•

Wheel 8mm D. x 6mm with Slot - Code: 34337 (Quantity: 4)
•

Tire Offset Tread Small - Band Around Center of Tread - Code: 87414 (Quantity: 4)

Goldfish

Possible prompt: "Create a LEGO goldfish with fins and tail, featuring elements for eyes."

•

Brick 2 x 4 - Code: 3001 (Quantity: 2)
•

Brick 1 x 2 with Pin Hole - Code: 3700 (Quantity: 1)
•

Brick, Modified 1 x 2 with Studs on 1 Side - Code: 11211 (Quantity: 2)
•

Brick 2 x 3 - Code: 3002 (Quantity: 1)
•

Slope 45° 2 x 2 - Inverted - Code: 3660 (Quantity: 1)
•

Slope 2 x 1 - 45° - Code: 3040 (Quantity: 4)
•

Tile 1 x 1 Round with Eye Pattern - Code: 98138pb007 (Quantity: 2)
•

Slope 30° 1 x 2 x 2/3 - Code: 85984 (Quantity: 1)

Baby Elephant

Possible prompt: "Assemble a LEGO baby elephant with a focus on its trunk, ears, and body structure."

•

Brick 2 x 6 - Code: 2456 (Quantity: 1)
•

Brick 1 x 2 - Code: 3004 (Quantity: 3)
•

Brick 1 x 4 - Code: 3010 (Quantity: 1)
•

Brick 1 x 1 with Stud on 1 Side - Code: 87087 (Quantity: 2)
•

Tile 1 x 1 Round with Eye Pattern - Code: 98138pb027 (Quantity: 2)
•

Brick 2 x 4 - Code: 3001 (Quantity: 1)

Flamingo

Possible prompt: "Construct a LEGO flamingo with pink bricks, designed to stand on one leg and feature a long neck and beak."

•

Brick 1 x 2 - Code: 3004 (Quantity: 3)
•

Brick, Modified 2 x 3 with Curved Top - Code: 6215 (Quantity: 2)
•

Brick 1 x 1 with Stud on 1 Side - Code: 87087 (Quantity: 2)
•

Plate 2 x 3 - Code: 3021 (Quantity: 1)
•

Slope 2 x 2 - 45° - Code: 3039 (Quantity: 1)
•

Tile 1 x 1 Round with Eye Closed Pattern - Code: 98138pb028 (Quantity: 2)

Twin Engine Airplane

Possible prompt: "Build a LEGO twin-engine airplane, with a body, wings, and a tail."

•

Plate 2 x 8 - Code: 3034 (Quantity: 2)
•

Brick 1 x 2 x 2 with Inside Stud Holder - Code: 3245c (Quantity: 1)
•

Brick, Modified 1 x 1 x 1 2/3 with Studs on 1 Side - Code: 32952 (Quantity: 2)
•

Brick 1 x 4 with 4 Studs on 1 Side - Code: 30414 (Quantity: 2)
•

Slope 2 x 2 - 45° - Code: 3039 (Quantity: 1)
•

Brick 1 x 2 without Bottom Tube - Code: 3065 (Quantity: 1)

Appendix D (More) Shapes - short experiments (not included in ChildPlay)

D.1 Prompts

Test	Prompt
Introductory prompt	"Below is a 15 by 15 grid of 0s. I have flipped some 0s into 1s such that a basic geometrical shape has formed. Can you tell me what shape it is?"
Square (feedback)	"That’s incorrect. The shape is a square. Can you tell me the length and width?"
Circle (feedback)	"That’s incorrect. The shape is a circle. Can you tell me the coordinates of the center?"
Triangle (feedback)	"That is incorrect. It is in fact a triangle. Can you tell the length of the base?"
Cross A	"Can you tell me the coordinates of the center of the cross and the length of each line, horizontal and vertical?"
Cross B	"Draw a cross in a 5 by 5 grid, with horizontal and vertical axes of 3 units of length with the center at (3,3)."

Table 7: Introductory and correction prompts for identifying and detailing specific geometrical shapes in a grid environment.

In the shape detection tests, both GPT-3.5 and GPT-4 demonstrated limited comprehension and ability to accurately interpret or draw shapes. When tasked with drawing a cross (see Fig. 34), GPT-3.5 and GPT-4 initially failed to produce a correct cross, but slightly improved after feedback. In Table 8, both models often misidentified or misrepresented the shapes requested, such as describing a circle as a "diamond shape" (GPT-3.5) and an "arrow pointing upwards" (GPT-4). Additionally, neither model could fully comprehend geometric properties, frequently providing incorrect dimensions and centers for squares, triangles, and crosses.

Test	Query	Correct Answer	GPT-3.5 Response	GPT-4 Response
Circle	Shape	Circle	"diamond shape"	"arrow pointing upwards"
	Zentrum	(7,7)	"(7,7)"	"(7,7)"
Square	Shape	Square	"square"	"’O’"
	Dimensions	(3,4)	"(4,4)"	"(3,3)"
Triangle	Shape	Triangle	"diamond"	"arrow pointing upwards"
	Base Length	7 units	"7"	"6"
Cross	Shape	Cross	"square"	" ’plus’ sign (+)"
	Zentrum	(5,5)	"(7,7)"	"(6,5)"
	Line Lengths	5	"5"	"4"

Table 8: Comparison of Responses by GPT-3.5 and GPT-4 in Shape Detection Tests.

Appendix E Conway’s Game of Life - a short experiment in state prediction (not included in ChildPlay)

We were interested in seeing if LLMs could predict states based on very simple rules. Initially we tested this by generating sequential states in Conway’s Game of Life and feeding them to GPT-3.5 and GPT-4 while prompting the models for two things: the rules and the next state. We do not include this as a benchmark because the experiment required hand segmentation of areas of interest in simulated states and repeatability was not achieved except for in the patterns of interest, of course. We hope to include a version of this task later on in the ChildPlay suite.

In Conway’s Game of Life, a cellular automaton devised by mathematician John Horton Conway and introduced to the public by Gardner, cells perpetuate or perish given certain simple rules. We simulated rule B3/S23 also known as "blinking" in varying board sizes. In this rule, a cell is born if it has exactly three live neighbours and survives if it has either two or three live neighbours, dying otherwise. Rule B3/S23 is known to generate a behaviour exhibited by certain configurations of cells that alternate between two or more states over successive generations. These configurations are generally known as oscillators, which are a type of pattern in the game that returns to its initial configuration after a fixed number of generations, repeating indefinitely. In our case, we prompted the models with three configurations of increasing complexity. One consisting of three iterations of the oscillator (3 horizontal cells and 3 vertical cells), another with the same oscillator in the same position but now noise was added in the shape of 4 cells that did not change. Finally, a 5-iteration long pattern was used showcasing cells moving in a circle formation outwardly across the board. These were obtained by repeated simulation using custom software developed specifically for this study available through Github.

E.1 Prompts

Test	Introductory Prompt	Second Prompt
Test 1: GPT-4 and GPT-3.5	"The following was produced using Conway’s Game of Life, namely rule ’survival’: [2, 3], ’birth’: [3]. Print the state prior to the first one and the one after the last one."
Test 2: GPT-4 and GPT-3.5	"Print the next pattern. The following are two iterations of the game of life by Conway. You cannot use code"	"That’s wrong. The rule is Rule: ’Blinking’: ’survival’: [2, 3], ’birth’: [3]. Try again."
Test 3: GPT-4 and GPT-3.5	"Print the next pattern. The following are two iterations of the game of life by Conway"	"That’s wrong. The rule is Rule: ’Blinking’: ’survival’: [2, 3], ’birth’: [3]. Try again."
Requesting Rules: GPT-3.5 and GPT-4	"I will show you two iterations of Conway’s game of life. The first generated the second. You must deduce the survival and birth rules. You must only print these rules, nothing else. Understood?"

Table 9: Prompts for tests related to Conway’s Game of Life.

In the Conway’s Game of Life tests, neither GPT-3.5 nor GPT-4 managed to consistently identify or predict the evolving patterns correctly. Table 10 summarizes their performance, where both models only succeeded in identifying a simple blinking pattern. In more complex scenarios involving patterns before or after a given state, both models returned incorrect responses. Even when explicitly provided with the game’s rules, GPT-3.5 and GPT-4 failed to accurately predict the next pattern or the pattern before.

Test	Description	Query²²footnotemark: 2	GPT-3.5 Response	GPT-4 Response
Test 1	Blinking pattern	Identify the rule	Correct	Correct
Test 2	Blinking pattern	Next pattern (no rule)	Incorrect	Incorrect
	Blinking pattern¹¹1At higher temperatures, some of GPT-4’s responses were discarded by our parser when the model generated invalid Unicode output, and thus were not included in the final evaluation. This discrepancy is evident in Fig. 7(b), for instance, where the sum of correct and incorrect choices does not total 25 at temperatures 1 and 1.5.	Next pattern¹¹1Queries conducted with the explicit rule revealed.	Incorrect	Incorrect
Test 3	Complex pattern	Pattern before ¹¹1Queries conducted with the explicit rule revealed.	Incorrect	Incorrect
	Complex pattern	Pattern after ¹¹1Queries conducted with the explicit rule revealed.	Incorrect	Incorrect