CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models

Shubham Bharti^∗ Shiyun Cheng^∗ Jihyun Rho Martina Rau^† Xiaojin Zhu
University of Wisconsin–Madison and ^†ETH Zurich

*

equal contribution

Abstract

We introduce CHARTOM, a visual theory-of-mind benchmark for multimodal large language models. CHARTOM consists of specially designed data visualizing charts. Given a chart, a language model needs to not only correctly comprehend the chart (the FACT question) but also judge if the chart will be misleading to a human reader (the MIND question). Both questions have significant societal benefits. We detail the construction of the CHARTOM benchmark including its calibration on human performance.

1 Introduction

1.1 Background on Theory of Mind

For AI to better assist humans, it must know not just factual truth but also how humans perceive the truth. The two could differ for many reasons such as information asymmetry or cognitive ability. If AI can detect the difference, it can help humans accordingly. The ability for an individual to reason about how others think (rather than the factual truth) is known as theory of mind [30]. A classic example is the Sally-Anne test [38, 2]: Sally hides a marble in a basket then leaves the room. Anne moves the marble to a box while Sally is away. Sally comes back. At this point, one can ask an observer two types of questions:

•

FACT question: Where is the marble?
•

MIND question: Where will Sally look for the marble?

There is a recent surge of interest in AI theory of mind, where large language models take the place of the observer. Several researchers found current AI competent at theory of mind tasks, performing near or surpassing human performance [20, 35], though some others remained skeptical [36, 33]. Concerned with limitations of AI theory of mind benchmarks, recent work also started to develop new benchmarks emphasizing conversations [18], causality [11], and actions [41]. However, these existing AI theory of mind benchmarks are all based on text comprehension.

1.2 Background on Data Visualizing Charts

Our theory of mind benchmark tests visual perception instead. Specifically, our focus is on misleading data visualizing charts, e.g. bar charts, line graphs, pie charts etc. Charts are widely used for conveying quantitative information to human readers. They lend credibility to accompanying messages and hence have a strong impact on human decision making. On the flip side, however, if charts are misleading, they can amplify the impact of misinformation [22, 29]. Figure 1 reproduces a misleading chart whose $y$ -axis is inverted, which may give the impression that enacting Florida’s “stand your ground” law reduced gun deaths. Misleading charts are unfortunately prevalent in many areas of our lives. For example, a review of medical advertisements found that 1/3 of charts were misleading [7]. A review of charts in Korean news about the COVID-19 pandemic showed that about 30% of bar charts and 44% of pictorial charts were misleading [21]. A U.S.-based analysis of visuals that had been identified as misleading by human fact checkers showed that more than half of these visuals were used as evidence for fake news about COVID-19 [4]. Because journalists often use charts to make narratives appear objective and to engage viewers emotionally [32], misleading charts can make an inaccurate story more deceptive [8, 37].

Refer to caption — Figure 1: Reproduction of a misleading chart from [10], attributed to Reuters originally

As individuals increasingly rely on digital media to inform their opinions and decision-making processes, it is essential to ensure that the visual representations of data they encounter are accurate and trustworthy. Studying misleading charts can lead to AI with the ability to automatically detect them and alert readers, thus empowering people to make well-informed decisions based on reliable information, consequently promoting a more informed and discerning society.

Existing computer vision research in charts largely aimed at chart comprehension. Classic approaches often employ an intermediate step of chart-to-data parsing, then use the data for downstream tasks such as visual question-answering, reasoning and summarization. To this end, [5, 28] applied deep learning based object detection methods to extract data, while others [1, 6, 26, 16, 34] applied semi-automated off-the-self OCR text recognition systems [3, 17]. Newer MLLM-based approaches tends to be end-to-end, such as training the model on extensive pre-training datasets [25, 12, 27, 40], few-shot tuning methods which fine-tune pre-trained models with a small number of task-specific examples [9, 23, 19], and even zero-shot learning where the model makes predictions on tasks it has not been explicitly trained on [39, 15]. The newer approaches are performant, though occasionally still suffer from hallucination in data extraction and reasoning [14, 13].

However, we point out a significant shortcoming in chart comprehension: many misleading charts in fact contain the correct data. Take Figure 1 for example. Despite the unusual flipped $y$ -axis, a meticulous reader (or AI) can indeed recover all the correct (year, deaths) data points from the chart. Therefore, ironically, a perfect chart comprehension system is not directly suited for judging how misleading a chart may be to humans. This is the same distinction between factual and theory of mind questions. We can ask a FACT question: What data is objectively presented in the chart? We can also ask a MIND question: What will a typical human reader perceive from the chart? Chart comprehension systems aim to solve the FACT question. To the best of our knowledge, the present paper is the first to address the MIND question in charts instead.

1.3 Our Contributions

Our contribution is to create a visual theory-of-mind benchmark CHARTOM (for CHARt Theory of Mind) consisting of specially designed charts. The benchmark covers common varieties of chart types, and we planted representative visual manipulations into the charts. The key to the benchmark is to obtain MIND ground truth, i.e. “degree of misleadingness to humans”, for each chart. This ground truth enables future benchmark users to judge whether their AI correctly answers the MIND questions. We obtained the MIND ground truth via human experiments.

2 The CHARTOM Benchmark

In this section we present our visual theory of mind benchmark CHARTOM. We first detail our design of charts to accommodate both FACT and MIND type of questions in the benchmark, then explain how we obtain the MIND ground truth with human experiments.

2.1 Design of the CHARTOM Benchmark

To explain our CHARTOM benchmark, let us start with an example.

An example. The pair of charts in Figure 2 share the same underlying data: (year=2018, dogs=250) and (year=2019, dogs=300). The left chart is standard, while we intentionally planted a potentially misleading manipulation to the right chart: we truncated the $y$ -axis range so it starts at 240 instead of 0. This is to create a perceptual effect: the 2019 bar now appears much taller than the 2018 bar. For each of the two charts, we ask the following FACT and MIND questions.

•

The FACT question is a chart comprehension question:

The following graph shows the number of dogs adopted. The dogs adopted in 2018 eat 1 million bags of dog food in their lifetimes. How much do the dogs adopted in 2019 eat in their lifetimes?
•

The MIND question is a human performance question:

Here is a chart we will present to typical university students and ask them the following question:

[THE FACT QUESTION ABOVE]

What fraction of typical university students do you predict will be misled by the chart when answering the question? First give your prediction as a decimal number between 0 and 1, then justify your prediction in words.

Answers to the FACT question is easy to judge. It is important to note that in both charts by design we ensured the underlying data can be exactly recovered, regardless of the manipulation. Therefore, a perfect MLLM-based AI, with proper reasoning, can always correctly answer the FACT question on both charts: the answer is $\frac{1}{250}\times 300=1.2$ million bags.

In contrast, the MIND question is trickier to judge. On the left chart where the design is standard, one might still argue that the chart is not perfect and could be a little bit misleading to humans: the numbers are too small, there is no grid guide lines to help visual alignment, etc. Conversely, on the right chart where we intentionally planted the truncated $y$ -axis because we presumed it to be misleading based on existing psychology literature (e.g. [24]), we do not know a priori the actual effect on human readers. One way to obtain ground truth on the MIND question is to actually conduct human experiments, which we explain in section 2.2.

The CHARTOM benchmark. Our entire CHARTOM benchmark follows the same design principle. The benchmark consists of 112 charts, covering five prevalent chart types: line, bar, pie, scatter plot and map. The charts are manually designed to be visually tidy. We present selected charts from the benchmark in Table 1 and Appendix A.

The 112 charts come in 56 pairs (one pair is already shown in Figure 2). Each pair has an original (suffix _1) and a manipulated (suffix _2) version, the latter incorporates one of many visualization fallacies suggested by the psychology literature, see the ‘manipulation’ column in Table 1. The manipulated version is intended to confuse casual human readers. Each pair of charts also comes with a FACT question and a MIND question. The FACT question is typically a simple word problem based on chart comprehension, see examples in Appendix A. There are three types of FACT questions: multiple choices, free text entry, and sorting. Importantly, the true underlying data can still be recovered from either version if one is careful. This means that the FACT question has the same answer on both versions. The MIND question has the same format as shown earlier for all charts.

To summarize, the CHARTOM benchmark consists of

•

112 charts: 56 pairs of original and manipulated versions.
•

Each chart comes with a FACT question and the answer key.
•

Each chart comes with a MIND question, and the answer key is the HMI estimated from human experiments (see Section 2.2).

To avoid web crawlers inadvertently adding the benchmark to future MLLM training data, the CHARTOM benchmark is not online but is available upon request. Please contact Prof. Jerry Zhu at [email protected].

	original		manipulated
chart type	ID	HMI	ID	manipulation	HMI
line	G4_Q1_1	0.03	G4_Q1_2	inconsistent x-axis	0.97
line	G3_Q1_1	0.03	G3_Q1_2	inverted y-axis	0.81
line	G9_Q1_1	0.06	G9_Q1_2	dual axes	0.74
line	G1_Q1_1	0.14	G1_Q1_2	truncated y-axis	0.40
line	G2_Q1_1	0.07	G2_Q1_2	compressed y-axis	0.04
bar	G12_Q1_1	0.28	G12_Q1_2	3D effect	0.96
bar	G7_Q1_1	0.18	G7_Q1_2	truncated y-axis	0.29
bar	G8_Q1_1	0.09	G8_Q1_2	compressed y-axis	0.06
bar	G10_Q1_1	0.02	G10_Q1_2	pictorial bars	0.06
scatter	G6_Q1_1	0.03	G6_Q1_2	swapped axis	0.71
scatter	G5_Q1_1	0.02	G5_Q1_2	logarithmic y-axis	0.63
map	G15_Q1_1	0.01	G15_Q1_2	inverted color scale	0.10
pie	G13_Q1_1	0.26	G13_Q1_2	3D effect	0.90
pie	G14_Q1_1	0.66	G14_Q1_2	3D effect + pop out	0.99

Table 1: Selected chart pairs in the CHARTOM benchmark. Each row is for a pair of original and manipulated charts. HMI standards for Human Misleadingness Index, a number between 0 and 1 where a larger value suggests the chart is more misleading to humans.

2.2 HMI: Ground Truth for MIND Questions

A key part of our benchmark is the ground truth answer for the MIND question on each chart, namely whether that chart is misleading to typical human readers.

2.2.1 Estimating HMI from Human Experiments

We propose Human Misleadingness Index (HMI), a number between 0 and 1 for each chart where a larger value means the chart is more misleading to humans. HMI is estimated with the following process.

We conducted a human experiment on 68 university students with IRB approval. We did not directly ask human participants the MIND question because we do not anticipate people to be able to give reliable answers—after all, many were themselves misled by some of our charts. Instead, we asked them the FACT question, which is designed so that users who are misled by the chart are likely to give wrong answer. Subsequently, we use the performance of users on the FACT question to establish the ground truth for the MIND question. Concretely, the HMI for a chart is defined as the percent of our study population who did not give an acceptable answer on the FACT question. We will define acceptability in the next section. We argue that HMI is a surrogate measure for how misleading the chart is. This definition is motivated by two factors:

1.

The arithmetic involved in solving the FACT questions is straightforward for the education level of our participants. Indeed on most original (not manipulated) charts the HMI is small.
2.

On many manipulated charts the HMI is higher compared to their original chart counterpart. Since the only difference between the pair is the planted manipulation in data visualization, we can attribute the increase in HMI to the manipulation being more misleading to humans. ¹¹1We randomized the order of original and manipulated charts in each pair when presenting them to human participants to minimize the order effect.

Our human experiments are described in greater detail in [31].

2.2.2 Defining Acceptable Human Answers

Recall the FACT questions can be either multiple choices, sorting, or free text entry. It is straightforward to judge human answers for multiple choice and sorting questions (where we require all items are in the correct order), but it is more subtle for free text entry. For the purpose of calculating HMI, on free text entries we allow approximately correct answers. For example, we already discussed that the FACT answer is 1.2 for Figure 2. However, many participants understandably answered other numbers around 1.2, and a few participants answered drastically different numbers, see Table 2. A few remarks are in order:

•

For both the original (G7-Q1-1) and manipulated (G7-Q1-2) charts, the mode of the answer distribution is at the correct FACT answer 1.2.
•

For the manipulated chart G7-Q1-2, many participant answered 6. This answer was not observed for the original chart. We believe those participants were misled by the vertical perceptual bar heights. In other words, they used the wrong formula $\frac{1}{(1\mbox{ vertical unit})}\times(6\mbox{ vertical units})=6$ million bags.
•

The answer 300 is also frequent. We believe those participants simply read off the $y$ -axis number of the 2019 bar.
•

Many participants answered other numbers around 1.2, 6, or 300.

original chart G7-Q1-1 answer 1 1.01 1.1 1.15 1.16 1.2 1.25 1.3 1.5 2 3 250 300 345 count 5 1 5 1 1 24 5 5 6 3 2 1 8 1 manipulated chart G7-Q1-2 answer 1 1.1 1.13 1.1666 1.2 1.25 1.26 1.3 1.5 5 6 30 56 300 count 4 2 1 1 22 4 1 2 11 4 7 1 1 7

Table 2: All participant answers on the FACT question for the pair of charts in Figure 2

We encountered similar diversity of human answers on other free text entry questions. These answers are indicative of the diverse cognitive processes in our participants, and are of independent interest to psychologists. From a computational perspective, we simplified the cognitive processes to define acceptable answers and assume that a participant will use one of following three problem solving strategies :

1.

The correct strategy. This leads to the correct FACT answer. For G7-Q1-1 and G7-Q1-2 the answer would be 1.2.
2.

The misled strategy where they “fall for the manipulation”. For G7-Q1-2 the answer would be 6.
3.

The skimming strategy where they just took the $y$ -axis at face value. For G7-Q1-1 and G7-Q1-2 the answer would be 300.

On top of these strategies, we anticipate a participant to be sometimes inexact in their math; the effect is to add a randomly perturbation around their answer. Let $c$ denote the correct strategy answer (e.g. 1.2). Let $a$ denote the closest alternative strategy answer that is larger than $c$ (e.g. 6). We define $U$ the upper limit of acceptable answers as the geometric mean of $c$ and $a$ :

U=\sqrt{c\cdot a}.

If $a$ does not exist, $U=\infty$ . The choice of geometric mean is to better accommodate the drastic difference in magnitude between $c$ and $a$ on many questions. Similarly $L$ is the lower limit of acceptable answers defined using the closest alternative strategy answer that is smaller than $c$ , and $L$ could be $-\infty$ . If a participant’s answer falls in the interval $(L,U)$ , we accept it as approximately correct. For the example of G7-Q1-2, the acceptable answer interval is $(-\infty,2.68)$ .

2.2.3 HMI Discussions

Figure 3 shows the HMI values in the CHARTOM benchmark. These values are sorted from small (not misleading to humans) to large (very misleading). In addition, each HMI value is color coded to indicate whether the chart is the original version (blue) or the manipulated version (red). Generally speaking, the original charts have small HMI and the manipulated ones have large HMI. But this depends on the type of manipulation. As the HMI values in Table 1 indicates, some manipulations greatly misled people: inconsistent x-axis, inverted y-axis, dual axes plots, 3D effect, etc. While interestingly, some other manipulations did not confuse people much: compressed y-axis and pictorial bars are actually benign. Such differences are of interest to cognitive science, and are discussed in [31].

2.3 Suggested MLLM Evaluation Criteria

We do not anticipate a particular output format for future MLLMs on our benchmark questions, but it should always be possible to manually compare the MLLM answers to our answer keys:

•

On FACT questions, we suggest using accuracy, defined as the number of correct answers over the number of FACT questions. FACT questions come in three types: multiple choice, ranking, and numerical (free text) answer. For the first two, correctness is precisely defined by the answer key (this means a ranking answer is correct if all items are ranked correctly). But for numerical answers, the answer key provides the exact answer assuming perfect perception. This is too strong an assumption. To judge numerical answers, we suggest using a $\pm 10\%$ tolerance interval: if the answer key is $a$ , then an MLLM answer in the interval $[0.9a,1.1a]$ is deemed correct. Note this interval is for evaluating MLLMs, and is distinct from how we estimated HMI in section 2.2.2.
•

On the numerical part of the MIND questions, we suggest using mean squared error: $\frac{1}{n}\sum_{i=1}^{n}(m_{i}-a_{i})^{2}$ where $n$ is the number of MIND questions, $m_{i}$ is an MLLM’s answer on the $i$ th MIND question, and $a_{i}$ is the HMI answer key to that MIND question.
•

On the justification part of the MIND questions, we suggest qualitative comparisons to the “chart manipulations” file provided in the benchmark. That file records the single manipulation we planted to the manipulated (suffix _2) version of each chart pair. If a manipulated chart has a high HMI value, a capable MLLM should identify that manipulation as an important cause of misleadingness.

3 Conclusion

We described a new theory of mind benchmark CHARTOM for MLLMs based on visual chart comprehension. An important part of the ground truth is established by our human experiments. We hope our benchmark helps the development of future MLLMs.

Acknowledgments. We thank Robert Hawkins for discussions on theory of mind and Yea-Seul Kim and Yuhang Zhao for discussions on computer vision for chart comprehension. This work was supported by NSF IIS 2202457.

References

[1] Rabah A. Al-Zaidy and C. Lee Giles. Automatic extraction of data from bar charts. In Proceedings of the 8th International Conference on Knowledge Capture, K-CAP ’15, New York, NY, USA, 2015. Association for Computing Machinery.
[2] Simon Baron-Cohen, Alan M Leslie, and Uta Frith. Does the autistic child have a “theory of mind”? Cognition, 21(1):37–46, 1985.
[3] Bas Tummers. Datathief iii, 2022.
[4] J Scott Brennen, Felix M Simon, and Rasmus Kleis Nielsen. Beyond (mis) representation: Visuals in covid-19 misinformation. The International Journal of Press/Politics, 26(1):277–299, 2021.
[5] Jinho Choi, Sanghun Jung, Deok Gun Park, Jaegul Choo, and Niklas Elmqvist. Visualizing for the non-visual: Enabling the visually impaired to use visualization. Computer Graphics Forum, 38(3):249–260, 2019.
[6] Mathieu Cliche, David Rosenberg, Dhruv Madeka, and Connie Yee. Scatteract: Automated Extraction of Data from Scatter Plots, page 135–150. Springer International Publishing, 2017.
[7] Richelle J Cooper, David L Schriger, Roger C Wallace, Vladislav J Mikulich, and Michael S Wilkes. The quantity and quality of scientific graphs in pharmaceutical advertisements. Journal of General Internal Medicine, 18(4):294–297, 2003.
[8] Nicholas Diakopoulos. Ethics in data-driven visual storytelling. In Data-Driven Storytelling, pages 233–248. AK Peters/CRC Press, 2018.
[9] Xuan Long Do, Mohammad Hassanpour, Ahmed Masry, Parsa Kavehzadeh, Enamul Hoque, and Shafiq Joty. Do llms work on charts? designing few-shot prompts for chart question answering and summarization, 2023.
[10] Pamela Engel. This chart shows an alarming rise in florida gun deaths after ‘stand your ground’ was enacted. Business Insider. 2014-02-18.
[11] Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, and Noah Goodman. Understanding social reasoning in language models with language models. Advances in Neural Information Processing Systems, 36, 2024.
[12] Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. Chartllama: A multimodal llm for chart understanding and generation, 2023.
[13] Jiani Huang, Haihua Chen, Fengchang Yu, and Wei Lu. From detection to application: Recent advances in understanding scientific tables;and figures. ACM Comput. Surv., apr 2024. Just Accepted.
[14] Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, and Heng Ji. From pixels to insights: A survey on automatic chart understanding in the era of large foundation models, 2024.
[15] Mohammed Saidul Islam, Raian Rahman, Ahmed Masry, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, and Enamul Hoque. Are large vision language models up to the challenge of chart comprehension and reasoning? an extensive investigation into the capabilities and limitations of lvlms, 2024.
[16] Daekyoung Jung, Wonjae Kim, Hyunjoo Song, Jeong-in Hwang, Bongshin Lee, Bohyoung Kim, and Jinwook Seo. Chartsense: Interactive data extraction from chart images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI ’17, page 6706–6717, New York, NY, USA, 2017. Association for Computing Machinery.
[17] Anthony Kay. Tesseract: an open-source optical character recognition engine. Linux J., 2007(159):2, jul 2007.
[18] Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023.
[19] Wonjoong Kim, Sangwu Park, Yeonjun In, Seokwon Han, and Chanyoung Park. Simplot: Enhancing chart question answering by distilling essentials, 2024.
[20] Michal Kosinski. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 2023.
[21] Oh Nam Kwon, Chaereen Han, Changsuk Lee, Kyungwon Lee, Kyeongjun Kim, Gyeongha Jo, and Gangwon Yoon. Graphs in the covid-19 news: A mathematics audit of newspapers in korea. Educational Studies in Mathematics, pages 1–18, 2021.
[22] Claire Lauer and Shaun O’Brien. How people are influenced by deceptive tactics in everyday charts and graphs. IEEE Transactions on Professional Communication, 63(4):327–340, 2020.
[23] Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. Deplot: One-shot visual language reasoning by plot-to-table translation, 2023.
[24] Leo Yu-Ho Lo, Ayush Gupta, Kento Shigyo, Aoyu Wu, Enrico Bertini, and Huamin Qu. Misinformed by visualization: What do we learn from misinformative visualizations? In Computer Graphics Forum, volume 41, pages 515–525. Wiley Online Library, 2022.
[25] Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. UniChart: A universal vision-language pretrained model for chart comprehension and reasoning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14662–14684, Singapore, December 2023. Association for Computational Linguistics.
[26] Gonzalo Gabriel Méndez, Miguel A. Nacenta, and Sebastien Vandenheste. ivolver: Interactive visual language for visualization extraction and reconstruction. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, page 4073–4085, New York, NY, USA, 2016. Association for Computing Machinery.
[27] Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, and Ping Luo. Chartassisstant: A universal chart multimodal language model via chart-to-table pre-training and multitask instruction tuning, 2024.
[28] Prerna Mishra, Santosh Kumar, Mithilesh Kumar Chaube, and Urmila Shrawankar. Chartvi: Charts summarizer for visually impaired. Journal of Computer Languages, 69:101107, 2022.
[29] Anshul Vikram Pandey, Katharina Rall, Margaret L Satterthwaite, Oded Nov, and Enrico Bertini. How deceptive are deceptive visualizations? an empirical analysis of common distortion techniques. In Proceedings of the 33rd annual acm conference on human factors in computing systems, pages 1469–1478, 2015.
[30] David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1(4):515–526, 1978.
[31] Jihyun Rho, Martina Rau, Shubham Bharti, Rosanne Luu, Jeremy McMahan, Andrew Wang, and Xiaojin Zhu. Various misleading visual features in misleading graphs: Do they truly deceive us? In The Annual Conference of the Cognitive Science Society (CogSci), 2024.
[32] Lindy Ryan. Visual storytelling with data. The Visual Imperative, pages 131–151, 2016.
[33] Maarten Sap, Ronan Le Bras, Daniel Fried, and Yejin Choi. Neural theory-of-mind? on the limits of social intelligence in large lms. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3762–3780, 2022.
[34] Manolis Savva, Nicholas Kong, Arti Chhajta, Li Fei-Fei, Maneesh Agrawala, and Jeffrey Heer. Revision: Automated classification, analysis and redesign of chart images. In ACM User Interface Software & Technology (UIST), 2011.
[35] James WA Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, et al. Testing theory of mind in large language models and humans. Nature Human Behaviour, pages 1–11, 2024.
[36] Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399, 2023.
[37] Jevin D West and Carl T Bergstrom. Misinformation in and about science. Proceedings of the National Academy of Sciences, 118(15):e1912444117, 2021.
[38] Heinz Wimmer and Josef Perner. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13(1):103–128, 1983.
[39] Yifan Wu, Lutao Yan, Yuyu Luo, Yunhai Wang, and Nan Tang. Evaluating task-based effectiveness of mllms on charts, 2024.
[40] Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, and Fei Huang. Tinychart: Efficient chart understanding with visual token merging and program-of-thoughts learning, 2024.
[41] Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, Shyam Upadhyay, and Faruqui Manaal. How far are large language models from agents with theory-of-mind? arXiv preprint arXiv:2310.03051, 2023.

Appendix A: Selected Charts in Table 1


G4_Q1_1	G4_Q1_2
The following graph shows military spending. What is the trend of military spending from 2000 to 2012? 1. It has increased. 2. It has decreased. 3. It stayed the same.

G3_Q1_1	G3_Q1_2
The following graph shows CD shipments. What is the trend of CD shipments from 2017 to 2020? 1. It has increased. 2. It has decreased. 3. It stayed the same.

G9_Q1_1	G9_Q1_2
Two fictional countries Tomainia and Bacteria have been vying for military supremacy. The figure shows the number of tanks the countries had over the years. Which country had more tanks in 2021? 1. Tomainia 2. Bacteria


G1_Q1_1	G1_Q1_2
As a stack of $1 bills, a company’s corporate tax in 2014 was 1 yard high. How high is the company’s corporate tax in 2018 as a stack of $1 bills?

G2_Q1_1	G2_Q1_2
A company’s corporate tax in 2018 weighed 1 ton in gold. How many tons in gold does the company’s corporate tax weigh in 2022?

G12_Q1_1	G12_Q1_2
The following graph shows the profit of company X in millions. Which year was company X most profitable? 1. 2018 2. 2022 3. 2018 and 2022 were the same.


G7_Q1_1	G7_Q1_2
The following graph shows the number of dogs adopted. The dogs adopted in 2018 eat 1 million bags of dog food in their lifetimes. How much do the dogs adopted in 2019 eat in their lifetimes?

G8_Q1_1	G8_Q1_2
The following graph shows the number of cats adopted. The cats adopted in 2017 need 1 ton of kitty litter pellets. How many tons of kitty litter pellets do the cats adopted in 2018 need?

G10_Q1_1	G10_Q1_2
The graph shows how many bushels of apples were harvested this year for different kinds of apples. If 3 trucks are needed to transport all the Pink Lady apples, how many are needed to transport all the Honeycrisp apples?


G6_Q1_1	G6_Q1_2
The following graph shows the relationship between car weight and miles per gallon. As car weight increases, what happens to the car’s miles per gallon? 1. Miles per gallon increases. 2. Miles per gallon decreases.

G5_Q1_1	G5_Q1_2
The following graph shows the number of ebola cases in Sierra Leone. An election took place in September 2014. When did ebola cases increase more? 1. Before the election 2. After the election

G15_Q1_1	G15_Q1_2
Which state has a higher foreign-born population, Wisconsin (WI) or Florida (FL)? 1. Wisconsin (WI) 2. Florida (FL)


G13_Q1_1	G13_Q1_2
The figure below shows market share for different companies. Sort companies by market share from largest (1) to smallest (5). The largest company (1) — the smallest company (5)

G14_Q1_1	G14_Q1_2
The figure below shows market share for different fast food restaurants. Based on this figure, sort restaurants by market share from largest (1) to smallest (5). The largest company (1) — the smallest company (5)