Unraveling the Truth: Do LLMs really Understand Charts?
A Deep Dive into Consistency and Robustness

Srija Mukhopadhyay^*, Adnan Qidwai^*, Aparna Garimella^†, Pritika Ramu^†
Vivek Gupta^§^‡, Dan Roth^§
^*IIIT Hyderabad, ^†Adobe Research, ^§University of Pennsylvania
{srija.mukhopadhyay@research, adnan.qidwai@students}.iiit.ac.in,
{garimell,pramu}@adobe.com ; {gvivek, danroth}@seas.upenn.edu
* contributed equally, ‡ corresponding author

Abstract

Chart question answering (CQA) is a crucial area of Visual Language Understanding. However, the robustness and consistency of current Visual Language Models (VLMs) in this field remain under-explored. This paper evaluates state-of-the-art VLMs on comprehensive datasets, developed specifically for this study, encompassing diverse question categories and chart formats. We investigate two key aspects: 1) the models’ ability to handle varying levels of chart and question complexity, and 2) their robustness across different visual representations of the same underlying data. Our analysis reveals significant performance variations based on question and chart types, highlighting both strengths and weaknesses of current models. Additionally, we identify areas for improvement and propose future research directions to build more robust and reliable CQA systems. This study sheds light on the limitations of current models and paves the way for future advancements in the field.

Srija Mukhopadhyay^*, Adnan Qidwai^*, Aparna Garimella^†, Pritika Ramu^† Vivek Gupta^§^‡, Dan Roth^§ ^*IIIT Hyderabad, ^†Adobe Research, ^§University of Pennsylvania {srija.mukhopadhyay@research, adnan.qidwai@students}.iiit.ac.in, {garimell,pramu}@adobe.com ; {gvivek, danroth}@seas.upenn.edu

^†^†footnotetext: * contributed equally, ‡ corresponding author

1 Introduction

Chart question answering (CQA) Masry et al. (2022); Chaudhry et al. (2020) has emerged as a critical area within the field of Visual Language Understanding (VLU) Lee et al. (2023); Ghosh et al. (2024), aiming to equip machines with the ability to comprehend and answer questions based on data visualizations. While recent advancements in Vision Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have yielded impressive performance improvements in CQA Liu et al. (2023b); Masry et al. (2023); Xia et al. (2024); Xu et al. (2024); Team et al. (2023); Achiam et al. (2023); Meng et al. (2024), their true capabilities remain obscure in uncertainty. This paper delves into an insightful analysis of the robustness and consistency of state-of-the-art CQA models, exposing their limitations and guiding future research directions.

Refer to caption — Figure 1: Simple and Complex Questions on a complex chart

We address several key questions regarding the current state of CQA: Are existing models truly effective, or do their impressive average scores mask significant weaknesses? For instance, in Figure 1, one can ask that if the model’s performance remains consistent across two distinct question types? The first type, Simple Questions like "What is the number of tigers present in Narnia?", involves straightforward value extraction. In contrast, Complex Questions such as "Is the mean number of leopards across all sanctuaries greater than that of cheetah?" require extracting multiple values, aggregating them, and making boolean comparisons. It’s evident that complex questions pose challenges even for humans; understanding how models handle these complexities provides valuable insights into their capabilities.

How do models perform on specific aspects of chart understanding, such as question complexity and chart type? E.g. in Figure 2, the performance of the model is examined across different types of charts—specifically, Simple Charts and Complex Charts—as well as varying question types, including Simple Questions and Complex Questions. Complex Charts, such as grouped-bar charts that compare multiple attributes side by side present information in a more intricate manner compared to Simple Charts, which depict data about a single attribute using a single bar. Similarly, questions can range from complex tasks like identifying maximum values, performing aggregations, and making comparisons, to simpler queries focused on straightforward value extraction. Investigating how models handle these varied chart types and question complexities provides crucial insights into their performance and adaptability.

Furthermore, is the robustness of these models, their ability to generalize across diverse variations, adequately explored? The same dataset can be depicted in multiple visual formats. For instance, Figure 3 demonstrates how an original chart can be transformed into stair plots, bar charts, stacked representations, and many more. These variations can differ in aspects such as color schemes, patterns, legend positioning, and even details specific to each chart type like legend orientation and grid sizes on the x-axis and y-axis. Exploring the effect of these variations could provide deeper insights into the data and enhance the comprehensibility of the visualizations for models.

To answer these questions, we present a rigorous evaluation of leading CQA models on a meticulously curated dataset. This dataset encompasses diverse chart types and question categories, allowing for a thorough assessment of model performance across varying levels of complexity. We examine how well the models generalize across diverse visual representations of identical data, assessing their robustness against perturbations. Our findings reveal significant performance discrepancies, particularly when transitioning from simple to complex chart-question combinations. Moreover, we demonstrate that even the highest-performing models exhibit a substantial drop in accuracy when subjected to diverse perturbations, highlighting the critical need for improved robustness in CQA. This paper makes the following contributions:

•

Providing a thorough analysis of the strengths and weaknesses of current VLMs and MLLMs for chart understanding.
•

Introducing a new evaluation set with fine-grained splits across chart types and question complexities, facilitating a deeper understanding of model performance.
•

Performing a detailed robustness analysis to uncover the shortcomings of current models, emphasizing the necessity for additional research in this domain.

Our research sheds light on the current state of CQA, offering crucial insights. Code and data are made publicly available at https://vgupta123.github.io/chartrobustness.html.

2 Initial Dataset

This section highlights the dataset preparation process employed to analyze the performance of CQA models across a spectrum of chart types and question complexities.

2.1 Dataset Selection

To ensure a comprehensive evaluation of CQA models, we selected the ChartQA dataset Masry et al. (2022) as our primary benchmark. This dataset is widely used in CQA benchmarking, covering diverse domains from sources like Our World in Data, Statista, OECD, and Pew Research.

ChartQA includes two distinct question categories: "Human" and "Augmented". "Human" questions were generated by human annotators, while "Augmented" questions were machine-generated, providing a diverse spectrum of question styles. Another important aspect which motivated our choice of ChartQA dataset was the presence of underlying tables. This feature enabled us to generate controlled visual perturbations for the later section of our study. Our experiments were conducted exclusively on the test set of ChartQA, comprising questions, charts and the corresponding tables.

2.2 Chart and Question Labelling

To facilitate a more granular analysis of model performance, we categorized both charts and questions according to their complexity levels. This categorization was applied to the entire ChartQA test set, resulting in a modified evaluation dataset tailored for our experiments.

Chart Categorization.

Charts were classified into two categories using a code-based approach.

- Simple Charts: These charts represent a single entity over a dataframe with two columns, exhibiting no overlaps or complex visual elements. Figure 2 shows an example of such chart titled "Number of Votes given by Area".

- Complex Charts: These charts feature more than two columns, often encompassing multiple entities, leading to increased visual complexity. Figure 2 shows an example of such chart titled "Number of Votes given to Parties by Area".

Question Categorization.

Human annotators cleaned and categorized the questions from the ChartQA dataset into two categories based on their complexity:

- Simple Questions: These questions primarily focus on data extraction, and typically involve a single step of reasoning. Figure 2 shows an example of such questions "What is the number of votes given in La La La Land?".

- Complex Questions: These questions require multi-step reasoning and data extraction, often involving comparisons and logical inferences. Figure 2 shows an example of such questions "Is the mean number of ‘Party A’ voters greater than the mean number of ‘Party B’ voters?".

We introduced these categorizations while preserving the existing division of question generation types (human-generated and augmented questions), resulting in eight categories. The number of unique question-chart pairs in each category is presented in Table 1. This detailed categorization allows us to isolate the impact of chart and question complexity on model performance, providing a deeper understanding of their capabilities and limitations.

	Human		Augmented
	Simple	Complex	Simple	Complex
Simple	149	450	876	165
Complex	143	419	133	38

Table 1: Dataset statistics. Rows represent the type of Chart, Columns represent the type of Question and its Generation method.

3 Experiments

Models

To rigorously assess the performance of CQA models, we selected a diverse range of state-of-the-art models, varying in architecture, size, and training setup. All models were evaluated using a zero-shot Chain-of-Thought Wei et al. (2022) prompting approach, with prompts tailored for each model to maximize performance. Importantly, no additional reasoning aids were provided to any of the models. For the sake of clarity and analysis, we grouped the models into three broad categories:

Chart-based VLMs.

This category contains open-source VLMs specifically adapted for chart reasoning. MatCha (282M) Liu et al. (2023b) is a transformer based model which enhances the capabilities of Pix2Struct Lee et al. (2023) models through pre-training on mathematical reasoning and chart derendering tasks. UniChart (201M) Masry et al. (2023) is another similar model which achieves chart understanding by leveraging pre-training on tasks such as data table generation, numerical and visual reasoning, and open-ended question answering. DePlot (282M) Liu et al. (2023a) is a model which specializes on extracting tabular data from a given chart. The extracted table is subsequently passed to a Language Model (LM), e.g. Flan UL2 (20B) Tay et al. (2022), for reasoning via Chain-of-Thought prompting Wei et al. (2022).

Generalist VLMs.

This category comprises open-source VLMs trained on general visual comprehension tasks. Notably, these models were not specifically trained or adapted for chart reasoning. QwenVL Bai et al. (2023b) is a generalist 7-billion-parameter VLM built on top of Qwen-LM Bai et al. (2023a) through the integration of visual encoders and the use of general and multi-task pre-training. CogAgent VQA Hong et al. (2024) is an 18-billion-parameter VLM specializing in Graphical User Interface (GUI) understanding and navigation. InternLM-XComposer2 (8B) Dong et al. (2024) is an adaptation of InternLM2-7B Cai et al. (2024), excelling in producing high-quality long-text multi-modal content and reasoning within visual-language understanding contexts.

Large MLLMs.

This category features state-of-the-art closed-source Multimodal Large Language Models (MLLMs) pre-trained on extensive visual and language data. For this category, we utilized Gemini 1.5 Flash Team et al. (2023), and GPT-4o Achiam et al. (2023), renowned for their capabilities in reasoning and visual understanding.

Mod.

Chart-based VLMs

Generalist VLMs

MLLMs

MatCha

UniChart

DePlot +

Flan UL2

Qwen VL

CogAgent

VQA

InternLM

XComposer2

Gemini

1.5 Flash

GPT

Human

57.00

49.60

51.60

66.40

81.20

79.90

87.92

88.59

30.22

32.00

32.80

44.20

55.50

58.60

81.11

88.22

45.40

47.50

30.60

60.10

58.00

74.10

80.42

81.82

25.29

25.00

25.20

35.00

42.40

51.30

74.46

83.29

Augmented

91.40

87.20

76.10

86.50

80.90

82.50

91.32

94.18

65.40

66.00

72.70

72.10

76.90

68.40

80.61

88.48

78.10

69.20

48.10

61.60

47.30

68.40

81.20

80.45

34.20

44.70

52.60

36.80

55.20

47.30

65.79

71.05

Table 2: Model performance across different categories. S denotes ’Simple’ and C denotes ’Complex’. The first and second letter represents chart and question type respectively.

Evaluation

To improve on the Relaxed Accuracy metric, we introduce a new evaluation metric that includes extra checks for precise answer matching. This metric, similar to Relaxed Accuracy, provides a 5% leverage for numerical answer matching. However, it includes the following checks:

•

Alphanumeric String Matching: Removing comma and spaces from the during answer matching to ensure an exact alphanumeric string comparison.
•

Strict Year Matching: For questions specifically asking for a "Year" as an answer, the 5% relaxation is disabled, forcing a strict string match. This ensures that the model accurately identifies the correct year.
•

Unordered Exact List Matching: For questions requiring multiple answers, an unordered exact list matching is applied, which ensure model correctly identifies all the elements in answer list, regardless of their order.

To validate the accuracy of our proposed evaluation metric, we manually verified the answers obtained using this metric.

Smaller VLMs.

Smaller models (QwenVL, CogAgent, InternLM) struggled to produce answers in the correct format. We addressed this by employing an "LLM as an Extractor" approach, using Gemini 1.5 Flash to extract answers from their outputs. Manual verification of 150 samples confirmed that Gemini primarily acted as a formatting tool, preserving the original model’s answer in 149 cases and performing rounding in the one remaining instance. This demonstrates Gemini’s effectiveness in enhancing the usability of smaller models without significantly altering their intent.

4 Can VLMs reasons consistently?

This section presents our findings and analysis on the performance of various chart question answering (CQA) models across different chart types and question complexities.

4.1 Results and Discussion

Table 2 gives an overview of all results obtained for this section.

$(Q1)$ Does any model excel across all categories?

While no single model dominates all categories, GPT-4o and Gemini 1.5 Flash consistently demonstrate impressive performance, with GPT-4o leading in most cases. Among open-source models, InternLM stands out as the top performer.

Interestingly, models specifically trained on chart tasks (MatCha, UniChart) excel on augmented questions. This likely stems from their exposure to similar question formats during training. This is particularly evident in simple questions from the augmented set, where MatCha achieves a high accuracy of 91.40%, followed by UniChart at 87.20%. However, they struggle significantly with reasoning-based questions, achieving as low as 25% accuracy for complex chart and complex question pairs, highlighting the need for enhancement in the reasoning abilities of such models.

$(Q2)$ How do models perform across various chart types?

Across all models, a consistent trend emerged: performance was consistently better on simple charts compared to complex charts, regardless of the question type. This behavior is likely attributable to the inherent difficulty in understanding and extracting values from complex charts. Factors like overlapping data points and complex color resolution contributes to challenges in data extraction, increasing the difficulty of reasoning on such charts.

$(Q3)$ How do models perform across various question types?

For the same chart type, models consistently perform better on simple questions compared to complex questions. This significant difference in scores highlights the limitations of certain models in fine-grained data extraction and reasoning. GPT-4o and Gemini 1.5 Flash exhibit the smallest decrease in scores, indicating strong data extraction and reasoning capabilities. Smaller models, particularly those specifically trained on charts, struggle with questions requiring mathematical reasoning, despite their competence in basic data extraction.

$(Q4)$ Do models struggle more with complex charts or complex questions?

To assess model capabilities, we compared performance on two categories: "Simple Charts, Complex Questions" and "Complex Charts, Simple Questions." This analysis reveals whether a model excels at visual data extraction (complex charts) or reasoning (complex questions).

Our results show that LLMs like GPT-4 demonstrate strong reasoning skills, excelling on complex questions even with simple charts. Conversely, Gemini 1.5 Flash performs consistently across both categories. Generalist and chart-based VLMs tend to favor complex charts over complex questions, suggesting limitations in complex reasoning. This insight allows for targeted model fine-tuning to enhance specific domains where they lack dexterity.

$(Q5)$ Are there charts and questions where all models consistently fail to answer accurately?

We focused on identifying patterns of model failure across all categories. Given below are a few recurring difficulties for models:

- Charts containing similar colours: Models struggled with charts which required discrimination between slightly different colors. The issue extends further to recognizing specific colors by their names accurately.

- Tight pie charts: In some instances, models incorrectly assigned labels to categories in pie charts with narrow slices. Thus, failing to identify the correct association.

- Charts containing summary statistics: Models failed to interpret such charts, recalculating metrics like mean or sum even though these values were explicitly provided within the chart itself.

- Questions involving counting: Models consistently struggled to accurately count objects when the number exceeded ten.

$(Q6)$ How well do the models attend to the provided image for reasoning?

To investigate the extent to which models rely on visual information versus their internal knowledge base, we conducted an experiment using blank images and irrelevant charts. We sampled 100 questions from each category and tested the top-performing models on their reasoning skills.

Surprisingly, even when presented with irrelevant or blank images, some models successfully answered the questions, indicating a reliance on their pre-existing knowledge. This observation suggests potential leaks in testing data, as models even provided factually incorrect answers, highlighting the need for masked evaluation sets for visual reasoning tasks.

Our analysis, detailed in Table 3, reveals that even large models like Gemini 1.5 Flash and GPT-4o were capable of answering few questions based on irrelevant charts, highlighting the needs of developing models that integrate visual information for robust visual reasoning capabilities.

Model	Blank Charts				Irrelevant Charts
	SS	SC	CS	CC	SS	SC	CS	CC
Gemini 1.5 Flash	0	0	0	0	0	2	2	4
GPT-4o	0	3	0	3	0	2	1	6
InternLM-XComposer2	2	3	8	6	1	5	3	2
CogAgent-VQA	11	5	13	9	5	7	20	8
Qwen-VL	7	9	21	17	9	8	13	14

Table 3: Performance of models when probed with blank and irrelevant charts. S denotes ’Simple’ and C denotes ’Complex’. The first letter represents chart type and the second letter represents question type.

While our analysis reveals that models face challenges with certain categories of questions and charts, it also underscores the significant progress achieved in chart question answering (CQA) performance across various models.

5 Are VLMs robust on CQA?

Another crucial aspect of our analysis involves investigating the robustness and consistency of these models across different visual representations of the same underlying data. Through the help of this probing, we aim to understand if model performance remains stable when presented with variations in chart types, styles, or aesthetics while conveying the same information.

Figure 3 illustrates how an original chart can be converted into stair plots, bar charts, stacked representations, and more. These variations may differ in color schemes, patterns, legend positioning, and other chart-specific details like legend orientation and grid sizes on the x-axis and y-axis. Examining these variations can offer deeper insights into the data and improve the clarity of the visualizations.

5.1 Our RobustCQA Dataset

Following the initial dataset preparation, a perturbation dataset was created to rigorously assess the robustness of the top-performing models across diverse chart variations. We refer to this dataset as the RobustCQA dataset, which systematically manipulates various chart elements while preserving the underlying data.

Erstellung

We identified 75 unique perturbation types for both simple and complex charts. These perturbations cover a broad spectrum of visual variations, including:

•

Color Scheme Changes: Modifying color palettes, gradients and hues.
•

Chart Type Variations: Experimenting with line plots, bar plots, stair plots, stem plots and other less commonly used chart types.
•

Legend and Axis Modification: Altering label position, formatting, and positioning of legend and axis elements.

A detailed section showcasing all perturbation types has been presented in the Appendix.
The perturbed charts were generated using MatplotLib, ensuring that only the perturbed element changed while maintaining consistency in all other chart elements. The tables from the ChartQA dataset served as the source for the underlying data.

Human Verification

To ensure the quality and relevance of our dataset, a rigorous manual annotation process was employed. Expert evaluators meticulously verified each perturbed chart, assessing its comprehensibility and answerability by humans. They also evaluated the relevance of each perturbation to the specific chart type, refining the perturbation set to include only meaningful variations. The underlying tables were also thoroughly verified to confirm that the generated questions remained answerable based on the chart data. This comprehensive evaluation was facilitated by a custom-built annotation platform, specifically designed to streamline the manual annotation process and ensure high-quality data.

Simple Questions

Complex Questions

MLLMs

Generalist VLMs

MLLMs

Generalist VLMs

Kategorie

Gemini

1.5 Flash

GPT-4o

Qwen

CogAgent

VQA

InternLM

XComposer2

Gemini

1.5 Flash

GPT-4o

Qwen

CogAgent

VQA

InternLM

XComposer2

original_chart

annotations

annotated_bars

basic

color_random

color_scheme

data_pivot

font

grid

hatching

horizontal_grouped

horizontal_stacked

legend_position

line_representation

log_scale

only_data_color_scheme

replacing_legend_with_labels

scaling_size

scatter_representation

stacked

stacked_area

stair_plot_normal

stair_plot_with_marker

stem_plot

tick_orientation

tick_position

Table 4: Model Performance on various perturbations on Complex Charts.

Final Dataset.

The finalized perturbations were then grouped into related categories to create the final dataset. This set comprises 22 unique perturbations categories for simple charts and 25 such categories for complex charts, encompassing a wide range of visual variations.

To ensure a fair analysis of model robustness across perturbations, 100 questions were sampled for each question type within each chart category. This standardized question set allows for direct comparison of model performance across different visual representations. A detailed breakdown of the perturbation categories along with examples has been included in the Appendix.

5.2 Methodology

To delve deeper into the performance and limitations of leading chart question answering models, we evaluated Qwen-VL, CogAgent-VQA, InternLM-XComposer2 (open-source VLMs) and Gemini 1.5 Flash, GPT-4o (closed-source MLLMs) using our RobustCQA dataset. We employed a similar evaluation metric as before, leveraging an LLM extractor for smaller models to ensure consistent output format, and analyzed all models through Zero-Shot Chain-of-Thought prompting.

5.3 Results and Discussion

The results obtained for perturbations on complex charts have been highlighted in table 4. For simple charts, a similar table has been presented in the Appendix.

$(Q1)$ Does model performance stay consistent with perturbed charts?

The results reveal a significant performance degradation for most models when confronted with perturbations. While performance generally decreases across all models, some exhibit more drastic drops.

Among open-source models, InternLM-XComposer2 proves most resilient, demonstrating consistency across various perturbations. However, CogAgent-VQA and Qwen-VL struggle significantly with most perturbations, exhibiting low accuracy. Surprisingly, GPT-4o, a closed-source model, displays relatively low accuracy with most perturbations, highlighting a potential lack of robust data extraction skills, particularly with non-annotated charts. In contrast, Gemini 1.5 Flash demonstrates notable consistency with minimal variations across perturbations, showcasing strong reasoning and data extraction capabilities.

Manual analysis of model performances suggests that Gemini 1.5 Flash’s success stems from its approximation skills, allowing it to accurately estimate values even in non-annotated charts where other models struggle. This highlights the importance of improving data extraction for non-annotated charts to enhance model robustness in chart-based tasks.

$(Q2)$ Are there certain perturbations which help enhance the model performance?

Our experiments highlighted several perturbations that unexpectedly improved model performance. Notably, across all models, annotated data points consistently boosted accuracy. While the most beneficial plot type varied across models and question/chart categories, annotated bar graphs emerged as a consistently positive influence.

Furthermore, the addition of a grid and altering tick orientation also yielded significant performance improvements. Grids provide models with precise reference points for data estimation, while clear tick labels enhance accurate data point interpretation. Other beneficial perturbations included replacing legends with labels on lines and adjusting legend positioning. Replacing legends reduces the complexity of color resolution, which often leads to accuracy drops.

Optimal legend placement ensures that labels do not obscure crucial data points. Initial experimentation also indicated that increasing font size, particularly for smaller models, often led to improved performance, results for which have been presented in the Table 5.

$(Q3)$ Are there perturbations which are always detrimental to model performance?

Our analysis reveals that while models demonstrate promising performance on standard chart datasets, they struggle with robustness when faced with visual perturbations. While annotations generally improve performance, most other perturbations negatively impact model accuracy.

Logarithmic scales pose the most significant challenge, likely due to the inherent difficulty in data retrieval even for humans. Models also struggle significantly with horizontal chart variations, particularly horizontal stacked charts, though they perform better with vertical stacked charts.

It should however be noted that models do also struggle with normal stacked plots as well as area plots. This observation can be attributed to the complex process of accurate data extraction for these plots, requiring additional mathematical reasoning to get the precise value of a point which is challenging for humans as well.

Stair plots, where models struggle to identify the precise data point to refer to, and changes in chart scale, which seem to disrupt model attention, also contribute to performance decline.

These findings emphasize the need to develop more robust models that can effectively interpret visual information beyond simple visual cues. Some examples are mentioned in 6.

Models and Perturbation Types

Gemini 1.5 Flash

GPT-4o

Qwen-VL

CogAgent-VQA

InternLM-XComposer2

Annotations on

individual points

Annotations on

individual points

Annotations on Bar

Graphs

Annotations on

individual points

Annotations on bar

charts

Annotations on Bar

Graphs

Annotations on Bar

Graphs

Annotations on

individual points

Annotations on Bar

Graphs

Annotations on

individual points

Random Color Scheme

in Chart

Placing Legend

Elements with Line

Basic Matplotlib

Charts

Random Color Scheme

in Chart

Area Plot

Placing Legend

Elements with Line

Random Markers

and Line Styles

Placing Legend

Elements with Line

Axes Transposition

Horizontal Bar

Charts

Basic Matplotlib

Charts

Basic Matplotlib

Charts

Changing Font Size

Basic Matplotlib

Charts

Random Color Scheme

Table 5: Top 5 best performing perturbations for each model

Models and Perturbation Types

Gemini 1.5 Flash

GPT-4o

Qwen-VL

CogAgent-VQA

InternLM-Xcomposer2

Stacked Area Chart

Horizontally Stacked

Bars

Stacked Bar Graphs

Horizontal Bar

Charts

Horizontally Stacked

Bars

Horizontally Stacked

Bars

Stacked Area Chart

Changing Horizontal

and Vertical Dimension

Stacked Area Chart

Changing Horizontal

and Vertical Dimension

Stacked Bar Graphs

Log Scale

Horizontally Stacked

Bars

Stacked Area Chart

Log Scale

Horizontal Grouped

Bar Charts

Random Representation

of Scatter Plots

Horizontal Grouped

Bar Charts

Stacked Bar Graphs

Normal Stair Plot

Hatched Pattern in

Bar Charts

Stair Plots with Marker

Log Scale

Changing Font Size

Table 6: Top 5 worst performing perturbations for each model

$(Q4)$ Are there certain perturbations which are more effective for certain question types?

Our analysis suggests that certain chart types might be more effective than others for different question types. For instance, line charts excel at revealing trends and correlation. Stacked bar charts are less suitable for almost any question, unless it explicitly asks for data aggregation, as extracting individual data points from stacked bars can be challenging. Bar charts, while useful for comparing individual values within a group, prove to not be good for showcasing correlations across different groups or entities. These observations highlight the importance of selecting correct chart types for accurate and efficient question answering, particularly in domain-specific applications.

$(Q5)$ Does the effect of each perturbation type vary across models?

The impact of each perturbation on model performance exhibits significant variation. While question and chart type play a role, for a given model, certain perturbations consistently prove more helpful or harmful. This nuanced effect is detailed in Table 5, 6, allowing us to identify specific areas for model enhancement, ultimately leading to a deeper understanding and extraction of insights from chart data.

Annotations, for both bar and point, consistently ranked among the top performing perturbations across most models. Secondly, chart elements in random colors surprisingly proved to be beneficial, indicating that models are capable of effectively resolving visual information when provided with colours far apart from each other. Furthermore, replacing legends with element names placed alongside or within the chart resulted in improved performance compared to traditional legend-based techniques. On the other hand, we observed that stacked chart elements, particularly horizontally stacked bars, significantly hindered model performance because of tougher data extraction. Similarly, logarithmic scales, known to be challenging for human interpretation, also negatively impacted model accuracy. This analysis provides valuable information for targeted model fine-tuning, addressing specific weaknesses and improving overall performance.

6 Related Work

Chart datasets

Chart comprehension and question answering (CQA) Hoque et al. (2022) are highly important domains with a substantial body of research. While existing CQA datasets have advanced reasoning capabilities over charts, most suffer from limitations in terms of size Kafle et al. (2018), template-based questions Methani et al. (2020); Chaudhry et al. (2020), synthetically generated charts Methani et al. (2020); Han et al. (2023); Chaudhry et al. (2020) or being limited to a specific domain Methani et al. (2020); Ahmed et al. (2023); Li and Tajbakhsh (2023), or have only open domain question answers Kantharaj et al. (2022). Additionally, even the current dataset used for state of the art bench-marking, Chart QA Masry et al. (2022) has limitations in the form of not having classifications for a more meaningful analysis. These datasets also have limited variations in the kinds of charts used.

Additionally, ChartX Xia et al. (2024) incorporates different chart types in the analysis performed, while ChartBench Xu et al. (2024) and MMC Liu et al. (2024) focuses on creating a large scale dataset with varied chart types.

Finally, a very recent work, CharXiv Wang et al. (2024) provides a extensive and comprehensive evaluations on a range of charts as well as questions ranging from those requiring reasoning to descriptive ones. They also perform ablation analysis by modifying charts and questions.

However, to the best of our knowledge, RobustCQA is the first dataset to systematically perturb all elements within a chart, enabling a fine-grained analysis of factors affecting model performance. Additionally, we do fine grained analysis on the Chart QA dataset as well, based on question and chart complexity, which has not been done prior to this.

Modeling approaches for charts

Various approaches have been developed for chart modeling. This includes models primarily made for chart comprehension and reasoning, constructed with the end to end goal of reasoning on charts Liu et al. (2023b); Masry et al. (2023); Singh and Shekhar (2020) as well as models which focus on converting the chart to an intermediate table format Liu et al. (2023a) followed by reasoning by a generalized LLM through Chain of Thought Wei et al. (2022) or Program of Thought Chen et al. (2023) prompting. Additionally, there are also generalized models which are used for general multi-modal reasoning tasks including chart comprehension Team et al. (2023); Achiam et al. (2023); Bai et al. (2023b); Dong et al. (2024); Hong et al. (2024). Finally, there has also been recent work on Various approaches have been developed for chart modeling. This includes models primarily made for chart comprehension and reasoning, constructed with the end to end goal of reasoning on charts Liu et al. (2023b); Masry et al. (2023); Singh and Shekhar (2020) as well as models which focus on converting the chart to an intermediate table format Liu et al. (2023a) followed by reasoning by a generalized LLM through Chain of Thought Wei et al. (2022) or Program of Thought prompting Chen et al. (2023). Additionally, there are also generalized models which are used for general multi-modal reasoning tasks including chart comprehension Team et al. (2023); Achiam et al. (2023); Bai et al. (2023b); Dong et al. (2024); Hong et al. (2024). There has also been recent work on making small yet accurate models for this task Wang et al. (2024).

While these approaches have shown significant improvements, there is limited insight into their specific failure points. A recent work Islam et al. (2024) focused on understanding the performance of GPT-4v and Gemini and provided a broad analysis on the performance of these models for various chart comprehension tasks including question answering, summarization and fact checking.

On the other hand, our work focuses specifically on CQA with a broader range of models, providing an in-depth analysis of question and chart types contributing to model failures. Through our contribution, we help identify and point out which exact question type or chart type cause models to fail, highlighting crucial information to help enhance model performance.

Vision-Language Model Robustness

Recent studies have also highlighted the vulnerability of different models to attacks and perturbations Ma et al. (2024); Zhao et al. (2023), raising significant concerns about their deployment.

Inspired by the adversarial vulnerability observed in vision and language tasks, we create a robustness benchmark for chart question answering.

A similar work, Gupta et al. (2024) also analyses the performance of Deplot and Matcha on perturbed charts, through the help of questions focusing on structural and visual context. Our work, on the other hand, investigates general reasoning questions to understand how model performance is affected by variations in the visual representation of the same underlying data.

7 Conclusion

This research introduces ChartQA-Split and RobustCQA, the first datasets dedicated to understanding model consistency across complexities and robustness to visual perturbations in chart question answering. Our evaluation of SOTA models, including baselines and VLMs/MLLMs, using a zero-shot chain-of-thought setting, reveals significant challenges in both areas. We perform an in-depth analysis of model weaknesses and identify key areas for improvement, such as enhancing data extraction for non-annotated charts and developing models that can effectively interpret complex visual information, taking every possible visual cue into consideration. Our work provides a foundation for future research in developing more robust and reliable chart question answering systems.

Future Directions.

Perturbation analysis provides a nuanced understanding of model performance by revealing both universal and model-specific vulnerabilities and strengths. This insight drives targeted improvements: Model Pretraining focusing on perturbations that affect models allows for effective fine-tuning to address weaknesses. Perturbation-Aware Training Integrating specific perturbations during training enhances overall robustness, helping models develop resilience against challenges. Interpretable Models Understanding the impact of perturbations aids in debugging and explainability, fostering the development of reliable and transparent chart understanding systems.

Limitations

The presented work exhibits several limitations. First, our data was obtained from a singular dataset, and we used only one plotting software for testing the perturbations. Expanding the dataset to include diverse sources and exploring various plotting libraries would strengthen the findings and improve generalizability. Second, the dataset is limited to English, while models are developed and evaluated on a wide variety of languages. Future research is required to expand the domain beyond English. Third, we were not able to cover a few chart types in the course of our analysis in order to make a more generalized perturbation set. This included pie and doughnut charts, pyramid and funnel charts as well as radar charts.

Ethics Statement

This research adheres to the ACL code of ethics, acknowledging and addressing potential ethical implications. While LLMs assisted in writing and presentation, all ideas and conclusions are solely attributed to the authors. The research promotes responsible and fair use of methodologies, ensuring transparency and reproducibility. We plan to release all scripts, resources, comprehensive documentation, evaluation metrics, datasets, model specifications, and prompting methods to enable others to build upon our work. We strive to present our findings clearly and accurately, avoiding exaggerated claims or misinterpretations.

Acknowledgement

Research was sponsored by the Army Research Office and was accomplished under Grant Number W911NF-20-1-0080. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. This work was partially funded by ONR Contract N00014-23-1-2365. Lastly, we acknowledge the generous gift from Adobe.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Ahmed et al. (2023) Saleem Ahmed, Bhavin Jawade, Shubham Pandey, Srirangaraj Setlur, and Venu Govindaraju. 2023. Realcqa: Scientific chart question answering as a test-bed for first-order logic. In International Conference on Document Analysis and Recognition, pages 66–83. Springer.
Bai et al. (2023a) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023a. Qwen technical report. arXiv preprint arXiv:2309.16609.
Bai et al. (2023b) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023b. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966.
Cai et al. (2024) Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. 2024. Internlm2 technical report. CoRR.
Chaudhry et al. (2020) Ritwick Chaudhry, Sumit Shekhar, Utkarsh Gupta, Pranav Maneriker, Prann Bansal, and Ajay Joshi. 2020. Leaf-qa: Locate, encode & attend for figure question answering. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE.
Chen et al. (2023) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research.
Dong et al. (2024) Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. 2024. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420.
Ghosh et al. (2024) Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Vinija Jain, and Aman Chadha. 2024. Exploring the frontier of vision-language models: A survey of current methodologies and future directions.
Gupta et al. (2024) Ashim Gupta, Vivek Gupta, Shuo Zhang, Yujie He, Ning Zhang, and Shalin Shah. 2024. Enhancing question answering on charts through effective pre-training tasks.
Han et al. (2023) Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. 2023. Chartllama: A multimodal llm for chart understanding and generation.
Hong et al. (2024) Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290.
Hoque et al. (2022) E. Hoque, P. Kavehzadeh, and A. Masry. 2022. Chart question answering: State of the art and future directions. Computer Graphics Forum, 41(3):555–572.
Islam et al. (2024) Mohammed Saidul Islam, Raian Rahman, Ahmed Masry, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, and Enamul Hoque. 2024. Are large vision language models up to the challenge of chart comprehension and reasoning? an extensive investigation into the capabilities and limitations of lvlms.
Kafle et al. (2018) Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. 2018. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656.
Kantharaj et al. (2022) Shankar Kantharaj, Xuan Long Do, Rixie Tiffany Leong, Jia Qing Tan, Enamul Hoque, and Shafiq Joty. 2022. OpenCQA: Open-ended question answering with charts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11817–11837, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Lee et al. (2023) Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. 2023. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR.
Li and Tajbakhsh (2023) Shengzhi Li and Nima Tajbakhsh. 2023. Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs.
Liu et al. (2023a) Fangyu Liu, Julian Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. 2023a. DePlot: One-shot visual language reasoning by plot-to-table translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10381–10399, Toronto, Canada. Association for Computational Linguistics.
Liu et al. (2023b) Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Eisenschlos. 2023b. MatCha: Enhancing visual language pretraining with math reasoning and chart derendering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12756–12770, Toronto, Canada. Association for Computational Linguistics.
Liu et al. (2024) Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. 2024. MMC: Advancing multimodal chart understanding with large-scale instruction tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1287–1310, Mexico City, Mexico. Association for Computational Linguistics.
Ma et al. (2024) J. Ma, P. Wang, D. Kong, Z. Wang, J. Liu, H. Pei, and J. Zhao. 2024. Robust visual question answering: Datasets, methods, and future challenges. IEEE Transactions on Pattern Analysis I&; Machine Intelligence, 46(08):5575–5594.
Masry et al. (2022) Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland. Association for Computational Linguistics.
Masry et al. (2023) Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. 2023. UniChart: A universal vision-language pretrained model for chart comprehension and reasoning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14662–14684, Singapore. Association for Computational Linguistics.
Meng et al. (2024) Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, and Ping Luo. 2024. Chartassisstant: A universal chart multimodal language model via chart-to-table pre-training and multitask instruction tuning. arXiv preprint arXiv:2401.02384.
Methani et al. (2020) Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. 2020. Plotqa: Reasoning over scientific plots. In The IEEE Winter Conference on Applications of Computer Vision (WACV).
Singh and Shekhar (2020) Hrituraj Singh and Sumit Shekhar. 2020. STL-CQA: Structure-based transformers with localization and encoding for chart question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3275–3284, Online. Association for Computational Linguistics.
Tay et al. (2022) Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier García, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, and Donald Metzler. 2022. Ul2: Unifying language learning paradigms. In International Conference on Learning Representations.
Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
Wang et al. (2024) Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. 2024. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
Xia et al. (2024) Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, et al. 2024. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning. arXiv preprint arXiv:2402.12185.
Xu et al. (2024) Zhengzhuo Xu, Sinan Du, Yiyan Qi, Chengjin Xu, Chun Yuan, and Jian Guo. 2024. Chartbench: A benchmark for complex visual reasoning in charts.
Zhao et al. (2023) Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, and Min Lin. 2023. On evaluating adversarial robustness of large vision-language models. In Thirty-seventh Conference on Neural Information Processing Systems.

Appendix

Effect of Font-size on Models

Table 7 demonstrates the significant impact of font size on model performance. Increasing font size has a positive effect on the OCR capabilities of visual language models (VLMs). This finding suggests that increasing font size can be a beneficial preprocessing step for improving model performance in tasks involving chart comprehension through such models.

Gemini 1.5 Flash

Qwen VL

Perturbation types

Small Font

Big Font

Small Font

Big Font

Normal line plot

Colors in a given

scheme (line)

Colors random (scatter)

Line Represntation

Stem Plot

Stair Plot

Ablation - removing

Y axis

Rotated X axis Tick

Annotated Bar Graph

Horizontal Bar Graph

Table 7: Effect of increasing font size

Model Scores

In addition to the scores of models across various perturbations for complex charts, we also have presented the scores for models across various perturbations in simple charts in Table 8.

Different chart perturbations

This section details the construction and structure of the RobustCQA dataset.

Initially, we generated an extensive set of perturbed chart images, creating 85 unique perturbations applied to both simple and complex chart types. This process ensured that every chart element, including those unique to specific chart types, was isolated and perturbed. For example, we perturbed markers for scatter and line plots and varied line styles for line, stem, and step plots.

Following the initial generation, we performed a rigorous analysis of the perturbations, manually categorizing them into distinct groups based on their visual characteristics and impact on chart interpretation. This categorization allowed us to identify and retain only the most relevant perturbations for each chart type.

During the refinement process, we carefully considered the relevance and interpretability of specific perturbations for different chart types. For example, "stack plots" were not considered for simple charts due to the absence of stackable elements. Similarly, "overlapping area plots" were excluded from complex charts due to their inherent ambiguity and complexity even for human annotators.

Following the process, we ended up with 22 unique categories for simple charts and 25 unique categories for complex charts as have been highlighted through the results table.

To illustrate the final perturbation categories, we provide representative images showcasing all subplots within each category. These visual examples provide a clear understanding of the types of perturbations included in RobustCQA and have been depicted in the section following the tables.

Simple Questions

Complex Questions

MLLMs

Generalist VLMs

MLLMs

Generalist VLMs

Kategorie

Gemini

1.5 Flash

GPT-4o

Qwen

CogAgent