Unraveling the Truth: Do LLMs really Understand Charts?
A Deep Dive into Consistency and Robustness

Srija Mukhopadhyay*, Adnan Qidwai*, Aparna Garimella, Pritika Ramu
Vivek Gupta§
, Dan Roth§
*IIIT Hyderabad, Adobe Research, §University of Pennsylvania
{srija.mukhopadhyay@research, adnan.qidwai@students}.iiit.ac.in,
{garimell,pramu}@adobe.com ; {gvivek, danroth}@seas.upenn.edu
 * contributed equally, ‡  corresponding author
Abstract

Chart question answering (CQA) is a crucial area of Visual Language Understanding. However, the robustness and consistency of current Visual Language Models (VLMs) in this field remain under-explored. This paper evaluates state-of-the-art VLMs on comprehensive datasets, developed specifically for this study, encompassing diverse question categories and chart formats. We investigate two key aspects: 1) the models’ ability to handle varying levels of chart and question complexity, and 2) their robustness across different visual representations of the same underlying data. Our analysis reveals significant performance variations based on question and chart types, highlighting both strengths and weaknesses of current models. Additionally, we identify areas for improvement and propose future research directions to build more robust and reliable CQA systems. This study sheds light on the limitations of current models and paves the way for future advancements in the field.

Unraveling the Truth: Do LLMs really Understand Charts?
A Deep Dive into Consistency and Robustness


Srija Mukhopadhyay*, Adnan Qidwai*, Aparna Garimella, Pritika Ramu Vivek Gupta§, Dan Roth§ *IIIT Hyderabad, Adobe Research, §University of Pennsylvania {srija.mukhopadhyay@research, adnan.qidwai@students}.iiit.ac.in, {garimell,pramu}@adobe.com ; {gvivek, danroth}@seas.upenn.edu


footnotetext:  * contributed equally, ‡  corresponding author

1 Introduction

Chart question answering (CQA) Masry et al. (2022); Chaudhry et al. (2020) has emerged as a critical area within the field of Visual Language Understanding (VLU) Lee et al. (2023); Ghosh et al. (2024), aiming to equip machines with the ability to comprehend and answer questions based on data visualizations. While recent advancements in Vision Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have yielded impressive performance improvements in CQA Liu et al. (2023b); Masry et al. (2023); Xia et al. (2024); Xu et al. (2024); Team et al. (2023); Achiam et al. (2023); Meng et al. (2024), their true capabilities remain obscure in uncertainty. This paper delves into an insightful analysis of the robustness and consistency of state-of-the-art CQA models, exposing their limitations and guiding future research directions.

Refer to caption
Figure 1: Simple and Complex Questions on a complex chart

We address several key questions regarding the current state of CQA: Are existing models truly effective, or do their impressive average scores mask significant weaknesses? For instance, in Figure 1, one can ask that if the model’s performance remains consistent across two distinct question types? The first type, Simple Questions like "What is the number of tigers present in Narnia?", involves straightforward value extraction. In contrast, Complex Questions such as "Is the mean number of leopards across all sanctuaries greater than that of cheetah?" require extracting multiple values, aggregating them, and making boolean comparisons. It’s evident that complex questions pose challenges even for humans; understanding how models handle these complexities provides valuable insights into their capabilities.

How do models perform on specific aspects of chart understanding, such as question complexity and chart type? E.g. in Figure 2, the performance of the model is examined across different types of charts—specifically, Simple Charts and Complex Charts—as well as varying question types, including Simple Questions and Complex Questions. Complex Charts, such as grouped-bar charts that compare multiple attributes side by side present information in a more intricate manner compared to Simple Charts, which depict data about a single attribute using a single bar. Similarly, questions can range from complex tasks like identifying maximum values, performing aggregations, and making comparisons, to simpler queries focused on straightforward value extraction. Investigating how models handle these varied chart types and question complexities provides crucial insights into their performance and adaptability.

Refer to caption
Figure 2: Example of simple chart and complex chart, along with simple and complex questions.

Furthermore, is the robustness of these models, their ability to generalize across diverse variations, adequately explored? The same dataset can be depicted in multiple visual formats. For instance, Figure 3 demonstrates how an original chart can be transformed into stair plots, bar charts, stacked representations, and many more. These variations can differ in aspects such as color schemes, patterns, legend positioning, and even details specific to each chart type like legend orientation and grid sizes on the x-axis and y-axis. Exploring the effect of these variations could provide deeper insights into the data and enhance the comprehensibility of the visualizations for models.

To answer these questions, we present a rigorous evaluation of leading CQA models on a meticulously curated dataset. This dataset encompasses diverse chart types and question categories, allowing for a thorough assessment of model performance across varying levels of complexity. We examine how well the models generalize across diverse visual representations of identical data, assessing their robustness against perturbations. Our findings reveal significant performance discrepancies, particularly when transitioning from simple to complex chart-question combinations. Moreover, we demonstrate that even the highest-performing models exhibit a substantial drop in accuracy when subjected to diverse perturbations, highlighting the critical need for improved robustness in CQA. This paper makes the following contributions:

  • Providing a thorough analysis of the strengths and weaknesses of current VLMs and MLLMs for chart understanding.

  • Introducing a new evaluation set with fine-grained splits across chart types and question complexities, facilitating a deeper understanding of model performance.

  • Performing a detailed robustness analysis to uncover the shortcomings of current models, emphasizing the necessity for additional research in this domain.

Our research sheds light on the current state of CQA, offering crucial insights. Code and data are made publicly available at https://vgupta123.github.io/chartrobustness.html.

2 Initial Dataset

This section highlights the dataset preparation process employed to analyze the performance of CQA models across a spectrum of chart types and question complexities.

2.1 Dataset Selection

To ensure a comprehensive evaluation of CQA models, we selected the ChartQA dataset Masry et al. (2022) as our primary benchmark. This dataset is widely used in CQA benchmarking, covering diverse domains from sources like Our World in Data, Statista, OECD, and Pew Research.

ChartQA includes two distinct question categories: "Human" and "Augmented". "Human" questions were generated by human annotators, while "Augmented" questions were machine-generated, providing a diverse spectrum of question styles. Another important aspect which motivated our choice of ChartQA dataset was the presence of underlying tables. This feature enabled us to generate controlled visual perturbations for the later section of our study. Our experiments were conducted exclusively on the test set of ChartQA, comprising questions, charts and the corresponding tables.

2.2 Chart and Question Labelling

To facilitate a more granular analysis of model performance, we categorized both charts and questions according to their complexity levels. This categorization was applied to the entire ChartQA test set, resulting in a modified evaluation dataset tailored for our experiments.

Chart Categorization.

Charts were classified into two categories using a code-based approach.

- Simple Charts: These charts represent a single entity over a dataframe with two columns, exhibiting no overlaps or complex visual elements. Figure 2 shows an example of such chart titled "Number of Votes given by Area".

- Complex Charts: These charts feature more than two columns, often encompassing multiple entities, leading to increased visual complexity. Figure 2 shows an example of such chart titled "Number of Votes given to Parties by Area".

Question Categorization.

Human annotators cleaned and categorized the questions from the ChartQA dataset into two categories based on their complexity:

- Simple Questions: These questions primarily focus on data extraction, and typically involve a single step of reasoning. Figure 2 shows an example of such questions "What is the number of votes given in La La La Land?".

- Complex Questions: These questions require multi-step reasoning and data extraction, often involving comparisons and logical inferences. Figure 2 shows an example of such questions "Is the mean number of ‘Party A’ voters greater than the mean number of ‘Party B’ voters?".

We introduced these categorizations while preserving the existing division of question generation types (human-generated and augmented questions), resulting in eight categories. The number of unique question-chart pairs in each category is presented in Table 1. This detailed categorization allows us to isolate the impact of chart and question complexity on model performance, providing a deeper understanding of their capabilities and limitations.

Human Augmented
Simple Complex Simple Complex
Simple 149 450 876 165
Complex 143 419 133 38
Table 1: Dataset statistics. Rows represent the type of Chart, Columns represent the type of Question and its Generation method.

3 Experiments

Models

To rigorously assess the performance of CQA models, we selected a diverse range of state-of-the-art models, varying in architecture, size, and training setup. All models were evaluated using a zero-shot Chain-of-Thought Wei et al. (2022) prompting approach, with prompts tailored for each model to maximize performance. Importantly, no additional reasoning aids were provided to any of the models. For the sake of clarity and analysis, we grouped the models into three broad categories:

Chart-based VLMs.

This category contains open-source VLMs specifically adapted for chart reasoning. MatCha (282M) Liu et al. (2023b) is a transformer based model which enhances the capabilities of Pix2Struct Lee et al. (2023) models through pre-training on mathematical reasoning and chart derendering tasks. UniChart (201M) Masry et al. (2023) is another similar model which achieves chart understanding by leveraging pre-training on tasks such as data table generation, numerical and visual reasoning, and open-ended question answering. DePlot (282M) Liu et al. (2023a) is a model which specializes on extracting tabular data from a given chart. The extracted table is subsequently passed to a Language Model (LM), e.g. Flan UL2 (20B) Tay et al. (2022), for reasoning via Chain-of-Thought prompting Wei et al. (2022).

Generalist VLMs.

This category comprises open-source VLMs trained on general visual comprehension tasks. Notably, these models were not specifically trained or adapted for chart reasoning. QwenVL Bai et al. (2023b) is a generalist 7-billion-parameter VLM built on top of Qwen-LM Bai et al. (2023a) through the integration of visual encoders and the use of general and multi-task pre-training. CogAgent VQA Hong et al. (2024) is an 18-billion-parameter VLM specializing in Graphical User Interface (GUI) understanding and navigation. InternLM-XComposer2 (8B) Dong et al. (2024) is an adaptation of InternLM2-7B Cai et al. (2024), excelling in producing high-quality long-text multi-modal content and reasoning within visual-language understanding contexts.

Large MLLMs.

This category features state-of-the-art closed-source Multimodal Large Language Models (MLLMs) pre-trained on extensive visual and language data. For this category, we utilized Gemini 1.5 Flash Team et al. (2023), and GPT-4o Achiam et al. (2023), renowned for their capabilities in reasoning and visual understanding.

Mod. Chart-based VLMs Generalist VLMs MLLMs
MatCha UniChart
DePlot +
Flan UL2
Qwen VL
CogAgent
VQA
InternLM
XComposer2
Gemini
1.5 Flash
GPT
4o
Human
SS 57.00 49.60 51.60 66.40 81.20 79.90 87.92 88.59
SC 30.22 32.00 32.80 44.20 55.50 58.60 81.11 88.22
CS 45.40 47.50 30.60 60.10 58.00 74.10 80.42 81.82
CC 25.29 25.00 25.20 35.00 42.40 51.30 74.46 83.29
Augmented
SS 91.40 87.20 76.10 86.50 80.90 82.50 91.32 94.18
SC 65.40 66.00 72.70 72.10 76.90 68.40 80.61 88.48
CS 78.10 69.20 48.10 61.60 47.30 68.40 81.20 80.45
CC 34.20 44.70 52.60 36.80 55.20 47.30 65.79 71.05
Table 2: Model performance across different categories. S denotes ’Simple’ and C denotes ’Complex’. The first and second letter represents chart and question type respectively.

Evaluation

To improve on the Relaxed Accuracy metric, we introduce a new evaluation metric that includes extra checks for precise answer matching. This metric, similar to Relaxed Accuracy, provides a 5% leverage for numerical answer matching. However, it includes the following checks:

  • Alphanumeric String Matching: Removing comma and spaces from the during answer matching to ensure an exact alphanumeric string comparison.

  • Strict Year Matching: For questions specifically asking for a "Year" as an answer, the 5% relaxation is disabled, forcing a strict string match. This ensures that the model accurately identifies the correct year.

  • Unordered Exact List Matching: For questions requiring multiple answers, an unordered exact list matching is applied, which ensure model correctly identifies all the elements in answer list, regardless of their order.

To validate the accuracy of our proposed evaluation metric, we manually verified the answers obtained using this metric.

Smaller VLMs.

Smaller models (QwenVL, CogAgent, InternLM) struggled to produce answers in the correct format. We addressed this by employing an "LLM as an Extractor" approach, using Gemini 1.5 Flash to extract answers from their outputs. Manual verification of 150 samples confirmed that Gemini primarily acted as a formatting tool, preserving the original model’s answer in 149 cases and performing rounding in the one remaining instance. This demonstrates Gemini’s effectiveness in enhancing the usability of smaller models without significantly altering their intent.

4 Can VLMs reasons consistently?

This section presents our findings and analysis on the performance of various chart question answering (CQA) models across different chart types and question complexities.

4.1 Results and Discussion

Table 2 gives an overview of all results obtained for this section.

(Q1)𝑄1(Q1)( italic_Q 1 ) Does any model excel across all categories?

While no single model dominates all categories, GPT-4o and Gemini 1.5 Flash consistently demonstrate impressive performance, with GPT-4o leading in most cases. Among open-source models, InternLM stands out as the top performer.

Interestingly, models specifically trained on chart tasks (MatCha, UniChart) excel on augmented questions. This likely stems from their exposure to similar question formats during training. This is particularly evident in simple questions from the augmented set, where MatCha achieves a high accuracy of 91.40%, followed by UniChart at 87.20%. However, they struggle significantly with reasoning-based questions, achieving as low as 25% accuracy for complex chart and complex question pairs, highlighting the need for enhancement in the reasoning abilities of such models.

(Q2)𝑄2(Q2)( italic_Q 2 ) How do models perform across various chart types?

Across all models, a consistent trend emerged: performance was consistently better on simple charts compared to complex charts, regardless of the question type. This behavior is likely attributable to the inherent difficulty in understanding and extracting values from complex charts. Factors like overlapping data points and complex color resolution contributes to challenges in data extraction, increasing the difficulty of reasoning on such charts.

(Q3)𝑄3(Q3)( italic_Q 3 ) How do models perform across various question types?

For the same chart type, models consistently perform better on simple questions compared to complex questions. This significant difference in scores highlights the limitations of certain models in fine-grained data extraction and reasoning. GPT-4o and Gemini 1.5 Flash exhibit the smallest decrease in scores, indicating strong data extraction and reasoning capabilities. Smaller models, particularly those specifically trained on charts, struggle with questions requiring mathematical reasoning, despite their competence in basic data extraction.

(Q4)𝑄4(Q4)( italic_Q 4 ) Do models struggle more with complex charts or complex questions?

To assess model capabilities, we compared performance on two categories: "Simple Charts, Complex Questions" and "Complex Charts, Simple Questions." This analysis reveals whether a model excels at visual data extraction (complex charts) or reasoning (complex questions).

Our results show that LLMs like GPT-4 demonstrate strong reasoning skills, excelling on complex questions even with simple charts. Conversely, Gemini 1.5 Flash performs consistently across both categories. Generalist and chart-based VLMs tend to favor complex charts over complex questions, suggesting limitations in complex reasoning. This insight allows for targeted model fine-tuning to enhance specific domains where they lack dexterity.

(Q5)𝑄5(Q5)( italic_Q 5 ) Are there charts and questions where all models consistently fail to answer accurately?

We focused on identifying patterns of model failure across all categories. Given below are a few recurring difficulties for models:

- Charts containing similar colours: Models struggled with charts which required discrimination between slightly different colors. The issue extends further to recognizing specific colors by their names accurately.

- Tight pie charts: In some instances, models incorrectly assigned labels to categories in pie charts with narrow slices. Thus, failing to identify the correct association.

- Charts containing summary statistics: Models failed to interpret such charts, recalculating metrics like mean or sum even though these values were explicitly provided within the chart itself.

- Questions involving counting: Models consistently struggled to accurately count objects when the number exceeded ten.

(Q6)𝑄6(Q6)( italic_Q 6 ) How well do the models attend to the provided image for reasoning?

To investigate the extent to which models rely on visual information versus their internal knowledge base, we conducted an experiment using blank images and irrelevant charts. We sampled 100 questions from each category and tested the top-performing models on their reasoning skills.

Surprisingly, even when presented with irrelevant or blank images, some models successfully answered the questions, indicating a reliance on their pre-existing knowledge. This observation suggests potential leaks in testing data, as models even provided factually incorrect answers, highlighting the need for masked evaluation sets for visual reasoning tasks.

Our analysis, detailed in Table 3, reveals that even large models like Gemini 1.5 Flash and GPT-4o were capable of answering few questions based on irrelevant charts, highlighting the needs of developing models that integrate visual information for robust visual reasoning capabilities.

Refer to caption
Figure 3: Examples of different types of perturbations on the same original chart and data.
Model Blank Charts Irrelevant Charts
SS SC CS CC SS SC CS CC
Gemini 1.5 Flash 0 0 0 0 0 2 2 4
GPT-4o 0 3 0 3 0 2 1 6
InternLM-XComposer2 2 3 8 6 1 5 3 2
CogAgent-VQA 11 5 13 9 5 7 20 8
Qwen-VL 7 9 21 17 9 8 13 14
Table 3: Performance of models when probed with blank and irrelevant charts. S denotes ’Simple’ and C denotes ’Complex’. The first letter represents chart type and the second letter represents question type.

While our analysis reveals that models face challenges with certain categories of questions and charts, it also underscores the significant progress achieved in chart question answering (CQA) performance across various models.

5 Are VLMs robust on CQA?

Another crucial aspect of our analysis involves investigating the robustness and consistency of these models across different visual representations of the same underlying data. Through the help of this probing, we aim to understand if model performance remains stable when presented with variations in chart types, styles, or aesthetics while conveying the same information.

Figure 3 illustrates how an original chart can be converted into stair plots, bar charts, stacked representations, and more. These variations may differ in color schemes, patterns, legend positioning, and other chart-specific details like legend orientation and grid sizes on the x-axis and y-axis. Examining these variations can offer deeper insights into the data and improve the clarity of the visualizations.

5.1 Our RobustCQA Dataset

Following the initial dataset preparation, a perturbation dataset was created to rigorously assess the robustness of the top-performing models across diverse chart variations. We refer to this dataset as the RobustCQA dataset, which systematically manipulates various chart elements while preserving the underlying data.

Erstellung

We identified 75 unique perturbation types for both simple and complex charts. These perturbations cover a broad spectrum of visual variations, including:

  • Color Scheme Changes: Modifying color palettes, gradients and hues.

  • Chart Type Variations: Experimenting with line plots, bar plots, stair plots, stem plots and other less commonly used chart types.

  • Legend and Axis Modification: Altering label position, formatting, and positioning of legend and axis elements.

A detailed section showcasing all perturbation types has been presented in the Appendix.
The perturbed charts were generated using MatplotLib, ensuring that only the perturbed element changed while maintaining consistency in all other chart elements. The tables from the ChartQA dataset served as the source for the underlying data.

Human Verification

To ensure the quality and relevance of our dataset, a rigorous manual annotation process was employed. Expert evaluators meticulously verified each perturbed chart, assessing its comprehensibility and answerability by humans. They also evaluated the relevance of each perturbation to the specific chart type, refining the perturbation set to include only meaningful variations. The underlying tables were also thoroughly verified to confirm that the generated questions remained answerable based on the chart data. This comprehensive evaluation was facilitated by a custom-built annotation platform, specifically designed to streamline the manual annotation process and ensure high-quality data.

Simple Questions Complex Questions
MLLMs Generalist VLMs MLLMs Generalist VLMs
Kategorie
Gemini
1.5 Flash
GPT-4o
Qwen
VL
CogAgent
VQA
InternLM
XComposer2
Gemini
1.5 Flash
GPT-4o
Qwen
VL
CogAgent
VQA
InternLM
XComposer2
original_chart 94 89 62 60 77 71 82 36 45 49
annotations 86 90 34 37 59 61 77 31 40 43
annotated_bars 83 89 35 31 64 68 74 26 28 38
basic 67 43 17 17 51 55 46 28 31 37
color_random 66 45 15 14 51 53 49 21 29 31
color_scheme 56 45 16 13 47 55 52 26 31 40
data_pivot 56 43 11 9 46 44 23 28 27 38
font 67 49 16 18 34 51 43 26 28 33
grid 67 48 18 16 51 52 51 21 24 34
hatching 57 37 11 9 42 49 42 28 29 37
horizontal_grouped 60 32 19 14 49 51 42 29 29 40
horizontal_stacked 30 20 16 11 22 59 46 22 32 43
legend_position 52 44 15 19 49 47 46 28 30 28
line_representation 52 44 13 18 35 42 40 29 27 33
log_scale 38 41 11 9 5 55 45 27 30 38
only_data_color_scheme 62 44 17 18 53 51 50 25 27 39
replacing_legend_with_labels 59 48 19 14 41 45 56 30 28 31
scaling_size 63 43 11 13 31 47 41 30 25 28
scatter_representation 43 38 12 14 37 45 44 23 27 29
stacked 36 28 14 13 36 47 38 26 33 32
stacked_area 34 24 19 13 31 45 41 26 34 34
stair_plot_normal 49 41 14 16 41 52 43 17 29 20
stair_plot_with_marker 55 48 13 20 43 57 51 24 27 45
stem_plot 47 36 12 12 52 55 41 22 28 35
tick_orientation 66 51 19 14 43 42 33 29 27 27
tick_position 56 48 21 16 47 49 42 30 31 30
Table 4: Model Performance on various perturbations on Complex Charts.

Final Dataset.

The finalized perturbations were then grouped into related categories to create the final dataset. This set comprises 22 unique perturbations categories for simple charts and 25 such categories for complex charts, encompassing a wide range of visual variations.

To ensure a fair analysis of model robustness across perturbations, 100 questions were sampled for each question type within each chart category. This standardized question set allows for direct comparison of model performance across different visual representations. A detailed breakdown of the perturbation categories along with examples has been included in the Appendix.

5.2 Methodology

To delve deeper into the performance and limitations of leading chart question answering models, we evaluated Qwen-VL, CogAgent-VQA, InternLM-XComposer2 (open-source VLMs) and Gemini 1.5 Flash, GPT-4o (closed-source MLLMs) using our RobustCQA dataset. We employed a similar evaluation metric as before, leveraging an LLM extractor for smaller models to ensure consistent output format, and analyzed all models through Zero-Shot Chain-of-Thought prompting.

5.3 Results and Discussion

The results obtained for perturbations on complex charts have been highlighted in table 4. For simple charts, a similar table has been presented in the Appendix.

(Q1)𝑄1(Q1)( italic_Q 1 ) Does model performance stay consistent with perturbed charts?

The results reveal a significant performance degradation for most models when confronted with perturbations. While performance generally decreases across all models, some exhibit more drastic drops.

Among open-source models, InternLM-XComposer2 proves most resilient, demonstrating consistency across various perturbations. However, CogAgent-VQA and Qwen-VL struggle significantly with most perturbations, exhibiting low accuracy. Surprisingly, GPT-4o, a closed-source model, displays relatively low accuracy with most perturbations, highlighting a potential lack of robust data extraction skills, particularly with non-annotated charts. In contrast, Gemini 1.5 Flash demonstrates notable consistency with minimal variations across perturbations, showcasing strong reasoning and data extraction capabilities.

Manual analysis of model performances suggests that Gemini 1.5 Flash’s success stems from its approximation skills, allowing it to accurately estimate values even in non-annotated charts where other models struggle. This highlights the importance of improving data extraction for non-annotated charts to enhance model robustness in chart-based tasks.

(Q2)𝑄2(Q2)( italic_Q 2 ) Are there certain perturbations which help enhance the model performance?

Our experiments highlighted several perturbations that unexpectedly improved model performance. Notably, across all models, annotated data points consistently boosted accuracy. While the most beneficial plot type varied across models and question/chart categories, annotated bar graphs emerged as a consistently positive influence.

Furthermore, the addition of a grid and altering tick orientation also yielded significant performance improvements. Grids provide models with precise reference points for data estimation, while clear tick labels enhance accurate data point interpretation. Other beneficial perturbations included replacing legends with labels on lines and adjusting legend positioning. Replacing legends reduces the complexity of color resolution, which often leads to accuracy drops.

Optimal legend placement ensures that labels do not obscure crucial data points. Initial experimentation also indicated that increasing font size, particularly for smaller models, often led to improved performance, results for which have been presented in the Table 5.

(Q3)𝑄3(Q3)( italic_Q 3 ) Are there perturbations which are always detrimental to model performance?

Our analysis reveals that while models demonstrate promising performance on standard chart datasets, they struggle with robustness when faced with visual perturbations. While annotations generally improve performance, most other perturbations negatively impact model accuracy.

Logarithmic scales pose the most significant challenge, likely due to the inherent difficulty in data retrieval even for humans. Models also struggle significantly with horizontal chart variations, particularly horizontal stacked charts, though they perform better with vertical stacked charts.

It should however be noted that models do also struggle with normal stacked plots as well as area plots. This observation can be attributed to the complex process of accurate data extraction for these plots, requiring additional mathematical reasoning to get the precise value of a point which is challenging for humans as well.

Stair plots, where models struggle to identify the precise data point to refer to, and changes in chart scale, which seem to disrupt model attention, also contribute to performance decline.

These findings emphasize the need to develop more robust models that can effectively interpret visual information beyond simple visual cues. Some examples are mentioned in 6.

Models and Perturbation Types
Gemini 1.5 Flash GPT-4o Qwen-VL CogAgent-VQA InternLM-XComposer2
Annotations on
individual points
Annotations on
individual points
Annotations on Bar
Graphs
Annotations on
individual points
Annotations on bar
charts
Annotations on Bar
Graphs
Annotations on Bar
Graphs
Annotations on
individual points
Annotations on Bar
Graphs
Annotations on
individual points
Random Color Scheme
in Chart
Placing Legend
Elements with Line
Basic Matplotlib
Charts
Random Color Scheme
in Chart
Area Plot
Placing Legend
Elements with Line
Random Markers
and Line Styles
Placing Legend
Elements with Line
Axes Transposition
Horizontal Bar
Charts
Basic Matplotlib
Charts
Basic Matplotlib
Charts
Changing Font Size
Basic Matplotlib
Charts
Random Color Scheme
Table 5: Top 5 best performing perturbations for each model
Models and Perturbation Types
Gemini 1.5 Flash GPT-4o Qwen-VL CogAgent-VQA InternLM-Xcomposer2
Stacked Area Chart
Horizontally Stacked
Bars
Stacked Bar Graphs
Horizontal Bar
Charts
Horizontally Stacked
Bars
Horizontally Stacked
Bars
Stacked Area Chart
Changing Horizontal
and Vertical Dimension
Stacked Area Chart
Changing Horizontal
and Vertical Dimension
Stacked Bar Graphs Stacked Bar Graphs Log Scale
Horizontally Stacked
Bars
Stacked Area Chart
Log Scale
Horizontal Grouped
Bar Charts
Random Representation
of Scatter Plots
Horizontal Grouped
Bar Charts
Stacked Bar Graphs
Normal Stair Plot
Hatched Pattern in
Bar Charts
Stair Plots with Marker Log Scale Changing Font Size
Table 6: Top 5 worst performing perturbations for each model

(Q4)𝑄4(Q4)( italic_Q 4 ) Are there certain perturbations which are more effective for certain question types?

Our analysis suggests that certain chart types might be more effective than others for different question types. For instance, line charts excel at revealing trends and correlation. Stacked bar charts are less suitable for almost any question, unless it explicitly asks for data aggregation, as extracting individual data points from stacked bars can be challenging. Bar charts, while useful for comparing individual values within a group, prove to not be good for showcasing correlations across different groups or entities. These observations highlight the importance of selecting correct chart types for accurate and efficient question answering, particularly in domain-specific applications.

(Q5)𝑄5(Q5)( italic_Q 5 ) Does the effect of each perturbation type vary across models?

The impact of each perturbation on model performance exhibits significant variation. While question and chart type play a role, for a given model, certain perturbations consistently prove more helpful or harmful. This nuanced effect is detailed in Table 5, 6, allowing us to identify specific areas for model enhancement, ultimately leading to a deeper understanding and extraction of insights from chart data.

Annotations, for both bar and point, consistently ranked among the top performing perturbations across most models. Secondly, chart elements in random colors surprisingly proved to be beneficial, indicating that models are capable of effectively resolving visual information when provided with colours far apart from each other. Furthermore, replacing legends with element names placed alongside or within the chart resulted in improved performance compared to traditional legend-based techniques. On the other hand, we observed that stacked chart elements, particularly horizontally stacked bars, significantly hindered model performance because of tougher data extraction. Similarly, logarithmic scales, known to be challenging for human interpretation, also negatively impacted model accuracy. This analysis provides valuable information for targeted model fine-tuning, addressing specific weaknesses and improving overall performance.

6 Related Work

Chart datasets

Chart comprehension and question answering (CQA) Hoque et al. (2022) are highly important domains with a substantial body of research. While existing CQA datasets have advanced reasoning capabilities over charts, most suffer from limitations in terms of size Kafle et al. (2018), template-based questions Methani et al. (2020); Chaudhry et al. (2020), synthetically generated charts Methani et al. (2020); Han et al. (2023); Chaudhry et al. (2020) or being limited to a specific domain Methani et al. (2020); Ahmed et al. (2023); Li and Tajbakhsh (2023), or have only open domain question answers Kantharaj et al. (2022). Additionally, even the current dataset used for state of the art bench-marking, Chart QA Masry et al. (2022) has limitations in the form of not having classifications for a more meaningful analysis. These datasets also have limited variations in the kinds of charts used.

Additionally, ChartX Xia et al. (2024) incorporates different chart types in the analysis performed, while ChartBench Xu et al. (2024) and MMC Liu et al. (2024) focuses on creating a large scale dataset with varied chart types.

Finally, a very recent work, CharXiv Wang et al. (2024) provides a extensive and comprehensive evaluations on a range of charts as well as questions ranging from those requiring reasoning to descriptive ones. They also perform ablation analysis by modifying charts and questions.

However, to the best of our knowledge, RobustCQA is the first dataset to systematically perturb all elements within a chart, enabling a fine-grained analysis of factors affecting model performance. Additionally, we do fine grained analysis on the Chart QA dataset as well, based on question and chart complexity, which has not been done prior to this.

Modeling approaches for charts

Various approaches have been developed for chart modeling. This includes models primarily made for chart comprehension and reasoning, constructed with the end to end goal of reasoning on charts Liu et al. (2023b); Masry et al. (2023); Singh and Shekhar (2020) as well as models which focus on converting the chart to an intermediate table format Liu et al. (2023a) followed by reasoning by a generalized LLM through Chain of Thought Wei et al. (2022) or Program of Thought Chen et al. (2023) prompting. Additionally, there are also generalized models which are used for general multi-modal reasoning tasks including chart comprehension Team et al. (2023); Achiam et al. (2023); Bai et al. (2023b); Dong et al. (2024); Hong et al. (2024). Finally, there has also been recent work on Various approaches have been developed for chart modeling. This includes models primarily made for chart comprehension and reasoning, constructed with the end to end goal of reasoning on charts Liu et al. (2023b); Masry et al. (2023); Singh and Shekhar (2020) as well as models which focus on converting the chart to an intermediate table format Liu et al. (2023a) followed by reasoning by a generalized LLM through Chain of Thought Wei et al. (2022) or Program of Thought prompting Chen et al. (2023). Additionally, there are also generalized models which are used for general multi-modal reasoning tasks including chart comprehension Team et al. (2023); Achiam et al. (2023); Bai et al. (2023b); Dong et al. (2024); Hong et al. (2024). There has also been recent work on making small yet accurate models for this task Wang et al. (2024).

While these approaches have shown significant improvements, there is limited insight into their specific failure points. A recent work Islam et al. (2024) focused on understanding the performance of GPT-4v and Gemini and provided a broad analysis on the performance of these models for various chart comprehension tasks including question answering, summarization and fact checking.

On the other hand, our work focuses specifically on CQA with a broader range of models, providing an in-depth analysis of question and chart types contributing to model failures. Through our contribution, we help identify and point out which exact question type or chart type cause models to fail, highlighting crucial information to help enhance model performance.

Vision-Language Model Robustness

Recent studies have also highlighted the vulnerability of different models to attacks and perturbations Ma et al. (2024); Zhao et al. (2023), raising significant concerns about their deployment.

Inspired by the adversarial vulnerability observed in vision and language tasks, we create a robustness benchmark for chart question answering.

A similar work, Gupta et al. (2024) also analyses the performance of Deplot and Matcha on perturbed charts, through the help of questions focusing on structural and visual context. Our work, on the other hand, investigates general reasoning questions to understand how model performance is affected by variations in the visual representation of the same underlying data.

7 Conclusion

This research introduces ChartQA-Split and RobustCQA, the first datasets dedicated to understanding model consistency across complexities and robustness to visual perturbations in chart question answering. Our evaluation of SOTA models, including baselines and VLMs/MLLMs, using a zero-shot chain-of-thought setting, reveals significant challenges in both areas. We perform an in-depth analysis of model weaknesses and identify key areas for improvement, such as enhancing data extraction for non-annotated charts and developing models that can effectively interpret complex visual information, taking every possible visual cue into consideration. Our work provides a foundation for future research in developing more robust and reliable chart question answering systems.

Future Directions.

Perturbation analysis provides a nuanced understanding of model performance by revealing both universal and model-specific vulnerabilities and strengths. This insight drives targeted improvements: Model Pretraining focusing on perturbations that affect models allows for effective fine-tuning to address weaknesses. Perturbation-Aware Training Integrating specific perturbations during training enhances overall robustness, helping models develop resilience against challenges. Interpretable Models Understanding the impact of perturbations aids in debugging and explainability, fostering the development of reliable and transparent chart understanding systems.

Limitations

The presented work exhibits several limitations. First, our data was obtained from a singular dataset, and we used only one plotting software for testing the perturbations. Expanding the dataset to include diverse sources and exploring various plotting libraries would strengthen the findings and improve generalizability. Second, the dataset is limited to English, while models are developed and evaluated on a wide variety of languages. Future research is required to expand the domain beyond English. Third, we were not able to cover a few chart types in the course of our analysis in order to make a more generalized perturbation set. This included pie and doughnut charts, pyramid and funnel charts as well as radar charts.

Ethics Statement

This research adheres to the ACL code of ethics, acknowledging and addressing potential ethical implications. While LLMs assisted in writing and presentation, all ideas and conclusions are solely attributed to the authors. The research promotes responsible and fair use of methodologies, ensuring transparency and reproducibility. We plan to release all scripts, resources, comprehensive documentation, evaluation metrics, datasets, model specifications, and prompting methods to enable others to build upon our work. We strive to present our findings clearly and accurately, avoiding exaggerated claims or misinterpretations.

Acknowledgement

Research was sponsored by the Army Research Office and was accomplished under Grant Number W911NF-20-1-0080. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. This work was partially funded by ONR Contract N00014-23-1-2365. Lastly, we acknowledge the generous gift from Adobe.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Ahmed et al. (2023) Saleem Ahmed, Bhavin Jawade, Shubham Pandey, Srirangaraj Setlur, and Venu Govindaraju. 2023. Realcqa: Scientific chart question answering as a test-bed for first-order logic. In International Conference on Document Analysis and Recognition, pages 66–83. Springer.
  • Bai et al. (2023a) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023a. Qwen technical report. arXiv preprint arXiv:2309.16609.
  • Bai et al. (2023b) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023b. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966.
  • Cai et al. (2024) Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. 2024. Internlm2 technical report. CoRR.
  • Chaudhry et al. (2020) Ritwick Chaudhry, Sumit Shekhar, Utkarsh Gupta, Pranav Maneriker, Prann Bansal, and Ajay Joshi. 2020. Leaf-qa: Locate, encode & attend for figure question answering. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE.
  • Chen et al. (2023) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research.
  • Dong et al. (2024) Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. 2024. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420.
  • Ghosh et al. (2024) Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Vinija Jain, and Aman Chadha. 2024. Exploring the frontier of vision-language models: A survey of current methodologies and future directions.
  • Gupta et al. (2024) Ashim Gupta, Vivek Gupta, Shuo Zhang, Yujie He, Ning Zhang, and Shalin Shah. 2024. Enhancing question answering on charts through effective pre-training tasks.
  • Han et al. (2023) Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. 2023. Chartllama: A multimodal llm for chart understanding and generation.
  • Hong et al. (2024) Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290.
  • Hoque et al. (2022) E. Hoque, P. Kavehzadeh, and A. Masry. 2022. Chart question answering: State of the art and future directions. Computer Graphics Forum, 41(3):555–572.
  • Islam et al. (2024) Mohammed Saidul Islam, Raian Rahman, Ahmed Masry, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, and Enamul Hoque. 2024. Are large vision language models up to the challenge of chart comprehension and reasoning? an extensive investigation into the capabilities and limitations of lvlms.
  • Kafle et al. (2018) Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. 2018. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656.
  • Kantharaj et al. (2022) Shankar Kantharaj, Xuan Long Do, Rixie Tiffany Leong, Jia Qing Tan, Enamul Hoque, and Shafiq Joty. 2022. OpenCQA: Open-ended question answering with charts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11817–11837, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Lee et al. (2023) Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. 2023. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR.
  • Li and Tajbakhsh (2023) Shengzhi Li and Nima Tajbakhsh. 2023. Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs.
  • Liu et al. (2023a) Fangyu Liu, Julian Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. 2023a. DePlot: One-shot visual language reasoning by plot-to-table translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10381–10399, Toronto, Canada. Association for Computational Linguistics.
  • Liu et al. (2023b) Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Eisenschlos. 2023b. MatCha: Enhancing visual language pretraining with math reasoning and chart derendering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12756–12770, Toronto, Canada. Association for Computational Linguistics.
  • Liu et al. (2024) Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. 2024. MMC: Advancing multimodal chart understanding with large-scale instruction tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1287–1310, Mexico City, Mexico. Association for Computational Linguistics.
  • Ma et al. (2024) J. Ma, P. Wang, D. Kong, Z. Wang, J. Liu, H. Pei, and J. Zhao. 2024. Robust visual question answering: Datasets, methods, and future challenges. IEEE Transactions on Pattern Analysis I&; Machine Intelligence, 46(08):5575–5594.
  • Masry et al. (2022) Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland. Association for Computational Linguistics.
  • Masry et al. (2023) Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. 2023. UniChart: A universal vision-language pretrained model for chart comprehension and reasoning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14662–14684, Singapore. Association for Computational Linguistics.
  • Meng et al. (2024) Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, and Ping Luo. 2024. Chartassisstant: A universal chart multimodal language model via chart-to-table pre-training and multitask instruction tuning. arXiv preprint arXiv:2401.02384.
  • Methani et al. (2020) Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. 2020. Plotqa: Reasoning over scientific plots. In The IEEE Winter Conference on Applications of Computer Vision (WACV).
  • Singh and Shekhar (2020) Hrituraj Singh and Sumit Shekhar. 2020. STL-CQA: Structure-based transformers with localization and encoding for chart question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3275–3284, Online. Association for Computational Linguistics.
  • Tay et al. (2022) Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier García, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, and Donald Metzler. 2022. Ul2: Unifying language learning paradigms. In International Conference on Learning Representations.
  • Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  • Wang et al. (2024) Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. 2024. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  • Xia et al. (2024) Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, et al. 2024. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning. arXiv preprint arXiv:2402.12185.
  • Xu et al. (2024) Zhengzhuo Xu, Sinan Du, Yiyan Qi, Chengjin Xu, Chun Yuan, and Jian Guo. 2024. Chartbench: A benchmark for complex visual reasoning in charts.
  • Zhao et al. (2023) Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, and Min Lin. 2023. On evaluating adversarial robustness of large vision-language models. In Thirty-seventh Conference on Neural Information Processing Systems.

Appendix

Effect of Font-size on Models

Table 7 demonstrates the significant impact of font size on model performance. Increasing font size has a positive effect on the OCR capabilities of visual language models (VLMs). This finding suggests that increasing font size can be a beneficial preprocessing step for improving model performance in tasks involving chart comprehension through such models.

Gemini 1.5 Flash Qwen VL
Perturbation types Small Font Big Font Small Font Big Font
Normal line plot 62 63 6 27
Colors in a given
scheme (line)
56 63 7 26
Colors random (scatter) 46 57 5 21
Line Represntation 47 50 10 18
Stem Plot 45 52 6 15
Stair Plot 42 48 10 26
Ablation - removing
Y axis
63 64 7 22
Rotated X axis Tick 56 62 8 22
Annotated Bar Graph 77 82 12 42
Horizontal Bar Graph 54 65 11 19
Table 7: Effect of increasing font size

Model Scores

In addition to the scores of models across various perturbations for complex charts, we also have presented the scores for models across various perturbations in simple charts in Table 8.

Different chart perturbations

This section details the construction and structure of the RobustCQA dataset.

Initially, we generated an extensive set of perturbed chart images, creating 85 unique perturbations applied to both simple and complex chart types. This process ensured that every chart element, including those unique to specific chart types, was isolated and perturbed. For example, we perturbed markers for scatter and line plots and varied line styles for line, stem, and step plots.

Following the initial generation, we performed a rigorous analysis of the perturbations, manually categorizing them into distinct groups based on their visual characteristics and impact on chart interpretation. This categorization allowed us to identify and retain only the most relevant perturbations for each chart type.

During the refinement process, we carefully considered the relevance and interpretability of specific perturbations for different chart types. For example, "stack plots" were not considered for simple charts due to the absence of stackable elements. Similarly, "overlapping area plots" were excluded from complex charts due to their inherent ambiguity and complexity even for human annotators.

Following the process, we ended up with 22 unique categories for simple charts and 25 unique categories for complex charts as have been highlighted through the results table.

To illustrate the final perturbation categories, we provide representative images showcasing all subplots within each category. These visual examples provide a clear understanding of the types of perturbations included in RobustCQA and have been depicted in the section following the tables.

Simple Questions Complex Questions
MLLMs Generalist VLMs MLLMs Generalist VLMs
Kategorie
Gemini
1.5 Flash
GPT-4o
Qwen
VL
CogAgent
VQA
InternLM
XComposer2
Gemini
1.5 Flash
GPT-4o
Qwen
VL
CogAgent
VQA
InternLM
XComposer2
original_chart 96 94 76 79 83 85 89 56 64 69
annotations 90 91 62 66 65 74 64 42 42 47
area_plot 73 42 21 16 61 71 64 39 42 48
annotated_bars 93 91 71 63 73 78 90 45 58 48
basic 73 43 24 18 54 73 61 38 38 46
color_random 79 43 20 22 63 72 64 32 40 42
color_scheme 78 50 18 23 58 68 64 30 36 36
data_pivot 74 55 13 13 56 71 62 42 40 41
font 79 53 19 28 28 65 52 33 26 51
grid 79 58 23 24 57 66 64 31 42 44
hatching 75 44 20 18 67 72 69 39 42 44
horizontal 73 33 19 14 58 67 64 35 45 43
legend_position 78 57 14 23 54 59 62 32 41 43
line_representation 84 59 19 25 56 67 63 31 34 40
log_scale 42 36 12 12 14 78 21 32 37 45
replacing_legend_with_labels 78 59 18 22 53 72 61 27 41 40
scaling_size 73 53 17 25 31 62 55 34 38 39
scatter_representation 75 48 15 17 47 64 57 34 43 44
stair_plot_normal 59 53 20 23 52 65 61 39 38 48
stair_plot_with_marker 64 51 15 24 60 68 64 25 41 31
stem_plot 72 41 12 17 70 75 86 36 54 48
tick_orientation 76 57 18 22 50 72 60 41 46 49
tick_position 69 54 27 27 51 53 61 35 42 41
Table 8: Model Performance on various perturbations on Simple Charts.
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
Refer to caption
Figure 4: Stacked: Stacked bar graphs
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]