Search | arXiv e-print repository

ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

Authors: Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, Shafiq Joty

Abstract: Given the ubiquity of charts as a data analysis, visualization, and decision-making tool across industries and sciences, there has been a growing interest in developing pre-trained foundation models as well as general purpose instruction-tuned models for chart understanding and reasoning. However, existing methods suffer crucial drawbacks across two critical axes affecting the performance of chart… ▽ More Given the ubiquity of charts as a data analysis, visualization, and decision-making tool across industries and sciences, there has been a growing interest in developing pre-trained foundation models as well as general purpose instruction-tuned models for chart understanding and reasoning. However, existing methods suffer crucial drawbacks across two critical axes affecting the performance of chart representation models: they are trained on data generated from underlying data tables of the charts, ignoring the visual trends and patterns in chart images, and use weakly aligned vision-language backbone models for domain-specific training, limiting their generalizability when encountering charts in the wild. We address these important drawbacks and introduce ChartGemma, a novel chart understanding and reasoning model developed over PaliGemma. Rather than relying on underlying data tables, ChartGemma is trained on instruction-tuning data generated directly from chart images, thus capturing both high-level trends and low-level visual information from a diverse set of charts. Our simple approach achieves state-of-the-art results across $5$ benchmarks spanning chart summarization, question answering, and fact-checking, and our elaborate qualitative studies on real-world charts show that ChartGemma generates more realistic and factually correct summaries compared to its contemporaries. We release the code, model checkpoints, dataset, and demos at https://github.com/vis-nlp/ChartGemma. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2406.00257 [pdf, other]

Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning? An Extensive Investigation into the Capabilities and Limitations of LVLMs

Authors: Mohammed Saidul Islam, Raian Rahman, Ahmed Masry, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, Enamul Hoque

Abstract: Natural language is a powerful complementary modality of communication for data visualizations, such as bar and line charts. To facilitate chart-based reasoning using natural language, various downstream tasks have been introduced recently such as chart question answering, chart summarization, and fact-checking with charts. These tasks pose a unique challenge, demanding both vision-language reason… ▽ More Natural language is a powerful complementary modality of communication for data visualizations, such as bar and line charts. To facilitate chart-based reasoning using natural language, various downstream tasks have been introduced recently such as chart question answering, chart summarization, and fact-checking with charts. These tasks pose a unique challenge, demanding both vision-language reasoning and a nuanced understanding of chart data tables, visual encodings, and natural language prompts. Despite the recent success of Large Language Models (LLMs) across diverse NLP tasks, their abilities and limitations in the realm of data visualization remain under-explored, possibly due to their lack of multi-modal capabilities. To bridge the gap, this paper presents the first comprehensive evaluation of the recently developed large vision language models (LVLMs) for chart understanding and reasoning tasks. Our evaluation includes a comprehensive assessment of LVLMs, including GPT-4V and Gemini, across four major chart reasoning tasks. Furthermore, we perform a qualitative evaluation of LVLMs' performance on a diverse range of charts, aiming to provide a thorough analysis of their strengths and weaknesses. Our findings reveal that LVLMs demonstrate impressive abilities in generating fluent texts covering high-level data insights while also encountering common problems like hallucinations, factual errors, and data bias. We highlight the key strengths and limitations of chart comprehension tasks, offering insights for future research. △ Less

Submitted 31 May, 2024; originally announced June 2024.

arXiv:2403.09028 [pdf, other]

ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning

Authors: Ahmed Masry, Mehrad Shahmohammadi, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty

Abstract: Charts provide visual representations of data and are widely used for analyzing information, addressing queries, and conveying insights to others. Various chart-related downstream tasks have emerged recently, such as question-answering and summarization. A common strategy to solve these tasks is to fine-tune various models originally trained on vision tasks language. However, such task-specific mo… ▽ More Charts provide visual representations of data and are widely used for analyzing information, addressing queries, and conveying insights to others. Various chart-related downstream tasks have emerged recently, such as question-answering and summarization. A common strategy to solve these tasks is to fine-tune various models originally trained on vision tasks language. However, such task-specific models are not capable of solving a wide range of chart-related tasks, constraining their real-world applicability. To overcome these challenges, we introduce ChartInstruct: a novel chart-specific vision-language Instruction-following dataset comprising 191K instructions generated with 71K charts. We then present two distinct systems for instruction tuning on such datasets: (1) an end-to-end model that connects a vision encoder for chart understanding with a LLM; and (2) a pipeline model that employs a two-step approach to extract chart data tables and input them into the LLM. In experiments on four downstream tasks, we first show the effectiveness of our model--achieving a new set of state-of-the-art results. Further evaluation shows that our instruction-tuning approach supports a wide array of real-world chart comprehension and reasoning scenarios, thereby expanding the scope and applicability of our models to new kinds of tasks. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2401.15050 [pdf, other]

LongFin: A Multimodal Document Understanding Model for Long Financial Domain Documents

Authors: Ahmed Masry, Amir Hajian

Abstract: Document AI is a growing research field that focuses on the comprehension and extraction of information from scanned and digital documents to make everyday business operations more efficient. Numerous downstream tasks and datasets have been introduced to facilitate the training of AI models capable of parsing and extracting information from various document types such as receipts and scanned forms… ▽ More Document AI is a growing research field that focuses on the comprehension and extraction of information from scanned and digital documents to make everyday business operations more efficient. Numerous downstream tasks and datasets have been introduced to facilitate the training of AI models capable of parsing and extracting information from various document types such as receipts and scanned forms. Despite these advancements, both existing datasets and models fail to address critical challenges that arise in industrial contexts. Existing datasets primarily comprise short documents consisting of a single page, while existing models are constrained by a limited maximum length, often set at 512 tokens. Consequently, the practical application of these methods in financial services, where documents can span multiple pages, is severely impeded. To overcome these challenges, we introduce LongFin, a multimodal document AI model capable of encoding up to 4K tokens. We also propose the LongForms dataset, a comprehensive financial dataset that encapsulates several industrial challenges in financial documents. Through an extensive evaluation, we demonstrate the effectiveness of the LongFin model on the LongForms dataset, surpassing the performance of existing public models while maintaining comparable results on existing single-page benchmarks. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: Accepted at AAAI 2024 Workshop on AI in Finance for Social Impact

arXiv:2312.10610 [pdf, other]

Do LLMs Work on Charts? Designing Few-Shot Prompts for Chart Question Answering and Summarization

Authors: Xuan Long Do, Mohammad Hassanpour, Ahmed Masry, Parsa Kavehzadeh, Enamul Hoque, Shafiq Joty

Abstract: A number of tasks have been proposed recently to facilitate easy access to charts such as chart QA and summarization. The dominant paradigm to solve these tasks has been to fine-tune a pretrained model on the task data. However, this approach is not only expensive but also not generalizable to unseen tasks. On the other hand, large language models (LLMs) have shown impressive generalization capabi… ▽ More A number of tasks have been proposed recently to facilitate easy access to charts such as chart QA and summarization. The dominant paradigm to solve these tasks has been to fine-tune a pretrained model on the task data. However, this approach is not only expensive but also not generalizable to unseen tasks. On the other hand, large language models (LLMs) have shown impressive generalization capabilities to unseen tasks with zero- or few-shot prompting. However, their application to chart-related tasks is not trivial as these tasks typically involve considering not only the underlying data but also the visual features in the chart image. We propose PromptChart, a multimodal few-shot prompting framework with LLMs for chart-related applications. By analyzing the tasks carefully, we have come up with a set of prompting guidelines for each task to elicit the best few-shot performance from LLMs. We further propose a strategy to inject visual information into the prompts. Our experiments on three different chart-related information consumption tasks show that with properly designed prompts LLMs can excel on the benchmarks, achieving state-of-the-art. △ Less

Submitted 17 December, 2023; originally announced December 2023.

Comments: 23 pages

arXiv:2305.14761 [pdf, other]

UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning

Authors: Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, Shafiq Joty

Abstract: Charts are very popular for analyzing data, visualizing key insights and answering complex reasoning questions about data. To facilitate chart-based data analysis using natural language, several downstream tasks have been introduced recently such as chart question answering and chart summarization. However, most of the methods that solve these tasks use pretraining on language or vision-language t… ▽ More Charts are very popular for analyzing data, visualizing key insights and answering complex reasoning questions about data. To facilitate chart-based data analysis using natural language, several downstream tasks have been introduced recently such as chart question answering and chart summarization. However, most of the methods that solve these tasks use pretraining on language or vision-language tasks that do not attempt to explicitly model the structure of the charts (e.g., how data is visually encoded and how chart elements are related to each other). To address this, we first build a large corpus of charts covering a wide variety of topics and visual styles. We then present UniChart, a pretrained model for chart comprehension and reasoning. UniChart encodes the relevant text, data, and visual elements of charts and then uses a chart-grounded text decoder to generate the expected output in natural language. We propose several chart-specific pretraining tasks that include: (i) low-level tasks to extract the visual elements (e.g., bars, lines) and data from charts, and (ii) high-level tasks to acquire chart understanding and reasoning skills. We find that pretraining the model on a large corpus with chart-specific low- and high-level tasks followed by finetuning on three down-streaming tasks results in state-of-the-art performance on three downstream tasks. △ Less

Submitted 10 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

arXiv:2205.03966 [pdf, other]

Chart Question Answering: State of the Art and Future Directions

Authors: Enamul Hoque, Parsa Kavehzadeh, Ahmed Masry

Abstract: Information visualizations such as bar charts and line charts are very common for analyzing data and discovering critical insights. Often people analyze charts to answer questions that they have in mind. Answering such questions can be challenging as they often require a significant amount of perceptual and cognitive effort. Chart Question Answering (CQA) systems typically take a chart and a natur… ▽ More Information visualizations such as bar charts and line charts are very common for analyzing data and discovering critical insights. Often people analyze charts to answer questions that they have in mind. Answering such questions can be challenging as they often require a significant amount of perceptual and cognitive effort. Chart Question Answering (CQA) systems typically take a chart and a natural language question as input and automatically generate the answer to facilitate visual data analysis. Over the last few years, there has been a growing body of literature on the task of CQA. In this survey, we systematically review the current state-of-the-art research focusing on the problem of chart question answering. We provide a taxonomy by identifying several important dimensions of the problem domain including possible inputs and outputs of the task and discuss the advantages and limitations of proposed solutions. We then summarize various evaluation techniques used in the surveyed papers. Finally, we outline the open challenges and future research opportunities related to chart question answering. △ Less

Submitted 21 May, 2022; v1 submitted 8 May, 2022; originally announced May 2022.

arXiv:2203.10244 [pdf, other]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Authors: Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, Enamul Hoque

Abstract: Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that involve several logical and arithmetic operations. They also commonly refer to visual features of a chart in their questions. However, most existing datasets do not focus on such complex reasoning questions as their questions are template-based and answers come from a f… ▽ More Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that involve several logical and arithmetic operations. They also commonly refer to visual features of a chart in their questions. However, most existing datasets do not focus on such complex reasoning questions as their questions are template-based and answers come from a fixed-vocabulary. In this work, we present a large-scale benchmark covering 9.6K human-written questions as well as 23.1K questions generated from human-written chart summaries. To address the unique challenges in our benchmark involving visual and logical reasoning over charts, we present two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions. While our models achieve the state-of-the-art results on the previous datasets as well as on our benchmark, the evaluation also reveals several challenges in answering complex reasoning questions. △ Less

Submitted 19 March, 2022; originally announced March 2022.

Comments: Accepted by ACL 2022 Findings

arXiv:2203.07452 [pdf, other]

A deep learning pipeline for breast cancer ki-67 proliferation index scoring

Authors: Khaled Benaggoune, Zeina Al Masry, Jian Ma, Christine Devalland, L. H Mouss, Noureddine Zerhouni

Abstract: The Ki-67 proliferation index is an essential biomarker that helps pathologists to diagnose and select appropriate treatments. However, automatic evaluation of Ki-67 is difficult due to nuclei overlapping and complex variations in their properties. This paper proposes an integrated pipeline for accurate automatic counting of Ki-67, where the impact of nuclei separation techniques is highlighted. F… ▽ More The Ki-67 proliferation index is an essential biomarker that helps pathologists to diagnose and select appropriate treatments. However, automatic evaluation of Ki-67 is difficult due to nuclei overlapping and complex variations in their properties. This paper proposes an integrated pipeline for accurate automatic counting of Ki-67, where the impact of nuclei separation techniques is highlighted. First, semantic segmentation is performed by combining the Squeez and Excitation Resnet and Unet algorithms to extract nuclei from the background. The extracted nuclei are then divided into overlapped and non-overlapped regions based on eight geometric and statistical features. A marker-based Watershed algorithm is subsequently proposed and applied only to the overlapped regions to separate nuclei. Finally, deep features are extracted from each nucleus patch using Resnet18 and classified into positive or negative by a random forest classifier. The proposed pipeline's performance is validated on a dataset from the Department of Pathology at Hôpital Nord Franche-Comté hospital. △ Less

Submitted 14 March, 2022; originally announced March 2022.

arXiv:2203.06486 [pdf, other]

Chart-to-Text: A Large-Scale Benchmark for Chart Summarization

Authors: Shankar Kantharaj, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, Shafiq Joty

Abstract: Charts are commonly used for exploring data and communicating insights. Generating natural language summaries from charts can be very helpful for people in inferring key insights that would otherwise require a lot of cognitive and perceptual efforts. We present Chart-to-text, a large-scale benchmark with two datasets and a total of 44,096 charts covering a wide range of topics and chart types. We… ▽ More Charts are commonly used for exploring data and communicating insights. Generating natural language summaries from charts can be very helpful for people in inferring key insights that would otherwise require a lot of cognitive and perceptual efforts. We present Chart-to-text, a large-scale benchmark with two datasets and a total of 44,096 charts covering a wide range of topics and chart types. We explain the dataset construction process and analyze the datasets. We also introduce a number of state-of-the-art neural models as baselines that utilize image captioning and data-to-text generation techniques to tackle two problem variations: one assumes the underlying data table of the chart is available while the other needs to extract data from chart images. Our analysis with automatic and human evaluation shows that while our best models usually generate fluent summaries and yield reasonable BLEU scores, they also suffer from hallucinations and factual errors as well as difficulties in correctly explaining complex patterns and trends in charts. △ Less

Submitted 14 April, 2022; v1 submitted 12 March, 2022; originally announced March 2022.

Comments: Accepted by ACL 2022 Main Conference

arXiv:2202.03737 [pdf, other]

doi 10.1080/03091902.2019.1664672

A Survey of Breast Cancer Screening Techniques: Thermography and Electrical Impedance Tomography

Authors: Juan Zuluaga-Gomez, N. Zerhouni, Z. Al Masry, C. Devalland, C. Varnier

Abstract: Breast cancer is a disease that threatens many women's life, thus, early and accurate detection plays a key role in reducing the mortality rate. Mammography stands as the reference technique for breast cancer screening; nevertheless, many countries still lack access to mammograms due to economic, social, and cultural issues. Last advances in computational tools, infrared cameras, and devices for b… ▽ More Breast cancer is a disease that threatens many women's life, thus, early and accurate detection plays a key role in reducing the mortality rate. Mammography stands as the reference technique for breast cancer screening; nevertheless, many countries still lack access to mammograms due to economic, social, and cultural issues. Last advances in computational tools, infrared cameras, and devices for bio-impedance quantification allowed the development of parallel techniques like thermography, infrared imaging, and electrical impedance tomography, these being faster, reliable and cheaper. In the last decades, these have been considered as complement procedures for breast cancer diagnosis, where many studies concluded that false positive and false negative rates are greatly reduced. This work aims to review the last breakthroughs about the three above-mentioned techniques describing the benefits of mixing several computational skills to obtain a better global performance. In addition, we provide a comparison between several machine learning techniques applied to breast cancer diagnosis going from logistic regression, decision trees, and random forest to artificial, deep, and convolutional neural networks. Finally, it is mentioned several recommendations for 3D breast simulations, pre-processing techniques, biomedical devices in the research field, prediction of tumor location and size. △ Less

Submitted 8 February, 2022; originally announced February 2022.

Comments: Article published at: Journal of Medical Engineering & Technology (Volume 43, 2019 - Issue 5)

arXiv:1910.13757 [pdf, other]

A CNN-based methodology for breast cancer diagnosis using thermal images

Authors: Juan Zuluaga-Gomez, Zeina Al Masry, Khaled Benaggoune, Safa Meraghni, Noureddine Zerhouni

Abstract: Micro Abstract: A recent study from GLOBOCAN disclosed that during 2018 two million women worldwide had been diagnosed from breast cancer. This study presents a computer-aided diagnosis system based on convolutional neural networks as an alternative diagnosis methodology for breast cancer diagnosis with thermal images. Experimental results showed that lower false-positives and false-negatives clas… ▽ More Micro Abstract: A recent study from GLOBOCAN disclosed that during 2018 two million women worldwide had been diagnosed from breast cancer. This study presents a computer-aided diagnosis system based on convolutional neural networks as an alternative diagnosis methodology for breast cancer diagnosis with thermal images. Experimental results showed that lower false-positives and false-negatives classification rates are obtained when data pre-processing and data augmentation techniques are implemented in these thermal images. Background: There are many types of breast cancer screening techniques such as, mammography, magnetic resonance imaging, ultrasound and blood sample tests, which require either, expensive devices or personal qualified. Currently, some countries still lack access to these main screening techniques due to economic, social or cultural issues. The objective of this study is to demonstrate that computer-aided diagnosis(CAD) systems based on convolutional neural networks (CNN) are faster, reliable and robust than other techniques. Methods: We performed a study of the influence of data pre-processing, data augmentation and database size versus a proposed set of CNN models. Furthermore, we developed a CNN hyper-parameters fine-tuning optimization algorithm using a tree parzen estimator. Results: Among the 57 patients database, our CNN models obtained a higher accuracy (92\%) and F1-score (92\%) that outperforms several state-of-the-art architectures such as ResNet50, SeResNet50 and Inception. Also, we demonstrated that a CNN model that implements data-augmentation techniques reach identical performance metrics in comparison with a CNN that uses a database up to 50\% bigger. Conclusion: This study highlights the benefits of data augmentation and CNNs in thermal breast images. Also, it measures the influence of the database size in the performance of CNNs. △ Less

Submitted 30 October, 2019; originally announced October 2019.

Comments: 19 pages, 7 figures, 5 tables. Clinical Breast Cancer

Showing 1–12 of 12 results for author: Masry, A