CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

Zhong-Zhi Li1,2 , Ming-Liang Zhang1,211footnotemark: 1, Fei Yin1,2,
Zhi-Long Ji3, Jin-Feng Bai3, Zhen-Ru Pan3,
Fan-Hu Zeng1,2, Jian Xu1,2, Jia-Xin Zhang1,2, Cheng-Lin Liu1,2
School of Artifcial Intelligence, University of Chinese Academy of Sciences1
MAIS, Institute of Automation of Chinese Academy of Sciences2, Tomorrow Advancing Life3
{lizhongzhi2022, zhangmingliang2018}@ia.ac.cn,
{jizhilong, baijinfeng, panzhenru,}@tal.com,
{fyin, liucl}@nlpr.ia.ac.cn
   Equal Contribution   Corresponding Author
Abstract

Due to the rapid advancements in multimodal large language models, evaluating their multimodal mathematical capabilities continues to receive wide attention. Despite the datasets like MathVista proposed benchmarks for assessing mathematical capabilities in multimodal scenarios, there is still a lack of corresponding evaluation tools and datasets for fine-grained assessment in the context of K12 education in Chinese language. To systematically evaluate the capability of multimodal large models in solving Chinese multimodal mathematical problems, we propose a Chinese Multi-modal Math Skill Evaluation Benchmark, named CMMaTH, contraining 23k multimodal K12 math related questions, forming the largest Chinese multimodal mathematical problem benchmark to date. CMMaTH questions from elementary to high school levels, provide increased diversity in problem types, solution objectives, visual elements, detailed knowledge points, and standard solution annotations. We have constructed an open-source tool GradeGPT integrated with the CMMaTH dataset, facilitating stable, rapid, and cost-free model evaluation. Our data and code are available.

{CJK}

UTF8gbsn

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models


Zhong-Zhi Li1,2thanks:    Equal Contribution , Ming-Liang Zhang1,211footnotemark: 1, Fei Yin1,2, Zhi-Long Ji3, Jin-Feng Bai3, Zhen-Ru Pan3, Fan-Hu Zeng1,2, Jian Xu1,2, Jia-Xin Zhang1,2, Cheng-Lin Liu1,2thanks:    Corresponding Author School of Artifcial Intelligence, University of Chinese Academy of Sciences1 MAIS, Institute of Automation of Chinese Academy of Sciences2, Tomorrow Advancing Life3 {lizhongzhi2022, zhangmingliang2018}@ia.ac.cn, {jizhilong, baijinfeng, panzhenru,}@tal.com, {fyin, liucl}@nlpr.ia.ac.cn


1 Introduction

Large language models(LLMs) excel in various language tasks, while multimodal models effectively handle visual-language problems. They advance natural language processing and computer vision fields, providing powerful solutions for complex tasks. Multimodal large models demonstrate potential as versatile solvers for multimodal problems.

The systematic evaluation of large models’ performance across various mathematical reasoning scenarios has been a subject of extensive research. GSM8K and MATHCobbe et al. (2021); Hendrycks et al. (2021b) assessed the ability in multi-step mathematical reasoning by constructing a high-quality set of elementary school math word problems or various competition mathematics problems. By collecting a diverse set of mathematical problems containing both textual and visual components, Lu et al. (2023); Wang et al. (2024); Zhang et al. (2024b) systematically evaluated the ability of large multimodal models to perceive visual elements and solve corresponding multimodal problems. Shi et al. (2023) constructed a multilingual mathematical reasoning dataset, MGSM, for evaluating the LLM reasoning ability in multilingual environments.

However, in non-English multimodal contexts, especially in Chinese scenarios, there is still a lack of sufficiently detailed and diverse benchmarks for assessing mathematical abilities. To assess the capability of large language models in non-English contexts, Huang et al. (2023) and Zhang et al. (2024a) constructed multidisciplinary Chinese question answering datasets C-Eval and CMMMU to evaluate the knowledge and reasoning abilities of multimodal large models. However, C-Eval lacks evaluation in multimodal contexts, while CMMMU’s dataset has relatively low diversity, consisting of only 540 questions.

Refer to caption
Figure 1: The results of mainstream multimodal large models and pure text large models on the CMMaTH dataset. Left: represents the performance evaluation of selected LMMs and LLMs across various Visual Subjects. Right: the performance assessment of these models on different educational grade-level questions.

Existing Math benchmarks for answer evaluation can be categorized into two types:Rule-based Cobbe et al. (2021); Hendrycks et al. (2021b); He et al. (2024) and API-based methods Lu et al. (2023); Zhang et al. (2024b); Hendrycks et al. (2021a). API-based methods are very costly and time-consuming, and they often result in unstable and inconsistent evaluation results. Rule-based methods, on the other hand, struggle to handle highly diverse contents of benchmarks. Also, it is difficult to maintain handcrafted rules for dynamically updated benchmarks. Current multimodal math benchmark evaluations often resort to multiple-choice or true/false question formats, using rules or API-based LLM to extract options for assessing answers.

Based on above considerations, we propose a new multimodal mathematical benchmark CMMaTH. Compared to previous benchmarks, our benchmark demonstrates greater diversity, increased depth of reasoning, and finer-grained knowledge annotation for multimodal models to grasp different levels and types of knowledge. We provided and open-sourced a lightweight answer comparator called GradeGPT, designed to compare the consistency between outputs from different LLM/LMMs and standard answers, thus avoiding expensive evaluation costs. Leveraging the CMMaTH dataset and GradeGPT tool, we evaluated mainstream open-source and commercial multimodal large models in Table 4, reporting comprehensive evaluation results along with extensive case analyses. In summary, our paper makes the following contributions:

  • We introduce the largest high-quality Chinese multimodal mathematics benchmark with the most detailed annotation granularity to date. We also provide an English version of this dataset. The CMMaTH dataset is a dynamically maintained and will be periodically updated.

  • Compared to previous multimodal mathematical benchmarks, our dataset exhibits great depth of reasoning and diversity. Our benchmark simulates more realistic educational Q&A scenarios, encompassing a wider variety of question types and answer formats. Additionally, we annotate each question with detailed knowledge points and corresponding skills to evaluate the mastery level of current large models.

  • We build an evaluation assistant named GradeGPT on the CMMaTH dataset, which allows for comparing the proximity of model responses to standard answers and assessing the correctness of results and processes. GradeGPT features lightweight open-source characteristics, avoiding the instability and high costs associated with commercial models.

  • We conduct a systematic evaluation of existing mainstream multimodal large models, quantitatively and qualitatively comparing with existing models.

2 Related Work

2.1 Assessment of mathematical abilities

To evaluate the performance of large models in mathematical reasoning and examine hallucinations during the reasoning process, numerous benchmarks have been proposed for evaluating the mathematical reasoning capabilities of large models. GSM8KCobbe et al. (2021) is the first and most widely used mathematical dataset used for large model math evaluation, consisting of 1k math word problem test samples and corresponding answers. The MATHHendrycks et al. (2021b) dataset, in comparison to GSM8K, presents a greater challenge in terms of reasoning difficulty. This dataset demands a more profound understanding and intuition in various mathematical domains such as Algebra, Number Theory, and Geometry. MathVistaLu et al. (2023) is the first dataset used to evaluate the multimodal mathematical capabilities of large models, but it has relatively simple reasoning depth. MATH-VISIONWang et al. (2024) has richer visual elements and deeper reasoning difficulty. MathVerseZhang et al. (2024c) constructed several subsets of datasets to assess whether existing multimodal large models can truly understand mathematical abstract forms.

The CMMaTH Benchmark, in comparison to existing works on the evaluation of mathematical proficiency, places a greater emphasis on the analysis of mathematical abilities within the context of the Chinese language. The data distribution of the CMMaTH dataset more closely aligns with the actual distribution found in K12 educational settings, and it provides detailed annotations of mathematical knowledge points to facilitate the assessment of models’ mastery of knowledge and skills.

2.2 Large Model Evaluation Tool

Due to their strong generalization capabilities and extensive world knowledge, large language models have achieved outstanding results in tasks such as machine translationZhu et al. (2023), question answeringKamalloo et al. (2023), dialogueDuan et al. (2023) and so on by generating text. Evaluating the comprehensive abilities of large models, such as clarity, adherence to instructions, comprehensiveness, formality, and mathematical reasoning ability, has received widespread attentionKe et al. (2023). Currently, many works opt to use powerful commercial model APIs, such as GPT-4, to assist in evaluating the comprehensive abilities of large models. For instance, MathVistaLu et al. (2023) and GeoEvalZhang et al. (2024b) use GPT-4’s API to extract correct answers for evaluation. These methods face several challenges: they are costly and time-consuming, and they struggle to keep up with rapid model iterations. Besides, these methods face challenges in terms of consistency and reproducibilityWang et al. (2023a); Ke et al. (2023).

Recent methods have proposed using metrics such as BERT scoreZhang et al. (2020) or MAUVEPillutla et al. (2021) for evaluation. However, the numerical indicators produced by these methods are difficult to interpret when it comes to the erroneous responses generated by LLM. PandaLM and CritiqueLLM Wang et al. (2023b); Ke et al. (2023) are similar to our work. They proposed a fine-tuning method based on open-source LLMs, distilling the evaluation capabilities of GPT-3.5 into a series of smaller open-source models. However, they are focused on the automated evaluation of more general text generation tasks, while we are targeting the automated evaluation of responses from large models for multimodal mathematics problems.

Unlike PandaLMWang et al. (2023b) trying to evalution relative conciseness, clarity and so on, our evaluation model, GradeGPT, is a dataset-oriented answer comparator that can provide specific reasons based on the standard answer and a model’s response. We distilled the answer comparison capability of GPT-4 using the Cross-Lingual Judge-of-Chain method and enhanced GradeGPT’s answer discrimination ability.

3 CMMaTH Dataset

Statistic Number
Total questions 23856
- multiple-choice questions 18191
- Free-form questions 5665
- Questions in the testmini set 1000
Single-choice questions 13706(75.3%)
- Proportion of answers A 2694(14.8%)
- Proportion of answers B 3903(21.4%)
- Proportion of answers C 3961(21.7%)
- Proportion of answers D 3148(17.5%)
Multiple-choice & Multi-turn questions 4485(24.7%)
knowledge point number 2299
Levels 5
Visual Subjects 13
Maximum question length 593
Minimum question length 6
Average question length 75.1
Grade Distribution Elementary(1-6) 800
Junior(7-9) 5082
Senior(10-12) 17972
Table 1: Key statistics of CMMaTH. The unit of question length is words.

3.1 Overview of CMMaTH

Refer to caption
Figure 2: Some of the knowledge points involved in the CMMaTH dataset.

We selected diverse multimodal mathematical problems from a vast pool of K12 educational questions, comprising 23856 items across 13 visual themes, 5 difficulty levels, and encompassing 150 types of knowledge points. More detailed statistical data can be found in Table 1.

For the convenience of evaluation, we provide a miniaturized test set of CMMaTH, called CMMaTH-testmin, containing 1500 samples. Testmin retains the diversity of the CMMaTH dataset and shows similar overall performance to the entire CMMaTH dataset. Evaluators can conduct quick tests and generate preliminary analyses based on CMMaTH-testmin.

{CJK}

UTF8gbsn

Image Type #Num Image Type #Num Image Type #Num Image Type #Num
视觉表格 Visual-Table 1513 折叠展开图 Folded Image Graph 235 立体几何图 Solid Geometry 2054 解析几何图 Analatic Geometry 3060
流程图 Flow Chart 3120 条形图 Bar Chart 4924 散点图 Scatter Chart 517 平面几何图 Plane Chart 3834
折线图 Line Chart 846 饼状图 Fan Chart 175 雷达图 LiDAR Chart 73 抽象类比图 Abatract Analog Graph 440
三视图 Three View Graph 22 枝页图 Stem-and-Leaf display 23 其他 Other Image type 240
Table 2: Primary element types involved in the CMMaTH dataset.

3.2 Collectioin Guidelines

We collected a large number of multimodal mathematics questions from a vast K12 educational question bank, including elements such as statistical charts, plane geometry, three-view diagrams, flowcharts, set notation diagrams, etc. The quality and distribution of the data were guided by the following criteria during collection.

  • Diverse Mathematical Visual Elements. We have collected solutions to multimodal mathematical problems that rely on understanding image content, especially those containing a large amount of Chinese visual content such as text and symbols. Table 2 shows some visual elements subject of CMMaTH.

  • High relevance to the K12 math knowledge and skill. The annotator, who is well-versed in knowledge, needs to ensure that the multimodal question assesses a specific K-12 mathematics knowledge point during the question collection process. It primarily includes mathematics questions related to K12 education, facilitating the assessment of the application potential of large-scale multimodal capabilities in the field of mathematics education.

  • High-quality images and answers. During the collection phase, we instruct collectors to disregard multimodal math questions with erroneous symbols or low-quality images (blurry images). Collectors are required to ensure that the collected questions are generally solvable.

3.3 Data Collections

Collection from Diverse Multimodal Math Sources CMMaTH’s data is based on a million-level private database. The private database we used comes from questions collected from the Internet and undergoes rigorous data checking. The project’s data has undergone multiple rounds of collection. We first sampled 45,000 multimodal math questions: 14,000 each from elementary, high, and junior high schools. Then, we added 34,000 more questions featuring algorithm block diagrams, statistics, and geometry diagrams to enhance visual diversity.
Data Filtering We filtered out all questions without images in the question stems, including questions with multi-graph reasoning, questions in non-Chinese languages, and questions not relying on visual content to solve. To ensure the quality of the images and text questions, we removed all images whose width and height were less than 100, then used the GPT4 API to score the data quality and filter out questions suspected of being unsolvable and questions with garbled text in the question text.
Data Labeling For K-12 mathematics knowledge points, we have scraped the mathematics section from Jiaoyan Cloud111https://www.jiaoyanyun.com/ and organized all the knowledge points into a knowledge tree including a total of 5,531 knowledge points. We retained 2,299 knowledge points more relevant to multimodal mathematics in K-12. Subsequently, all questions were classified according to knowledge points by GPT-4 and a fine-tuned LLM, followed by manual multi-level verification. Questions that did not match any K-12 multimodal mathematics knowledge points were filtered out.

3.4 Comparison with Existing Benchmarks

Dataset Size Image&Supplementary Input Format Source Answer Knowledge Annotation Lanugage Domain Knowledge Domain
VQAv2Goyal et al. (2017) >1absent1>1> 1M V I+T Annotated Open/MC/TF En Allgemein
SEEDLi et al. (2023a) 19K V I+T Annotated MC En Allgemein
MMBenchLiu et al. (2023) 3K V I+T Repurposed MC En Allgemein
MM-VetYu et al. (2023) 0.2K V I+T Annotated Öffnen Sie En Allgemein
ScienceQALu et al. (2022) 6K V I+T Textbooks MC En Science
MathVistaLu et al. (2023) 1K/6K V(5 Types)+OC I+T Synthesized Open/MC/TF En/ZH Math
MMMUYue et al. (2023) 11.5K V(30 Types)+OC Interleaved Textbooks Open/MC - Allgemein
CMMMUZhang et al. (2024a) <1absent1<1< 1K(Math Part) V(5 Types)+OC Interleaved Internet Open/MC ZH Allgemein
OlympiadBenchHe et al. (2024) 6.5K(Math Part) V(5 Types) Interleaved Internet Öffnen Sie ZH/EH Math/Physics
MathVerseZhang et al. (2024c) 2.6K/15K V(3 Types) I+T Synthesized MC ZH/EH Math
MATH-VisionWang et al. (2024) 3K V(16 Types)+IC I+T Synthesized Open/MC EH Math
CMMaTH 23K V(13 Types)+OC+IC I+T Internet/Annotated Open/MC/TF ZH K12 Math
Table 3: Comparison with other multimodal benchmarks. V: visual input, VD: video input, OC: optical characters, IC: Image Caption, I+T: images and text, Open: open questions, MC: multiple choice questions, FIB: fill in the blank questions, TF: true or false questions.

The CMMaTH dataset is primarily used to evaluate multimodal reasoning capabilities in K-12 educational scenarios. We compared the current mainstream multimodal mathematical datasets and large model benchmarks in Table 3. Compared to existing multimodal benchmarks and multimodal reasoning benchmarks, the CMMaTH dataset has the following characteristics:
Extreme Diversity Currently, there is a severe lack of high-quality Chinese multimodal mathematics datasets. MATH-VISION lacks a Chinese component, the MATH-VISTA dataset contains only a small number of Chinese samples, and CMMMU contains only 540 math problems, which are not fine-grained and comprehensive enough. We have included about 23k fine-grained multimodal mathematics assessment samples, covering 13 K12 mathematics visual categories, making it the largest known multimodal Chinese dataset to date.
Real and High Quality & Multilingual MathVista features a substantial number of problems that are associated with natural and synthetic images. However, these images do not accurately represent the genuine data distribution encountered in K12 mathematics educational settings. OlympiadBench is an Olympiad-level bilingual multimodal benchmark. However, this benchmark is overly challenging and deviates from the application of LMM in real K12 multimodal math scenarios. Additionally, the variety of multimodal visual elements is relatively limited. Instead, we collect multimodal data specifically tailored to the K12 education context. Additionally, MathVista incorporates a significant amount of data from GeoQA and synthetic images, which have relatively poor image quality. Our multimodal visual image elements have all undergone stringent image quality assessments. Unlike CMMMU, CEval, and CMath, our dataset is a bilingual dataset that considers a large number of Chinese scenes. In addition to the text of the questions being in Chinese, the visual elements related to the questions also contain Chinese text/symbols.
High-quality Fine-grained Annotation and Evaluation Tool Every question in our dataset is meticulously annotated with standardized answers, solutions expressed in natural language, associated multimodal knowledge points, visual element categories, and K-12 grade levels. This fine-grained annotation enables a more nuanced evaluation of multimodal mathematical proficiency within the K-12 educational context. While MathVista and GeoEval rely on GPT-4 for answer extraction and validation, we introduce an open-source model named GradeGPT. GradeGPT stands out by providing a stable, cost-free, and swift accuracy evaluation specifically tailored for the CMMaTH dataset.

Refer to caption
Figure 3: Instruction Construction Pipeline of GradeGPT

4 GradeGPT

The CMMaTH dataset encompasses a large variety of problem-solving objectives, such as mathematical expressions, multiple-choice options, numerical outcomes, coordinate points, conclusion figures, and correctness assessments. Traditionally, in reasoning or evaluation contexts, problems have been formulated as multiple-choice or true/false questions to facilitate comparison and to simplify the extraction of results. Also, it is difficult to maintain dynamically updated benchmark. Employing API models for evaluation is prohibitively expensive, and the resulting evaluations are not consistently stable, which also hampers the iterative development of models on benchmarks, such as hyperparameter selection.

To provide a stable, free, fast, and easy-to-update model response evaluation tool, we introduce GradeGPT, an answer comparison model tailored for the CMMaTH dataset. GradeGPT is designed to receive a question, its standard answers, and a model-generated response. It extracts key steps including results from Chinese output. Determine whether the result is consistent with the standard answer. Our GradeGPT is a streamlined, open-source model. When integrated with frameworks such as vLLM using the 14B model, it can swiftly compare a myriad of model-generated answers, accomplishing a remarkable judgment accuracy of 96.1% for assessing responses comparable with GPT4 API.
Prompt Format
In the prompt input of GradeGPT, there are "questions," "reference answers," and "model output answers." The model is required to provide an answer in the form of "<Yes>" or "<No>" indicating whether the model output answer is equivalent to the standard reference answer. We have designed an instruction format named Cross-Lingual-Judge-of-Chain for the purpose of determining answer consistency. Cross-Lingual-Judge-of-Chain first analyzes the model response and finds the key sentences that give the answer in the model response, understand key chinese sentences in English. Then analyze the standard answer, determine the type of the standard answer, and then determine whether the standard answer is included in the model response. More details can be found in Appendix E
Instruction Construction
We first generate inference results on CMMaTH using multiple Multimodal LLMs and provide GPT-4 with a detailed few-shot prompt to synthesize answer judgments in the form of a Cross-Lingual Judge-of-Chain response. By employing GPT4’s In-Context Learning, as showned in Figure 3, we have established a procedure for synthesizing instruction data and have produced approximately 56k cross-lingual result judge instruction pairs. Through fine-tuning the model with these instructions, we are able to obtain an expert model, GradeGPT, which possesses the capability to compare answers.

5 Experiments

We conducted a series of experiment to evaluate various models on the CMMaTH dataset. We evaluated various LLM/LMM models, including open-source and closed-source models. More model details can be found in Table 13. We employed a method similar to GeoEval and MathVista, generating captions through an GPT4V, and assessed them using MetaMath, and DeepSeekMath equipped with caption information. Our empirical research reveals that even the most advanced models struggle to achieve satisfactory accuracy levels. Furthermore, we conducted an exhaustive error analysis on a sufficiently strong commercial multimodal model, GPT-4V, examining its error distribution and presenting illustrative qualitative examples. Our investigation also revealed that the inclusion of multilingual thought chains does not mitigate the substantial difficulties presented by Chinese multimodal mathematical reasoning scenarios. We postulate that the richness of non-English contextual information contained within the images necessitates models equipped with enhanced multilingual OCR and sophisticated multimodal diagram reasoning capabilities.

5.1 Main Experiments on LLM/LMMs

Refer to caption
Figure 4: Accuracy of LMMs across different types of problems in CMMaTH Benchmark.
Refer to caption
Figure 5: The metrics of different LMMs/LLMs models about SSR.
Model Overall Flow Bar Scatter Line Plot Fan LiDAR Visual-Table Three View Folded Image Analytic Solid Plane Venn Abt-Analogy
LLMs (Text Only)
LLama2-70B 4.5 4.7 2.5 4.4 7.9 7.4 8.1 3.4 5.4 5.1 5.3 4.1 5.3 5.9 4.5
MetaMath-70B 5.7 4.6 3.3 6.6 8.7 5.7 0.2 4.2 4.1 8.5 7.2 4.8 8.5 9.8 5.4
DeepSeek-Math 14.0 13.4 6.7 14.7 13.1 12.5 12.2 8.1 13.5 12.3 17.2 16.5 21.6 19.5 13.8
Baichuan-13B 8.4 6.7 4.8 12.2 12.4 13.1 16.2 5.4 4.1 8.5 11.1 6.7 13.7 12.8 9.3
Qwen-14B 13.7 15.5 7.3 14.3 16.9 13.6 10.8 11.4 12.8 14.8 15.9 12.7 17.8 20.4 19.3
Math LLMs (Text + OCR Caption)
LLama2-70B 5.6 4.9 2.3 4.8 7.9 7.1 8.0 4.4 6.4 9.1 3.3 4.8 6.3 6.9 5.5
MetaMath-70B 5.1 4.3 3.2 6.9 8.1 5.3 0.0 4.4 4.2 8.8 7.1 4.4 8.3 9.1 5.2
DeepSeek-Math 15.3 13.2 6.9 14.1 12.6 12.3 12.1 8.9 14.4 14.1 17.9 19.3 22.7 21.5 13.9
Baichuan-13B 8.1 6.9 4.3 12.4 11.5 12.3 14.9 3.4 4.4 9.3 11.6 6.8 13.2 12.9 9.9
Qwen-14B 13.3 14.1 7.4 13.3 16.2 13.2 11.8 10.6 11.8 19.8 5.9 11.7 13.8 21.4 16.3
Open-source LMMs (Text + Image)
LLaVA-v1.5-7B 5.5 1.5 4.2 5.4 6.2 5.4 3.6 4.0 4.2 5.3 4.8 3.9 8.4 6.1 4.2
InternLM-XComposer2-VL 3.4 3.3 5.3 3.2 6.2 11.3 6.2 5.4 4.0 0.5 0.4 3.6 1.5 1.8 3.6
Yi-VL-34B 8.3 7.1 4.6 10.2 14.6 8.5 6.8 7.7 5.9 6.4 10.1 7.8 12.2 11.3 7.9
CogAgent-Chat 10.6 12.2 5.2 10.8 13.7 8.0 9.5 8.8 11.2 10.2 13.2 10.5 11.8 19.9 12.2
Closed-source LMMs (Text + Image)
GPT4V 27.0 39.3 12.5 30.2 21.0 22.9 38.6 16.9 18.3 20.0 37.5 15.8 21.5 58.0 29.9
GPT4o 35.2 59.4 18.8 54.5 31.7 58.4 32.4 31.7 28.7 23.8 40.6 31.6 33.6 57.4 29.7
Human Performance
Human (testmini) 80.1 73.7 78.9 96.2 95.1 57.4 91.7 83.5 69.2 63.2 67.5 51.6 72.1 89.1 83.1
Table 4: Comparison of model performances across various mathematical subjects. Subjects: Flow: Flow Chart, Bar: Bar Chart, Scatter: Scatter Chart, Line Plot: Line Curve and Plot, Fan: Fan Chart, LiDAR: LiDAR Chart, Visual-Table: Visual-Table Chart, Three View: Three View Graph, Folded Image: Folded Image Graph, Analytic: Analytic Geometry Problem, Solid: Solid Geometry Problem, Plane: Plane Geometry Problem, SolG: Venn: Set Venn Graph, Abt-Analogy: Abtract Analogy Graph.

We evaluated the results of mainstream multimodal large models and mathematical expert models in Table B. We analyzed the trend of existing large models in descending with problems and conditions, as well as the effectiveness of techniques such as Cross-Lingual Prompting in solving Chinese multimodal mathematical problems. The experimental in Table 4 results indicates that our data exhibits extremely strong diversity and relatively challenging reasoning depth. Figure 1 and Table 4 shows models such as GPT4V struggle to comprehend our multimodal content and reasoning questions effectively, resulting in significant performance gaps between open-source and proprietary models. In certain rare visual domains, multimodal large models achieve very low reasoning outcomes.
Accuracy on various question types. We evaluated the accuracy of GPT4V on various target-solving tasks in Figure 4. The results indicate that when solving free-form problems, especially those with more diverse targets such as expressions, coordinates, and conclusion judgments, the multimodal large language model shows poorer performance.
Is OCR information sufficient for CMMaTH? We also referred to works like MathVista, attempting to use LLMs combined with OCR information from diagrams to assist in mathematical reasoning in Table 4. We found that, in our benchmark, a small amount of OCR information (such as mathematical symbols in diagrams, axis values, and image titles) made it very difficult to complete our multimodal mathematical reasoning tasks. The results indicate that solving problems in CMMaTH requires stronger multimodal mathematical chart capabilities, beyond just OCR.
K12 Multimodal Knowledge Richness of current LMMs. We systematically evaluated the proficiency of existing multimodal large models in the K12 domain regarding multimodal reasoning skills in Figure 5. The results revealed a significant knowledge gap in existing multimodal K12 educational resources. Compared to other existing LMMs, GPT4V possesses a richer knowledge base, thereby substantially reducing the illusion of reasoning in multimodal mathematical inference.

5.2 Experiments of Cross-language Reason Technology

We also attempted several multilingual Chain-of-Thought approaches such as En-CoT, CLP(Cross-Lingual Prompting) used by Qin et al. (2023) to observe whether multimodal mathematical problems could be enhanced through context learning techniques without training. The results indicate that multilingual CoT methods face challenges in solving, possibly due to the abundance of Chinese contextual text in the image content, which may necessitate the model to demonstrate excellent cross-lingual OCR capabilities. We have included more details on the implementation of Cross-Lingual Prompting and En-CoT on the CMMaTH dataset in the Table 5.

LMM Overall-Acc
LLaVA-v15 4.2
InternLM-XComposer2-VL 3.4
LLaVA-v15 + En-CoT 9.4
InternLM-XComposer2-VL + En-CoT 16.9
LLaVA-v15 + CLP 12.7
InternLM-XComposer2-VL + CLP 17.1
Table 5: The performance of train-free CoT reasoning techniques on the CMMaTH dataset.

5.3 Error Analysis

Refer to caption
Figure 6: Distribution of Error Types in GPT4V.

We conducted a detailed analysis and evaluation of GPT4V on CMMaTH-testmin, categorizing errors into four types: perceptual errors, reasoning errors, calculation errors, and Reject Errors. The error type distribution of GPT4V on CMMaTH is shown in the Figure 6.
Perception Errors
Perception Error refers to the model’s erroneous interpretation and utilization of diagram content during reasoning. For example, incorrect OCR, misidentification of numerical relationships, geometric relationships, logical relationships, etc.
Reasoning Errors
Reasoning Error are quite common during the solving process. For instance, the model may misinterpret symbols or use incorrect logic or knowledge for inference. The frequency of Reasoning Errors reflects the model’s logical and mathematical reasoning capabilities.
Calculation Errors
Calculation Error refers to the model performing incorrect mathematical operations, such as writing equations or solving equations incorrectly.
Reject Errors
Reject Error refers to the model’s inability to solve a problem that is actually solvable. The frequency of such errors reflects the model’s ability to follow instructions.

6 Conclusions

We introduce CMMaTH, a detailed Chinese math reasoning benchmark with diverse question types, vivid visuals, and complex reasoning. The benchmark includes detailed knowledge points, standard thought processes, and grade levels to measure the mastery of knowledge points in the K-12 multimodal math skill. To evaluate large multimodal models quickly and affordably, we built GradeGPT, an open-source tool for assessing results. Extensive experimental results on CMMaTH manifest the limitations of current models in multilingual, multimodal math reasoning.

Limitation & Potential Impact

Our dataset CMMaTH, as a multimodal mathematics dataset aimed at the K-12 education sector, can facilitate model evaluation and iteration of multimodal large models in this field, and may promote the research and development of educational artificial intelligence. CMMaTH primarily consists of single-image problems, without considering multi-image contextual reasoning or scenarios requiring auxiliary line drawing and similar tasks. GradeGPT is a result-oriented, relatively coarse reasoning response evaluator. How to construct a process evaluation model for fine-grained assessment of the reasoning ability of large models can continue to be explored in the future.

References

Appendix A More Related Work About Multimodal Large Model Evaluation

The multimodal large models face serious hallucination issues in perceiving objects and executing inference. To systematically evaluate the various capabilities of multimodal large models, diverse multimodal benchmarks are utilized for assessing the abilities of large models and aiding iterative development. POPELi et al. (2023b) is used to evaluate the accuracy of large models in identifying perceptual objects. MMMU and CMMMUYue et al. (2023); Zhang et al. (2024a) are comprehensive subject datasets designed to assess the proficiency of large models in mastering massive multimodal multi-disciplinary knowledge. SEED-Bench designed 19,000 diverse multimodal questions spanning video and image modalities to evaluate the spatiotemporal capabilities of multimodal large models Li et al. (2023a). MMVetYu et al. (2023) attempts to design datasets to evaluate the integrated capabilities of different multimodal large model systems in combining various Vision-Language skills.

Appendix B Model Generation Details

B.1 Model Weight Version

We evaluated models on CMMaTH, including open-source models such as LLaVA-v1.5, Deepseek-Math, InternLM-XComposer2-VL, Yi-VL-34B, CogAgent-Chat, MetaMath-70B, LLama-70B, Baichuan-13B and Qwen-14B as well as state-of-the-art commercial models GPT4V. We have listed the parameter versions and the Hugging Face repository names of the open-source models used in Table 12.

B.2 Model Sampling Parameter

We have listed the corresponding hyperparameters used by the models in Table 13. For API models, we have indicated the corresponding release versions. Models using vLLM for inference are annotated.

B.3 Data quality control

To ensure the high quality of the final data, we conducted sampling and manual verification. We performed three random samples, each consisting of 500 multimodal samples, to check the data quality and ensure the consistency of the knowledge points and data.

Appendix C Prompt Details

Refer to caption
Figure 7: Overall Data Collection Pipeline of CMMaTH.

C.1 Prompt For Step Response Generation

Element Prompt
system-prompt 你是一个数学题解题助手,你的输入是一道数学题,以及这道题目的图像,你的任务是输出这道题目的求解思路以及答案. 输出格式是一步一步的方式.
Example1 Question: 小明有2000元, 一个苹果2元, 小明买了10个苹果。小明还有多少钱。 Model response: 首先计算小明买苹果的开销为2乘10等于20元. <STEP1> 用2000元减去所有的开销, 2000-20=1980. <STEP2> 所以最终的答案为1980元. <STEP3>
Example2 Question: 如图,在△ABC中,∠C=90°,∠B=30°,以A为圆心,任意长为半径画弧分别交AB、AC于点M和N,再分别以M、N为圆心,大于MN/2的长为半径画弧,两弧交于点P,连结AP并延长交BC于点D,则下列说法中正确的个数是( )①AD是∠BAC的平分线;②∠ADC=60°;③点D在AB的中垂线上;④BD=2CD.A.4 B.3 C.2 D.1 Model response: 首先确定△ABC的几何性质。 <STEP1>∠C=90°,∠B=30°,∠A=120°。 <STEP2>△ABC是一个等腰三角形。 <STEP3>根据等腰三角形的性质,AD是∠BAC的平分线。 <STEP4>①AD是∠BAC的平分线,正确。 <STEP5>②∠ADC=60°,错误。 <STEP6>③点D在AB的中垂线上,错误。 <STEP7>④BD=2CD,正确。 <STEP8>因此,正确的个数为4。 <STEP9>答案为A.4。 <STEP10>请你根据这个例子,解决下面的数学题。问题:在△ABC中,∠C=90°,∠B=30°,以A为圆心,任意长为半径画弧分别交AB、AC于点M和N,再分别以M、N为圆心,大于MN/2的长为半径画弧,两弧交于点P,连结AP并延长交BC于点D,则下列说法中正确的个数是( )①AD是∠BAC的平分线;②∠ADC=60°;③点D在AB的中垂线上;④BD=2CD.A.4 B.3 C.2 D.1 求解步骤:首先确定△ABC的几何性质。∠C=90°,∠B=30°,∠A=120°。△ABC是一个等腰三角形。根据等腰三角形的性质,AD是∠BAC的平分线。①AD是∠BAC的平分线,正确。②∠ADC=60°,错误。③点D在AB的中垂线上,错误。④BD=2CD,正确。因此,正确的个数为4。答案为A.4。
Table 6: Prompt for all model to generate step-by-step answer.

When evaluating hallucinations during the assessment process, we use a few-shot prompt format to elicit step-by-step outputs from the model as showed in Table 6.

C.2 Prompt For GradeGPT

Element Prompt
system-prompt You are an answer evaluation assistant. Based on the following questions and reference answers, determine if the model’s output is the same answer. When making the determination, please provide the analysis process in the following format. Note, for multiple-choice questions, it is only correct if the output includes all the corresponding options. (For example: If the reference answer is: A, C, and the model response is: C, the answer consistency is: No) Note, if the model response does not specify a particular option, and its correctness cannot be determined, the answer consistency should be: No.
Example Question: 如图所示,AD是ABC𝐴𝐵𝐶\bigtriangleup ABC△ italic_A italic_B italic_C的中线,点E在AC上,BE交AD于点F.当AFFD=12𝐴𝐹𝐹𝐷12\frac{AF}{FD}=\frac{1}{2}divide start_ARG italic_A italic_F end_ARG start_ARG italic_F italic_D end_ARG = divide start_ARG 1 end_ARG start_ARG 2 end_ARG时,AEAC=¯𝐴𝐸𝐴𝐶¯absent\frac{AE}{AC}=\underline{}divide start_ARG italic_A italic_E end_ARG start_ARG italic_A italic_C end_ARG = under¯ start_ARG end_ARG Model Response: 首先,我们需要找到中线AD,然后找到点E在AC上的位置。找到中线AD,可以通过勾股定理求解,AD2=AD×AC2𝐴superscript𝐷2𝐴𝐷𝐴superscript𝐶2AD^{2}=AD\times AC^{2}italic_A italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_A italic_D × italic_A italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT。设AD=x𝐴𝐷𝑥AD=xitalic_A italic_D = italic_x, AC=y𝐴𝐶𝑦AC=yitalic_A italic_C = italic_y,则有 x2=xy2superscript𝑥2𝑥superscript𝑦2x^{2}=x\*y^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_x ⁢ italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT。<STEP1> 解这个方程,我们得到 y=x𝑦𝑥y=\sqrt{x}italic_y = square-root start_ARG italic_x end_ARG。现在我们知道点E在AC上,所以E的横坐标为y,纵坐标为x。 接下来,我们需要找到点F在BD上的位置。由于F点在BD上,我们可以设F点在BD的一半,即F=(x+y)/2𝐹𝑥𝑦2F=(x+y)/2italic_F = ( italic_x + italic_y ) / 2。<STEP2> 现在我们可以计算AF和FD的长度。AF=xF𝐴𝐹𝑥𝐹AF=x-Fitalic_A italic_F = italic_x - italic_F, FD=FDF𝐹𝐷𝐹𝐷𝐹FD=FD-Fitalic_F italic_D = italic_F italic_D - italic_F 将已知的x和y代入公式,得到: AF=x(x+y)/2FD=(x+y)/2(x+y)/2𝐴𝐹𝑥𝑥𝑦2𝐹𝐷𝑥𝑦2𝑥𝑦2AF=x-(x+y)/2FD=(x+y)/2-(x+y)/2italic_A italic_F = italic_x - ( italic_x + italic_y ) / 2 italic_F italic_D = ( italic_x + italic_y ) / 2 - ( italic_x + italic_y ) / 2 简化得到:AF=xxFD=xx𝐴𝐹𝑥𝑥𝐹𝐷𝑥𝑥AF=x-xFD=x-xitalic_A italic_F = italic_x - italic_x italic_F italic_D = italic_x - italic_x 这意味着AF=FD𝐴𝐹𝐹𝐷AF=FDitalic_A italic_F = italic_F italic_D。<STEP3> 现在我们可以求解题目中的条件:AE/AC=1/2𝐴𝐸𝐴𝐶12AE/AC=1/2italic_A italic_E / italic_A italic_C = 1 / 2 将已知的AF和FD代入公式,得到:AE/AC=(xx)/(xy)𝐴𝐸𝐴𝐶𝑥𝑥𝑥𝑦AE/AC=(x-x)/(\sqrt{x}\*y)italic_A italic_E / italic_A italic_C = ( italic_x - italic_x ) / ( square-root start_ARG italic_x end_ARG ⁢ italic_y )化简得到:AE/AC=1/2𝐴𝐸𝐴𝐶12AE/AC=1/2italic_A italic_E / italic_A italic_C = 1 / 2所以,答案为AE/AC=1/2𝐴𝐸𝐴𝐶12AE/AC=1/2italic_A italic_E / italic_A italic_C = 1 / 2 <STEP4> Reference Response: 4π3234𝜋323\frac{4\pi}{3}-2\sqrt{3}divide start_ARG 4 italic_π end_ARG start_ARG 3 end_ARG - 2 square-root start_ARG 3 end_ARG Judge Result: The relevant paragraph in the Model Response that provides the answer is 所以,答案为AE / AC = 1 / 2. Understand the question in Chinese. The Chinese model response means: Therefore, the answer is AE / AC = 1 / 2. Next, let’s analyze whether the correct answer has been given. The reference answer 1515\frac{1}{5}divide start_ARG 1 end_ARG start_ARG 5 end_ARG is a numerical answer, and the output given in the model response is 1/2. The consistency of the answers is: <No>.
Table 7: Detail prompt for GradeGPT and it’s instruction construction prompt by GPT4.

We also listed the prompts used by GradeGPT in Tables 7.

C.3 Prompt For Cross-Lingual Prompting and En-CoT

We have listed the specific prompts used for En-CoT and Cross-Lingual Prompt during actual execution in Table 11. Unlike the original Cross-Lingual Prompt paper, for experimental simplicity, we only adopted a single-turn format. However, this suffices to illustrate the varying inferential capabilities across different languages in current LMMs.

Appendix D CMMaTH Dataset Details

D.1 Data Collection Details

To more clearly elucidate our data collection process, we have depicted the overall pipeline of data collection in Figure 7.

D.2 Knowledge Point Details

We provided detailed annotations of knowledge points for our dataset and conducted preliminary clustering of these knowledge points. The distribution of knowledge points in different clusters is as follows:

Refer to caption
Figure 8: Cloud diagram of the knowledge points contained in the CMMaTH dataset.

We have formulated a Knowledge Successful Solve Rate(SSR) as a structural metric to gauge the proficiency level of multi-modal extensive models in mastering knowledge points. Nknsubscript𝑁𝑘𝑛N_{kn}italic_N start_POSTSUBSCRIPT italic_k italic_n end_POSTSUBSCRIPT is the total number of knowledge point of CMMaTH. Acckni𝐴𝑐subscript𝑐𝑘subscript𝑛𝑖Acc_{kn_{i}}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_k italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the Accoutcome𝐴𝑐subscript𝑐𝑜𝑢𝑡𝑐𝑜𝑚𝑒Acc_{outcome}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_o italic_u italic_t italic_c italic_o italic_m italic_e end_POSTSUBSCRIPT of questions about i𝑖iitalic_i’th knowledge point. I𝐼Iitalic_I denotes an indicator function.

SSR@α=i=1NknI(Acckni>α)Nkn𝑆𝑆𝑅@𝛼superscriptsubscript𝑖1subscript𝑁𝑘𝑛𝐼𝐴𝑐subscript𝑐𝑘subscript𝑛𝑖𝛼subscript𝑁𝑘𝑛SSR@\alpha=\frac{\sum_{i=1}^{N_{kn}}I(Acc_{kn_{i}}>\alpha)}{N_{kn}}italic_S italic_S italic_R @ italic_α = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_k italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_I ( italic_A italic_c italic_c start_POSTSUBSCRIPT italic_k italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT > italic_α ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_k italic_n end_POSTSUBSCRIPT end_ARG (1)

It is our contention that a knowledge point can be deemed comprehensively understood only when the accuracy rate of solving problems related to that knowledge point surpasses a predefined threshold, denoted as α𝛼\alphaitalic_α. For the purpose of our investigation, we have established α𝛼\alphaitalic_α at the values of 0.1, 0.2, 0.3, and 0.6 to demarcate the levels of mastery.

D.3 Characteristics Of Annotators

We utilized a standard team of four people, who spent two weeks annotating the data. All annotators have a university undergraduate education and are well-versed in basic knowledge of the K12 education field. To ensure quality, each question was verified by at least two people.

Appendix E GradeGPT details

E.1 GradeGPT Prompt Detail

We have listed detailed Fewshot Examples using the GPT4-generated GradeGPT model responses in Table 11. Through this table, you can observe the specific form of the Cross-Lingual-Judge-of-Chain that we have used.

E.2 GradeGPT Performance Metric

GradeGPT performance evaluation metric is precision in comparison. We constructed a model that responds to a test set containing outputs from various large models (including both correct and incorrect model outputs). Each output is labeled as correct or incorrect based on its result. GradeGPT is tasked with assessing whether the model responses are correct or incorrect, and this performance evaluation metric is a binary classification metric.

Accoutcome=I(GradeGPT(Ri),OvercomeGT)Nresponse×100𝐴𝑐subscript𝑐𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝐼𝐺𝑟𝑎𝑑𝑒𝐺𝑃𝑇subscript𝑅𝑖𝑂𝑣𝑒𝑟𝑐𝑜𝑚subscript𝑒𝐺𝑇subscript𝑁𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒100Acc_{outcome}=\frac{I(GradeGPT(R_{i}),Overcome_{GT})}{N_{response}}\times 100\ italic_A italic_c italic_c start_POSTSUBSCRIPT italic_o italic_u italic_t italic_c italic_o italic_m italic_e end_POSTSUBSCRIPT = divide start_ARG italic_I ( italic_G italic_r italic_a italic_d italic_e italic_G italic_P italic_T ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_O italic_v italic_e italic_r italic_c italic_o italic_m italic_e start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_r italic_e italic_s italic_p italic_o italic_n italic_s italic_e end_POSTSUBSCRIPT end_ARG × 100 (2)

E.3 GradeGPT Training Details

We generated cross-lingual evaluation instruction pairs using the outputs from InternLM-XComposer, LLaVA-v1.5, CogAgent-18B and Yi-VL-34B. These outputs were produced using GPT-4 Fewshot. The generated evaluation instructions were filtered based on specific rules, retaining only those responses from GPT-4 that contained the fields: <Yes>/<No>. Ultimately, we constructed a cross-lingual format instruction set comprising 56k instruction pairs.

GradeGPT was trained on 8 H800, with the Qwen-14B-Chat version used as the base model. The model’s batch size was set to 16. The learning rate was set to 1e-4, and the gradient accumulation step was set to 16. It was trained for 10 epochs on a 40k bilingual Judge-of-Chain dataset. A detail example of instruction can refer to Figure 9.

Refer to caption
Figure 9: A instruction example to finetune GradeGPT.
LLM Accoutcome𝐴𝑐subscript𝑐𝑜𝑢𝑡𝑐𝑜𝑚𝑒Acc_{outcome}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_o italic_u italic_t italic_c italic_o italic_m italic_e end_POSTSUBSCRIPT
Qwen-7B-Chat(4-Shot) 35.1
   +++Naive Outcome Finetune 51.5
   +++Judge-of-Chain 65.3
   +++Cross-Lingual-Judge-of-Chain 85.1
Qwen-14B-Chat(4-Shot) 43.7
GradeGPT(14B) 96.1
GPT4(4-Shot) 97.2
Table 8: Ablation study on the instruction fine-tuning of GradeGPT commands

E.4 Futher More Ablation Study

We conducted experiments on a development set comprising outputs from a 0.5k model. The development set was sampled from a subset of 0.5k questions on CMMaTH. Each question was accompanied by answers provided by GPT-4V, GPT-4o, and middle school students. Each answer was manually annotated to indicate whether it was correct. We use 2 to measure the answer judgment capability of different LMMs, including Zershot LMMs and LLMs after Finetune.
Ablation On Instruction Format We conducted experiments on various instruction enhancement techniques used by GradeGPT and compared the results with GPT4 in Table 8. The results suggest that after various instruction enhancements, the accuracy of GradeGPT in model response judgment on CMMaTH can be improved to 96.1%, significantly surpassing the accuracy of GPT4. The proposed strategy can significantly enhance GradeGPT’s ability to judge results. It is only slightly weaker than the performance of GPT4(Fewshot) executed with a large number of examples. Our GradeGPT, as an open-source parameter model of approximately 14B, can serve as a stable, low-cost, and efficient alternative to GPT4.

The Baseline we compared, Qwen-7B/14B(4-Shot), GPT4(4-Shot), Naive Outcome Finetune, Judge-of-Chain. In the Naive Outcome Finetune format of instructions, the model is required to output its results indicating whether they are correct in the form of "<Yes>"/"<No>".. Judge-of-Chain also includes the understanding of results and natural language descriptions of model outputs, but does not include the part of extracting key Chinese outputs and translating them into English. Compared to having the language model directly predict the <Yes>/<No> judgment labels, directly using Chinese Judge-of-Chain to construct Judge-of-Chain improves the model’s performance in answer evaluation. However, they didn’t yield good results. The performance disparity may stem from an imbalance in the quality of Chinese and English components within some bilingual base models due to the training corpus.

Our findings show that fine-tuning with Cross-Lingual-Judge-of-Chain for detailed thought chain refinement significantly improves the performance of open-source models in outcome analysis tasks. Additionally, we discovered that using bilingual thought chains instead of Chinese-only thought chains for base model fine-tuning effectively enhances performance in outcome determination tasks. By using and synthesizing the instructions in the form of Cross-Lingual-Judge-of-Chain that we designed, we are able to efficiently distill the answer reviewing capabilities of GPT4.
Ablation On Instruction Data Source The instruction data for Cross-Lingual Judge-of-Chain Prompts comes from outputs of various LLMs on CMMaTH. We conducted ablation experiments on the sources of instruction data, which showed the impact of using different LLM models on constructing diverse and effective instruction data.

Model Response Source Accoutcome𝐴𝑐subscript𝑐𝑜𝑢𝑡𝑐𝑜𝑚𝑒Acc_{outcome}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_o italic_u italic_t italic_c italic_o italic_m italic_e end_POSTSUBSCRIPT
LLaVA-v1.5 response 77.2
   +++InterLM-XComposer2-VL response 83.1
   +++CogAgent-Chat response 84.2
   +++Yi-VL-34B response 85.1
Table 9: Ablation study on the instruction fine-tuning of GradeGPT commands. All experiments were conducted using the GradeGPT (7B) version.

Appendix F Case Study

We conducted a fine-grained manual evaluation of GPT4V’s output on CMMaTH, with the results shown in Figure 10 to Figure 22.

Element Prompt
system-prompt You are a math problem-solving assistant. Your input is a math problem and an image of the problem. Your task is to provide the solution steps and the final answer. The output format should be step-by-step.
Example Question: Xiaoming has 2000 yuan, an apple costs 2 yuan, Xiaoming bought 10 apples. How much money does Xiaoming have left? Model response: First, calculate the cost of the apples Xiaoming bought, which is 2 times 10 equals 20 yuan. <STEP1> Subtract the total cost from 2000 yuan, 2000-20=1980. <STEP2> So the final answer is 1980 yuan. <STEP3>
Table 10: Prompt for En-CoT in Our Cross-Lingual Experiment.
Element Prompt
system-prompt Please act as an expert in multi-lingual understanding in Chinese. After understanding, you should act as an expert in arithmetic reasoning in English. Let’s resolve the task you understand above step-by-step! Finally, you should format your answer as ’Answer: [num]’.
Example Request: 玛丽比琼小两岁,琼比杰莎大五岁。如果杰莎20岁,这三个女孩的年龄总和是多少? Reason Result: Let’s understand the task in English step-by-step! 1. "玛丽比琼小两岁" means "Mary is two years younger than Joan." 2. "琼比杰莎大五岁" means "Joan is five years older than Jessie. " 3. "如果杰莎 20 岁" means "If Jessie is 20 years old. " Then, Let’s solve the task step-by-step: 1. According to the first sentence, Mary is two years younger than Joan. Let’s suppose Joan’s age is J, then Mary’s age is J - 2. 2. According to the second sentence, Joan is five years older than Jessie. If Jessie’s age is x, then Joan’s age is x + 5. Answer: 68.
Table 11: Detail prompt for Cross-Lingual Prompt(CLP) Reason technology for our Experiment.
Model Name
Llama-2-70B meta-llama/Llama-2-70b-hf
MetaMath-70B meta-math/MetaMath-70B-V1.0
DeepSeek-Math-7B deepseek-ai/deepseek-math-7b-instruct
Baichuan-13B baichuan-inc/Baichuan2-13B-Chat
Qwen-14B Qwen/Qwen-14B-Chat
LLaVA-v1.5 liuhaotian/llava-v1.5-13b
InterLM-XComposer2-VL internlm/internlm-7b
Yi-VL-34B 01-ai/Yi-VL-34B
CogAgent-Chat THUDM/cogagent-chat-hf
Table 12: LLMs used in our experiments and their corresponding names in Huggingface Hub.
Model Name Generation Parameters Kommentare
Llama-2-70B do_sample=True, top_k=0.5, top_p=0.5, max_tokens=512 model=""Salesforce/codegen2-16B"
GPT-4 temperature=0.2, max_tokens=2048 version="gpt-4-1106-preview"
llava-7B-V1.5 temperature=0.2, max_new_tokens=2048 llava package
DeepSeek-Math-7B temperature=0.2, max_new_tokens=2048 vllm package
Baichuan-13B temperature=0.2, max_new_tokens=2048 vllm package
Qwen-14B temperature=0.2, max_new_tokens=2048 vllm package
InterLM-XComposer2-VL temperature=0.2, max_new_tokens=2048 Huggingface
Yi-VL-34B temperature=0.2, max_new_tokens=2048 Huggingface
CogAgent-Chat temperature=0.2, max_new_tokens=2048 Huggingface
GPT4V temperature=0.2, max_tokens=2048 version="gpt-4-vision-2023-05-15"
GPT4o temperature=0.2, max_tokens=2048 version="gpt-4o-2024-02-01"
Table 13: The hyperparameters for the models used in the evaluation are detailed. When the "comments" section includes the format model = "", it signifies that the model was loaded from the transformer package. The vLLM package indicates that models are implemented by the vLLM package, where more details can be found in https://github.com/vllm-project/vllm. For models other than OpenAI’s GPT, custom codes were utilized for evaluation unless specified otherwise in the comments.
Refer to caption
Figure 10: Case of GPT4V. The red ones are marked as generated inference hallucinations.
Refer to caption
Figure 11: Case of GPT4V. The red ones are marked as generated inference hallucinations.
Refer to caption
Figure 12: Case of GPT4V. The red ones are marked as generated inference hallucinations.
Refer to caption
Figure 13: Case of GPT4V. The red ones are marked as generated inference hallucinations.
Refer to caption
Figure 14: Case of GPT4V. The red ones are marked as generated inference hallucinations.
Refer to caption
Figure 15: Case of GPT4V. The red ones are marked as generated inference hallucinations.
Refer to caption
Figure 16: Case of GPT4V. The red ones are marked as generated inference hallucinations.
Refer to caption
Figure 17: Case of GPT4V. The red ones are marked as generated inference hallucinations.
Refer to caption
Figure 18: Case of GPT4V. The red ones are marked as generated inference hallucinations.
Refer to caption
Figure 19: Case of GPT4V. The red ones are marked as generated inference hallucinations.
Refer to caption
Figure 20: Case of GPT4V. The red ones are marked as generated inference hallucinations.
Refer to caption
Figure 21: Case of GPT4V. The red ones are marked as generated inference hallucinations.
Refer to caption
Figure 22: Case of GPT4V. The red ones are marked as generated inference hallucinations.