CIBench: Evaluating Your LLMs with a Code Interpreter Plugin

Songyang Zhang^1,∗, Chuyu Zhang^1,2,, Yingfan Hu^∗, Haowen Shen, Kuikun Liu¹,
Zerun Ma¹, Fengzhe Zhou¹, Wenwei Zhang¹, Xuming He², Dahua Lin¹, Kai Chen^1,†
¹Shanghai Artificial Intelligence Laboratory ²ShanghaiTech University
{zhangsongyang}@pjlab.org.cn Equal contribution. Code is available at https://github.com/open-compass/CIBench

Abstract

While LLM-Based agents, which use external tools to solve complex problems, have made significant progress, benchmarking their ability is challenging, thereby hindering a clear understanding of their limitations. In this paper, we propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs’ ability to utilize code interpreters for data science tasks. Our evaluation framework includes an evaluation dataset and two evaluation modes. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. The two evaluation modes assess LLMs’ ability with and without human assistance. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.

1 Introduction

Empowered by the emerging abilities of Large Language Models (LLMs), numerous agents have surfaced to harness these generalist models for utilizing external tools, such as web browsing, document retrieval, and code interpretation, in tackling complex real-world problems. Notable examples include ToolLLM [23], LangChain [5], AutoGPT [25], and QwenAgent [2]. Specifically, agents equipped with a code interpreter leverage the advanced programming skills of LLMs through a natural language interface, thereby facilitating the creation of workflows that are both effortless and efficient. However, assessing the agents’ capability to generate executable code and effectively use code interpreters remains a challenge, hindering a clear understanding of their limitations.

In this study, we focus on assessing the proficiency of LLMs in leveraging code interpreters to address data science problems across several distinct domains, like data analysis, visualization, and machine learning. These tasks necessitate that LLMs exhibit advanced capabilities in instruction following, reasoning, and programming. Existing benchmarks, including GSM8K [8], MathBench [20] and HumanEval [6], primarily assess the models’ abilities to solve mathematical or coding problems solely. These benchmarks, while valuable for measuring specific competencies, fail to fully represent the LLMs’ aptitude for building complex workflows involving code interpreters in practical applications. More recent efforts, such as MINT [27], QwenAgent [2], CodeGen [21], and DS-1000 [16], aim to bridge the existing gap by constructing novel benchmarks, particularly tailored to mathematical problems and data analysis tasks. Despite considerable advancements, they either focus on single-turn question assessments or have a limited scope in data science. The substantial insights they provided inadequately reflect the LLM’s ability to invoke a code interpreter to solve real-world data science problems, which typically require consecutive sessions.

To address these shortcomings, as depicted in Fig.1, we introduce a novel evaluation framework that encompasses a benchmark with consecutive and diverse tasks, along with comprehensive assessment protocols. This framework aims to provide a thorough evaluation of LLMs’ ability to use code interpreters. Notably, the benchmark employs a distinctive LLM-human cooperative approach, as depicted in Fig. 2. It simulates authentic workflow scenarios using interactive IPython sessions with sequential, interconnected questions focused on popular Python modules such as Matplotlib, Pandas, and PyTorch. Additionally, we utilize two evaluation modes and a series of fine-grained metrics for various outputs. The statistics of our benchmark are shown in Tab.1.

Refer to caption — Figure 1: Features of our benchmark. Our benchmark consists of interactive sessions, diverse tasks covering various Python modules, and comprehensive evaluations (The tool-call rate is not displayed).

Table 1: Comparison of different datasets. "Multi-turn" denotes that LLMs can attempt to solve tasks through multiple trials. "Code interpreter" means whether LLM needs to call the code interpreter to solve the problem. "Consecutive" indicates that each task consists of a series of sequential questions rather than a single question.

Dataset	Topic	Size	Multi-turn	Code Interpreter	Consecutive
DS-1000	Datenwissenschaft	1000 questions	No	No	No
MINT	Math, Reasoning	586 questions	Yes	Yes	No
CodeGen	Math, Data Science, etc	115 tasks	No	No	Yes
QwenAgent	Math, Data visualization	295 questions	No	Yes	No
CIBench	Datenwissenschaft	234 tasks, 1900+ questions	Yes	Yes	Yes

Specifically, we build the evaluation dataset by initially identifying ten highly prevalent Python libraries within the domain of data science. Then, we prompt advanced LLM, such as GPT-4, to generate instructions and code snippets within Jupyter Notebook format. Each notebook is structured to contain a sequence of 10 to 15 progressive steps, with increasing levels of complexity. To mitigate any inherent biases or limitations in the LLM-generated content, we employ human experts to generate template tasks based on the common patterns observed in the LLM-generated tasks and online resources. Those tasks can incorporate multiple interchangeable datasets for evaluation. Those designs ensure that the benchmark encapsulates both diversity and quality, thereby offering a comprehensive and balanced assessment of code interpreter capabilities.

To thoroughly assess the LLMs’ performance on our benchmark, we have instituted two distinct evaluation modes: the end-to-end mode and the oracle mode. In the end-to-end mode, LLMs are tasked with a holistic problem-solving process where they must reason through given instructions and generate corresponding code. This requires them to iteratively refine their output based on feedback from the code interpreter, as they attempt to solve multiple consecutive questions that build upon one another. Additionally, the oracle mode simulates guided learning by providing the LLM with the correct code snippet when it fails. This immediate feedback mimics human guidance and equips the model to use this accurate example for tackling subsequent tasks in the same context. Furthermore, we introduce two types of metrics: the process-oriented (i.e., tool-call rate, executable rate) and output-oriented (i.e., numeric accuracy, text score, visualization score), to provide a comprehensive analysis of the model’s performance.

Based on our evaluation framework, we conduct extensive experiments and analysis using 19 LLMs. The results indicate that open-sourced LLMs struggle to utilize PyTorch- and TensorFlow-like modules (Fig.10), and the best-open-sourced LLMs lag behind GPT-4 by 10.0% (Tab.3). In summary, our contributions are three-fold:

$\bullet$ We build a new benchmark for agents with code interpreters using an LLM-human cooperative method. It consists of interactive IPython sessions with interconnected questions on key data science libraries, simulating interactive problem-solving scenarios in practical workflows.

$\bullet$ We devise unique assessment strategies involving both end-to-end and oracle modes. We also introduce several evaluation metrics to assess various outputs, offering a comprehensive gauge of LLMs’ coding prowess within the benchmark.

$\bullet$ We conduct thorough experiments with 24 LLMs to analyze their performance on our benchmark. The results indicate that open-sourced LLMs are inferior to GPT-4 by a large margin and perform poorly in the modeling category modules.

2 Related Works

CIBench is an evaluation framework that assesses LLMs’ [26, 2, 10, 7, 4] ability to utilize external code interpreters for solving data science tasks. Therefore, we focus on presenting work related to invoking code interpreters and benchmarks related to data science.

2.1 Model with Plugins

LLM-based agents use external tools via APIs to solve complex tasks and have been regarded as a promising direction [5, 23, 25, 24, 13, 30]. Specifically, [18, 13, 22] develop efficient workflows to coordinate multi-agent systems for automatic programming. [24, 5, 23, 11] equip LLMs with external tools, such as search engines, calculators, and code interpreters, to augment LLMs’ problem-solving ability. Among these tools, the code interpreter can promote LLMs’ reasoning and coding ability and has gradually gained attention in works like [32, 33, 34].

In contrast, we aim to develop an evaluation framework to benchmark LLMs’ ability with code interpreters for solving data science tasks, providing insights for future work to promote LLMs for better realistic utilization.

2.2 Related Benchmarks

Various benchmarks [8, 6, 23, 2] have been proposed to measure LLMs’ reasoning, coding, and tool utilization ability. Classic benchmarks, such as GSM8K [8], MATH [12], HumanEval [6], and MBPP [1], focus on evaluating the mathematical reasoning or code capability of LLMs. ToolBench [23] and MS-Agent Bench [17] aim to evaluate LLMs’ capability in effectively utilizing various tools and generating accurate and contextually appropriate responses. [27] introduce MINT, a benchmark that evaluates LLMs’ ability to solve challenging tasks with multi-turn interactions by using tools and leveraging natural language feedback. However, the above benchmarks cannot measure LLMs’ ability in data science tasks, which require instruction following, coding, and tool utilization abilities. To address this gap, [21] introduces a multi-turn programming benchmark consisting of 115 diverse problems, covering topics such as mathematics, algorithms, and data science. QwenAgent [2] introduces a benchmark for data science, focusing mainly on mathematical problems and data visualization. [16] introduce DS-1000, a code generation benchmark with a thousand data science problems. QwenAgent [2] introduces a benchmark for data science, focusing mainly on mathematical problems and data visualization. Meanwhile, [14] introduces DABench, which evaluates various concepts with individual questions assigned to each dataset. However, this approach does not effectively mimic practical scenarios where interconnected questions are raised.

In contrast to the above works, CIBench simulates a real-world data science workflow by leveraging sequential interactive IPython sessions and covers most concepts in data science by including commonly used Python modules. Furthermore, we devise two evaluation modes and five metrics to holistically evaluate LLMs’ abilities.

3 CIBench

To benchmark LLM’s ability to leverage code interpreters for addressing data science problems, we propose a novel evaluation framework (Fig.2), which comprises a diverse evaluation dataset and two newly devised evaluation modes. The evaluation dataset is generated through an LLM-human cooperative approach and simulates authentic workflow scenarios for solving sequential and interconnected tasks. Given the evaluation dataset, we adhere to the ReAct protocol [31] to generate reasoning traces and invoke code interpreters alternately. And, we allow LLMs to attempt to solve tasks multiple times, enabling exploration of their self-debugging capabilities based on feedback from the code interpreter. Finally, we propose two evaluation modes: the end-to-end mode and the oracle mode, to comprehensively measure LLM’s ability with and without human interaction.

In the following sections, we will detail the construction of the dataset in Sec.3.1 and the evaluation modes and metrics in Sec.3.2.

Table 2: Selected Python modules and their categories.

Kategorie	Python Modules
Data Cleaning and Manipulation	Pandas
Data Visualization	Matplotlib, Seaborn
Modeling	Scikit-learn, PyTorch, TensorFlow, LightGBM
Natural Language Processing	NLTK
Image Processing	OpenCV-Python
Mathematics and Statistics	SciPy

Figure 3: An example prompt of task generation.

3.1 Evaluation Dataset

Python Modules Selection

We carefully choose modules that pertain to the fields of data science. It encompasses a wide array of topics such as data cleansing, visualization, image analysis, mathematical computations, and statistical methods. The modules we have selected are detailed in Tab.2.

Topic Candidates Generation

After selecting modules, we prompt (Appendix B) GPT-4 to summarize 50 topics for each module, to encapsulate the vast majority of the module’s functionalities, thereby offering precise guidance for the subsequent generation of more targeted questions.

Tasks Generation and Refinement

We sample a module and topic, then prompt GPT-4 to generate questions and code based on the prompt in Fig.3. The prompt is designed to enable GPT-4 to generate a Jupyter notebook with sequential steps and various outputs, including numeric answers, structured output, and visualizations, mirroring real-world scenarios. Despite our request for concise descriptions, the generated content may lack conciseness and continuity. To address this, we undertake iterative refinement of these tasks. This process involves presenting both good and bad cases, along with additional modifications, to enhance the quality of questions and reduce ambiguity. Details of prompts used for refinement are in Appendix B.

Human Refinement

Despite the excellent capabilities of LLMs, we found that the generated tasks often focus on typical datasets, such as Titanic and Iris, which limits task diversity. Additionally, ensuring perfect accuracy in question descriptions and result correctness is challenging.

To enhance diversity, we summarize template tasks based on the generated tasks and existing high-quality tutorials available in various Python libraries (Fig.11). With minor modifications, these template tasks can be adapted to a wide range of datasets. To further diversify the benchmark, we collect new datasets through two approaches: 1) We prompt GPT-4 to generate datasets tailored to different templates, leveraging GPT-4’s ability to flexibly specify characteristics and data attributes. 2) We incorporate the latest datasets from the past year, ensuring authenticity and diversity, and significantly reducing the likelihood that the model has previously encountered this data.

To improve quality, we employ several experts who use a code interpreter for manual double-checking. This approach ensures that questions are written by real users and that various factors, such as runtime, are carefully controlled. For a comprehensive overview of the rules governing quality control, please refer to the Appendix B. The statistics of the dataset are also provided in the Appendix A.2.

3.2 Evaluation Modes and Metrics

Evaluation Modes

As shown in Fig.4, CIBench includes the end-to-end and oracle mode. It not only assesses the model’s proficiency in autonomously resolving continuous problems but also assesses its capacity in conjunction with human interaction. In end-to-end mode, the model must solve the problem by itself. Each subsequent response is contingent upon the model’s previous outcomes, necessitating self-correction based on code interpreter feedback. What’s more, to reflect the real-world scenarios where human assistance is available, we introduce oracle mode to incorporate ground truth code and thought as context, helping the model to address subsequent tasks. Conceptually, oracle mode emulates a few-shot testing or in-context learning scenario, providing the model with comprehensive context to facilitate a more precise solution.

Evaluation Metrics

In two evaluation modes, we introduce two types of evaluation metrics: process-oriented and output-oriented, to provide a comprehensive analysis of the model’s performance. Process-oriented metrics focus on the correct invocation of tools and the successful compilation and execution of code. These metrics include the Tool Call Rate, which measures the proportion of instances where the model correctly follows the instructions to invoke a code interpreter, and the Executable Rate, which indicates the percentage of code that is executed without any errors.

On the other hand, output-oriented metrics focus on the outcomes of the model. These metrics include Numeric Accuracy, which assesses the accuracy of the numerical results; Text Score, which measures the quality of the structural text output using the Rouge metric [19]; and Visualization Score, which evaluates the quality of visual output. Instead of using GPT-4V like Qwen-Agent, which is expensive and ineffective, we propose using structural similarities [28] between predictions and ground truth images as the visualization score. These metrics provide a holistic evaluation of the LLM’s capabilities.

4 Experiments

4.1 Experiments Setup

To provide a thorough analysis, we evaluate 19 chat models, including popular open-sourced LLMs and the private GPT-4, using the CIBench benchmark. During inference, we allow LLMs to attempt up to 3 times. The specific versions of Python modules utilized in the code interpreter are provided in Appendix A.1. All experiments are conducted within the OpenCompass [9] evaluation platform.

Table 3: Main results of CIBench. Tool, Exe, Num, Text, and Vis denote the tool call rate, executable rate, numeric accuracy, text score, and visualization score respectively. bold denotes the best score among the same model scale. Average is the mean of Num, Text, and Vis in two modes..

Model	End-to-end Mode					Oracle Mode					Average
Model	Tool	Exe	Num	Text	Visum	Tool	Exe	Num	Text	Visum	Average
Llama-2-7B-Chat	80.6	28.8	10.1	10.4	17.4	92.7	41.5	22.4	31.2	24.8	19.4
Yi-6B-Chat	90.6	50.2	22.5	21.7	31.1	99.7	61.5	29.5	40.8	40.2	31.0
ChatGLM3-6B-32K	77.9	48.8	18.9	33.0	29.3	98.6	55.7	28.2	52.0	34.2	32.6
Vicuna-7B-v1.5-16k	89.4	54.5	19.9	27.1	32.7	99.9	62.9	29.3	43.8	43.1	32.7
DeepSeek-7B-Chat	62.9	45.3	16.0	17.4	30.8	93.8	67.1	33.0	56.7	48.5	33.7
Qwen-7B-Chat	99.8	72.2	42.3	55.7	43.9	97.9	71.9	36.6	58.1	52.1	48.1
Mistral-7b-Instruct	97.5	67.2	41.9	53.6	41.7	99.2	75.7	48.0	55.4	49.1	48.3
Qwen1.5-7B-Chat	99.1	69.9	35.4	59.1	39.1	99.8	75.9	48.6	61.7	52.1	49.3
InternLM2-7B-Chat	99.7	77.7	52.6	64.6	51.8	99.3	70.4	48.1	44.3	42.9	50.7
DeepSeek-Coder-6.7B	75.9	66.0	40.7	48.5	47.7	96.8	83.0	59.2	77.2	62.5	56.0
Llama-3-8B-Instruct	98.5	78.6	55.6	70.8	54.6	95.5	84.5	60.4	83.7	61.4	64.4
Llama-2-13B-Chat	90.7	42.7	12.1	14.8	24.0	99.0	52.4	28.9	31.0	29.3	23.4
Vicuna-13B-v1.5-16k	91.0	62.2	25.9	45.4	35.4	99.9	72.1	39.7	45.2	45.9	39.6
Qwen-14B-Chat	93.1	70.9	48.2	60.7	48.7	99.2	82.4	55.5	74.6	57.3	57.5
Qwen1.5-14B-Chat	99.7	84.8	56.9	63.3	51.4	99.9	92.7	61.7	62.6	63.1	59.8
Mixtral-8x7B-Instruct	99.1	83.0	45.4	55.9	53.0	99.9	88.6	67.5	73.5	67.0	60.4
InternLM2-20B-Chat	96.6	82.2	47.5	67.2	47.1	96.7	81.0	60.4	85.0	56.7	60.6
Yi-34B-Chat	92.9	66.2	42.2	40.3	42.2	99.8	77.2	68.2	72.2	52.7	53.0
Llama-2-70B-Chat	92.4	53.8	23.7	22.7	24.9	94.2	61.6	36.7	36.0	35.6	29.9
DeepSeek-67B-Chat	76.9	70.7	47.5	63.3	54.8	91.1	82.8	60.6	74.7	63.8	60.8
Qwen-72B-Chat	98.3	87.2	60.5	73.2	55.9	99.7	90.1	62.8	73.6	64.9	65.1
Llama-3-70B-Instruct	98.0	91.1	64.9	55.7	63.6	96.6	82.8	72.9	74.2	62.0	65.5
gpt-4o	100.0	97.2	74.1	79.0	61.1	99.2	94.8	81.7	79.7	71.6	74.5
GPT-4-1106-preview	99.8	98.6	77.8	78.9	64.0	99.6	97.7	77.5	81.6	73.1	75.5

4.2 Main Results

We categorize the models into distinct groups based on their scales to enable more effective comparisons. As illustrated in Table 3, LLama-3-8B-Instruct achieves the highest performance in the 7B group, while InternLM2-20B-Chat demonstrates the top performance in the 13B-20B group. Within the 70B group, LLama-3-70B-Instruct secures the leading position with an overall score that is 10.0% lower than GPT-4-1106-preview. The API model GPT-4-1106-preview and gpt-4o outperform all other models, especially in end-to-end mode, highlighting the significant potential for improvement in current open-source models. What’s more, larger models tend to exhibit superior performance across various metrics, in line with established trends [3, 15, 29]. Moreover, models within the same series (such as Mistral, InternLM, Qwen, Llama3, etc.) consistently maintain relatively stable rankings within their respective parameter groups, underscoring the stability and efficacy of our approach.

When comparing the end-to-end mode and oracle mode, it becomes evident that the oracle mode surpasses the end-to-end mode across all metrics for most models. This observation suggests that LLMs can achieve better results with human interaction, hinting at a promising avenue for integrating LLMs to assist humans in data science. The experiment demos are shown in Appendix C.

4.3 Error Mode Analysis

In the evaluation of CIBench, we identify four prevalent types of errors in the code generated by the model. These errors are categorized as follows: 1) Instruction Following Errors: These encompass instances where the model deviates from or disregards provided instructions, reflecting a lack of adherence to specified guidelines; 2) Hallucination Errors: This category pertains to the phenomenon of the model generating code that contains hallucinated elements, such as utilizing undefined parameters or referencing irrelevant variables; 3) Reasoning Errors: These errors occur when the model encounters complex problems, often resulting in logical errors in the generated code. Such errors offer valuable insights into the model’s ability to handle intricate tasks in code generation; 4) Code Errors: Basic errors in code generation fall under this category. While these errors may sometimes appear trivial, they signify potential deficiencies in the model’s code-generation process. These identified errors effectively underscore the current limitations of LLMs in terms of their coding capabilities, providing valuable insights for the ongoing development of CIBench. Since automatically classifying errors is challenging, we manually categorized the four types of errors in GPT-4-1106-preview. The proportions of these error types are 31.9%, 4.3%, 40.4%, and 23.4%, respectively. Detailed examples of these errors are presented in Appendix E.

Furthermore, to provide insight into enhancing performance on CIBench, we analyze its correlation with existing benchmarks such as IFEval, BBH, GSM8K, MATH, HumanEval, and MBPP. These benchmarks evaluate the instruction-following, reasoning, and coding abilities of LLMs. As shown in Fig.5, CIBench performance exhibits a strong correlation with these benchmarks, with Pearson correlation coefficients exceeding 0.7, particularly for HumanEval and GSM8K. This result indicates that improvements in reasoning and coding can boost LLM performance on CIBench.

4.4 More Analysis

Debug Ability Analysis

In the ReAct protocol, we allow LLMs to try to solve tasks multiple times. During each trial, the model can use feedback from the code interpreter to rectify any bugs in the generated code. To assess the LLMs’ ability to autonomously correct bugs, we vary the number of trials. As shown in Fig.6, increasing the number of trials correlates with improvements across all metrics. Significantly, for most LLMs, there is a notable enhancement when the number of trials reaches two, particularly evident in metrics such as executable rate, numeric accuracy, and visualization score. This suggests that the LLM can autonomously rectify bugs to a certain extent. In our experiments, to balance evaluation time and performance, we set the number of trials to three.

Visualization Metric Analysis

To validate the effectiveness of our proposed visualization metric, we follow QwenAgent [2] and use GPT-4V to assess visualization scores on a subset of CIBench tasks. The prompt is provided in Appendix D. As shown in Fig.8, despite structural similarities being derived from low-level features, there is a strong correlation between them and GPT-4V scores, demonstrating remarkable consistency between the two metrics. Therefore, we can utilize structural similarities as a simplified visualization metric to subject GPT-4V for effective analysis.

Cross Language Analysis

To benchmark the LLMs’ ability in Chinese, we created a Chinese version of CIBench by translating the human-created tasks into Chinese. This allows us to evaluate the Code Interpreter performance in Chinese. As shown in Fig.8, we observe that: 1) most models exhibit a slight decrease in Chinese CIBench compared to their English counterparts.; 2) the strong DeepSeek-67B-Chat and Qwen-72B-Chat drop a lot on Chinese CIBench, compared to the English version. Further research and development efforts are necessary to address these discrepancies and improve the performance of LLMs in multilingual scenarios.

Difficulty Analysis

CIbench comprises interactive tasks, and we assume that varying interactive steps denote distinct levels of difficulty. To demonstrate model performance across these difficulty levels, we decompose each task into three categories: easy (up to 2 steps), medium (2-4 steps), and difficult (more than 4 steps), and then evaluate performance accordingly. As Tab.10 shows, for most models, the performance of most models decreases as the number of steps increases.

Different Category Modules Analysis

We assess the capabilities of different LLMs with various category modules (refer to Tab. 2). As shown in Fig. 10, LLMs demonstrate proficiency in tackling mathematical and statistical tasks utilizing SciPy modules. However, open-sourced LLMs encounter challenges when handling modeling tasks that demand advanced coding and reasoning skills. We hope that future open-sourced LLMs will excel in modeling tasks.

Limitation

Our work has two main limitations: 1) CIBench is currently limited to Python, despite it could be extended to include other programming languages using a similar methodology; 2) the evaluation metric of CIBench has limitations in measuring certain data science tasks, such as "training a model with PyTorch" and tasks involving randomness.

Model	Easy	Medium	Hard
Yi-6B-chat	37.5	33.1	23.1
Llama-2-7B-chat	25.7	19.2	15.7
Qwen-7B-chat	51.0	45.5	36.5
Vicuna-7B-v1.5-16k	39.0	35.9	23.8
DeepSeek-7B-chat	38.8	41.8	22.8
Mistral-7B-instruct-v0.2	52.3	52.4	33.0
Chatglm3-6B-32k	43.2	33.3	19.0
Mistral-8x7B-instruct-v0.1	62.1	61.2	47.6
Internlm2-chat-7B	55.5	52.0	33.9
Qwen-14B-chat	59.6	58.8	42.0
Vicuna-13B-v1.5-16k	46.1	41.6	27.7
Llama-2-13B-chat	34.8	23.5	18.7
Internlm2-chat-20B	51.5	59.2	45.1
Yi-34b-chat	55.2	58.1	38.0
Llama-2-70B-chat	38.1	33.9	18.6
DeepSeek-67B-chat	60.4	63.9	43.7
Qwen-72B-chat	60.7	66.2	50.3
Llama-3-70B-chat	62.3	67.1	42.0

5 Conclusion

We propose a novel benchmark, named CIBench , to comprehensively assess LLMs’ ability to leverage code interpreters for complex data science tasks. It includes an evaluation dataset covering widely used Python modules in data science and two evaluation modes measuring LLMs’ ability with and without human assistance. The evaluation dataset is constructed using an LLM-human cooperative approach, leveraging interactive IPython sessions to simulate realistic scenarios in data science. Thorough experimental analysis with 24 LLMs on CIBench indicates that LLMs perform poorly in modeling category modules. Based on our experimental analysis, we offer several insights for the future development of LLMs: enhancing the LLM’s ability to correct errors based on feedback, improving its understanding of user intentions across multiple rounds of interactions, and, most importantly, strengthening its reasoning capabilities.

References

[1] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021.
[2] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report, 2023.
[3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
[4] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
[5] Harrison Chase. LangChain, October 2022.
[6] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021.
[7] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org, 2023.
[8] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021.
[9] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
[10] DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism, 2024.
[11] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023.
[12] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021.
[13] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
[14] Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, et al. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507, 2024.
[15] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020.
[16] Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR, 2023.
[17] Chenliang Li, Hehong Chen, Ming Yan, Weizhou Shen, Haiyang Xu, Zhikai Wu, Zhicheng Zhang, Wenmeng Zhou, Yingda Chen, Chen Cheng, Hongzhu Shi, Ji Zhang, Fei Huang, and Jingren Zhou. Modelscope-agent: Building your customizable agent system with open-source large language models, 2023.
[18] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large scale language model society. arXiv preprint arXiv:2303.17760, 2023.
[19] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
[20] Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, and Kai Chen. Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark. arXiv preprint arXiv:2405.12209, 2024.
[21] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
[22] Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
[23] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023.
[24] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
[25] Significant Gravitas. AutoGPT.
[26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[27] Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2023.
[28] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
[29] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models, 2022.
[30] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
[31] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023.
[32] Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large language models toward verifiable reasoning. arXiv preprint arXiv:2402.06332, 2024.
[33] Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023.
[34] Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Toolqa: A dataset for llm question answering with external tools. arXiv preprint arXiv:2306.13304, 2023.

Checklist

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

•

Did you include the license to the code and datasets? [Yes] See
•

Did you include the license to the code and datasets? [No] The code and the data are proprietary.
•

Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

1.
For all authors…
1. (a)
  
  Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] See Introduction
2. (b)
  
  Did you describe the limitations of your work? [Yes] See Sec.4.4
3. (c)
  
  Did you discuss any potential negative societal impacts of your work? [N/A] Our paper proposes a dataset to measure the programming ability of large language models. There is currently no negative social impact.
4. (d)
  
  Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] See Appendix
2.
If you are including theoretical results…
1. (a)
  
  Did you state the full set of assumptions of all theoretical results? [N/A]
2. (b)
  
  Did you include complete proofs of all theoretical results? [N/A]
3.
If you ran experiments (e.g. for benchmarks)…
1. (a)
  
  Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] https://github.com/open-compass/CIBench
2. (b)
  
  Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] https://github.com/open-compass/CIBench
3. (c)
  
  Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] The experiments are conducted at a temperature of 0.
4. (d)
  
  Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See 4.1
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
1. (a)
  
  If your work uses existing assets, did you cite the creators? [N/A]
2. (b)
  
  Did you mention the license of the assets? [N/A]
3. (c)
  
  Did you include any new assets either in the supplemental material or as a URL? [N/A]
4. (d)
  
  Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]
5. (e)
  
  Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]
5.
If you used crowdsourcing or conducted research with human subjects…
1. (a)
  
  Did you include the full text of instructions given to participants and screenshots, if applicable? [Yes] See Appendix
2. (b)
  
  Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]
3. (c)
  
  Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix A Dataset Details

A.1 Module Version Settings

The version of Python modules used in code interpreters is listed in Tab.4.

Table 4: The module version settings in CIBench.

Module	Version
Pandas	1.5.3
Matplotlib	3.7.2
Seaborn	0.13.0
Scikit-learn	1.2.1
PyTorch	1.13.1
TensorFlow	2.14.0
LightGBM	4.1.0
NLTK	3.8
PyTorch	1.131
OpenCV-Python	4.8.1.78
SciPy	1.11.2

A.2 Dataset Statistics

The CIBench comprises generation tasks, template tasks, and Chinese template tasks, which produce three types of output: numerical, text, and visualization. The statistics of CIBench are shown in Tab.5.

Table 5: Dataset statistics of CIBench. "generation", "template", and "template_cn" represent generation tasks, template tasks, and Chinese template tasks, respectively. Other refers to the questions that only require successful execution without any output comparison.

Subset	Num	Text	Visum	Other	Total
generation	210	76	466	208	960
template	147	20	161	142	470
template_cn	147	20	161	142	470
total	504	116	788	492	1900

Appendix B Construction Prompts and Rules

Topic Generation

The prompt used for topic generation is shown in Fig. 13.

Question Refinement

The prompts used for question refinement are shown in Fig. 14, 15.

Quality Control Rules

We include manual quality checking in the final steps due to limitations in the LLMs’ ability to control runtime or file size. The rules used for quality control are shown in Fig.16.

Appendix C Experiment Example Demo

An example of the experiment is shown in Fig. 11 and Fig. 12.

Appendix D Subjective Visualization Evaluation

The prompt used for subjective visualization scoring is shown in Fig.17.

Appendix E Dataset Error Analysis

In the evaluation of CIBench, we identify four prevalent types of errors in the code generated by the model. These errors are categorized as follows: 1) Instruction Following Errors (Fig.18): These encompass instances where the model deviates from or disregards provided instructions, reflecting a lack of adherence to specified guidelines; 2) Hallucination Errors (Fig.19): This category pertains to the phenomenon of the model generating code that contains hallucinated elements, such as utilizing undefined parameters or referencing irrelevant variables; 3) Reasoning Errors (Fig.20): These errors occur when the model encounters complex problems, often resulting in logical errors in the generated code. Such errors offer valuable insights into the model’s ability to handle intricate tasks in code generation; 4) Code Errors (Fig.21): Basic errors in code generation fall under this category. While these errors may sometimes appear trivial, they signify potential deficiencies in the model’s code-generation process. These identified errors effectively underscore the current limitations of LLMs in terms of their Code Interpreter capabilities, providing valuable insights for the ongoing development of CIBench.

Appendix F Human Annotator

In our paper, the authors also serve as annotators for CIBench. These annotators are tasked with creating template tasks for various data science challenges, ensuring coverage of a wide range of concepts in the field. Importantly, our data collection process for these tasks is unbiased and does not involve any private information.

Appendix G Ethical Consideration

We use GPT-4 and online resources to construct our benchmark. The benchmarks are carefully processed by experts to exclude any private information. Additionally, we utilize ChatGPT for text refinement and to correct any typographical errors during the writing process.

Figure 13: An example prompt of topic generation.

Figure 14: An example prompt of question refinement.

Figure 15: Another example prompt of question refinement.

Figure 16: Rules used for manual dataset quality control.

Figure 17: An example prompt of subjective visualization scoring prompt.

Figure 18: Examples of Following error

Figure 19: Examples of Hallucination error

Figure 20: Examples of Reasoning error

Figure 21: Examples of Code error