CIBench: Evaluating Your LLMs with a Code Interpreter Plugin

Songyang Zhang1,∗, Chuyu Zhang1,2,, Yingfan Hu, Haowen Shen, Kuikun Liu1,
Zerun Ma1, Fengzhe Zhou1, Wenwei Zhang1, Xuming He2, Dahua Lin1, Kai Chen1,†
1Shanghai Artificial Intelligence Laboratory  2ShanghaiTech University
{zhangsongyang}@pjlab.org.cn
Equal contribution. Code is available at https://github.com/open-compass/CIBench
Abstract

While LLM-Based agents, which use external tools to solve complex problems, have made significant progress, benchmarking their ability is challenging, thereby hindering a clear understanding of their limitations. In this paper, we propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs’ ability to utilize code interpreters for data science tasks. Our evaluation framework includes an evaluation dataset and two evaluation modes. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. The two evaluation modes assess LLMs’ ability with and without human assistance. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.

1 Introduction

Empowered by the emerging abilities of Large Language Models (LLMs), numerous agents have surfaced to harness these generalist models for utilizing external tools, such as web browsing, document retrieval, and code interpretation, in tackling complex real-world problems. Notable examples include ToolLLM [23], LangChain [5], AutoGPT [25], and QwenAgent [2]. Specifically, agents equipped with a code interpreter leverage the advanced programming skills of LLMs through a natural language interface, thereby facilitating the creation of workflows that are both effortless and efficient. However, assessing the agents’ capability to generate executable code and effectively use code interpreters remains a challenge, hindering a clear understanding of their limitations.

In this study, we focus on assessing the proficiency of LLMs in leveraging code interpreters to address data science problems across several distinct domains, like data analysis, visualization, and machine learning. These tasks necessitate that LLMs exhibit advanced capabilities in instruction following, reasoning, and programming. Existing benchmarks, including GSM8K [8], MathBench [20] and HumanEval [6], primarily assess the models’ abilities to solve mathematical or coding problems solely. These benchmarks, while valuable for measuring specific competencies, fail to fully represent the LLMs’ aptitude for building complex workflows involving code interpreters in practical applications. More recent efforts, such as MINT [27], QwenAgent [2], CodeGen [21], and DS-1000 [16], aim to bridge the existing gap by constructing novel benchmarks, particularly tailored to mathematical problems and data analysis tasks. Despite considerable advancements, they either focus on single-turn question assessments or have a limited scope in data science. The substantial insights they provided inadequately reflect the LLM’s ability to invoke a code interpreter to solve real-world data science problems, which typically require consecutive sessions.

To address these shortcomings, as depicted in Fig.1, we introduce a novel evaluation framework that encompasses a benchmark with consecutive and diverse tasks, along with comprehensive assessment protocols. This framework aims to provide a thorough evaluation of LLMs’ ability to use code interpreters. Notably, the benchmark employs a distinctive LLM-human cooperative approach, as depicted in Fig. 2. It simulates authentic workflow scenarios using interactive IPython sessions with sequential, interconnected questions focused on popular Python modules such as Matplotlib, Pandas, and PyTorch. Additionally, we utilize two evaluation modes and a series of fine-grained metrics for various outputs. The statistics of our benchmark are shown in Tab.1.

Refer to caption
Figure 1: Features of our benchmark. Our benchmark consists of interactive sessions, diverse tasks covering various Python modules, and comprehensive evaluations (The tool-call rate is not displayed).
Table 1: Comparison of different datasets. "Multi-turn" denotes that LLMs can attempt to solve tasks through multiple trials. "Code interpreter" means whether LLM needs to call the code interpreter to solve the problem. "Consecutive" indicates that each task consists of a series of sequential questions rather than a single question.
Dataset Topic Size Multi-turn Code Interpreter Consecutive
DS-1000 Datenwissenschaft 1000 questions No No No
MINT Math, Reasoning 586 questions Yes Yes No
CodeGen Math, Data Science, etc 115 tasks No No Yes
QwenAgent Math, Data visualization 295 questions No Yes No
CIBench Datenwissenschaft 234 tasks, 1900+ questions Yes Yes Yes

Specifically, we build the evaluation dataset by initially identifying ten highly prevalent Python libraries within the domain of data science. Then, we prompt advanced LLM, such as GPT-4, to generate instructions and code snippets within Jupyter Notebook format. Each notebook is structured to contain a sequence of 10 to 15 progressive steps, with increasing levels of complexity. To mitigate any inherent biases or limitations in the LLM-generated content, we employ human experts to generate template tasks based on the common patterns observed in the LLM-generated tasks and online resources. Those tasks can incorporate multiple interchangeable datasets for evaluation. Those designs ensure that the benchmark encapsulates both diversity and quality, thereby offering a comprehensive and balanced assessment of code interpreter capabilities.

To thoroughly assess the LLMs’ performance on our benchmark, we have instituted two distinct evaluation modes: the end-to-end mode and the oracle mode. In the end-to-end mode, LLMs are tasked with a holistic problem-solving process where they must reason through given instructions and generate corresponding code. This requires them to iteratively refine their output based on feedback from the code interpreter, as they attempt to solve multiple consecutive questions that build upon one another. Additionally, the oracle mode simulates guided learning by providing the LLM with the correct code snippet when it fails. This immediate feedback mimics human guidance and equips the model to use this accurate example for tackling subsequent tasks in the same context. Furthermore, we introduce two types of metrics: the process-oriented (i.e., tool-call rate, executable rate) and output-oriented (i.e., numeric accuracy, text score, visualization score), to provide a comprehensive analysis of the model’s performance.

Based on our evaluation framework, we conduct extensive experiments and analysis using 19 LLMs. The results indicate that open-sourced LLMs struggle to utilize PyTorch- and TensorFlow-like modules (Fig.10), and the best-open-sourced LLMs lag behind GPT-4 by 10.0% (Tab.3). In summary, our contributions are three-fold:

\bullet We build a new benchmark for agents with code interpreters using an LLM-human cooperative method. It consists of interactive IPython sessions with interconnected questions on key data science libraries, simulating interactive problem-solving scenarios in practical workflows.

\bullet We devise unique assessment strategies involving both end-to-end and oracle modes. We also introduce several evaluation metrics to assess various outputs, offering a comprehensive gauge of LLMs’ coding prowess within the benchmark.

\bullet We conduct thorough experiments with 24 LLMs to analyze their performance on our benchmark. The results indicate that open-sourced LLMs are inferior to GPT-4 by a large margin and perform poorly in the modeling category modules.

2 Related Works

CIBench is an evaluation framework that assesses LLMs’ [26, 2, 10, 7, 4] ability to utilize external code interpreters for solving data science tasks. Therefore, we focus on presenting work related to invoking code interpreters and benchmarks related to data science.

2.1 Model with Plugins

LLM-based agents use external tools via APIs to solve complex tasks and have been regarded as a promising direction [5, 23, 25, 24, 13, 30]. Specifically, [18, 13, 22] develop efficient workflows to coordinate multi-agent systems for automatic programming. [24, 5, 23, 11] equip LLMs with external tools, such as search engines, calculators, and code interpreters, to augment LLMs’ problem-solving ability. Among these tools, the code interpreter can promote LLMs’ reasoning and coding ability and has gradually gained attention in works like [32, 33, 34].

In contrast, we aim to develop an evaluation framework to benchmark LLMs’ ability with code interpreters for solving data science tasks, providing insights for future work to promote LLMs for better realistic utilization.

2.2 Related Benchmarks

Various benchmarks [8, 6, 23, 2] have been proposed to measure LLMs’ reasoning, coding, and tool utilization ability. Classic benchmarks, such as GSM8K [8], MATH [12], HumanEval [6], and MBPP [1], focus on evaluating the mathematical reasoning or code capability of LLMs. ToolBench [23] and MS-Agent Bench [17] aim to evaluate LLMs’ capability in effectively utilizing various tools and generating accurate and contextually appropriate responses.  [27] introduce MINT, a benchmark that evaluates LLMs’ ability to solve challenging tasks with multi-turn interactions by using tools and leveraging natural language feedback. However, the above benchmarks cannot measure LLMs’ ability in data science tasks, which require instruction following, coding, and tool utilization abilities. To address this gap,  [21] introduces a multi-turn programming benchmark consisting of 115 diverse problems, covering topics such as mathematics, algorithms, and data science. QwenAgent [2] introduces a benchmark for data science, focusing mainly on mathematical problems and data visualization.  [16] introduce DS-1000, a code generation benchmark with a thousand data science problems. QwenAgent [2] introduces a benchmark for data science, focusing mainly on mathematical problems and data visualization. Meanwhile, [14] introduces DABench, which evaluates various concepts with individual questions assigned to each dataset. However, this approach does not effectively mimic practical scenarios where interconnected questions are raised.

In contrast to the above works, CIBench  simulates a real-world data science workflow by leveraging sequential interactive IPython sessions and covers most concepts in data science by including commonly used Python modules. Furthermore, we devise two evaluation modes and five metrics to holistically evaluate LLMs’ abilities.

3 CIBench

To benchmark LLM’s ability to leverage code interpreters for addressing data science problems, we propose a novel evaluation framework (Fig.2), which comprises a diverse evaluation dataset and two newly devised evaluation modes. The evaluation dataset is generated through an LLM-human cooperative approach and simulates authentic workflow scenarios for solving sequential and interconnected tasks. Given the evaluation dataset, we adhere to the ReAct protocol [31] to generate reasoning traces and invoke code interpreters alternately. And, we allow LLMs to attempt to solve tasks multiple times, enabling exploration of their self-debugging capabilities based on feedback from the code interpreter. Finally, we propose two evaluation modes: the end-to-end mode and the oracle mode, to comprehensively measure LLM’s ability with and without human interaction.

In the following sections, we will detail the construction of the dataset in Sec.3.1 and the evaluation modes and metrics in Sec.3.2.

Refer to caption
Figure 2: Overview of CIBench. CIBench first selects Python modules to generate candidate topics and then generates tasks based on these modules and the selected topic. Additionally, humans are engaged to generate new tasks to ensure diversity and filter out incorrect questions to enhance quality.
Table 2: Selected Python modules and their categories.
Kategorie Python Modules
Data Cleaning and Manipulation Pandas
Data Visualization Matplotlib, Seaborn
Modeling Scikit-learn, PyTorch, TensorFlow, LightGBM
Natural Language Processing NLTK
Image Processing OpenCV-Python
Mathematics and Statistics SciPy
Task Generation Prompt: Please create jupyter notebook experiment based on Python module {}. Please follow these rules: 1. The experiment should be conducted in a jupyter notebook manner, but use the markdown format. 2. The experiment should only use Python code. 3. The experiment has around 10-15 continuous steps, from the easiest to the hardest. 4. The step description should be concise. 5. The step description should be precise and contain exact parameter names and values to instruct. 6. Each step requires Python code to solve and the executed result should be the numeric answer, structured output, or visualized result. 7. Please use ‘matplotlib’ to visualize if necessary. 8. DO NOT have any steps to save or write any output files. 9. Please provide an input data file with an external link. The experiment topic is {}. You should generate the experiment file without any other statements.
Figure 3: An example prompt of task generation.

3.1 Evaluation Dataset

Python Modules Selection

We carefully choose modules that pertain to the fields of data science. It encompasses a wide array of topics such as data cleansing, visualization, image analysis, mathematical computations, and statistical methods. The modules we have selected are detailed in Tab.2.

Topic Candidates Generation

After selecting modules, we prompt (Appendix B) GPT-4 to summarize 50 topics for each module, to encapsulate the vast majority of the module’s functionalities, thereby offering precise guidance for the subsequent generation of more targeted questions.

Tasks Generation and Refinement

We sample a module and topic, then prompt GPT-4 to generate questions and code based on the prompt in Fig.3. The prompt is designed to enable GPT-4 to generate a Jupyter notebook with sequential steps and various outputs, including numeric answers, structured output, and visualizations, mirroring real-world scenarios. Despite our request for concise descriptions, the generated content may lack conciseness and continuity. To address this, we undertake iterative refinement of these tasks. This process involves presenting both good and bad cases, along with additional modifications, to enhance the quality of questions and reduce ambiguity. Details of prompts used for refinement are in Appendix B.

Human Refinement

Despite the excellent capabilities of LLMs, we found that the generated tasks often focus on typical datasets, such as Titanic and Iris, which limits task diversity. Additionally, ensuring perfect accuracy in question descriptions and result correctness is challenging.

To enhance diversity, we summarize template tasks based on the generated tasks and existing high-quality tutorials available in various Python libraries (Fig.11). With minor modifications, these template tasks can be adapted to a wide range of datasets. To further diversify the benchmark, we collect new datasets through two approaches: 1) We prompt GPT-4 to generate datasets tailored to different templates, leveraging GPT-4’s ability to flexibly specify characteristics and data attributes. 2) We incorporate the latest datasets from the past year, ensuring authenticity and diversity, and significantly reducing the likelihood that the model has previously encountered this data.

To improve quality, we employ several experts who use a code interpreter for manual double-checking. This approach ensures that questions are written by real users and that various factors, such as runtime, are carefully controlled. For a comprehensive overview of the rules governing quality control, please refer to the Appendix B. The statistics of the dataset are also provided in the Appendix A.2.

Refer to caption
Figure 4: Evaluation modes: In end-to-end mode, the LLM addresses the user’s question (bottom) within the context of its response, while in oracle mode, it answers the user’s question (bottom) within the context of ground truth.

3.2 Evaluation Modes and Metrics

Evaluation Modes

As shown in Fig.4, CIBench  includes the end-to-end and oracle mode. It not only assesses the model’s proficiency in autonomously resolving continuous problems but also assesses its capacity in conjunction with human interaction. In end-to-end mode, the model must solve the problem by itself. Each subsequent response is contingent upon the model’s previous outcomes, necessitating self-correction based on code interpreter feedback. What’s more, to reflect the real-world scenarios where human assistance is available, we introduce oracle mode to incorporate ground truth code and thought as context, helping the model to address subsequent tasks. Conceptually, oracle mode emulates a few-shot testing or in-context learning scenario, providing the model with comprehensive context to facilitate a more precise solution.

Evaluation Metrics

In two evaluation modes, we introduce two types of evaluation metrics: process-oriented and output-oriented, to provide a comprehensive analysis of the model’s performance. Process-oriented metrics focus on the correct invocation of tools and the successful compilation and execution of code. These metrics include the Tool Call Rate, which measures the proportion of instances where the model correctly follows the instructions to invoke a code interpreter, and the Executable Rate, which indicates the percentage of code that is executed without any errors.

On the other hand, output-oriented metrics focus on the outcomes of the model. These metrics include Numeric Accuracy, which assesses the accuracy of the numerical results; Text Score, which measures the quality of the structural text output using the Rouge metric [19]; and Visualization Score, which evaluates the quality of visual output. Instead of using GPT-4V like Qwen-Agent, which is expensive and ineffective, we propose using structural similarities [28] between predictions and ground truth images as the visualization score. These metrics provide a holistic evaluation of the LLM’s capabilities.

4 Experiments

4.1 Experiments Setup

To provide a thorough analysis, we evaluate 19 chat models, including popular open-sourced LLMs and the private GPT-4, using the CIBench benchmark. During inference, we allow LLMs to attempt up to 3 times. The specific versions of Python modules utilized in the code interpreter are provided in Appendix A.1. All experiments are conducted within the OpenCompass [9] evaluation platform.

Table 3: Main results of CIBench. Tool, Exe, Num, Text, and Vis denote the tool call rate, executable rate, numeric accuracy, text score, and visualization score respectively. bold denotes the best score among the same model scale. Average is the mean of Num, Text, and Vis in two modes..
Model End-to-end Mode Oracle Mode Average
Tool Exe Num Text Visum Tool Exe Num Text Visum
Llama-2-7B-Chat 80.6 28.8 10.1 10.4 17.4 92.7 41.5 22.4 31.2 24.8 19.4
Yi-6B-Chat 90.6 50.2 22.5 21.7 31.1 99.7 61.5 29.5 40.8 40.2 31.0
ChatGLM3-6B-32K 77.9 48.8 18.9 33.0 29.3 98.6 55.7 28.2 52.0 34.2 32.6
Vicuna-7B-v1.5-16k 89.4 54.5 19.9 27.1 32.7 99.9 62.9 29.3 43.8 43.1 32.7
DeepSeek-7B-Chat 62.9 45.3 16.0 17.4 30.8 93.8 67.1 33.0 56.7 48.5 33.7
Qwen-7B-Chat 99.8 72.2 42.3 55.7 43.9 97.9 71.9 36.6 58.1 52.1 48.1
Mistral-7b-Instruct 97.5 67.2 41.9 53.6 41.7 99.2 75.7 48.0 55.4 49.1 48.3
Qwen1.5-7B-Chat 99.1 69.9 35.4 59.1 39.1 99.8 75.9 48.6 61.7 52.1 49.3
InternLM2-7B-Chat 99.7 77.7 52.6 64.6 51.8 99.3 70.4 48.1 44.3 42.9 50.7
DeepSeek-Coder-6.7B 75.9 66.0 40.7 48.5 47.7 96.8 83.0 59.2 77.2 62.5 56.0
Llama-3-8B-Instruct 98.5 78.6 55.6 70.8 54.6 95.5 84.5 60.4 83.7 61.4 64.4
Llama-2-13B-Chat 90.7 42.7 12.1 14.8 24.0 99.0 52.4 28.9 31.0 29.3 23.4
Vicuna-13B-v1.5-16k 91.0 62.2 25.9 45.4 35.4 99.9 72.1 39.7 45.2 45.9 39.6
Qwen-14B-Chat 93.1 70.9 48.2 60.7 48.7 99.2 82.4 55.5 74.6 57.3 57.5
Qwen1.5-14B-Chat 99.7 84.8 56.9 63.3 51.4 99.9 92.7 61.7 62.6 63.1 59.8
Mixtral-8x7B-Instruct 99.1 83.0 45.4 55.9 53.0 99.9 88.6 67.5 73.5 67.0 60.4
InternLM2-20B-Chat 96.6 82.2 47.5 67.2 47.1 96.7 81.0 60.4 85.0 56.7 60.6
Yi-34B-Chat 92.9 66.2 42.2 40.3 42.2 99.8 77.2 68.2 72.2 52.7 53.0
Llama-2-70B-Chat 92.4 53.8 23.7 22.7 24.9 94.2 61.6 36.7 36.0 35.6 29.9
DeepSeek-67B-Chat 76.9 70.7 47.5 63.3 54.8 91.1 82.8 60.6 74.7 63.8 60.8
Qwen-72B-Chat 98.3 87.2 60.5 73.2 55.9 99.7 90.1 62.8 73.6 64.9 65.1
Llama-3-70B-Instruct 98.0 91.1 64.9 55.7 63.6 96.6 82.8 72.9 74.2 62.0 65.5
gpt-4o 100.0 97.2 74.1 79.0 61.1 99.2 94.8 81.7 79.7 71.6 74.5
GPT-4-1106-preview 99.8 98.6 77.8 78.9 64.0 99.6 97.7 77.5 81.6 73.1 75.5

4.2 Main Results

We categorize the models into distinct groups based on their scales to enable more effective comparisons. As illustrated in Table 3, LLama-3-8B-Instruct achieves the highest performance in the  7B group, while InternLM2-20B-Chat demonstrates the top performance in the 13B-20B group. Within the 70B group, LLama-3-70B-Instruct secures the leading position with an overall score that is 10.0% lower than GPT-4-1106-preview. The API model GPT-4-1106-preview and gpt-4o outperform all other models, especially in end-to-end mode, highlighting the significant potential for improvement in current open-source models. What’s more, larger models tend to exhibit superior performance across various metrics, in line with established trends [3, 15, 29]. Moreover, models within the same series (such as Mistral, InternLM, Qwen, Llama3, etc.) consistently maintain relatively stable rankings within their respective parameter groups, underscoring the stability and efficacy of our approach.

When comparing the end-to-end mode and oracle mode, it becomes evident that the oracle mode surpasses the end-to-end mode across all metrics for most models. This observation suggests that LLMs can achieve better results with human interaction, hinting at a promising avenue for integrating LLMs to assist humans in data science. The experiment demos are shown in Appendix C.

4.3 Error Mode Analysis

Refer to caption
Figure 5: Correlation of CIBench with other benchmarks. The small p-value (top-left) and high Pearson correlation coefficients (title) indicate a strong correlation between CIBench and IFEval, BBH, GSM8K, MATH, HumanEval, and MBPP. These benchmarks evaluate the instruction-following, reasoning, and coding abilities of LLMs, respectively.

In the evaluation of CIBench, we identify four prevalent types of errors in the code generated by the model. These errors are categorized as follows: 1) Instruction Following Errors: These encompass instances where the model deviates from or disregards provided instructions, reflecting a lack of adherence to specified guidelines; 2) Hallucination Errors: This category pertains to the phenomenon of the model generating code that contains hallucinated elements, such as utilizing undefined parameters or referencing irrelevant variables; 3) Reasoning Errors: These errors occur when the model encounters complex problems, often resulting in logical errors in the generated code. Such errors offer valuable insights into the model’s ability to handle intricate tasks in code generation; 4) Code Errors: Basic errors in code generation fall under this category. While these errors may sometimes appear trivial, they signify potential deficiencies in the model’s code-generation process. These identified errors effectively underscore the current limitations of LLMs in terms of their coding capabilities, providing valuable insights for the ongoing development of CIBench. Since automatically classifying errors is challenging, we manually categorized the four types of errors in GPT-4-1106-preview. The proportions of these error types are 31.9%, 4.3%, 40.4%, and 23.4%, respectively. Detailed examples of these errors are presented in Appendix E.

Furthermore, to provide insight into enhancing performance on CIBench, we analyze its correlation with existing benchmarks such as IFEval, BBH, GSM8K, MATH, HumanEval, and MBPP. These benchmarks evaluate the instruction-following, reasoning, and coding abilities of LLMs. As shown in Fig.5, CIBench performance exhibits a strong correlation with these benchmarks, with Pearson correlation coefficients exceeding 0.7, particularly for HumanEval and GSM8K. This result indicates that improvements in reasoning and coding can boost LLM performance on CIBench.

4.4 More Analysis

Debug Ability Analysis

In the ReAct protocol, we allow LLMs to try to solve tasks multiple times. During each trial, the model can use feedback from the code interpreter to rectify any bugs in the generated code. To assess the LLMs’ ability to autonomously correct bugs, we vary the number of trials. As shown in Fig.6, increasing the number of trials correlates with improvements across all metrics. Significantly, for most LLMs, there is a notable enhancement when the number of trials reaches two, particularly evident in metrics such as executable rate, numeric accuracy, and visualization score. This suggests that the LLM can autonomously rectify bugs to a certain extent. In our experiments, to balance evaluation time and performance, we set the number of trials to three.

Visualization Metric Analysis

To validate the effectiveness of our proposed visualization metric, we follow QwenAgent [2] and use GPT-4V to assess visualization scores on a subset of CIBench tasks. The prompt is provided in Appendix D. As shown in Fig.8, despite structural similarities being derived from low-level features, there is a strong correlation between them and GPT-4V scores, demonstrating remarkable consistency between the two metrics. Therefore, we can utilize structural similarities as a simplified visualization metric to subject GPT-4V for effective analysis.

Cross Language Analysis

To benchmark the LLMs’ ability in Chinese, we created a Chinese version of CIBench by translating the human-created tasks into Chinese. This allows us to evaluate the Code Interpreter performance in Chinese. As shown in Fig.8, we observe that: 1) most models exhibit a slight decrease in Chinese CIBench  compared to their English counterparts.; 2) the strong DeepSeek-67B-Chat and Qwen-72B-Chat drop a lot on Chinese  CIBench, compared to the English version. Further research and development efforts are necessary to address these discrepancies and improve the performance of LLMs in multilingual scenarios.

Refer to caption
Figure 6: Debug ability analysis. Max trial denotes the trial times in ReAct protocol.
Refer to caption
Figure 7: Structural Similarities v.s. GPT-4V.
Refer to caption
Figure 8: Chinese CIBench  v.s. English CIBench .

Difficulty Analysis

CIbench comprises interactive tasks, and we assume that varying interactive steps denote distinct levels of difficulty. To demonstrate model performance across these difficulty levels, we decompose each task into three categories: easy (up to 2 steps), medium (2-4 steps), and difficult (more than 4 steps), and then evaluate performance accordingly. As Tab.10 shows, for most models, the performance of most models decreases as the number of steps increases.

Different Category Modules Analysis

We assess the capabilities of different LLMs with various category modules (refer to Tab. 2). As shown in Fig. 10, LLMs demonstrate proficiency in tackling mathematical and statistical tasks utilizing SciPy modules. However, open-sourced LLMs encounter challenges when handling modeling tasks that demand advanced coding and reasoning skills. We hope that future open-sourced LLMs will excel in modeling tasks.

Limitation

Our work has two main limitations: 1) CIBench is currently limited to Python, despite it could be extended to include other programming languages using a similar methodology; 2) the evaluation metric of CIBench has limitations in measuring certain data science tasks, such as "training a model with PyTorch" and tasks involving randomness.

Model Easy Medium Hard
Yi-6B-chat 37.5 33.1 23.1
Llama-2-7B-chat 25.7 19.2 15.7
Qwen-7B-chat 51.0 45.5 36.5
Vicuna-7B-v1.5-16k 39.0 35.9 23.8
DeepSeek-7B-chat 38.8 41.8 22.8
Mistral-7B-instruct-v0.2 52.3 52.4 33.0
Chatglm3-6B-32k 43.2 33.3 19.0
Mistral-8x7B-instruct-v0.1 62.1 61.2 47.6
Internlm2-chat-7B 55.5 52.0 33.9
Qwen-14B-chat 59.6 58.8 42.0
Vicuna-13B-v1.5-16k 46.1 41.6 27.7
Llama-2-13B-chat 34.8 23.5 18.7
Internlm2-chat-20B 51.5 59.2 45.1
Yi-34b-chat 55.2 58.1 38.0
Llama-2-70B-chat 38.1 33.9 18.6
DeepSeek-67B-chat 60.4 63.9 43.7
Qwen-72B-chat 60.7 66.2 50.3
Llama-3-70B-chat 62.3 67.1 42.0
Figure 9: Performance on different difficulty levels. Easy means <2 steps, Medium means 2-4 steps, and Hard means >4 steps.
Refer to caption
Figure 10: Performance on different categories.

5 Conclusion

We propose a novel benchmark, named CIBench , to comprehensively assess LLMs’ ability to leverage code interpreters for complex data science tasks. It includes an evaluation dataset covering widely used Python modules in data science and two evaluation modes measuring LLMs’ ability with and without human assistance. The evaluation dataset is constructed using an LLM-human cooperative approach, leveraging interactive IPython sessions to simulate realistic scenarios in data science. Thorough experimental analysis with 24 LLMs on CIBench indicates that LLMs perform poorly in modeling category modules. Based on our experimental analysis, we offer several insights for the future development of LLMs: enhancing the LLM’s ability to correct errors based on feedback, improving its understanding of user intentions across multiple rounds of interactions, and, most importantly, strengthening its reasoning capabilities.

References

  • [1] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021.
  • [2] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report, 2023.
  • [3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
  • [4] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  • [5] Harrison Chase. LangChain, October 2022.
  • [6] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021.
  • [7] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org, 2023.
  • [8] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021.
  • [9] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  • [10] DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism, 2024.
  • [11] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023.
  • [12] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021.
  • [13] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
  • [14] Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, et al. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507, 2024.
  • [15] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020.
  • [16] Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR, 2023.
  • [17] Chenliang Li, Hehong Chen, Ming Yan, Weizhou Shen, Haiyang Xu, Zhikai Wu, Zhicheng Zhang, Wenmeng Zhou, Yingda Chen, Chen Cheng, Hongzhu Shi, Ji Zhang, Fei Huang, and Jingren Zhou. Modelscope-agent: Building your customizable agent system with open-source large language models, 2023.
  • [18] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large scale language model society. arXiv preprint arXiv:2303.17760, 2023.
  • [19] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  • [20] Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, and Kai Chen. Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark. arXiv preprint arXiv:2405.12209, 2024.
  • [21] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  • [22] Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
  • [23] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023.
  • [24] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  • [25] Significant Gravitas. AutoGPT.
  • [26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [27] Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2023.
  • [28] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  • [29] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models, 2022.
  • [30] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
  • [31] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023.
  • [32] Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large language models toward verifiable reasoning. arXiv preprint arXiv:2402.06332, 2024.
  • [33] Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023.
  • [34] Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Toolqa: A dataset for llm question answering with external tools. arXiv preprint arXiv:2306.13304, 2023.

Checklist

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

  • Did you include the license to the code and datasets? [Yes] See

  • Did you include the license to the code and datasets? [No] The code and the data are proprietary.

  • Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

  1. 1.

    For all authors…

    1. (a)

      Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] See Introduction

    2. (b)

      Did you describe the limitations of your work? [Yes] See Sec.4.4

    3. (c)

      Did you discuss any potential negative societal impacts of your work? [N/A] Our paper proposes a dataset to measure the programming ability of large language models. There is currently no negative social impact.

    4. (d)

      Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] See Appendix

  2. 2.

    If you are including theoretical results…

    1. (a)

      Did you state the full set of assumptions of all theoretical results? [N/A]

    2. (b)

      Did you include complete proofs of all theoretical results? [N/A]

  3. 3.

    If you ran experiments (e.g. for benchmarks)…

    1. (a)

      Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] https://github.com/open-compass/CIBench

    2. (b)

      Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] https://github.com/open-compass/CIBench

    3. (c)

      Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] The experiments are conducted at a temperature of 0.

    4. (d)

      Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See 4.1

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. (a)

      If your work uses existing assets, did you cite the creators? [N/A]

    2. (b)

      Did you mention the license of the assets? [N/A]

    3. (c)

      Did you include any new assets either in the supplemental material or as a URL? [N/A]

    4. (d)

      Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]

    5. (e)

      Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]

  5. 5.

    If you used crowdsourcing or conducted research with human subjects…

    1. (a)

      Did you include the full text of instructions given to participants and screenshots, if applicable? [Yes] See Appendix

    2. (b)

      Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

    3. (c)

      Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix A Dataset Details

A.1 Module Version Settings

The version of Python modules used in code interpreters is listed in Tab.4.

Table 4: The module version settings in CIBench.
Module Version
Pandas 1.5.3
Matplotlib 3.7.2
Seaborn 0.13.0
Scikit-learn 1.2.1
PyTorch 1.13.1
TensorFlow 2.14.0
LightGBM 4.1.0
NLTK 3.8
PyTorch 1.131
OpenCV-Python 4.8.1.78
SciPy 1.11.2

A.2 Dataset Statistics

The CIBench comprises generation tasks, template tasks, and Chinese template tasks, which produce three types of output: numerical, text, and visualization. The statistics of CIBench are shown in Tab.5.

Table 5: Dataset statistics of CIBench. "generation", "template", and "template_cn" represent generation tasks, template tasks, and Chinese template tasks, respectively. Other refers to the questions that only require successful execution without any output comparison.
Subset Num Text Visum Other Total
generation 210 76 466 208 960
template 147 20 161 142 470
template_cn 147 20 161 142 470
total 504 116 788 492 1900

Appendix B Construction Prompts and Rules

Topic Generation

The prompt used for topic generation is shown in Fig. 13.

Question Refinement

The prompts used for question refinement are shown in Fig. 14, 15.

Quality Control Rules

We include manual quality checking in the final steps due to limitations in the LLMs’ ability to control runtime or file size. The rules used for quality control are shown in Fig.16.

Appendix C Experiment Example Demo

An example of the experiment is shown in Fig. 11 and Fig. 12.

Appendix D Subjective Visualization Evaluation

The prompt used for subjective visualization scoring is shown in Fig.17.

Appendix E Dataset Error Analysis

In the evaluation of CIBench, we identify four prevalent types of errors in the code generated by the model. These errors are categorized as follows: 1) Instruction Following Errors (Fig.18): These encompass instances where the model deviates from or disregards provided instructions, reflecting a lack of adherence to specified guidelines; 2) Hallucination Errors (Fig.19): This category pertains to the phenomenon of the model generating code that contains hallucinated elements, such as utilizing undefined parameters or referencing irrelevant variables; 3) Reasoning Errors (Fig.20): These errors occur when the model encounters complex problems, often resulting in logical errors in the generated code. Such errors offer valuable insights into the model’s ability to handle intricate tasks in code generation; 4) Code Errors (Fig.21): Basic errors in code generation fall under this category. While these errors may sometimes appear trivial, they signify potential deficiencies in the model’s code-generation process. These identified errors effectively underscore the current limitations of LLMs in terms of their Code Interpreter capabilities, providing valuable insights for the ongoing development of CIBench.

Appendix F Human Annotator

In our paper, the authors also serve as annotators for CIBench. These annotators are tasked with creating template tasks for various data science challenges, ensuring coverage of a wide range of concepts in the field. Importantly, our data collection process for these tasks is unbiased and does not involve any private information.

Appendix G Ethical Consideration

We use GPT-4 and online resources to construct our benchmark. The benchmarks are carefully processed by experts to exclude any private information. Additionally, we utilize ChatGPT for text refinement and to correct any typographical errors during the writing process.

Refer to caption
Figure 11: Example of successful template task. The penguins dataset and the specific attribute "flipper_length_mm" can be substituted with alternative datasets and columns. Such a template task enriches the diversity of CIBench.
Refer to caption
Figure 12: Example of successful template task. Continue of Fig. 11.
Topic Generation Prompt: Please create 50 different topics base on Python module {}. These topics can cover the majority of functionality of this module.
Figure 13: An example prompt of topic generation.
Question Refinement Prompt: Based on the experiment document and each step, please refine these steps into specific instructions for code writing to solve a real world question. Provide parameter settings if needed, ensuring that the execution results are identical for any reasonable person attempting to write the code. Please refrain from directly mentioning the name of the API, just target on purpose and give parameter settings. Please also summarize the required external dataset if applicable. Bad case step: 1. "Remove the ’Cabin’ column from the dataframe due to its numerous missing values. Make sure to specify that you are removing a column and that the change should be permanent." Too wordy, "Remove the ’Cabin’ column and make the change permanent." should be fine. 2. "Identify and count how many values are missing in each column of the dataframe. Use a chain of two functions: the first checks if each value is null, and the second adds up these null instances for each column." Too specific, "Count how many values are missing in each column of the dataframe. Print the count." should be fine. Good case step: 1. "Convert the ’Sex’ column to numerical values. Map ’male’ to 0 and ’female’ to 1." Concise, clear instruction. Your output format should be as follows starting with import necessary libraries: [Optional]delimited-[]𝑂𝑝𝑡𝑖𝑜𝑛𝑎𝑙\left[Optional\right][ italic_O italic_p italic_t italic_i italic_o italic_n italic_a italic_l ] Dataset Link: [Optional]delimited-[]𝑂𝑝𝑡𝑖𝑜𝑛𝑎𝑙\left[Optional\right][ italic_O italic_p italic_t italic_i italic_o italic_n italic_a italic_l ] Dataset Description: Step 1. xx # code blcok Step 2. xx # code blcok
Figure 14: An example prompt of question refinement.
Another Question Refinement Prompt: Given the above experiment document. Do the following modification: 1. Remove all the installation steps. 2. Remove all the file saving steps. 3. Elaborate the steps to be more specific with number and inputs that the execution results are identical for any reasonable person attempting to solve this step. 4. Reorder the steps. Response with the new experiment document.
Figure 15: Another example prompt of question refinement.
Quality Control Rules Check Rules: Assure the questions is written from the perspective of a real user. Assure file path informed in the head for all the experiment required external files. Control the runtime, and each step should ideally produce results within 1 minute. Control the file size, the file used for single experiment should ideally not exceed 50M. Assure the output is valid and unambiguous as ground truth.
Figure 16: Rules used for manual dataset quality control.
Subjective Visualization Scoring Prompt Prompt: You are an assistant skilled in assessing visualization capabilities. In the capacity of a fair judge, you will evaluate the quality of images drawn by an AI model generating code for visualization-related problems. We will provide you with a code visualization problem and an image drawn by the code created by the AI model you need to assess. When you start your assessment, you must adhere to the following process: 1. Rate the visualization capability with a score between 1 and 10 for the image, returning only the number without any additional descriptions. 2. Your scoring needs to be as rigorous as possible, and it should follow the scoring rules below: Overall, the higher the quality of the model’s response, the higher the score. A score of 1 to 2 is given when the image cannot reflect the given instruction content at all. A score of 3 to 4 is given when the image can partly reflect the given instruction content, but there is a significant lack of specific detail expression. If the image basically meets the given instructions, but the aesthetic quality of the image is average without any outstanding features, this kind of rating can get a score of 5 to 6. When the image matches the given instructions well, and shows some aesthetic appeal, such as some originality in color matching and shape design, this kind of rating can get a score of 7 to 8. Only when the image completely matches the given instructions, covers all the details in the instructions, and performs excellently in terms of aesthetics, can this kind of rating get a score of 9 to 10. [Visualization Problem]:{question}
Figure 17: An example prompt of subjective visualization scoring prompt.
Following error User: Detect corners using the Shi-Tomas corner detector with max-corners=50, min-distance=0.5 and blocksize=10, mark the corners with circles on the image. The radius and thickness of the circle are 5 and 1. Show the marked image.Please use opencv and numpy modules. Assistant: # Detect corners using Shi-Tomasi corner detector corners = cv2.goodFeaturesToTrack(equalized_image, maxCorners=max_corners, qualityLevel=quality_level, minDistance=min_distance, blockSize=block_size) # Mark the corners with circles on the image marked_image = equalized_image.copy() for corner in corners: x, y = corner.ravel() cv2.circle(marked_image, (int(x), int(y)), 5, (255, 0, 0), -1) # Draw a blue filled circle at each corner # Show the marked image plt.imshow(marked_image, cmap=’gray’) plt.axis(’off’) # Turn off axis numbers and ticks plt.show() System: NameError Traceback (most recent call last) Cell In[6], line 2 1 # Detect corners using Shi-Tomasi corner detector ----> 2 corners = cv2.goodFeaturesToTrack(equalized_image, maxCorners=max_corners, qualityLevel=quality_level, minDistance=min_distance, blockSize=block_size) 4 # Mark the corners with circles on the image 5 marked_image = equalized_image.copy() NameError: name max_corners is not defined User: Define the CrossEntropyLoss function and the Adam optimizer. Train the model for 3 epochs. Display the loss for the last epoch(Keep to two decimal places). Assistant: import torch.optim as optim criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) epochs = 3 for epoch in range(epochs): model.train() running_loss = 0.0 for inputs, labels in train_loader: inputs, labels = inputs.to(device), labels.to(device) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs.squeeze(), labels.float()) loss.backward() optimizer.step() running_loss += loss.item() final_loss = round(running_loss/len(train_loader), 2) print(final_loss) System: 933.55 933.45 933.09
Figure 18: Examples of Following error
Hallucination error User: Build a vocabulary of all the tokens and assign an index to the vocabulary. Display the index of the word "happy". Please use nltk modules. Assistant: # Check the vocabulary vocab # Display the index of the word "happy" word_to_index.get(’happy’, ’Word not found in vocabulary’) System: NameError Traceback (most recent call last) Cell In[2], line 2 1 # Check the vocabulary ----> 2 vocab 3 # Display the index of the word "happy" 4 word_to_index.get(’happy’, Word not found in vocabulary’) NameError: name vocab is not defined User: Define a LightGBM model with max_depth=4, n_estimators=120, learning_rate=0.01 and num_leaves=31. Train the model with Evaluation Metric=’logloss’. Please use LightGBM modules. Assistant: import lightgbm as lgb # Define the LightGBM model model = lgb.LGBMClassifier(max_depth=4, n_estimators=120, learning_rate=0.01, num_leaves=31) # Train the model with evaluation metric as ’logloss’ model.fit(X_train, y_train, eval_set=(X_test, y_test), eval_metric=’logloss’, early_stopping_rounds=20) System: TypeError Traceback (most recent call last) Cell In[3], line 7 4 model = lgb.LGBMClassifier(max_depth=4, n_estimators=120, learning_rate=0.01, num_leaves=31) 6 # Train the model with evaluation metric as logloss ----> 7 model.fit(X_train, y_train, eval_set=(X_test, y_test), eval_metric=’logloss’, early_stopping_rounds=20) TypeError: LGBMClassifier.fit() got an unexpected keyword argument early_stopping_rounds
Figure 19: Examples of Hallucination error
Reasoning error User: Split the dataset into training and testing sets using 0.2 as the test size, then define the train_loader and test_loader. Set batch size as 64. Please use pytorch and sklearn modules. Assistant: from sklearn.model_selection import train_test_split import torch from torch.utils.data import DataLoader, TensorDataset # Split the dataset into training and testing sets X = df[’text’].values y = df[’sentiment’].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Define the train_loader and test_loader def create_data_loader(X, y, batch_size): X_tensor = torch.tensor(X) y_tensor = torch.tensor(y, dtype=torch.long) dataset = TensorDataset(X_tensor, y_tensor) data_loader = DataLoader(dataset, batch_size=batch_size) return data_loader batch_size = 64 train_loader = create_data_loader(X_train, y_train, batch_size) test_loader = create_data_loader(X_test, y_test, batch_size) System: TypeError Traceback (most recent call last) Cell In[4], line 19 16 return data_loader 18 batch_size = 64 ---> 19 train_loader = create_data_loader(X_train, y_train, batch_size) 20 test_loader = create_data_loader(X_test, y_test, batch_size) Cell In[4], line 12 11 def create_data_loader(X, y, batch_size): ---> 12 X_tensor = torch.tensor(X) 13 y_tensor = torch.tensor(y, dtype=torch.long) 14 dataset = TensorDataset(X_tensor, y_tensor) TypeError: cant convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool. User: Group and aggregate data by weight and calculate the average of each numerical group. Find out how many attributes have a negative correlation with weight.Please use pandas modules. Assistant: # Group and aggregate data by weight and calculate the average of each numerical group grouped_data = df.groupby(’Weight’).mean() # Find attributes with negative correlation with weight negative_correlation_count = (grouped_data.corr()[’Weight’] < 0).sum() negative_correlation_count System: KeyError Traceback (most recent call last) File anaconda3/envs/prod-torch1.13/lib/python3.10/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance) 3801 try: -> 3802 return self._engine.get_loc(casted_key) 3803 except KeyError as err: File anaconda3/envs/prod-torch1.13/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc() File anaconda3/envs/prod-torch1.13/lib/python3.10/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc() File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item() File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: Weight
Figure 20: Examples of Reasoning error
Code error User: Generate a pair plot for all the numerical columns, using color = "pastel".Please use seaborn modules. Assistant: sns.pairplot(df, kind=’scatter’, plot_kws={’color’:’pastel’}) plt.show() System: ValueError Traceback (most recent call last) File anaconda3/envs/prod-torch1.13/lib/python3.10/site-packages/matplotlib/axes/_axes.py:4372, in Axes._parse_scatter_color_args(c, edgecolors, kwargs, xsize, get_next_color_func) 4371 try: -> 4372 mcolors.to_rgba_array(kwcolor) 4373 except ValueError as err: File anaconda3/envs/prod-torch1.13/lib/python3.10/site-packages/matplotlib/colors.py:471, in to_rgba_array(c, alpha) 470 if isinstance(c, str): --> 471 raise ValueError(f"{c!r} is not a valid color value.") 473 if len(c) == 0: ValueError: pastel is not a valid color value. User: Blur the initial image by applying a Gaussian filter with a sigma value of 3. Sigma is the standard deviation for the Gaussian kernel, which controls the amount of blur. Then display the smoothed image. Assistant: smooth = ndimage.gaussian_filter(image, sigma=3) plt.imshow(smooth) plt.show() System: NameError Traceback (most recent call last) Cell In[48], line 2 1 smooth = ndimage.gaussian_filter(image, sigma=3) ----> 2 plt.imshow(smooth) 3 plt.show() NameError: name plt is not defined
Figure 21: Examples of Code error