Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

Jiasheng Zheng1,4, Boxi Cao1,4, Zhengzhao Ma1, Ruotong Pan1,4,
Hongyu Lin1, Yaojie Lu1, Xianpei Han1,2,3, Le Sun1,2,3
1Chinese Information Processing Laboratory   2State Key Laboratory of Computer Science
Institute of Software, Chinese Academy of Sciences, Beijing, China
3Key Laboratory of System Software, Chinese Academy of Sciences, Beijing, China
4University of Chinese Academy of Sciences, Beijing, China
{zhengjiasheng2022,boxi2020,mazhengzhao2024,panruotong2021}@iscas.ac.cn
{hongyu,luyaojie,xianpei,sunle}@iscas.ac.cn
Abstract

In recent years, researchers have proposed numerous benchmarks to evaluate the impressive coding capabilities of large language models (LLMs). However, existing benchmarks primarily focus on assessing the correctness of code generated by LLMs, while neglecting other critical dimensions that also significantly impact code quality. Therefore, this paper proposes the RACE benchmark, which comprehensively evaluates the quality of code generated by LLMs across 4 dimensions: Readability, mAintainability, Correctness, and Efficiency. Specifically, considering the demand-dependent nature of dimensions beyond correctness, we design various types of user requirements for each dimension to assess the model’s ability to generate correct code that also meets user demands. We evaluate 18 representative LLMs on RACE and find that: 1) the current LLMs’ ability to generate high-quality code on demand does not yet meet the requirements of software development; 2) readability serves as a critical indicator of the overall quality of generated code; 3) most LLMs exhibit an inherent preference for specific coding style. These findings can help researchers gain a deeper understanding of the coding capabilities of current LLMs and shed light on future directions for model improvement111We release our benchmark and source code at https://github.com/jszheng21/RACE and leaderboard at https://huggingface.co/spaces/jszheng/RACE_leaderboard.

Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models


Jiasheng Zheng1,4, Boxi Cao1,4, Zhengzhao Ma1, Ruotong Pan1,4, Hongyu Lin1, Yaojie Lu1, Xianpei Han1,2,3, Le Sun1,2,3 1Chinese Information Processing Laboratory   2State Key Laboratory of Computer Science Institute of Software, Chinese Academy of Sciences, Beijing, China 3Key Laboratory of System Software, Chinese Academy of Sciences, Beijing, China 4University of Chinese Academy of Sciences, Beijing, China {zhengjiasheng2022,boxi2020,mazhengzhao2024,panruotong2021}@iscas.ac.cn {hongyu,luyaojie,xianpei,sunle}@iscas.ac.cn


1 Introduction

The impressive coding capabilities demonstrated by Large Language Models (LLMs) are reshaping the landscape of software development (Zheng et al., 2023c, b; Fan et al., 2023), attracting significant attention from researchers. To accurately measure and compare the coding capabilities of various large models, numerous benchmarks have been proposed to evaluate the code generation (Chen et al., 2021; Austin et al., 2021; Hendrycks et al., 2021), completion (Gong et al., 2024), and execution (Jain et al., 2024a) abilities of LLMs.

Refer to caption
Figure 1: Current benchmarks perform single-dimension evaluations and mostly focus only on code correctness (upper); our proposed RACE benchmark performs multi-dimensional code evaluations to identify truly high-quality code beyond correctness (lower).

However, existing benchmarks primarily focus on evaluating the correctness of code generated by LLMs, while neglecting other critical dimensions that also significantly impact code quality. For instance, Börstler et al. (2023) investigate the aspects of code quality and find that code readability serves as the most decisive property for quality code (Dantas et al., 2023; Oliveira et al., 2020); code maintainability is crucial for ensuring the software remains adaptable, efficient, and easy to update or fix over time, ultimately reducing long-term costs and technical debt (Hegedus, 2013); code efficiency is essential for optimizing performance, reducing resource consumption, and ensuring scalability in software applications (Curtis et al., 2022; Börstler et al., 2023). As shown in Figure 1, current benchmarks lack evaluation on these critical dimensions that impact code quality, making it difficult to distinguish truly high-quality code from code that is merely correct. Such deficiency in evaluation could lead to inappropriate assessments of the coding capabilities of different models, directing developers to focus only on the correctness of code generation, thus severely limiting the further development and real-world application of Code LLMs. Therefore, there is an urgent need for a multi-dimensional code evaluation benchmark to bridge the gap between LLM-generated code and real-world scenarios.

To this end, we propose RACE benchmark, which can comprehensively evaluate the code generated by LLMs from multiple dimensions including Readability, mAintainability, Correctness, and Efficiency. However, it is not trivial to develop a multi-dimensional benchmark for code generation. The first challenge is to design a quantifiable evaluation framework for each dimension. Unlike correctness, other dimensions are typically difficult to quantify with a single metric. To address this, we refer to the definition of readability, maintainability, and code efficiency in various quality models (Curtis et al., 2022; Nistala et al., 2019; Sadeghzadeh Hemayati and Rashidi, 2017), and summarize multiple representative factors for each dimension of code quality. By integrating the performance across multiple factors, we can comprehensively assess the quality of LLM-generated code in each dimension. The second challenge is that dimensions other than correctness are demand-dependent. That is to say, we cannot use a fixed and uniform standard to measure what constitutes better code. Instead, different application scenarios could have varying requirements for code generation. For instance, different projects may require different coding styles, interface standards, and maintainability standards to be adaptable and scalable. Moreover, based on different hardware conditions, it is necessary to strike an appropriate balance between time efficiency and space efficiency to ensure the efficient operation of the code. Therefore, a truly practical model should be capable of generating correct code that meets multiple dimensional requirements and can be customized according to different user instructions. To achieve this, we design various demands for each factor and incorporate them into the task descriptions, requiring the model to generate code that is both correct and meets the specified requirements. For example, we design multiple instructions requiring the model to generate comments of varying granularity to meet readability requirements in different scenarios, or to generate multiple versions of code, each needing to balance time efficiency and space efficiency differently. The third challenge is the calculation of evaluation metrics. For dimensions beyond correctness, we cannot directly use the pass rate of test cases as a performance measurement metric. Therefore, for each factor, we design targeted evaluation metrics based on static analysis and runtime monitoring methods. This approach allows us to accurately and efficiently quantify the extent to which the LLM-generated code meets the corresponding customized requirements.

Based on the RACE benchmark, we conduct a comprehensive evaluation of 18 representative Code LLMs across various scales. According to the evaluation results, the major findings include: 1) Although current Code LLMs can achieve a decent level of accuracy, they struggle to generate correct code that meets specific requirements, which falls short of the application needed for software development. This indicates that these models need further improvement in their ability to generate high-quality code across multiple dimensions based on user demands. In particular, GPT-4o and DeepSeek-Coder-V2-236B222We use the API version: https://api.deepseek.com demonstrate exceptional performance in each dimension, significantly surpassing other models in meeting user demands. 2) By analyzing the correlation of each model’s performance across different dimensions, we find that readability is an indicator of the overall quality of the generated code. Additionally, adding appropriate comments in the code can serve as an implicit chain-of-thought, thereby improving the correctness of the code for some large-scale models. 3) Most LLMs exhibit an inherent preference for specific coding styles, making it difficult for them to follow user instructions that are inconsistent with their preference. These findings reveal the limitations of current Code LLMs, shedding light on future optimization directions. They help researchers in selecting LLMs for different application scenarios and in designing targeted methods for model improvement.

The main contributions of this paper can be summarized as follows:

  • We propose a novel multi-dimensional evaluation framework for code generation.

  • Based on the framework, we construct a comprehensive benchmark, characterized by data construction, customized requirement instructions, and specific evaluation metrics.

  • We evaluate 18 Code LLMs on RACE benchmark, and obtain valuable conclusions that reveal the limitations of code models and guide their further development directions.

2 Related Work

2.1 Code LLMs

The outstanding code generation capabilities exhibited by LLMs have attracted considerable attention from researchers (Wang et al., 2021; Li et al., 2022; Fried et al., 2022; Xu et al., 2022; Roziere et al., 2023; Zheng et al., 2023a). Some representative Code LLMs, such as CodeX (Chen et al., 2021), CodeGen (Nijkamp et al., 2022), and AlphaCode (Li et al., 2022), have achieved notable performance in code generation, program repair and code translation. Currently, research on LLMs for code primarily focuses on data and pretraining methods. For training data collection, WizardCoder (Luo et al., 2024) introduce code instruction-following training constructed by Evol-Instruct to enhance the capabilities of Code LLMs. For pretraining methods, StarCoder (Li et al., 2023a) and DeepSeek-Coder (Guo et al., 2024) incorporate fill-in-the-middle training task to enhance the model’s capability to handle various structural arrangements in code. With the rapid advancement of Code LLM capabilities, there is an increasing demand for reliable and comprehensive code evaluation benchmarks.

2.2 Coding benchmark for LLMs

The existing benchmarks for LLM-based code (Ni et al., 2023), such as HumanEval (Chen et al., 2021), APPS (Hendrycks et al., 2021), MBPP (Austin et al., 2021), CodeContests (Li et al., 2022), and DS-1000 (Lai et al., 2023), focusing on the correctness of generated code in scenarios such as code exercises, data science, and competitions (Yan et al., 2023; Li et al., 2023b; Shinn et al., 2024). However, these efforts only focus on the correctness of the generated code, using the pass rate of test cases as the sole evaluation metric. Meanwhile, there has been a recent trend in considering other dimensions (Li et al., 2024; Jain et al., 2024b; Tian et al., 2024); for example, Huang et al. (2024) evaluate the efficiency of the generated code, while Dillmann et al. (2024) bridge the connection between cross-entropy and logical lines of code (LLOC). Nevertheless, these studies neither account for the demand-dependent nature of these dimensions nor systematically evaluate the LLM’s code capabilities across multiple dimensions.

3 RACE Benchmark Construction

Factors Data Source # Cases
Correctness
Correctness HumanEval+, MBPP+, ClassEval, and LeetCode 923
Readability
Code Length HumanEval+ 492
Name Convention 984
Kommentare 328
Maintainability
Maintainability Index ClassEval 100
Modularity LeetCode 540
Efficiency
Time Complexity LeetCode 101
Space Complexity
Table 1: The sources and number of evaluation cases for each factor within each dimension in the RACE benchmark.
Refer to caption
Figure 2: The overall evaluation pipeline in RACE benchmark.

The philosophy of our framework design comes from the demands for code quality in software engineering (Börstler et al., 2023). Firstly, we summarize multiple representative factors for each dimension based on their respective quality definitions (Curtis et al., 2022; Nistala et al., 2019; Sadeghzadeh Hemayati and Rashidi, 2017). Secondly, we design several reasonable customized requirements for each factor and integrate them into task descriptions, requiring the model to generate code that is both correct and meets these requirements. Information on the detailed evaluation data is presented in Table 1. Finally, leveraging static analysis and runtime monitoring techniques, we develop evaluation metrics tailored to each factor to achieve accurate and efficient evaluation. The specific designs of each instruction refer to Appendix A.

3.1 Correctness

To investigate the impact of incorporating customized instructions on code correctness, we evaluate the accuracy of the LLM-generated code on the original benchmark tasks and also calculate the accuracy when provided with instructions containing customization requirements.

To thoroughly investigate the impact of customized instructions across various tasks, we select the following datasets: HumanEval+ and MBPP+ (Liu et al., 2024) for code exercise problems, ClassEval (Du et al., 2023) for class-level code generation, and LeetCode (Guo et al., 2024) for coding competition problems. To mitigate the bias introduced by additional information in the original dataset on the customized requirements, we remove such information from the datasets.

To measure code correctness, we calculate the macro accuracy at the dataset level, which is the proportion of generated code that passes all test cases for the corresponding problems.

3.2 Readability

In real-world development scenarios, code is required to adhere to a consistent style to ensure comprehensibility and reduce the time cost of maintaining the code, which refers to code readability (Börstler et al., 2023). Specifically, the code length is the most straightforward aspect of style; excessively long lines of code can lead to incomplete screen display, severely affecting readability. Meanwhile, good and consistent naming styles help developers quickly understand the functionality of interfaces, and comments assist in rapidly comprehending the implementation logic of the code. Therefore, we summarize the code readability into three representative factors: Length, Naming Convention, and Comment. Based on the real-world development requirements, we collect corresponding customizable options for different factors.

For the Length factor, the readability requirements for code length vary due to the differences in display scales across different user scenarios. Therefore, we refer to the PEP8 style for Python, and define the following user requirements concerning code length: (60, 20), (70, 30), and (79, 40), with the parentheses corresponding to the maximum line length and the maximum lines of functions, respectively. For the Naming Convention factor, camel-case and snake-case are commonly used naming methods in computer programming, with varying preferences across different projects for naming functions and variables. Consequently, we offer the choice between camel-case and snake-case based on the naming convention used for functions or variables as customization options. For the Comment factor, different levels of granularity serve varying purposes and needs. Line-level comments aid in understanding the implementation details of the code, while function-level comments assist in comprehending the functionality and usage of functions. Additionally, line-level comments are particularly beneficial for novice programmers. Consequently, we have defined two customization options: function-level comments and line-level comments.

Code exercise tasks are derived from snippets of real-world development tasks, encapsulating scenarios encountered in actual development environments. Subsequently, leveraging these code exercise scenarios, we evaluate the generated code against customized readability requirements on HumanEval+ datasets to assess its alignment with specific criteria.

Furthermore, to measure code readability, we employed abstract syntax tree analysis and heuristic methods to assess code length, examine naming conventions, and distinguish between different levels of comment granularity.

3.3 Maintainability

The maintainability of code significantly impacts the long-term health of software and the efficiency of development teams. Many quality models propose empirical quantitative equations for maintainability. Simultaneously, the Single Responsibility Principle (SRP) is a crucial part of code design principles to avoid excessive functional coupling. Therefore, based on these principles, we summarize two factors for code maintainability: Maintainability Metric and Modularity.

For the Maintainability Metric factor, we use the Maintainability Index (MI) (Coleman et al., 1994) to measure how maintainable the code is, which is widely used in the Microsoft Visual Studio 2010 development environment. This index is a four-metric polynomial equation, resulting in a value between 0 and 100, with higher values indicating greater maintainability. The formulation is as follows:

MI=MIabsent\displaystyle\mathrm{MI}=roman_MI = max[0,1001715.2lnV0.23G171\displaystyle\max\bigg{[}0,100\cdot\frac{171-5.2\ln V-0.23G}{171}roman_max [ 0 , 100 ⋅ divide start_ARG 171 - 5.2 roman_ln italic_V - 0.23 italic_G end_ARG start_ARG 171 end_ARG (1)
16.2lnL+50sin(2.4C)171]\displaystyle-\frac{16.2\ln L+50\sin(\sqrt{2.4C})}{171}\bigg{]}- divide start_ARG 16.2 roman_ln italic_L + 50 roman_sin ( square-root start_ARG 2.4 italic_C end_ARG ) end_ARG start_ARG 171 end_ARG ]

where V𝑉Vitalic_V is Halstead Volume to identify measurable properties of the code, G𝐺Gitalic_G is Cyclomatic Complexity corresponding to the number of decisions a block of code contains plus 1, L𝐿Litalic_L is the number of source lines of code, and C𝐶Citalic_C is the percent of comment lines. To comprehensively assess the maintainability requirements satisfaction of LLM-generated code, we employ ClassEval (Du et al., 2023) dataset, to ensure the complexity of the code problems.

For the Modularity factor, different application requirements dictate varying levels of modularization. Achieving compactness often necessitates implementing functionality using a single function, whereas maximizing code reusability demands the use of multiple functions. Accordingly, we define the following customization options: implementing functionality using 1, 2, or 3 functions. We choose to measure the modularity of generated code on LeetCode (Guo et al., 2024) dataset, which is more challenging to ensure better discriminative capability. In addition, we design corresponding rule-based methods to check the degree of modularity in the generated code.

3.4 Efficiency

In most applications, code efficiency is directly linked to user experience or business process efficiency. Generally, efficiency is assessed using time complexity and space complexity. Based on this principle, we define them as factors. Due to varying user-side hardware conditions, achieving a balance between execution time and memory usage, or optimizing one of these aspects to the extreme, is a common practice to ensure code efficiency. Recognizing these scenarios, we gather 101 cases from LeetCode programming problems designed to simulate such conditions. These cases are customized with specific time complexity requirements, space complexity requirements, or both, to evaluate the extent to which the LLM-generated code meets the efficiency requirements.

To measure code efficiency, we propose the Normalized Index (NI), i.e., to measure the degree to which the generated code satisfies the complexity requirement. Given two solution codes with time and space complexity C1T,C1Ssuperscriptsubscript𝐶1𝑇superscriptsubscript𝐶1𝑆C_{1}^{T},C_{1}^{S}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT and C2T,C2Ssuperscriptsubscript𝐶2𝑇superscriptsubscript𝐶2𝑆C_{2}^{T},C_{2}^{S}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, respectively, where C1Tsuperscriptsubscript𝐶1𝑇C_{1}^{T}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and C2Ssuperscriptsubscript𝐶2𝑆C_{2}^{S}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT are better, and given their total running time T1,T2subscript𝑇1subscript𝑇2T_{1},T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (T1<T2subscript𝑇1subscript𝑇2T_{1}<T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) and memory usage S1,S2subscript𝑆1subscript𝑆2S_{1},S_{2}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (S1>S2subscript𝑆1subscript𝑆2S_{1}>S_{2}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) on all test cases. Now there is a code C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG to be evaluated, which has a running time T^^𝑇\hat{T}over^ start_ARG italic_T end_ARG and memory usage S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG, with requirements C1T,C1Ssuperscriptsubscript𝐶1𝑇superscriptsubscript𝐶1𝑆C_{1}^{T},C_{1}^{S}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, then the normalized index is:

NITsubscriptNI𝑇\displaystyle\mathrm{NI}_{T}roman_NI start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT =100Clip(1T^T1T2T1,0,1)absent100Clip1^𝑇subscript𝑇1subscript𝑇2subscript𝑇101\displaystyle=100\cdot\mathrm{Clip}\left(1-\frac{\hat{T}-T_{1}}{T_{2}-T_{1}},0% ,1\right)= 100 ⋅ roman_Clip ( 1 - divide start_ARG over^ start_ARG italic_T end_ARG - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , 0 , 1 ) (2)
NISsubscriptNI𝑆\displaystyle\mathrm{NI}_{S}roman_NI start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT =100Clip(1S^S1S1S2,0,1)absent100Clip1^𝑆subscript𝑆1subscript𝑆1subscript𝑆201\displaystyle=100\cdot\mathrm{Clip}\left(1-\frac{\hat{S}-S_{1}}{S_{1}-S_{2}},0% ,1\right)= 100 ⋅ roman_Clip ( 1 - divide start_ARG over^ start_ARG italic_S end_ARG - italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , 0 , 1 )

NITsubscriptNI𝑇\mathrm{NI}_{T}roman_NI start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT indicates the degree of time complexity toward C1Tsuperscriptsubscript𝐶1𝑇C_{1}^{T}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and NISsubscriptNI𝑆\mathrm{NI}_{S}roman_NI start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT indicates the degree of space complexity toward C2Ssuperscriptsubscript𝐶2𝑆C_{2}^{S}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT.

4 Experiments

Correctness Readability Maintainability Efficiency
C C RN RL RC C MI C MC C EC
Models Acc. Acc. Acc. Acc. IF Acc. Acc. IF Acc. Acc. IF Acc. Acc. MI* Acc. Acc. Acc. IF Acc. Acc. NITsubscriptNI𝑇\mathrm{NI}_{T}roman_NI start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT* NISsubscriptNI𝑆\mathrm{NI}_{S}roman_NI start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT*
Instruct-Type
gpt-4o-2024-05-13 59.9 80.5 81.1 75.3 78.9 63.2 79.8 64.3 38.0 35.0 75.1 57.2 56.3 35.2 59.4 58.4 44.8 42.0
gpt-3.5-turbo-0125 44.7 62.8 63.2 48.3 60.4 46.1 65.8 41.5 28.0 24.0 80.2 31.1 28.1 18.5 39.6 32.7 27.5 36.5
CodeLlama-7b-Instruct 23.9 32.3 31.5 17.0 31.7 23.4 30.2 18.3 16.0 15.0 71.8 12.2 10.9 7.2 15.8 13.9 8.2 8.8
CodeLlama-13b-Instruct 24.4 36.0 37.7 22.0 35.0 23.6 35.7 23.2 17.0 19.0 82.1 10.6 13.1 7.6 17.8 17.8 10.4 16.1
CodeLlama-34b-Instruct 26.0 36.0 36.5 21.1 35.8 17.5 36.3 9.4 12.0 18.0 73.2 15.6 14.2 8.5 20.8 15.8 14.4 13.8
DS-Coder-Instruct-6.7B 39.2 65.2 65.5 44.4 61.2 46.6 61.2 42.0 26.0 25.0 79.3 18.9 18.7 8.2 28.7 30.7 27.1 30.0
DS-Coder-Instruct-7B 39.9 61.0 61.5 35.2 62.6 46.0 62.8 46.0 23.0 24.0 79.6 23.3 20.9 8.9 32.7 27.7 25.1 26.8
DS-Coder-Instruct-33B 44.7 65.9 64.6 57.7 65.0 53.5 66.5 46.4 28.0 30.0 75.7 22.2 27.6 11.3 45.5 38.6 35.3 36.1
DS-Coder-V2-Instruct-16B 50.9 72.0 71.2 40.2 66.5 57.7 67.1 42.7 26.0 30.0 78.2 44.4 44.3 19.8 49.5 55.4 40.2 47.7
DS-Coder-V2-Instruct-236B 58.7 73.8 75.3 70.0 75.2 67.1 76.5 58.5 35.0 38.0 77.3 58.9 58.9 35.0 57.3 53.5 41.1 49.4
CodeQwen1.5-7B-Chat 46.3 76.2 76.8 47.0 73.4 47.0 74.7 54.2 22.0 22.0 82.3 33.3 32.6 13.0 39.6 38.6 30.7 37.7
Completion-Type
CodeLlama-7b-Python 20.4 29.3 29.5 20.4 30.1 25.8 24.7 11.6 11.0 10.0 79.4 5.6 6.5 3.7 14.9 15.8 14.3 14.4
CodeLlama-13b-Python 21.7 40.2 35.0 22.4 34.8 30.9 30.2 20.4 16.0 15.0 78.6 6.1 4.8 2.4 16.8 17.8 13.8 14.7
CodeLlama-34b-Python 19.2 31.7 27.2 18.6 32.5 26.7 27.8 6.7 3.0 2.0 85.3 7.2 5.4 2.2 17.8 11.9 12.0 14.4
WizardCoder-Python-7B-V1.0 25.2 34.8 35.8 22.4 34.3 28.0 35.4 8.6 19.0 23.0 79.3 10.6 9.8 7.2 19.8 19.8 15.3 16.7
WizardCoder-Python-13B-V1.0 26.3 36.0 38.2 23.1 38.4 33.1 43.6 27.4 20.0 21.0 78.8 12.8 12.8 8.5 20.8 18.8 16.2 19.8
WizardCoder-15B-V1.0 28.0 38.4 38.7 23.2 41.9 27.8 40.0 24.4 22.0 21.0 80.0 11.7 11.5 7.8 21.8 22.8 21.8 24.2
WizardCoder-33B-V1.1 44.4 58.5 58.8 39.9 62.2 47.6 58.8 37.2 34.0 34.0 71.2 26.1 25.0 9.3 38.6 35.6 33.9 34.9
Table 2: Based on the RACE benchmark, the performance results for each LLM in code correctness (C), readability (R), maintainability (M), and efficiency (E). The performance metrics include accuracy (Acc) (%) and the proportion of code that is both functionally correct and follows customized instructions (Acc. IF) (%). RN, RL, RC, and EC denote the Name Convention, Length, Comments, and Complexity factor. MI denotes the Maintainability Index. MC denotes the Modularity factor. NITsubscriptNI𝑇\mathrm{NI}_{T}roman_NI start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and NISsubscriptNI𝑆\mathrm{NI}_{S}roman_NI start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT are metrics for code efficiency. The (*) symbol indicates that the indicator is a scalar from 0 to 100, and the rest are percentages (%).
Refer to caption
Figure 3: Performance of several representative LLMs on RACE benchmark.

In this section, we conduct a detailed evaluation of 18 Code LLMs and obtain several valuable findings. We first introduce the input formats and inference configurations for code generation tasks, along with the selection of LLMs. Subsequently, we present the overall experimental findings and conduct further analysis of the results to derive meaningful conclusions. The detailed experimental results are shown in Appendix B.

4.1 Settings

Task formats

We construct the different prompts based on the completion style and chat style, to better induce the LLMs to accomplish the corresponding tasks. In the inference process, we use a greedy strategy and set the temperature to 0.

Models

We select several state-of-the-art Code LLMs ranging in different sizes, both open and closed source, including DeepSeek-Coder (Guo et al., 2024), CodeLlama (Roziere et al., 2023), WizardCoder-Python (Luo et al., 2024), CodeQwen1.5-7B-Chat (Bai et al., 2023), gpt-3.5-turbo-0125 (OpenAI, 2022), and gpt-4o-2024-05-13. We use 6.7B/7B/33B/V2-Lite-Instruct(16B)/V2-Instruct(236B) for DeepSeek-Coder, use 7B/13B/34B for both CodeLlama-Instruct and CodeLlama-Python, and use 15B/33B for WizardCoder with 7B/13B for WizardCoder-Python.

The overall evaluation results on all 4 dimensions of each LLM are demonstrated in Table 2, and Figure 3 illustrates radar charts for representative models from each LLM family, providing a more intuitive comparison of the capabilities across various dimensions for different models. We can clearly see that current LLMs still struggle to generate high-quality code that is both correct and meets user requirements, thus making it difficult to meet the demands of real-world software development scenarios:

  • Across all metrics of readability and code efficiency, current Code LLMs demonstrate poor instruction-following ability. For instance, when considering the metric of time complexity (NITsubscriptNI𝑇\mathrm{NI}_{T}roman_NI start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT), the performance of all LLMs does not exceed 50. More notably, the proportion of GPT-4o generating code that is correct and meets customization requirements related to code modularity (MC) is 35.2%, and DeepSeek-Coder-V2-236B achieves 35.0%, while the performance of other LLMs remains below 20%, exhibiting a nearly twofold difference. Furthermore, in most cases, incorporating specific user instructions in task descriptions often leads to varying degrees of decreased correctness in the generated code. For example, when instructions related to code length are introduced, the code correctness of GPT-4o decreases from 80.5% to 78.9%, while DeepSeek-Coder-V2-16B experiences a 5.5% decline.

  • Different models have varying focus areas, and apart from GPT-4o and DeepSeek-Coder-V2-236B, no model consistently excels across multiple dimensions. For instance, CodeQwen1.5-7B-Chat achieves a relatively high code accuracy rate of 46.3%. However, its ability to meet customized requirements in terms of naming conventions, code length, and time complexity is comparable to that of deepseek-coder-6.7b-instruct, which has a code accuracy rate of 39.2%. Furthermore, we find that different models may have different complexity tendencies. For instance, GPT-4o and CodeLlama-34b-Instruct tend to generate code with lower time complexity, while DeepSeek-Coder-V2-236B and WizardCoder-33B-V1.1 tend to generate code with lower space complexity.

  • In terms of overall code quality, currently, only the performance of the open-source model DeepSeek-Coder-V2-236B is comparable to that of GPT-4o. Although it is slightly weaker than GPT-4o regarding code correctness and readability, it surpasses GPT-4o in overall code efficiency. However, other models still exhibit a significant gap compared to GPT-4o. For instance, while CodeQwen1.5-7B approaches the level of code correctness in the readability dimension as GPT-4o, it significantly lags in other dimensions related to generating correct and requirement-compliant code. In contrast, the remaining models, such as the CodeLlama series, exhibit disparities exceeding double those with GPT-4o.

These findings indicate that future research should prioritize improving instruction-following capabilities in terms of code readability, maintainability, and efficiency, while ensuring code accuracy remains uncompromised. This strategy aims to develop Code LLMs that can consistently meet real-world development requirements across multiple dimensions.

4.2 Correlation Analysis Across Dimensions

Refer to caption
Figure 4: The Pearson correlation coefficient matrix among factors under the dimensions of code correctness, readability, maintainability, and efficiency. We can observe that readability is a critical indicator of overall code quality.
Refer to caption
Figure 5: Comparison of code correctness among LLM-generated code without custom requirements, with function-level comments, and with line-level comments.

To conduct a more in-depth analysis of how various factors across different dimensions influence overall code quality, we analyze the correlations between different factors in each model. Specifically, we first compute the proportion of the generated code that is both correct and follows customized instructions across 8 factors for 18 Code LLMs. Subsequently, we calculate Pearson correlation coefficients between these factors.

The results of the correlation analysis are presented in Figure 4, which demonstrate that readability serves as a critical indicator of overall code quality. Specifically, significant correlations are observed between accuracy and almost all the factors, with correlation coefficients mostly exceeding 0.8, notably exceeding 0.9 in relation to correctness. For instance, if a segment of LLM-generated code exhibits consistent naming conventions, appropriate length constraints, or suitable comments, it is more likely to be of overall high quality. This finding aligns with conclusions from  Börstler et al. (2023), which identifies code readability as a decisive factor in code quality, suggesting that enhancing the readability of code generated by LLMs is a crucial avenue for improvement. Furthermore, we analyze the comment factor within the readability dimension and compare the changes in the accuracy of LLM-generated code before and after incorporating comments. As illustrated in Figure 5, requiring models to include comments in appropriate sections enhances the performance of some LLMs. We posit that this phenomenon could be attributed to an emerging ability in large-scale LLMs (Wei et al., 2022; Schaeffer et al., 2024), wherein comments serve an implicit chain-of-thought role, thereby improving the accuracy of generated code.

4.3 Preference Bias of Code LLMs

Refer to caption
Figure 6: The instruction-following rates of different LLMs for different customization needs in terms of naming convention, length, and loop structure.

To investigate whether the model’s inherent preferences affect its ability to follow user instructions, we conduct a more fine-grained comparison of various incorporated user requirements. Specifically, we focus on naming conventions, requiring LLMs to consistently use either camel-case or snake-case for both function names and variable names. Additionally, regarding code length, LLMs are tasked with generating single-line lengths not exceeding 60, 70, and 79 characters, as well as ensuring that individual method lengths do not exceed 20, 30, and 40 lines. For loop structures, the requirement is for LLMs to use only either for or while statements to implement necessary loop constructs. Finally, we calculate the proportion of LLM-generated code that follows these customized requirements, i.e. the rate of instruction-following (IF).

Figure 6 demonstrates the IF rates of 18 LLMs across all the customized requirements above. We find that the majority of LLMs exhibit an inherent preference bias towards generating code in specific styles. This bias often results in these LLMs being unable to follow user instructions effectively if the requested style differs from that prevalent in their training data. Specifically, for naming conventions, Python conventionally employs snake-case for function and variable names. When LLMs are requested to use camel-case, most LLMs, such as CodeLlama and CodeQwen, almost fail to comprehend and fulfill this requirement, with instruction-following rates below 20%. For code length, a 79-character single-line length limit is a common style in Python. When dealing with more stringent requirements, the IF rates of most instruct-type LLMs drop by nearly 20%, while DeepSeek-Coder-V2-236B maintains the best instruction-following rates. For loop structures, all LLMs except GPT series and DeepSeek-Coder-V2-236B exhibit a pronounced tendency to use "for" statements. When tasked with using "while" statements, LLMs struggle to transform between different statements, reflected in the IF rates generally below 70%. These findings imply that most LLMs may simply learn the inherent patterns of the next token from examples, without a clear understanding of the logic of code comprehension. However, GPT-3.5, GPT-4o, and DeepSeek-Coder-V2-236B perform well in this aspect, with an IF rate above 90%. Such preference bias can lead to the ossification of code style in Code LLMs, thereby hindering their ability to meet specific real-world project requirements and consequently affecting the adaptability and scalability of generated code. This issue may be more pronounced in programming languages like Perl, JavaScript, and PHP, where there is no strict, widely accepted standard code style.

5 Conclusion

We present the RACE benchmark, a multi-dimensional evaluation framework for code generation, including correctness, readability, maintainability, and efficiency. The RACE benchmark evaluates whether LLMs can generate code that is both correct and meets customized requirements, based on the selection of factors within each dimension and the customized requirements designed for each factor. Based on further experiments in 18 representative LLMs, we find that the present capabilities of LLMs in generating code of high quality as needed still fall short of the demands in software development. Additionally, code readability serves as a pivotal indicator of the overall quality of generated code. Our research highlights the critical importance of improving the multidimensional quality of generated code. Future efforts should focus on improving the ability of LLMs to meet real-world requirements.

Limitations

Currently, the RACE benchmark consists of four dimensions, each comprising two to three factors. However, there are additional dimensions worthy of consideration in defining code quality and meeting practical development needs, such as security, testability, and dynamic behavior. Additionally, our experiments have only been conducted on Python code data thus far. Future plans include expanding to multilingual code to explore differences in model preferences across languages and their impact on meeting real-world scenario requirements. Additionally, future efforts will focus on further analyzing the Code LLMs’ ability to meet customized requirements, exploring deeper factors influencing generated code quality, and investigating how code placement in longer code affects compliance with requirements.

References

  • Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  • Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609.
  • Börstler et al. (2023) Jürgen Börstler, Kwabena E Bennin, Sara Hooshangi, Johan Jeuring, Hieke Keuning, Carsten Kleiner, Bonnie MacKellar, Rodrigo Duran, Harald Störrle, Daniel Toll, et al. 2023. Developers talking about code quality. Empirical Software Engineering, 28(6):128.
  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  • Coleman et al. (1994) Don Coleman, Dan Ash, Bruce Lowther, and Paul Oman. 1994. Using metrics to evaluate software system maintainability. Computer, 27(8):44–49.
  • Curtis et al. (2022) Bill Curtis, Robert A Martin, and Philippe-Emmanuel Douziech. 2022. Measuring the structural quality of software systems. Computer, 55(3):87–90.
  • Dantas et al. (2023) Carlos Eduardo C Dantas, Adriano M Rocha, and Marcelo A Maia. 2023. How do developers improve code readability? an empirical study of pull requests. In 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 110–122. IEEE.
  • Dillmann et al. (2024) Marc Dillmann, Julien Siebert, and Adam Trendowicz. 2024. Evaluation of large language models for assessing code maintainability. arXiv preprint arXiv:2401.12714.
  • Du et al. (2023) Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861.
  • Fan et al. (2023) Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533.
  • Fried et al. (2022) Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, and Mike Lewis. 2022. Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations.
  • Gong et al. (2024) Linyuan Gong, Sida Wang, Mostafa Elhoushi, and Alvin Cheung. 2024. Evaluation of llms on syntax-aware code fill-in-the-middle tasks. arXiv preprint arXiv:2403.04814.
  • Guo et al. (2024) Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al. 2024. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196.
  • Hegedus (2013) Péter Hegedus. 2013. Revealing the effect of coding practices on software maintainability. In 2013 ieee international conference on software maintenance, pages 578–581. IEEE.
  • Hendrycks et al. (2021) Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Measuring coding challenge competence with apps. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  • Huang et al. (2024) Dong Huang, Jie M Zhang, Yuhao Qing, and Heming Cui. 2024. Effibench: Benchmarking the efficiency of automatically generated code. arXiv preprint arXiv:2402.02037.
  • Jain et al. (2024a) Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024a. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974.
  • Jain et al. (2024b) Naman Jain, Manish Shetty, Tianjun Zhang, King Han, Koushik Sen, and Ion Stoica. 2024b. R2e: Turning any github repository into a programming agent environment. In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
  • Lai et al. (2023) Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR.
  • Li et al. (2024) Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, et al. 2024. Devbench: A comprehensive benchmark for software development. arXiv preprint arXiv:2403.08604.
  • Li et al. (2023a) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023a. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  • Li et al. (2023b) Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. 2023b. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852.
  • Li et al. (2022) Y Li, D Choi, J Chung, N Kushman, J Schrittwieser, R Leblond, T Eccles, J Keeling, F Gimeno, A Dal Lago, et al. 2022. Competition-level code generation with alphacode. Science (New York, NY), 378(6624):1092–1097.
  • Liu et al. (2024) Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36.
  • Luo et al. (2024) Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2024. Wizardcoder: Empowering code large language models with evol-instruct. In The Twelfth International Conference on Learning Representations.
  • Ni et al. (2023) Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell, Troy Feng, Rui Shen, Stephen Yin, Ye Liu, Semih Yavuz, Caiming Xiong, et al. 2023. L2ceval: Evaluating language-to-code generation capabilities of large language models. arXiv preprint arXiv:2309.17446.
  • Nijkamp et al. (2022) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations.
  • Nistala et al. (2019) Padmalata Nistala, Kesav Vithal Nori, and Raghu Reddy. 2019. Software quality models: A systematic mapping study. In 2019 IEEE/ACM International Conference on Software and System Processes (ICSSP), pages 125–134. IEEE.
  • Oliveira et al. (2020) Delano Oliveira, Reydne Bruno, Fernanda Madeiral, and Fernando Castor. 2020. Evaluating code readability and legibility: An examination of human-centric studies. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 348–359. IEEE.
  • OpenAI (2022) OpenAI. 2022. Introducing chatgpt.
  • Roziere et al. (2023) Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  • Sadeghzadeh Hemayati and Rashidi (2017) M Sadeghzadeh Hemayati and H Rashidi. 2017. Software quality models: A comprehensive review and analysis. Journal of Electrical and Computer Engineering Innovations (JECEI), 6(1):59–76.
  • Schaeffer et al. (2024) Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. 2024. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36.
  • Shinn et al. (2024) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36.
  • Tian et al. (2024) Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. Debugbench: Evaluating debugging capability of large language models. arXiv preprint arXiv:2401.04621.
  • Wang et al. (2021) Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708.
  • Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. Transactions on Machine Learning Research.
  • Xu et al. (2022) Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 1–10.
  • Yan et al. (2023) Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, Li Zhu, Shuiguang Deng, et al. 2023. Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation. arXiv preprint arXiv:2311.08588.
  • Zheng et al. (2023a) Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, et al. 2023a. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568.
  • Zheng et al. (2023b) Zibin Zheng, Kaiwen Ning, Jiachi Chen, Yanlin Wang, Wenqing Chen, Lianghong Guo, and Weicheng Wang. 2023b. Towards an understanding of large language models in software engineering tasks. arXiv preprint arXiv:2308.11396.
  • Zheng et al. (2023c) Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. 2023c. A survey of large language models for code: Evolution, benchmarking, and future trends. arXiv preprint arXiv:2311.10372.

Appendix A Evaluation Data and Customized Instructions

Based on existing data, we design customized requirements that are both reasonable and closely aligned with real-world application scenarios, incorporating these requirements into the task description to obtain evaluation data for our RACE benchmark. Detailed customization instructions for each factor are shown in Figure 7 and Figure 8.

For code correctness, we utilize HumanEval+, MBPP+, ClassEval, and LeetCode data. For code readability, we use HumanEval+ data. For code maintainability, we use ClassEval and LeetCode data. For code efficiency, we use self-constructed data derived from LeetCode. We adhere to the task settings defined in the original data while incorporating our designed customization requirements. Simultaneously, for the HumanEval+ and MBPP+ datasets, we discard the original prompt format and extract the primary task descriptions from the original prompts to serve as the final prompts. This approach helps avoid conflicts between function template information included in the original prompts and requirements related to code readability, thus providing a better assessment of code-related abilities. Additionally, it mitigates the impact of potential data leakage, thereby increasing the difficulty of the benchmark.

A) The templates for the correctness dimension Please generate the Python code to solve the following problem.\n\nProblem:\n\n{problem} B) The templates for the readability dimension 1) For the Naming Convention factor Please generate the Python code to solve the following problem, and use camel case for both function names and variable names.\n\nProblem:\n\n{problem} Please generate the Python code to solve the following problem, and use snake case for both function names and variable names.\n\nProblem:\n\n{problem} Please generate the Python code to solve the following problem, and use camel case for function names.\n\nProblem:\n\n{problem} Please generate the Python code to solve the following problem, and use snake case for function names.\n\nProblem:\n\n{problem} Please generate the Python code to solve the following problem, and use camel case for variable names.\n\nProblem:\n\n{problem} Please generate the Python code to solve the following problem, and use snake case for variable names.\n\nProblem:\n\n{problem} 2) For the Length factor Please generate the Python code to solve the following problem, where each line is less than 60 characters long and each function is less than 20 lines long.\n\nProblem:\n\n{problem} Please generate the Python code to solve the following problem, where each line is less than 70 characters long and each function is less than 30 lines long.\n\nProblem:\n\n{problem} Please generate the Python code to solve the following problem, where each line is less than 79 characters long and each function is less than 40 lines long.\n\nProblem:\n\n{problem} 3) For the Comment factor Please generate the Python code to solve the following problem, and add the necessary docstring for each function.\n\nProblem:\n\n{problem} Please generate the Python code to solve the following problem, and add comments for each line in each function.\n\nProblem:\n\n{problem}
Figure 7: The prompt templates for each factor in the correctness and readability dimension for the RACE benchmark.
C) The templates for the maintainability dimension 1) For the MI factor Please complete the class {class_name} in the following code, and ensure that the code has good maintainability. Code maintainability refers to how easy it is to support and change the code.\n\n“‘python\n{skeleton}\n“‘ 2) For the Modularity factor {problem}\n\nPlease complete the code below to solve above problem, and use only the given function.\n\n{starter_code} {problem}\n\nPlease complete the code below to solve above problem, and use only the given function and one addition sub-function.\n\n{starter_code} {problem}\n\nPlease complete the code below to solve above problem, and use only the given function and two addition sub-functions.\n\n{starter_code} 3) For the loop structure (Only for experiments) Please generate the Python code to solve the following problem, and just use the for statement to implement the desired loop structures.\n\nProblem:\n\n{problem} Please generate the Python code to solve the following problem, and just use the while statement to implement the desired loop structures.\n\nProblem:\n\n{problem} D) The templates for the efficiency dimension Please complete the code below to solve above problem, and make sure that the time complexity of the code is ${complexity}$. Please complete the code below to solve above problem, and make sure that the space complexity of the code is ${complexity}$. Please complete the code below to solve above problem, and make sure that the time complexity is ${time_complexity}$ and the space complexity is ${space_complexity}$.
Figure 8: The prompt templates for each factor in the maintainability and efficiency dimension for the RACE benchmark.

Appendix B Detailed Experiment Results

The detailed experimental results on the RACE benchmark are shown in Table 3 and Table 4. For the Naming Convention factor, we design 6 settings that require the function names (function_camel, function_snake), variable names (var_camel, var_snake), or both (camel, snake) in the generated code, to follow specified naming conventions. We can see that the majority of models struggle to adhere to the camel-case naming convention. Furthermore, the variance in capabilities among different models primarily manifests in scenarios requiring function names to use camel-case (function_camel). For the Length factor, we can see that as the constraints became progressively stringent, ranging from maximum single-line length of 79 and maximum method line count of 40 (L_79_40), to maximum single-line length of 60 and maximum method line count of 20 (L_60_20), most models exhibit a significant decline in their ability to meet requirements. For the Comment factor, different models respond variably to related requirements. However, we find that most models can improve code correctness by meeting the code comment requirements, such as models deepseek-coder-33b-instruct, DeepSeek-Coder-V2-236B, and WizardCoder-Python-13B-V1.0.

Readability (Naming Convention)
C camel snake function_camel function_snake var_camel var_snake
Models Acc Acc IF Acc. IF Acc IF Acc. IF Acc IF Acc. IF Acc IF Acc. IF Acc IF Acc. IF Acc IF Acc. IF
gpt-4o-2024-05-13 80.5 81.7 89.6 73.8 80.5 88.4 72.0 83.5 97.0 81.7 79.9 97.0 78.0 81.7 90.2 75.0 79.3 88.4 71.3
gpt-3.5-turbo-0125 62.8 65.2 48.8 35.4 61.0 87.2 53.7 63.4 84.8 54.3 62.8 93.9 58.5 64.0 40.2 29.3 62.8 91.5 58.5
CodeLlama-7b-Instruct 32.3 31.7 2.4 0.0 29.3 90.2 26.8 31.1 5.5 0.0 31.7 97.0 31.0 31.7 43.9 12.2 33.5 93.9 31.7
CodeLlama-7b-Python 29.3 28.0 18.3 4.9 29.9 90.9 28.0 29.9 22.0 7.3 31.7 95.1 31.1 28.7 78.7 23.2 28.7 93.3 28.0
CodeLlama-13b-Instruct 36.0 37.2 4.3 3.0 37.2 92.1 34.8 40.9 9.8 5.5 34.8 97.6 34.1 40.2 48.8 20.7 36.0 94.5 34.1
CodeLlama-13b-Python 40.2 34.8 6.7 2.4 35.4 90.2 33.5 36.0 9.1 3.7 33.5 94.5 32.9 34.8 73.2 28.0 35.4 93.9 34.1
CodeLlama-34b-Instruct 36.0 37.2 4.3 2.4 34.8 84.8 32.3 36.6 5.5 2.4 36.6 94.5 35.4 37.8 47.6 19.5 36.0 89.0 34.8
CodeLlama-34b-Python 31.7 27.4 19.5 5.5 28.0 90.2 26.8 29.9 21.3 6.1 26.2 96.3 26.2 26.2 80.5 22.6 25.6 93.3 24.4
DS-Coder-Instruct-6.7B 65.2 65.2 26.2 15.9 65.9 90.9 61.0 67.7 47.0 29.9 67.7 98.2 65.9 62.8 48.2 33.5 64.0 92.7 60.4
DS-Coder-Instruct-7B 61.0 61.6 8.5 5.5 59.1 92.1 54.3 62.6 11.0 6.7 61.6 98.8 60.4 62.8 43.9 26.2 61.6 92.7 57.9
DS-Coder-Instruct-33B 65.9 64.6 72.6 50.6 65.2 89.0 61.6 62.2 98.2 61.0 64.0 98.2 62.8 68.3 72.6 50.0 63.4 90.2 60.4
DS-Coder-V2-Instruct-16B 72.0 72.0 9.8 7.9 69.5 88.4 64.0 72.6 14.0 10.4 71.3 96.3 68.9 73.2 33.5 26.2 68.9 89.6 64.0
DS-Coder-V2-Instruct-236B 73.8 75.0 89.0 67.7 76.2 91.5 70.1 74.4 95.1 70.7 76.8 97.0 74.4 73.2 88.4 66.5 76.2 89.6 70.7
WizardCoder-Python-7B-V1.0 34.8 34.8 4.9 1.8 34.1 90.2 32.9 34.8 5.5 1.2 34.1 97.0 34.1 37.8 62.8 26.8 39.0 89.6 37.8
WizardCoder-Python-13B-V1.0 36.0 38.4 4.3 1.8 36.6 91.5 34.8 36.6 6.1 1.2 38.4 96.3 38.4 37.8 59.8 23.8 41.5 92.1 38.4
WizardCoder-15B-V1.0 38.4 39.6 4.3 1.2 40.9 92.1 38.4 38.4 5.5 1.2 38.4 97.0 38.4 39.0 62.8 26.2 36.0 92.1 34.1
WizardCoder-33B-V1.1 58.5 57.9 25.0 14.6 59.1 87.2 54.3 57.3 34.1 20.7 57.9 96.3 57.3 59.8 60.4 35.4 61.0 89.6 57.3
CodeQwen1.5-7B-Chat 76.2 75.6 12.2 9.1 76.2 90.9 70.1 76.2 15.9 11.0 79.3 98.2 77.4 76.8 57.9 43.3 76.8 89.6 71.3
Table 3: Detailed experimental results for the Name Convention factor in the readability dimension on the RACE benchmark.
Readability (Length) Readability (Comment) Maintainability (Loop Structure)
C L_60_20 L_70_30 L_79_40 by_function by_line for while
Models Acc Acc IF Acc. IF Acc IF Acc. IF Acc IF Acc. IF Acc IF Acc. IF Acc IF Acc. IF Acc IF Acc. IF Acc IF Acc. IF
gpt-4o-2024-05-13 80.5 80.5 74.4 61.6 76.2 75.0 58.5 79.9 87.2 69.5 77.4 98.2 77.4 82.3 59.1 51.2 75.0 93.3 71.3 70.1 97.0 68.9
gpt-3.5-turbo-0125 62.8 58.5 64.6 39.0 61.0 78.0 45.7 61.6 87.8 53.7 66.5 95.1 64.0 65.2 25.0 18.9 56.7 97.0 54.9 52.4 90.2 48.8
CodeLlama-7b-Instruct 32.3 29.9 50.6 20.1 31.7 57.9 23.2 33.5 70.7 26.8 29.9 100.0 29.9 30.5 34.8 6.7 31.7 95.7 31.1 29.9 42.7 11.0
CodeLlama-7b-Python 29.3 29.9 68.3 23.8 31.1 78.0 25.6 29.3 83.5 28.0 28.7 72.6 22.0 20.7 11.6 1.2 28.7 93.9 26.2 26.8 58.5 13.4
CodeLlama-13b-Instruct 36.0 34.8 53.0 20.1 35.4 62.8 25.6 34.8 64.0 25.0 36.6 92.7 34.8 34.8 36.0 11.6 31.1 95.1 30.5 34.1 45.7 14.0
CodeLlama-13b-Python 40.2 34.1 78.7 27.4 34.8 83.5 31.1 35.4 88.4 34.1 34.8 92.1 34.8 25.6 29.3 6.1 33.5 91.5 31.1 34.8 60.4 18.9
CodeLlama-34b-Instruct 36.0 33.5 36.0 15.9 36.6 39.0 15.9 37.2 50.0 20.7 35.4 38.4 12.8 37.2 25.6 6.1 36.0 94.5 36.0 35.4 46.3 15.2
CodeLlama-34b-Python 31.7 32.3 64.6 22.0 33.5 72.6 28.7 31.7 82.3 29.3 23.2 67.7 12.8 32.3 11.0 0.6 25.6 94.5 25.0 25.6 70.7 12.2
DS-Coder-Instruct-6.7B 65.2 62.2 61.0 40.9 61.0 76.8 47.6 60.4 82.9 51.2 64.0 100.0 64.0 58.5 31.1 20.1 64.0 90.9 59.1 62.8 65.9 39.6
DS-Coder-Instruct-7B 61.0 61.6 57.3 37.2 62.2 71.3 47.0 64.0 84.1 53.7 62.2 99.4 62.2 63.4 40.9 29.9 61.6 95.1 58.5 57.9 64.0 39.6
DS-Coder-Instruct-33B 65.9 62.8 73.8 47.6 65.2 84.1 53.7 67.1 90.2 59.1 68.9 100.0 68.9 64.0 41.5 23.8 66.5 91.5 60.4 68.3 70.1 48.2
DS-Coder-V2-Instruct-16B 72.0 66.5 77.4 53.7 65.9 84.8 57.9 67.1 89.0 61.6 67.7 98.2 67.7 66.5 28.7 17.1 70.7 88.4 62.2 63.4 62.8 37.2
DS-Coder-V2-Instruct-236B 73.8 74.4 86.6 65.9 74.4 89.0 67.1 76.8 89.6 68.3 77.4 99.4 77.4 75.6 48.8 39.6 72.0 90.9 65.2 67.1 95.7 65.2
WizardCoder-Python-7B-V1.0 34.8 35.4 72.6 25.6 34.1 81.1 28.0 33.5 85.4 30.5 33.5 40.9 13.4 37.2 9.1 3.7 36.0 93.9 34.1 35.4 40.9 11.6
WizardCoder-Python-13B-V1.0 36.0 40.2 75.6 32.9 37.8 84.8 31.7 37.2 89.0 34.8 43.3 98.2 43.3 43.9 21.3 11.6 43.3 91.5 38.4 39.0 43.3 16.5
WizardCoder-15B-V1.0 38.4 42.7 50.0 20.7 40.2 67.1 28.0 42.7 77.4 34.8 41.5 99.4 41.5 38.4 15.2 7.3 42.7 97.6 41.5 40.2 59.1 21.3
WizardCoder-33B-V1.1 58.5 62.2 67.7 42.7 62.8 76.2 48.2 61.6 84.1 51.8 59.8 98.2 58.5 57.9 23.8 15.9 59.8 90.2 54.9 59.8 62.8 36.0
CodeQwen1.5-7B-Chat 76.2 71.3 47.0 36.6 75.6 61.0 48.2 73.2 74.4 56.1 76.2 98.8 75.0 73.2 43.9 33.5 72.0 93.3 68.3 65.2 70.1 44.5
Table 4: Detailed experimental results for the Length and Comment factor in the readability dimension on the RACE benchmark.