Revisiting the Impact of Pursuing Modularity for Code Generation

Deokyeong Kang, Ki Jung Seo, Taeuk Kim
Department of Computer Science, Hanyang University, Seoul, Republic of Korea
{rkdejrdud88,tjrlwjd1,kimtaeuk}@hanyang.ac.kr
Abstract

Modular programming, which aims to construct the final program by integrating smaller, independent building blocks, has been regarded as a desirable practice in software development. However, with the rise of recent code generation agents built upon large language models (LLMs), a question emerges: is this traditional practice equally effective for these new tools? In this work, we assess the impact of modularity in code generation by introducing a novel metric for its quantitative measurement. Surprisingly, unlike conventional wisdom on the topic, we find that modularity is not a core factor for improving the performance of code generation models. We also explore potential explanations for why LLMs do not exhibit a preference for modular code compared to non-modular code.

Revisiting the Impact of Pursuing Modularity for Code Generation


Deokyeong Kang, Ki Jung Seo, Taeuk Kim Department of Computer Science, Hanyang University, Seoul, Republic of Korea {rkdejrdud88,tjrlwjd1,kimtaeuk}@hanyang.ac.kr


footnotetext: Equal contribution. Corresponding author.

1 Introduction

With recent advances in the capabilities of large language models (LLMs; OpenAI, 2024; Gemini Team, 2024; inter alia), their application areas have expanded beyond simple text-based tasks. Among these, coding assistants are becoming practically essential for programmers, enhancing their efficiency through tasks such as natural language to code (NL2Code) generation.

Similar to other use cases of LLMs, coding assistants are typically utilized in zero- or few-shot manners, without task-specific fine-tuning. The problem is that as the length of code is usually much longer than that of a sentence, the number of code examples available for each run is strictly limited. Furthermore, the same functionality can be represented with different forms of code, making it challenging for users to select a proper example for a target task. It is thus important to understand what characteristics of the code provided to the agents contribute to the final performance of such models. Among the many possible properties that influence the characteristics of code snippets, this work investigates the impact of code modularity on the performance of LLMs for NL2Code generation.

Refer to caption
Figure 1: In this work, we address the following research question: Given modular and non-modular code snippets with identical functionality, which code type more effectively enhances performance in code generation when used as input for code language models?

Modular programming, the practice of building software with independent components, has long been considered a cornerstone of good software development. While this paradigm facilitates desirable properties of code for human programmers, such as reusability, readability, and maintainability, it remains an open question whether it offers the same level of effectiveness for LLMs.

Notably, Jain et al. (2024) argued that leveraging a set of modular functions can improve code generation accuracy for both in-context learning (ICL) and fine-tuning. As it is not trivial to guarantee the modularity of each code snippet, the authors asked GPT-3.5-Turbo111https://platform.openai.com/docs/models/gpt-3-5-turbo to convert an existing code snippet into a more modular one, while ensuring its functional correctness.

However, we claim that their report warrants revisiting for two reasons. First, since LLMs are notorious for their verbosity, it is unclear whether the conversion process aimed solely for modularity or accidentally introduced unexpected side effects. Second, the lack of a formally defined quantitative method for estimating modularity hinders more extensive analyses related to the problem.

In this paper, we (re-)investigate the effectiveness of pursuing modularity in NL2Code generation. We aim to push the boundaries of previous work by (1) introducing a novel metric that quantifies the modularity of a code snippet using numeric values. Based on the metric, we (2) classify code snippets as modular or non-modular without relying on LLMs, and evaluate how each category contributes to performance.222Note that this was infeasible in the previous study Jain et al. (2024) as there was no clear standard for determining whether each code snippet is modular or not. Moreover, beyond previous work, we (3) conduct experiments on models with parameters exceeding 7B (i.e., 33B and 34B) to investigate the impact of model scale. Figure 1 illustrates the core research question of this work.

In experiments, we discover that contrary to conventional wisdom in the literature, the modularity of a code example may not be the crucial factor for performance. We also explore potential explanations for why LLMs do not exhibit a preference for modular code compared to non-modular code.

Refer to caption
Figure 2: Procedure of computing Cyclomatic Complexity (CC) and Modularity Score (MoS). We first build control-flow graphs (CFGs) from the given code to derive CC. The CC values are then used to compute MoS as the form of CCtotalsubscriptCCtotal\text{CC}_{\text{total}}CC start_POSTSUBSCRIPT total end_POSTSUBSCRIPT and msuperscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

2 Quantitative Definition of Modularity

To assess the impact of code modularity, the first essential step is to develop a method that provides a measurable score for code modularity. While the previous study Jain et al. (2024) bypassed this vital step,333The authors instead utilized LLMs to transform all code snippets into supposedly modular ones. we present a reasonable metric for estimating code modularity, which is challenging due to the inherent subjectivity of the concept itself.

Inspired by the software engineering literature, we employ the concept of Cyclomatic Complexity (CC) McCabe (1976) to determine the ideal number of modules, msuperscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, for a given code snippet. CC counts the number of independent execution paths in the control-flow graph (CFG) representation of the target code. CC can also be calculated as E𝐸Eitalic_E - N𝑁Nitalic_N + 2, where E𝐸Eitalic_E and N𝑁Nitalic_N correspond to the number of edges and nodes in the CFG. The CC values are computed at either the whole code level (total CC; CCtotalsubscriptCCtotal\text{CC}_{\text{total}}CC start_POSTSUBSCRIPT total end_POSTSUBSCRIPT) or the function level (meaning the average CC across all functions in the code; CCavgsubscriptCCavg\text{CC}_{\text{avg}}CC start_POSTSUBSCRIPT avg end_POSTSUBSCRIPT).

A high CC value generally indicates a complex code structure. It functions as a guideline for code decomposition, suggesting that a function whose CC is exceeding a certain threshold value τ𝜏\tauitalic_τ, e.g., 5 McCabe (1976) or 10 McConnell (2004), might benefit from being broken down into smaller sub-functions. Based on the concept, we assume that the average CC of an ideal modular code example, denoted by CCavgsubscriptCCsuperscriptavg\text{CC}_{\text{avg}^{*}}CC start_POSTSUBSCRIPT avg start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, should be equal to the threshold τ𝜏\tauitalic_τ.444Given two choices for τ𝜏\tauitalic_τ, i.e., 5 or 10, we set τ𝜏\tauitalic_τ to 5 to encourage a sparser distribution of modularity scores (MoS). In other words, ideally, every function within a modular code snippet is expected to have a CC value of τ𝜏\tauitalic_τ. Following the intuition, we define msuperscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the number of ideal modules, as follows:

m=CCtotalCCavg=CCtotalτ,superscript𝑚subscriptCCtotalsubscriptCCsuperscriptavgsubscriptCCtotal𝜏m^{*}=\biggl{\lfloor}\frac{\text{CC}_{\text{total}}}{\text{CC}_{\text{avg}^{*}% }}\biggr{\rfloor}=\biggl{\lfloor}\frac{\text{CC}_{\text{total}}}{\tau}\biggr{% \rfloor},italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ⌊ divide start_ARG CC start_POSTSUBSCRIPT total end_POSTSUBSCRIPT end_ARG start_ARG CC start_POSTSUBSCRIPT avg start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ⌋ = ⌊ divide start_ARG CC start_POSTSUBSCRIPT total end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG ⌋ ,

Finally, we define the modularity score, dubbed MoS, as follows:

MoS={min(1,nm)if m>00if m=n=01if m=0,n>0,MoScases1𝑛superscript𝑚if superscript𝑚00if superscript𝑚𝑛01formulae-sequenceif superscript𝑚0𝑛0\text{MoS}=\begin{cases}\min\left(1,\frac{n}{m^{*}}\right)&\text{if }{m^{*}}>0% \\ 0&\text{if }{m^{*}}=n=0\\ 1&\text{if }{m^{*}}=0{,\,n}>0\end{cases},MoS = { start_ROW start_CELL roman_min ( 1 , divide start_ARG italic_n end_ARG start_ARG italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG ) end_CELL start_CELL if italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_n = 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL if italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0 , italic_n > 0 end_CELL end_ROW ,

where n𝑛nitalic_n is equal to the actual number of modules in the target code. That is, the closer n𝑛nitalic_n (actual number of modules) is to msuperscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (ideal number of modules), the higher the modularity is considered to be.555In extreme cases where m=0𝑚0m=0italic_m = 0 (no modularization required), the modularity score is set to 0 if no actual modules are used (n=0𝑛0n=0italic_n = 0) and 1 otherwise (n>0𝑛0n>0italic_n > 0). The process of deriving MoS is illustrated in Figure 2.

Model Size Code Type Introductory Interview Competition Average
pass@1 pass@5 pass@1 pass@5 pass@1 pass@5 pass@1 pass@5
Code Llama 7B TMC 14.67 19.63 2.28 3.98 0.21 0.59 4.45 6.66
TSC 13.84 17.15 2.16 3.61 0.07 0.24 4.20 6.07
DeepSeekCoder 6.7B TMC 34.26 40.74 9.60 13.41 0.76 1.93 13.49 17.63
TSC 33.24 39.73 8.55 12.40 0.55 1.21 12.55 16.64
Table 1: Results on APPS measured by pass@k𝑘kitalic_k. We use n𝑛nitalic_n=10 for pass@1 and pass@5. The best results are in bold for each section. Two-shot prompting is applied to generate code. Code Type refers to two distinct groups of code used for demonstrations. We find that TMC slightly outperforms TSC but the performance gaps are insignificant.
Model Size Code Type CodeContests
pass@1 pass@10
Code Llama 7B MC 1.98 8.02
SC 2.58 8.81
TMC 2.57 10.18
TSC 4.35 10.67
34B MC 4.11 12.78
SC 5.83 14.1
TMC 3.39 13.55
TSC 5.61 15.32
DeepSeekCoder 6.7B MC 5.3 12.78
SC 7.15 16.27
TMC 8.02 17.88
TSC 8.19 17.79
33B MC 6.79 16.14
SC 8.87 20.5
TMC 9.38 22.74
TSC 8.78 22.09
Table 2: Results on CodeContests measured by pass@k𝑘kitalic_k. We use n𝑛nitalic_n=10 for pass@1 and n𝑛nitalic_n=50 for pass@5, respectively. The best results are in bold for each section. Two-shot prompting is applied for generating code given natural language queries. Code Type refers to four distinct groups of code used for demonstrations. We reveal no significant impact of code type on performance.

3 Four Code Categories by Modularity

With a way to quantify code modularity, we can now classify a code dataset into two categories—modular and non-modular (= singular). We further leverage prior research by including LLM-based code transformations and their corresponding manually recovered counterparts for controlled experiments. This allows us to create four distinct clusters of code separated by their modularity levels.666Figure 3 in Appendix displays examples of code from each category for a code generation problem.

Modular Code (MC) is a collection of code snippets with the highest MoS among solutions for each problem in a dataset.

Singular Code (SC) represents another set of solution code examples for the same problems corresponding to MC, with MoS being 0.

Transformed Modular Code (TMC) can be obtained by utilizing GPT-3.5-Turbo (f𝑓fitalic_f) to transform SC into code with high MoS. The conversion process can be represented by the following:

TMC=f(I,Q,SC),TMC𝑓𝐼𝑄SC\textbf{TMC}=f({I},{Q},\textbf{SC}),TMC = italic_f ( italic_I , italic_Q , SC ) ,

where I𝐼Iitalic_I represents a transformation instruction and Q𝑄Qitalic_Q is the problem description of SC.777See Figure 4 for prompt details on the conversion process.

Transformed Singular Code (TSC) is a variation from TMC, whose modularity is manually removed by human programmers. Specifically, TSC is created by replacing module invocation parts in TMC with the body code of corresponding modules and removing the modules from the program. By minimizing the influence of factors other than modularity through the comparison of TSC and TMC, it contributes to a rigorous evaluation of the impact of modularity.

4 Experimental Setups

We explore the impact of modularity by comparing how the four code collections, categorized by their modularity levels, affect performance. To mimic real-world usage, we focus on the case of utilizing code LLMs with few-shot in-context learning. We leverage two-shot demonstrations (providing two code examples) unless otherwise specified.888Refer to Figure 5 and Figure 6 for prompt details.

Models.

We use two LLMs for code generation—Code Llama (7B, 34B; Rozière et al., 2024) and DeepSeekCoder (6.7B, 33B; Guo et al., 2024).

Datasets.

We employ two NL2Code generation datasets—APPS Hendrycks et al. (2021) and CodeContests Li et al. (2022).999Note that representative code generation benchmarks, e.g., HumanEval Chen et al. (2021), typically provide code snippets whose length restricts the possibility of modularization. They are based on competitive programming contests and provide a set of different solutions for each problem.101010We preprocess the APPS and CodeContests datasets following Jain et al. (2024). Refer to Appendix for more details. For each dataset, the groups of MC and SC demonstrations are chosen from solutions for randomly selected problems. SC examples are then converted into TMC, and finally, TSC is manually obtained. In this study, we focus our evaluation on Python.

Evaluation Metrics.

We apply an unbiased version of pass@k𝑘kitalic_k Chen et al. (2021), which measures the functional correctness of generated programs by running them against test cases. For each problem, LLMs are prompted to generate n𝑛nitalic_n programs, and we determine c𝑐citalic_c, the number of programs that pass the test cases. In addition, k𝑘kitalic_k (kn𝑘𝑛k\leq nitalic_k ≤ italic_n) specifies the granularity of evaluation such that the metric indicates the probability of finding at least one correct solution when sampling k𝑘kitalic_k programs out of the n𝑛nitalic_n generated ones. The metric is then averaged over all problems. As a result, pass@k𝑘kitalic_k is computed as:

pass@k=𝔼problems[1(nck)(nk)].pass@𝑘subscript𝔼problemsdelimited-[]1binomial𝑛𝑐𝑘binomial𝑛𝑘\text{pass}@k=\mathbb{E}_{\text{problems}}\left[1-\frac{\binom{n-c}{k}}{\binom% {n}{k}}\right].pass @ italic_k = blackboard_E start_POSTSUBSCRIPT problems end_POSTSUBSCRIPT [ 1 - divide start_ARG ( FRACOP start_ARG italic_n - italic_c end_ARG start_ARG italic_k end_ARG ) end_ARG start_ARG ( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ) end_ARG ] .

5 Main Results

Table 1 and Table 2 present results on APPS and CodeContests, categorized by the modularity of the code demonstrations. All results are the average of five independent runs with different random seeds.

In Table 1, we observe, as previously reported, that the performance of TMC is slightly better than TSC.111111For APPS, we conducted experiments only with TMC and TSC due to computational constraints. However, their marginal performance gaps raise questions about the impact of modularity.

In Table 2, the relationship between modularity and performance becomes less clear. When comparing MC to SC, we observe that MC consistently underperforms SC, which contradicts previous findings. Furthermore, the comparison between TMC and TSC—a more controlled setting for evaluating modularity—shows no clear correlation between code modularity and performance. This is despite the fact that the transformation process by GPT-3.5-Turbo (SC \rightarrow TMC) seems to contribute to non-trivial increases in performance. Therefore, we argue that the previously reported effectiveness of modularity on performance was likely due to unforeseen consequences of the transformation process, rather than the modularity itself.

6 Analysis

Model Size Pearson Spearman
Code Llama 7B -0.34 (0) -0.31 (0)
DeepSeekCoder 6.7B -0.21 (0.04) -0.25 (0.01)
Table 3: Correlations between modularity (MoS) and performance (pass@1), evaluated on CodeContests. They consistently show weak negative relationships. Numbers in parentheses represent p𝑝pitalic_p-values.

6.1 Correlation Study

We conduct an extra experiment to dive deeper into the modularity-performance relationship. Specifically, given 100 code samples used as demonstrations,121212For balanced sampling, we create bins along the MoS dimension and sample an equal number of data from each bin. All the examples are either MC or SC type. we compute the Pearson and Spearman correlations between their modularity (MoS) and resulting performance (pass@1). For simplicity, we perform one-shot ICL on CodeContests. Experimental results are presented in Table 3 and Figure 7 in Appendix. Surprisingly, the results reveal weak negative correlations between modularity and performance, suggesting that modularity may not offer benefits, or even hinder performance in some cases.

6.2 Do LLMs Favor Modular Code?

The minimal performance gap between (T)MC and (T)SC suggests that LLMs may not have a strong preference for generating modular code. To verify this hypothesis, we compare the perplexities of LLMs on modular and non-modular code. Formally, the perplexity of a code snippet 𝒞𝒞\mathcal{C}caligraphic_C given a problem description 𝒟𝒟\mathcal{D}caligraphic_D is:

PPL(𝒞)=exp{1nt=0n1logP(xt+1|𝒟,xt)},𝑃𝑃𝐿𝒞𝑒𝑥𝑝1𝑛superscriptsubscript𝑡0𝑛1𝑙𝑜𝑔𝑃conditionalsubscript𝑥𝑡1𝒟subscript𝑥absent𝑡PPL(\mathcal{C})=exp\biggl{\{}-\frac{1}{n}\sum_{t=0}^{n-1}log{P(x_{t+1}\,|\,% \mathcal{D},x_{\leq t})}\biggr{\}},italic_P italic_P italic_L ( caligraphic_C ) = italic_e italic_x italic_p { - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_l italic_o italic_g italic_P ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | caligraphic_D , italic_x start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) } ,

where 𝒞𝒞\mathcal{C}caligraphic_C, consisting of tokens x1,,xnsubscript𝑥1subscript𝑥𝑛x_{1},\dots,x_{n}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, belongs to either MC (𝒞MCsubscript𝒞MC\mathcal{C}_{\text{MC}}caligraphic_C start_POSTSUBSCRIPT MC end_POSTSUBSCRIPT) or SC (𝒞SCsubscript𝒞SC\mathcal{C}_{\text{SC}}caligraphic_C start_POSTSUBSCRIPT SC end_POSTSUBSCRIPT). We sample nearly 10,000 problems from CodeContests containing both 𝒞MCsubscript𝒞MC\mathcal{C}_{\text{MC}}caligraphic_C start_POSTSUBSCRIPT MC end_POSTSUBSCRIPT and 𝒞SCsubscript𝒞SC\mathcal{C}_{\text{SC}}caligraphic_C start_POSTSUBSCRIPT SC end_POSTSUBSCRIPT. We then compare PPL(𝒞MC)𝑃𝑃𝐿subscript𝒞MCPPL(\mathcal{C}_{\text{MC}})italic_P italic_P italic_L ( caligraphic_C start_POSTSUBSCRIPT MC end_POSTSUBSCRIPT ) and PPL(𝒞SC)𝑃𝑃𝐿subscript𝒞SCPPL(\mathcal{C}_{\text{SC}})italic_P italic_P italic_L ( caligraphic_C start_POSTSUBSCRIPT SC end_POSTSUBSCRIPT ) averaged over all examples to identify which type of code is better predicted by code language models.

Table 4 supports our hypothesis, highlighting a neutral preference of LLMs which is not biased towards generating SC or MC. This is presumably because the models were likely exposed to both code types during pre-training. We speculate that this could be one of the reasons why modular examples are not always beneficial for code generation in language models.

Model Size PPL(𝒞MC)𝑃𝑃𝐿subscript𝒞MCPPL(\mathcal{C}_{\textbf{MC}})italic_P italic_P italic_L ( caligraphic_C start_POSTSUBSCRIPT MC end_POSTSUBSCRIPT ) PPL(𝒞SC)𝑃𝑃𝐿subscript𝒞SCPPL(\mathcal{C}_{\textbf{SC}})italic_P italic_P italic_L ( caligraphic_C start_POSTSUBSCRIPT SC end_POSTSUBSCRIPT )
Code Llama 7B 2.2 (0.57) 2.4 (1)
34B 2.02 (0.45) 2 (0.44)
DeepSeekCoder 6.7B 1.93 (0.41) 2.05 (0.63)
33B 1.89 (0.42) 1.89 (0.42)
Table 4: Perplexities of LLMs for 𝒞MCsubscript𝒞MC\mathcal{C}_{\text{MC}}caligraphic_C start_POSTSUBSCRIPT MC end_POSTSUBSCRIPT and 𝒞SCsubscript𝒞SC\mathcal{C}_{\text{SC}}caligraphic_C start_POSTSUBSCRIPT SC end_POSTSUBSCRIPT. LLMs exhibit similar predictive ability for both SC and MC. Numbers in parentheses represent standard deviations.

7 Conclusion

In this work, we propose a metric, called MoS, for quantifying the modularity of code snippets and evaluate its impact on performance. Our evaluation reveals no significant correlation, or even a possible weak negative correlation, between modularity and performance. This suggests that factors influencing the usefulness of code examples may differ between human and LLM perspectives. Exploring the influence of other code properties beyond modularity is a promising direction for future work.

Limitations

Due to limited computational resources, we focused on designing experimental settings that are both targeted and generalizable. This limitation restricted the scope of our investigation, but considering more extensive configurations in future work—such as fine-tuning, employing much larger models, and evaluating other programming languages—will help validate and potentially broaden the applicability of our findings. Despite these limitations, we believe our findings offer valuable insights, thanks to our comprehensive exploration of the feasible configurations within the available resources. Additionally, identifying a core factor besides modularity that directly affects performance holds significant promise for improving code generation.

References

  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. Preprint, arXiv:2107.03374.
  • Gemini Team (2024) Gemini Team. 2024. Gemini: A family of highly capable multimodal models. Preprint, arXiv:2312.11805.
  • Guo et al. (2024) Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. Deepseek-coder: When the large language model meets programming – the rise of code intelligence. Preprint, arXiv:2401.14196.
  • Hendrycks et al. (2021) Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring coding challenge competence with apps. Preprint, arXiv:2105.09938.
  • Jain et al. (2024) Naman Jain, Tianjun Zhang, Wei-Lin Chiang, Joseph E. Gonzalez, Koushik Sen, and Ion Stoica. 2024. LLM-assisted code cleaning for training accurate code generators. In The Twelfth International Conference on Learning Representations.
  • Li et al. (2022) Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. 2022. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  • McCabe (1976) T.J. McCabe. 1976. A complexity measure. IEEE Transactions on Software Engineering, SE-2(4):308–320.
  • McConnell (2004) Steve McConnell. 2004. Code Complete, Second Edition. Microsoft Press, Redmond, WA, USA.
  • OpenAI (2024) OpenAI. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  • Rozière et al. (2024) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. 2024. Code llama: Open foundation models for code. Preprint, arXiv:2308.12950.

Appendix

Dataset Filtering.

In both the APPS and CodeContests datasets, there are some solution codes that are incorrect based on functional correctness. We filter out the code snippets that cannot pass test cases from the dataset and only retain Python solutions. After data filtering, CodeContest has a training dataset of around 8K samples, while APPS has a training dataset of approximately 2K samples. Since some of the problems in APPS provide insufficient or absent test cases, we retain only problems obtained from atcoder, codechef, and codeforces in APPS, following Jain et al. (2024). APPS are divided into APPS-INTRODUCTORY, APPS-INTERVIEW, and APPS-COMPETITION based on problem difficulty. Table 5 describes the statistics of the APPS and CodeContests datasets we finally employed. Additionally, we also guarantee that both TMC and TSC pass the test cases.

Details on Two-shot In-Context Learning.

Following Rozière et al. (2024), we use a special instruction to help models understand the specific question format: “read from and write to standard IO” for standard questions and “use the provided function signature” for call-based questions, which we insert into our prompt as the question guidance for APPS and use special instructions for standard questions for CodeContests. This corresponds to {FEW_SHOT_QUESTION} in Figure 4.

Scatter Plots for Correlation Study.

Each data point in the plots indicates code generation performed using one-shot ICL. To conduct experiments using codes with various modularity (MoS), we use codes from the CodeContests dataset. It is important to note that the MoS scores in the demonstration exhibit a wide distribution as depicted in Figure 7. Specifically, we utilize a candidate pool of approximately 8K filtered codes to sample 100 codes.

Refer to caption
(a) MC (MoS = 1)

Refer to caption

(b) SC (MoS = 0)
Refer to caption
(c) TMC (MoS = 1)
Refer to caption
(d) TSC (MoS = 0)
Figure 3: Examples of four code categories corresponding to an identical problem.
Refer to caption
Figure 4: The prompt template used for converting SC to TMC.
Refer to caption
Figure 5: The prompt template used for two-shot in-context learning with Code Llama.
Refer to caption
Figure 6: The prompt template used for two-shot in-context learning with DeepSeekCoder.
Split CodeContests APPS-INTRODUCTORY APPS-INTERVIEW APPS-COMPETITION
# problems Training 8139 42 1247 361
Test 165 702 2699 309
# avg. test cases Training 20 1 1 10
Test 10 16 24 45
# avg. solutions Training 182 64 24 17
Table 5: Statistical details regarding the number of problems, the average number of test cases per problem, and the average number of solutions in the filtered datasets of CodeContests and APPS.
Refer to caption
(a) One-shot ICL with CodeLlama 7B.
Refer to caption
(b) One-shot ICL with DeepSeekCoder 6.7B.
Figure 7: Scatter plots with modularity (MoS) on the x-axis and performance (pass@1) on the y-axis show weak negative correlations between the two variables. The CodeContests dataset is used for evaluation.