\addbibresource

software.bib \addbibresourcemain.bib

ITERTL: An Iterative Framework for Fine-tuning LLMs for RTL Code Generation

Peiyang Wu Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected] Nan Guo Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected] Xiao Xiao Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected] Wenming Li Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected] Xiaochun Ye Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected]  and  Dongrui Fan Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected]
(2018; 20 February 2007; 12 March 2009; 5 June 2009)
Abstract.

Recently, large language models (LLMs) have demonstrated excellent performance in understanding human instructions and generating code, which has inspired researchers to explore the feasibility of generating RTL code with LLMs. However, the existing approaches to fine-tune LLMs on RTL codes typically are conducted on fixed datasets, which do not fully stimulate the capability of LLMs and require large amounts of reference data. To mitigate these issues , we introduce a simple yet effective iterative training paradigm named ITERTL. During each iteration, samples are drawn from the model trained in the previous cycle. Then these new samples are employed for training in this loop. Through this iterative approach, the distribution mismatch between the model and the training samples is reduced. Additionally, the model is thus enabled to explore a broader generative space and receive more comprehensive feedback. Theoretical analyses are conducted to investigate the mechanism of the effectiveness. Experimental results show the model trained through our proposed approach can compete with and even outperform the state-of-the-art (SOTA) open-source model with nearly 37% reference samples, achieving remarkable 42.9% and 62.2% pass@1 rate on two VerilogEval evaluation datasets respectively. While using the same amount of reference samples, our method can achieved a relative improvement of 16.9% and 12.5% in pass@1 compared to the non-iterative method. This study facilitates the application of LLMs for generating RTL code in practical scenarios with limited data.

RTL Code Generation, Large Language Model, Iteration, Reward Function
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NYisbn: 978-1-4503-XXXX-X/18/06ccs: Computing methodologies Natural language generation

1. Introduction

Manually writing hardware description language (HDL) code(e.g., Verilog) is an unavoidable part of the current hardware design process. This step is often boring and cumbersome, consuming a significant amount of engineers’ time. As LLMs have demonstrated excellent performance in natural language processing and code generation, researchers try to explore the use of LLMs to generate HDL code as an aid in hardware design(chang2023chipgpt; blocklove2023chip; thakur2023benchmarking; liu2023verilogeval; liu2023chipnemo).

In order to make LLMs more professional in RTL code generation, a common method is to use the corresponding database to fine-tune LLMs. However, most existing methods conform to the conventional paradigm of deep learning, which involves initially gathering data and then training the model. This approach may lead to two negative effects:

(1)Since the model is trained on a limited amount of collected data, the room available for the model to explore is thus constrained, resulting in a narrow coverage of feedback signals.

(2)There is a mismatch between the distributions of training samples and the LLM under training, which can lead to estimation errors(li2023policy; liu2023statistical) in optimization process. Specifically, if the training samples are not directly generated by the LLM intended for training, their distributions are obviously misaligned. However, even if the training samples are generated by the LLM intended for training at the beginning, their distributions will still mismatch because the distribution of the LLM will shift during the training process.

Due to these two negative effects, the RTL code generation capability of existing fine-tuned models are limited. Additionally, a large number of reference samples are required for fine-tuning, which can be costly to obtain. In order to foster exploration and mitigate the distribution deviation between training samples and LLMs that is being trained, an intuitive idea is to introduce reinforcement learning (RL) methods to allow LLMs to refine their policies through the interaction with the environment, like reinforcement learning from human feedback (RLHF)(ouyang2022training; stiennon2020learning) for common LLMs or CodeRL(le2022coderl), RLTF(liu2023rltf), etc. for code LLMs. However, it is well-known that the process of RL is often complex, resource-intensive, and unstable. Typically, RL algorithms need to maintain several models at the same time (such as policy model, value model and reference model in PPO algorithm(schulman2017proximal)) and conduct extensive hyper-parameters tuning, which will incur significant computational overhead and pose substantial obstacles for users.

To tackle the aforementioned issues, we introduce a simple yet effective iterative training scheme to fine-tune LLMs for Verilog code generation. Unlike previous methods learning from the same training set, our approach updates the training samples through sampling iteratively during the training process, thus enhancing the exploration range of LLMs and enriching feedback. Since the training samples are iteratively sampled from the updated model, the mismatch between distributions of the training samples and LLMs, as well as the resulting estimation error, are mitigated accordingly. As the first method which iteratively updates training samples in the RTL code generation field, our method enables the model to achieve state-of-the-art (SOTA) performance even with a smaller number of reference samples. Moreover, our approach is much easier to implement than complicated reinforcement learning. Only minimal hyper-parameter tuning is required, and there is no need to handle multiple models during the training process.

Our contributions can be outlined as follows:

(i) We develop an iterative supervised fine-tuning scheme for training LLMs to generate Verilog code. By expanding the exploration scope of models and reduces estimation errors caused by distribution mismatches, this solution can effectively boost the Verilog code generation capability and substantially decreases the number of externally-sourced reference samples needed for training.

(ii) Through a reward-maximization viewpoint, we theoretically analyze the limitations of previous methods and reveal the superiority of our approach.

(iii) With only about 37% reference samples, we outperform the state-of-the-art (SOTA) open-source LLM, attaining 42.9% and 62.2% pass@1 rate on VerilogEval-human and VerilogEval-machine benchmarks respectively. Utilizing the same quantity of reference samples, our method demonstrates a 16.9% and 12.5% relative enhancement in pass@1 scores compared to the non-iterative approach. Relative to commercial closed-source models, we surpass GPT-3.5 and approach the level of GPT-4 on the benchmarks.

2. BACKGROUND

2.1. Large Language Models for RTL Code Generation

Previous works(blocklove2023chip; chang2023chipgpt) have attempted to directly prompt LLMs to generate RTL code, achieving notable results. Chip-chat(blocklove2023chip) design a 8-bit accumulator-based microprocessor with the assistance of commercial LLMs. In practical applications, it is challenging to directly generate usable code through human instructions. As a result, recent works(thakur2023autochip; delorenzo2024make) explore optimizing the RTL codes generated by LLMs with feedback from Verilog tools. AutoChip(thakur2023autochip) utilize error reports from compilers and simulators to help LLMs to rectify faulty code. (delorenzo2024make) develops a Monte Carlo tree-search (MCTS) algorithm to enhance LLMs to generate correct and PPA-efficient code with the feedback from compilers and synthesis tools.

Instead of using off-the-shelf LLMs directly, some researchers(liu2023verilogeval) opt to train LLMs to make it specialized in hardware design. VeriGen(thakur2023benchmarking) leverage corpora from GitHub and Verilog textbooks to fine-tune open source LLMs, defeating the state-of-the-art commercial Codex(chen2021evaluating) LLM on 17 Verilog problems. ChipNeMo(liu2023chipnemo) customize LLaMa2(touvron2023llama) for applications in chip design such as chatbot, generating EDA tool script, and bug summarization. RTLCoder(liu2023rtlcoder) develops a automated flow to generate instruction-code pairs for supervised fine-tuning (SFT) and propose a new SFT method leveraging code quality assessment.

In order to better measure the effect of LLMs on Verilog code generation, researchers have proposed corresponding evaluation benchmarks. VerilogEval(liu2023verilogeval) has released an open-source evaluation dataset , including 156 questions along with their corresponding golden solutions. RTLLM(lu2024rtllm) present an open-source benchmark evaluating the quality of the generated code from three progressive perspectives: syntax, functionality, and design quality.

2.2. Fine-tuning LLMs with RTL data

To tailor LLMs for RTL code generation, researchers often need to fine-tune LLMs with domain-specific data. VeriGen(thakur2023benchmarking) fine-tune LLMs by predicting next token on corpora from open-source code and textbooks, which can be regarded as an continual pre-training. To facilitate the model’s ability to follow instructions, VerilogEval(liu2023verilogeval) applies SFT on LLMs using synthetic instruction-code pairs. Furthermore, RTLCoder(liu2023rtlcoder) introduces a new fine-tuning algorithm which harnesses code quality evaluation upon candidates sampled from pretrained LLMs. However, during the training process, the aforementioned methods are constrained by static and unchanging training samples. According to previous research(yuan2023rrhf; liu2022brio) in natural language, updating train samples using in-training LLMs can significantly bootstrap the model performance. However, related research in the task of Verilog code generation remains scarce. Our work can be seen as a pioneering attempt.

3. Approach

In this section, we firstly detail the workflow of our proposed approach. Then analyze the deficiencies of related methods and reveal the improvements of our approach from a theoretical viewpoint.

3.1. Framework

As an iterative training scheme, our approach boost the generation capability by alternately sampling and training. During a single loop, we basically follows RTLCoder(liu2023rtlcoder) for ease of implementation. The key distinction lies in our core idea of alternating iterations, which can significantly enhances the performance.

Refer to caption
Figure 1. ITERTL Framework. The model πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is instructed to generates code responses, which are utilized to optimize the model itself, along with the reference code aKsubscript𝑎𝐾a_{K}italic_a start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. Once the optimization process converges, the new model πt+1subscript𝜋𝑡1\pi_{t+1}italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is used to sample responses in next iteration.

We briefly introduce the procedure in a single loop at first, and then explain the iteration process. As shown in Figure  1, in the t𝑡titalic_t-th round of training, for each input instruction s𝑠sitalic_s, there are K𝐾Kitalic_K corresponding output responses aktsuperscriptsubscript𝑎𝑘𝑡a_{k}^{t}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, 1kK1𝑘𝐾1\leq k\leq K1 ≤ italic_k ≤ italic_K. Among these, the first K1𝐾1K-1italic_K - 1 responses are sampled from the model πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT acquired from previous training iteration. While the last response aKtsuperscriptsubscript𝑎𝐾𝑡a_{K}^{t}italic_a start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, which serve as the reference data, can be obtained from another teacher LLM or human, denoted as πteachersubscript𝜋𝑡𝑒𝑎𝑐𝑒𝑟\pi_{teacher}italic_π start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r end_POSTSUBSCRIPT. Assume the distribution of πteachersubscript𝜋𝑡𝑒𝑎𝑐𝑒𝑟\pi_{teacher}italic_π start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r end_POSTSUBSCRIPT remains unchanged, there is little impact from repeated sampling. Therefore, we omit the superscript t𝑡titalic_t from aKtsuperscriptsubscript𝑎𝐾𝑡a_{K}^{t}italic_a start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Each response aktsuperscriptsubscript𝑎𝑘𝑡a_{k}^{t}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is assigned a quality score zktsuperscriptsubscript𝑧𝑘𝑡z_{k}^{t}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT utilizing Pyverilog and Rough-L following (liu2023rtlcoder).

The loss function comprises two components: the ranking loss(liu2023rtlcoder; yuan2023rrhf; liu2022brio) and the Maximum Likelihood Estimation (MLE) loss.

Conditional log probability (length-normalized) is calculated leveraging the model π𝜋\piitalic_π that is being trained:

(1) pk=jlogPπ(ak,jts,ak,<jt)aktsubscript𝑝𝑘subscript𝑗subscript𝑃𝜋conditionalsuperscriptsubscript𝑎𝑘𝑗𝑡𝑠superscriptsubscript𝑎𝑘absent𝑗𝑡normsuperscriptsubscript𝑎𝑘𝑡p_{k}=\frac{\sum_{j}\log P_{\pi}\left(a_{k,j}^{t}\mid s,a_{k,<j}^{t}\right)}{% \left\|a_{k}^{t}\right\|}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∣ italic_s , italic_a start_POSTSUBSCRIPT italic_k , < italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∥ italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ end_ARG

Combining the quality score z𝑧zitalic_z and log probability p𝑝pitalic_p, the ranking loss(liu2023rtlcoder; yuan2023rrhf; liu2022brio) can be computed:

(2) Lrankingt=zkt<zτtβmax(pkpτ+α,0)superscriptsubscript𝐿𝑟𝑎𝑛𝑘𝑖𝑛𝑔𝑡subscriptsuperscriptsubscript𝑧𝑘𝑡superscriptsubscript𝑧𝜏𝑡𝛽subscript𝑝𝑘subscript𝑝𝜏𝛼0L_{ranking}^{t}=\sum_{z_{k}^{t}<z_{\tau}^{t}-\beta}\max\left(p_{k}-p_{\tau}+% \alpha,0\right)italic_L start_POSTSUBSCRIPT italic_r italic_a italic_n italic_k italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT < italic_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_β end_POSTSUBSCRIPT roman_max ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT + italic_α , 0 )

Unlike RTLCoder(liu2023rtlcoder), we discard the use of softmax normalization to p𝑝pitalic_p based on subsequent experimental results.

Another loss component is the common cross entropy loss relative to the reference sample aKsubscript𝑎𝐾a_{K}italic_a start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT:

(3) Lce=jlogPπ(aK,js,aK,<j)subscript𝐿𝑐𝑒subscript𝑗subscript𝑃𝜋conditionalsubscript𝑎𝐾𝑗𝑠subscript𝑎𝐾absent𝑗L_{ce}=-\sum_{j}\log P_{\pi}\left(a_{K,j}\mid s,a_{K,<j}\right)italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_K , italic_j end_POSTSUBSCRIPT ∣ italic_s , italic_a start_POSTSUBSCRIPT italic_K , < italic_j end_POSTSUBSCRIPT )

The overall loss function is a weighted sum of both:

(4) Lt=Lce+λLrankingtsuperscript𝐿𝑡subscript𝐿𝑐𝑒𝜆superscriptsubscript𝐿𝑟𝑎𝑛𝑘𝑖𝑛𝑔𝑡L^{t}=L_{ce}+\lambda L_{ranking}^{t}italic_L start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_r italic_a italic_n italic_k italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT

To control the scale relationship between the two loss functions, we set λ𝜆\lambdaitalic_λ equal to sg(Lce)𝑠𝑔subscript𝐿𝑐𝑒sg(L_{ce})italic_s italic_g ( italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ), where sg()𝑠𝑔sg(\cdot)italic_s italic_g ( ⋅ ) represents a stop-gradient operation.

After the initial round of training, we obtained the converged model πt+1subscript𝜋𝑡1\pi_{t+1}italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. Even though the capability of model πt+1subscript𝜋𝑡1\pi_{t+1}italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT has improved comparing to πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, there remains room for further enhancement. So we propose to resample new responses akt+1superscriptsubscript𝑎𝑘𝑡1a_{k}^{t+1}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT using the new model πt+1subscript𝜋𝑡1\pi_{t+1}italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and assess new quality scores zkt+1superscriptsubscript𝑧𝑘𝑡1z_{k}^{t+1}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT. Resampling can bring new feedback as the model’s distribution has shifted after previous round of training. We compute the loss function Lt+1superscript𝐿𝑡1L^{t+1}italic_L start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT and update parameters with new data. This process is repeated until the training concludes.

3.2. Theoretical Analysis

Inspired by the previous work(li2023policy), we regard fine-tuning LLMs as a reward maximization problem, which can be formulated as follows:

(5) max𝜋𝔼sρ()𝔼aπ(s)[r(s,a)],\underset{\pi}{\operatorname{max}}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a% \sim\pi(\cdot\mid s)}[r(s,a)],underitalic_π start_ARG roman_max end_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_ρ ( ⋅ ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_π ( ⋅ ∣ italic_s ) end_POSTSUBSCRIPT [ italic_r ( italic_s , italic_a ) ] ,

where 𝔼𝔼\mathbb{E}blackboard_E denotes the expected value, s𝑠sitalic_s represents the instruction (prompt), following the distribution ρ()𝜌\rho(\cdot)italic_ρ ( ⋅ ). a𝑎aitalic_a represents the response (code) generated by the LLM π𝜋\piitalic_π which is being optimized. r𝑟ritalic_r represents the reward for the response, in other words, the quality of the generated code.

Ideally, the true reward for the generated Verilog code should incorporate the evaluation of its functional correctness, as well as PPA metrics. However, it’s intractable to implement because verifying the functional correctness requires a comprehensive corresponding testbench, which is nearly impossible for researchers to develop given the large amount of samples for fine-tuning. Additionally, assessing PPA relies on logic synthesis, which also consumes a significant amount of time. Therefore, it is necessary to approximate the reward function to make it more manageable.

Let’s first consider a naive approach that directly fine-tunes LLMs using reference data drawn from πteachersubscript𝜋𝑡𝑒𝑎𝑐𝑒𝑟\pi_{teacher}italic_π start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r end_POSTSUBSCRIPT, which can be considered a form of knowledge distillation. The optimization objective can be derived as follows:

(6) min𝜋jlogPπ(ajs,a<j)=min𝜋jlogPteacher(ajs,a<j)logPπ(ajs,a<j)=min𝜋jKL(Pteacher,j||Pπ,j)max𝜋𝔼sρ()𝔼aπteacher(s)[jKL(Pteacher,j||Pπ,j)]\begin{split}\underset{\pi}{min}\sum_{j}&-\log P_{\pi}\left(a_{j}\mid s,a_{<j}% \right)=\underset{\pi}{min}\sum_{j}\log\frac{P_{teacher}\left(a_{j}\mid s,a_{<% j}\right)}{\log P_{\pi}\left(a_{j}\mid s,a_{<j}\right)}\\ &=\underset{\pi}{min}\sum_{j}KL(P_{teacher,j}||P_{\pi,j})\\ &\simeq\underset{\pi}{\operatorname{max}}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{% E}_{a\sim\pi_{teacher}(\cdot\mid s)}[\sum_{j}-KL(P_{teacher,j}||P_{\pi,j})]% \end{split}start_ROW start_CELL underitalic_π start_ARG italic_m italic_i italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL - roman_log italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_s , italic_a start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) = underitalic_π start_ARG italic_m italic_i italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log divide start_ARG italic_P start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_s , italic_a start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG roman_log italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_s , italic_a start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = underitalic_π start_ARG italic_m italic_i italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_K italic_L ( italic_P start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r , italic_j end_POSTSUBSCRIPT | | italic_P start_POSTSUBSCRIPT italic_π , italic_j end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≃ underitalic_π start_ARG roman_max end_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_ρ ( ⋅ ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_π start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r end_POSTSUBSCRIPT ( ⋅ ∣ italic_s ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_K italic_L ( italic_P start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r , italic_j end_POSTSUBSCRIPT | | italic_P start_POSTSUBSCRIPT italic_π , italic_j end_POSTSUBSCRIPT ) ] end_CELL end_ROW

The first equality holds because Pteachersubscript𝑃𝑡𝑒𝑎𝑐𝑒𝑟P_{teacher}italic_P start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r end_POSTSUBSCRIPT is independent of π𝜋\piitalic_π. While the approximate equality in the third line is based on the law of large numbers. Comparing equations  5 and  6, it is clear that this approach leads to systematic errors in two aspects111If s𝑠sitalic_s is synthesized by another model rather than obtained from real samples, the estimation is also biased. we omit this aspect in this work since it’s not the focus of our research.: (1) replacing the distribution of the model being trained π(s)\pi(\cdot\mid s)italic_π ( ⋅ ∣ italic_s ) with that of πteacher(s)\pi_{teacher}(\cdot\mid s)italic_π start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r end_POSTSUBSCRIPT ( ⋅ ∣ italic_s ), and (2) the use of the negative Kullback-Leibler divergence (KL divergence) as the surrogate reward function between the probability distribution of the reference data and π𝜋\piitalic_π.

Next, Let’s consider a non-iterative degenerate version of our approach, where a constant distribution πcsubscript𝜋𝑐\pi_{c}italic_π start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is used to generate training samples. This version is quite similar with RTLCoder(liu2023rtlcoder). We can also transform the optimization objective into a reward maximization problem. To simplify the derivation, here we only present the ranking loss component from Equation  2. And the term associated with the reference data in the ranking loss are also omitted, which does not interfere with our conclusions.

(7) min𝜋(Lranking)=min𝜋[zk<zτβmax(pkpτ+α,0)]min𝜋[max(pkpτ+α,0)𝕀(zk<zτβ)]max𝜋𝔼sρ()𝔼akπc(s){𝔼aτπc(s)[max(pkpτ+α,0)𝕀(zk<zτβ)]}=max𝜋𝔼sρ()𝔼aπc(s)[rπ(s,a)]\begin{split}\underset{\pi}{min}&(L_{ranking})=\underset{\pi}{min}[\sum_{z_{k}% <z_{\tau}-\beta}\max\left(p_{k}-p_{\tau}+\alpha,0\right)]\\ &\simeq\underset{\pi}{min}[\max\left(p_{k}-p_{\tau}+\alpha,0\right)\mathbb{I}(% z_{k}<z_{\tau}-\beta)]\\ &\simeq\underset{\pi}{max}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a_{k}\sim% \pi_{c}(\cdot\mid s)}\{\\ &\mathbb{E}_{a_{\tau}\sim\pi_{c}(\cdot\mid s)}[-\max\left(p_{k}-p_{\tau}+% \alpha,0\right)\mathbb{I}(z_{k}<z_{\tau}-\beta)]\}\\ &=\underset{\pi}{max}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a\sim\pi_{c}(% \cdot\mid s)}[r_{\pi}(s,a)]\end{split}start_ROW start_CELL underitalic_π start_ARG italic_m italic_i italic_n end_ARG end_CELL start_CELL ( italic_L start_POSTSUBSCRIPT italic_r italic_a italic_n italic_k italic_i italic_n italic_g end_POSTSUBSCRIPT ) = underitalic_π start_ARG italic_m italic_i italic_n end_ARG [ ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < italic_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT - italic_β end_POSTSUBSCRIPT roman_max ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT + italic_α , 0 ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≃ underitalic_π start_ARG italic_m italic_i italic_n end_ARG [ roman_max ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT + italic_α , 0 ) blackboard_I ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < italic_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT - italic_β ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≃ underitalic_π start_ARG italic_m italic_a italic_x end_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_ρ ( ⋅ ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ⋅ ∣ italic_s ) end_POSTSUBSCRIPT { end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ⋅ ∣ italic_s ) end_POSTSUBSCRIPT [ - roman_max ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT + italic_α , 0 ) blackboard_I ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < italic_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT - italic_β ) ] } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = underitalic_π start_ARG italic_m italic_a italic_x end_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_ρ ( ⋅ ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_π start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ⋅ ∣ italic_s ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a ) ] end_CELL end_ROW

Here, we denote the surrogate reward function as rπsubscript𝑟𝜋r_{\pi}italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT to show its dependency on π𝜋\piitalic_π. Comparing equations  6 and  7, we find that this approach offers two improvements over the previous naive approach. (1) πcsubscript𝜋𝑐\pi_{c}italic_π start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT can be adjusted to an appropriate distribution to reduce the mismatch with π𝜋\piitalic_π, for instance, selecting the initial distribution π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. (2) The surrogate reward function incorporates the evaluation of code quality, which is more reasonable than a simple KL divergence.

Based on the analysis of the above two cases, we can formally reveal the superiority of our method. Using similar notations, our optimization scheme can be formulated as:

(8) πt+1=argmax𝜋𝔼sρ()𝔼aπt(s)[rπ(s,a)]\pi_{t+1}=\underset{\pi}{argmax}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a\sim% \pi_{t}(\cdot\mid s)}[r_{\pi}(s,a)]italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = underitalic_π start_ARG italic_a italic_r italic_g italic_m italic_a italic_x end_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_ρ ( ⋅ ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_s ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a ) ]

After t𝑡titalic_t-th iteration, π𝜋\piitalic_π is updated, progressively deviating from the previous distribution πoldsubscript𝜋𝑜𝑙𝑑\pi_{old}italic_π start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT (π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT,…πt1subscript𝜋𝑡1\pi_{t-1}italic_π start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT). By replacing πoldsubscript𝜋𝑜𝑙𝑑\pi_{old}italic_π start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT with πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for sampling, our method effectively mitigates the distribution mismatch. Furthermore, due to the law of large numbers, increasing the number of sampling promotes better estimation expected values.

we can also analyze the potential bottleneck of our approach from Equation  8. Since the surrogate reward function rπsubscript𝑟𝜋r_{\pi}italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT evaluates code quality merely from a syntax perspective, ignoring functional correctness and PPA, there is a mismatch between rπsubscript𝑟𝜋r_{\pi}italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT and the true reward function r𝑟ritalic_r. As a result, although rπsubscript𝑟𝜋r_{\pi}italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT may increase with each iteration, the potential improvement for model is still constrained, which is verified by our experiments in Section  4. We anticipate that a better-designed surrogate reward function could further enhance the effectiveness of our approach in the future.

From another perspective, our approach can also be regarded as a degenerate variant of the generalized Expectation-Maximization (EM) algorithm. Consider the following optimization problem:

(9) max𝜋𝔼sρ()𝔼aπ(s)[rπ(s,a)]\underset{\pi}{max}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a\sim\pi(\cdot\mid s% )}[r_{\pi}(s,a)]underitalic_π start_ARG italic_m italic_a italic_x end_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_ρ ( ⋅ ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_π ( ⋅ ∣ italic_s ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a ) ]

Since π𝜋\piitalic_π appears in both the subscript of the expected symbol and the reward function, an EM-style optimization procedure would be like:

(10) πt+1=argmax𝜋𝔼sρ()𝔼aπt(s)[rπ(s,a)]\pi_{t+1}=\underset{\pi}{argmax}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a\sim% \pi_{t}(\cdot\mid s)}[r_{\pi}(s,a)]italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = underitalic_π start_ARG italic_a italic_r italic_g italic_m italic_a italic_x end_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_ρ ( ⋅ ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_s ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a ) ]
(11) πt+2=argmax𝜋𝔼sρ()𝔼aπ(s)[rπt+1(s,a)]\pi_{t+2}=\underset{\pi}{argmax}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a\sim% \pi(\cdot\mid s)}[r_{\pi_{t+1}}(s,a)]italic_π start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT = underitalic_π start_ARG italic_a italic_r italic_g italic_m italic_a italic_x end_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_ρ ( ⋅ ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_π ( ⋅ ∣ italic_s ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) ]

Equation  10 corresponds to our optimization approach, while optimizing Equation  11 typically requires reinforcement learning methods like policy gradients, which may increase the complexity and instability of the whole scheme. So we simply set πt+2subscript𝜋𝑡2\pi_{t+2}italic_π start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT equal to πt+1subscript𝜋𝑡1\pi_{t+1}italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT to avoid these issues. We leave the exploration of more precise optimization methods to future researches.

4. Experimental Evaluation

4.1. Experimental Setup

Benchmark: We choose a comprehensive evaluation dataset, named VerilogEval(liu2023verilogeval), as the benchmark to measure the performance. This dataset consists of diverse Verilog tasks ranging from simple to complex, such as combinational circuits, finite state machines, code debugging, constructing testbenches, and so on. Two sets of design instructions are provided: the first one is generated by LLM, named VerilogEval-machine, containing 143 samples; the other one is manually written, named VerilogEval-human, comprising 156 samples. Functional correctness evaluation is conducted via ICARUS Verilog simulator by comparing the outputs of the generated design with that of the golden solution.

Table 1. Performance on VerilogEval Benchmark(liu2023verilogeval). The results of GPT-3.5, GPT-4, VerilogEval, Codegen and VeriGen come from (liu2023verilogeval). The results of RTLCoder come from (liu2023rtlcoder). The best results are marked in bold. If the best result comes from a closed-source model (like GPT-4), then the best result from the open-source model is also highlighted in bold. Additionally, the second best result is underlined.
Model Params VerilogEval-Machine(%) VerilogEval-Human(%)
Pass@1 Pass@5 Pass@10 Pass@1 Pass@5 Pass@10
GPT-3.5 (Closed-Source) - 46.7 69.1 74.1 26.7 45.8 51.7
GPT-4 (Closed-Source) - 60.0 70.6 73.5 43.5 55.8 58.9
Codegen(nijkamp2022codegen) 16B 5.0 17.6 25.8 1.6 6.1 9.4
VeriGen(thakur2023benchmarking) 16B 33.8 59.2 67.9 24.5 45.3 53.2
VerilogEval-8.5k(liu2023verilogeval) 16B 46.2 67.3 73.7 28.8 45.9 52.3
RTLCoder-DeepSeek-10k(liu2023rtlcoder) 6.7B 55.3 70.4 76.2 36.7 47.0 50.4
RTLCoder-DeepSeek-27k(liu2023rtlcoder) 6.7B 61.2 76.5 81.8 41.6 50.1 53.4
Ours-DeepSeek-10k 6.7B 62.2 75.0 79.0 42.9 50.0 53.8

Metric: As with many code generation researches, pass@k𝑘kitalic_k metric(kulal2019spoc) is employed to measure the performance, by regard a problem as solved if at least one of the k𝑘kitalic_k generated code samples passes the unit tests. Specifically, for each problem, the model generated nk𝑛𝑘n\geq kitalic_n ≥ italic_k candidates, where cn𝑐𝑛c\leq nitalic_c ≤ italic_n samples pass the unit tests. The pass@k𝑘kitalic_k metric is estimated unbiasedly use the following expression(chen2021evaluating):

(12) pass@k=𝔼Problems[1(nck)(nk)]𝑝𝑎𝑠𝑠@𝑘𝑃𝑟𝑜𝑏𝑙𝑒𝑚𝑠𝔼delimited-[]1binomial𝑛𝑐𝑘binomial𝑛𝑘pass@k=\underset{Problems}{\mathbb{E}}\left[1-\frac{\binom{n-c}{k}}{\binom{n}{% k}}\right]italic_p italic_a italic_s italic_s @ italic_k = start_UNDERACCENT italic_P italic_r italic_o italic_b italic_l italic_e italic_m italic_s end_UNDERACCENT start_ARG blackboard_E end_ARG [ 1 - divide start_ARG ( FRACOP start_ARG italic_n - italic_c end_ARG start_ARG italic_k end_ARG ) end_ARG start_ARG ( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ) end_ARG ]

To avoid the impact of randomness, we mainly focus on the pass@1 value under greedy decoding. For a comprehensive evaluation, we also measure the pass@k={1,5,10}𝑘1510k=\{1,5,10\}italic_k = { 1 , 5 , 10 } metrics setting n=10𝑛10n=10italic_n = 10 under Top-p decoding.

Decoding Strategy: During the training stage, in order to enhance the diversity and promote the exploration, we use Top-p decoding strategy with topp=0.95𝑡𝑜subscript𝑝𝑝0.95top_{p}=0.95italic_t italic_o italic_p start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.95 and temperature=0.5𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒0.5temperature=0.5italic_t italic_e italic_m italic_p italic_e italic_r italic_a italic_t italic_u italic_r italic_e = 0.5. In the testing stage, to mitigate random errors, the model is prompted to generate responses with temperature={0(greedydecoding),0.2,0.5,0.8}𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒0𝑔𝑟𝑒𝑒𝑑𝑦𝑑𝑒𝑐𝑜𝑑𝑖𝑛𝑔0.20.50.8temperature=\{0(greedy\,decoding),0.2,0.5,0.8\}italic_t italic_e italic_m italic_p italic_e italic_r italic_a italic_t italic_u italic_r italic_e = { 0 ( italic_g italic_r italic_e italic_e italic_d italic_y italic_d italic_e italic_c italic_o italic_d italic_i italic_n italic_g ) , 0.2 , 0.5 , 0.8 } and topp=0.95𝑡𝑜subscript𝑝𝑝0.95top_{p}=0.95italic_t italic_o italic_p start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.95. For each test metric (pass@1, pass@5, and pass@10), the best result is chosen.

Training Details: DeepSeek-Coder-Instruct-6.7B(guo2024deepseek) is chosen as the pre-trained model. We use the open-source instruction-code pairs from (liu2023rtlcoder) as reference data. This dataset contains nearly 27k samples. We randomly sample only 10,000 entries to train the final version of our model. The learning rate is set to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. AdamW optimizer is employed with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999. The bp16 Mixed-precision Training is adopted to avoid overflow. During each iteration, the model is trained for 3 epochs. The total number of iterations is set to 7. We conduct all experiments on 4 NVIDIA A800 GPUs. As for parameters mentioned in Section  3.1, we set the number of candidates K=4𝐾4K=4italic_K = 4. The hyper-parameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β in Equation  2 are set to 0.3 and 0.2, respectively.

4.2. Comparison to State-of-the-Art Methods

As Table  1 shows, our model reach the state-of-the-art level among open-source models. On Pass@1 metric, our model trained with only 10k reference samples surpasses RTLCoder with 27k reference samples by 1% and 1.3% on VerilogEval-machine and VerilogEval-human respectively. Leveraging an equal amount of reference samples, our model significantly outperforms RTLCoder-DeepSeek-10k on all metrics, with a relative improvement of 12.5% and 16.9% in pass@1 especially. Considering the size of the parameters, with only 6.7 billion parameters, our model significantly outperforms VerilogEval(liu2023verilogeval) with 16 billion parameters and a similar volume of reference samples (8502), further proving the efficiency and superiority of our method. Additionally, we incorporated a general-purpose code model, Codegen(nijkamp2022codegen), into our comparison. Its subpar performance on the VerilogEval benchmark underscores the importance of tailoring LLMs for RTL code.

Relative to closed-source models, our model comprehensively surpasses GPT-3.5 on the benchmark. Against GPT-4, our model performs better on VerilogEval-machine but lags on VerilogEval-human. Considering the vast differences in training costs and the lightweight advantage of our model, these performance deficits are acceptable.

4.3. Effect of Iterations

Refer to caption
Figure 2. Number of Iterations and Pass@1 on VerilogEval-human. As the iteration count increases, the pass@1 rate approximately rises initially and then decreases.

To delve deeper into the impact of iteration, we plot how pass@1 on VerilogEval-human varies with the number of iterations in Figure  2. It can be clearly seen that, with 5 iterations and only 10k reference data, our model beats the baseline model with with 27k reference data. From the first to the fifth iteration, the pass@1 rate exhibits a clear upward trend, indicating the efficacy of iterative training. From the fifth to the seventh iteration, the pass@1 rate gradually decreases, which can be explained by the mismatch between the surrogate reward function and the true reward function mentioned in Section  3.2. In fact, similar discoveries were found on LLMs trained using reward models on natural languages. Our work reveals that the surrogate reward function based on code quality evaluation and reward model on natural languages have similar properties, providing inspiration for subsequent improvement work.

Figure  3 depicts different loss functions curves across each iteration. Even though the loss function converges within each round, the loss value can still decrease by leveraging newly sampled data points from updated model, validating the effectiveness of the proposed iterative training approach. Another observed phenomenon is that as the number of iterations increases, the marginal decrease in the loss function gradually diminishes, which indicates that the surrogate reward function is nearing its optimization ceiling.

Refer to caption
Figure 3. The loss function curves across each iteration. For better visualization, we represent the original loss function curves with light-colored lines and the results of exponential smoothing with dark-colored lines. And the vertical axis is on a log scale. As the iteration count increases, the loss function decreases.

4.4. Ablation Study of Softmax Normalization

The baseline method(liu2023rtlcoder) employ a softmax normalization for conditional log probability after Equation  1:

(13) pk=epki=1Kepisuperscriptsubscript𝑝𝑘superscript𝑒subscript𝑝𝑘superscriptsubscript𝑖1𝐾superscript𝑒subscript𝑝𝑖p_{k}^{{}^{\prime}}=\frac{e^{p_{k}}}{\sum_{i=1}^{K}e^{p_{i}}}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG
Refer to caption
Figure 4. Number of Iterations and Pass@1 on VerilogEval-human of approaches with and without softmax normalization.

We conduct an ablation study to investigate its impact in iteration training. All other settings remain unchanged except for softmax normalization. From Figure  4, it can be observed that the pass@1 of the approach with softmax shows smoother variations across iterations, which may be caused by vanishing gradients brought by softmax saturation. We choose to discard softmax to maximize performance. While the softmax normalization can also be considered when a more stable and smooth training process is required.

5. Future Work

Reviewing previous theoretical and experimental analyses, we believe that exploring two directions could further enhance the capability of LLMs. The first is to find a more suitable surrogate reward mechanism to provide feedback. Current methods simply rely on evaluating similarity or syntax checking, which is too superficial to reflect the true quality of RTL code. A more comprehensive reward mechanism is likely to further unleash the potential of training methods.

Another direction is to develop more advanced optimization algorithms. In this work, to enhance simplicity and usability, we avoid using complex reinforcement learning methods, which might better optimize models with careful tuning of hyper-parameters. Drawing inspiration from Direct Preference Optimization (DPO)(rafailov2024direct) in natural language generation, we look forward to seeing both simpler and more advanced optimization solutions in the RTL code generation domain in the future.

6. Conclusion

In this paper, we propose an iterative training paradigm to fine-tune LLMs for RTL code generation. During each iteration, we employ the model trained in the previous iteration to update samples, which are then utilized for training in the current round. From a perspective of maximizing the reward function, we theoretically analyze the superiority of our approach, which are subsequently validated by empirical results. With just about 37% of the reference samples, our model outperforms the state-of-the-art open-source LLMs with 42.9% and 62.2% pass@1 rate on VerilogEval-human and VerilogEval-machine benchmarks respectively. Compared to GPT-4, our model achieves a comparable level of performance on evaluation benchmarks with only 6.7B parameters and affordable overhead. Based on theoretical analysis and experiments, we anticipate future improvements through refining reward functions and exploring new training approaches.

Acknowledgements.
\printbibliography