\addbibresource

software.bib \addbibresourcemain.bib

ITERTL: An Iterative Framework for Fine-tuning LLMs for RTL Code Generation

Peiyang Wu Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected] , Nan Guo Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected] , Xiao Xiao Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected] , Wenming Li Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected] , Xiaochun Ye Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected] and Dongrui Fan Institute of Computing Technology, Chinese Academy of SciencesBeijingChina [email protected]

(2018; 20 February 2007; 12 March 2009; 5 June 2009)

Abstract.

Recently, large language models (LLMs) have demonstrated excellent performance in understanding human instructions and generating code, which has inspired researchers to explore the feasibility of generating RTL code with LLMs. However, the existing approaches to fine-tune LLMs on RTL codes typically are conducted on fixed datasets, which do not fully stimulate the capability of LLMs and require large amounts of reference data. To mitigate these issues , we introduce a simple yet effective iterative training paradigm named ITERTL. During each iteration, samples are drawn from the model trained in the previous cycle. Then these new samples are employed for training in this loop. Through this iterative approach, the distribution mismatch between the model and the training samples is reduced. Additionally, the model is thus enabled to explore a broader generative space and receive more comprehensive feedback. Theoretical analyses are conducted to investigate the mechanism of the effectiveness. Experimental results show the model trained through our proposed approach can compete with and even outperform the state-of-the-art (SOTA) open-source model with nearly 37% reference samples, achieving remarkable 42.9% and 62.2% pass@1 rate on two VerilogEval evaluation datasets respectively. While using the same amount of reference samples, our method can achieved a relative improvement of 16.9% and 12.5% in pass@1 compared to the non-iterative method. This study facilitates the application of LLMs for generating RTL code in practical scenarios with limited data.

RTL Code Generation, Large Language Model, Iteration, Reward Function

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Computing methodologies Natural language generation

1. Introduction

Manually writing hardware description language (HDL) code(e.g., Verilog) is an unavoidable part of the current hardware design process. This step is often boring and cumbersome, consuming a significant amount of engineers’ time. As LLMs have demonstrated excellent performance in natural language processing and code generation, researchers try to explore the use of LLMs to generate HDL code as an aid in hardware design(chang2023chipgpt; blocklove2023chip; thakur2023benchmarking; liu2023verilogeval; liu2023chipnemo).

In order to make LLMs more professional in RTL code generation, a common method is to use the corresponding database to fine-tune LLMs. However, most existing methods conform to the conventional paradigm of deep learning, which involves initially gathering data and then training the model. This approach may lead to two negative effects:

(1)Since the model is trained on a limited amount of collected data, the room available for the model to explore is thus constrained, resulting in a narrow coverage of feedback signals.

(2)There is a mismatch between the distributions of training samples and the LLM under training, which can lead to estimation errors(li2023policy; liu2023statistical) in optimization process. Specifically, if the training samples are not directly generated by the LLM intended for training, their distributions are obviously misaligned. However, even if the training samples are generated by the LLM intended for training at the beginning, their distributions will still mismatch because the distribution of the LLM will shift during the training process.

Due to these two negative effects, the RTL code generation capability of existing fine-tuned models are limited. Additionally, a large number of reference samples are required for fine-tuning, which can be costly to obtain. In order to foster exploration and mitigate the distribution deviation between training samples and LLMs that is being trained, an intuitive idea is to introduce reinforcement learning (RL) methods to allow LLMs to refine their policies through the interaction with the environment, like reinforcement learning from human feedback (RLHF)(ouyang2022training; stiennon2020learning) for common LLMs or CodeRL(le2022coderl), RLTF(liu2023rltf), etc. for code LLMs. However, it is well-known that the process of RL is often complex, resource-intensive, and unstable. Typically, RL algorithms need to maintain several models at the same time (such as policy model, value model and reference model in PPO algorithm(schulman2017proximal)) and conduct extensive hyper-parameters tuning, which will incur significant computational overhead and pose substantial obstacles for users.

To tackle the aforementioned issues, we introduce a simple yet effective iterative training scheme to fine-tune LLMs for Verilog code generation. Unlike previous methods learning from the same training set, our approach updates the training samples through sampling iteratively during the training process, thus enhancing the exploration range of LLMs and enriching feedback. Since the training samples are iteratively sampled from the updated model, the mismatch between distributions of the training samples and LLMs, as well as the resulting estimation error, are mitigated accordingly. As the first method which iteratively updates training samples in the RTL code generation field, our method enables the model to achieve state-of-the-art (SOTA) performance even with a smaller number of reference samples. Moreover, our approach is much easier to implement than complicated reinforcement learning. Only minimal hyper-parameter tuning is required, and there is no need to handle multiple models during the training process.

Our contributions can be outlined as follows:

(i) We develop an iterative supervised fine-tuning scheme for training LLMs to generate Verilog code. By expanding the exploration scope of models and reduces estimation errors caused by distribution mismatches, this solution can effectively boost the Verilog code generation capability and substantially decreases the number of externally-sourced reference samples needed for training.

(ii) Through a reward-maximization viewpoint, we theoretically analyze the limitations of previous methods and reveal the superiority of our approach.

(iii) With only about 37% reference samples, we outperform the state-of-the-art (SOTA) open-source LLM, attaining 42.9% and 62.2% pass@1 rate on VerilogEval-human and VerilogEval-machine benchmarks respectively. Utilizing the same quantity of reference samples, our method demonstrates a 16.9% and 12.5% relative enhancement in pass@1 scores compared to the non-iterative approach. Relative to commercial closed-source models, we surpass GPT-3.5 and approach the level of GPT-4 on the benchmarks.

2. BACKGROUND

2.1. Large Language Models for RTL Code Generation

Previous works(blocklove2023chip; chang2023chipgpt) have attempted to directly prompt LLMs to generate RTL code, achieving notable results. Chip-chat(blocklove2023chip) design a 8-bit accumulator-based microprocessor with the assistance of commercial LLMs. In practical applications, it is challenging to directly generate usable code through human instructions. As a result, recent works(thakur2023autochip; delorenzo2024make) explore optimizing the RTL codes generated by LLMs with feedback from Verilog tools. AutoChip(thakur2023autochip) utilize error reports from compilers and simulators to help LLMs to rectify faulty code. (delorenzo2024make) develops a Monte Carlo tree-search (MCTS) algorithm to enhance LLMs to generate correct and PPA-efficient code with the feedback from compilers and synthesis tools.

Instead of using off-the-shelf LLMs directly, some researchers(liu2023verilogeval) opt to train LLMs to make it specialized in hardware design. VeriGen(thakur2023benchmarking) leverage corpora from GitHub and Verilog textbooks to fine-tune open source LLMs, defeating the state-of-the-art commercial Codex(chen2021evaluating) LLM on 17 Verilog problems. ChipNeMo(liu2023chipnemo) customize LLaMa2(touvron2023llama) for applications in chip design such as chatbot, generating EDA tool script, and bug summarization. RTLCoder(liu2023rtlcoder) develops a automated flow to generate instruction-code pairs for supervised fine-tuning (SFT) and propose a new SFT method leveraging code quality assessment.

In order to better measure the effect of LLMs on Verilog code generation, researchers have proposed corresponding evaluation benchmarks. VerilogEval(liu2023verilogeval) has released an open-source evaluation dataset , including 156 questions along with their corresponding golden solutions. RTLLM(lu2024rtllm) present an open-source benchmark evaluating the quality of the generated code from three progressive perspectives: syntax, functionality, and design quality.

2.2. Fine-tuning LLMs with RTL data

To tailor LLMs for RTL code generation, researchers often need to fine-tune LLMs with domain-specific data. VeriGen(thakur2023benchmarking) fine-tune LLMs by predicting next token on corpora from open-source code and textbooks, which can be regarded as an continual pre-training. To facilitate the model’s ability to follow instructions, VerilogEval(liu2023verilogeval) applies SFT on LLMs using synthetic instruction-code pairs. Furthermore, RTLCoder(liu2023rtlcoder) introduces a new fine-tuning algorithm which harnesses code quality evaluation upon candidates sampled from pretrained LLMs. However, during the training process, the aforementioned methods are constrained by static and unchanging training samples. According to previous research(yuan2023rrhf; liu2022brio) in natural language, updating train samples using in-training LLMs can significantly bootstrap the model performance. However, related research in the task of Verilog code generation remains scarce. Our work can be seen as a pioneering attempt.

3. Approach

In this section, we firstly detail the workflow of our proposed approach. Then analyze the deficiencies of related methods and reveal the improvements of our approach from a theoretical viewpoint.

3.1. Framework

As an iterative training scheme, our approach boost the generation capability by alternately sampling and training. During a single loop, we basically follows RTLCoder(liu2023rtlcoder) for ease of implementation. The key distinction lies in our core idea of alternating iterations, which can significantly enhances the performance.

Refer to caption — Figure 1. ITERTL Framework. The model $\pi_{t}$ is instructed to generates code responses, which are utilized to optimize the model itself, along with the reference code $a_{K}$ . Once the optimization process converges, the new model $\pi_{t+1}$ is used to sample responses in next iteration.

We briefly introduce the procedure in a single loop at first, and then explain the iteration process. As shown in Figure 1, in the $t$ -th round of training, for each input instruction $s$ , there are $K$ corresponding output responses $a_{k}^{t}$ , $1\leq k\leq K$ . Among these, the first $K-1$ responses are sampled from the model $\pi_{t}$ acquired from previous training iteration. While the last response $a_{K}^{t}$ , which serve as the reference data, can be obtained from another teacher LLM or human, denoted as $\pi_{teacher}$ . Assume the distribution of $\pi_{teacher}$ remains unchanged, there is little impact from repeated sampling. Therefore, we omit the superscript $t$ from $a_{K}^{t}$ . Each response $a_{k}^{t}$ is assigned a quality score $z_{k}^{t}$ utilizing Pyverilog and Rough-L following (liu2023rtlcoder).

The loss function comprises two components: the ranking loss(liu2023rtlcoder; yuan2023rrhf; liu2022brio) and the Maximum Likelihood Estimation (MLE) loss.

Conditional log probability (length-normalized) is calculated leveraging the model $\pi$ that is being trained:

(1)

p_{k}=\frac{\sum_{j}\log P_{\pi}\left(a_{k,j}^{t}\mid s,a_{k,<j}^{t}\right)}{% \left\|a_{k}^{t}\right\|}

Combining the quality score $z$ and log probability $p$ , the ranking loss(liu2023rtlcoder; yuan2023rrhf; liu2022brio) can be computed:

(2)

L_{ranking}^{t}=\sum_{z_{k}^{t}<z_{\tau}^{t}-\beta}\max\left(p_{k}-p_{\tau}+% \alpha,0\right)

Unlike RTLCoder(liu2023rtlcoder), we discard the use of softmax normalization to $p$ based on subsequent experimental results.

Another loss component is the common cross entropy loss relative to the reference sample $a_{K}$ :

(3)

L_{ce}=-\sum_{j}\log P_{\pi}\left(a_{K,j}\mid s,a_{K,<j}\right)

The overall loss function is a weighted sum of both:

(4)

L^{t}=L_{ce}+\lambda L_{ranking}^{t}

To control the scale relationship between the two loss functions, we set $\lambda$ equal to $sg(L_{ce})$ , where $sg(\cdot)$ represents a stop-gradient operation.

After the initial round of training, we obtained the converged model $\pi_{t+1}$ . Even though the capability of model $\pi_{t+1}$ has improved comparing to $\pi_{t}$ , there remains room for further enhancement. So we propose to resample new responses $a_{k}^{t+1}$ using the new model $\pi_{t+1}$ and assess new quality scores $z_{k}^{t+1}$ . Resampling can bring new feedback as the model’s distribution has shifted after previous round of training. We compute the loss function $L^{t+1}$ and update parameters with new data. This process is repeated until the training concludes.

3.2. Theoretical Analysis

Inspired by the previous work(li2023policy), we regard fine-tuning LLMs as a reward maximization problem, which can be formulated as follows:

(5)

\underset{\pi}{\operatorname{max}}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a% \sim\pi(\cdot\mid s)}[r(s,a)],

where $\mathbb{E}$ denotes the expected value, $s$ represents the instruction (prompt), following the distribution $\rho(\cdot)$ . $a$ represents the response (code) generated by the LLM $\pi$ which is being optimized. $r$ represents the reward for the response, in other words, the quality of the generated code.

Ideally, the true reward for the generated Verilog code should incorporate the evaluation of its functional correctness, as well as PPA metrics. However, it’s intractable to implement because verifying the functional correctness requires a comprehensive corresponding testbench, which is nearly impossible for researchers to develop given the large amount of samples for fine-tuning. Additionally, assessing PPA relies on logic synthesis, which also consumes a significant amount of time. Therefore, it is necessary to approximate the reward function to make it more manageable.

Let’s first consider a naive approach that directly fine-tunes LLMs using reference data drawn from $\pi_{teacher}$ , which can be considered a form of knowledge distillation. The optimization objective can be derived as follows:

(6)

\begin{split}\underset{\pi}{min}\sum_{j}&-\log P_{\pi}\left(a_{j}\mid s,a_{<j}% \right)=\underset{\pi}{min}\sum_{j}\log\frac{P_{teacher}\left(a_{j}\mid s,a_{<% j}\right)}{\log P_{\pi}\left(a_{j}\mid s,a_{<j}\right)}\\ &=\underset{\pi}{min}\sum_{j}KL(P_{teacher,j}||P_{\pi,j})\\ &\simeq\underset{\pi}{\operatorname{max}}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{% E}_{a\sim\pi_{teacher}(\cdot\mid s)}[\sum_{j}-KL(P_{teacher,j}||P_{\pi,j})]% \end{split}

The first equality holds because $P_{teacher}$ is independent of $\pi$ . While the approximate equality in the third line is based on the law of large numbers. Comparing equations 5 and 6, it is clear that this approach leads to systematic errors in two aspects¹¹1If $s$ is synthesized by another model rather than obtained from real samples, the estimation is also biased. we omit this aspect in this work since it’s not the focus of our research.: (1) replacing the distribution of the model being trained $\pi(\cdot\mid s)$ with that of $\pi_{teacher}(\cdot\mid s)$ , and (2) the use of the negative Kullback-Leibler divergence (KL divergence) as the surrogate reward function between the probability distribution of the reference data and $\pi$ .

Next, Let’s consider a non-iterative degenerate version of our approach, where a constant distribution $\pi_{c}$ is used to generate training samples. This version is quite similar with RTLCoder(liu2023rtlcoder). We can also transform the optimization objective into a reward maximization problem. To simplify the derivation, here we only present the ranking loss component from Equation 2. And the term associated with the reference data in the ranking loss are also omitted, which does not interfere with our conclusions.

(7)

\begin{split}\underset{\pi}{min}&(L_{ranking})=\underset{\pi}{min}[\sum_{z_{k}% <z_{\tau}-\beta}\max\left(p_{k}-p_{\tau}+\alpha,0\right)]\\ &\simeq\underset{\pi}{min}[\max\left(p_{k}-p_{\tau}+\alpha,0\right)\mathbb{I}(% z_{k}<z_{\tau}-\beta)]\\ &\simeq\underset{\pi}{max}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a_{k}\sim% \pi_{c}(\cdot\mid s)}\{\\ &\mathbb{E}_{a_{\tau}\sim\pi_{c}(\cdot\mid s)}[-\max\left(p_{k}-p_{\tau}+% \alpha,0\right)\mathbb{I}(z_{k}<z_{\tau}-\beta)]\}\\ &=\underset{\pi}{max}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a\sim\pi_{c}(% \cdot\mid s)}[r_{\pi}(s,a)]\end{split}

Here, we denote the surrogate reward function as $r_{\pi}$ to show its dependency on $\pi$ . Comparing equations 6 and 7, we find that this approach offers two improvements over the previous naive approach. (1) $\pi_{c}$ can be adjusted to an appropriate distribution to reduce the mismatch with $\pi$ , for instance, selecting the initial distribution $\pi_{0}$ . (2) The surrogate reward function incorporates the evaluation of code quality, which is more reasonable than a simple KL divergence.

Based on the analysis of the above two cases, we can formally reveal the superiority of our method. Using similar notations, our optimization scheme can be formulated as:

(8)

\pi_{t+1}=\underset{\pi}{argmax}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a\sim% \pi_{t}(\cdot\mid s)}[r_{\pi}(s,a)]

After $t$ -th iteration, $\pi$ is updated, progressively deviating from the previous distribution $\pi_{old}$ ( $\pi_{0}$ ,… $\pi_{t-1}$ ). By replacing $\pi_{old}$ with $\pi_{t}$ for sampling, our method effectively mitigates the distribution mismatch. Furthermore, due to the law of large numbers, increasing the number of sampling promotes better estimation expected values.

we can also analyze the potential bottleneck of our approach from Equation 8. Since the surrogate reward function $r_{\pi}$ evaluates code quality merely from a syntax perspective, ignoring functional correctness and PPA, there is a mismatch between $r_{\pi}$ and the true reward function $r$ . As a result, although $r_{\pi}$ may increase with each iteration, the potential improvement for model is still constrained, which is verified by our experiments in Section 4. We anticipate that a better-designed surrogate reward function could further enhance the effectiveness of our approach in the future.

From another perspective, our approach can also be regarded as a degenerate variant of the generalized Expectation-Maximization (EM) algorithm. Consider the following optimization problem:

(9)

\underset{\pi}{max}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a\sim\pi(\cdot\mid s% )}[r_{\pi}(s,a)]

Since $\pi$ appears in both the subscript of the expected symbol and the reward function, an EM-style optimization procedure would be like:

(10)

\pi_{t+1}=\underset{\pi}{argmax}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a\sim% \pi_{t}(\cdot\mid s)}[r_{\pi}(s,a)]

(11)

\pi_{t+2}=\underset{\pi}{argmax}\mathbb{E}_{s\sim\rho(\cdot)}\mathbb{E}_{a\sim% \pi(\cdot\mid s)}[r_{\pi_{t+1}}(s,a)]

Equation 10 corresponds to our optimization approach, while optimizing Equation 11 typically requires reinforcement learning methods like policy gradients, which may increase the complexity and instability of the whole scheme. So we simply set $\pi_{t+2}$ equal to $\pi_{t+1}$ to avoid these issues. We leave the exploration of more precise optimization methods to future researches.

4. Experimental Evaluation

4.1. Experimental Setup

Benchmark: We choose a comprehensive evaluation dataset, named VerilogEval(liu2023verilogeval), as the benchmark to measure the performance. This dataset consists of diverse Verilog tasks ranging from simple to complex, such as combinational circuits, finite state machines, code debugging, constructing testbenches, and so on. Two sets of design instructions are provided: the first one is generated by LLM, named VerilogEval-machine, containing 143 samples; the other one is manually written, named VerilogEval-human, comprising 156 samples. Functional correctness evaluation is conducted via ICARUS Verilog simulator by comparing the outputs of the generated design with that of the golden solution.

Table 1. Performance on VerilogEval Benchmark(liu2023verilogeval). The results of GPT-3.5, GPT-4, VerilogEval, Codegen and VeriGen come from (liu2023verilogeval). The results of RTLCoder come from (liu2023rtlcoder). The best results are marked in bold. If the best result comes from a closed-source model (like GPT-4), then the best result from the open-source model is also highlighted in bold. Additionally, the second best result is underlined.

Model	Params	VerilogEval-Machine(%)			VerilogEval-Human(%)
Model	Params	Pass@1	Pass@5	Pass@10	Pass@1	Pass@5	Pass@10
GPT-3.5 (Closed-Source)	-	46.7	69.1	74.1	26.7	45.8	51.7
GPT-4 (Closed-Source)	-	60.0	70.6	73.5	43.5	55.8	58.9
Codegen(nijkamp2022codegen)	16B	5.0	17.6	25.8	1.6	6.1	9.4
VeriGen(thakur2023benchmarking)	16B	33.8	59.2	67.9	24.5	45.3	53.2
VerilogEval-8.5k(liu2023verilogeval)	16B	46.2	67.3	73.7	28.8	45.9	52.3
RTLCoder-DeepSeek-10k(liu2023rtlcoder)	6.7B	55.3	70.4	76.2	36.7	47.0	50.4
RTLCoder-DeepSeek-27k(liu2023rtlcoder)	6.7B	61.2	76.5	81.8	41.6	50.1	53.4
Ours-DeepSeek-10k	6.7B	62.2	75.0	79.0	42.9	50.0	53.8

Metric: As with many code generation researches, pass@ $k$ metric(kulal2019spoc) is employed to measure the performance, by regard a problem as solved if at least one of the $k$ generated code samples passes the unit tests. Specifically, for each problem, the model generated $n\geq k$ candidates, where $c\leq n$ samples pass the unit tests. The pass@ $k$ metric is estimated unbiasedly use the following expression(chen2021evaluating):

(12)

pass@k=\underset{Problems}{\mathbb{E}}\left[1-\frac{\binom{n-c}{k}}{\binom{n}{% k}}\right]

To avoid the impact of randomness, we mainly focus on the pass@1 value under greedy decoding. For a comprehensive evaluation, we also measure the pass@ $k=\{1,5,10\}$ metrics setting $n=10$ under Top-p decoding.

Decoding Strategy: During the training stage, in order to enhance the diversity and promote the exploration, we use Top-p decoding strategy with $top_{p}=0.95$ and $temperature=0.5$ . In the testing stage, to mitigate random errors, the model is prompted to generate responses with $temperature=\{0(greedy\,decoding),0.2,0.5,0.8\}$ and $top_{p}=0.95$ . For each test metric (pass@1, pass@5, and pass@10), the best result is chosen.

Training Details: DeepSeek-Coder-Instruct-6.7B(guo2024deepseek) is chosen as the pre-trained model. We use the open-source instruction-code pairs from (liu2023rtlcoder) as reference data. This dataset contains nearly 27k samples. We randomly sample only 10,000 entries to train the final version of our model. The learning rate is set to $10^{-5}$ . AdamW optimizer is employed with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . The bp16 Mixed-precision Training is adopted to avoid overflow. During each iteration, the model is trained for 3 epochs. The total number of iterations is set to 7. We conduct all experiments on 4 NVIDIA A800 GPUs. As for parameters mentioned in Section 3.1, we set the number of candidates $K=4$ . The hyper-parameters $\alpha$ and $\beta$ in Equation 2 are set to 0.3 and 0.2, respectively.

4.2. Comparison to State-of-the-Art Methods

As Table 1 shows, our model reach the state-of-the-art level among open-source models. On Pass@1 metric, our model trained with only 10k reference samples surpasses RTLCoder with 27k reference samples by 1% and 1.3% on VerilogEval-machine and VerilogEval-human respectively. Leveraging an equal amount of reference samples, our model significantly outperforms RTLCoder-DeepSeek-10k on all metrics, with a relative improvement of 12.5% and 16.9% in pass@1 especially. Considering the size of the parameters, with only 6.7 billion parameters, our model significantly outperforms VerilogEval(liu2023verilogeval) with 16 billion parameters and a similar volume of reference samples (8502), further proving the efficiency and superiority of our method. Additionally, we incorporated a general-purpose code model, Codegen(nijkamp2022codegen), into our comparison. Its subpar performance on the VerilogEval benchmark underscores the importance of tailoring LLMs for RTL code.

Relative to closed-source models, our model comprehensively surpasses GPT-3.5 on the benchmark. Against GPT-4, our model performs better on VerilogEval-machine but lags on VerilogEval-human. Considering the vast differences in training costs and the lightweight advantage of our model, these performance deficits are acceptable.

4.3. Effect of Iterations

To delve deeper into the impact of iteration, we plot how pass@1 on VerilogEval-human varies with the number of iterations in Figure 2. It can be clearly seen that, with 5 iterations and only 10k reference data, our model beats the baseline model with with 27k reference data. From the first to the fifth iteration, the pass@1 rate exhibits a clear upward trend, indicating the efficacy of iterative training. From the fifth to the seventh iteration, the pass@1 rate gradually decreases, which can be explained by the mismatch between the surrogate reward function and the true reward function mentioned in Section 3.2. In fact, similar discoveries were found on LLMs trained using reward models on natural languages. Our work reveals that the surrogate reward function based on code quality evaluation and reward model on natural languages have similar properties, providing inspiration for subsequent improvement work.

Figure 3 depicts different loss functions curves across each iteration. Even though the loss function converges within each round, the loss value can still decrease by leveraging newly sampled data points from updated model, validating the effectiveness of the proposed iterative training approach. Another observed phenomenon is that as the number of iterations increases, the marginal decrease in the loss function gradually diminishes, which indicates that the surrogate reward function is nearing its optimization ceiling.

4.4. Ablation Study of Softmax Normalization

The baseline method(liu2023rtlcoder) employ a softmax normalization for conditional log probability after Equation 1:

(13)

p_{k}^{{}^{\prime}}=\frac{e^{p_{k}}}{\sum_{i=1}^{K}e^{p_{i}}}

We conduct an ablation study to investigate its impact in iteration training. All other settings remain unchanged except for softmax normalization. From Figure 4, it can be observed that the pass@1 of the approach with softmax shows smoother variations across iterations, which may be caused by vanishing gradients brought by softmax saturation. We choose to discard softmax to maximize performance. While the softmax normalization can also be considered when a more stable and smooth training process is required.

5. Future Work

Reviewing previous theoretical and experimental analyses, we believe that exploring two directions could further enhance the capability of LLMs. The first is to find a more suitable surrogate reward mechanism to provide feedback. Current methods simply rely on evaluating similarity or syntax checking, which is too superficial to reflect the true quality of RTL code. A more comprehensive reward mechanism is likely to further unleash the potential of training methods.

Another direction is to develop more advanced optimization algorithms. In this work, to enhance simplicity and usability, we avoid using complex reinforcement learning methods, which might better optimize models with careful tuning of hyper-parameters. Drawing inspiration from Direct Preference Optimization (DPO)(rafailov2024direct) in natural language generation, we look forward to seeing both simpler and more advanced optimization solutions in the RTL code generation domain in the future.

6. Conclusion

In this paper, we propose an iterative training paradigm to fine-tune LLMs for RTL code generation. During each iteration, we employ the model trained in the previous iteration to update samples, which are then utilized for training in the current round. From a perspective of maximizing the reward function, we theoretically analyze the superiority of our approach, which are subsequently validated by empirical results. With just about 37% of the reference samples, our model outperforms the state-of-the-art open-source LLMs with 42.9% and 62.2% pass@1 rate on VerilogEval-human and VerilogEval-machine benchmarks respectively. Compared to GPT-4, our model achieves a comparable level of performance on evaluation benchmarks with only 6.7B parameters and affordable overhead. Based on theoretical analysis and experiments, we anticipate future improvements through refining reward functions and exploring new training approaches.

Acknowledgements.

\printbibliography