Adaptive Draft-Verification for Efficient Large Language Model Decoding

Xukun Liu
Northwestern University
[email protected]
&Bowen Li
Texas A&M University
[email protected]
&Ruqi Zhang
Purdue University
[email protected]
&Dongkuan Xu
North Carolina State University
[email protected]
Abstract

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model’s learned probabilities. The typical autoregressive decoding method requires a separate forward pass through the model for each token generated, which is computationally inefficient and poses challenges for deploying LLMs in latency-sensitive scenarios. The main limitations of current decoding methods stem from their inefficiencies and resource demands. Existing approaches either necessitate fine-tuning smaller models, which is resource-intensive, or rely on fixed retrieval schemes to construct drafts for the next tokens, which lack adaptability and fail to generalize across different models and contexts. To address these issues, we introduce a novel methodology called ADED 111Project repo: https://anonymous.4open.science/r/ADED-C7D5, which accelerates LLM decoding without requiring fine-tuning. Our approach involves an adaptive draft-verification process that evolves over time to improve efficiency. We utilize a tri-gram matrix-based LLM representation to dynamically approximate the output distribution of the LLM, allowing the model to adjust to changing token probabilities during the decoding process. Additionally, we implement a draft construction mechanism that effectively balances exploration and exploitation, ensuring that the drafts generated are both diverse and close to the true output distribution of the LLM. The importance of this design lies in its ability to optimize the draft distribution adaptively, leading to faster and more accurate decoding. Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that ADED significantly accelerates the decoding process while maintaining high accuracy, making it suitable for deployment in a wide range of practical applications.

1 Introduction

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model’s learned probabilities (Brown et al., 2020; Zhang et al., 2022; Touvron et al., 2023a, b). The core mechanism is autoregressive, where each new token is generated conditioned on the previously generated tokens and the given context. This process is crucial for applications like text generation (Li et al., 2024a; Peng et al., 2023; Chang et al., 2023), machine translation (Zhang et al., 2023; Moslem et al., 2023; Hendy et al., 2023), and conversational AI (Shanahan, 2024; Wu et al., 2023; Saka et al., 2023). However, each decoding step involves a forward pass through the model, making the process inherently sequential and computationally expensive. The inefficiencies arise due to the need to reload the model for each token prediction, leading to high computational costs and memory bandwidth usage. This serial nature of decoding is a significant bottleneck, especially for real-time applications (Liu et al., 2023a; Mandvikar, 2023; Antoniol et al., 1994) where latency is critical. Thus, optimizing the decoding speed of LLMs is essential for practical deployment in various real-world scenarios.

Refer to caption
Figure 1: Comparison of different LLM decoding strategies: Speculative Decoding Leviathan et al. (2023), Lookahead Fu et al. (2024a), REST He et al. (2023), and our ADED. In Speculative Decoding, a small LLM generates predictions (red blocks) from inputs (blue blocks). Yellow blocks indicating intermediate results obtained from language model. Lookahead uses a large LLM for forward-looking predictions. REST employs a corpus trie for rapid token lookups. ADED integrates Monte Carlo Tree Search with tri-gram (Martin et al., 1998; Golding and Schabes, 1996; Zhu and Rosenfeld, 2001) statistics and recent token history to simulate potential outputs, refining its recommendations over time. ADED’s adaptive approach offers significant advantages in terms of speed and accuracy by continuously evolving its draft constructions, providing more efficient and accurate LLM decoding compared to the fixed or resource-intensive methods used by the others.

Recent research has explored various strategies to mitigate the inefficiencies of LLM decoding. Speculative Decoding (Leviathan et al., 2023; Spector and Re, 2023; Chen et al., 2023) introduces an approach where a smaller, more efficient model generates several token predictions in parallel, which are then verified by the larger target model. This method leverages the efficiency of smaller models to reduce the number of serial forward passes required, achieving substantial speedups without altering the output distribution. Lookahead Decoding Fu et al. (2024a) uses the full context to predict multiple future tokens, creating a buffer that reduces the dependency on sequential processing. REST He et al. (2023) employs a retrieval-based approach where relevant tokens are fetched from a pre-constructed datastore using the current context, forming drafts that are verified by the LLM. These methods can be summarized within the draft-verification pipeline, as shown in Figure 4. Speculative Decoding and Lookahead Decoding both generate draft tokens through predictive models, while REST constructs drafts from retrieved tokens based on the context. In each case, the drafts are then verified by the main LLM, ensuring that the final output adheres to the model’s learned probabilities. Despite their advancements, these approaches face notable limitations. They often require additional training or fine-tuning, which can be resource-intensive. Fixed retrieval schemes lack adaptability, making it challenging to adjust the draft distribution in real-time based on the evolving LLM output. Additionally, these methods may not generalize well across different models and contexts, limiting their effectiveness in dynamic environments.

In this work, our focus is on fine-tuning-free draft-verification to address these limitations. The draft-verification pipeline can be viewed as a rejection sampling procedure where the similarity between the proposal distribution (draft) and the target distribution (LLM output) is crucial for the acceptance rate and convergence speed. Higher similarity results in a higher acceptance rate and faster decoding speed. Very few fine-tuning-free approaches, e.g., REST He et al. (2023), typically use fixed retrieval-based schemes to construct drafts. These schemes lack the adaptability to adjust the draft distribution based on the evolving LLM output distribution, resulting in a persistent gap between the draft and the actual LLM output. This gap reduces the draft acceptance rate and limits the potential for improving decoding speed. To address this issue, we raise the following question:

Question: How to design an adaptive draft construction process that can evolve itself and accurately approximate LLM outputs during decoding?

To introduce adaptability and find drafts that are increasingly close to the LLM output distribution during decoding, we not only need to have an adaptive draft construction pipeline but also need to maintain a balance between exploration and exploitation. This balance ensures that speedups can be achieved by leveraging existing knowledge of draft construction while continuously exploring better draft construction capabilities. To achieve this, we propose a novel methodology called ADED (Adaptive Draft-Verification for Efficient LLM Decoding). ADED incorporates a tri-gram-matrix-based adaptive LLM representative to control the conditional probability distribution of the next token, which can be updated during the decoding process to adjust the draft construction accordingly. To balance exploration and exploitation, we design a draft maker inspired by Monte Carlo Tree Search (MCTS) (Coulom, 2007; Browne et al., 2012; James et al., 2017; Świechowski et al., 2023). This draft maker uses a token preference score to maintain the balance during the search process. The score consists of two parts: the first part is based on the approximate conditional probability distribution of the next token obtained from the LLM representative, reflecting the draft maker’s current knowledge of the LLM output; the second part encourages the draft maker to explore unexplored or less-explored draft spaces. Theoretically, our method can be viewed as a constrained optimization problem to encourage the draft distribution to converge to the LLM output distribution (see Appendix A). Using the token preference score, the draft maker can effectively search the draft space and generate candidate tokens. After the draft construction and verification are completed, the information is fed back to the LLM representative to update its approximation of the LLM output. This feedback loop enriches the draft maker’s knowledge in subsequent rounds of draft-verification, enabling adaptability and self-evolution in the draft construction process.

In summary, our contributions are concluded as follows:

  • We design a tri-gram matrix-based LLM representation that dynamically approximates the LLM output distribution, enhancing adaptability without the need for fine-tuning. This approach addresses the limitation of fixed retrieval schemes by continuously evolving with the model’s predictions.

  • We develop a draft maker inspired by MCTS, which effectively balances exploration and exploitation to generate high-quality drafts. This mechanism improves decoding speed and accuracy by ensuring that the drafts are closely aligned with the LLM’s output distribution. Our experiments show a 2.5X improvement in decoding speed compared to baselines.

  • Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that ADED significantly accelerates the decoding process while maintaining high accuracy. Specifically, we achieve up to a 2.5X speedup in latency and an average acceptance rate improvement of 20% over existing methods.

  • Our method reduces computational overhead and memory usage, making it suitable for deployment in a wide range of practical, real-time applications. Our method’s ability to adapt to evolving LLM outputs and continuously refine draft construction sets it apart from existing, addressing the need for more flexible and dynamic LLM decoding solutions.

2 Methodology

We propose a new fast fine-tuning-free draft-verification LLM decoding method by introducing adaptability into the decoding and learning from LLM. Existing accelerated decoding algorithms either require additional fine-tuning or lack adaptability to LLM’s output distributions, resulting in significant additional cost or insufficient acceleration. To address these issues, we design an adaptive LLM representation based on a tri-gram matrix to adaptively approximate the output distribution of the LLM; develop an MCTS-based draft maker that balances exploration and exploitation for self-evolution towards high-quality drafts; and verify the drafts using tree attention.

2.1 Preliminary: Speculative Decoding & Monte Carlo Tree Search

Speculative decoding is a method to accelerate language model inference by using a smaller auxiliary model to generate a draft sequence, reducing the computational load on the larger model Leviathan et al. (2023). Retrieval-based speculative decoding extends this by incorporating a retrieval system instead of the smaller model, leveraging pre-stored corpus segments for relevant text generation. Monte Carlo Tree Search (MCTS) (Coulom, 2007; Browne et al., 2012; James et al., 2017; Świechowski et al., 2023) is an AI algorithm that optimizes decision-making by balancing exploration and exploitation of future states. It selects nodes for further exploration using a combination of node visit counts and estimated values, aiming to maximize overall outcomes. For a comprehensive discussion of these methods, please refer to Appendix C.

2.2 Adaptive LLM Representative

In order to approximate the output token distribution of the LLM without fine-tuning the small model, we distill linguistic knowledge from a small corpus and construct a tri-gram matrix as an initial representation of the LLM, which allows us to leverage the statistical regularities of language at a granular level. Specifically, we summarize and count the three tokens that appear in the corpus and compute the probability of the third token appearing conditional on the first two tokens. The formula is as defined in Eq. (1):

P(wi|wi2,wi1)=C(wi2,wi1,wi)C(wi2,wi1),𝑃conditionalsubscript𝑤𝑖subscript𝑤𝑖2subscript𝑤𝑖1𝐶subscript𝑤𝑖2subscript𝑤𝑖1subscript𝑤𝑖𝐶subscript𝑤𝑖2subscript𝑤𝑖1P(w_{i}|w_{i-2},w_{i-1})=\frac{C(w_{i-2},w_{i-1},w_{i})}{C(w_{i-2},w_{i-1})},% \vspace{-2mm}italic_P ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) = divide start_ARG italic_C ( italic_w start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_C ( italic_w start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) end_ARG , (1)

where P(wi|wi2,wi1)𝑃conditionalsubscript𝑤𝑖subscript𝑤𝑖2subscript𝑤𝑖1P(w_{i}|w_{i-2},w_{i-1})italic_P ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) is the conditional probability of a word wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given the two preceding words wi2subscript𝑤𝑖2w_{i-2}italic_w start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT and wi1subscript𝑤𝑖1w_{i-1}italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, C(wi2,wi1,wi)𝐶subscript𝑤𝑖2subscript𝑤𝑖1subscript𝑤𝑖C(w_{i-2},w_{i-1},w_{i})italic_C ( italic_w start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the count of the tri-gram occurrence in the corpus, and C(wi2,wi1)𝐶subscript𝑤𝑖2subscript𝑤𝑖1C(w_{i-2},w_{i-1})italic_C ( italic_w start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) is the count of the preceding bi-gram Mori et al. (1998).

In this way, we can obtain a good initial LLM representative at a much lower cost, which can generate an approximate distribution of the next token based on the previous tokens. This LLM representative will collaborate with our draft maker to generate drafts and get feedback to update the tri-gram matrix for adaptability and self-evolution. Please see Section 2.3 for more details.

2.3 Draft Maker and Self-Evolution

With the help of the LLM representative, we further propose a draft maker that balances exploration and exploitation while searching for candidate drafts that are closer to the LLM output. On the one hand, our draft maker leverages the conditional probabilities from the LLM representative, which include current knowledge of the LLM output. On the other hand, our draft maker is encouraged to search more in the unexplored or less explored draft space to find better draft candidates. Then, with feedback from the LLM output, the LLM representative can update its understanding of the LLM output, improve the draft maker’s search, and achieve self-evolution. Details are provided below.

Draft Search Score: Given the initial tokens, we exploit Monte Carlo Tree Search (MCTS) Coulom (2007) to guide the search process of the drafts of the next tokens, where we prioritize candidate tokens according to the conditional probability from the tri-gram matrix-based LLM representative and the node visitation counts during the tree search. Our scores play a key role in balancing exploration and utilization during Monte Carlo tree search and is defined as Eq. (2). This is a kind of PUTC Score (Rosin, 2011; Silver et al., 2017). More specifically, Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ) assesses the quality of taking action a𝑎aitalic_a in state s𝑠sitalic_s, while P(s,a)𝑃𝑠𝑎P(s,a)italic_P ( italic_s , italic_a ) represents the prior probability of selecting action a𝑎aitalic_a in state s𝑠sitalic_s. The term N(s,a)𝑁𝑠𝑎N(s,a)italic_N ( italic_s , italic_a ) denotes the number of times the action a𝑎aitalic_a has been taken from state s𝑠sitalic_s, and N(s,b)𝑁𝑠𝑏N(s,b)italic_N ( italic_s , italic_b ) sums the counts for all actions from state s𝑠sitalic_s. The constants c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT adjust the balance between exploration and exploitation, improving the decision-making process in draft construction. This formula ensures that our draft choices are contextually appropriate and optimizes the robustness and coherence of text generation.

maxQ(s,a)+P(s,a)bN(s,b)1+N(s,a)(c1+log(bN(s,b)+c2+1c2)).𝑄𝑠𝑎𝑃𝑠𝑎subscript𝑏𝑁𝑠𝑏1𝑁𝑠𝑎subscript𝑐1subscript𝑏𝑁𝑠𝑏subscript𝑐21subscript𝑐2\max{Q(s,a)+P(s,a)\cdot\frac{\sqrt{\sum_{b}N(s,b)}}{1+N(s,a)}}(c_{1}+\log(% \frac{\sum_{b}N(s,b)+c_{2}+1}{c_{2}})).roman_max italic_Q ( italic_s , italic_a ) + italic_P ( italic_s , italic_a ) ⋅ divide start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_N ( italic_s , italic_b ) end_ARG end_ARG start_ARG 1 + italic_N ( italic_s , italic_a ) end_ARG ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_log ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_N ( italic_s , italic_b ) + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_ARG start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) ) . (2)
Refer to caption
Figure 2: Self-Evolution strategy transfer. This diagram shows how the LLM updates its tri-gram probabilities based on recent outputs, feeding these probabilities into the MCTS. This dynamic update enhances the generation of coherent sequences by evolving the search strategy over time.

Self-Evolution Strategy Transfer: Based on the final search score obtained during the Monte Carlo tree search, we can construct draft candidates and verify them to get the final decoding output (please see Section 2.4) and feed it back for self-evolution. This final output decoding represents LLM’s output distribution, which would be a good learning material for the LLM representative. Therefore, we feed this knowledge into the LLM representative in order to obtain updated conditional probability distributions, thus providing the draft maker with more accurate and exploitable knowledge, which is illustrated in Figure 2. Specifically, this technique operates by first extracting tri-grams from recent outputs of the LLM. Each tri-gram’s frequency is then used to update its probability as potential outputs. These adjusted probabilities are fed into the MCTS as part of the policy network, influencing the selection phase of the tree search. In the context of MCTS, the updated tri-gram probabilities essentially serve as a dynamic policy guide, enhancing the model’s ability to generate contextually relevant and coherent sequences. By incorporating learned tri-gram probabilities into the tree search algorithm, we effectively create a feedback loop where the search strategy itself evolves over time. This strategy adjustment is executed by recalibrating the exploration-exploitation balance based on the empirical data derived from the model’s own outputs.

2.4 Draft Construction and Verification

To validate the draft sequences, it is noted that many have common starting segments that can cause redundant recalculations in the Transformer layers if not managed correctly. To address the issue, a pseudo-sequence that guarantees that each draft is a sub-sequence and that any common prefix appears only once is created He et al. (2023). We also use a specific attention mask for each attention layer, called tree attention (Miao et al., 2023; Cai et al., 2024). This mask aligns the computations for each token with its dependencies according to the original draft sequence, preserving the draft’s contextual integrity and preventing unnecessary computations. The approval of drafts relies on a comparison with the conditional distribution from the LLM. At each position, new tokens are sampled and compared to the draft tokens. If a sampled token corresponds to the draft token, it is approved; otherwise, the draft is discarded from that point. This selective approval ensures that the output sequence aligns with what would be produced by a typical autoregressive process, thus upholding the authenticity of the generated text.

3 Theoretical Insight: Why ADED uses MCTS

In this section, we explore the theoretical parallels between the Monte Carlo Tree Search (MCTS) algorithm used in our ADED framework and the inference mechanisms of large language models (LLMs) to demonstrate the use of MCTS and the self-evolution of ADED. We show that draft search in ADED using MCTS can be viewed as a form of policy optimization, and that the inference mechanism of LLM can be viewed as a similar form of penalty optimization.

MCTS in ADED: The token selection procedure in ADED decoding can be viewed as an action selection process. The MCTS algorithm optimizes its policy by iteratively building a search tree and updating visit counts for each node (state-action pair) based on the search paths. The visit count distribution π^(ax)^𝜋conditional𝑎𝑥\hat{\pi}(a\mid x)over^ start_ARG italic_π end_ARG ( italic_a ∣ italic_x ) is defined as:

π^(ax)1+n(x,a)|A|+bn(x,b),^𝜋conditional𝑎𝑥1𝑛𝑥𝑎𝐴subscript𝑏𝑛𝑥𝑏\hat{\pi}(a\mid x)\triangleq\frac{1+n(x,a)}{|A|+\sum_{b}n(x,b)},over^ start_ARG italic_π end_ARG ( italic_a ∣ italic_x ) ≜ divide start_ARG 1 + italic_n ( italic_x , italic_a ) end_ARG start_ARG | italic_A | + ∑ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_n ( italic_x , italic_b ) end_ARG , (3)

where n(x,a)𝑛𝑥𝑎n(x,a)italic_n ( italic_x , italic_a ) represents the visit count for action a𝑎aitalic_a in state x𝑥xitalic_x. Then, the action selection in MCTS can be written as selecting the action asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT:

a(x)argmaxa[Q(x,a)+λNπθ(ax)π^(ax)]superscript𝑎𝑥subscript𝑎𝑄𝑥𝑎subscript𝜆𝑁subscript𝜋𝜃conditional𝑎𝑥^𝜋conditional𝑎𝑥a^{*}(x)\triangleq\arg\max_{a}[Q(x,a)+\lambda_{N}\cdot\frac{\pi_{\theta}(a\mid x% )}{\hat{\pi}(a\mid x)}]italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ≜ roman_arg roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT [ italic_Q ( italic_x , italic_a ) + italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⋅ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a ∣ italic_x ) end_ARG start_ARG over^ start_ARG italic_π end_ARG ( italic_a ∣ italic_x ) end_ARG ] (4)

Nach (Grill et al., 2020), we use q|A|𝑞superscript𝐴q\in\mathcal{R}^{|A|}italic_q ∈ caligraphic_R start_POSTSUPERSCRIPT | italic_A | end_POSTSUPERSCRIPT to denote the vector of Q-function Q(x,a)𝑄𝑥𝑎Q(x,a)italic_Q ( italic_x , italic_a ). With proper choice of hyper-parameters, the MCTS algorithm can be viewed as searching for the optimum solution to a policy optimization problem (Grill et al., 2020) as below:

π¯argmaxyS[qyλNKL[πθ,y]],¯𝜋subscript𝑦𝑆superscript𝑞top𝑦subscript𝜆𝑁KLsubscript𝜋𝜃𝑦\bar{\pi}\triangleq\arg\max_{y\in S}\left[q^{\top}y-\lambda_{N}\text{KL}[\pi_{% \theta},y]\right],over¯ start_ARG italic_π end_ARG ≜ roman_arg roman_max start_POSTSUBSCRIPT italic_y ∈ italic_S end_POSTSUBSCRIPT [ italic_q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y - italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT KL [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_y ] ] , (5)

where S𝑆Sitalic_S is the |A|𝐴|A|| italic_A |-dimensional simplex, λNsubscript𝜆𝑁\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is a regularization parameter that depends on hyperparameters and balances exploration and exploitation, and KL is the KL-divergence.

LLM Inference Mechanism: Large language models, particularly those based on the Transformer architecture, generate text by predicting the probability distribution of the next token given the previous tokens. During inference, the model maximizes the log-likelihood of the observed data, which is equivalent to minimizing the cross-entropy loss:

(θ)=t=1TlogP(wtw1:t1;θ),𝜃superscriptsubscript𝑡1𝑇𝑃conditionalsubscript𝑤𝑡subscript𝑤:1𝑡1𝜃\mathcal{L}(\theta)=-\sum_{t=1}^{T}\log P(w_{t}\mid w_{1:t-1};\theta),caligraphic_L ( italic_θ ) = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_P ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ; italic_θ ) , (6)

where P𝑃Pitalic_P denotes the conditional probability of LLM, w𝑤witalic_w denotes the tokens, and θ𝜃\thetaitalic_θ denotes the model parameters. Regularization techniques, such as KL divergence, are often incorporated to prevent overfitting and ensure generalization:

(θ)=t=1TlogP(wtw1:t1;θ)+λKL(Pmodel,Pdata).𝜃superscriptsubscript𝑡1𝑇𝑃conditionalsubscript𝑤𝑡subscript𝑤:1𝑡1𝜃𝜆KLsubscript𝑃modelsubscript𝑃data\mathcal{L}(\theta)=-\sum_{t=1}^{T}\log P(w_{t}\mid w_{1:t-1};\theta)+\lambda% \text{KL}(P_{\text{model}},P_{\text{data}}).\vspace{-2mm}caligraphic_L ( italic_θ ) = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_P ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ; italic_θ ) + italic_λ KL ( italic_P start_POSTSUBSCRIPT model end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ) . (7)

Comparison between MCTS & LLM Inference: As shown in Eq (5) and Eq. (7), both MCTS and LLM Inference can be viewed as regularized optimization problems for selecting the distribution of the next tokens. On the one hand, the Q-function in MCTS for ADED can be viewed as an approximation to the log-likelihood of LLM

Q(x,a)=t=1TlogP^(wtwt1,wt2;θ)P(wtwt1,wt2;θ)t=1TlogP(wtw1:t1;θ)𝑄𝑥𝑎superscriptsubscript𝑡1𝑇^𝑃conditionalsubscript𝑤𝑡subscript𝑤𝑡1subscript𝑤𝑡2𝜃𝑃conditionalsubscript𝑤𝑡subscript𝑤𝑡1subscript𝑤𝑡2𝜃superscriptsubscript𝑡1𝑇𝑃conditionalsubscript𝑤𝑡subscript𝑤:1𝑡1𝜃\displaystyle Q(x,a)=-\sum_{t=1}^{T}\log\hat{P}(w_{t}\mid w_{t-1},w_{t-2};% \theta)\approx P(w_{t}\mid w_{t-1},w_{t-2};\theta)\approx-\sum_{t=1}^{T}\log P% (w_{t}\mid w_{1:t-1};\theta)\vspace{-2mm}italic_Q ( italic_x , italic_a ) = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log over^ start_ARG italic_P end_ARG ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ; italic_θ ) ≈ italic_P ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ; italic_θ ) ≈ - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_P ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ; italic_θ ) (8)

where P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG and P𝑃Pitalic_P are the conditional probability distribution from tri-gram-matrix-based LLM representative and LLM, respectively. On the other hand, both MCTS and LLM Inference improve the optimization procedure by employing regularization techniques. Through a comparative analysis, we show the similarities between MCTS and LLM Inference in terms of optimization and regularization, and highlight our rationale for choosing MCTS for the ADED framework.

4 Experiments

4.1 Experimental Setup

Models and Datasets  To evaluate the efficacy of ADED for the inference of large-language models, we execute a series of experiments employing five distinct models in four datasets. We test our algorithm on three Vicuna models Chiang et al. (2023) (7B, 13B, 33B) and two LLaMA2-chat models Touvron et al. (2023b) (7B, 13B) to evaluate its acceleration capabilities across different sizes and types of models. Our assessment incorporated the HumanEval Chen et al. (2021), MT-Bench Zheng et al. (2023) and Alpaca Taori et al. (2023) dataset to ascertain general natural language understanding and generation competencies. These datasets were meticulously chosen to guarantee a comprehensive analysis of our acceleration techniques across various tasks.

Corpus  We constructe two corpus. The first one is built using a portion of the Python pre-training code from The Stack Kocetkov et al. (2022), comprising approximately 2.7M Python code samples with a resulting size of 1007MB. The second corpus is constructed using data derived from UltraChat Ding et al. (2023), consisting of around 774K ChatGPT conversations, which produces a corpus with a size of 574MB.

Metrics  To assess the performance of our acceleration algorithms on large language models, we utilize two main metrics: speedup ratio and average acceptance length. The speedup ratio, calculated as the ratio of the time required by the baseline models to complete inference tasks without acceleration to the time required by our ADED, effectively measures the efficiency gains introduced by our algorithm. The second metric, average acceptance length, measures the average number of tokens accepted per forward pass by the target large language models, excluding any overhead of retrieving and constructing draft tokens, indicating the maximum possible acceleration.

Baselines  In this study, we investigate various foundational approaches to improve the decoding speed of large language models. We examine Lookahead Decoding Fu et al. (2024a), an innovative, precise, parallel decoding algorithm that significantly cuts down latency without relying on draft models. We also assess REST He et al. (2023) (Retrieval-Based Speculative Decoding), which adopts a retrieval-based strategy to create draft tokens, in contrast to conventional speculative decoding methods that rely on a draft model. Collectively, these baseline methods provide a solid framework for evaluating the efficiency of our proposed acceleration techniques in the LLM decoding process. All experiments are conducted on an NVIDIA A6000, except for the 33B model, which utilizes an NVIDIA A100. Where not mentioned, the experiments default to Greedy sampling.

4.2 Main Experimental Results

In our experiment, we investigate the efficacy of diverse methodologies applied to multiple models, utilizing three distinct datasets: MT-Bench, Human-Eval, and Alpaca. We focus on metrics such as Accepted Length, Latency, and Speed Up to evaluate the efficiency and responsiveness.

Table 1: Latency Comparison of ADED and Baselines. ADED gets the lowest latency in almost all the test cases, having a higher efficiency.
MT-Bench Latency Alpaca Latency
Model REST REST Single Lookahead ADED REST REST Single Thread Lookahead ADED
vicuna-7B 16.05 16.31 20.53 12.95 13.21 13.51 20.86 13.18
vicuna-13B 25.43 25.99 33.55 22.94 22.27 22.65 34.62 22.80
vicuna-33B 32.90 33.26 42.13 29.17 31.87 32.41 42.69 29.23
llama2-7B 16.08 17.67 18.47 13.88 13.55 19.46 19.44 13.02
llama2-13B 27.13 29.80 32.07 22.90 23.88 27.24 32.72 22.55

Table 1 summarizes the latency results for different models and methods on the MT-Bench and Alpaca datasets. ADED consistently demonstrates lower latency compared to other methods, particularly for the vicuna-7B and llama2-13B models. For instance, on the MT-Bench dataset, ADED achieves a latency of 12.95 ms for vicuna-7B, which is significantly lower than REST (16.05 ms), REST Single Thread (16.91 ms), and Lookahead (20.53 ms). It is very important to note that the memory required for ADED (574MB) is only 5.6% of that required for REST (12GB). This trend is also observed on the Alpaca dataset, where ADED achieves a latency of 13.18 ms for vicuna-7B, compared to 13.21 ms for REST, 13.51 ms for REST Single Thread, and 20.86 ms for Lookahead.

Refer to caption
(a) Accept Length on MT-Bench
Refer to caption
(b) Accept Length on Alpaca
Refer to caption
(c) Speed Up on MT-Bench
Refer to caption
(d) Speed Up on Alpaca
Figure 3: Comparison of Accept Length and Speed Up for Different Models on (a) MT-Bench and (b) Alpaca. The performance of ADED is evaluated against REST, REST Single Thread, and Lookahead across all models and benchmarks. ADED consistently shows improvements in both accept length and speed up, demonstrating its effectiveness and robustness in different scenarios.

The accept length results indicate the quality of the generated outputs, with longer accept lengths suggesting more coherent and contextually relevant text. Our method, ADED, outperforms other methods across different models on both MT-Bench and Alpaca datasets. For example, in MT-Bench, ADED achieves the highest accept length for vicuna-33B and llama2-13B models, showcasing its superior language generation capabilities.

Additionally, speed up metrics are evaluated to determine the efficiencly of each method. ADED consistently shows a significant improvement in speed up across all models and datasets. This efficiency is particularly noticeable in the MT-Bench and Alpaca datasets, where ADED not only reduces latency but also enhances the overall processing speed, thus validating the robustness of our retrieval algorithms. For instance, ADED achieves a speed up of 2.4x on the MT-Bench dataset with the vicuna-13B model, outperforming REST, REST Single Thread, and Lookahead.

Refer to caption
Figure 4: Comparison of speed up and ADED shows notable improvements.

Further analysis on the HumanEval dataset reveals that ADED achieves considerable speed up improvements. The speed up results highlight the efficiency of ADED in handling larger datasets without compromising on retrieval time or output quality. The vicuna-13B model, for example, demonstrates a speed up of nearly 2.5x when using ADED, which is a substantial improvement over the baseline methods.

These results collectively indicate that ADED provides a balanced improvement in both the quality of generated outputs and the efficiency of the retrieval process. The ability of ADED to maintain high performance across diverse tasks and datasets underscores its versatility and reliability for real-world applications. The improvements in latency, accept length, and speed up metrics affirm the efficacy of ADED in delivering superior performance while managing larger datasets effectively.

4.3 Stability of ADED

In this section, we analyze the stability of our algorithm, ADED, across different categories of tasks. The categories considered include writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities. The experimental results, as shown in Table 2, indicate that ADED maintains consistent performance across all categories. The average accept length remains stable, demonstrating that ADED can effectively handle a diverse range of tasks without significant variations in performance.

To further evaluate the robustness of ADED, we examined the effects of varying the top-p and temperature parameters on its performance. Figures 5(a) and 5(b) illustrate the impact of these parameters on the average accept length.

Table 2: Average Accept Length for the Tasks.
Task Avg Accept
Length
Writing 2.37
Roleplay 2.21
Reasoning 2.10
Math 2.33
Coding 2.45
Extraction 2.11
STEM 2.19
Humanities 2.17

Figure 5(a) shows that changes in the top-p parameter do not significantly affect the performance of ADED. The average accept length remains relatively stable across different values of top-p, indicating that ADED is not overly sensitive to this parameter.

Similarly, Figure 5(b) demonstrates that variations in the temperature parameter have minimal impact on the performance of ADED. The consistency in the average accept length across different temperature values further supports the robustness of our algorithm.

Refer to caption
(a) Top-p
Refer to caption
(b) Temperature
Figure 5: Sensitivity of ADED to (left) top-p and (right) temperature parameters. ADED maintains the sable acceleration under various settings.

These results confirm that ADED exhibits robust performance across a variety of tasks and maintains stability despite changes in key parameters, making it a versatile and reliable choice for diverse applications.

5 Ablation Study

To gain a deeper understanding of our method, we conduct a series of ablation studies and analyses focused on each individual component. Please see full ablation studies in the Appendix.

Refer to caption
Figure 6: Adaptive Strategy Comparison on MTBench: Performance comparison of Vicuna-7B model with and without the adaptive strategy on the MT-Bench dataset, showing the advantage of using the adaptive approach.

Effect of the adaptive strategy. Figure 6 illustrates the performance impact of our adaptive strategy on two models, Vicuna-7B and Vicuna-13B, with a comparative analysis of average accepted lengths over varying token counts. The graphs show that the adaptive strategy consistently maintains higher average accepted lengths across the input range for both models, compared to the non-adaptive. The adaptive strategy’s success can be attributed to its dynamic adjustment of the model’s probability distributions based on the tri-gram frequencies from prior outputs. This allows the model to better manage longer contexts and maintain relevance, enhancing stability and coherence in longer interactions. The marked performance improvement, particularly in managing larger token counts, highlights the adaptive strategy’s efficacy in sustaining effective and coherent outputs in extended sequences.

Effect of the corpus size.

Table  3 demonstrates how increasing the corpus size from 300k tokens to 700k tokens affects various performance metrics. With the expansion of the corpus, there is a gradual improvement in the ’Accept Length’ from 2.17 to 2.33. This increase suggests that larger datasets provide a broader array of language patterns, which enhances the model’s ability to generate more coherent and contextually relevant outputs. Despite the growth in data size, from 253 MB to 574 MB, and a slight increase in retrieval time from 2.1 ms to 3.1 ms, the system maintains efficient data handling.

Table 3: Effect of Corpus Size.
Corpus corpus Retrieval Accept Speed
Size Size Zeit Length Up
300k 253 MB 2.1 ms 2.17 1.93
500k 467 MB 2.7 ms 2.24 2.01
700k 574 MB 3.1 ms 2.33 2.07

The modest rise in retrieval time underscores the efficiency of the retrieval algorithms, which manage larger datasets without significantly compromising response speed. Overall, the results indicate that larger corpus sizes improve the model’s output quality while maintaining good system performance.

These findings highlight the importance of the size of the corpus in enhancing the performance of ADED, demonstrating that a larger and more carefully selected corpus can significantly improve the quality and efficiency of the generated outputs.

6 Related Work

Approach Without Draft Models. A significant portion of recent advances in language model decoding strategies has focused on improving efficiency without relying on draft models. Two notable approaches in this realm are Lookahead decoding Fu et al. (2024a) and Retrieval-Based Speculative Decoding He et al. (2023). Lookahead decoding is an approach that enhances the efficiency of the decoding process through the prediction of subsequent tokens via Jacobi Iteration Sleijpen and Van der Vorst (2000). It employs a heuristic to estimate the future cost of a sequence without the need to explicitly create a draft. This technique not only accelerates the decoding process by minimizing the number of tokens to be processed but also seeks to preserve or improve the quality of the output text by taking into account potential future scenarios in the decision-making process. Retrieval-Based Speculative Decoding(REST) He et al. (2023) introduces a retrieval-enhanced generation model that speculatively decodes sequences without the need for producing preliminary drafts. It instead searches and prioritizes possible continuations from an already established sequence database. This approach utilizes the inherent redundancy in natural language to anticipate probable continuations, thereby eliminating the requirement for draft sequences and greatly reducing computational costs. Both Lookahead decoding and REST demonstrate the capabilities of decoding methods that avoid intermediate drafts, providing a more straightforward and computationally efficient route to generating high-quality text.

Approach With Draft Models. Draft models are also used to improve the decoding efficiency. Techniques such as Speculative Decoding (Leviathan et al., 2023; Spector and Re, 2023; Chen et al., 2023; Stern et al., 2018), Madusa Cai et al. (2024), Eagle Li et al. (2024b), various other approaches requiring draft models (Zhang et al., 2024; Liu et al., 2023b; Kim et al., 2024; Fu et al., 2024b) fall into this category, utilizing models to generate drafts. Although these methods aim to speed up response times and reduce computational load during initial text generation, their adoption comes with significant drawbacks. The primary issue is the necessity for additional training specific to the draft models, which can be resource-intensive. Moreover, these techniques generally depend on GPU resources (Kwon et al., 2023; Sheng et al., 2023; Park et al., 2024) for inference, potentially limiting their application in environments where such hardware is unavailable or when operating under strict resource constraints. This dependence not only increases operational costs, but also restricts flexibility in deployment scenarios.

7 Conclusion

ADED improves the LLM decoding process by introducing adaptability and efficiency, significantly reducing latency and computational demands. This method achieves up to a 2.5X speedup in decoding and a 20% improvement in acceptance rates, outperforming traditional techniques. Unlike existing approaches, ADED dynamically adjusts the draft distribution using a tri-gram matrix and enhances draft quality through MCTS, eliminating the need for fine-tuning. The continuous feedback loop ensures ongoing improvements in draft generation. While ADED demonstrates robust performance across various benchmarks, future work will focus on further optimizing the adaptability mechanisms and exploring its application in more diverse real-world scenarios. Additionally, addressing potential limitations in extremely large-scale deployments will be a priority.

References

  • Antoniol et al. [1994] Giuliano Antoniol, Fabio Brugnara, Mauro Cettolo, and Marcello Federico. Language model estimations and representations for real-time continuous speech recognition. In ICSLP, pages 859–862, 1994.
  • Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
  • Browne et al. [2012] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
  • Cai et al. [2024] Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv: 2401.10774, 2024.
  • Chang et al. [2023] Jonathan D Chang, Kiante Brantley, Rajkumar Ramamurthy, Dipendra Misra, and Wen Sun. Learning to generate better than your llm. arXiv preprint arXiv:2306.11816, 2023.
  • Chen et al. [2023] Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023.
  • Chen et al. [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. 2021.
  • Chiang et al. [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  • Coulom [2007] Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In H. Jaap van den Herik, Paolo Ciancarini, and H. H. L. M. (Jeroen) Donkers, editors, Computers and Games, pages 72–83, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. ISBN 978-3-540-75538-8.
  • Ding et al. [2023] Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  • Fu et al. [2024a] Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding, 2024a.
  • Fu et al. [2024b] Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057, 2024b.
  • Golding and Schabes [1996] Andrew R Golding and Yves Schabes. Combining trigram-based and feature-based methods for context-sensitive spelling correction. arXiv preprint cmp-lg/9605037, 1996.
  • Grill et al. [2020] Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Rémi Munos. Monte-carlo tree search as regularized policy optimization. CoRR, abs/2007.12509, 2020. URL https://arxiv.org/abs/2007.12509.
  • He et al. [2023] Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D Lee, and Di He. Rest: Retrieval-based speculative decoding, 2023.
  • Hendy et al. [2023] Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. How good are gpt models at machine translation? a comprehensive evaluation, 2023.
  • James et al. [2017] Steven James, George Konidaris, and Benjamin Rosman. An analysis of monte carlo tree search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
  • Kim et al. [2024] Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W Mahoney, Amir Gholami, and Kurt Keutzer. Speculative decoding with big little decoder. Advances in Neural Information Processing Systems, 36, 2024.
  • Kocetkov et al. [2022] Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. The stack: 3 tb of permissively licensed source code, 2022.
  • Kwon et al. [2023] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  • Leviathan et al. [2023] Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding, 2023.
  • Li et al. [2024a] Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Pre-trained language models for text generation: A survey. ACM Computing Surveys, 56(9):1–39, 2024a.
  • Li et al. [2024b] Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024b.
  • Liu et al. [2023a] Jijia Liu, Chao Yu, Jiaxuan Gao, Yuqing Xie, Qingmin Liao, Yi Wu, and Yu Wang. Llm-powered hierarchical language agent for real-time human-ai coordination. arXiv preprint arXiv:2312.15224, 2023a.
  • Liu et al. [2023b] Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, Alvin Cheung, and Hao Zhang. Online speculative decoding, 2023b.
  • Mandvikar [2023] Shreekant Mandvikar. Factors to consider when selecting a large language model: A comparative analysis. International Journal of Intelligent Automation and Computing, 6(3):37–40, 2023.
  • Martin et al. [1998] Sven Martin, Jörg Liermann, and Hermann Ney. Algorithms for bigram and trigram word clustering. Speech communication, 24(1):19–37, 1998.
  • Miao et al. [2023] Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 1(2):4, 2023.
  • Mori et al. [1998] Shinsuke Mori, Masafumi Nishimura, and Nobuyasu Itoh. Word clustering for a word bi-gram model. In ICSLP. Citeseer, 1998.
  • Moslem et al. [2023] Yasmin Moslem, Rejwanul Haque, John D. Kelleher, and Andy Way. Adaptive machine translation with large language models, 2023.
  • Park et al. [2024] Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, and Jae W Lee. Any-precision llm: Low-cost deployment of multiple, different-sized llms. arXiv preprint arXiv:2402.10517, 2024.
  • Peng et al. [2023] Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of generative large language model for medical research and healthcare. NPJ Digital Medicine, 6(1):210, 2023.
  • Rosin [2011] Christopher D. Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61:203–230, 2011. URL https://api.semanticscholar.org/CorpusID:207081359.
  • Saka et al. [2023] Abdullahi B Saka, Lukumon O Oyedele, Lukman A Akanbi, Sikiru A Ganiyu, Daniel WM Chan, and Sururah A Bello. Conversational artificial intelligence in the aec industry: A review of present status, challenges and opportunities. Advanced Engineering Informatics, 55:101869, 2023.
  • Shanahan [2024] Murray Shanahan. Talking about large language models. Communications of the ACM, 67(2):68–79, 2024.
  • Sheng et al. [2023] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
  • Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  • Sleijpen and Van der Vorst [2000] Gerard LG Sleijpen and Henk A Van der Vorst. A jacobi–davidson iteration method for linear eigenvalue problems. SIAM review, 42(2):267–293, 2000.
  • Spector and Re [2023] Benjamin Spector and Chris Re. Accelerating llm inference with staged speculative decoding, 2023.
  • Stern et al. [2018] Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models, 2018.
  • Świechowski et al. [2023] Maciej Świechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Mańdziuk. Monte carlo tree search: A review of recent modifications and applications. Artificial Intelligence Review, 56(3):2497–2562, 2023.
  • Taori et al. [2023] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  • Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023a.
  • Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  • Wu et al. [2023] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023.
  • Zhang et al. [2024] Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, and Yunfei Cheng. Recurrent drafter for fast speculative decoding in large language models. arXiv preprint arXiv:2403.09919, 2024.
  • Zhang et al. [2023] Biao Zhang, Barry Haddow, and Alexandra Birch. Prompting large language model for machine translation: A case study. In International Conference on Machine Learning, pages 41092–41110. PMLR, 2023.
  • Zhang et al. [2022] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022.
  • Zheng et al. [2023] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  • Zhu and Rosenfeld [2001] Xiaojin Zhu and Ronald Rosenfeld. Improving trigram language modeling with the world wide web. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), volume 1, pages 533–536. IEEE, 2001.

Appendix

Appendix A Advantages on Computation Effiency

Our technique provides substantial benefits when implemented on edge devices like laptops and smartphones, which often face limitations in GPU capabilities and memory. In contrast to traditional decoding methods that depend heavily on GPU power or large memory sizes, our strategy is crafted for high efficiency with low resource demands.

Reduced GPU Requirements  Our approach, which does not require fine-tuning and utilizes a lightweight probabilistic model, primarily operates on the CPU, eliminating the need for substantial GPU resources. This feature is especially advantageous for edge devices with limited GPU access. By minimizing GPU dependency, our technique can be applied more widely, enhancing LLM decoding across a broader array of devices.

Low Memory Usage  Our method avoids the need for bulky initial models or intricate neural network architectures, considerably lowering the memory usage typically needed for LLM decoding. This aspect is particularly suitable for devices with limited memory, such as budget laptops and mobile phones. The decrease in memory usage not only leads to quicker processing times but also reduces power consumption, which is vital for devices running on batteries. Compared to REST, which also requires a corpus, our method significantly reduces memory usage; for instance, both using the Stack dataset, our method requires only less than 1GB while REST needs 27GB.

Table 4: Comparison of Different Methods.
Method Requires GPU Computation Memory Overhead
Lookahead \uparrow ×\times×
Eagle \uparrow ×\times×
Medusa \uparrow ×\times×
REST ×\times× \downarrow \uparrow
Speculative Decoding \uparrow ×\times×
ADED ×\times× \downarrow Very Low

Ultimately, our decoding method is exceptionally apt for practical use in edge systems, where there is often a scarcity of computational resources. It offers a viable and effective option for improving LLM decoding without sacrificing speed or precision, thus bringing sophisticated language processing to less powerful devices.

Appendix B Broader Impacts

The advancements presented in this paper, specifically the accelerated LLM decoding via Monte Carlo Tree Search (MCTS) and self-evolving speculation, have several broader impacts worth discussing. These impacts span multiple domains including technology, society, and ethics.

Technological Impact

Our method significantly enhances the efficiency and speed of autoregressive LLM decoding. This improvement can benefit numerous applications that rely on real-time language processing, such as interactive chatbots, automated customer service, and real-time translation systems. By reducing the computational load and memory requirements, our approach also makes it feasible to deploy advanced LLMs on edge devices like smartphones and IoT devices, broadening their accessibility and usability.

Societal Impact

The ability to perform faster and more efficient language model decoding can have a profound impact on society. For instance, it can improve the responsiveness and accuracy of assistive technologies for individuals with disabilities, such as voice-controlled assistants and text-to-speech systems. Additionally, educational tools that rely on real-time feedback and interactive learning can benefit from quicker and more reliable LLM responses, enhancing the learning experience for students.

Ethical Considerations

While our advancements offer significant benefits, they also raise important ethical considerations. The increased efficiency of LLMs could lead to more widespread use of automated systems, which might replace human jobs in certain sectors. It is crucial to address the potential displacement of workers by fostering skills development and creating new job opportunities that leverage human-LLM collaboration.

Moreover, the deployment of more powerful LLMs on a wider scale necessitates robust measures to mitigate misuse. Enhanced LLM capabilities could be exploited for malicious purposes, such as generating misleading information or deepfake content. Therefore, it is essential to implement strong ethical guidelines and monitoring mechanisms to prevent abuse and ensure that the technology is used responsibly.

Environmental Impact

Improving the efficiency of LLM decoding can also contribute to environmental sustainability. By reducing the computational resources required for LLM operations, our method decreases the energy consumption associated with running these models. This reduction is particularly important given the growing concerns about the environmental footprint of large-scale AI systems. Our approach aligns with the broader goal of developing greener AI technologies that minimize their impact on the planet.

In summary, the proposed method for accelerating LLM decoding has far-reaching implications across various domains. While it offers substantial benefits, it is essential to address the accompanying ethical, societal, and environmental challenges to ensure that the technology is developed and deployed in a responsible and beneficial manner.

Appendix C Preliminary

Retrieval-Based Speculative Decoding:  Decoding in large language models describes the procedure of text generation by sequentially predicting tokens. Given a context sequence s=(x1,,xt1,xt)𝑠subscript𝑥1subscript𝑥𝑡1subscript𝑥𝑡s=(x_{1},...,x_{t-1},x_{t})italic_s = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), conventional autoregressive decoding techniques produce the subsequent token at position t+1𝑡1t+1italic_t + 1 using conditional probability:

xt+1p(x|x1,,xt;θlarge),similar-tosubscript𝑥𝑡1𝑝conditional𝑥subscript𝑥1subscript𝑥𝑡subscript𝜃largex_{t+1}\sim p(x|x_{1},...,x_{t};\theta_{\text{large}}),italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_p ( italic_x | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT large end_POSTSUBSCRIPT ) , (9)

where p𝑝pitalic_p denotes the conditional probability distribution calculated by the LLM with parameters θlargesubscript𝜃large\theta_{\text{large}}italic_θ start_POSTSUBSCRIPT large end_POSTSUBSCRIPT. To reduce these computational burdens during inference, speculative decoding is proposed Leviathan et al. [2023]. It reduces the frequency of forward passes using θlargesubscript𝜃large\theta_{\text{large}}italic_θ start_POSTSUBSCRIPT large end_POSTSUBSCRIPT by incorporating an auxiliary language model with fewer parameters θsmallsubscript𝜃small\theta_{\text{small}}italic_θ start_POSTSUBSCRIPT small end_POSTSUBSCRIPT.

The speculative decoding process is implemented iteratively as follows: Using the smaller model θsmallsubscript𝜃small\theta_{\text{small}}italic_θ start_POSTSUBSCRIPT small end_POSTSUBSCRIPT, a draft sequence of the next m𝑚mitalic_m tokens is generated autoregressively:

x~t+ip(x|s,x~t+1,,x~t+i1;θsmall),similar-tosubscript~𝑥𝑡𝑖𝑝conditional𝑥𝑠subscript~𝑥𝑡1subscript~𝑥𝑡𝑖1subscript𝜃small\tilde{x}_{t+i}\sim p(x|s,\tilde{x}_{t+1},...,\tilde{x}_{t+i-1};\theta_{\text{% small}}),over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_x | italic_s , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + italic_i - 1 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT small end_POSTSUBSCRIPT ) , (10)

where i=1,,m𝑖1𝑚i=1,...,mitalic_i = 1 , … , italic_m. Despite the sequential nature of this generation, the reduced model complexity of θsmallsubscript𝜃small\theta_{\text{small}}italic_θ start_POSTSUBSCRIPT small end_POSTSUBSCRIPT leads to a lower computational overhead compared to θlargesubscript𝜃large\theta_{\text{large}}italic_θ start_POSTSUBSCRIPT large end_POSTSUBSCRIPT.Retrieval-Based Speculative Decoding extends the basic speculative decoding framework by replacing the smaller language model with a retrieval system. This method uses:

x~t+ip(x|s,x~t+1,,x~t+i1;θCorpus),similar-tosubscript~𝑥𝑡𝑖𝑝conditional𝑥𝑠subscript~𝑥𝑡1subscript~𝑥𝑡𝑖1subscript𝜃Corpus\tilde{x}_{t+i}\sim p(x|s,\tilde{x}_{t+1},...,\tilde{x}_{t+i-1};\theta_{\text{% Corpus}}),over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_x | italic_s , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + italic_i - 1 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT Corpus end_POSTSUBSCRIPT ) , (11)

where retrieve(x1,,xt)retrievesubscript𝑥1subscript𝑥𝑡\text{retrieve}(x_{1},...,x_{t})retrieve ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) fetches contextually relevant text segments from a pre-stored corpus, reducing reliance on frequent recalculations with θlargesubscript𝜃large\theta_{\text{large}}italic_θ start_POSTSUBSCRIPT large end_POSTSUBSCRIPT.

Monte Carlo Tree Search:  Monte Carlo Tree Search (MCTS) Coulom [2007] is a popular algorithmic in artificial intelligence, which explores potential future states from a current decision point by building a tree of possibilities, balancing exploration of new paths and exploitation of known beneficial paths.

In the context of MCTS, each node in the tree represents a possible state, and these nodes are expanded based on the results of simulated plays. A key aspect of MCTS is how it selects nodes for further exploration, which is based on the number of visits to each node, denoted as N(s)𝑁𝑠N(s)italic_N ( italic_s ). Specifically, the selection process in MCTS can be interpreted as an approximate solution to a regular optimization problemGrill et al. [2020]. The node visit count N(s)𝑁𝑠N(s)italic_N ( italic_s ) acts as a regularizer, guiding the algorithm towards a balance between exploring less visited, uncertain nodes and exploiting nodes that are known to yield higher rewards. The optimization problem can be formally expressed as

maxsChildren(s)(Q(s)+clogN(s)N(s)),subscript𝑠Childrensuperscript𝑠𝑄𝑠𝑐𝑁superscript𝑠𝑁𝑠\max_{s\in\text{Children}(s^{\prime})}\left(Q(s)+c\sqrt{\frac{\log N(s^{\prime% })}{N(s)}}\right),roman_max start_POSTSUBSCRIPT italic_s ∈ Children ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_Q ( italic_s ) + italic_c square-root start_ARG divide start_ARG roman_log italic_N ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_N ( italic_s ) end_ARG end_ARG ) , (12)

where ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the current node, s𝑠sitalic_s represents a child node, Q(s)𝑄𝑠Q(s)italic_Q ( italic_s ) is the estimated value of node s𝑠sitalic_s, N(s)𝑁superscript𝑠N(s^{\prime})italic_N ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is the number of visits to the parent node, and c𝑐citalic_c is a constant that controls the trade-off between exploration and exploitation. This formulation highlights how MCTS inherently balances the dual objectives of accuracy (through Q(s)𝑄𝑠Q(s)italic_Q ( italic_s )) and robustness (through the regularization term involving N(s)𝑁𝑠N(s)italic_N ( italic_s )).

Appendix D Configuration of ADED

For our experiments, we use the following hyperparameters to optimize the performance of ADED: We set t𝑡titalic_t, the threshold for the elimination of tri-gram probability, at 12 to focus only on the most prevalent trigrams. The number of iterations for Monte Carlo Tree Search (MCTS), denoted s𝑠sitalic_s, is fixed at 150. The parameters in the MCTS PUTC Score function, c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, are 2.414 and 8.0, respectively, to effectively balance exploration and exploitation. The search depth and the length of each retrieved continuation candidate, represented by l𝑙litalic_l, is 4; and the number of continuation candidates, n𝑛nitalic_n, is set at 24.

Appendix E Configuration of Baselines

REST

The REST baseline uses the default settings with the following specific configurations:

  • Number of threads: 6

  • Draft choice: 64

  • Datasets: UltraChat and The Stack

  • Token spans: [16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2]

Lookahead

The Lookahead baseline uses the default settings with the following specific configurations:

  • LEVEL: 5

  • WIN: 15

  • GUESS: 15

  • FLASH: 0

Appendix F Implementation Details

In this section, we describe the implementation details of our proposed Monte Carlo Tree Search (MCTS) algorithm for text generation and the dynamic adjustment of tri-gram probabilities.

F.1 Monte Carlo Tree Search for Text Generation

Algorithm 1 Monte Carlo Tree Search for Text Generation
1:procedure RunMCTS(self,iterations𝑠𝑒𝑙𝑓𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠self,iterationsitalic_s italic_e italic_l italic_f , italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n italic_s)
2:     rnginitialize random number generator𝑟𝑛𝑔initialize random number generatorrng\leftarrow\text{initialize random number generator}italic_r italic_n italic_g ← initialize random number generator
3:     for iter1𝑖𝑡𝑒𝑟1iter\leftarrow 1italic_i italic_t italic_e italic_r ← 1 to iterations𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠iterationsitalic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n italic_s do
4:         nodeself.rootformulae-sequence𝑛𝑜𝑑𝑒𝑠𝑒𝑙𝑓𝑟𝑜𝑜𝑡node\leftarrow self.rootitalic_n italic_o italic_d italic_e ← italic_s italic_e italic_l italic_f . italic_r italic_o italic_o italic_t
5:         state[node.word]state\leftarrow[node.word]italic_s italic_t italic_a italic_t italic_e ← [ italic_n italic_o italic_d italic_e . italic_w italic_o italic_r italic_d ]
6:         endFalse𝑒𝑛𝑑Falseend\leftarrow\text{False}italic_e italic_n italic_d ← False \triangleright Selection
7:         while not node.untried_words is empty and node.children exist and not endformulae-sequencenot 𝑛𝑜𝑑𝑒𝑢𝑛𝑡𝑟𝑖𝑒𝑑_𝑤𝑜𝑟𝑑𝑠 is empty and 𝑛𝑜𝑑𝑒𝑐𝑖𝑙𝑑𝑟𝑒𝑛 exist and not 𝑒𝑛𝑑\text{not }node.untried\_words\text{ is empty and }node.children\text{ exist % and not }endnot italic_n italic_o italic_d italic_e . italic_u italic_n italic_t italic_r italic_i italic_e italic_d _ italic_w italic_o italic_r italic_d italic_s is empty and italic_n italic_o italic_d italic_e . italic_c italic_h italic_i italic_l italic_d italic_r italic_e italic_n exist and not italic_e italic_n italic_d do
8:              selected_childnode.select_child()formulae-sequence𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑_𝑐𝑖𝑙𝑑𝑛𝑜𝑑𝑒𝑠𝑒𝑙𝑒𝑐𝑡_𝑐𝑖𝑙𝑑selected\_child\leftarrow node.select\_child()italic_s italic_e italic_l italic_e italic_c italic_t italic_e italic_d _ italic_c italic_h italic_i italic_l italic_d ← italic_n italic_o italic_d italic_e . italic_s italic_e italic_l italic_e italic_c italic_t _ italic_c italic_h italic_i italic_l italic_d ( )
9:              state.append(selected_child.word)state.append(selected\_child.word)italic_s italic_t italic_a italic_t italic_e . italic_a italic_p italic_p italic_e italic_n italic_d ( italic_s italic_e italic_l italic_e italic_c italic_t italic_e italic_d _ italic_c italic_h italic_i italic_l italic_d . italic_w italic_o italic_r italic_d )
10:              nodeselected_child𝑛𝑜𝑑𝑒𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑_𝑐𝑖𝑙𝑑node\leftarrow selected\_childitalic_n italic_o italic_d italic_e ← italic_s italic_e italic_l italic_e italic_c italic_t italic_e italic_d _ italic_c italic_h italic_i italic_l italic_d
11:              if length of state==self.sentence_length\text{length of }state==self.sentence\_lengthlength of italic_s italic_t italic_a italic_t italic_e = = italic_s italic_e italic_l italic_f . italic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e _ italic_l italic_e italic_n italic_g italic_t italic_h then
12:                  endTrue𝑒𝑛𝑑Trueend\leftarrow\text{True}italic_e italic_n italic_d ← True
13:              end if
14:         end while\triangleright Expansion
15:         if not node.untried_words is empty and not endformulae-sequencenot 𝑛𝑜𝑑𝑒𝑢𝑛𝑡𝑟𝑖𝑒𝑑_𝑤𝑜𝑟𝑑𝑠 is empty and not 𝑒𝑛𝑑\text{not }node.untried\_words\text{ is empty and not }endnot italic_n italic_o italic_d italic_e . italic_u italic_n italic_t italic_r italic_i italic_e italic_d _ italic_w italic_o italic_r italic_d italic_s is empty and not italic_e italic_n italic_d then
16:              untried_wordsnode.untried_wordsformulae-sequence𝑢𝑛𝑡𝑟𝑖𝑒𝑑_𝑤𝑜𝑟𝑑𝑠𝑛𝑜𝑑𝑒𝑢𝑛𝑡𝑟𝑖𝑒𝑑_𝑤𝑜𝑟𝑑𝑠untried\_words\leftarrow node.untried\_wordsitalic_u italic_n italic_t italic_r italic_i italic_e italic_d _ italic_w italic_o italic_r italic_d italic_s ← italic_n italic_o italic_d italic_e . italic_u italic_n italic_t italic_r italic_i italic_e italic_d _ italic_w italic_o italic_r italic_d italic_s
17:              p_wordnode.word.1formulae-sequence𝑝_𝑤𝑜𝑟𝑑𝑛𝑜𝑑𝑒𝑤𝑜𝑟𝑑.1p\_word\leftarrow node.word.1italic_p _ italic_w italic_o italic_r italic_d ← italic_n italic_o italic_d italic_e . italic_w italic_o italic_r italic_d .1
18:              for word in untried_words𝑤𝑜𝑟𝑑 in 𝑢𝑛𝑡𝑟𝑖𝑒𝑑_𝑤𝑜𝑟𝑑𝑠word\text{ in }untried\_wordsitalic_w italic_o italic_r italic_d in italic_u italic_n italic_t italic_r italic_i italic_e italic_d _ italic_w italic_o italic_r italic_d italic_s do
19:                  tmp_untried_wordsget potential words from trigram matrix𝑡𝑚𝑝_𝑢𝑛𝑡𝑟𝑖𝑒𝑑_𝑤𝑜𝑟𝑑𝑠get potential words from trigram matrixtmp\_untried\_words\leftarrow\text{get potential words from trigram matrix}italic_t italic_m italic_p _ italic_u italic_n italic_t italic_r italic_i italic_e italic_d _ italic_w italic_o italic_r italic_d italic_s ← get potential words from trigram matrix
20:                  child_nodenew Node(p_word,word,tmp_untried_words)𝑐𝑖𝑙𝑑_𝑛𝑜𝑑𝑒new Node𝑝_𝑤𝑜𝑟𝑑𝑤𝑜𝑟𝑑𝑡𝑚𝑝_𝑢𝑛𝑡𝑟𝑖𝑒𝑑_𝑤𝑜𝑟𝑑𝑠child\_node\leftarrow\text{new Node}(p\_word,word,tmp\_untried\_words)italic_c italic_h italic_i italic_l italic_d _ italic_n italic_o italic_d italic_e ← new Node ( italic_p _ italic_w italic_o italic_r italic_d , italic_w italic_o italic_r italic_d , italic_t italic_m italic_p _ italic_u italic_n italic_t italic_r italic_i italic_e italic_d _ italic_w italic_o italic_r italic_d italic_s )
21:                  node.children.append(child_node)formulae-sequence𝑛𝑜𝑑𝑒𝑐𝑖𝑙𝑑𝑟𝑒𝑛𝑎𝑝𝑝𝑒𝑛𝑑𝑐𝑖𝑙𝑑_𝑛𝑜𝑑𝑒node.children.append(child\_node)italic_n italic_o italic_d italic_e . italic_c italic_h italic_i italic_l italic_d italic_r italic_e italic_n . italic_a italic_p italic_p italic_e italic_n italic_d ( italic_c italic_h italic_i italic_l italic_d _ italic_n italic_o italic_d italic_e )
22:              end for
23:              nodenode.children.last()formulae-sequence𝑛𝑜𝑑𝑒𝑛𝑜𝑑𝑒𝑐𝑖𝑙𝑑𝑟𝑒𝑛𝑙𝑎𝑠𝑡node\leftarrow node.children.last()italic_n italic_o italic_d italic_e ← italic_n italic_o italic_d italic_e . italic_c italic_h italic_i italic_l italic_d italic_r italic_e italic_n . italic_l italic_a italic_s italic_t ( )
24:              state.append(node.word)state.append(node.word)italic_s italic_t italic_a italic_t italic_e . italic_a italic_p italic_p italic_e italic_n italic_d ( italic_n italic_o italic_d italic_e . italic_w italic_o italic_r italic_d )
25:         end if\triangleright Simulation
26:         while length of state<self.sentence_lengthformulae-sequencelength of 𝑠𝑡𝑎𝑡𝑒𝑠𝑒𝑙𝑓𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒_𝑙𝑒𝑛𝑔𝑡\text{length of }state<self.sentence\_lengthlength of italic_s italic_t italic_a italic_t italic_e < italic_s italic_e italic_l italic_f . italic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e _ italic_l italic_e italic_n italic_g italic_t italic_h do
27:              last_wordstate.last()formulae-sequence𝑙𝑎𝑠𝑡_𝑤𝑜𝑟𝑑𝑠𝑡𝑎𝑡𝑒𝑙𝑎𝑠𝑡last\_word\leftarrow state.last()italic_l italic_a italic_s italic_t _ italic_w italic_o italic_r italic_d ← italic_s italic_t italic_a italic_t italic_e . italic_l italic_a italic_s italic_t ( )
28:              next_wordsget from trigram matrix using last_word𝑛𝑒𝑥𝑡_𝑤𝑜𝑟𝑑𝑠get from trigram matrix using 𝑙𝑎𝑠𝑡_𝑤𝑜𝑟𝑑next\_words\leftarrow\text{get from trigram matrix using }last\_worditalic_n italic_e italic_x italic_t _ italic_w italic_o italic_r italic_d italic_s ← get from trigram matrix using italic_l italic_a italic_s italic_t _ italic_w italic_o italic_r italic_d
29:              if next_words is not empty𝑛𝑒𝑥𝑡_𝑤𝑜𝑟𝑑𝑠 is not emptynext\_words\text{ is not empty}italic_n italic_e italic_x italic_t _ italic_w italic_o italic_r italic_d italic_s is not empty then
30:                  select next word based on probabilities from next_wordsselect next word based on probabilities from 𝑛𝑒𝑥𝑡_𝑤𝑜𝑟𝑑𝑠\text{select next word based on probabilities from }next\_wordsselect next word based on probabilities from italic_n italic_e italic_x italic_t _ italic_w italic_o italic_r italic_d italic_s
31:                  state.append(selected word)formulae-sequence𝑠𝑡𝑎𝑡𝑒𝑎𝑝𝑝𝑒𝑛𝑑selected wordstate.append(\text{selected word})italic_s italic_t italic_a italic_t italic_e . italic_a italic_p italic_p italic_e italic_n italic_d ( selected word )
32:              else
33:                  break
34:              end if
35:         end while\triangleright Backpropagation
36:         score1.0𝑠𝑐𝑜𝑟𝑒1.0score\leftarrow 1.0italic_s italic_c italic_o italic_r italic_e ← 1.0
37:         for i1𝑖1i\leftarrow 1italic_i ← 1 to length of state1length of 𝑠𝑡𝑎𝑡𝑒1\text{length of }state-1length of italic_s italic_t italic_a italic_t italic_e - 1 do
38:              scorescore+get score from trigram matrix𝑠𝑐𝑜𝑟𝑒𝑠𝑐𝑜𝑟𝑒get score from trigram matrixscore\leftarrow score+\text{get score from trigram matrix}italic_s italic_c italic_o italic_r italic_e ← italic_s italic_c italic_o italic_r italic_e + get score from trigram matrix
39:         end for
40:         current_nodenode𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑛𝑜𝑑𝑒𝑛𝑜𝑑𝑒current\_node\leftarrow nodeitalic_c italic_u italic_r italic_r italic_e italic_n italic_t _ italic_n italic_o italic_d italic_e ← italic_n italic_o italic_d italic_e
41:         while current_node is not None𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑛𝑜𝑑𝑒 is not Nonecurrent\_node\text{ is not None}italic_c italic_u italic_r italic_r italic_e italic_n italic_t _ italic_n italic_o italic_d italic_e is not None do
42:              current_node.visitscurrent_node.visits+1formulae-sequence𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑛𝑜𝑑𝑒𝑣𝑖𝑠𝑖𝑡𝑠𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑛𝑜𝑑𝑒𝑣𝑖𝑠𝑖𝑡𝑠1current\_node.visits\leftarrow current\_node.visits+1italic_c italic_u italic_r italic_r italic_e italic_n italic_t _ italic_n italic_o italic_d italic_e . italic_v italic_i italic_s italic_i italic_t italic_s ← italic_c italic_u italic_r italic_r italic_e italic_n italic_t _ italic_n italic_o italic_d italic_e . italic_v italic_i italic_s italic_i italic_t italic_s + 1
43:              current_node.scorecurrent_node.score+scoreformulae-sequence𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑛𝑜𝑑𝑒𝑠𝑐𝑜𝑟𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑛𝑜𝑑𝑒𝑠𝑐𝑜𝑟𝑒𝑠𝑐𝑜𝑟𝑒current\_node.score\leftarrow current\_node.score+scoreitalic_c italic_u italic_r italic_r italic_e italic_n italic_t _ italic_n italic_o italic_d italic_e . italic_s italic_c italic_o italic_r italic_e ← italic_c italic_u italic_r italic_r italic_e italic_n italic_t _ italic_n italic_o italic_d italic_e . italic_s italic_c italic_o italic_r italic_e + italic_s italic_c italic_o italic_r italic_e
44:              current_nodecurrent_node.parentformulae-sequence𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑛𝑜𝑑𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑛𝑜𝑑𝑒𝑝𝑎𝑟𝑒𝑛𝑡current\_node\leftarrow current\_node.parentitalic_c italic_u italic_r italic_r italic_e italic_n italic_t _ italic_n italic_o italic_d italic_e ← italic_c italic_u italic_r italic_r italic_e italic_n italic_t _ italic_n italic_o italic_d italic_e . italic_p italic_a italic_r italic_e italic_n italic_t
45:         end while
46:     end for\triangleright Extract and sort top sentences
47:     sentencesextract sentences from root𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠extract sentences from rootsentences\leftarrow\text{extract sentences from root}italic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e italic_s ← extract sentences from root
48:     sort sentences by scoresort 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 by score\text{sort }sentences\text{ by score}sort italic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e italic_s by score
49:     return sentences𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠sentencesitalic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e italic_s
50:end procedure

Algorithm  1 illustrates the Monte Carlo Tree Search (MCTS) procedure for text generation. The procedure, RunMCTS, takes the number of iterations as input. The algorithm proceeds through the following phases:

Selection: Starting from the root node, the algorithm selects child nodes based on the selection criteria until a node with untried words or no children is reached, or the end of the sentence is detected.

Expansion: If the node has untried words and the end of the sentence is not reached, child nodes are created for each untried word using potential words from the trigram matrix.

Simulation: A sequence of words is generated starting from the current state until the sentence reaches the desired length or no more words can be selected from the trigram matrix.

Backpropagation: The score for the generated sequence is calculated, and this score is propagated back through the selected nodes, updating their visit counts and scores.

Finally, the top sentences are extracted from the root node and sorted by score.

F.2 Dynamic Adjustment of tri-gram Probabilities

Algorithm  2 details the procedure for dynamically adjusting tri-gram probabilities. The procedure, Adjust3Gram, takes a sequence of tokens, the maximum length of n-grams to consider, an increment value, and the maximum probability.

For each trigram in the recent portion of the token sequence, the probability is increased by the increment value, up to the specified maximum probability. If the trigram is not already in the tri-gram matrix, it is added with the increment value as its initial probability.

Algorithm 2 Dynamic Adjustment of tri-gram Probabilities
1:procedure Adjust3Gram(tokens,maxLength,increment,maxProb𝑡𝑜𝑘𝑒𝑛𝑠𝑚𝑎𝑥𝐿𝑒𝑛𝑔𝑡𝑖𝑛𝑐𝑟𝑒𝑚𝑒𝑛𝑡𝑚𝑎𝑥𝑃𝑟𝑜𝑏tokens,maxLength,increment,maxProbitalic_t italic_o italic_k italic_e italic_n italic_s , italic_m italic_a italic_x italic_L italic_e italic_n italic_g italic_t italic_h , italic_i italic_n italic_c italic_r italic_e italic_m italic_e italic_n italic_t , italic_m italic_a italic_x italic_P italic_r italic_o italic_b)
2:     nlength of tokens𝑛length of 𝑡𝑜𝑘𝑒𝑛𝑠n\leftarrow\text{length of }tokensitalic_n ← length of italic_t italic_o italic_k italic_e italic_n italic_s
3:     startIndexmax(0,nmaxLength)𝑠𝑡𝑎𝑟𝑡𝐼𝑛𝑑𝑒𝑥0𝑛𝑚𝑎𝑥𝐿𝑒𝑛𝑔𝑡startIndex\leftarrow\max(0,n-maxLength)italic_s italic_t italic_a italic_r italic_t italic_I italic_n italic_d italic_e italic_x ← roman_max ( 0 , italic_n - italic_m italic_a italic_x italic_L italic_e italic_n italic_g italic_t italic_h )
4:     subTokenstokens[startIndex:n]subTokens\leftarrow tokens[startIndex:n]italic_s italic_u italic_b italic_T italic_o italic_k italic_e italic_n italic_s ← italic_t italic_o italic_k italic_e italic_n italic_s [ italic_s italic_t italic_a italic_r italic_t italic_I italic_n italic_d italic_e italic_x : italic_n ]
5:     for i2𝑖2i\leftarrow 2italic_i ← 2 to length of subTokenslength of 𝑠𝑢𝑏𝑇𝑜𝑘𝑒𝑛𝑠\text{length of }subTokenslength of italic_s italic_u italic_b italic_T italic_o italic_k italic_e italic_n italic_s do
6:         triGram(subTokens[i2],subTokens[i1],subTokens[i])𝑡𝑟𝑖𝐺𝑟𝑎𝑚𝑠𝑢𝑏𝑇𝑜𝑘𝑒𝑛𝑠delimited-[]𝑖2𝑠𝑢𝑏𝑇𝑜𝑘𝑒𝑛𝑠delimited-[]𝑖1𝑠𝑢𝑏𝑇𝑜𝑘𝑒𝑛𝑠delimited-[]𝑖triGram\leftarrow(subTokens[i-2],subTokens[i-1],subTokens[i])italic_t italic_r italic_i italic_G italic_r italic_a italic_m ← ( italic_s italic_u italic_b italic_T italic_o italic_k italic_e italic_n italic_s [ italic_i - 2 ] , italic_s italic_u italic_b italic_T italic_o italic_k italic_e italic_n italic_s [ italic_i - 1 ] , italic_s italic_u italic_b italic_T italic_o italic_k italic_e italic_n italic_s [ italic_i ] )
7:         if triGram in tri-gram matrix𝑡𝑟𝑖𝐺𝑟𝑎𝑚 in tri-gram matrixtriGram\text{ in tri-gram matrix}italic_t italic_r italic_i italic_G italic_r italic_a italic_m in tri-gram matrix then
8:              currentProbtri-gram matrix[triGram]𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑃𝑟𝑜𝑏tri-gram matrixdelimited-[]𝑡𝑟𝑖𝐺𝑟𝑎𝑚currentProb\leftarrow\text{tri-gram matrix}[triGram]italic_c italic_u italic_r italic_r italic_e italic_n italic_t italic_P italic_r italic_o italic_b ← tri-gram matrix [ italic_t italic_r italic_i italic_G italic_r italic_a italic_m ]
9:              newProbmin(currentProb+increment,maxProb)𝑛𝑒𝑤𝑃𝑟𝑜𝑏𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑃𝑟𝑜𝑏𝑖𝑛𝑐𝑟𝑒𝑚𝑒𝑛𝑡𝑚𝑎𝑥𝑃𝑟𝑜𝑏newProb\leftarrow\min(currentProb+increment,maxProb)italic_n italic_e italic_w italic_P italic_r italic_o italic_b ← roman_min ( italic_c italic_u italic_r italic_r italic_e italic_n italic_t italic_P italic_r italic_o italic_b + italic_i italic_n italic_c italic_r italic_e italic_m italic_e italic_n italic_t , italic_m italic_a italic_x italic_P italic_r italic_o italic_b )
10:              tri-gram matrix[triGram]newProbtri-gram matrixdelimited-[]𝑡𝑟𝑖𝐺𝑟𝑎𝑚𝑛𝑒𝑤𝑃𝑟𝑜𝑏\text{tri-gram matrix}[triGram]\leftarrow newProbtri-gram matrix [ italic_t italic_r italic_i italic_G italic_r italic_a italic_m ] ← italic_n italic_e italic_w italic_P italic_r italic_o italic_b
11:         else
12:              tri-gram matrix[triGram]incrementtri-gram matrixdelimited-[]𝑡𝑟𝑖𝐺𝑟𝑎𝑚𝑖𝑛𝑐𝑟𝑒𝑚𝑒𝑛𝑡\text{tri-gram matrix}[triGram]\leftarrow incrementtri-gram matrix [ italic_t italic_r italic_i italic_G italic_r italic_a italic_m ] ← italic_i italic_n italic_c italic_r italic_e italic_m italic_e italic_n italic_t
13:         end if
14:     end for
15:end procedure

Appendix G Additional Experimental Results

In this section, we present additional experimental results to further illustrate the effectiveness of our proposed methods. We compare the performance of Vicuna-7B and Vicuna-13B models with and without the adaptive strategy and analyze the impact of varying Monte Carlo Tree Search (MCTS) search counts on performance.

Figure  7 shows the performance comparison on the MTBench dataset for Vicuna-7B and Vicuna-13B models with and without the adaptive strategy. The results demonstrate the advantage of using the adaptive approach. For both models, the adaptive strategy significantly reduces the average exact length of the generated sequences as the number of tokens increases, indicating more efficient and accurate text generation.

Refer to caption
(a) Vicuna-7B
Refer to caption
(b) Vicuna-13B
Figure 7: Adaptive Strategy Comparison on MTBench: Performance comparison of Vicuna-7B (a) and Vicuna-13B (b) models with and without the adaptive strategy on the MTBench dataset, showing the advantage of using the adaptive approach.

Figure  8 presents the results for Vicuna-7B and Vicuna-13B models on the MTBench dataset, showing the impact of varying MCTS search counts on performance. For both models, increasing the number of MCTS search counts leads to improved performance, with the optimal counts varying by model size. The average exact length and latency are plotted against the number of MCTS search counts, illustrating the trade-off between performance and computational cost. As shown in the plots, there is a notable improvement in the average exact length as the search counts increase, while latency also increases, indicating a balance between the depth of search and the time taken for generation.

Refer to caption
(a) Vicuna-7B
Refer to caption
(b) Vicuna-13B
Figure 8: Comparison of MCTS Search Counts and Performance on MTBench: Results for Vicuna-7B (a) and Vicuna-13B (b) models on the MTBench dataset. Increasing MCTS search counts improves performance, with optimal counts varying by model size.