Adaptive Draft-Verification for Efficient Large Language Model Decoding

Xukun Liu
Northwestern University
[email protected]
&Bowen Li
Texas A&M University
[email protected]
&Ruqi Zhang
Purdue University
[email protected]
&Dongkuan Xu
North Carolina State University
[email protected]

Abstract

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model’s learned probabilities. The typical autoregressive decoding method requires a separate forward pass through the model for each token generated, which is computationally inefficient and poses challenges for deploying LLMs in latency-sensitive scenarios. The main limitations of current decoding methods stem from their inefficiencies and resource demands. Existing approaches either necessitate fine-tuning smaller models, which is resource-intensive, or rely on fixed retrieval schemes to construct drafts for the next tokens, which lack adaptability and fail to generalize across different models and contexts. To address these issues, we introduce a novel methodology called ADED ¹¹1Project repo: https://anonymous.4open.science/r/ADED-C7D5, which accelerates LLM decoding without requiring fine-tuning. Our approach involves an adaptive draft-verification process that evolves over time to improve efficiency. We utilize a tri-gram matrix-based LLM representation to dynamically approximate the output distribution of the LLM, allowing the model to adjust to changing token probabilities during the decoding process. Additionally, we implement a draft construction mechanism that effectively balances exploration and exploitation, ensuring that the drafts generated are both diverse and close to the true output distribution of the LLM. The importance of this design lies in its ability to optimize the draft distribution adaptively, leading to faster and more accurate decoding. Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that ADED significantly accelerates the decoding process while maintaining high accuracy, making it suitable for deployment in a wide range of practical applications.

1 Introduction

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model’s learned probabilities (Brown et al., 2020; Zhang et al., 2022; Touvron et al., 2023a, b). The core mechanism is autoregressive, where each new token is generated conditioned on the previously generated tokens and the given context. This process is crucial for applications like text generation (Li et al., 2024a; Peng et al., 2023; Chang et al., 2023), machine translation (Zhang et al., 2023; Moslem et al., 2023; Hendy et al., 2023), and conversational AI (Shanahan, 2024; Wu et al., 2023; Saka et al., 2023). However, each decoding step involves a forward pass through the model, making the process inherently sequential and computationally expensive. The inefficiencies arise due to the need to reload the model for each token prediction, leading to high computational costs and memory bandwidth usage. This serial nature of decoding is a significant bottleneck, especially for real-time applications (Liu et al., 2023a; Mandvikar, 2023; Antoniol et al., 1994) where latency is critical. Thus, optimizing the decoding speed of LLMs is essential for practical deployment in various real-world scenarios.

Refer to caption — Figure 1: Comparison of different LLM decoding strategies: Speculative Decoding Leviathan et al. (2023), Lookahead Fu et al. (2024a), REST He et al. (2023), and our ADED. In Speculative Decoding, a small LLM generates predictions (red blocks) from inputs (blue blocks). Yellow blocks indicating intermediate results obtained from language model. Lookahead uses a large LLM for forward-looking predictions. REST employs a corpus trie for rapid token lookups. ADED integrates Monte Carlo Tree Search with tri-gram (Martin et al., 1998; Golding and Schabes, 1996; Zhu and Rosenfeld, 2001) statistics and recent token history to simulate potential outputs, refining its recommendations over time. ADED’s adaptive approach offers significant advantages in terms of speed and accuracy by continuously evolving its draft constructions, providing more efficient and accurate LLM decoding compared to the fixed or resource-intensive methods used by the others.

Recent research has explored various strategies to mitigate the inefficiencies of LLM decoding. Speculative Decoding (Leviathan et al., 2023; Spector and Re, 2023; Chen et al., 2023) introduces an approach where a smaller, more efficient model generates several token predictions in parallel, which are then verified by the larger target model. This method leverages the efficiency of smaller models to reduce the number of serial forward passes required, achieving substantial speedups without altering the output distribution. Lookahead Decoding Fu et al. (2024a) uses the full context to predict multiple future tokens, creating a buffer that reduces the dependency on sequential processing. REST He et al. (2023) employs a retrieval-based approach where relevant tokens are fetched from a pre-constructed datastore using the current context, forming drafts that are verified by the LLM. These methods can be summarized within the draft-verification pipeline, as shown in Figure 4. Speculative Decoding and Lookahead Decoding both generate draft tokens through predictive models, while REST constructs drafts from retrieved tokens based on the context. In each case, the drafts are then verified by the main LLM, ensuring that the final output adheres to the model’s learned probabilities. Despite their advancements, these approaches face notable limitations. They often require additional training or fine-tuning, which can be resource-intensive. Fixed retrieval schemes lack adaptability, making it challenging to adjust the draft distribution in real-time based on the evolving LLM output. Additionally, these methods may not generalize well across different models and contexts, limiting their effectiveness in dynamic environments.

In this work, our focus is on fine-tuning-free draft-verification to address these limitations. The draft-verification pipeline can be viewed as a rejection sampling procedure where the similarity between the proposal distribution (draft) and the target distribution (LLM output) is crucial for the acceptance rate and convergence speed. Higher similarity results in a higher acceptance rate and faster decoding speed. Very few fine-tuning-free approaches, e.g., REST He et al. (2023), typically use fixed retrieval-based schemes to construct drafts. These schemes lack the adaptability to adjust the draft distribution based on the evolving LLM output distribution, resulting in a persistent gap between the draft and the actual LLM output. This gap reduces the draft acceptance rate and limits the potential for improving decoding speed. To address this issue, we raise the following question:

Question: How to design an adaptive draft construction process that can evolve itself and accurately approximate LLM outputs during decoding?

To introduce adaptability and find drafts that are increasingly close to the LLM output distribution during decoding, we not only need to have an adaptive draft construction pipeline but also need to maintain a balance between exploration and exploitation. This balance ensures that speedups can be achieved by leveraging existing knowledge of draft construction while continuously exploring better draft construction capabilities. To achieve this, we propose a novel methodology called ADED (Adaptive Draft-Verification for Efficient LLM Decoding). ADED incorporates a tri-gram-matrix-based adaptive LLM representative to control the conditional probability distribution of the next token, which can be updated during the decoding process to adjust the draft construction accordingly. To balance exploration and exploitation, we design a draft maker inspired by Monte Carlo Tree Search (MCTS) (Coulom, 2007; Browne et al., 2012; James et al., 2017; Świechowski et al., 2023). This draft maker uses a token preference score to maintain the balance during the search process. The score consists of two parts: the first part is based on the approximate conditional probability distribution of the next token obtained from the LLM representative, reflecting the draft maker’s current knowledge of the LLM output; the second part encourages the draft maker to explore unexplored or less-explored draft spaces. Theoretically, our method can be viewed as a constrained optimization problem to encourage the draft distribution to converge to the LLM output distribution (see Appendix A). Using the token preference score, the draft maker can effectively search the draft space and generate candidate tokens. After the draft construction and verification are completed, the information is fed back to the LLM representative to update its approximation of the LLM output. This feedback loop enriches the draft maker’s knowledge in subsequent rounds of draft-verification, enabling adaptability and self-evolution in the draft construction process.

In summary, our contributions are concluded as follows:

•

We design a tri-gram matrix-based LLM representation that dynamically approximates the LLM output distribution, enhancing adaptability without the need for fine-tuning. This approach addresses the limitation of fixed retrieval schemes by continuously evolving with the model’s predictions.
•

We develop a draft maker inspired by MCTS, which effectively balances exploration and exploitation to generate high-quality drafts. This mechanism improves decoding speed and accuracy by ensuring that the drafts are closely aligned with the LLM’s output distribution. Our experiments show a 2.5X improvement in decoding speed compared to baselines.
•

Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that ADED significantly accelerates the decoding process while maintaining high accuracy. Specifically, we achieve up to a 2.5X speedup in latency and an average acceptance rate improvement of 20% over existing methods.
•

Our method reduces computational overhead and memory usage, making it suitable for deployment in a wide range of practical, real-time applications. Our method’s ability to adapt to evolving LLM outputs and continuously refine draft construction sets it apart from existing, addressing the need for more flexible and dynamic LLM decoding solutions.

2 Methodology

We propose a new fast fine-tuning-free draft-verification LLM decoding method by introducing adaptability into the decoding and learning from LLM. Existing accelerated decoding algorithms either require additional fine-tuning or lack adaptability to LLM’s output distributions, resulting in significant additional cost or insufficient acceleration. To address these issues, we design an adaptive LLM representation based on a tri-gram matrix to adaptively approximate the output distribution of the LLM; develop an MCTS-based draft maker that balances exploration and exploitation for self-evolution towards high-quality drafts; and verify the drafts using tree attention.

2.1 Preliminary: Speculative Decoding & Monte Carlo Tree Search

Speculative decoding is a method to accelerate language model inference by using a smaller auxiliary model to generate a draft sequence, reducing the computational load on the larger model Leviathan et al. (2023). Retrieval-based speculative decoding extends this by incorporating a retrieval system instead of the smaller model, leveraging pre-stored corpus segments for relevant text generation. Monte Carlo Tree Search (MCTS) (Coulom, 2007; Browne et al., 2012; James et al., 2017; Świechowski et al., 2023) is an AI algorithm that optimizes decision-making by balancing exploration and exploitation of future states. It selects nodes for further exploration using a combination of node visit counts and estimated values, aiming to maximize overall outcomes. For a comprehensive discussion of these methods, please refer to Appendix C.

2.2 Adaptive LLM Representative

In order to approximate the output token distribution of the LLM without fine-tuning the small model, we distill linguistic knowledge from a small corpus and construct a tri-gram matrix as an initial representation of the LLM, which allows us to leverage the statistical regularities of language at a granular level. Specifically, we summarize and count the three tokens that appear in the corpus and compute the probability of the third token appearing conditional on the first two tokens. The formula is as defined in Eq. (1):

P(w_{i}|w_{i-2},w_{i-1})=\frac{C(w_{i-2},w_{i-1},w_{i})}{C(w_{i-2},w_{i-1})},% \vspace{-2mm}

(1)

where $P(w_{i}|w_{i-2},w_{i-1})$ is the conditional probability of a word $w_{i}$ given the two preceding words $w_{i-2}$ and $w_{i-1}$ , $C(w_{i-2},w_{i-1},w_{i})$ is the count of the tri-gram occurrence in the corpus, and $C(w_{i-2},w_{i-1})$ is the count of the preceding bi-gram Mori et al. (1998).

In this way, we can obtain a good initial LLM representative at a much lower cost, which can generate an approximate distribution of the next token based on the previous tokens. This LLM representative will collaborate with our draft maker to generate drafts and get feedback to update the tri-gram matrix for adaptability and self-evolution. Please see Section 2.3 for more details.

2.3 Draft Maker and Self-Evolution

With the help of the LLM representative, we further propose a draft maker that balances exploration and exploitation while searching for candidate drafts that are closer to the LLM output. On the one hand, our draft maker leverages the conditional probabilities from the LLM representative, which include current knowledge of the LLM output. On the other hand, our draft maker is encouraged to search more in the unexplored or less explored draft space to find better draft candidates. Then, with feedback from the LLM output, the LLM representative can update its understanding of the LLM output, improve the draft maker’s search, and achieve self-evolution. Details are provided below.

Draft Search Score: Given the initial tokens, we exploit Monte Carlo Tree Search (MCTS) Coulom (2007) to guide the search process of the drafts of the next tokens, where we prioritize candidate tokens according to the conditional probability from the tri-gram matrix-based LLM representative and the node visitation counts during the tree search. Our scores play a key role in balancing exploration and utilization during Monte Carlo tree search and is defined as Eq. (2). This is a kind of PUTC Score (Rosin, 2011; Silver et al., 2017). More specifically, $Q(s,a)$ assesses the quality of taking action $a$ in state $s$ , while $P(s,a)$ represents the prior probability of selecting action $a$ in state $s$ . The term $N(s,a)$ denotes the number of times the action $a$ has been taken from state $s$ , and $N(s,b)$ sums the counts for all actions from state $s$ . The constants $c_{1}$ and $c_{2}$ adjust the balance between exploration and exploitation, improving the decision-making process in draft construction. This formula ensures that our draft choices are contextually appropriate and optimizes the robustness and coherence of text generation.

\max{Q(s,a)+P(s,a)\cdot\frac{\sqrt{\sum_{b}N(s,b)}}{1+N(s,a)}}(c_{1}+\log(% \frac{\sum_{b}N(s,b)+c_{2}+1}{c_{2}})).

(2)

Self-Evolution Strategy Transfer: Based on the final search score obtained during the Monte Carlo tree search, we can construct draft candidates and verify them to get the final decoding output (please see Section 2.4) and feed it back for self-evolution. This final output decoding represents LLM’s output distribution, which would be a good learning material for the LLM representative. Therefore, we feed this knowledge into the LLM representative in order to obtain updated conditional probability distributions, thus providing the draft maker with more accurate and exploitable knowledge, which is illustrated in Figure 2. Specifically, this technique operates by first extracting tri-grams from recent outputs of the LLM. Each tri-gram’s frequency is then used to update its probability as potential outputs. These adjusted probabilities are fed into the MCTS as part of the policy network, influencing the selection phase of the tree search. In the context of MCTS, the updated tri-gram probabilities essentially serve as a dynamic policy guide, enhancing the model’s ability to generate contextually relevant and coherent sequences. By incorporating learned tri-gram probabilities into the tree search algorithm, we effectively create a feedback loop where the search strategy itself evolves over time. This strategy adjustment is executed by recalibrating the exploration-exploitation balance based on the empirical data derived from the model’s own outputs.

2.4 Draft Construction and Verification

To validate the draft sequences, it is noted that many have common starting segments that can cause redundant recalculations in the Transformer layers if not managed correctly. To address the issue, a pseudo-sequence that guarantees that each draft is a sub-sequence and that any common prefix appears only once is created He et al. (2023). We also use a specific attention mask for each attention layer, called tree attention (Miao et al., 2023; Cai et al., 2024). This mask aligns the computations for each token with its dependencies according to the original draft sequence, preserving the draft’s contextual integrity and preventing unnecessary computations. The approval of drafts relies on a comparison with the conditional distribution from the LLM. At each position, new tokens are sampled and compared to the draft tokens. If a sampled token corresponds to the draft token, it is approved; otherwise, the draft is discarded from that point. This selective approval ensures that the output sequence aligns with what would be produced by a typical autoregressive process, thus upholding the authenticity of the generated text.

3 Theoretical Insight: Why ADED uses MCTS

In this section, we explore the theoretical parallels between the Monte Carlo Tree Search (MCTS) algorithm used in our ADED framework and the inference mechanisms of large language models (LLMs) to demonstrate the use of MCTS and the self-evolution of ADED. We show that draft search in ADED using MCTS can be viewed as a form of policy optimization, and that the inference mechanism of LLM can be viewed as a similar form of penalty optimization.

MCTS in ADED: The token selection procedure in ADED decoding can be viewed as an action selection process. The MCTS algorithm optimizes its policy by iteratively building a search tree and updating visit counts for each node (state-action pair) based on the search paths. The visit count distribution $\hat{\pi}(a\mid x)$ is defined as:

\hat{\pi}(a\mid x)\triangleq\frac{1+n(x,a)}{|A|+\sum_{b}n(x,b)},

(3)

where $n(x,a)$ represents the visit count for action $a$ in state $x$ . Then, the action selection in MCTS can be written as selecting the action $a^{*}$ :

a^{*}(x)\triangleq\arg\max_{a}[Q(x,a)+\lambda_{N}\cdot\frac{\pi_{\theta}(a\mid x% )}{\hat{\pi}(a\mid x)}]

(4)

Nach (Grill et al., 2020), we use $q\in\mathcal{R}^{|A|}$ to denote the vector of Q-function $Q(x,a)$ . With proper choice of hyper-parameters, the MCTS algorithm can be viewed as searching for the optimum solution to a policy optimization problem (Grill et al., 2020) as below:

\bar{\pi}\triangleq\arg\max_{y\in S}\left[q^{\top}y-\lambda_{N}\text{KL}[\pi_{% \theta},y]\right],

(5)

where $S$ is the $|A|$ -dimensional simplex, $\lambda_{N}$ is a regularization parameter that depends on hyperparameters and balances exploration and exploitation, and KL is the KL-divergence.

LLM Inference Mechanism: Large language models, particularly those based on the Transformer architecture, generate text by predicting the probability distribution of the next token given the previous tokens. During inference, the model maximizes the log-likelihood of the observed data, which is equivalent to minimizing the cross-entropy loss:

\mathcal{L}(\theta)=-\sum_{t=1}^{T}\log P(w_{t}\mid w_{1:t-1};\theta),

(6)

where $P$ denotes the conditional probability of LLM, $w$ denotes the tokens, and $\theta$ denotes the model parameters. Regularization techniques, such as KL divergence, are often incorporated to prevent overfitting and ensure generalization:

\mathcal{L}(\theta)=-\sum_{t=1}^{T}\log P(w_{t}\mid w_{1:t-1};\theta)+\lambda% \text{KL}(P_{\text{model}},P_{\text{data}}).\vspace{-2mm}

(7)

Comparison between MCTS & LLM Inference: As shown in Eq (5) and Eq. (7), both MCTS and LLM Inference can be viewed as regularized optimization problems for selecting the distribution of the next tokens. On the one hand, the Q-function in MCTS for ADED can be viewed as an approximation to the log-likelihood of LLM

\displaystyle Q(x,a)=-\sum_{t=1}^{T}\log\hat{P}(w_{t}\mid w_{t-1},w_{t-2};% \theta)\approx P(w_{t}\mid w_{t-1},w_{t-2};\theta)\approx-\sum_{t=1}^{T}\log P% (w_{t}\mid w_{1:t-1};\theta)\vspace{-2mm}

(8)

where $\hat{P}$ and $P$ are the conditional probability distribution from tri-gram-matrix-based LLM representative and LLM, respectively. On the other hand, both MCTS and LLM Inference improve the optimization procedure by employing regularization techniques. Through a comparative analysis, we show the similarities between MCTS and LLM Inference in terms of optimization and regularization, and highlight our rationale for choosing MCTS for the ADED framework.

4 Experiments

4.1 Experimental Setup

Models and Datasets To evaluate the efficacy of ADED for the inference of large-language models, we execute a series of experiments employing five distinct models in four datasets. We test our algorithm on three Vicuna models Chiang et al. (2023) (7B, 13B, 33B) and two LLaMA2-chat models Touvron et al. (2023b) (7B, 13B) to evaluate its acceleration capabilities across different sizes and types of models. Our assessment incorporated the HumanEval Chen et al. (2021), MT-Bench Zheng et al. (2023) and Alpaca Taori et al. (2023) dataset to ascertain general natural language understanding and generation competencies. These datasets were meticulously chosen to guarantee a comprehensive analysis of our acceleration techniques across various tasks.

Corpus We constructe two corpus. The first one is built using a portion of the Python pre-training code from The Stack Kocetkov et al. (2022), comprising approximately 2.7M Python code samples with a resulting size of 1007MB. The second corpus is constructed using data derived from UltraChat Ding et al. (2023), consisting of around 774K ChatGPT conversations, which produces a corpus with a size of 574MB.

Metrics To assess the performance of our acceleration algorithms on large language models, we utilize two main metrics: speedup ratio and average acceptance length. The speedup ratio, calculated as the ratio of the time required by the baseline models to complete inference tasks without acceleration to the time required by our ADED, effectively measures the efficiency gains introduced by our algorithm. The second metric, average acceptance length, measures the average number of tokens accepted per forward pass by the target large language models, excluding any overhead of retrieving and constructing draft tokens, indicating the maximum possible acceleration.

Baselines In this study, we investigate various foundational approaches to improve the decoding speed of large language models. We examine Lookahead Decoding Fu et al. (2024a), an innovative, precise, parallel decoding algorithm that significantly cuts down latency without relying on draft models. We also assess REST He et al. (2023) (Retrieval-Based Speculative Decoding), which adopts a retrieval-based strategy to create draft tokens, in contrast to conventional speculative decoding methods that rely on a draft model. Collectively, these baseline methods provide a solid framework for evaluating the efficiency of our proposed acceleration techniques in the LLM decoding process. All experiments are conducted on an NVIDIA A6000, except for the 33B model, which utilizes an NVIDIA A100. Where not mentioned, the experiments default to Greedy sampling.

4.2 Main Experimental Results

In our experiment, we investigate the efficacy of diverse methodologies applied to multiple models, utilizing three distinct datasets: MT-Bench, Human-Eval, and Alpaca. We focus on metrics such as Accepted Length, Latency, and Speed Up to evaluate the efficiency and responsiveness.

Table 1: Latency Comparison of ADED and Baselines. ADED gets the lowest latency in almost all the test cases, having a higher efficiency.

Model	REST	REST Single	Lookahead	ADED	REST	REST Single Thread	Lookahead	ADED
	MT-Bench Latency				Alpaca Latency
vicuna-7B	16.05	16.31	20.53	12.95	13.21	13.51	20.86	13.18
vicuna-13B	25.43	25.99	33.55	22.94	22.27	22.65	34.62	22.80
vicuna-33B	32.90	33.26	42.13	29.17	31.87	32.41	42.69	29.23
llama2-7B	16.08	17.67	18.47	13.88	13.55	19.46	19.44	13.02
llama2-13B	27.13	29.80	32.07	22.90	23.88	27.24	32.72	22.55

Table 1 summarizes the latency results for different models and methods on the MT-Bench and Alpaca datasets. ADED consistently demonstrates lower latency compared to other methods, particularly for the vicuna-7B and llama2-13B models. For instance, on the MT-Bench dataset, ADED achieves a latency of 12.95 ms for vicuna-7B, which is significantly lower than REST (16.05 ms), REST Single Thread (16.91 ms), and Lookahead (20.53 ms). It is very important to note that the memory required for ADED (574MB) is only 5.6% of that required for REST (12GB). This trend is also observed on the Alpaca dataset, where ADED achieves a latency of 13.18 ms for vicuna-7B, compared to 13.21 ms for REST, 13.51 ms for REST Single Thread, and 20.86 ms for Lookahead.

The accept length results indicate the quality of the generated outputs, with longer accept lengths suggesting more coherent and contextually relevant text. Our method, ADED, outperforms other methods across different models on both MT-Bench and Alpaca datasets. For example, in MT-Bench, ADED achieves the highest accept length for vicuna-33B and llama2-13B models, showcasing its superior language generation capabilities.

Additionally, speed up metrics are evaluated to determine the efficiencly of each method. ADED consistently shows a significant improvement in speed up across all models and datasets. This efficiency is particularly noticeable in the MT-Bench and Alpaca datasets, where ADED not only reduces latency but also enhances the overall processing speed, thus validating the robustness of our retrieval algorithms. For instance, ADED achieves a speed up of 2.4x on the MT-Bench dataset with the vicuna-13B model, outperforming REST, REST Single Thread, and Lookahead.

Further analysis on the HumanEval dataset reveals that ADED achieves considerable speed up improvements. The speed up results highlight the efficiency of ADED in handling larger datasets without compromising on retrieval time or output quality. The vicuna-13B model, for example, demonstrates a speed up of nearly 2.5x when using ADED, which is a substantial improvement over the baseline methods.

These results collectively indicate that ADED provides a balanced improvement in both the quality of generated outputs and the efficiency of the retrieval process. The ability of ADED to maintain high performance across diverse tasks and datasets underscores its versatility and reliability for real-world applications. The improvements in latency, accept length, and speed up metrics affirm the efficacy of ADED in delivering superior performance while managing larger datasets effectively.

4.3 Stability of ADED

In this section, we analyze the stability of our algorithm, ADED, across different categories of tasks. The categories considered include writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities. The experimental results, as shown in Table 2, indicate that ADED maintains consistent performance across all categories. The average accept length remains stable, demonstrating that ADED can effectively handle a diverse range of tasks without significant variations in performance.

To further evaluate the robustness of ADED, we examined the effects of varying the top-p and temperature parameters on its performance. Figures 5(a) and 5(b) illustrate the impact of these parameters on the average accept length.

Table 2: Average Accept Length for the Tasks.

Task	Avg Accept
	Length
Writing	2.37
Roleplay	2.21
Reasoning	2.10
Math	2.33
Coding	2.45
Extraction	2.11
STEM	2.19
Humanities	2.17

Figure 5(a) shows that changes in the top-p parameter do not significantly affect the performance of ADED. The average accept length remains relatively stable across different values of top-p, indicating that ADED is not overly sensitive to this parameter.

Similarly, Figure 5(b) demonstrates that variations in the temperature parameter have minimal impact on the performance of ADED. The consistency in the average accept length across different temperature values further supports the robustness of our algorithm.

These results confirm that ADED exhibits robust performance across a variety of tasks and maintains stability despite changes in key parameters, making it a versatile and reliable choice for diverse applications.

5 Ablation Study

To gain a deeper understanding of our method, we conduct a series of ablation studies and analyses focused on each individual component. Please see full ablation studies in the Appendix.

Effect of the adaptive strategy. Figure 6 illustrates the performance impact of our adaptive strategy on two models, Vicuna-7B and Vicuna-13B, with a comparative analysis of average accepted lengths over varying token counts. The graphs show that the adaptive strategy consistently maintains higher average accepted lengths across the input range for both models, compared to the non-adaptive. The adaptive strategy’s success can be attributed to its dynamic adjustment of the model’s probability distributions based on the tri-gram frequencies from prior outputs. This allows the model to better manage longer contexts and maintain relevance, enhancing stability and coherence in longer interactions. The marked performance improvement, particularly in managing larger token counts, highlights the adaptive strategy’s efficacy in sustaining effective and coherent outputs in extended sequences.

Effect of the corpus size.

Table 3 demonstrates how increasing the corpus size from 300k tokens to 700k tokens affects various performance metrics. With the expansion of the corpus, there is a gradual improvement in the ’Accept Length’ from 2.17 to 2.33. This increase suggests that larger datasets provide a broader array of language patterns, which enhances the model’s ability to generate more coherent and contextually relevant outputs. Despite the growth in data size, from 253 MB to 574 MB, and a slight increase in retrieval time from 2.1 ms to 3.1 ms, the system maintains efficient data handling.

Table 3: Effect of Corpus Size.

Corpus	corpus	Retrieval	Accept	Speed
Size	Size	Zeit	Length	Up
300k	253 MB	2.1 ms	2.17	1.93
500k	467 MB	2.7 ms	2.24	2.01
700k	574 MB	3.1 ms	2.33	2.07

The modest rise in retrieval time underscores the efficiency of the retrieval algorithms, which manage larger datasets without significantly compromising response speed. Overall, the results indicate that larger corpus sizes improve the model’s output quality while maintaining good system performance.

These findings highlight the importance of the size of the corpus in enhancing the performance of ADED, demonstrating that a larger and more carefully selected corpus can significantly improve the quality and efficiency of the generated outputs.

6 Related Work

Approach Without Draft Models. A significant portion of recent advances in language model decoding strategies has focused on improving efficiency without relying on draft models. Two notable approaches in this realm are Lookahead decoding Fu et al. (2024a) and Retrieval-Based Speculative Decoding He et al. (2023). Lookahead decoding is an approach that enhances the efficiency of the decoding process through the prediction of subsequent tokens via Jacobi Iteration Sleijpen and Van der Vorst (2000). It employs a heuristic to estimate the future cost of a sequence without the need to explicitly create a draft. This technique not only accelerates the decoding process by minimizing the number of tokens to be processed but also seeks to preserve or improve the quality of the output text by taking into account potential future scenarios in the decision-making process. Retrieval-Based Speculative Decoding(REST) He et al. (2023) introduces a retrieval-enhanced generation model that speculatively decodes sequences without the need for producing preliminary drafts. It instead searches and prioritizes possible continuations from an already established sequence database. This approach utilizes the inherent redundancy in natural language to anticipate probable continuations, thereby eliminating the requirement for draft sequences and greatly reducing computational costs. Both Lookahead decoding and REST demonstrate the capabilities of decoding methods that avoid intermediate drafts, providing a more straightforward and computationally efficient route to generating high-quality text.

Approach With Draft Models. Draft models are also used to improve the decoding efficiency. Techniques such as Speculative Decoding (Leviathan et al., 2023; Spector and Re, 2023; Chen et al., 2023; Stern et al., 2018), Madusa Cai et al. (2024), Eagle Li et al. (2024b), various other approaches requiring draft models (Zhang et al., 2024; Liu et al., 2023b; Kim et al., 2024; Fu et al., 2024b) fall into this category, utilizing models to generate drafts. Although these methods aim to speed up response times and reduce computational load during initial text generation, their adoption comes with significant drawbacks. The primary issue is the necessity for additional training specific to the draft models, which can be resource-intensive. Moreover, these techniques generally depend on GPU resources (Kwon et al., 2023; Sheng et al., 2023; Park et al., 2024) for inference, potentially limiting their application in environments where such hardware is unavailable or when operating under strict resource constraints. This dependence not only increases operational costs, but also restricts flexibility in deployment scenarios.

7 Conclusion

ADED improves the LLM decoding process by introducing adaptability and efficiency, significantly reducing latency and computational demands. This method achieves up to a 2.5X speedup in decoding and a 20% improvement in acceptance rates, outperforming traditional techniques. Unlike existing approaches, ADED dynamically adjusts the draft distribution using a tri-gram matrix and enhances draft quality through MCTS, eliminating the need for fine-tuning. The continuous feedback loop ensures ongoing improvements in draft generation. While ADED demonstrates robust performance across various benchmarks, future work will focus on further optimizing the adaptability mechanisms and exploring its application in more diverse real-world scenarios. Additionally, addressing potential limitations in extremely large-scale deployments will be a priority.

References

Antoniol et al. [1994] Giuliano Antoniol, Fabio Brugnara, Mauro Cettolo, and Marcello Federico. Language model estimations and representations for real-time continuous speech recognition. In ICSLP, pages 859–862, 1994.
Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
Browne et al. [2012] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
Cai et al. [2024] Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv: 2401.10774, 2024.
Chang et al. [2023] Jonathan D Chang, Kiante Brantley, Rajkumar Ramamurthy, Dipendra Misra, and Wen Sun. Learning to generate better than your llm. arXiv preprint arXiv:2306.11816, 2023.
Chen et al. [2023] Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023.
Chen et al. [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. 2021.
Chiang et al. [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
Coulom [2007] Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In H. Jaap van den Herik, Paolo Ciancarini, and H. H. L. M. (Jeroen) Donkers, editors, Computers and Games, pages 72–83, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. ISBN 978-3-540-75538-8.
Ding et al. [2023] Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
Fu et al. [2024a] Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding, 2024a.
Fu et al. [2024b] Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057, 2024b.
Golding and Schabes [1996] Andrew R Golding and Yves Schabes. Combining trigram-based and feature-based methods for context-sensitive spelling correction. arXiv preprint cmp-lg/9605037, 1996.
Grill et al. [2020] Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Rémi Munos. Monte-carlo tree search as regularized policy optimization. CoRR, abs/2007.12509, 2020. URL https://arxiv.org/abs/2007.12509.
He et al. [2023] Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D Lee, and Di He. Rest: Retrieval-based speculative decoding, 2023.
Hendy et al. [2023] Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. How good are gpt models at machine translation? a comprehensive evaluation, 2023.
James et al. [2017] Steven James, George Konidaris, and Benjamin Rosman. An analysis of monte carlo tree search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
Kim et al. [2024] Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W Mahoney, Amir Gholami, and Kurt Keutzer. Speculative decoding with big little decoder. Advances in Neural Information Processing Systems, 36, 2024.
Kocetkov et al. [2022] Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. The stack: 3 tb of permissively licensed source code, 2022.
Kwon et al. [2023] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
Leviathan et al. [2023] Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding, 2023.
Li et al. [2024a] Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Pre-trained language models for text generation: A survey. ACM Computing Surveys, 56(9):1–39, 2024a.
Li et al. [2024b] Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024b.
Liu et al. [2023a] Jijia Liu, Chao Yu, Jiaxuan Gao, Yuqing Xie, Qingmin Liao, Yi Wu, and Yu Wang. Llm-powered hierarchical language agent for real-time human-ai coordination. arXiv preprint arXiv:2312.15224, 2023a.
Liu et al. [2023b] Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, Alvin Cheung, and Hao Zhang. Online speculative decoding, 2023b.
Mandvikar [2023] Shreekant Mandvikar. Factors to consider when selecting a large language model: A comparative analysis. International Journal of Intelligent Automation and Computing, 6(3):37–40, 2023.
Martin et al. [1998] Sven Martin, Jörg Liermann, and Hermann Ney. Algorithms for bigram and trigram word clustering. Speech communication, 24(1):19–37, 1998.
Miao et al. [2023] Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 1(2):4, 2023.
Mori et al. [1998] Shinsuke Mori, Masafumi Nishimura, and Nobuyasu Itoh. Word clustering for a word bi-gram model. In ICSLP. Citeseer, 1998.
Moslem et al. [2023] Yasmin Moslem, Rejwanul Haque, John D. Kelleher, and Andy Way. Adaptive machine translation with large language models, 2023.
Park et al. [2024] Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, and Jae W Lee. Any-precision llm: Low-cost deployment of multiple, different-sized llms. arXiv preprint arXiv:2402.10517, 2024.
Peng et al. [2023] Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of generative large language model for medical research and healthcare. NPJ Digital Medicine, 6(1):210, 2023.
Rosin [2011] Christopher D. Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61:203–230, 2011. URL https://api.semanticscholar.org/CorpusID:207081359.
Saka et al. [2023] Abdullahi B Saka, Lukumon O Oyedele, Lukman A Akanbi, Sikiru A Ganiyu, Daniel WM Chan, and Sururah A Bello. Conversational artificial intelligence in the aec industry: A review of present status, challenges and opportunities. Advanced Engineering Informatics, 55:101869, 2023.
Shanahan [2024] Murray Shanahan. Talking about large language models. Communications of the ACM, 67(2):68–79, 2024.
Sheng et al. [2023] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
Sleijpen and Van der Vorst [2000] Gerard LG Sleijpen and Henk A Van der Vorst. A jacobi–davidson iteration method for linear eigenvalue problems. SIAM review, 42(2):267–293, 2000.
Spector and Re [2023] Benjamin Spector and Chris Re. Accelerating llm inference with staged speculative decoding, 2023.
Stern et al. [2018] Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models, 2018.
Świechowski et al. [2023] Maciej Świechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Mańdziuk. Monte carlo tree search: A review of recent modifications and applications. Artificial Intelligence Review, 56(3):2497–2562, 2023.
Taori et al. [2023] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023a.
Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023b.
Wu et al. [2023] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023.
Zhang et al. [2024] Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, and Yunfei Cheng. Recurrent drafter for fast speculative decoding in large language models. arXiv preprint arXiv:2403.09919, 2024.
Zhang et al. [2023] Biao Zhang, Barry Haddow, and Alexandra Birch. Prompting large language model for machine translation: A case study. In International Conference on Machine Learning, pages 41092–41110. PMLR, 2023.
Zhang et al. [2022] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022.
Zheng et al. [2023] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
Zhu and Rosenfeld [2001] Xiaojin Zhu and Ronald Rosenfeld. Improving trigram language modeling with the world wide web. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), volume 1, pages 533–536. IEEE, 2001.

Appendix

Appendix A Advantages on Computation Effiency

Our technique provides substantial benefits when implemented on edge devices like laptops and smartphones, which often face limitations in GPU capabilities and memory. In contrast to traditional decoding methods that depend heavily on GPU power or large memory sizes, our strategy is crafted for high efficiency with low resource demands.

Reduced GPU Requirements Our approach, which does not require fine-tuning and utilizes a lightweight probabilistic model, primarily operates on the CPU, eliminating the need for substantial GPU resources. This feature is especially advantageous for edge devices with limited GPU access. By minimizing GPU dependency, our technique can be applied more widely, enhancing LLM decoding across a broader array of devices.

Low Memory Usage Our method avoids the need for bulky initial models or intricate neural network architectures, considerably lowering the memory usage typically needed for LLM decoding. This aspect is particularly suitable for devices with limited memory, such as budget laptops and mobile phones. The decrease in memory usage not only leads to quicker processing times but also reduces power consumption, which is vital for devices running on batteries. Compared to REST, which also requires a corpus, our method significantly reduces memory usage; for instance, both using the Stack dataset, our method requires only less than 1GB while REST needs 27GB.

Table 4: Comparison of Different Methods.

Method	Requires GPU	Computation	Memory Overhead
Lookahead	✓	$\uparrow$	$\times$
Eagle	✓	$\uparrow$	$\times$
Medusa	✓	$\uparrow$	$\times$
REST	$\times$	$\downarrow$	$\uparrow$
Speculative Decoding	✓	$\uparrow$	$\times$
ADED	$\times$	$\downarrow$	Very Low

Ultimately, our decoding method is exceptionally apt for practical use in edge systems, where there is often a scarcity of computational resources. It offers a viable and effective option for improving LLM decoding without sacrificing speed or precision, thus bringing sophisticated language processing to less powerful devices.

Appendix B Broader Impacts

The advancements presented in this paper, specifically the accelerated LLM decoding via Monte Carlo Tree Search (MCTS) and self-evolving speculation, have several broader impacts worth discussing. These impacts span multiple domains including technology, society, and ethics.

Technological Impact

Our method significantly enhances the efficiency and speed of autoregressive LLM decoding. This improvement can benefit numerous applications that rely on real-time language processing, such as interactive chatbots, automated customer service, and real-time translation systems. By reducing the computational load and memory requirements, our approach also makes it feasible to deploy advanced LLMs on edge devices like smartphones and IoT devices, broadening their accessibility and usability.

Societal Impact

The ability to perform faster and more efficient language model decoding can have a profound impact on society. For instance, it can improve the responsiveness and accuracy of assistive technologies for individuals with disabilities, such as voice-controlled assistants and text-to-speech systems. Additionally, educational tools that rely on real-time feedback and interactive learning can benefit from quicker and more reliable LLM responses, enhancing the learning experience for students.

Ethical Considerations

While our advancements offer significant benefits, they also raise important ethical considerations. The increased efficiency of LLMs could lead to more widespread use of automated systems, which might replace human jobs in certain sectors. It is crucial to address the potential displacement of workers by fostering skills development and creating new job opportunities that leverage human-LLM collaboration.

Moreover, the deployment of more powerful LLMs on a wider scale necessitates robust measures to mitigate misuse. Enhanced LLM capabilities could be exploited for malicious purposes, such as generating misleading information or deepfake content. Therefore, it is essential to implement strong ethical guidelines and monitoring mechanisms to prevent abuse and ensure that the technology is used responsibly.

Environmental Impact

Improving the efficiency of LLM decoding can also contribute to environmental sustainability. By reducing the computational resources required for LLM operations, our method decreases the energy consumption associated with running these models. This reduction is particularly important given the growing concerns about the environmental footprint of large-scale AI systems. Our approach aligns with the broader goal of developing greener AI technologies that minimize their impact on the planet.

In summary, the proposed method for accelerating LLM decoding has far-reaching implications across various domains. While it offers substantial benefits, it is essential to address the accompanying ethical, societal, and environmental challenges to ensure that the technology is developed and deployed in a responsible and beneficial manner.

Appendix C Preliminary

Retrieval-Based Speculative Decoding: Decoding in large language models describes the procedure of text generation by sequentially predicting tokens. Given a context sequence $s=(x_{1},...,x_{t-1},x_{t})$ , conventional autoregressive decoding techniques produce the subsequent token at position $t+1$ using conditional probability:

x_{t+1}\sim p(x|x_{1},...,x_{t};\theta_{\text{large}}),

(9)

where $p$ denotes the conditional probability distribution calculated by the LLM with parameters $\theta_{\text{large}}$ . To reduce these computational burdens during inference, speculative decoding is proposed Leviathan et al. [2023]. It reduces the frequency of forward passes using $\theta_{\text{large}}$ by incorporating an auxiliary language model with fewer parameters $\theta_{\text{small}}$ .

The speculative decoding process is implemented iteratively as follows: Using the smaller model $\theta_{\text{small}}$ , a draft sequence of the next $m$ tokens is generated autoregressively:

\tilde{x}_{t+i}\sim p(x|s,\tilde{x}_{t+1},...,\tilde{x}_{t+i-1};\theta_{\text{% small}}),

(10)

where $i=1,...,m$ . Despite the sequential nature of this generation, the reduced model complexity of $\theta_{\text{small}}$ leads to a lower computational overhead compared to $\theta_{\text{large}}$ .Retrieval-Based Speculative Decoding extends the basic speculative decoding framework by replacing the smaller language model with a retrieval system. This method uses:

\tilde{x}_{t+i}\sim p(x|s,\tilde{x}_{t+1},...,\tilde{x}_{t+i-1};\theta_{\text{% Corpus}}),

(11)

where $\text{retrieve}(x_{1},...,x_{t})$ fetches contextually relevant text segments from a pre-stored corpus, reducing reliance on frequent recalculations with $\theta_{\text{large}}$ .

Monte Carlo Tree Search: Monte Carlo Tree Search (MCTS) Coulom [2007] is a popular algorithmic in artificial intelligence, which explores potential future states from a current decision point by building a tree of possibilities, balancing exploration of new paths and exploitation of known beneficial paths.

In the context of MCTS, each node in the tree represents a possible state, and these nodes are expanded based on the results of simulated plays. A key aspect of MCTS is how it selects nodes for further exploration, which is based on the number of visits to each node, denoted as $N(s)$ . Specifically, the selection process in MCTS can be interpreted as an approximate solution to a regular optimization problemGrill et al. [2020]. The node visit count $N(s)$ acts as a regularizer, guiding the algorithm towards a balance between exploring less visited, uncertain nodes and exploiting nodes that are known to yield higher rewards. The optimization problem can be formally expressed as

\max_{s\in\text{Children}(s^{\prime})}\left(Q(s)+c\sqrt{\frac{\log N(s^{\prime% })}{N(s)}}\right),

(12)

where $s^{\prime}$ is the current node, $s$ represents a child node, $Q(s)$ is the estimated value of node $s$ , $N(s^{\prime})$ is the number of visits to the parent node, and $c$ is a constant that controls the trade-off between exploration and exploitation. This formulation highlights how MCTS inherently balances the dual objectives of accuracy (through $Q(s)$ ) and robustness (through the regularization term involving $N(s)$ ).

Appendix D Configuration of ADED

For our experiments, we use the following hyperparameters to optimize the performance of ADED: We set $t$ , the threshold for the elimination of tri-gram probability, at 12 to focus only on the most prevalent trigrams. The number of iterations for Monte Carlo Tree Search (MCTS), denoted $s$ , is fixed at 150. The parameters in the MCTS PUTC Score function, $c_{1}$ and $c_{2}$ , are 2.414 and 8.0, respectively, to effectively balance exploration and exploitation. The search depth and the length of each retrieved continuation candidate, represented by $l$ , is 4; and the number of continuation candidates, $n$ , is set at 24.

Appendix E Configuration of Baselines

REST

The REST baseline uses the default settings with the following specific configurations:

•

Number of threads: 6
•

Draft choice: 64
•

Datasets: UltraChat and The Stack
•

Token spans: [16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2]

Lookahead

The Lookahead baseline uses the default settings with the following specific configurations:

•

LEVEL: 5
•

WIN: 15
•

GUESS: 15
•

FLASH: 0

Appendix F Implementation Details

In this section, we describe the implementation details of our proposed Monte Carlo Tree Search (MCTS) algorithm for text generation and the dynamic adjustment of tri-gram probabilities.

F.1 Monte Carlo Tree Search for Text Generation

Algorithm 1 Monte Carlo Tree Search for Text Generation

1:procedure RunMCTS(

self,iterations

)

rng\leftarrow\text{initialize random number generator}

3: for

iter\leftarrow 1

iterations

node\leftarrow self.root

state\leftarrow[node.word]

end\leftarrow\text{False}

\triangleright

Selection

7: while

\text{not }node.untried\_words\text{ is empty and }node.children\text{ exist % and not }end

selected\_child\leftarrow node.select\_child()

state.append(selected\_child.word)

10:

node\leftarrow selected\_child

11: if

\text{length of }state==self.sentence\_length

then

12:

end\leftarrow\text{True}

13: end if

14: end while

\triangleright

Expansion

15: if

\text{not }node.untried\_words\text{ is empty and not }end

then

16:

untried\_words\leftarrow node.untried\_words

17:

p\_word\leftarrow node.word.1

18: for

word\text{ in }untried\_words

19:

tmp\_untried\_words\leftarrow\text{get potential words from trigram matrix}

20:

child\_node\leftarrow\text{new Node}(p\_word,word,tmp\_untried\_words)

21:

node.children.append(child\_node)

22: end for

23:

node\leftarrow node.children.last()

24:

state.append(node.word)

25: end if

\triangleright

Simulation

26: while

\text{length of }state<self.sentence\_length

27:

last\_word\leftarrow state.last()

28:

next\_words\leftarrow\text{get from trigram matrix using }last\_word

29: if

next\_words\text{ is not empty}

then

30:

\text{select next word based on probabilities from }next\_words

31:

state.append(\text{selected word})

32: else

33: break

34: end if

35: end while

\triangleright

Backpropagation

36:

score\leftarrow 1.0

37: for

i\leftarrow 1

\text{length of }state-1

38:

score\leftarrow score+\text{get score from trigram matrix}

39: end for

40:

current\_node\leftarrow node

41: while

current\_node\text{ is not None}

42:

current\_node.visits\leftarrow current\_node.visits+1

43:

current\_node.score\leftarrow current\_node.score+score

44:

current\_node\leftarrow current\_node.parent

45: end while

46: end for

\triangleright

Extract and sort top sentences

47:

sentences\leftarrow\text{extract sentences from root}

48:

\text{sort }sentences\text{ by score}

49: return

sentences

50:end procedure

Algorithm 1 illustrates the Monte Carlo Tree Search (MCTS) procedure for text generation. The procedure, RunMCTS, takes the number of iterations as input. The algorithm proceeds through the following phases:

Selection: Starting from the root node, the algorithm selects child nodes based on the selection criteria until a node with untried words or no children is reached, or the end of the sentence is detected.

Expansion: If the node has untried words and the end of the sentence is not reached, child nodes are created for each untried word using potential words from the trigram matrix.

Simulation: A sequence of words is generated starting from the current state until the sentence reaches the desired length or no more words can be selected from the trigram matrix.

Backpropagation: The score for the generated sequence is calculated, and this score is propagated back through the selected nodes, updating their visit counts and scores.

Finally, the top sentences are extracted from the root node and sorted by score.

F.2 Dynamic Adjustment of tri-gram Probabilities

Algorithm 2 details the procedure for dynamically adjusting tri-gram probabilities. The procedure, Adjust3Gram, takes a sequence of tokens, the maximum length of n-grams to consider, an increment value, and the maximum probability.

For each trigram in the recent portion of the token sequence, the probability is increased by the increment value, up to the specified maximum probability. If the trigram is not already in the tri-gram matrix, it is added with the increment value as its initial probability.

Algorithm 2 Dynamic Adjustment of tri-gram Probabilities

1:procedure Adjust3Gram(

tokens,maxLength,increment,maxProb

)

n\leftarrow\text{length of }tokens

startIndex\leftarrow\max(0,n-maxLength)

subTokens\leftarrow tokens[startIndex:n]

5: for

i\leftarrow 2

\text{length of }subTokens

triGram\leftarrow(subTokens[i-2],subTokens[i-1],subTokens[i])

7: if

triGram\text{ in tri-gram matrix}

then

currentProb\leftarrow\text{tri-gram matrix}[triGram]

newProb\leftarrow\min(currentProb+increment,maxProb)

10:

\text{tri-gram matrix}[triGram]\leftarrow newProb

11: else

12:

\text{tri-gram matrix}[triGram]\leftarrow increment

13: end if

14: end for

15:end procedure

Appendix G Additional Experimental Results

In this section, we present additional experimental results to further illustrate the effectiveness of our proposed methods. We compare the performance of Vicuna-7B and Vicuna-13B models with and without the adaptive strategy and analyze the impact of varying Monte Carlo Tree Search (MCTS) search counts on performance.

Figure 7 shows the performance comparison on the MTBench dataset for Vicuna-7B and Vicuna-13B models with and without the adaptive strategy. The results demonstrate the advantage of using the adaptive approach. For both models, the adaptive strategy significantly reduces the average exact length of the generated sequences as the number of tokens increases, indicating more efficient and accurate text generation.

Figure 8 presents the results for Vicuna-7B and Vicuna-13B models on the MTBench dataset, showing the impact of varying MCTS search counts on performance. For both models, increasing the number of MCTS search counts leads to improved performance, with the optimal counts varying by model size. The average exact length and latency are plotted against the number of MCTS search counts, illustrating the trade-off between performance and computational cost. As shown in the plots, there is a notable improvement in the average exact length as the search counts increase, while latency also increases, indicating a balance between the depth of search and the time taken for generation.