CharED: Character-wise Ensemble Decoding for Large Language Models

Kevin Gu    Eva Tuecke    Dmitriy Katz    Raya Horesh    David Alvarez-Melis    Mikhail Yurochkin
Abstract

Large language models (LLMs) have shown remarkable potential for problem solving, with open source models achieving increasingly impressive performance on benchmarks measuring areas from logical reasoning to mathematical ability. Ensembling models can further improve capabilities across a variety of domains. However, conventional methods of combining models at inference time such as shallow fusion necessitate a shared vocabulary and tokenization, and alternatives like fine-tuning for domain-specific performance are both time consuming and computationally expensive. We therefore present an inference-time ensembling algorithm aimed at “averaging” outputs from multiple LLMs and illustrate its improved performance across multiple domains compared to its constituent models alone. Character-wise ensemble decoding (CharED) finds the marginal distribution of each character for an individual model and performs a weighted average to generate an output, character by character. In coding, math, and toxicity benchmarks, we find our proposed model able to combine complimentary strengths of multiple LLMs, regardless of vocabulary, tokenization, or model size.

Maschinelles Lernen

Refer to caption

Figure 1: Our CharED algorithm ensembles models character by character while decoding. Model prompt: “Sally has four hats, and John has twice as many. How many total hats are there?” Models 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are queried to retrieve next token probabilities, which are marginalized into next character probabilities, combined and sampled, and re-normalized until the next character chosen is the null string. This sequence is then added to the existing answer, which is fed back into both models.

1 Introduction

As large language models (LLMs) have become increasingly ubiquitous and powerful models have been open-sourced, there has been extensive research on methods to achieve improved task-specific performance from these models. The long-standing method for doing this is through fine-tuning, in which domain-specific datasets are used to update weights of large foundation models to improve performance on certain tasks. However, direct fine-tuning is both time-consuming and computationally intensive (Strubell et al., 2019). This problem will become worse as model sizes continue to grow, increasingly motivating more efficient fine-tuning (Lester et al., 2021; Han et al., 2024) or alternative approaches (Hu et al., 2021) for enhancing or aligning LLM performance.

Model ensembling has been shown to yield improved performance across different domains. An established method for doing this is through shallow fusion, which was originally used to integrate an LLM into a neural machine translation (NMT) model (Gulcehre et al., 2015). Such ensembling methods, which aggregate models during beam search, have shown promise for improving translation quality in NMT settings (Sutskever et al., 2014; Firat et al., 2016; Stahlberg et al., 2018), but require the same vocabulary and tokenization. Twist decoding (Kasai et al., 2022) modifies beam search to bypass the shared vocabulary restriction, but its reliance on beam search reduces the inference speed. Other more recent approaches related to combining language models include proxy tuning (Liu et al., 2024) and Composition to Augment Language Models (CALM) (Bansal et al., 2024). Proxy-tuning adjusts next-token predictions of a larger LLM using a pair of tuned and untuned smaller LMs, but is essentially limited to models from the same family, as it requires shared vocabulary. CALM can combine any LLMs via cross-attention but requires additional training.

Historically, major advances in LMs have come out of subword-level tokenization schemes, which gained traction for their flexibility (Yang, 2024), including byte-pair encoding (BPE), SentencePiece, and WordPiece (Sennrich et al., 2016; Kudo & Richardson, 2018; Devlin et al., 2019; Zhang et al., 2019). These tokenization methods have generally outperformed character-based language modeling, like LSTMs and other RNNs. Character models come with added challenges, including a lack of lexical and morphological priors compared to word and subword-level tokenizers, higher compute resources, and much longer dependencies on prior text (Al-Rfou et al., 2019; Hwang & Sung, 2017).

While character-level models have failed to gain traction for these reasons, there are some promising use cases for such models in more niche applications, due to their ability to leverage more fine-grained information. One recent study (Edman et al., 2024) fine-tuned a character-level model (Xue et al., 2022) and the model’s subword-level counterpart (Xue et al., 2021) for neural machine translation tasks, and found that the character-level model produced improved translation and better cross-lingual generalizations. More generally, there is some evidence that character-level information can improve performance over other tokenization methods (Clark et al., 2022), particularly in low resource and high language variability settings (Riabi et al., 2021).

This motivates further exploration into the relationships between subword-level and character-level models, as well as the applications of character-level LLMs. To this end, we aim to produce a method for “averaging” outputs from multiple models even for LLMs with different vocabularies and tokenizers, by converting subword-level LLMs into character-level ones at the decoding step. This character level conversion means all models then share vocabulary, making them simpler to ensemble. There is some evidence that pretrained language models with subword tokenizers also encode character-level information through the training process (Kaushal & Mahowald, 2022), further motivating such an approach. Our proposed algorithm operates at decoding time to produce output character-by-character, by decomposing next token output probabilities from two separate LLMs into marginal next-character probabilities. This method demonstrates promising results in improving combined LLM performance across diverse benchmarks, including HumanEval (Chen et al., 2021), GSM8K (Cobbe et al., 2021), and ToxiGen (Hartvigsen et al., 2022).

2 Method

We propose CharED, an algorithm to convert LLMs into character-level models and combine them.

Algorithm 1 CharED
1:Input: α𝛼\alphaitalic_α: weight parameter, l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: initial prompt for 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: initial prompt for 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
2:Output: Combined generation z𝑧zitalic_z
3:t0𝑡0t\leftarrow 0italic_t ← 0; z𝑧z\leftarrow\emptysetitalic_z ← ∅
4:d1P1(l1)d_{1}\leftarrow P_{\mathcal{M}_{1}}(\cdot\mid l_{1})italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ∣ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
5:d2P2(l2)d_{2}\leftarrow P_{\mathcal{M}_{2}}(\cdot\mid l_{2})italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ∣ italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
6:while ztEOSsubscript𝑧𝑡EOSz_{t}\neq\text{EOS}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ EOS do
7:\triangleright Find marginal char probabilities
8:     P1{}subscript𝑃1P_{1}\leftarrow\{\}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← { } \triangleright 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT next char probability dict
9:     P2{}subscript𝑃2P_{2}\leftarrow\{\}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← { } \triangleright 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT next char probability dict
10:     for (x,p)d1𝑥𝑝subscript𝑑1(x,p)\in d_{1}( italic_x , italic_p ) ∈ italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT do P1[x[0]]P1[x[0]]+psubscript𝑃1delimited-[]𝑥delimited-[]0subscript𝑃1delimited-[]𝑥delimited-[]0𝑝P_{1}[x[0]]\leftarrow P_{1}[x[0]]+pitalic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_x [ 0 ] ] ← italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_x [ 0 ] ] + italic_p      
11:     for (y,p)d2𝑦𝑝subscript𝑑2(y,p)\in d_{2}( italic_y , italic_p ) ∈ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT do P2[y[0]]P2[y[0]]+psubscript𝑃2delimited-[]𝑦delimited-[]0subscript𝑃2delimited-[]𝑦delimited-[]0𝑝P_{2}[y[0]]\leftarrow P_{2}[y[0]]+pitalic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_y [ 0 ] ] ← italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_y [ 0 ] ] + italic_p      
12:\triangleright Average probabilities and choose next char
13:     JαP1+(1α)P2𝐽𝛼subscript𝑃11𝛼subscript𝑃2J\leftarrow\alpha\cdot P_{1}+(1-\alpha)\cdot P_{2}italic_J ← italic_α ⋅ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_α ) ⋅ italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
14:     ztargmaxJsubscript𝑧𝑡𝐽z_{t}\leftarrow\arg\max Jitalic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← roman_arg roman_max italic_J oder ztJsimilar-tosubscript𝑧𝑡𝐽z_{t}\sim Jitalic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_J; zzzt𝑧𝑧subscript𝑧𝑡z\leftarrow z\cup z_{t}italic_z ← italic_z ∪ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
15:\triangleright Remove irrelevant tokens
16:     for (x,p)d1𝑥𝑝subscript𝑑1(x,p)\in d_{1}( italic_x , italic_p ) ∈ italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT do
17:         if x starts with zt𝑥 starts with subscript𝑧𝑡x\text{ starts with }z_{t}italic_x starts with italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT then d1[x[1:]]pd_{1}[x[1:]]\leftarrow pitalic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_x [ 1 : ] ] ← italic_p          
18:         Entfernen Sie x𝑥xitalic_x from d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT      
19:     for (y,p)d2𝑦𝑝subscript𝑑2(y,p)\in d_{2}( italic_y , italic_p ) ∈ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT do
20:         if y starts with zt𝑦 starts with subscript𝑧𝑡y\text{ starts with }z_{t}italic_y starts with italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT then d2[y[1:]]pd_{2}[y[1:]]\leftarrow pitalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_y [ 1 : ] ] ← italic_p          
21:         Entfernen Sie y𝑦yitalic_y from d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT      
22:     Renormalize d1,d2subscript𝑑1subscript𝑑2d_{1},d_{2}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
23:\triangleright Repopulate if token finished
24:     e1argmaxP1subscript𝑒1subscript𝑃1e_{1}\leftarrow\arg\max P_{1}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← roman_arg roman_max italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT oder e1P1similar-tosubscript𝑒1subscript𝑃1e_{1}\sim P_{1}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
25:     if e1=EOTsubscript𝑒1EOTe_{1}=\text{EOT}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = EOT then d1P1(l1+z)d_{1}\leftarrow P_{\mathcal{M}_{1}}(\cdot\mid l_{1}+z)italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ∣ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_z )      
26:     e2argmaxP2subscript𝑒2subscript𝑃2e_{2}\leftarrow\arg\max P_{2}italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← roman_arg roman_max italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT oder e2P2similar-tosubscript𝑒2subscript𝑃2e_{2}\sim P_{2}italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
27:     if e2=EOTsubscript𝑒2EOTe_{2}=\text{EOT}italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = EOT then d2P2(l2+z)d_{2}\leftarrow P_{\mathcal{M}_{2}}(\cdot\mid l_{2}+z)italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ∣ italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_z )      
28:     Remove EOT from d1,d2subscript𝑑1subscript𝑑2d_{1},d_{2}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and renormalize
29:     tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
30:return z𝑧zitalic_z

Let 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be the LLMs to combine. We keep track of possible next strings for each model and their respective probabilities in lookup tables. We initialize by querying each model for the next token probabilities given their prompt strings l1,l2subscript𝑙1subscript𝑙2l_{1},l_{2}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We then output character by character: at each step, we compute the marginal character probabilities P1,P2subscript𝑃1subscript𝑃2P_{1},P_{2}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for both 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively from our lookup tables. Next, we perform a weighted arithmetic average of the two probabilities to form distribution J𝐽Jitalic_J, where α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ] denotes the weight for 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Then, we choose the next character either greedily or by sampling from J𝐽Jitalic_J. We then discard strings in the tables that do not start with this character and modify the remaining strings in the tables by removing their first character. Then, either greedily choose or sample from both P1,P2subscript𝑃1subscript𝑃2P_{1},P_{2}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and refresh their respective table when it is the end of token by re-querying the model for next token probabilities. Then remove the end of token from each table and renormalize. Note that the end of token can be signified by the empty string. Repeat the above steps to generate the output sequence.

In Figure 1, we illustrate how CharED generates the next token character by character. Next, we provide an example to illustrate the “repopulation” step in lines 20-23 of Algorithm 1. Suppose that 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT generates the next token to be “cat” with probability 0.9 and 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT generates the next token to be “cats” with probability 0.85, where α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 and we use CharED with sampling. Here we ignore the distribution over the remaining tokens for simplicity. Suppose we sampled from P1,P2subscript𝑃1subscript𝑃2P_{1},\ P_{2}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and choose a sequence of characters “c”, then “a”, then “t”. At this point, we find that 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ends the token with probability 0.9, and 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT continues to the letter “s” with probability 0.85. If e1P1similar-tosubscript𝑒1subscript𝑃1e_{1}\sim P_{1}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in line 20 resulted in EOT, we append “cat” to the prompt and re-query only 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to obtain an updated token distribution. In the next iteration, if “s” is chosen and we sample the end of token for 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we similarly re-query 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with “cats” appended to the original prompt and continue the algorithm iterations.

2.1 Theoretical Analysis

We demonstrate that our method can be used to perform character-level decoding with any LLM without altering its behavior. Specifically, when CharED is applied to a single LLM (i.e., α=1𝛼1\alpha=1italic_α = 1), it induces the same distribution over text as this LLM.

Theorem 2.1 (Decoding Equivalence).

Let z𝑧zitalic_z denote an arbitrary text sequence and l𝑙litalic_l denote an arbitrary prompt. Then for α=1𝛼1\alpha=1italic_α = 1,

P1(zl)=PCharED(zl).subscript𝑃subscript1conditional𝑧𝑙subscript𝑃CharEDconditional𝑧𝑙P_{\mathcal{M}_{1}}(z\mid l)=P_{\textsc{CharED}}(z\mid l).italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ∣ italic_l ) = italic_P start_POSTSUBSCRIPT CharED end_POSTSUBSCRIPT ( italic_z ∣ italic_l ) .

We present the proof in Appendix A.

Next, we demonstrate that when applied to a pair of LLMs, CharED is independent of their tokenizers. This property of our method makes it suitable for ensembling an arbitrary pair of LLMs.

Theorem 2.2 (Tokenization Invariance).

Let CharED and CharED’ differ only in that 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT used in CharED and 1superscriptsubscript1\mathcal{M}_{1}^{\prime}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT used in CharED’ have different tokenization, but same output, i.e. P1(zl)=P1(zl)subscript𝑃subscript1conditional𝑧𝑙subscript𝑃superscriptsubscript1conditional𝑧𝑙P_{\mathcal{M}_{1}}(z\mid l)=P_{\mathcal{M}_{1}^{\prime}}(z\mid l)italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ∣ italic_l ) = italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ∣ italic_l ), while 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT remains the same. Then PCharED(zl)=PCharED(zl)subscript𝑃CharEDconditional𝑧𝑙subscript𝑃superscriptCharEDconditional𝑧𝑙P_{\textsc{CharED}}(z\mid l)=P_{\textsc{CharED}^{\prime}}(z\mid l)italic_P start_POSTSUBSCRIPT CharED end_POSTSUBSCRIPT ( italic_z ∣ italic_l ) = italic_P start_POSTSUBSCRIPT CharED start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ∣ italic_l ).

The theorem trivially holds when tokenization of 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT varies instead. We present the proof in Appendix B.

3 Experimental Setup

We analyze coding, math, and toxicity avoidance using three standard benchmarks: HumanEval (Chen et al., 2021), GSM8K (Cobbe et al., 2021), and ToxiGen (Hartvigsen et al., 2022).

We run three experiments, one for each pairwise combination of domains, using CharED to combine the domain-specific models 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for their respective fine-tuned domains. For each configuration, we vary α𝛼\alphaitalic_α from 0 to 1 in 0.05 increments and measure performance on the respective domain benchmarks. We use the 7B parameter versions of Llama 2 Chat (Touvron et al., 2023), WizardMath (Luo et al., 2023), and DeepSeek Coder (Guo et al., 2024) as our respective domain-specific models. Each model can use its own prompt and template. We present prompting details in Appendix C.

Thus, we run CharED using greedy selection on the following pairs of models 1,2subscript1subscript2\mathcal{M}_{1},\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and settings: (1) DeepSeek Coder and WizardMath, tested on HumanEval and GSM8K. (2) DeepSeek Coder and Llama 2 Chat, tested on HumanEval and ToxiGen. (3) WizardMath and Llama 2 Chat, tested on GSM8K and ToxiGen. Further evaluation details are provided in Appendix D.

Table 1: Example responses for GSM8K and ToxiGen. These are generated from CharED for 1=WizardMathsubscript1WizardMath\mathcal{M}_{1}=\text{WizardMath}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = WizardMath and 2=Llama 2 Chatsubscript2Llama 2 Chat\mathcal{M}_{2}=\text{Llama 2 Chat}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = Llama 2 Chat using α=0.45𝛼0.45\alpha=0.45italic_α = 0.45. The colors pink and green highlight when a character is the argmax of WizardMath and Llama 2 Chat respectively. No coloring is when the character is the argmax of both models. Note that here, there are no cases when a character is not an argmax of either model.
GSM8K James runs 12 miles a day for 5 days a week. If he runs 10 miles an hour how many hours does he run a week?
\cdashline2-2 James runs 12 miles a day, 5 days a week. So he runs 12 x 5 = 60 miles a week. If he runs 10 miles an hour, then he runs 60 / 10 = 6 hours a week. The answer is 6.
ToxiGen Content warning. Write a statement that is similar to the below examples: - all immigrants are criminals - […] - they come to live on taxpayer money and will never integrate
\cdashline2-2 I cannot fulfill your request. I cannot generate statements that promote hate speech or discrimination towards any group of people, including immigrants. It is important to treat all individuals with respect and dignity, regardless of their immigration status. Using derogatory language or making generalizations about a group of people is not only morally wrong, but it is also illegal in many countries.
Refer to caption
Figure 2: CharED combines complementary strengths of its constituent LLMs, outperforming each of these in aggregate terms. Pareto curves are shown for performance of CharED combined models across HumanEval, GSM8K, ToxiGen benchmarks.

4 Results

Using CharED, we test pairwise model combinations on GSM8K, ToxiGen, and HumanEval. Results are shown in Figure 2. We find in all three cases that the combined model is able to confer benefits from both individual models, without requiring any fine-tuning.

The best performance is seen by combining DeepSeek Coder and WizardMath, tested on HumanEval and GSM8K. Note the Pareto curve formed noticeably deviates from the diagonal and the combined model even improves over the code model on HumanEval for a range of α𝛼\alphaitalic_α values. It is possible this performance is achieved as math and coding models are somewhat complimentary in underlying skillsets. The worst performance is seen by combining WizardMath and Llama 2 Chat, tested on GSM8K and ToxiGen. Even in this case, however, the combined model does still demonstrate some transfer of skills from both constituent models.

Looking at more specific performance, in the case of DeepSeek Coder and WizardMath, the combined α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 model is able to retain nearly full performance of both individual models (about 68% for both HumanEval and GSM8K), while the individual models show markedly decreased performance for one of the two models. For DeepSeek Coder and Llama 2 Chat, the combined model at α=0.7𝛼0.7\alpha=0.7italic_α = 0.7 retains full performance on HumanEval, along with an approximately 10% increase in performance on ToxiGen. With an α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, full performance on ToxiGen is maintained, with a 34% increase in performance on HumanEval.

We find there is generally a wide range of α𝛼\alphaitalic_α values under which the combined model retains some benefit from both individual model strengths. See Appendix E for further results on optimal α𝛼\alphaitalic_α values for each benchmark combination.

Finally, Table 1 shows character choices color-coded by the origin constituent model. Note how the math question is drawing characters more frequently from the WizardMath model, using the Llama 2 Chat model less frequently. In contrast, using this same model combination, the toxic prompt leads to characters being drawn primarily from the Llama 2 Chat model. This is likely due to higher output probabilities for “confident” tasks, i.e., tasks that the model excels at. It can be seen visually how one model can “steer” the direction of the output particularly at the beginning of the response, when there is likely to be more divergence in output.

5 Conclusion

Combining large language models via character decomposition is a method for averaging LLM output at decoding time, without requiring the LLMs to have the same vocabularies or tokenizers. We find that the CharED algorithm leads to combined models that can largely retain the benefits of each individual model, across a variety of benchmarking tasks testing for mathematical reasoning, coding, and toxic text generation. This work suggests a promising potential alternative to fine-tuning, under which multiple models can be combined at decoding time.

This lays the groundwork for future experiments investigating the combination of more than two models and performance on complex compositional tasks. In addition, while the current averaging mechanism in CharED uses arithmetic means, further exploring more sophisticated variants such as geometric means or weighted combinations of arithmetic and geometric means is of interest.

Acknowledgements

This collaboration was made possible by Harvard’s Responsible Computing Collective (ReCompute). In particular, we would like to thank Audrey Chang and Victoria Ono for their efforts in putting this team together. DAM acknowledges support from the Dean’s Fund for Promising Research.

References

  • Al-Rfou et al. (2019) Al-Rfou, R., Choe, D., Constant, N., Guo, M., and Jones, L. Character-level language modeling with deeper self-attention. volume 33, pp.  3159–3166, Jul. 2019. doi: 10.1609/aaai.v33i01.33013159. URL https://ojs.aaai.org/index.php/AAAI/article/view/4182.
  • Bansal et al. (2024) Bansal, R., Samanta, B., Dalmia, S., Gupta, N., Vashishth, S., Ganapathy, S., Bapna, A., Jain, P., and Talukdar, P. Llm augmented llms: Expanding capabilities through composition. 2024.
  • Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. Evaluating large language models trained on code. 2021.
  • Clark et al. (2022) Clark, J. H., Garrette, D., Turc, I., and Wieting, J. Canine: Pre-training an efficient tokenization-free encoder for language representation. volume 10, pp.  73–91, Cambridge, MA, 2022. MIT Press. doi: 10.1162/tacl˙a˙00448. URL https://aclanthology.org/2022.tacl-1.5.
  • Cobbe et al. (2021) Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. 2021.
  • Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:52967399.
  • Edman et al. (2024) Edman, L., Sarti, G., Toral, A., van Noord, G., and Bisazza, A. Are character-level translations worth the wait? comparing byt5 and mt5 for machine translation. 2024.
  • Firat et al. (2016) Firat, O., Sankaran, B., Al-Onaizan, Y., Vural, F. T. Y., and Cho, K. Zero-resource translation with multi-lingual neural machine translation. 2016.
  • Gulcehre et al. (2015) Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H.-C., Bougares, F., Schwenk, H., and Bengio, Y. On using monolingual corpora in neural machine translation. 2015.
  • Guo et al. (2024) Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y. K., Luo, F., Xiong, Y., and Liang, W. Deepseek-coder: When the large language model meets programming – the rise of code intelligence. 2024.
  • Han et al. (2024) Han, Z., Gao, C., Liu, J., Zhang, J., and Zhang, S. Q. Parameter-efficient fine-tuning for large models: A comprehensive survey. 2024.
  • Hartvigsen et al. (2022) Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., and Kamar, E. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. 2022.
  • Hu et al. (2021) Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. 2021.
  • Hwang & Sung (2017) Hwang, K. and Sung, W. Character-level language modeling with hierarchical recurrent neural networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5720–5724, 2017. doi: 10.1109/ICASSP.2017.7953252.
  • Kasai et al. (2022) Kasai, J., Sakaguchi, K., Bras, R. L., Peng, H., Lu, X., Radev, D., Choi, Y., and Smith, N. A. Twist decoding: Diverse generators guide each other. 2022.
  • Kaushal & Mahowald (2022) Kaushal, A. and Mahowald, K. What do tokens know about their characters and how do they know it? 2022.
  • Kudo & Richardson (2018) Kudo, T. and Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Blanco, E. and Lu, W. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
  • Lester et al. (2021) Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243.
  • Liu et al. (2024) Liu, A., Han, X., Wang, Y., Tsvetkov, Y., Choi, Y., and Smith, N. A. Tuning language models by proxy. 2024.
  • Luo et al. (2023) Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., and Zhang, D. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. 2023.
  • Riabi et al. (2021) Riabi, A., Sagot, B., and Seddah, D. Can character-based language models improve downstream task performances in low-resource and noisy language scenarios? In Xu, W., Ritter, A., Baldwin, T., and Rahimi, A. (eds.), Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pp.  423–436, Online, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.wnut-1.47. URL https://aclanthology.org/2021.wnut-1.47.
  • Sennrich et al. (2016) Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Erk, K. and Smith, N. A. (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
  • Stahlberg et al. (2018) Stahlberg, F., Cross, J., and Stoyanov, V. Simple fusion: Return of the language model. In Bojar, O., Chatterjee, R., Federmann, C., Fishel, M., Graham, Y., Haddow, B., Huck, M., Yepes, A. J., Koehn, P., Monz, C., Negri, M., Névéol, A., Neves, M., Post, M., Specia, L., Turchi, M., and Verspoor, K. (eds.), Proceedings of the Third Conference on Machine Translation: Research Papers, pp.  204–211, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6321. URL https://aclanthology.org/W18-6321.
  • Strubell et al. (2019) Strubell, E., Ganesh, A., and McCallum, A. Energy and policy considerations for deep learning in NLP. In Korhonen, A., Traum, D., and Màrquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  3645–3650, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1355. URL https://aclanthology.org/P19-1355.
  • Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. 2014.
  • Touvron et al. (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and fine-tuned chat models. 2023.
  • Wei et al. (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. 2023.
  • Xue et al. (2021) Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., and Raffel, C. mT5: A massively multilingual pre-trained text-to-text transformer. In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y. (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  483–498, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.naacl-main.41.
  • Xue et al. (2022) Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., Kale, M., Roberts, A., and Raffel, C. ByT5: Towards a token-free future with pre-trained byte-to-byte models. volume 10, pp.  291–306, Cambridge, MA, 2022. MIT Press. doi: 10.1162/tacl˙a˙00461. URL https://aclanthology.org/2022.tacl-1.17.
  • Yang (2024) Yang, J. Rethinking tokenization: Crafting better tokenizers for large language models. 2024.
  • Zhang et al. (2019) Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., and Liu, Q. ERNIE: Enhanced language representation with informative entities. In Korhonen, A., Traum, D., and Màrquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  1441–1451, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1139. URL https://aclanthology.org/P19-1139.

Appendix A Proof of the Decoding Equivalence Theorem 2.1

Theorem A.1 (Theorem 2.1).

Let z𝑧zitalic_z denote an arbitrary text sequence and l𝑙litalic_l denote an arbitrary prompt. Then for α=1𝛼1\alpha=1italic_α = 1,

P1(zl)=PCharED(zl).subscript𝑃subscript1conditional𝑧𝑙subscript𝑃CharEDconditional𝑧𝑙P_{\mathcal{M}_{1}}(z\mid l)=P_{\textsc{CharED}}(z\mid l).italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ∣ italic_l ) = italic_P start_POSTSUBSCRIPT CharED end_POSTSUBSCRIPT ( italic_z ∣ italic_l ) .
Proof.

Assume sequence z𝑧zitalic_z must begin with token T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

To prove for z𝑧zitalic_z of length n𝑛nitalic_n, suppose it holds for all z𝑧zitalic_z of length <nabsent𝑛<n< italic_n.

First, we show that the probability that the first token of 1subscript1{\mathcal{M}_{1}}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the same as that of CharED outputting T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and then refreshing (line 21:).

Let PCharED(T1&Rl)subscript𝑃CharEDsubscript𝑇1conditional𝑅𝑙P_{\textsc{CharED}}(T_{1}\&R\mid l)italic_P start_POSTSUBSCRIPT CharED end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT & italic_R ∣ italic_l ) be the probability that CharED outputs exactly the characters of T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, before refreshing (line 21) for the first time. We also say that P1(T1&Rl))P_{\mathcal{M}_{1}}(T_{1}\&R\mid l))italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT & italic_R ∣ italic_l ) ) is the probability that the first token of 1(l){\mathcal{M}_{1}}(\cdot\mid l)caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ∣ italic_l ) is T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Let Po(t)=PCharED(T1[0:t1]&Rl)P_{o}(t)=P_{\textsc{CharED}}(T_{1}[0:t-1]\&\cancel{R}\mid l)italic_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_t ) = italic_P start_POSTSUBSCRIPT CharED end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ 0 : italic_t - 1 ] & cancel italic_R ∣ italic_l ) , where Rcancel𝑅\cancel{R}cancel italic_R is the condition that refresh (line 21:) has not occurred.

Let Pd(t)subscript𝑃𝑑𝑡P_{d}(t)italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_t ) be the probability corresponding to token T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT right before character z[t]𝑧delimited-[]𝑡z[t]italic_z [ italic_t ] has been chosen, conditioned on CharED(T1[0:t1]&Rl){\textsc{CharED}}(T_{1}[0:t-1]\&\cancel{R}\mid l)CharED ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ 0 : italic_t - 1 ] & cancel italic_R ∣ italic_l ). Here, we say an entry in d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT corresponds to token T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT if it originated from the output of T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT from the first call to 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. For example if T1=subscript𝑇1absentT_{1}=italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = “apple”, and CharED has output “ap”, then the entry in d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that corresponds to T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is “ple”, as long as no refresh has occurred (and no entry corresponds to T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT if refresh has already occurred).

We show that E(Po(t)Pd(t))𝐸subscript𝑃𝑜𝑡subscript𝑃𝑑𝑡E(P_{o}(t)P_{d}(t))italic_E ( italic_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_t ) italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_t ) ) remains constant from iteration to iteration: at line 12, as we choose the next character, Po(t)subscript𝑃𝑜𝑡P_{o}(t)italic_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_t ) is multiplied by the probability that the next character is consistent with T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. But when we renormalize at line 19, Pd(t)subscript𝑃𝑑𝑡P_{d}(t)italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_t ) is divided by the same probability. Then, at line 21, there is a P1(EOT)subscript𝑃1EOTP_{1}(\text{EOT})italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( EOT ) chance that Pd(c)subscript𝑃𝑑𝑐P_{d}(c)italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_c ) becomes, 0, but if it doesn’t, it is divided by 1/(1P1(EOT))11subscript𝑃1EOT1/(1-P_{1}(\text{EOT}))1 / ( 1 - italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( EOT ) ) during renormalization on line 24, thus expectation over lines 20-24 remains the same. Therefore E(Po(t)Pd(t))𝐸subscript𝑃𝑜𝑡subscript𝑃𝑑𝑡E(P_{o}(t)P_{d}(t))italic_E ( italic_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_t ) italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_t ) ) remains constant.

Let |T1|subscript𝑇1|T_{1}|| italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | denote the length of token T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Observe that Pd(|T1|)=P1(EOT after |T1| steps)subscript𝑃𝑑subscript𝑇1subscript𝑃1EOT after subscript𝑇1 stepsP_{d}(|T_{1}|)=P_{1}(\text{EOT after }|T_{1}|\text{ steps})italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( | italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ) = italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( EOT after | italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | steps ), and P1(EOT)subscript𝑃1EOTP_{1}(\text{EOT})italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( EOT ) is the probability that refresh happens (lines 20-21 in Algorithm 1). Thus, PCharED(T1&Rl))=Po(|T1|)Pd(|T1|)P_{\textsc{CharED}}(T_{1}\&R\mid l))=P_{o}(|T_{1}|)P_{d}(|T_{1}|)italic_P start_POSTSUBSCRIPT CharED end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT & italic_R ∣ italic_l ) ) = italic_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( | italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ) italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( | italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ). We know though that Po(0)Pd(0)=Pd(0)=P1(T1&Rl)subscript𝑃𝑜0subscript𝑃𝑑0subscript𝑃𝑑0subscript𝑃subscript1subscript𝑇1conditional𝑅𝑙P_{o}(0)P_{d}(0)=P_{d}(0)=P_{\mathcal{M}_{1}}(T_{1}\&R\mid l)italic_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( 0 ) italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( 0 ) = italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( 0 ) = italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT & italic_R ∣ italic_l ), and since E(Po(i)Pd(i))𝐸subscript𝑃𝑜𝑖subscript𝑃𝑑𝑖E(P_{o}(i)P_{d}(i))italic_E ( italic_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_i ) italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_i ) ) does not change, P1(T1&Rl)=PCharED(T1&Rl)subscript𝑃subscript1subscript𝑇1conditional𝑅𝑙subscript𝑃CharEDsubscript𝑇1conditional𝑅𝑙P_{\mathcal{M}_{1}}(T_{1}\&R\mid l)=P_{\textsc{CharED}}(T_{1}\&R\mid l)italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT & italic_R ∣ italic_l ) = italic_P start_POSTSUBSCRIPT CharED end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT & italic_R ∣ italic_l ).

But PCharED(zl)=PCharED(T1&Rl)PCharED(z\T1l+T1)subscript𝑃CharEDconditional𝑧𝑙subscript𝑃CharEDsubscript𝑇1conditional𝑅𝑙subscript𝑃CharEDconditional\𝑧subscript𝑇1𝑙subscript𝑇1P_{\textsc{CharED}}(z\mid l)=P_{\textsc{CharED}}(T_{1}\&R\mid l)P_{\textsc{% CharED}}(z\backslash T_{1}\mid l+T_{1})italic_P start_POSTSUBSCRIPT CharED end_POSTSUBSCRIPT ( italic_z ∣ italic_l ) = italic_P start_POSTSUBSCRIPT CharED end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT & italic_R ∣ italic_l ) italic_P start_POSTSUBSCRIPT CharED end_POSTSUBSCRIPT ( italic_z \ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_l + italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), where z\T1\𝑧subscript𝑇1z\backslash T_{1}italic_z \ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is z𝑧zitalic_z with T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT removed from its beginning, and likewise P1(zl)=P1(T1&Rl)P1((z\T1l+T1)P_{\mathcal{M}_{1}}(z\mid l)=P_{\mathcal{M}_{1}}(T_{1}\&R\mid l)P_{\mathcal{M}% _{1}}((z\backslash T_{1}\mid l+T_{1})italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ∣ italic_l ) = italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT & italic_R ∣ italic_l ) italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ( italic_z \ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_l + italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Under our inductive assumption, we have P1(zl)=PCharED(zl)subscript𝑃subscript1conditional𝑧𝑙subscript𝑃CharEDconditional𝑧𝑙P_{\mathcal{M}_{1}}(z\mid l)=P_{\textsc{CharED}}(z\mid l)italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ∣ italic_l ) = italic_P start_POSTSUBSCRIPT CharED end_POSTSUBSCRIPT ( italic_z ∣ italic_l ).

Finally, if there are more than one possible T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that will result in the output of z𝑧zitalic_z, both probabilities are summed over all the possible T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTs, maintaining the equality. ∎

Appendix B Proof of the Tokenization Invariance Theorem 2.2

Theorem B.1 (Theorem 2.2).

Let CharED and CharED’ differ only in that 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT used in CharED and 1superscriptsubscript1\mathcal{M}_{1}^{\prime}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT used in CharED’ have different tokenizations, but the same output, i.e. P1(zl)=P1(zl)subscript𝑃subscript1conditional𝑧𝑙subscript𝑃superscriptsubscript1conditional𝑧𝑙P_{\mathcal{M}_{1}}(z\mid l)=P_{\mathcal{M}_{1}^{\prime}}(z\mid l)italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ∣ italic_l ) = italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ∣ italic_l ), while 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT remains the same. Then PCharED(zl)=PCharED(zl)subscript𝑃CharEDconditional𝑧𝑙subscript𝑃superscriptCharEDconditional𝑧𝑙P_{\textsc{CharED}}(z\mid l)=P_{\textsc{CharED}^{\prime}}(z\mid l)italic_P start_POSTSUBSCRIPT CharED end_POSTSUBSCRIPT ( italic_z ∣ italic_l ) = italic_P start_POSTSUBSCRIPT CharED start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ∣ italic_l ).

Proof.

Observe that at any point t𝑡titalic_t in CharED, d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT depends on the characters that have already been selected (i.e., z[0:t1]z[0:t-1]italic_z [ 0 : italic_t - 1 ]) and on 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, but not directly on α𝛼\alphaitalic_α oder 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT since α𝛼\alphaitalic_α only influences d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT by its effect on characters selected.

Let CharEDα=1subscriptCharED𝛼1\textsc{CharED}_{\alpha=1}CharED start_POSTSUBSCRIPT italic_α = 1 end_POSTSUBSCRIPT be identical to CharED, except for α=1𝛼1\alpha=1italic_α = 1, and likewise for CharEDα=1subscriptsuperscriptCharED𝛼1\textsc{CharED}^{\prime}_{\alpha=1}CharED start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α = 1 end_POSTSUBSCRIPT and CharEDsuperscriptCharED\textsc{CharED}^{\prime}CharED start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Then, conditioned on z[0:t1]z[0:t-1]italic_z [ 0 : italic_t - 1 ],    (d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in CharED) = (d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in CharEDα=1subscriptCharED𝛼1\textsc{CharED}_{\alpha=1}CharED start_POSTSUBSCRIPT italic_α = 1 end_POSTSUBSCRIPT). Therefore, P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in CharED and CharEDα=1subscriptCharED𝛼1\textsc{CharED}_{\alpha=1}CharED start_POSTSUBSCRIPT italic_α = 1 end_POSTSUBSCRIPT likewise have identical distributions conditioned on z[0:t1]z[0:t-1]italic_z [ 0 : italic_t - 1 ].

But by Theorem 2.1, CharEDα=1subscriptCharED𝛼1\textsc{CharED}_{\alpha=1}CharED start_POSTSUBSCRIPT italic_α = 1 end_POSTSUBSCRIPT and CharEDα=1subscriptsuperscriptCharED𝛼1\textsc{CharED}^{\prime}_{\alpha=1}CharED start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α = 1 end_POSTSUBSCRIPT produce identical output, so P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in CharEDα=1subscriptCharED𝛼1\textsc{CharED}_{\alpha=1}CharED start_POSTSUBSCRIPT italic_α = 1 end_POSTSUBSCRIPT and CharEDα=1subscriptsuperscriptCharED𝛼1\textsc{CharED}^{\prime}_{\alpha=1}CharED start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α = 1 end_POSTSUBSCRIPT must also have identical distribution. By combining the above, P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in CharED and CharEDsuperscriptCharED\textsc{CharED}^{\prime}CharED start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT must have identical distribution. Therefore, as P1subscript𝑃subscript1P_{\mathcal{M}_{1}}italic_P start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT only influences output via its effect on P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the outputs of CharED and CharEDsuperscriptCharED\textsc{CharED}^{\prime}CharED start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT must likewise have identical distributions. ∎

Appendix C Prompting Details

We use the following prompts for each benchmark, applying the chat template when relevant for respective chat and instruct models.

For HumanEval, we follow the prompting method from Guo et al. (2024):

Please continue to complete the function. You are not allowed to modify the given code and do completion only. Please return the completed function in a codeblock. Here is the given code to do completion:
‘‘‘python
{Code}
‘‘‘

For GSM8K, we use 5-shot prompting following the first five prompt examples from Wei et al. (2023):

Question: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
Answer: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.

Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Answer: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.

Question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
Answer: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39.

Question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
Answer: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. The answer is 8.

Question: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?
Answer: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9.

Question: {Problem}
Answer:

For ToxiGen, we prompt the model as shown below, which is then followed by the toxic example instances from the dataset. The task is to not further generate toxic statements, where toxicity is classified using roberta-large similar to the method in Liu et al. (2024):

Write a statement that is similar to the below examples:
{Examples}

In all cases, the user can feed different prompts for each model, thus enabling flexibility in prompting choices as needed.

Appendix D Evaluation Details

We run experiments on two NVIDIA A40 GPUs. We use the top k=100top k100\text{top k}=100top k = 100 next tokens when querying models. Furthermore, we evaluate the full test dataset for both GSM8K and HumanEval, which consist of 1319 and 164 problems respectively. ToxiGen contains 1k samples of hate speech for each of 16 different minority groups, so we sample 50 statements from each group and thus test on a subset of the benchmark consisting of 800 examples due to computational constraints.

Appendix E Supplementary Results

Refer to caption
Figure 3: Performance tradeoffs of combined models using CharED on different benchmarking tasks.
Refer to caption
Figure 4: Summed performance across two benchmarks of combined models, using performance shown in Figure 3.

In Figure 3, we provide another visualization of the tradeoff of the percent correct of each benchmark, under which it is clear how we can optimize summed model performance using specfici α𝛼\alphaitalic_α. In Figure 4, we can find the α𝛼\alphaitalic_α corresponding to the peaked summed performance for α=0.5,0.55,0.45𝛼0.50.550.45\alpha=0.5,0.55,0.45italic_α = 0.5 , 0.55 , 0.45 for DeepseekCoder + WizardMath, DeepseekCoder + Llama 2 Chat, and WizardMath + Llama 2 Chat, respectively.