{CJK}

UTF8gbsn

Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration

Yichong Huang, Xiaocheng Feng🖂absent🖂{}^{{\dagger}{\ddagger}\textrm{\Letter}}start_FLOATSUPERSCRIPT † ‡ 🖂 end_FLOATSUPERSCRIPT, Baohang Li, Yang Xiang, Hui Wang
Ting Liu, Bing Qin†‡
Harbin Institute of Technology    Peng Cheng Laboratory
{ychuang,xcfeng,baohangli,tliu,qinb}@ir.hit.edu.cn
{xiangy,wangh06}@ir.hit.edu.cn

Abstract

Large language models (LLMs) exhibit complementary strengths in various tasks, motivating the research of LLM ensembling. However, existing work focuses on training an extra reward model or fusion model to select or combine all candidate answers, posing a great challenge to the generalization on unseen data distributions. Besides, prior methods use textual responses as communication media, ignoring the valuable information in the internal representations. In this work, we propose a training-free ensemble framework DeePEn, fusing the informative probability distributions yielded by different LLMs at each decoding step. Unfortunately, the vocabulary discrepancy between heterogeneous LLMs directly makes averaging the distributions unfeasible due to the token misalignment. To address this challenge, DeePEn maps the probability distribution of each model from its own probability space to a universal relative space based on the relative representation theory, and performs aggregation. Next, we devise a search-based inverse transformation to transform the aggregated result back to the probability space of one of the ensembling LLMs (main model), in order to determine the next token. We conduct extensive experiments on ensembles of different number of LLMs, ensembles of LLMs with different architectures, and ensembles between the LLM and the specialist model. Experimental results show that (i) DeePEn achieves consistent improvements across six benchmarks covering subject examination, reasoning, and knowledge, (ii) a well-performing specialist model can benefit from a less effective LLM through distribution fusion, and (iii) DeePEn has complementary strengths with other ensemble methods such as voting111Our code is available at: https://github.com/OrangeInSouth/DeePEn.

1 Introduction

With the scaling of model capacities and data volumes, generative large language models (LLMs) have shown impressive language understanding and generation abilities, shedding light for artificial general intelligence [35, 22, 13, 28]. Due to diversities of data sources, model architectures, and training recipes, LLMs have different strengths and weaknesses in various tasks and cases. Therefore, recent research has explored the ensemble of LLMs to exploit the complementary potential [15, 19].

Existing methods can be categorized into selection-based and fusion-based ensembling. Selection-based ensembling selects the best candidate answer from all individual LLMs’ answers using an additionally trained reward model [15, 31, 25, 19]. Fusion-based ensembling combines all candidate answers using a trained fusion model [15]. However, these approaches inevitably face significant challenges in generalizing to unseen data distributions and base models. Besides, prior methods enable collaboration via conveying the textual responses between LLMs while ignoring the rich information (e.g., confidence and alternative answers) in the internal representations.

An ideal solution to this issue is to apply the well-established technology of prediction fusion. [36, 24, 7, 10]. For LLM ensemble, prediction fusion works at each decoding step, averaging the probability distributions from different LLMs to determine the next token. It could not only directly apply to the ensemble of any LLMs without extra parameter training, making it more general, but leverages the informative internal representations (i.e., probability distributions) as communication media. Unfortunately, the vocabulary discrepancy between different LLMs makes it unfeasible to average the distributions due to token misalignment.

In this work, we tackle this key challenge by drawing upon the cross-model invariance of relative representation, which represents each token using the embedding similarities of this token to a set of anchor tokens [21]. Specifically, we propose an ensemble framework DeePEn (Deep Parallel Ensemble), enabling distribution fusion for heterogeneous LLMs. DeePEn transforms the probability distribution from the heterogeneous probability space to a homogeneous relative space, using a matrix formed by the relative representation of all tokens. Next, DeePEn aggregates the relative representations of all probability distributions in the relative space, coordinating the decision on the next token. Finally, the result of aggregation is transformed back to the probability space of the main model using a search-based inverse transformation to determine the next token.

We conduct extensive experiments ranging from 2-model to 9-model ensembles, covering ensembles of models with parameters ranging from 6B to 70B, ensembles of dense and sparse models, and the ensemble of LLMs with specialist models. Experimental results on six widely-used benchmarks demonstrate that compared to baselines, DeePEn achieves consistent improvements across all benchmarks. It is also discovered that DeePEn has complementary strengths when combined with other ensemble methods.

2 Theoretical Analysis

We first introduce relative representation and then illustrate the theoretical support for our method.

2.1 Relative Representation

Previous study discovers that despite the misalignment between latent spaces of different neural networks, the embedding similarity between samples do not change across models [21, 11, 23]. Specifically, Moschella et al. [21] propose relative representation, which represents each sample x(i)superscript𝑥𝑖x^{(i)}italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT by the embedding similarities to a set of anchor samples 𝔸𝔸\mathbb{A}blackboard_A (x(i)superscript𝑥𝑖x^{(i)}italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and 𝔸𝔸\mathbb{A}blackboard_A are identically distributed):

𝐫x(i)=(cos(ex(i),ea(1)),,cos(ex(i),ea(|𝔸|))),subscript𝐫superscript𝑥𝑖𝑐𝑜𝑠subscript𝑒superscript𝑥𝑖subscript𝑒superscript𝑎1𝑐𝑜𝑠subscript𝑒superscript𝑥𝑖subscript𝑒superscript𝑎𝔸\displaystyle\mathbf{r}_{x^{(i)}}=(cos(e_{x^{(i)}},e_{a^{(1)}}),...,cos(e_{x^{% (i)}},e_{a^{(|\mathbb{A}|)}})),bold_r start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ( italic_c italic_o italic_s ( italic_e start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , … , italic_c italic_o italic_s ( italic_e start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( | blackboard_A | ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) , (1)

where e()subscript𝑒e_{(*)}italic_e start_POSTSUBSCRIPT ( ∗ ) end_POSTSUBSCRIPT denotes the embedding of samples, also is absolute representation.

It is empirically evidenced that relative representations possess cross-model invariance, i.e., the relative representation of the same sample keeps invariant across different models, which lays the theoretical foundation for our work to fuse heterogeneous probability distributions.

2.2 Theoretical Support for DeePEn

Refer to caption
Figure 1: Visualizations for relative representations between models with the same vocabulary and between models with different vocabularies. PCA and K-means clustering are applied only for visualization. The red block indicates the representation of tokens that only appear in Mistral’s vocabulary. Relative representation consistency is obtained by calculating the cosine similarity between the relative representations of the same token in different models.

Average probability distribution has been widely evidenced to effectively improve the predictive performance in the filed of image and text [2, 10]. For generative language models, as we understand, the underlying mechanism is to interpolate different output semantics represented by the probability distributions. However, for LLM ensemble, vocabulary discrepancy isolates these output semantics in semantic spaces with different basis vectors, making the interpolation infeasible. To tackle this challenge, we aim to enable the cross-model alignment for output semantics, i.e., find a transformation to map the output semantics into a universal space. To this effect, we propose to represent the output semantics with the convex combination of relative representations of all tokens where the weight is the probability assigned to the token.

Definition of output semantics in relative space.

Formally, given the absolute representation of the output semantics 𝐩𝐩\mathbf{p}bold_p and the relative representation matrix R|V|×|A|𝑅superscript𝑉𝐴R\in\mathbb{R}^{|V|\times|A|}italic_R ∈ blackboard_R start_POSTSUPERSCRIPT | italic_V | × | italic_A | end_POSTSUPERSCRIPT where V𝑉Vitalic_V is the vocabulary and AV𝐴𝑉A\subseteq Vitalic_A ⊆ italic_V is the anchor token set. The i𝑖iitalic_i-th row of R𝑅Ritalic_R is the relative representation of word w(i)superscript𝑤𝑖w^{(i)}italic_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT:

R[i]=(cos(ew(i),ea(1)),,cos(ew(i),ea(|𝔸|))),𝑅delimited-[]𝑖𝑐𝑜𝑠subscript𝑒superscript𝑤𝑖subscript𝑒superscript𝑎1𝑐𝑜𝑠subscript𝑒superscript𝑤𝑖subscript𝑒superscript𝑎𝔸\displaystyle R[i]=(cos(e_{w^{(i)}},e_{a^{(1)}}),...,cos(e_{w^{(i)}},e_{a^{(|% \mathbb{A}|)}})),italic_R [ italic_i ] = ( italic_c italic_o italic_s ( italic_e start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , … , italic_c italic_o italic_s ( italic_e start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ( | blackboard_A | ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) , (2)

and the relative representation of the output semantics 𝐩𝐩\mathbf{p}bold_p is defined as: 𝐫=𝐩R𝐫𝐩𝑅\mathbf{r}=\mathbf{p}\cdot Rbold_r = bold_p ⋅ italic_R.

Model-invariance of relative representation of output semantic.

Next, we illustrate why this representation scheme could align the output semantics isolated in heterogeneous absolute spaces. First, considering two LLMs θAsubscript𝜃𝐴\theta_{A}italic_θ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and θBsubscript𝜃𝐵\theta_{B}italic_θ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT with the same vocabulary (e.g., LLaMA2-7B and LLaMA2-13B). When expressing the same output semantic, these models output the same probability distribution (i.e., absolute representation) 𝐩Asubscript𝐩𝐴\mathbf{p}_{A}bold_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and 𝐩Bsubscript𝐩𝐵\mathbf{p}_{B}bold_p start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. Besides, they have the same (highly similar in practice) relative representation matrix due the vocabulary consistency and cross-model invariance of relative representation. Therefore, the relative representations of output semantics are also identical:

𝐫Asubscript𝐫𝐴\displaystyle\mathbf{r}_{A}bold_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT =𝐩ARA=𝐩BRB=𝐫B.absentsubscript𝐩𝐴subscript𝑅𝐴subscript𝐩𝐵subscript𝑅𝐵subscript𝐫𝐵\displaystyle=\mathbf{p}_{A}\cdot R_{A}=\mathbf{p}_{B}\cdot R_{B}=\mathbf{r}_{% B}.= bold_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⋅ italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = bold_p start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ⋅ italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = bold_r start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT . (3)

Then, let’s consider a language model θCsubscript𝜃𝐶\theta_{C}italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT with a different vocabulary (e.g., Mistral). Based on the fact that different LLMs typically share mass tokens in their vocabularies (§A), the vocabulary of model θCsubscript𝜃𝐶\theta_{C}italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is identical to adding and removing partial tokens to the vocabulary of θBsubscript𝜃𝐵\theta_{B}italic_θ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, which leads to 𝐩B𝐩Csubscript𝐩𝐵subscript𝐩𝐶\mathbf{p}_{B}\ncong\mathbf{p}_{C}bold_p start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≇ bold_p start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT and RBRCsubscript𝑅𝐵subscript𝑅𝐶R_{B}\ncong R_{C}italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≇ italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT. However, in our study, we discover that this change to the vocabulary has not incurred significant influence on the relative representation of the unchanged tokens (i.e., the common tokens between θBsubscript𝜃𝐵\theta_{B}italic_θ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT and θCsubscript𝜃𝐶\theta_{C}italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT), as shown in Fig. 1. Therefore, we make the reasonable assumption that the local change in the vocabulary could hardly influence the relative space.

3 Methodology

In this section, we first introduce the overall process of our ensemble framework DeePEn and then describe the three parts of DeePEn in detail.

3.1 Overview

We illustrate the process of DeePEn in Fig. 2. Given N𝑁Nitalic_N models to ensemble, DeePEn first constructs their transformation matrices (i.e., relative representation matrices) mapping the probability distributions from the heterogeneous absolute spaces into the relative space (§3.2). At each decoding step, all models perform prediction and output N𝑁Nitalic_N probability distributions. These distributions are mapped into the relative space and aggregated (§3.3). Finally, the aggregation result is transformed back into the absolute space of the main model, in order to determine the next token (§3.4).

Refer to caption
Figure 2: Overview of DeePEn. The relative representation matrix of each LLM is directly derived by calculating the embedding similarities between each token with the anchor tokens.

3.2 Construction of Relative Transformation

Given N𝑁Nitalic_N models to ensemble, DeePEn first finds out the intersection of vocabularies of all models, i.e., common token set C𝐶Citalic_C, and samples a subset or uses the full set of common tokens as the anchor token set AC𝐴𝐶A\subseteq Citalic_A ⊆ italic_C. Next, for each model, DeePEn calculates embedding similarities of each token to the anchor words, obtaining the relative representation matrix R𝑅Ritalic_R (as shown in Eq.2). Finally, to overcome the relative representation degeneration of outlier words, which will be introduced later, we perform normalization on the relative representation of all tokens by a softmax operation so that it becomes a probability distribution. We denote the normalized representation matrix R^^𝑅\hat{R}over^ start_ARG italic_R end_ARG:

R^[i]=softmax(R[i]).^𝑅delimited-[]𝑖𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑅delimited-[]𝑖\displaystyle\hat{R}[i]=softmax(R[i]).over^ start_ARG italic_R end_ARG [ italic_i ] = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_R [ italic_i ] ) . (4)

Anchor Selection.

The choice of anchor tokens is crucial for the relative representation capability. Previous research discovers that the capability improves as the number of anchor words increases [21]. Therefore, we employ the full set of common words between LLMs as the anchor words. It is also empirically proved that this method performs more stablely on downstream tasks (§5.2).

Normalization of relative representation matrix.

In DeePEn, the relative representation of each token is normalized by the softmax operation to avoid the relative representation degeneration of outlier words, which are referred to as words that are far away from other words (including the anchors) and become distinguishable in relative space since for being zero vectors. The softmax operation effectively resolves this problem by making each relative representation a probabilistic distribution instead of a zero vector.

3.3 Aggregation in Relative Space

At each decoding step, once each model θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT outputs the probability distribution pisubscriptp𝑖\textbf{p}_{i}p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, DeePEn transforms pisubscriptp𝑖\textbf{p}_{i}p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into the relative representation risubscriptr𝑖\textbf{r}_{i}r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using the normalized relative representation matrix: 𝐫i=piR^isubscript𝐫𝑖subscriptp𝑖subscript^𝑅𝑖\mathbf{r}_{i}=\textbf{p}_{i}\cdot\hat{R}_{i}bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT,

and aggregate all relative representations to obtain the aggregated relative representation:

𝐫¯=i=1Nαi×𝐫i,¯𝐫superscriptsubscript𝑖1𝑁subscript𝛼𝑖subscript𝐫𝑖\displaystyle\overline{\mathbf{r}}=\sum_{i=1}^{N}\alpha_{i}\times\mathbf{r}_{i},over¯ start_ARG bold_r end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (5)

where αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the collaboration weight of model θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT3.5).

3.4 Inverse Transformation of Relative Representations

To decide the next token according to the aggregated relative representation, DeePEn aims to transform it from the relative space back to the absolute space of the main model, which is empirically selected with the best-performing model on the development set. To enable this inverse transformation, we adopt a search-based strategy, finding out the absolute representation whose relative representation is identical to the aggregated relative representation. This search problem is formulated as:

p¯i=argminpii(pi×R^,𝐫¯),subscript¯p𝑖subscriptsubscriptp𝑖subscript𝑖subscriptp𝑖^𝑅¯𝐫\displaystyle\overline{\textbf{p}}_{i}=\mathop{\arg\min}\limits_{\textbf{p}_{i% }\in\ \mathbb{P}_{i}}\ell(\textbf{p}_{i}\times\hat{R},\ \overline{\mathbf{r}}),over¯ start_ARG p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × over^ start_ARG italic_R end_ARG , over¯ start_ARG bold_r end_ARG ) , (6)

where isubscript𝑖\mathbb{P}_{i}blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the absolute space of model θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and ()\ell(\cdot)roman_ℓ ( ⋅ ) is the loss function to measure the distance between relative representations. In this work, we adopt the KL-divergence due to its convergence.

This search is iteratively conducted under the guidance of the gradient of the loss in Eq.6 with respect to the absolute representation pisubscriptp𝑖\textbf{p}_{i}p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Specifically, we initialize the start point of searching pi(0)subscriptsuperscriptp0𝑖\textbf{p}^{(0)}_{i}p start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the main model’s original absolute representation, and update it as:

pi(t+1)=pi(t)η×pi(t),t[0,T]formulae-sequencesubscriptsuperscriptp𝑡1𝑖subscriptsuperscriptp𝑡𝑖𝜂subscriptsuperscriptp𝑡𝑖𝑡0𝑇\displaystyle\textbf{p}^{(t+1)}_{i}=\textbf{p}^{(t)}_{i}-\eta\times\frac{% \partial\ell}{\partial\textbf{p}^{(t)}_{i}},t\in[0,T]p start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_η × divide start_ARG ∂ roman_ℓ end_ARG start_ARG ∂ p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , italic_t ∈ [ 0 , italic_T ] (7)

where η𝜂\etaitalic_η is an important hyperparameter named the relative ensemble learning rate, and T𝑇Titalic_T is the iterations number named relative ensemble learning steps. Finally, we use the updated absolute representation pi(T)subscriptsuperscriptp𝑇𝑖\textbf{p}^{(T)}_{i}p start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to determine the emitted token.

3.5 Collaboration Schemes

DeePEn aggregates the output distributions of individual models via performing weighted averaging on their relative representations (Eq. 5). As our work focus on enabling the distribution fusion of heterogeneous LLMs instead of finding the optimal collaboration weights, we follow the most common practice to uniformly aggregate the distributions (α=1/N𝛼1𝑁\alpha=1/Nitalic_α = 1 / italic_N, N𝑁Nitalic_N is the number of models), which is named DeePEn-Avg. Besides, we also adopt a simple and effective method of deducing weights, DeePEn-Adapt, which heuristically sets a larger value to the model with a better performance on the development set: αi=si/jsjsubscript𝛼𝑖subscript𝑠𝑖subscript𝑗subscript𝑠𝑗\alpha_{i}=s_{i}/\sum_{j}s_{j}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where si=Acc(θi,𝒟dev)ϵsubscript𝑠𝑖𝐴𝑐𝑐subscript𝜃𝑖superscript𝒟𝑑𝑒𝑣italic-ϵs_{i}=Acc(\theta_{i},\mathcal{D}^{dev})-\epsilonitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A italic_c italic_c ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D start_POSTSUPERSCRIPT italic_d italic_e italic_v end_POSTSUPERSCRIPT ) - italic_ϵ, Acc(,)𝐴𝑐𝑐Acc(\cdot,\cdot)italic_A italic_c italic_c ( ⋅ , ⋅ ) indicates the average accuracy of model θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on the development set, and ϵitalic-ϵ\epsilonitalic_ϵ indicates the chance level on the evaluation task. Specifically, ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0 on the free-form generation tasks and ϵ=1/Kitalic-ϵ1𝐾\epsilon=1/Kitalic_ϵ = 1 / italic_K on the K𝐾Kitalic_K-choice tasks.

4 Experiments

4.1 Experimental Setup

Benchmarks.

We mainly conduct experiments on six benchmarks, which can be categorized into:

  • Comprehensive Examination: (1) MMLU (5-shot) [12], which covers 57 subjects that humans learn, and (2) ARC-C (0-shot) [5], collected from standardized natural science tests.

  • Reasoning Capabilities: (1) GSM8K [6] (4-shot), which is a dataset of high quality problems at the grade school math level, and (2) PIQA [3] (0-shot), which is a commonsense reasoning dataset.

  • Knowledge Capacities: (1) TriviaQA (5-shot) [16], collected by Trivia enthusiast authored, and (2) NQ (5-shot) [18], which is a QA corpus consists of queries issued to the Google search engine.

Evaluation.

For all benchmarks, we follow the test scripts of OpenCompass leaderboard. Specifically, on the multiple-choice tasks (MMLU, ARC-C, and PIQA), the option with the highest likelihood is selected to calculate the accuracy. On the free-form generation tasks (GSM8K, TriviaQA and NQ), we calculate the exact match (EM) accuracy.

Individual models.

As ensemble learning typically works on models with comparable performance [24, 34], we select six well-performing LLMs whose performance are closely matched: LLaMA-2-13B [29], Mistral-7B-v0.1 [13], InternLM-20B [26], Yi-6B [1], Skywork-13B-base [32], and Tigerbot-13b-base-v2 [4]. To achieve better ensemble performance, we conduct experiments on the ensemble of the top-2 models and the top-4 models for each benchmark. Besides, we also consider ensembling various number of models (§4.3) and ensembling more diverse models (§5.1).

Hyperparameters.

In this work, we select all of the common tokens between LLMs as the anchor tokens to build the relative spaces, i.e., A=C𝐴𝐶A=Citalic_A = italic_C5.2). For the inverse transformation of relative representations, we search the optimal relative learning rate (η𝜂\etaitalic_η in Eq. 7) from 0.05 to 0.30 with an interval of 0.05. We empirically set the number of relative ensemble learning steps T=5𝑇5T=5italic_T = 55.3).

Comparative methods.

We compare DeePEn with (1) MinED [30, 9], which maps the probability distributions of heterogeneous LLMs to the distribution of the main model via aligning tokens in different vocabularies with edit distance, and (2) LLM-Blender [15], which comprises a reward model PairRanker to score each response of LLMs and a fusion model GenFuser to fuse candidate responses. In this work, we we only adopt the PairRanker since GenFuser suffers from serious over-generation under our training-free setting. In the ensemble of more than two models, we introduce two additional ensemble methods: (3) Voting, which selects the choice favored by most models on the tasks with outputs limited to a fixed set, and (4) MBR [8, 17], which selects the answer with the highest textual similarity to other candidate answers. The implementation details of baselines are illustrated in §B.

Models Examination Reasoning Knowledge
MMLU ARC-C GSM8K PIQA TriviaQA NQ
Individual Models
LLaMA2-13B 55.07 59.32 29.80 59.68 74.32 28.67
InternLM-20B 59.94 75.81 53.83 64.78 66.88 26.09
Skywork-13B 61.16 66.50 53.90 74.04 58.65 19.75
Tigerbot-13B 51.95 57.44 48.82 68.28 66.22 22.71
Mistral-7B 62.13 73.33 47.50 65.61 73.18 27.62
Yi-6B 63.25 73.33 37.91 76.15 59.02 18.98
Top-2 Ensemble
LLM-Blender 63.85 (+0.60) 75.73 (- 0.08) 54.89 (+0.99) 78.31 (+2.16) 74.10 (- 0.22) 28.61 (- 0.06)
MinED 65.04 (+1.79) 77.35 (+1.54) 18.50 (-35.40) 78.98 (+2.83) 72.30 (- 2.02) 28.45 (- 0.22)
\hdashline[4pt/5pt] DeePEn-Avg 64.68 (+1.43) 77.52 (+1.71) 55.42 (+1.52) 78.87 (+2.72) 75.90 (+1.58) 30.17 (+1.50)
Top-4 Ensemble
LLM-Blender 61.44 (- 1.81) 71.03 (- 4.78) 43.37(-10.53) 71.16 (- 4.99) 67.87 (- 6.45) 24.18 (- 4.49)
Voting 64.88 (+1.63) 78.41 (+2.60) 63.15 (+9.25) 76.82 (+0.67)
MBR 62.09 (+8.26) 74.32 (+0.00) 30.28 (+1.61)
MinED 65.61 (+2.36) 78.68 (+2.87) 56.56 (+2.66) 77.87 (+1.72) 71.62 (- 2.70) 29.50 (+0.83)
\hdashline[4pt/5pt] DeePEn-Avg 65.09 (+1.84) 78.70 (+2.89) 56.18 (+2.28) 77.15 (+1.00) 75.74 (+1.42) 31.55 (+2.88)
DeePEn-Adapt 65.25 (+2.00) 79.15 (+3.34) 56.25 (+2.35) 78.59 (+2.44) 75.76 (+1.44) 31.77 (+3.10)
+Voting/MBR 65.40 (+2.15) 79.44 (+3.63) 65.25 (+11.35) 77.37 (+1.22) 75.65 (+1.33) 32.11 (+3.44)
Table 1: Main results. The best individual model is highlighted in red, and the best ensemble method is highlighted in green , except for the results of the combined method (i.e., the last row). The top-4 models on each benchmark are underlined. ‘—’ indicates that the method does not apply to the task.

4.2 Main Results

The main results are shown in Tab. 4.1, from which we have drawn the following observations:

(1) DeePEn achieves consistent improvements over the individual models.

These results prove that our DeePEn successfully enables collaboration between heterogeneous LLMs via aggregating their probability distributions in the relative space. Specifically, DeePEn-Avg achieves improvements of +1.43(MMLU)similar-to\sim+2.72(PIQA) on the ensemble of top-2 models, and +1.00(PIQA)similar-to\sim+2.89(ARC-C) on the ensemble of top-4 model. DeePEn-Adapt gains improvements of +1.44(TriviaQA)similar-to\sim+3.34(ARC-C) on the ensemble of top-4 models.

(2) DeePEn shows better stability than baselines.

As shown, LLM-Blender struggles to achieve improvements under the training-free setting. MinED shows unstable performance across different benchmarks. For example, MinED leads to performance drops of -35.40 on the GSM8K benchmark under the top-2 models ensemble setting and -2.70 on the TriviaQA, indicating the limitation of using textual similarity to align tokens in heterogeneous vocabularies. Through case studies, it is revealed that this method of aligning tokens with edit distance disturbs the decoding and produces incomplete words (demonstrated in §10). Instead, DeePEn-Avg achieves consistent improvements and surpasses all baselines in 7/12 settings.

(3) DeePEn has complementary strengths with other ensemble methods.

Voting achieves a significant improvement on the mathematical reasoning GSM8K, showing the effectiveness of reasoning with multiple paths. To evidence the complementary strength of DeePEn with Voting, we combine both methods. On the TriviaQA and NQ, Voting is replaced with MBR. As shown that the combination of both methods gains a further improvement over Voting (63.15\rightarrow65.25).

(4) Collaboration with more worse-performing LLMs is a double-edged sword.

The ensemble performance of DeePEn-Avg with top-4 models surpasses that with top-2 models on 4 benchmarks, but falls short on 2 benchmarks. This is reasonable because incorporating the 3rd and 4th ranked LLMs enhances complementary strengths but also causes the interference with the top-2 models.

4.3 Results on Different Numbers of Models

Refer to caption
Figure 3: Test set results of ensemble learning on various number of models. Individual models are arranged in descending order of their performance on the development set, and sequentially incorporated into the ensemble. Δ^^Δ\hat{\Delta}over^ start_ARG roman_Δ end_ARG indicates the largest improvement achieved by DeePEn.

Next, we illustrate the effectiveness of DeePEn on the ensemble of more models on the MMLU, PIQA, and NQ. We add Nanbeige-13B into the ensemble on all three benchmarks, and add LLaMA2-70B and Mixtral-8×\times×7B on the PIQA due to their comparable performance. As illustrated in Fig. 3, the ensemble performance increases first and then decreases with the joining of more models in descending order of performance. And the ensemble performance peaks in the top-4 or top-5 models across three benchmarks.

5 Analysis

To deeply understand DeePEn, we first evaluate its performance on the ensemble learning of model sets with diverse architectures, abilities, and performance gaps. Next, we conduct a series of analyses on the reverse transformation process of relative representations.

5.1 Results of Ensembling Diverse Models

Model GSM8K PIQA
LLaMA2-70B (Dense) 63.84 71.27
Mixtral-8×\times×7B (Sparse) 65.73 71.88
\hdashline[4pt/5pt] DeePEn 67.33 75.10
ΔΔ\Deltaroman_Δ +1.60 +3.22
Figure 4: Ensemble learning of the dense large language model LLaMA2-70B and the sparse MoE model Mixtral-8×\times×7B.
Model En\rightarrowDe De\rightarrowEn En\rightarrowRo Ro\rightarrowEn
LLaMA2-13B 30.60 42.27 30.83 39.99
NLLB-600M 32.30 41.49 31.91 42.39
\hdashline[4pt/5pt] DeePEn 33.34 43.70 32.95 42.84
ΔΔ\Deltaroman_Δ +1.04 +1.43 +1.04 + 0.45
Figure 5: Ensemble learning of the generalist model LLaMA2 and the specialist translator model NLLB on the translation benchmark Flores-200.

Ensemble of the dense model and the sparse model.

We first evaluate our method on the ensemble learning of the dense model and the sparse MoE model on the challenge reasoning tasks. Specifically, we use the widely-used large-scale dense model LLaMA2-70B [29] and the popular sparse MoE model Mixtral-8×\times×7B [14] as the base models. As the results shown in Tab. 5.1, our DeePEn achieves improvements of +1.60 and +3.22 on the GSM8K and PIQA datasets, even though the base models have achieved a high level of performance.

Ensemble of the generalist model and the specialist model.

To investigate the effectiveness of DeePEn on the ensemble of the generalist model and the specialist model for the specific task, we conduct experiments on the machine translation task using the ensemble of the large language model LLaMA2 and the machine translation model NLLB [27], which is a well-known open-source multilingual translator. We adopt the widely-used machine translation benchmark Flores-200222https://github.com/facebookresearch/flores. As the results in Tab. 5.1 illustrated, DeePEn achieves better translation performance leveraging the diverse translation knowledge in the generalist LLM and the specialist translator.

Ensemble of models with different performance gaps.

To assess the stability of DeePEn regarding to the performance gap of base models, we conduct an experiment on the ensemble of model pairs with increasing performance gaps. As the result demonstrated in Tab. 8, the performance of ensemble learning between a well-performing model (the rank-first model)with a worse-performing model could achieve improvements or slightly lag behind the well-performing model.

5.2 Analysis on Relative Transformation

Effect of anchor selection.

We demonstrate the impact of different numbers of anchor words through experiments with the top-2 ensemble models on the MMLU and ARC-C datasets. As shown in Fig. 6, an increased number of anchor words can improve performance for LLMs in downstream tasks, and selecting the full set of common words as anchors provides better performance.

Effect of normalization on relative representation matrix.

Refer to caption
Figure 6: Effect of the number of anchor words. The x-axis indicates the number of anchor words randomly sampled from the common words for 4 times.
Methods MMLU-Dev TriviaQA-Dev
ACC ΔΔ\Deltaroman_Δ ACC ΔΔ\Deltaroman_Δ
Baseline 61.19 - 72.74 -
DeePEn 63.61 +2.42 74.79 +2.05
w/o. Rel-Norm 60.73 -0.46 72.95 +0.21
Figure 7: Ablation study of normalization on the relative representation matrix to the ensembling performance on the development sets. Baseline refers to as the best single model on each benchmark. DeePen refers the performance of ensembling top-2 models in the benchmark.
Refer to caption
Figure 8: 2-model ensemble of the top-1 model (LLaMA2-13B) with different models on the NQ benchmark, respectively.

To demonstrate the importance of normalization on the relative representation matrix to the ensemble performance (§3.2), we conduct an ablation analysis. The result is shown in Tab. 8, the ensemble struggles to achieve improvements due to the ineffective representation of outlier words, i.e., words distant to other words. The proportion of outlier words can be derived from the distribution of distance to nearest neighbor words, which is illustrated in Fig. 11. As illustrated, a remarkable proportion (>30%absentpercent30>30\%> 30 %) of words are distant from other words, i.e., cosine similarity to its nearest neighbor word is less than 0.3. Through the normalization operation, the output semantics that intend to emit outlier words could be prevented from becoming zero vectors by relative transformation.

5.3 Analysis of Reverse Transformation

To better understand the reverse transformation process (§3.4) transforming the relative representation back to the absolute space of the main model, we further analyze each component of this process.

Analysis of relative ensemble learning rates.

As shown in Tab. 2, the performance of DeePEn is sensitive to the value of relative ensemble learning rate (η𝜂\etaitalic_η), which is abbreviated by RELR. This observation motivates us to measure the generality of this hyperparameter. Specifically, we illustrate the cross-distribution performance of the searched optimal value of η𝜂\etaitalic_η in Tab. 6. As observed, the optimal value of RELR varies across different datasets, which suggests that the inverse transformation from relative space to absolute space requires adaptive mapping schemes.

RELR (η𝜂\etaitalic_η) 0.05 0.10 0.15 0.20 0.25 0.30
MMLU +2.42 +1.57 +1.77 +1.96 +1.31 +1.31
TriviaQA +1.31 +2.05 +1.63 +1.94 +1.82 +1.26
Table 2: Sensitivity analysis of relative ensemble learning rate (RELR). We report the improvements of ensembling top-2 models over the best individual models.

Effect of iteration steps in relative ensemble learning.

To give a deep view of the dynamics of the inverse transformation in DeePEn, we report the performance change along with different numbers of relative ensemble learning steps (T𝑇Titalic_T). Besides, the dynamics of loss of relative ensemble learning (η𝜂\etaitalic_η in Eq. 6)is also reported. As shown in Fig. 12, on the one hand, more steps of relative ensemble learning significantly lead to lower losses. However, the loss is hard to reach zero, i.e., under-fitting. On the other hand, increasing the number of steps of relative ensemble learning will cause the performance to increase first and then decrease. The reason behind the performance drop could be that in the early stage of optimization, the focus of optimization is on updating the high-probability tokens. In the later stage of optimization, since the probabilities of all words will be adjusted equally, the high-probability tokens will be interfered with the high-probability ones, thus affecting the performance. Therefore, it is recommended to set a modest value of step number (e.g., T=5𝑇5T=5italic_T = 5).

6 Related Work

Selection-based ensemble.

Rerank is an intuitive solution to utilize multi-model strengths. Jiang et al. [15] take the first step towards LLM ensemble, training a reward model PairRanker for pairwise comparison on candidate outputs. To overcome the huge computation costs of multi-LLM inference, several works have explored to train a router to predict the best-performing model out of a fixed set of LLMs for the given input [31, 25, 19].

Fusion-based ensemble.

Towards a synergy between LLMs, Jiang et al. [15] propose GenFuser, trained to combine multiple candidate answers. Different from these training-dependent ensemble methods which pose a great challenge to the generalizability of the reward model or fusion model, our DeePEn is completely training-free, making it more general. Similar to our method, MinED also aims to tackle the vocabulary discrepancy via aligning the tokens in different vocabularies based on edit distance [30, 9]. Unfortunately, this textual similarity-based method exhibits unstable performance and produces abnormal text for LLM ensemble (Tab. 10).

There are several contemporaneous works related to our work. Xu et al. [33] propose EVA to tackle vocabulary discrepancy by learning token alignment between different vocabularies with the assistance of overlapping tokens. Our DeePEn eliminates this training process via directly aligning tokens with the relative representation (more discussion is illustrated in §B). Mavromatis et al. [20] explore adaptive collaboration weights at test time by harnessing the perplexity on the input prompt. We emphasize that this work is complementary to our work.

7 Conclusion

In this work, we propose a training-free LLM ensembling framework DeePEn, which addresses the vocabulary discrepancy when fusing the probability distributions of heterogeneous LLMs. Experimental results on six widely-used benchmarks demonstrate that DeePEn exhibits more stable performance than baseline methods and has complementary strengths with other ensemble methods such as Voting. We believe our work can inspire further research on the LLMs collaboration, model reuse, and knowledge distillation. In the future, we aim to explore more effective adaptive collaboration schemes to leverage the complementary strengths between different LLMs.

References

  • AI [2024] . AI. Yi: Open foundation models by 01.ai, 2024.
  • Allen-Zhu and Li [2020] Z. Allen-Zhu and Y. Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning, 2020. URL https://arxiv.org/abs/2012.09816.
  • Bisk et al. [2020] Y. Bisk, R. Zellers, R. Le bras, J. Gao, and Y. Choi. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439, Apr. 2020. doi: 10.1609/aaai.v34i05.6239. URL https://ojs.aaai.org/index.php/AAAI/article/view/6239.
  • Chen et al. [2023] Y. Chen, W. Cai, L. Wu, X. Li, Z. Xin, and C. Fu. Tigerbot: An open multilingual multitask llm, 2023.
  • Clark et al. [2018] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  • Cobbe et al. [2021] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems, 2021.
  • Dong et al. [2020] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma. A survey on ensemble learning. Frontiers of Computer Science, 14:241–258, 2020.
  • Freitag et al. [2023] M. Freitag, B. Ghorbani, and P. Fernandes. Epsilon sampling rocks: Investigating sampling strategies for minimum Bayes risk decoding for machine translation. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9198–9209, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.617. URL https://aclanthology.org/2023.findings-emnlp.617.
  • Fu et al. [2023] Y. Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot. Specializing smaller language models towards multi-step reasoning. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10421–10430. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/fu23d.html.
  • Garmash and Monz [2016] E. Garmash and C. Monz. Ensemble learning for multi-source neural machine translation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1409–1418, Osaka, Japan, Dec. 2016. The COLING 2016 Organizing Committee. URL https://aclanthology.org/C16-1133.
  • He and Ozay [2022] B. He and M. Ozay. Feature kernel distillation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=tBIQEvApZK5.
  • Hendrycks et al. [2021] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  • Jiang et al. [2023a] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023a.
  • Jiang et al. [2024] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, and et al. Mixtral of experts, 2024.
  • Jiang et al. [2023b] D. Jiang, X. Ren, and B. Y. Lin. LLM-blender: Ensembling large language models with pairwise ranking and generative fusion. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.792. URL https://aclanthology.org/2023.acl-long.792.
  • Joshi et al. [2017] M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
  • Kumar and Byrne [2004] S. Kumar and W. Byrne. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 169–176, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics. URL https://aclanthology.org/N04-1022.
  • Kwiatkowski et al. [2019] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl_a_00276. URL https://aclanthology.org/Q19-1026.
  • Lu et al. [2023] K. Lu, H. Yuan, R. Lin, J. Lin, Z. Yuan, C. Zhou, and J. Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models, 2023.
  • Mavromatis et al. [2024] C. Mavromatis, P. Karypis, and G. Karypis. Pack of llms: Model fusion at test-time via perplexity optimization, 2024.
  • Moschella et al. [2023] L. Moschella, V. Maiorca, M. Fumero, A. Norelli, F. Locatello, and E. Rodolà. Relative representations enable zero-shot latent space communication. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SrC-nwieGJ.
  • OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
  • Park et al. [2019] W. Park, D. Kim, Y. Lu, and M. Cho. Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3967–3976, 2019.
  • Sagi and Rokach [2018] O. Sagi and L. Rokach. Ensemble learning: A survey. WIREs Data Mining and Knowledge Discovery, 8(4):e1249, 2018. doi: https://doi.org/10.1002/widm.1249. URL https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/widm.1249.
  • Shnitzer et al. [2024] T. Shnitzer, A. Ou, M. Silva, K. Soule, Y. Sun, J. Solomon, N. Thompson, and M. Yurochkin. Large language model routing with benchmark datasets, 2024. URL https://openreview.net/forum?id=LyNsMNNLjY.
  • Team [2023] I. Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM-techreport, 2023.
  • Team [2022] N. Team. No language left behind: Scaling human-centered machine translation, 2022. URL https://arxiv.org/abs/2207.04672.
  • Touvron et al. [2023a] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models, 2023a.
  • Touvron et al. [2023b] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, and et al. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  • Wan et al. [2024] F. Wan, X. Huang, D. Cai, X. Quan, W. Bi, and S. Shi. Knowledge fusion of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=jiDsk12qcz.
  • Wang et al. [2024] H. Wang, F. M. Polo, Y. Sun, S. Kundu, E. Xing, and M. Yurochkin. Fusing models with complementary expertise. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PhMrGCMIRL.
  • Wei et al. [2023] T. Wei, L. Zhao, L. Zhang, B. Zhu, L. Wang, and at el. Skywork: A more open bilingual foundation model, 2023.
  • Xu et al. [2024] Y. Xu, J. Lu, and J. Zhang. Bridging the gap between different vocabularies for llm ensemble, 2024.
  • Zhang and Ma [2012] C. Zhang and Y. Ma. Ensemble Machine Learning: Methods and Applications. Springer Publishing Company, Incorporated, 2012. ISBN 1441993258.
  • Zhao et al. [2023] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie, and J.-R. Wen. A survey of large language models, 2023.
  • Zhou [2012] Z.-H. Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.

Appendix A Statistics of Common Tokens across different LLMs

We count the number of common tokens shared among different LLM vocabularies and present the results in Fig. 9. It is observed that a large number of common words (>20k) exist across the different vocabularies. We also count the number of common tokens in all six LLMs and find that there are a total of 18k common tokens, enabling DeePEn to be applied to the ensemble learning of a large number of models.

Refer to caption
Figure 9: Statistics of common words across different vocabularies.

Appendix B Details of Baselines

LLM-Blender.

(1) the selection-based ensemble method PairRanker Jiang et al. [15], which is a reward model to score each response of LLMs and (2) the fusion-based ensemble method GenFuser Jiang et al. [15], which is a generative model to fuse multiple candidate responses. Both models are trained on the constructed instruction tuning dataset MixInstruct. In our experiments, as GenFuser struggles to generate responses following the expected format, we only adopt PairRanker.

Voting.

For tasks with outputs limited to a fixed set (i.e., MMLU, ARC-C, PIQA, GSM8K benchmarks), we adopt the Voting method on the ensemble learning of more than 2 models. Concretely, we count each candidate answer’s occurrences and select the most frequent as the final output. In the event of a tie, the main model’s answer is used as the final output.

MBR.

For generation tasks, we implement the MBR [8, 17] method, which selects the answer with the highest lexical similarity to other candidate answers. To measure this similarity, we experimented with the edit distance and chrF333https://github.com/mjpost/sacrebleu metrics, ultimately choosing chrF due to its superior performance.

MinED.

To bridge the gap between different vocabularies in LLM ensemble, MinED apply the Minimum Edit Distance (MinED) approach to align tokens across different vocabularies, e.g., "get" to "gets". However, this textual similarity-based mapping method could disturb the text generation process and produce incomplete words.

EVA.

Recently, Xu et al. [33] propose EVA to tackle the vocabulary discrepancy by learning mappings between the vocabularies of different LLMs with the assistance of overlapping tokens. We have tried to re-implement their method with the released code. However, we encounter a technical problem in that EVA only the supports the ensemble learning between LLMs with the same embedding dimension. This is caused by the limitation of tool of vecmap444https://github.com/artetxem/vecmap, which is used to learn the token alignment.

Refer to caption
Figure 10: Analysis of the generation process of MinED. To illustrate the problematic generation process of MinED, we list the top-10 high-probability tokens in the probability distribution of the assistant model and their aligned token.

Appendix C Additional Experiments

C.1 Choice of main model.

Models MMLU-Dev ARC-C-Dev
Indiv DeePEn Indiv DeePEn
Yi-6B 61.19 63.61 (+2.42) 72.72 77.55 (+4.83)
Mistral-7B 60.80 64.46 (+3.66) 73.88 77.73 (+3.85)
Table 3: Performance of DeePEn with choosing different main models on the development sets. Indiv refers to as individual models. The result of DeePEn indicates the performance of using the model of this row as the main model.

In the process of inverse transformation, DeePEn maps the relative aggregated representation to the absolute space of the main model. Ideally, we expected the results of inverse transformation to keep invariant with the choice of the main model. However, this objective is hard to achieve due to the underfitting observed in the search process. Therefore, we illustrate the performance variance of choosing different main models in Tab. 3. As the results shown on ARC-C, changing the main model from the first-ranked Mistral-7B to the second-rank Yi-6B, the ensemble performance is decreased slightly from 77.7377.7377.7377.73 to 77.5577.5577.5577.55. Interestingly, changing the main model from the rank-1 Yi-6B to the rank-2 Mistral-7B on MMLU, the performance is actually improved from 63.6363.6363.6363.63 to 64.4664.4664.4664.46, which indicates that Mistral-7B benefits more than Yi-6B from collaboration. Even so, choosing different main models does not significantly affects the ensemble performance.

C.2 Comparison to Vanilla Prediction Average

Models MMLU-Dev MMLU-Test
Indiv Vanil DeePEn Indiv Vanil DeePEn
LLaMA1-13B 43.26 45.48 44.37 43.70 45.01 44.22
LLaMA2-7B 42.28 45.94 42.99 45.31
Table 4: Comparison to vanilla prediction average (Vanil) on the ensemble of LLMs with the same vocabulary.

To compare our DeePEn with vanilla prediction average, we conduct an experiment for ensembling two LLMs with the same vocabulary and comparable performance on MMLU, i.e., LLaMA2-7B and LLaMA1-13B. As shown in Tab. 4, the performance of DeePEn is comparable, even better than, that of the vanilla prediction average. Theoretically, the performance of the vanilla prediction average is the performance upper-bound of DeePEn. The reason that DeePEn could excel over the vanilla one on MMLU is the under-fitting in the inverse transformation process, which leads to the weights to aggregate the output semantics of different models not being a uniform distribution (i.e., (0.5,0.5)0.50.5(0.5,0.5)( 0.5 , 0.5 )). For example, in Tab. 4, the weights for LLaMA1 and LLaMA2 could be (0.6,0.4)0.60.4(0.6,0.4)( 0.6 , 0.4 ), where the weight of the main model is larger than the other model.

C.3 Latency Analysis

To accomplish the fusion of heterogeneous distributions, DeePEn first maps the distributions into the relative space and adopts the search-based inverse transformation to map the aggregated relative representation back to the main model’s probability distribution, which incurs an extra latency. This latency is mainly caused by the inverse transformation process, which requires T𝑇Titalic_T-round search. To demonstrate this latency, we report the token-level inference latency of ensembling two LLMs (Mixtral-8×\times×7b and LLaMA2-70B). This experiment is conducted on 8 A100 GPUs. All of our experiments can be re-implemented on 8 A100 GPUs. As shown in Tab. 5, DeePEn causes +17% token-level inference latency. However, in practice, this latency could be greatly decreased since all individual models intend to emit the same token in 90% decoding steps. In these steps, we could skip the fusion process and use the consistently agreed token as the next token. In total, DeePEn actually incurs less than 2% sentence-level inference latency.

Baseline T=1𝑇1T=1italic_T = 1 T=3𝑇3T=3italic_T = 3 T=5𝑇5T=5italic_T = 5 T=10𝑇10T=10italic_T = 10
Inference Latency 0.19s 0.20s 0.21s 0.22s 0.24s
Relative Change 0% +7% +11% +17% +29%
Table 5: Inference Latency of DeePEn with different search steps T𝑇Titalic_T.
Refer to caption
Figure 11: Distance distribution to nearest neighbor words. The distance is measured by calculating the cosine similarity between words.
Baseline TrivaQA NQ ARC-C MMLU
TriviaQA 73.42 75.9 75.41 75.56 75.44
NQ 29.11 30.55 30.65 30.42 30.69
ARC-C 60.29 69.32 72.31 74.19 73.76
MMLU 54.06 59.97 61.04 61.94 61.42
Table 6: Cross-distribution validation of relative ensemble learning rate (η𝜂\etaitalic_η). We report the performance of ensembling LLaMA2-13B and Mistral-7B. Each row indicates the test set used to evaluate performance. Each column indicates the development set used to search the optimal value of η𝜂\etaitalic_η.
Refer to caption
Figure 12: Effect of different number of relative ensemble learning steps.

Appendix D Limitations

As illustrated in Tab. 4.1, collaboration with more LLMs can sometimes lead to a performance drop caused by interference from lower-performing models. This issue limits the ensemble performance of our current method, even though we have explored setting different collaboration weights for each model on each benchmark (DeePEn-Adapt). An ideal solution would be to set adaptive collaboration weights at the sample level, or even the token level, for each LLM, which remains a significant challenge. Despite this, our work represents an important step towards the distribution fusion of LLMs.