UTF8gbsn
Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration
Abstract
Large language models (LLMs) exhibit complementary strengths in various tasks, motivating the research of LLM ensembling. However, existing work focuses on training an extra reward model or fusion model to select or combine all candidate answers, posing a great challenge to the generalization on unseen data distributions. Besides, prior methods use textual responses as communication media, ignoring the valuable information in the internal representations. In this work, we propose a training-free ensemble framework DeePEn, fusing the informative probability distributions yielded by different LLMs at each decoding step. Unfortunately, the vocabulary discrepancy between heterogeneous LLMs directly makes averaging the distributions unfeasible due to the token misalignment. To address this challenge, DeePEn maps the probability distribution of each model from its own probability space to a universal relative space based on the relative representation theory, and performs aggregation. Next, we devise a search-based inverse transformation to transform the aggregated result back to the probability space of one of the ensembling LLMs (main model), in order to determine the next token. We conduct extensive experiments on ensembles of different number of LLMs, ensembles of LLMs with different architectures, and ensembles between the LLM and the specialist model. Experimental results show that (i) DeePEn achieves consistent improvements across six benchmarks covering subject examination, reasoning, and knowledge, (ii) a well-performing specialist model can benefit from a less effective LLM through distribution fusion, and (iii) DeePEn has complementary strengths with other ensemble methods such as voting111Our code is available at: https://github.com/OrangeInSouth/DeePEn.
1 Introduction
With the scaling of model capacities and data volumes, generative large language models (LLMs) have shown impressive language understanding and generation abilities, shedding light for artificial general intelligence [35, 22, 13, 28]. Due to diversities of data sources, model architectures, and training recipes, LLMs have different strengths and weaknesses in various tasks and cases. Therefore, recent research has explored the ensemble of LLMs to exploit the complementary potential [15, 19].
Existing methods can be categorized into selection-based and fusion-based ensembling. Selection-based ensembling selects the best candidate answer from all individual LLMs’ answers using an additionally trained reward model [15, 31, 25, 19]. Fusion-based ensembling combines all candidate answers using a trained fusion model [15]. However, these approaches inevitably face significant challenges in generalizing to unseen data distributions and base models. Besides, prior methods enable collaboration via conveying the textual responses between LLMs while ignoring the rich information (e.g., confidence and alternative answers) in the internal representations.
An ideal solution to this issue is to apply the well-established technology of prediction fusion. [36, 24, 7, 10]. For LLM ensemble, prediction fusion works at each decoding step, averaging the probability distributions from different LLMs to determine the next token. It could not only directly apply to the ensemble of any LLMs without extra parameter training, making it more general, but leverages the informative internal representations (i.e., probability distributions) as communication media. Unfortunately, the vocabulary discrepancy between different LLMs makes it unfeasible to average the distributions due to token misalignment.
In this work, we tackle this key challenge by drawing upon the cross-model invariance of relative representation, which represents each token using the embedding similarities of this token to a set of anchor tokens [21]. Specifically, we propose an ensemble framework DeePEn (Deep Parallel Ensemble), enabling distribution fusion for heterogeneous LLMs. DeePEn transforms the probability distribution from the heterogeneous probability space to a homogeneous relative space, using a matrix formed by the relative representation of all tokens. Next, DeePEn aggregates the relative representations of all probability distributions in the relative space, coordinating the decision on the next token. Finally, the result of aggregation is transformed back to the probability space of the main model using a search-based inverse transformation to determine the next token.
We conduct extensive experiments ranging from 2-model to 9-model ensembles, covering ensembles of models with parameters ranging from 6B to 70B, ensembles of dense and sparse models, and the ensemble of LLMs with specialist models. Experimental results on six widely-used benchmarks demonstrate that compared to baselines, DeePEn achieves consistent improvements across all benchmarks. It is also discovered that DeePEn has complementary strengths when combined with other ensemble methods.
2 Theoretical Analysis
We first introduce relative representation and then illustrate the theoretical support for our method.
2.1 Relative Representation
Previous study discovers that despite the misalignment between latent spaces of different neural networks, the embedding similarity between samples do not change across models [21, 11, 23]. Specifically, Moschella et al. [21] propose relative representation, which represents each sample by the embedding similarities to a set of anchor samples ( and are identically distributed):
(1) |
where denotes the embedding of samples, also is absolute representation.
It is empirically evidenced that relative representations possess cross-model invariance, i.e., the relative representation of the same sample keeps invariant across different models, which lays the theoretical foundation for our work to fuse heterogeneous probability distributions.
2.2 Theoretical Support for DeePEn
Average probability distribution has been widely evidenced to effectively improve the predictive performance in the filed of image and text [2, 10]. For generative language models, as we understand, the underlying mechanism is to interpolate different output semantics represented by the probability distributions. However, for LLM ensemble, vocabulary discrepancy isolates these output semantics in semantic spaces with different basis vectors, making the interpolation infeasible. To tackle this challenge, we aim to enable the cross-model alignment for output semantics, i.e., find a transformation to map the output semantics into a universal space. To this effect, we propose to represent the output semantics with the convex combination of relative representations of all tokens where the weight is the probability assigned to the token.
Definition of output semantics in relative space.
Formally, given the absolute representation of the output semantics and the relative representation matrix where is the vocabulary and is the anchor token set. The -th row of is the relative representation of word :
(2) |
and the relative representation of the output semantics is defined as: .
Model-invariance of relative representation of output semantic.
Next, we illustrate why this representation scheme could align the output semantics isolated in heterogeneous absolute spaces. First, considering two LLMs and with the same vocabulary (e.g., LLaMA2-7B and LLaMA2-13B). When expressing the same output semantic, these models output the same probability distribution (i.e., absolute representation) and . Besides, they have the same (highly similar in practice) relative representation matrix due the vocabulary consistency and cross-model invariance of relative representation. Therefore, the relative representations of output semantics are also identical:
(3) |
Then, let’s consider a language model with a different vocabulary (e.g., Mistral). Based on the fact that different LLMs typically share mass tokens in their vocabularies (§A), the vocabulary of model is identical to adding and removing partial tokens to the vocabulary of , which leads to and . However, in our study, we discover that this change to the vocabulary has not incurred significant influence on the relative representation of the unchanged tokens (i.e., the common tokens between and ), as shown in Fig. 1. Therefore, we make the reasonable assumption that the local change in the vocabulary could hardly influence the relative space.
3 Methodology
In this section, we first introduce the overall process of our ensemble framework DeePEn and then describe the three parts of DeePEn in detail.
3.1 Overview
We illustrate the process of DeePEn in Fig. 2. Given models to ensemble, DeePEn first constructs their transformation matrices (i.e., relative representation matrices) mapping the probability distributions from the heterogeneous absolute spaces into the relative space (§3.2). At each decoding step, all models perform prediction and output probability distributions. These distributions are mapped into the relative space and aggregated (§3.3). Finally, the aggregation result is transformed back into the absolute space of the main model, in order to determine the next token (§3.4).
3.2 Construction of Relative Transformation
Given models to ensemble, DeePEn first finds out the intersection of vocabularies of all models, i.e., common token set , and samples a subset or uses the full set of common tokens as the anchor token set . Next, for each model, DeePEn calculates embedding similarities of each token to the anchor words, obtaining the relative representation matrix (as shown in Eq.2). Finally, to overcome the relative representation degeneration of outlier words, which will be introduced later, we perform normalization on the relative representation of all tokens by a softmax operation so that it becomes a probability distribution. We denote the normalized representation matrix :
(4) |
Anchor Selection.
The choice of anchor tokens is crucial for the relative representation capability. Previous research discovers that the capability improves as the number of anchor words increases [21]. Therefore, we employ the full set of common words between LLMs as the anchor words. It is also empirically proved that this method performs more stablely on downstream tasks (§5.2).
Normalization of relative representation matrix.
In DeePEn, the relative representation of each token is normalized by the softmax operation to avoid the relative representation degeneration of outlier words, which are referred to as words that are far away from other words (including the anchors) and become distinguishable in relative space since for being zero vectors. The softmax operation effectively resolves this problem by making each relative representation a probabilistic distribution instead of a zero vector.
3.3 Aggregation in Relative Space
At each decoding step, once each model outputs the probability distribution , DeePEn transforms into the relative representation using the normalized relative representation matrix: ,
and aggregate all relative representations to obtain the aggregated relative representation:
(5) |
where is the collaboration weight of model (§3.5).
3.4 Inverse Transformation of Relative Representations
To decide the next token according to the aggregated relative representation, DeePEn aims to transform it from the relative space back to the absolute space of the main model, which is empirically selected with the best-performing model on the development set. To enable this inverse transformation, we adopt a search-based strategy, finding out the absolute representation whose relative representation is identical to the aggregated relative representation. This search problem is formulated as:
(6) |
where denotes the absolute space of model , and is the loss function to measure the distance between relative representations. In this work, we adopt the KL-divergence due to its convergence.
This search is iteratively conducted under the guidance of the gradient of the loss in Eq.6 with respect to the absolute representation . Specifically, we initialize the start point of searching with the main model’s original absolute representation, and update it as:
(7) |
where is an important hyperparameter named the relative ensemble learning rate, and is the iterations number named relative ensemble learning steps. Finally, we use the updated absolute representation to determine the emitted token.
3.5 Collaboration Schemes
DeePEn aggregates the output distributions of individual models via performing weighted averaging on their relative representations (Eq. 5). As our work focus on enabling the distribution fusion of heterogeneous LLMs instead of finding the optimal collaboration weights, we follow the most common practice to uniformly aggregate the distributions (, is the number of models), which is named DeePEn-Avg. Besides, we also adopt a simple and effective method of deducing weights, DeePEn-Adapt, which heuristically sets a larger value to the model with a better performance on the development set: , where , indicates the average accuracy of model on the development set, and indicates the chance level on the evaluation task. Specifically, on the free-form generation tasks and on the -choice tasks.
4 Experiments
4.1 Experimental Setup
Benchmarks.
We mainly conduct experiments on six benchmarks, which can be categorized into:
- •
- •
- •
Evaluation.
For all benchmarks, we follow the test scripts of OpenCompass leaderboard. Specifically, on the multiple-choice tasks (MMLU, ARC-C, and PIQA), the option with the highest likelihood is selected to calculate the accuracy. On the free-form generation tasks (GSM8K, TriviaQA and NQ), we calculate the exact match (EM) accuracy.
Individual models.
As ensemble learning typically works on models with comparable performance [24, 34], we select six well-performing LLMs whose performance are closely matched: LLaMA-2-13B [29], Mistral-7B-v0.1 [13], InternLM-20B [26], Yi-6B [1], Skywork-13B-base [32], and Tigerbot-13b-base-v2 [4]. To achieve better ensemble performance, we conduct experiments on the ensemble of the top-2 models and the top-4 models for each benchmark. Besides, we also consider ensembling various number of models (§4.3) and ensembling more diverse models (§5.1).
Hyperparameters.
In this work, we select all of the common tokens between LLMs as the anchor tokens to build the relative spaces, i.e., (§5.2). For the inverse transformation of relative representations, we search the optimal relative learning rate ( in Eq. 7) from 0.05 to 0.30 with an interval of 0.05. We empirically set the number of relative ensemble learning steps (§5.3).
Comparative methods.
We compare DeePEn with (1) MinED [30, 9], which maps the probability distributions of heterogeneous LLMs to the distribution of the main model via aligning tokens in different vocabularies with edit distance, and (2) LLM-Blender [15], which comprises a reward model PairRanker to score each response of LLMs and a fusion model GenFuser to fuse candidate responses. In this work, we we only adopt the PairRanker since GenFuser suffers from serious over-generation under our training-free setting. In the ensemble of more than two models, we introduce two additional ensemble methods: (3) Voting, which selects the choice favored by most models on the tasks with outputs limited to a fixed set, and (4) MBR [8, 17], which selects the answer with the highest textual similarity to other candidate answers. The implementation details of baselines are illustrated in §B.
Models | Examination | Reasoning | Knowledge | |||
---|---|---|---|---|---|---|
MMLU | ARC-C | GSM8K | PIQA | TriviaQA | NQ | |
Individual Models | ||||||
LLaMA2-13B | 55.07 | 59.32 | 29.80 | 59.68 | 74.32 | 28.67 |
InternLM-20B | 59.94 | 75.81 | 53.83 | 64.78 | 66.88 | 26.09 |
Skywork-13B | 61.16 | 66.50 | 53.90 | 74.04 | 58.65 | 19.75 |
Tigerbot-13B | 51.95 | 57.44 | 48.82 | 68.28 | 66.22 | 22.71 |
Mistral-7B | 62.13 | 73.33 | 47.50 | 65.61 | 73.18 | 27.62 |
Yi-6B | 63.25 | 73.33 | 37.91 | 76.15 | 59.02 | 18.98 |
Top-2 Ensemble | ||||||
LLM-Blender | 63.85 (+0.60) | 75.73 (- 0.08) | 54.89 (+0.99) | 78.31 (+2.16) | 74.10 (- 0.22) | 28.61 (- 0.06) |
MinED | 65.04 (+1.79) | 77.35 (+1.54) | 18.50 (-35.40) | 78.98 (+2.83) | 72.30 (- 2.02) | 28.45 (- 0.22) |
\hdashline[4pt/5pt] DeePEn-Avg | 64.68 (+1.43) | 77.52 (+1.71) | 55.42 (+1.52) | 78.87 (+2.72) | 75.90 (+1.58) | 30.17 (+1.50) |
Top-4 Ensemble | ||||||
LLM-Blender | 61.44 (- 1.81) | 71.03 (- 4.78) | 43.37(-10.53) | 71.16 (- 4.99) | 67.87 (- 6.45) | 24.18 (- 4.49) |
Voting | 64.88 (+1.63) | 78.41 (+2.60) | 63.15 (+9.25) | 76.82 (+0.67) | — | — |
MBR | — | — | 62.09 (+8.26) | — | 74.32 (+0.00) | 30.28 (+1.61) |
MinED | 65.61 (+2.36) | 78.68 (+2.87) | 56.56 (+2.66) | 77.87 (+1.72) | 71.62 (- 2.70) | 29.50 (+0.83) |
\hdashline[4pt/5pt] DeePEn-Avg | 65.09 (+1.84) | 78.70 (+2.89) | 56.18 (+2.28) | 77.15 (+1.00) | 75.74 (+1.42) | 31.55 (+2.88) |
DeePEn-Adapt | 65.25 (+2.00) | 79.15 (+3.34) | 56.25 (+2.35) | 78.59 (+2.44) | 75.76 (+1.44) | 31.77 (+3.10) |
+Voting/MBR | 65.40 (+2.15) | 79.44 (+3.63) | 65.25 (+11.35) | 77.37 (+1.22) | 75.65 (+1.33) | 32.11 (+3.44) |
4.2 Main Results
The main results are shown in Tab. 4.1, from which we have drawn the following observations:
(1) DeePEn achieves consistent improvements over the individual models.
These results prove that our DeePEn successfully enables collaboration between heterogeneous LLMs via aggregating their probability distributions in the relative space. Specifically, DeePEn-Avg achieves improvements of +1.43(MMLU)+2.72(PIQA) on the ensemble of top-2 models, and +1.00(PIQA)+2.89(ARC-C) on the ensemble of top-4 model. DeePEn-Adapt gains improvements of +1.44(TriviaQA)+3.34(ARC-C) on the ensemble of top-4 models.
(2) DeePEn shows better stability than baselines.
As shown, LLM-Blender struggles to achieve improvements under the training-free setting. MinED shows unstable performance across different benchmarks. For example, MinED leads to performance drops of -35.40 on the GSM8K benchmark under the top-2 models ensemble setting and -2.70 on the TriviaQA, indicating the limitation of using textual similarity to align tokens in heterogeneous vocabularies. Through case studies, it is revealed that this method of aligning tokens with edit distance disturbs the decoding and produces incomplete words (demonstrated in §10). Instead, DeePEn-Avg achieves consistent improvements and surpasses all baselines in 7/12 settings.
(3) DeePEn has complementary strengths with other ensemble methods.
Voting achieves a significant improvement on the mathematical reasoning GSM8K, showing the effectiveness of reasoning with multiple paths. To evidence the complementary strength of DeePEn with Voting, we combine both methods. On the TriviaQA and NQ, Voting is replaced with MBR. As shown that the combination of both methods gains a further improvement over Voting (63.1565.25).
(4) Collaboration with more worse-performing LLMs is a double-edged sword.
The ensemble performance of DeePEn-Avg with top-4 models surpasses that with top-2 models on 4 benchmarks, but falls short on 2 benchmarks. This is reasonable because incorporating the 3rd and 4th ranked LLMs enhances complementary strengths but also causes the interference with the top-2 models.
4.3 Results on Different Numbers of Models
Next, we illustrate the effectiveness of DeePEn on the ensemble of more models on the MMLU, PIQA, and NQ. We add Nanbeige-13B into the ensemble on all three benchmarks, and add LLaMA2-70B and Mixtral-87B on the PIQA due to their comparable performance. As illustrated in Fig. 3, the ensemble performance increases first and then decreases with the joining of more models in descending order of performance. And the ensemble performance peaks in the top-4 or top-5 models across three benchmarks.
5 Analysis
To deeply understand DeePEn, we first evaluate its performance on the ensemble learning of model sets with diverse architectures, abilities, and performance gaps. Next, we conduct a series of analyses on the reverse transformation process of relative representations.
5.1 Results of Ensembling Diverse Models
Model | GSM8K | PIQA |
---|---|---|
LLaMA2-70B (Dense) | 63.84 | 71.27 |
Mixtral-87B (Sparse) | 65.73 | 71.88 |
\hdashline[4pt/5pt] DeePEn | 67.33 | 75.10 |
+1.60 | +3.22 |
Model | EnDe | DeEn | EnRo | RoEn |
---|---|---|---|---|
LLaMA2-13B | 30.60 | 42.27 | 30.83 | 39.99 |
NLLB-600M | 32.30 | 41.49 | 31.91 | 42.39 |
\hdashline[4pt/5pt] DeePEn | 33.34 | 43.70 | 32.95 | 42.84 |
+1.04 | +1.43 | +1.04 | + 0.45 |
Ensemble of the dense model and the sparse model.
We first evaluate our method on the ensemble learning of the dense model and the sparse MoE model on the challenge reasoning tasks. Specifically, we use the widely-used large-scale dense model LLaMA2-70B [29] and the popular sparse MoE model Mixtral-87B [14] as the base models. As the results shown in Tab. 5.1, our DeePEn achieves improvements of +1.60 and +3.22 on the GSM8K and PIQA datasets, even though the base models have achieved a high level of performance.
Ensemble of the generalist model and the specialist model.
To investigate the effectiveness of DeePEn on the ensemble of the generalist model and the specialist model for the specific task, we conduct experiments on the machine translation task using the ensemble of the large language model LLaMA2 and the machine translation model NLLB [27], which is a well-known open-source multilingual translator. We adopt the widely-used machine translation benchmark Flores-200222https://github.com/facebookresearch/flores. As the results in Tab. 5.1 illustrated, DeePEn achieves better translation performance leveraging the diverse translation knowledge in the generalist LLM and the specialist translator.
Ensemble of models with different performance gaps.
To assess the stability of DeePEn regarding to the performance gap of base models, we conduct an experiment on the ensemble of model pairs with increasing performance gaps. As the result demonstrated in Tab. 8, the performance of ensemble learning between a well-performing model (the rank-first model)with a worse-performing model could achieve improvements or slightly lag behind the well-performing model.
5.2 Analysis on Relative Transformation
Effect of anchor selection.
We demonstrate the impact of different numbers of anchor words through experiments with the top-2 ensemble models on the MMLU and ARC-C datasets. As shown in Fig. 6, an increased number of anchor words can improve performance for LLMs in downstream tasks, and selecting the full set of common words as anchors provides better performance.
Effect of normalization on relative representation matrix.
Methods | MMLU-Dev | TriviaQA-Dev | ||
---|---|---|---|---|
ACC | ACC | |||
Baseline | 61.19 | - | 72.74 | - |
DeePEn | 63.61 | +2.42 | 74.79 | +2.05 |
w/o. Rel-Norm | 60.73 | -0.46 | 72.95 | +0.21 |
To demonstrate the importance of normalization on the relative representation matrix to the ensemble performance (§3.2), we conduct an ablation analysis. The result is shown in Tab. 8, the ensemble struggles to achieve improvements due to the ineffective representation of outlier words, i.e., words distant to other words. The proportion of outlier words can be derived from the distribution of distance to nearest neighbor words, which is illustrated in Fig. 11. As illustrated, a remarkable proportion () of words are distant from other words, i.e., cosine similarity to its nearest neighbor word is less than 0.3. Through the normalization operation, the output semantics that intend to emit outlier words could be prevented from becoming zero vectors by relative transformation.
5.3 Analysis of Reverse Transformation
To better understand the reverse transformation process (§3.4) transforming the relative representation back to the absolute space of the main model, we further analyze each component of this process.
Analysis of relative ensemble learning rates.
As shown in Tab. 2, the performance of DeePEn is sensitive to the value of relative ensemble learning rate (), which is abbreviated by RELR. This observation motivates us to measure the generality of this hyperparameter. Specifically, we illustrate the cross-distribution performance of the searched optimal value of in Tab. 6. As observed, the optimal value of RELR varies across different datasets, which suggests that the inverse transformation from relative space to absolute space requires adaptive mapping schemes.
RELR () | 0.05 | 0.10 | 0.15 | 0.20 | 0.25 | 0.30 |
---|---|---|---|---|---|---|
MMLU | +2.42 | +1.57 | +1.77 | +1.96 | +1.31 | +1.31 |
TriviaQA | +1.31 | +2.05 | +1.63 | +1.94 | +1.82 | +1.26 |
Effect of iteration steps in relative ensemble learning.
To give a deep view of the dynamics of the inverse transformation in DeePEn, we report the performance change along with different numbers of relative ensemble learning steps (). Besides, the dynamics of loss of relative ensemble learning ( in Eq. 6)is also reported. As shown in Fig. 12, on the one hand, more steps of relative ensemble learning significantly lead to lower losses. However, the loss is hard to reach zero, i.e., under-fitting. On the other hand, increasing the number of steps of relative ensemble learning will cause the performance to increase first and then decrease. The reason behind the performance drop could be that in the early stage of optimization, the focus of optimization is on updating the high-probability tokens. In the later stage of optimization, since the probabilities of all words will be adjusted equally, the high-probability tokens will be interfered with the high-probability ones, thus affecting the performance. Therefore, it is recommended to set a modest value of step number (e.g., ).
6 Related Work
Selection-based ensemble.
Rerank is an intuitive solution to utilize multi-model strengths. Jiang et al. [15] take the first step towards LLM ensemble, training a reward model PairRanker for pairwise comparison on candidate outputs. To overcome the huge computation costs of multi-LLM inference, several works have explored to train a router to predict the best-performing model out of a fixed set of LLMs for the given input [31, 25, 19].
Fusion-based ensemble.
Towards a synergy between LLMs, Jiang et al. [15] propose GenFuser, trained to combine multiple candidate answers. Different from these training-dependent ensemble methods which pose a great challenge to the generalizability of the reward model or fusion model, our DeePEn is completely training-free, making it more general. Similar to our method, MinED also aims to tackle the vocabulary discrepancy via aligning the tokens in different vocabularies based on edit distance [30, 9]. Unfortunately, this textual similarity-based method exhibits unstable performance and produces abnormal text for LLM ensemble (Tab. 10).
There are several contemporaneous works related to our work. Xu et al. [33] propose EVA to tackle vocabulary discrepancy by learning token alignment between different vocabularies with the assistance of overlapping tokens. Our DeePEn eliminates this training process via directly aligning tokens with the relative representation (more discussion is illustrated in §B). Mavromatis et al. [20] explore adaptive collaboration weights at test time by harnessing the perplexity on the input prompt. We emphasize that this work is complementary to our work.
7 Conclusion
In this work, we propose a training-free LLM ensembling framework DeePEn, which addresses the vocabulary discrepancy when fusing the probability distributions of heterogeneous LLMs. Experimental results on six widely-used benchmarks demonstrate that DeePEn exhibits more stable performance than baseline methods and has complementary strengths with other ensemble methods such as Voting. We believe our work can inspire further research on the LLMs collaboration, model reuse, and knowledge distillation. In the future, we aim to explore more effective adaptive collaboration schemes to leverage the complementary strengths between different LLMs.
References
- AI [2024] . AI. Yi: Open foundation models by 01.ai, 2024.
- Allen-Zhu and Li [2020] Z. Allen-Zhu and Y. Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning, 2020. URL https://arxiv.org/abs/2012.09816.
- Bisk et al. [2020] Y. Bisk, R. Zellers, R. Le bras, J. Gao, and Y. Choi. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439, Apr. 2020. doi: 10.1609/aaai.v34i05.6239. URL https://ojs.aaai.org/index.php/AAAI/article/view/6239.
- Chen et al. [2023] Y. Chen, W. Cai, L. Wu, X. Li, Z. Xin, and C. Fu. Tigerbot: An open multilingual multitask llm, 2023.
- Clark et al. [2018] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
- Cobbe et al. [2021] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems, 2021.
- Dong et al. [2020] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma. A survey on ensemble learning. Frontiers of Computer Science, 14:241–258, 2020.
- Freitag et al. [2023] M. Freitag, B. Ghorbani, and P. Fernandes. Epsilon sampling rocks: Investigating sampling strategies for minimum Bayes risk decoding for machine translation. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9198–9209, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.617. URL https://aclanthology.org/2023.findings-emnlp.617.
- Fu et al. [2023] Y. Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot. Specializing smaller language models towards multi-step reasoning. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10421–10430. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/fu23d.html.
- Garmash and Monz [2016] E. Garmash and C. Monz. Ensemble learning for multi-source neural machine translation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1409–1418, Osaka, Japan, Dec. 2016. The COLING 2016 Organizing Committee. URL https://aclanthology.org/C16-1133.
- He and Ozay [2022] B. He and M. Ozay. Feature kernel distillation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=tBIQEvApZK5.
- Hendrycks et al. [2021] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
- Jiang et al. [2023a] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023a.
- Jiang et al. [2024] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, and et al. Mixtral of experts, 2024.
- Jiang et al. [2023b] D. Jiang, X. Ren, and B. Y. Lin. LLM-blender: Ensembling large language models with pairwise ranking and generative fusion. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.792. URL https://aclanthology.org/2023.acl-long.792.
- Joshi et al. [2017] M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
- Kumar and Byrne [2004] S. Kumar and W. Byrne. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 169–176, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics. URL https://aclanthology.org/N04-1022.
- Kwiatkowski et al. [2019] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl_a_00276. URL https://aclanthology.org/Q19-1026.
- Lu et al. [2023] K. Lu, H. Yuan, R. Lin, J. Lin, Z. Yuan, C. Zhou, and J. Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models, 2023.
- Mavromatis et al. [2024] C. Mavromatis, P. Karypis, and G. Karypis. Pack of llms: Model fusion at test-time via perplexity optimization, 2024.
- Moschella et al. [2023] L. Moschella, V. Maiorca, M. Fumero, A. Norelli, F. Locatello, and E. Rodolà. Relative representations enable zero-shot latent space communication. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SrC-nwieGJ.
- OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
- Park et al. [2019] W. Park, D. Kim, Y. Lu, and M. Cho. Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3967–3976, 2019.
- Sagi and Rokach [2018] O. Sagi and L. Rokach. Ensemble learning: A survey. WIREs Data Mining and Knowledge Discovery, 8(4):e1249, 2018. doi: https://doi.org/10.1002/widm.1249. URL https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/widm.1249.
- Shnitzer et al. [2024] T. Shnitzer, A. Ou, M. Silva, K. Soule, Y. Sun, J. Solomon, N. Thompson, and M. Yurochkin. Large language model routing with benchmark datasets, 2024. URL https://openreview.net/forum?id=LyNsMNNLjY.
- Team [2023] I. Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM-techreport, 2023.
- Team [2022] N. Team. No language left behind: Scaling human-centered machine translation, 2022. URL https://arxiv.org/abs/2207.04672.
- Touvron et al. [2023a] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models, 2023a.
- Touvron et al. [2023b] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, and et al. Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Wan et al. [2024] F. Wan, X. Huang, D. Cai, X. Quan, W. Bi, and S. Shi. Knowledge fusion of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=jiDsk12qcz.
- Wang et al. [2024] H. Wang, F. M. Polo, Y. Sun, S. Kundu, E. Xing, and M. Yurochkin. Fusing models with complementary expertise. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PhMrGCMIRL.
- Wei et al. [2023] T. Wei, L. Zhao, L. Zhang, B. Zhu, L. Wang, and at el. Skywork: A more open bilingual foundation model, 2023.
- Xu et al. [2024] Y. Xu, J. Lu, and J. Zhang. Bridging the gap between different vocabularies for llm ensemble, 2024.
- Zhang and Ma [2012] C. Zhang and Y. Ma. Ensemble Machine Learning: Methods and Applications. Springer Publishing Company, Incorporated, 2012. ISBN 1441993258.
- Zhao et al. [2023] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie, and J.-R. Wen. A survey of large language models, 2023.
- Zhou [2012] Z.-H. Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.
Appendix A Statistics of Common Tokens across different LLMs
We count the number of common tokens shared among different LLM vocabularies and present the results in Fig. 9. It is observed that a large number of common words (>20k) exist across the different vocabularies. We also count the number of common tokens in all six LLMs and find that there are a total of 18k common tokens, enabling DeePEn to be applied to the ensemble learning of a large number of models.
Appendix B Details of Baselines
LLM-Blender.
(1) the selection-based ensemble method PairRanker Jiang et al. [15], which is a reward model to score each response of LLMs and (2) the fusion-based ensemble method GenFuser Jiang et al. [15], which is a generative model to fuse multiple candidate responses. Both models are trained on the constructed instruction tuning dataset MixInstruct. In our experiments, as GenFuser struggles to generate responses following the expected format, we only adopt PairRanker.
Voting.
For tasks with outputs limited to a fixed set (i.e., MMLU, ARC-C, PIQA, GSM8K benchmarks), we adopt the Voting method on the ensemble learning of more than 2 models. Concretely, we count each candidate answer’s occurrences and select the most frequent as the final output. In the event of a tie, the main model’s answer is used as the final output.
MBR.
For generation tasks, we implement the MBR [8, 17] method, which selects the answer with the highest lexical similarity to other candidate answers. To measure this similarity, we experimented with the edit distance and chrF333https://github.com/mjpost/sacrebleu metrics, ultimately choosing chrF due to its superior performance.
MinED.
To bridge the gap between different vocabularies in LLM ensemble, MinED apply the Minimum Edit Distance (MinED) approach to align tokens across different vocabularies, e.g., "get" to "gets". However, this textual similarity-based mapping method could disturb the text generation process and produce incomplete words.
EVA.
Recently, Xu et al. [33] propose EVA to tackle the vocabulary discrepancy by learning mappings between the vocabularies of different LLMs with the assistance of overlapping tokens. We have tried to re-implement their method with the released code. However, we encounter a technical problem in that EVA only the supports the ensemble learning between LLMs with the same embedding dimension. This is caused by the limitation of tool of vecmap444https://github.com/artetxem/vecmap, which is used to learn the token alignment.
Appendix C Additional Experiments
C.1 Choice of main model.
Models | MMLU-Dev | ARC-C-Dev | ||
---|---|---|---|---|
Indiv | DeePEn | Indiv | DeePEn | |
Yi-6B | 61.19 | 63.61 (+2.42) | 72.72 | 77.55 (+4.83) |
Mistral-7B | 60.80 | 64.46 (+3.66) | 73.88 | 77.73 (+3.85) |
In the process of inverse transformation, DeePEn maps the relative aggregated representation to the absolute space of the main model. Ideally, we expected the results of inverse transformation to keep invariant with the choice of the main model. However, this objective is hard to achieve due to the underfitting observed in the search process. Therefore, we illustrate the performance variance of choosing different main models in Tab. 3. As the results shown on ARC-C, changing the main model from the first-ranked Mistral-7B to the second-rank Yi-6B, the ensemble performance is decreased slightly from to . Interestingly, changing the main model from the rank-1 Yi-6B to the rank-2 Mistral-7B on MMLU, the performance is actually improved from to , which indicates that Mistral-7B benefits more than Yi-6B from collaboration. Even so, choosing different main models does not significantly affects the ensemble performance.
C.2 Comparison to Vanilla Prediction Average
Models | MMLU-Dev | MMLU-Test | ||||
---|---|---|---|---|---|---|
Indiv | Vanil | DeePEn | Indiv | Vanil | DeePEn | |
LLaMA1-13B | 43.26 | 45.48 | 44.37 | 43.70 | 45.01 | 44.22 |
LLaMA2-7B | 42.28 | 45.94 | 42.99 | 45.31 |
To compare our DeePEn with vanilla prediction average, we conduct an experiment for ensembling two LLMs with the same vocabulary and comparable performance on MMLU, i.e., LLaMA2-7B and LLaMA1-13B. As shown in Tab. 4, the performance of DeePEn is comparable, even better than, that of the vanilla prediction average. Theoretically, the performance of the vanilla prediction average is the performance upper-bound of DeePEn. The reason that DeePEn could excel over the vanilla one on MMLU is the under-fitting in the inverse transformation process, which leads to the weights to aggregate the output semantics of different models not being a uniform distribution (i.e., ). For example, in Tab. 4, the weights for LLaMA1 and LLaMA2 could be , where the weight of the main model is larger than the other model.
C.3 Latency Analysis
To accomplish the fusion of heterogeneous distributions, DeePEn first maps the distributions into the relative space and adopts the search-based inverse transformation to map the aggregated relative representation back to the main model’s probability distribution, which incurs an extra latency. This latency is mainly caused by the inverse transformation process, which requires -round search. To demonstrate this latency, we report the token-level inference latency of ensembling two LLMs (Mixtral-87b and LLaMA2-70B). This experiment is conducted on 8 A100 GPUs. All of our experiments can be re-implemented on 8 A100 GPUs. As shown in Tab. 5, DeePEn causes +17% token-level inference latency. However, in practice, this latency could be greatly decreased since all individual models intend to emit the same token in 90% decoding steps. In these steps, we could skip the fusion process and use the consistently agreed token as the next token. In total, DeePEn actually incurs less than 2% sentence-level inference latency.
Baseline | |||||
---|---|---|---|---|---|
Inference Latency | 0.19s | 0.20s | 0.21s | 0.22s | 0.24s |
Relative Change | 0% | +7% | +11% | +17% | +29% |
Baseline | TrivaQA | NQ | ARC-C | MMLU | |
---|---|---|---|---|---|
TriviaQA | 73.42 | 75.9 | 75.41 | 75.56 | 75.44 |
NQ | 29.11 | 30.55 | 30.65 | 30.42 | 30.69 |
ARC-C | 60.29 | 69.32 | 72.31 | 74.19 | 73.76 |
MMLU | 54.06 | 59.97 | 61.04 | 61.94 | 61.42 |
Appendix D Limitations
As illustrated in Tab. 4.1, collaboration with more LLMs can sometimes lead to a performance drop caused by interference from lower-performing models. This issue limits the ensemble performance of our current method, even though we have explored setting different collaboration weights for each model on each benchmark (DeePEn-Adapt). An ideal solution would be to set adaptive collaboration weights at the sample level, or even the token level, for each LLM, which remains a significant challenge. Despite this, our work represents an important step towards the distribution fusion of LLMs.