{CJK}

UTF8gbsn

Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration

Yichong Huang^†, Xiaocheng Feng

{}^{{\dagger}{\ddagger}\textrm{\Letter}}

, Baohang Li^†, Yang Xiang^‡, Hui Wang^‡
Ting Liu^†, Bing Qin^†‡
^†Harbin Institute of Technology ^‡ Peng Cheng Laboratory
{ychuang,xcfeng,baohangli,tliu,qinb}@ir.hit.edu.cn
{xiangy,wangh06}@ir.hit.edu.cn

Abstract

Large language models (LLMs) exhibit complementary strengths in various tasks, motivating the research of LLM ensembling. However, existing work focuses on training an extra reward model or fusion model to select or combine all candidate answers, posing a great challenge to the generalization on unseen data distributions. Besides, prior methods use textual responses as communication media, ignoring the valuable information in the internal representations. In this work, we propose a training-free ensemble framework DeePEn, fusing the informative probability distributions yielded by different LLMs at each decoding step. Unfortunately, the vocabulary discrepancy between heterogeneous LLMs directly makes averaging the distributions unfeasible due to the token misalignment. To address this challenge, DeePEn maps the probability distribution of each model from its own probability space to a universal relative space based on the relative representation theory, and performs aggregation. Next, we devise a search-based inverse transformation to transform the aggregated result back to the probability space of one of the ensembling LLMs (main model), in order to determine the next token. We conduct extensive experiments on ensembles of different number of LLMs, ensembles of LLMs with different architectures, and ensembles between the LLM and the specialist model. Experimental results show that (i) DeePEn achieves consistent improvements across six benchmarks covering subject examination, reasoning, and knowledge, (ii) a well-performing specialist model can benefit from a less effective LLM through distribution fusion, and (iii) DeePEn has complementary strengths with other ensemble methods such as voting¹¹1Our code is available at: https://github.com/OrangeInSouth/DeePEn.

1 Introduction

With the scaling of model capacities and data volumes, generative large language models (LLMs) have shown impressive language understanding and generation abilities, shedding light for artificial general intelligence [35, 22, 13, 28]. Due to diversities of data sources, model architectures, and training recipes, LLMs have different strengths and weaknesses in various tasks and cases. Therefore, recent research has explored the ensemble of LLMs to exploit the complementary potential [15, 19].

Existing methods can be categorized into selection-based and fusion-based ensembling. Selection-based ensembling selects the best candidate answer from all individual LLMs’ answers using an additionally trained reward model [15, 31, 25, 19]. Fusion-based ensembling combines all candidate answers using a trained fusion model [15]. However, these approaches inevitably face significant challenges in generalizing to unseen data distributions and base models. Besides, prior methods enable collaboration via conveying the textual responses between LLMs while ignoring the rich information (e.g., confidence and alternative answers) in the internal representations.

An ideal solution to this issue is to apply the well-established technology of prediction fusion. [36, 24, 7, 10]. For LLM ensemble, prediction fusion works at each decoding step, averaging the probability distributions from different LLMs to determine the next token. It could not only directly apply to the ensemble of any LLMs without extra parameter training, making it more general, but leverages the informative internal representations (i.e., probability distributions) as communication media. Unfortunately, the vocabulary discrepancy between different LLMs makes it unfeasible to average the distributions due to token misalignment.

In this work, we tackle this key challenge by drawing upon the cross-model invariance of relative representation, which represents each token using the embedding similarities of this token to a set of anchor tokens [21]. Specifically, we propose an ensemble framework DeePEn (Deep Parallel Ensemble), enabling distribution fusion for heterogeneous LLMs. DeePEn transforms the probability distribution from the heterogeneous probability space to a homogeneous relative space, using a matrix formed by the relative representation of all tokens. Next, DeePEn aggregates the relative representations of all probability distributions in the relative space, coordinating the decision on the next token. Finally, the result of aggregation is transformed back to the probability space of the main model using a search-based inverse transformation to determine the next token.

We conduct extensive experiments ranging from 2-model to 9-model ensembles, covering ensembles of models with parameters ranging from 6B to 70B, ensembles of dense and sparse models, and the ensemble of LLMs with specialist models. Experimental results on six widely-used benchmarks demonstrate that compared to baselines, DeePEn achieves consistent improvements across all benchmarks. It is also discovered that DeePEn has complementary strengths when combined with other ensemble methods.

2 Theoretical Analysis

We first introduce relative representation and then illustrate the theoretical support for our method.

2.1 Relative Representation

Previous study discovers that despite the misalignment between latent spaces of different neural networks, the embedding similarity between samples do not change across models [21, 11, 23]. Specifically, Moschella et al. [21] propose relative representation, which represents each sample $x^{(i)}$ by the embedding similarities to a set of anchor samples $\mathbb{A}$ ( $x^{(i)}$ and $\mathbb{A}$ are identically distributed):

\displaystyle\mathbf{r}_{x^{(i)}}=(cos(e_{x^{(i)}},e_{a^{(1)}}),...,cos(e_{x^{% (i)}},e_{a^{(|\mathbb{A}|)}})),

(1)

where $e_{(*)}$ denotes the embedding of samples, also is absolute representation.

It is empirically evidenced that relative representations possess cross-model invariance, i.e., the relative representation of the same sample keeps invariant across different models, which lays the theoretical foundation for our work to fuse heterogeneous probability distributions.

2.2 Theoretical Support for DeePEn

Refer to caption — Figure 1: Visualizations for relative representations between models with the same vocabulary and between models with different vocabularies. PCA and K-means clustering are applied only for visualization. The red block indicates the representation of tokens that only appear in Mistral’s vocabulary. Relative representation consistency is obtained by calculating the cosine similarity between the relative representations of the same token in different models.

Average probability distribution has been widely evidenced to effectively improve the predictive performance in the filed of image and text [2, 10]. For generative language models, as we understand, the underlying mechanism is to interpolate different output semantics represented by the probability distributions. However, for LLM ensemble, vocabulary discrepancy isolates these output semantics in semantic spaces with different basis vectors, making the interpolation infeasible. To tackle this challenge, we aim to enable the cross-model alignment for output semantics, i.e., find a transformation to map the output semantics into a universal space. To this effect, we propose to represent the output semantics with the convex combination of relative representations of all tokens where the weight is the probability assigned to the token.

Definition of output semantics in relative space.

Formally, given the absolute representation of the output semantics $\mathbf{p}$ and the relative representation matrix $R\in\mathbb{R}^{|V|\times|A|}$ where $V$ is the vocabulary and $A\subseteq V$ is the anchor token set. The $i$ -th row of $R$ is the relative representation of word $w^{(i)}$ :

\displaystyle R[i]=(cos(e_{w^{(i)}},e_{a^{(1)}}),...,cos(e_{w^{(i)}},e_{a^{(|% \mathbb{A}|)}})),

(2)

and the relative representation of the output semantics $\mathbf{p}$ is defined as: $\mathbf{r}=\mathbf{p}\cdot R$ .

Model-invariance of relative representation of output semantic.

Next, we illustrate why this representation scheme could align the output semantics isolated in heterogeneous absolute spaces. First, considering two LLMs $\theta_{A}$ and $\theta_{B}$ with the same vocabulary (e.g., LLaMA2-7B and LLaMA2-13B). When expressing the same output semantic, these models output the same probability distribution (i.e., absolute representation) $\mathbf{p}_{A}$ and $\mathbf{p}_{B}$ . Besides, they have the same (highly similar in practice) relative representation matrix due the vocabulary consistency and cross-model invariance of relative representation. Therefore, the relative representations of output semantics are also identical:

\displaystyle\mathbf{r}_{A}

\displaystyle=\mathbf{p}_{A}\cdot R_{A}=\mathbf{p}_{B}\cdot R_{B}=\mathbf{r}_{% B}.

(3)

Then, let’s consider a language model $\theta_{C}$ with a different vocabulary (e.g., Mistral). Based on the fact that different LLMs typically share mass tokens in their vocabularies (§A), the vocabulary of model $\theta_{C}$ is identical to adding and removing partial tokens to the vocabulary of $\theta_{B}$ , which leads to $\mathbf{p}_{B}\ncong\mathbf{p}_{C}$ and $R_{B}\ncong R_{C}$ . However, in our study, we discover that this change to the vocabulary has not incurred significant influence on the relative representation of the unchanged tokens (i.e., the common tokens between $\theta_{B}$ and $\theta_{C}$ ), as shown in Fig. 1. Therefore, we make the reasonable assumption that the local change in the vocabulary could hardly influence the relative space.

3 Methodology

In this section, we first introduce the overall process of our ensemble framework DeePEn and then describe the three parts of DeePEn in detail.

3.1 Overview

We illustrate the process of DeePEn in Fig. 2. Given $N$ models to ensemble, DeePEn first constructs their transformation matrices (i.e., relative representation matrices) mapping the probability distributions from the heterogeneous absolute spaces into the relative space (§3.2). At each decoding step, all models perform prediction and output $N$ probability distributions. These distributions are mapped into the relative space and aggregated (§3.3). Finally, the aggregation result is transformed back into the absolute space of the main model, in order to determine the next token (§3.4).

3.2 Construction of Relative Transformation

Given $N$ models to ensemble, DeePEn first finds out the intersection of vocabularies of all models, i.e., common token set $C$ , and samples a subset or uses the full set of common tokens as the anchor token set $A\subseteq C$ . Next, for each model, DeePEn calculates embedding similarities of each token to the anchor words, obtaining the relative representation matrix $R$ (as shown in Eq.2). Finally, to overcome the relative representation degeneration of outlier words, which will be introduced later, we perform normalization on the relative representation of all tokens by a softmax operation so that it becomes a probability distribution. We denote the normalized representation matrix $\hat{R}$ :

\displaystyle\hat{R}[i]=softmax(R[i]).

(4)

Anchor Selection.

The choice of anchor tokens is crucial for the relative representation capability. Previous research discovers that the capability improves as the number of anchor words increases [21]. Therefore, we employ the full set of common words between LLMs as the anchor words. It is also empirically proved that this method performs more stablely on downstream tasks (§5.2).

Normalization of relative representation matrix.

In DeePEn, the relative representation of each token is normalized by the softmax operation to avoid the relative representation degeneration of outlier words, which are referred to as words that are far away from other words (including the anchors) and become distinguishable in relative space since for being zero vectors. The softmax operation effectively resolves this problem by making each relative representation a probabilistic distribution instead of a zero vector.

3.3 Aggregation in Relative Space

At each decoding step, once each model $\theta_{i}$ outputs the probability distribution $\textbf{p}_{i}$ , DeePEn transforms $\textbf{p}_{i}$ into the relative representation $\textbf{r}_{i}$ using the normalized relative representation matrix: $\mathbf{r}_{i}=\textbf{p}_{i}\cdot\hat{R}_{i}$ ,

and aggregate all relative representations to obtain the aggregated relative representation:

\displaystyle\overline{\mathbf{r}}=\sum_{i=1}^{N}\alpha_{i}\times\mathbf{r}_{i},

(5)

where $\alpha_{i}$ is the collaboration weight of model $\theta_{i}$ (§3.5).

3.4 Inverse Transformation of Relative Representations

To decide the next token according to the aggregated relative representation, DeePEn aims to transform it from the relative space back to the absolute space of the main model, which is empirically selected with the best-performing model on the development set. To enable this inverse transformation, we adopt a search-based strategy, finding out the absolute representation whose relative representation is identical to the aggregated relative representation. This search problem is formulated as:

\displaystyle\overline{\textbf{p}}_{i}=\mathop{\arg\min}\limits_{\textbf{p}_{i% }\in\ \mathbb{P}_{i}}\ell(\textbf{p}_{i}\times\hat{R},\ \overline{\mathbf{r}}),

(6)

where $\mathbb{P}_{i}$ denotes the absolute space of model $\theta_{i}$ , and $\ell(\cdot)$ is the loss function to measure the distance between relative representations. In this work, we adopt the KL-divergence due to its convergence.

This search is iteratively conducted under the guidance of the gradient of the loss in Eq.6 with respect to the absolute representation $\textbf{p}_{i}$ . Specifically, we initialize the start point of searching $\textbf{p}^{(0)}_{i}$ with the main model’s original absolute representation, and update it as:

\displaystyle\textbf{p}^{(t+1)}_{i}=\textbf{p}^{(t)}_{i}-\eta\times\frac{% \partial\ell}{\partial\textbf{p}^{(t)}_{i}},t\in[0,T]

(7)

where $\eta$ is an important hyperparameter named the relative ensemble learning rate, and $T$ is the iterations number named relative ensemble learning steps. Finally, we use the updated absolute representation $\textbf{p}^{(T)}_{i}$ to determine the emitted token.

3.5 Collaboration Schemes

DeePEn aggregates the output distributions of individual models via performing weighted averaging on their relative representations (Eq. 5). As our work focus on enabling the distribution fusion of heterogeneous LLMs instead of finding the optimal collaboration weights, we follow the most common practice to uniformly aggregate the distributions ( $\alpha=1/N$ , $N$ is the number of models), which is named DeePEn-Avg. Besides, we also adopt a simple and effective method of deducing weights, DeePEn-Adapt, which heuristically sets a larger value to the model with a better performance on the development set: $\alpha_{i}=s_{i}/\sum_{j}s_{j}$ , where $s_{i}=Acc(\theta_{i},\mathcal{D}^{dev})-\epsilon$ , $Acc(\cdot,\cdot)$ indicates the average accuracy of model $\theta_{i}$ on the development set, and $\epsilon$ indicates the chance level on the evaluation task. Specifically, $\epsilon=0$ on the free-form generation tasks and $\epsilon=1/K$ on the $K$ -choice tasks.

4 Experiments

4.1 Experimental Setup

Benchmarks.

We mainly conduct experiments on six benchmarks, which can be categorized into:

•

Comprehensive Examination: (1) MMLU (5-shot) [12], which covers 57 subjects that humans learn, and (2) ARC-C (0-shot) [5], collected from standardized natural science tests.
•

Reasoning Capabilities: (1) GSM8K [6] (4-shot), which is a dataset of high quality problems at the grade school math level, and (2) PIQA [3] (0-shot), which is a commonsense reasoning dataset.
•

Knowledge Capacities: (1) TriviaQA (5-shot) [16], collected by Trivia enthusiast authored, and (2) NQ (5-shot) [18], which is a QA corpus consists of queries issued to the Google search engine.

Evaluation.

For all benchmarks, we follow the test scripts of OpenCompass leaderboard. Specifically, on the multiple-choice tasks (MMLU, ARC-C, and PIQA), the option with the highest likelihood is selected to calculate the accuracy. On the free-form generation tasks (GSM8K, TriviaQA and NQ), we calculate the exact match (EM) accuracy.

Individual models.

As ensemble learning typically works on models with comparable performance [24, 34], we select six well-performing LLMs whose performance are closely matched: LLaMA-2-13B [29], Mistral-7B-v0.1 [13], InternLM-20B [26], Yi-6B [1], Skywork-13B-base [32], and Tigerbot-13b-base-v2 [4]. To achieve better ensemble performance, we conduct experiments on the ensemble of the top-2 models and the top-4 models for each benchmark. Besides, we also consider ensembling various number of models (§4.3) and ensembling more diverse models (§5.1).

Hyperparameters.

In this work, we select all of the common tokens between LLMs as the anchor tokens to build the relative spaces, i.e., $A=C$ (§5.2). For the inverse transformation of relative representations, we search the optimal relative learning rate ( $\eta$ in Eq. 7) from 0.05 to 0.30 with an interval of 0.05. We empirically set the number of relative ensemble learning steps $T=5$ (§5.3).

Comparative methods.

We compare DeePEn with (1) MinED [30, 9], which maps the probability distributions of heterogeneous LLMs to the distribution of the main model via aligning tokens in different vocabularies with edit distance, and (2) LLM-Blender [15], which comprises a reward model PairRanker to score each response of LLMs and a fusion model GenFuser to fuse candidate responses. In this work, we we only adopt the PairRanker since GenFuser suffers from serious over-generation under our training-free setting. In the ensemble of more than two models, we introduce two additional ensemble methods: (3) Voting, which selects the choice favored by most models on the tasks with outputs limited to a fixed set, and (4) MBR [8, 17], which selects the answer with the highest textual similarity to other candidate answers. The implementation details of baselines are illustrated in §B.

Individual Models
Models	Examination		Reasoning		Knowledge
Models	MMLU	ARC-C	GSM8K	PIQA	TriviaQA	NQ
LLaMA2-13B	55.07	59.32	29.80	59.68	74.32	28.67
InternLM-20B	59.94	75.81	53.83	64.78	66.88	26.09
Skywork-13B	61.16	66.50	53.90	74.04	58.65	19.75
Tigerbot-13B	51.95	57.44	48.82	68.28	66.22	22.71
Mistral-7B	62.13	73.33	47.50	65.61	73.18	27.62
Yi-6B	63.25	73.33	37.91	76.15	59.02	18.98
Top-2 Ensemble
LLM-Blender	63.85 (+0.60)	75.73 (- 0.08)	54.89 (+0.99)	78.31 (+2.16)	74.10 (- 0.22)	28.61 (- 0.06)
MinED	65.04 (+1.79)	77.35 (+1.54)	18.50 (-35.40)	78.98 (+2.83)	72.30 (- 2.02)	28.45 (- 0.22)
\hdashline[4pt/5pt] DeePEn-Avg	64.68 (+1.43)	77.52 (+1.71)	55.42 (+1.52)	78.87 (+2.72)	75.90 (+1.58)	30.17 (+1.50)
Top-4 Ensemble
LLM-Blender	61.44 (- 1.81)	71.03 (- 4.78)	43.37(-10.53)	71.16 (- 4.99)	67.87 (- 6.45)	24.18 (- 4.49)
Voting	64.88 (+1.63)	78.41 (+2.60)	63.15 (+9.25)	76.82 (+0.67)	—	—
MBR	—	—	62.09 (+8.26)	—	74.32 (+0.00)	30.28 (+1.61)
MinED	65.61 (+2.36)	78.68 (+2.87)	56.56 (+2.66)	77.87 (+1.72)	71.62 (- 2.70)	29.50 (+0.83)
\hdashline[4pt/5pt] DeePEn-Avg	65.09 (+1.84)	78.70 (+2.89)	56.18 (+2.28)	77.15 (+1.00)	75.74 (+1.42)	31.55 (+2.88)
DeePEn-Adapt	65.25 (+2.00)	79.15 (+3.34)	56.25 (+2.35)	78.59 (+2.44)	75.76 (+1.44)	31.77 (+3.10)
+Voting/MBR	65.40 (+2.15)	79.44 (+3.63)	65.25 (+11.35)	77.37 (+1.22)	75.65 (+1.33)	32.11 (+3.44)

Model	GSM8K	PIQA
LLaMA2-70B (Dense)	63.84	71.27
Mixtral-8 $\times$ 7B (Sparse)	65.73	71.88
\hdashline[4pt/5pt] DeePEn	67.33	75.10
$\Delta$	+1.60	+3.22

Model	En $\rightarrow$ De	De $\rightarrow$ En	En $\rightarrow$ Ro	Ro $\rightarrow$ En
LLaMA2-13B	30.60	42.27	30.83	39.99
NLLB-600M	32.30	41.49	31.91	42.39
\hdashline[4pt/5pt] DeePEn	33.34	43.70	32.95	42.84
$\Delta$	+1.04	+1.43	+1.04	+ 0.45

Methods	MMLU-Dev		TriviaQA-Dev
Methods	ACC	$\Delta$	ACC	$\Delta$
Baseline	61.19	-	72.74	-
DeePEn	63.61	+2.42	74.79	+2.05
w/o. Rel-Norm	60.73	-0.46	72.95	+0.21

RELR ( $\eta$ )	0.05	0.10	0.15	0.20	0.25	0.30
MMLU	+2.42	+1.57	+1.77	+1.96	+1.31	+1.31
TriviaQA	+1.31	+2.05	+1.63	+1.94	+1.82	+1.26

Models	MMLU-Dev		ARC-C-Dev
Models	Indiv	DeePEn	Indiv	DeePEn
Yi-6B	61.19	63.61 (+2.42)	72.72	77.55 (+4.83)
Mistral-7B	60.80	64.46 (+3.66)	73.88	77.73 (+3.85)

Models	MMLU-Dev			MMLU-Test
Models	Indiv	Vanil	DeePEn	Indiv	Vanil	DeePEn
LLaMA1-13B	43.26	45.48	44.37	43.70	45.01	44.22
LLaMA2-7B	42.28	45.48	45.94	42.99	45.01	45.31

	Baseline	$T=1$	$T=3$	$T=5$	$T=10$
Inference Latency	0.19s	0.20s	0.21s	0.22s	0.24s
Relative Change	0%	+7%	+11%	+17%	+29%

	Baseline	TrivaQA	NQ	ARC-C	MMLU
TriviaQA	73.42	75.9	75.41	75.56	75.44
NQ	29.11	30.55	30.65	30.42	30.69
ARC-C	60.29	69.32	72.31	74.19	73.76
MMLU	54.06	59.97	61.04	61.94	61.42

Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration

Abstract

1 Introduction

2 Theoretical Analysis

2.1 Relative Representation

2.2 Theoretical Support for DeePEn

Definition of output semantics in relative space.

Model-invariance of relative representation of output semantic.

3 Methodology

3.1 Overview

3.2 Construction of Relative Transformation

Anchor Selection.

Normalization of relative representation matrix.

3.3 Aggregation in Relative Space

3.4 Inverse Transformation of Relative Representations

3.5 Collaboration Schemes

4 Experiments

4.1 Experimental Setup

Benchmarks.

Evaluation.

Individual models.

Hyperparameters.

Comparative methods.

4.2 Main Results

(1) DeePEn achieves consistent improvements over the individual models.

(2) DeePEn shows better stability than baselines.

(3) DeePEn has complementary strengths with other ensemble methods.

(4) Collaboration with more worse-performing LLMs is a double-edged sword.

4.3 Results on Different Numbers of Models

5 Analysis

5.1 Results of Ensembling Diverse Models

Ensemble of the dense model and the sparse model.

Ensemble of the generalist model and the specialist model.

Ensemble of models with different performance gaps.

5.2 Analysis on Relative Transformation

Effect of anchor selection.

Effect of normalization on relative representation matrix.

5.3 Analysis of Reverse Transformation

Analysis of relative ensemble learning rates.

Effect of iteration steps in relative ensemble learning.

6 Related Work

Selection-based ensemble.

Fusion-based ensemble.

7 Conclusion

References

Appendix A Statistics of Common Tokens across different LLMs

Appendix B Details of Baselines

LLM-Blender.

Voting.

MBR.

MinED.

EVA.

Appendix C Additional Experiments

C.1 Choice of main model.

C.2 Comparison to Vanilla Prediction Average

C.3 Latency Analysis

Appendix D Limitations