Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts

name=Nikolas Gritsch affiliation=Cohere for AI email=[email protected] name=Qizhen Zhang^† affiliation=University of Oxford email=[email protected] name=Acyr Locatelli affiliation=Cohere email=[email protected] name=Sara Hooker affiliation=Cohere for AI email=[email protected] name=Ahmet Üstün affiliation=Cohere for AI email=[email protected]

(August 28, 2024)

Abstract

Efficiency, specialization, and adaptability to new data distributions are qualities that are hard to combine in current Large Language Models. The Mixture of Experts (MoE) architecture has been the focus of significant research because its inherent conditional computation enables such desirable properties. In this work, we focus on “upcycling” dense expert models into an MoE, aiming to improve specialization while also adding the ability to adapt to new tasks easily. We introduce Nexus, an enhanced MoE architecture with adaptive routing where the model learns to project expert embeddings from domain representations. This approach allows Nexus to flexibly add new experts after the initial upcycling through separately trained dense models, without requiring large-scale MoE training for unseen data domains. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and a 18.8% relative gain for extending the MoE with a new expert by using limited finetuning data. This flexibility of Nexus is crucial to enable an open-source ecosystem where every user continuously assembles their own MoE-mix according to their needs.

1 Introduction

⁰⁰footnotemark: 0^$\dagger$^$\dagger$footnotetext: Work done at Cohere^†^†Corresponding authors: Nikolas Gritsch, Sara Hooker, Ahmet Üstün

In an era of bigger and bigger models [Canziani et al., 2016; Strubell et al., 2019; Rae et al., 2021; Raffel et al., 2020; Bommasani et al., 2022; Hooker, 2024], there are several key objectives driving state-of-art progress. Doing more with less by improving efficiency [Treviso et al., 2023] remains paramount, but in addition to efficiency the deployment of these models in the wild means that the ability to adapt to new data [Pozzobon et al., 2023b; Gururangan et al., 2020a; Jang et al., 2022; Jin et al., 2022], and specialization of compute [Zadouri et al., 2024; Shazeer et al., 2018; Riquelme et al., 2021; Du et al., 2022; Fedus et al., 2022] have gained renewed focus. While all these properties are desirable, a formidable challenge is designing architectures that can fulfill all of these requirements.

The Mixture-of-Expert (MoE) approach gained prominence because of its efficiency properties. In contrast to dense models which require significant compute to deploy, MoE approaches only activate a subset of the parameters for every single token. Intuitively, not all parameters are necessary for each request, as some parameters will specialize on certain tasks, and those unrelated to the current request can be ignored. However, while MoEs greatly improved efficiency, the ability to induce meaningful specialization has been more limited with observations that experts don’t appear to exhibit dedicated expertise [Jiang et al., 2024; Zoph et al., 2022; Zadouri et al., 2023]. Furthermore, MoEs tend to suffer from severe training instabilities [Zoph et al., 2022].

Recent work has attempted to address both the training instabilities and the lack of specialization. These techniques often train completely separate experts and ‘‘upcycle’’ (combine) them into a single unified MoE model after dense training [Sukhbaatar et al., 2024]. This reduces the memory and communication cost, and improves efficiency during training as computations are more local and cross-device communication is reduced [Li et al., 2022; Gururangan et al., 2023]. Notably, the other major advantage of these approaches is the increase in specialization with separate experts that are trained on specific domains, making them clearly responsible for their human-interpretable subset of the data. On the other hand, MoEs with a standard router, which needs to be trained on a mix of all training data, are not designed to maintain domain specialization [Jiang et al., 2024].

Refer to caption — Рис. 1: Depiction of Nexus for a single Transformer block: A) In the initial training phase, each expert is trained separately. Furthermore, its training data is embedded by an embedding model and stored. The experts are combined by initializing each block’s MoE layer with the seed model and each of the experts’ FFN layers, and finetuning the model on a mix of all domains. During a forward pass, the seed model FFN is used as the shared expert and always activated. For the other experts, we perform top-1 routing based on the similarity of the transformed expert embeddings with the input data. B) Later, we can add a new expert by appending its training data embedding to the existing domain embeddings. The router function is independent of the number of experts, and therefore adapts fast to the new one.

However, efficiently integrating new experts into upcycled MoE models - a setting that is of great interest for adaptability objectives is far less studied. For most practitioners, given the scale of modern LLMs [Brown et al., 2020; Touvron et al., 2023; Kaplan et al., 2020; Anil et al., 2023] training MoEs repeatedly is an infeasible computational cost. Furthermore, most model development fails to take into account distribution drift in use cases, with limited flexibility and applicability across different tasks and domains [Pozzobon et al., 2023a; Gururangan et al., 2020b]. However, human language is shaped by a cumulative culture, constantly building upon itself and evolving over time [Silvey, 2016]. Also, specialized use cases such as multilingual, code and math often require tailored additional training.

In this work, we attempt to reconcile all three desirable properties: efficiency, specialization, and adaptability. We ask ‘‘how can we adaptively combine separately trained specialized experts?’’ To address this, we introduce Nexus, a novel MoE architecture that parameterizes the router based on domain-specific data by learning to project the embedding of each data domain to an expert embedding. This learnable projection for the router allows for the easy extension of the MoE model with new experts that are trained independently on new datasets of interest. This also avoids the difficulties of MoE training, as our learned router scales with the number of experts without needing to be trained from scratch, which enables adding or removing experts as desired.

Our experiments show that Nexus outperforms previous work when upscaling an MoE from separately trained specialized domain experts. Going beyond the single upscaling phase, Nexus can be efficiently extended with a new expert trained on a new domain, by finetuning it with much fewer tokens, compared to the finetuning after the initial upcycling.

	MoE	BTM	BTX	Nexus
	(Vanilla)	(Merge)	(Linear router)	(Ours)
Dense experts are trained independently (upcycling)	✗	✔	✔	✔
Experts are specialized in different domains	✗	✔	✔	✔
Experts are chosen by a learned router per input token	✔	✗	✔	✔
Router is adaptive via learned projection for new domains	✗	✗	✗	✔

Таблица 1: A comparison of existing approaches with Nexus: Unlike the vanilla MoE architecture [Shazeer et al., 2017; Fedus et al., 2022] the Branch-Train-Merge [BTM; Li et al., 2022] and the Branch-Train-Mix [BTX; Sukhbaatar et al., 2024] approaches train experts separately in different domains, reducing the training cost and improving specialization. However, they either merge the experts during inference or learn an MoE router layer from scratch, where prior domain information is not used. Our approach trains the MoE router based on domain information, maintaining the specialization and enabling efficient extension of the MoE with a new expert after training.

In summary, our contributions are as follows:

(i)

We present Nexus, a novel MoE framework designed to enhance sparse upcycling of specialized trained dense experts, while reducing the training cost of MoEs by facilitating easy adaptation to unseen data distributions. In Nexus, the traditional linear router from vanilla MoE models is replaced with routing based on the similarity of layer inputs to an expert embedding vector, derived from the average embedding of the corresponding expert training dataset.
(ii)

Our method outperforms the existing approach for upcycling specialized models into MoE, leading to 2.1% and 1.6% relative increase over the upcycled MoE (linear router) in 470M and 2.8B scales respectively. This enables performance increase in general tasks with 5.8% and 7.4% relative gains over the dense seed model at 470M and 2.8B respectively.
(iii)

Our method enables efficient adaptation to new domains by extending upcycled MoE with the new experts trained on unseen dataset. In this setting, Nexus outperforms the baseline MoE (linear router) when finetuning on the limited amount of data, leading 18.8% relative gain on the new domain with 1B finetuning tokens upon MoE extension.
(iv)

Finally, we show that our method is robust across different load balancing and data mixtures, and consistently outperforms the MoE with a linear router for specialized upcycling, confirming the benefits of the adaptive routing based on domain projections used in Nexus.

2 Background

Sparse Mixture of Experts architectures [Shazeer et al., 2017; Fedus et al., 2022] replace the feed-forward network (FFN) with an MoE layer in the Transformer block [Vaswani et al., 2017]. An MoE layer consists of a router network $R$ and a set of $n$ experts, $E_{1},...,E_{n}$ , where each expert $E_{i}$ corresponds to an independent dense feed-forward network. The router network $R$ is commonly parameterized by trainable weights $W_{r}\in\mathbb{R}^{h\times n}$ where $h$ is the model hidden dimension, and followed by a softmax function which takes an intermediate token representation $x$ as input and combines the output of each expert based on the gating scores $s_{1},...,s_{n}$ . Sparse MoEs only use the top-k experts $E_{k}$ based on experts gating scores $s_{i}$ .

	$\displaystyle s_{i}=R(x)=\text{softmax}(W^{T}_{r}x)$		(Router)
	$\displaystyle s_{k}=\text{TopK}(s_{i})$		(Top-K Routing)
	$\displaystyle y=\sum_{i=1}^{k}s_{k}\cdot E_{k}(x)$		(MoE)

Recent work has also shown that using a shared expert $E_{0}$ that is always activated is beneficial to remove parameter redundancy among other experts [Rajbhandari et al., 2022; Dai et al., 2024]:

\displaystyle y=E_{0}(x)+\sum_{i=1}^{k}s_{k}\cdot E_{k}(x)

(MoE + shared expert)

Sparse Upcycling [Komatsuzaki et al., 2023] initializes an MoE model from a dense Transformer model. The dense model’s FFN layers are copied $n$ times to initialize each of the $n$ experts, and the router layer is trained from scratch. BTX [Sukhbaatar et al., 2024] generalize this approach to initialize each expert from the FFN layer of a different expert model, and all other parameters as the average over all of these models. The experts models are finetuned versions of the original dense model, which allows weight merging without major losses.

Nexus leverages upcycling specialized expert models similar to BTX, however, it diverges in terms of MoE training, in particular with its novel MoE router, which enables to efficiently extend the MoE in multiple rounds after the sparse upcycling. We describe our method in the next section.

3 Adaptive Router for Upcycling Specialized Experts as MoE

The core component of an MoE model is the router, as it determines which experts to activate for any given input. In vanilla MoEs, the router is a learned linear layer that takes the token intermediate representations as input and computes the expert probabilities. However, this router does not necessarily learn specialization as MoEs are commonly trained using an auxiliary load balancing loss to improve training stability [Fedus et al., 2022; Jiang et al., 2024]. In Nexus, we propose a novel MoE router where per MoE block we learn a projection layer from given pre-computed domain embeddings to expert embeddings. We parametrize this projection layer $P_{r}$ as a two-layer MLP with a SwiGLU activation function [Shazeer, 2020]:

	$\displaystyle\centering e_{i}\@add@centering$	$\displaystyle=P_{r}(d_{i})$		(Domain to Expert Embeddings)
		$\displaystyle=W_{2}\cdot\text{SwiGLU}(W_{1}\cdot d_{i})$

⬇ 1def router(self, inputs, domain_embeddings): 2 # domain_to_expert_ffn learns projection domain to expert embeddings 3 # domain_embeddings: [e_dim x n_experts] 4 # expert_embeddings: [h_dim x n_experts] 5 expert_embeddings = self.domain_to_expert_ffn(self.domain_embeddings) 6 7 # router probs: [batch, seq, n_experts] 8 router_probs = nn.softmax(inputs @ expert_embeddings) 9 10 # Top-1 gate for routed experts 11 index, gate = nn.topk(1, router_probs) 12 13 # routed_experts_ffns: An MoE layer with FFN experts 14 # routed_expert_out: [batch, seq, h_dim] 15 # shared_expert_out: [batch, seq, h_dim] 16 routed_expert_out = self.routed_expert_ffns[index](input) 17 shared_expert_out = self.shared_expert_ffn(input) 18 19 return shared_expert_out + gate * routed_expert_out

Рис. 2: Router layer in Nexus: PyTorch-like pseudo-code illustrating a router layer, which consists of a 2-layer MLP network (domain_to_expert_ffn) to project domain embeddings to expert embeddings, shared and routed expert FFNs, and sparse Top-k gating. Note that the expert embeddings are independent of the input and could be precomputed once and stored during inference.

where $d_{i}\in\mathbb{R}^{m}$ , and $e_{i}\in\mathbb{R}^{h}$ are the domain and expert embeddings for the $i$ th domain respectively., where $m$ and $h$ are the domain embedding and the model dimensions. $W_{1}\in\mathbb{R}^{2h\times d},W_{2}\in\mathbb{R}^{l\times l}$ are linear layers, and SwiGLU is defined as $\mathbb{R}^{2n}\rightarrow\mathbb{R}^{n}$ . Given the expert embeddings $e_{i}$ and layer inputs $x\in\mathbb{R}^{s\times h}$ , we then compute routing probabilities $s_{i}$ as:

\displaystyle s_{i}=\text{softmax}(x\cdot e_{i})

(Routing Scores)

Unlike the standard router, Nexus’s router includes a stronger inductive bias through pre-computed domain embeddings¹¹1We used an Cohere Embed v3 (https://cohere.com/blog/introducing-embed-v3) as an external embedding model to compute domain embeddings based on existing individual data sources. However, similar to Gururangan et al. [2023], pre-training data can also be clustered and the centroid of each cluster can be used for domain embeddings. that enables expert embedding to specialize. Thus, $x\cdot e_{i}$ gives a high value for input tokens that are closer to the domain of the corresponding expert. Notably, this router is particularly suited for the sparse upcycling setting where the dense experts are separately trained on different domains.

Connection to hypernetworks. Our router parametrization is closely related to hypernetworks [Ha et al., 2016] as the projection layer $P_{r}$ generates parameters for the router during runtime for a given input. We use domain embeddings as the input to the projection layer, enabling efficient adaptation and also a better cross-domain transfer based on the similarity between domain embeddings as shown in previous work [Mahabadi et al., 2021; Üstün et al., 2022].

Upcycling dense experts as an MoE. After training dense expert models, we merge the individual experts into a unified MoE by appending their FFNs along a new dimension to create an MoE layer per Transformer block. Unlike Sukhbaatar et al. [2024], instead of using the original FFN of the seed model as one of the routed experts in an MoE layer, we use it as the shared expert ( $\text{FFN}_{s}$ ) to better preserve the previous capabilities in the MoE model. For all non-FFN parameters including the attention weights, we merge expert parameters using simple weight averaging:

	$\displaystyle\text{FFN}_{moe}=\text{FFN}_{s}+[\text{FFN}{e_{1}},\text{FFN}{e_{% 2}},...,\text{FFN}{e_{n}}]$		(MoE Layer FFNs)
	$\displaystyle\phi_{moe}=\frac{\sum_{i=1}^{n}\phi_{i}}{n}$		(Merge Non-FFN params.)

Efficient adaptation to new domains. An important advantage of method is that when a new data domain is present after MoE training, we use the learned projection $P_{r}$ to compute expert embedding of the new domain as $e_{new}=P_{r}(d_{new})$ . This enables to enhance the trained MoE model with additional dense experts, which are trained in the same way as the initial experts. The FFN parameters of the new expert are simply appended to the array of existing experts.

To adequately preserve the non-FFN parameters of existing experts, we perform a weighted average $\phi_{f}=(1-\lambda)\cdot{\phi}_{moe}+\lambda\cdot{\phi}_{new}$ where ${\phi}_{f}$ , ${\phi}_{e}$ , and ${\phi}_{moe}$ are parameters of the final MoE, dense expert, and initial MoE model and $\lambda=1/(n+1)$ . This enables efficient adaptation Nexus to new domain by extending it with the new dense expert trained independently. After extending the MoE with a new expert, we perform a lightweight finetuning with a limited number of tokens for quick adaptation.

4 Experiments

4.1 Experimental setting

Our experimental setup includes 3 phases. Figure 1 shows the architecture of Nexus and the corresponding experimental setting:

1. Training specialized expert LMs. For training the dense specialized experts, we use the sub-datasets from the SlimPajama dataset [Soboleva et al., 2023], a 627B token English-language corpus assembled from web data of various sources. We initialize four dense experts from the weights of the seed model and train them on the ArXiv, Books, C4, GitHub, StackExchange, and Wikipedia domains.²²2We exclude the Github and StackExchange datasets from SlimPajama in order to ablate adding a new expert model using the Code domain. As seed model, we use a 470M and 2.8B parameters decoder-only autoregressive Transformer models [Radford et al., 2019] that are trained with a standard language modeling objective for 750B tokens. We train dense experts for 25 and 40 billion tokens for 470M and 2.8B seed models respectively. We use parallel attention layers, [Anil et al., 2023; Wang, 2021], SwiGLU activation [Shazeer, 2020], no biases in dense layers, and a byte-pair-encoding (BPE) tokenizer with a vocabulary size of 256,000. During training, we use a linear warmup (10% of total steps) to a maximum learning rate of $1\mathrm{e}$ -3 and a cosine decay schedule to $3\mathrm{e}$ -4.

2. MoE training. After the training of dense expert models, we merge them into a unified MoE by appending their FFNs along a new dimension to create an MoE layer per Transformer block. For the shared expert in our MoE layer, we use the original FFN layer of the seed model to better preserve the previous capabilities in the MoE model. For all non-FFN parameters including the attention weights, we merge expert parameters using simple weight averaging, following Sukhbaatar et al. [2024]. After the MoE model is created, we continually train it for an additional 25B and 40B tokens respectively for the 470M and 2.8B experiments, on a mix of all domain and original pre-training datasets, using the same training hyperparameters as in the single expert training. Finally, we train the MoE models using an additional 1B tokens by upweighting the original pre-training dataset as it includes high-quality data sources such as instruction-style datasets using a cosine learning rate decay to $3\mathrm{e}$ -5 [Parmar et al., 2024].

3. Extending the MoE model with new experts. After adding a new expert as defined in Section 2, we finetune the extended MoE model for up to 1 billion tokens using a uniformly sampled data mix consisting of 50% the previous domains and pre-training data and 50% the new domain. For the new expert (Code), we train a dense model using code documents from StarCoder [Li et al., 2023] with the same settings as for the training of the initial experts. As the 470M scale MoE did not have sufficient instruction following capabilities to attempt the code benchmarks, we only tested extending the MoEs with a new expert on the 2.8B scale.

4.2 Baselines

We compare our experiments against two baselines:

1.

Dense Merging: We compare MoE variants against merging all separately pre-trained experts and the seed model into a dense Transformer via equal weight averaging similar to BTM [Li et al., 2022]. This allows us to ask What are the benefits of routing MoE over simple averaging?
2.

MoE (Linear Router): To evaluate Nexus’s novel router for upcycling, we compare it against an MoE with a standard linear router that is upcycled from dense experts. Here, we ask how does our specialized routing compare to conventional learned linear routing? For a fair comparison, we also train this MoE model on the same datasets and for the same number of tokens as our method, and use the same architectural modifications such as shared experts.

4.3 Evaluation

For the downstream evaluation, we measure the performance of each model on 15 tasks³³3We did not include ARC-Challenge and Natural Questions in 470M experiments as some model variants were unable to achieve non-random performance. from five evaluation categories that reflect different capabilities based on the tasks and the datasets used in the benchmarks:

•

Knowledge: To measure question-answering capabilities based on world knowledge and web documents such as Wikipedia, we report the performance on OpenBookQA [Mihaylov et al., 2018], Natural Questions [Kwiatkowski et al., 2019], TriviaQA [Joshi et al., 2017], QUAC [Choi et al., 2018] (all 0-shot) and SQuAD (4-shot) [Rajpurkar et al., 2016].
•

Science: For measuring knowledge in science-oriented academic benchmarks, we use ARC-Easy, ARC-Challenge [Clark et al., 2018], SciQ [Welbl et al., 2017] (all 0-shot).
•

Reasoning: For reasoning abilities, we use CommonSenseQA [Talmor et al., 2019], SIQA [Sap et al., 2019], PIQA [Bisk et al., 2020], WinoGrande [Sakaguchi et al., 2019], and HellaSwag [Zellers et al., 2019] (all 0-shot).
•

General Language Understanding: We use MMLU (5-shot) [Hendrycks et al., 2021] to test general language understanding.
•

Code: For code generation, we evaluate models on MBPP [Austin et al., 2021], LBPP [Matton et al., 2024] and HumanEval-Pack [Chen et al., 2021] that includes Cpp, Javascript, Java, Go, Python, and Rust (all 0-shot).