Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts
Abstract
Efficiency, specialization, and adaptability to new data distributions are qualities that are hard to combine in current Large Language Models. The Mixture of Experts (MoE) architecture has been the focus of significant research because its inherent conditional computation enables such desirable properties. In this work, we focus on “upcycling” dense expert models into an MoE, aiming to improve specialization while also adding the ability to adapt to new tasks easily. We introduce Nexus, an enhanced MoE architecture with adaptive routing where the model learns to project expert embeddings from domain representations. This approach allows Nexus to flexibly add new experts after the initial upcycling through separately trained dense models, without requiring large-scale MoE training for unseen data domains. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and a 18.8% relative gain for extending the MoE with a new expert by using limited finetuning data. This flexibility of Nexus is crucial to enable an open-source ecosystem where every user continuously assembles their own MoE-mix according to their needs.
1 Introduction
00footnotemark: 0$\dagger$$\dagger$footnotetext: Work done at Cohere††Corresponding authors: Nikolas Gritsch, Sara Hooker, Ahmet ÜstünIn an era of bigger and bigger models [Canziani et al., 2016; Strubell et al., 2019; Rae et al., 2021; Raffel et al., 2020; Bommasani et al., 2022; Hooker, 2024], there are several key objectives driving state-of-art progress. Doing more with less by improving efficiency [Treviso et al., 2023] remains paramount, but in addition to efficiency the deployment of these models in the wild means that the ability to adapt to new data [Pozzobon et al., 2023b; Gururangan et al., 2020a; Jang et al., 2022; Jin et al., 2022], and specialization of compute [Zadouri et al., 2024; Shazeer et al., 2018; Riquelme et al., 2021; Du et al., 2022; Fedus et al., 2022] have gained renewed focus. While all these properties are desirable, a formidable challenge is designing architectures that can fulfill all of these requirements.
The Mixture-of-Expert (MoE) approach gained prominence because of its efficiency properties. In contrast to dense models which require significant compute to deploy, MoE approaches only activate a subset of the parameters for every single token. Intuitively, not all parameters are necessary for each request, as some parameters will specialize on certain tasks, and those unrelated to the current request can be ignored. However, while MoEs greatly improved efficiency, the ability to induce meaningful specialization has been more limited with observations that experts don’t appear to exhibit dedicated expertise [Jiang et al., 2024; Zoph et al., 2022; Zadouri et al., 2023]. Furthermore, MoEs tend to suffer from severe training instabilities [Zoph et al., 2022].
Recent work has attempted to address both the training instabilities and the lack of specialization. These techniques often train completely separate experts and ‘‘upcycle’’ (combine) them into a single unified MoE model after dense training [Sukhbaatar et al., 2024]. This reduces the memory and communication cost, and improves efficiency during training as computations are more local and cross-device communication is reduced [Li et al., 2022; Gururangan et al., 2023]. Notably, the other major advantage of these approaches is the increase in specialization with separate experts that are trained on specific domains, making them clearly responsible for their human-interpretable subset of the data. On the other hand, MoEs with a standard router, which needs to be trained on a mix of all training data, are not designed to maintain domain specialization [Jiang et al., 2024].
However, efficiently integrating new experts into upcycled MoE models - a setting that is of great interest for adaptability objectives is far less studied. For most practitioners, given the scale of modern LLMs [Brown et al., 2020; Touvron et al., 2023; Kaplan et al., 2020; Anil et al., 2023] training MoEs repeatedly is an infeasible computational cost. Furthermore, most model development fails to take into account distribution drift in use cases, with limited flexibility and applicability across different tasks and domains [Pozzobon et al., 2023a; Gururangan et al., 2020b]. However, human language is shaped by a cumulative culture, constantly building upon itself and evolving over time [Silvey, 2016]. Also, specialized use cases such as multilingual, code and math often require tailored additional training.
In this work, we attempt to reconcile all three desirable properties: efficiency, specialization, and adaptability. We ask ‘‘how can we adaptively combine separately trained specialized experts?’’ To address this, we introduce Nexus, a novel MoE architecture that parameterizes the router based on domain-specific data by learning to project the embedding of each data domain to an expert embedding. This learnable projection for the router allows for the easy extension of the MoE model with new experts that are trained independently on new datasets of interest. This also avoids the difficulties of MoE training, as our learned router scales with the number of experts without needing to be trained from scratch, which enables adding or removing experts as desired.
Our experiments show that Nexus outperforms previous work when upscaling an MoE from separately trained specialized domain experts. Going beyond the single upscaling phase, Nexus can be efficiently extended with a new expert trained on a new domain, by finetuning it with much fewer tokens, compared to the finetuning after the initial upcycling.
MoE | BTM | BTX | Nexus | |
(Vanilla) | (Merge) | (Linear router) | (Ours) | |
Dense experts are trained independently (upcycling) | ✗ | ✔ | ✔ | ✔ |
Experts are specialized in different domains | ✗ | ✔ | ✔ | ✔ |
Experts are chosen by a learned router per input token | ✔ | ✗ | ✔ | ✔ |
Router is adaptive via learned projection for new domains | ✗ | ✗ | ✗ | ✔ |
In summary, our contributions are as follows:
-
(i)
We present Nexus, a novel MoE framework designed to enhance sparse upcycling of specialized trained dense experts, while reducing the training cost of MoEs by facilitating easy adaptation to unseen data distributions. In Nexus, the traditional linear router from vanilla MoE models is replaced with routing based on the similarity of layer inputs to an expert embedding vector, derived from the average embedding of the corresponding expert training dataset.
-
(ii)
Our method outperforms the existing approach for upcycling specialized models into MoE, leading to 2.1% and 1.6% relative increase over the upcycled MoE (linear router) in 470M and 2.8B scales respectively. This enables performance increase in general tasks with 5.8% and 7.4% relative gains over the dense seed model at 470M and 2.8B respectively.
-
(iii)
Our method enables efficient adaptation to new domains by extending upcycled MoE with the new experts trained on unseen dataset. In this setting, Nexus outperforms the baseline MoE (linear router) when finetuning on the limited amount of data, leading 18.8% relative gain on the new domain with 1B finetuning tokens upon MoE extension.
-
(iv)
Finally, we show that our method is robust across different load balancing and data mixtures, and consistently outperforms the MoE with a linear router for specialized upcycling, confirming the benefits of the adaptive routing based on domain projections used in Nexus.
2 Background
Sparse Mixture of Experts architectures [Shazeer et al., 2017; Fedus et al., 2022] replace the feed-forward network (FFN) with an MoE layer in the Transformer block [Vaswani et al., 2017]. An MoE layer consists of a router network and a set of experts, , where each expert corresponds to an independent dense feed-forward network. The router network is commonly parameterized by trainable weights where is the model hidden dimension, and followed by a softmax function which takes an intermediate token representation as input and combines the output of each expert based on the gating scores . Sparse MoEs only use the top-k experts based on experts gating scores .
(Router) | |||
(Top-K Routing) | |||
(MoE) |
Recent work has also shown that using a shared expert that is always activated is beneficial to remove parameter redundancy among other experts [Rajbhandari et al., 2022; Dai et al., 2024]:
(MoE + shared expert) |
Sparse Upcycling [Komatsuzaki et al., 2023] initializes an MoE model from a dense Transformer model. The dense model’s FFN layers are copied times to initialize each of the experts, and the router layer is trained from scratch. BTX [Sukhbaatar et al., 2024] generalize this approach to initialize each expert from the FFN layer of a different expert model, and all other parameters as the average over all of these models. The experts models are finetuned versions of the original dense model, which allows weight merging without major losses.
Nexus leverages upcycling specialized expert models similar to BTX, however, it diverges in terms of MoE training, in particular with its novel MoE router, which enables to efficiently extend the MoE in multiple rounds after the sparse upcycling. We describe our method in the next section.
3 Adaptive Router for Upcycling Specialized Experts as MoE
The core component of an MoE model is the router, as it determines which experts to activate for any given input. In vanilla MoEs, the router is a learned linear layer that takes the token intermediate representations as input and computes the expert probabilities. However, this router does not necessarily learn specialization as MoEs are commonly trained using an auxiliary load balancing loss to improve training stability [Fedus et al., 2022; Jiang et al., 2024]. In Nexus, we propose a novel MoE router where per MoE block we learn a projection layer from given pre-computed domain embeddings to expert embeddings. We parametrize this projection layer as a two-layer MLP with a SwiGLU activation function [Shazeer, 2020]:
(Domain to Expert Embeddings) | ||||
where , and are the domain and expert embeddings for the th domain respectively., where and are the domain embedding and the model dimensions. are linear layers, and SwiGLU is defined as . Given the expert embeddings and layer inputs , we then compute routing probabilities as:
(Routing Scores) |
Unlike the standard router, Nexus’s router includes a stronger inductive bias through pre-computed domain embeddings111We used an Cohere Embed v3 (https://cohere.com/blog/introducing-embed-v3) as an external embedding model to compute domain embeddings based on existing individual data sources. However, similar to Gururangan et al. [2023], pre-training data can also be clustered and the centroid of each cluster can be used for domain embeddings. that enables expert embedding to specialize. Thus, gives a high value for input tokens that are closer to the domain of the corresponding expert. Notably, this router is particularly suited for the sparse upcycling setting where the dense experts are separately trained on different domains.
Connection to hypernetworks. Our router parametrization is closely related to hypernetworks [Ha et al., 2016] as the projection layer generates parameters for the router during runtime for a given input. We use domain embeddings as the input to the projection layer, enabling efficient adaptation and also a better cross-domain transfer based on the similarity between domain embeddings as shown in previous work [Mahabadi et al., 2021; Üstün et al., 2022].
Upcycling dense experts as an MoE. After training dense expert models, we merge the individual experts into a unified MoE by appending their FFNs along a new dimension to create an MoE layer per Transformer block. Unlike Sukhbaatar et al. [2024], instead of using the original FFN of the seed model as one of the routed experts in an MoE layer, we use it as the shared expert () to better preserve the previous capabilities in the MoE model. For all non-FFN parameters including the attention weights, we merge expert parameters using simple weight averaging:
(MoE Layer FFNs) | |||
(Merge Non-FFN params.) |
Efficient adaptation to new domains. An important advantage of method is that when a new data domain is present after MoE training, we use the learned projection to compute expert embedding of the new domain as . This enables to enhance the trained MoE model with additional dense experts, which are trained in the same way as the initial experts. The FFN parameters of the new expert are simply appended to the array of existing experts.
To adequately preserve the non-FFN parameters of existing experts, we perform a weighted average where , , and are parameters of the final MoE, dense expert, and initial MoE model and . This enables efficient adaptation Nexus to new domain by extending it with the new dense expert trained independently. After extending the MoE with a new expert, we perform a lightweight finetuning with a limited number of tokens for quick adaptation.
4 Experiments
4.1 Experimental setting
Our experimental setup includes 3 phases. Figure 1 shows the architecture of Nexus and the corresponding experimental setting:
1. Training specialized expert LMs. For training the dense specialized experts, we use the sub-datasets from the SlimPajama dataset [Soboleva et al., 2023], a 627B token English-language corpus assembled from web data of various sources. We initialize four dense experts from the weights of the seed model and train them on the ArXiv, Books, C4, GitHub, StackExchange, and Wikipedia domains.222We exclude the Github and StackExchange datasets from SlimPajama in order to ablate adding a new expert model using the Code domain. As seed model, we use a 470M and 2.8B parameters decoder-only autoregressive Transformer models [Radford et al., 2019] that are trained with a standard language modeling objective for 750B tokens. We train dense experts for 25 and 40 billion tokens for 470M and 2.8B seed models respectively. We use parallel attention layers, [Anil et al., 2023; Wang, 2021], SwiGLU activation [Shazeer, 2020], no biases in dense layers, and a byte-pair-encoding (BPE) tokenizer with a vocabulary size of 256,000. During training, we use a linear warmup (10% of total steps) to a maximum learning rate of -3 and a cosine decay schedule to -4.
2. MoE training. After the training of dense expert models, we merge them into a unified MoE by appending their FFNs along a new dimension to create an MoE layer per Transformer block. For the shared expert in our MoE layer, we use the original FFN layer of the seed model to better preserve the previous capabilities in the MoE model. For all non-FFN parameters including the attention weights, we merge expert parameters using simple weight averaging, following Sukhbaatar et al. [2024]. After the MoE model is created, we continually train it for an additional 25B and 40B tokens respectively for the 470M and 2.8B experiments, on a mix of all domain and original pre-training datasets, using the same training hyperparameters as in the single expert training. Finally, we train the MoE models using an additional 1B tokens by upweighting the original pre-training dataset as it includes high-quality data sources such as instruction-style datasets using a cosine learning rate decay to -5 [Parmar et al., 2024].
3. Extending the MoE model with new experts. After adding a new expert as defined in Section 2, we finetune the extended MoE model for up to 1 billion tokens using a uniformly sampled data mix consisting of 50% the previous domains and pre-training data and 50% the new domain. For the new expert (Code), we train a dense model using code documents from StarCoder [Li et al., 2023] with the same settings as for the training of the initial experts. As the 470M scale MoE did not have sufficient instruction following capabilities to attempt the code benchmarks, we only tested extending the MoEs with a new expert on the 2.8B scale.
4.2 Baselines
We compare our experiments against two baselines:
-
1.
Dense Merging: We compare MoE variants against merging all separately pre-trained experts and the seed model into a dense Transformer via equal weight averaging similar to BTM [Li et al., 2022]. This allows us to ask What are the benefits of routing MoE over simple averaging?
-
2.
MoE (Linear Router): To evaluate Nexus’s novel router for upcycling, we compare it against an MoE with a standard linear router that is upcycled from dense experts. Here, we ask how does our specialized routing compare to conventional learned linear routing? For a fair comparison, we also train this MoE model on the same datasets and for the same number of tokens as our method, and use the same architectural modifications such as shared experts.
4.3 Evaluation
For the downstream evaluation, we measure the performance of each model on 15 tasks333We did not include ARC-Challenge and Natural Questions in 470M experiments as some model variants were unable to achieve non-random performance. from five evaluation categories that reflect different capabilities based on the tasks and the datasets used in the benchmarks:
-
•
Knowledge: To measure question-answering capabilities based on world knowledge and web documents such as Wikipedia, we report the performance on OpenBookQA [Mihaylov et al., 2018], Natural Questions [Kwiatkowski et al., 2019], TriviaQA [Joshi et al., 2017], QUAC [Choi et al., 2018] (all 0-shot) and SQuAD (4-shot) [Rajpurkar et al., 2016].
- •
- •
-
•
General Language Understanding: We use MMLU (5-shot) [Hendrycks et al., 2021] to test general language understanding.
- •