MaskMoE: Boosting Token-Level Learning via Routing Mask in
Mixture-of-Experts

Zhenpeng Su1,2 Zijia Lin3 Xue Bai4 Xing Wu1,2 Yizhe Xiong3Haoran Lian5
Guangyuan Ma1,2
, Hui Chen3Guiguang Ding3Wei Zhou1,2Songlin Hu1,2111Corresponding authors.
1Institute of Information Engineering, Chinese Academy of Sciences
2School of Cyber Security, University of Chinese Academy of Sciences
3Tsinghua University, 4University of Science and Technology of China, 5Beihang University
{suzhenpeng,wuxing,maguangyuan,zhouwei,husonglin}@iie.ac.cn
[email protected], [email protected], [email protected]
[email protected], [email protected], [email protected]
Abstract

Scaling model capacity enhances its capabilities but significantly increases computation. Mixture-of-Experts models (MoEs) address this by allowing model capacity to scale without substantially increasing training or inference costs. Despite their promising results, MoE models encounter several challenges. Primarily, the dispersion of training tokens across multiple experts can lead to underfitting, particularly for infrequent tokens. Additionally, while fixed routing mechanisms can mitigate this issue, they compromise on the diversity of representations. In this paper, we propose MaskMoE, a method designed to enhance token-level learning by employing a routing masking technique within the Mixture-of-Experts model. MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training. Experimental results demonstrate that our method outperforms previous dominant Mixture-of-Experts models in both perplexity (PPL) and downstream tasks.

1 Introduction

Large language models have achieved promising performance on various downstream natural language tasks Touvron et al. (2023a); Dai et al. (2022); Brown et al. (2020); Anil et al. (2023). However, training large language models on extensive textual data has significantly increased computational costs compared to previous works (e.g., BERT Devlin et al. (2019), ELMO Peters et al. (2018), and LSTM Hochreiter and Schmidhuber (1997)). In order to further scale up models within computational budgets, sparse activation networks Child et al. (2019); Du et al. (2022a) receive widespread attention due to their ability to significantly reduce computational costs by using only a few parameters per input. A widely studied approach is Mixture-of-Experts (MoEs)  Lepikhin et al. (2021); Du et al. (2022a); Dai et al. (2024); Fedus et al. (2022); Roller et al. (2021), which trains multiple expert networks but selects only a subset to process specific inputs. Compared to dense networks of the same model size, MoEs effectively reduce computational costs.

The current commonly used dynamic routing method in MoEs is to select the expert with the high confidence based on the probability distribution output by an intermediate layer with learnable parameters, i.e., a softmax layer. Previous works Lepikhin et al. (2021); Du et al. (2022a); Dai et al. (2024); Fedus et al. (2022) show that MoE models trained based on dynamic routing achieve better performance than dense models with the same amount of training computation.

However, this dynamic routing method introduces a new problem: the routing fluctuation problem Dai et al. (2022). This implies that as training progresses, the same tokens are assigned to different experts in different iterations. Consequently, each expert is trained on only a subset of the same tokens. Additionally, the lack of knowledge sharing and interaction among experts may negatively impact the model’s performance Zhao et al. (2024). Specifically, for the same tokens assigned to different experts, the knowledge each expert learns is harder to share. As previous work Shazeer et al. (2017) has indicated, when there are too many experts, the MoE model exhibits underfitting, and even shows higher perplexity than scenarios with fewer experts. The MoEs models have relatively less impact on the learning of frequent tokens, as these tokens have sufficient data to ensure that each expert receives adequate training. However, the MoEs model disperses infrequent tokens and concepts across various experts, which can lead to underfitting for these tokens. This issue hinders scaling the number of experts for the MoE model. As the number of experts increases while keeping the training data constant, the average number of tokens assigned to each expert further decreases, exacerbating the problem of expert underfitting.

Refer to caption
Figure 1: Explanation of random routing masking strategy. For each token, some experts will be randomly masked, with only a subset of experts being visible, e.g., for the token “in” only experts 1, 2 and 3 are visible.

Previously fixed routing strategies Roller et al. (2021) based on Hash Layer, pre-assigning each token to a fixed-route expert, e.g., “deliberate” can only be routed to expert number one as shown in Figure 1. The fixed routing to send the same token to the same expert is more advantageous for the token’s thorough learning, especially for some infrequent tokens. However, previous works He et al. (2023); Yang et al. (2019) show that increasing the number of optional experts for the same token input enhances the diversity of representations. Therefore, fixed routing based on the Hash Layer may represent a deficiency in diversity, especially for the representation of frequent tokens. Frequent tokens, in order to achieve diverse representations, require encoding by more experts due to their broader range of usage Koranda et al. (2018). For instance, “the” can follow a variety of nouns, and having a more diverse representation helps to distinguish these nouns, making it easier to predict them in the next token prediction task. To enhance the diversity of representation for frequent tokens, it is necessary to employ more experts to encode them.

In this paper, we propose a routing strategy called MaskMoE. Specifically, we propose a routing masking method to adjust the number of visible experts for tokens of different frequencies by generating a masking vector for each token in the vocabulary before training. For infrequent tokens, the MoE layers employ routing masking to retain only one visible expert to which the token can be routed. For frequent tokens, the MoE layers have a lower routing masking rate, allowing more visible experts. MaskMoE can enhance token-level learning through a routing mask. Specifically, MaskMoE allows the model to learn more intensively about infrequent tokens by routing the same infrequent tokens to the same expert, enabling the expert to be fully trained on those tokens. Meanwhile, MaskMoE maintains the diversity of representations for frequent tokens by still having multiple experts available for routing, which benefits their representation diversity He et al. (2023); Yang et al. (2019). Additionally, despite frequent tokens being routed to multiple experts, their sufficient quantities ensure thorough training. Increase the visible number of experts for high-frequency tokens, i.e., increasing the count of updatable parameters, can further enhance capability. Our experimental results indicate that MaskMoE outperforms previous MoEs models, whether in terms of perplexity or downstream benchmark datasets.

Our contributions are summarized as follows:

  • We propose MaskMoE, which introduces a routing mask method to assign a different number of visible experts to tokens based on their frequency to enhance the model’s token-level learning. MaskMoE ensures sufficient training for infrequent tokens while maintaining diverse representations for frequent tokens.

  • We conduct extensive experiments to analyze the relationship between the number of experts and token frequency. We highlight that dynamic routing, which disperses the same tokens, can lead to underfitting of experts on infrequent tokens, while fixed routing lacks representation diversity for frequent tokens.

  • We validate the effectiveness of the proposed MaskMoE with extensive experiments. Experimental results show that it consistently outperforms previous Mixture-of-Experts models.

2 Related Work

2.1 Language Models

Language models are statistical models designed to optimize the likelihood of token sequences in training data Touvron et al. (2023a). Initially, language models relied on n𝑛nitalic_n-gram statistics Bahl et al. (1983); Katz (1987); Kneser and Ney (1995). Subsequently, the emphasis shifted to neural network-based models, particularly Recurrent Neural Networks Mikolov et al. (2010) and their variants like LSTMs Graves (2013). These models have demonstrated the ability to learn intricate patterns within textual data, achieving significant success in various language modeling applications.

In recent times, Transformers have become the predominant architecture for language models. Notable examples include BERT Devlin et al. (2019), RoBERTa Liu et al. (2019), GPT-2 Radford et al. (2019), UniLM Dong et al. (2019), and T5 Raffel et al. (2020). The introduction of GPT-3 Brown et al. (2020), which boasts 175 billion parameters, marked a significant milestone due to its exceptional performance across numerous downstream tasks. This has led to a surge in research focusing on large generative language models, including prominent works such as Gopher Rae et al. (2021), PaLM Chowdhery et al. (2022), Pythia Biderman et al. (2023), OPT Zhang et al. (2022), GLM Du et al. (2022b); Zeng et al. (2023) and LLaMA Touvron et al. (2023a, b). Currently, GPT4 OpenAI (2023), trained with Reinforcement Learning from Human Feedback (RLHF), achieves truly remarkable results.

However, as the size of the model grows, the computational demands for both training and inference also increase. MoE models achieve scalability by sparsely activating portions of the model’s parameters, allowing for an increase in model size without significantly raising computational overhead. Consequently, MoE models have been receiving increasing attention recently.

2.2 Mixture-of-Experts

The concept of Mixture-of-Experts (MoE) models is initially proposed by  Jacobs et al. (1991). Later,  Shazeer et al. (2017) applies MoE to LSTM, training an LSTM model with up to 137B parameters. With the rise of the Transformer architecture Vaswani et al. (2017); Devlin et al. (2019), Gshard Lepikhin et al. (2021) first applies MoE to Transformers. Subsequently, powerful MoE models such as GLaM Du et al. (2022a) and Switch Transformer Fedus et al. (2022) emerge.

Early works  Zoph et al. (2022); Fedus et al. (2022); Du et al. (2022a); Lepikhin et al. (2021) mainly focus on dynamic routing methods for learning-to-route MoEs. The Hash Layer  Roller et al. (2021) proposes using a random hashing method to route tokens to a fixed expert, achieving better results than dynamic routing. Then,  Dai et al. 2022 points out that dynamic routing has the problem of routing fluctuation problem, meaning the same input is assigned to different experts as training progresses. Routing fluctuation often harms sample efficiency because the same input updates different experts, which significantly affects infrequent tokens. However, previous works He et al. (2023); Yang et al. (2019) indicate that having more experts allows tokens to obtain richer representations, and fixed routing can affect the diversity of representations, especially for frequent tokens.

Based on the above analysis, we propose MaskMoE, which uses routing masks to alleviate the underfitting problem of infrequent tokens caused by routing fluctuations, while maintaining the representational diversity of frequent tokens.

3 Method

In this section, we first review the training of the MoEs models. Then, we introduce detailed MaskMoE. Finally, we introduce the load balance loss.

3.1 Reviewing the Mixture-of-Experts

The routing of MoEs can be divided into fixed routing Roller et al. (2021); Dai et al. (2022) and dynamic routing with learnable parameters Lepikhin et al. (2021); Du et al. (2022a); Dai et al. (2024); Fedus et al. (2022). The common method for fixed routing is to use random hashing to determine Roller et al. (2021), before training, which experts each token will be sent to. In contrast, dynamic routing employs a routing layer with learnable parameters, i.e., a softmax layer, to decide which expert should process the input tokens.

In terms of structure, the FFN layer of the MoEs model is replaced with sparse expert modules. Given a tokenized input sequence 𝐱=x1,x2,,xT𝐱subscript𝑥1subscript𝑥2subscript𝑥𝑇\mathbf{x}={x_{1},x_{2},...,x_{T}}bold_x = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT consisting of T𝑇Titalic_T tokens, the representation of each token is computed as hidden states by a standard Transformer.

𝐇=Tansformer(Embedding(𝐱))𝐇𝑇𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔𝐱\displaystyle\mathbf{H}=Tansformer(Embedding(\mathbf{x}))bold_H = italic_T italic_a italic_n italic_s italic_f italic_o italic_r italic_m italic_e italic_r ( italic_E italic_m italic_b italic_e italic_d italic_d italic_i italic_n italic_g ( bold_x ) ) (1)

Here, 𝐇𝐇\mathbf{H}bold_H denote the hidden states, i.e., 𝐇=[𝐡1,𝐡2,,𝐡T]𝐇subscript𝐡1subscript𝐡2subscript𝐡𝑇\mathbf{H}=[\mathbf{h}_{1},\mathbf{h}_{2},\ldots,\mathbf{h}_{T}]bold_H = [ bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ]. The dense transformer layer consists of self-attention and FFN sublayers.

𝐡¯tl=SelfAttn(𝐡𝐭𝐥𝟏)superscriptsubscript¯𝐡𝑡𝑙𝑆𝑒𝑙𝑓𝐴𝑡𝑡𝑛superscriptsubscript𝐡𝐭𝐥1\displaystyle\mathbf{\bar{h}}_{t}^{l}=SelfAttn(\mathbf{h_{t}^{l-1}})over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_S italic_e italic_l italic_f italic_A italic_t italic_t italic_n ( bold_h start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_l - bold_1 end_POSTSUPERSCRIPT ) (2)
𝐡𝐭𝐥=FFN(𝐡¯tl)superscriptsubscript𝐡𝐭𝐥𝐹𝐹𝑁superscriptsubscript¯𝐡𝑡𝑙\displaystyle\mathbf{h_{t}^{l}}=FFN(\mathbf{\bar{h}}_{t}^{l})bold_h start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_l end_POSTSUPERSCRIPT = italic_F italic_F italic_N ( over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) (3)

Here, l𝑙litalic_l represents the l𝑙litalic_l-th transformer layer. MoEs methods replace the FFN sub-layer with the MoEs module. Each MoEs module consists of multiple FFNs, whose outputs are mixed with the gating function P()𝑃P(\cdot)italic_P ( ⋅ ), represented by the following formula.

𝐡𝐭𝐥=iNP(𝐡¯tl)FFNi(𝐡¯tl)superscriptsubscript𝐡𝐭𝐥superscriptsubscript𝑖𝑁𝑃superscriptsubscript¯𝐡𝑡𝑙𝐹𝐹subscript𝑁𝑖superscriptsubscript¯𝐡𝑡𝑙\displaystyle\mathbf{h_{t}^{l}}=\sum_{i}^{N}P(\mathbf{\bar{h}}_{t}^{l})\cdot FFN% _{i}(\mathbf{\bar{h}}_{t}^{l})bold_h start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_l end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P ( over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ⋅ italic_F italic_F italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) (4)

N𝑁Nitalic_N represents the number of experts in a single MoE layer. The majority of elements in the distribution of the gated P()𝑃P(\cdot)italic_P ( ⋅ ) output are zeros, i.e., only a small portion of experts will be activated. Therefore, increasing the total number of experts in MoEs does not increase computation or inference time.

3.2 Proposed MaskMoE

The typical dynamic routing method routes tokens to the top experts based on the scores output by a softmax layer. This approach to update expert parameters is sparse, meaning that the parameters of unselected experts are not updated. This sparsity may lead to experts underfitting for infrequent tokens. Fixed routing ensures that specific experts consistently process the same input tokens, which helps mitigate the problem of insufficient learning for infrequent tokens. However, it is less effective in fostering a diversity of representations compared to dynamic routing methods. Dynamic routing allows the same token to be encoded by different experts, leading to more varied representations He et al. (2023); Yang et al. (2019). Since frequent tokens can have different meanings depending on the context, fixed routing is less effective than dynamic routing in capturing the nuances of these tokens. To balance the density of parameter updates and the diversity of representations in MoE models, we propose MaskMoE to boost token-level learning via routing mask. As illustrated in Figure 1, MaskMoE achieves this by introducing a routing masking mechanism. For each token, a fixed random masking is applied to the experts, allowing them to be routed only to their respective subsets of experts. The formula is represented as follows.

𝐩𝐩\displaystyle\mathbf{p}bold_p =Softmax(Wg𝐡¯𝐭𝐥+𝐌)absentSoftmaxsubscript𝑊𝑔superscriptsubscript¯𝐡𝐭𝐥𝐌\displaystyle=\text{Softmax}(W_{g}\mathbf{\bar{h}_{t}^{l}}+\mathbf{M})= Softmax ( italic_W start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_l end_POSTSUPERSCRIPT + bold_M ) (5)

Where 𝐌𝐌\mathbf{M}bold_M is the masking vector used to control the visibility of token-specific experts. For visible experts, the corresponding elements in 𝐌𝐌\mathbf{M}bold_M are 0, while for invisible experts, the corresponding elements in 𝐌𝐌\mathbf{M}bold_M are -\infty- ∞. The masking vector is determined before training and does not change during the training process. For multi-layer MoE models, we reuse the same masking vector 𝐌𝐌\mathbf{M}bold_M across different MoE layers. Here, 𝐖𝐠subscript𝐖𝐠\mathbf{W_{g}}bold_W start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT is the parameter of the gate. The initial energy masking vector for a token t𝑡titalic_t can be represented by the following formula.

𝐌tsuperscript𝐌𝑡\displaystyle\mathbf{M}^{t}bold_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT :=𝟏Nassignabsentsubscript1𝑁\displaystyle:=-\infty\cdot\mathbf{1}_{N}:= - ∞ ⋅ bold_1 start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT (6)
𝐂𝐂\displaystyle\mathbf{C}bold_C ={Ci}i=1V𝒰({0:N1})absentsuperscriptsubscriptsubscriptC𝑖𝑖1𝑉similar-to𝒰conditional-set0𝑁1\displaystyle=\{\text{C}_{i}\}_{i=1}^{V}\sim\mathcal{U}(\{0:N-1\})= { C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∼ caligraphic_U ( { 0 : italic_N - 1 } )
𝐌jtsuperscriptsubscript𝐌𝑗𝑡\displaystyle\mathbf{M}_{j}^{t}bold_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =0,j𝐂formulae-sequenceabsent0for-all𝑗𝐂\displaystyle=0,\quad\forall j\in\mathbf{C}= 0 , ∀ italic_j ∈ bold_C

Formally, we denote the number of visible experts as V𝑉Vitalic_V. 𝒰({0:N1})𝒰conditional-set0𝑁1\mathcal{U}(\{0:N-1\})caligraphic_U ( { 0 : italic_N - 1 } ) denotes a uniform distribution over the set of integers from 00 to N1𝑁1N-1italic_N - 1. It is worth noting that for infrequent tokens, each token has only one visible expert, i.e., V=1𝑉1V=1italic_V = 1. This promotes more thorough learning by the selected expert for such infrequent tokens. For frequent tokens, we also reduce the number of visible experts, but the number of visible experts is greater than 1, i.e., 1<V<=N1𝑉𝑁1<V<=N1 < italic_V < = italic_N. For frequent tokens, it is common for V𝑉Vitalic_V to be less than N𝑁Nitalic_N. We believe that an appropriately sized V𝑉Vitalic_V not only allows frequent words to have higher representation diversity compared to fixed routing but also enables more thorough training for the same token compared to dynamic routing.

3.3 Load Balance Loss

MoE models typically require distributed training, where different experts are deployed across various nodes. To avoid load imbalance, where a minority of expert nodes handle the majority of tokens while the majority of expert nodes spend more time idle, this can impact training efficiency. It is generally desirable for the number of tokens processed by different experts to be roughly equal. To achieve load balancing, a load balancing loss is commonly introduced in the training of MoE models. We follow previous works Huang et al. (2024); Fedus et al. (2022) and adopt a widely used loss.

balsubscript𝑏𝑎𝑙\displaystyle\mathcal{L}_{bal}caligraphic_L start_POSTSUBSCRIPT italic_b italic_a italic_l end_POSTSUBSCRIPT =Ni=1NwiRiabsent𝑁superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝑅𝑖\displaystyle=N*\sum_{i=1}^{N}w_{i}\cdot R_{i}= italic_N ∗ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (7)
s.t.,wis.t.subscript𝑤𝑖\displaystyle\text{s.t.},\quad w_{i}s.t. , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =1Tj=1T𝕀{argmax(𝐩j)=i}absent1𝑇superscriptsubscript𝑗1𝑇𝕀𝑎𝑟𝑔𝑚𝑎𝑥superscript𝐩𝑗𝑖\displaystyle=\frac{1}{T}\sum_{j=1}^{T}\mathbb{I}\{argmax(\mathbf{p}^{j})=i\}= divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_I { italic_a italic_r italic_g italic_m italic_a italic_x ( bold_p start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) = italic_i } (8)
Risubscript𝑅𝑖\displaystyle R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =1Tj=1T𝐩ijabsent1𝑇superscriptsubscript𝑗1𝑇superscriptsubscript𝐩𝑖𝑗\displaystyle=\frac{1}{T}\sum_{j=1}^{T}\mathbf{p}_{i}^{j}= divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT (9)

Where T𝑇Titalic_T represents the number of tokens in a mini-batch. 𝐩jsuperscript𝐩𝑗\mathbf{p}^{j}bold_p start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT denotes the probability distribution of the routing output for the j-th token, while 𝐩ijsuperscriptsubscript𝐩𝑖𝑗\mathbf{p}_{i}^{j}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT represents the probability value of the i-th expert. It is noteworthy that for infrequent tokens, which have only one visible expert, the load balancing loss does not apply to them. The balancing loss primarily regulates the routing of frequent tokens. Additionally, since infrequent tokens are assigned to experts through random hashing, the overall load is relatively balanced. Therefore, our experiments find that MaskMoE, compared to SMoE, does not experience a significant decrease in training efficiency.

Our final loss is a combination of the language model loss and load-balance loss:

=lm+balsubscript𝑙𝑚subscript𝑏𝑎𝑙\displaystyle\mathcal{L}=\mathcal{L}_{lm}+\mathcal{L}_{bal}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_l italic_m end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_b italic_a italic_l end_POSTSUBSCRIPT (10)

4 Expertiments

Model Configuration Params Activated Params Pile PPL
Standard Transformer Layers=24, Dense 468M 468M 6.95
SMoE Layers=24, MoE_Layers = 1 1.3B 468M 6.62
Hash Layer Layers=24, MoE_Layers = 1 1.3B 468M 6.56
Share-MoE Layers=24, MoE_Layers = 1 1.3B 468M 6.72
MaskMoE Layers=24, MoE_Layers = 1 1.3B 468M 6.48
SMoE Layers=24, MoE_Layers = 12 10B 468M 6.18
Hash Layer Layers=24, MoE_Layers = 12 10B 468M 6.16
Share-MoE Layers=24, MoE_Layers = 12 10B 468M 6.15
MaskMoE Layers=24, MoE_Layers = 12 10B 468M 6.11
Table 1: Perplexity results of language modeling.

4.1 Pre-training Dataset

Following previous works, we use the Pile dataset Gao et al. (2021) for pre-training data. The Pile dataset is a large-scale, publicly available corpus for language model pre-training, containing 22 domains and over 825GB of English text data. For our experiments, we use the well-known LLaMA tokenizer for tokenization, with a vocabulary size of 32k32𝑘32k32 italic_k. We follow  Xie et al. (2023a); Su et al. (2024) to calculate the sampling rate for each domain based on the number of tokens after tokenization. Due to the computational budget, we follow the pretraining settings of  Xie et al. (2023a); Su et al. (2024); Huang et al. (2024); Xiong et al. (2024); Lian et al. (2024), all models are pre-trained with 100B tokens.

Then, to identify infrequent tokens and frequent tokens, we calculate the frequency of each token in the training set and sort them in descending order of frequency. We categorize the top tokens that cover P𝑃Pitalic_P% of the dataset as frequent tokens, and the remaining (1P)1𝑃(1-P)( 1 - italic_P )% as infrequent tokens. Where P𝑃Pitalic_P is a tunable hyperparameter.

4.2 Experimental Setup

Following  Touvron et al. (2023a, b); Huang et al. (2024), we adopt the LLaMA architecture with 24 transformer layers and a hidden layer dimension of 1024. In line with LLaMA Touvron et al. (2023a), we employ the AdamW  Loshchilov and Hutter (2019) optimizer, and applying a cosine learning rate decay schedule.

For the MoE model, following  Roller et al. (2021), we conduct experiments under both single-layer and multi-layer settings. For the single-layer setting, we replace the final FFN layer with the MoE layer. For the multi-layer MoE models, we follow Gshard Lepikhin et al. (2021), alternating MoE layers every other layer, resulting in a total of 12121212 MoE layers in this setting. For single-layer MoE models and dense models, we use a learning rate of 3e43𝑒43e-43 italic_e - 4, while for 12121212-layer MoE models, we employ a learning rate of 1e41𝑒41e-41 italic_e - 4 to ensure stable convergence of the models. If not specified otherwise, tokens that account for the top 40% of frequency are considered frequent tokens, while the remaining are considered infrequent tokens, i.e., P=0.4𝑃0.4P=0.4italic_P = 0.4. For frequent tokens, the visible expert count V𝑉Vitalic_V is set to 8, and for infrequent tokens, the visible expert count V𝑉Vitalic_V is set to 1. All of our implementations are based on the DeepSpeed111https://github.com/microsoft/DeepSpeed library Rajbhandari et al. (2022); Rasley et al. (2020), which offers robust support for MoE-distributed training. By default, we enable the random token selection strategy Kim et al. (2021) implemented in the library to facilitate faster model convergence and better runtime efficiency. Following previous works Roller et al. (2021); Fedus et al. (2022); Dai et al. (2022), unless otherwise specified, our experiments only select the top expert.

4.3 Compared Models

Including MaskMoE, we compare five models for validation experiments. Next, we will briefly introduce five compared models.

  • Dense represents a standard Transformer language model with 468M parameters.

  • SMoE denotes a Switch Transformer Fedus et al. (2022), where the router is a learnable parameter. Unless otherwise specified, each MoE layer has 64 experts.

  • Hash Layer Roller et al. (2021) signifies that through a random hash method, each token is assigned a fixed routing expert prior to training. Unless otherwise specified, each MoE layer has 64 experts.

  • Share-MoE is a hybrid dense and MoE model created using residual connections Rajbhandari et al. (2022). Models with shared experts are currently a popular architecture Dai et al. (2024); DeepSeek-AI et al. (2024); Zhao et al. (2024); Rajbhandari et al. (2022). Unless otherwise specified, Share-MoE has 1 shared expert and 128 routed experts, where each expert is 0.5 times the size of a standard FFN. During both training and inference, in addition to the activation of shared experts, the top expert is also selected from a pool of 128 experts. In such a setup, Share-MoE has the same number of floating-point operations (FLOPs) and activation parameters as SMoE and Hash Layer. This allows for a fair comparison between Share-MoE, SMoE, and Hash Layer.

  • MaskMoE achieves a balance between representation diversity in MoE layers and thorough parameter training through the routing masking method we proposed. Similarly, in this section, MaskMoE also employs the shared expert architecture, which is identical to that of Share-MoE.

Model BoolQ Hellaswag LAMBADA PIQA SIQA StoryCloze Arc-e TriviaQA WebQs
Standard Transformer 56.02 40.73 52.55 67.62 40.79 63.55 51.43 7.44 4.97
MoE_Layers = 1
SMoE 55.26 43.66 53.46 68.28 41.15 64.14 51.98 10.00 6.99
Hash Layer 56.61 43.88 54.67 69.53 41.76 64.19 53.70 9.82 6.89
Share-MoE 58.87 43.00 53.00 68.39 41.45 64.19 51.92 10.54 6.89
MaskMoE 58.38 44.47 55.36 68.99 41.86 65.15 53.14 10.39 7.14
MoE_Layers = 12
SMoE 56.18 46.92 56.37 69.91 41.20 64.99 54.34 13.54 7.33
Hash Layer 57.16 46.61 56.68 71.00 40.83 65.47 55.05 12.97 5.95
Share-MoE 58.71 46.94 56.39 69.69 40.84 65.42 55.89 12.77 6.94
MaskMoE 58.32 47.46 57.46 70.62 41.91 65.69 55.35 15.20 6.74
Table 2: Performances of language models on downstream benchmarks. The best score is marked in bold, and the second best is underlined.

4.4 Main Results

In this section, we first present the model’s perplexity (PPL) on the Pile validation set. Then, following Touvron et al. (2023a); Brown et al. (2020); Su et al. (2024); Dai et al. (2024), we conduct tests on a large number of downstream benchmarks, including zero-shot for BoolQ Clark et al. (2019), HellaSwag Zellers et al. (2019), LAMBADA Paperno et al. (2016), PIQA Bisk et al. (2020), SIQA Sap et al. (2019), StoryCloze Mostafazadeh et al. (2016), and Arc-e Bhakthavatsalam et al. (2021), as well as 5-shot for TriviaQA Joshi et al. (2017) and WebQs Berant et al. (2013). Among them, TriviaQA and WebQs use exact match as the metric, while the remaining benchmarks are evaluated based on accuracy. For a fair comparison, we use the open-source evaluation tool lm-evaluation-harness222https://github.com/EleutherAI/lm-evaluation-harness for assessment.

4.4.1 Perplexity Results

Table 1 shows the main results of language modeling on the Pile validation sets. With the same parameter activations maintained, MoE models consistently achieve improvements compared to the dense model. In comparison to SMoE and Hash Layer, under a single MoE layer setup, MaskMoE has reduced the PPL on the Pile Validation by 0.14 and 0.08 respectively. In a 12-MoE layers setup, the PPL reductions are 0.07 and 0.05 respectively. Compared to Share-MoE, MaskMoE decreases by 0.24 in single MoE layers and by 0.04 in 12-MoE layers. These results demonstrate the rationality of the MaskMoE design.

It is noteworthy that Share-MoE achieves worse results than SMoE and Hash Layer in the single MoE setup. However, MaskMoE consistently maintains superior performance, whether in single or multi-MoE layers. This indicates that MaskMoE exhibits stronger robustness across various settings.

4.4.2 Benchmark Results

As shown in Table 2, we present the performance of the models on downstream tasks. Firstly, we observe that MoE models significantly outperform Dense models in downstream tasks. Then, we find that MaskMoE substantially outperforms the other MoE baselines, whether in single-layer or 12-layers settings.

Compared to SMoE, MaskMoE outperforms in all 9 benchmarks with a single MoE layer and in 8 out of 9 benchmarks with 12-MoE layers. Against Hash Layer, MaskMoE excels in 7 out of 9 benchmarks with a single MoE layer and in 8 out of 9 with 12-MoE layers. Compared to Share-MoE, MaskMoE shows significant improvements in 7 out of 9 benchmarks with a single MoE layer and in 6 out of 9 with 12-MoE layers.

This indicates that the performance enhancement is achieved by using our proposed router mask to control the visibility of experts for different tokens. We believe the improvements are due to two factors. Firstly, compared to SMoE and Share-MoE, infrequent tokens are only routed to a single expert, ensuring more thorough training for these tokens. Secondly, in contrast to the Hash Layer, frequent tokens have multiple expert options, which promotes diversity in the representation of frequent tokens.

5 Analyses

We conduct further experiments to provide more insightful analyses on the proposed MaskMoE. Considering the constraints of computational resources, following previous works Roller et al. (2021); Xie et al. (2023b); Dai et al. (2022), unless otherwise specified, most of the analytical experiments are conducted under the setting of a single MoE layer.

5.1 Impact of Router Mask

P𝑃Pitalic_P Vasubscript𝑉𝑎V_{a}italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Vbsubscript𝑉𝑏V_{b}italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT PPL P𝑃Pitalic_P Vasubscript𝑉𝑎V_{a}italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Vbsubscript𝑉𝑏V_{b}italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT PPL
0.2 8 1 6.518 0.4 16 1 6.509
0.4 8 1 6.506 0.4 4 1 6.511
0.6 8 1 6.518 0.4 64 1 6.549
/ 64 64 6.618\spadesuit / 1 1 6.558 \clubsuit
/ 4 4 6.551 / 8 8 6.566
Table 3: Impact of Router Mask. \spadesuit denotes SMoE, \clubsuit denotes Hash Layer. To make the differences in MaskMoE under various configurations more apparent, the values shown in the table are precise to three decimal places.

To visually demonstrate the impact of the routing masking mechanism, the experiments in this subsection do not employ a shared expert structure. Instead, they adopt the same structure as SMoE and Hash Layer, with each MoE layer comprising 64 experts, and the FNN layer is full-sized.

The performance of MaskMoE is predominantly influenced by two hyperparameters: the boundary threshold P𝑃Pitalic_P for the division of frequent and infrequent tokens, and the maximum number of visible experts V𝑉Vitalic_V for tokens. As shown in Table 3, we conduct an experimental parameter search for both and reported the Pile Validation PPL. Vasubscript𝑉𝑎V_{a}italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT represents the number of experts visible for frequent tokens, and Vbsubscript𝑉𝑏V_{b}italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT represents the number of experts visible for infrequent tokens. Notably, when Va=Vb=1subscript𝑉𝑎subscript𝑉𝑏1V_{a}=V_{b}=1italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 1, MaskMoE is equivalent to the Hash Layer, and when Va=Vb=64subscript𝑉𝑎subscript𝑉𝑏64V_{a}=V_{b}=64italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 64, MaskMoE is equivalent to SMoE.

We have the following observations. Firstly, we find that MaskMoE outperforms the Hash Layer. With the Va=8subscript𝑉𝑎8V_{a}=8italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 8 and Vb=1subscript𝑉𝑏1V_{b}=1italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 1 settings, MaskMoE reduces perplexity by 0.052 on the Pile validation set compared to the Hash Layer, i.e., Va=Vb=1subscript𝑉𝑎subscript𝑉𝑏1V_{a}=V_{b}=1italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 1. This indicates that MaskMoE’s use of multiple expert encodings for frequent tokens enhances representation diversity, which is beneficial for model performance.

Secondly, compared to SMoE, concentrating tokens into fewer visible experts, such as settings like Va=Vb=8subscript𝑉𝑎subscript𝑉𝑏8V_{a}=V_{b}=8italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 8 oder Va=Vb=4subscript𝑉𝑎subscript𝑉𝑏4V_{a}=V_{b}=4italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 4, yields better outcomes. This indicates that routing tokens more densely to a reduced number of visible experts enhances thorough training and improves model performance. Moreover, we observe that the configuration Va=8subscript𝑉𝑎8V_{a}=8italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 8 and Vb=1subscript𝑉𝑏1V_{b}=1italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 1 outperforms Va=Vb=8subscript𝑉𝑎subscript𝑉𝑏8V_{a}=V_{b}=8italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 8. This indicates that further reducing the number of visible experts is more beneficial for infrequent tokens.

Lastly, it is noteworthy that for MaskMoE, when Vb=1subscript𝑉𝑏1V_{b}=1italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 1, regardless of the value of Vasubscript𝑉𝑎V_{a}italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT (whether it is 4, 8, 16, or 64), the performance of MaskMoE consistently outperforms that of SMoE and Hash Layer. This demonstrates the robustness of MaskMoE.

5.2 Impact of Shared Expert

To explore the impact of shared experts, in this subsection, we conduct ablation experiments on shared experts, and the results are shown in Table 4. We separately report the average scores of the benchmarks and the PPL of Pile Validation. It is noteworthy that after removing the shared experts, we adopt the same architecture as SMoE and Hash Layer, where the MoE layer consists of 64 fully-sized FFN.

We find that even without shared experts, MaskMoE outperforms SMoE, Hash Layer, and Share-MoE in average benchmark scores and Pile Validation PPL. This further validates the soundness of MaskMoE’s design. Additionally, we observe that incorporating shared expert structures continuously improves MaskMoE’s performance. We believe this is because the shared expert structure promotes greater specialization in token representation DeepSeek-AI et al. (2024).

Model Avg(\uparrow) PPL(\downarrow)
SMoE 43.87 6.62
Hash Layer 44.56 6.56
Share-MoE 44.25 6.72
MaskMoE 44.99 6.48
w/o Shared Experts 44.85 6.51
Table 4: Average scores of benchmarks and perplexity on the Pile validation. The best score is marked in bold, and the second best is underlined.

5.3 Number of Experts

Refer to caption
(a) Non-Shared Expert Structures
Refer to caption
(b) Shared Expert Structures
Figure 2: Comparison of MoE-based Transformers with different numbers of experts.

To investigate the variation in model performance under different configurations of expert numbers, we conducted experiments with 16, 64, and 128 experts. The experimental results, as shown in Figure 2, reveal the following observations.

Firstly, we observe that for SMoE and Share-MoE, increasing the number of experts does not necessarily lead to a lower perplexity. In fact, when the number of experts reaches 128, the model exhibits a higher PPL compared to having only 64 experts. We believe this is because with more experts, the average number of tokens assigned to each expert decreases, leading to further underfitting of the model. In contrast, the performance of the Hash Layer and MaskMoE consistently improves with the increase in the number of experts, indicating that they are more favorable for scaling the number of experts.

Secondly, we observe that compared to the Hash Layer, MaskMoE consistently performs better across all configurations. Although the Hash Layer also enables experts to be adequately trained with the tokens assigned to them, MaskMoE excels in the diversity of representations for frequent tokens. This superiority in handling token diversity contributes to MaskMoE’s overall advantage over Hash Layer.

Lastly, we find that regardless of the number of experts, MaskMoE consistently maintains the best performance. This indicates that MaskMoE has strong parameter robustness.

6 Conclusions

In this paper, we analyze dynamic routing models such as SMoE, which routes the same token to multiple experts, potentially causing the model to underfit, especially for infrequent tokens. While fixed routing can mitigate this issue, it struggles with capturing rich features for frequent tokens. To address these issues, we propose a routing masking method aimed at ensuring adequate learning of infrequent tokens while maintaining rich representations for frequent tokens. Extensive experiments demonstrate that the proposed MaskMoE outperforms various downstream tasks and significantly improves perplexity (PPL) on the Pile dataset compared to SMoE, Hash Layer, and Share-MoE.

7 Limitiations

In this paper, we categorize tokens simply into frequent and infrequent tokens by their frequency. This classification seems somewhat rigid. Considering a smoother classification method could potentially yield better results. Specifically, when tokens are arranged in ascending order, their frequencies demonstrate a gradual decline, and the corresponding number of visible experts also decreases progressively. Due to computational constraints, we do not conduct further experiments. We leave the exploration of smoother partitioning methods for future work.

References

  • Anil et al. (2023) Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul Ronald Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, and et al. 2023. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805.
  • Bahl et al. (1983) Lalit R. Bahl, Frederick Jelinek, and Robert L. Mercer. 1983. A maximum likelihood approach to continuous speech recognition. IEEE Trans. Pattern Anal. Mach. Intell., 5(2):179–190.
  • Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1533–1544. ACL.
  • Bhakthavatsalam et al. (2021) Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. 2021. Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge. CoRR, abs/2102.03315.
  • Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. 2023. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430. PMLR.
  • Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press.
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Child et al. (2019) Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. CoRR, abs/1904.10509.
  • Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
  • Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2924–2936. Association for Computational Linguistics.
  • Dai et al. (2024) Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066.
  • Dai et al. (2022) Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. Stablemoe: Stable routing strategy for mixture of experts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 7085–7095. Association for Computational Linguistics.
  • DeepSeek-AI et al. (2024) DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, Hao Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jin Chen, Jingyang Yuan, Junjie Qiu, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruizhe Pan, Runxin Xu, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Size Zheng, Tao Wang, Tian Pei, Tian Yuan, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wentao Zhang, X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaosha Chen, Xiaotao Nie, and Xiaowen Sun. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. CoRR, abs/2405.04434.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  • Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 13042–13054.
  • Du et al. (2022a) Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P. Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen S. Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V. Le, Yonghui Wu, Zhifeng Chen, and Claire Cui. 2022a. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 5547–5569. PMLR.
  • Du et al. (2022b) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022b. GLM: general language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 320–335. Association for Computational Linguistics.
  • Fedus et al. (2022) William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23:120:1–120:39.
  • Gao et al. (2021) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2021. The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027.
  • Graves (2013) Alex Graves. 2013. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850.
  • He et al. (2023) Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao. 2023. Merging experts into one: Improving computational efficiency of mixture of experts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 14685–14691. Association for Computational Linguistics.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Huang et al. (2024) Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Liwei Chen, Songfang Huang, and Yansong Feng. 2024. Harder tasks need more experts: Dynamic routing in moe models. CoRR, abs/2403.07652.
  • Jacobs et al. (1991) Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. 1991. Adaptive mixtures of local experts. Neural Comput., 3(1):79–87.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1601–1611. Association for Computational Linguistics.
  • Katz (1987) Slava M. Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust. Speech Signal Process., 35(3):400–401.
  • Kim et al. (2021) Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andrés Felipe Cruz-Salinas, Liyang Lu, Amr Hendy, Samyam Rajbhandari, Yuxiong He, and Hany Hassan Awadalla. 2021. Scalable and efficient moe training for multitask multilingual models. CoRR, abs/2109.10465.
  • Kneser and Ney (1995) Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In 1995 International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’95, Detroit, Michigan, USA, May 08-12, 1995, pages 181–184. IEEE Computer Society.
  • Koranda et al. (2018) Mark Koranda, Martin Zettersten, and Maryellen C. MacDonald. 2018. Word frequency can affect what you choose to say. In Proceedings of the 40th Annual Meeting of the Cognitive Science Society, CogSci 2018, Madison, WI, USA, July 25-28, 2018. cognitivesciencesociety.org.
  • Lepikhin et al. (2021) Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  • Lian et al. (2024) Haoran Lian, Yizhe Xiong, Jianwei Niu, Shasha Mo, Zhenpeng Su, Zijia Lin, Peng Liu, Hui Chen, and Guiguang Ding. 2024. Scaffold-bpe: Enhancing byte pair encoding with simple and effective scaffold token removal. arXiv preprint arXiv:2404.17808.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  • Mikolov et al. (2010) Tomás Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pages 1045–1048. ISCA.
  • Mostafazadeh et al. (2016) Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James F. Allen. 2016. A corpus and evaluation framework for deeper understanding of commonsense stories. CoRR, abs/1604.01696.
  • OpenAI (2023) OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  • Paperno et al. (2016) Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  • Rae et al. (2021) Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, H. Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant M. Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew J. Johnson, Blake A. Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Edward Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. 2021. Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  • Rajbhandari et al. (2022) Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 18332–18346. PMLR.
  • Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 3505–3506. ACM.
  • Roller et al. (2021) Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. 2021. Hash layers for large sparse models. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 17555–17566.
  • Sap et al. (2019) Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. Socialiqa: Commonsense reasoning about social interactions. CoRR, abs/1904.09728.
  • Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  • Su et al. (2024) Zhenpeng Su, Zijia Lin, Baixue Baixue, Hui Chen, Songlin Hu, Wei Zhou, Guiguang Ding, and Xing W. 2024. MiLe loss: a new loss for mitigating the bias of learning difficulties in generative language models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 250–262, Mexico City, Mexico. Association for Computational Linguistics.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
  • Xie et al. (2023a) Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, and Adams Wei Yu. 2023a. Doremi: Optimizing data mixtures speeds up language model pretraining. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  • Xie et al. (2023b) Yuan Xie, Shaohan Huang, Tianyu Chen, and Furu Wei. 2023b. Moec: Mixture of expert clusters. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 13807–13815. AAAI Press.
  • Xiong et al. (2024) Yizhe Xiong, Xiansheng Chen, Xin Ye, Hui Chen, Zijia Lin, Haoran Lian, Jianwei Niu, and Guiguang Ding. 2024. Temporal scaling law for large language models. arXiv preprint arXiv:2404.17785.
  • Yang et al. (2019) Brandon Yang, Gabriel Bender, Quoc V. Le, and Jiquan Ngiam. 2019. Condconv: Conditionally parameterized convolutions for efficient inference. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 1305–1316.
  • Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics.
  • Zeng et al. (2023) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-130B: an open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  • Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  • Zhao et al. (2024) Hao Zhao, Zihan Qiu, Huijia Wu, Zili Wang, Zhaofeng He, and Jie Fu. 2024. Hypermoe: Towards better mixture of experts via transferring among experts. arXiv preprint arXiv:2402.12656.
  • Zoph et al. (2022) Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906.