Mixture-of-Experts with Expert Choice Routing

Zhou, Yanqi; Lei, Tao; Liu, Hanxiao; Du, Nan; Huang, Yanping; Zhao, Vincent; Dai, Andrew; Chen, Zhifeng; Le, Quoc; Laudon, James

Computer Science > Machine Learning

arXiv:2202.09368 (cs)

[Submitted on 18 Feb 2022 (v1), last revised 14 Oct 2022 (this version, v2)]

Title:Mixture-of-Experts with Expert Choice Routing

Authors:Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, James Laudon

View PDF

Abstract:Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2x. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2202.09368 [cs.LG]
	(or arXiv:2202.09368v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2202.09368

Submission history

From: Yanqi Zhou [view email]
[v1] Fri, 18 Feb 2022 17:46:11 UTC (350 KB)
[v2] Fri, 14 Oct 2022 00:08:24 UTC (526 KB)

Computer Science > Machine Learning

Title:Mixture-of-Experts with Expert Choice Routing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Mixture-of-Experts with Expert Choice Routing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators