DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs

Zhen Tan  Daize Dong11footnotemark: 1  Xinyu Zhao  Jie PengYu ChengTianlong Chen
School of Computing, and Augmented Intelligence, Arizona State University
Shanghai Artificial Intelligence Laboratory
Department of Computer Science, Chinese University of Hong Kong
School of Artificial Intelligence and Data Science, University of Science and Technology of China
Department of Computer Science, University of North Carolina at Chapel Hill
[email protected][email protected][email protected]
{xinyu,tianlong}@cs.unc.edu
[email protected]
  Equal contribution.
Abstract

In this paper, we introduce Dynamic Layer Operations (DLO), a novel approach for vertically scaling transformer-based Large Language Models (LLMs) by dynamically expanding, activating, or skipping layers using a sophisticated routing policy based on layerwise feature similarity. Unlike traditional Mixture-of-Experts (MoE) methods that focus on extending the model width, our approach targets model depth, addressing the redundancy observed across layer representations for various input samples. Our framework is integrated with the Supervised Fine-Tuning (SFT) stage, eliminating the need for resource-intensive Continual Pre-Training (CPT). Experimental results demonstrate that DLO not only outperforms the original unscaled models but also achieves comparable results to densely expanded models with significantly improved efficiency. Our work offers a promising direction for building efficient yet powerful LLMs. We will release our implementation and model weights upon acceptance.

\newmdenv

[ linewidth=2pt, roundcorner=10pt, linecolor=black, backgroundcolor=gray!10, skipabove=5pt, skipbelow=5pt ]custombox

DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs


Zhen Tanthanks:   Equal contribution.  Daize Dong11footnotemark: 1  Xinyu Zhao  Jie Peng  Yu Cheng  Tianlong Chen School of Computing, and Augmented Intelligence, Arizona State University Shanghai Artificial Intelligence Laboratory Department of Computer Science, Chinese University of Hong Kong School of Artificial Intelligence and Data Science, University of Science and Technology of China Department of Computer Science, University of North Carolina at Chapel Hill [email protected][email protected][email protected] {xinyu,tianlong}@cs.unc.edu[email protected]


1 Introduction

Refer to caption
Figure 1: (a) DLO structure that ensembles human brain activities in a math problem example Koechlin et al. (2003), where the primary neurons preceive numbers, secondary neurons understand operations, and high-order neurons calclulate the results. (b) Layer-Wise token similarity and distribution.

Large Language Models (LLMs) Achiam et al. (2023); Team et al. (2023) have shown remarkable success across various natural language processing (NLP) tasks Hadi et al. (2023); Tan et al. (2024); Li et al. (2024b, a), leveraging their vast capacity to capture complex patterns in data. Traditional scaling of these models has predominantly focused on horizontal expansion, as seen in Mixture-of-Experts (MoE) architectures Shazeer et al. (2017); Fedus et al. (2022b); Lepikhin et al. (2020), where the width of the model is increased by adding more experts. This approach primarily optimizes parameter usage and computational cost by activating a fixed portion of parameters conditioned on the given input Fedus et al. (2022a).

However, the potential for vertical expansion remains underexplored. Inspired by how the human brain allocates more neurons for complex tasks and forms deeper neural chains Baddeley (1992); Koechlin et al. (2003), we propose focusing on vertical scaling. Our method dynamically expands, activates, or skips layers to optimize model depth and reduce redundancy, as shown in Figure 1 (a).

There are three critical challenges in vertically scaling LLMs: ❶ Optimization Complexity. Dynamically adding or pruning layers making the process hard to optimize. Obtaining optimal such operations have been proved to be a NP-hard problem Glorot and Bengio (2010); Hestness et al. (2017), while an approximation method Wang et al. (2023a) has shown compromised improvement. ❷ Computation Cost. The inherent computational cost is associated with processing deeper networks. Each additional layer contributes to the overall latency and resource consumption. ❸ Feature Collapse. Our analysis in Figure 1 (b) reveals that for a significant number of inputs, the representations across consecutive layers exhibit substantial similarity, suggesting that many layers may be redundant for certain samples.

To address these challenges, in this paper, we propose Dynamic Layer Operation (DLO), that consists of three operations: (iexpansion, (iiactivation, and (iiiskipping, for dynamic vertical scaling of LLMs without a proportional increase in computational cost. Our specific designs are as follows: ❶ Expansion: Additional layers are dynamically expanded from existing ones, easing optimization complexity. ❷ Activation & Skip: Feature Similarity guides the activation and skipping of layers. We propose similarity-induced labels to train the router that controls these operations. ❸ Adaptive FLOPs: Sparsity settings vary for layers facilitate adaptive FLOPs for different tokens, maintaining efficiency. ❹ Enhanced Generalizability: Layer-specific learning rates, based on sparsity, further improve the model’s ability to generalize across tasks. Note that all modules are trained during the Supervised Fine-Tuning (SFT) stage, eliminating the need for Continual Pre-training (CPT) and simplifying the training process. Our primary contributions are as follows:

  • Method. We introduce a novel method, DLO, for dynamically scaling LLMs vertically by dynamically expanding, activating, or skipping layers.

  • Performance & Efficiency. Through rigoerous experiments, we demonstrate that DLO not only surpasses the performance of the original unscaled models but also achieves comparable results to densely expanded models with significantly enhanced efficiency.

  • Applicability. Fine-tuned on language understanding, math, and coding tasks, we manifest DLO’s effectiveness across multiple NLP tasks.

2 Related Work

2.1 Mixture-of-Experts (MoE)

MoE architectures have emerged as a promising approach for enhancing the efficiency and scalability of LLMs Shazeer et al. (2017). Traditional neural networks activate all parameters for every input, leading to significant computational overhead, particularly as models scale up. In contrast, MoE models activate only a subset of parameters for each input, optimizing computational resource usage and enabling models to scale to billions of parameters without a corresponding increase in computational cost per input Lepikhin et al. (2020); Fedus et al. (2022b); Zoph et al. (2022); Team (2023a); jiang2024mixtral,zhu2024llama. This selective activation makes MoE highly efficient for both training and inference by focusing on horizontal expansion and adding more experts. However, MoE’s primary aim is to optimize width, potentially leaving layer redundancy unaddressed. Our Dynamic Layer Operation (DLO) approach complements MoE by focusing on vertical scaling through dynamic layer expansion and activation, targeting depth scalability and reducing potential feature redundancy.

2.2 Efficient Model Stacking

Model stacking is a common ensemble learning technique that improves predictive performance by combining multiple models to leverage their complementary strengths Ting and Witten (1997); Chen et al. (2015). In the context of LLMs, stacking can involve integrating various models into a hierarchical structure, where outputs from one model serve as inputs to another, capturing a broader range of features and patterns Dabre and Fujita (2019); Chen et al. (2021a); Wang et al. (2023b); Kim et al. (2023).

Recent advancements have focused on progressively stacking pre-trained transformer or self-attention layers to create composite language models Gong et al. (2019); Gu et al. (2020); Shen et al. (2022); Evci et al. (2022); Yao et al. (2023); Du et al. (2024); Wu et al. (2024). This approach reduce training costs by reusing pre-trained components. However, the increased depth and complexity of stacked models lead to high inference latency.

To mitigate this issue, layer-skipping methods have been developed, allowing models to “early exit” using additional layer-wise classifiers, thereby reducing the number of layers processed during inference Wang et al. (2022); Chen et al. (2023); Zhang et al. . More recently, conditional computation techniques have been proposed to dynamically skip layers based on token-specific conditions, further enhancing efficiency Ainslie et al. (2023); Raposo et al. (2024). However, these methods often require modifications during the pre-training stage, adding computation complexity and limiting their application to existing pre-trained LLMs. In contrast, our DLO method focuses on efficiency and scalability through dynamic vertical scaling within a single model during the SFT stage. It provides a comprehensive, high-performance solution to scaling LLMs without the extensive computational demands associated with stacked ensembles.

3 Methodology

Refer to caption
Figure 2: Layer extension with initialization strategies.

In this section, we introduce the Dynamic Layer Operation (DLO) framework for efficienct vertical scaling of LLMs. DLO consists of three key operations: expansion, activation, and skipping. These operations dynamically adjust the model structure during the Supervised Fine-Tuning (SFT) phase to optimize computational efficiency and improve performance. A pseudo code style description is included in Appendix A.

3.1 Layer Expansion

To facilitate dynamic depth adjustment, we introduce a group-based layer expansion strategy. Suppose the LLM has R𝑅Ritalic_R transformer layers, which we group into P𝑃Pitalic_P groups with Q𝑄Qitalic_Q layers each, such that R=P×Q𝑅𝑃𝑄R=P\times Qitalic_R = italic_P × italic_Q. Each group is expanded to Q=Q+qsuperscript𝑄𝑄𝑞Q^{\prime}=Q+qitalic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_Q + italic_q layers, where q𝑞qitalic_q is the number of additional layers introduced per group. The resulting number of layers will be R=P×Qsuperscript𝑅𝑃superscript𝑄R^{\prime}=P\times Q^{\prime}italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_P × italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Let 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the i𝑖iitalic_i-th group with layers i1,i2,,iQsubscript𝑖1subscript𝑖2subscript𝑖𝑄\mathcal{L}_{i1},\mathcal{L}_{i2},\ldots,\mathcal{L}_{iQ}caligraphic_L start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … , caligraphic_L start_POSTSUBSCRIPT italic_i italic_Q end_POSTSUBSCRIPT. The expanded group 𝒢isuperscriptsubscript𝒢𝑖\mathcal{G}_{i}^{\prime}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT will contain layers i1,i2,,i(Q+q)subscript𝑖1subscript𝑖2subscript𝑖𝑄𝑞\mathcal{L}_{i1},\mathcal{L}_{i2},\ldots,\mathcal{L}_{i(Q+q)}caligraphic_L start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … , caligraphic_L start_POSTSUBSCRIPT italic_i ( italic_Q + italic_q ) end_POSTSUBSCRIPT. The expanded layers are initialized using a policy ΠΠ\Piroman_Π, and we consider several initialization strategies:

  • Random Initialization (ΠrandsubscriptΠrand\Pi_{\text{rand}}roman_Π start_POSTSUBSCRIPT rand end_POSTSUBSCRIPT): Initialize the new layers’ weights θijsubscriptsuperscript𝜃𝑖𝑗\theta^{\prime}_{ij}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT using Xavier initialization Glorot and Bengio (2010).

    θij𝒰(6nin+nout,6nin+nout),similar-tosubscriptsuperscript𝜃𝑖𝑗𝒰6subscript𝑛𝑖𝑛subscript𝑛𝑜𝑢𝑡6subscript𝑛𝑖𝑛subscript𝑛𝑜𝑢𝑡\theta^{\prime}_{ij}\sim\mathcal{U}\left(-\sqrt{\frac{6}{n_{in}+n_{out}}},% \sqrt{\frac{6}{n_{in}+n_{out}}}\right),italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_U ( - square-root start_ARG divide start_ARG 6 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_ARG end_ARG , square-root start_ARG divide start_ARG 6 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_ARG end_ARG ) ,

    where j{Q+1,Q+2,,Q+q}for-all𝑗𝑄1𝑄2𝑄𝑞\forall j\in\{Q+1,Q+2,\ldots,Q+q\}∀ italic_j ∈ { italic_Q + 1 , italic_Q + 2 , … , italic_Q + italic_q }, 𝒰𝒰\mathcal{U}caligraphic_U denotes the uniform distribution, ninsubscript𝑛𝑖𝑛n_{in}italic_n start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT is the number of input units, and noutsubscript𝑛𝑜𝑢𝑡n_{out}italic_n start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT is the number of output units in the layer.

  • Copy from Previous Layer (ΠcopysubscriptΠcopy\Pi_{\text{copy}}roman_Π start_POSTSUBSCRIPT copy end_POSTSUBSCRIPT): Copy the parameters from the preceding layer.

    θij=θi(Q+q1),j{Q+1,,Q+q}.formulae-sequencesubscriptsuperscript𝜃𝑖𝑗subscript𝜃𝑖𝑄𝑞1for-all𝑗𝑄1𝑄𝑞\theta^{\prime}_{ij}=\theta_{i(Q+q-1)},\forall j\in\{Q+1,\ldots,Q+q\}.italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_i ( italic_Q + italic_q - 1 ) end_POSTSUBSCRIPT , ∀ italic_j ∈ { italic_Q + 1 , … , italic_Q + italic_q } .
  • Identity Initialization (ΠIdentitysubscriptΠIdentity\Pi_{\text{Identity}}roman_Π start_POSTSUBSCRIPT Identity end_POSTSUBSCRIPT) Wu et al. (2024): Copy from the preceding layer but set the output linear matrix of the multi-head self-attention (MHSA) to zero.

    1. a.

      Copy the parameters of the previous layer:

      θij=θi(Q+q1),j{Q+1,,Q+q}.formulae-sequencesubscriptsuperscript𝜃𝑖𝑗subscript𝜃𝑖𝑄𝑞1for-all𝑗𝑄1𝑄𝑞\theta^{\prime}_{ij}=\theta_{i(Q+q-1)},\forall j\in\{Q+1,\ldots,Q+q\}.\vspace{% -2mm}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_i ( italic_Q + italic_q - 1 ) end_POSTSUBSCRIPT , ∀ italic_j ∈ { italic_Q + 1 , … , italic_Q + italic_q } .
    2. b.

      Set the weights of the output linear layer Woutsubscriptsuperscript𝑊outW^{\prime}_{\text{out}}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out end_POSTSUBSCRIPT in the MHSA to zero: Wout=0.subscriptsuperscript𝑊out0W^{\prime}_{\text{out}}=0.italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = 0 .

    In this way, the output of the expanded layers will preserve the features from the original layers.

  • Linear Merge (ΠlinearsubscriptΠlinear\Pi_{\text{linear}}roman_Π start_POSTSUBSCRIPT linear end_POSTSUBSCRIPT): Merge from the preceding τ𝜏\tauitalic_τ layers using a linear function.

    θij=k=1ταkθi(Q+qk),k=1ταk=1.formulae-sequencesubscriptsuperscript𝜃𝑖𝑗superscriptsubscript𝑘1𝜏subscript𝛼𝑘subscript𝜃𝑖𝑄𝑞𝑘superscriptsubscript𝑘1𝜏subscript𝛼𝑘1\theta^{\prime}_{ij}=\sum_{k=1}^{\tau}\alpha_{k}\theta_{i(Q+q-k)},\sum_{k=1}^{% \tau}\alpha_{k}=1.\vspace{-2mm}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i ( italic_Q + italic_q - italic_k ) end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 .
  • Spherical Linear Interpolation (SLERP) (ΠslerpsubscriptΠslerp\Pi_{\text{slerp}}roman_Π start_POSTSUBSCRIPT slerp end_POSTSUBSCRIPT) Shoemake (1985): Merge from the preceding τ𝜏\tauitalic_τ layers using SLERP. The SLERP method smoothly interpolates between two weight vectors on a unit sphere, maintaining constant velocity. The interpolation between two weight vectors 𝐮𝐮\mathbf{u}bold_u and 𝐯𝐯\mathbf{v}bold_v is defined as:

    SLERP(𝐮,𝐯,α)=sin((1α)Ω)sin(Ω)𝐮+sin(αΩ)sin(Ω)𝐯,SLERP𝐮𝐯𝛼1𝛼ΩΩ𝐮𝛼ΩΩ𝐯\text{SLERP}(\mathbf{u},\mathbf{v},\alpha)=\frac{\sin((1-\alpha)\Omega)}{\sin(% \Omega)}\mathbf{u}+\frac{\sin(\alpha\Omega)}{\sin(\Omega)}\mathbf{v},SLERP ( bold_u , bold_v , italic_α ) = divide start_ARG roman_sin ( ( 1 - italic_α ) roman_Ω ) end_ARG start_ARG roman_sin ( roman_Ω ) end_ARG bold_u + divide start_ARG roman_sin ( italic_α roman_Ω ) end_ARG start_ARG roman_sin ( roman_Ω ) end_ARG bold_v ,

    where ΩΩ\Omegaroman_Ω is the angle between 𝐮𝐮\mathbf{u}bold_u and 𝐯𝐯\mathbf{v}bold_v:

    Ω=arccos(𝐮𝐯𝐮𝐯),Ω𝐮𝐯norm𝐮norm𝐯\Omega=\arccos\left(\frac{\mathbf{u}\cdot\mathbf{v}}{\|\mathbf{u}\|\|\mathbf{v% }\|}\right),roman_Ω = roman_arccos ( divide start_ARG bold_u ⋅ bold_v end_ARG start_ARG ∥ bold_u ∥ ∥ bold_v ∥ end_ARG ) ,

    and α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ] is the interpolation parameter.

    In our context, for the new layer j𝑗jitalic_j, the parameters are initialized by interpolating between the weights of the previous layers θi(Q+q1)subscript𝜃𝑖𝑄𝑞1\theta_{i(Q+q-1)}italic_θ start_POSTSUBSCRIPT italic_i ( italic_Q + italic_q - 1 ) end_POSTSUBSCRIPT and θi(Q+qτ)subscript𝜃𝑖𝑄𝑞𝜏\theta_{i(Q+q-\tau)}italic_θ start_POSTSUBSCRIPT italic_i ( italic_Q + italic_q - italic_τ ) end_POSTSUBSCRIPT:

    θij=SLERP(θi(Q+q1),θi(Q+qτ),α),subscriptsuperscript𝜃𝑖𝑗SLERPsubscript𝜃𝑖𝑄𝑞1subscript𝜃𝑖𝑄𝑞𝜏𝛼\theta^{\prime}_{ij}=\text{SLERP}(\theta_{i(Q+q-1)},\theta_{i(Q+q-\tau)},% \alpha),italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = SLERP ( italic_θ start_POSTSUBSCRIPT italic_i ( italic_Q + italic_q - 1 ) end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i ( italic_Q + italic_q - italic_τ ) end_POSTSUBSCRIPT , italic_α ) ,

    where α𝛼\alphaitalic_α controls the interpolation. This ensures a smooth transition between layers, aiding in gradient flow and stable training.

We conduct comprehensive experiments on the choice of the policy ΠΠ\Piroman_Π in Section 4.3.

3.2 Layer Activation & Skipping

DLO dynamically skips the multi-layer perceptron (MLP) module within the transformer layer isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for input tokens. To achieve this, we uses a linear router to determine the activation of layers. Suppose we have the set of token embeddings in a sequence of length S𝑆Sitalic_S for a given layer isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, that is 𝐡i={𝐡is|s,sS}subscript𝐡𝑖conditional-setsubscriptsuperscript𝐡𝑠𝑖formulae-sequence𝑠superscript𝑠𝑆\mathbf{h}_{i}=\{\mathbf{h}^{s}_{i}|s\in\mathbb{N}^{*},s\leq S\}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_h start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s ∈ blackboard_N start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_s ≤ italic_S }, where in following contents we omit the superscript s𝑠sitalic_s for better readability. Considering feature redundancy, we use the router weights Wisubscript𝑊𝑖W_{i}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to process the input token 𝐡isubscript𝐡𝑖\mathbf{h}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and obtain the decision score risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is given by:

ri=β+(2σ(𝐡iWi)1)γ2(βγ2,β+γ2),subscript𝑟𝑖𝛽2𝜎subscript𝐡𝑖subscript𝑊𝑖1𝛾2𝛽𝛾2𝛽𝛾2r_{i}=\frac{\beta+(2\sigma(\mathbf{h}_{i}W_{i})-1)\gamma}{2}\in\Big{(}\frac{% \beta-\gamma}{2},\frac{\beta+\gamma}{2}\Big{)},italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_β + ( 2 italic_σ ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - 1 ) italic_γ end_ARG start_ARG 2 end_ARG ∈ ( divide start_ARG italic_β - italic_γ end_ARG start_ARG 2 end_ARG , divide start_ARG italic_β + italic_γ end_ARG start_ARG 2 end_ARG ) , (1)

where σ𝜎\sigmaitalic_σ is the sigmoid function, β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ are hyperparameters controlling the output range. During the inference stage, this score risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT determines whether layer isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is active: the layer is activated if and only if iβ2subscript𝑖𝛽2\mathcal{L}_{i}\geq\frac{\beta}{2}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ divide start_ARG italic_β end_ARG start_ARG 2 end_ARG, otherwise it’s skipped. The final activated output for the layer is:

𝐡i+1={rii𝒜i(𝐡i)if riβ2,𝒜i(𝐡i)otherwise,subscript𝐡𝑖1casessubscript𝑟𝑖subscript𝑖subscript𝒜𝑖subscript𝐡𝑖if subscript𝑟𝑖𝛽2subscript𝒜𝑖subscript𝐡𝑖otherwise\mathbf{h}_{i+1}=\begin{cases}r_{i}\cdot\mathcal{M}_{i}\circ\mathcal{A}_{i}(% \mathbf{h}_{i})&\text{if }r_{i}\geq\frac{\beta}{2},\\ \mathcal{A}_{i}(\mathbf{h}_{i})&\text{otherwise},\end{cases}bold_h start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = { start_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL start_CELL if italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ divide start_ARG italic_β end_ARG start_ARG 2 end_ARG , end_CELL end_ROW start_ROW start_CELL caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL start_CELL otherwise , end_CELL end_ROW (2)

where isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝒜isubscript𝒜𝑖\mathcal{A}_{i}caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the MLP and the attention modules within layer isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. This activation & skipping mechanism aims to encourage the utilization of the most relevant layers, thus reducing unnecessary computation. In this paper we initialize the router weights Wisubscript𝑊𝑖W_{i}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as zeros and set β=2.0,γ=0.05formulae-sequence𝛽2.0𝛾0.05\beta=2.0,\gamma=0.05italic_β = 2.0 , italic_γ = 0.05, so that ri=1.0subscript𝑟𝑖1.0r_{i}=1.0italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1.0 on the first step and ri(0.975,1.025)subscript𝑟𝑖0.9751.025r_{i}\in(0.975,1.025)italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( 0.975 , 1.025 ) during training. This ensures benign initialization for activated tokens and avoids excessive disturbance on activated outputs brought by the decision scores.

3.3 Training and Integration

Refer to caption
Figure 3: Training pipeline of DLO, consists of the downstream task loss and an auxilliary router skip loss supervised by generated router labels.

Similarity-induced Label & Router Skip Loss.

Given a pre-defined overall spasity ρ𝜌\rhoitalic_ρ, we define a sparsity factor ρisubscript𝜌𝑖\rho_{i}italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each layer isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that controls the layer-wise actived tokens. To determine the status of token 𝐡issuperscriptsubscript𝐡𝑖𝑠\mathbf{h}_{i}^{s}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, we utilize predicted router labels Λ^s={λ^1s,,λ^is,,λ^Rs}superscript^Λ𝑠superscriptsubscript^𝜆1𝑠superscriptsubscript^𝜆𝑖𝑠superscriptsubscript^𝜆superscript𝑅𝑠\hat{\Lambda}^{s}=\{\hat{\lambda}_{1}^{s},\dots,\hat{\lambda}_{i}^{s},\dots,% \hat{\lambda}_{R^{\prime}}^{s}\}over^ start_ARG roman_Λ end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = { over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , … , over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , … , over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT }, which is obtained through the decision scores rissuperscriptsubscript𝑟𝑖𝑠r_{i}^{s}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT:

λ^is={1if risTop(1ρi)S({ris}s=1S),0otherwise.superscriptsubscript^𝜆𝑖𝑠cases1if superscriptsubscript𝑟𝑖𝑠subscriptTop1subscript𝜌𝑖𝑆superscriptsubscriptsuperscriptsubscript𝑟𝑖𝑠𝑠1𝑆0otherwise\hat{\lambda}_{i}^{s}=\begin{cases}1&\text{if }r_{i}^{s}\in\mathrm{Top}_{% \lfloor(1-\rho_{i})S\rfloor}\big{(}\{r_{i}^{s}\}_{s=1}^{S}\big{)},\\ 0&\text{otherwise}.\end{cases}over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ roman_Top start_POSTSUBSCRIPT ⌊ ( 1 - italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_S ⌋ end_POSTSUBSCRIPT ( { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise . end_CELL end_ROW (3)

where λ^is=1superscriptsubscript^𝜆𝑖𝑠1\hat{\lambda}_{i}^{s}=1over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = 1 indicates layer isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is predicted to be activated, and vice versa. To train the routers, we utilize the supervised router labels Λ~s={λ~1s,,λ~is,,λ~Rs}superscript~Λ𝑠superscriptsubscript~𝜆1𝑠superscriptsubscript~𝜆𝑖𝑠superscriptsubscript~𝜆superscript𝑅𝑠\tilde{\Lambda}^{s}=\{\tilde{\lambda}_{1}^{s},\dots,\tilde{\lambda}_{i}^{s},% \dots,\tilde{\lambda}_{R^{\prime}}^{s}\}over~ start_ARG roman_Λ end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = { over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , … , over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , … , over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } to guide the learning of sparsity, i.e., to skip or not. The router labels λ~issuperscriptsubscript~𝜆𝑖𝑠\tilde{\lambda}_{i}^{s}over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT of layer isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at training step t𝑡titalic_t is determined by the following procedures:

  1. 1.

    The cosine feature similarity of layer i𝑖iitalic_i is calculated through features across MLP isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

    μis=𝒜i(𝐡is)i𝒜i(𝐡is)𝒜i(𝐡is)i𝒜i(𝐡is)[0,1].superscriptsubscript𝜇𝑖𝑠subscript𝒜𝑖superscriptsubscript𝐡𝑖𝑠subscript𝑖subscript𝒜𝑖superscriptsubscript𝐡𝑖𝑠normsubscript𝒜𝑖superscriptsubscript𝐡𝑖𝑠normsubscript𝑖subscript𝒜𝑖superscriptsubscript𝐡𝑖𝑠01\mu_{i}^{s}=\frac{\mathcal{A}_{i}(\mathbf{h}_{i}^{s})\cdot\mathcal{M}_{i}\circ% \mathcal{A}_{i}(\mathbf{h}_{i}^{s})}{\|\mathcal{A}_{i}(\mathbf{h}_{i}^{s})\|\|% \mathcal{M}_{i}\circ\mathcal{A}_{i}(\mathbf{h}_{i}^{s})\|}\in[0,1].italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = divide start_ARG caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ⋅ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) end_ARG start_ARG ∥ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ∥ ∥ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ∥ end_ARG ∈ [ 0 , 1 ] . (4)
  2. 2.

    The similarity are sorted over all the layers, and the similarity-induced label is given as follows:

    λ~is={1if μisBottom(1ρ)RS({μis}i,s=1R,S),0otherwise.superscriptsubscript~𝜆𝑖𝑠cases1if superscriptsubscript𝜇𝑖𝑠subscriptBottom1𝜌superscript𝑅𝑆superscriptsubscriptsuperscriptsubscript𝜇𝑖𝑠𝑖𝑠1superscript𝑅𝑆0otherwise.\tilde{\lambda}_{i}^{s}=\begin{cases}1&\text{if }\mu_{i}^{s}\in\mathrm{Bottom}% _{\lfloor(1-\rho)R^{\prime}S\rfloor}\big{(}\{\mu_{i}^{s}\}_{i,s=1}^{R^{\prime}% ,S}\big{)},\\ 0&\text{otherwise.}\end{cases}over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ roman_Bottom start_POSTSUBSCRIPT ⌊ ( 1 - italic_ρ ) italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_S ⌋ end_POSTSUBSCRIPT ( { italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i , italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW (5)

    where Bottom(1ρ)RSsubscriptBottom1𝜌superscript𝑅𝑆\mathrm{Bottom}_{\lfloor(1-\rho)R^{\prime}S\rfloor}roman_Bottom start_POSTSUBSCRIPT ⌊ ( 1 - italic_ρ ) italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_S ⌋ end_POSTSUBSCRIPT indicates the labels for tokens with the least (1ρ)RS1𝜌superscript𝑅𝑆\lfloor(1-\rho)R^{\prime}S\rfloor⌊ ( 1 - italic_ρ ) italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_S ⌋ portion of cosine similarity are set to 1111s, which are expected to be activated.

  3. 3.

    A skip loss skipsubscriptskip\mathcal{L}_{\text{skip}}caligraphic_L start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT based on the Binary-Cross-Entropy loss BCEsubscriptBCE\mathcal{L}_{\text{BCE}}caligraphic_L start_POSTSUBSCRIPT BCE end_POSTSUBSCRIPT is incorporated to guide the learning of the router attached to each layer:

    skip=1RSi,s=1R,SBCE(σ(𝐡isWi),λ~is).subscriptskip1superscript𝑅𝑆superscriptsubscript𝑖𝑠1superscript𝑅𝑆subscriptBCE𝜎superscriptsubscript𝐡𝑖𝑠subscript𝑊𝑖superscriptsubscript~𝜆𝑖𝑠\mathcal{L}_{\text{skip}}=\frac{1}{R^{\prime}S}\sum_{i,s=1}^{R^{\prime},S}% \mathcal{L}_{\text{BCE}}(\sigma(\mathbf{h}_{i}^{s}W_{i}),\tilde{\lambda}_{i}^{% s}).caligraphic_L start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_S end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT BCE end_POSTSUBSCRIPT ( italic_σ ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) . (6)

Given the task-specific loss tasksubscripttask\mathcal{L}_{\text{task}}caligraphic_L start_POSTSUBSCRIPT task end_POSTSUBSCRIPT, the overall loss function for DLO training is:

=task+skip.subscripttasksubscriptskip\mathcal{L}=\mathcal{L}_{\text{task}}+\mathcal{L}_{\text{skip}}.caligraphic_L = caligraphic_L start_POSTSUBSCRIPT task end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT . (7)

Skip Rate Dynamics.

The redundancy exhibits an imbalanced distribution across layers, as is evidenced in Figure 1 (b). To this end, we adjust the next-step ρi,t+1subscript𝜌𝑖𝑡1\rho_{i,t+1}italic_ρ start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT for each training step t𝑡titalic_t over the total T𝑇Titalic_T steps, where the initial skip rate ρi,1=ρsubscript𝜌𝑖1𝜌\rho_{i,1}=\rhoitalic_ρ start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT = italic_ρ. The layer-wise sparsity factor ρi,t+1subscript𝜌𝑖𝑡1\rho_{i,t+1}italic_ρ start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT is calculated using the router labels as follows:

ρi,t+1=s=1Sλ~i,tsS,subscript𝜌𝑖𝑡1superscriptsubscript𝑠1𝑆superscriptsubscript~𝜆𝑖𝑡𝑠𝑆\rho_{i,t+1}=\frac{\sum_{s=1}^{S}\tilde{\lambda}_{i,t}^{s}}{S},italic_ρ start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG italic_S end_ARG , (8)

where λ~i,tssuperscriptsubscript~𝜆𝑖𝑡𝑠\tilde{\lambda}_{i,t}^{s}over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is the supervised router label of the s𝑠sitalic_s-th token in layer isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at step t𝑡titalic_t. Additionally, we employ an annealing technique on the skip rate to ensure the warm start. During training, the overall skipping rate gradually increases from an initial low value ρ¯¯𝜌\bar{\rho}over¯ start_ARG italic_ρ end_ARG to the target sparsity level ρ𝜌\rhoitalic_ρ over a predefined number of steps Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The overall skip rate ρtsuperscript𝜌𝑡\rho^{t}italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at step t𝑡titalic_t is given by:

ρt={ρ¯+(ρρ¯)tTif tT,ρotherwise,superscript𝜌𝑡cases¯𝜌𝜌¯𝜌𝑡superscript𝑇if 𝑡superscript𝑇𝜌otherwise\rho^{t}=\begin{cases}\bar{\rho}+\left(\rho-\bar{\rho}\right)\frac{t}{T^{% \prime}}&\text{if }t\leq T^{\prime},\\ \rho&\text{otherwise},\end{cases}italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { start_ROW start_CELL over¯ start_ARG italic_ρ end_ARG + ( italic_ρ - over¯ start_ARG italic_ρ end_ARG ) divide start_ARG italic_t end_ARG start_ARG italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL if italic_t ≤ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_ρ end_CELL start_CELL otherwise , end_CELL end_ROW (9)

where we set ρ¯=0¯𝜌0\bar{\rho}=0over¯ start_ARG italic_ρ end_ARG = 0. This annealing process helps the model to progressively adapt to higher sparsity levels with smoother changes, leading to more stable training and better convergence.

Layer-Wise Learning Rates.

DLO also employs layer-wise learning rates ζi,tsubscript𝜁𝑖𝑡\zeta_{i,t}italic_ζ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT, adjusted based on sparsity to promote generalizability. The learning rate for each layer is defined as:

ζi,t=ζ¯1ρi,t1ρt,subscript𝜁𝑖𝑡¯𝜁1subscript𝜌𝑖𝑡1superscript𝜌𝑡\zeta_{i,t}=\bar{\zeta}\cdot\frac{1-\rho_{i,t}}{1-\rho^{t}},italic_ζ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = over¯ start_ARG italic_ζ end_ARG ⋅ divide start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG , (10)

where ζ¯¯𝜁\bar{\zeta}over¯ start_ARG italic_ζ end_ARG is the base learning rate. It is noteworthy that all DLO components are trained during the Supervised Fine-Tuning (SFT) stage in an end-to-end manner, eliminating the need for Continual Pre-Training (CPT). By integrating DLO, we achieve dynamic vertical scaling, optimizing model depth, and maintaining high performance with reduced computational demands.

3.4 Adaptive Inference-Time FLOPs

During inference time, DLO uses layer-specific sparsity settings to maintain computational efficiency and ensure adaptive floating-point operations (FLOPs) for different tokens. In other words, the predicted sparsity ρ^isubscript^𝜌𝑖\hat{\rho}_{i}over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will be determined completely by the router based on Equation (1)-(2). The adaptive FLOPs are computed as:

FLOPsi=ρ^iFLOPsfull,subscriptFLOPs𝑖subscript^𝜌𝑖subscriptFLOPsfull\text{FLOPs}_{i}=\hat{\rho}_{i}\cdot\text{FLOPs}_{\text{full}},FLOPs start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ FLOPs start_POSTSUBSCRIPT full end_POSTSUBSCRIPT , (11)

where FLOPsfullsubscriptFLOPsfull\text{FLOPs}_{\text{full}}FLOPs start_POSTSUBSCRIPT full end_POSTSUBSCRIPT represents the FLOPs for a fully active layer. Since ρ^isubscript^𝜌𝑖\hat{\rho}_{i}over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is predicetd based on each specific token, DLO acheive adaptive FLOPs that entails better generalizability.

4 Experiments

In this section, we present an empirical evaluation of the proposed DLO framework, detailing the experimental settings, results, and analysis.

4.1 Experimental Settings

Model Selection. We utilize LLaMA2-7B Touvron et al. (2023) as the primary backbone due to its open-source availability and extensive usage. It consists of R=32𝑅32R=32italic_R = 32 original transformer layers, which we group into P=4𝑃4P=4italic_P = 4 clusters, each containing Q=8𝑄8Q=8italic_Q = 8 layers. For layer expansion, we increase the group size to Q=10superscript𝑄10Q^{\prime}=10italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 10 layers, resulting in a dense model, LLaMA-DLO, with a total of 40 layers and 8 billion parameters. For comparison, we also employ LLaMA-Pro-8B Wu et al. (2024), a competitive model trained with Continual Pre-Training (CPT) on specialized datasets. We demonstrate DLO achieves an optimal balance between performance and computational cost in Section 4.4.

Fine-tuning Details. Following common practices Wu et al. (2024), we fine-tune using a mixture of five instruction tuning datasets: ShareGPT Team (2023b), EvolInstruct Luo et al. (2023), SlimOrca Team (2023c), MetaMath Yu et al. (2023), and Evol-CodeAlpaca Team (2022), with ShareGPT replicated three times, totaling approximately 1.44 million instances. We use a batch size of 128 and a maximum sequence length of 4,096 tokens. The learning rate is set to 2e52𝑒52e-52 italic_e - 5 with a warmup ratio of 0.03 and cosine scheduling, and we utilize AdamW Loshchilov and Hutter (2017) as the optimizer. Flash Attention Dao et al. and bfloat16 mixed-precision training are adopted to accelerate training. Fine-tuning LLaMA-DLO under different skip ratios yields the sparse models, with each training run taking approximately 36 hours on eight NVIDIA A100 GPUs.

Evaluation Benchmarks. We assess the fine-tuned models using the EleutherAI LM Harness Gao et al. (2023) and BigCode Harness Ben Allal et al. (2022) across three domains: ❶ Language [ARC-C Clark et al. (2018), GLUE Wang et al. (2018), MMLU Hendrycks et al. (2020), OBQA Mihaylov et al. (2018), PIQA Bisk et al. (2020), SQuAD Rajpurkar et al. (2016), TruthfulQA Lin et al. (2021), WinoGrande Sakaguchi et al. (2021)], ❷ Math [GSM8K Cobbe et al. (2021), MathQA Amini et al. (2019)], and ❸ Code [HumanEval Chen et al. (2021b), MBPP Austin et al. (2021)]. Detailed metrics are in Appendix B.

Sparsity Model FLOPs Sprache Math Code Avg. \uparrow
ARC-C GLUE MMLU OBQA PIQA SQuAD TruthfulQA WinoGrande GSM8K MathQA HumanEval MBPP
0% \bullet LLaMA2-7B 29.3T 53.1 40.6 46.9 44.2 79.0 26.4 38.8 74.0 14.5 28.3 21.8 29.0 41.38
\bullet LLaMA2-7B+SFTSFT{}_{+\text{SFT}}start_FLOATSUBSCRIPT + SFT end_FLOATSUBSCRIPT 54.0 72.4 53.0 44.4 78.8 22.9 40.8 74.2 56.6 30.8 57.3 30.5 51.31
\cdashline2-16 \bullet LLaMA-Pro-8B 36.4T 54.1 40.7 47.9 41.6 78.2 14.2 39.0 74.0 17.9 29.5 28.7 33.2 41.58
\bullet LLaMA-Pro-8B+SFTSFT{}_{+\text{SFT}}start_FLOATSUBSCRIPT + SFT end_FLOATSUBSCRIPT 51.0 71.0 53.0 45.0 79.0 15.1 38.0 73.6 58.6 30.8 58.4 30.5 50.33
\cdashline2-16 \bullet LLaMA-DLO-8B+SFTSFT{}_{+\text{SFT}}start_FLOATSUBSCRIPT + SFT end_FLOATSUBSCRIPT 36.5T 53.2 75.5 53.2 43.7 79.0 22.0 38.7 74.0 57.4 31.0 57.0 30.2 51.24
10% \circ LLaMA2-7B 27.5T 51.0 72.5 51.1 40.2 78.0 21.0 39.3 71.0 53.4 29.6 49.7 28.3 48.76
\circ LLaMA-DLO-8B 34.2T 52.5 75.4 51.4 43.6 78.7 21.7 41.0 73.4 55.0 31.4 55.3 29.1 50.71
20% \circ LLaMA2-7B 25.8T 33.0 69.9 50.4 35.0 65.6 16.4 36.9 54.2 1.0 24.5 0.0 0.0 32.24
\circ LLaMA-DLO-8B 32.0T 51.2 73.3 50.8 43.2 78.2 20.4 38.9 73.2 50.1 30.5 57.6 28.2 49.63
30% \circ LLaMA2-7B 24.1T 28.1 2.8 47.1 35.0 53.8 13.9 37.9 52.2 0.0 21.6 0.0 0.0 24.40
\circ LLaMA-DLO-8B 29.8T 44.6 73.0 50.1 41.2 77.1 21.7 37.1 63.2 46.9 28.1 31.2 6.0 43.35
Table 1: Performance comparison of DLO (our approach) on various datasets using LLaMA2-7B as the backbone. Models marked with \bullet are dense models, either original or those expanded using DLO expansion. Models with 8B parameters indicate expansion via LLaMA-Pro or our DLO. Models marked with \circ are sparse models incorporating DLO activation and skipping operations. Inference FLOPs are counted with a sequence with 2,048 tokens. The proposed \bullet LLaMA-DLO-8B with 0%percent00\%0 % spasity signifies that no layer is skipped and all the original and expanded layers are activated. \circ LLaMA2-7B with non-zero sparsity equals LLaMA-DLO without expanding layers.

4.2 Overall Performance

Table 1 summarizes the performance of the DLO framework across various datasets using LLaMA2-7B as the backbone. From the results, we draw several key observations are as follows:

❶ Dense Models’ Superiority: Dense models, indicated by \bullet, generally outperform their sparse counterparts across most datasets. For instance, LLaMA-DLO models consistently achieve high average scores, such as 51.31 for LLaMA2-7B+SFT, compared to the baseline LLaMA2-7B’s 41.38. This indicates that our DLO-expansion method enhances model performance significantly while leveraging additional parameters effectively. ❷ Efficiency of Sparse Models: Sparse models, marked with \circ, show a notable reduction in inference-time FLOPs while maintaining competitive accuracy. At 10% sparsity, the LLaMA-DLO model achieves an average score of 50.71 with 34.2T FLOPs, compared to the dense model’s 51.31 with 36.5T FLOPs. This demonstrates the efficiency of DLO’s activation and skipping operations in optimizing computational resources without significantly sacrificing performance. ❸ DLO-Expansion Advantages: Models expanded using DLO expansion with up to 8B parameters outperform the original LLaMA2-7B model across multiple metrics. For example, dense LLaMA-DLO-8B+SFT achieves a higher average score of 51.24 compared to LLaMA2-7B’s 41.38, highlighting the effectiveness of vertical scaling through layer expansion in improving model capacity and performance. On the other hand ❹ Balanced Performance of Sparse Models: Sparse models with DLO’s dynamic activation and skipping (\circ) provide a well-balanced trade-off between performance and computational efficiency. At 30% sparsity, LLaMA-DLO models maintain strong performance on tasks like GLUE (51.0 vs. 28.1) and HumanEval (50.5 vs. 21.8), while significantly reducing FLOPs. This makes them suitable for scenarios requiring computational efficiency without substantial performance loss. ❺ Effective Inference Optimization: DLO demonstrates effective inference optimization. For instance, the dense LLaMA-DLO model with 8B parameters achieves lower inference-time FLOPs compared to LLaMA-Pro-8B (36.5T vs. 36.4T) while maintaining a competitive average performance (51.24 vs. 50.33). This highlights DLO’s capability to enhance model efficiency without compromising accuracy. ❻ General Observations: Overall, the DLO framework successfully balances performance and efficiency across various datasets and tasks. The adoption of both expansion and skipping strategies enables LLaMA-DLO to achieve robust performance improvements while maintaining lower computational costs, suggesting that DLO is a viable approach for scalable and efficient LLM deployment.

We conduct further analyses of key components of DLO in the subsequent subsections.

4.3 Ablation Studies

Method Sprache Math Code Avg. \uparrow
ARC-C GLUE MMLU OBQA PIQA SQuAD TruthfulQA WinoGrande GSM8K MathQA HumanEval MBPP
Random 25.1 41.4 24.4 26.8 50.3 42.2 37.7 49.3 0.3 20.3 0.0 1.3 26.6
Identity 52.5 75.4 51.4 43.6 78.7 21.7 41.0 73.4 55.0 31.4 55.3 29.1 50.7
Kopieren Sie 52.0 71.3 52.3 43.2 78.6 23.3 40.8 73.0 52.2 29.0 41.1 26.7 48.6
Linear 48.6 70.7 53.4 39.4 72.9 18.7 38.7 57.5 29.7 25.7 24.0 0.7 40.0
Slerp 42.2 71.6 49.6 39.8 76.4 22.3 36.2 63.0 43.0 27.1 40.4 4.9 43.0
Table 2: Experiments on the effectiveness of different initialization strategies for the expaned blocks. For this study we evaluate on \circ LLaMA-DLO-8B with 10% sparsity.
Method Sprache Math Code Avg. \uparrow
ARC-C GLUE MMLU OBQA PIQA SQuAD TruthfulQA WinoGrande GSM8K MathQA HumanEval MBPP
DLO 52.5 75.4 51.4 43.6 78.7 21.7 41.0 73.4 55.0 31.4 55.3 29.1 50.7
w/o Zero Init 51.9 73.0 52.1 44.6 77.7 20.8 40.3 74.3 55.0 30.5 56.1 28.2 50.4
w/o Rescaling 47.4 71.3 49.4 35.8 75.3 21.1 39.6 67.5 35.0 24.5 55.6 28.2 45.9
Table 3: Ablation Study on the effectiveness of zeros router initialization & score rescaling. For this evaluation, we deploy \circ LLaMA-DLO-8B with 10% sparsity for experiment.
Method ARC-C MMLU TruthfulQA WinoGrande GSM8K Avg. \uparrow
SOLAR 24.8 24.8 38.8 50.7 2.2 28.3
SD-Stack 23.5 23.4 36.0 51.1 2.5 27.3
\hdashlineDLO 52.5 51.4 41.0 73.4 55.0 54.7
Table 4: Comparison with different expansion methods. We extend \circ LLaMA-DLO layers using different strategies and fine-tune the expanded models with DLO under overall skip rate ρ=10%𝜌percent10\rho=10\%italic_ρ = 10 %.

Initialization Strategies for Expanded Layers.

We explored the effectiveness of various layer initialization strategies for the expanded layers, as detailed in Section 3.1. Table 2 evaluates the impact of different initialization strategies on the performance of LLaMA-DLO models with 10% sparsity. The results highlight that the choice of initialization plays a critical role in determining model performance across various tasks.

❶ The identity and copy initialization strategies demonstrate the most consistent and high-performing results, suggesting that leveraging existing layer information is beneficial for stabilizing and enhancing model performance. These methods help maintain coherence in the model’s internal representations, leading to robust results across a wide range of tasks, including GLUE and HumanEval.

❷ Interestingly, while linear and SLERP initializations were expected to offer smoother transitions and potentially enhance performance, their results were only moderately effective. This indicates that while sophisticated initialization techniques can offer benefits, they may not always outperform simpler strategies like identity and copy initialization, which directly utilize pre-existing model structures.

❸ Random initialization yields the lowest performance. The variability in task performance with this method highlights the challenges of using non-specific weights, which can lead to unstable and suboptimal model behavior, particularly in complex tasks like math and coding.

Overall, the findings emphasize that initialization strategies that leverage prior information from existing layers tend to provide a better foundation for training expanded models, leading to improved performance. We thus choose ΠidentitysubscriptΠ𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦\Pi_{identity}roman_Π start_POSTSUBSCRIPT italic_i italic_d italic_e italic_n italic_t italic_i italic_t italic_y end_POSTSUBSCRIPT as the default initialization strategy.

Zeros Router Initialization & Score Rescaling.

In this experiment, we investigate the impact of zeros router initialization and score rescaling on mitigating performance degradation.

contains-as-subgroup\rhd Zeros Router Initialization. Initializing the router parameters to zero aims to start the model from a neutral state, avoiding any initial bias towards layer activation or skipping. This method allows the model to learn activation patterns from scratch without being influenced by predefined weights. Results in Table 3 indicate that this approach helps maintain balanced training dynamics and mitigates premature convergence, as reflected in the performance stability observed across tasks.

contains-as-subgroup\rhd Score Rescaling. Score rescaling adjusts the routing scores to maintain them within a specific range, typically 0 to 1. This adjustment is intended to preserve gradient flow and prevent extreme activations, ensuring that the model remains responsive to training signals. Our findings suggest that score rescaling helps avoid over-activation of layers, leading to more efficient use of the model’s capacity.

The combined use of zeros router initialization and score rescaling appears to prevent performance degradation effectively. As shown in Table 3, models with these techniques generally achieve more consistent accuracy and efficiency across various tasks. These results suggest that careful initialization and rescaling strategies are beneficial for maintaining robust performance during adaptation.

Method Sprache Math Code Avg. \uparrow
ARC-C GLUE MMLU OBQA PIQA SQuAD TruthfulQA WinoGrande GSM8K MathQA HumanEval MBPP
Uniform 36.0 61.7 48.7 36.6 58.4 47.9 36.8 68.2 25.4 21.7 23.7 2.2 38.9
Layer-Wise 52.5 75.4 51.4 43.6 78.7 21.7 41.0 73.4 55.0 31.4 55.3 29.1 50.7
Table 5: Performance of the fine-tuned \circ LLaMA-DLO-8B with 10%percent1010\%10 % sparsiy and different sparsity distribution strategies. “Uniform” represents all layers use the same sparsity ρi=ρsubscript𝜌𝑖𝜌\rho_{i}=\rhoitalic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ρ during training. “Layer-Wise” denotes the model maintains different skip rates for different layers, as described in Section 3.3.
Refer to caption
Figure 4: Visualization on different datasets of (a) Layer-Wise Number of Activations, (b) Layer-Wise Average Similarity, and (c) Token Activation Examples.
Refer to caption
Figure 5: (a) Performance v.s. Training time. LLaMA-Pro is reported in H800 GPU hours quoted from the original paper. The rests are reported in A100 GPU hours. (b) Performance v.s. Inference FLOPs. DLO achieves the best trade-off between performance and training or inference costs.

Efficient Expansion.

In addition to the high-cost LLaMA-Pro approach (studied in Table 1), we compare our expansion method with two state-of-the-art efficient vertical expansion baselines: SOLAR Kim et al. (2023) and Self-Duplicate Stack (SD-Stack) Team (2024). These two methods duplicate blocks of transformer layers and stack them together in a training-free manner. As shown in Table 4, the proposed DLO-expansion significantly outperforms both SOLAR and SD-Stack by a considerable margin. This highlights the critical role of Supervised Fine-Tuning (SFT) in adapting the expanded layers effectively. Unlike training-free approaches, DLO-expansion achieves a superior balance between training cost and performance, demonstrating the importance of fine-tuning in maximizing the effectiveness of layer expansion.

Layer-Wise Skip Rates & Sparsity Allocation.

This experiment evaluates the impact of using layer-specific skip rates on sparsity allocation.

contains-as-subgroup\rhd Layer-Wise Skip Rates. Adjusting skip rates for each layer aims to selectively activate or skip layers based on their contribution to task performance, which is measured by the layer-wise similarity. This method helps focus computational resources on more critical layers. Results in Table 5 suggest that this approach can lead to more efficient sparsity allocation with less impact on model performance. Figure 4 also show that DLO can skip layers that have high layer-wise similarity.

contains-as-subgroup\rhd Sparsity Allocation. Tailoring skip rates by layer helps distribute sparsity more effectively, potentially reducing computational overhead. As indicated in Table 5, models with layer-wise skip rates tend to maintain performance while achieving better computational efficiency.

4.4 Scalability

The proposed LLaMA-DLO model surpasses the performance of the original dense LLaMA, while also achieving comparable results to the dense LLaMA-Pro. Notably, it does so at a significantly lower training cost by eliminating the need for expensive CPT. Additionally, LLaMA-DLO facilitates efficient inference through adaptively reduced FLOPs, making it a cost-effective choice for both training and deployment.

Figure 5 illustrates the trade-off between model performance and both training and inference costs. LLaMA-DLO emerges as the optimal solution, achieving the best balance across these metrics. This demonstrates the model’s scalability, ensuring that high performance is maintained while keeping computational costs manageable.

5 Conclusion

This paper presents LLaMA-DLO, a framework for efficient vertical scaling of LLMs that dynamically expands, activates, and skips layers to optimize computational resources. Our experiments demonstrate that LLaMA-DLO achieves performance on par with expensive dense expansion model like LLaMA-Pro, while significantly reducing training costs and enhancing inference efficiency. These results highlight LLaMA-DLO’s potential as a cost-effective solution for scaling LLMs in various NLP tasks, offering a balanced approach between model performance and resource management.

Limitation Discussions & Future Work

Disentanglement of Routing Decisions and Rescaling Scores.

Currently, the routing decisions and rescaling scores in our framework are interdependent, which may impact the model’s accuracy. For instance, the skipped outputs are optimized to match the original outputs primarily through cosine similarity, which does not account for the difference in L2 magnitude. This discrepancy could potentially be mitigated by applying an additional rescaling factor to the skipped outputs, ensuring a better match in magnitude and improving the overall performance.

Improved Supervision for Router Labels.

The current method relies on cosine similarity for supervising router labels, which may not be the most effective approach. Exploring alternative supervision methods, such as task-specific metrics or direct gradients, could lead to more accurate router decisions. For example, drawing inspiration from works like Jiang et al. (2023) or employing gradient-based techniques similar to those used in network pruning, could enhance the router’s ability to prioritize important tokens and improve performance.

Skipping Attention Modules.

This work primarily focuses on skipping MLP layers due to the observed instability in decoding when skipping attention modules, such as excessive repetition and degraded accuracy (e.g., achieving 0%percent00\%0 % accuracy on GSM8K). Future work could explore strategies to stabilize the skipping of attention layers, potentially improving model efficiency without compromising output quality significantly.

Ethical Statement

The development and deployment of large language models, including the LLaMA-DLO framework presented in this work, can raise important ethical considerations. Our research aims to enhance the efficiency and scalability of LLMs while maintaining high standards of responsibility and ethical practice. We recognize the potential impact of our work on various stakeholders and are committed to the following ethical principles:

Fairness and Bias Mitigation.

We are aware that language models can inadvertently learn and propagate biases present in training data. Efforts have been made to ensure that LLaMA-DLO is trained on diverse and representative datasets to minimize the risk of bias. Effective ways to mitigate bias is a pressing problem worth further study.

Transparency and Accountability.

We strive to maintain transparency in our research and development processes. Detailed documentation and open access to our methodologies and results will be provided to allow for scrutiny and reproducibility. Accountability mechanisms are in place to ensure that any adverse effects of our technology are promptly identified and addressed.

Social Impact.

The potential societal impact of LLaMA-DLO is carefully evaluated to prevent misuse or harm. We are committed to the ethical deployment of our technology, ensuring it is used for beneficial purposes such as advancing research, improving accessibility, and enhancing communication. We actively discourage and take steps to prevent the use of our models for malicious activities, misinformation, or any application that could harm individuals or society.

Continuous Ethical Review.

The ethical implications of our work are continually assessed to adapt to evolving norms and expectations. We engage with interdisciplinary experts and stakeholders to identify and address ethical concerns, ensuring that our research and its applications remain aligned with societal values and ethical standards.

By adhering to these principles, we aim to contribute positively to the field of artificial intelligence and ensure that the benefits of our research are realized in an ethical and responsible manner.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Ainslie et al. (2023) Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Ontañón, Siddhartha Brahma, Yury Zemlyanskiy, David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, et al. 2023. Colt5: Faster long-range transformers with conditional computation. arXiv preprint arXiv:2303.09752.
  • Amini et al. (2019) Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319.
  • Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  • Baddeley (1992) Alan Baddeley. 1992. Working memory. Science, 255(5044):556–559.
  • Ben Allal et al. (2022) Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von Werra. 2022. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness.
  • Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  • Chen et al. (2021a) Cheng Chen, Yichun Yin, Lifeng Shang, Xin Jiang, Yujia Qin, Fengyu Wang, Zhi Wang, Xiao Chen, Zhiyuan Liu, and Qun Liu. 2021a. bert2bert: Towards reusable pretrained language models. arXiv preprint arXiv:2110.07143.
  • Chen et al. (2021b) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021b. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  • Chen et al. (2015) Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. 2015. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641.
  • Chen et al. (2023) Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, and Jingren Zhou. 2023. Ee-llm: Large-scale training and inference of early-exit large language models with 3d parallelism. arXiv preprint arXiv:2312.04916.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  • Dabre and Fujita (2019) Raj Dabre and Atsushi Fujita. 2019. Recurrent stacking of layers for compact neural machine translation models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6292–6299.
  • (15) T Dao, DY Fu, S Ermon, A Rudra, and C Flashattention Ré. Fast and memory-efficient exact attention with io-awareness, 2022. URL https://arxiv. org/abs/2205.14135.
  • Du et al. (2024) Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, and Jie Fu. 2024. Stacking your transformers: A closer look at model growth for efficient llm pre-training. arXiv preprint arXiv:2405.15319.
  • Evci et al. (2022) Utku Evci, Bart van Merrienboer, Thomas Unterthiner, Max Vladymyrov, and Fabian Pedregosa. 2022. Gradmax: Growing neural networks using gradient information. arXiv preprint arXiv:2201.05125.
  • Fedus et al. (2022a) William Fedus, Jeff Dean, and Barret Zoph. 2022a. A review of sparse expert models in deep learning. arXiv preprint arXiv:2209.01667.
  • Fedus et al. (2022b) William Fedus, Barret Zoph, and Noam Shazeer. 2022b. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39.
  • Gao et al. (2023) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, d Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A framework for few-shot language model evaluation.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings.
  • Gong et al. (2019) Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. 2019. Efficient training of bert by progressively stacking. In International conference on machine learning, pages 2337–2346. PMLR.
  • Gu et al. (2020) Xiaotao Gu, Liyuan Liu, Hongkun Yu, Jing Li, Chen Chen, and Jiawei Han. 2020. On the transformer growth for progressive bert training. arXiv preprint arXiv:2010.12562.
  • Hadi et al. (2023) Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al. 2023. A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints.
  • Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  • Hestness et al. (2017) Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. 2017. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409.
  • Jiang et al. (2023) Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736.
  • Kim et al. (2023) Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, et al. 2023. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166.
  • Koechlin et al. (2003) Etienne Koechlin, Chrystele Ody, and Frédérique Kouneiher. 2003. The architecture of cognitive control in the human prefrontal cortex. Science, 302(5648):1181–1185.
  • Lepikhin et al. (2020) Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
  • Li et al. (2024a) Dawei Li, Shu Yang, Zhen Tan, Jae Young Baik, Sunkwon Yun, Joseph Lee, Aaron Chacko, Bojian Hou, Duy Duong-Tran, Ying Ding, et al. 2024a. Dalk: Dynamic co-augmentation of llms and kg to answer alzheimer’s disease questions with scientific literature. arXiv preprint arXiv:2405.04819.
  • Li et al. (2024b) Yifan Li, Anh Dao, Wentao Bao, Zhen Tan, Tianlong Chen, Huan Liu, and Yu Kong. 2024b. Facial affective behavior analysis with instruction tuning. arXiv preprint arXiv:2404.05052.
  • Lin et al. (2021) Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  • Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  • Luo et al. (2023) Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  • Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
  • Raposo et al. (2024) David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro. 2024. Mixture-of-depths: Dynamically allocating compute in transformer-based language models. arXiv preprint arXiv:2404.02258.
  • Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  • Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
  • Shen et al. (2022) Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew Peters, and Iz Beltagy. 2022. Staged training for transformer language models. In International Conference on Machine Learning, pages 19893–19908. PMLR.
  • Shoemake (1985) Ken Shoemake. 1985. Animating rotation with quaternion curves. In Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques, pages 245–254. ACM.
  • Tan et al. (2024) Zhen Tan, Alimohammad Beigi, Song Wang, Ruocheng Guo, Amrita Bhattacharjee, Bohan Jiang, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. 2024. Large language models for data annotation: A survey. arXiv preprint arXiv:2402.13446.
  • Team (2022) EvolCodeAlpaca Team. 2022. Evolcodealpaca dataset.
  • Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  • Team (2023a) LLaMA-MoE Team. 2023a. Llama-moe: Building mixture-of-experts from llama with continual pre-training.
  • Team (2024) Self-Duplicate Stack Team. 2024. Self-duplicate stack.
  • Team (2023b) ShareGPT Team. 2023b. Sharegpt dataset.
  • Team (2023c) SlimOrca Team. 2023c. Slimorca dataset.
  • Ting and Witten (1997) Kai Ming Ting and Ian H Witten. 1997. Stacking bagged and dagged models.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
  • Wang et al. (2022) Jue Wang, Ke Chen, Gang Chen, Lidan Shou, and Julian McAuley. 2022. Skipbert: Efficient inference with shallow layer skipping. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7287–7301.
  • Wang et al. (2023a) Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, and Yoon Kim. 2023a. Learning to grow pretrained models for efficient transformer training. arXiv preprint arXiv:2303.00980.
  • Wang et al. (2023b) Yite Wang, Jiahao Su, Hanlin Lu, Cong Xie, Tianyi Liu, Jianbo Yuan, Haibin Lin, Ruoyu Sun, and Hongxia Yang. 2023b. Lemon: Lossless model expansion. arXiv preprint arXiv:2310.07999.
  • Wu et al. (2024) Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, and Ying Shan. 2024. Llama pro: Progressive llama with block expansion. arXiv preprint arXiv:2401.02415.
  • Yao et al. (2023) Yiqun Yao, Zheng Zhang, Jing Li, and Yequan Wang. 2023. Masked structural growth for 2x faster language model pre-training. In The Twelfth International Conference on Learning Representations.
  • Yu et al. (2023) Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  • (59) Xingjian Zhang, Jiaxi Tang, Yang Liu, Xinyang Yi, Li Wei, Lichan Hong, Qiaozhu Mei, and Ed H Chi. Conditional transformer fine-tuning by adaptive layer skipping. In 5th Workshop on practical ML for limited/low resource settings.
  • Zoph et al. (2022) Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906.

Appendix A Pseudo Code style Description of Dynamic Layer Operation (DLO)

Algorithm 1 Dynamic Layer Operation (DLO)
0:  Pre-trained LLM with R𝑅Ritalic_R layers, group size Q𝑄Qitalic_Q, expansion size q𝑞qitalic_q, target overall sparsity ρ𝜌\rhoitalic_ρ, training steps T𝑇Titalic_T, annealing steps Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, base learning rate ζ¯¯𝜁\bar{\zeta}over¯ start_ARG italic_ζ end_ARG
0:  Optimized LLM with dynamic scaling
1:  Initialize: PR/Q𝑃𝑅𝑄P\leftarrow R\ /\ Qitalic_P ← italic_R / italic_Q, QQ+qsuperscript𝑄𝑄𝑞Q^{\prime}\leftarrow Q+qitalic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_Q + italic_q, ρi,1ρsubscript𝜌𝑖1𝜌\rho_{i,1}\leftarrow\rhoitalic_ρ start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT ← italic_ρ
2:  // Layer Expansion:
3:  for group i1𝑖1i\leftarrow 1italic_i ← 1 to P𝑃Pitalic_P do
4:     for layer jQ+1𝑗𝑄1j\leftarrow Q+1italic_j ← italic_Q + 1 to Q+q𝑄𝑞Q+qitalic_Q + italic_q do
5:        Initialize θijsubscriptsuperscript𝜃𝑖𝑗\theta^{\prime}_{ij}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT using ΠΠ\Piroman_Π:
6:        if Π=Πabsent\Pi=roman_Π = ‘Xavier’ then
7:           θij𝒰(6nin+nout,6nin+nout)similar-tosubscriptsuperscript𝜃𝑖𝑗𝒰6subscript𝑛𝑖𝑛subscript𝑛𝑜𝑢𝑡6subscript𝑛𝑖𝑛subscript𝑛𝑜𝑢𝑡\theta^{\prime}_{ij}\sim\mathcal{U}\left(-\sqrt{\frac{6}{n_{in}+n_{out}}},% \sqrt{\frac{6}{n_{in}+n_{out}}}\right)italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_U ( - square-root start_ARG divide start_ARG 6 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_ARG end_ARG , square-root start_ARG divide start_ARG 6 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_ARG end_ARG )
8:        else if Π=Πabsent\Pi=roman_Π = ‘Copy’ then
9:           θijskipθi(Q+q1)subscriptsuperscript𝜃𝑖𝑗subscriptskipsubscript𝜃𝑖𝑄𝑞1\theta^{\prime}_{ij}\mathcal{L}_{\text{skip}}\theta_{i(Q+q-1)}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i ( italic_Q + italic_q - 1 ) end_POSTSUBSCRIPT
10:        else if Π=Πabsent\Pi=roman_Π = ‘Identity’ then
11:           θijskipθi(Q+q1)subscriptsuperscript𝜃𝑖𝑗subscriptskipsubscript𝜃𝑖𝑄𝑞1\theta^{\prime}_{ij}\mathcal{L}_{\text{skip}}\theta_{i(Q+q-1)}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i ( italic_Q + italic_q - 1 ) end_POSTSUBSCRIPT, Wout=0subscriptsuperscript𝑊out0W^{\prime}_{\text{out}}=0italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = 0
12:        else if Π=Πabsent\Pi=roman_Π = ‘Linear Merge’ then
13:           θijskipk=1ταkθi(Q+qk)subscriptsuperscript𝜃𝑖𝑗subscriptskipsuperscriptsubscript𝑘1𝜏subscript𝛼𝑘subscript𝜃𝑖𝑄𝑞𝑘\theta^{\prime}_{ij}\mathcal{L}_{\text{skip}}\sum_{k=1}^{\tau}\alpha_{k}\theta% _{i(Q+q-k)}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i ( italic_Q + italic_q - italic_k ) end_POSTSUBSCRIPT
14:        else if Π=Πabsent\Pi=roman_Π = ‘SLERP’ then
15:           Ω=arccos(𝐮𝐯𝐮𝐯)Ω𝐮𝐯norm𝐮norm𝐯\Omega=\arccos\left(\frac{\mathbf{u}\cdot\mathbf{v}}{\|\mathbf{u}\|\|\mathbf{v% }\|}\right)roman_Ω = roman_arccos ( divide start_ARG bold_u ⋅ bold_v end_ARG start_ARG ∥ bold_u ∥ ∥ bold_v ∥ end_ARG )
16:           θijskipsin((1α)Ω)sin(Ω)𝐮+sin(αΩ)sin(Ω)𝐯subscriptsuperscript𝜃𝑖𝑗subscriptskip1𝛼ΩΩ𝐮𝛼ΩΩ𝐯\theta^{\prime}_{ij}\mathcal{L}_{\text{skip}}\frac{\sin((1-\alpha)\Omega)}{% \sin(\Omega)}\mathbf{u}+\frac{\sin(\alpha\Omega)}{\sin(\Omega)}\mathbf{v}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT divide start_ARG roman_sin ( ( 1 - italic_α ) roman_Ω ) end_ARG start_ARG roman_sin ( roman_Ω ) end_ARG bold_u + divide start_ARG roman_sin ( italic_α roman_Ω ) end_ARG start_ARG roman_sin ( roman_Ω ) end_ARG bold_v
17:        end if
18:     end for
19:  end for
20:  // Layer Activation and Skipping:
21:  for step t1𝑡1t\leftarrow 1italic_t ← 1 to T𝑇Titalic_T do
22:     // Skip Rate Annealing:
23:     if tT𝑡superscript𝑇t\leq T^{\prime}italic_t ≤ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT then
24:        ρtρ¯+(ρρ¯)tTsuperscript𝜌𝑡¯𝜌𝜌¯𝜌𝑡superscript𝑇\rho^{t}\leftarrow\bar{\rho}+(\rho-\bar{\rho})\frac{t}{T^{\prime}}italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← over¯ start_ARG italic_ρ end_ARG + ( italic_ρ - over¯ start_ARG italic_ρ end_ARG ) divide start_ARG italic_t end_ARG start_ARG italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG
25:     else
26:        ρtρsuperscript𝜌𝑡𝜌\rho^{t}\leftarrow\rhoitalic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← italic_ρ
27:     end if
28:     // Training and Integration:
29:     skip0subscriptskip0\mathcal{L}_{\text{skip}}\leftarrow 0caligraphic_L start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ← 0
30:     for layer i1𝑖1i\leftarrow 1italic_i ← 1 to Rsuperscript𝑅R^{\prime}italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT do
31:        for token s𝑠sitalic_s in sequence do
32:           // Dynamic Skip:
33:           ris12(β+(2σ(𝐡isWi)1)γ)superscriptsubscript𝑟𝑖𝑠12𝛽2𝜎superscriptsubscript𝐡𝑖𝑠subscript𝑊𝑖1𝛾r_{i}^{s}\leftarrow\frac{1}{2}(\beta+(2\sigma(\mathbf{h}_{i}^{s}W_{i})-1)\gamma)italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ← divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_β + ( 2 italic_σ ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - 1 ) italic_γ )
34:           if training then
35:              λ^is𝟙risTopρi,tS({ris}s=1S)superscriptsubscript^𝜆𝑖𝑠1superscriptsubscript𝑟𝑖𝑠subscriptTopsubscript𝜌𝑖𝑡𝑆superscriptsubscriptsuperscriptsubscript𝑟𝑖𝑠𝑠1𝑆\hat{\lambda}_{i}^{s}\leftarrow\mathbb{1}{r_{i}^{s}\in\mathrm{Top}_{\lfloor% \rho_{i,t}S\rfloor}\big{(}\{r_{i}^{s}\}_{s=1}^{S}\big{)}}over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ← blackboard_1 italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ roman_Top start_POSTSUBSCRIPT ⌊ italic_ρ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT italic_S ⌋ end_POSTSUBSCRIPT ( { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT )
36:           else
37:              λ^is𝟙ri>β2superscriptsubscript^𝜆𝑖𝑠1subscript𝑟𝑖𝛽2\hat{\lambda}_{i}^{s}\leftarrow\mathbb{1}{r_{i}>\frac{\beta}{2}}over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ← blackboard_1 italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > divide start_ARG italic_β end_ARG start_ARG 2 end_ARG
38:           end if
39:           if ris=1superscriptsubscript𝑟𝑖𝑠1r_{i}^{s}=1italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = 1 then
40:              𝐡i+1srii𝒜i(𝐡is)superscriptsubscript𝐡𝑖1𝑠subscript𝑟𝑖subscript𝑖subscript𝒜𝑖superscriptsubscript𝐡𝑖𝑠\mathbf{h}_{i+1}^{s}\leftarrow r_{i}\cdot\mathcal{M}_{i}\circ\mathcal{A}_{i}(% \mathbf{h}_{i}^{s})bold_h start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ← italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT )
41:           else
42:              𝐡i+1s𝒜i(𝐡is)superscriptsubscript𝐡𝑖1𝑠subscript𝒜𝑖superscriptsubscript𝐡𝑖𝑠\mathbf{h}_{i+1}^{s}\leftarrow\mathcal{A}_{i}(\mathbf{h}_{i}^{s})bold_h start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ← caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT )
43:           end if
44:           // Skip Loss:
45:           μiscos(𝒜i(𝐡is),i𝒜i(𝐡is))superscriptsubscript𝜇𝑖𝑠subscript𝒜𝑖superscriptsubscript𝐡𝑖𝑠subscript𝑖subscript𝒜𝑖superscriptsubscript𝐡𝑖𝑠\mu_{i}^{s}\leftarrow\cos(\mathcal{A}_{i}(\mathbf{h}_{i}^{s}),\mathcal{M}_{i}% \circ\mathcal{A}_{i}(\mathbf{h}_{i}^{s}))italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ← roman_cos ( caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) , caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) )
46:           λ~is𝟙μisBottom(1ρt)RS({μis}i,s=1R,S)superscriptsubscript~𝜆𝑖𝑠1superscriptsubscript𝜇𝑖𝑠subscriptBottom1superscript𝜌𝑡superscript𝑅𝑆superscriptsubscriptsuperscriptsubscript𝜇𝑖𝑠𝑖𝑠1superscript𝑅𝑆\tilde{\lambda}_{i}^{s}\leftarrow\mathbb{1}{\mu_{i}^{s}\in\mathrm{Bottom}_{% \lfloor(1-\rho^{t})R^{\prime}S\rfloor}\big{(}\{\mu_{i}^{s}\}_{i,s=1}^{R^{% \prime},S}\big{)}}over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ← blackboard_1 italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ roman_Bottom start_POSTSUBSCRIPT ⌊ ( 1 - italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_S ⌋ end_POSTSUBSCRIPT ( { italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i , italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S end_POSTSUPERSCRIPT )
47:           skipskip+BCE(σ(𝐡isWi),λ~is)subscriptskipsubscriptskipsubscriptBCE𝜎superscriptsubscript𝐡𝑖𝑠subscript𝑊𝑖superscriptsubscript~𝜆𝑖𝑠\mathcal{L}_{\text{skip}}\leftarrow\mathcal{L}_{\text{skip}}+\mathcal{L}_{% \text{BCE}}(\sigma(\mathbf{h}_{i}^{s}W_{i}),\tilde{\lambda}_{i}^{s})caligraphic_L start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ← caligraphic_L start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT BCE end_POSTSUBSCRIPT ( italic_σ ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT )
48:        end for
49:        // Layer-Wise Skip Rate:
50:        ρi,t+1s=1Sλ~is/Ssubscript𝜌𝑖𝑡1superscriptsubscript𝑠1𝑆superscriptsubscript~𝜆𝑖𝑠𝑆\rho_{i,t+1}\leftarrow\sum_{s=1}^{S}\tilde{\lambda}_{i}^{s}\ /\ Sitalic_ρ start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT ← ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT / italic_S
51:     end for
52:     skipskip/RSsubscriptskipsubscriptskipsuperscript𝑅𝑆\mathcal{L}_{\text{skip}}\leftarrow\mathcal{L}_{\text{skip}}\ /\ R^{\prime}Scaligraphic_L start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ← caligraphic_L start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT / italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_S
53:     task+skipsubscripttasksubscriptskip\mathcal{L}\leftarrow\mathcal{L}_{\text{task}}+\mathcal{L}_{\text{skip}}caligraphic_L ← caligraphic_L start_POSTSUBSCRIPT task end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT
54:     Adjust learning rate ζi,tζ¯1ρi,t1ρtsubscript𝜁𝑖𝑡¯𝜁1subscript𝜌𝑖𝑡1superscript𝜌𝑡\zeta_{i,t}\leftarrow\bar{\zeta}\cdot\frac{1-\rho_{i,t}}{1-\rho^{t}}italic_ζ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ← over¯ start_ARG italic_ζ end_ARG ⋅ divide start_ARG 1 - italic_ρ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG
55:  end for
56:  return  Optimized LLM

Appendix B Evaluation Details

We list the number of shots and the metric used for each dataset as follows:

  • ARC-C: 25 shots, normalized accuracy.

  • GLUE: 0 shot, accuracy.

  • MMLU: 5 shots, normalized accuracy.

  • PIQA: 0 shot, normalized accuracy.

  • OBQA: 0 shot, normalized accuracy.

  • SQuAD: 0 shot, F1 Score.

  • TruthfulQA: 0 shot, accuracy.

  • WinoGrande: 5 shots, accuracy.

  • GSM8K: 5 shots, accuracy.

  • MathQA: 0 shot, normalized accuracy.

  • HumanEval: 200 rounds, pass@100.

  • MBPP: 15 rounds, pass@10.

Appendix C Acknowledgment of AI Assistance in Writing and Revision

We utilized ChatGPT-4 for revising and enhancing sections of this paper.