SwitchCIT: Switching for Continual Instruction Tuning of Large Language Models

Xinbo Wu1,2, Max Hartman2, Vidhata Arjun Jayaraman2,3, Lav R. Varshney 1,2
1
Coordinated Science Laboratory
2Department of Electrical and Computer Engineering
3Department of Mathematics
University of Illinois Urbana-Champaign
{xinbowu2, maxh3, vidhata2, varshney}@illinois.edu
Abstract

Large language models (LLMs) have exhibited impressive capabilities in various domains, particularly in general language understanding. However these models, trained on massive text data, may not be finely optimized for specific tasks triggered by instructions. Continual instruction tuning is crucial to adapt LLMs to evolving tasks and domains, ensuring their effectiveness and relevance across a wide range of applications. In the context of continual instruction tuning, where models are sequentially trained on different tasks, catastrophic forgetting can occur, leading to performance degradation on previously learned tasks. This work addresses the catastrophic forgetting in continual instruction learning for LLMs through a switching mechanism for routing computations to parameter-efficient tuned models. We demonstrate the effectiveness of our method through experiments on continual instruction tuning of different natural language generation tasks.

1 Introduction

Large language models (LLMs) have demonstrated remarkable capabilities across numerous domains, as highlighted by OpenAI, (2023) and Bubeck et al., (2023). However, whereas LLMs pre-trained on extensive language data excel in general language understanding, they may not be optimized for every specific task of interest prompted by instructions. Therefore, there is need for continual instruction learning to adapt LLMs to evolving tasks and domains. Indeed, continual instruction learning is essential for LLMs such as GPT (Radford et al.,, 2019) to maintain their effectiveness and relevance in handling a wide range of tasks and domains.

Such models are trained on vast amounts of text data and fine-tuned for specific applications, often by learning tasks sequentially (Luo et al.,, 2023), i.e. learning on datasets pertaining to one task all at once, before moving on to the next task. The challenge lies in their ability to continually learn and adapt as they encounter new tasks and information. However, in continual instruction learning scenarios, where models are sequentially trained on different tasks or datasets, catastrophic forgetting occurs when the model’s parameters are updated to accommodate new information, leading to degradation or complete loss of performance on previously learned tasks.

A typical way to balance new learning with the retention of previously acquired capabilities in LLMs is through replaying old data. However, with the rapid iterations of LLMs for diverse and complex use cases, retaining old data becomes exceptionally challenging. Moreover, continually tuning an LLM with a large number of parameters is highly costly in terms of both computation and memory usage. Parameter-efficient fine-tuning (PEFT) such as low-rank adaptation (LoRA) (Hu et al.,, 2022) provides an option of lightweight with portable parameters, which could be paired with an LLM to perform specific tasks. Therefore, in this work, we focus on alleviating catastrophic forgetting during continual instruction tuning of LLMs, particularly with minimal data retention and its interplay with PEFT.

We propose a novel continual instruction tuning method, SwitchCIT, that alleviates forgetting of previously seen tasks by introducing a switch network to identify a task given an instruction, leveraging the clustering phenomenon of task-specific instruction vectors (Wu and Varshney,, 2024). For each new task, we fine-tune the task performance by including extra parameters created by PEFT methods such as LoRA (a self-expansion process), making the method more practical.

Catastrophic forgetting in neural networks is related to the palimpsest phenomenon that new memories rapidly overwrite old ones  (Zenke and Laborieux,, 2024). SwitchCIT may be considered as a way to avoid the need to overwrite the old memories of previously learned tasks by introducing extra parameters for a new task. Moreover, there exists a line of methods inspired by synaptic consolidation in brains that reduces the learning rate on specific weights based on their importance to previously encountered tasks. Our method has the advantage of enabling the full tuning of all weights to adapt to a specific task without restricting any particular set of weights.

We summarize our contributions as follows:

  • We propose a novel continual instruction-tuning approach to alleviate catastrophic forgetting by using a switch network for task routing to different specialized models tuned via PEFT.

  • We conduct experiments for instruction-tuning on five continual natural language generation tasks, demonstrating the effectiveness of our method compared to several baselines.

2 Related Work

Continual Learning and Catastrophic Forgetting. The study of continual learning focuses on developing algorithms that learn from a continuous stream of data, enabling a model to acquire new knowledge while retaining previously learned information without catastrophic forgetting (Wang et al., 2024b, ). Catastrophic forgetting happens when LLMs forget previously learned information as new tasks are learned (Luo et al.,, 2023). Anonymous, (2024) provides an insightful study empirically showing that pre-trained LLMs may forget domain knowledge and tasks that were not included in the fine-tuning process, while supervised fine-tuning offers substantial benefits to the models. To counter this effect, here we use a switch network to classify tasks and route computations from their instructions. By doing so, we can fine-tune task performance by including extra parameters created by PEFT methods such as LoRA for each task.

Understanding Transformers. Prior studies have offered insightful understandings of Transformer models with focuses on the internal representations (Wu and Varshney,, 2024, 2023; Nanda et al.,, 2023) and attention mechanisms (Sun and Marasović,, 2021; Olsson et al.,, 2022). Inspired by Wu and Varshney, (2024), we design a novel method for continual instruction tuning for LLMs via switching instead of concentrating on understanding of Transformers.

Instruction Tuning. Instruction tuning is the process of tuning a model from specific instructions or prompts that will guide the model toward behaving in the desired fashion. A major issue with LLMs has been the mismatch between the training objective of the LLM and users’ objectives. Instruction tuning has been developed, in part, to address this issue. This method of training aims to align language models with human intent (Ouyang et al.,, 2022; Stiennon et al.,, 2020; Zhang et al., 2023b, ). We concentrate our work on the specific case of continual instruction tuning across different tasks, which presents unique challenges such as catastrophic forgetting.

Parameter-Efficient Fine-Tuning. PEFT addresses the challenge of needing enormous computing resources to fine-tune contemporary LLMs. PEFT reduces the number of fine-tuning parameters and memory usage while still achieving similar results as full fine-tuning (Xu et al.,, 2023). One particularly popular PEFT method is LoRA, which freezes the model weights of the pre-trained model and injects trainable rank decomposition matrices into each layer of the Transformer architecture, allowing training on a small number of additional parameters rather than on the original pre-trained model (Hu et al.,, 2019). Here, we use LoRA to create extra parameters for fine-tuning tasks not yet seen by the LLM.

Mixture of Experts. Mixture of experts (MoE) models integrate multiple sub-models, or experts, to address different parts of the input space  (Jacobs et al.,, 1991; Du et al.,, 2022; Zoph et al.,, 2022). Though the MoE philosophy is similar to ours, SwitchCIT uses different models to handle different parts of a task space represented by instructions, rather than an arbitrary input space as in MoE models. Also, we separate the learning processes for model selection and the models themselves, whereas MoE models learn both simultaneously. SwitchCIT can self-expand its parameters to adapt to new tasks, whereas MoE models typically do not.

3 Method

Refer to caption
Figure 1: The inference procedure of SwitchCIT. The instruction is first fed into a switch network consisting of a lightweight LLM and a task classifier. The last token representation from the final layer of the LLM is used as the input to the switch network. This switch network classifies the task was given and then routes computation to the associated set of parameters.

We illustrate SwitchCIT in Figure 1. At inference time, a tuple [I,x]𝐼𝑥[I,x][ italic_I , italic_x ] is given, where I𝐼Iitalic_I is an instruction and x𝑥xitalic_x is an optional input. Based on the instruction I𝐼Iitalic_I, a switch network routes the computation to a model trained explicitly for the predicted task such that the performance of both previously learned and newly learned tasks is mostly retained. More specifically, the switch network identifies tasks via a multi-class classification from their instructions for the routing by using instruction features extracted by a lightweight LLM, Wsmallsubscript𝑊𝑠𝑚𝑎𝑙𝑙W_{small}italic_W start_POSTSUBSCRIPT italic_s italic_m italic_a italic_l italic_l end_POSTSUBSCRIPT. We use the last token representation of an instruction from the final layer of Wsmallsubscript𝑊𝑠𝑚𝑎𝑙𝑙W_{small}italic_W start_POSTSUBSCRIPT italic_s italic_m italic_a italic_l italic_l end_POSTSUBSCRIPT as the features. This design is inspired by the fact that vector representations of instructions belonging to the same task are clustered together within the hidden representation space, with the task-specific clustering phenomenon becoming more pronounced in later layers (Wu and Varshney,, 2024). Note that effective clustering implies good separability of task representations.

A selected model relies on a concatenation of the instruction and the input, [I;x]𝐼𝑥[I;x][ italic_I ; italic_x ] to anticipate an output y𝑦yitalic_y via an internal representation hhitalic_h produced by a base LLM W𝑊Witalic_W and its task-specific weight ΔWΔ𝑊\Delta{W}roman_Δ italic_W. For brevity, we omit details about the computations of hhitalic_h and of reaching y𝑦yitalic_y from hhitalic_h, which involves a causal decoding process; see Vaswani et al., (2017); Hu et al., (2022) for more details. Therefore, the switch network allows tasks to be handled by models dedicated to them. Models tailored to different tasks will not interfere with one another, which consequently alleviates catastrophic forgetting of previously learned task. Here, both x𝑥xitalic_x and y𝑦yitalic_y could be considered as textual sequences in the context of language generation. All models M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are instruction-tuned for different tasks (1111 to T𝑇Titalic_T) by introducing extra parameters ΔWΔ𝑊\Delta{W}roman_Δ italic_W through a PEFT method such as LoRA. The switch network may be easily implemented as a multi-layer perceptron (MLP) model with an instruction feature encoder.

4 Experimental Setup

In this section, we briefly overview the model implementation and datasets. Experimental setups are further detailed in Appendices B and C.

We use BLOOMZ 1.1B and BLOOMZ 7.1B  (Muennighoff et al.,, 2023) as two choices of our base LLMs and LoRA (Hu et al.,, 2022) to learn task-specific and portable parameters. A switch network is implemented using a two-layer MLP network. We use OPT-125M (Zhang et al.,, 2022), a lightweight LLM with only 125 million parameters to extract features for the switch network. Implementation and training details are in Appendix B.

We consider the benchmark introduced by Scialom et al., (2022) and five continual instruction tasks for natural language generation selected by Luo et al., (2023) to differ from the training and evaluation tasks of BLOOMZ. These tasks are:

  • Text Simplification (Simp), (Jiang et al.,, 2020) and (Alva-Manchego et al.,, 2020), paraphrasing a given text into simpler language;

  • Empathetic Dialogue Generation (Emdg) (Rashkin et al.,, 2019), generating a dialogue response that offers a reason within a specified emotional context.

  • Inquisitive Question Generation (InqQG) (Fan et al.,, 2019), create questions that prompt long-form answers.

  • Explanation Generation (Exp), according to Camburu et al., (2018), this involves generating natural language explanations for provided premises, hypotheses, or labels.

  • Headline Generation with Constraint (HGen) (Scialom et al.,, 2022), which focuses on producing headlines that include specific keywords, positioned as per the constraints specified. We evaluate a model’s performance on each task using metrics outlined by Scialom et al., (2022).

Specifically, we use SARI for Simp; BERTSore (BS) for Emdg, InqQA, and Exp; Rouge-1 (R1) for HGen. We use the BERTScore version based on DeBERTa-MNLI.

Following Luo et al., (2023), we train a model based on the order of instruction tasks: Simp → Emdg → InqQG → Exp → HGen. We use the specific prompts designed by Scialom et al., (2022) and train on 100,000 data samples, following previous works (Luo et al.,, 2023; Scialom et al.,, 2022). We feed only an instruction without using the prompt template to the switch network for determining task identity, since the prompt does not contain useful information for task identification and this is sufficient for high performance as shown in Table 1.

We compare our method to a direct supervised fine-tuning approach and a continual learning method with a rehearsal mechanism following Scialom et al., (2022). We consider the rehearsal method because it is the only method in the literature that is developed in an experimental setting closest to our continual instruction tuning setting and a representative of continual learning methods based on data replaying. Both our method and the other methods employ the same PEFT using LoRA.

To ensure a fair comparison, we use the same amount of data for rehearsal as we do for training our switch network. To investigate the regime of low data retaining, we reserve only 0.01% of the training data for each task to train the switch network, which is significantly less than the amount of replay data such as the 1% used by traditional continual learning methods for rehearsal. We also evaluate methods vi using 1% of the training data, based on which the rehearsal method was originally developed. Details of the instructions, prompts, and training are in Appendix C. We evaluate these methods on test splits of various task datasets as detailed in Appendix C.

Sub Emdg InqQG Exp HGen
0.01 % 100.0 % 96.3 % 97.6 % 97.1 %
1.0 % 100.0 % 99.9 % 99.9 % 99.6 %
Table 1: Progressive performance of task classification by our switch networks measured by accuracy. A continual learning stage is denoted by its task name. The ”Sub” refers to the sub-sampling percentage of the training data used. Note that there is no switch network for the first learned task.

5 Switch Network

Table 1 presents the progressive performance of task classification by our switch networks trained under different conditions: a low data rate-0.01% and 1% comparable to the replay data used by the rehearsal method of Scialom et al., (2022). Note that after learning each task, we retrain a very lightweight switch network to accommodate the newly learned task. It is evident that switch networks of different settings achieve very high classification accuracy at every learning stage, even when using a lightweight LLM like the OPT-125M for feature extraction. The performance is scaled up by using more training data. Performance does not obviously degrade when including more tasks, demonstrating good robustness. Notice that competitive classification performance is reached even with 100X less data, which may be explained by good separability of different tasks due to task clustering shown by Wu and Varshney, (2024).

Method Simp(SARI) Emdg(BS) InqQG(BS) Exp(BS) HGen(R1)
Initial (BLOOMZ-1.1B) 37.7 0.483 0.457 0.515 0.269
Full-SFT (BLOOMZ-7.1B) 47.2 0.533 0.597 0.687 0.329
SFT (BLOOMZ-1.1B) 34.9 0.265 0.454 0.369 0.356
Rehearsal (BLOOMZ-1.1B, 0.01%) 36.8 0.458 0.484 0.587 0.359
Rehearsal (BLOOMZ-1.1B, 1%) 48.4 0.533 0.589 0.685 0.357
SwitchCIT (BLOOMZ-1.1B, 0.01%) 49.0 0.545 0.559 0.712 0.355
SwitchCIT (BLOOMZ-1.1B, 1%) 49.0 0.546 0.593 0.712 0.359
SwitchCIT (BLOOMZ-7.1B, 0.01%) 49.5 0.561 0.577 0.726 0.414
SwitchCIT (BLOOMZ-7.1B, 1%) 49.5 0.562 0.615 0.726 0.418
Table 2: The final performance of various methods on different tasks in the continual learning. Tasks are presented in the learning order. We also list the performance of the original LLM as ”Initial”. ”Full” indicates full fine-tuning is utilized instead of parameter-efficient fine-tuning. The 1% means utilization of 1% training samples for training the switch network or replaying. Evaluation metrics for different tasks are shown beside their names. SFT refers to the supervised fine-tuning method. R1 and BS are abbreviations of ROUGE-1 and BERTScore respectively. We use a horizontal line to separate models that are not directly comparable.

6 Continual Instruction Tuning

Table 2 demonstrates that SwitchCIT not only improves upon the performance of the base LLM (”Initial”) but also outperforms other methods on most tasks after the continual learning process under the same settings. The gap becomes more pronounced when retaining only a very limited amount of data from previous tasks (0.01%). Our method also surpasses the rehearsal method using the same replay data rate (1%) studied by Scialom et al., (2022). Additionally, we compare SwitchCIT (BLOOMZ-1.1B, 0.01%) relying on PEFT to a much larger fully supervised fine-tuned approach based on the BLOOMZ-7.1b model (Luo et al.,, 2023; Muennighoff et al.,, 2023), which is approximately seven times larger than our base LLM. Surprisingly, our method still achieves better performance on most tasks, underscoring its effectiveness. In contrast to the rehearsal method, our approach experiences notably less performance degradation when reducing replay data from 1% to 0.01%. We hypothesize that due to our high-performing switch network, our method is able to specifically select a tailored sub-model for a task, yielding impressive performance.

Refer to caption
(a) SFT
Refer to caption
(b) Rehearsal
Refer to caption
(c) SwitchCIT
Figure 2: Progressive relative gain of various models. The horizontal axis presents different learning stages labeled by their respective task names, whereas the vertical axis shows the relative gain. Task performances are shown once the task is learned across different stages. (a) Supervised fine-tune. (b) Rehearsal method. (c) SwitchCIT.

We calculate a relative gain, a normalized score using the performance achieved by a model when fine-tuned only on one specific task by following Scialom et al., (2022). A high relative gain indicates effective retention of performance on a specific task. From Figure 2, notice that SwitchCIT experiences minimal catastrophic forgetting compared to other approaches, as shown by almost perfect retention of performance for various tasks. The less perfect retention on the InqQG task is due to imperfect task identification by the switch network. In contrast to our method, the rehearsal method retains only some performance on previous tasks. Along with SFT, they both exhibit noticeable levels of catastrophic forgetting as they continue to learn additional tasks.

7 Efficiency and Portability

Many existing works overcome catastrophic forgetting by imposing constraints on the existing parameters of a model, so their model sizes will not change. SwitchCIT introduces new parameters for each additional task. For example, when using BLOOMZ 1.1B as the base LLM, these additional parameters account for only 0.878% of the total parameters. However, only extra parameters specific to a task are loaded during inference. Considering five continual tasks, the additional parameters amount to just 4.39% in exchange for minimal catastrophic forgetting, demonstrating their lightweight and practical feasibility. Note that only the additional parameters specific to a single task are loaded during inference. We anticipate further improvements in these numbers as parameter-efficient methods continue to advance.

Separating the development of the switch network from the instruction-tuned models greatly enhances SwitchCIT’s portability. For instance, to improve task identification by our switch network using more data (from 0.01% to 1.0%, as shown in Table  1), we only need to retrain the switch network and plug it in the existing instruction-tuned models. Conversely, we can also use existing switch networks for better instruction-tuned models as shown in Table  2, where we leverage the same switch network for models with larger base LLMs such as BLOOMZ 7.1B.

8 Conclusion

We proposed a novel continual instruction-tuning approach to alleviate catastrophic forgetting by using a switch network to identify tasks and then route computations to parameter-efficient tuned models. Experiments conducted on five instruction-based continual natural language generation tasks demonstrate the effectiveness of our method compared to several baselines.

9 Limitations

Because of computational constraints, we could only tested our method on relatively small-scale LLMs. However, according to our design, our high-performing switch network is independent of the base LLM and can be paired with larger-scale LLMs that offer superior performance. The remarkable performance of our switch network showcases the effectiveness of our method. It is worth noting that our task classifier in the switch network is incredibly lightweight (154K parameters) and requires minimal data (0.01% of the training data), making it highly practical and easy to integrate. The parameters introduced by LoRA account for less than 2% of the total parameters of the base LLM, contributing to the overall lightweight nature of our method.

Our current approach does not facilitate learning transferable knowledge across continual tasks. Exploring methods to enable our model to leverage transferable knowledge across tasks will be an important future direction for improvement.

References

  • Alva-Manchego et al., (2020) Alva-Manchego, F., Martin, L., Bordes, A., Scarton, C., Sagot, B., and Specia, L. (2020). ASSET: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  • Anonymous, (2024) Anonymous (2024). Amuro and char: Analyzing the relationship between pre-training and fine-tuning of large language models. In Submitted to ACL Rolling Review - June 2024. under review.
  • Bubeck et al., (2023) Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T., and Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv:2303.12712 [cs.CL].
  • Camburu et al., (2018) Camburu, O.-M., Rocktäschel, T., Lukasiewicz, T., and Blunsom, P. (2018). e-SNLI: Natural language inference with natural language explanations. In Advances in Neural Information Processing Systems, volume 31.
  • Dettmers et al., (2023) Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. In Advances in Neural Information Processing Systems, volume 36, pages 10088–10115.
  • Du et al., (2022) Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., et al. (2022). GLaM: Efficient scaling of language models with mixture-of-experts. In Proceedings of the 39th International Conference on Machine Learning, pages 5547–5569.
  • Fan et al., (2019) Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., and Auli, M. (2019). ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567.
  • Hu et al., (2022) Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2022). LoRA: Low-rank adaptation of large language models. In Proceedings of the 10th International Conference on Learning Representations (ICLR).
  • Hu et al., (2019) Hu, W., Lin, Z., Liu, B., Tao, C., Tao, Z. T., Zhao, D., Ma, J., and Yan, R. (2019). Overcoming catastrophic forgetting for continual learning via model adaptation. In Proceedings of the 7th International Conference on Learning Representations (ICLR).
  • Huang et al., (2023) Huang, C., Liu, Q., Lin, B. Y., Du, C., Pang, T., and Lin, M. (2023). LoraHub: Efficient cross-task generalization via dynamic LoRA composition. arXiv:2307.13269 [cs.CL].
  • Jacobs et al., (1991) Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1):79–87.
  • Jiang et al., (2020) Jiang, C., Maddela, M., Lan, W., Zhong, Y., and Xu, W. (2020). Neural CRF model for sentence alignment in text simplification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7943–7960.
  • Kirkpatrick et al., (2017) Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526.
  • Luo et al., (2023) Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., and Zhang, Y. (2023). An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv:2308.08747 [cs.CL].
  • Muennighoff et al., (2023) Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Biderman, S., Scao, T. L., Bari, M. S., Shen, S., Yong, Z.-X., Schoelkopf, H., et al. (2023). Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 15991–16111.
  • Nair and Hinton, (2010) Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. Proceedings of the 27th International Conference on Machine Learning, pages 807–814.
  • Nanda et al., (2023) Nanda, N., Lee, A., and Wattenberg, M. (2023). Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941.
  • Olsson et al., (2022) Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al. (2022). In-context learning and induction heads. arXiv preprint arXiv:2209.11895.
  • OpenAI, (2023) OpenAI (2023). GPT-4 technical report. arXiv:2304.01852 [cs.CL].
  • Ouyang et al., (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744.
  • Radford et al., (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners.
  • Rashkin et al., (2019) Rashkin, H., Smith, E. M., Li, M., and Boureau, Y.-L. (2019). Towards empathetic open-domain conversation models: a new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381.
  • Scialom et al., (2022) Scialom, T., Chakrabarty, T., and Muresan, S. (2022). Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122.
  • Stiennon et al., (2020) Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. (2020). Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021.
  • Sun and Marasović, (2021) Sun, K. and Marasović, A. (2021). Effective attention sheds light on interpretability. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4126–4135.
  • Valipour et al., (2023) Valipour, M., Rezagholizadeh, M., Kobyzev, I., and Ghodsi, A. (2023). DyLoRA: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3274–3287.
  • Vaswani et al., (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, volume 30.
  • (28) Wang, H., Lu, H., Yao, L., and Gong, D. (2024a). Self-expansion of pre-trained models with mixture of adapters for continual learning. arXiv:2403.18886 [cs.LG].
  • (29) Wang, L., Zhang, X., Su, H., and Zhu, J. (2024b). A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence. to appear.
  • Wei et al., (2022) Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. (2022). Finetuned language models are zero-shot learners. In Proceedings of the 10th International Conference on Learning Representations (ICLR).
  • Wu and Varshney, (2023) Wu, X. and Varshney, L. R. (2023). A meta-learning perspective on transformers for causal language modeling. arXiv preprint arXiv:2310.05884.
  • Wu and Varshney, (2024) Wu, X. and Varshney, L. R. (2024). Transformer-based causal language models perform clustering. arXiv:2402.12151 [cs.CL].
  • Xu et al., (2023) Xu, L., Xie, H., Qin, S.-Z. J., Tao, X., and Wang, F. L. (2023). Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv:2312.12148 [cs.CL].
  • Yoon et al., (2018) Yoon, J., Yang, E., Lee, J., and Hwang, S. J. (2018). Lifelong learning with dynamically expandable networks. In Proceedings of the 6th International Conference on Learning Representations (ICLR).
  • Zenke and Laborieux, (2024) Zenke, F. and Laborieux, A. (2024). Theories of synaptic memory consolidation and intelligent plasticity for continual learning. arXiv:2405.16922 [q-bio.NC].
  • (36) Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., and Zhao, T. (2023a). Adaptive budget allocation for parameter-efficient fine-tuning. In Proceedings of the 11th International Conference on Learning Representations (ICLR).
  • (37) Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., and Wang, G. (2023b). Instruction tuning for large language models: A survey. arXiv:2308.10792 [cs.CL].
  • Zhang et al., (2022) Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer, L. (2022). OPT: Open pre-trained transformer language models. 2205.01068 [cs.CL].
  • Zoph et al., (2022) Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., and Fedus, W. (2022). ST-MoE: Designing stable and transferable sparse expert models. 2202.08906 [cs.CL].

Appendix A Additional Related Works

Continual Learning and Catastrophic Forgetting. Continual learning methods are often classified into the following categories: replay-based methods, regularization-based methods, and architecture-based methods (Wang et al., 2024a, ). Catastrophic forgetting arises when model parameters are updated to account for new data, causing degradation or a complete loss of performance on previously learned tasks. Catastrophic forgetting is not specific to LLMs. Other neural networks also experience this phenomenon, leading to methods such as Elastic Weight Consolidation (Kirkpatrick et al.,, 2017). Previous research to resolve this problem has scaled the number of parameters in the model (Wang et al., 2024a, ; Yoon et al.,, 2018). It has been demonstrated that these solutions work in theory but suffer from over-reliance on scaling LLM parameters. Works such as Hu et al., (2019) avoid this by splitting parameters into two sets: one for tasks learned and one to dynamically be generated.

Instruction Tuning. Models trained by instruction tuning have been shown to have better performance on unseen tasks than without (Wei et al.,, 2022). Instruction tuning, however, has its challenges of crafting high-quality instructions that can cover the desired behavior, and it seems to only capture surface-level patterns rather than truly comprehending the task (Zhang et al., 2023b, ).

Parameter-Efficient Fine-Tuning. Since the original proposal of LoRA, there have been many derivatives that aim to resolve certain limitations of the original LoRA method, (Valipour et al.,, 2023; Zhang et al., 2023a, ; Dettmers et al.,, 2023; Huang et al.,, 2023). Notwithstanding, LoRA still seems to be the most widely used PEFT method.

Task Set Size
Simp Training 100,002
Testing 4,000
Emdg Training 58,770
Testing 8,396
InqQG Training 61,710
Testing 1,681
Exp Training 100,002
Testing 9,824
HGen Training 100,002
Testing 1,951
Table 3: Data statistics of various tasks and their splits.

Appendix B Implementation Details

We implement our switch network via OPT-125M  (Zhang et al.,, 2022) as a feature extractor and its classifier using a two-layer MLP with ReLU activation function (Nair and Hinton,, 2010). We present hyperparameters related to the switch network in Table 4 and to continual learned models in Table 5. We also present sizes of various used models in Table 6. We perform the experiments on 2 X A100 Nvidia GPUs with 80 GB GPU memory.

Hyperparameter Value
Learning rate 1E-3
Number of epochs 20
Optimizer AdamW
Scheduler Constant
Number of layers 2
Hidden dimension 200
Table 4: Hyperparameters related to switch network and its training.
Hyperparameter Value
Learning rate 2E-5
Number of epochs 3 per task
Optimizer AdamW
Scheduler Constant
LoRA rank 64
LoRA Alpha 16
LoRA dropout 0.05
LoRA bias None
Table 5: Hyperparameters related to models undergoing continual instruction tuning and their training. We adopt the training hyperparameters from Luo et al., (2023).
Model Parameter Count
Switch Network 154 thousand
OPT-125M 125 million
BLOOMZ-1.1B 1.1 billion
BLOOMZ-7.1B 7.1 billion
Table 6: Sizes of models used in this work in terms of parameter counts.

Appendix C Continual Instruction Tuning Tasks

During instruction tuning, we follow Scialom et al., (2022) to start by adding a general prompt template at the beginning of the data: ‘Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request…’. This is then followed by a specific prompt for each task. We present statistics of different task datasets in Table 3. In addition, we searched online and did not discover any information regarding the data containing information that names or uniquely identifies individual people or offensive content.