Mix-CPT: A Domain Adaptation Framework via Decoupling Knowledge Learning and Format Alignment

Jinhao Jiang1, Junyi Li1, Wayne Xin Zhao1, Yang Song311footnotemark: 1, Tao Zhang3 and Ji-Rong Wen1,2
1Gaoling School of Artificial Intelligence, Renmin University of China.
2School of Information, Renmin University of China.
3BOSS Zhipin, Beijing, China
[email protected], [email protected],
[email protected], [email protected]
 Corresponding authors.
Abstract

Adapting general large language models (LLMs) to specialized domains presents great challenges due to varied data distributions. This adaptation typically requires continual pre-training on massive domain-specific corpora to facilitate knowledge memorization, followed by training to apply this knowledge following human instructions and preferences. However, this method may result in inefficient knowledge memorization due to a lack of awareness of knowledge utilization and imposes substantial demands on LLMs to simultaneously learn knowledge utilization and format alignment with limited training samples. To facilitate the domain adaptation of LLM, we revise this process and propose a new domain adaptation framework including domain knowledge learning and general format alignment, called Mix-CPT. Specifically, we first conduct knowledge mixture continual pre-training that concurrently focuses on knowledge memorization and utilization, allowing for mutual reinforcement. To avoid catastrophic forgetting during the continual pre-training process, we further incorporate a logit swap self-distillation constraint. Subsequently, leveraging the knowledge and capabilities acquired during continual pre-training, we efficiently perform instruction tuning and alignment with a few general training samples to achieve format alignment. Extensive experiments demonstrate that our proposed Mix-CPT framework can simultaneously improve the task-solving capabilities of LLMs on the target and general domains compared to the traditional adaptation methods.

1 Introduction

Large Language Models (LLMs) (Zhao et al., 2023) have revolutionized the field of natural language processing (NLP) (Brown et al., 2020; OpenAI, 2023), showing exceptional capabilities such as instruction following (Ouyang et al., 2022a; Taori et al., 2023) and complex reasoning (Wei et al., 2022; Wang et al., 2023a). However, due to their limited exposure to relevant data, such general LLMs still considerably lag behind in specific domains requiring professional knowledge. This situation has necessitated the effective adaptation of general-purpose LLMs to specific domains (e.g., mathematics and code), called domain adaptation of LLMs (Guo & Yu, 2022).

In essence, tailoring general LLMs to specific domains requires adaptation in two main aspects, namely knowledge learning (acquiring and leveraging the necessary domain knowledge) and format alignment (responding to the user in an expected output form) (Jiang et al., 2024; Zhou et al., 2023). Specially, knowledge learning can be further fulfilled via knowledge memorization and utilization. In practice, domain adaptation of LLMs typically involves three consecutive stages (Rozière et al., 2023; Azerbayev et al., 2023), i.e., pre-training, instruction tuning, and alignment, where the first stage is primarily aimed at knowledge memorization and the other two stages are mainly focused on knowledge utilization and format alignment. However, at the pre-training stage, knowledge memorization based on raw domain-specific corpus would be somehow inefficient without eliciting the acquired knowledge according to task goals (Jiang et al., 2024). Despite that some studies incorporate instruction data for pre-training, they often rely on proprietary models to synthesize high-quality instructions at scale (Cheng et al., 2024; Wang et al., 2024), which may not be that easy without extensive fine-tuning experiences. Another issue is that learning to master knowledge utilization and format alignment in the instruction tuning and alignment stages might lead to suboptimal performance, due to the fact that the two goals can be divergent in model optimization (Ren et al., 2024).

Refer to caption
Figure 1: Comparison of traditional domain adaptation approaches (top) and our proposed rescheduled domain adaptation paradigm (bottom). “[EOS]” is the special token representing the end of the document. “[ST]”, “[UT]” and “[AT]” denote the system, user, assistant chat template, repesctively.

Considering the above issues, this paper explores a new domain adaptation approach that only uses raw domain-specific corpus and general instruction or alignment data. Our hypothesis is that the knowledge utilization capacity can be essentially learned from general instruction or alignment data, which has been also evidenced by prior studies (Ouyang et al., 2022b; Rafailov et al., 2023b). In this way, we can remove the tedious instruction synthesis step from the training pipeline, since it is much easier to obtain general or mixed-domain instruction data from open resources. Another important attempt is to enhance knowledge learning by jointly attaining both memorization and utilization of knowledge. To implement this idea, we schedule all the instruction and alignment data at the pre-training stage (with a suitable format), then only reuse a minor proportion of instruction and alignment data for the instruction-tuning and alignment stages to achieve format alignment. We compare our rescheduled process with the traditional domain adaptation in Figure 1.

Specially, our approach for domain adaptation of LLMs consists of two main stages, including domain knowledge learning and general format alignment. In the first stage, we conduct knowledge mixture continual pre-training to integrate both knowledge memorization and utilization. The memorization of new knowledge can be facilitated by taking into account how this knowledge will be utilized. In the second stage, based on the knowledge and capabilities that are already acquired during pre-training, we perform instruction tuning and alignment in an efficient manner to achieve format alignment. For unified training, we convert raw domain documents, instruction tuning data, and alignment data into a unified format for conducting knowledge mixture continual pre-training (Mix-CPT). To avoid catastrophic forgetting in continual pre-training, we propose Logit Swap Self-Distillation (LSSD), which exchanges the predicted top-1111 token logit with the logit of the ground-truth token, serving as the surrogate target. In this way, LSSD maintains most probabilities of the original distribution of LLMs, avoiding dramatic model update and thereby preserving original capabilities. In instruction tuning and alignment, we select a small number of easy instructions from the pre-training instruction set based on the perplexity scores of LLMs as criteria. These instructions have already been seen during pre-training, so that the model can mainly focus on pure style or format learning for downstream tasks.

To verify the effectiveness of our proposed Mix-CPT method, we evaluate it on both domain-specific and general tasks, including a total of seven distinctive capabilities based on 17 representative benchmarks. For both base LLMs and chat LLMs, our approach can effectively improve their domain-specific and general performance simultaneously compared to traditional methods of first performing continual pre-training, followed by instruction tuning and alignment.

2 Approach

2.1 Overview

To adapt general-purpose LLMs to specific domains (e.g., Wiki, mathematics, code), our core idea is to decouple knowledge learning and format alignment, and propose an effective two-stage domain adaptation framework for general LLMs, i.e., first performing knowledge mixture continual pre-training (Section 2.2) and then performing efficient format alignment (Section 2.3). We show the overall architecture in Figure 2.

In the first stage, we conduct continual pre-training on the mixed data of raw domain-related documents, general instruction and alignment data via a unified text format. We aim to utilize general instruction data to better elicit the capacities of knowledge memorization and utilization during continual pre-training. To avoid catastrophic forgetting in pre-training, we design a new learning method, i.e., Logit Swap Self-Distillation (LSSD), that exchanges the top-1111 token logit with the ground-truth token logit. In the second stage, based on the domain knowledge augmented LLMs, we conduct efficient format alignment with a small number of easy instructions or preference samples that have been seen during pre-training. In this way, LLMs can focus on learning the simple style and formet for interacting with human, without much consideration of how to utilize the attained knowledge. Next, we will describe each part in detail.

Refer to caption
Figure 2: The illustration of our proposed rescheduled domain adaptation paradigm, including first conducting knowledge mixture continual pre-training, then selecting top-K𝐾Kitalic_K easy training samples with the lowest perplexity for performing supervised fine-tuning and direct preference optimization.

2.2 Knowledge Mixture Continual Pre-training

Different from prior work that performs continual pre-training solely based on domain-specific corpus (Que et al., 2024; Ke et al., 2022), we propose to mix domain-specific documents, general instructions and alignment data as pre-training data. The QA-based instruction data is useful to reflect how the knowledge will be accessed and utilized through questions. In this way, LLMs can improve their capabilities to learn new knowledge from those domain-related documents. Incorporating general instructions can facilitate LLMs to transfer the general knowledge utilization capability to domain-specific knowledge without relying on highly domain-related instructions in previous work (Jiang et al., 2024; Cheng et al., 2024). Specifically, we first transform the raw domain documents, general instructions and alignment data into a unified knowledge format. Then, we perform continual pre-training on this mixture collection with the objective of next token prediction. To avoid catastrophic forgetting, we further introduce a logit swap self-distillation approach during the continual pre-training process. Next, we introduce these techniques in detail.

2.2.1 Unified Knowledge Format

Typically, adapting a base LLM to a specific domain involves three distinct and relatively independent stages, each based on corresponding data in different formats. Specifically, the base model firstly performs continual pre-training (CPT) on domain-specific corpus for learning new knowledge, then conducts supervised fine-tuning (SFT) based on instructions for enhancing the instruction following ability, and finally utilizes the human preference data for human alignment. In this work, we adopt the direct preference optimization (DPO) to as the alignment algorithm. Formally, we denote the domain-specific corpus as 𝒟CPT={di}i=1ncsubscript𝒟CPTsuperscriptsubscriptsubscript𝑑𝑖𝑖1subscript𝑛𝑐\mathcal{D}_{\text{CPT}}=\{{d_{i}}\}_{i=1}^{n_{c}}caligraphic_D start_POSTSUBSCRIPT CPT end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where disubscript𝑑𝑖{d_{i}}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents a raw domain document consisting of a sequence of tokens. For the instructions used in SFT, we denote as 𝒟SFT={qi,ri}i=1nssubscript𝒟SFTsuperscriptsubscriptsubscript𝑞𝑖subscript𝑟𝑖𝑖1subscript𝑛𝑠\mathcal{D}_{\text{SFT}}=\{\langle q_{i},r_{i}\rangle\}_{i=1}^{n_{s}}caligraphic_D start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT = { ⟨ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where qisubscript𝑞𝑖{q_{i}}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and risubscript𝑟𝑖{r_{i}}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the user query and the expected response, repectively. For alignment data used in DPO, we denote by 𝒟DPO={qi,ri+,ri}i=1ndsubscript𝒟DPOsuperscriptsubscriptsubscript𝑞𝑖subscriptsuperscript𝑟𝑖subscriptsuperscript𝑟𝑖𝑖1subscript𝑛𝑑\mathcal{D}_{\text{DPO}}=\{\langle q_{i},r^{+}_{i},r^{-}_{i}\rangle\}_{i=1}^{n% _{d}}caligraphic_D start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT = { ⟨ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where qisubscript𝑞𝑖{q_{i}}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ri+subscriptsuperscript𝑟𝑖{r}^{+}_{i}italic_r start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and risubscriptsuperscript𝑟𝑖{r}^{-}_{i}italic_r start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the user query, positive response, and negative response, respectively.

In this work, we propose to mix 𝒟CPTsubscript𝒟CPT\mathcal{D}_{\text{CPT}}caligraphic_D start_POSTSUBSCRIPT CPT end_POSTSUBSCRIPT, 𝒟SFTsubscript𝒟SFT\mathcal{D}_{\text{SFT}}caligraphic_D start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT and 𝒟DPOsubscript𝒟DPO\mathcal{D}_{\text{DPO}}caligraphic_D start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT with a unified text format, building upon which we further perform knowledge mixture continual pre-training on a general LLM. Unlike previous work relying on synthesizing high-quality domain instructions (Jiang et al., 2024; Cheng et al., 2024), we empirically find that knowledge utilization is indeed a general capability that can be learned from general instructions, and we can further transfer such capacity to enhance the learning of domain knowledge. Specially, we remove any templates and markers (e.g., [User]) from the instructions and alignment data to construct the mixture data in a unified format, denoted by 𝒟MIX={xcpt,xsft,xdpo}subscript𝒟MIXsubscript𝑥cptsubscript𝑥sftsubscript𝑥dpo\mathcal{D}_{\text{MIX}}=\{x_{\text{cpt}},x_{\text{sft}},x_{\text{dpo}}\}caligraphic_D start_POSTSUBSCRIPT MIX end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT cpt end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT sft end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT dpo end_POSTSUBSCRIPT }, where xcpt=disubscript𝑥cptsubscript𝑑𝑖x_{\text{cpt}}=d_{i}italic_x start_POSTSUBSCRIPT cpt end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the original domain document, xsft=[qi;ri]subscript𝑥sftsubscript𝑞𝑖subscript𝑟𝑖x_{\text{sft}}=[{q_{i}};{r_{i}}]italic_x start_POSTSUBSCRIPT sft end_POSTSUBSCRIPT = [ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] denotes the concatenation of user query and expected response in instructions, and xdpo=[qi;ri+]subscript𝑥dposubscript𝑞𝑖subscriptsuperscript𝑟𝑖x_{\text{dpo}}=[{q_{i}};{r}^{+}_{i}]italic_x start_POSTSUBSCRIPT dpo end_POSTSUBSCRIPT = [ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_r start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] is the concatenation of user query and positive response in the alignment data. We show some examples in Figure 1.

Following existing pre-training methods (Touvron et al., 2023), during the continual pre-training process, we concatenate each kind of data sample (i.e., xcptsubscript𝑥cptx_{\text{cpt}}italic_x start_POSTSUBSCRIPT cpt end_POSTSUBSCRIPT, xsftsubscript𝑥sftx_{\text{sft}}italic_x start_POSTSUBSCRIPT sft end_POSTSUBSCRIPT, or xdposubscript𝑥dpox_{\text{dpo}}italic_x start_POSTSUBSCRIPT dpo end_POSTSUBSCRIPT) and truncate the sequence when reaching the maximum input length of the LLM. Besides, we add an extra special symbol (i.e., [SEP]) at the end of each sample to separate them. We repeat this process until concatenating all samples from each kind of data (i.e., 𝒟CPTsubscript𝒟CPT\mathcal{D}_{\text{CPT}}caligraphic_D start_POSTSUBSCRIPT CPT end_POSTSUBSCRIPT, 𝒟SFTsubscript𝒟SFT\mathcal{D}_{\text{SFT}}caligraphic_D start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT and 𝒟DPOsubscript𝒟DPO\mathcal{D}_{\text{DPO}}caligraphic_D start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT) to obtain our final knowledge mixture pre-training data 𝒟MIXsubscript𝒟MIX\mathcal{D}_{\text{MIX}}caligraphic_D start_POSTSUBSCRIPT MIX end_POSTSUBSCRIPT.

Note that though we use general instruction data to derive the mixture data here, our approach can be generally extended to incorporating domain-specific instruction data (Cheng et al., 2024), which often relies on specific data synthesis techniques.

2.2.2 Logit Swap Self-Distillation

After obtaining the mixture data 𝒟MIX={xcpt,xsft,xdpo}subscript𝒟MIXsubscript𝑥cptsubscript𝑥sftsubscript𝑥dpo\mathcal{D}_{\text{MIX}}=\{x_{\text{cpt}},x_{\text{sft}},x_{\text{dpo}}\}caligraphic_D start_POSTSUBSCRIPT MIX end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT cpt end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT sft end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT dpo end_POSTSUBSCRIPT }, we then perform continual pre-training on the base LLM. For simplicity, we remove the subscript of each training sample in the mixture data, denoted as x𝑥xitalic_x. We adopt the pre-training task of next token prediction (NTP), which aims to predict the next token conditioned on all previous tokens. Specifically, given an input x={w1,w2,,wn}𝑥subscript𝑤1subscript𝑤2subscript𝑤𝑛x=\{w_{1},w_{2},...,w_{n}\}italic_x = { italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, we feed it into the decoder-only LLM and use the standard language modeling objective to minimize the cross-entropy loss as follows:

NTP=j=1nlogPr(wj|w<j;Θ),subscriptNTPsuperscriptsubscript𝑗1𝑛Prconditionalsubscript𝑤𝑗subscript𝑤absent𝑗Θ\displaystyle\mathcal{L}_{\text{NTP}}=-\sum_{j=1}^{n}\log\text{Pr}(w_{j}|w_{<j% };\Theta),caligraphic_L start_POSTSUBSCRIPT NTP end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log Pr ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ; roman_Θ ) , (1)

where wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denotes the j𝑗jitalic_j-th token in the input, w<jsubscript𝑤absent𝑗w_{<j}italic_w start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT is the previous tokens, and ΘΘ\Thetaroman_Θ denotes the model parameters. During continual pre-training, the task of next-token prediction enables the base LLM to learn domain knowledge, and the incorporation of general instruction and alignment data further transfers the general knowledge utilization capability to specific domains.

However, the traditional language modeling objective is prone to suffer from the issue of catastrophic forgetting for previously learned knowledge of LLMs. Therefore, we propose an auxiliary training objective, i.e., Logit Swap Self-Distillation (LSSD), which serves as an extra constraint for standard language modeling objective. Specifically, we first utilize the original base LLM before continual pre-training (paramerized by ΘorisubscriptΘori\Theta_{\text{ori}}roman_Θ start_POSTSUBSCRIPT ori end_POSTSUBSCRIPT) to infer the output logits following the standard language modeling objective yet without computing the loss:

𝒉jsubscript𝒉𝑗\displaystyle\bm{h}_{j}bold_italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =LLM(w<j;Θori),absentLLMsubscript𝑤absent𝑗subscriptΘori\displaystyle=\text{LLM}(w_{<j};\Theta_{\text{ori}}),= LLM ( italic_w start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ; roman_Θ start_POSTSUBSCRIPT ori end_POSTSUBSCRIPT ) , (2)
𝒍jsubscript𝒍𝑗\displaystyle\bm{l}_{j}bold_italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =𝒉j𝑾eT,absentsubscript𝒉𝑗superscriptsubscript𝑾𝑒𝑇\displaystyle=\bm{h}_{j}\bm{W}_{e}^{T},= bold_italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (3)

where 𝑾esubscript𝑾𝑒\bm{W}_{e}bold_italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the token embedding matrix, 𝒉jsubscript𝒉𝑗\bm{h}_{j}bold_italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the hidden state of the last transformer block, and ljsubscript𝑙𝑗l_{j}italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denotes the output logit at the j𝑗jitalic_j-th position. Then, we exchange the logit value of the top-1111 predicted token (i.e., w~jsubscript~𝑤𝑗\widetilde{w}_{j}over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) and the ground-truth token (i.e., wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) if they are not equal as follows:

𝒍~j=Exchange(𝒍j,Iw~j,Iwj),ifIw~jIwj,formulae-sequencesubscriptbold-~𝒍𝑗Exchangesubscript𝒍𝑗subscript𝐼subscript~𝑤𝑗subscript𝐼subscript𝑤𝑗ifsubscript𝐼subscript~𝑤𝑗subscript𝐼subscript𝑤𝑗\displaystyle\bm{\tilde{l}}_{j}=\text{Exchange}(\bm{l}_{j},I_{\widetilde{w}_{j% }},I_{w_{j}}),\quad\text{if}~{}~{}I_{\widetilde{w}_{j}}\neq I_{w_{j}},overbold_~ start_ARG bold_italic_l end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = Exchange ( bold_italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , if italic_I start_POSTSUBSCRIPT over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≠ italic_I start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (4)

where Iw~jsubscript𝐼subscript~𝑤𝑗I_{\widetilde{w}_{j}}italic_I start_POSTSUBSCRIPT over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Iwjsubscript𝐼subscript𝑤𝑗I_{w_{j}}italic_I start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the indices of the top-1111 predicted token w~jsubscript~𝑤𝑗\widetilde{w}_{j}over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and the ground-truth token wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the vocabulary, respectively, the function Exchange(\cdot) will exchange their logit values in 𝒍jsubscript𝒍𝑗\bm{l}_{j}bold_italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and the output 𝒍~jsubscriptbold-~𝒍𝑗\bm{\tilde{l}}_{j}overbold_~ start_ARG bold_italic_l end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT will be regarded as the teacher logit in LSSD. In essence, LSSD only calibrates the prediction of ground-truth token for adapting to the current domain knowledge while maintaining the most previously learned knowledge of the base LLM (i.e., represented by the unchanged logit values in 𝒍~jsubscriptbold-~𝒍𝑗\bm{\tilde{l}}_{j}overbold_~ start_ARG bold_italic_l end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT). Then, we can compute the teacher model’s probability distribution for the j𝑗jitalic_j-th token with softmax function:

Pr(wj|w<j;Θori)=softmax(𝒍~j),Prconditionalsubscript𝑤𝑗subscript𝑤absent𝑗subscriptΘorisoftmaxsubscriptbold-~𝒍𝑗\displaystyle\text{Pr}(w_{j}|w_{<j};\Theta_{\text{ori}})=\text{softmax}(\bm{% \tilde{l}}_{j}),Pr ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ; roman_Θ start_POSTSUBSCRIPT ori end_POSTSUBSCRIPT ) = softmax ( overbold_~ start_ARG bold_italic_l end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (5)

Finally, we compute the self-knowledge distillation objective and minimize the reverse Kullback-Leibler divergence loss (Gu et al., 2023) between the current model’s probability distribution and the teacher model’s probbaility distribution as follows:

LSSD=j=1nwj𝒱Pr(wj|w<j;Θ)log(Pr(wj|w<j;Θ)Pr(wj|w<j;Θori)),subscriptLSSDsuperscriptsubscript𝑗1𝑛subscriptsubscript𝑤𝑗𝒱Prconditionalsubscript𝑤𝑗subscript𝑤absent𝑗ΘPrconditionalsubscript𝑤𝑗subscript𝑤absent𝑗ΘPrconditionalsubscript𝑤𝑗subscript𝑤absent𝑗subscriptΘori\displaystyle\mathcal{L}_{\text{LSSD}}=-\sum_{j=1}^{n}\sum_{w_{j}\in\mathcal{V% }}\text{Pr}(w_{j}|w_{<j};\Theta)\log(\frac{\text{Pr}(w_{j}|w_{<j};\Theta)}{% \text{Pr}(w_{j}|w_{<j};\Theta_{\text{ori}})}),caligraphic_L start_POSTSUBSCRIPT LSSD end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_V end_POSTSUBSCRIPT Pr ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ; roman_Θ ) roman_log ( divide start_ARG Pr ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ; roman_Θ ) end_ARG start_ARG Pr ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ; roman_Θ start_POSTSUBSCRIPT ori end_POSTSUBSCRIPT ) end_ARG ) , (6)

where 𝒱𝒱\mathcal{V}caligraphic_V is the vocabulary and ΘΘ\Thetaroman_Θ denotes the parameters of the current LLM. In the knowledge mixture continual pre-training stage, the final total loss is the combination of next token prediction loss and self-distillation loss as follows:

CPT=αNTP+(1α)LSSD,subscriptCPT𝛼subscriptNTP1𝛼subscriptLSSD\displaystyle\mathcal{L}_{\text{CPT}}=\alpha\cdot\mathcal{L}_{\text{NTP}}+(1-% \alpha)\cdot\mathcal{L}_{\text{LSSD}},caligraphic_L start_POSTSUBSCRIPT CPT end_POSTSUBSCRIPT = italic_α ⋅ caligraphic_L start_POSTSUBSCRIPT NTP end_POSTSUBSCRIPT + ( 1 - italic_α ) ⋅ caligraphic_L start_POSTSUBSCRIPT LSSD end_POSTSUBSCRIPT , (7)

where α𝛼\alphaitalic_α is a coefficient to control the proportion of two parts.

2.3 Efficient Format Alignment

In the domain knowledge learning stage, the LLM has simultaneously learned to both memorize domain knowledge and understand how to utilize the knowledge through our proposed knowledge mixture continual pre-training. After that, during the format alignment stage, the LLM can more efficiently fine-tuned to master the task format with only a small number of alignment samples. Next, we first introduce the selection of training samples and then perform general format alignment.

Since we would like to decouple knowledge learning and format alignment, we mainly focus on training samples from 𝒟SFTsubscript𝒟SFT\mathcal{D}_{\text{SFT}}caligraphic_D start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT and 𝒟DPOsubscript𝒟DPO\mathcal{D}_{\text{DPO}}caligraphic_D start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT that are are both easy and have been encountered during continual pre-training, which avoids introducing new knowledge during supervised fine-tuning. These easy samples are selected based on the perplexity scores of the LLMs w.r.t the ground-truth output. Specifically, given a sample in 𝒟SFTsubscript𝒟SFT\mathcal{D}_{\text{SFT}}caligraphic_D start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT and 𝒟DPOsubscript𝒟DPO\mathcal{D}_{\text{DPO}}caligraphic_D start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT, we first equip the input instruction and output response with the corresponding chat template, i.e., the format for interaction with humans. Then, we feed the formatted sequence into the LLM and compute the perplexity score for the output response. Finally, we select top-K𝐾Kitalic_K samples with the lowest perplexity scores to conduct the supervised fine-tuning and direct preference optimization. Note that for samples in 𝒟DPOsubscript𝒟DPO\mathcal{D}_{\text{DPO}}caligraphic_D start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT, we only utilize the positive response for computing its perplexity score.

After selecting the training samples, we next utilize them to conduct efficient format alignment. Firstly, we utilize the selected easy instruction samples from 𝒟SFTsubscript𝒟SFT\mathcal{D}_{\text{SFT}}caligraphic_D start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT to perform supervised fine-tuning based on the LLM after continual pre-training following the standard way (Ouyang et al., 2022b), which is to minimize the cross-entropy loss:

SFT=j=1nlogPr(rj|q,r<j;Θ),subscriptSFTsuperscriptsubscript𝑗1𝑛Prconditionalsubscript𝑟𝑗𝑞subscript𝑟absent𝑗Θ\displaystyle\mathcal{L}_{\text{SFT}}=-\sum_{j=1}^{n}\log\text{Pr}(r_{j}|q,r_{% <j};\Theta),caligraphic_L start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log Pr ( italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_q , italic_r start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ; roman_Θ ) , (8)

where rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and r<jsubscript𝑟absent𝑗r_{<j}italic_r start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT denote the j𝑗jitalic_j-th token and its previous tokens in the response. Secondly, we further utilize the selected easy preference samples from 𝒟DPOsubscript𝒟DPO\mathcal{D}_{\text{DPO}}caligraphic_D start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT to conduct direct preference optimization following its original method (Rafailov et al., 2023b) as follows:

DPO=logσ(βlogπ(r+|q;Θ)π(r+|q;Θref)βlogπ(r|q;Θ)π(r|q;Θref)),subscriptDPO𝜎𝛽𝜋conditionalsuperscript𝑟𝑞Θ𝜋conditionalsuperscript𝑟𝑞subscriptΘref𝛽𝜋conditionalsuperscript𝑟𝑞Θ𝜋conditionalsuperscript𝑟𝑞subscriptΘref\displaystyle\mathcal{L}_{\text{DPO}}=-\log\sigma\bigg{(}\beta\log\frac{\pi(r^% {+}|q;\Theta)}{\pi(r^{+}|q;\Theta_{\text{ref}})}-\beta\log\frac{\pi(r^{-}|q;% \Theta)}{\pi(r^{-}|q;\Theta_{\text{ref}})}\bigg{)},caligraphic_L start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT = - roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π ( italic_r start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | italic_q ; roman_Θ ) end_ARG start_ARG italic_π ( italic_r start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | italic_q ; roman_Θ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ) end_ARG - italic_β roman_log divide start_ARG italic_π ( italic_r start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | italic_q ; roman_Θ ) end_ARG start_ARG italic_π ( italic_r start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | italic_q ; roman_Θ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ) end_ARG ) , (9)

where σ𝜎\sigmaitalic_σ denotes the sigmoid function, ΘΘ\Thetaroman_Θ and ΘrefsubscriptΘref\Theta_{\text{ref}}roman_Θ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT denote the parameters of the updated LLM and reference LLM during the direct preference optimization process, and π𝜋\piitalic_π denotes the product of the probabilities of all output tokens, conditioned on the given input.

3 Experiments

3.1 Experimental Setup

Domain-specific Corpus. In our experiments, we mainly focus on three popular domains for adapting general-purpose LLMs, i.e., encyclopedia, mathematics, and code. For the encyclopedia domain, we select Wikipedia as the primary corpus, which is collaboratively developed by volunteers globally and can be freely accessed online. To enable LLMs to learn knowledge from new documents, we utilize the official 2024/03/01 Wikipedia dump111https://dumps.wikimedia.org/enwiki/20240301/ and conduct necessary data cleaning and filtering processes such as deduplication, resulting in approximately 4B tokens in raw Wikipedia documents. For the domain of mathematics, we opt for AutoMathText (Zhang et al., 2024), a carefully curated corpus derived from various sources including websites, arXiv, and GitHub. Each sample in this corpus has been labeled with a quality score from 0.0 (“the poorest”) to 1.0 (“the best”), reflecting its relevance, quality, and educational value in the context of mathematical intelligence. Following previous work (Zhou et al., 2024), we specifically select those samples with scores higher than 0.7, containing about 0.7B tokens. For the field of code, we select the StarCoder (Li et al., 2023) corpus, which is widely recognized and employed in several studies (Luo et al., 2023b). It contains 86 programming languages, and we select the Python subset with approximately 1B tokens.

General Instruction Datasets. For general instruction datasets, we choose TULU-V2-mix (Ivison et al., 2023) and UltraFeedback (Cui et al., 2023) for instruction tuning and alignment, respectively. Specifically, each sample in TULU-V2-mix is either manually curated for quality or generated from GPT models for encouraging complexity and diversity. We utilize the entire dataset of TULU-V2-mix (about 326K samples) mixed with domain-specific corpus for knowledge mixture continual pre-training (Section 2.2), and then randomly select 10K samples with the lowest perplexity score for subsequent instruction tuning. In addition, UltraFeedback is a widely-used diverse human preference alignment dataset, containing approximately 64K preference pairs. Similarly, we employ the whole dataset of UltraFeedback for knowledge mixture continual pre-training and then downsample 5K pairs with the lowest perplexity score for alignment. It is noted that our TULU-V2-mix and UltraFeedback datasets are open-source and widely used in previous work (Meng et al., 2024; Hu et al., 2024), ensuring a high level of transparency and facilitating fair experimental comparisons.

Baselines. In the experiments, we employ the Meta-Llama-3-8B (LLaMA3-8B222https://llama.meta.com/llama3/ as the base model, which is extensively utilized in the existing work (Cheng et al., 2024). For comparative analysis, we consider the following three types of baseline methods:

\bullet Closed-Source Chat LLMs consist of the official Chat LLMs that have undergone both instruction tuning and preference alignment using closed-source data. Here, we select the Meta-Llama-3-8B-Instruct (LLaMA3-8B-Chat).

\bullet Open-Source Chat LLMs are developed by us following the processes of instruction tuning and alignment. Based on our selected base LLM, we conduct supervised fine-tuning (SFT) using TULU-V2-mix, followed by direct preference optimization (DPO) with UltraFeedback.

\bullet Continual Pre-training Augmented LLMs include domain knowledge-enhanced Chat LLMs which initially undergo continual pre-training (CPT) with domain-specific corpus, followed by the same implementation of supervised fine-tuning (SFT) and direct preference optimization (DPO) using TULU-V2-mix and UltraFeedback as open-source chat LLMs.

Evaluation Benchmarks and Metrics. For a comprehensive evaluation, we evaluate seven distinctive capabilities of LLMs based on a total of 17 representative NLP datasets:

\bullet Factual Question Answering assesses the factual knowledge of LLMs in the Wikipedia domain. We employ NaturalQuestion (NQ) (Kwiatkowski et al., 2019), TrivialQA (TQ) (Joshi et al., 2017), and our curated WikiQA (WQ) datasets and use the Exact Match (EM) metric to determine if the prediction is the same as the gold answer.

\bullet Math Reasoning tests the LLMs’ ability to solve mathematical problems. We use the GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021b) datasets, and evaluate predictions using the Accuracy metric.

\bullet Code Reasoning tests the LLMs’ ability to solve programming problems. We use MBPP (Austin et al., 2021) and HumanEval (Chen et al., 2021) datasets, with the Pass@K metric assessing the likelihood that at least one of the top-K𝐾Kitalic_K generated code samples for a problem passes the unit tests.

\bullet Reading Comprehension measures the LLMs’ ability to comprehend a passage and answer related questions. We use the RACE-Hard (Lai et al., 2017) and OpenBookQA (Mihaylov et al., 2018) datasets, and employ the Exact Match (EM) metric.

\bullet Commonsense Reasoning evaluates the ability to answer questions using commonsense knowledge. We use the HellaSwag (Zellers et al., 2019), CSQA (Talmor et al., 2019), and PIQA (Bisk et al., 2020) datasets, and employ the Accuracy metric.

\bullet Examination includes comprehensive and challenging benchmarks designed to assess problem-solving ability across various domains. We use MMLU (Hendrycks et al., 2021a), BBH (Suzgun et al., 2023), and ARC-Challenge (Bhakthavatsalam et al., 2021) for English examinations and C-EVAL (Huang et al., 2023) for Chinese. Both benchmarks are evaluated using the Accuracy metric.

\bullet Instruction Following assesses the LLMs’ ability to engage in coherent, informative, and engaging conversations. We use the MT-Bench (Zheng et al., 2023) datasets. For evaluation, we utilize the GPT-4-0613 333https://platform.openai.com/docs/models as the judging model, assigning a score ranging from 1 to 10 to the answer. We multiply this score by ten, resulting in a final score of 100.

Specifically, we evaluate the above datasets based on the OpenCompass framework (Contributors, 2023), which is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation.

3.2 Main Results

Table 1 and Table 2 display the evaluation results of our proposed Mix-CPT framework and other baselines using specific evaluation benchmarks. In the next, we give a detailed analysis.

3.2.1 Results of Base LLMs

We initially assess the effectiveness of our proposed Mix-CPT framework, especially the logit swap self-distillation constraint, in mitigating catastrophic forgetting and facilitating the knowledge learning of base LLMs during the continual pre-training stage. To better observe the impact of knowledge mixture continual pre-training on the performance in target domains and general capabilities, we select domain-specific tasks and comprehensive examination tasks. Specifically, we evaluate the domain-specific tasks (e.g., factual question answering (Wiki), math reasoning (Math), and code reasoning (Code)) under few-shot settings and the Chinese and English examination tasks (i.e., multi-choice question answering) with perplexity-based zero-shot setting. We show the evaluation results in Table 1.

First, we can see that traditional continual pre-training (i.e., + CPT) on the raw domain data does not necessarily enhance the performance of base LLMs in the target domain, and may instead impair their performance therein. Additionally, this method inevitably leads to a certain degree of catastrophic forgetting, thereby damaging the overall performance of the base LLM. For example, when compared to the LLaMA3-8B-Base model, performing continual pre-training in the Wiki or Math domains leads to observed improvements in the corresponding target domains (i.e., 50.48 \rightarrow 50.79 in the Wiki domain and 33.17 \rightarrow 37.72 in the Math domain). However, this approach is inefficient in the Code domain, where it results in diminished performance (i.e., 58.75 \rightarrow 55.64 in Code domain). The phenomenon is also discovered in existing work (Lin et al., 2024), which suggests that the data used for continual pre-training should be of superior quality relative to the data utilized during the initial pre-training phase; otherwise, it may detrimentally affect performance. Indeed, the publicly available domain data we utilized (e.g., AutoMathText and StarCoder) has likely already been used to train the base model. Consequently, further employing it for continual pre-training may impair the model’s performance.

Second, when mixing the domain raw data with the additional instructions and alignment data (i.e., Mix-CPT w/o KD), the domain-specific capability can be further improved, which indicates that the mixed instruction data can benefit the learning of the domain knowledge during continual pre-training. At the same time, it can mitigate the effect of other general capabilities and reduce the degradation of the overall average performance. For example, compared to the traditional continual pre-training method (CPT), mixing domain corpus with additional instruction data (Mix-CPT w/o KD) can almost consistently improve the domain capability and overall average performance (i.e., 55.64 \rightarrow 57.98 and 43.94 \rightarrow 45.16 in the target Code domain and overall average performance).

Finally, through applying the logit swap self-distillation strategy to the knowledge mixture continual pre-training process (i.e., Mix-CPT), we can further reduce the impact on the pre-learned knowledge for the base LLM while maintaining the domain capability improvement obtained from simple knowledge mixture (i.e., Mix-CPT w/o KD), thereby mitigating the degradation of the general capabilities of LLMs. Therefore, these results demonstrate that the Mix-CPT framework with the logit swap self-distillation constraint can indeed promote knowledge learning and alleviate the issue of catastrophic forgetting to some extent.

Table 1: Evaluation results on three specialized domains and two comprehensive examination domains. The underline and bold fonts denote the best results in the target domain and the average performance in each domain adaptation group, respectively.
Model Wiki Math Code English Examination Chinese Examination Average
LLaMA3-8B-Base 50.48 33.17 58.75 50.93 47.95 45.72
Wiki + CPT 50.79 34.02 15.56 52.79 46.27 42.00
+ Mix-CPT (w/o KD) 52.47 39.14 57.20 52.25 47.64 47.59
+ Mix-CPT 52.25 38.13 56.42 52.58 49.83 47.73
Math + CPT 50.60 37.72 54.47 44.13 46.78 44.06
+ Mix-CPT (w/o KD) 50.60 37.73 54.47 43.47 48.77 44.78
+ Mix-CPT 50.70 37.67 57.98 51.45 49.93 47.51
Code + CPT 46.14 33.71 55.64 51.38 47.58 43.94
+ Mix-CPT (w/o KD) 50.09 37.85 57.98 46.82 47.91 45.16
+ Mix-CPT 50.73 38.01 59.14 51.49 47.87 46.90

3.2.2 Results of Chat LLM

Subsequently, we assess the performance of final chat LLMs after performing format alignment with selectively chosen training samples for instruction tuning and alignment. We want to examine the instruction following capabilities of these LLMs using comprehensive benchmarks in a zero-shot scenario. Accordingly, we assess both domain-specific tasks (i.e., factual question answering (Wiki), math reasoning (Math), and code reasoning (Code)) and more general tasks (i.e., Reading Comprehension (RC), Commonsense Reasoning (CR), Examination (EX), and Instruction Following (IF)). We show the final results in Table 2.

First, the traditional domain adaptation method (i.e., CSD consisting of CPT, SFT, and DPO), faces challenges in simultaneously enhancing domains-specific capabilities while preserving general capabilities, in contrast to open-source chat models that do not utilize domain-specific raw data. For example, compared to open-source LLaMA3-8B chat model in the Wiki domain, traditional methods of conducting an extra continual pre-training based on raw Wikipedia documents not only decreases the factual question answering performance (i.e., 26.10 \rightarrow 22.20 in Wiki domain), but also hurts the average performance (i.e., 52.57 \rightarrow 50.96 in overall average performance). In the Math and Code domains, despite obtaining improvements in the target domains (i.e., 30.83 \rightarrow 37.35 in Math domain and 42.53 \rightarrow 48.55 in Code domain), the overall capability of LLMs exhibited a decline (i.e., 52.57 \rightarrow 52.10 after adapting to the Math domain). This phenomenon can also be observed in the other two LLMs, which indicates that conventional domain adaptation methods of performing continual pre-training on raw domain data may cause catastrophic forgetting and merely focus on knowledge memorization without considering how to utilize the learned knowledge, which might suffer from the memorization trap.

Second, our proposed Mix-CPT method can simultaneously improve the performance of the target domains and the general capability. The main reasons are two fold. On one hand, with the constraint of logit swap self-knowledge distillation during continual pre-training, the LLM can effectively memorize the raw domain data while maintaining its originally learned knowledge in the previous pre-training stage. On the other hand, by mixing the raw domain data with the general instruction and alignment data (removing any templates), the model can learn the general knowledge utilization capability and transfer this capability to utilize the raw domain data. In this way, the model can perform efficient format alignment with only a few formatted samples to better utilize both target domain knowledge and other general knowledge.

Finally, with the same instruction data and alignment data, our method can successfully improve the performance on the target domain while maintaining the general capability compared to the traditional continual pre-training augmented method based on the raw domain data. Compared to the official chat LLM with large-scale closed-source instruction tuning and alignment tuning, our proposed method can achieve comparable even better performance on the target domain (e.g., 40.28 vs 44.09 in the Wiki domain), which also indicates the effectiveness of our method.

Table 2: Evaluation results on three domain capabilities (i.e., Wiki, Math, and Code) and four general capabilities (i.e., Reading Comprehension, Complex Reasoning, EXamination, and Instruction Following) with the average performance. The underline and bold fonts denote the best results in the target domain and the average performance in each domain adaptation group, respectively.
Model Specialized Domain General Domain Average
Wiki Math Code RC CR EX IF
LLaMA3 8B-Base Closed-Source Chat 40.28 50.52 61.54 77.23 76.59 63.60 81.04 62.62
Open-Source Chat 26.10 30.83 42.53 73.90 75.57 56.44 68.50 52.57
Wiki + CSD 22.20 30.39 40.73 74.21 73.66 56.22 63.15 50.96
+ Mix-CPT (ours) 44.09 31.18 48.53 70.07 68.85 57.47 70.72 55.23
Math + CSD 12.65 37.35 44.19 74.34 77.80 58.38 69.02 52.10
+ Mix-CPT (ours) 44.35 35.99 52.55 72.88 72.91 61.64 71.31 58.38
Code + CSD 17.27 38.13 48.55 74.75 77.82 59.21 70.88 53.87
+ Mix-CPT (ours) 42.17 30.70 53.91 72.60 71.57 60.44 71.52 56.99

3.3 Detailed Analysis

Refer to caption
Figure 3: (Left) Pass@K on code and Average results w.r.t. Proportion of code data. (Right) Accuracy on math and Average results w.r.t. Quality score of math data.
Refer to caption
Figure 4: Accuracy on math and Average results w.r.t. Ratio of SFT and DPO data.
Refer to caption
Figure 5: Accuracy on math and Average results w.r.t. selection strategy.
Refer to caption
Figure 6: Accuracy on math and Average results w.r.t. Amount of SFT data (Left) and Amount of DPO data (Right).

In this section, we conduct a detailed analysis of the proposed method. Specifically, the investigation focuses on examining the impact of several factors, including the quantity and quality of raw domain data and the selection criteria of samples for format alignment, on the performance of the final chat LLMs. The evaluation employs the same benchmarks as those presented in Table 2.

Effect of Quantity and Quality of Raw Domain Data. We further explore the effect of the quantity and quality of the raw domain documents on the model performance. Considering that the quality of the original StarCoder dataset is approximately consistent, we first utilize it to explore the effect of the amount of domain documents on the domain-specific and general capabilities under similar quality. Moreover, we utilize the AutoMathText dataset to explore the effect of the quality of raw domain data on LLMs’ performance by leveraging the annotated quality score for each sample in the AutoMathText dataset. Specifically, we conduct two group experiments utilizing the StarCoder and AutoMathText datasets as follows:

\bullet Proportion of Code Data: This group aims to compare the variants using different proportions of the original StarCoder corpus, including 10%, 30%, and 50%, while maintaining constancy in other variables.

\bullet Quality of Math Data: This group aims to compare variants by employing various AutoMathText corpora, each characterized by distinct quality scores with thresholds exceeding 0.5, 0.6, and 0.7 respectively, while ensuring consistency in all other variables.

We show the results in Figure 4. We can see that increasing the amount of raw domain documents can indeed further enhance the target domain performance under the same quality. However, even though utilizing a larger number of raw domain data (e.g., more math texts with 0.5absent0.5\geq 0.5≥ 0.5 score than those with 0.7absent0.7\geq 0.7≥ 0.7 score), the low quality of raw data can also deduce the LLMs’ performance regardless of the target domain or general domain, which indicates the quality of domain raw data is a priority over its amount when performing continual pre-training.

Effect of Format Alignment Data Selection. Our experiments indicate that selecting 10,000 instruction samples (about 3% of the original TULU-V2-mix) and 5,000 alignment samples (3% of original UltraFeedback) with the lowest perplexity scores can effectively perform format alignment during supervised fine-tuning and direct preference optimization. Here, we conduct a further ablation study to explore the impact of different selection strategies of format alignment data on the final chat model, which consists of the difficulty and the amount of selected data. Specifically, we conduct three groups of experiments including:

\bullet Difficulty of Samples for SFT and DPO: This group compares four distinct selection strategies: random selection (R), easiest samples with the lowest perplexity (E), hardest samples with the highest perplexity (H), and half easiest samples and half hardest samples (EH).

\bullet Amout of Samples for SFT and DPO: This involves four variants using different quantities of easiest samples from the original TULU-V2-mix dataset (i.e., 10K, 20K, 40K, and 80K) and from the original UltraFeedback dataset (i.e., 5K, 10K, 20K, and 40K).

\bullet Ratios between SFT and DPO Samples: This experiment utilizes a total of 15,000 samples for format alignment by adopting five different ratios between samples used in supervised fine-tuning and direct preference optimization, ranging from 1:2, 1:1, 2:1, 3:1, to 4:1.

We show the results of each group in Figure 4, Figure 6, and Figure 6. Firstly, using the easiest samples with the lowest perplexity can balance the domain capability and general capability best compared to other selection strategies. Secondly, it enhances both the domain and general capabilities simultaneously to a certain extent by continuously increasing the amount of SFT training samples. Conversely, this phenomenon is not observed when continuously increasing the amount of DPO training samples, rather, both remain in a state of fluctuation. Finally, we can see that the proportion of SFT and DPO data we selected (i.e., 2:1) can optimally balance the general and domain-specific capabilities.

4 Related Work

Domain Adaptation of LLMs. Our work is closely related to efforts in adapting general LLMs to specific domains (Yildiz et al., 2024; Ke et al., 2022; Scialom et al., 2022). Due to the increasing scale and complexity of LLMs, training domain-specific LLMs from scratch involves significantly high financial and ecological costs (Luccioni et al., 2023). To address this issue, recent work has been devoted to studying efficient approaches like continual pre-training, which involves incrementally training general LLMs based on new domain corpora (Que et al., 2024; Ke et al., 2022), and continual fine-tuning, aiming to fine-tune general LLMs on a series of downstream tasks related to target domains (Razdaibiedina et al., 2023; Scialom et al., 2022; Luo et al., 2023a). Specially, continual pre-training updates LLMs with large and unlabeled domain-specific corpora, which mainly focuses on memorizing and injecting new knowledge into the parameters of LLMs. However, these approaches might result in catastrophic forgetting and performance degradation in general language tasks (Kar et al., 2022; Mehta et al., 2023). Another line of work has explored conducting instruction fine-tuning by synthesizing domain-related instructions (Cheng et al., 2024; Jiang et al., 2024). Nevertheless, these studies require additional models to synthesize amounts of instructions highly related to specific domains, resulting in high computational costs. It is noted that our method differs from these works in several ways. Firstly, we disentangle domain adaptation into knowledge memorization and capability elicitation, focusing on learning domain-specific knowledge and solving domain tasks with learned knowledge, respectively. Secondly, we employ token swap self-distillation in the knowledge mixture pre-training to retain general knowledge and avoid catastrophic forgetting.

Instruction Tuning and Alignment. Instruction tuning (also known as supervised fine-tuning) employs human-annotated instructions (Sanh et al., 2022; Mishra et al., 2022; Köpf et al., 2023; Sun et al., 2023) or synthetic instructions by proprietary models (Taori et al., 2023; Chiang et al., 2023; Wang et al., 2023b) to fine-tune LLMs. Besides, alignment with reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022a) or direct preference optimization (DPO) (Rafailov et al., 2023a) aims to align LLMs with human preference. Both instruction tuning and alignment are able to elicit knowledge from LLMs and improve their capabilities to solve downstream tasks. Recent work (Zhou et al., 2023) has demonstrated that LLMs mainly learn the style or format for interacting with users through simple instruction tuning and alignment, by leveraging their prior knowledge and capabilities already acquired during the pre-training stage. Therefore, only employing as few as 1,000 examples in supervised fine-tuning can also achieve satisfactory alignment performance (Zhou et al., 2023). Furthermore, by comparing the token distribution before and after alignment, recent work (Lin et al., 2023) found that the most significant distribution shifts appear dominantly in stylistic tokens such as transitional phrases and discourse markers instead of contextual words that involve rich knowledge for solving downstream tasks. Inspired by these studies, we propose to expose knowledge memorization and capability elicitation from instruction tuning and alignment. Unlike these studies which typically focused on instruction tuning or alignment, we differ in that we unify the three stages of training LLMs (i.e., continual pre-training, instruction tuning, and alignment) and conduct a knowledge mixture pre-training to mainly focus on learning new domain knowledge while maintaining general knowledge.

5 Conclusion

In this study, we refined the conventional approach to domain adaptation for LLMs by introducing a novel two-stage approach, termed Mix-CPT, which encompasses both domain knowledge learning and general format alignment. Distinct from previous strategies, Mix-CPT employed a knowledge mixture of continual pre-training that learns knowledge memorization and utilization simultaneously through the integration of domain-specific raw data with general instruction tuning and alignment data. Besides, we further incorporated the Logit Swap Self-Distillation (LSSD) constraint into the continual pre-training process to relieve catastrophic forgetting. Subsequently, based on the knowledge and capabilities that are already acquired during pre-training, we strategically selected a small number of easy instructions with lower perplexity scores from the instruction set used during the continual pre-training process to make the LLM focus on learning the pure style and format for interacting with humans. We conducted extensive experiments on three benchmark datasets, and the experiment results show that our proposed Mix-CPT outperforms the traditional method, obtaining improvements on both the domain and general capabilities.

References

  • Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732, 2021.
  • Azerbayev et al. (2023) Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. CoRR, abs/2310.10631, 2023.
  • Bhakthavatsalam et al. (2021) Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge. CoRR, abs/2102.03315, 2021.
  • Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.  7432–7439. AAAI Press, 2020.
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  • Cheng et al. (2024) Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, and Furu Wei. Instruction pre-training: Language models are supervised multitask learners, 2024.
  • Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org, 2023.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021.
  • Contributors (2023) OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  • Cui et al. (2023) Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. CoRR, abs/2310.01377, 2023.
  • Gu et al. (2023) Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Knowledge distillation of large language models. CoRR, abs/2306.08543, 2023.
  • Guo & Yu (2022) Xu Guo and Han Yu. On the domain adaptation and generalization of pretrained language models: A survey. CoRR, abs/2211.03154, 2022.
  • Hendrycks et al. (2021a) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021a.
  • Hendrycks et al. (2021b) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021b.
  • Hu et al. (2024) Hanxu Hu, Pinzhen Chen, and Edoardo M. Ponti. Fine-tuning large language models with sequential instructions. CoRR, abs/2403.07794, 2024.
  • Huang et al. (2023) Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  • Ivison et al. (2023) Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew E. Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. Camels in a changing climate: Enhancing LM adaptation with tulu 2. CoRR, abs/2311.10702, 2023.
  • Jiang et al. (2024) Zhengbao Jiang, Zhiqing Sun, Weijia Shi, Pedro Rodriguez, Chunting Zhou, Graham Neubig, Xi Victoria Lin, Wen-tau Yih, and Srinivasan Iyer. Instruction-tuned language models are better knowledge learners. arXiv preprint arXiv:2402.12847, 2024.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp.  1601–1611. Association for Computational Linguistics, 2017.
  • Kar et al. (2022) Sudipta Kar, Giuseppe Castellucci, Simone Filice, Shervin Malmasi, and Oleg Rokhlenko. Preventing catastrophic forgetting in continual learning of new natural language tasks. In KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, pp.  3137–3145. ACM, 2022.
  • Ke et al. (2022) Zixuan Ke, Yijia Shao, Haowei Lin, Hu Xu, Lei Shu, and Bing Liu. Adapting a language model while preserving its general knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp.  10177–10188. Association for Computational Linguistics, 2022.
  • Köpf et al. (2023) Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conversations - democratizing large language model alignment. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466, 2019.
  • Lai et al. (2017) Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard H. Hovy. RACE: large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp.  785–794. Association for Computational Linguistics, 2017.
  • Li et al. (2023) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy V, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Moustafa-Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the source be with you! CoRR, abs/2305.06161, 2023.
  • Lin et al. (2023) Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi. The unlocking spell on base llms: Rethinking alignment via in-context learning. CoRR, abs/2312.01552, 2023. doi: 10.48550/ARXIV.2312.01552. URL https://doi.org/10.48550/arXiv.2312.01552.
  • Lin et al. (2024) Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. Rho-1: Not all tokens are what you need. CoRR, abs/2404.07965, 2024.
  • Luccioni et al. (2023) Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. Estimating the carbon footprint of bloom, a 176b parameter language model. J. Mach. Learn. Res., 24:253:1–253:15, 2023.
  • Luo et al. (2023a) Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. CoRR, abs/2308.08747, 2023a.
  • Luo et al. (2023b) Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. CoRR, abs/2306.08568, 2023b.
  • Mehta et al. (2023) Sanket Vaibhav Mehta, Darshan Patil, Sarath Chandar, and Emma Strubell. An empirical investigation of the role of pre-training in lifelong learning. J. Mach. Learn. Res., 24:214:1–214:50, 2023.
  • Meng et al. (2024) Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. CoRR, abs/2405.14734, 2024.
  • Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp.  2381–2391. Association for Computational Linguistics, 2018.
  • Mishra et al. (2022) Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  3470–3487. Association for Computational Linguistics, 2022.
  • OpenAI (2023) OpenAI. Gpt-4 technical report. OpenAI Blog, 2023.
  • Ouyang et al. (2022a) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022a.
  • Ouyang et al. (2022b) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022b.
  • Que et al. (2024) Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, et al. D-cpt law: Domain-specific continual pre-training scaling law for large language models. arXiv preprint arXiv:2406.01375, 2024.
  • Rafailov et al. (2023a) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023a.
  • Rafailov et al. (2023b) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b.
  • Razdaibiedina et al. (2023) Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian Khabsa, Mike Lewis, and Amjad Almahairi. Progressive prompts: Continual learning for language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  • Ren et al. (2024) Mengjie Ren, Boxi Cao, Hongyu Lin, Cao Liu, Xianpei Han, Ke Zeng, Guanglu Wan, Xunliang Cai, and Le Sun. Learning or self-aligning? rethinking instruction fine-tuning. CoRR, abs/2402.18243, 2024.
  • Rozière et al. (2023) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code. CoRR, abs/2308.12950, 2023.
  • Sanh et al. (2022) Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Févry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M. Rush. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  • Scialom et al. (2022) Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp.  6107–6122. Association for Computational Linguistics, 2022.
  • Sun et al. (2023) Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David D. Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  • Suzgun et al. (2023) Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  13003–13051. Association for Computational Linguistics, 2023.
  • Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  4149–4158. Association for Computational Linguistics, 2019.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
  • Wang et al. (2024) Rui Wang, Fei Mi, Yi Chen, Boyang Xue, Hongru Wang, Qi Zhu, Kam-Fai Wong, and Ruifeng Xu. Role prompting guided domain adaptation with general capability preserve for large language models. CoRR, abs/2403.02756, 2024.
  • Wang et al. (2023a) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a.
  • Wang et al. (2023b) Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023b.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  • Yildiz et al. (2024) Çagatay Yildiz, Nishaanth Kanna Ravichandran, Prishruit Punia, Matthias Bethge, and Beyza Ermis. Investigating continual pretraining in large language models: Insights and implications. CoRR, abs/2402.17400, 2024.
  • Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.  4791–4800. Association for Computational Linguistics, 2019.
  • Zhang et al. (2024) Yifan Zhang, Yifan Luo, Yang Yuan, and Andrew Chi-Chih Yao. Automathtext: Autonomous data selection with language models for mathematical texts. CoRR, abs/2402.07625, 2024.
  • Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models. CoRR, abs/2303.18223, 2023.
  • Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  • Zhou et al. (2023) Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. LIMA: less is more for alignment. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  • Zhou et al. (2024) Kun Zhou, Beichen Zhang, Jiapeng Wang, Zhipeng Chen, Wayne Xin Zhao, Jing Sha, Zhichao Sheng, Shijin Wang, and Ji-Rong Wen. Jiuzhang3.0: Efficiently improving mathematical reasoning by training small data synthesis models. CoRR, abs/2405.14365, 2024.