Hadamard Adapter: An Extreme Parameter-Efficient Adapter Tuning Method for Pre-trained Language Models

Yuyan Chen [email protected] 0000-0002-4381-486X Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan UniversityShanghaiChina , Qiang Fu [email protected] 0000-0002-5821-7267 MicrosoftBeijingChina , Ge Fan [email protected] 0000-0001-5653-1626 TencentShenzhenChina , Lun Du [email protected] 0000-0002-7625-0650 MicrosoftBeijingChina , Jian-Guang Lou [email protected] 0000-0001-8496-033X MicrosoftBeijingChina , Shi Han [email protected] 0000-0002-0360-6089 MicrosoftBeijingChina , Dongmei Zhang [email protected] 0000-0002-9230-2799 MicrosoftBeijingChina , Zhixu Li [email protected] 0000-0003-2355-288X Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan UniversityShanghaiChina and Yanghua Xiao [email protected] 0000-0001-8403-9591 Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Fudan-Aishu Cognitive Intelligence Joint Research CenterShanghaiChina

(2023)

ABSTRACT.

Recent years, Pre-trained Language models (PLMs) have swept into various fields of artificial intelligence and achieved great success. However, most PLMs, such as T5 and GPT3, have a huge amount of parameters, fine-tuning them is often expensive and time consuming, and storing them takes up a lot of space. Therefore, it is necessary to adopt a parameter-efficient approach to reduce parameters of PLMs in fine-tuning without compromising their performance in downstream tasks. In this paper, we design a novel adapter which only acts on self-attention outputs in PLMs. This adapter adopts element-wise linear transformation using Hadamard product, hence named as Hadamard adapter, requires the fewest parameters compared to previous parameter-efficient adapters. In addition, we also summarize some tuning patterns for Hadamard adapter shared by various downstream tasks, expecting to provide some guidance for further parameter reduction with shared adapters in future studies. The experiments conducted on the widely-used GLUE benchmark with several SOTA PLMs prove that the Hadamard adapter achieves competitive performance with only 0.033% parameters compared with full fine-tuning, and it has the fewest parameters compared with other adapters. Moreover, we further find that there is also some redundant layers in the Hadamard adapter which can be removed to achieve more parameter efficiency with only 0.022% parameters.

Adapter Tuning, Parameter-Efficiency, Pre-trained Language Models

^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management; October 21–25, 2023; Birmingham, United Kingdom^†^†booktitle: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23), October 21–25, 2023, Birmingham, United Kingdom^†^†price: 15.00^†^†doi: 10.1145/3583780.3614904^†^†isbn: 979-8-4007-0124-5/23/10^†^†ccs: Computing methodologies Natural language processing

1 Introduction

Recent years, Pre-trained Language models (PLMs) have swept into various fields of artificial intelligence and achieved great success (Chen et al., 2024b, 2023b, 2023a, 2022, a, 2023d, 2023c). The mainstream paradigm for adapting PLMs to downstream tasks is fine-tuning. As most PLMs, such as T5 (Raffel et al., 2019), GPT3 (Brown et al., 2020), have a large amount of parameters, fine-tuning them is often expensive and time consuming, and storing them takes up a lot of space. It has been revealed that there are a lot of redundant parameters in the process of fine-tuning (Cheng et al., 2017; Dean et al., 2012). Thus, it is necessary to greatly reduce the scale of parameters in fine-tuning without compromising PLMs’ performance in downstream tasks.

Previous parameter-efficient fine-tuning of PLMs mainly contains three categories of methods, i.e., adapter tuning, prefix tuning, and prompt tuning. Adapter tuning (Houlsby et al., 2019) is to inject a small neural network module into each or some layers of the PLMs. During fine-tuning, only the parameters of this small module need to be learned. It has promising performance in NLP, which achieves comparable performance with fine-tuning while adding no more than 4% task-specific parameters (Houlsby et al., 2019; Lin et al., 2020). Prefix tuning (Li and Liang, 2021) and prompt tuning (Lester et al., 2021) preset additional adjustable prefix tokens in the input or hidden layer, and only these soft prompts are trained during the fine-tuning of downstream tasks. In addition to the above three parameter-efficient fine-tuning ways, the existing efforts also works on model compression (Cheng et al., 2017), including knowledge distillation (Ba and Caruana, 2013; Hinton et al., 2015), which transfers the knowledge learned by large models to small models, such that small models can have the generalization ability of large models; quantization (Gong et al., 2014; Wu et al., 2016), which reduces the accuracy of large models within the acceptable range; pruning (Srinivas and Babu, 2015; Han et al., 2015), which removes less useful connections in the model; and structure optimization, such as matrix decomposition (Tai et al., 2015), parameter sharing (Ullrich et al., 2017), etc. However, although the current endeavors achieve competitive performance in downstream tasks with much fewer parameters, we believe there is still room for improvement in parametric efficiency.

It’s well-known that the attention mechanism, especially self-atten-tion, is one of the core modules that enable PLMs achieve superior performance in various downstream tasks (Vaswani et al., 2017). Thus, a possible way to significantly reduce the scale of parameters for fine-tuning might be designing an adapter to work with the self-attention module in PLMs. We also find related research on adapter tuning that injects adapters into the self-attention layer (Lyu et al., 2022, 2023), such as IA3 (Liu et al., 2022b). There are three questions need to be answered when designing the adapter. Q1. Where should the adapter that acts on self-attention outputs be injected into the PLMs? Q2. What is the suitable form of the adapter that satisfies both competitive performance and parameter-efficiency? Q3. What other essential parameters should not be frozen in adapter tuning? To answer these questions, we conduct the following empirical studies: i) Analyzing the changes of self-attention outputs before and after full fine-tuning to verify the importance of self-attention which is therefore necessary to inject the adapter; ii) Comparing the difference among all fitting functions to select the suitable form of the adapter; iii) Analyzing the gradients of PLMs after fine-tuning on downstream tasks to select out the modules of great importance which should to be trained in the adapter tuning.

According to the empirical analysis, we propose a novel adapter tuning method as follows: We first learn the classification module to output prediction results on a given downstream task, without updating the PLMs’ other parameters. Since the classification module is a linear model, this step requires light-weight computation cost. Then we design an adapter and inject it right after the multi-head self-attention outputs of PLMs. Particularly, we freeze all parameters except parameters in the designed adapter and the subsequent normalization module for continuous fine-tuning. As there are usually multiple layers with the same architecture in PLMs, e.g., BERT (Devlin et al., 2018) model of base version has 12 layers, we inject such an adapter module in each layer of PLMs. In designing the adapter, we only adopt element-wise linear transformation, rather than high-order ones, as the computational logic for the adapter. Specifically, the adapter includes a weight vector and a bias vector which have the same dimension as the output of the multi-head self-attention module. The multi-head self-attention output is multiplied by the weight vector of the adapter using the element-wise product (also called the Hadamard product), then added by the corresponding bias vector to obtain new self-attention outputs. Thus, we name the designed adapter as Hadamard adapter.

We carry out experiments on GLUE benchmark, including eight tasks. The experimental results demonstrate that the proposed Hadamard adapter achieves competitive performance with much fewer parameters than the existing fine-tuning methods. In addition, we take the learned parameter values of the Hadamard adapter as representations of downstream tasks. Through further analysis, we summarize some valuable tuning patterns for Hadamard adapter shared by various downstream tasks, which provide valuable guidance for further parameter reduction using shared adapters in future research.

To summarize, our contributions in this paper are threefold:

•

Based on comprehensive empirical analysis, we design Hadamard adapter, which acts on self-attention outputs in PLMs with element-wise linear transformation. We also design an extreme parameter-efficient adapter tuning method based on the Hadamard adapter.
•

We conduct extensive comparative experiments with several mainstream PLMs. The experimental results show that the proposed Hadamard adapter achieves the highest parametric efficiency in the fine-tuning history, and has competitive performance with full fine-tuning for various downstream tasks.
•

We summarize some valuable tuning patterns for Hadamard adapter shared by various downstream tasks, which provide valuable guidance for further parameter reduction using shared adapters in future research.

2 Empirical analysis

To guide the design of our adapter tuning method, we conduct empirical studies that target at answering the three key questions as listed in the Introduction. In the following of this section, we first analyze the changes of self-attention outputs before and after full fine-tuning (for answering Q1), then we compare the difference among all fitting functions to select the suitable form of the Hadamard adapter (for answering Q2). Finally, we analyze the gradients of PLMs after fine-tuning on downstream tasks to select out the modules of great importance that would not be frozen in the adapter tuning (for answering Q3).

2.1 The changes of self-attention outputs

We employ eight tasks in the GLUE benchmark to conduct the first analysis of how PLM’s self-attention output changes before and after fine-tuning. For PLM, we adopt Roberta-large model, which has 24 hidden layers and outputs 1024-dimensional tensors in the encoder, as an example to make analysis. Specifically, in order to compare the changes of self-attention outputs in each layer among all tasks, we adopt the norm of self-attention outputs instead of the original self-attention outputs. We analyze the distribution of the norm of self-attention outputs among all tasks before and after fine-tuning, and the changes during fine-tuning on each layer as shown in Fig. 1. The process is shown as the following equations:

(1)		$\displaystyle\small\\|\bm{A_{b}}\\|_{2}=\sqrt{\lambda_{max}(\bm{A_{b}}^{T}\bm{A_% {b}})},\quad\\|\bm{A_{a}}\\|_{2}=\sqrt{\lambda_{max}(\bm{A_{a}}^{T}\bm{A_{a}})}$
(2)		$\displaystyle\Delta=\frac{\\|\bm{A_{a}}\\|_{2}-\\|\bm{A_{b}}\\|_{2}}{\\|\bm{A_{b}}% \\|_{2}}$

where $\|\bm{A_{b}}\|_{2}$ and $\|\bm{A_{b}}\|_{2}$ represent the norm of self-attention outputs among all tasks before and after fine-tuning in a hidden layer, and $\lambda_{max}(\bm{A_{b}}^{T}\bm{A_{b}})$ is the eigenvalue of the matrix $\|\bm{A_{b}}\|_{2}$ .

One box in Fig. 1(a)(b) and Fig. 1(c) represents the distribution of the norm of self-attention outputs and the corresponding changes in the layer, respectively. As can be observed in Fig. 1, the norm of self-attention outputs of all tasks significantly increase from an average of 60 to an average of 100 after fine-tuning, especially in the middle and back layer (Fig 1(a)(b)). After the fifteen layers, the changes become more significant as the number of layers increases, reaching the greatest changes at the last layer (Fig 1(c)). The above observations indicate that self-attention outputs change significantly during the fine-tuning process, which inspire us with the answer to Q1 as follows: It is proper to inject an adapter right after the self-attention outputs to achieve similar performance gains with fine-tuning while updating much fewer parameters.

Refer to caption — Figure 1. The distribution of the norm of the self-attention outputs among all tasks before (a) and after fine-tuning (b), and the corresponding changes (c) in each layer.

2.2 Fitting full fine-tuning

We design fitting functions for self-attention outputs to make adapter tuning, which aims at letting the values of self-attention outputs approximate those in full fine-tuning of PLMs. We first optimize the parameters in the classifier modules. Next, we reload them and train different fitting functions, including linear function, quadratic function and higher order function (i.e. cubic function), respectively, to obtain new self-attention outputs. After that, we calculate the average value of each token in a sequence through dividing by the hidden size of the PLM in the new self-attention outputs as shown in Fig. 2(a). The process is shown in the following equations:

(3)

\displaystyle\small a_{j}^{\prime}=\frac{1}{H}\sum_{i=1}^{H}a_{ij}^{\prime}

where $H$ represents the hidden size of a PLM. $a_{ij}^{\prime}$ is the value of the $i^{th}$ dimension of the $j^{th}$ token in the new self-attention outputs in a hidden layer based on a task. $a_{j}^{\prime}$ is the average value of each token in a sequence of a task. More detailed, we then calculate the average value of each sequence through dividing by sequence length in the new self-attention outputs. In this way, we obtain a characteristic value for each task which represents its respective average self-attention outputs. We analyze the distribution of the characteristic value among all tasks in each layer as shown in Fig. 2(b) with the process shown in the following equations:

(4)

\displaystyle\small a^{\prime}=\frac{1}{L}\sum_{j=1}^{L}a_{j}^{\prime}

where $L$ represents the sequence length fed for the PLM. $a^{\prime}$ is the characteristic value of a task. We also analyze the average characteristic values of all tasks in each layer as shown in Fig. 2(c).

As shown in Fig. 2(a), dots of the same color represents the average value of each token in a sequence of all downstream tasks corresponding to one of the three fitting functions and fine-tuning. The dots of four settings are covered by each other, which also proves that fitting functions of different orders are similar in approximating the performance of fine-tuning. As shown in Fig. 2(b), one box represents the distribution of characteristic values among all downstream tasks in a layer. Median, quartile ranges which correspond to the characteristic value distribution of different fitting functions and fine-tuning are similar in each hidden layer. When we analyze the average characteristic value of all tasks, trends of linear function, quadratic function and higher order function are still similar, but slightly different from that of fine-tuning (Fig 2(c)). One dot represents the average characteristic values of all downstream tasks in a layer. As the order increases, the values between the fitting function and fine-tuning are closer, but the difference in distance can be ignored compared with the increase in the number of parameters. Therefore, we have answer to Q2 as follows: A linear function is qualified enough to act on self-attention outputs to fit the performance of fine-tuning.

2.3 Gradient Analysis

We output the gradient and unit gradient of the top five layers in the first and last epoch during training, respectively, of a PLM (such as BERT-base model). Two representative datasets MRPC (similarity and paraphrase task, 3.7k) and SST-2 (single-sentence classification, 67k) from the GLUE benchmark are selected for analysis, and the results are shown in Table 1.

Table 1. The gradient and unit gradient of the top five layers (in descending order) in the first and last epoch, respectively. We adopt BERT-base model, MRPC and SST-2 dataset to show results and make analysis.

Task

Gradient in first epoch

Unit gradient in first epoch

Gradient in last epoch

Unit gradient in last epoch

MRPC

classifier.weight