LoRA+: Efficient Low Rank Adaptation of Large Models
Abstract
In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in [Hu et al., 2021] leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices and in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for and does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices and with a well-chosen fixed ratio. We call this proposed algorithm LoRA. In our extensive experiments, LoRA improves performance ( improvements) and finetuning speed (up to X SpeedUp), at the same computational cost as LoRA.
1 Introduction
State-of-the-art (SOTA) deep learning models all share a common characteristic: they all have an extremely large number of parameters (10’s if not 100’s of billions parameters). Currently, only a few industry labs can pretrain large language models due to their high training cost. However, many pretrained models are accessible either through an API (GPT4, [OpenAI, 2023]) or through open-source platforms (Llama, [Touvron et al., 2023]). Most practitioners are interested in using such models for specific tasks and want to adapt these models to a new, generally smaller task. This procedure is known as finetuning, where one adjusts the weights of the pretrained model to improve performance on the new task. However, due to the size of SOTA models, adapting to down-stream tasks with full finetuning (finetuning all model parameters) is computationally infeasible as it requires modifying the weights of the pretrained models using gradient methods which is a costly process. Besides, a model that has already learned generally useful representations during pretraining would not require in-principle significant adaptation of all parameters. With this intuition, researchers have proposed a variety of resource-efficient finetuning methods which typically freeze the pretrained weights and tune only a small set of newly inserted parameters. Such methods include prompt tuning [Lester et al., 2021] where a “soft prompt" is learned and appended to the input, the adapters method [Houlsby et al., 2019] where lightweight “adapter" layers are inserted and trained, and [Liu et al., 2022] where activation vectors are modified with learned scalings. Another resource-efficient method is known as Low Rank Adaptation [Hu et al., 2021], or simply LoRA. In LoRA finetuning, only a low rank matrix, called an adapter, that is added to the pretrained weights is trainable. The training can be done with any optimizer and in practice a common choice is Adam [Kingma and Ba, 2014]. Since the trained adapter is low-rank, this effectively reduces the number of trainable parameters in the fine-tuning process, significantly decreasing the training cost. On many tasks such as instruction finetuning, LoRA has been shown to achieve comparable or better performance compared with full-finetuning [Wang et al., 2023, Liu et al., 2023], although on complicated, long form generation tasks, it is not always as performant. The impressive performance and the computational savings of LoRA have contributed to it becoming an industry standard finetuning method.
Efficient use of LoRA requires a careful choice of hyperparameters: the rank and the learning rate. While some theoretical guidelines on the choice of the rank in LoRA exist in the literature (see e.g. Zeng and Lee [2023]), there are no principled guidelines on how to set the learning rate, apart from common choices of order -.
![Refer to caption](extracted/5711638/figures/intro_table.png)
Related Work.
Dettmers et al. [2023] introduced a quantized version of LoRA (or QLoRA), which further reduces computation costs by quantizing pretrained weights down to as few as four bits. Using QLoRA enables fine-tuning Llama-65b [Touvron et al., 2023], on a single consumer GPU while achieving competitive performance with full-finetuning. To further improve LoRA training with quantization, Li et al. [2023] introduced a new method called LoftQ for computing a better initialization for quantized training. Additional variations of LoRA have been proposed such as VeRA [Kopiczko et al., 2023] which freezes random weight tied adapters and learns vector scalings of the internal adapter activations. This achieves a further reduction in the number of trainable parameters while achieving comparable performance to LoRA on several NLP finetuning tasks. However, to the best of our knowledge, there is no principled guidance for setting LoRA learning rate which is the focus of our work.
Contributions.
We provide guidelines for setting the learning rate through a theory of scaling for neural networks. There is a significant number of works on the scaling of neural networks from the infinite width/depth perspective. The approach is simple: take the width/depth of a neural network to infinity,111Depending on the model, one might want to scale width with fixed depth and vice-versa, or both at the same time. See Section A.1 for more details. understand how the limit depends on the choice of the hyperparameters in the training process such as the learning rate and initialization variance, then derive principled choices for these hyperparameters to achieve some desired goal (e.g. improve feature learning). Examples of the infinite-width limit include works on initialization schemes such as [He et al., 2016, Yang, 2019], or more holistically network parametrizations such as [Yang and Hu, 2021] where the authors introduced P, a neural network parameterization ensuring feature learning in the infinite-width limit, offering precise scaling rules for architecture and learning rates to maximize feature learning. Examples for the depth limit include initialization strategies [Schoenholz et al., 2017a, He et al., 2023, Hayou et al., 2019], block scaling (see e.g. [Hayou et al., 2021, Hayou, 2023, Noci et al., 2023]), depth parametrizations [Yang et al., 2023, Bordelon et al., 2023] etc. Here we propose to use the same strategy to derive scaling rules for the learning rate in LoRA for finetuning. More precisely, we study the infinite-width limit of LoRA finetuning dynamics and show that standard LoRA setup is suboptimal. We correct this by introducing a new method called LoRA that improves feature learning in low rank adaptation in the this limit. The key innovation in LoRA is setting different learning rates for and modules (LoRA modules) as explained in Figure 1. Our theory is validated with extensive empirical results with different language of models and tasks.
2 Setup and Definitions
Our methodology in this paper is model agnostic and applies to general neural network models. Let us consider a neural network of the form
(1) |
where is the input, is the network depth, are mappings that define the layers, are the hidden weights, where is the network width, and are input and output embedding weights.
Model (1) is pretrained on some dataset to perform some specified task (e.g. next token prediction). Once the model is pretrained, one can finetune it to improve performance on some downstream task. To achieve this with relatively small devices (limited GPUs), resource-efficient finetuning methods like LoRA significantly reduce the computational cost by considering low rank weight matrices instead of full rank finetuning (or simply full finetuning).
Definition 1 (Low Rank Adapters (LoRA) from [Hu et al., 2021]).
For any weight matrix in the pretrained model, we constrain its update in the fine-tuning process by representing the latter with a low-rank decomposition . Here, only the weight matrices , are trainable. The rank and are tunable constants.
Scaling of Neural Networks.
It is well known that as the width grows, the network initialization scheme and the learning should be adapted to avoid numerical instabilities and ensure efficient learning. For instance, the variance of the initialization weights (in hidden layers) should scale to prevent arbitrarily large pre-activations as we increase model width (e.g. He init [He et al., 2016]). To derive such scaling rules, a principled approach consist of analyzing statistical properties of key quantities in the model (e.g. pre-activations) as grows and then adjust the initialization, the learning rate, and the architecture itself to achieve desirable properties in the limit [Hayou et al., 2019, Schoenholz et al., 2017b, Yang, 2019, Yang and Littwin, 2023]. This approach is used in this paper to study feature learning dynamics with LoRA in the infinite-width limit. This will allow us to derive scaling rules for the learning rates of LoRA modules. For more details about the theory of scaling of neural networks, see Section A.1.
Notation.
Hereafter, we use the following notation to describe the asymptotic behaviour as the width grows. Given sequences and , we write , resp. , to refer to , resp. , for some constant . We write if both and are satisfied. For vector sequences (for some ), we write when for all , and same holds for other asymptotic notations. Finally, when the sequence is a vector of random variables, convergence is understood to be convergence in second moment ( norm).
3 An Intuitive Analysis of LoRA
Our intuition is simple: the matrices and have “transposed” shapes and one would naturally ask whether the learning rate should be set differently for the two matrices. In practice, most SOTA models have large width (embedding dimension). Thus, it makes sense to study the training dynamics when the width goes to infinity.
3.1 LoRA with a Toy Model
Consider the following linear model
(2) |
where are the pretrained weights, are LoRA weights,222Here, we consider to simplify the analysis. All the conclusions remain essentially valid when . is the model input. This setup corresponds to in 1. We assume that the weights are fixed (from pretraining). The goal is to minimize the loss where and is an input-output datapoint.333For simplicity, we assume that the finetuning dataset consists of a single sample. Our analysis is readily generalizable to multiple samples. We assume that which means that input coordinates remain of the same order as we increase width. In the following, we analyze the behaviour of the finetuning dynamics as model width grows.
Initialization.
We consider a Gaussian initialization of the weights as follows: , .444The Gaussian distribution can be replaced by any other distribution with finite variance. With LoRA, we generally want to initialize the product to be so that finetuning starts from the pretrained model. This implies at least one of the weights and is initialized to . If both are initialized to , it is trivial that no learning occurs in this case since this is a saddle point. Thus, we should initialize one of the parameters and to be non-zero and the other to be zero. If we choose a non-zero initialization for , then following standard initialization schemes (e.g., He Init [He et al., 2016], LeCun Init [LeCun et al., 2002]), one should set to ensure does not explode with width. This is justified by the Central Limit Theorem (CLT).555Technically, the CLT only ensures the almost sure convergence, the convergence follows from the Dominated Convergence Theorem. We omit these technical details in this paper. On the other hand, if we choose a non-zero initialization for , one should make sure that . This leaves us with two possible schemes:
-
•
Init[1]: .
-
•
Init[2]: .
Our analysis will only consider these two initialization schemes for LoRA modules, although the results should in-principle hold for other schemes, providing that stability (as discussed above) is satisfied.
Learning rate.
WLOG, we can simplify the analysis by assuming that . This can be achieved by setting . The gradients are given by
We use subscript to denote the finetuning step. Let . At step with learning rate , we have
The update in model output is driven by the three terms . The first two terms represent “linear” contributions to the update, i.e. change in model output driven by fixing and updating and vice-versa. These terms are order one in . The third term represents a multiplicative update, compounding the updates in and , and is an order two term in . As grows, a desirable property is that . Intuitively, this means that as we scale the width, feature updates do not ‘suffer’ from this scaling (see Section A.1 for more details). An example of a scenario where feature learning is affected by scaling is the lazy training regime [Jacot et al., 2018], where feature updates are of order which implies that no feature learning occurs in the limit . The condition also implies that the update does not explode with width, which is also a desirable property.
Having satisfied implies that at least one of the three terms is . Ideally, we want both and to be because otherwise it means that either or is not efficiently updated. For instance, if , it means that as , the model acts as if is fixed and only is trained. Similar conclusions hold when . Having both and being in width means that both and parameter updates significantly contribute to the change in , and we say that feature learning with LoRA is efficient when this is the case, i.e. for and all . We will formalize this definition of efficiency in the next section. The reader might wonder why we do not require that be . We will see that when both and are , the term is also .
Efficiency Analysis.
Let us assume that we train the model with gradient descent with learning rate for some , and suppose that we initialize the model with Init[1]. Sine the training dynamics are mainly matrix vector products, sum of vectors/scalars etc (see [Yang et al., 2022]),666A crucial assumption for this to hold is also to have that for any matrix/vector product in the training dynamics, the product dimension (the dimension along which the matrix/vector product is calculated) is for some . For instance, in the case of Transformers, this is satisfied since the MLP embedding dimension is generally . However, this condition would be violated if for instance one considers MLP embedding dimension . Such non-standard scaling choices require a particular treatment, but the conclusions remain the same. it is easy to see that any quantity in the training dynamics should be of order for some . For any quantity in the training dynamics, we write . When is a vector, we use the same notation when all entries of are . The notation is formally defined in Appendix A.
Starting from initialization, we have . LoRA finetuning is efficient when and for all ,777Here we use the instead of because at , at least one the terms or will be zero. and for . This translate to
Solving this equation yields , i.e. the learning rate should scale as in order to achieve efficient feature learning. At initialization, and (by Central Limit Theorem). Through an inductive argument, for , will be of order and will be of order , yielding . Indeed, at each iteration the update to will be of order and the updates to are of order . As , this yields a contradiction towards learning features.
This shows that we cannot have both and to be with this parametrization (also true with Init[2]). We formalize this result in the next proposition and refer the reader to Appendix A for further technical details.
Proposition 1 (Inefficiency of LoRA fine-tuning).
Assume that LoRA weights are initialized with Init[1] or Init[2] and trained with gradient descent with learning rate for some . Then, it is impossible to have for for any , and therefore, fine-tuning with LoRA in this setup is inefficient.
In conclusion, efficiency cannot be achieved with this parametrization of the learning rate. This suggests that standard LoRA finetuning as currently used by practitioners is suboptimal, especially when model width is large, which is a property that is largely satsified in practice ( for GPT2 and for LLama). This analysis suggests that we are missing crucial hyperparameters in the standard LoRA setup. Indeed, we show that by decoupling the learning rate for and , we can have for . We write to denote the learning rates. The analysis conducted above remains morally the same with the only difference being in the learning rates. Let and , and assume that weights are initialized with Init[1]. A similar analysis to the one conducted above show that having and for and implies that for all
which, after simple calculations, implies that . This is only a necessary condition. In the next result, taking also some elements of stability into consideration, we fully characterize the choice of and to ensure efficient LoRA fine-tuning.
Proposition 2 (Efficient Fine-Tuning with LoRA).
In the case of model (2), with and , we have for all , , .
We refer the reader to Appendix A for more details on the proof of 2. In conclusion, scaling the learning rates as and ensures stability () and efficiency of LoRA finetuning ( for and ) in the infinite-width limit. In practice, this means that the learning rate for should be generally much larger than that of . This remains true even if for general . We will later see that this scaling is valid for general neural network models.
![Refer to caption](x1.png)
![Refer to caption](x2.png)
3.2 Verifying the Results on a Toy Model
The previous analysis considers a simple linear model. To assess the validity of the scaling rules in a non-linear setting, we consider a neural network model given by
(3) |
where are the weights, and is the ReLU function. The model is trained on a synthetic dataset generated with . See Appendix C for more details.
Only the weight matrices are trained ( are fixed). We use , train data size and a test data size .888See Appendix C for more details about the experimental setup. The train/test loss for varying and is reported in Figure 2 at the early stages of the training () and after convergence (we observed convergence around for reasonable choices of learning rates). The red ’’ signs represents learning rates for which the loss is within range from the best loss and dashed line represents the case where the learning rates are set equal. We observe that both the best train and test losses are consistently achieved by a combination of learning rates where , which validates our analysis in the previous section. Notice also that optimal learning rates are generally close to the edge of stability, a well-known behaviour in training dynamics of deep networks [Cohen et al., 2021].
4 Stability and Feature Learning with LoRA in the Infinite Width Limit
In this section, we extend the analysis above to general neural architectures with LoRA layers. We show that the conclusions from the analysis on the linear model hold for general neural architectures: 1) using the same learning rate for both and leads to suboptimal feature learning when model width is large, and 2) this problem can be fixed by setting different learning rates for and .
Since our aim in this paper is primarily methodological, the theoretical results in this section are of a physics level of rigor, omitting technical assumptions that would otherwise make the analysis rigorous but unnecessarily complicated. In all the results, LoRA rank is considered fixed and finetuning dynamics are analyzed in the limit of infinite-width. This setup fairly represents practical scenarios where and is generally small.
Notation.
The LoRA weights are initialized with for some .999In [Hu et al., 2021], is initialized to , which corresponds to setting . Here also, we assume that either and (Init[1]), or and (Init[2]). Given a LoRA layer in the model, denotes the input to that layer and the output after adding the pretrained weights. More precisely, we write .
Our main analysis relies on a careful estimation of the magnitude of several quantities including LoRA features. Let us first give a formal definition.
Definition 2 (LoRA Features).
Given a general neural architecture and a LoRA layer (1), we define LoRA features as and . At fine-tuning step , we use the superscript to denote the value of LoRA features , and the subscript to denote the weights .
Finetuning Dataset.
To simplify the analysis, we assume that the finetuning dataset comprises a single sample ,101010This assumption on the finetuning dataset is for simplification purposes only. All our analysis can be re-written with ‘batched’ gradients and the conclusions remain the same. However, some additonal assumptions are required to make the analysis rigorous. and the goal is to minimize the loss computed with the underlying model where the adjusted weights are given by for all LoRA layers (here ). At training step , and for any LoRA layer in the model, is the input to the LoRA layer, computed with data input . Similarly, we write to denote the gradient of the loss function with respect to the layer output features evaluated at data point .
The notion of stability of LoRA as discussed in Section 3 can be generalized to any neural network model as follows.
Definition 3 (Stability).
We say that LoRA finetuning is stable if for all LoRA layers in the model, and all training steps , we have as goes to infinity.
Stability implies that no quantity in the network explodes as width grows, a desirable property as we scale the model.111111It is possible to define stability as and exclude from the condition. This would allow scenarios where for instance the entries of explode with width but their magnitude is compensated with a smaller magnitude of . This system has one degree of freedom because of the homogeneity of the product , and by imposing that , we avoid having such scenarios. Naturally, in order to ensure stability, one has to scale hyperparameters (initialization, learning rate) as grows. Scaling rules for initialization are fairly easy to infer and were already discussed in Section 3 where we obtained two plausible initialization schemes (Init[1] and Init[2]). More importantly, if we arbitrarily scale the learning rate with width, we might end up with suboptimal learning as width grows even if the finetuning is stable. This is the case for instance when we aggressively downscale the learning rate with width, or inadequately parameterize the network (e.g. Neural Tangent Kernel parametrization which leads to the kernel regime in the infinite width limit, [Jacot et al., 2018]). To take this into account, we define a notion of feature learning with LoRA.
Definition 4 (Stable Feature Learning with LoRA).
We say that LoRA finetuning induces stable feature learning if it is stable (3), and for all LoRA layers and finetuning step , we have .
A similar definition of feature learning was introduced in [Yang and Littwin, 2023] for pretraining. This definition ensures that the network is not ‘stuck’ in a kernel regime where feature updates are of order in the infinite-width limit for some , which implies that no feature learning occurs in the limit. The authors introduced the -parameterization (or maximal update parametrization), a specific network parameterization (initialization + learning rate scaling), that ensures that feature updates are . Note that here we added stability in the definition, but in principle, one could define feature learning with instead of . The latter covers unstable scenarios (e.g. when due to improper scaling of initialization and learning rate), so we omit it here and focus on stable feature learning. Also, notice that we only consider finetuning dynamics and not the pretraining dynamics. However, since our analysis depends on weights from pretraining, we assume that pretraining parameterization ensures stability and feature learning as width grows (see Appendix A for more details).121212When taking the infinite width limit, we assume that pretraining parameterization is P. This is just a technicality for the infinite-width limit and does not have any implications on practical scenarios where the width is finite. The most important implications of this assumption is that in the pretrained network (before introducing LoRA layers), we have , which holds for a general input-output pair .
At finetuning step , the gradients are given by
where denotes the outer product of vectors , , and the weights are updated as follows
where are processed gradients (e.g. normalized gradients with momentum as in AdamW etc). Hereafter, we assume that the gradients are processed in a way that makes their entries . This is generally satisfied in practice (with Adam for instance) and has been considered in [Yang and Littwin, 2023] to derive the -parametrization for general gradient processing functions.
Unlike the linear model in Section 3, LoRA feature updates are not only driven by the change in the weights, but also which are updated as we finetune the model (assuming there are multiple LoRA layers). To isolate the contribution of individual LoRA layers to feature learning, we assume that only a single LoRA layer is trainable and all other LoRA layers are frozen.131313This is equivalent to having only a single LoRA layer in the model since LoRA layers are initialized to zero. In this way, we can quantify feature learning induced by the LoRA layer as we finetune the model.. In this setting, considering the only trainable LoRA layer in the model, the layer input is fixed and does not change with , while changes with step (because ). After step , is updated as follows
As discussed in Section 3, the terms represent the ‘linear’ feature updates that we obtain if we fix one weight matrix and only train the other, while represents the ‘multiplicative’ feature update which captures the compounded update due to updating both and .
Analysis of the Role of and .
As discussed above, we want to ensure that and which means that both weight matrices contribute to the update in . To further explain why this is a desirable property, let us analyze how changes in matrices and affect LoRA feature .
Let denote the columns of . We can express as , where is the coordinate of . This decomposition suggests that the direction of is a weighted sum of the columns of , and modulates the weights. With this, we can also write
where refers to the columns of at time step . Having both and of order means that both and are ‘sufficiently’ updated to induce a change in weights and directions . If one of the matrices is not efficiently updated, we might end up with suboptimal finetuning, leading to either non updated directions or direction weights . For instance, assuming that the model is initialized with Init[2], and that is not efficiently updated, the direction of will be mostly determined by the vector (sub)space of dimension generated by the columns of at initialization. This analysis leads to the following definition of efficient learning with LoRA.
![Refer to caption](x3.png)
Definition 5 (Efficient Learning).
We say that LoRA fine-tuning is efficient if it is stable (3), and for all LoRA layers in the model, all steps , and , we have .
Note that it is possible to achieve stable feature learning (4) without necessarily having efficient learning. This is the case when for instance is not updated (fixed to a non-zero init with Init[2]) and only is updated, which corresponds to simply setting . This is a trivial case, but other non-trivial cases of inefficiency are common in practice, such as the use of the same learning rate for and which is a standard practice. In the next theorem, we characterize the optimal scaling of learning rates and , a conclusion similar to that of Section 3.
Theorem 1 (Efficient LoRA (Informal)).
Assume that weight matrices and are trained with Adam with respective learning rates and . Then, it is impossible to achieve efficiency with . However, LoRA Finetuning is efficient with and .
The result of 1 suggests that efficiency can only be achieved with . In practice, this translates to setting , but does not provide a precise ratio to be fixed while tuning the learning rate (the constant in ‘’ is generally intractable), unless we tune both and which is not efficient from a computational perspective as it becomes a 2D tuning problem. It is therefore natural to set a fixed ratio and tune only (or ), which would effectively reduce the tuning process to a 1D grid search, achieving the same computational cost of standard LoRA where the learning rate is the same for and . We call this method LoRA.
In the next section, through extensive empirical evaluations, we first validate our theoretical result and show that optimal pairs (in terms of test accuracy) generally satisfy . We then investigate the optimal ratio for LoRA and suggest a default ratio that was empirically found to generally improve performance compared to standard LoRA. Although the conclusions of 1 and 2 are similar, the proof techniques are different. In 2, the linear model is trained with gradient descent, while in 1, the training algorithm is Adam-type in the sense that it normalizes the gradients before updating the weights. The formal statement of 1 requires an additional assumption on the alignment of the processed gradients with LoRA input . This technical detail is introduced and discussed in Appendix A.
5 Experiments with Language Models
We report our empirical results using LoRA to finetune a set of language models on different benchmarks. Details about the experimental setup and more empirical results are provided in Appendix C. We also identify a default value for the ratio that generally improves performance as compared to standard LoRA. The code for our experiments is available at https://github.com/nikhil-ghosh-berkeley/loraplus.
5.1 GLUE tasks with GPT-2 and RoBERTa
The GLUE benchmark (General Language Understanding Evaluation) consists of several language tasks that evaluate the understanding capabilities of langugage models [Wang et al., 2018]. Using LoRA, we finetune Roberta-base from the RoBERTa family [Liu et al., 2019] and GPT-2 [Radford et al., 2019] on MNLI, QQP, SST2, and QNLI tasks (Other tasks are smaller and generally require an already finetuned model e.g. on MNLI as starting checkpoint) with varying learning rates to identify the optimal combination. Empirical details are provided in Appendix C.
Roberta-base.
Figure 3 shows the results of Roberta-base finetuning with , trained with half precision (FP16). We observe that test accuracy is consistently maximal for some set of learning rates satisfying , outperforming the standard practice where and are usually set equal. Interestingly, the gap between the optimal choice of learning rates overall and the optimal choice when is more pronounced for ‘harder’ tasks like MNLI and QQP, as compared to SST2 and QNLI. This is probably due to the fact that harder tasks require more efficient feature learning. It is also worth mentioning that in our experiments, given limited computational resources, we use sequence length and finetune for only epochs for MNLI and QQP, so it is expected that we obtain test accuracies lower that those reported in [Hu et al., 2021] where the authores finetune Roberta-base with sequence length (for MNLI) and more epochs ( for MNLI). In Appendix C, we provide additional results with Test/Train accuracy/loss.
GPT-2.
Figure 4 shows the results of finetuning GPT-2 with LoRA on MNLI and QQP (other tasks and full precision training are provided in Appendix C). Similar to the conclusions from Roberta-base, we observe that maximal test accuracies are achieved with some satisfying . Further GPT-2 results with different tasks are provided in Appendix C. Here also, we observed that the harder the task, the larger the gap between model performance when and when .
![Refer to caption](x4.png)
5.2 Llama
To further validate our theoretical findings, we finetune the Llama-7b model [Touvron et al., 2023] on the MNLI dataset and flan-v2 dataset [Longpre et al., 2023] using LoRA. Each trial is averaged over two seeds.
Flan-v2.
We examine LoRA training of Llama on the instruction finetuning dataset flan-v2 [Longpre et al., 2023]. To make the experiments computationally feasible, we train for one epoch on a size subset of the flan-v2 dataset. We record the test accuracy of the best checkpoint every 500 steps. The LoRA hyperparameters are set to and . The adapters are added to every linear layer (excluding embedding layers) and we use a constant learning rate schedule. The full training details are in Appendix C.
![Refer to caption](x5.png)
We evaluate the final model on the MMLU benchmark [Hendrycks et al., 2020]. The results in Figure 5 show that for this benchmark taking is advantageous and results in a roughly 1.3% gain compared with the optimal . In Appendix C we show that the same effect holds also when using Init[1].
MNLI.
The right panel of Fig 5 shows the results of finetuning Llama-7b with LoRA on MNLI, with , . We train using half precision and constant learning rate schedule, with a sequence length . Since MNLI is relatively easy for Llama, we finetune for only one epoch, which is sufficient for the model to reach its peak test accuracy. In Figure 5, is nearly optimal for all . This is consistent with the intuition that efficient feature learning is not required for easy tasks and that having does not significantly enhance performance. Additionally, the magnitude of stable learning rates for Llama is much smaller than for GPT-2 and RoBERTa on MNLI further supporting that Llama requires less adaptation. Analogous plots for the train and test loss are shown in Fig 19 in Appendix C.
5.3 How to set LoRA+ Ratio?
Naturally, the optimal ratio depends on the architecture and the finetuning task via the constants in ‘’ (1). This is a limitation of these asymptotic results since they do not offer any insights on how the constants are affected by the task and the neural architecture.
![Refer to caption](x6.png)
Figure 6 show the distribution of the ratio for the top runs in terms of test accuracy for different pairs of (model, task). This is the same experimental setup of Figure 3 and Figure 4. The optimal ratio is model and task sensitive and shows significant variance. Our additional experiments in Appendix C show that it is also sensitive to initialization (Init[1] vs Init[2]). With Init[2], we found that generally setting a ratio of improves performance for Roberta (Figure 7). However, with Init[1], we found that the optimal ratio is smaller and is of order - (see Appendix C). For LLama experiments, it seems that a ratio of order - is optimal..
![Refer to caption](x7.png)
6 Conclusion and Limitations
Employing a scaling argument, we showed that LoRA finetuning as it is currently used in practice is not efficient. We proposed a method, LoRA+, that resolves this issue by setting different learning rates for LoRA adapter matrices. Our analysis is supported by extensive empirical results confirming the benefits of LoRA+ for both training speed and performance. These benefits are more significant for ‘hard’ tasks such as MNLI for Roberta/GPT2 (compared to SST2 for instance) and MMLU for LLama-7b (compared to MNLI for instance). However, as we depicted in Figure 7, a more refined estimation of the optimal ratio should take into account task and model dependent, and our analysis in this paper lacks this dimension. We leave this for future work.
Acknowledgement
We thank Amazon Web Services (AWS) for cloud credits under an Amazon Research Award. We also gratefully acknowledge partial support from NSF grants DMS-2209975, 2015341, NSF grant 2023505 on Collaborative Research: Foundations of Data Science Institute (FODSI), the NSF and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning through awards DMS-2031883 and 814639, and NSF grant MC2378 to the Institute for Artificial CyberThreat Intelligence and OperatioN (ACTION).
Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning, specifically, to speed up the leading algorithm LoRA for fine-tuning pre-trained large language models while improving performance of the fine-tuned models. The speed-up saves computation resources when pre-trained large language models are customized for particular down-stream tasks. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.
References
- Bordelon et al. [2023] Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit, 2023.
- Cohen et al. [2021] Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=jh-rTtvkGeM.
- Dettmers et al. [2023] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Hayou [2023] Soufiane Hayou. On the infinite-depth limit of finite-width neural networks. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=RbLsYz1Az9.
- Hayou et al. [2019] Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. On the impact of the activation function on deep neural networks training. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2672–2680. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/hayou19a.html.
- Hayou et al. [2021] Soufiane Hayou, Eugenio Clerico, Bobby He, George Deligiannidis, Arnaud Doucet, and Judith Rousseau. Stable resnet. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1324–1332. PMLR, 13–15 Apr 2021. URL https://proceedings.mlr.press/v130/hayou21a.html.
- He et al. [2023] Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock, Samuel L Smith, and Yee Whye Teh. Deep transformers without shortcuts: Modifying self-attention for faithful signal propagation, 2023.
- He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Hendrycks et al. [2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Hoffmann et al. [2022] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models, 2022.
- Houlsby et al. [2019] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
- Hu et al. [2021] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kopiczko et al. [2023] Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki Markus Asano. Vera: Vector-based random matrix adaptation. arXiv preprint arXiv:2310.11454, 2023.
- LeCun et al. [2002] Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50. Springer, 2002.
- Lester et al. [2021] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Li et al. [2023] Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, and Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv preprint arXiv:2310.08659, 2023.
- Liu et al. [2022] Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
- Liu et al. [2023] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
- Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
- Longpre et al. [2023] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
- Noci et al. [2023] Lorenzo Noci, Chuning Li, Mufan Bill Li, Bobby He, Thomas Hofmann, Chris Maddison, and Daniel M. Roy. The shaped transformer: Attention models in the infinite depth-and-width limit, 2023.
- OpenAI [2023] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Schoenholz et al. [2017a] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation, 2017a.
- Schoenholz et al. [2017b] S.S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein. Deep information propagation. In International Conference on Learning Representations, 2017b.
- Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Wang et al. [2018] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2018.
- Wang et al. [2023] Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023.
- Yang [2019] G. Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
- Yang and Hu [2021] Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727–11737. PMLR, 2021.
- Yang and Littwin [2023] Greg Yang and Etai Littwin. Tensor programs ivb: Adaptive optimization in the infinite-width limit. arXiv preprint arXiv:2308.01814, 2023.
- Yang et al. [2022] Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
- Yang et al. [2023] Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244, 2023.
- Yang et al. [2013] Liu Yang, Steve Hanneke, and Jaime Carbonell. A theory of transfer learning with applications to active learning. Machine learning, 90:161–189, 2013.
- Zeng and Lee [2023] Yuchen Zeng and Kangwook Lee. The expressive power of low-rank adaptation. arXiv preprint arXiv:2310.17513, 2023.
Appendix A Proofs
A.1 Scaling of Neural Networks
Scaling refers to the process of increasing the size of one of the ingredients in the model to improve performance (see e.g. [Hoffmann et al., 2022]). This includes model capacity which can be increased via width (embedding dimension) or depth (number of layers) or both, compute (training data), number of training steps etc. In this paper, we are interested in scaling model capacity via the width . This is motivated by the fact that most state-of-the-art language and vision models have large width.
It is well known that as the width grows, the network initialization scheme and the learning should be adapted to avoid numerical instabilities and ensure efficient learning. For instance, the initialization variance should scale to prevent arbitrarily large pre-activations as we increase model width (e.g. He init [He et al., 2016]). To derive such scaling rules, a principled approach consist of analyzing statistical properties of key quantities in the model (e.g. pre-activations) as grows and then adjust the initialization, the learning rate, and the architecture itself to achieve desirable properties in the limit [Hayou et al., 2019, Schoenholz et al., 2017b, Yang, 2019].
In this context, [Yang et al., 2022] introduces the Maximal Update Parameterization (or P), a set of scaling rules for the initialization scheme, the learning rate, and the network architecture that ensure stability and maximal feature learning in the infinite width limit. Stability is defined by for all and where the asymptotic notation ‘’ is with respect to width (see next paragraph for a formal definition), and feature learning is defined by , where refers to the feature update after taking a gradient step. P guarantees that these two conditions are satisfied at any training step . Roughly speaking, P specifies that hidden weights should be initialized with random weights, and weight updates should be of order . Input weights should be initialized and the weights update should be as well. While the output weights should be initialized and updated with . These rules ensure both stability and feature learning in the infinite-width limit, in contrast to standard parameterization (exploding features if the learning rate is well tuned), and kernel parameterizations (e.g. Neural Tangent Kernel parameterization where , i.e. no feature learning in the limit).
A.2 The Gamma Function ()
In the theory of scaling of neural networks, one usually tracks the asymptotic behaviour of key quantities as we scale some model ingredient. For instance, if we scale the width, we are interested in quantifying how certain quantities in the network behave as width grows large and the asymptotic notation becomes natural in this case. This is a standard approach for (principled) model scaling and it has so far been used to derive scaling rules for initialization [Schoenholz et al., 2017b], activation function [Hayou et al., 2019], network parametrization [Yang et al., 2023], amongst other things.
With Init[1] and Init[2], the weights are initialized with for some . Assuming that the learning rates also scale polynomially with , it is straightforward that preactivations, gradients, and weight updates are all asymptotically polynomial in . It is therefore natural to introduce the Gamma function, and we write to capture this polynomial behaviour. Now, let us introduce some elementary operations with the Gamma function.
Multiplication.
Given two real-valued variables , we have .
Addition.
Given two real-valued variables , we generally have . The only case where this is violated is when . This is generally a zero probability event if and are random variables that are not perfectly correlated, which is the case in most situations where we make use of this formula (see the proofs below).
A.3 Proof of 1
Proposition 1. [Inefficiency of LoRA fine-tuning] Assume that LoRA weights are initialized with Init[1] or Init[2] and trained with gradient descent with learning rate for some . Then, it is impossible to have for all for any , and therefore, fine-tuning with LoRA in this setup is inefficient.
Proof.
Assume that the model is initialized with Init[1]. Since the training dynamics are mainly simple linear algebra operation (matrix vector products, sum of vectors/scalars etc), it is easy to see that any vector/scaler in the training dynamics has a magnitude of order for some (for more details, see the Tensor Programs framework, e.g. [Yang, 2019]). For any quantity in the training dynamics, we write . When is a vector, we use the same notation when all entries of are . Efficiency is defined by having for and . Note that this implies for all . Let and assume that learning with LoRA is efficient. We will show that this leads to a contradiction. Efficiency requires that for all . Using the elementary formulas from Section A.2, this implies that for all
Solving this equation yields , i.e. LoRA finetuning can be efficient only if the learning rate scales as . Let us now show that this yields a contradiction. From the gradient updates and the elementary operations from Section A.2, we have the following recursive formulas
Starting from , with Init[1] we have and , we have and . Trivially, this holds for any . However, this implies that which means that cannot be . With Init[2], we have and . From the recursive formula we get and which remains true for all . In this case we have which contradicts .
In both cases, this contradicts our assumption, and therefore efficiency cannot be achieved in this setup.
∎
A.4 Proof of 2
Proposition 2. [Efficient Fine-Tuning with LoRA] In the case of Toy model Equation 2, with and , we have for all , , .
Proof.
The proof is similar in flavor to that of 1. In this case, the set of equations that should be satisfied so that are given by
where we have used the elementary formulas from Section A.2. Simple calculations yield . Using the gradient update expression with the elementary addition from Section A.2, the recursive formulas controlling and are given by
Starting from , with Init[1], we have and . Therefore , and . By induction, this holds for all . With Init[2], we have , and . At step , we have and , and this holds for all by induction. In both cases, to ensure that , we have to set and (straightforward from the equation ). In conclusion, setting and ensures efficient fine-tuning with LoRA.
∎
A.5 Proof of 1
In this section, we give a non-rigorous but intuitive proof of 1. The proof relies on the following assumption on the processed gradient .
Assumption 1.
With the same setup of Section 4, at training step , we have .
To see why 1 is sound in practice, let us study the product in the simple case of Adam with no momentum, a.k.a SignSGD which is given by
where the sign function is applied element-wise. At training step , we have
Let . Therefore we have
However, note that we also have
and as a result
Hence, we obtain
where we used the fact that .
This intuition should in-principle hold for the general variant of Adam with momentum as long as the gradient processing function (a notion introduced in [Yang et al., 2013]) roughly preserves the direction. This reasoning can be made rigorous for general gradient processing function using the Tensor Program framework and taking the infinite-width limit where the components of all become iid. However this necessitates an intricate treatment of several quantities in the process, which we believe is an unnecessary complication and does not serve the main purpose of this paper.
Let us now give a proof for the main claim.
Theorem 1.
Assume that weight matrices and are trained with Adam with respective learning rates and and that 1 is satisifed with the Adam gradient processing function. Then, it is impossible to achieve efficiency with . However, LoRA Finetuning is efficient with and .
Proof.
With the same setup of Section 4, at step , we have
The key observation here is that has entries of order as predicted and justified in 1. Having for and for translate to
which implies that .
With the gradient updates, we have
which implies that
Now assume that the model is initialized with Init[1]. We have and therefore for all , we have . We also have (because , and we use the Central Limit Theorem to conclude). Hence, if we choose the same learning rate for and , given by , we obtain , and therefore which violates the stability condition. A similar behaviour occurs with Init[2]. Hence, efficiency is not possible in this case. However, if we set and , we get that , and for all and . The same result holds with Init[2].
∎
Appendix B Efficiency from a Loss Perspective.
Consider the same setup of Section 4. At step , the loss changes as follows
where is the Frobenius inner product in , and is the euclidean product in . Since the direction of the feature updates are significantly correlated with , it should be expected that having for all results in more efficient loss reduction.
Appendix C Additional Experiments
This section complements the empirical results reported in the main text. We provide the details of our experimental setup, and show the acc/loss heatmaps for several configurations.
C.1 Empirical Details
C.1.1 Toy Example
In Figure 2, we trained a simple MLP with LoRA layers to verify the results of the analysis in Section 3. Here we provide the empirical details for these experiments.
Model.
We consider a simple MLP given by
where are the weights, and is the ReLU activation function. Here, we used , , and .
Dataset.
Synthetic dataset generated by with . The number of training examples is , and the number of test examples is .
Training.
We train the model with gradient descent for a range for values of . The weights are initialized as follows: . Only the weight matrices are trained and are fixed to their initial value.
C.1.2 GLUE Tasks with GPT2/Roberta
For our experiments with GPT2/Roberta-base models, finetuned on GLUE tasks, we use the following setup:
Tasks.
MNLI, QQP, SST2, QNLI
Models.
GPT2, Roberta-base
Training Alg.
AdamW with 1e-8, linear schedule, no warmup.
Learning rate grid.
4e-3, 2e-3, 1e-3, 5e-4, 2e-4, 1e-4, 8e-4, 4e-4, 2e-4, 1e-4, 5e-5, 2e-5, 1e-5 .
Targert Modules for LoRA.
For Roberta-base, we add LoRA layers to ‘query’ and ‘value’ weights. For GPT2, we add LoRA layers to ‘c_attn, c_proj, c_fc’.
Other Hyperparameters.
Sequence length , train batch size , number of train epochs ( for SST2), number of random seeds .
GPUs.
Nvidia V100, Nvidia A10.
C.1.3 Llama MNLI
For our experiments using the Llama-7b model, finetuned on MNLI, we use following setup
Training Alg.
AdamW with , , 1e-6, constant schedule.
Learning rate grid.
1e-6, 5e-6, 1e-5, 2.5e-5, 5e-5, 1e-4, 1e-6, 5e-6, 1e-5, 2.5e-5, 5e-5, 1e-4,
LoRA Hyperparameters.
LoRA rank , , and dropout . LoRA target modules ‘q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj’.
Other Hyperparameters.
Sequence length , train batch size , number of train epochs , number of random seeds for and near test optimal, otherwise. Precision FP16.
GPUs.
Nvidia V100.
C.1.4 Llama flan-v2
For our experiments using the Llama-7b model, finetuned on a size 100k random subset flan-v2, we use following setup
Training Alg.
AdamW with , , 1e-6, constant schedule.
Learning rate grid.
1e-6, 5e-6, 1e-5, 2.5e-5, 5e-5, 1e-4, 1e-6, 5e-6, 1e-5, 2.5e-5, 5e-5, 1e-4,
LoRA Hyperparameters.
LoRA rank , , and dropout . LoRA target modules ‘q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj’.
Other Hyperparameters.
Sequence length , , train batch size , number of epochs , number of random seeds for and near test optimal, otherwise. Precision BF16.
MMLU Evaluation.
We evaluate average accuracy on MMLU using 5-shot prompting.
GPUs.
Nvidia A10.
C.2 Results of Roberta-base Finetuning on all Tasks
Figure 3 showed finetuning test accuracy for Roberta-base. To complement these results, we show here the test/train accuracy for all tasks.
![Refer to caption](x8.png)
Interestingly, the optimal choice of learning rates for test accuracy differs from that of the train accuracy, although the difference is small. This can be due to mild overfitting occuring during finetuning (the optimal choice of learning rates for train accuracy probably lead to a some overfitting).
C.3 Results of GPT2 Finetuning on all Tasks
Figure 4 showed finetuning results for GPT2 on MNLI and QQP. To complement these results, we show here the test/train accuracy for all tasks.
![Refer to caption](x9.png)
C.4 GLUE Tasks with Full Precision
![Refer to caption](x10.png)
![Refer to caption](x11.png)
C.5 GLUE Tasks Test/Train Loss
![Refer to caption](x12.png)
![Refer to caption](x13.png)
C.6 GLUE Tasks with Different LoRA Ranks
![Refer to caption](x14.png)
![Refer to caption](x15.png)
![Refer to caption](x16.png)
C.7 Experiments with Init[1]
We also run some experiments using Init[1] as initialization scheme. We noticed that the optimal ratio is this case is generally smaller than the optimal ratio with Init[2]. Figure 17 shows the optimal learning rates obtained with Init[1] and Init[2]. The optimal ratio is generally smaller with Init[1].
![Refer to caption](x17.png)
![Refer to caption](x18.png)
C.8 Llama Flan-v2 MMLU Acc/Train Loss
![Refer to caption](x19.png)
![Refer to caption](extracted/5711638/figures/llama-7b_flan-v2_init_1.png)
C.9 Llama MNLI Test/Train Loss
![Refer to caption](x20.png)