LoRA+: Efficient Low Rank Adaptation of Large Models

Soufiane Hayou    Nikhil Ghosh    Bin Yu
Abstract

In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in [Hu et al., 2021] leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A𝐴Aitalic_A and B𝐵Bitalic_B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A𝐴Aitalic_A and B𝐵Bitalic_B does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A𝐴Aitalic_A and B𝐵Bitalic_B with a well-chosen fixed ratio. We call this proposed algorithm LoRA+++. In our extensive experiments, LoRA+++ improves performance (1%2%percent1percent21\%-2\%1 % - 2 % improvements) and finetuning speed (up to 2similar-toabsent2\sim 2∼ 2X SpeedUp), at the same computational cost as LoRA.

Machine Learning, ICML

1 Introduction

State-of-the-art (SOTA) deep learning models all share a common characteristic: they all have an extremely large number of parameters (10’s if not 100’s of billions parameters). Currently, only a few industry labs can pretrain large language models due to their high training cost. However, many pretrained models are accessible either through an API (GPT4, [OpenAI, 2023]) or through open-source platforms (Llama, [Touvron et al., 2023]). Most practitioners are interested in using such models for specific tasks and want to adapt these models to a new, generally smaller task. This procedure is known as finetuning, where one adjusts the weights of the pretrained model to improve performance on the new task. However, due to the size of SOTA models, adapting to down-stream tasks with full finetuning (finetuning all model parameters) is computationally infeasible as it requires modifying the weights of the pretrained models using gradient methods which is a costly process. Besides, a model that has already learned generally useful representations during pretraining would not require in-principle significant adaptation of all parameters. With this intuition, researchers have proposed a variety of resource-efficient finetuning methods which typically freeze the pretrained weights and tune only a small set of newly inserted parameters. Such methods include prompt tuning [Lester et al., 2021] where a “soft prompt" is learned and appended to the input, the adapters method [Houlsby et al., 2019] where lightweight “adapter" layers are inserted and trained, and (IA)3superscript𝐼𝐴3(IA)^{3}( italic_I italic_A ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT [Liu et al., 2022] where activation vectors are modified with learned scalings. Another resource-efficient method is known as Low Rank Adaptation [Hu et al., 2021], or simply LoRA. In LoRA finetuning, only a low rank matrix, called an adapter, that is added to the pretrained weights is trainable. The training can be done with any optimizer and in practice a common choice is Adam [Kingma and Ba, 2014]. Since the trained adapter is low-rank, this effectively reduces the number of trainable parameters in the fine-tuning process, significantly decreasing the training cost. On many tasks such as instruction finetuning, LoRA has been shown to achieve comparable or better performance compared with full-finetuning [Wang et al., 2023, Liu et al., 2023], although on complicated, long form generation tasks, it is not always as performant. The impressive performance and the computational savings of LoRA have contributed to it becoming an industry standard finetuning method.

Efficient use of LoRA requires a careful choice of hyperparameters: the rank and the learning rate. While some theoretical guidelines on the choice of the rank in LoRA exist in the literature (see e.g. Zeng and Lee [2023]), there are no principled guidelines on how to set the learning rate, apart from common choices of order 1e1𝑒1e1 italic_e-4444.

Refer to caption
Figure 1: The key difference between standard LoRA and LoRA+++ is in how learning rates are set (the matrices GAsubscript𝐺𝐴G_{A}italic_G start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and GBsubscript𝐺𝐵G_{B}italic_G start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are ‘effective’ gradients from AdamW) With standard LoRA, the learning rate is the same for A𝐴Aitalic_A and B𝐵Bitalic_B, which provably leads to suboptimal learning when embedding dimension is large. In LoRA+++, we set the learning rate of B𝐵Bitalic_B to be λ×\lambda\timesitalic_λ × that of A𝐴Aitalic_A, where λ1much-greater-than𝜆1\lambda\gg 1italic_λ ≫ 1 is fixed. We later provide guidelines on how to set λ𝜆\lambdaitalic_λ.
Related Work.

Dettmers et al. [2023] introduced a quantized version of LoRA (or QLoRA), which further reduces computation costs by quantizing pretrained weights down to as few as four bits. Using QLoRA enables fine-tuning Llama-65b [Touvron et al., 2023], on a single consumer GPU while achieving competitive performance with full-finetuning. To further improve LoRA training with quantization, Li et al. [2023] introduced a new method called LoftQ for computing a better initialization for quantized training. Additional variations of LoRA have been proposed such as VeRA [Kopiczko et al., 2023] which freezes random weight tied adapters and learns vector scalings of the internal adapter activations. This achieves a further reduction in the number of trainable parameters while achieving comparable performance to LoRA on several NLP finetuning tasks. However, to the best of our knowledge, there is no principled guidance for setting LoRA learning rate which is the focus of our work.

Contributions.

We provide guidelines for setting the learning rate through a theory of scaling for neural networks. There is a significant number of works on the scaling of neural networks from the infinite width/depth perspective. The approach is simple: take the width/depth of a neural network to infinity,111Depending on the model, one might want to scale width with fixed depth and vice-versa, or both at the same time. See Section A.1 for more details. understand how the limit depends on the choice of the hyperparameters in the training process such as the learning rate and initialization variance, then derive principled choices for these hyperparameters to achieve some desired goal (e.g. improve feature learning). Examples of the infinite-width limit include works on initialization schemes such as [He et al., 2016, Yang, 2019], or more holistically network parametrizations such as [Yang and Hu, 2021] where the authors introduced μ𝜇\muitalic_μP, a neural network parameterization ensuring feature learning in the infinite-width limit, offering precise scaling rules for architecture and learning rates to maximize feature learning. Examples for the depth limit include initialization strategies [Schoenholz et al., 2017a, He et al., 2023, Hayou et al., 2019], block scaling (see e.g. [Hayou et al., 2021, Hayou, 2023, Noci et al., 2023]), depth parametrizations [Yang et al., 2023, Bordelon et al., 2023] etc. Here we propose to use the same strategy to derive scaling rules for the learning rate in LoRA for finetuning. More precisely, we study the infinite-width limit of LoRA finetuning dynamics and show that standard LoRA setup is suboptimal. We correct this by introducing a new method called LoRA+++ that improves feature learning in low rank adaptation in the this limit. The key innovation in LoRA+++ is setting different learning rates for A𝐴Aitalic_A and B𝐵Bitalic_B modules (LoRA modules) as explained in Figure 1. Our theory is validated with extensive empirical results with different language of models and tasks.

2 Setup and Definitions

Our methodology in this paper is model agnostic and applies to general neural network models. Let us consider a neural network of the form

{Yin(x)=Winx,Yl(x)=l(Wl,Yl1(x)),l[L],Yout(x)=WoutYL(x),casessubscript𝑌𝑖𝑛𝑥subscript𝑊𝑖𝑛𝑥otherwiseformulae-sequencesubscript𝑌𝑙𝑥subscript𝑙subscript𝑊𝑙subscript𝑌𝑙1𝑥𝑙delimited-[]𝐿otherwisesubscript𝑌𝑜𝑢𝑡𝑥subscript𝑊𝑜𝑢𝑡subscript𝑌𝐿𝑥otherwise\begin{cases}Y_{in}(x)=W_{in}x,\\ Y_{l}(x)=\mathcal{F}_{l}(W_{l},Y_{l-1}(x)),\;l\in[L],\\ Y_{out}(x)=W_{out}Y_{L}(x),\end{cases}{ start_ROW start_CELL italic_Y start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_x ) = italic_W start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_x , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_Y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x ) = caligraphic_F start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ( italic_x ) ) , italic_l ∈ [ italic_L ] , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_Y start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ( italic_x ) = italic_W start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_x ) , end_CELL start_CELL end_CELL end_ROW (1)

where xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the input, L1𝐿1L\geq 1italic_L ≥ 1 is the network depth, (l)l[L]subscriptsubscript𝑙𝑙delimited-[]𝐿(\mathcal{F}_{l})_{l\in[L]}( caligraphic_F start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_l ∈ [ italic_L ] end_POSTSUBSCRIPT are mappings that define the layers, Wln×nsubscript𝑊𝑙superscript𝑛𝑛W_{l}\in\mathbb{R}^{n\times n}italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT are the hidden weights, where n𝑛nitalic_n is the network width, and Win,Woutsubscript𝑊𝑖𝑛subscript𝑊𝑜𝑢𝑡W_{in},W_{out}italic_W start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT are input and output embedding weights.

Model (1) is pretrained on some dataset 𝒟𝒟\mathcal{D}caligraphic_D to perform some specified task (e.g. next token prediction). Once the model is pretrained, one can finetune it to improve performance on some downstream task. To achieve this with relatively small devices (limited GPUs), resource-efficient finetuning methods like LoRA significantly reduce the computational cost by considering low rank weight matrices instead of full rank finetuning (or simply full finetuning).

Definition 1 (Low Rank Adapters (LoRA) from [Hu et al., 2021]).

For any weight matrix Wn1×n2𝑊superscriptsubscript𝑛1subscript𝑛2W\in\mathbb{R}^{n_{1}\times n_{2}}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in the pretrained model, we constrain its update in the fine-tuning process by representing the latter with a low-rank decomposition W=W+αrBA𝑊superscript𝑊𝛼𝑟𝐵𝐴W=W^{*}+\frac{\alpha}{r}BAitalic_W = italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG italic_B italic_A. Here, only the weight matrices Bn1×r𝐵superscriptsubscript𝑛1𝑟B\in\mathbb{R}^{n_{1}\times r}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_r end_POSTSUPERSCRIPT, Ar×n2𝐴superscript𝑟subscript𝑛2A\in\mathbb{R}^{r\times n_{2}}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are trainable. The rank rmin(n1,n2)much-less-than𝑟subscript𝑛1subscript𝑛2r\ll\min(n_{1},n_{2})italic_r ≪ roman_min ( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and α𝛼\alpha\in\mathbb{R}italic_α ∈ blackboard_R are tunable constants.

Scaling of Neural Networks.

It is well known that as the width n𝑛nitalic_n grows, the network initialization scheme and the learning should be adapted to avoid numerical instabilities and ensure efficient learning. For instance, the variance of the initialization weights (in hidden layers) should scale 1/n1𝑛1/n1 / italic_n to prevent arbitrarily large pre-activations as we increase model width n𝑛nitalic_n (e.g. He init [He et al., 2016]). To derive such scaling rules, a principled approach consist of analyzing statistical properties of key quantities in the model (e.g. pre-activations) as n𝑛nitalic_n grows and then adjust the initialization, the learning rate, and the architecture itself to achieve desirable properties in the limit n𝑛n\to\inftyitalic_n → ∞ [Hayou et al., 2019, Schoenholz et al., 2017b, Yang, 2019, Yang and Littwin, 2023]. This approach is used in this paper to study feature learning dynamics with LoRA in the infinite-width limit. This will allow us to derive scaling rules for the learning rates of LoRA modules. For more details about the theory of scaling of neural networks, see Section A.1.

Notation.

Hereafter, we use the following notation to describe the asymptotic behaviour as the width n𝑛nitalic_n grows. Given sequences cnsubscript𝑐𝑛c_{n}\in\mathbb{R}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R and dn+subscript𝑑𝑛superscriptd_{n}\in\mathbb{R}^{+}italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, we write cn=𝒪(dn)subscript𝑐𝑛𝒪subscript𝑑𝑛c_{n}=\mathcal{O}(d_{n})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = caligraphic_O ( italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), resp. cn=Ω(dn)subscript𝑐𝑛Ωsubscript𝑑𝑛c_{n}=\Omega(d_{n})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_Ω ( italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), to refer to cn<κdnsubscript𝑐𝑛𝜅subscript𝑑𝑛c_{n}<\kappa d_{n}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT < italic_κ italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, resp. cn>κdnsubscript𝑐𝑛𝜅subscript𝑑𝑛c_{n}>\kappa d_{n}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > italic_κ italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, for some constant κ>0𝜅0\kappa>0italic_κ > 0. We write cn=Θ(dn)subscript𝑐𝑛Θsubscript𝑑𝑛c_{n}=\Theta(d_{n})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_Θ ( italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) if both cn=𝒪(dn)subscript𝑐𝑛𝒪subscript𝑑𝑛c_{n}=\mathcal{O}(d_{n})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = caligraphic_O ( italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and cn=Ω(dn)subscript𝑐𝑛Ωsubscript𝑑𝑛c_{n}=\Omega(d_{n})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_Ω ( italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) are satisfied. For vector sequences cn=(cni)1ikksubscript𝑐𝑛subscriptsuperscriptsubscript𝑐𝑛𝑖1𝑖𝑘superscript𝑘c_{n}=(c_{n}^{i})_{1\leq i\leq k}\in\mathbb{R}^{k}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT (for some k>0𝑘0k>0italic_k > 0), we write cn=𝒪(dn)subscript𝑐𝑛𝒪subscript𝑑𝑛c_{n}=\mathcal{O}(d_{n})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = caligraphic_O ( italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) when cni=𝒪(dni)superscriptsubscript𝑐𝑛𝑖𝒪superscriptsubscript𝑑𝑛𝑖c_{n}^{i}=\mathcal{O}(d_{n}^{i})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = caligraphic_O ( italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) for all i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ], and same holds for other asymptotic notations. Finally, when the sequence cnsubscript𝑐𝑛c_{n}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a vector of random variables, convergence is understood to be convergence in second moment (L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm).

3 An Intuitive Analysis of LoRA

Our intuition is simple: the matrices A𝐴Aitalic_A and B𝐵Bitalic_B have “transposed” shapes and one would naturally ask whether the learning rate should be set differently for the two matrices. In practice, most SOTA models have large width (embedding dimension). Thus, it makes sense to study the training dynamics when the width goes to infinity.

3.1 LoRA with a Toy Model

Consider the following linear model

f(x)=(W+ba)x,𝑓𝑥superscript𝑊𝑏superscript𝑎top𝑥f(x)=(W^{*}+ba^{\top})x,italic_f ( italic_x ) = ( italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_b italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_x , (2)

where W1×nsuperscript𝑊superscript1𝑛W^{*}\in\mathbb{R}^{1\times n}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_n end_POSTSUPERSCRIPT are the pretrained weights, b,anformulae-sequence𝑏𝑎superscript𝑛b\in\mathbb{R},a\in\mathbb{R}^{n}italic_b ∈ blackboard_R , italic_a ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are LoRA weights,222Here, we consider n2=1subscript𝑛21n_{2}=1italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 to simplify the analysis. All the conclusions remain essentially valid when n2=n1=nsubscript𝑛2subscript𝑛1𝑛n_{2}=n_{1}=nitalic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n. xn𝑥superscript𝑛x\in\mathbb{R}^{n}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the model input. This setup corresponds to n1=1,n2=n,r=1formulae-sequencesubscript𝑛11formulae-sequencesubscript𝑛2𝑛𝑟1n_{1}=1,n_{2}=n,r=1italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_n , italic_r = 1 in 1. We assume that the weights Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are fixed (from pretraining). The goal is to minimize the loss (θ)=12(f(x)y)2𝜃12superscript𝑓𝑥𝑦2\mathcal{L}(\theta)=\frac{1}{2}(f(x)-y)^{2}caligraphic_L ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_f ( italic_x ) - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT where θ=(a,b)𝜃𝑎𝑏\theta=(a,b)italic_θ = ( italic_a , italic_b ) and (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is an input-output datapoint.333For simplicity, we assume that the finetuning dataset consists of a single sample. Our analysis is readily generalizable to multiple samples. We assume that x=Θn(1)𝑥subscriptΘ𝑛1x=\Theta_{n}(1)italic_x = roman_Θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 1 ) which means that input coordinates remain of the same order as we increase width. In the following, we analyze the behaviour of the finetuning dynamics as model width n𝑛nitalic_n grows.

Initialization.

We consider a Gaussian initialization of the weights as follows: ai𝒩(0,σa2)similar-tosubscript𝑎𝑖𝒩0superscriptsubscript𝜎𝑎2a_{i}\sim\mathcal{N}(0,\sigma_{a}^{2})italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), b𝒩(0,σb2)similar-to𝑏𝒩0superscriptsubscript𝜎𝑏2b\sim\mathcal{N}(0,\sigma_{b}^{2})italic_b ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).444The Gaussian distribution can be replaced by any other distribution with finite variance. With LoRA, we generally want to initialize the product ba𝑏superscript𝑎topba^{\top}italic_b italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT to be 00 so that finetuning starts from the pretrained model. This implies at least one of the weights a𝑎aitalic_a and b𝑏bitalic_b is initialized to 00. If both are initialized to 00, it is trivial that no learning occurs in this case since this is a saddle point. Thus, we should initialize one of the parameters a𝑎aitalic_a and b𝑏bitalic_b to be non-zero and the other to be zero. If we choose a non-zero initialization for a𝑎aitalic_a, then following standard initialization schemes (e.g., He Init [He et al., 2016], LeCun Init [LeCun et al., 2002]), one should set σa2=Θ(n1)superscriptsubscript𝜎𝑎2Θsuperscript𝑛1\sigma_{a}^{2}=\Theta(n^{-1})italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) to ensure axsuperscript𝑎top𝑥a^{\top}xitalic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x does not explode with width. This is justified by the Central Limit Theorem (CLT).555Technically, the CLT only ensures the almost sure convergence, the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT convergence follows from the Dominated Convergence Theorem. We omit these technical details in this paper. On the other hand, if we choose a non-zero initialization for b𝑏bitalic_b, one should make sure that σb2=Θ(1)superscriptsubscript𝜎𝑏2Θ1\sigma_{b}^{2}=\Theta(1)italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Θ ( 1 ). This leaves us with two possible schemes:

  • Init[1]: σb2=0,σa2=Θ(n1)formulae-sequencesuperscriptsubscript𝜎𝑏20superscriptsubscript𝜎𝑎2Θsuperscript𝑛1\sigma_{b}^{2}=0,\sigma_{a}^{2}=\Theta(n^{-1})italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0 , italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ).

  • Init[2]: σb2=Θ(1),σa2=0formulae-sequencesuperscriptsubscript𝜎𝑏2Θ1superscriptsubscript𝜎𝑎20\sigma_{b}^{2}=\Theta(1),\sigma_{a}^{2}=0italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Θ ( 1 ) , italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.

Our analysis will only consider these two initialization schemes for LoRA modules, although the results should in-principle hold for other schemes, providing that stability (as discussed above) is satisfied.

Learning rate.

WLOG, we can simplify the analysis by assuming that W=0superscript𝑊0W^{*}=0italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0. This can be achieved by setting y~=yWx~𝑦𝑦superscript𝑊𝑥\tilde{y}=y-W^{*}xover~ start_ARG italic_y end_ARG = italic_y - italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_x. The gradients are given by

b=ax(f(x)y),a=b(f(x)y)x.formulae-sequence𝑏superscript𝑎top𝑥𝑓𝑥𝑦𝑎𝑏𝑓𝑥𝑦𝑥\displaystyle\frac{\partial\mathcal{L}}{\partial b}=a^{\top}x(f(x)-y),\,\,\,% \frac{\partial\mathcal{L}}{\partial a}=b(f(x)-y)x.divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_b end_ARG = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ( italic_f ( italic_x ) - italic_y ) , divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_a end_ARG = italic_b ( italic_f ( italic_x ) - italic_y ) italic_x .

We use subscript t𝑡titalic_t to denote the finetuning step. Let Ut=(ft(x)y)subscript𝑈𝑡subscript𝑓𝑡𝑥𝑦U_{t}=(f_{t}(x)-y)italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_y ). At step t𝑡titalic_t with learning rate η>0𝜂0\eta>0italic_η > 0, we have

Δft=defft(x)ft1(x)=ηbt12Ut1x2δt1Δsubscript𝑓𝑡𝑑𝑒𝑓subscript𝑓𝑡𝑥subscript𝑓𝑡1𝑥superscriptsubscript𝛿𝑡1𝜂superscriptsubscript𝑏𝑡12subscript𝑈𝑡1superscriptnorm𝑥2\displaystyle\Delta f_{t}\overset{def}{=}f_{t}(x)-f_{t-1}(x)=-\underset{\delta% _{t}^{1}}{\underbrace{\eta b_{t-1}^{2}U_{t-1}\|x\|^{2}}}roman_Δ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_OVERACCENT italic_d italic_e italic_f end_OVERACCENT start_ARG = end_ARG italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x ) = - start_UNDERACCENT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG under⏟ start_ARG italic_η italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG
η(at1x)2Ut1δt2+η2Ut12bt1(at1x)x2δt3.superscriptsubscript𝛿𝑡2𝜂superscriptsuperscriptsubscript𝑎𝑡1top𝑥2subscript𝑈𝑡1superscriptsubscript𝛿𝑡3superscript𝜂2superscriptsubscript𝑈𝑡12subscript𝑏𝑡1superscriptsubscript𝑎𝑡1top𝑥superscriptnorm𝑥2\displaystyle-\underset{\delta_{t}^{2}}{\underbrace{\eta(a_{t-1}^{\top}x)^{2}U% _{t-1}}}+\underset{\delta_{t}^{3}}{\underbrace{\eta^{2}U_{t-1}^{2}b_{t-1}(a_{t% -1}^{\top}x)\|x\|^{2}}}.- start_UNDERACCENT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG under⏟ start_ARG italic_η ( italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG + start_UNDERACCENT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG under⏟ start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ) ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG .

The update in model output is driven by the three terms (δti)i{1,2,3}subscriptsuperscriptsubscript𝛿𝑡𝑖𝑖123(\delta_{t}^{i})_{i\in\{1,2,3\}}( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ { 1 , 2 , 3 } end_POSTSUBSCRIPT. The first two terms represent “linear” contributions to the update, i.e. change in model output driven by fixing b𝑏bitalic_b and updating a𝑎aitalic_a and vice-versa. These terms are order one in η𝜂\etaitalic_η. The third term δt3superscriptsubscript𝛿𝑡3\delta_{t}^{3}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT represents a multiplicative update, compounding the updates in a𝑎aitalic_a and b𝑏bitalic_b, and is an order two term in η𝜂\etaitalic_η. As n𝑛nitalic_n grows, a desirable property is that Δft=Θ(1)Δsubscript𝑓𝑡Θ1\Delta f_{t}=\Theta(1)roman_Δ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Θ ( 1 ). Intuitively, this means that as we scale the width, feature updates do not ‘suffer’ from this scaling (see Section A.1 for more details). An example of a scenario where feature learning is affected by scaling is the lazy training regime [Jacot et al., 2018], where feature updates are of order Θ(n1/2)Θsuperscript𝑛12\Theta(n^{-1/2})roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) which implies that no feature learning occurs in the limit n𝑛n\to\inftyitalic_n → ∞. The condition Δft=Θ(1)Δsubscript𝑓𝑡Θ1\Delta f_{t}=\Theta(1)roman_Δ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Θ ( 1 ) also implies that the update does not explode with width, which is also a desirable property.

Having Δft=Θ(1)Δsubscript𝑓𝑡Θ1\Delta f_{t}=\Theta(1)roman_Δ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Θ ( 1 ) satisfied implies that at least one of the three terms (δti)i{1,2,3}subscriptsuperscriptsubscript𝛿𝑡𝑖𝑖123(\delta_{t}^{i})_{i\in\{1,2,3\}}( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ { 1 , 2 , 3 } end_POSTSUBSCRIPT is Θ(1)Θ1\Theta(1)roman_Θ ( 1 ). Ideally, we want both δt1subscriptsuperscript𝛿1𝑡\delta^{1}_{t}italic_δ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and δt2subscriptsuperscript𝛿2𝑡\delta^{2}_{t}italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to be Θ(1)Θ1\Theta(1)roman_Θ ( 1 ) because otherwise it means that either a𝑎aitalic_a or b𝑏bitalic_b is not efficiently updated. For instance, if δt1=o(1)subscriptsuperscript𝛿1𝑡𝑜1\delta^{1}_{t}=o(1)italic_δ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_o ( 1 ), it means that as n𝑛n\to\inftyitalic_n → ∞, the model acts as if a𝑎aitalic_a is fixed and only b𝑏bitalic_b is trained. Similar conclusions hold when δt2=o(1)subscriptsuperscript𝛿2𝑡𝑜1\delta^{2}_{t}=o(1)italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_o ( 1 ). Having both δt1subscriptsuperscript𝛿1𝑡\delta^{1}_{t}italic_δ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and δt2subscriptsuperscript𝛿2𝑡\delta^{2}_{t}italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT being Θ(1)Θ1\Theta(1)roman_Θ ( 1 ) in width means that both a𝑎aitalic_a and b𝑏bitalic_b parameter updates significantly contribute to the change in ft(x)subscript𝑓𝑡𝑥f_{t}(x)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ), and we say that feature learning with LoRA is efficient when this is the case, i.e. δit=Θ(1)superscriptsubscript𝛿𝑖𝑡Θ1\delta_{i}^{t}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = roman_Θ ( 1 ) for i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 } and all t>1𝑡1t>1italic_t > 1. We will formalize this definition of efficiency in the next section. The reader might wonder why we do not require that δt3subscriptsuperscript𝛿3𝑡\delta^{3}_{t}italic_δ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be Θ(1)Θ1\Theta(1)roman_Θ ( 1 ). We will see that when both δt1subscriptsuperscript𝛿1𝑡\delta^{1}_{t}italic_δ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and δt2subscriptsuperscript𝛿2𝑡\delta^{2}_{t}italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are Θ(1)Θ1\Theta(1)roman_Θ ( 1 ), the term δt3subscriptsuperscript𝛿3𝑡\delta^{3}_{t}italic_δ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is also Θ(1)Θ1\Theta(1)roman_Θ ( 1 ).

Efficiency Analysis.

Let us assume that we train the model with gradient descent with learning rate η=Θ(nc)𝜂Θsuperscript𝑛𝑐\eta=\Theta(n^{c})italic_η = roman_Θ ( italic_n start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) for some c𝑐c\in\mathbb{R}italic_c ∈ blackboard_R, and suppose that we initialize the model with Init[1]. Sine the training dynamics are mainly matrix vector products, sum of vectors/scalars etc (see [Yang et al., 2022]),666A crucial assumption for this to hold is also to have that for any matrix/vector product in the training dynamics, the product dimension (the dimension along which the matrix/vector product is calculated) is Θ(nα)Θsuperscript𝑛𝛼\Theta(n^{\alpha})roman_Θ ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) for some α>0𝛼0\alpha>0italic_α > 0. For instance, in the case of Transformers, this is satisfied since the MLP embedding dimension is generally k×n𝑘𝑛k\times nitalic_k × italic_n. However, this condition would be violated if for instance one considers MLP embedding dimension knlog(n)𝑘𝑛𝑛kn\log(n)italic_k italic_n roman_log ( italic_n ). Such non-standard scaling choices require a particular treatment, but the conclusions remain the same. it is easy to see that any quantity in the training dynamics should be of order nγsuperscript𝑛𝛾n^{\gamma}italic_n start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT for some γ𝛾\gamma\in\mathbb{R}italic_γ ∈ blackboard_R. For any quantity v𝑣vitalic_v in the training dynamics, we write v=Θ(nγ[v])𝑣Θsuperscript𝑛𝛾delimited-[]𝑣v=\Theta(n^{\gamma[v]})italic_v = roman_Θ ( italic_n start_POSTSUPERSCRIPT italic_γ [ italic_v ] end_POSTSUPERSCRIPT ). When v𝑣vitalic_v is a vector, we use the same notation when all entries of v𝑣vitalic_v are Θ(nγ[v])Θsuperscript𝑛𝛾delimited-[]𝑣\Theta(n^{\gamma[v]})roman_Θ ( italic_n start_POSTSUPERSCRIPT italic_γ [ italic_v ] end_POSTSUPERSCRIPT ). The γ𝛾\gammaitalic_γ notation is formally defined in Appendix A.

Starting from initialization, we have f0(x)=0subscript𝑓0𝑥0f_{0}(x)=0italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) = 0. LoRA finetuning is efficient when δt1=Θ(1)superscriptsubscript𝛿𝑡1Θ1\delta_{t}^{1}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = roman_Θ ( 1 ) and δt2=Θ(1)subscriptsuperscript𝛿2𝑡Θ1\delta^{2}_{t}=\Theta(1)italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Θ ( 1 ) for all t>1𝑡1t>1italic_t > 1,777Here we use the t>1𝑡1t>1italic_t > 1 instead of t>0𝑡0t>0italic_t > 0 because at t1𝑡1t\leq 1italic_t ≤ 1, at least one the terms δ11superscriptsubscript𝛿11\delta_{1}^{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT or δ12superscriptsubscript𝛿12\delta_{1}^{2}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT will be zero. and ft(x)=Θ(1)subscript𝑓𝑡𝑥Θ1f_{t}(x)=\Theta(1)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = roman_Θ ( 1 ) for t>1𝑡1t>1italic_t > 1. This translate to

{c+2γ[bt1]+1=0(δt1=Θ(1))c+2γ[at1x]=0(δt2=Θ(1))γ[bt1]+γ[at1x]=0(ft1(x)=Θ(1))cases𝑐2𝛾delimited-[]subscript𝑏𝑡110subscriptsuperscript𝛿1𝑡Θ1otherwise𝑐2𝛾delimited-[]superscriptsubscript𝑎𝑡1top𝑥0subscriptsuperscript𝛿2𝑡Θ1otherwise𝛾delimited-[]subscript𝑏𝑡1𝛾delimited-[]superscriptsubscript𝑎𝑡1top𝑥0subscript𝑓𝑡1𝑥Θ1otherwise\begin{cases}c+2\gamma[b_{t-1}]+1=0\quad(\delta^{1}_{t}=\Theta(1))\\ c+2\gamma[a_{t-1}^{\top}x]=0\quad(\delta^{2}_{t}=\Theta(1))\\ \gamma[b_{t-1}]+\gamma[a_{t-1}^{\top}x]=0\quad(f_{t-1}(x)=\Theta(1))\\ \end{cases}{ start_ROW start_CELL italic_c + 2 italic_γ [ italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] + 1 = 0 ( italic_δ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Θ ( 1 ) ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_c + 2 italic_γ [ italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = 0 ( italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Θ ( 1 ) ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_γ [ italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] + italic_γ [ italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = 0 ( italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x ) = roman_Θ ( 1 ) ) end_CELL start_CELL end_CELL end_ROW

Solving this equation yields c=1/2𝑐12c=-1/2italic_c = - 1 / 2, i.e. the learning rate should scale as η=Θ(n1/2)𝜂Θsuperscript𝑛12\eta=\Theta(n^{-1/2})italic_η = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) in order to achieve efficient feature learning. At initialization, b0=0subscript𝑏00b_{0}=0italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 and a0x=Θ(1)superscriptsubscript𝑎0top𝑥Θ1a_{0}^{\top}x=\Theta(1)italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x = roman_Θ ( 1 ) (by Central Limit Theorem). Through an inductive argument, for t>0𝑡0t>0italic_t > 0, btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will be of order Θ(n1/2)Θsuperscript𝑛12\Theta(n^{-1/2})roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) and atxsuperscriptsubscript𝑎𝑡top𝑥a_{t}^{\top}xitalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x will be of order Θ(1)Θ1\Theta(1)roman_Θ ( 1 ), yielding ft(x)=Θ(n1/2)subscript𝑓𝑡𝑥Θsuperscript𝑛12f_{t}(x)=\Theta(n^{-1/2})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). Indeed, at each iteration the update to btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will be of order Θ(ηyat1x)=Θ(n1/2)Θ𝜂𝑦superscriptsubscript𝑎𝑡1top𝑥Θsuperscript𝑛12\Theta(\eta ya_{t-1}^{\top}x)=\Theta(n^{-1/2})roman_Θ ( italic_η italic_y italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ) = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) and the updates to atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are of order Θ(ηbt1yx)=Θ(n1)Θ𝜂subscript𝑏𝑡1𝑦𝑥Θsuperscript𝑛1\Theta(\eta b_{t-1}yx)=\Theta(n^{-1})roman_Θ ( italic_η italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT italic_y italic_x ) = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). As ft=Θ(n1/2)subscript𝑓𝑡Θsuperscript𝑛12f_{t}=\Theta(n^{-1/2})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ), this yields a contradiction towards learning Θ(1)Θ1\Theta(1)roman_Θ ( 1 ) features.

This shows that we cannot have both δt1superscriptsubscript𝛿𝑡1\delta_{t}^{1}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and δt2superscriptsubscript𝛿𝑡2\delta_{t}^{2}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to be Θ(1)Θ1\Theta(1)roman_Θ ( 1 ) with this parametrization (also true with Init[2]). We formalize this result in the next proposition and refer the reader to Appendix A for further technical details.

Proposition 1 (Inefficiency of LoRA fine-tuning).

Assume that LoRA weights are initialized with Init[1] or Init[2] and trained with gradient descent with learning rate η=Θ(nc)𝜂Θsuperscript𝑛𝑐\eta=\Theta(n^{c})italic_η = roman_Θ ( italic_n start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) for some c𝑐c\in\mathbb{R}italic_c ∈ blackboard_R. Then, it is impossible to have δti=Θ(1)superscriptsubscript𝛿𝑡𝑖Θ1\delta_{t}^{i}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Θ ( 1 ) for i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 } for any t>0𝑡0t>0italic_t > 0, and therefore, fine-tuning with LoRA in this setup is inefficient.

In conclusion, efficiency cannot be achieved with this parametrization of the learning rate. This suggests that standard LoRA finetuning as currently used by practitioners is suboptimal, especially when model width is large, which is a property that is largely satsified in practice (n700𝑛700n\approx 700italic_n ≈ 700 for GPT2 and n4000𝑛4000n\approx 4000italic_n ≈ 4000 for LLama). This analysis suggests that we are missing crucial hyperparameters in the standard LoRA setup. Indeed, we show that by decoupling the learning rate for a𝑎aitalic_a and b𝑏bitalic_b, we can have δti=Θ(1)superscriptsubscript𝛿𝑡𝑖Θ1\delta_{t}^{i}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Θ ( 1 ) for i{1,2,3}𝑖123i\in\{1,2,3\}italic_i ∈ { 1 , 2 , 3 }. We write ηa,ηbsubscript𝜂𝑎subscript𝜂𝑏\eta_{a},\eta_{b}italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT to denote the learning rates. The analysis conducted above remains morally the same with the only difference being in the learning rates. Let ηa=Θ(nca)subscript𝜂𝑎Θsuperscript𝑛subscript𝑐𝑎\eta_{a}=\Theta(n^{c_{a}})italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = roman_Θ ( italic_n start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) and ηb=Θ(ncb)subscript𝜂𝑏Θsuperscript𝑛subscript𝑐𝑏\eta_{b}=\Theta(n^{c_{b}})italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = roman_Θ ( italic_n start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), and assume that weights are initialized with Init[1]. A similar analysis to the one conducted above show that having ft(x)=Θ(1)subscript𝑓𝑡𝑥Θ1f_{t}(x)=\Theta(1)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = roman_Θ ( 1 ) and δti=Θ(1)superscriptsubscript𝛿𝑡𝑖Θ1\delta_{t}^{i}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Θ ( 1 ) for i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 } and t>0𝑡0t>0italic_t > 0 implies that for all t>1𝑡1t>1italic_t > 1

{ca+2γ[bt1]+1=0(δt1=Θ(1))cb+2γ[at1x]=0(δt2=Θ(1))γ[bt1]+γ[at1x]=0(ft1(x)=Θ(1))casessubscript𝑐𝑎2𝛾delimited-[]subscript𝑏𝑡110subscriptsuperscript𝛿1𝑡Θ1otherwisesubscript𝑐𝑏2𝛾delimited-[]superscriptsubscript𝑎𝑡1top𝑥0subscriptsuperscript𝛿2𝑡Θ1otherwise𝛾delimited-[]subscript𝑏𝑡1𝛾delimited-[]superscriptsubscript𝑎𝑡1top𝑥0subscript𝑓𝑡1𝑥Θ1otherwise\begin{cases}c_{a}+2\gamma[b_{t-1}]+1=0\quad(\delta^{1}_{t}=\Theta(1))\\ c_{b}+2\gamma[a_{t-1}^{\top}x]=0\quad(\delta^{2}_{t}=\Theta(1))\\ \gamma[b_{t-1}]+\gamma[a_{t-1}^{\top}x]=0\quad(f_{t-1}(x)=\Theta(1))\end{cases}{ start_ROW start_CELL italic_c start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + 2 italic_γ [ italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] + 1 = 0 ( italic_δ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Θ ( 1 ) ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_c start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + 2 italic_γ [ italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = 0 ( italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Θ ( 1 ) ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_γ [ italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] + italic_γ [ italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = 0 ( italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x ) = roman_Θ ( 1 ) ) end_CELL start_CELL end_CELL end_ROW

which, after simple calculations, implies that ca+cb=1subscript𝑐𝑎subscript𝑐𝑏1c_{a}+c_{b}=-1italic_c start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = - 1. This is only a necessary condition. In the next result, taking also some elements of stability into consideration, we fully characterize the choice of ηasubscript𝜂𝑎\eta_{a}italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and ηbsubscript𝜂𝑏\eta_{b}italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT to ensure efficient LoRA fine-tuning.

Proposition 2 (Efficient Fine-Tuning with LoRA).

In the case of model (2), with ηa=Θ(n1)subscript𝜂𝑎Θsuperscript𝑛1\eta_{a}=\Theta(n^{-1})italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and ηb=Θ(1)subscript𝜂𝑏Θ1\eta_{b}=\Theta(1)italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = roman_Θ ( 1 ), we have for all t>1𝑡1t>1italic_t > 1, i{1,2,3}𝑖123i\in\{1,2,3\}italic_i ∈ { 1 , 2 , 3 }, δti=Θ(1)superscriptsubscript𝛿𝑡𝑖Θ1\delta_{t}^{i}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Θ ( 1 ).

We refer the reader to Appendix A for more details on the proof of 2. In conclusion, scaling the learning rates as ηa=Θ(n1)subscript𝜂𝑎Θsuperscript𝑛1\eta_{a}=\Theta(n^{-1})italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and ηb=Θ(1)subscript𝜂𝑏Θ1\eta_{b}=\Theta(1)italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = roman_Θ ( 1 ) ensures stability (Δft=Θ(1)Δsubscript𝑓𝑡Θ1\Delta f_{t}=\Theta(1)roman_Δ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Θ ( 1 )) and efficiency of LoRA finetuning (δti=Θ(1)subscriptsuperscript𝛿𝑖𝑡Θ1\delta^{i}_{t}=\Theta(1)italic_δ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Θ ( 1 ) for i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 } and t>1𝑡1t>1italic_t > 1) in the infinite-width limit. In practice, this means that the learning rate for b𝑏bitalic_b should be generally much larger than that of a𝑎aitalic_a. This remains true even if br𝑏superscript𝑟b\in\mathbb{R}^{r}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT for general r𝑟ritalic_r. We will later see that this scaling is valid for general neural network models.

Refer to caption
Refer to caption
Figure 2: (Top) Train/Test accuracy of toy model Equation 3 averaged over 3 random seeds. Orange dashed line represents the line ηA=ηBsubscript𝜂𝐴subscript𝜂𝐵\eta_{A}=\eta_{B}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, and red dots represents all values of (ηA,ηB)subscript𝜂𝐴subscript𝜂𝐵(\eta_{A},\eta_{B})( italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) for which dmin(ηA,ηB):=(ηA,ηB)/11%assignsubscript𝑑subscript𝜂𝐴subscript𝜂𝐵subscriptsubscript𝜂𝐴subscript𝜂𝐵superscript1percent1d_{\min}(\eta_{A},\eta_{B}):=\mathcal{L}_{(\eta_{A},\eta_{B})}/\mathcal{L}^{*}% -1\leq 1\%italic_d start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) := caligraphic_L start_POSTSUBSCRIPT ( italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT / caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - 1 ≤ 1 %, where superscript\mathcal{L}^{*}caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the best loss. (Bottom) Train/Test curves for two sets of learning rates: the optimal choice (ηA,ηB)=(2.78,1.29e4)superscriptsubscript𝜂𝐴superscriptsubscript𝜂𝐵2.781.29e4(\eta_{A}^{*},\eta_{B}^{*})=(2.78,1.29\mathrm{e}{-4})( italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ( 2.78 , 1.29 roman_e - 4 ) overall at t=200𝑡200t=200italic_t = 200 in terms of test loss (Blue) and the optimal choice when ηA=ηBsubscript𝜂𝐴subscript𝜂𝐵\eta_{A}=\eta_{B}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT which is given by (ηA,ηB)=(2.15e2,2.15e2)subscript𝜂𝐴subscript𝜂𝐵2.15e22.15e2(\eta_{A},\eta_{B})=(2.15\mathrm{e}{-2},2.15\mathrm{e}{-2})( italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) = ( 2.15 roman_e - 2 , 2.15 roman_e - 2 ) (Orange). All values are averaged oevr three runs and confidence interval are shown (shaded).

3.2 Verifying the Results on a Toy Model

The previous analysis considers a simple linear model. To assess the validity of the scaling rules in a non-linear setting, we consider a neural network model given by

f(x)=Woutϕ(BAϕ(Winx)),𝑓𝑥subscript𝑊𝑜𝑢𝑡italic-ϕ𝐵𝐴italic-ϕsubscript𝑊𝑖𝑛𝑥f(x)=W_{out}\phi(BA\phi(W_{in}x)),italic_f ( italic_x ) = italic_W start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT italic_ϕ ( italic_B italic_A italic_ϕ ( italic_W start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_x ) ) , (3)

where Winn×d,Wout1×n,Ar×n,Bn×rformulae-sequencesubscript𝑊𝑖𝑛superscript𝑛𝑑formulae-sequencesubscript𝑊𝑜𝑢𝑡superscript1𝑛formulae-sequence𝐴superscript𝑟𝑛𝐵superscript𝑛𝑟W_{in}\in\mathbb{R}^{n\times d},W_{out}\in\mathbb{R}^{1\times n},A\in\mathbb{R% }^{r\times n},B\in\mathbb{R}^{n\times r}italic_W start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_n end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_n end_POSTSUPERSCRIPT , italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_r end_POSTSUPERSCRIPT are the weights, and ϕitalic-ϕ\phiitalic_ϕ is the ReLU function. The model is trained on a synthetic dataset generated with X𝒩(0,Id),Y=sin(d1i=1dXi)formulae-sequencesimilar-to𝑋𝒩0subscript𝐼𝑑𝑌superscript𝑑1superscriptsubscript𝑖1𝑑subscript𝑋𝑖X\sim\mathcal{N}(0,I_{d}),\,\,Y=\sin(d^{-1}\sum_{i=1}^{d}X_{i})italic_X ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_Y = roman_sin ( italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). See Appendix C for more details.

Only the weight matrices A,B𝐴𝐵A,Bitalic_A , italic_B are trained (Win,Woutsubscript𝑊𝑖𝑛subscript𝑊𝑜𝑢𝑡W_{in},W_{out}italic_W start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT are fixed). We use d=5,n=100,r=4formulae-sequence𝑑5formulae-sequence𝑛100𝑟4d=5,n=100,r=4italic_d = 5 , italic_n = 100 , italic_r = 4, train data size 1000100010001000 and a test data size 100100100100.888See Appendix C for more details about the experimental setup. The train/test loss for varying ηAsubscript𝜂𝐴\eta_{A}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and ηBsubscript𝜂𝐵\eta_{B}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is reported in Figure 2 at the early stages of the training (t=10𝑡10t=10italic_t = 10) and after convergence (we observed convergence around t200𝑡200t\approx 200italic_t ≈ 200 for reasonable choices of learning rates). The red ’+++’ signs represents learning rates (ηA,ηB)subscript𝜂𝐴subscript𝜂𝐵(\eta_{A},\eta_{B})( italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) for which the loss is within 1%percent11\%1 % range from the best loss and dashed line represents the case where the learning rates are set equal. We observe that both the best train and test losses are consistently achieved by a combination of learning rates where ηbηamuch-greater-thansubscript𝜂𝑏subscript𝜂𝑎\eta_{b}\gg\eta_{a}italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ≫ italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, which validates our analysis in the previous section. Notice also that optimal learning rates (ηA,ηB)subscript𝜂𝐴subscript𝜂𝐵(\eta_{A},\eta_{B})( italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) are generally close to the edge of stability, a well-known behaviour in training dynamics of deep networks [Cohen et al., 2021].

4 Stability and Feature Learning with LoRA in the Infinite Width Limit

In this section, we extend the analysis above to general neural architectures with LoRA layers. We show that the conclusions from the analysis on the linear model hold for general neural architectures: 1) using the same learning rate for both A𝐴Aitalic_A and B𝐵Bitalic_B leads to suboptimal feature learning when model width is large, and 2) this problem can be fixed by setting different learning rates for A𝐴Aitalic_A and B𝐵Bitalic_B.

Since our aim in this paper is primarily methodological, the theoretical results in this section are of a physics level of rigor, omitting technical assumptions that would otherwise make the analysis rigorous but unnecessarily complicated. In all the results, LoRA rank r𝑟ritalic_r is considered fixed and finetuning dynamics are analyzed in the limit of infinite-width. This setup fairly represents practical scenarios where rnmuch-less-than𝑟𝑛r\ll nitalic_r ≪ italic_n and r𝑟ritalic_r is generally small.

Notation.

The LoRA weights are initialized with Aij𝒩(0,σA2),Bij𝒩(0,σB2)formulae-sequencesimilar-tosubscript𝐴𝑖𝑗𝒩0superscriptsubscript𝜎𝐴2similar-tosubscript𝐵𝑖𝑗𝒩0superscriptsubscript𝜎𝐵2A_{ij}\sim\mathcal{N}(0,\sigma_{A}^{2}),B_{ij}\sim\mathcal{N}(0,\sigma_{B}^{2})italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for some σA,σB0subscript𝜎𝐴subscript𝜎𝐵0\sigma_{A},\sigma_{B}\geq 0italic_σ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≥ 0.999In [Hu et al., 2021], B𝐵Bitalic_B is initialized to 00, which corresponds to setting σB=0subscript𝜎𝐵0\sigma_{B}=0italic_σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 0. Here also, we assume that either σB2=0superscriptsubscript𝜎𝐵20\sigma_{B}^{2}=0italic_σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0 and σA2=Θ(n1)superscriptsubscript𝜎𝐴2Θsuperscript𝑛1\sigma_{A}^{2}=\Theta(n^{-1})italic_σ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) (Init[1]), or σB2=Θ(1)superscriptsubscript𝜎𝐵2Θ1\sigma_{B}^{2}=\Theta(1)italic_σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Θ ( 1 ) and σA2=0superscriptsubscript𝜎𝐴20\sigma_{A}^{2}=0italic_σ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0 (Init[2]). Given a LoRA layer in the model, Z¯¯𝑍\underline{Z}under¯ start_ARG italic_Z end_ARG denotes the input to that layer and Z¯¯𝑍\bar{Z}over¯ start_ARG italic_Z end_ARG the output after adding the pretrained weights. More precisely, we write Z¯=WZ¯+αrBAZ¯¯𝑍superscript𝑊¯𝑍𝛼𝑟𝐵𝐴¯𝑍\bar{Z}=W^{*}\underline{Z}+\frac{\alpha}{r}BA\,\underline{Z}over¯ start_ARG italic_Z end_ARG = italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT under¯ start_ARG italic_Z end_ARG + divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG italic_B italic_A under¯ start_ARG italic_Z end_ARG.

Our main analysis relies on a careful estimation of the magnitude of several quantities including LoRA features. Let us first give a formal definition.

Definition 2 (LoRA Features).

Given a general neural architecture and a LoRA layer (1), we define LoRA features (ZA,ZB)subscript𝑍𝐴subscript𝑍𝐵(Z_{A},Z_{B})( italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) as ZA=AZ¯subscript𝑍𝐴𝐴¯𝑍Z_{A}=A\underline{Z}italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_A under¯ start_ARG italic_Z end_ARG and ZB=BZA=BAZ¯subscript𝑍𝐵𝐵subscript𝑍𝐴𝐵𝐴¯𝑍Z_{B}=BZ_{A}=BA\underline{Z}italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = italic_B italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_B italic_A under¯ start_ARG italic_Z end_ARG . At fine-tuning step t𝑡titalic_t, we use the superscript t𝑡titalic_t to denote the value of LoRA features ZAt,ZBtsuperscriptsubscript𝑍𝐴𝑡superscriptsubscript𝑍𝐵𝑡Z_{A}^{t},Z_{B}^{t}italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and the subscript t𝑡titalic_t to denote the weights At,Btsubscript𝐴𝑡subscript𝐵𝑡A_{t},B_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

LoRA layers are 2-layers linear networks with a “bottleneck” in the middle (since generally rnmuch-less-than𝑟𝑛r\ll nitalic_r ≪ italic_n). This bottleneck shape might induce some numerical challenges in training stability and efficiency (3 and 5).

Finetuning Dataset.

To simplify the analysis, we assume that the finetuning dataset comprises a single sample (x,y)𝑥𝑦(x,y)( italic_x , italic_y ),101010This assumption on the finetuning dataset is for simplification purposes only. All our analysis can be re-written with ‘batched’ gradients and the conclusions remain the same. However, some additonal assumptions are required to make the analysis rigorous. and the goal is to minimize the loss (𝜽,(x,y))𝜽𝑥𝑦\mathcal{L}(\bm{\theta},(x,y))caligraphic_L ( bold_italic_θ , ( italic_x , italic_y ) ) computed with the underlying model where the adjusted weights are given by W+BAsuperscript𝑊𝐵𝐴W^{*}+BAitalic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_B italic_A for all LoRA layers (here 𝜽={A,B, for all LoRA layers in the model}𝜽𝐴𝐵 for all LoRA layers in the model\bm{\theta}=\{A,B,\textrm{ for all LoRA layers in the model}\}bold_italic_θ = { italic_A , italic_B , for all LoRA layers in the model }). At training step t𝑡titalic_t, and for any LoRA layer in the model, Z¯tsuperscript¯𝑍𝑡\underline{Z}^{t}under¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the input to the LoRA layer, computed with data input x𝑥xitalic_x. Similarly, we write dZ¯t𝑑superscript¯𝑍𝑡d\bar{Z}^{t}italic_d over¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to denote the gradient of the loss function with respect to the layer output features Z¯¯𝑍\bar{Z}over¯ start_ARG italic_Z end_ARG evaluated at data point (x,y)𝑥𝑦(x,y)( italic_x , italic_y ).

The notion of stability of LoRA as discussed in Section 3 can be generalized to any neural network model as follows.

Definition 3 (Stability).

We say that LoRA finetuning is stable if for all LoRA layers in the model, and all training steps t𝑡titalic_t, we have Z¯,ZA,ZB=𝒪(1)¯𝑍subscript𝑍𝐴subscript𝑍𝐵𝒪1\underline{Z},Z_{A},Z_{B}=\mathcal{O}(1)under¯ start_ARG italic_Z end_ARG , italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = caligraphic_O ( 1 ) as n𝑛nitalic_n goes to infinity.

Stability implies that no quantity in the network explodes as width grows, a desirable property as we scale the model.111111It is possible to define stability as Z¯,ZB=𝒪(1)¯𝑍subscript𝑍𝐵𝒪1\underline{Z},Z_{B}=\mathcal{O}(1)under¯ start_ARG italic_Z end_ARG , italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = caligraphic_O ( 1 ) and exclude ZAsubscript𝑍𝐴Z_{A}italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT from the condition. This would allow scenarios where for instance the entries of A𝐴Aitalic_A explode with width but their magnitude is compensated with a smaller magnitude of B𝐵Bitalic_B. This system has one degree of freedom because of the homogeneity of the product BA𝐵𝐴BAitalic_B italic_A, and by imposing that ZA=𝒪(1)subscript𝑍𝐴𝒪1Z_{A}=\mathcal{O}(1)italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = caligraphic_O ( 1 ), we avoid having such scenarios. Naturally, in order to ensure stability, one has to scale hyperparameters (initialization, learning rate) as n𝑛nitalic_n grows. Scaling rules for initialization are fairly easy to infer and were already discussed in Section 3 where we obtained two plausible initialization schemes (Init[1] and Init[2]). More importantly, if we arbitrarily scale the learning rate with width, we might end up with suboptimal learning as width grows even if the finetuning is stable. This is the case for instance when we aggressively downscale the learning rate with width, or inadequately parameterize the network (e.g. Neural Tangent Kernel parametrization which leads to the kernel regime in the infinite width limit, [Jacot et al., 2018]). To take this into account, we define a notion of feature learning with LoRA.

Definition 4 (Stable Feature Learning with LoRA).

We say that LoRA finetuning induces stable feature learning if it is stable (3), and for all LoRA layers and finetuning step t𝑡titalic_t, we have ΔZBt=defZBt+1ZBt=Θ(1)Δsuperscriptsubscript𝑍𝐵𝑡𝑑𝑒𝑓superscriptsubscript𝑍𝐵𝑡1superscriptsubscript𝑍𝐵𝑡Θ1\Delta Z_{B}^{t}\overset{def}{=}Z_{B}^{t+1}-Z_{B}^{t}=\Theta(1)roman_Δ italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_OVERACCENT italic_d italic_e italic_f end_OVERACCENT start_ARG = end_ARG italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT - italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = roman_Θ ( 1 ).

A similar definition of feature learning was introduced in [Yang and Littwin, 2023] for pretraining. This definition ensures that the network is not ‘stuck’ in a kernel regime where feature updates are of order 𝒪(nϵ)𝒪superscript𝑛italic-ϵ\mathcal{O}(n^{-\epsilon})caligraphic_O ( italic_n start_POSTSUPERSCRIPT - italic_ϵ end_POSTSUPERSCRIPT ) in the infinite-width limit for some ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, which implies that no feature learning occurs in the limit. The authors introduced the μ𝜇\muitalic_μ-parameterization (or maximal update parametrization), a specific network parameterization (initialization + learning rate scaling), that ensures that feature updates are Θ(1)Θ1\Theta(1)roman_Θ ( 1 ). Note that here we added stability in the definition, but in principle, one could define feature learning with ΩΩ\Omegaroman_Ω instead of ΘΘ\Thetaroman_Θ. The latter covers unstable scenarios (e.g. when ΔZBt=Θ(n)Δsuperscriptsubscript𝑍𝐵𝑡Θ𝑛\Delta Z_{B}^{t}=\Theta(n)roman_Δ italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = roman_Θ ( italic_n ) due to improper scaling of initialization and learning rate), so we omit it here and focus on stable feature learning. Also, notice that we only consider finetuning dynamics and not the pretraining dynamics. However, since our analysis depends on weights Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from pretraining, we assume that pretraining parameterization ensures stability and feature learning as width grows (see Appendix A for more details).121212When taking the infinite width limit, we assume that pretraining parameterization is μ𝜇\muitalic_μP. This is just a technicality for the infinite-width limit and does not have any implications on practical scenarios where the width is finite. The most important implications of this assumption is that in the pretrained network (before introducing LoRA layers), we have Z¯=Θ(1),Z¯=Θ(1)formulae-sequence¯𝑍Θ1¯𝑍Θ1\underline{Z}=\Theta(1),\bar{Z}=\Theta(1)under¯ start_ARG italic_Z end_ARG = roman_Θ ( 1 ) , over¯ start_ARG italic_Z end_ARG = roman_Θ ( 1 ), which holds for a general input-output pair (x,y)𝑥𝑦(x,y)( italic_x , italic_y ).

At finetuning step t𝑡titalic_t, the gradients are given by

tBsubscript𝑡𝐵\displaystyle\frac{\partial\mathcal{L}_{t}}{\partial B}divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_B end_ARG =αrdZ¯t1At1Z¯t1absenttensor-product𝛼𝑟𝑑superscript¯𝑍𝑡1subscript𝐴𝑡1superscript¯𝑍𝑡1\displaystyle=\frac{\alpha}{r}d\bar{Z}^{t-1}\otimes A_{t-1}\underline{Z}^{t-1}= divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG italic_d over¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ⊗ italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT
tAsubscript𝑡𝐴\displaystyle\frac{\partial\mathcal{L}_{t}}{\partial A}divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_A end_ARG =dZAt1Z¯t1=αrBt1dZ¯t1Z¯t1,absenttensor-product𝑑superscriptsubscript𝑍𝐴𝑡1superscript¯𝑍𝑡1tensor-product𝛼𝑟subscriptsuperscript𝐵top𝑡1𝑑superscript¯𝑍𝑡1superscript¯𝑍𝑡1\displaystyle=dZ_{A}^{t-1}\otimes\underline{Z}^{t-1}=\frac{\alpha}{r}B^{\top}_% {t-1}d\bar{Z}^{t-1}\otimes\underline{Z}^{t-1},= italic_d italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ⊗ under¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT = divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT italic_d over¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ⊗ under¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ,

where uvtensor-product𝑢𝑣u\otimes vitalic_u ⊗ italic_v denotes the outer product uv𝑢superscript𝑣topuv^{\top}italic_u italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT of vectors u𝑢uitalic_u, v𝑣vitalic_v, and the weights are updated as follows

At=At1ηAgAt1,Bt=Bt1ηBgBt1,formulae-sequencesubscript𝐴𝑡subscript𝐴𝑡1subscript𝜂𝐴superscriptsubscript𝑔𝐴𝑡1subscript𝐵𝑡subscript𝐵𝑡1subscript𝜂𝐵superscriptsubscript𝑔𝐵𝑡1A_{t}=A_{t-1}-\eta_{A}g_{A}^{t-1},\quad B_{t}=B_{t-1}-\eta_{B}g_{B}^{t-1},italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ,

where gA,gBsubscript𝑔𝐴subscript𝑔𝐵g_{A},g_{B}italic_g start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are processed gradients (e.g. normalized gradients with momentum as in AdamW etc). Hereafter, we assume that the gradients are processed in a way that makes their entries Θ(1)Θ1\Theta(1)roman_Θ ( 1 ). This is generally satisfied in practice (with Adam for instance) and has been considered in [Yang and Littwin, 2023] to derive the μ𝜇\muitalic_μ-parametrization for general gradient processing functions.

Unlike the linear model in Section 3, LoRA feature updates are not only driven by the change in the A,B𝐴𝐵A,Bitalic_A , italic_B weights, but also Z¯,dZ¯¯𝑍𝑑¯𝑍\underline{Z},d\bar{Z}under¯ start_ARG italic_Z end_ARG , italic_d over¯ start_ARG italic_Z end_ARG which are updated as we finetune the model (assuming there are multiple LoRA layers). To isolate the contribution of individual LoRA layers to feature learning, we assume that only a single LoRA layer is trainable and all other LoRA layers are frozen.131313This is equivalent to having only a single LoRA layer in the model since LoRA layers are initialized to zero. In this way, we can quantify feature learning induced by the LoRA layer as we finetune the model.. In this setting, considering the only trainable LoRA layer in the model, the layer input Z¯¯𝑍\underline{Z}under¯ start_ARG italic_Z end_ARG is fixed and does not change with t𝑡titalic_t, while dZ¯𝑑¯𝑍d\bar{Z}italic_d over¯ start_ARG italic_Z end_ARG changes with step t𝑡titalic_t (because Z¯t=(W+αrBtAt)Z¯superscript¯𝑍𝑡superscript𝑊𝛼𝑟subscript𝐵𝑡subscript𝐴𝑡¯𝑍\bar{Z}^{t}=(W^{*}+\frac{\alpha}{r}B_{t}A_{t})\underline{Z}over¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) under¯ start_ARG italic_Z end_ARG). After step t𝑡titalic_t, ZBsubscript𝑍𝐵Z_{B}italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is updated as follows

ΔZBt=Bt1ΔZAtδt1+ΔBtZAt1δt2+ΔBtΔZAtδt3Δsuperscriptsubscript𝑍𝐵𝑡superscriptsubscript𝛿𝑡1subscript𝐵𝑡1Δsuperscriptsubscript𝑍𝐴𝑡superscriptsubscript𝛿𝑡2Δsubscript𝐵𝑡superscriptsubscript𝑍𝐴𝑡1subscriptsuperscript𝛿3𝑡Δsubscript𝐵𝑡Δsuperscriptsubscript𝑍𝐴𝑡\Delta Z_{B}^{t}=\underset{\delta_{t}^{1}}{\underbrace{B_{t-1}\Delta Z_{A}^{t}% }}+\underset{\delta_{t}^{2}}{\underbrace{\Delta B_{t}Z_{A}^{t-1}}}+\underset{% \delta^{3}_{t}}{\underbrace{\Delta B_{t}\Delta Z_{A}^{t}}}roman_Δ italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = start_UNDERACCENT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG under⏟ start_ARG italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT roman_Δ italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG + start_UNDERACCENT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG under⏟ start_ARG roman_Δ italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG end_ARG + start_UNDERACCENT italic_δ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_UNDERACCENT start_ARG under⏟ start_ARG roman_Δ italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Δ italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG

As discussed in Section 3, the terms δt1,δt2subscriptsuperscript𝛿1𝑡subscriptsuperscript𝛿2𝑡\delta^{1}_{t},\delta^{2}_{t}italic_δ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represent the ‘linear’ feature updates that we obtain if we fix one weight matrix and only train the other, while δt3subscriptsuperscript𝛿3𝑡\delta^{3}_{t}italic_δ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the ‘multiplicative’ feature update which captures the compounded update due to updating both A𝐴Aitalic_A and B𝐵Bitalic_B.

Analysis of the Role of A𝐴Aitalic_A and B𝐵Bitalic_B.

As discussed above, we want to ensure that δt1=Θ(1)superscriptsubscript𝛿𝑡1Θ1\delta_{t}^{1}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = roman_Θ ( 1 ) and δt2=Θ(1)superscriptsubscript𝛿𝑡2Θ1\delta_{t}^{2}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Θ ( 1 ) which means that both weight matrices contribute to the update in ZBsubscript𝑍𝐵Z_{B}italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. To further explain why this is a desirable property, let us analyze how changes in matrices A𝐴Aitalic_A and B𝐵Bitalic_B affect LoRA feature ZB=BAZ¯subscript𝑍𝐵𝐵𝐴¯𝑍Z_{B}=BA\,\underline{Z}italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = italic_B italic_A under¯ start_ARG italic_Z end_ARG.

Let (B:,i)1irsubscriptsubscript𝐵:𝑖1𝑖𝑟(B_{:,i})_{1\leq i\leq r}( italic_B start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_r end_POSTSUBSCRIPT denote the columns of B𝐵Bitalic_B. We can express ZBsubscript𝑍𝐵Z_{B}italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT as ZB=i=1r(AZ¯)iB:,isubscript𝑍𝐵superscriptsubscript𝑖1𝑟subscript𝐴¯𝑍𝑖subscript𝐵:𝑖Z_{B}=\sum_{i=1}^{r}(A\,\underline{Z})_{i}B_{:,i}italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ( italic_A under¯ start_ARG italic_Z end_ARG ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT, where (AZ¯)isubscript𝐴¯𝑍𝑖(A\underline{Z})_{i}( italic_A under¯ start_ARG italic_Z end_ARG ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT coordinate of AZ¯𝐴¯𝑍A\underline{Z}italic_A under¯ start_ARG italic_Z end_ARG. This decomposition suggests that the direction of ZBsubscript𝑍𝐵Z_{B}italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is a weighted sum of the columns of B𝐵Bitalic_B, and A𝐴Aitalic_A modulates the weights. With this, we can also write

{δt1=i=1r(ΔAtZ¯)i(B:,i)t1δt2=i=1r(At1Z¯)i(ΔB:,i)t1,casessubscriptsuperscript𝛿1𝑡superscriptsubscript𝑖1𝑟subscriptΔsubscript𝐴𝑡¯𝑍𝑖subscriptsubscript𝐵:𝑖𝑡1otherwisesubscriptsuperscript𝛿2𝑡superscriptsubscript𝑖1𝑟subscriptsubscript𝐴𝑡1¯𝑍𝑖subscriptΔsubscript𝐵:𝑖𝑡1otherwise\begin{cases}\delta^{1}_{t}=\sum_{i=1}^{r}(\Delta A_{t}\underline{Z})_{i}(B_{:% ,i})_{t-1}\\ \delta^{2}_{t}=\sum_{i=1}^{r}(A_{t-1}\underline{Z})_{i}(\Delta B_{:,i})_{t-1},% \end{cases}{ start_ROW start_CELL italic_δ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ( roman_Δ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Δ italic_B start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , end_CELL start_CELL end_CELL end_ROW

where (B:,i)tsubscriptsubscript𝐵:𝑖𝑡(B_{:,i})_{t}( italic_B start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT refers to the columns of B𝐵Bitalic_B at time step t𝑡titalic_t. Having both δt1superscriptsubscript𝛿𝑡1\delta_{t}^{1}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and δt2superscriptsubscript𝛿𝑡2\delta_{t}^{2}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of order Θ(1)Θ1\Theta(1)roman_Θ ( 1 ) means that both A𝐴Aitalic_A and B𝐵Bitalic_B are ‘sufficiently’ updated to induce a change in weights (AZ¯)isubscript𝐴¯𝑍𝑖(A\underline{Z})_{i}( italic_A under¯ start_ARG italic_Z end_ARG ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and directions B:,isubscript𝐵:𝑖B_{:,i}italic_B start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT. If one of the matrices A,B𝐴𝐵A,Bitalic_A , italic_B is not efficiently updated, we might end up with suboptimal finetuning, leading to either non updated directions B𝐵Bitalic_B or direction weights (At1Z)subscript𝐴𝑡1𝑍(A_{t-1}Z)( italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT italic_Z ). For instance, assuming that the model is initialized with Init[2], and that B𝐵Bitalic_B is not efficiently updated, the direction of ZBsubscript𝑍𝐵Z_{B}italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT will be mostly determined by the vector (sub)space of dimension r𝑟ritalic_r generated by the columns of B𝐵Bitalic_B at initialization. This analysis leads to the following definition of efficient learning with LoRA.

Refer to caption
Figure 3: Test accuracy of Roberta-base finetuning for 3333 epochs on MNLI, QQP, QNLI, and 10101010 epochs on SST2, with sequence length T=128𝑇128T=128italic_T = 128 and half precision (FP16). LoRA hyperparameters are set to α=r=8𝛼𝑟8\alpha=r=8italic_α = italic_r = 8. All values are averaged over 3 random seeds (we do not show confidence intervals for better visualizations, but fluctuations are of order 0.1%percent0.10.1\%0.1 %, see Figure 7 for instance). For better visualization, when accuracy is lower than a fixed threshold, we set it to threshold. Values shown in red are: 1) the best accuracy (overall) and 2) the accuracy for a set of learning rates where ηBsubscript𝜂𝐵\eta_{B}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT and ηAsubscript𝜂𝐴\eta_{A}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT are close in order of magnitude (ηB/ηA[1,1.25]subscript𝜂𝐵subscript𝜂𝐴11.25\eta_{B}/\eta_{A}\in[1,1.25]italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∈ [ 1 , 1.25 ]).
Definition 5 (Efficient Learning).

We say that LoRA fine-tuning is efficient if it is stable (3), and for all LoRA layers in the model, all steps t>1𝑡1t>1italic_t > 1, and i{1,2}𝑖12i\{1,2\}italic_i { 1 , 2 }, we have δti=Θ(1)superscriptsubscript𝛿𝑡𝑖Θ1\delta_{t}^{i}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Θ ( 1 ).

Note that it is possible to achieve stable feature learning (4) without necessarily having efficient learning. This is the case when for instance B𝐵Bitalic_B is not updated (fixed to a non-zero init with Init[2]) and only A𝐴Aitalic_A is updated, which corresponds to simply setting ηB=0subscript𝜂𝐵0\eta_{B}=0italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 0. This is a trivial case, but other non-trivial cases of inefficiency are common in practice, such as the use of the same learning rate for A𝐴Aitalic_A and B𝐵Bitalic_B which is a standard practice. In the next theorem, we characterize the optimal scaling of learning rates ηAsubscript𝜂𝐴\eta_{A}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and ηBsubscript𝜂𝐵\eta_{B}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, a conclusion similar to that of Section 3.

Theorem 1 (Efficient LoRA (Informal)).

Assume that weight matrices A𝐴Aitalic_A and B𝐵Bitalic_B are trained with Adam with respective learning rates ηAsubscript𝜂𝐴\eta_{A}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and ηBsubscript𝜂𝐵\eta_{B}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. Then, it is impossible to achieve efficiency with ηA=ηBsubscript𝜂𝐴subscript𝜂𝐵\eta_{A}=\eta_{B}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. However, LoRA Finetuning is efficient with ηA=Θ(n1)subscript𝜂𝐴Θsuperscript𝑛1\eta_{A}=\Theta(n^{-1})italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and ηB=Θ(1)subscript𝜂𝐵Θ1\eta_{B}=\Theta(1)italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = roman_Θ ( 1 ).

The result of 1 suggests that efficiency can only be achieved with ηB/ηA=Θ(n)subscript𝜂𝐵subscript𝜂𝐴Θ𝑛\eta_{B}/\eta_{A}=\Theta(n)italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = roman_Θ ( italic_n ). In practice, this translates to setting ηBηAmuch-greater-thansubscript𝜂𝐵subscript𝜂𝐴\eta_{B}\gg\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≫ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, but does not provide a precise ratio ηB/ηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}/\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT to be fixed while tuning the learning rate (the constant in ‘ΘΘ\Thetaroman_Θ’ is generally intractable), unless we tune both ηBsubscript𝜂𝐵\eta_{B}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT and ηAsubscript𝜂𝐴\eta_{A}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT which is not efficient from a computational perspective as it becomes a 2D tuning problem. It is therefore natural to set a fixed ratio ηB/ηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}/\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and tune only ηAsubscript𝜂𝐴\eta_{A}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT (or ηBsubscript𝜂𝐵\eta_{B}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT), which would effectively reduce the tuning process to a 1D grid search, achieving the same computational cost of standard LoRA where the learning rate is the same for A𝐴Aitalic_A and B𝐵Bitalic_B. We call this method LoRA+++.

LoRA+++  : set the learning rates for A,B𝐴𝐵A,Bitalic_A , italic_B such that ηB=ληAsubscript𝜂𝐵𝜆subscript𝜂𝐴\eta_{B}=\lambda\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = italic_λ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT with λ>1𝜆1\lambda>1italic_λ > 1 fixed and tune ηAsubscript𝜂𝐴\eta_{A}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT.

In the next section, through extensive empirical evaluations, we first validate our theoretical result and show that optimal pairs (ηA,ηB)subscript𝜂𝐴subscript𝜂𝐵(\eta_{A},\eta_{B})( italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) (in terms of test accuracy) generally satisfy ηBηAmuch-greater-thansubscript𝜂𝐵subscript𝜂𝐴\eta_{B}\gg\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≫ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. We then investigate the optimal ratio λ𝜆\lambdaitalic_λ for LoRA+++ and suggest a default ratio that was empirically found to generally improve performance compared to standard LoRA. Although the conclusions of 1 and 2 are similar, the proof techniques are different. In 2, the linear model is trained with gradient descent, while in 1, the training algorithm is Adam-type in the sense that it normalizes the gradients before updating the weights. The formal statement of 1 requires an additional assumption on the alignment of the processed gradients gAsubscript𝑔𝐴g_{A}italic_g start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT with LoRA input Z¯¯𝑍\underline{Z}under¯ start_ARG italic_Z end_ARG. This technical detail is introduced and discussed in Appendix A.

5 Experiments with Language Models

We report our empirical results using LoRA to finetune a set of language models on different benchmarks. Details about the experimental setup and more empirical results are provided in Appendix C. We also identify a default value for the ratio λ=ηB/ηA𝜆subscript𝜂𝐵subscript𝜂𝐴\lambda=\eta_{B}/\eta_{A}italic_λ = italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT that generally improves performance as compared to standard LoRA. The code for our experiments is available at https://github.com/nikhil-ghosh-berkeley/loraplus.

5.1 GLUE tasks with GPT-2 and RoBERTa

The GLUE benchmark (General Language Understanding Evaluation) consists of several language tasks that evaluate the understanding capabilities of langugage models [Wang et al., 2018]. Using LoRA, we finetune Roberta-base from the RoBERTa family [Liu et al., 2019] and GPT-2 [Radford et al., 2019] on MNLI, QQP, SST2, and QNLI tasks (Other tasks are smaller and generally require an already finetuned model e.g. on MNLI as starting checkpoint) with varying learning rates (ηA,ηB)subscript𝜂𝐴subscript𝜂𝐵(\eta_{A},\eta_{B})( italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) to identify the optimal combination. Empirical details are provided in Appendix C.

Roberta-base.

Figure 3 shows the results of Roberta-base finetuning with α=r=8𝛼𝑟8\alpha=r=8italic_α = italic_r = 8, trained with half precision (FP16). We observe that test accuracy is consistently maximal for some set of learning rates satisfying ηBηAmuch-greater-thansubscript𝜂𝐵subscript𝜂𝐴\eta_{B}\gg\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≫ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, outperforming the standard practice where ηAsubscript𝜂𝐴\eta_{A}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and ηBsubscript𝜂𝐵\eta_{B}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are usually set equal. Interestingly, the gap between the optimal choice of learning rates overall and the optimal choice when ηAηBsubscript𝜂𝐴subscript𝜂𝐵\eta_{A}\approx\eta_{B}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ≈ italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is more pronounced for ‘harder’ tasks like MNLI and QQP, as compared to SST2 and QNLI. This is probably due to the fact that harder tasks require more efficient feature learning. It is also worth mentioning that in our experiments, given limited computational resources, we use sequence length T=128𝑇128T=128italic_T = 128 and finetune for only 3333 epochs for MNLI and QQP, so it is expected that we obtain test accuracies lower that those reported in [Hu et al., 2021] where the authores finetune Roberta-base with T=512𝑇512T=512italic_T = 512 sequence length (for MNLI) and more epochs (30303030 for MNLI). In Appendix C, we provide additional results with Test/Train accuracy/loss.

GPT-2.

Figure 4 shows the results of finetuning GPT-2 with LoRA on MNLI and QQP (other tasks and full precision training are provided in Appendix C). Similar to the conclusions from Roberta-base, we observe that maximal test accuracies are achieved with some (ηA,ηB)subscript𝜂𝐴subscript𝜂𝐵(\eta_{A},\eta_{B})( italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) satisfying ηBηAmuch-greater-thansubscript𝜂𝐵subscript𝜂𝐴\eta_{B}\gg\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≫ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. Further GPT-2 results with different tasks are provided in Appendix C. Here also, we observed that the harder the task, the larger the gap between model performance when ηBηAmuch-greater-thansubscript𝜂𝐵subscript𝜂𝐴\eta_{B}\gg\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≫ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and when ηAηBsubscript𝜂𝐴subscript𝜂𝐵\eta_{A}\approx\eta_{B}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ≈ italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT.

Refer to caption
Figure 4: Test accuracy of GPT-2 after finetuning for 3333 epochs on MNLI, QQP, with FP16 precision. LoRA hyperparameters are set to α=r=8𝛼𝑟8\alpha=r=8italic_α = italic_r = 8. Both train/test accuracy are consistently maximal for some choice of learning rates where ηBηAmuch-greater-thansubscript𝜂𝐵subscript𝜂𝐴\eta_{B}\gg\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≫ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. See Appendix C for more numerical results with GPT2.

5.2 Llama

To further validate our theoretical findings, we finetune the Llama-7b model [Touvron et al., 2023] on the MNLI dataset and flan-v2 dataset [Longpre et al., 2023] using LoRA. Each trial is averaged over two seeds.

Flan-v2.

We examine LoRA training of Llama on the instruction finetuning dataset flan-v2 [Longpre et al., 2023]. To make the experiments computationally feasible, we train for one epoch on a size 100,000100000100,000100 , 000 subset of the flan-v2 dataset. We record the test accuracy of the best checkpoint every 500 steps. The LoRA hyperparameters are set to α=16𝛼16\alpha=16italic_α = 16 and r=64𝑟64r=64italic_r = 64. The adapters are added to every linear layer (excluding embedding layers) and we use a constant learning rate schedule. The full training details are in Appendix C.

Refer to caption
Figure 5: Left: MMLU accuracy of Llama-7b trained for one epoch on a 100k subset of flan-v2. Right: Test accuracy of the best checkpoint of Llama-7b trained on MNLI for one epoch. Values are averaged over two seeds.

We evaluate the final model on the MMLU benchmark [Hendrycks et al., 2020]. The results in Figure 5 show that for this benchmark taking ηBηAmuch-greater-thansubscript𝜂𝐵subscript𝜂𝐴\eta_{B}\gg\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≫ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is advantageous and results in a roughly 1.3% gain compared with the optimal ηB=ηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}=\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. In Appendix C we show that the same effect holds also when using Init[1].

MNLI.

The right panel of Fig 5 shows the results of finetuning Llama-7b with LoRA on MNLI, with α=16𝛼16\alpha=16italic_α = 16, r=8𝑟8r=8italic_r = 8. We train using half precision and constant learning rate schedule, with a sequence length T=128𝑇128T=128italic_T = 128. Since MNLI is relatively easy for Llama, we finetune for only one epoch, which is sufficient for the model to reach its peak test accuracy. In Figure 5, ηB=ηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}=\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is nearly optimal for all ηBηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}\geq\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≥ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. This is consistent with the intuition that efficient feature learning is not required for easy tasks and that having ηB/ηA1much-greater-thansubscript𝜂𝐵subscript𝜂𝐴1\eta_{B}/\eta_{A}\gg 1italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ≫ 1 does not significantly enhance performance. Additionally, the magnitude of stable learning rates for Llama is much smaller than for GPT-2 and RoBERTa on MNLI further supporting that Llama requires less adaptation. Analogous plots for the train and test loss are shown in Fig 19 in Appendix C.

5.3 How to set LoRA+ Ratio?

Naturally, the optimal ratio λ𝜆\lambdaitalic_λ depends on the architecture and the finetuning task via the constants in ‘ΘΘ\Thetaroman_Θ’ (1). This is a limitation of these asymptotic results since they do not offer any insights on how the constants are affected by the task and the neural architecture.

Refer to caption
Figure 6: Distribution of the ratio ηB/ηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}/\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT for the top 4 learning rate for each pair (model, task). The 4 learning rates are selected using the test loss at the end of finetuning (i.e. top 4 learning rates (ηB,ηA)subscript𝜂𝐵subscript𝜂𝐴(\eta_{B},\eta_{A})( italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) in terms of test loss). The distribution shows the interquartile range ( 25%75%percent25percent7525\%-75\%25 % - 75 % quantiles) and the median.

Figure 6 show the distribution of the ratio ηB/ηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}/\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT for the top 4444 runs in terms of test accuracy for different pairs of (model, task). This is the same experimental setup of Figure 3 and Figure 4. The optimal ratio is model and task sensitive and shows significant variance. Our additional experiments in Appendix C show that it is also sensitive to initialization (Init[1] vs Init[2]). With Init[2], we found that generally setting a ratio of λ=ηB/ηA24𝜆subscript𝜂𝐵subscript𝜂𝐴superscript24\lambda=\eta_{B}/\eta_{A}\approx 2^{4}italic_λ = italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ≈ 2 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT improves performance for Roberta (Figure 7). However, with Init[1], we found that the optimal ratio is smaller and is of order 22superscript222^{2}2 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-23superscript232^{3}2 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT (see Appendix C). For LLama experiments, it seems that a ratio of order 21superscript212^{1}2 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT-22superscript222^{2}2 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is optimal..

Refer to caption
Figure 7: Test accuracy of Roberta-base finetuned on the MNLI task in two setups: (LoRA+) ηB=24ηAsubscript𝜂𝐵superscript24subscript𝜂𝐴\eta_{B}=2^{4}\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and (Standard) ηB=ηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}=\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. ηAsubscript𝜂𝐴\eta_{A}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is tuned using a grid search.

6 Conclusion and Limitations

Employing a scaling argument, we showed that LoRA finetuning as it is currently used in practice is not efficient. We proposed a method, LoRA+, that resolves this issue by setting different learning rates for LoRA adapter matrices. Our analysis is supported by extensive empirical results confirming the benefits of LoRA+ for both training speed and performance. These benefits are more significant for ‘hard’ tasks such as MNLI for Roberta/GPT2 (compared to SST2 for instance) and MMLU for LLama-7b (compared to MNLI for instance). However, as we depicted in Figure 7, a more refined estimation of the optimal ratio ηB/ηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}/\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT should take into account task and model dependent, and our analysis in this paper lacks this dimension. We leave this for future work.

Acknowledgement

We thank Amazon Web Services (AWS) for cloud credits under an Amazon Research Award. We also gratefully acknowledge partial support from NSF grants DMS-2209975, 2015341, NSF grant 2023505 on Collaborative Research: Foundations of Data Science Institute (FODSI), the NSF and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning through awards DMS-2031883 and 814639, and NSF grant MC2378 to the Institute for Artificial CyberThreat Intelligence and OperatioN (ACTION).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning, specifically, to speed up the leading algorithm LoRA for fine-tuning pre-trained large language models while improving performance of the fine-tuned models. The speed-up saves computation resources when pre-trained large language models are customized for particular down-stream tasks. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Bordelon et al. [2023] Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit, 2023.
  • Cohen et al. [2021] Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=jh-rTtvkGeM.
  • Dettmers et al. [2023] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  • Hayou [2023] Soufiane Hayou. On the infinite-depth limit of finite-width neural networks. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=RbLsYz1Az9.
  • Hayou et al. [2019] Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. On the impact of the activation function on deep neural networks training. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2672–2680. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/hayou19a.html.
  • Hayou et al. [2021] Soufiane Hayou, Eugenio Clerico, Bobby He, George Deligiannidis, Arnaud Doucet, and Judith Rousseau. Stable resnet. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1324–1332. PMLR, 13–15 Apr 2021. URL https://proceedings.mlr.press/v130/hayou21a.html.
  • He et al. [2023] Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock, Samuel L Smith, and Yee Whye Teh. Deep transformers without shortcuts: Modifying self-attention for faithful signal propagation, 2023.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hendrycks et al. [2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  • Hoffmann et al. [2022] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models, 2022.
  • Houlsby et al. [2019] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  • Hu et al. [2021] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kopiczko et al. [2023] Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki Markus Asano. Vera: Vector-based random matrix adaptation. arXiv preprint arXiv:2310.11454, 2023.
  • LeCun et al. [2002] Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50. Springer, 2002.
  • Lester et al. [2021] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  • Li et al. [2023] Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, and Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv preprint arXiv:2310.08659, 2023.
  • Liu et al. [2022] Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
  • Liu et al. [2023] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  • Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
  • Longpre et al. [2023] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
  • Noci et al. [2023] Lorenzo Noci, Chuning Li, Mufan Bill Li, Bobby He, Thomas Hofmann, Chris Maddison, and Daniel M. Roy. The shaped transformer: Attention models in the infinite depth-and-width limit, 2023.
  • OpenAI [2023] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Schoenholz et al. [2017a] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation, 2017a.
  • Schoenholz et al. [2017b] S.S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein. Deep information propagation. In International Conference on Learning Representations, 2017b.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Wang et al. [2018] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2018.
  • Wang et al. [2023] Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023.
  • Yang [2019] G. Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
  • Yang and Hu [2021] Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727–11737. PMLR, 2021.
  • Yang and Littwin [2023] Greg Yang and Etai Littwin. Tensor programs ivb: Adaptive optimization in the infinite-width limit. arXiv preprint arXiv:2308.01814, 2023.
  • Yang et al. [2022] Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
  • Yang et al. [2023] Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244, 2023.
  • Yang et al. [2013] Liu Yang, Steve Hanneke, and Jaime Carbonell. A theory of transfer learning with applications to active learning. Machine learning, 90:161–189, 2013.
  • Zeng and Lee [2023] Yuchen Zeng and Kangwook Lee. The expressive power of low-rank adaptation. arXiv preprint arXiv:2310.17513, 2023.

Appendix A Proofs

In this section, we provide proofs for 1, 2, 1, and some technical details used in the proofs.

A.1 Scaling of Neural Networks

Scaling refers to the process of increasing the size of one of the ingredients in the model to improve performance (see e.g. [Hoffmann et al., 2022]). This includes model capacity which can be increased via width (embedding dimension) or depth (number of layers) or both, compute (training data), number of training steps etc. In this paper, we are interested in scaling model capacity via the width n𝑛nitalic_n. This is motivated by the fact that most state-of-the-art language and vision models have large width.

It is well known that as the width n𝑛nitalic_n grows, the network initialization scheme and the learning should be adapted to avoid numerical instabilities and ensure efficient learning. For instance, the initialization variance should scale 1/n1𝑛1/n1 / italic_n to prevent arbitrarily large pre-activations as we increase model width n𝑛nitalic_n (e.g. He init [He et al., 2016]). To derive such scaling rules, a principled approach consist of analyzing statistical properties of key quantities in the model (e.g. pre-activations) as n𝑛nitalic_n grows and then adjust the initialization, the learning rate, and the architecture itself to achieve desirable properties in the limit n𝑛n\to\inftyitalic_n → ∞ [Hayou et al., 2019, Schoenholz et al., 2017b, Yang, 2019].

In this context, [Yang et al., 2022] introduces the Maximal Update Parameterization (or μ𝜇\muitalic_μP), a set of scaling rules for the initialization scheme, the learning rate, and the network architecture that ensure stability and maximal feature learning in the infinite width limit. Stability is defined by Yli=Θ(1)superscriptsubscript𝑌𝑙𝑖Θ1Y_{l}^{i}=\Theta(1)italic_Y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Θ ( 1 ) for all l𝑙litalic_l and i𝑖iitalic_i where the asymptotic notation ‘Θ(.)\Theta(.)roman_Θ ( . )’ is with respect to width n𝑛nitalic_n (see next paragraph for a formal definition), and feature learning is defined by ΔYl=Θ(1)Δsubscript𝑌𝑙Θ1\Delta Y_{l}=\Theta(1)roman_Δ italic_Y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = roman_Θ ( 1 ), where ΔΔ\Deltaroman_Δ refers to the feature update after taking a gradient step. μ𝜇\muitalic_μP guarantees that these two conditions are satisfied at any training step t𝑡titalic_t. Roughly speaking, μ𝜇\muitalic_μP specifies that hidden weights should be initialized with Θ(n1/2)Θsuperscript𝑛12\Theta(n^{-1/2})roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) random weights, and weight updates should be of order Θ(n1)Θsuperscript𝑛1\Theta(n^{-1})roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Input weights should be initialized Θ(1)Θ1\Theta(1)roman_Θ ( 1 ) and the weights update should be Θ(1)Θ1\Theta(1)roman_Θ ( 1 ) as well. While the output weights should be initialized Θ(n1)Θsuperscript𝑛1\Theta(n^{-1})roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and updated with Θ(n1)Θsuperscript𝑛1\Theta(n^{-1})roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). These rules ensure both stability and feature learning in the infinite-width limit, in contrast to standard parameterization (exploding features if the learning rate is well tuned), and kernel parameterizations (e.g. Neural Tangent Kernel parameterization where ΔYl=Θ(n1/2)Δsubscript𝑌𝑙Θsuperscript𝑛12\Delta Y_{l}=\Theta(n^{-1/2})roman_Δ italic_Y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ), i.e. no feature learning in the limit).

A.2 The Gamma Function (γ[.]\gamma[.]italic_γ [ . ])

In the theory of scaling of neural networks, one usually tracks the asymptotic behaviour of key quantities as we scale some model ingredient. For instance, if we scale the width, we are interested in quantifying how certain quantities in the network behave as width n𝑛nitalic_n grows large and the asymptotic notation becomes natural in this case. This is a standard approach for (principled) model scaling and it has so far been used to derive scaling rules for initialization [Schoenholz et al., 2017b], activation function [Hayou et al., 2019], network parametrization [Yang et al., 2023], amongst other things.

With Init[1] and Init[2], the weights are initialized with Θ(nβ)Θsuperscript𝑛𝛽\Theta(n^{-\beta})roman_Θ ( italic_n start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT ) for some β0𝛽0\beta\geq 0italic_β ≥ 0. Assuming that the learning rates also scale polynomially with n𝑛nitalic_n, it is straightforward that preactivations, gradients, and weight updates are all asymptotically polynomial in n𝑛nitalic_n. It is therefore natural to introduce the Gamma function, and we write v=Θ(γ[v])𝑣Θ𝛾delimited-[]𝑣v=\Theta(\gamma[v])italic_v = roman_Θ ( italic_γ [ italic_v ] ) to capture this polynomial behaviour. Now, let us introduce some elementary operations with the Gamma function.

Multiplication.

Given two real-valued variables v,v𝑣superscript𝑣v,v^{\prime}italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we have γ[v×v]=γ[v]+γ[v]𝛾delimited-[]𝑣superscript𝑣𝛾delimited-[]𝑣𝛾delimited-[]superscript𝑣\gamma[v\times v^{\prime}]=\gamma[v]+\gamma[v^{\prime}]italic_γ [ italic_v × italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] = italic_γ [ italic_v ] + italic_γ [ italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ].

Addition.

Given two real-valued variables v,v𝑣superscript𝑣v,v^{\prime}italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we generally have γ[v+v]=max(γ[v],γ[v])𝛾delimited-[]𝑣superscript𝑣𝛾delimited-[]𝑣𝛾delimited-[]superscript𝑣\gamma[v+v^{\prime}]=\max(\gamma[v],\gamma[v^{\prime}])italic_γ [ italic_v + italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] = roman_max ( italic_γ [ italic_v ] , italic_γ [ italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ). The only case where this is violated is when v=vsuperscript𝑣𝑣v^{\prime}=-vitalic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = - italic_v. This is generally a zero probability event if v𝑣vitalic_v and vsuperscript𝑣v^{\prime}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are random variables that are not perfectly correlated, which is the case in most situations where we make use of this formula (see the proofs below).

A.3 Proof of 1

Proposition 1. [Inefficiency of LoRA fine-tuning] Assume that LoRA weights are initialized with Init[1] or Init[2] and trained with gradient descent with learning rate η=Θ(nc)𝜂Θsuperscript𝑛𝑐\eta=\Theta(n^{c})italic_η = roman_Θ ( italic_n start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) for some c𝑐c\in\mathbb{R}italic_c ∈ blackboard_R. Then, it is impossible to have δti=Θ(1)superscriptsubscript𝛿𝑡𝑖Θ1\delta_{t}^{i}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Θ ( 1 ) for all i𝑖iitalic_i for any t>0𝑡0t>0italic_t > 0, and therefore, fine-tuning with LoRA in this setup is inefficient.

Proof.

Assume that the model is initialized with Init[1]. Since the training dynamics are mainly simple linear algebra operation (matrix vector products, sum of vectors/scalars etc), it is easy to see that any vector/scaler in the training dynamics has a magnitude of order nγsuperscript𝑛𝛾n^{\gamma}italic_n start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT for some γ𝛾\gamma\in\mathbb{R}italic_γ ∈ blackboard_R (for more details, see the Tensor Programs framework, e.g. [Yang, 2019]). For any quantity v𝑣vitalic_v in the training dynamics, we write v=Θ(nγ[v])𝑣Θsuperscript𝑛𝛾delimited-[]𝑣v=\Theta(n^{\gamma[v]})italic_v = roman_Θ ( italic_n start_POSTSUPERSCRIPT italic_γ [ italic_v ] end_POSTSUPERSCRIPT ). When v𝑣vitalic_v is a vector, we use the same notation when all entries of v𝑣vitalic_v are Θ(nγ[v])Θsuperscript𝑛𝛾delimited-[]𝑣\Theta(n^{\gamma[v]})roman_Θ ( italic_n start_POSTSUPERSCRIPT italic_γ [ italic_v ] end_POSTSUPERSCRIPT ). Efficiency is defined by having δit=Θ(1)subscriptsuperscript𝛿𝑡𝑖Θ1\delta^{t}_{i}=\Theta(1)italic_δ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Θ ( 1 ) for i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 } and t>1𝑡1t>1italic_t > 1. Note that this implies ft(x)=Θ(1)subscript𝑓𝑡𝑥Θ1f_{t}(x)=\Theta(1)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = roman_Θ ( 1 ) for all t>1𝑡1t>1italic_t > 1. Let t>1𝑡1t>1italic_t > 1 and assume that learning with LoRA is efficient. We will show that this leads to a contradiction. Efficiency requires that δti=Θ(1)superscriptsubscript𝛿𝑡𝑖Θ1\delta_{t}^{i}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Θ ( 1 ) for all t,i{1,2}𝑡𝑖12t,i\in\{1,2\}italic_t , italic_i ∈ { 1 , 2 }. Using the elementary formulas from Section A.2, this implies that for all t𝑡titalic_t

{γ[η]+2γ[bt1]+1=0γ[η]+2γ[at1x]=0γ[bt1]+γ[at1x]=0.cases𝛾delimited-[]𝜂2𝛾delimited-[]subscript𝑏𝑡110otherwise𝛾delimited-[]𝜂2𝛾delimited-[]superscriptsubscript𝑎𝑡1top𝑥0otherwise𝛾delimited-[]subscript𝑏𝑡1𝛾delimited-[]superscriptsubscript𝑎𝑡1top𝑥0otherwise\begin{cases}\gamma[\eta]+2\gamma[b_{t-1}]+1=0\\ \gamma[\eta]+2\gamma[a_{t-1}^{\top}x]=0\\ \gamma[b_{t-1}]+\gamma[a_{t-1}^{\top}x]=0.\end{cases}{ start_ROW start_CELL italic_γ [ italic_η ] + 2 italic_γ [ italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] + 1 = 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_γ [ italic_η ] + 2 italic_γ [ italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_γ [ italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] + italic_γ [ italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = 0 . end_CELL start_CELL end_CELL end_ROW

Solving this equation yields γ[η]=1/2𝛾delimited-[]𝜂12\gamma[\eta]=-1/2italic_γ [ italic_η ] = - 1 / 2, i.e. LoRA finetuning can be efficient only if the learning rate scales as η=Θ(n1/2)𝜂Θsuperscript𝑛12\eta=\Theta(n^{-1/2})italic_η = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). Let us now show that this yields a contradiction. From the gradient updates and the elementary operations from Section A.2, we have the following recursive formulas

{γ[bt]=max(γ[bt1],1/2+γ[at1x])γ[atx]=max(γ[at1x],1/2+γ[bt1])cases𝛾delimited-[]subscript𝑏𝑡𝛾delimited-[]subscript𝑏𝑡112𝛾delimited-[]superscriptsubscript𝑎𝑡1top𝑥otherwise𝛾delimited-[]superscriptsubscript𝑎𝑡top𝑥𝛾delimited-[]superscriptsubscript𝑎𝑡1top𝑥12𝛾delimited-[]subscript𝑏𝑡1otherwise\begin{cases}\gamma[b_{t}]=\max(\gamma[b_{t-1}],-1/2+\gamma[a_{t-1}^{\top}x])% \\ \gamma[a_{t}^{\top}x]=\max(\gamma[a_{t-1}^{\top}x],1/2+\gamma[b_{t-1}])\end{cases}{ start_ROW start_CELL italic_γ [ italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = roman_max ( italic_γ [ italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] , - 1 / 2 + italic_γ [ italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_γ [ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = roman_max ( italic_γ [ italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] , 1 / 2 + italic_γ [ italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] ) end_CELL start_CELL end_CELL end_ROW

Starting from t=1𝑡1t=1italic_t = 1, with Init[1] we have γ[b1]=γ[η(a0x)y]=1/2𝛾delimited-[]subscript𝑏1𝛾delimited-[]𝜂superscriptsubscript𝑎0top𝑥𝑦12\gamma[b_{1}]=\gamma[\eta(a_{0}^{\top}x)y]=-1/2italic_γ [ italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = italic_γ [ italic_η ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ) italic_y ] = - 1 / 2 and γ[a1x]=γ[a0x]=0𝛾delimited-[]superscriptsubscript𝑎1top𝑥𝛾delimited-[]superscriptsubscript𝑎0top𝑥0\gamma[a_{1}^{\top}x]=\gamma[a_{0}^{\top}x]=0italic_γ [ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = italic_γ [ italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = 0, we have γ[b2]=1/2𝛾delimited-[]subscript𝑏212\gamma[b_{2}]=-1/2italic_γ [ italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = - 1 / 2 and γ[a2x]=0𝛾delimited-[]superscriptsubscript𝑎2top𝑥0\gamma[a_{2}^{\top}x]=0italic_γ [ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = 0. Trivially, this holds for any t𝑡titalic_t. However, this implies that γ[ft]=γ[bt]+γ[atx]=1/2𝛾delimited-[]subscript𝑓𝑡𝛾delimited-[]subscript𝑏𝑡𝛾delimited-[]superscriptsubscript𝑎𝑡top𝑥12\gamma[f_{t}]=\gamma[b_{t}]+\gamma[a_{t}^{\top}x]=-1/2italic_γ [ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = italic_γ [ italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + italic_γ [ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = - 1 / 2 which means that ΔftΔsubscript𝑓𝑡\Delta f_{t}roman_Δ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT cannot be Θ(1)Θ1\Theta(1)roman_Θ ( 1 ). With Init[2], we have γ[b1]=γ[b0]=0𝛾delimited-[]subscript𝑏1𝛾delimited-[]subscript𝑏00\gamma[b_{1}]=\gamma[b_{0}]=0italic_γ [ italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = italic_γ [ italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] = 0 and γ[a1]=γ[ηb0yx2]=1/2+1=1/2𝛾delimited-[]superscriptsubscript𝑎1top𝛾delimited-[]𝜂subscript𝑏0𝑦superscriptnorm𝑥212112\gamma[a_{1}^{\top}]=\gamma[\eta b_{0}y\|x\|^{2}]=-1/2+1=1/2italic_γ [ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] = italic_γ [ italic_η italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_y ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = - 1 / 2 + 1 = 1 / 2. From the recursive formula we get γ[b2]=0𝛾delimited-[]subscript𝑏20\gamma[b_{2}]=0italic_γ [ italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = 0 and γ[a2x]=1/2𝛾delimited-[]superscriptsubscript𝑎2top𝑥12\gamma[a_{2}^{\top}x]=1/2italic_γ [ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = 1 / 2 which remains true for all t𝑡titalic_t. In this case we have γ[ft]=1/2𝛾delimited-[]subscript𝑓𝑡12\gamma[f_{t}]=1/2italic_γ [ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = 1 / 2 which contradicts Δft=Θ(1)Δsubscript𝑓𝑡Θ1\Delta f_{t}=\Theta(1)roman_Δ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Θ ( 1 ).

In both cases, this contradicts our assumption, and therefore efficiency cannot be achieved in this setup.

A.4 Proof of 2

Proposition 2. [Efficient Fine-Tuning with LoRA] In the case of Toy model Equation 2, with ηa=Θ(n1)subscript𝜂𝑎Θsuperscript𝑛1\eta_{a}=\Theta(n^{-1})italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and ηb=Θ(1)subscript𝜂𝑏Θ1\eta_{b}=\Theta(1)italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = roman_Θ ( 1 ), we have for all t>1𝑡1t>1italic_t > 1, {1,2,3}absent123\in\{1,2,3\}∈ { 1 , 2 , 3 }, δti=Θ(1)superscriptsubscript𝛿𝑡𝑖Θ1\delta_{t}^{i}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Θ ( 1 ).

Proof.

The proof is similar in flavor to that of 1. In this case, the set of equations that should be satisfied so that δti=Θ(1)superscriptsubscript𝛿𝑡𝑖Θ1\delta_{t}^{i}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Θ ( 1 ) are given by

{γ[ηa]+2γ[bt1]+1=0γ[ηb]+2γ[at1x]=0γ[ηa]+γ[ηb]+γ[bt1]+γ[at1x]+1=0,cases𝛾delimited-[]subscript𝜂𝑎2𝛾delimited-[]subscript𝑏𝑡110otherwise𝛾delimited-[]subscript𝜂𝑏2𝛾delimited-[]superscriptsubscript𝑎𝑡1top𝑥0otherwise𝛾delimited-[]subscript𝜂𝑎𝛾delimited-[]subscript𝜂𝑏𝛾delimited-[]subscript𝑏𝑡1𝛾delimited-[]superscriptsubscript𝑎𝑡1top𝑥10otherwise\begin{cases}\gamma[\eta_{a}]+2\gamma[b_{t-1}]+1=0\\ \gamma[\eta_{b}]+2\gamma[a_{t-1}^{\top}x]=0\\ \gamma[\eta_{a}]+\gamma[\eta_{b}]+\gamma[b_{t-1}]+\gamma[a_{t-1}^{\top}x]+1=0,% \end{cases}{ start_ROW start_CELL italic_γ [ italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] + 2 italic_γ [ italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] + 1 = 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_γ [ italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] + 2 italic_γ [ italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_γ [ italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] + italic_γ [ italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] + italic_γ [ italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] + italic_γ [ italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] + 1 = 0 , end_CELL start_CELL end_CELL end_ROW

where we have used the elementary formulas from Section A.2. Simple calculations yield γ[ηa]+γ[ηb]=1𝛾delimited-[]subscript𝜂𝑎𝛾delimited-[]subscript𝜂𝑏1\gamma[\eta_{a}]+\gamma[\eta_{b}]=-1italic_γ [ italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] + italic_γ [ italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] = - 1. Using the gradient update expression with the elementary addition from Section A.2, the recursive formulas controlling γ[bt]𝛾delimited-[]subscript𝑏𝑡\gamma[b_{t}]italic_γ [ italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] and γ[atx]𝛾delimited-[]superscriptsubscript𝑎𝑡top𝑥\gamma[a_{t}^{\top}x]italic_γ [ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] are given by

{γ[bt]=max(γ[bt1],γ[ηb]+γ[at1x])γ[atx]=max(γ[at1x],γ[ηa]+γ[bt1]+1).cases𝛾delimited-[]subscript𝑏𝑡𝛾delimited-[]subscript𝑏𝑡1𝛾delimited-[]subscript𝜂𝑏𝛾delimited-[]superscriptsubscript𝑎𝑡1top𝑥otherwise𝛾delimited-[]superscriptsubscript𝑎𝑡top𝑥𝛾delimited-[]superscriptsubscript𝑎𝑡1top𝑥𝛾delimited-[]subscript𝜂𝑎𝛾delimited-[]subscript𝑏𝑡11otherwise\begin{cases}\gamma[b_{t}]=\max(\gamma[b_{t-1}],\gamma[\eta_{b}]+\gamma[a_{t-1% }^{\top}x])\\ \gamma[a_{t}^{\top}x]=\max(\gamma[a_{t-1}^{\top}x],\gamma[\eta_{a}]+\gamma[b_{% t-1}]+1).\end{cases}{ start_ROW start_CELL italic_γ [ italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = roman_max ( italic_γ [ italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] , italic_γ [ italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] + italic_γ [ italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_γ [ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = roman_max ( italic_γ [ italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] , italic_γ [ italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] + italic_γ [ italic_b start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] + 1 ) . end_CELL start_CELL end_CELL end_ROW

Starting from t=1𝑡1t=1italic_t = 1, with Init[1], we have γ[b1]=γ[ηb(a0x)y]=γ[ηb]𝛾delimited-[]subscript𝑏1𝛾delimited-[]subscript𝜂𝑏superscriptsubscript𝑎0top𝑥𝑦𝛾delimited-[]subscript𝜂𝑏\gamma[b_{1}]=\gamma[\eta_{b}(a_{0}^{\top}x)y]=\gamma[\eta_{b}]italic_γ [ italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = italic_γ [ italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ) italic_y ] = italic_γ [ italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] and γ[a1x]=γ[a0x]=0𝛾delimited-[]superscriptsubscript𝑎1top𝑥𝛾delimited-[]superscriptsubscript𝑎0top𝑥0\gamma[a_{1}^{\top}x]=\gamma[a_{0}^{\top}x]=0italic_γ [ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = italic_γ [ italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = 0. Therefore γ[b2]=max(γ[ηb],γ[ηb]+0)=γ[ηb]𝛾delimited-[]subscript𝑏2𝛾delimited-[]subscript𝜂𝑏𝛾delimited-[]subscript𝜂𝑏0𝛾delimited-[]subscript𝜂𝑏\gamma[b_{2}]=\max(\gamma[\eta_{b}],\gamma[\eta_{b}]+0)=\gamma[\eta_{b}]italic_γ [ italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = roman_max ( italic_γ [ italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] , italic_γ [ italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] + 0 ) = italic_γ [ italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ], and γ[a2x]=max(0,γ[ηa]+γ[ηb]+1)=max(0,0)=0𝛾delimited-[]superscriptsubscript𝑎2top𝑥0𝛾delimited-[]subscript𝜂𝑎𝛾delimited-[]subscript𝜂𝑏1000\gamma[a_{2}^{\top}x]=\max(0,\gamma[\eta_{a}]+\gamma[\eta_{b}]+1)=\max(0,0)=0italic_γ [ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = roman_max ( 0 , italic_γ [ italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] + italic_γ [ italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] + 1 ) = roman_max ( 0 , 0 ) = 0. By induction, this holds for all t1𝑡1t\geq 1italic_t ≥ 1. With Init[2], we have γ[b1]=γ[b0]=0𝛾delimited-[]subscript𝑏1𝛾delimited-[]subscript𝑏00\gamma[b_{1}]=\gamma[b_{0}]=0italic_γ [ italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = italic_γ [ italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] = 0, and γ[a1x]=γ[ηab02yx2]=γ[ηa]+1𝛾delimited-[]superscriptsubscript𝑎1top𝑥𝛾delimited-[]subscript𝜂𝑎superscriptsubscript𝑏02𝑦superscriptnorm𝑥2𝛾delimited-[]subscript𝜂𝑎1\gamma[a_{1}^{\top}x]=\gamma[-\eta_{a}b_{0}^{2}y\|x\|^{2}]=\gamma[\eta_{a}]+1italic_γ [ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = italic_γ [ - italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_y ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_γ [ italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] + 1. At step t=2𝑡2t=2italic_t = 2, we have γ[b2]=max(0,γ[ηb]+γ[ηa]+1)=0𝛾delimited-[]subscript𝑏20𝛾delimited-[]subscript𝜂𝑏𝛾delimited-[]subscript𝜂𝑎10\gamma[b_{2}]=\max(0,\gamma[\eta_{b}]+\gamma[\eta_{a}]+1)=0italic_γ [ italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = roman_max ( 0 , italic_γ [ italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] + italic_γ [ italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] + 1 ) = 0 and γ[a2x]=max(γ[ηa]+1,γ[ηa]+0+1)=γ[ηa]+1𝛾delimited-[]superscriptsubscript𝑎2top𝑥𝛾delimited-[]subscript𝜂𝑎1𝛾delimited-[]subscript𝜂𝑎01𝛾delimited-[]subscript𝜂𝑎1\gamma[a_{2}^{\top}x]=\max(\gamma[\eta_{a}]+1,\gamma[\eta_{a}]+0+1)=\gamma[% \eta_{a}]+1italic_γ [ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = roman_max ( italic_γ [ italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] + 1 , italic_γ [ italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] + 0 + 1 ) = italic_γ [ italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] + 1, and this holds for all t𝑡titalic_t by induction. In both cases, to ensure that γ[ft]=γ[bt]+γ[atx]=0𝛾delimited-[]subscript𝑓𝑡𝛾delimited-[]subscript𝑏𝑡𝛾delimited-[]superscriptsubscript𝑎𝑡top𝑥0\gamma[f_{t}]=\gamma[b_{t}]+\gamma[a_{t}^{\top}x]=0italic_γ [ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = italic_γ [ italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + italic_γ [ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ] = 0, we have to set γ[ηb]=0𝛾delimited-[]subscript𝜂𝑏0\gamma[\eta_{b}]=0italic_γ [ italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] = 0 and γ[ηa]=1𝛾delimited-[]subscript𝜂𝑎1\gamma[\eta_{a}]=-1italic_γ [ italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] = - 1 (straightforward from the equation γ[ηb]+γ[ηa]=1𝛾delimited-[]subscript𝜂𝑏𝛾delimited-[]subscript𝜂𝑎1\gamma[\eta_{b}]+\gamma[\eta_{a}]=-1italic_γ [ italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] + italic_γ [ italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] = - 1). In conclusion, setting ηa=Θ(n1)subscript𝜂𝑎Θsuperscript𝑛1\eta_{a}=\Theta(n^{-1})italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and ηb=Θ(1)subscript𝜂𝑏Θ1\eta_{b}=\Theta(1)italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = roman_Θ ( 1 ) ensures efficient fine-tuning with LoRA.

A.5 Proof of 1

In this section, we give a non-rigorous but intuitive proof of 1. The proof relies on the following assumption on the processed gradient gAsubscript𝑔𝐴g_{A}italic_g start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT.

Assumption 1.

With the same setup of Section 4, at training step t𝑡titalic_t, we have gAtZ¯=Θ(n)superscriptsubscript𝑔𝐴𝑡¯𝑍Θ𝑛g_{A}^{t}\underline{Z}=\Theta(n)italic_g start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT under¯ start_ARG italic_Z end_ARG = roman_Θ ( italic_n ).

To see why 1 is sound in practice, let us study the product gAtZ¯superscriptsubscript𝑔𝐴𝑡¯𝑍g_{A}^{t}\underline{Z}italic_g start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT under¯ start_ARG italic_Z end_ARG in the simple case of Adam with no momentum, a.k.a SignSGD which is given by

gA=sign(A),subscript𝑔𝐴sign𝐴g_{A}=\textrm{sign}\left(\frac{\partial\mathcal{L}}{\partial A}\right),italic_g start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = sign ( divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_A end_ARG ) ,

where the sign function is applied element-wise. At training step t𝑡titalic_t, we have

tA=αrBt1dZ¯t1Z¯,subscript𝑡𝐴tensor-product𝛼𝑟subscriptsuperscript𝐵top𝑡1𝑑superscript¯𝑍𝑡1¯𝑍\frac{\partial\mathcal{L}_{t}}{\partial A}=\frac{\alpha}{r}B^{\top}_{t-1}d\bar% {Z}^{t-1}\otimes\underline{Z},divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_A end_ARG = divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT italic_d over¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ⊗ under¯ start_ARG italic_Z end_ARG ,

Let St=αrBt1dZ¯t1superscript𝑆𝑡𝛼𝑟subscriptsuperscript𝐵top𝑡1𝑑superscript¯𝑍𝑡1S^{t}=\frac{\alpha}{r}B^{\top}_{t-1}d\bar{Z}^{t-1}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT italic_d over¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT. Therefore we have

gA=sign(StZ¯)=(sign(SitZ¯j))1i,jn.subscript𝑔𝐴signtensor-productsuperscript𝑆𝑡¯𝑍subscriptsignsubscriptsuperscript𝑆𝑡𝑖subscript¯𝑍𝑗formulae-sequence1𝑖𝑗𝑛g_{A}=\textrm{sign}(S^{t}\otimes\underline{Z})=(\textrm{sign}(S^{t}_{i}% \underline{Z}_{j}))_{1\leq i,j\leq n}.italic_g start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = sign ( italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊗ under¯ start_ARG italic_Z end_ARG ) = ( sign ( italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT 1 ≤ italic_i , italic_j ≤ italic_n end_POSTSUBSCRIPT .

However, note that we also have

sign(SitZ¯j)=sign(Sit)sign(Z¯j),signsubscriptsuperscript𝑆𝑡𝑖subscript¯𝑍𝑗signsubscriptsuperscript𝑆𝑡𝑖signsubscript¯𝑍𝑗\textrm{sign}(S^{t}_{i}\underline{Z}_{j})=\textrm{sign}(S^{t}_{i})\textrm{sign% }(\underline{Z}_{j}),sign ( italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = sign ( italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) sign ( under¯ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,

and as a result

gAt=sign(St)sign(Z¯).superscriptsubscript𝑔𝐴𝑡tensor-productsignsuperscript𝑆𝑡sign¯𝑍g_{A}^{t}=\textrm{sign}(S^{t})\otimes\textrm{sign}(\underline{Z}).italic_g start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = sign ( italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊗ sign ( under¯ start_ARG italic_Z end_ARG ) .

Hence, we obtain

gAtZ¯=(sign(Z¯)Z¯)sign(St)=Θ(n),superscriptsubscript𝑔𝐴𝑡¯𝑍signsuperscript¯𝑍top¯𝑍signsuperscript𝑆𝑡Θ𝑛g_{A}^{t}\underline{Z}=(\textrm{sign}(\underline{Z})^{\top}\underline{Z})% \textrm{sign}(S^{t})=\Theta(n),italic_g start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT under¯ start_ARG italic_Z end_ARG = ( sign ( under¯ start_ARG italic_Z end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT under¯ start_ARG italic_Z end_ARG ) sign ( italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = roman_Θ ( italic_n ) ,

where we used the fact that sign(Z¯)Z¯=Θ(n)signsuperscript¯𝑍top¯𝑍Θ𝑛\textrm{sign}(\underline{Z})^{\top}\underline{Z}=\Theta(n)sign ( under¯ start_ARG italic_Z end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT under¯ start_ARG italic_Z end_ARG = roman_Θ ( italic_n ).

This intuition should in-principle hold for the general variant of Adam with momentum as long as the gradient processing function (a notion introduced in [Yang et al., 2013]) roughly preserves the sign(Z¯)sign¯𝑍\textrm{sign}(\underline{Z})sign ( under¯ start_ARG italic_Z end_ARG ) direction. This reasoning can be made rigorous for general gradient processing function using the Tensor Program framework and taking the infinite-width limit where the components of gA,Z¯,dZ¯subscript𝑔𝐴¯𝑍𝑑¯𝑍g_{A},\underline{Z},d\bar{Z}italic_g start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , under¯ start_ARG italic_Z end_ARG , italic_d over¯ start_ARG italic_Z end_ARG all become iid. However this necessitates an intricate treatment of several quantities in the process, which we believe is an unnecessary complication and does not serve the main purpose of this paper.

Let us now give a proof for the main claim.

Theorem 1. Assume that weight matrices A𝐴Aitalic_A and B𝐵Bitalic_B are trained with Adam with respective learning rates ηAsubscript𝜂𝐴\eta_{A}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and ηBsubscript𝜂𝐵\eta_{B}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT and that 1 is satisifed with the Adam gradient processing function. Then, it is impossible to achieve efficiency with ηA=ηBsubscript𝜂𝐴subscript𝜂𝐵\eta_{A}=\eta_{B}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. However, LoRA Finetuning is efficient with ηA=Θ(n1)subscript𝜂𝐴Θsuperscript𝑛1\eta_{A}=\Theta(n^{-1})italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = roman_Θ ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and ηB=Θ(1)subscript𝜂𝐵Θ1\eta_{B}=\Theta(1)italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = roman_Θ ( 1 ).

Proof.

With the same setup of Section 4, at step t𝑡titalic_t, we have

{δt1=Bt1ΔZAt=ηABt1gAt1Z¯δt2=ΔBtZAt1=ηBgBt1At1Z¯δt3=ΔBtΔZAt=ηAηBgBt1gAt1Z¯casessuperscriptsubscript𝛿𝑡1subscript𝐵𝑡1Δsuperscriptsubscript𝑍𝐴𝑡subscript𝜂𝐴subscript𝐵𝑡1subscriptsuperscript𝑔𝑡1𝐴¯𝑍otherwisesuperscriptsubscript𝛿𝑡2Δsubscript𝐵𝑡superscriptsubscript𝑍𝐴𝑡1subscript𝜂𝐵subscriptsuperscript𝑔𝑡1𝐵subscript𝐴𝑡1¯𝑍otherwisesuperscriptsubscript𝛿𝑡3Δsubscript𝐵𝑡Δsuperscriptsubscript𝑍𝐴𝑡subscript𝜂𝐴subscript𝜂𝐵subscriptsuperscript𝑔𝑡1𝐵subscriptsuperscript𝑔𝑡1𝐴¯𝑍otherwise\begin{cases}\delta_{t}^{1}=B_{t-1}\Delta Z_{A}^{t}=-\eta_{A}B_{t-1}g^{t-1}_{A% }\underline{Z}\\ \delta_{t}^{2}=\Delta B_{t}Z_{A}^{t-1}=-\eta_{B}g^{t-1}_{B}A_{t-1}\underline{Z% }\\ \delta_{t}^{3}=\Delta B_{t}\Delta Z_{A}^{t}=\eta_{A}\eta_{B}g^{t-1}_{B}g^{t-1}% _{A}\underline{Z}\end{cases}{ start_ROW start_CELL italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT roman_Δ italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = - italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Δ italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT = - italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT = roman_Δ italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Δ italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG end_CELL start_CELL end_CELL end_ROW

The key observation here is that gAt1Z¯subscriptsuperscript𝑔𝑡1𝐴¯𝑍g^{t-1}_{A}\underline{Z}italic_g start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG has entries of order Θ(n)Θ𝑛\Theta(n)roman_Θ ( italic_n ) as predicted and justified in 1. Having δti=Θ(1)subscriptsuperscript𝛿𝑖𝑡Θ1\delta^{i}_{t}=\Theta(1)italic_δ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Θ ( 1 ) for i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 } and ZBt=Θ(1)superscriptsubscript𝑍𝐵𝑡Θ1Z_{B}^{t}=\Theta(1)italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = roman_Θ ( 1 ) for t>1𝑡1t>1italic_t > 1 translate to

{γ[ηA]+γ[Bt1]+1=0γ[ηB]+γ[At1Z¯]=0γ[Bt1]+γ[At1Z¯]=0,cases𝛾delimited-[]subscript𝜂𝐴𝛾delimited-[]subscript𝐵𝑡110otherwise𝛾delimited-[]subscript𝜂𝐵𝛾delimited-[]subscript𝐴𝑡1¯𝑍0otherwise𝛾delimited-[]subscript𝐵𝑡1𝛾delimited-[]subscript𝐴𝑡1¯𝑍0otherwise\begin{cases}\gamma[\eta_{A}]+\gamma[B_{t-1}]+1=0\\ \gamma[\eta_{B}]+\gamma[A_{t-1}\underline{Z}]=0\\ \gamma[B_{t-1}]+\gamma[A_{t-1}\underline{Z}]=0,\end{cases}{ start_ROW start_CELL italic_γ [ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ] + italic_γ [ italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] + 1 = 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_γ [ italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ] + italic_γ [ italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG ] = 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_γ [ italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] + italic_γ [ italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG ] = 0 , end_CELL start_CELL end_CELL end_ROW

which implies that γ[ηA]+γ[ηB]=1𝛾delimited-[]subscript𝜂𝐴𝛾delimited-[]subscript𝜂𝐵1\gamma[\eta_{A}]+\gamma[\eta_{B}]=-1italic_γ [ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ] + italic_γ [ italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ] = - 1.

With the gradient updates, we have

Btsubscript𝐵𝑡\displaystyle B_{t}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Bt1ηBgBt1absentsubscript𝐵𝑡1subscript𝜂𝐵subscriptsuperscript𝑔𝑡1𝐵\displaystyle=B_{t-1}-\eta_{B}g^{t-1}_{B}= italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT
AtZ¯subscript𝐴𝑡¯𝑍\displaystyle A_{t}\underline{Z}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG =At1Z¯ηAgAt1Z¯absentsubscript𝐴𝑡1¯𝑍subscript𝜂𝐴subscriptsuperscript𝑔𝑡1𝐴¯𝑍\displaystyle=A_{t-1}\underline{Z}-\eta_{A}g^{t-1}_{A}\underline{Z}= italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG - italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG

which implies that

γ[Bt]𝛾delimited-[]subscript𝐵𝑡\displaystyle\gamma[B_{t}]italic_γ [ italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] =max(γ[Bt1],γ[ηB])absent𝛾delimited-[]subscript𝐵𝑡1𝛾delimited-[]subscript𝜂𝐵\displaystyle=\max(\gamma[B_{t-1}],\gamma[\eta_{B}])= roman_max ( italic_γ [ italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] , italic_γ [ italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ] )
γ[AtZ¯]𝛾delimited-[]subscript𝐴𝑡¯𝑍\displaystyle\gamma[A_{t}\underline{Z}]italic_γ [ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG ] =max(γ[At1Z¯],γ[ηA]+1),absent𝛾delimited-[]subscript𝐴𝑡1¯𝑍𝛾delimited-[]subscript𝜂𝐴1\displaystyle=\max(\gamma[A_{t-1}\underline{Z}],\gamma[\eta_{A}]+1),= roman_max ( italic_γ [ italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG ] , italic_γ [ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ] + 1 ) ,

Now assume that the model is initialized with Init[1]. We have γ[B1]=γ[ηB]𝛾delimited-[]subscript𝐵1𝛾delimited-[]subscript𝜂𝐵\gamma[B_{1}]=\gamma[\eta_{B}]italic_γ [ italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = italic_γ [ italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ] and therefore for all t𝑡titalic_t, we have γ[Bt]=γ[ηB]𝛾delimited-[]subscript𝐵𝑡𝛾delimited-[]subscript𝜂𝐵\gamma[B_{t}]=\gamma[\eta_{B}]italic_γ [ italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = italic_γ [ italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ]. We also have γ[A1Z¯]=γ[A0Z¯]=0𝛾delimited-[]subscript𝐴1¯𝑍𝛾delimited-[]subscript𝐴0¯𝑍0\gamma[A_{1}\underline{Z}]=\gamma[A_{0}\underline{Z}]=0italic_γ [ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG ] = italic_γ [ italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG ] = 0 (because A1=A0subscript𝐴1subscript𝐴0A_{1}=A_{0}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and we use the Central Limit Theorem to conclude). Hence, if we choose the same learning rate for A𝐴Aitalic_A and B𝐵Bitalic_B, given by η𝜂\etaitalic_η, we obtain γ[η]=1/2𝛾delimited-[]𝜂12\gamma[\eta]=-1/2italic_γ [ italic_η ] = - 1 / 2, and therefore γ[ZAt1]=γ[At1Z¯]=1/2𝛾delimited-[]superscriptsubscript𝑍𝐴𝑡1𝛾delimited-[]subscript𝐴𝑡1¯𝑍12\gamma[Z_{A}^{t-1}]=\gamma[A_{t-1}\underline{Z}]=1/2italic_γ [ italic_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ] = italic_γ [ italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG ] = 1 / 2 which violates the stability condition. A similar behaviour occurs with Init[2]. Hence, efficiency is not possible in this case. However, if we set γ[ηB]=0𝛾delimited-[]subscript𝜂𝐵0\gamma[\eta_{B}]=0italic_γ [ italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ] = 0 and γ[ηA]=1𝛾delimited-[]subscript𝜂𝐴1\gamma[\eta_{A}]=-1italic_γ [ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ] = - 1, we get that γ[Bt]=0,γ[AtZ¯]=0formulae-sequence𝛾delimited-[]subscript𝐵𝑡0𝛾delimited-[]subscript𝐴𝑡¯𝑍0\gamma[B_{t}]=0,\gamma[A_{t}\underline{Z}]=0italic_γ [ italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = 0 , italic_γ [ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT under¯ start_ARG italic_Z end_ARG ] = 0, and δti=Θ(1)superscriptsubscript𝛿𝑡𝑖Θ1\delta_{t}^{i}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Θ ( 1 ) for all i{1,2,3}𝑖123i\in\{1,2,3\}italic_i ∈ { 1 , 2 , 3 } and t1𝑡1t\geq 1italic_t ≥ 1. The same result holds with Init[2].

Appendix B Efficiency from a Loss Perspective.

Consider the same setup of Section 4. At step t𝑡titalic_t, the loss changes as follows

ΔΔ\displaystyle\Delta\mathcal{L}roman_Δ caligraphic_L =((BA)t)((BA)t1)absentsubscript𝐵𝐴𝑡subscript𝐵𝐴𝑡1\displaystyle=\mathcal{L}((BA)_{t})-\mathcal{L}((BA)_{t-1})= caligraphic_L ( ( italic_B italic_A ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - caligraphic_L ( ( italic_B italic_A ) start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )
dZ¯t1Z¯,(BA)t(BA)t1Fabsentsubscripttensor-product𝑑superscript¯𝑍𝑡1¯𝑍subscript𝐵𝐴𝑡subscript𝐵𝐴𝑡1𝐹\displaystyle\approx\langle d\bar{Z}^{t-1}\otimes\underline{Z},(BA)_{t}-(BA)_{% t-1}\rangle_{F}≈ ⟨ italic_d over¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ⊗ under¯ start_ARG italic_Z end_ARG , ( italic_B italic_A ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ( italic_B italic_A ) start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT
=dZ¯t1,ΔZBt,absent𝑑superscript¯𝑍𝑡1Δsuperscriptsubscript𝑍𝐵𝑡\displaystyle=\langle d\bar{Z}^{t-1},\Delta Z_{B}^{t}\rangle,= ⟨ italic_d over¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , roman_Δ italic_Z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ ,

where .,.F\langle.,.\rangle_{F}⟨ . , . ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the Frobenius inner product in n×nsuperscript𝑛𝑛\mathbb{R}^{n\times n}blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT, and .,.\langle.,.\rangle⟨ . , . ⟩ is the euclidean product in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Since the direction of the feature updates are significantly correlated with dZ¯t1𝑑superscript¯𝑍𝑡1d\bar{Z}^{t-1}italic_d over¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT, it should be expected that having δti=Θ(1)superscriptsubscript𝛿𝑡𝑖Θ1\delta_{t}^{i}=\Theta(1)italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Θ ( 1 ) for all i𝑖iitalic_i results in more efficient loss reduction.

Appendix C Additional Experiments

This section complements the empirical results reported in the main text. We provide the details of our experimental setup, and show the acc/loss heatmaps for several configurations.

C.1 Empirical Details

C.1.1 Toy Example

In Figure 2, we trained a simple MLP with LoRA layers to verify the results of the analysis in Section 3. Here we provide the empirical details for these experiments.

Model.

We consider a simple MLP given by

f(x)=Woutϕ(BAϕ(Winx)),𝑓𝑥subscript𝑊𝑜𝑢𝑡italic-ϕ𝐵𝐴italic-ϕsubscript𝑊𝑖𝑛𝑥f(x)=W_{out}\phi(BA\phi(W_{in}x)),italic_f ( italic_x ) = italic_W start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT italic_ϕ ( italic_B italic_A italic_ϕ ( italic_W start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_x ) ) ,

where Winn×d,Wout1×n,Ar×n,Bn×rformulae-sequencesubscript𝑊𝑖𝑛superscript𝑛𝑑formulae-sequencesubscript𝑊𝑜𝑢𝑡superscript1𝑛formulae-sequence𝐴superscript𝑟𝑛𝐵superscript𝑛𝑟W_{in}\in\mathbb{R}^{n\times d},W_{out}\in\mathbb{R}^{1\times n},A\in\mathbb{R% }^{r\times n},B\in\mathbb{R}^{n\times r}italic_W start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_n end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_n end_POSTSUPERSCRIPT , italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_r end_POSTSUPERSCRIPT are the weights, and ϕitalic-ϕ\phiitalic_ϕ is the ReLU activation function. Here, we used d=5𝑑5d=5italic_d = 5, n=100𝑛100n=100italic_n = 100, and r=4𝑟4r=4italic_r = 4.

Dataset.

Synthetic dataset generated by X𝒩(0,Id),Y=sin(d1i=1dXi)formulae-sequencesimilar-to𝑋𝒩0subscript𝐼𝑑𝑌superscript𝑑1superscriptsubscript𝑖1𝑑subscript𝑋𝑖X\sim\mathcal{N}(0,I_{d}),Y=\sin(d^{-1}\sum_{i=1}^{d}X_{i})italic_X ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_Y = roman_sin ( italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with d=5𝑑5d=5italic_d = 5. The number of training examples is Ntrain=1000subscript𝑁𝑡𝑟𝑎𝑖𝑛1000N_{train}=1000italic_N start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT = 1000, and the number of test examples is Ntest=100subscript𝑁𝑡𝑒𝑠𝑡100N_{test}=100italic_N start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT = 100.

Training.

We train the model with gradient descent for a range for values of (ηA,ηB)subscript𝜂𝐴subscript𝜂𝐵(\eta_{A},\eta_{B})( italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ). The weights are initialized as follows: Win𝒩(0,1.),Wout𝒩(0,1/n),A𝒩(0,1/n),B𝒩(0,1.)W_{in}\sim\mathcal{N}(0,1.),W_{out}\sim\mathcal{N}(0,1/n),A\sim\mathcal{N}(0,1% /n),B\sim\mathcal{N}(0,1.)italic_W start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 . ) , italic_W start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 / italic_n ) , italic_A ∼ caligraphic_N ( 0 , 1 / italic_n ) , italic_B ∼ caligraphic_N ( 0 , 1 . ). Only the weight matrices A,B𝐴𝐵A,Bitalic_A , italic_B are trained and Win,Woutsubscript𝑊𝑖𝑛subscript𝑊𝑜𝑢𝑡W_{in},W_{out}italic_W start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT are fixed to their initial value.

C.1.2 GLUE Tasks with GPT2/Roberta

For our experiments with GPT2/Roberta-base models, finetuned on GLUE tasks, we use the following setup:

Tasks.

MNLI, QQP, SST2, QNLI

Models.

GPT2, Roberta-base

Training Alg.

AdamW with β1=0.9,β2=0.99,ϵ=\beta_{1}=0.9,\beta_{2}=0.99,\epsilon=italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.99 , italic_ϵ = 1e-8, linear schedule, no warmup.

Learning rate grid.

ηA{\eta_{A}\in\{italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∈ {4e-3, 2e-3, 1e-3, 5e-4, 2e-4, 1e-4}}\}}, ηB{\eta_{B}\in\{italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∈ { 8e-4, 4e-4, 2e-4, 1e-4, 5e-5, 2e-5, 1e-5 }}\}}.

Targert Modules for LoRA.

For Roberta-base, we add LoRA layers to ‘query’ and ‘value’ weights. For GPT2, we add LoRA layers to ‘c_attn, c_proj, c_fc’.

Other Hyperparameters.

Sequence length T=128𝑇128T=128italic_T = 128, train batch size bs=32𝑏𝑠32bs=32italic_b italic_s = 32, number of train epochs E=3𝐸3E=3italic_E = 3 (E=10𝐸10E=10italic_E = 10 for SST2), number of random seeds s=3𝑠3s=3italic_s = 3.

GPUs.

Nvidia V100, Nvidia A10.

C.1.3 Llama MNLI

For our experiments using the Llama-7b model, finetuned on MNLI, we use following setup

Training Alg.

AdamW with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, ϵ=italic-ϵabsent\epsilon=italic_ϵ = 1e-6, constant schedule.

Learning rate grid.

ηA{\eta_{A}\in\{italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∈ {1e-6, 5e-6, 1e-5, 2.5e-5, 5e-5, 1e-4}}\}}, ηB{\eta_{B}\in\{italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∈ {1e-6, 5e-6, 1e-5, 2.5e-5, 5e-5, 1e-4}}\}}, ηBηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}\geq\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≥ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT

LoRA Hyperparameters.

LoRA rank r=8𝑟8r=8italic_r = 8, α=16𝛼16\alpha=16italic_α = 16, and dropout 0.10.10.10.1. LoRA target modules ‘q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj’.

Other Hyperparameters.

Sequence length T=128𝑇128T=128italic_T = 128, train batch size bs=32𝑏𝑠32bs=32italic_b italic_s = 32, number of train epochs E=1𝐸1E=1italic_E = 1, number of random seeds s=2𝑠2s=2italic_s = 2 for ηA=ηBsubscript𝜂𝐴subscript𝜂𝐵\eta_{A}=\eta_{B}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT and ηA,ηBsubscript𝜂𝐴subscript𝜂𝐵\eta_{A},\eta_{B}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT near test optimal, s=1𝑠1s=1italic_s = 1 otherwise. Precision FP16.

GPUs.

Nvidia V100.

C.1.4 Llama flan-v2

For our experiments using the Llama-7b model, finetuned on a size 100k random subset flan-v2, we use following setup

Training Alg.

AdamW with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, ϵ=italic-ϵabsent\epsilon=italic_ϵ = 1e-6, constant schedule.

Learning rate grid.

ηA{\eta_{A}\in\{italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∈ {1e-6, 5e-6, 1e-5, 2.5e-5, 5e-5, 1e-4}}\}}, ηB{\eta_{B}\in\{italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∈ {1e-6, 5e-6, 1e-5, 2.5e-5, 5e-5, 1e-4}}\}}, ηBηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}\geq\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≥ italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT

LoRA Hyperparameters.

LoRA rank r=64𝑟64r=64italic_r = 64, α=16𝛼16\alpha=16italic_α = 16, and dropout 0.10.10.10.1. LoRA target modules ‘q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj’.

Other Hyperparameters.

Sequence length Tsource=1536subscript𝑇source1536T_{\text{source}}=1536italic_T start_POSTSUBSCRIPT source end_POSTSUBSCRIPT = 1536, Ttarget=512subscript𝑇target512T_{\text{target}}=512italic_T start_POSTSUBSCRIPT target end_POSTSUBSCRIPT = 512, train batch size bs=16𝑏𝑠16bs=16italic_b italic_s = 16, number of epochs E=1𝐸1E=1italic_E = 1, number of random seeds s=2𝑠2s=2italic_s = 2 for ηA=ηBsubscript𝜂𝐴subscript𝜂𝐵\eta_{A}=\eta_{B}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT and ηA,ηBsubscript𝜂𝐴subscript𝜂𝐵\eta_{A},\eta_{B}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT near test optimal, s=1𝑠1s=1italic_s = 1 otherwise. Precision BF16.

MMLU Evaluation.

We evaluate average accuracy on MMLU using 5-shot prompting.

GPUs.

Nvidia A10.

C.2 Results of Roberta-base Finetuning on all Tasks

Figure 3 showed finetuning test accuracy for Roberta-base. To complement these results, we show here the test/train accuracy for all tasks.

Refer to caption
Figure 8: GLUE/Roberta-base: same as Figure 3 with test/train accuracy.

Interestingly, the optimal choice of learning rates for test accuracy differs from that of the train accuracy, although the difference is small. This can be due to mild overfitting occuring during finetuning (the optimal choice of learning rates (ηA,ηB)subscript𝜂𝐴subscript𝜂𝐵(\eta_{A},\eta_{B})( italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) for train accuracy probably lead to a some overfitting).

C.3 Results of GPT2 Finetuning on all Tasks

Figure 4 showed finetuning results for GPT2 on MNLI and QQP. To complement these results, we show here the test/train accuracy for all tasks.

Refer to caption
Figure 9: GLUE/GPT2: same setup as Figure 4 with additional tasks

C.4 GLUE Tasks with Full Precision

Refer to caption
Figure 10: GLUE/Roberta-base: same as Figure 3 with full precision training instead of FP16.
Refer to caption
Figure 11: GLUE/GPT2: same setup as Figure 9 with full precision training

C.5 GLUE Tasks Test/Train Loss

Refer to caption
Figure 12: GLUE/Roberta-base: same setup as Figure 3 with 100×100\times100 ×Test/Train loss instead of accuracy
Refer to caption
Figure 13: GLUE/GPT2: same setup as Figure 9 with 100×100\times100 ×Test/Train loss instead of accuracy

C.6 GLUE Tasks with Different LoRA Ranks

Refer to caption
Figure 14: GLUE/Roberta-base: same setup as Figure 3 with r=4𝑟4r=4italic_r = 4
Refer to caption
Figure 15: GLUE/Roberta-base: same setup as Figure 3 with r=16𝑟16r=16italic_r = 16
Refer to caption
Figure 16: GLUE/GPT2: same setup as Figure 11 with r=4𝑟4r=4italic_r = 4

C.7 Experiments with Init[1]

We also run some experiments using Init[1] as initialization scheme. We noticed that the optimal ratio λ𝜆\lambdaitalic_λ is this case is generally smaller than the optimal ratio with Init[2]. Figure 17 shows the optimal learning rates (ηA,ηB)subscript𝜂𝐴subscript𝜂𝐵(\eta_{A},\eta_{B})( italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) obtained with Init[1] and Init[2]. The optimal ratio λ=ηB/ηA𝜆subscript𝜂𝐵subscript𝜂𝐴\lambda=\eta_{B}/\eta_{A}italic_λ = italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is generally smaller with Init[1].

Refer to caption
Refer to caption
Figure 17: Roberta-base with Init[1] and Init[2], finetuning on MNLI for 10 epochs (similar to Figure 3 but with more epochs).

C.8 Llama Flan-v2 MMLU Acc/Train Loss

Refer to caption
(a) MMLU evaluation accuracy and train loss of Llama-7b trained on flan-v2 100k in the same setting as Figure 5 left panel (using Init[2]). Interestingly, even in one epoch the model can overfit. We were unable to find ηB>ηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}>\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT > italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT that was optimal for train loss, however it could be the case that the grid was not fine enough or that overfitting does not require much “feature learning" and ηB/ηA1subscript𝜂𝐵subscript𝜂𝐴1\eta_{B}/\eta_{A}\approx 1italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ≈ 1 is optimal for minimizing train loss (see the main text for more discussion).
Refer to caption
(b) MMLU evaluation accuracy and train loss of Llama-7b trained on flan-v2 100k in the same setting as Figure 5 left panel except using Init[1]. Interestingly, the optimal MMLU accuracy is 0.6% higher than using Init[2] and the optimal ratio ηB/ηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}/\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is twice as large. The training loss is also near optimal only using a large ratio ηB/ηAsubscript𝜂𝐵subscript𝜂𝐴\eta_{B}/\eta_{A}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT.
Figure 18: Llama-7b on flan-v2 training with different initializations.

C.9 Llama MNLI Test/Train Loss

Refer to caption
Figure 19: Train and test loss of Llama-7b finetuned on MNLI in the same setting as Figure 5 right panel.