A Survey on LoRA of Large Language Models

Yuren MAO    Yuhang GE    Yijiang FAN    Wenyi XU    Yu MI   
Zhonghao HU
   Zhejiang University, Hangzhou 310007, China
Abstract

Low-Rank Adaptation (LoRA), which updates the dense neural network layers with pluggable low-rank matrices, is one of the best performed parameter efficient fine-tuning paradigms. Furthermore, it has significant advantages in cross-task generalization and privacy-preserving. Hence, LoRA has gained much attention recently, and the number of related literature demonstrates exponential growth. It is necessary to conduct a comprehensive overview of the current progress on LoRA. This survey categorizes and reviews the progress from the perspectives of (1) downstream adaptation improving variants that improve LoRA’s performance on downstream tasks; (2) cross-task generalization methods that mix multiple LoRA plugins to achieve cross-task generalization; (3) efficiency-improving methods that boost the computation-efficiency of LoRA; (4) data privacy-preserving methods that use LoRA in federated learning; (5) application. Besides, this survey also discusses the future directions in this field.

keywords:
Low-Rank Adaptation, LoRA, Large Language Models, LLMs

[1]Yunjun GAO \fcssetup received = month dd, yyyy, accepted = month dd, yyyy, corr-email = [email protected],

1 Introduction

Rapidly increasing parameter scales of pre-training language models improves their generalization ability and brings emergent abilities. In the last few years, the parameter scales of pre-training languages models have increased by thousands of times (e.g., from 330M parameter BERT [1] to 540B parameter PaLM [2]). These pre-training language models having large parameter scales are termed Large language models (LLMs). Nevertheless, due to the knowledge boundaries of the LLMs, their abilities on some downstream tasks are still limited. To expand the knowledge boundaries, it remains necessary to fine-tune LLMs on the downstream tasks.

However, fine-tuning the full parameters of an LLM, namely full fine-tuning, is extremely computationally expensive, for example, full fine-tuning of a LLaMA2-7B  [3] model requires approximately 60GB of memory, which exceeds the capacity of common consumer GPUs [4]. To reduce the computational cost, various parameter-efficient fine-tuning (PEFT) methods have been proposed [5]. They adapt LLMs to downstream tasks by only fine-tuning a small number of (extra) model parameters. From the perspective of whether extra parameters are involved, PEFT methods can be divided into two categories: extra-parameter methods and intra-parameter methods. The extra-parameter methods freeze all of the original parameters of an LLM and insert a set of learnable parameters to optimize the model input or model layers such as adapter tuning [6] and prompt tuning [7]. By contrast, intra-parameter methods freeze most of the original parameters of an LLM and only tune a small number of parameters of the LLM such as BitFit [8], LISA [4] and LoRA [9].

When we do not have access to modify the model architecture, intra-parameter methods are desirable. Among the intra-parameter methods, LoRA is the most widely used one, because it can achieve a comparable or better downstream adaptation performance to the full fine-tuning on a range of downstream tasks [9] and is easy to implement. Besides, there are many variants have been proposed to further improve the downstream adaptation ability of LoRA on more challenging downstream tasks.

LoRA achieves parameter efficiency by updating the dense neural network layers of an LLM with pluggable low-rank matrices. These matrices (a.k.a, LoRA plugins) are independent of the LLM, which can be stored and reused in other related downstream tasks. Furthermore, these LoRA plugins can be combined to achieve cross-task generalization, which can facilitate multi-task learning, domain adaptation, and continual learning for LLMs.

As the LoRA plugins accumulate, the computation cost of managing LoRA plugins is increasing. Although LoRA is computation-efficient, the computational cost of managing a larger number of LoRA plugins is unignorable. It is necessary to further improve the computation efficiency of LoRA. The improvement can come from reducing the computation cost of single LoRA plugins and accelerating the scalable serving of multiple plugins. It can boost the application of LoRA in real-world use cases, such as Generative-as-a-Service (GaaS) cloud products.

In some cases, the training data are privately owned by multiple clients and cannot be centralized. To adapt LLMs with the distributed training data, we can adopt federated learning to protect the data privacy of each client. However, federated learning suffers expensive communication and computation costs. To reduce costs, LoRA is a natural choice. Its parameter-efficient nature helps to reduce the computation cost of each client and the communication cost of sharing parameters across clients. Furthermore, the pluggable feature of LoRA can help preserve the parameter privacy of each client in federated learning. Therefore, LoRA has a great potential for privacy-preserving.

In this survey, we give a comprehensive overview of the current progress on LoRA for methods (1) improving downstream adaption performance of LoRA; (2) mixing LoRA plugins to achieve cross-task generalization; (3) boosting the computation-efficiency of LoRA; (4) adopting LoRA in federated learning. Besides, the application of LoRA is briefly introduced. This taxonomy of LoRA-related methods is illustrated in Figure 1. This survey is expected to give comprehensive background knowledge, research trends and technical insights for LoRA.

The rest of this survey is organized as follows. Section 2 introduces the background knowledge of LoRA, and Section 3 introduces the LoRA’s variants that aim to improve the downstream adaptation performance. In Section 4, we review the LoRA mixture methods that mix LoRA plugins to achieve cross-task generalization. The LoRA-driven federated learning methods are introduced in Section 6. Section 7 reports the applications of LoRA. We conclude this survey and discuss the future directions in Section 8.

2 Low-Rank Adaptation (LoRA)

{forest}

forked edges, for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, base=center, font=, rectangle, draw=hidden-draw, rounded corners, align=left, text centered, minimum width=4em, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, line width=0.8pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=16em,font=,, where level=2text width=18em,font=,, where level=3text width=20em,font=,, where level=4text width=20em,font=,, where level=5text width=20em,font=,, [ Low-Rank Adaptation of Large Language Models, ver [ Low-Rank Adaptation2), fill=magenta!10 [ Theoretical Analysis2.2), fill=magenta!10 [ Malladi et al. [10], Koubbi et al. [11], Jang et al. [12],
Zhu et al. [13], Zeng et al. [14], leaf, text width=20em ] ] [ Beyond Fine-tuning2.4), fill=magenta!10 [ ReLoRA [15], MoRA [16], LTE [17], InfLoRA [18],
GS-LoRA [19], I-LoRA [20], LongLoRA [3],
SinkLoRA [21], leaf, text width=20em ] ] ] [ Downstream Adaptation Improving3), fill=green!10 [ Breaking the Low-rank Bottleneck3.1), fill=green!10 [ Stacking LoRAs along Fine-tuning3.1.1), [ ReLoRA [15], COLA [22], MELoRA [23], leaf, text width=18em ] ] [ Updating as gradient compressor3.1.2), [ FLoRA [24], leaf, text width=18em ] ] [ Co-learning LLM and LoRA3.1.3), [ Delta-LoRA [25], leaf, text width=18em ] ] ] [ Dynamic Rank Allocation3.2), fill=green!10 [ SVD-Based Methods3.2.1), [ AdaLoRA [26], SaLoRA [27], IncreLoRA [28], leaf, text width=18em ] ] [ SRD-based Methods3.2.2), [ DoRA (Dynamic Low-Rank Adaptation) [29],
AutoLoRA [30], SoRA [31], ALoRA [32], leaf, text width=18em ] ] [ Rank Sampling-based Methods3.2.3), [ DyLoRA [33], leaf, text width=18em ] ] ] [ Optimizing the Learning Procedure3.3), fill=green!10 [ Initialization Improvement3.3.1), [ Hayou et al. [34], PiSSA [35], MiLoRA [36], leaf, text width=18em ] ] [ Gradient Update Optimization3.3.2), [ Zhang et al. [37], LoRA+ [38], ResLoRA [39],
SIBO[40], Jin et al. [41], DoRA [42], leaf, text width=18em ] ] [ Overfitting Mitigation3.3.3), [ BiLoRA [43], Lin et al. [44], HiddenKey [45], leaf, text width=18em ] ] ] [ Combining with other Learning Paradigms3.4), fill=green!10 [ Laplace-LoRA [46], PILLOW [47], STAR [48], leaf, text width=20em ] ] ] [ Cross-task Generalization4), fill=cyan!10 [ Mixture with Manually Designed Weights4.1), fill=cyan!10 [ Wang et al.[49], Zhao et al.[50], Smith et al.[51],
ControlPE[52], Zhang et al.[53], Chitale et al.[54],
Token-level Adaptation[55], BYOM[56], leaf, text width=20em ] ] [ Mixture with Learnt Weights4.2), fill=cyan!10 [ Asadi et al.[57], LoRAHub[58], ComPEFT[59],
L-LoRA[60], MixLoRA[61], X-LoRA[62], leaf, text width=20em ] ] [ Mixture of LoRA Experts4.3), fill=cyan!10 [ MoRAL[63], LoRAMoE[64], MoCLE[65],
MOELoRA[66], Mixture-of-LoRAs[67],
MultiLoRA[68], MLoRE[69], MTLoRA[70],
MoLA[71], LLaVA-MoLE[72], SiRA[73],
Octavius[74], Fast LoRA[75], I-LoRA[76], leaf, text width=20em ] ] ] [ Efficiency Improving5), fill=blue!10 [ Parameter Reduction5.1), fill=blue!10 [ Parameter Freezing5.1.1), [ LoRA-SP[77], LoRA-FA[78], AFLoRA[79],
DropBP[80], LoRA-XS[81], BYOM-LoRA[56],leaf, text width=18em ] ] [ Parameter Pruning5.1.2) [ LoRA-drop[82], LoRAprune[83],
LoRAshear[84], Zhu et al.[85], leaf, text width=18em ] ] [ Parameter Sharing5.1.3) [ VeRA[86], VB-LoRA[87], leaf, text width=18em ] ] ] [ Parameter Quantization5.2), fill=blue!10 [ PTQ-based methods5.2.1) [ QLoRA[88], QA-LoRA[89], leaf, text width=18em ] ] [ QAT-base5.2.2) [ LoftQ[90], ApiQ[91], L4Q[92], leaf, text width=18em ] ] ] [ Parallel LoRA Computing Frameworks5.3), fill=blue!10 [ Parallel Fine-tuning5.3.1) [ ASPEN[93], leaf, text width=18em ] ] [ Parallel Inference5.3.2) [ Punica[94], S-LoRA[95], CARASERVE[96], leaf, text width=18em ] ] ] ] [ LoRA for Federate Learning6), fill=orange!10 [ Data Heterogeneity6.1), fill=orange!10 [ SLoRA [97], FeDeRA [98], FFA-LoRA [99], leaf, text width=20em ] ] [ Device Heterogeneity6.2), fill=orange!10 [ FedMS [100], FlexLoRA [101], HETLORA [102], leaf, text width=20em ] ] [ Model Heterogeneity6.3), fill=orange!10 [ pFedLoRA [103], leaf, text width=20em ] ] [ Parameter Privacy6.4), fill=orange!10 [ Huang et al. [104], PrivateLoRA [105], leaf, text width=20em ] ] ] [ Applications of LoRA7), fill=yellow!10 [ Language Tasks7.1), fill=yellow!10 [ Traditional NLP Task[106, 107, 108, 109, 110, 111, 112, 113, 114], Code Task[115, 116, 117, 118, 119, 120],
Model Alignment Task[121, 122, 123, 124, 125, 126, 127, 128],
Vertical Domain Task[129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140], leaf, text width=20em ] ] [ Vision Task7.2), fill=yellow!10 [ Image Generation Tasks[141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169],
Image Segmentation Task[170, 171, 172, 173, 174, 175, 176, 177, 178], leaf, text width=20em ] ] [ Multimodal Tasks7.2), fill=yellow!10 [ Audio-Text[179], Image-Text[180, 181, 182],
Video-Text[183, 184, 185], leaf, text width=20em ] ] ] ] ]

Figure 1: The taxonomy of this paper.

The Low-dimensional intrinsic dimensionality hypothesis [186] presents that over-parameterized models reside on a low intrinsic dimension, which demonstrates that we can achieve proper learning performance by only updating parameters related to the intrinsic rank. Based on this hypothesis, LoRA [9] proposes to update dense layers in a model with low-rank matrices. It can achieve both parameter- and computational- efficiency. In this section, we first introduce the details of LoRA and then introduce existing works that focus on the theoretical analysis of LoRA. Furthermore, we demonstrate LoRA’s efficiency in practice. At last, this section presents that LoRA can be used in other use cases except fine-tuning.

2.1 LoRA

Given a dense neural network layer parameterized by W0d×ksubscript𝑊0superscript𝑑𝑘W_{0}\in\mathbb{R}^{d\times k}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT, to adapt it to a downstream task, we update it with ΔWd×kΔ𝑊superscript𝑑𝑘\Delta{W}\in\mathbb{R}^{d\times k}roman_Δ italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT and obtain an updated layer parameterized by W=W0+ΔW𝑊subscript𝑊0Δ𝑊W=W_{0}+\Delta{W}italic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_Δ italic_W. For full fine-tuning, ΔWΔ𝑊\Delta{W}roman_Δ italic_W is computed based on gradients of all the d×k𝑑𝑘d\times kitalic_d × italic_k parameters for the layer, which is computationally expensive and requires a large amount of GPU memory for LLMs. To improve the computational efficiency, LoRA decomposes ΔWΔ𝑊\Delta{W}roman_Δ italic_W into two small matrices Bd×r𝐵superscript𝑑𝑟B\in\mathbb{R}^{d\times r}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT and Ar×k𝐴superscript𝑟𝑘A\in\mathbb{R}^{r\times k}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT, i.e.,

W=W0+αBA𝑊subscript𝑊0𝛼𝐵𝐴W=W_{0}+\alpha BAitalic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_α italic_B italic_A (1)

where rmin{d,k}much-less-than𝑟𝑚𝑖𝑛𝑑𝑘r\ll min\{d,k\}italic_r ≪ italic_m italic_i italic_n { italic_d , italic_k }, B𝐵Bitalic_B and A𝐴Aitalic_A are initialized with a random Gaussian distribution and zero respectively, α𝛼\alphaitalic_α represents the scaling factor that controls the strength of updates. The parameter number of LoRA is r×(d+k)𝑟𝑑𝑘r\times(d+k)italic_r × ( italic_d + italic_k ), which is significantly less than d×k𝑑𝑘d\times kitalic_d × italic_k. Figure 2 (a) and (b) compare the structures of full fine-tuning and LoRA.

LoRA is highly parameter efficient for it updates only a small subset of model parameters, which reduces the memory and computational requirements for fine-tuning without increasing inference latency [187]. Furthermore, The parameter efficiency can be further improved by extending from the low-rank matrix to low-rank tensor [188] or combining with the Kronecker decomposition [189, 190]. Except for parameter efficiency, LoRA is also pluggable for the LoRA parameters that can be separated from the model after training. The pluggable character of LoRA enables it to be shared and reused by multiple users  [191]. When we have LoRA plugins for multiple tasks, we can combine these plugins and expect a proper cross-task generalization performance [58]. Besides, the low-rank mechanism of LoRA is compatible with other parameter-efficient methods, such as adapter  [192, 193].

In practice, for a Transformer-based LLM, the dense layers typically consist of two types of weight matrices: the projection matrices in attention modules and feed-forward neural (FFN) modules. In the original study, LoRA is applied to the weight matrix of the attention layer. Subsequent work shows that using it in the FFN layers can further improve model performance [194].

2.2 Theoretical Analysis

Refer to caption
Figure 2: An illustration of full fine-tuning (a), LoRA (b) and its variants for improving downstream adaptation, which includes breaking the low-rank bottleneck (c) and dynamic rank allocation (d).

To understand why LoRA is effective and how LoRA can be more effective, several works have provided theoretical analyses from various aspects. To answer the question that why LoRA is effective, Malladi et al. [10] analyze the fine-tuning dynamics of LoRA from the kernel view and demonstrate that in the lazy regime, LoRA fine-tuning is nearly equivalent to full fine-tuning. Besides, Zeng et al. [14] provides a theoretical analysis of the LoRA’s expressive power for both fully connected neural networks (FNNs) and Transformer networks (TFNs). They proved that for FNNs, LoRA can adapt any model f𝑓fitalic_f to accurately represent any smaller target model f¯¯𝑓\bar{f}over¯ start_ARG italic_f end_ARG if LoRA-rank \geq (width of f𝑓fitalic_f) ×\times× depthoff¯depthoff𝑑𝑒𝑝𝑡𝑜𝑓¯𝑓𝑑𝑒𝑝𝑡𝑜𝑓𝑓\frac{depth\,of\,\bar{f}}{depth\,of\,f}divide start_ARG italic_d italic_e italic_p italic_t italic_h italic_o italic_f over¯ start_ARG italic_f end_ARG end_ARG start_ARG italic_d italic_e italic_p italic_t italic_h italic_o italic_f italic_f end_ARG, under a mild assumption. Moreover, they quantify the approximation error when the LoRA-rank falls below this threshold. Regarding TFNs, they showed that any model can be adapted to a target model of equivalent size using a rank-(\textembeddingsize2)\text𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔𝑠𝑖𝑧𝑒2\left(\frac{\text{embeddingsize}}{2}\right)( divide start_ARG italic_e italic_m italic_b italic_e italic_d italic_d italic_i italic_n italic_g italic_s italic_i italic_z italic_e end_ARG start_ARG 2 end_ARG ) for LoRA. Additionally, Koubbi et al. [11] utilize the mathematical framework for Transformers established by  [195, 196, 197] to investigate the how low-rank perturbations in attention parameters affect.

As to the question that how LoRA can be more effective, Jang et al. [12] analyze the fine-tuning of LoRA within the neural tangent kernel (NTK) [198] framework, showing that employing a rank rNgreater-than-or-equivalent-to𝑟𝑁r\gtrsim\sqrt{N}italic_r ≳ square-root start_ARG italic_N end_ARG in LoRA helps to avoid spurious local minima and facilitates the discovery of low-rank solutions that exhibit good generalization. Besides, Zhu et al. [13] observe that the project-down matrix A𝐴Aitalic_A is utilized for extracting features from the input, while the project-up matrix B𝐵Bitalic_B employs these features to create the desired output. Based on this observation, they demonstrate that freezing the project-down matrix A𝐴Aitalic_A while tuning only the project-up matrix B𝐵Bitalic_B leads to better generalization compared to tuning both matrices, in addition to achieving a 2×2\times2 × reduction in parameters.

2.3 Efficiency in Practice

The computational efficiency of LoRA is significantly higher than that for full fine-tuning. Taking fine-tuning the dense weight matrix of the first FFN layer in LLaMA2-7B as an example, full fine-tuning needs to fine-tune 11,008×4,096=45,088,768formulae-sequence1100840964508876811,008\times 4,096=45,088,76811 , 008 × 4 , 096 = 45 , 088 , 768 parameters while LoRA only needs to tune (11,008×4)+(4×4,096)=60,4161100844409660416(11,008\times 4)+(4\times 4,096)=60,416( 11 , 008 × 4 ) + ( 4 × 4 , 096 ) = 60 , 416 parameters when r=4𝑟4r=4italic_r = 4. For this layer, LoRA only adjusts nearly one-thousandth of the parameters compared to full fine-tuning.

LoRA can significantly decrease the memory usage of fine-tuning an LLM, which can be divided into four parts: (1) Model Memory (Weight Memory): the memory required to store the model weights; (2) Activation Memory: the memory occupied by intermediate activations during forward propagation. It mainly depends on factors such as batch size and sequence length; (3) Gradient Memory: the memory required to store gradients during backpropagation. The gradients are only calculated for trainable parameters; (4) Optimization Memory: the memory used to store optimizer states. For example, the Adam optimizer stores the “first moment” and “second moment” of trainable parameters.

Literature [4] provides a comprehensive empirical comparison between full fine-tuning and LoRA fine-tuning on an LLaMA2-7B model with batch size 1, utilizing a single NVIDIA RTX4090 (24GB) GPU. According to this study, full fine-tuning requires approximately 60GB of memory, which exceeds the capacity of an RTX4090 GPU; by contrast, LoRA fine-tuning only needs about 23GB of memory. LoRA significantly reduces memory usage and makes fine-tuning LLaMA2-7B feasible on a single NVIDIA RTX4090 (24GB) GPU. Specifically, due to fewer trainable parameters, both optimization memory and gradient memory decrease significantly by approximately 25GB and 14GB respectively. On the other hand, while LoRA introduces additional “incremental parameters” resulting in slight increases in activation memory and weight memory (totaling about 2GB), this increase is negligible when considering the overall reduction in memory. Moreover, reducing memory brings an acceleration of forward propagation. LoRA is 1.9×1.9\times1.9 × times faster compared to full fine-tuning.

2.4 Beyond Fine-tuning

Besides fine-tuning, LoRA can be applied to other learning paradigms, such as pre-training [15, 17] and continual training [18]. For pre-training, ReLoRA [15] and MoRA [16] are proposed to use low-rank updates to train high-rank networks; moreover, LTE [17] is proposed to perform parallel training of multiple low-rank heads across computing nodes to minimize the need for frequent synchronization, which facilitates the utilization of LoRA in pre-training. As for continual training, there are several methods have been proposed to address the catastrophic forgetting problem. InfLoRA [18] addresses catastrophic forgetting by reparameterizing pre-trained weights with a minimal set of parameters in a subspace. GS-LoRA [19] uses group sparse regularization to automatically select specific LoRA groups while zeroing out others to mitigate catastrophic forgetting effects. I-LoRA [20] leverages dual-memory experience replay combined with LoRA parameter interpolation to combat catastrophic forgetting.

Furthermore, LoRA can be used to overcome the limited context size for LLMs  [3, 21]. For instance, LongLoRA [3] successfully computaitional efficiently extends the context window of LLaMA2-7B [199] from 4k to 100k tokens by combining LoRA with shifted sparse attention. However, LongLoRA does not match the efficiency of vanilla attention due to chaotic attention head structures and unnecessary information exchange between token groups. To address these issues, SinkLoRA [21] introduces Sink Fixed Attention (SF-Attn) to proportionally returns cyclically shifted groups of attention heads to their un-shifted state and achieves proper performance.

3 Downstream Adaptation Improving

Although LoRA can achieve proper adaptation performance on some downstream tasks, there is still a performance gap between LoRA and full fine-tuning on many downstream tasks, such as mathematical reasoning [200, 201, 202]. To fill this gap, many methods are proposed to further improve the downstream task adaption performance of LoRA. Typically, existing methods improve the downstream adaptation performance from the following perspectives: (1) breaking the low-rank bottleneck, refer to Figure 2 (c); (2) adaptively allocating the ranks of different LoRA modules, refer to Figure 2 (d); (3) optimizing the learning procedure of LoRA; (4) combining with other learning paradigms. In this section, we introduce these four types of methods respectively.

3.1 Breaking the Low-rank Bottleneck

The low-rank updates enable LoRA to be parameter efficient; however, it restricts LLMs’ ability to memorize downstream knowledge and generalization on downstream tasks  [16, 203, 204, 201, 202]. This low-rank limitation causes inferior performance of LoRA in knowledge- and skill-intensive domains comparing to full-fine tuning, such as code and math. Experimental study [202] demonstrates that the rank for full fine-tuning is significant  (10-100 ×\times×) higher than that for LoRA, and increasing the rank of LoRA updation can narrow the performance gap between LoRA and full fine-tuning. To increase the rank of LoRA and improve its performance, several methods have been proposed [15, 22, 25, 205], which typically increase the rank through (1) stacking LoRAs along learning iterations; (2) updating as gradient compressors; (3) co-updating LLM and LoRA modules during fine-tuning.

3.1.1 Stacking LoRAs along Fine-tuning

Matrix rank is subadditive, i.e., rank(M1+M2)rank(M1)+rank(M2)𝑟𝑎𝑛𝑘subscript𝑀1subscript𝑀2𝑟𝑎𝑛𝑘subscript𝑀1𝑟𝑎𝑛𝑘subscript𝑀2rank(M_{1}+M_{2})\leq rank(M_{1})+rank(M_{2})italic_r italic_a italic_n italic_k ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≤ italic_r italic_a italic_n italic_k ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_r italic_a italic_n italic_k ( italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) for metrices M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that have the same size. Based on the subadditivity, we can aggregate multiple LoRA modules together to increase the rank and break the low-rank bottleneck. Following this idea, ReLoRA [15] proposes a merge-and-reinit procedure for LoRA, which periodically merges the LoRA modules to the LLM and then reinitializes the LoRA modules during fine-tuning. It equals stacking multiple LoRA modules along with fine-tuning and can increase the rank of the overall updates. Similarly, COLA [22] proposes another merge-and-reinit method based on Frank-Wolfe algorithm [206]. However, MELoRA [23] points out that the merge-and-reinit procedure does not necessarily guarantee an increase in rank, because there can be overlap between the series of LoRA modules along fine-tuning. To solve this problem, MELoRA proposes to decompose the LoRA modules into smaller mini LoRAs and then parallelly stack these mini LoRAs, whose effectiveness in increasing the rank is theoretically verified.

3.1.2 Updating as Gradient Compressor

The above methods break the low-rank bottleneck in the parameter space. As a supplement, FLoRA [24] finds that LoRA performs a fixed random projection to compress gradients and restricts the total weight matrix change to low-rank. To overcome this low-rank bottleneck in gradient space, FLoRA proposes to resample the random projection, which is demonstrated to largely recover the performance of full-matrix SGD.

3.1.3 Co-updating LLM and LoRA

The above two kinds of methods focus on improving the representation ability of LoRA itself. Different from them, Delta-LoRA [25] proposes to jointly update the LLM and LoRA modules, which directly updates the high-rank LLM and can gain better representations capable than updating LoRA independently. Delta-LoRA updates the LLM based on the difference between two LoRA modules of two consecutive iterations, which enables it to update the LLM without any extra memory.

3.2 Dynamic Rank Allocation

For the rank of LoRA, higher is not always better. The abundant LoRA ranks may cause degeneration in both performance and efficiency. Furthermore, the importance of weights can vary across different layers of a Transformer model during fine-tuning, requiring different ranks for each layer. [26, 31, 29, 207]. Therefore, assigning the same rank to LoRA modules of different layers is not the optimal choice. It is better to adaptively allocate ranks to LoRA modules of different layers. Existing methods adaptively allocate ranks for LoRA modules from the perspectives of (1) singular value decomposition (SVD); (2) single-rank decomposition (SRD); (3) rank sampling.

3.2.1 SVD-based Methods

Decomposing a matrix with singular value decomposition (SVD) and selectively truncating its singular values is an effective way to control the rank of the matrix. Inspire by SVD, we can decompose the LoRA parameter matrix BA𝐵𝐴BAitalic_B italic_A into an SVD form, i.e, PΛQ𝑃Λ𝑄P\Lambda Qitalic_P roman_Λ italic_Q where P𝑃Pitalic_P and Q𝑄Qitalic_Q are orthogonal and ΛΛ\Lambdaroman_Λ is a non-negative diagonal matrix. By controlling the elements in ΛΛ\Lambdaroman_Λ, we can control the rank of BA𝐵𝐴BAitalic_B italic_A and allocate ranks for LoRA modules. Following this idea, several rank allocation methods approximate the SVD decomposition for BA𝐵𝐴BAitalic_B italic_A and allocate the ranks by filtering the diagonal matrix. For instance, AdaLoRA [26] approximates the SVD decomposition by regularizing the orthogonality of P𝑃Pitalic_P and Q𝑄Qitalic_Q. Then, it drops unimportant singular values based on novel importance scoring methods. Similarly, SaLoRA [27] also introduces an orthogonality regularization for P𝑃Pitalic_P and Q𝑄Qitalic_Q; by contrast, it drops unimportant singular values based on the L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm. However, the above methods are not efficient enough for they start with a high rank and then reduce the rank iteratively, which brings a pre-defined budget  [28]. To solve this problem, IncreLoRA [28] proposes to start from a single rank and then automatically increase the rank based on a heuristic importance score, where the orthogonality regularization is also involved while the elements in ΛΛ\Lambdaroman_Λ is not required to be non-negative.

3.2.2 SRD-based Methods

However, the orthogonality regularization brings unignorable computational costs for LoRA and degenerates its efficiency. To address this problem, several methods omit the orthogonality requirement of SVD and directly decompose BA𝐵𝐴BAitalic_B italic_A into single-rank components. Then, they allocate the ranks by selecting the proper components. DoRA (Dynamic Low-Rank Adaptation) [29] proposes to decompose the LoRA parameter matrix BA𝐵𝐴BAitalic_B italic_A into single-rank components and prunes the components based on a heuristic importance score. Similarly, AutoLoRA [30] also decomposes the LoRA parameter matrix BA𝐵𝐴BAitalic_B italic_A into single-rank components, but it prunes the components based on meta-learning. SoRA [31] eliminates the orthogonality regularization and filters columns and rows of P𝑃Pitalic_P and Q𝑄Qitalic_Q (their combination can be regarded as single-rank components) by directly controlling the diagonal matrix. It controls the diagonal matrix by formulating them as a set of learnable gating units which are updated in the fine-tuning procedure. ALoRA [32] also filters the components by using gating units; by contrast, it learns the gating units based on neural architecture search [208].

3.2.3 Rank Sampling-based Methods

In the SVD parameterization- and component-wise decomposition-based methods, we need to spend the extra computational costs to search proper ranks. To avoid the extra cost, DyLoRA [33] points out that we can allocate ranks directly by random sampling. In each training step, it samples a value b𝑏bitalic_b from a pre-defined discrete distribution and allocates b𝑏bitalic_b as the rank. Then, the matrices A𝐴Aitalic_A and B𝐵Bitalic_B are truncated to rank-b𝑏bitalic_b. In the fine-tuning procedure, only the parameters on the b𝑏bitalic_b-th row of A𝐴Aitalic_A and b𝑏bitalic_b-th column of B𝐵Bitalic_B are tunable while other parameters are frozen. Besides, the distribution can be defined based on users’ preferences.

3.3 Optimizing the Learning Procedure

In practice, LoRA converges more slowly than full fine-tuning. Moreover, it is also sensitive to hyperparameters and suffers from overfitting. These issues affect LoRA’s efficiency and hinder its downstream adaption performance. To address these issues, researchers have developed several approaches to optimize the learning procedure of LoRA, which can be categorized into the following three types: (1) Initialization Improvement; (2) Gradient Update Optimization; (3) Overfitting Mitigation.

3.3.1 Initialization Improvement

LoRA usually initializes its parameter matrices A and B using Gaussian noise and zeros respectively. There are two simple schemes: Init[A], which sets matrix B to zero and randomly initializes matrix A, and Init[B], which does the reverse. Literature  [34] compares these two schemes and concludes that Init[A] is better through theoretical analysis. It reveals that Init[A] allows using a larger learning rate without causing instability, making the learning process more efficient. However, even with init[A], this random initialization method still results in small initial gradients, leading to slower convergence. To solve this, PiSSA [35] initializes LoRA with the principal singular components of the pre-trained matrix. Since principal singular components represent the most significant directions in the matrix, aligning the initial weights with these components can accelerate convergence and improve performance. In contrast, MiLoRA [36] initializes LoRA with the minor singular components. Given that random initialization of low-rank matrices can interfere with the important features learned in the pre-trained matrix, it reduces this interference to improve overall performance while adapting to new tasks.

3.3.2 Gradient Update Optimization

To further enhance the convergence and reliability of LoRA, several studies have proposed improvements from the perspective of gradient updates.  [37] introduces a scaled gradient method based on Riemannian optimization, which incorporates an r×r𝑟𝑟r\times ritalic_r × italic_r preconditioner item in the gradient update step to improve the convergence and hyperparameter robustness of LoRA. Through theoretical analysis, LoRA+ [38] discovered the necessity of setting a proportional learning rate for matrices A and B to achieve stable feature learning and accelerate convergence. ResLoRA [39] introduced residual connections into LoRA to optimize the gradient propagation path, speeding up training convergence and enhancing model performance. Similarly, SIBO [40] mitigate over-smoothing by injecting residual connections of initial token representations into LoRA’s input. Additionally, to further reduce computational resources, literature  [41] employs gradient-free optimization methods such as CMA-ES and FWA to optimize LoRA, demonstrating competitive performance in few-shot NLU tasks. Besides, DoRA (Weight-Decomposed Low-Rank Adaptation) [42] constrains the gradient update, focusing on the directional change of the parameter. It decomposes pre-trained weight into two components, direction and magnitude, and applies LoRA only to the direction component to enhance training stability.

3.3.3 Overfitting Mitigation

Although LoRA effectively reduces the number of trainable parameters compared to full fine-tuning, some studies have shown that LoRA is also prone to overfitting [45], which contradicts previous views. To address this issue, BiLoRA [43] adopts a bi-level optimization strategy. It alternately trains the singular vectors and singular values of the low-rank increment matrix on different subsets of the training data. This approach avoids the simultaneous optimization of parameters at different levels on a single dataset, thus mitigating overfitting. In addition, literature  [44] applies dropout to LoRA parameters to reduce overfitting, while HiddenKey [45] employs column-wise dropout for attention layers and element-wise dropout for feedforward layers.

3.4 Combining with other Learning Paradigms

LoRA is compatible with other learning paradigms, such as Bayesian Learning, In-context Learning and Active Learning. Combining LoRA with these learning paradigms can address several problems that hurt the downstream adaptation performance. For example, combining with Bayesian Learning, Laplace-LoRA [46] can relieve the overconfidence phenomenon that happened in downstream adaptation. Combining with In-context Learning, PILLOW [47] aims to solve the low-resource dilemmas existing in some downstream tasks. Combining with Active Learning, STAR [48] can effectively improve the data efficiency.

4 Cross-task Generalization

LoRA’s pluggable nature enables users to accumulate LoRA plugins for different tasks. For example, on Hugging Face 111\urlhttps://huggingface.co/models?p=8&sort=trendingsearch=lorahub, there are more than 300 LoRA plugins compatible with Flan-T5 for different tasks. These accumulated LoRA plugins can not only be utilized independently but also be mixed to achieve cross-task generalization[58]. Mixing multiple LoRA plugins together, namely LoRA mixture, has been widely applied in areas requiring cross-task generalization, such as multi-task learning, domain adaptation, and continual learning. Existing LoRA mixture methods can be categorized into (1) mixture with manually designed weights; (2) mixture with learnt weights; (3) mixture of LoRA experts. This section introduces each category of methods respectively, as shown in Fig.  3.

Refer to caption
Figure 3: An illustration of LoRA mixture methods.

4.1 Mixture with Manually Designed Weights

Early LoRA mixture methods attempt to linearly combine different LoRA plugins with manually designed weights. Some research demonstrates that we can achieve proper cross-task generalization ability by simply averaging the plugins or their related outputs [49, 50, 51]. Furthermore, several methods have been proposed to further improve the performance of the LoRA mixture via adopting manually designed weights. For example, ControlPE [52],  [53] and [54] set the weight factors as hyperparameters, and ControlPE uses hyperparameter search to determine the optimal combination of two LoRA plugins. Additionally, Token-level Adaptation [55] utilizes cosine similarity between the input feature and the adapter dataset center as weight factors, while BYOM [56] applies basic model fusion methods such as Task Arithmetic, Fisher-Merging, and RegMean.

Mixture with manually designed weights can quickly mix multiple LoRAs without extra training, which demonstrates simplicity and computational efficiency. However, it often fails to find the optimal weights, leading to unstable performance and limited generalization. Subsequently, researchers have explored using learning-based methods to achieve more precise and adaptive mixtures.

4.2 Mixture with Learnt Weights

To learn the optimal mixture weights, several methods have been proposed at task level, instance level and token level to meet different needs. Task-level methods focus on enhancing task transferability, which can be either gradient-based, such as [57], or gradient-free, as seen in LoRAHub [58]. LoRAHub employs a black-box algorithm named CMA-ES [209] to optimize weight factors for LoRA plugins, simplifying the training process. Later, ComPEFT [59] and L-LoRA [60] use LoRAHub to mix quantized LoRA plugins, further improving computational efficiency.

Compared to task-level methods, instance-level and token-level methods can provide flexibility and precision for complex inputs. For multimodal instruction tuning, MixLoRA [61] dynamically chooses appropriate low-rank decomposition vectors based on the input instance, which are then integrated into LoRA matrices for training. To conduct protein mechanics analysis and design tasks, X-LoRA [62] develops a dynamic gating mechanism to assign weights for LoRA plugins at the token level and layer granularity. These approaches demonstrate better performance in specific tasks or application scenarios.

4.3 Mixture of LoRA Experts

When the LoRA plugins are trainable, we can jointly learn the mixture weights and the LoRA plugins, which can further improve the performance of the LoRA mixture. To jointly learn the mixture weights and LoRA plugins, Mixture of LoRA Experts (LoRA MoE) is a natural choice, where each LoRA plugin acts as an expert, while a router network typically assigns the mixture weights. LoRA MoE has been proven to be effective in many tasks, such as continual learning[63, 64], vision-language tasks[65] and multi-task medical applications[66].

Existing methods improve the performance of LoRA MoE from the perspectives of initialization, task relationship management and efficiency. For initialization, Mixture-of-LoRAs [67] first trains multiple LoRAs separately as initialization and then optimizes the router and LoRAs jointly. MultiLoRA [68] proposes refining the initialization to reduce parameter dependency, which can yield more balanced unitary subspaces. As for task balance, MLoRE [69] adds a low-rank convolution path in the MoE structure to capture global task relationships. MTLoRA [70] adopts both task-agnostic and task-specific LoRA modules to address task conflicts. For efficiency, MoLA [71] adaptively allocates different numbers of LoRA experts to different layers of the Transformer model to save the number of LoRA modules. LLaVA-MoLE [72] and SiRA [73] leverage sparse computation to reduce computational cost. Additionally, Octavius[74] sparsely activates independent LoRA experts with instance-level instructions to mitigate task interference and improve efficiency. Fast LoRA[75] allows each sample in a minibatch to have its unique low-rank adapters, enabling efficient batching.

Besides, some methods are not explicitly based on MoE but follow MoE ideas. For example, I-LoRA [76] uses two LoRAs to manage long-term and short-term memory for continual learning, respectively.

5 Efficiency Improving

With the popularization of LLMs, the demand for training and running LoRA plugins increases rapidly. This increasing demand brings an unignorable computational burden; thus, for LoRA, the smaller, the faster, the better. To meet this demand, existing methods improve the computational efficiency of LoRA from the perspectives of (1) parameter reduction; (2) parameter quantization; (3) parallel LoRA computing frameworks.

5.1 Parameter Reduction

LoRA significantly reduces the number of tunable parameters for fine-tuning LLMs. However, it still requires expensive activation memory to update low-rank matrices. To further reduce the memory cost, existing methods reduce the number of tunable parameters of LoRA via parameter freezing, parameter pruning, and parameter sharing.

5.1.1 Parameter Freezing

Parameter freezing methods reduce the number of tunable parameters for LoRA via freezing some of its parameters. They can be divided into two categories: intra-parameter methods and extra-parameter methods.

The intra-parameter methods tune a subset of parameters of LoRA while freezing the others. LoRA-SP[77] randomly selects half of the LoRA parameters to freeze during fine-tuning. LoRA-FA[78]freezes the down-projection weights and updates the up-projection weights in each layer of LoRA. AFLoRA[79] constructs a low-rank trainable path and gradually freezes parameters during training LoRA. Additionally, DropBP[80] accelerates the training process by randomly dropping some LoRA gradient calculations during backpropagation.

By contrast, the extra-parameter methods introduce and tune a set of extra parameters while freezing the original parameters of LoRA. Most of them are proposed based on Singular Value Decomposition(SVD). LoRA-XS[81] adds a small r×r𝑟𝑟r\times ritalic_r × italic_r weight matrix between frozen LoRA matrices, which are constructed using the SVD of the original weight matrix; then it tunes only the r×r𝑟𝑟r\times ritalic_r × italic_r weight matrices in fine-tuning. Similarly, BYOM-LoRA[56] adopts SVD to compress LoRA matrices for multi-task models.

Refer to caption
Figure 4: An illustration of efficiency improving methods.

5.1.2 Parameter Pruning

Parameter pruning methods aim to remove unimportant LoRA parameters during training and inference. They prune parameters by either pruning LoRA independently or jointly pruning LoRA and the LLM. LoRA-drop[82] uses the output of LoRA at each layer to evaluate the importance of parameters and prune the unimportant parameters. By contrast, LoRAPrune[83] jointly pruning LoRA matrices and the LLM parameters based on LoRA’s gradients. Besides, we can also use LoRA to support parameters pruning for LLMs [84, 85].

5.1.3 Parameter Sharing

Parameter-sharing methods reduce the number of parameters by sharing parameters across different layers or modules of LLMs. VeRA[86] and VB-LoRA[87] are two representative parameter-sharing methods for LoRA. Specifically, VeRA proposes to share a pair of frozen random matrices across all layers and conduct layer-wise adaptation with “scaling vectors”. By contrast, VB-LoRA proposes a “divide-and-share” paradigm, which divides LoRA’s low-rank decomposition by a rank-one decomposition and achieves global sharing based on an admixture model.

5.2 Parameter Quantization

Quantization, which reduces the bit width of parameters (e.g., from 32-bit floats to 4-bit integers), can be used to reduce the memory and computational cost of LoRA. Existing quantization-aware LoRA methods consist of post-training quantization (PTQ)-based methods and quantization-aware training (QAT)-based methods[92].

5.2.1 PTQ-based methods

In PTQ-based methods, we first quantize an LLM and then fine-tune the quantized model, namely quantization and fine-tuning are sequentially conducted. QLoRA [88] is the first PTQ-based quantization-aware LoRA method. In the fine-tuning stage, it first quantizes an LLM to 4 bits and then fine-tunes a LoRA plugin on it with a higher precision, such as BFloat16 or Float16. In the inference stage, it dequantizes the LLM to the same precision as LoRA and then adds the LoRA updates to the LLM.

Although QLoRA can significantly reduce memory cost for fine-tuning, it does not bring benefits for inference, because it requires dequantizing the LLM to high precision again. To solve this problem, QA-LoRA [89] is proposed to reduce memory cost for both the fine-tuning and inference stages. QA-LoRA uses group-wise operators to balance the degrees of freedom of the LLM quantization and fine-tuning, which enables it to obtain a LoRA plugin having identical precision with the quantized LLM. Thus, it can perform inference without dequantization.

5.2.2 QAT-based methods

In QAT-based methods, we jointly quantize and fine-tune an LLM, namely quantization and fine-tuning are simultaneously conducted. These methods can alleviate the quantization discrepancies observed in PTQ-based methods. To address the quantization discrepancy of QLoRA, LoftQ [90] alternatively applies quantization and low-rank approximation during fine-tuning to minimize the quantization error. However, ApiQ [91] points out that LoftQ ignores the error propagation across layers and proposes activation-preserved initialization to avoid error propagation. Besides, L4Q [92] is another QAT-based method that has an advanced layer design.

5.3 Parallel LoRA Computing Frameworks

LoRA’s parameter-efficient nature enables us to fine-tune or infer multiple plugins on a single GPU or a GPU cluster, which can save computational resources and improve the efficiency of LoRA. This section introduces the parallel fine-tuning and parallel inference frameworks, respectively.

5.3.1 Parallel Fine-tuning

Parallelly fine-tuning multiple LoRA plugins on a single GPU can reduce GPU memory usage and improve computation efficiency. ASPEN [93] proposes a high-throughput parallel finetuning framework for LoRA, which consists of a BatchFusion approach and an adaptive job scheduling algorithm. Specifically, the BatchFusion approach supports parallelly fine-tuning multiple LoRA plugins on a shared LLM by fusing multiple input batches into a single batch, while the adaptive job scheduling algorithm allocates computation resources to the fine-tuning jobs.

5.3.2 Parallel Inference

Parallel inference framework for LoRA can not only improve the computational efficiency but also support the needs of multi-tenant service. Punica[94] uses a new CUDA kernel design to batch GPU operations for different LoRA plugins. Based on Punica, S-LoRA [95] further optimizes the parallel inference framework by introducing a unified paging mechanism and a new tensor parallelism strategy, which enables the service of thousands of concurrent LoRA plugins. Then, based on Punica and S-LoRA, CARASERVE [96] reduces the cold-start overhead and further improves the service efficiency and SLO (service-level objective) attainment rates by CPU-GPU cooperation and rank-aware scheduling.

6 LoRA for Federated Learning

When adapting LLMs to vertical domains such as medicine and finance, the available training data can be privately owned by multiple clients. In this scenario, the training data is not centralized, and we have to fine-tune LLMs while keeping the data localized, namely federated learning. In federated learning, the clients typically compute weight updates locally and then share these updates with others to globally update the LLM. It brings both communication and computation costs for the clients. Fortunately, LoRA is parameter efficient and pluggable, which can reduce communication costs and lower computational resource requirements. LoRA can enhance the overall efficiency and scalability of federated learning.

However, adopting LoRA in federated learning is not trivial for federated learning faces challenges such as data heterogeneity, device heterogeneity, and model heterogeneity. To address these issues, recent studies have designed various methods for LoRA to meet the diverse needs of federated learning. Additionally, as a localized parameter component, LoRA’s pluggable nature allows it to support parameter privacy protection in federated learning.

Refer to caption
Figure 5: An illustration of LoRA for federated learning.

6.1 Data Heterogeneity

Data heterogeneity refers to differences in data distribution across clients. In federated learning, different clients usually have different data distributions. The inconsistency in data distribution affects the overall performance of the model. Research reveals that in federated learning, as user data becomes more diverse, the performance gap between LoRA and full fine-tuning widens [97]. To address this issue, researchers have proposed several improvement methods.

SLoRA [97] introduces a data-driven initialization method for LoRA. It first performs sparse federated fine-tuning before applying LoRA and then performs SVD to decompose the accumulated gradient updates into low-rank matrices for LoRA initialization. The goal is to enable the LoRA modules to better adapt to the data distribution of each client, thereby integrating these heterogeneous data characteristics into the global model more effectively. FeDeRA [98] uses a simpler initialization method. It directly applies SVD to pre-trained weights to initialize LoRA. Retaining the principal components of the pre-trained weights aligns the direction and magnitude of weight updates across different clients to handle data heterogeneity. Additionally, FFA-LoRA [99] freezes one low-rank matrix and fine-tunes only the other. This reduces inconsistency during server aggregation of LoRA gradients, alleviating the optimization instability caused by non-IID data.

6.2 Device Heterogeneity

Device heterogeneity refers to the differences in hardware capabilities, and network connectivity among clients participating in federated learning. Traditional federated learning methods often encounter the “buckets effect”, implying that the system’s overall performance is limited by the capability of the least powerful client. Specifically, these methods use the smallest LoRA rank to accommodate all clients, which prevents many resource-rich clients from fully utilizing their potential.

To address this issue, a dynamic parameter allocation strategy can be adopted. FedMS [100] dynamically adjusts the number of activated LoRA matrices based on the real-time computational resources of clients. FlexLoRA [101] uses a dynamic parameter allocation strategy. It adjusts the LoRA rank and redistributes the SVD components of the global LoRA weights based on resource constraints. Similarly, HETLORA [102] assigns different ranks for different clients. However, it performs weighted aggregation according to the sparsity of the updates from different clients, balancing update information better than simple aggregation.

6.3 Model Heterogeneity

Model heterogeneity indicates differences in model structures among clients. In traditional federated learning, clients use local models with the same architecture, allowing their parameters to be aggregated into a global model on the server. However, in practice, clients may prefer unique local model architectures due to personal needs and often do not want to disclose model details. Thus, it is necessary to transfer knowledge between heterogeneous models without sharing private data or revealing local model structures [210].

Previous work has used knowledge distillation, model ensembling, and mutual learning to address model heterogeneity. However, these methods have limitations, such as reliance on public datasets, additional communication costs and poor local model performance. To avoid these limitations, pFedLoRA [103] uses LoRA as a carrier of both global and local knowledge. It adopts an iterative training strategy to facilitate knowledge transfer and integration, enabling knowledge sharing among heterogeneous models across different clients.

6.4 Parameter Privacy

In federated learning, protecting client-specific parameters is crucial because ensuring the privacy of these parameters also indirectly safeguards client data privacy. As a modular approach to adjusting personalized parameters, LoRA can be effectively integrated into federated learning systems to achieve parameter privacy protection.

Literature [104] proposes a secure distributed language model training framework based on model slicing. They deploy LoRA in a Trusted Execution Environment (TEE) and use OTP encryption to transmit features between the GPU and TEE, protecting model parameter privacy. PrivateLoRA [105] introduces a distributed system based on LoRA. It adds a square matrix M𝑀Mitalic_M between low-rank matrices A𝐴Aitalic_A and B𝐵Bitalic_B. The non-trainable matrices A and B, along with most of the pre-trained weights, are deployed on the global server to enhance computation. Meanwhile, the trainable matrix M𝑀Mitalic_M is stored on the client as personalized parameters, thus ensuring parameter privacy protection.

7 Applications of LoRA

In the rapidly evolving field of deep learning, LoRA has become widely used due to its unique advantages. Researchers utilize LoRA to fine-tune pre-trained models for various downstream tasks, reducing computational resource requirements while enhancing performance. LoRA’s strong adaptability and efficiency have significantly improved various applications. In this section, we will introduce LoRA’s applications in the following scenarios: (1) language tasks; (2) vision tasks; (3) multimodal tasks.

7.1 Language Tasks

Recently, the rapid development of pre-trained language models, especially LLMs, is revolutionizing the approach to language tasks due to their outstanding performance. However, these pre-trained models are trained on a large amount of general data and still require further fine-tuning on task-specific data to adapt to downstream tasks. Therefore, it is natural to use LoRA to fine-tune these pre-trained language models, as it reduces computational resource requirements. We mainly focus on some representative downstream tasks, which include traditional NLP tasks, code tasks, model alignment and vertical domain tasks.

7.1.1 Traditional NLP Tasks

Given the strong instruction-following and contextual understanding abilities of LLMs, some researches apply LoRA to fine-tune these models for traditional NLP tasks. For example, LoRA is widely adopted in LLaMA for various tasks, such as emotion recognition [106], text classification[107] and role recognition [108]. AutoRE [109] applies QLoRA to three document-level relation extraction tasks, achieving great performance on different LLMs. Some studies[110, 111, 112] leverage LoRA from different perspectives to enhance the model’s capability in machine translation tasks. Additionally, LoRA can also improve the performance of models like BERT and T5 for text understanding tasks [113, 114].

7.1.2 Code Tasks

Some researchs apply LoRA to improve model performance in various code-related tasks. For example, BERT-style models fine-tuned with LoRA are suitable for code-change-related tasks, specifically in Just-In-Time defect prediction (JIT-DP)[115, 116]. Similarly, training CodeT5 and PLBART with LoRA can enhance their adaptability for code summarization and code clone detection[117]. As for the decoder-only model, RepairLLaMA[118] uses LoRA to fine-tune Llama for automated program repair (APR), while WizardCoder-15B is fine-tuned with LoRA for Text-to-SQL task[119]. Additionally, SteloCoder[120], a fine-tuned version of StarCoder, is designed for multi-language to Python code translation.

7.1.3 Model Alignment Tasks

Model alignment tasks focus on adjusting a machine learning model to align with human values and intentions, often using techniques like Reinforcement Learning from Human Feedback (RLHF). To reduce memory requirements of RLHF, some studies use LoRA to fine-tune the reward model and policy model[121, 122, 123]. Furthermore, other works improve reward models by integrating multiple LoRA adapters. For example, DMoERM[124] combines MoE with LoRA, routing model inputs to multiple LoRA experts while another work[125] proposes a LoRA-based ensemble method as well. The integration can also benefit the quantification of uncertainty in reward models[126]. Besides, literature[127] applies Laplace-LoRA[128] to train Bayesian reward models, which mitigates reward overoptimization in best-of-n sampling.

7.1.4 Vertical Domain Tasks

LLMs often perform suboptimally in vertical domains, requiring fine-tuning with domain-specific expertise. Some works apply LoRA to improve the performance of LLMs on domain-specific tasks. For example, some studies fine-tune LLMs on medical datasets with LoRA to adapt them to the medical domain[129, 130, 131]. Additionally, other studies improve medical tasks like clinical dialogue summarization[132], assertion detection[133] and medical QA tasks[134, 135]. Similarly, several studies fine-tune LLMs with LoRA on financial data to solve tasks such as financial news analytics and sentiment classification[136, 137, 138, 139]. Besides, LoRA can also be used to enhance the performance in database tasks like query rewrite and index tuning[140].

7.2 Vision Tasks

In vision tasks, LoRA is primarily applied to image generation and image segmentation, significantly improving training efficiency and optimizing model performance.

7.2.1 Image Generation

Image generation tasks hold significant importance in the field of computer vision. In recent years, diffusion model have demonstrated exceptional performance in image generation tasks. LoRA is widely used in diffusion models to address various image generation tasks while reducing computational resources. Some works use LoRA to fine-tune diffusion models for image style transfer[141, 142, 143, 144, 145], while others apply it to text-to-image generation[146, 147, 148, 149, 150].

Furthermore, researchers have designed several LoRA-based methods to improve image generation quality. For instance, Smooth Diffusion[151] uses LoRA to achieve smoothness in the latent space, leading to better performance in various image generation and editing tasks. ResAdapter[152] employs LoRA to learn resolution priors, adjusting the receptive fields of convolutional layers to dynamical resolution. Additionally, to specifically enhance text-to-image quality, STAMINA[153] uses LoRA to fine-tune diffusion models for longer concept sequences. DreamSync[154] and StyleAdapter[155] use LoRA to improve text fidelity and image quality. Mix-of-Show[156] captures out-of-domain information with LoRA weights to combine multiple customized concepts with high fidelity, reducing concept conflicts. Other studies combine LoRA with model distillation to accelerate image generation[157, 158]. Moreover, LoRA can also be applied to video generation[159, 160, 161, 162, 163, 164] and 3D generation tasks[165, 166, 167, 168, 169].

7.2.2 Image Segmentation

Image segmentation is a significant challenge in computer vision, aiming to divide an image into multiple meaningful regions or objects. To address this, SAM has been proposed as a foundational model for image segmentation and demonstrated superior generalization ability. To further enhance its performance in specific vertical domains, many studies utilize LoRA to fine-tune it. For instance, in license plate detection, SamLP[170] utilizes LoRA to adapt SAM for efficient segmentation of license plates. In structural damage detection, literature[171] fine-tunes SAM’s encoder using LoRA for instance segmentation task. In the medical domain, many studies also apply LoRA to fine-tune SAM for a variety of tasks, including nuclei segmentation[172], OCTA image segmentation[173], brain tumor segmentation[174], organ segmentation[175], and surgical instrument segmentation[176]. Additionally, some studies use LoRA to fine-tune Vision Transformer (ViT) for visual tracking[177] and face forgery detection[178].

7.3 Multimodal Tasks

Multimodal Large Language Models (MLLMs) aim to integrate text with various modalities such as audio, image and video, which enable cross-modal understanding and reasoning through a unified embedding space. The success of LoRA in both NLP and vision tasks has sparked considerable interest in applying them to MLLMs.

In MLLMs, LoRA can not only improve training efficiency but also facilitate effective modality alignment. In audio-text tasks, SALM[179] comprises LoRA layers, a frozen text-based LLM, an audio encoder and a modality adapter to handle speech inputs and corresponding task instructions. For image-text tasks, InternLM-XComposer2[180] achieves modality alignment by applying LoRA to image tokens, mPLUG-Owl[181] freezes the visual module while jointly fine-tuning LoRA and abstractor of the text module, and CoLLaVO[182] employs QLoRA to preserve object-level image understanding. In the realm of video-text tasks, VSP-LLM[183] fine-tunes the text module with QLoRA for visual speech processing, MolCA[184] uses LoRA to understand 2D molecular graphs and text, while TPLLM[185] employs LoRA for efficient traffic prediction by integrating sequence and spatial features. These applications demonstrate the versatility and power of LoRA in MLLMs tasks.

8 Conclusion and Future Direction

In this survey, the recent progress of LoRA have been systematically reviewed from the perspective of downstream adaptation improving, cross-task generalization, efficiency improving, federated learning and applications. From this review, we can find that LoRA is parameter efficient, pluggable, compatible and easy to achieve cross-task generalization, which enables it to be one of the most important technology for LLMs applications. Recent progress further boost the generalization and efficiency of LoRA, and stimulate its potential to be used in more scenarios. Here, we list three future directions where LoRA will be indispensable.

8.1 LoRA for GaaS

In Generative-as-a-Service (GaaS), cloud-based platforms provide users with generative artificial intelligence (AGI) services. GaaS enables users enjoy AGI without deploying local computational resources. For the users’ needs are diverse, it is necessary to provides various functions for GaaS. To implement the various functions, we can construct a LoRA plugin for each function. The pramameter efficiency and plugability of LoRA can facilitate efficient functions’ construction and execution. Besides, the services on GaaS platforms can change rapidly alonging time. To follow the changes, we can train new LoRA plugins that initialized by combination of previous plugins. The cross-task generalization ability of LoRA can facilitate fast adaption to service updations.

8.2 LoRA for Continued Pre-training

In continued pre-training, a foundation model is continuely trained with unlabeled user data to adapt the model to specific domains. Typically, the self-supervised training objective is same with that for pre-training, and the learning rate is much smaller than than for pre-training. Continued pre-training is a important stage for constructing vertical domain LLMs. However, it is highly computational expensive, which impedes the development of vertical domain LLMs, especailly for the organizations with limited computational resources. Enhancing LoRA for continued pre-training and reducing its computational cost is worth to explored.

8.3 LoRA for Autonomous Agents

In LLM-based autonomous agents, the agents are assigned with specific roles. Based the roles and environment, agents make actions to response users’ or other agents’ request. The actions can be made based on self-knowledge or tools that designed for domain-specific tasks. The request and the actions are stored in memory to support the future requests.

In the current agents, the roles are typically assigned by prompts; however, prompt may cannot give a comprehensive discription of the role when the role is complex and the number of related data is large. Assiging roles with LoRA plugins training from data related to the roles can be a better choice. Furthermore, the tools for agent can be LoRA plugins. Besides, the memory usually augments the agents with retrieval augmented generation (RAG); however, due to the input token limitation and the short-comings of in-context learning, the RAG-based support may be less effective. By contrast, we can use LoRA-based continual learning to construct memory plugins, which can solve the problem of RAG. Therefore, LoRA-driven agents are worth to explore.

{acknowledgement}

This work was supported in part by the NSFC under Grants No. (62025206, 62302436, U23A20296), Zhejiang Province’s ”Lingyan” R&D Project under Grant No. 2024C01259, and Ningbo Science and Technology Special Projects under Grant No. 2023Z212. Yunjun Gao is the corresponding author of this work.

References

  • [1] Devlin J, Chang M, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT. 2019
  • [2] Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung H W, Sutton C, Gehrmann S, Schuh P, Shi K, Tsvyashchenko S, Maynez J, Rao A, Barnes P, Tay Y, Shazeer N, Prabhakaran V, Reif E, Du N, Hutchinson B, Pope R, Bradbury J, Austin J, Isard M, Gur-Ari G, Yin P, Duke T, Levskaya A, Ghemawat S, Dev S, Michalewski H, Garcia X, Misra V, Robinson K, Fedus L, Zhou D, Ippolito D, Luan D, Lim H, Zoph B, Spiridonov A, Sepassi R, Dohan D, Agrawal S, Omernick M, Dai A M, Pillai T S, Pellat M, Lewkowycz A, Moreira E, Child R, Polozov O, Lee K, Zhou Z, Wang X, Saeta B, Diaz M, Firat O, Catasta M, Wei J, Meier-Hellstern K, Eck D, Dean J, Petrov S, Fiedel N. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 2023, 24: 240:1–240:113
  • [3] Chen Y, Qian S, Tang H, Lai X, Liu Z, Han S, Jia J. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv.2309.12307, 2023
  • [4] Pan R, Liu X, Diao S, Pi R, Zhang J, Han C, Zhang T. LISA: layerwise importance sampling for memory-efficient large language model fine-tuning. arXiv preprint arXiv.2403.17919, 2024
  • [5] Ding N, Qin Y, Yang G, Wei F, Yang Z, Su Y, Hu S, Chen Y, Chan C, Chen W, Yi J, Zhao W, Wang X, Liu Z, Zheng H, Chen J, Liu Y, Tang J, Li J, Sun M. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mac. Intell., 2023, 5(3): 220–235
  • [6] Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, Laroussilhe d Q, Gesmundo A, Attariyan M, Gelly S. Parameter-efficient transfer learning for NLP. In: ICML. 2019
  • [7] Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning. In: EMNLP. 2021
  • [8] Zaken E B, Goldberg Y, Ravfogel S. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In: ACL. 2022
  • [9] Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. Lora: Low-rank adaptation of large language models. In: ICLR. 2022
  • [10] Malladi S, Wettig A, Yu D, Chen D, Arora S. A kernel-based view of language model fine-tuning. In: ICML. 2023
  • [11] Koubbi H, Boussard M, Hernandez L. The impact of lora on the emergence of clusters in transformers. arXiv preprint arXiv.2402.15415, 2024
  • [12] Jang U, Lee J D, Ryu E K. Lora training in the NTK regime has no spurious local minima. arXiv preprint arXiv.2402.11867, 2024
  • [13] Zhu J, Greenewald K H, Nadjahi K, Ocáriz Borde d H S, Gabrielsson R B, Choshen L, Ghassemi M, Yurochkin M, Solomon J. Asymmetry in low-rank adapters of foundation models. arXiv preprint arXiv.2402.16842, 2024
  • [14] Zeng Y, Lee K. The expressive power of low-rank adaptation. arXiv preprint arXiv.2310.17513, 2023
  • [15] Lialin V, Muckatira S, Shivagunde N, Rumshisky A. Relora: High-rank training through low-rank updates. In: NeurIPS Workshop. 2023
  • [16] Jiang T, Huang S, Luo S, Zhang Z, Huang H, Wei F, Deng W, Sun F, Zhang Q, Wang D, others . Mora: High-rank updating for parameter-efficient fine-tuning. arXiv preprint arXiv:2405.12130, 2024
  • [17] Huh M, Cheung B, Bernstein J, Isola P, Agrawal P. Training neural networks from scratch with parallel low-rank adapters. arXiv preprint arXiv.2402.16828, 2024
  • [18] Liang Y, Li W. Inflora: Interference-free low-rank adaptation for continual learning. arXiv preprint arXiv.2404.00228, 2024
  • [19] Zhao H, Ni B, Wang H, Fan J, Zhu F, Wang Y, Chen Y, Meng G, Zhang Z. Continual forgetting for pre-trained vision models. arXiv preprint arXiv.2403.11530, 2024
  • [20] Ren W, Li X, Wang L, Zhao T, Qin W. Analyzing and reducing catastrophic forgetting in parameter efficient tuning. arXiv preprint arXiv.2402.18865, 2024
  • [21] Zhang H. Sinklora: Enhanced efficiency and chat capabilities for long-context large language models. arXiv preprint arXiv.2406.05678, 2023
  • [22] Xia W, Qin C, Hazan E. Chain of lora: Efficient fine-tuning of language models via residual learning. arXiv preprint arXiv.2401.04151, 2024
  • [23] Ren P, Shi C, Wu S, Zhang M, Ren Z, Rijke d M, Chen Z, Pei J. Mini-ensemble low-rank adapters for parameter-efficient fine-tuning. arXiv preprint arXiv.2402.17263, 2024
  • [24] Hao Y, Cao Y, Mou L. Flora: Low-rank adapters are secretly gradient compressors. arXiv preprint arXiv.2402.03293, 2024
  • [25] Zi B, Qi X, Wang L, Wang J, Wong K, Zhang L. Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices. arXiv preprint arXiv.2309.02411, 2023
  • [26] Zhang Q, Chen M, Bukharin A, He P, Cheng Y, Chen W, Zhao T. Adaptive budget allocation for parameter-efficient fine-tuning. In: ICLR. 2023
  • [27] Hu Y, Xie Y, Wang T, Chen M, Pan Z. Structure-aware low-rank adaptation for parameter-efficient fine-tuning. Mathematics, 2023
  • [28] Zhang F, Li L, Chen J, Jiang Z, Wang B, Qian Y. Increlora: Incremental parameter allocation method for parameter-efficient fine-tuning. arXiv preprint arXiv.2308.12043, 2023
  • [29] Mao Y, Huang K, Guan C, Bao G, Mo F, Xu J. Dora: Enhancing parameter-efficient fine-tuning with dynamic rank distribution. arXiv preprint arXiv:2405.17357, 2024
  • [30] Zhang R, Qiang R, Somayajula S A, Xie P. Autolora: Automatically tuning matrix ranks in low-rank adaptation based on meta learning. arXiv preprint arXiv.2403.09113, 2024
  • [31] Ding N, Lv X, Wang Q, Chen Y, Zhou B, Liu Z, Sun M. Sparse low-rank adaptation of pre-trained language models. In: EMNLP. 2023
  • [32] Liu Z, Lyn J, Zhu W, Tian X, Graham Y. Alora: Allocating low-rank adaptation for fine-tuning large language models. arXiv preprint arXiv.2403.16187, 2024
  • [33] Valipour M, Rezagholizadeh M, Kobyzev I, Ghodsi A. Dylora: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. In: EACL. 2023
  • [34] Hayou S, Ghosh N, Yu B. The impact of initialization on lora finetuning dynamics. arXiv preprint arXiv:2406.08447, 2024
  • [35] Meng F, Wang Z, Zhang M. Pissa: Principal singular values and singular vectors adaptation of large language models. arXiv preprint arXiv:2404.02948, 2024
  • [36] Wang H, Xiao Z, Li Y, Wang S, Chen G, Chen Y. Milora: Harnessing minor singular components for parameter-efficient llm finetuning. arXiv preprint arXiv:2406.09044, 2024
  • [37] Zhang F, Pilanci M. Riemannian preconditioned lora for fine-tuning foundation models. arXiv preprint arXiv:2402.02347, 2024
  • [38] Hayou S, Ghosh N, Yu B. Lora+: Efficient low rank adaptation of large models. arXiv preprint arXiv:2402.12354, 2024
  • [39] Shi S, Huang S, Song M, Li Z, Zhang Z, Huang H, Wei F, Deng W, Sun F, Zhang Q. Reslora: Identity residual mapping in low-rank adaption. arXiv preprint arXiv:2402.18039, 2024
  • [40] Wen Z, Zhang J, Fang Y. Sibo: A simple booster for parameter-efficient fine-tuning. arXiv preprint arXiv:2402.11896, 2024
  • [41] Jin F, Liu Y, Tan Y. Derivative-free optimization for low-rank adaptation in large language models. arXiv preprint arXiv:2403.01754, 2024
  • [42] Liu S Y, Wang C Y, Yin H, Molchanov P, Wang Y C F, Cheng K T, Chen M H. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024
  • [43] Qiang R, Zhang R, Xie P. Bilora: A bi-level optimization framework for overfitting-resilient low-rank adaptation of large pre-trained models. arXiv preprint arXiv:2403.13037, 2024
  • [44] Lin Y, Ma X, Chu X, Jin Y, Yang Z, Wang Y, Mei H. Lora dropout as a sparsity regularizer for overfitting control. arXiv preprint arXiv:2404.09610, 2024
  • [45] Wang S, Chen L, Jiang J, Xue B, Kong L, Wu C. Lora meets dropout under a unified framework. arXiv preprint arXiv:2403.00812, 2024
  • [46] Yang A X, Robeyns M, Wang X, Aitchison L. Bayesian low-rank adaptation for large language models. arXiv preprint arXiv.2308.13111, 2023
  • [47] Qi Z, Tan X, Shi S, Qu C, Xu Y, Qi Y. PILLOW: enhancing efficient instruction fine-tuning via prompt matching. In: EMNLP. 2023
  • [48] Zhang L, Wu J, Zhou D, Xu G. STAR: constraint lora with dynamic active learning for data-efficient fine-tuning of large language models. arXiv preprint arXiv.2403.01165, 2024
  • [49] Wang X, Aitchison L, Rudolph M. Lora ensembles for large language model fine-tuning. arXiv preprint arXiv.2310.00035, 2023
  • [50] Zhao Z, Gan L, Wang G, Zhou W, Yang H, Kuang K, Wu F. Loraretriever: Input-aware lora retrieval and composition for mixed tasks in the wild. arXiv preprint arXiv.2402.09997, 2024
  • [51] Smith J S, Cascante-Bonilla P, Arbelle A, Kim D, Panda R, Cox D D, Yang D, Kira Z, Feris R, Karlinsky L. Construct-vl: Data-free continual structured VL concepts learning. In: CVPR. 2023
  • [52] Sun Y, Li M, Cao Y, Wang K, Wang W, Zeng X, Zhao R. To be or not to be? an exploration of continuously controllable prompt engineering. arXiv preprint arXiv:2311.09773, 2023
  • [53] Zhang J, Chen S, Liu J, He J. Composing parameter-efficient modules with arithmetic operations. arXiv preprint arXiv.2306.14870, 2023
  • [54] Chitale R, Vaidya A, Kane A, Ghotkar A. Task arithmetic with lora for continual learning. arXiv preprint arXiv.2311.02428, 2023
  • [55] Belofsky J. Token-level adaptation of lora adapters for downstream task generalization. In: AICCC. 2023
  • [56] Jiang W, Lin B, Shi H, Zhang Y, Li Z, Kwok J T. Effective and parameter-efficient reusing fine-tuned models. arXiv preprint arXiv.2310.01886, 2023
  • [57] Asadi N, Beitollahi M, Khalil Y H, Li Y, Zhang G, Chen X. Does combining parameter-efficient modules improve few-shot transfer accuracy? arXiv preprint arXiv.2402.15414, 2024
  • [58] Huang C, Liu Q, Lin B Y, Pang T, Du C, Lin M. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv.2307.13269, 2023
  • [59] Yadav P, Choshen L, Raffel C, Bansal M. Compeft: Compression for communicating parameter efficient updates via sparsification and quantization. arXiv preprint arXiv.2311.13171, 2023
  • [60] Tang A, Shen L, Luo Y, Zhan Y, Hu H, Du B, Chen Y, Tao D. Parameter efficient multi-task model fusion with partial linearization. arXiv preprint arXiv.2310.04742, 2023
  • [61] Shen Y, Xu Z, Wang Q, Cheng Y, Yin W, Huang L. Multimodal instruction tuning with conditional mixture of lora. arXiv preprint arXiv.2402.15896, 2024
  • [62] Buehler E L, Buehler M J. X-lora: Mixture of low-rank adapter experts, a flexible framework for large language models with applications in protein mechanics and design. arXiv preprint arXiv.2402.07148, 2024
  • [63] Yang S, Ali M A, Wang C, Hu L, Wang D. Moral: Moe augmented lora for llms’ lifelong learning. arXiv preprint arXiv.2402.11260, 2024
  • [64] Dou S, Zhou E, Liu Y, Gao S, Zhao J, Shen W, Zhou Y, Xi Z, Wang X, Fan X, Pu S, Zhu J, Zheng R, Gui T, Zhang Q, Huang X. Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin. arXiv preprint arXiv:2312.09979, 2023
  • [65] Gou Y, Liu Z, Chen K, Hong L, Xu H, Li A, Yeung D, Kwok J T, Zhang Y. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv.2312.12379, 2023
  • [66] Liu Q, Wu X, Zhao X, Zhu Y, Xu D, Tian F, Zheng Y. Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications. arXiv preprint arXiv.2310.18339, 2023
  • [67] Feng W, Hao C, Zhang Y, Han Y, Wang H. Mixture-of-loras: An efficient multitask tuning method for large language models. In: LREC/COLING. 2024
  • [68] Wang Y, Lin Y, Zeng X, Zhang G. Multilora: Democratizing lora for better multi-task learning. arXiv preprint arXiv.2311.11501, 2023
  • [69] Yang Y, Jiang P, Hou Q, Zhang H, Chen J, Li B. Multi-task dense prediction via mixture of low-rank experts. arXiv preprint arXiv.2403.17749, 2024
  • [70] Agiza A R SN. M. Mtlora: Low-rank adaptation approach for efficient multi-task learning. In: CVPR. 2024
  • [71] Gao C, Chen K, Rao J, Sun B, Liu R, Peng D, Zhang Y, Guo X, Yang J, Subrahmanian V S. Higher layers need more lora experts. arXiv preprint arXiv.2402.08562, 2024
  • [72] Chen S, Jie Z, Ma L. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms. arXiv preprint arXiv.2401.16160, 2024
  • [73] Zhu Y, Wichers N, Lin C, Wang X, Chen T, Shu L, Lu H, Liu C, Luo L, Chen J, Meng L. Sira: Sparse mixture of low rank adaptation. arXiv preprint arXiv.2311.09179, 2023
  • [74] Chen Z, Wang Z, Wang Z, Liu H, Yin Z, Liu S, Sheng L, Ouyang W, Qiao Y, Shao J. Octavius: Mitigating task interference in mllms via moe. arXiv preprint arXiv.2311.02684, 2023
  • [75] Wen Y, Chaudhuri S. Batched low-rank adaptation of foundation models. arXiv preprint arXiv.2312.05677, 2023
  • [76] Ren W, Li X, Wang L, Zhao T, Qin W. Analyzing and reducing catastrophic forgetting in parameter efficient tuning. arXiv preprint arXiv.2402.18865, 2024
  • [77] Wu Y, Xiang Y, Huo S, Gong Y, Liang P. Lora-sp: streamlined partial parameter adaptation for resource efficient fine-tuning of large language models. In: AMNA. 2024
  • [78] Zhang L, Zhang L, Shi S, Chu X, Li B. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv preprint arXiv:2308.03303, 2023
  • [79] Liu Z, Kundu S, Li A, Wan J, Jiang L, Beerel P A. Aflora: Adaptive freezing of low rank adaptation in parameter efficient fine-tuning of large models. arXiv preprint arXiv:2403.13269, 2024
  • [80] Woo S, Park B, Kim B, Jo M, Kwon S, Jeon D, Lee D. Dropbp: Accelerating fine-tuning of large language models by dropping backward propagation. arXiv preprint arXiv:2402.17812, 2024
  • [81] Bałazy K, Banaei M, Aberer K, Tabor J. Lora-xs: Low-rank adaptation with extremely small number of parameters. arXiv preprint arXiv:2405.17604, 2024
  • [82] Zhou H, Lu X, Xu W, Zhu C, Zhao T. Lora-drop: Efficient lora parameter pruning based on output evaluation. arXiv preprint arXiv:2402.07721, 2024
  • [83] Zhang M, Chen H, Shen C, Yang Z, Ou L, Yu X, Zhuang B. Loraprune: Pruning meets low-rank parameter-efficient fine-tuning. 2023
  • [84] Chen T, Ding T, Yadav B, Zharkov I, Liang L. Lorashear: Efficient large language model structured pruning and knowledge recovery. arXiv preprint arXiv:2310.18356, 2023
  • [85] Zhu Y, Yang X, Wu Y, Zhang W. Parameter-efficient fine-tuning with layer pruning on free-text sequence-to-sequence modeling. arXiv preprint arXiv:2305.08285, 2023
  • [86] Kopiczko D J, Blankevoort T, Asano Y M. Vera: Vector-based random matrix adaptation. arXiv preprint arXiv:2310.11454, 2023
  • [87] Li Y, Han S, Ji S. Vb-lora: Extreme parameter efficient fine-tuning with vector banks. arXiv preprint arXiv:2405.15179, 2024
  • [88] Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. Qlora: Efficient finetuning of quantized llms. In: NeurIPS. 2024
  • [89] Xu Y, Xie L, Gu X, Chen X, Chang H, Zhang H, Chen Z, Zhang X, Tian Q. Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv preprint arXiv:2309.14717, 2023
  • [90] Li Y, Yu Y, Liang C, He P, Karampatziakis N, Chen W, Zhao T. Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv preprint arXiv:2310.08659, 2023
  • [91] Liao B, Monz C. Apiq: Finetuning of 2-bit quantized large language model. arXiv preprint arXiv:2402.05147, 2024
  • [92] Jeon H, Kim Y, Kim J j. L4q: Parameter efficient quantization-aware training on large language models via lora-wise lsq. arXiv preprint arXiv:2402.04902, 2024
  • [93] Ye Z, Li D, Tian J, Lan T, Zuo J, Duan L, Lu H, Jiang Y, Sha J, Zhang K, Tang M. Aspen: High-throughput lora fine-tuning of large language models with a single gpu. arXiv preprint arXiv:2312.02515, 2023
  • [94] Chen L, Ye Z, Wu Y, Zhuo D, Ceze L, Krishnamurthy A. Punica: Multi-tenant lora serving. In: MLSys. 2024
  • [95] Sheng Y, Cao S, Li D, Hooper C, Lee N, Yang S, Chou C, Zhu B, Zheng L, Keutzer K, others . S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023
  • [96] Li S, Lu H, Wu T, Yu M, Weng Q, Chen X, Shan Y, Yuan B, Wang W. Caraserve: Cpu-assisted and rank-aware lora serving for generative llm inference. arXiv preprint arXiv:2401.11240, 2024
  • [97] Babakniya S, Elkordy A R, Ezzeldin Y H, Liu Q, Song K, El-Khamy M, Avestimehr S. SLoRA: Federated parameter efficient fine-tuning of language models. arXiv preprint arXiv:2308.06522, 2023
  • [98] Yan Y, Tang S, Shi Z, Yang Q. FeDeRA: Efficient fine-tuning of language models in federated learning leveraging weight decomposition. arXiv preprint arXiv:2404.18848, 2024
  • [99] Sun Y, Li Z, Li Y, Ding B. Improving LoRA in privacy-preserving federated learning. arXiv preprint arXiv:2403.12313, 2024
  • [100] Wu P, Li K, Wang T, Wang F. FedMS: Federated learning with mixture of sparsely activated foundations models. arXiv preprint arXiv:2312.15926, 2023
  • [101] Bai J, Chen D, Qian B, Yao L, Li Y. Federated fine-tuning of large language models under heterogeneous language tasks and client resources. arXiv preprint arXiv:2402.11505, 2024
  • [102] Cho Y J, Liu L, Xu Z, Fahrezi A, Barnes M, Joshi G. Heterogeneous loRA for federated fine-tuning of on-device foundation models. In: NeurIPS, 2023
  • [103] Yi L, Yu H, Wang G, Liu X, Li X. pFedLoRA: Model-Heterogeneous Personalized Federated Learning with LoRA Tuning. arXiv preprint arXiv:2310.13283, 2023
  • [104] Huang W, Wang Y, Cheng A, Zhou A, Yu C, Wang L. A fast, performant, secure distributed training framework for large language model. arXiv preprint arXiv:2401.09796, 2024
  • [105] Wang Y, Lin Y, Zeng X, Zhang G. PrivateLoRA for efficient privacy preserving LLM. arXiv preprint arXiv:2311.14030, 2023
  • [106] Zhang Y, Wang M, Wu Y, Tiwari P, Li Q, Wang B, Qin J. Dialoguellm: Context and emotion knowledge-tuned large language models for emotion recognition in conversations. arXiv preprint arXiv:2310.11374, 2024
  • [107] Li Z, Li X, Liu Y, Xie H, Li J, Wang F L, Li Q, Zhong X. Label supervised llama finetuning. arXiv preprint arXiv:2310.01208, 2023
  • [108] Bornheim T, Grieger N, Blaneck P G, Bialonski S. Speaker attribution in german parliamentary debates with qlora-adapted large language models. arXiv preprint arXiv:2309.09902, 2024
  • [109] Xue L, Zhang D, Dong Y, Tang J. Autore: Document-level relation extraction with large language models. arXiv preprint arXiv:2403.14888, 2024
  • [110] Alves D M, Guerreiro N M, Alves J, Pombal J, Rei R, Souza d J G C, Colombo P, Martins A F T. Steering large language models for machine translation with finetuning and in-context learning. In: EMNLP. 2023
  • [111] Zheng J, Hong H, Wang X, Su J, Liang Y, Wu S. Fine-tuning large language models for domain-specific machine translation. arXiv preprint arXiv:2402.15061, 2024
  • [112] Mujadia V, Urlana A, Bhaskar Y, Pavani P A, Shravya K, Krishnamurthy P, Sharma D M. Assessing translation capabilities of large language models involving english and indian languages. arXiv preprint arXiv:2311.09216, 2023
  • [113] Zhang Y, Wang J, Yu L, Xu D, Zhang X. Personalized lora for human-centered text understanding. In: AAAI. 2024
  • [114] Liu Y, An C, Qiu X. Y-tuning: An efficient tuning paradigm for large-scale pre-trained models via label representation learning. Frontiers of Computer Science, 2024, 18(4): 184320
  • [115] Liu S, Keung J, Yang Z, Liu F, Zhou Q, Liao Y. Delving into parameter-efficient fine-tuning in code change learning: An empirical study. arXiv preprint arXiv:2402.06247, 2024
  • [116] Guo Y, Gao X, Jiang B. An empirical study on jit defect prediction based on bert-style model. arXiv preprint arXiv:2403.11158, 2024
  • [117] Ayupov S, Chirkova N. Parameter-efficient finetuning of transformers for source code. arXiv preprint arXiv:2212.05901, 2022
  • [118] Silva A, Fang S, Monperrus M. Repairllama: Efficient representations and fine-tuned adapters for program repair. arXiv preprint arXiv:2312.15698, 2023
  • [119] Roberson R, Kaki G, Trivedi A. Analyzing the effectiveness of large language models on text-to-sql synthesis. arXiv preprint arXiv:2401.12379, 2024
  • [120] Pan J, Sadé A, Kim J, Soriano E, Sole G, Flamant S. Stelocoder: a decoder-only LLM for multi-language to python code translation. arXiv preprint arXiv:2310.15539, 2023
  • [121] Sidahmed H, Phatale S, Hutcheson A, Lin Z, Chen Z, Yu Z, Jin J, Komarytsia R, Ahlheim C, Zhu Y, Chaudhary S, Li B, Ganesh S, Byrne B, Hoffmann J, Mansoor H, Li W, Rastogi A, Dixon L. Perl:parameter efficient reinforcement learning from human feedback. arXiv preprint arXiv:2403.10704, 2024
  • [122] Santacroce M, Lu Y, Yu H, Li Y, Shen Y. Efficient RLHF: reducing the memory usage of PPO. arXiv preprint arXiv:2309.00754, 2023
  • [123] Sun S, Gupta D, Iyyer M. Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF. arXiv preprint arXiv:2309.09055, 2023
  • [124] Quan S. Dmoerm: Recipes of mixture-of-experts for effective reward modeling. arXiv preprint arXiv:2403.01197, 2024
  • [125] Zhang S, Chen Z, Chen S, Shen Y, Sun Z, Gan C. Improving reinforcement learning from human feedback with efficient reward model ensemble. arXiv preprint arXiv:2401.16635, 2024
  • [126] Zhai Y, Zhang H, Lei Y, Yu Y, Xu K, Feng D, Ding B, Wang H. Uncertainty-penalized reinforcement learning from human feedback with diverse reward lora ensembles. arXiv preprint arXiv:2401.00243, 2024
  • [127] Yang A X, Robeyns M, Coste T, Wang J, Bou-Ammar H, Aitchison L. Bayesian reward models for LLM alignment. arXiv preprint arXiv:2402.13210, 2024
  • [128] Yang A X, Robeyns M, Wang X, Aitchison L. Bayesian low-rank adaptation for large language models. arXiv preprint arXiv:2308.13111, 2023
  • [129] Tran H, Yang Z, Yao Z, Yu H. Bioinstruct: Instruction tuning of large language models for biomedical natural language processing. arXiv preprint arXiv:2310.19975, 2023
  • [130] Gema A P, Daines L, Minervini P, Alex B. Parameter-efficient fine-tuning of llama for the clinical domain. arXiv preprint arXiv:2307.03042, 2023
  • [131] Toma A, Lawler P R, Ba J, Krishnan R G, Rubin B B, Wang B. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031, 2023
  • [132] Suri K, Mishra P, Saha S, Singh A. Suryakiran at mediqa-sum 2023: Leveraging lora for clinical dialogue summarization. In: CLEF. 2023
  • [133] Ji Y, Yu Z, Wang Y. Assertion detection large language model in-context learning lora fine-tuning. arXiv preprint arXiv:2401.17602, 2024
  • [134] Wang R, Duan Y, Lam C, Chen J, Xu J, Chen H, Liu X, Pang P C, Tan T. Ivygpt: Interactive chinese pathway language model in medical domain. In: CAAI. 2023
  • [135] Bhatti A, Parmar S, Lee S. SM70: A large language model for medical devices. arXiv preprint arXiv:2312.06974, 2023
  • [136] Konstantinidis T, Iacovides G, Xu M, Constantinides T G, Mandic D P. Finllama: Financial sentiment classification for algorithmic trading applications. arXiv preprint arXiv:2403.12285, 2024
  • [137] Pavlyshenko B M. Financial news analytics using fine-tuned llama 2 GPT model. arXiv preprint arXiv:2308.13032, 2023
  • [138] Liu X, Wang G, Zha D. Fingpt: Democratizing internet-scale data for financial large language models. arXiv preprint arXiv:2307.10485, 2023
  • [139] Li J, Lei Y, Bian Y, Cheng D, Ding Z, Jiang C. Ra-cfgpt: Chinese financial assistant with retrieval-augmented large language model. Frontiers of Computer Science, 2024, 18(5): 185350
  • [140] Zhou X, Sun Z, Li G. Db-gpt: Large language model meets database. Data Science and Engineering, 2024, 9(1): 102–111
  • [141] Li S. Diffstyler: Diffusion-based localized image style transfer. arXiv preprint arXiv:2403.18461, 2024
  • [142] Frenkel Y, Vinker Y, Shamir A, Cohen-Or D. Implicit style-content separation using b-lora. arXiv preprint arXiv:2403.14572, 2024
  • [143] Liu Y, Yu C, Shang L, He Y, Wu Z, Wang X, Xu C, Xie H, Wang W, Zhao Y, Zhu L, Cheng C, Chen W, Yao Y, Zhou W, Xu J, Wang Q, Chen Y, Xie X, Sun B. Facechain: A playground for human-centric artificial intelligence generated content. arXiv preprint arXiv:2308.14256, 2023
  • [144] Liao Q, Xia G, Wang Z. Calliffusion: Chinese calligraphy generation and style transfer with diffusion modeling. arXiv preprint arXiv:2305.19124, 2023
  • [145] Shrestha S, Venkataramanan A, others . Style transfer to calvin and hobbes comics using stable diffusion. arXiv preprint arXiv:2312.03993, 2023
  • [146] Li L, Zeng H, Yang C, Jia H, Xu D. Block-wise lora: Revisiting fine-grained lora for effective personalization and stylization in text-to-image generation. arXiv preprint arXiv:2403.07500, 2024
  • [147] Kong Z, Zhang Y, Yang T, Wang T, Zhang K, Wu B, Chen G, Liu W, Luo W. OMG: occlusion-friendly personalized multi-concept generation in diffusion models. arXiv preprint arXiv:2403.10983, 2024
  • [148] Shi J, Hua H. Space narrative: Generating images and 3d scenes of chinese garden from text using deep learning. In: xArch–creativity in the age of digital reproduction symposium. 2023, 236–243
  • [149] Jin Z, Song Z. Generating coherent comic with rich story using chatgpt and stable diffusion. arXiv preprint arXiv:2305.11067, 2023
  • [150] Wang H, Xiang X, Fan Y, Xue J. Customizing 360-degree panoramas through text-to-image diffusion models. In: WACV. 2024
  • [151] Guo J, Xu X, Pu Y, Ni Z, Wang C, Vasu M, Song S, Huang G, Shi H. Smooth diffusion: Crafting smooth latent spaces in diffusion models. arXiv preprint arXiv:2312.04410, 2023
  • [152] Cheng J, Xie P, Xia X, Li J, Wu J, Ren Y, Li H, Xiao X, Zheng M, Fu L. Resadapter: Domain consistent resolution adapter for diffusion models. arXiv preprint arXiv:2403.02084, 2024
  • [153] Smith J S, Hsu Y C, Kira Z, Shen Y, Jin H. Continual diffusion with stamina: Stack-and-mask incremental adapters. In: CVPR. 2024
  • [154] Sun J, Fu D, Hu Y, Wang S, Rassin R, Juan D C, Alon D, Herrmann C, Steenkiste v S, Krishna R, others . Dreamsync: Aligning text-to-image generation with image understanding feedback. In: Synthetic Data for Computer Vision Workshop@ CVPR 2024. 2023
  • [155] Wang Z, Wang X, Xie L, Qi Z, Shan Y, Wang W, Luo P. Styleadapter: A single-pass lora-free model for stylized image generation. arXiv preprint arXiv:2309.01770, 2023
  • [156] Gu Y, Wang X, Wu J Z, Shi Y, Chen Y, Fan Z, Xiao W, Zhao R, Chang S, Wu W, Ge Y, Shan Y, Shou M Z. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. In: NeurIPS. 2023
  • [157] Luo S, Tan Y, Patil S, Gu D, Platen v P, Passos A, Huang L, Li J, Zhao H. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556, 2023
  • [158] Golnari P A. Lora-enhanced distillation on guided diffusion models. arXiv preprint arXiv:2312.06899, 2023
  • [159] Ren Y, Zhou Y, Yang J, Shi J, Liu D, Liu F, Kwon M, Shrivastava A. Customize-a-video: One-shot motion customization of text-to-video diffusion models. arXiv preprint arXiv:2402.14780, 2024
  • [160] Deng Y, Wang R, Zhang Y, Tai Y, Tang C. Dragvideo: Interactive drag-style video editing. arXiv preprint arXiv:2312.02216, 2023
  • [161] Yang S, Zhou Y, Liu Z, Loy C C. Rerender A video: Zero-shot text-guided video-to-video translation. In: SIGGRAPH. 2023
  • [162] Khandelwal A. Infusion: Inject and attention fusion for multi concept zero-shot text-based video editing. In: ICCV. 2023
  • [163] Blattmann A, Dockhorn T, Kulal S, Mendelevitch D, Kilian M, Lorenz D, Levi Y, English Z, Voleti V, Letts A, others . Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023
  • [164] Guo Y, Yang C, Rao A, Wang Y, Qiao Y, Lin D, Dai B. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023
  • [165] Huang T, Zeng Y, Zhang Z, Xu W, Xu H, Xu S, Lau R W H, Zuo W. Dreamcontrol: Control-based text-to-3d generation with 3d self-prior. arXiv preprint arXiv:2312.06439, 2023
  • [166] Ma Y, Fan Y, Ji J, Wang H, Sun X, Jiang G, Shu A, Ji R. X-dreamer: Creating high-quality 3d content by bridging the domain gap between text-to-2d and text-to-3d generation. arXiv preprint arXiv:2312.00085, 2023
  • [167] Yu K, Liu J, Feng M, Cui M, Xie X. Boosting3d: High-fidelity image-to-3d by boosting 2d diffusion prior to 3d prior with progressive learning. arXiv preprint arXiv:2311.13617, 2023
  • [168] Yoo S, Kim K, Kim V G, Sung M. As-plausible-as-possible: Plausibility-aware mesh deformation using 2d diffusion priors. In: CVPR. 2024
  • [169] Zhang Y, Xu Q, Zhang L. Dragtex: Generative point-based texture editing on 3d mesh. arXiv preprint arXiv:2403.02217, 2024
  • [170] Ding H, Gao J, Yuan Y, Wang Q. Samlp: A customized segment anything model for license plate detection. arXiv preprint arXiv:2401.06374, 2024
  • [171] Ye Z, Lovell L, Faramarzi A, Ninic J. Sam-based instance segmentation models for the automation of structural damage detection. arXiv preprint arXiv:2401.15266, 2024
  • [172] Na S, Guo Y, Jiang F, Ma H, Huang J. Segment any cell: A sam-based auto-prompting fine-tuning framework for nuclei segmentation. arXiv preprint arXiv:2401.13220, 2024
  • [173] Chen X, Wang C, Ning H, Li S. SAM-OCTA: prompting segment-anything for OCTA image segmentation. arXiv preprint arXiv:2310.07183, 2023
  • [174] Feng W, Zhu L, Yu L. Cheap lunch for medical image segmentation by fine-tuning SAM on few exemplars. arXiv preprint arXiv:2308.14133, 2023
  • [175] Zhang K, Liu D. Customized segment anything model for medical image segmentation. arXiv preprint arXiv:2304.13785, 2023
  • [176] Wang A, Islam M, Xu M, Zhang Y, Ren H. SAM meets robotic surgery: An empirical study on generalization, robustness and adaptation. In: MICCAI. 2023
  • [177] Lin L, Fan H, Zhang Z, Wang Y, Xu Y, Ling H. Tracking meets lora: Faster training, larger model, stronger performance. arXiv preprint arXiv:2403.05231, 2024
  • [178] Kong C, Li H, Wang S. Enhancing general face forgery detection via vision transformer with low-rank adaptation. In: MIPR. 2023
  • [179] Chen Z, Huang H, Andrusenko A, Hrinchuk O, Puvvada K C, Li J, Ghosh S, Balam J, Ginsburg B. SALM: speech-augmented language model with in-context learning for speech recognition and translation. arXiv preprint arXiv:2310.09424, 2023
  • [180] Dong X, Zhang P, Zang Y, Cao Y, Wang B, Ouyang L, Wei X, Zhang S, Duan H, Cao M, Zhang W, Li Y, Yan H, Gao Y, Zhang X, Li W, Li J, Chen K, He C, Zhang X, Qiao Y, Lin D, Wang J. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024
  • [181] Ye Q, Xu H, Xu G, Ye J, Yan M, Zhou Y, Wang J, Hu A, Shi P, Shi Y, Li C, Xu Y, Chen H, Tian J, Qi Q, Zhang J, Huang F. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023
  • [182] Lee B, Park B, Kim C W, Ro Y M. Collavo: Crayon large language and vision model. arXiv preprint arXiv:2402.11248, 2024
  • [183] Yeo J H, Han S, Kim M, Ro Y M. Where visual speech meets language: VSP-LLM framework for efficient and context-aware visual speech processing. arXiv preprint arXiv:2402.15151, 2024
  • [184] Liu Z, Li S, Luo Y, Fei H, Cao Y, Kawaguchi K, Wang X, Chua T. Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. In: EMNLP. 2023
  • [185] Ren Y, Chen Y, Liu S, Wang B, Yu H, Cui Z. TPLLM: A traffic prediction framework based on pretrained large language models. arXiv preprint arXiv:2403.02221, 2024
  • [186] Aghajanyan A, Gupta S, Zettlemoyer L. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In: ACL/IJCNLP. 2021
  • [187] Fomenko V, Yu H, Lee J, Hsieh S, Chen W. A note on lora. arXiv preprint arXiv:2404.05086, 2024
  • [188] Bershatsky D, Cherniuk D, Daulbaev T, Mikhalev A, Oseledets I V. Lotr: Low tensor rank weight adaptation. arXiv preprint arXiv.2402.01376, 2024
  • [189] Edalati A, Tahaei M S, Kobyzev I, Nia V P, Clark J J, Rezagholizadeh M. Krona: Parameter efficient tuning with kronecker adapter. arXiv preprint arXiv.2212.10650, 2022
  • [190] He X, Li C, Zhang P, Yang J, Wang X E. Parameter-efficient model adaptation for vision transformers. In: AAAI. 2023
  • [191] Sheng Y, Cao S, Li D, Hooper C, Lee N, Yang S, Chou C, Zhu B, Zheng L, Keutzer K, Gonzalez J E, Stoica I. S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv.2311.03285, 2023
  • [192] Mahabadi R K, Henderson J, Ruder S. Compacter: Efficient low-rank hypercomplex adapter layers. In: NeurlPS. 2021
  • [193] Liao B, Meng Y, Monz C. Parameter-efficient fine-tuning without introducing new latency. In: ACL. 2023
  • [194] He J, Zhou C, Ma X, Berg-Kirkpatrick T, Neubig G. Towards a unified view of parameter-efficient transfer learning. In: ICLR. 2022
  • [195] Geshkovski B, Letrouit C, Polyanskiy Y, Rigollet P. A mathematical perspective on transformers. arXiv preprint arXiv.2312.10794, 2023
  • [196] Geshkovski B, Letrouit C, Polyanskiy Y, Rigollet P. The emergence of clusters in self-attention dynamics. In: NeurIPS. 2023
  • [197] Sander M E, Ablin P, Blondel M, Peyré G. Sinkformers: Transformers with doubly stochastic attention. In: AISTATS. 2022
  • [198] Jacot A, Hongler C, Gabriel F. Neural tangent kernel: Convergence and generalization in neural networks. In: NeurIPS. 2018
  • [199] Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D, Blecher L, Canton-Ferrer C, Chen M, Cucurull G, Esiobu D, Fernandes J, Fu J, Fu W, Fuller B, Gao C, Goswami V, Goyal N, Hartshorn A, Hosseini S, Hou R, Inan H, Kardas M, Kerkez V, Khabsa M, Kloumann I, Korenev A, Koura P S, Lachaux M, Lavril T, Lee J, Liskovich D, Lu Y, Mao Y, Martinet X, Mihaylov T, Mishra P, Molybog I, Nie Y, Poulton A, Reizenstein J, Rungta R, Saladi K, Schelten A, Silva R, Smith E M, Subramanian R, Tan X E, Tang B, Taylor R, Williams A, Kuan J X, Xu P, Yan Z, Zarov I, Zhang Y, Fan A, Kambadur M, Narang S, Rodriguez A, Stojnic R, Edunov S, Scialom T. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv.2307.09288, 2023
  • [200] Ding N, Qin Y, Yang G, Wei F, Yang Z, Su Y, Hu S, Chen Y, Chan C, Chen W, Yi J, Zhao W, Wang X, Liu Z, Zheng H, Chen J, Liu Y, Tang J, Li J, Sun M. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mac. Intell., 2023, 5(3): 220–235
  • [201] Zhao J, Zhang Z, Chen B, Wang Z, Anandkumar A, Tian Y. Galore: Memory-efficient LLM training by gradient low-rank projection. arXiv preprint arXiv.2403.03507, 2024
  • [202] Biderman D, Ortiz J J G, Portes J, Paul M, Greengard P, Jennings C, King D, Havens S, Chiley V, Frankle J, Blakeney C, Cunningham J P. Lora learns less and forgets less. arXiv preprint arXiv.2405.09673, 2024
  • [203] Han A, Li J, Huang W, Hong M, Takeda A, Jawanpuria P, Mishra B. Sltrain: a sparse plus low-rank approach for parameter and memory efficient pretraining. arXiv preprint arXiv:2406.02214, 2024
  • [204] Sui Y, Yin M, Gong Y, Xiao J, Phan H, Yuan B. ELRT: efficient low-rank training for compact convolutional neural networks. arXiv preprint arXiv.2401.10341, 2024
  • [205] Meng X, Dai D, Luo W, Yang Z, Wu S, Wang X, Wang P, Dong Q, Chen L, Sui Z. Periodiclora: Breaking the low-rank bottleneck in lora optimization. arXiv preprint arXiv.2402.16141, 2024
  • [206] Frank M, Wolfe P, others . An algorithm for quadratic programming. Naval research logistics quarterly, 1956, 3(1-2): 95–110
  • [207] Valipour M, Rezagholizadeh M, Kobyzev I, Ghodsi A. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv preprint arXiv:2210.07558, 2022
  • [208] Elsken T, Metzen J H, Hutter F. Neural architecture search: A survey. J. Mach. Learn. Res., 2019, 20: 55:1–55:21
  • [209] Hansen N, Ostermeier A. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In: IEEE. 1996
  • [210] Ye M, Fang X, Du B, Yuen P C, Tao D. Heterogeneous federated learning: State-of-the-art and research challenges. ACM Comput. Surv., 2024, 56(3): 79:1–79:44
{biography}

authors/myr Yuren Mao received his PhD degree under the supervision of Prof. Xuemin Lin in computer science from University of New South Wales, Australia in 2022. He is currently an assistant professor with the School of Software Technology, Zhejiang University, China. His current research interests include Large Language Models and its applications in Data Intelligence.

{biography}

authors/yhge Yuhang Ge is currently working toward his PhD degree in the School of Software Technology at Zhejiang University, China. His research interests include Large Language Models and Data Management.

{biography}

authors/yjf Yijiang Fan is currently studying as a master’s student in the School of Software Technology at Zhejiang University, China. His research interests include Large Language Models and collaborative inference.

{biography}

authors/wyxu Wenyi Xu is currently studying as a master’s student in the School of Software Technology at Zhejiang University, China. His research interests include Multimodal Large Models and RAG.

{biography}

authors/ymi Yu Mi is currently studying as a master’s student in the School of Software Technology at Zhejiang University, China. Her research interests include Large Language Models and AI for science.

{biography}

authors/hzh Zhonghao Hu is currently studying as a master’s student in the School of Software Technology at Zhejiang University, China. His research interests include Large Language Models and data discovery.

{biography}

authors/yjgao Yunjun Gao received the PhD degree in computer science from Zhejiang University, China, in 2008. He is currently a professor in the College of Computer Science and Technology, Zhejiang University, China. His research interests include Database, Big Data Management and Analytics, and AI interaction with DB technology.