Panza: A Personalized Text Writing Assistant
via Data Playback and Local Fine-Tuning

Armand Nicolicioiu &Eugenia Iofinova¹¹footnotemark: 1²²footnotemark: 2 &Eldar Kurtic¹¹footnotemark: 1²²footnotemark: 2 &Mahdi Nikdan¹¹footnotemark: 1²²footnotemark: 2
\ANDAndrei Panferov²²footnotemark: 2
&Ilia Markov²²footnotemark: 2
&Nir Shavit³³footnotemark: 3
&Dan Alistarh²²footnotemark: 2³³footnotemark: 3
Equal contribution.Institute of Science and Technology Austria.Neural Magic, Inc.Massachusetts Institute of Technology.Correspondence to [email protected]

Abstract

The availability of powerful open-source large language models (LLMs) opens exciting use-cases, such as automated personal assistants that adapt to the user’s unique data and demands. Two key desiderata for such assistants are personalization–in the sense that the assistant should reflect the user’s own style—and privacy–in the sense that users may prefer to always store their personal data locally, on their own computing device. In this application paper, we present a new design for such an automated assistant, for the specific use case of personal assistant for email generation, which we call Panza. Specifically, Panza can be both trained and inferenced locally on commodity hardware, and is personalized to the user’s writing style. Panza’s personalization features are based on a new technique called data playback, which allows us to fine-tune an LLM to better reflect a user’s writing style using limited data. We show that, by combining efficient fine-tuning and inference methods, Panza can be executed entirely locally using limited resources–specifically, it can be executed within the same resources as a free Google Colab instance. Finally, our key methodological contribution is a careful study of evaluation metrics, and of how different choices of system components (e.g. the use of Retrieval-Augmented Generation or different fine-tuning approaches) impact the system’s performance. Panza is available at https://github.com/IST-DASLab/PanzaMail.

Armand Nicolicioiu^†^†thanks: Equal contribution.^†^†thanks: Institute of Science and Technology Austria. Eugenia Iofinova¹¹footnotemark: 1²²footnotemark: 2 Eldar Kurtic¹¹footnotemark: 1²²footnotemark: 2^†^†thanks: Neural Magic, Inc. Mahdi Nikdan¹¹footnotemark: 1²²footnotemark: 2

Andrei Panferov²²footnotemark: 2 Ilia Markov²²footnotemark: 2 Nir Shavit^†^†thanks: Massachusetts Institute of Technology.³³footnotemark: 3 Dan Alistarh^†^†thanks: Correspondence to [email protected]²²footnotemark: 2³³footnotemark: 3

1 Introduction

An automated personal assistant is a software application that can help the user with various repetitive tasks such as email, writing, or summarization. Large Language Models (LLMs) are natural candidates for implementing personal assistants, as they can provide remarkably good results on such generative tasks. At the same time, many highly-capable LLMs reside in the cloud, and can only be accessed via an API. Specifically, this makes it expensive or impossible to support certain natural features for automated personal assistants, namely:

1.

Model personalization that is, customizing the model to the specific individual’s tone, preferences, and personal history;
2.

Privacy protection, that is, allow the model to have access to highly personal information of a caliber that—unlike corporate data—most people would not agree to share even if promised that the data is protected by the provider’s cloud.

A natural approach to addressing these constraints would be to execute these models locally, on the user’s own data and hardware. However, this poses obvious challenges both in terms of data ingestion–that is, getting user data into a format that can be used to successfully train or fine-tune an LLM–and in terms of hardware efficiency–specifically, because fine-tuning or even inferencing over a capable LLM on user data may be technically impossible if done naively.

In this application paper, we take up this challenge for the limited, but interesting, case of designing and implementing a fully-local automated email writing assistant we call Panza, whose purpose is to generate relevant messages in the user’s own style, given a user prompt, as well as access to a set of previously-sent user emails. Our main finding is that it is possible to obtain a capable assistant starting from existing pre-trained models such as Llama-3-8B (Meta, 2024), and that this application can be executed entirely locally, on a single machine with a commodity GPU, or even in a CPU-only environment. The overall structure of Panza is illustrated and described in Figure 1. While the focus of our work is applied, we present novel findings from the following perspectives:

1.

We introduce a new technique called data playback whose goal is to personalize a generative LLMs output to match a user’s writing style, given a relatively small number of text samples produced by a user, such as sent emails. Data playback works in two stages: first, given the text samples and a pre-trained (non-personalized) LLM, we use the LLM to summarize each text sample, but in imperative form, i.e. as an instruction from a user to an assistant. For email generation, each text sample becomes associated with an instruction which, if followed, should lead an ideal assistant to produce the original email exactly. We show that pre-trained instruction-tuned open LLMs such as Mistral and Llama have good performance on this instruction-generation task. In the second step, we use the (instruction, sample) pairs as training data for a personalized instruction tuning stage of the pre-trained LLM, with the goal of transferring the user’s style onto the LLM. As such, data playback “reduces” personalization to instruction tuning. The fine-tuned model can then be deployed for inference over unseen prompts, possibly in conjunction with Retrieval-Augmented Generation (RAG).
2.

On the methodological side, we begin with an investigation of evaluation metrics for this task, in conjunction with studying the impact of different design options or even hyper-parameters on the final model accuracy. First, we observe that the BLEU / ROUGE / MAUVE metrics usually employed in these scenarios all show similar trends. Then, we show that data playback consistently outperforms both a prompted pre-trained model and a pre-trained model with RAG. We then perform one of the first in-depth studies of the impact of RAG on personalized model performance when applied either at training time or at inference time (or both) for LLMs.
3.

We show that this entire pipeline can be executed in a resource-constrained setup. Specifically, we show that both the fine-tuning necessary for data playback as well as the inference and RAG components can be run efficiently and accurately on a system with a single commodity GPU. We show that the best parameter-efficient fine-tuning (PEFT) results are obtained using the Robust Adaptation (RoSA) method Nikdan et al. (2024), which we find particularly suitable for style transfer, and can provide good results even with limited training data. Our main innovation on this point is a new accurate merging mechanism, which is required to accurately combine sparse and low-rank PEFT adapters into quantized weights.

A general finding is that existing metrics are not a good fit to the highly-practical scenario where an LLM has to generate personalized text via RAG or fine-tuning. To address this challenge, we propose a blend of metrics to ensure that the desiderata of personalization and content are met. In addition, we show that, using our techniques, good performance for personalized text generation can be achieved using existing open LLMs.

Related Work.

Broadly speaking, the LLM personalization problem we consider is that of customizing an LLM to yield outputs that match a given user’s tone and writing style. Two additional challenges we consider are 1) the availability of limited user data to be used for personalization, and 2) the fact that we wish to do so under stringent computational constraints. The problem of personalization has seen a lot of interest for Language Models, e.g. (King and Cook, 2020). However, due to their recent nature, there is a lot less literature on personalization of generative LLMs (Chen et al., 2023; Kirk et al., 2024). Existing work starts from the assumption that RAG variants over a large enough context should be sufficient for personalization Salemi et al. (2024). In this work, we show both via individual examples and comparison across metrics that this is not necessarily the case: while LLMs using RAG can indeed provide coherent outputs, including personal information obtained via retrieval, they largely fail to match a given user’s writing style.

Our work builds upon both RAG Lewis et al. (2020) and on Parameter-Efficient Fine-Tuning (PEFT) techniques (Hu et al., 2021; Nikdan et al., 2024). We observe that, for personalization, the recently-proposed RoSA method Nikdan et al. (2024) tends to outperform LoRA Hu et al. (2021), matching full fine-tuning. Further, keeping model weights quantized Dettmers et al. (2024) does not impact accuracy significantly.

2 Method

2.1 Overview

The Panza design, described in Figure 1, requires a pre-trained, possibly instruction-tuned, LLM and of a set of emails sent by the user. Both the LLM and the emails have dual use. First, the pre-trained LLM is used to rephrase the “raw” user emails in the form of instructions, which will be used for data playback (described below). Then, the LLM itself is going to be fine-tuned on these instructions, resulting on the Panza personalized model. Independently, the emails are used to generate a RAG database, employed at deployment time.

Refer to caption — Figure 1: Panza’s overall design. Given a set of emails produced by the user, we produce both a data playback dataset and retrieval augmented generation (RAG) database. The base model is first fine-tuned via data playback, and then served in conjunction with RAG.

2.2 The Data Playback Mechanism

Data playback is the core personalization technique behind Panza. The intuition behind data playback is that we would like to “reduce” the problem of creating emails in the user’s style to a specific instance of instruction-tuning. In a practical usage scenario, the user query would come in the form of a standard instruction, e.g. “Write an email to Joanne to set up a meeting on Wednesday.”, and the “correct” answer would be such an email written in the user’s style. To induce this behavior from the LLM, data playback works in two steps:

1.

First, we use a pre-trained LLM to rephrase each email into a targeted instruction, containing just enough information for an assistant to write the original email.
2.

Second, we fine-tune the LLM (possibly the same one) on the (instruction, email) pairs obtained in the first step, with a standard training objective which induces the LLM to reconstitute the original email as a response to the instruction. Optionally, we implement a training-time RAG component, which retrieves query-related emails and provides them as context to the LLM at training time. This is similar to retrieval-augmented fine-tuning (RAFT) Zhang et al. (2024).

At the end of these two steps, we have obtained a personalized LLM which can respond to short user queries by writing targeted emails that should follow the user’s style. Next, we describe how this model is deployed.

2.3 Deployment

In the absence of computational or memory constraints, we perform full fine-tuning (FFT) of the base model, and inference over it, possibly adding a retrieval-augmented generation (RAG) component, which retrieves similar emails sent in the past. However, fine-tuning and deploying a powerful billion-parameter model locally requires a powerful GPU with significant memory. Therefore, we also investigate techniques for reducing these costs, as well as their impact in terms of accuracy metrics.

2.4 Local Fine-Tuning and Inference

Memory efficiency is critical in our setting, due to privacy constraints. For instance, full fine-tuning (FFT) of a Mistral-7B model Jiang et al. (2023) in half-precision using a standard Adam optimizer Kingma and Ba (2015) requires more than 60GB of GPU memory. Such resources are rarely available in a consumer-grade local machine.

We tailor Panza to two resource-constrained settings; running on a GPU with under 24GB RAM and under 15GB RAM, each of which is detailed next. Throughout, we use Mistral-7B Jiang et al. (2023) as a running example for costs, but the techniques apply to other base LLMs with similar size. We always consider a local training micro-batch size of $1$ to minimize memory footprint, accumulating gradients whenever necessary. See Section 4 for more details. Training takes under an hour.

Panza on a single GPU.

We first assume local access to a single 24GB GPU (such as NVIDIA GeForce RTX 3090), which is relevant for users with a small GPU server or a strong gaming laptop. For this, we use Parameter-Efficient Fine-Tuning (PEFT) methods, which tune only a small (possibly extra) set of parameters to enable efficient adaptation of models to downstream tasks. We compare the standard LoRA method Hu et al. (2021), as well as the more recent Robust Adaptation (RoSA) method Nikdan et al. (2024), which we find to be particularly effective for style transfer. Specifically, by training a combination of low-rank and sparse adapters on top of the base weights, RoSA allows effective fine-tuning of a half-precision 7B model on less than 24GB of memory, with competitive accuracy relative to full fine-tuning. For deployment, we merge the RoSA adapters into the base model weights, with the inference requiring around 15GB.

Panza under 15GB GPU memory

This setting is particularly interesting since it allows training and deploying Panza on a free Colab instance (NVIDIA T4 GPU), or a gaming laptop. In addition to the GPU memory constraint, we also assume access to only 12GB of CPU memory, since that is the case for free Colab instances. We take a similar approach as above, while quantizing multiple components of the pipeline to $4$ bits per parameter, detailed below.

•

Quantized base weights. We store the weights of the base model in doubly-quantized $4$ bit precision Dettmers et al. (2023), reducing the model size in memory by roughly $4\times$ .
•

Quantized accumulators. As part of the sparse adapter’s mask generation, RoSA accumulates gradients of the base weights on CPU, violating the 12GB CPU memory restriction. To remedy this issue, we change the precision of the accumulators to $4$ bits using a uniform quantization with group size $128$ .
•

Adapter merging. To merge the half-precision RoSA adapters and the $4$ -bit base weights, we present a new accurate merging mechanism; for each layer, we dequantize the base weights, add adapters, and quantize the result back to $4$ bits using the GPTQ method Frantar et al. (2022). Our key contribution is an adapter-aware implemetation of the GPTQ algorithm, where the quantization is interleaved with merging per layer, without materializing the full half-precision model.

The above optimizations allow us to run Panza using the Mistral-7B-Instruct model variants on a single GPU under 15GB RAM.

3 Evaluation Protocol

3.1 Datasets

A key challenge of this project is the lack of availability of e-mail datasets, due to the sensitive content of the data. To our knowledge, the only such repository contains the business e-mails of 144 Enron employees. This data was originally released by the Federal Energy Regulatory Committee; the version we use!(Cohen, 2015) is licensed for research. We use the emails of four employees(s.shackleton, k.mann, j.dasovich, and t.jones) with over 400 (English) e-mails each, these are identified by their first names - Sara, Kay, Jeff, and Tana. To avoid influencing the model by explicitly invoking Enron, we changed the name of the corporation and its executives.

We use three additional datasets. David, was anonymized manually by its author and donated for research use by an ML researcher with a clear understanding of its release and proposed use; this dataset will be released as part of this project. Two additional datasets, identified as Anon1 and Anon2, were also donated by ML researchers for this project, but will not be released due to the sensitive nature of their contents. Of the seven datasets used, six contain primarily business emails, and one, Anon2, contains primarily personal emails.

Table 1: E-mail datasets used. The David dataset is a new contribution published along with this work.

Name	Source	Public	Number of Emails
Tana	Enron	✓	742
Jeff	Enron	✓	573
Kay	Enron	✓	481
Sara	Enron	✓	464
David	Donated	✓	166
Anon1	Donated		301
Anon2	Donated		669

The test-train split for fine-tuning / PEFT is 80-20%, and only training emails are used to form the RAG database employed at inference time. To assess the quality of the LLM-generated email summarization prompts, used in the Data Playback process, 18 randomly selected emails from David and 20 randomly selected emails from Anon2 were annotated manually with prompts by the authors.

3.2 Metrics

Table 2: Summary of desiderata and metrics to evaluate LLM outputs for summarization and email generation.

	Metric(s)	Task
Desideratum	Used	Prompts	Emails
Paraphrasing quality	BLEU/ROUGE	✓	✓
User-specific knowledge	BLEU/ROUGE		✓
World knowledge	HF eval suite		✓
Style	MAUVE		✓

Panza uses the text generation capabilities of LLMs for two tasks: to summarize a user’s emails to create synthetic prompts for the data playback process, and to generate new emails in the test-time prompts. These two use cases both rely on the ability of LLMs to summarize or rephrase content; the email generation task additionally requires the recall of both general and user-specific knowledge¹¹1Examples of general knowledge are creating a bash script to perform a given task, or including some general details about the current US president; an example of specific knowledge is the user’s home address., and an imitation of a user’s personal style.

Thus, we divide the evaluation benchmarks into four broad categories - paraphrasing quality, user-specific knowledge, general knowledge, and style. Of these, the data playback email summaries only need to perform best on the first benchmark.

Paraphrasing quality.

For paraphrasing quality we rely on the BLEU Papineni et al. (2002) and ROUGE Lin (2004) metrics, which are standard to measure translation and summarization quality. Both metrics function by counting matching N-grams between the LLM output and one or several ’golden’ responses (for the email generation task, the golden response is the email actually written by the user). The BLEU score is a weighted measure of uni-, bi- tri, and quad-grams that match exactly between the output and golden text strings, normalized by the string length; we use an equal weight of $0.25$ for each N-gram length. ROUGE reports 1-gram and 2-gram precision, recall, and F1-score, as well as the longest substring precision, recall, and F1-score; in our paper, we use the longest-substring F1 score for maximum contrast with the BLEU metric. For both metrics, we use the Torchmetrics package. Both metrics are computed for each prompt/output combination, and we report the average across all prompts as the overall value. We do not compute either metric on a per-sentence basis, but rather compare n-grams in the full email text after dropping punctuation.

User-specific knowledge.

As there is no database of user-specific knowledge, we restrict the evaluation of such knowledge to what is contained in the user’s emails. This working assumption enables us to use RAG-assisted email generation; here, it further enables to evaluate user-specific knowledge on the same test dataset as for the paraphrasing quality. We do not attempt to decouple knowledge content from paraphrasing quality, rather relying on the overall BLEU and ROUGE scores to reflect the correct imputed information. Note that, unlike the general knowledge desideratum, the user-specific information is better specified—a prompt requesting the user’s current address is easier to evaluate than one asking for suggestions for a travel destination—and so the N-gram match is an appropriate measure of quality.

General knowledge.

For the world knowledge measurement, we rely on the standard six tasks that make up the popular Open-LLM leaderboard evaluation suite Beeching et al. (2023), which we describe in more detail in Section 4.5. Together, these tasks test the model’s ability to perform basic reasoning and factual recall in English, and are used as a broad evaluation of a model’s quality.

Style transfer.

To measure the quality of the style transfer, we use the MAUVE score Pillutla et al. (2021), which was introduced to measure the degree of similarity between machine-generated text and human-written text. Specifically, the MAUVE score, relies on an estimate of the symmetric K-L divergence between the distribution of the golden labels and the distribution of model outputs. In order to estimate the necessary K-L divergences, the golden and output strings are tokenized and passed through an external LLM (we use GPT-2, as is common), producing text embeddings in a multidimensional space. Higher MAUVE score is correlated with higher difficulty in distinguishing the generated text from human text.

Discussion.

A major challenge of this project is that the email generation task differs significantly from summarization or translation tasks, in that we generally expect the output email to be longer and more detailed than the prompt used to generate it, thus requiring some improvisation on the part of the model, and making it unlikely for the generated email to be close to the actual one. This is reflected in the BLEU/ROUGE scores, which are substantially lower than what would be considered acceptable for a translation or summarization task. Extensive manual review during the project development phase has shown that, nevertheless, these scores appear highly correlated with output email quality. We provide a few examples in Appendix C. As a rough guideline, human subjects generally agreed that models achieving above 0.2 average BLEU score and above 0.75 MAUVE score produced plausible emails.

4 Experimental Results

4.1 Baselines

As no other e-mail generators exist to our knowledge, we investigate the effect of Data Playback/finetuning by baselining against simply using prompt engineering to elicit personalization. We start with publicly available instruction-finetuned LLMs: Meta-Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2, and Phi-3-mini-4k-instruct. As the first baseline, we prompt the models with the following format:

1.

System preamble: Sets the role of the LLM as an email writing assistant.
2.

User preamble: General information about the user, such as name, occupation, address.
3.

Instruction: The actual email writing instruction, as created by an LLM in the first phase of data playback.

This baseline (denoted Pretrained) provides a reasonable starting with respect to writing well-structured emails, but without any personalization since the model does not have information about the user’s style.

Next, we test if presenting a few samples of the user’s previous emails during inference through a RAG component can provide enough information for the model to imitate the user’s style. We select the closest $n_{RAG}$ previous emails, filtered by a relevancy threshold $T_{RAG}$ , and add them to the input as an additional preamble. We denote this baseline as Pretrained + RAG.

4.2 Fine-Tuning via Data Playback

Next, we employ the complete data playback technique Panza is based on. After generating pairs of (instruction, email), we fine-tune the pre-trained LLM to reconstruct the email given the instruction. We analyze the following regimes: full fine-tuning (FFT) and two PEFT methods: Robust Adaptation (RoSA) and Low-Rank Adaptation (LoRA).

For both training and testing, the input is formatted with the same system and user preambles described for the baselines. We also test whether RAG can bring additional improvements.

Furthermore, we explore if the model can learn how to better leverage previous e-mails in RAG by presenting the same type of augmented prompt during fine-tuning, parameterized by the number of closest emails $n_{RAG}$ and the relevancy threshold $T_{RAG}$ . Additionally, to make the model robust to the absence of similar emails, we have a $p_{RAG}$ chance of not using any similar emails for a particular instruction during fine-tuning, even if there are matches in the database. This approach adapts Zhang et al. (2024) to our setting and is denoted as RAFT (Retrieval-Augmented Fine-Tuning).

Hyperparameter tuning.

We found fine-tuning, and especially PEFT, to be highly sensitive to learning rate and the number of training epochs. To find suitable hyperparameters, we first used a greedy grid search approach with learning rates ranging from $10^{-6}$ to $10^{-4}$ and epoch ranges from $1$ to $9$ , batch sizes of $8$ and $16$ , and using the BLEU metric as the proxy for overall model quality. We used the realistic, non-anonymized Anon1 and Anon2 datasets for hyperparameter tuning and chose values that worked well for both. Overall, we found that learning rate of $10^{-5}$ , batch size of $8$ , and 5 epochs (3 for FFT) to work well across all base models and finetuning styles. We then tuned these parameters further for the other users. The final values for all users are presented in Appendix E.

4.3 Results Across Methods and Models

We find that all fine-tuning regimes outperform the Pretrained + RAG baselines by a large margin, adapting to the user’s writing style. The results are illustrated in Figure 2. (We present a qualitative comparison of the generated emails in Appendix C; using leading closed LLMs yielded similarly poor results.) Interestingly, RoSA performs on par with FFT, consistently surpassing LoRA, especially in terms of MAUVE score. The same trend is observed for all the backbones we trained, and for all users (please see Appendix D for full results).

In Figure 3 we study the effect of RAG on models fine-tuned with RoSA. We observe that, although RAG clearly helps for the pretrained baseline, it reduces the average BLEU score for the RoSA fine-tuned model (second group), but may slightly increase the average MAUVE score. This is not desirable, as BLEU score (relative to the ground-truth email) is a closer measure of content accuracy than MAUVE. Based on analyzing individual samples, we hypothesize that this may be caused by the model re-using the RAG context too aggressively at deployment time, leading to emails that are very similar to past emails (so, preserving style), but less accurate in terms of content (leading to a lower average BLEU score). We observe that we can overcome this issue by introducing RAG during fine-tuning itself, i.e. using RAFT: when previous emails are presented during fine-tuning, the model “learns” to ignore them if irrelevant.

In Figure 4 we show that similar performance levels can be obtained by fine-tuning various LLM backbones, when performing FFT or RoSA-RAFT across Mistral-7B-Instruct-v0.2, Llama3-8B-Instruct, and Phi-3-mini-4k-instruct. While the achieved BLEU scores are very similar across models, the only significant difference is the higher MAUVE score achieved by the Llama3 model.

4.4 Style Evaluation

Recall that MAUVE measures the gap between machine-generated text and human-written text. Above, we reported the MAUVE scores on the test emails coming from the same user the model was trained for. Next, we do a pairwise comparison, evaluating models trained for different users on the test data of all the other users. This focuses precisely on how well style is reflected in generated emails. In Figure 5, we see that each model produces a high MAUVE score $(0.6$ to $1.0)$ only for the test emails of the user it was trained for, while it has close to $0$ MAUVE score on any other user. In terms of BLEU/ROUGE score, all models have essentially the same performance on the test set of any given user, suggesting all models have similar paraphrasing capacity to express the given instruction, but each does it in the style of the user it was trained for. We therefore conclude that Data Playback is highly effective in terms of this metric.

4.5 Maintaining General Knowledge

Despite the primary objective of Panza being personalization, it is desirable for the assistant to retain the general knowledge acquired during pre-training and supervised fine-tuning. To assess the extent to which the model preserves its general knowledge post-personalization, we evaluated it using the few-shot setup of the popular Open-LLM Leaderboard evaluation suite Beeching et al. (2023).

All results are obtained through the corresponding lm-evaluation-harness framework Gao et al. (2021), and are reported in Table 3. Based on the results collected for David, we conclude that personalized Panza models retain, on average, over 98% of the baseline model’s accuracy. This suggests that the Panza personalization process avoids catastrophic forgetting.

Table 3: Performance of personalized Panza models on general knowledge benchmarks.

Llama-3-8B Instruct FFT LoRA RoSA Arc-c 60.7 60.3 60.7 58.9 MMLU 67.1 65.5 65.6 65.3 Hellaswag 78.5 78.7 78.6 78.2 Winogrande 74.5 74.3 74.9 73.2 GSM8k 68.7 70.2 69.1 68.4 TruthfulQA 51.6 50.9 51.4 50.7 Average 66.9 66.7 66.7 65.8 Recovery (%) 100 99.7 99.7 98.4

4.6 The Impact of Compression

We now evaluate the memory-efficient version of Panza (described in Section 2.4), which requires less than 15GB of GPU memory, and 12GB CPU RAM. Particularly, we first examine how quantizing each component of the pipeline affects the results. Then, we show that quantizing all the components at the same time can achieve reasonable results while being more memory efficient than the non-quantized version.

Setting and hyper-parameters.

We fine-tune the Mistral-Instruct-7b-v0.2 model using RoSA on the David dataset, and use the same three BLEU, ROUGE, and MAUVE metrics for evaluation. For each experiment, we select the best of $5$ and $7$ epochs and learning rates $10^{-4}$ and $10^{-5}$ in terms of BLEU score. We perform both summarization and fine-tuning using the same Mistral model (either half-precision or 4-bit quantized), since we find that the quantization methods perform better in this case compared to Llama-3 summaries.

Quantizing components separately.

As described in Section 2.4, quantization can be alternatively applied to 1) RoSA’s base weights, 2) RoSA’s gradient accumulators, and 3) the final model used for inference. In addition, the email summarization model should also be compressed. Table 4 shows how quantizing each component affects the final accuracy. These results show that (a) quantizing most components individually only marginally affects the results (especially in terms of MAUVE score); (b) quantizing the inference model with bitsandbytes (BNB) Dettmers et al. (2023) significantly downgrades the accuracy, while (c) GPTQ quantization Frantar et al. (2022) recovers comparable accuracy to the original. We note that applying GPTQ to this setting is only possible due to our new efficient implementation of GPTQ (described in 2.4), which allows for merging $4$ -bit base weights with half-precision RoSA adapters without materializing a complete half-precision model.

Fully-quantized Panza.

Finally, we apply quantization to every component at the same time and report the accuracy of the model in Table 4. These results show that it possible to achieve reasonably accurate results with under 15GB of GPU memory.

Table 4: Effect of applying

4

-bit quantization to different components of Panza on the David dataset without RAG. We use Mistral-Instruct-7b-v0.2 for both summarization and fine-tuning. BNB stands for bitsandbytes Dettmers et al. (2023) and GPTQ indicates the quantization technique of Frantar et al. (2022). Further, QSum: summarization with quantized model, QRoSA: RoSA with quantized base model, QAcc: quantized gradient accumulation for RoSA, QInf: Inference with quantized model, and QAll: all components quantized at the same time.

	BLEU	ROUGE	MAUVE
All BF16	$0.265$	$0.432$	$0.971$
QSum (BNB)	$0.153$	$0.283$	$0.998$
QRoSA (BNB)	$0.268$	$0.417$	$0.971$
QAcc (Uniform)	$0.239$	$0.407$	$0.971$
QInf (BNB)	$0.084$	$0.192$	$0.130$
QInf (GPTQ, ours)	$0.251$	$0.414$	$0.971$
QAll (BNB Inf)	$0.076$	$0.119$	$0.070$
QAll (GPTQ Inf, ours)	$0.207$	$0.390$	$0.996$

5 Limitations

The techniques provided by Panza provide a significant improvement, across a mix of metrics, with respect to on-device personalization, and, we believe, a compelling case of practically useful LLM personalization. Yet, more work should be done to be able to accurately measure LLM’s performance on open-ended tasks such as e-mail generation, in particular with regard to measuring, and improving, its representation of the personal data of the user. Additionally, as a proof-of-concept, Panza has currently only been tested in Euro/US-centered English, leaving open the creation of such tools for other languages and cultures. Our techniques should be easily extensible to this case.

6 Ethical Considerations

We foresee two categories of risks from presenting a project like Panza. First, a malicious user who has access to a sufficient number of third party’s emails (for instance, the ones received from that party) can use a tool like Panza to create a credible imitation of that party. Second, a style-matching tool like Panza can be used to create derivative work that can credibly be misrepresented as original (for instance, for completing school assignments). Panza’s low price and accessibility may aid in such misuse; however, overall, the existence of public LLM models and finetuning methods already allows such misuse to occur.

Acknowledgments

The authors would like to thank Michael Goin and Tony Wang for their feedback on this work, and their help in distributing the project. This research was supported by the Scientific Service Units (SSU) of IST Austria through resources provided by Scientific Computing (SciComp). EI was supported in part by the FWF DK VGSCO, grant agreement number W1260-N35.

References

Beeching et al. (2023) Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. 2023. Open llm leaderboard. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard.
Chen et al. (2023) Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wang, et al. 2023. When large language models meet personalization: Perspectives of challenges and opportunities. arXiv preprint arXiv:2307.16376.
Cohen (2015) William W. Cohen. 2015. Enron email dataset. https://www.cs.cmu.edu/~./enron/.
Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
Dettmers et al. (2024) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2024. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
Frantar et al. (2022) Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
Gao et al. (2021) Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. A framework for few-shot language model evaluation. https://doi.org/10.5281/zenodo.5371628.
Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
King and Cook (2020) Milton King and Paul Cook. 2020. Evaluating approaches to personalizing language models. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2461–2469.
Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR).
Kirk et al. (2024) Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A Hale. 2024. The benefits, risks and bounds of personalizing the alignment of large language models to individuals. Nature Machine Intelligence, pages 1–10.
Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Annual Meeting of the Association for Computational Linguistics.
Meta (2024) Meta. 2024. Llama 3 model card. https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
Nikdan et al. (2024) Mahdi Nikdan, Soroush Tabesh, and Dan Alistarh. 2024. Rosa: Accurate parameter-efficient fine-tuning via robust adaptation. arXiv preprint arXiv:2401.04679.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics.
Pillutla et al. (2021) Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaïd Harchaoui. 2021. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In Neural Information Processing Systems.
Salemi et al. (2024) Alireza Salemi, Surya Kallumadi, and Hamed Zamani. 2024. Optimization methods for personalizing large language models through retrieval augmentation. arXiv preprint arXiv:2404.05970.
Zhang et al. (2024) Tianjun Zhang, Shishir G Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E Gonzalez. 2024. Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131.

Appendix A Prompt Engineering

In the first phase of the data playback, we generate summaries with the help of the following summarization prompt:

”””Summarize the following email that I wrote, in an imperative form, in one or two or maximum three sentences, and make sure to include relevant information, without copying the email content itself. The summary should look like an instruction directing someone to write the same email, and start with Instruction:
Here is the email text:
{email}”””

Then, to generate emails, we give the instructions back to the model using the following format:

”””
{system preamble}

{user preamble}

{rag prompt} # [optional]

Instruction: {instruction}
”””

The system preamble sets the role of the LLM as follows:

”””Your role is that of a helpful automated email assistant. I will provide you with a short instruction, and you have to write a well-formed email in my style following this instruction. Be sure to follow my email writing style! In case you see a nonsensical instruction, you should not reply with an email, but with the expression ”Sorry, but I don’t get it.” ”””

The user preamble provides optional information about the user. For the five users in our experiments, we set it to ”My name is ¡First Name¿ ¡Last Name¿”. Generally, it can be filled with any relevant information about the user, for instance:

”””My name is Jane Doe. I work as a manager at Acme Corp. My address is 123 Main Street, Springfield, IL, USA. My boss’s name is Alex Burns. My children’s names are Elsa, Anna, and Olaf. I am deeply committed to my hobby of underwater basket weaving, for which we meet every Thursday at noon.”””

Finally, for RAG we retrieve several relevant emails for the current instruction and include the following rag prompt to the input:

”””Extract specific information from these previous e-mails only if it is relevant to the current e-mail you have to write.

Previous e-mails:

EMAIL CONTENT:
¡email_1 content¿

$---$

EMAIL CONTENT:
¡email_2 content¿

$---$
$...$ ”””

Appendix B Summarization

Table 5: Summarization quality relative to “golden” user-generated instructions, across different models (10 seeds).

	David		Anon1
Method	BLEU	ROUGE	BLEU	ROUGE
Mistral-7B-Instruct-v0.2	$0.219\pm 0.009$	$0.353\pm 0.007$	$0.133\pm 0.009$	$0.251\pm 0.012$
Mistral-7B-Instruct-v0.3	$0.229\pm 0.021$	$0.373\pm 0.021$	$0.164\pm 0.014$	$0.307\pm 0.016$
Phi-3-mini-4k-instruct	$0.268\pm 0.020$	$0.422\pm 0.018$	$0.180\pm 0.006$	$0.331\pm 0.011$
Meta-Llama-3-8B-Instruct	$\textbf{0.394}\pm 0.011$	$\textbf{0.548}\pm 0.013$	$\textbf{0.264}\pm 0.011$	$\textbf{0.411}\pm 0.015$

In Table 5 we measure the summarization quality, by comparing against golden summaries from the users David and Anon1. In Table 6 we inspect several generated summaries for each model.

Table 6: Summarization comparison between “golden” user-generated instructions and summaries generated by different pre-trained LLMs.

Original email Golden summary Llama-3-8B-Instruct Mistral-7B-Instruct Phi-3-mini-4k-instruct Hi Cheryl, The proposal looks good, maybe we can discuss it in a bit more detail tomorrow. Best regards, David Write to Cheryl saying that the proposal looks good and suggest a meeting the following day. Write an email to Cheryl, stating that the proposal looks good and suggesting a more detailed discussion for the next day. Write an email to Cheryl suggesting a meeting to discuss the proposal in more detail tomorrow. End with a polite closing. Write an email to Cheryl, acknowledging the good proposal, suggesting a detailed discussion for the following day, and signing off with your name. Hi both, I’m fairly available in the afternoons next week, except for Monday and Friday, so please feel free to suggest a slot. Best, David Write an email to two recipients, stating that I am available in the afternoons next week, except for Monday and Friday, and ask them to suggest a time that works for them. Write an email to both parties, stating that you are available in the afternoons next week, excluding Monday and Friday, and inviting them to suggest a suitable time slot. Write an email informing the recipients that you are generally available in the afternoons from Tuesday to Thursday next week, except for Mondays and Fridays. Request them to suggest a suitable time. Write an email informing the recipients that you are generally available in the afternoons next week, except for Monday and Friday, and invite them to propose a suitable time slot for a meeting. Dear Ernie, Thank you for the invitation. I would have loved to join, but I’m trying to save some time and energy for my other admin tasks this semester :-) Best regards, David Write an email to Ernie, thanking him for the invitation, but saying that I have to decline due to a heavy admin burden this semester. Write an email to Ernie declining an invitation, citing a need to prioritize other administrative tasks this semester. Inform Ernie that you’re unable to attend due to prior commitments with administrative tasks, expressing gratitude for the invitation. Write an email to Ernie, expressing gratitude for an invitation, politely declining due to prior commitments related to admin tasks, and sign off with a warm closing. Dear David, I was wondering if I could ask you for an ”emergency-mode” code review (see attached). Specifically, it would be great if you could have a look at it in the next couple of weeks. I’d really appreciate it if you can make it; please let me know. Cheers, David Write an email to David requesting an ”emergency-mode” code review, politely asking them if they can review it within the next couple of weeks. Write an email to David requesting an ”emergency-mode” code review, attaching the relevant document, and asking him to review it within the next couple of weeks. Ask David for an urgent code review, requesting him to look at the attached document within the next couple of weeks and kindly express appreciation for his assistance. Write an email to David requesting an emergency-mode code review, emphasizing the need for it to be completed within the next couple of weeks, and express appreciation for his potential assistance, while also asking for his confirmation of availability.

Appendix C Panza Generated Emails

Table 7: Generation comparison for Llama-3-8B-Instruct model fine-tuned for different users.

Instruction Llama-3-8B-Instruct (Pretrained) David Kay Sara Tana Write an email to two recipients, stating that I am available in the afternoons next week, except for Monday and Friday, and ask them to suggest a time that works for them. Dear [Recipient 1] and [Recipient 2],
I hope this email finds you well. I wanted to touch base with you regarding my availability for a meeting next week. I am free in the afternoons, from Tuesday to Thursday, and would be happy to schedule a meeting with you during this time.
Please note that I am not available on Monday and Friday, as I have other commitments. If you could let me know a time that works for you, I would greatly appreciate it.
Best regards,
[Your Name] Hi,
I am available in the afternoons next week, except for Monday and Friday. Could you please suggest a time that works for you?
Best regards,
David I’m available in the afternoons next week, except for Monday and Friday. How about Tuesday or Wednesday? Please let me know what works for you.
Thanks,
Kay I am available in the afternoons next week, except for Monday and Friday. Can you both let me know when you are available and we’ll set something up. I am available in the afternoons next week, except Monday and Friday. What time works for you? Write an email to the group proposing to resume meetings tomorrow at 11:30 AM, and also mention your availability in the office on Thursday afternoon and Friday. Dear team,
I hope this email finds you well. As we’ve been on a brief hiatus, I’d like to propose that we resume our regular meetings tomorrow at 11:30 AM. I believe it’s essential to catch up and discuss ongoing projects, and I’m confident that this will help us stay on track.
Additionally, I wanted to let you know that I’ll be available in the office on Thursday afternoon and Friday. If any concerns need to meet or discuss any matters, please don’t hesitate to reach out to me.
Looking forward to seeing you all tomorrow at 11:30 AM.
Best regards,
[Your Name] ”Hi all,
We should probably resume the meetings tomorrow at 11:30 AM. Also, I am available in the office on Thursday afternoon and Friday.
Best,
David” I’m thinking we should resume meetings tomorrow at 11:30 AM. I’m in the office tomorrow afternoon, and I’ll be here Thursday afternoon, and Friday.
I hope this helps.
Kay I’m going to propose that we resume our meetings tomorrow at 11:30 AM. I’ll be in the office tomorrow and Thursday afternoon. I’ll be out of the office on Friday. Please let me know if you have any questions.
Thanks,
Sara I think we should try to resume our meetings tomorrow at 11:30 AM. I’ll be in the office tomorrow afternoon, Thursday afternoon and Friday. Write an email to Jeannie, thanking her for the monitor, and then request assistance in gaining access to the conference room after hours, as your current ID does not provide access. Dear Jeannie,
I wanted to take a moment to express my gratitude for the new monitor you provided for me. It has been a huge help in my work and I appreciate your support in keeping everything up to date.
On a separate note, I was wondering if you could assist me with gaining access to the conference room after hours. As you may know, my current ID does not provide access to the room outside of regular working hours. I often find myself needing to work late […] Hi Jeannie,
Thanks for the monitor! Unfortunately, my current ID does not provide access to the conference room after hours. Could you please help me with this?
Best regards,
David Thanks very much for the monitor. I really appreciate it. I’m having some trouble getting into the conference room after hours. I’ve got a new ID, but it doesn’t seem to work. Could you help me out?
Thanks a bunch.
Best,
Jeff Jeannie: Thanks for the monitor. I was wondering if you could help me with something. I need to be able to access the conference room after hours and my current ID does not provide access. Can you help me with this?
Thanks,
Sara Thanks for the monitor, Jeannie! It is a great help. Now, I need your help with one more thing. My new ID doesn’t seem to give me access to the conference room after hours. Can you help me out with this?
Thanks!

Table 7 compares emails generated by models fine-tuned for different users, highlighting their style differences.

Appendix D Fine-Tuning Performance

We show complete results for all models, across every user in Tables 8 and 12 (Meta-Llama-3-8B-Instruct), Tables 9 and 13 (Mistral-7B-Instruct-v0.2), and Tables 10 and 14 (Phi-3-mini-4k-instruct). To compare between different models, we report the average results over all users in Table 11. We find that for all the models, data playback successfully incorporates the user’s style. In Figure 6 we perform a pairwise style comparison for model trained on different users and the test emails of all the other users. This shows MAUVE score successfully captures style differences, while BLEU/ROUGE scores are limited to measuring the paraphrasing capacity and can’t distinguish style.

Table 8: Results Meta-Llama-3-8B-Instruct across all methods and users.

David Jeff Kay Sara Tana Method BLEU Rouge MAUVE BLEU Rouge MAUVE BLEU Rouge MAUVE BLEU Rouge MAUVE BLEU Rouge MAUVE Pretrained $0.083$ $0.181$ $0.009$ $0.108$ $0.182$ $0.004$ $0.113$ $0.186$ $0.005$ $0.144$ $0.23$ $0.004$ $0.132$ $0.21$ $0.006$ Pretrained-RAG $0.107$ $0.212$ $0.017$ $0.115$ $0.188$ $0.005$ $0.121$ $0.197$ $0.004$ $0.151$ $0.233$ $0.004$ $0.149$ $0.227$ $0.005$ FFT $0.278$ $0.46$ $0.996$ $0.166$ $0.282$ $0.758$ $0.197$ 0.295 $0.863$ $0.261$ $0.356$ $0.859$ $0.256$ $0.358$ $0.859$ FFT-RAG $0.3$ $0.449$ $0.984$ $0.166$ $0.266$ $0.779$ $0.19$ $0.283$ $0.933$ $0.242$ $0.337$ $0.852$ $0.238$ $0.327$ $0.898$ FFT-RAFT $0.299$ $0.476$ $0.997$ $0.164$ $0.279$ $0.715$ $0.192$ $0.285$ $0.891$ $0.253$ $0.357$ $0.914$ $0.266$ $0.363$ $0.903$ FFT-RAFT-RAG $0.31$ $0.494$ $0.985$ $0.187$ $0.297$ $0.824$ $0.184$ $0.281$ $0.941$ $0.263$ $0.36$ $0.826$ 0.278 0.372 $0.876$ RoSA $0.312$ $0.488$ $0.999$ $0.164$ $0.285$ $0.806$ 0.202 $0.293$ $0.898$ $0.26$ $0.355$ $0.872$ $0.26$ $0.352$ $0.948$ RoSA-RAG $0.236$ $0.383$ $0.982$ $0.166$ $0.266$ $0.786$ $0.184$ $0.269$ $0.963$ $0.243$ $0.338$ $0.945$ $0.253$ $0.34$ 0.97 RoSA-RAFT $0.321$ $0.491$ $0.991$ $0.166$ $0.291$ $0.823$ $0.201$ $0.29$ $0.895$ 0.268 $0.36$ $0.763$ $0.252$ $0.349$ $0.814$ RoSA-RAFT-RAG 0.346 0.509 1.0 0.192 0.305 0.867 $0.196$ $0.291$ $0.973$ $0.265$ 0.362 $0.889$ $0.277$ $0.367$ $0.905$ LoRA $0.26$ $0.401$ $0.68$ $0.167$ $0.27$ $0.224$ $0.184$ $0.269$ $0.979$ $0.247$ $0.351$ $0.921$ $0.239$ $0.329$ $0.897$ LoRA-RAG $0.257$ $0.406$ $0.922$ $0.159$ $0.253$ $0.481$ $0.181$ $0.267$ $0.968$ $0.23$ $0.327$ 0.969 $0.233$ $0.312$ $0.942$ LoRA-RAFT $0.243$ $0.391$ $0.412$ $0.165$ $0.267$ $0.149$ $0.187$ $0.276$ 0.987 $0.245$ $0.341$ $0.906$ $0.243$ $0.329$ $0.893$ LoRA-RAFT-RAG $0.246$ $0.404$ $0.899$ $0.167$ $0.262$ $0.547$ $0.185$ $0.27$ $0.942$ $0.252$ $0.349$ $0.909$ $0.262$ $0.353$ $0.854$

Table 9: Results Mistral-7B-Instruct-v0.2 across all methods and users.

David Jeff Kay Sara Tana Method BLEU Rouge MAUVE BLEU Rouge MAUVE BLEU Rouge MAUVE BLEU Rouge MAUVE BLEU Rouge MAUVE Pretrained $0.072$ $0.158$ $0.009$ $0.093$ $0.164$ $0.005$ $0.096$ $0.168$ $0.006$ $0.117$ $0.196$ $0.004$ $0.113$ $0.187$ $0.007$ Pretrained-RAG $0.083$ $0.181$ $0.058$ $0.102$ $0.17$ $0.006$ $0.103$ $0.176$ $0.017$ $0.122$ $0.2$ $0.008$ $0.126$ $0.201$ $0.005$ FFT 0.335 0.475 $0.997$ $0.162$ $0.262$ $0.263$ $0.214$ $0.295$ $0.795$ $0.264$ $0.365$ $0.839$ $0.263$ $0.36$ $0.719$ FFT-RAG $0.276$ $0.425$ $0.934$ $0.162$ $0.251$ $0.376$ $0.2$ $0.286$ $0.896$ $0.236$ $0.335$ $0.876$ $0.255$ $0.346$ 0.872 FFT-RAFT $0.306$ $0.461$ $0.963$ $0.163$ $0.262$ $0.346$ $0.211$ 0.313 $0.696$ 0.28 0.393 $0.792$ $0.254$ $0.35$ $0.759$ FFT-RAFT-RAG $0.309$ $0.461$ $0.959$ 0.174 0.277 $0.29$ 0.216 $0.308$ $0.882$ $0.275$ $0.39$ $0.648$ $0.268$ $0.358$ $0.723$ RoSA $0.306$ $0.459$ $0.988$ $0.169$ $0.258$ $0.482$ $0.209$ $0.302$ $0.797$ $0.253$ $0.358$ $0.955$ $0.24$ $0.33$ $0.74$ RoSA-RAG $0.263$ $0.416$ 1.0 $0.154$ $0.228$ $0.517$ $0.182$ $0.259$ $0.894$ $0.2$ $0.287$ 0.971 $0.238$ $0.331$ $0.826$ RoSA-RAFT $0.289$ $0.437$ $0.999$ $0.167$ $0.264$ $0.557$ $0.189$ $0.296$ $0.809$ $0.265$ $0.387$ $0.838$ $0.245$ $0.337$ $0.856$ RoSA-RAFT-RAG $0.289$ $0.43$ $0.998$ $0.171$ $0.262$ $0.408$ $0.204$ $0.293$ 0.961 $0.255$ $0.37$ $0.827$ 0.273 0.365 $0.757$ LoRA $0.294$ $0.444$ $0.976$ $0.157$ $0.247$ $0.56$ $0.145$ $0.229$ $0.44$ $0.214$ $0.318$ $0.413$ $0.214$ $0.318$ $0.288$ LoRA-RAG $0.249$ $0.397$ $0.974$ $0.166$ $0.25$ $0.589$ $0.169$ $0.257$ $0.891$ $0.234$ $0.336$ $0.702$ $0.241$ $0.336$ $0.535$ LoRA-RAFT $0.298$ $0.451$ $0.941$ $0.158$ $0.248$ 0.681 $0.136$ $0.217$ $0.426$ $0.204$ $0.3$ $0.465$ $0.206$ $0.303$ $0.3$ LoRA-RAFT-RAG $0.286$ $0.437$ $0.991$ $0.168$ $0.262$ $0.569$ $0.151$ $0.235$ $0.417$ $0.23$ $0.334$ $0.471$ $0.232$ $0.329$ $0.329$

Table 10: Results for Phi-3-mini-4k-instruct across all methods and users.

David Jeff Kay Sara Tana Method BLEU Rouge MAUVE BLEU Rouge MAUVE BLEU Rouge MAUVE BLEU Rouge MAUVE BLEU Rouge MAUVE Pretrained $0.055$ $0.129$ $0.009$ $0.077$ $0.147$ $0.006$ $0.078$ $0.149$ $0.006$ $0.101$ $0.182$ $0.007$ $0.093$ $0.168$ $0.005$ Pretrained-RAG $0.054$ $0.126$ $0.009$ $0.082$ $0.151$ $0.007$ $0.082$ $0.151$ $0.012$ $0.107$ $0.185$ $0.009$ $0.101$ $0.176$ $0.009$ FFT $0.33$ $0.488$ $1.0$ $0.169$ $0.281$ $0.86$ $0.199$ $0.299$ $0.866$ $0.27$ $0.379$ $0.884$ $0.265$ $0.364$ $0.869$ FFT-RAG $0.31$ $0.474$ $0.994$ $0.179$ $0.28$ $0.726$ $0.196$ $0.283$ $0.858$ $0.231$ $0.322$ $0.959$ $0.254$ $0.353$ $0.911$ FFT-RAFT 0.33 0.508 $0.992$ $0.166$ $0.276$ 0.915 $0.194$ $0.289$ $0.923$ $0.261$ $0.366$ $0.881$ $0.262$ $0.371$ $0.877$ FFT-RAFT-RAG $0.327$ $0.495$ $0.998$ 0.181 $0.289$ $0.824$ $0.201$ $0.294$ $0.862$ $0.263$ $0.365$ $0.865$ 0.276 0.38 $0.887$ RoSA $0.308$ $0.473$ $0.996$ $0.166$ $0.256$ $0.85$ $0.193$ $0.281$ 0.956 $0.241$ $0.35$ $0.866$ $0.245$ $0.337$ $0.866$ RoSA-RAG $0.302$ $0.455$ $1.0$ $0.171$ $0.267$ $0.726$ $0.171$ $0.256$ $0.814$ $0.195$ $0.289$ 0.982 $0.251$ $0.339$ $0.948$ RoSA-RAFT $0.315$ $0.473$ 1.0 $0.157$ $0.257$ $0.875$ $0.18$ $0.278$ $0.916$ $0.234$ $0.343$ $0.939$ $0.245$ $0.331$ 0.958 RoSA-RAFT-RAG $0.319$ $0.472$ $0.996$ $0.173$ $0.274$ $0.869$ $0.188$ $0.278$ $0.896$ $0.238$ $0.345$ $0.812$ $0.265$ $0.354$ $0.945$ LoRA $0.277$ $0.439$ $1.0$ $0.164$ $0.289$ $0.789$ $0.2$ $0.302$ $0.72$ $0.261$ $0.384$ $0.754$ $0.253$ $0.359$ $0.82$ LoRA-RAG $0.295$ $0.461$ $0.999$ $0.173$ $0.283$ $0.77$ $0.156$ $0.246$ $0.577$ $0.235$ $0.353$ $0.909$ $0.25$ $0.347$ $0.834$ LoRA-RAFT $0.288$ $0.443$ $1.0$ $0.16$ $0.286$ $0.846$ 0.21 0.319 $0.852$ 0.276 0.398 $0.767$ $0.261$ $0.366$ $0.802$ LoRA-RAFT-RAG $0.313$ $0.472$ $0.999$ $0.176$ 0.3 $0.65$ $0.209$ $0.311$ $0.919$ $0.269$ $0.389$ $0.923$ $0.274$ $0.376$ $0.757$

Table 11: Comparison between models (results average over all users).

Mistral Llama-3 Phi-3 Method BLEU Rouge MAUVE BLEU Rouge MAUVE BLEU Rouge MAUVE Pretrained $0.098$ $0.174$ $0.006$ $0.116$ $0.198$ $0.006$ $0.081$ $0.155$ $0.007$ Pretrained-RAG $0.107$ $0.186$ $0.019$ $0.129$ $0.212$ $0.007$ $0.085$ $0.158$ $0.009$ FFT $0.247$ $0.351$ $0.722$ $0.231$ $0.35$ $0.867$ $0.247$ $0.362$ $0.896$ FFT-RAG $0.226$ $0.328$ $0.791$ $0.227$ $0.332$ $0.889$ $0.234$ $0.342$ $0.89$ FFT-RAFT $0.243$ $0.356$ $0.711$ $0.235$ $0.352$ $0.884$ $0.243$ $0.362$ $0.918$ FFT-RAFT-RAG 0.248 0.359 $0.7$ $0.244$ $0.361$ $0.89$ 0.25 $0.365$ $0.887$ RoSA $0.235$ $0.341$ $0.792$ $0.24$ $0.355$ $0.904$ $0.231$ $0.339$ $0.907$ RoSA-RAG $0.208$ $0.304$ 0.841 $0.216$ $0.319$ 0.929 $0.218$ $0.321$ $0.894$ RoSA-RAFT $0.231$ $0.344$ $0.812$ $0.241$ $0.356$ $0.857$ $0.226$ $0.336$ 0.938 RoSA-RAFT-RAG $0.238$ $0.344$ $0.79$ 0.255 0.367 $0.927$ $0.237$ $0.345$ $0.904$ LoRA $0.205$ $0.311$ $0.535$ $0.219$ $0.324$ $0.74$ $0.231$ $0.355$ $0.817$ LoRA-RAG $0.212$ $0.315$ $0.738$ $0.212$ $0.313$ $0.856$ $0.222$ $0.338$ $0.818$ LoRA-RAFT $0.2$ $0.304$ $0.563$ $0.216$ $0.321$ $0.669$ $0.239$ $0.363$ $0.853$ LoRA-RAFT-RAG $0.213$ $0.319$ $0.556$ $0.222$ $0.328$ $0.83$ $0.248$ 0.37 $0.849$

Table 12: Anon users results (Meta-Llama-3-8B-Instruct)

Anon1 Anon2 Method BLEU Rouge Mauve BLEU Rouge Mauve Pretrained $0.11$ $0.207$ $0.006$ $0.094$ $0.18$ $0.005$ Pretrained-RAG $0.123$ $0.226$ $0.006$ $0.108$ $0.197$ $0.007$ FFT $0.302$ $0.462$ $0.889$ $0.219$ $0.381$ $0.909$ FFT-RAG $0.232$ $0.398$ $0.918$ $0.167$ $0.301$ 0.976 FFT-RAFT $0.293$ 0.464 $0.959$ $0.214$ 0.383 $0.898$ FFT-RAFT-RAG $0.287$ $0.461$ $0.95$ $0.218$ $0.369$ $0.895$ RoSA 0.306 $0.459$ $0.958$ 0.22 $0.358$ $0.927$ RoSA-RAG $0.278$ $0.435$ $0.975$ $0.198$ $0.327$ $0.934$ RoSA-RAFT $0.285$ $0.436$ $0.978$ $0.21$ $0.348$ $0.951$ RoSA-RAFT-RAG $0.29$ $0.438$ $0.961$ $0.217$ $0.353$ $0.939$ LoRA $0.3$ $0.457$ $0.959$ $0.208$ $0.348$ $0.96$ LoRA-RAG $0.245$ $0.389$ $0.95$ $0.2$ $0.323$ $0.954$ LoRA-RAFT $0.289$ $0.45$ $0.965$ $0.206$ $0.349$ $0.956$ LoRA-RAFT-RAG $0.288$ $0.443$ 0.985 $0.214$ $0.353$ $0.957$

Table 13: Anon users results (Mistral-7B-Instruct-v0.2)

Anon1 Anon2 Method BLEU Rouge Mauve BLEU Rouge Mauve Pretrained $0.085$ $0.173$ $0.011$ $0.081$ $0.16$ $0.005$ Pretrained-RAG $0.095$ $0.188$ $0.011$ $0.089$ $0.169$ $0.008$ FFT $0.291$ 0.459 $0.878$ $0.214$ $0.362$ $0.706$ FFT-RAG $0.248$ $0.411$ $0.824$ $0.197$ $0.334$ $0.812$ FFT-RAFT $0.295$ $0.457$ $0.877$ $0.217$ 0.367 $0.559$ FFT-RAFT-RAG $0.287$ $0.45$ $0.825$ 0.222 $0.362$ $0.679$ RoSA $0.294$ $0.459$ $0.912$ $0.213$ $0.35$ $0.888$ RoSA-RAG $0.266$ $0.422$ $0.875$ $0.175$ $0.292$ $0.914$ RoSA-RAFT 0.298 $0.458$ $0.851$ $0.215$ $0.347$ $0.868$ RoSA-RAFT-RAG $0.286$ $0.448$ $0.907$ $0.206$ $0.336$ $0.832$ LoRA $0.275$ $0.415$ $0.935$ $0.203$ $0.335$ 0.948 LoRA-RAG $0.259$ $0.401$ $0.943$ $0.189$ $0.314$ $0.93$ LoRA-RAFT $0.28$ $0.43$ $0.96$ $0.203$ $0.335$ $0.916$ LoRA-RAFT-RAG $0.278$ $0.429$ 0.975 $0.206$ $0.335$ $0.901$

Table 14: Anon users results (Phi-3-mini-4k-instruct)

Anon1 Anon2 Method BLEU Rouge Mauve BLEU Rouge Mauve Pretrained $0.067$ $0.144$ $0.009$ $0.069$ $0.139$ $0.006$ Pretrained-RAG $0.067$ $0.145$ $0.01$ $0.068$ $0.139$ $0.007$ FFT 0.306 $0.468$ $0.914$ $0.23$ $0.383$ $0.942$ FFT-RAG $0.302$ $0.462$ $0.888$ $0.219$ $0.359$ $0.939$ FFT-RAFT $0.303$ 0.477 $0.968$ 0.233 0.394 $0.943$ FFT-RAFT-RAG $0.3$ $0.473$ $0.923$ $0.229$ $0.374$ $0.943$ RoSA $0.293$ $0.446$ $0.926$ $0.206$ $0.356$ $0.924$ RoSA-RAG $0.3$ $0.457$ $0.927$ $0.21$ $0.348$ $0.927$ RoSA-RAFT $0.288$ $0.449$ 0.98 $0.213$ $0.358$ $0.941$ RoSA-RAFT-RAG $0.284$ $0.445$ $0.907$ $0.215$ $0.35$ $0.926$ LoRA $0.296$ $0.468$ $0.945$ $0.22$ $0.38$ 0.95 LoRA-RAG $0.292$ $0.454$ $0.886$ $0.223$ $0.373$ $0.935$ LoRA-RAFT $0.302$ $0.467$ $0.951$ $0.217$ $0.386$ $0.885$ LoRA-RAFT-RAG $0.301$ $0.466$ $0.945$ $0.226$ $0.379$ $0.924$

Appendix E Hyperparameter Tuning

E.1 Inference

We perform generation using beam search with a temperature $T=0.7$ , number of top probability tokens to keep $top\_k=50$ and nucleus sampling parameter $top\_p=0.7$ .

E.2 Fine-Tuning

We perform a thorough hyperparameter tuning for every backbone, and every user over learning rate and number of epochs. For FFT we experiment with { $1$ , $3$ , $5$ } epochs and learning rates between $[10^{-3},10^{-7}]$ . For PEFT methods (RoSA, LoRA), we experiment with { $1$ , $3$ , $5$ , $7$ , $9$ } epochs and learning rates between $[10^{-3},10^{-7}]$ . Next, we present the best configuration found for each model, method and user.

Meta-Llama-3-8B-Instruct

For users David and Jeff: FFT for $3$ epochs with a learning rate of $10^{-5}$ ; PEFT for $7$ epochs with a learning rate of $10^{-5}$ . For users Kay, Sara and Tana: FFT for $3$ epochs with a learning rate of $10^{-4}$ and PEFT for $7$ epochs with a learning rate of $10^{-4}$ . For users Anon1, Anon2: FFT for $3$ epochs with a learning rate of $10^{-5}$ and PEFT for $7$ epochs with a learning rate of $10^{-4}$ .

Mistral-7B-Instruct-v0.2

For users David and Jeff, Anon2: FFT for $3$ epochs with a learning rate of $10^{-5}$ ; PEFT for $7$ epochs with a learning rate of $10^{-4}$ . For users Kay, Sara and Tana: FFT for $3$ epochs with a learning rate of $10^{-5}$ and PEFT for $7$ epochs with a learning rate of $10^{-5}$ . For Anon1: FFT for $3$ epochs with a learning rate of $10^{-5}$ ; RoSA for $7$ epochs with a learning rate of $10^{-5}$ , LoRA for $7$ epochs with a learning rate of $10^{-4}$ .

Phi-3-mini-4k-instruct

Same hyperparameters for every users: FFT for $3$ epochs with a learning rate of $10^{-4}$ ; PEFT for $7$ epochs wth a learning rate of $10^{-4}$ .

E.3 RAG and RAFT

We set fixed values of the number of retrieval emails $n_{RAG}=2$ during RAFT, and $n_{RAG}=3$ at inference with RAG. We pick relatively low values, as every email retrieved increases the input length, thus the memory consumption. We experimented with larger $n_{RAG}$ for the Pretrained baseline, without significantly different results. For RAFT, we use $p_{RAG}=0.55$ chance to include relevant emails in the prompt. We use the same relevancy threshold $T_{RAG}=0.2$ , tuned on the private users Anon1 and Anon2 to encourage recall rather than precision. This can retrieve irrelevant emails, but prevents missing important items from the user’s history. Furthermore, RAFT learns how to better filter irrelevant information in case it is added to the prompt.

Panza: A Personalized Text Writing Assistant via Data Playback and Local Fine-Tuning

Abstract

1 Introduction

Related Work.

2 Method

2.1 Overview

2.2 The Data Playback Mechanism

2.3 Deployment

2.4 Local Fine-Tuning and Inference

Panza on a single GPU.

Panza under 15GB GPU memory

3 Evaluation Protocol

3.1 Datasets

3.2 Metrics

Paraphrasing quality.

User-specific knowledge.

General knowledge.

Style transfer.

Discussion.

4 Experimental Results

4.1 Baselines

4.2 Fine-Tuning via Data Playback

Hyperparameter tuning.

4.3 Results Across Methods and Models

4.4 Style Evaluation

4.5 Maintaining General Knowledge

4.6 The Impact of Compression

Setting and hyper-parameters.

Quantizing components separately.

Fully-quantized Panza.

5 Limitations

6 Ethical Considerations

Acknowledgments

References

Appendix A Prompt Engineering

Appendix B Summarization

Appendix C Panza Generated Emails

Appendix D Fine-Tuning Performance

Appendix E Hyperparameter Tuning

E.1 Inference

E.2 Fine-Tuning

Meta-Llama-3-8B-Instruct

Mistral-7B-Instruct-v0.2

Phi-3-mini-4k-instruct

E.3 RAG and RAFT

Panza: A Personalized Text Writing Assistant
via Data Playback and Local Fine-Tuning