Learning to Refuse: Towards Mitigating Privacy Risks in LLMs

Zhenhua Liu,Tong Zhu,Chuanyuan Tan,Wenliang Chen
Institute of Artificial Intelligence, School of Computer Science and Technology,
Soochow University, China
{zhliu0106, tzhu7, cytan17726}@stu.suda.edu.cn, [email protected]
 Corresponding author
Abstract

Large language models (LLMs) exhibit remarkable capabilities in understanding and generating natural language. However, these models can inadvertently memorize private information, posing significant privacy risks. This study addresses the challenge of enabling LLMs to protect specific individuals’ private data without the need for complete retraining. We propose [Uncaptioned image] RETURN, a Real-world pErsonal daTa UnleaRNing dataset, comprising 2,492 individuals from Wikipedia with associated QA pairs, to evaluate machine unlearning (MU) methods for protecting personal data in a realistic scenario. Additionally, we introduce the Name-Aware Unlearning Framework (NAUF) for Privacy Protection, which enables the model to learn which individuals’ information should be protected without affecting its ability to answer questions related to other unrelated individuals. Our extensive experiments demonstrate that NAUF achieves a state-of-the-art average unlearning score, surpassing the best baseline method by 5.65 points, effectively protecting target individuals’ personal data while maintaining the model’s general capabilities111Our code and dataset are available at https://github.com/zhliu0106/learning-to-refuse.

Learning to Refuse: Towards Mitigating Privacy Risks in LLMs


Zhenhua Liu,  Tong Zhu,  Chuanyuan Tan,  Wenliang Chenthanks:  Corresponding author Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University, China {zhliu0106, tzhu7, cytan17726}@stu.suda.edu.cn, [email protected]


1 Introduction

Large language models (LLMs) demonstrate extraordinary abilities to understand and generate natural languages following instructions, attributing to the massive amounts of parameters and training data Brown et al. (2020); Anil et al. (2023). However, these models sometimes memorize about private contents since there are personally identifiable information in the pre-training corpus Carlini et al. (2021); Huang et al. (2022). This presents a significant privacy concern, as an adversary can prompt the model to extract an individual’s name, email address, phone number, or other sensitive information for malicious purposes, as shown in Figure 1. The General Data Protection Regulation European Parliament and Council of the European Union (2016) gives individuals Right To Be Forgotten (RTBF), which can limit the direct and indirect commercial use of their personal information. This situation leads us to the question: How can we enable LLMs to protect specific individual’s private data to mitigate privacy risks?

Refer to caption
Figure 1: The example of extracting private information from LLMs.

With the costly training process of LLMs, removing all private information from the training data and retraining it from scratch is not a practical solution Lison et al. (2021); Kandpal et al. (2022); Liu et al. (2024a). Therefore, researchers have attempted to adopt machine unlearning (MU) as an alternative, which aims to eliminate the influence of undesirable data and associated model capabilities without retraining Cao and Yang (2015); Bourtoule et al. (2021); Jang et al. (2022); Si et al. (2023); Zhang et al. (2023a); Maini et al. (2024); Liu et al. (2024a). To evaluate the performance of MU methods, some studies have experimented with question-answering datasets Patil et al. (2023), fictitious biographies Maini et al. (2024), and copyrighted contents Eldan and Russinovich (2023). However, there is a lack of evaluation of MU methods for protecting personal privacy data in real-world scenarios, where the target individuals exist in reality and have been memorized by LLMs.

Refer to caption
Figure 2: The construction of Refer to caption RETURN and the process for evaluating Machine Unlearning (MU) methods using this dataset.

Considering these problems, we propose [Uncaptioned image] RETURN, a Real-world pErsonal daTa UnleaRNing dataset. As illustrated in Figure 2, we collect extensive background information on celebrities from Wikipedia and use GPT-4 Achiam et al. (2023) to generate 20×\times×QA pairs for each individual. After manual and automated validation, we obtain a dataset of 2,492 individuals, each with a (Name, 20×\times×QA pairs) data instance. Next, we could select a base model to evaluate the MU methods on [Uncaptioned image] RETURN. In this work, we take LLaMA-3-8B-Instruct AI@Meta (2024) as an example. We first identify individuals with deep memorization in the model and then divide them into the forget set and the retain set. Our goal is for the model to protect the information of individuals in the forget set, ensuring that questions related to these individuals are not answered correctly, while maintaining the model’s performance on the retain set.

Existing MU methods often suffer from sensitivity to hyperparameter selection or the inability to effectively distinguish between the forget set and the retain set. To mitigate the drawbacks of existing methods, we propose a simple yet novel unlearning method: Name-Aware Unlearning Framework (NAUF) for privacy protection. The framework comprises two key components: Name-Aware Refusal Answer and Contrastive Data Augmentation. The Name-Aware Refusal Answer is designed to help the model learn which individuals’ information should be protected, and the Contrastive Data Augmentation aims to expand the distribution of both the forget set and the retain set for enhancing the generalization of our method. We evaluate the effectiveness of our proposed method on [Uncaptioned image] RETURN and compare it with the baseline methods, and the results show that our proposed NAUF achieves a state-of-the-art average unlearning score, outperforming the best baseline method by 5.65 points.

Our contributions can be summarized as follows:

  • \bullet

    We propose [Uncaptioned image] RETURN, which consists of 2,492 real individual names and 20×\times×QA pairs for each individual. As far as we know, this is the first dataset for evaluating MU methods for protecting personal data in a real-world scenario.

  • \bullet

    We propose a simple yet novel method NAUF for privacy protection. This method could help the model protect the privacy of individuals in the forget set while maintaining the model’s performance on the retain set.

  • \bullet

    We conduct extensive experiments on [Uncaptioned image] RETURN to evaluate the effectiveness of our proposed method and compare it with the baseline methods. The results show that our proposed NAUF achieves a state-of-the-art average unlearning score, outperforming the best baseline method by 5.65 points. Through comprehensive experimental analysis, we demonstrate the effectiveness of our proposed method in protecting the privacy of individuals in the forget set while maintaining the model’s performance on the retain set.

2 Related Work

Memorization and Privacy Risks of LLMs.

Previous works show that LLMs can memorize sensitive information from the training data Thakkar et al. (2021); Carlini et al. (2021); Huang et al. (2022). Adversaries can utilize membership inference attacks to infer whether a specific data point was in the LLMs’ training set Shi et al. (2023); Liu et al. (2024b). They can also recover the training data by powerful data extraction attacks Carlini et al. (2021); Nasr et al. (2023). These privacy risks can be mitigated by removing the sensitive information from the LLMs. However, retraining the LLMs from scratch is impractical due to the high cost of training Lison et al. (2021); Kandpal et al. (2022); Liu et al. (2024a). One approach to minimizing the memorization of sensitive information is to apply differential privacy techniques in model training Dwork et al. (2006); Shokri and Shmatikov (2015); McMahan et al. (2017). Unfortunately, these methods often reduce the accuracy and increase the training time, making them less common in practice Jayaraman and Evans (2019).

Machine Unlearning for LLMs.

Machine unlearning (MU) aims to eliminate the influence of undesirable data and remove associated model capabilities while preserving model performance for other data Cao and Yang (2015); Bourtoule et al. (2021); Jang et al. (2022); Si et al. (2023); Zhang et al. (2023a); Maini et al. (2024); Liu et al. (2024a). The study of MU methods encompasses diverse domains, such as image classification Ginart et al. (2019); Golatkar et al. (2020); Sekhari et al. (2021); Fan et al. (2023), text-to-image generation Kumari et al. (2023); Zhang et al. (2023b); Fan et al. (2023), and federated learning Wang et al. (2022); Liu et al. (2023); Che et al. (2023).

Specifically in the era of LLMs, MU has been applied to addressing trustworthiness concerns, such as toxicity Lu et al. (2022), copyright Eldan and Russinovich (2023), and privacy Jang et al. (2022); Patil et al. (2023); Maini et al. (2024). We find that these studies have tested MU methods on question-answering datasets Jang et al. (2022); Patil et al. (2023), fictitious biographies Maini et al. (2024), and copyrighted contents Eldan and Russinovich (2023), but have not yet evaluated the methods for protecting personal privacy data in real-world scenarios. Considering the practical applications, we propose [Uncaptioned image] RETURN to evaluate MU methods when an individual practices his/her RTBT in a black-box setting, where adversaries can only interact with the model through API query.

Jang et al. (2022) shows that simply performing gradient ascent on target token sequences is effective at forgetting them with little to no degradation of general language modeling performances. Maini et al. (2024) tries to unlearn the memorized information in LLMs by relabeling the target data with uninformed answers such as "I don’t know". We believe that these methods have their drawbacks: gradient ascent is sensitive to hyperparameters and could easily cause model training to crash; simply allowing the model to learn to respond with uninformed answers could easily affect the model’s performance on the retain set. Therefore, we propose Name-Aware Unlearning Framework, to mitigate these issues and achieve a better balance between privacy protection and model performance.

3 [Uncaptioned image] RETURN: Real-world pErsonal daTa UnleaRNing

In order to evaluate various MU methods in a practical scenario, we propose [Uncaptioned image] RETURN, a Real-world pErsonal daTa UnleaRNing dataset. We take Llama-3-8B-Instruct AI@Meta (2024) as an example to demonstrate how to use the dataset to evaluate MU methods. It is worth noting that we could use any LLM to replace Llama-3-8B-Instruct as the base model for evaluation.

3.1 Data Construction

We begin by leveraging PopQA Mallen et al. (2022) to collect a large set of names of individuals. PopQA is a large-scale open-domain question-answering (QA) dataset constructed by Mallen et al. (2022), consisting of 14k entity-centric QA pairs. Each pair comes with the original [subject entity, relationship type, object entity] annotation, as well as Wikipedia monthly page views. Specifically, for the data in PopQA, we collect “subject entity” if the “relationship type” is within [ occupation, place of birth, father, mother]; and we collect “object entity” if the “relationship type” is within [ producer, director, screenwriter, composer, author].

After gathering these names, we retrieve their corresponding Wikipedia pages and extract the abstracts from these pages as background information222https://github.com/martin-majlis/Wikipedia-API. We then filter the background information to retain only those whose word count falls between 100 and 500 words. Through this process, we ultimately obtain 2,516 records consisting of (Name, Background Information). Next, given each pair of name and the background information, we use a prompt to generate 20×\times×QA pairs with GPT4 Achiam et al. (2023). The prompt template is shown in Table 3 in the Appendix.

Item Value
#Instances 2,492
#QA pairs per instance 20
Avg. background information tokens 315.0
Avg. question tokens 15.2
Avg. abstract tokens 18.8
Table 1: Data statistics of [Uncaptioned image] RETURN. The numbers of tokens are estimated with LLaMA-3-8B-Instruct.

As shown in Table 1, after manually verifying and filtering out data with content or formatting errors, we finally obtain [Uncaptioned image] RETURN consisting of 2,492 (Name, 20×\times×QA pairs). Next, we will demonstrate how to use the dataset to evaluate MU methods with LLaMA-3-8B-Instruct AI@Meta (2024).

3.2 Identifying Individuals with Deep Memorization

To perform unlearning on LLaMA-3, we first need to identify which individuals the model has deeply memorized. We ask the model to answer the questions for each individual in [Uncaptioned image] RETURN, then calculate the average accuracy by comparing the model’s predicted answers with the gold answers using a Natural Language Inference (NLI) model 333We choose deberta-v3-base-tasksource-nli Sileo (2023) to evaluate the correctness of model’s prediction.. If the prediction is "entailment" or "neutral," we consider the model’s answer correct; if the NLI model’s prediction is "contradiction," we consider the model’s answer incorrect444When the model’s predicted answer is partially correct and does not conflict with the gold answer, the NLI model’s prediction is ”neutral.” Therefore, we will consider ”neutral” as correct in this context.. The accuracy distribution of LLaMA-3 on [Uncaptioned image] RETURN is shown in Figure 3. The higher the accuracy, the more deeply the model memorizes the individual’s information. Finally, we take 466 individuals with accuracy \geq 0.8 as individuals with deep memorization for the subsequent unlearning experiments.

Refer to caption
Figure 3: Accuracy distribution of LLaMA-3 on Refer to caption RETURN.

3.3 Evaluation Setup

We split the 466 individuals into 2 sets in a ratio of 1:9: forget set 𝒟Fsuperscript𝒟𝐹\mathcal{D}^{F}caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT and retain set 𝒟Rsuperscript𝒟𝑅\mathcal{D}^{R}caligraphic_D start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT. We mark the original model as osubscript𝑜\mathcal{M}_{o}caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and the unlearned model as usubscript𝑢\mathcal{M}_{u}caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. We want the model to learn to protect the privacy of individuals in the forget set, ensuring that questions related to these individuals are not answered correctly, while not affecting the performance on the retain set and other tasks. Specifically, we aim for the following:

  1. 1.

    For questions regarding individuals in 𝒟Fsuperscript𝒟𝐹\mathcal{D}^{F}caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT, the model should not answer correctly, or refuse to respond to protect their privacy.

  2. 2.

    For questions regarding individuals in 𝒟Rsuperscript𝒟𝑅\mathcal{D}^{R}caligraphic_D start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT, the model should respond normally.

  3. 3.

    Meanwhile, MU methods should not affect the model’s general capabilities on other tasks.

3.4 Evaluation Metrics

We measure MU methods’ comprehensive performance using the following metrics:

Forget Score.

To quantify the model’s ability to protect the privacy of individuals in the forget set, we propose the Forget Score. It is calculated as the relative decrease in accuracy on 𝒟Fsuperscript𝒟𝐹\mathcal{D}^{F}caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT after unlearning compared to the original model’s accuracy on 𝒟Fsuperscript𝒟𝐹\mathcal{D}^{F}caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT:

ForgetScore=Acco(𝒟F)Accu(𝒟F)Acco(𝒟F)=1Accu(𝒟F)Acco(𝒟F)𝐹𝑜𝑟𝑔𝑒𝑡𝑆𝑐𝑜𝑟𝑒𝐴𝑐subscript𝑐subscript𝑜superscript𝒟𝐹𝐴𝑐subscript𝑐subscript𝑢superscript𝒟𝐹𝐴𝑐subscript𝑐subscript𝑜superscript𝒟𝐹1𝐴𝑐subscript𝑐subscript𝑢superscript𝒟𝐹𝐴𝑐subscript𝑐subscript𝑜superscript𝒟𝐹\begin{split}ForgetScore=&\frac{Acc_{\mathcal{M}_{o}}(\mathcal{D}^{F})-Acc_{% \mathcal{M}_{u}}(\mathcal{D}^{F})}{Acc_{\mathcal{M}_{o}}(\mathcal{D}^{F})}\\ =&1-\frac{Acc_{\mathcal{M}_{u}}(\mathcal{D}^{F})}{Acc_{\mathcal{M}_{o}}(% \mathcal{D}^{F})}\end{split}start_ROW start_CELL italic_F italic_o italic_r italic_g italic_e italic_t italic_S italic_c italic_o italic_r italic_e = end_CELL start_CELL divide start_ARG italic_A italic_c italic_c start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ) - italic_A italic_c italic_c start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_A italic_c italic_c start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ) end_ARG end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL 1 - divide start_ARG italic_A italic_c italic_c start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_A italic_c italic_c start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ) end_ARG end_CELL end_ROW (1)
Retain Score.

To quantify the model’s ability to retain the performance on the retain set after unlearning, we propose the Retain Score. It is calculated as the ratio of the unlearned model’s accuracy on 𝒟Rsuperscript𝒟𝑅\mathcal{D}^{R}caligraphic_D start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT to the original model’s accuracy on 𝒟Rsuperscript𝒟𝑅\mathcal{D}^{R}caligraphic_D start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT:

RetainScore=Accu(𝒟R)Acco(𝒟R)𝑅𝑒𝑡𝑎𝑖𝑛𝑆𝑐𝑜𝑟𝑒𝐴𝑐subscript𝑐subscript𝑢superscript𝒟𝑅𝐴𝑐subscript𝑐subscript𝑜superscript𝒟𝑅\begin{split}RetainScore=&\frac{Acc_{\mathcal{M}_{u}}(\mathcal{D}^{R})}{Acc_{% \mathcal{M}_{o}}(\mathcal{D}^{R})}\end{split}start_ROW start_CELL italic_R italic_e italic_t italic_a italic_i italic_n italic_S italic_c italic_o italic_r italic_e = end_CELL start_CELL divide start_ARG italic_A italic_c italic_c start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_A italic_c italic_c start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) end_ARG end_CELL end_ROW (2)
Downstream Task Accuracy.

To quantify the influence of unlearning on the model’s general capabilities, we evaluate the model on 5 downstream natural language processing tasks: WinoGrande Sakaguchi et al. (2021), PIQA Bisk et al. (2020), LogiQA Liu et al. (2020), LAMBADA Paperno et al. (2016), and ARC-c Clark et al. (2018). We use the accuracy of the downstream tasks as the evaluation metric.

4 Machine Unlearning for LLMs

In this section, we introduce MU baselines. Then we propose a novel method for mitigating privacy risks in LLMs: Name-Aware Unlearning Framework for Privacy Protection.

4.1 Baseline Methods

A typical MU method often have two parts: unlearning on the forget set and regularization on the retain set. These two types of loss can be used in any combination.

4.1.1 Unlearning on Forget Set

Gradient Ascent.

Gradient ascent (GA) stands as the most straightforward method for unlearning, which is simply performing gradient ascent on the loss over forget set. GA is to minimize the likelihood of correct predictions on the forget set, denoted as:

GA(𝒟F,u)=𝔼(x,y)𝒟F[log(u(y|x))]=𝔼(x,y)𝒟F[log(u(y|x))]subscript𝐺𝐴superscript𝒟𝐹subscript𝑢subscript𝔼similar-to𝑥𝑦superscript𝒟𝐹delimited-[]subscript𝑢conditional𝑦𝑥subscript𝔼similar-to𝑥𝑦superscript𝒟𝐹delimited-[]subscript𝑢conditional𝑦𝑥\begin{split}\mathcal{L}_{GA}(\mathcal{D}^{F},\mathcal{M}_{u})=&-\mathbb{E}_{(% x,y)\sim\mathcal{D}^{F}}[-\log(\mathcal{M}_{u}(y|x))]\\ =&\mathbb{E}_{(x,y)\sim\mathcal{D}^{F}}[\log(\mathcal{M}_{u}(y|x))]\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_G italic_A end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) = end_CELL start_CELL - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ - roman_log ( caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_y | italic_x ) ) ] end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_log ( caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_y | italic_x ) ) ] end_CELL end_ROW (3)
Negative Preference Optimization.

Zhang et al. (2024) proposed Negative Preference Optimization (NPO), a simple alignment-inspired method that could efficiently and effectively unlearn a target dataset. The loss function of NPO is defined as:

NPO(𝒟F,u,o)subscript𝑁𝑃𝑂superscript𝒟𝐹subscript𝑢subscript𝑜\displaystyle\mathcal{L}_{NPO}(\mathcal{D}^{F},\mathcal{M}_{u},\mathcal{M}_{o})caligraphic_L start_POSTSUBSCRIPT italic_N italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) (4)
=2β𝔼(x,y)𝒟F[log(1+(u(y|x)o(y|x))β)]absent2𝛽subscript𝔼similar-to𝑥𝑦superscript𝒟𝐹delimited-[]1superscriptsubscript𝑢conditional𝑦𝑥subscript𝑜conditional𝑦𝑥𝛽\displaystyle\hfill=\frac{2}{\beta}\mathbb{E}_{(x,y)\sim\mathcal{D}^{F}}[\log(% 1+(\frac{\mathcal{M}_{u}(y|x)}{\mathcal{M}_{o}(y|x)})^{\beta})]= divide start_ARG 2 end_ARG start_ARG italic_β end_ARG blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_log ( 1 + ( divide start_ARG caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG start_ARG caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ) ]
Relabeled Gradient Descent.

A variant of GA is to transform it into a gradient descent approach, which aims to maximize the likelihood of predictions on relabeled forget set. Following Maini et al. (2024), we relabel the question in the forget set with an uninformed answer like "I don’t know." (or any one of 100 versions of this response, we name the uninformed answer set as 𝒟idksuperscript𝒟𝑖𝑑𝑘\mathcal{D}^{idk}caligraphic_D start_POSTSUPERSCRIPT italic_i italic_d italic_k end_POSTSUPERSCRIPT). The loss function of Relabeled Gradient Descent (RGD) is defined as:

RGD(𝒟F,u)subscript𝑅𝐺𝐷superscript𝒟𝐹subscript𝑢\displaystyle\mathcal{L}_{RGD}(\mathcal{D}^{F},\mathcal{M}_{u})caligraphic_L start_POSTSUBSCRIPT italic_R italic_G italic_D end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) (5)
=𝔼(x,y)𝒟F,yidk𝒟idk[log(u(yidk|x))]absentsubscript𝔼formulae-sequencesimilar-to𝑥𝑦superscript𝒟𝐹similar-tosuperscript𝑦𝑖𝑑𝑘superscript𝒟𝑖𝑑𝑘delimited-[]subscript𝑢conditionalsuperscript𝑦𝑖𝑑𝑘𝑥\displaystyle\hfill=-\mathbb{E}_{(x,y)\sim\mathcal{D}^{F},y^{idk}\sim\mathcal{% D}^{idk}}[\log(\mathcal{M}_{u}(y^{idk}|x))]= - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i italic_d italic_k end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUPERSCRIPT italic_i italic_d italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_log ( caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_i italic_d italic_k end_POSTSUPERSCRIPT | italic_x ) ) ]
Relabeled Direct Preference Optimization.

Direct Preference Optimization (DPO) seeks to fine-tune the model with human preferences Rafailov et al. (2024). We take the uninformed answer from 𝒟idksuperscript𝒟𝑖𝑑𝑘\mathcal{D}^{idk}caligraphic_D start_POSTSUPERSCRIPT italic_i italic_d italic_k end_POSTSUPERSCRIPT as preferred answer, the gold answer as the dispreferred answer. The loss function of Relabeled Direct Preference Optimization (RDPO) is defined as:

RDPO(𝒟F,u,o)subscript𝑅𝐷𝑃𝑂superscript𝒟𝐹subscript𝑢subscript𝑜\displaystyle\mathcal{L}_{RDPO}(\mathcal{D}^{F},\mathcal{M}_{u},\mathcal{M}_{o})caligraphic_L start_POSTSUBSCRIPT italic_R italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) (6)
=𝔼(x,y)𝒟F,yidk𝒟idk[logσ(βlogu(yidk|x)o(yidk|x)\displaystyle\hfill=-\mathbb{E}_{(x,y)\sim\mathcal{D}^{F},y^{idk}\sim\mathcal{% D}^{idk}}[\log\sigma(\beta\log\frac{\mathcal{M}_{u}(y^{idk}|x)}{\mathcal{M}_{o% }(y^{idk}|x)}= - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i italic_d italic_k end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUPERSCRIPT italic_i italic_d italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log divide start_ARG caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_i italic_d italic_k end_POSTSUPERSCRIPT | italic_x ) end_ARG start_ARG caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_i italic_d italic_k end_POSTSUPERSCRIPT | italic_x ) end_ARG
βlogu(y|x)o(y|x))]\displaystyle\hfill-\beta\log\frac{\mathcal{M}_{u}(y|x)}{\mathcal{M}_{o}(y|x)})]- italic_β roman_log divide start_ARG caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG start_ARG caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG ) ]

4.1.2 Regularization on Retain Set

MU methods should not only protect the privacy of individuals in the forget set but also maintain the model’s performance on the retain set. Regularization methods are designed to achieve this goal. If we only fine-tune the model to maximize the likelihood of the uninformed answer on the forget set, the model may also refuse to answer the questions on the retain set. To achieve a balance between the forget set and the retain set, there are two regularization methods:

Gradient Descent Regularization.

Simply performing gradient descent (GD) on the loss over the retain set. The loss function is defined as:

GD(𝒟R,u)subscript𝐺𝐷superscript𝒟𝑅subscript𝑢\displaystyle\mathcal{L}_{GD}(\mathcal{D}^{R},\mathcal{M}_{u})caligraphic_L start_POSTSUBSCRIPT italic_G italic_D end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) (7)
=𝔼(x,y)𝒟R[log(u(y|x))]absentsubscript𝔼similar-to𝑥𝑦superscript𝒟𝑅delimited-[]subscript𝑢conditional𝑦𝑥\displaystyle\hfill=-\mathbb{E}_{(x,y)\sim\mathcal{D}^{R}}[\log(\mathcal{M}_{u% }(y|x))]= - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_log ( caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_y | italic_x ) ) ]
Kullback-Leibler Divergence Regularization.

Minimizing the Kullback-Leibler divergence (KLD) between the predictions on the retain set of the original model and the unlearned model. The loss function is defined as:

KL(𝒟R,u,o)subscript𝐾𝐿superscript𝒟𝑅subscript𝑢subscript𝑜\displaystyle\mathcal{L}_{KL}(\mathcal{D}^{R},\mathcal{M}_{u},\mathcal{M}_{o})caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) (8)
=𝔼(x,y)𝒟R[KL(o(y|x)||u(y|x))]\displaystyle\hfill=\mathbb{E}_{(x,y)\sim\mathcal{D}^{R}}[KL(\mathcal{M}_{o}(y% |x)||\mathcal{M}_{u}(y|x))]= blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_K italic_L ( caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_y | italic_x ) | | caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_y | italic_x ) ) ]

Considering a computing budget that scales with the size of the forget set, we randomly sample an example from 𝒟Rsuperscript𝒟𝑅\mathcal{D}^{R}caligraphic_D start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT every time we see an example from 𝒟Fsuperscript𝒟𝐹\mathcal{D}^{F}caligraphic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT to stay within the constraints following Maini et al. (2024).

4.2 Name-Aware Unlearning Framework

In our pilot study, we find that RGD achieves the comparatively best performance in protecting the privacy of individuals in the forget set. However, the model’s performance on the retain set is significantly affected. The model tends to refuse to answer the questions on the retain set, which is not desirable. To address this issue, we propose a novel method: Name-Aware Unlearning Framework (NAUF) for privacy protection. The framework comprises two key components: Name-Aware Refusal Answer and Contrastive Data Augmentation.

Name-Aware Refusal Answer.

First, we relabel the questions in the forget set with a name-aware refusal answer, such as "I’m afraid I can’t help with inquiries about NAME." Then we could perform gradient ascent on the loss over the relabeled forget set. The name-aware refusal answer is designed to help the model learn which individuals’ information should be protected. We curate 100 name-aware refusal answer templates 𝒟refusesuperscript𝒟𝑟𝑒𝑓𝑢𝑠𝑒\mathcal{D}^{refuse}caligraphic_D start_POSTSUPERSCRIPT italic_r italic_e italic_f italic_u italic_s italic_e end_POSTSUPERSCRIPT using GPT-4, and some examples are shown in Table 4 in the Appendix.

Contrastive Data Augmentation.

In addition, given the limited number of QA pairs for each individual, we propose contrastive data augmentation (CDA) to enhance the generalization of unlearning. Specifically:

  • For each individual in the forget set, we randomly sample questions from other individuals in the forget set and replace the name with the target individual’s name. Then relabel the questions with the name-aware refusal answer.

  • For each individual in the retain set, we also randomly sample questions from other individuals in the forget set and replace the name with the target individual’s name, but we use the original model’s prediction for that question as the relabeled answer.

This contrastive data augmentation strategy expands the distribution of both the forget set and the retain set. For simplicity, we expand the forget set and the retain set by doubling the amount of data.

5 Experiments

5.1 Implementation Details

Due to the limited training data available for unlearning, we aim to use this limited data to teach the model to protect all privacy information of the target individuals, which places stricter requirements on the generalization capability of the MU methods. Considering this situation, we divide the QA pairs for each individual in the forget set and retain set into train and test sets in a ratio of 1:1, as well as 𝒟trainF,𝒟testF,𝒟trainRsuperscriptsubscript𝒟𝑡𝑟𝑎𝑖𝑛𝐹superscriptsubscript𝒟𝑡𝑒𝑠𝑡𝐹superscriptsubscript𝒟𝑡𝑟𝑎𝑖𝑛𝑅\mathcal{D}_{train}^{F},\mathcal{D}_{test}^{F},\mathcal{D}_{train}^{R}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT, and 𝒟testRsuperscriptsubscript𝒟𝑡𝑒𝑠𝑡𝑅\mathcal{D}_{test}^{R}caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT. We use 𝒟trainFsuperscriptsubscript𝒟𝑡𝑟𝑎𝑖𝑛𝐹\mathcal{D}_{train}^{F}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT and 𝒟trainRsuperscriptsubscript𝒟𝑡𝑟𝑎𝑖𝑛𝑅\mathcal{D}_{train}^{R}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT to perform unlearning on the model and then evaluate each MU method on 𝒟testFsuperscriptsubscript𝒟𝑡𝑒𝑠𝑡𝐹\mathcal{D}_{test}^{F}caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT and 𝒟testRsuperscriptsubscript𝒟𝑡𝑒𝑠𝑡𝑅\mathcal{D}_{test}^{R}caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT.

The β𝛽\betaitalic_β for NPO and RDPO is set to 0.1. We use the AdamW optimizer with a learning rate of 1e-5 for all experiments. We set the batch size to 32 and train the model for 5 epochs. Considering the computational budget, we constraint that the number of samples used from the retain set is equal to the number of the entire forget set in each epoch. All experiments are conducted with 2 NVIDIA A100 (40GB) GPUs.

5.2 Main Results

Method Unlearning Score Downstream Task Accuracy
Forget S. Retain S. Avg. WG PIQA LQA LAM ARC-c Avg.
Oracle 0.00 100.00 50.00 72.14 78.40 33.18 71.92 56.83 62.49
Without Regularization
GA NS NS 0.00 48.70 47.06 22.89 0.02 25.68 28.87
NPO 15.66 84.67 50.16 56.27 59.47 26.27 37.98 29.35 41.87
RGD 96.46 3.16 49.81 70.56 75.24 28.26 46.15 36.43 51.33
RDPO 25.25 72.47 48.86 55.33 56.42 26.57 26.86 21.93 37.42
NAUF(ours) 100.00 0.06 50.03 69.77 75.68 29.03 62.84 35.41 54.55
- CDA 99.75 0.25 50.00 70.01 76.17 27.19 68.64 37.54 55.91
With GD Regularization
GA NS 70.79 35.40 69.61 73.29 21.66 71.67 38.31 54.91
NPO 33.33 80.81 57.07 71.74 78.40 29.19 73.24 45.90 59.69
RGD 89.65 60.58 75.11 72.85 78.13 29.03 73.12 47.01 60.03
RDPO 32.07 81.37 56.72 72.14 77.86 29.19 73.26 44.03 59.29
NAUF(ours) 81.06 76.25 78.65 73.01 79.60 30.11 73.16 50.94 61.36
- CDA 70.71 75.71 73.21 72.61 78.84 28.88 75.57 47.18 60.62
With KLD Regularization
GA NS NS 0.00 50.28 43.63 21.97 0.91 22.87 27.93
NPO 30.30 87.60 58.95 68.67 77.69 29.34 73.34 48.21 59.45
RGD 96.21 52.01 74.11 71.51 79.33 26.42 72.11 50.77 60.03
RDPO 22.47 87.44 54.96 71.43 79.22 29.65 71.86 50.09 60.45
NAUF(ours) 93.69 67.82 80.76 72.22 79.27 29.80 72.21 50.51 60.80
- CDA 94.44 63.82 79.13 71.11 79.60 28.88 74.46 50.51 60.91
Table 2: The main results of the experiments. Forget S. denotes Forget Score, Retain S. denotes Retain Score, WG denotes WinoGrande, LQA denotes LogiQA, LAM denotes LAMBADA. Oracle refers to using the original model directly to compute the metrics without applying any unlearning. Notably, NS denotes "NonSense", which means the model’s prediction is meaningless, and we take it as 0 for computing the average. We highlight the best results in bold.
Refer to caption
Figure 4: Impact of the number of unlearning epochs on the performance of MU methods (best viewed in color).
Refer to caption
Figure 5: Average unlearning score vs average downstream task accuracy across different numbers of epochs (best viewed in color).

We present the main results of the experiments in Table 2. We report the average unlearning score and average downstream task accuracy to evaluate the overall performance of the model.

The results show that our proposed NAUF with KLD regularization achieves a state-of-the-art average unlearning score, outperforming the best baseline method (RGD with GD regularization) by 5.65 points. The GA method performs the worst on our dataset, and the unlearned model generates meaningless predictions for questions in the forget set and significantly impacts the retain score and the performance on downstream tasks. The decline in the retain score and the performance on downstream tasks is mitigated to some extent only when using GD regularization.

We find that the RGD method achieves a better forget score than our method when using any regularization method, but it significantly affects the retain score. Intuitively, this could be attributed to the uninformed answer like "I don’t know", which could not teach the model to distinguish the individuals whose information should be protected. Our proposed name-aware refusal answer can help the model learn which individuals’ information should be protected, thereby achieving a better balance between the forget score and the retain score.

5.3 Analysis

Importance of Regularization on Retain Set.

Without regularization on retain set, the average unlearning score of all methods except GA is around 50 points, and the average downstream task accuracy is also affected to varying degrees. With any regularization, the unlearned model performs well on downstream tasks with any MU method, showing performance close to the original model. This indicates that regularization on the retain set can effectively protect the model’s general capabilities.

The experimental results indicate that our method, when using GD regularization, achieves similar forget and retain scores, with a difference of only 5 points between them. In contrast, when using KLD regularization, the forget score reaches 93.69, but the retain score is only 67.82, resulting in a difference of 26 points. This demonstrates that GD regularization can achieve a better balance between unlearning metrics.

Importance of Contrastive Data Augmentation.

To analyze the importance of CDA, we evaluate the performance of our unlearning framework without this component. The results are presented in Table 2. We find that without regularization, CDA has almost no effect. However, it can improve our method’s forget score by 10 points when using the GD regularization. With the KLD regularization, it can increase the retain score by 4 points while maintaining a similar forget score. Notably, our method without CDA also achieves a competitive (with GD regularization) or better (with KLD regularization) average unlearning score compared to the baseline methods, which demonstrates the effectiveness of the name-aware refusal answer. These findings indicate that CDA can enhance performance on the forget set or retain set depending on the regularization method used, thereby enhancing the generalization of our proposed unlearning framework.

Unlearning Performance across Different Numbers of Epochs.

We investigate the impact of the number of unlearning epochs on the performance of MU methods. Specifically, We evaluate RGD and NAUF with 1, 3, 5, and 10 epochs, and the results are shown in Figure 4. For the Forget Score, our method with KLD regularization demonstrates relatively stable performance across different epochs. With GD regularization, the Forget Score improves as the number of epochs increases. Conversely, for the Retain Score, our method with GD regularization shows little variation across epochs, while KLD regularization leads to a gradual improvement in the Retain Score with increasing epochs. Our method’s average unlearning score improves with an increasing number of epochs, while RGD shows little to no improvement from the 5 to the 10 epoch, which indicates our method still has room for further optimization.

Average Unlearning Score vs Average Downstream Task Accuracy across Different Numbers of Epochs.

We analyze the relationship between the average unlearning score and the average downstream task accuracy across different numbers of epochs. We choose RGD and NAUF with KLD regularization for this analysis, and the results are shown in Figure 5. We observe that as the number of epochs increases, both the average unlearning score and the average downstream task accuracy increase proportionally. However, our method surpasses RGD in all aspects after just 3 epochs. Additionally, from the 5 to the 10 epoch, RGD shows a decline in average downstream task accuracy without any significant improvement in the average unlearning score. In contrast, our method continues to achieve higher average unlearning scores at the 10 epoch while maintaining stable average downstream task accuracy.

6 Conclusion

In this work, we introduce [Uncaptioned image] RETURN, a novel benchmark designed to evaluate MU methods for protecting personal data in a real-world scenario. We also present the Name-Aware Unlearning Framework (NAUF), which integrates Name-Aware Refusal Answer and Contrastive Data Augmentation to enhance the generalization of unlearning methods. Our experimental results show that NAUF not only effectively protects the privacy of individuals in the forget set but also maintains the performance of the model on the retain set, achieving an average unlearning score that outperforms the best baseline method by 5.65 points. These findings underscore the potential of NAUF to advance privacy protection in large language models.

Limitations

The proposed NAUF method is designed for individual-level privacy protection. Our goal is to protect all information about an individual, ensuring that the model refuses to answer any questions related to that individual. However, this method does not provide fine-grained protection of the target individual’s information. In other words, it cannot distinguish between questions that can be answered and those that are too sensitive to answer. Future work will explore how to align the model with human judgment, enabling it to discern which personal information can be publicly discussed and which information, potentially susceptible to malicious use, should be protected.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
  • Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  • Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  • Bourtoule et al. (2021) Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141–159. IEEE.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Cao and Yang (2015) Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pages 463–480. IEEE.
  • Carlini et al. (2021) Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
  • Che et al. (2023) Tianshi Che, Yang Zhou, Zijie Zhang, Lingjuan Lyu, Ji Liu, Da Yan, Dejing Dou, and Jun Huan. 2023. Fast federated machine unlearning with nonlinear functional theory. In International conference on machine learning, pages 4241–4268. PMLR.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  • Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer.
  • Eldan and Russinovich (2023) Ronen Eldan and Mark Russinovich. 2023. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238.
  • European Parliament and Council of the European Union (2016) European Parliament and Council of the European Union. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council. Accessed: 2024-06-06.
  • Fan et al. (2023) Chongyu Fan, Jiancheng Liu, Yihua Zhang, Dennis Wei, Eric Wong, and Sijia Liu. 2023. Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation. arXiv preprint arXiv:2310.12508.
  • Ginart et al. (2019) Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. 2019. Making ai forget you: Data deletion in machine learning. Advances in neural information processing systems, 32.
  • Golatkar et al. (2020) Aditya Golatkar, Alessandro Achille, and Stefano Soatto. 2020. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312.
  • Huang et al. (2022) Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. 2022. Are large pre-trained language models leaking your personal information? arXiv preprint arXiv:2205.12628.
  • Jang et al. (2022) Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. 2022. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504.
  • Jayaraman and Evans (2019) Bargav Jayaraman and David Evans. 2019. Evaluating differentially private machine learning in practice. In 28th USENIX Security Symposium (USENIX Security 19), pages 1895–1912.
  • Kandpal et al. (2022) Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697–10707. PMLR.
  • Kumari et al. (2023) Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. 2023. Ablating concepts in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22691–22702.
  • Lison et al. (2021) Pierre Lison, Ildikó Pilán, David Sánchez, Montserrat Batet, and Lilja Øvrelid. 2021. Anonymisation models for text data: State of the art, challenges and future directions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4188–4203.
  • Liu et al. (2020) Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. arXiv preprint arXiv:2007.08124.
  • Liu et al. (2024a) Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Xiaojun Xu, Yuguang Yao, Hang Li, Kush R Varshney, et al. 2024a. Rethinking machine unlearning for large language models. arXiv preprint arXiv:2402.08787.
  • Liu et al. (2024b) Zhenhua Liu, Tong Zhu, Chuanyuan Tan, Haonan Lu, Bing Liu, and Wenliang Chen. 2024b. Probing language models for pre-training data detection. arXiv preprint arXiv:2406.01333.
  • Liu et al. (2023) Ziyao Liu, Yu Jiang, Jiyuan Shen, Minyi Peng, Kwok-Yan Lam, and Xingliang Yuan. 2023. A survey on federated unlearning: Challenges, methods, and future directions. arXiv preprint arXiv:2310.20448.
  • Lu et al. (2022) Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. 2022. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609.
  • Maini et al. (2024) Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. 2024. Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121.
  • Mallen et al. (2022) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511.
  • McMahan et al. (2017) H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. 2017. Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963.
  • Nasr et al. (2023) Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A Feder Cooper, Daphne Ippolito, Christopher A Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee. 2023. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035.
  • Paperno et al. (2016) Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. 2016. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031.
  • Patil et al. (2023) Vaidehi Patil, Peter Hase, and Mohit Bansal. 2023. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. arXiv preprint arXiv:2309.17410.
  • Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
  • Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  • Sekhari et al. (2021) Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. 2021. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34:18075–18086.
  • Shi et al. (2023) Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. 2023. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789.
  • Shokri and Shmatikov (2015) Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 1310–1321.
  • Si et al. (2023) Nianwen Si, Hao Zhang, Heyu Chang, Wenlin Zhang, Dan Qu, and Weiqiang Zhang. 2023. Knowledge unlearning for llms: Tasks, methods, and challenges. arXiv preprint arXiv:2311.15766.
  • Sileo (2023) Damien Sileo. 2023. tasksource: Structured dataset preprocessing annotations for frictionless extreme multi-task learning and evaluation. arXiv preprint arXiv:2301.05948.
  • Thakkar et al. (2021) Om Dipakbhai Thakkar, Swaroop Ramaswamy, Rajiv Mathews, and Francoise Beaufays. 2021. Understanding unintended memorization in language models under federated learning. In Proceedings of the Third Workshop on Privacy in Natural Language Processing, pages 1–10.
  • Wang et al. (2022) Junxiao Wang, Song Guo, Xin Xie, and Heng Qi. 2022. Federated unlearning via class-discriminative pruning. In Proceedings of the ACM Web Conference 2022, pages 622–632.
  • Zhang et al. (2023a) Dawen Zhang, Pamela Finckenberg-Broman, Thong Hoang, Shidong Pan, Zhenchang Xing, Mark Staples, and Xiwei Xu. 2023a. Right to be forgotten in the era of large language models: Implications, challenges, and solutions. arXiv preprint arXiv:2307.03941.
  • Zhang et al. (2023b) Eric Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. 2023b. Forget-me-not: Learning to forget in text-to-image diffusion models. arXiv preprint arXiv:2303.17591.
  • Zhang et al. (2024) Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. 2024. Negative preference optimization: From catastrophic collapse to effective unlearning. arXiv preprint arXiv:2404.05868.

Appendix A QA Pairs Generation Template

The prompt template for generating QA pairs used in subsection 3.1 is shown in Table 3

Appendix B Examples of Name-Aware Refusal Answer Templates

We provide 10 examples of name-aware refusal answer templates in Table 4. It is designed to help LLM learn which individuals’ information should be protected.

QA Pairs Generation [ABSTRACT] Given the above [NAME]’s background information, please give me 20 simple questions and answers about this person point by point. Return the content STRICTLY in the following manner: Q1: <content of the question>? A1: <content of the answer>. Q2: <content of the question>? A2: <content of the answer>. Q20: <content of the question>? A20: <content of the answer>. Make sure the person’s name - [NAME] - appears in the content of the question. Make sure the answer is concise and accurate.
Table 3: QA pairs generation template.
Name-Aware Refusal Answers Apologies, I cannot discuss [NAME]-related inquiries. Sorry, I’m unable to provide information regarding [NAME]. Unfortunately, I’m not able to address questions about [NAME]. I’m sorry, but I can’t offer insights on [NAME]. Regrettably, [NAME] is not a topic I can discuss. I’m unable to engage in discussions about [NAME]. Sorry, I’m not equipped to answer queries about [NAME]. I can’t provide information on topics involving [NAME]. Unfortunately, [NAME] falls outside the scope of my responses. Apologies, but I’m unable to delve into matters concerning [NAME].
Table 4: Name-aware refusal answer templates.