LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack
Abstract
Natural language processing models are vulnerable to adversarial examples. Previous textual adversarial attacks adopt model internal information (gradients or confidence scores) to generate adversarial examples. However, this information is unavailable in the real world. Therefore, we focus on a more realistic and challenging setting, named hard-label attack, in which the attacker can only query the model and obtain a discrete prediction label. Existing hard-label attack algorithms tend to initialize adversarial examples by random substitution and then utilize complex heuristic algorithms to optimize the adversarial perturbation. These methods require a lot of model queries and the attack success rate is restricted by adversary initialization. In this paper, we propose a novel hard-label attack algorithm named LimeAttack, which leverages a local explainable method to approximate word importance ranking, and then adopts beam search to find the optimal solution. Extensive experiments show that LimeAttack achieves the better attacking performance compared with existing hard-label attack under the same query budget. In addition, we evaluate the effectiveness of LimeAttack on large language models and some defense methods, and results indicate that adversarial examples remain a significant threat to large language models. The adversarial examples crafted by LimeAttack are highly transferable and effectively improve model robustness in adversarial training.
Einführung
Deep Neural Networks (DNNs) are widely applied in the natural language processing field and have achieved great success (Kim 2014; Devlin et al. 2019; Minaee et al. 2021; Hochreiter and Schmidhuber 1997). However, DNNs are vulnerable to adversarial examples, which are correctly classified samples altered by some slight perturbations (Jin et al. 2020; Papernot et al. 2017; Kurakin, Goodfellow, and Bengio 2016). These adversarial perturbations are imperceptible to humans but can mislead the model. Adversarial examples seriously threaten the robustness and reliability of DNNs, especially in some security-critical applications (e.g., autonomous driving and toxic text detection (Yang et al. 2021; Kurakin, Goodfellow, and Bengio 2018)). Therefore, adversarial examples have attracted enormous attention on adversarial attacks and defenses in computer vision, natural language processing and speech (Szegedy et al. 2013; Carlini and Wagner 2018; Yu et al. 2022). It is more challenging to craft textual adversarial examples due to the discrete nature of language along with the presence of lexical, semantic, and fluency constraints.
According to different scenarios, textual adversarial attacks can be briefly divided into white-box attacks, score-based attacks and hard-label attacks. In a white-box setting, the attacker utilizes the model’s parameters and gradients to generate adversarial examples (Goodman, Zhonghou et al. 2020; Jiang et al. 2020). Score-based attacks only adopt class probabilities or confidence scores to craft adversarial examples (Jin et al. 2020; Li et al. 2020; Ma, Shi, and Guan 2020; Zhu, Zhao, and Wu 2023). However, these attack methods perform poorly in reality due to DNNs being deployed through application programming interfaces (APIs), and the attacker having no access to the model’s parameters, gradients or probability distributions of all labels (Ye et al. 2022b). In contrast, under a hard-label scenario, the model’s internal structures, gradients, training data and even confidence scores are unavailable. The attacker can only query the black-box victim model and get a discrete prediction label, which is more challenging and realistic. Additionally, most realistic models (e.g., HuggingFace API, OpenAI API) usually have a limit on the number of calls. In reality, the adversarial examples attack setting is hard-label with tiny model queries.
Some hard-label attack algorithms have been proposed (Yu et al. 2022; Ye et al. 2022b; Maheshwary, Maheshwary, and Pudi 2021; Ye et al. 2022a). They follow two-stages strategies: i) generate low-quality adversarial examples by randomly replacing several original words with synonyms, and then ii) adopt complex heuristic algorithms (e.g., genetic algorithm) to optimize the adversary perturbation. Therefore, these attack methods usually require a lot of queries and the attack success rate and quality of adversarial examples are limited by adversary initialization. On the contrary, score-based attacks calculate the word importance based on the change in confidence scores after deleting one word. Word importance ranking improves attack efficiency by preferring to attack words that have a significant impact on the model’s predictions (Jin et al. 2020). However, score-based attacks cannot calculate the word importance in a hard-label setting because deleting one token hardly changes the discrete prediction label. Therefore, we want to investigate such a problem: how to calculate word importance ranking in a hard-label setting to improve attack efficiency?
Actually, word importance ranking can reveal the decision boundary to determine the better attack path, but existing hard-label algorithms ignore this useful information because it is hard to obtain. Inspired by local explainable methods (Ribeiro, Singh, and Guestrin 2016; Lundberg and Lee 2017; Shrikumar et al. 2016) for DNNs, which are often used to explain the outputs of black-box models, aim to estimate the token sensitivity on the benign sample. Previous study (Chai et al. 2023) has tried to simply replace deletion-based method with local explainable method to calculate word importance in score-based attack. However, In Appendix B, we have verified through experiments that local explainable method does not have a significant advantage over deletion-based method in a score-based scenario. Because the probability distribution of the model’s output is available, the influence of each word on the output can be well reflected by deletion-based method. Therefore, compared with score-based attacks, we think local explainable method can play a greater advantage in hard-label attacks where deletion-based method is useless. We adopt the most fundamental and straightforward local explainable method, namely LIME. LIME is easy to understand and more in line with the deletion-based method proposed in score-based attacks, since our goal is to bridge the gap between score-base attacks and hard-label attacks by introducing interpretability method. In fact, local explainable methods are model-agnostic and suitable for conducting word importance estimation for hard-label attacks. However, there are the following difficulties in applying LIME to hard-label attacks: 1) How to allocate LIME and search queries under tiny query budget to achieve optimal results. 2) How to establish a mapping relationship between LIME and word importance in adversarial samples without model’s logits output. 3) How to sample reasonably during perturbation execution to achieve optimal results. In subsequent sessions we will explain in detail how to solve these difficulties.
In this work, we propose a novel hard-label attack algorithm named LimeAttack. The application of LIME in hard-label attacks was inspired by the score-based attacks’ deletion method. We verify the effectiveness of inside-to-outside attack path in hard-label attacks, then many excellent score-based attacks may provide hard-label attacks more insight. To evaluate the attack performance and efficiency, we compare LimeAttack with other hard-label attacks and take several score-based attacks as references for two NLP tasks on seven common datasets. We also evaluate LimeAttack on the currently state-of-the-art large language models (e.g., ChatGPT). Experiments show that LimeAttack achieves the highest attack success rate compared to other baselines under the tiny query budget. Our contributions are summarized as follows:
-
•
We summarize the shortcomings of the existing hard-label attacks and apply LIME to connect score-base attacks and hard-label attacks and verify the effectiveness of inside-to-outside attack path in hard-label attacks.
-
•
Extensive experiments show that LimeAttack achieves higher attack success rate than existing hard-label attack algorithms under tiny query budget. Meanwhile, adversarial examples crafted by LimeAttack are high quality and difficult for humans to distinguish. 111Code is available in https://github.com/zhuhai-ustc/limeattack
-
•
In addition, we also conduct attacks and evaluations on the currently state-of-the-art large language models. Results indicate that adversarial examples remain a significant threat to large language models. We also have added attack performance on defense methods and convergence results of attack success rate and perturbation rate.
Related Work
Hard-Label Adversarial Attacks
In a hard-label setting, the attacker can only query the victim model and get a discrete prediction label. Therefore, hard-label setting is more practical and challenging. Existing hard-label attacks contain two-stages strategies, i.e., adversary initialization and perturbation optimization. HLBB (Maheshwary, Maheshwary, and Pudi 2021) initializes an adversarial example and adopts a genetic algorithm to optimize the perturbation. TextHoaxer (Ye et al. 2022b) and LeapAttack (Ye et al. 2022a) utilizes semantic similarity and perturbation rate as optimization objective to search for a better perturbation matrix in the continuous word embedding space. TextHacker (Yu et al. 2022) adopts a hybrid local search algorithm and a word importance table learned from attack history to guide the local search. These attack methods often require a lot of queries to reduce the perturbation rate, and the attack success rate and quality of adversary are limited by initialization. Therefore, in this work, we attempt to craft an adversarial example directly from the benign sample. This approach can generate high-quality adversarial examples with fewer queries.
Local Explainable Methods
To improve DNN interpretability and aid decision-making, various methods for explaining DNNs have been proposed and broadly categorized as global or local explainable methods. Global explainable methods focus on the model itself by using the overall knowledge about the model’s architecture and parameters. On the contrary, local methods fit a simple and interpretable model (e.g., decision tree) to a single input to measure the contribution of each token. In detail, local explainable methods (Lundberg and Lee 2017; Shrikumar et al. 2016; Štrumbelj and Kononenko 2014) associate all input tokens by defining a linear interpretability model and assumes that the contribution of each token in the input is additive. This is also called the additive feature attribution method. In this paper, local interpretable model-agnostic explanation (LIME) (Ribeiro, Singh, and Guestrin 2016) is applied to calculate word importance, which is a fundamental and representative local explainable method. The intuition of LIME is to generate many neighborhood samples by deleting some original words in the benign example. These samples are then used to train a linear model where the number of features equals to the number of words in the benign sample. The parameters of this linear model are approximated to the importance of each word. As LIME is model-agnostic, it is suitable for hard-label attacks.
Limitation of Existing Hard-Label Attack
In order to intuitively compare the difference between LimeAttack and existing hard-label attack algorithms, we create attack search path visualizations in Figure 3. LimeAttack’s search paths are represented by green lines, and they move from inside to outside. LimeAttack utilizes a local explainable method to learn word importance ranking and generates adversarial examples iteratively from benign samples. This helps LimeAttack to find the nearest decision boundary direction, and costs fewer model queries to attack keywords preferentially. In contrast, previous hard-label attack algorithms’ search paths are represented by blue lines, and they move from outside to inside. These algorithms typically begin with a randomly initialized adversarial example and optimize perturbation by maximizing semantic similarity between the initialized example and the benign sample, which requires a lot of model queries to achieve a low perturbation rate. Furthermore, their attack success rate and adversary quality are also limited by the adversary initialization.
Methodology
Problem Formulation
Given a sentence of words and its ground truth label , an adversarial example is crafted by replacing one or more original words with synonyms to mislead the victim model . i.e.,
(1) |
is an edit distance that measures the modifications between a benign sample and an adversarial example :
(2) |
is a binary variable that equals to 0 if and 1 otherwise. A high-quality adversarial example should be similar to the benign sample, and human readers should hardly be able to distinguish the difference. The LimeAttack belongs to the hard-label attack, it has nothing to do with the model’s parameters, gradients or confidence scores. The attacker can only query the victim model to obtain a predicted label .
The Proposed LimeAttack Algorithm
The overall flow chart is shown in Figure 2. LimeAttack follows two steps, i.e., word importance ranking and perturbation execution.
Word Importance Ranking.
Given a sentence of words , we assume that the contribution of all words is additive, and their sum is positively related to the model’s prediction. As shown in the Figure 2, we generate some neighborhood samples from a benign example by randomly replacing some words with ’[MASK]’. Usually, sentences with more words often requires more neighbor samples to approximate the word importance. Therefore, we keep the number of neighborhood samples consistent with the number of tokens. We then feed to the victim model to obtain discrete prediction labels . Subsequently, we will fit a linear interpretability model to classify these neighborhood samples:
(3) |
where is the parameter of the linear model, is a binary variable that equals to 1 if word in and 0 otherwise. Therefore, the parameter reflects the change without word and is approximated to the word importance. In Appendix O, we have verified through experiments that the linear model (such as LIME) has the same effect as some advanced interpretation methods (such as SHAP) or non-linear models (such as decision tree) under tiny query budgets. SHAP or non-linear models also have a higher computational complexity. The advantages of some advanced interpretation methods or non-linear models will only be reflected when there are a large number of neighborhood samples and queries.
In detail, we transform each neighborhood sample into the binary vector . If the origin word is removed in , its corresponding vector dimension in is 0 otherwise 1. Therefore, has the same length as , which is the length of the benign example. A benign example is also transformed to . Sometimes neighborhood samples may not necessarily be linearly separable, LIME adopts gaussian kernel to weight the loss for each sample to gather points closest to the original sample, which helps with linear fitting. We give weights to each neighborhood sample according to their distance from the benign sample (Ribeiro, Singh, and Guestrin 2016).
(4) |
where is a distance function. We adopt the cosine similarity as the distance metric.
(5) |
Finally, we calculate the optimal parameters :
(6) |
where is the non-zero of parameters, which is a measure of the complexity of the linear model. After optimizing , the importance of each word is equal to . LIME can be seen as an approximation of the model’s decision boundary in the original sample. The parameters can be interpreted as the margin, the larger the margin, the larger the importance of this word in approximating the decision boundary. We will filter out stop words using NLTK222https://www.nltk.org/ firstly and calculate the importance of each word. To ensure that LimeAttack has generated high-quality adversarial examples rather than just negative examples. We only adopt synonym replacement strategy and construct the synonym candidate set for each word by selecting the top nearest synonyms in the counter-fitted embedding space (Mrkšić et al. 2016). Additionally, we present the results of human evaluation and more qualitative adversarial examples in Appendix I.
Perturbation Execution.
Adversarial examples generation is a combinatorial optimization problem. Score-based attack iterates by selecting the token that causes the greatest change in model’s logits each time. But there is no such information in the hard-label attack. Therefore, we can only rely on the similarity between the adversarial sample and the original sample for iteration. The problem is that the similarity and attack success rate are not completely linearly correlated. As shown in the Table.7, greedily selecting the adversarial sample with the lowest similarity each time cannot ensure that the final attack success rate is optimal. We hope that each sampling is uniformly distributed to balance attack success rate and semantic similarity. For each origin word , we replace it with to generate an adversarial example , then we calculate the semantic similarity between the benign sample and the adversarial example by universal sentence encoder (USE)333https://tfhub.dev/google/ universal-sentence-encoder. We first sort candidates by similarity and sample adversarial examples each time to enter the next iteration. In detail, We have formulated the following sampling rules: (1) Sampling adversarial examples with the highest semantic similarity. (2) Sampling adversarial examples with the lowest semantic similarity. (3) Sampling of the remaining adversarial samples randomly. The analysis of hyper-parameters and LimeAttack’s algorithm are summarized in Appendix C and H.
Experiments
Analysis of the transferability and adversarial training of LimeAttack are listed in Appendix D and E.
Tasks, Datasets and Models
We adopt seven common datasets, such as MR (Pang and Lee 2005), SST-2 (Socher et al. 2013), AG (Zhang, Zhao, and LeCun 2015) and Yahoo (Yoo et al. 2020) for text classification. SNLI (Bowman et al. 2015) and MNLI (Williams, Nangia, and Bowman 2018) for textual entailment, where MNLI includes a matched version (MNLIm) and a mismatched version (MNLImm). In addition, we have trained three neural networks as victim models, including CNN (Kim 2014), LSTM (Hochreiter and Schmidhuber 1997) and BERT (Devlin et al. 2019). The parameters of the models and the detailed information of datasets are listed in Appendix A.
Baselines
We have chosen the following existing hard-label attack algorithms as our baselines: HLBB (Maheshwary, Maheshwary, and Pudi 2021), TextHoaxer (Ye et al. 2022b), LeapAttack (Ye et al. 2022a) and TextHacker (Yu et al. 2022) as our baselines. Additionally, we have included some classic score-based attack algorithms, such as TextFooler (TF) (Jin et al. 2020), PWWS (Ma, Shi, and Guan 2020) and Bert-Attack (Li et al. 2020) for references, which obtain additional confidence scores for attacks and are implemented on the TextAttack framework (Morris et al. 2020).
Automatic Evaluation Metrics
We use four metrics to evaluate the attack performance: attack success rate (ASR), perturbation rate (Pert), semantic similarity (Sim) and query number (Query). Specifically, given a dataset consisting of samples and corresponding ground truth labels , attack success rate of an adversarial attack method , which generates adversarial examples given an input to attack a victim model , is defined as (Wang et al. 2021):
(7) |
The perturbation rate is the proportion of the number of substitutions to the number of original tokens, which has been defined in Eq 2. The semantic similarity is measured by the Universal Sentence Encoder (USE). Most papers (Maheshwary, Maheshwary, and Pudi 2021; Ye et al. 2022a) have adopted USE. In order to maintain consistency and facilitate comparability, we have also utilized USE. Query number is the number of model queries during the attack. The robustness of a model is inversely proportional to the attack success rate, while the perturbation rate and semantic similarity together reveal the quality of adversarial examples. Query number reveals the attack efficiency.
Implementation Details
We set the kernel width , the number of neighborhood samples equal to the number of the benign sample’s tokens, and the beam size . For a fair comparison, all baselines follow the same settings: synonyms are selected from counter-fitted embedding space and the number of each candidate set , the same 1000 texts are sampled for baselines to attack. The results are averaged on five runs with different seeds (1234,2234,3234,4234 and 5234) to eliminate randomness. In order to improve the quality of adversarial examples, the attack succeeds if the perturbation rate of each adversarial example is less than 10%. We set a tiny query budget of 100 for hard-label attack, which corresponds to real-world settings. (e.g., The HuggingFace free Inference API typically limits calls to 200 times per minute.)
Experiments Results
Attack Performance.
Table 1 and 2 show that LimeAttack outperforms existing hard-label attacks on text classification and textual entailment tasks, achieving higher attack success rates and lower perturbation rates in datasets such as SST-2, AG, and MNLI. Unlike existing hard-label attacks that require many queries to optimize the perturbation, LimeAttack adopts a local explainable method to calculate word importance ranking and attacks key words first. This approach can generate adversarial examples with a high attack success rate, even under tiny query budgets. Appendix G includes a t-test and the mean and variance of LimeAttack’s success rate compared to other methods.In Appendix K and L, we list the semantic similarity and the results of the comparison results between LimeAttack and several score-based attacks.
Model | Attack | MR | SST-2 | AG | Yahoo | ||||
ASR. | Pert. | ASR. | Pert. | ASR. | Pert. | ASR. | Pert. | ||
CNN | HLBB | 44.4 | 5.4 | 33.4 | 5.6 | 17.7 | 3.3 | 41.8 | 3.6 |
TextHoaxer | 44.2 | 5.2 | 38.1 | 5.6 | 15.7 | 2.9 | 39.9 | 3.3 | |
LeapAttack | 43.1 | 5.3 | 40.0 | 5.7 | 20.2 | 3.2 | 40.4 | 3.4 | |
TextHacker | 49.4 | 6.2 | 38.1 | 6.3 | 20.5 | 6.2 | 38.1 | 5.9 | |
LimeAttack | 49.9 | 5.3 | 42.8 | 5.6 | 20.9 | 2.9 | 43.7 | 3.7 | |
LSTM | HLBB | 41.2 | 5.2 | 33.1 | 5.7 | 15.2 | 3.1 | 38.4 | 3.3 |
TextHoaxer | 39.3 | 5.4 | 36.4 | 5.6 | 14.7 | 2.7 | 37.1 | 3.3 | |
LeapAttack | 40.0 | 5.3 | 39.8 | 5.6 | 15.9 | 3.1 | 37.6 | 3.3 | |
TextHacker | 45.8 | 6.1 | 35.2 | 6.4 | 16.5 | 6.2 | 36.8 | 5.9 | |
LimeAttack | 47.6 | 5.4 | 40.1 | 5.5 | 17.3 | 2.7 | 40.3 | 3.7 | |
BERT | HLBB | 26.6 | 5.6 | 23.0 | 5.8 | 12.7 | 3.2 | 36.3 | 3.6 |
TextHoaxer | 27.0 | 5.5 | 24.9 | 5.8 | 9.8 | 3.0 | 32.7 | 3.3 | |
LeapAttack | 26.5 | 5.4 | 26.1 | 5.8 | 13.7 | 2.9 | 34.1 | 3.4 | |
TextHacker | 26.5 | 6.5 | 25.4 | 6.3 | 12.9 | 5.5 | 31.3 | 6.3 | |
LimeAttack | 29.2 | 5.9 | 27.8 | 5.7 | 14.6 | 2.9 | 37.4 | 3.8 |
Dataset | HLBB | TextHoaxer | LeapAttack | TextHacker | LimeAttack | |||||
ASR. | Pert. | ASR. | Pert. | ASR. | Pert. | ASR. | Pert. | ASR. | Pert. | |
SNLI | 24.9 | 8.3 | 24.7 | 8.3 | 28.3 | 8.3 | 22.8 | 8.3 | 29.1 | 8.4 |
MNLIm | 41.9 | 7.8 | 40.9 | 7.7 | 49.1 | 7.7 | 38.2 | 7.8 | 49.7 | 7.7 |
MNLImm | 47.8 | 7.5 | 45.6 | 7.6 | 56.0 | 7.6 | 44.3 | 7.7 | 56.3 | 7.6 |
Query Budget.
As illustrated in Figure 3, LimeAttack still maintains a stable attack success rate and a smoother attack curve under different query budgets, which means that regardless of high or low query budget, LimeAttack often have a stable and excellent attack performance. The trend of perturbation rate are listed in Appendix N. Comparing the attack performance in low query and high query budgets can provide a more comprehensive evaluation. However, attack without considering the query budget is more of an ideal situation, it shows the upper limit of an attack algorithm. A large number of queries are expensive, we believe attack performance under low query budget is more practical. We also list some attack success rates and perturbation rates of different attacks under the query budget is 2000 in Appendix N.
Adversary Quality.
High-quality adversarial examples should be both fluent and context-aware, while also being similar to benign samples to evade human detection. We utilize Language-Tools444https://www.languagetool.org/ and USE to detect grammatical errors and measure semantic similarity. As shown in Table 3, LimeAttack has the lowest perturbation rate and grammatical error, though its semantic similarity is lower than HLBB, TextHoaxer, and LeapAttack. Because these methods take the similarity into account during the attack, thus LimeAttack exhibits lower similarity than other methods. Considering all metrics, LimeAttack is still dominant. To intuitively contrast the quality of adversarial examples, some qualitative examples are provided in Appendix I.
Attack | ASR. | Pert. | Sim. | Gram. |
HLBB | 23.0 | 5.8 | 99.2 | 1.6 |
TextHoaxer | 24.9 | 5.8 | 99.2 | 1.7 |
LeapAttack | 26.1 | 5.8 | 99.1 | 1.5 |
TextHacker | 25.4 | 6.3 | 96.0 | 1.9 |
LimeAttack | 27.8 | 5.7 | 96.4 | 1.5 |
Evaluation on Large Language Models.
Model(size) | ASR. | Pert. | Sim. | Acc. |
BART-L (407M) | 42.0 | 5.15 | 93.7 | 87.0 |
DeBERTa-L (435M) | 52.0 | 5.82 | 92.9 | 79.0 |
T5-L (780M) | 28.0 | 5.59 | 95.1 | 93.0 |
GPT3(175B) | 61.0 | 4.82 | 95.2 | 82.0 |
ChatGPT (175B) | 25.0 | 5.62 | 95.3 | 92.0 |
Large language models (LLMs), also known as foundation models (Bommasani et al. 2021), have achieved impressive performance on various natural language processing tasks. However, their robustness to adversarial examples remains unclear (Wang et al. 2023). To evaluate the effectiveness of LimeAttack on LLMs, we select some popular models such as DeBERTa-L (Kojima et al. 2022), BART-L (Lewis et al. 2019), Flan-T5 (Raffel et al. 2020), GPT-3 (text-davinci-003) and ChatGPT (gpt-3.5-turbo) (Brown et al. 2020). Due to the limited API calls, we sample 100 texts from MR datasets and attacked the zero-shot classification task of these models. As Table 4 shows, LimeAttack successfully attacked most LLMs under tight query budgets. Although these models have high accuracy on zero-shot tasks, their robustness to adversarial examples still needs to be improved. ChatGPT and T5-L are more robust to adversarial examples. The robustness of the victim model is related to origin accuracy. The higher the origin accuracy, the stronger the victim model’s ability to defense adversarial examples. Further analysis of other hard-label attacks and experimental details are discussed in Appendix F.
Defense Method | HLBB | TextHoaxer | LeapAttack | TextHacker | LimeAttack | |||||
ASR. | Pert. | ASR. | Pert. | ASR. | Pert. | ASR. | Pert. | ASR. | Pert. | |
None | 24.9 | 8.3 | 24.7 | 8.3 | 28.3 | 8.3 | 22.8 | 8.3 | 29.1 | 8.4 |
A2T | 20.6 | 9.3 | 21.4 | 9.5 | 23.5 | 9.4 | 19.8 | 9.1 | 24.5 | 9.4 |
ASCC | 13.2 | 6.5 | 13.4 | 6.5 | 14.3 | 6.4 | 12.5 | 7.2 | 15.8 | 6.7 |
Attack Performance on Defense Methods.
To evaluate the effectiveness of LimeAttack on defense methods, we use A2T (Yoo and Qi 2021) and ASCC (Dong et al. 2021) to enhance the defense ability of BERT on SNLI, and conducted attack experiments on this basis. As shown in Table 5, LimeAttack still has a certain attack effect and outcomes other baselines on these defense methods. More attack performance on defense methods are listed in Appendix M.
Ablation Study
Effect of Word Importance Ranking.
To validate the effectiveness of word importance ranking, we removed the word importance ranking strategy and instead randomly selected words to perturb to evaluate its effectiveness. Table 6 shows that without the word importance ranking, the attack success rate decreased by 9% and 6% on the MR and SST-2 datasets, respectively. Furthermore, adversarial examples generated by random selection had higher perturbation rates and required more queries. This indicates the importance of the word importance ranking in guiding LimeAttack to focus on crucial words, leading to a more efficient attack with lower perturbation rates.
Effect of Sampling Rules.
To verify the effectiveness of LimeAttack’s sampling rules, we will replace this strategy with one of three common sampling rules: (1) selecting adversarial examples with the highest semantic similarity, (2) selecting adversarial examples with the lowest semantic similarity, or (3) randomly selecting adversarial examples. The results in Table 7 show that LimeAttack outperforms other sampling rules with a higher attack success rate and lower perturbation rate. Additionally, it has a comparable (second highest) semantic similarity and number of queries.
MR | SST-2 | |||
Random | LIME | Random | LIME | |
Pert. | 6.1 | 5.6 | 6.4 | 5.9 |
ASR. | 30.1 | 39.3 | 32.1 | 36.5 |
Sim. | 94.6 | 94.8 | 94.2 | 94.6 |
Query. | 157.2 | 153.3 | 148.1 | 132.5 |
Sample Rule | ASR. | Pert. | Sim. | Query. |
Method 1 | 35.8 | 5.76 | 95.02 | 164.65 |
Method 2 | 31.5 | 6.13 | 93.79 | 87.45 |
Method 3 | 32.1 | 6.09 | 94.50 | 107.05 |
LimeAttack | 39.3 | 5.65 | 94.81 | 153.03 |
Human Evaluation
We selected 200 adversarial examples BERT-MR. Each adversarial example was evaluated by two human judges for semantic similarity, fluency and prediction accuracy. The entire human evaluation is consistent with TextFooler (Jin et al. 2020). In detail, we ask human judges to put a 5-point Likart scale (1-5 corresponds to very not fluent/similar, not fluent/similar, uncertain, fluent/similar, very fluent/similar respectively) to evaluate the the similarity and fluency of adversarial examples and benign samples. The results are listed in the Table 8, semantic similarity is 4.5, which means adversarial samples are similar to original sample. The prediction accuracy here is to make humans to predict what the label of this sentence is (such as it is positive or negative for sentiment analysis). 76.7% means majorities of adversarial examples have the same attribute as original samples from humans’ perspective but mistake victim model.
Ori | Adv | ||
Prediction Accuracy | 81.2% | 76.7% | |
Fluency | 4.4 | 4.1 | |
Semantic Similarity | 4.5 |
Fazit
In this work, we summarize the previous score-based attacks and hard-label attacks and propose a novel hard-label attack algorithm called LimeAttack. LimeAttack adopts a local explainable method to approximate the word importance ranking, and then utilizes beam search to generate high-quality adversarial examples with tiny query budget. Experiments show that LimeAttack achieves a higher attack success rate than other hard-label attacks. In addition, we have evaluated LimeAttack’s attack performance on large language models and some defense methods. The adversarial examples crafted by LimeAttack are high-quality, high transferable and improves victim model’s robustness in adversarial training. LimeAttack has verified the effectiveness of inside-to-outside attack path in hard-label. Then many excellent score-based attacks may provide hard-label attacks more insight.
References
- Bommasani et al. (2021) Bommasani, R.; Hudson, D. A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M. S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- Bowman et al. (2015) Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 632–642.
- Brown et al. (2020) Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. In Advances in neural information processing systems, volume 33, 1877–1901.
- Carlini and Wagner (2018) Carlini, N.; and Wagner, D. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops, 1–7.
- Chai et al. (2023) Chai, Y.; Liang, R.; Samtani, S.; Zhu, H.; Wang, M.; Liu, Y.; and Jiang, Y. 2023. Additive Feature Attribution Explainable Methods to Craft Adversarial Attacks for Text Classification and Text Regression. IEEE Transactions on Knowledge and Data Engineering.
- Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186.
- Dong et al. (2021) Dong, X.; Luu, A. T.; Ji, R.; and Liu, H. 2021. Towards robustness against natural language word substitutions. International Conference on Learning Representations.
- Goodman, Zhonghou et al. (2020) Goodman, D.; Zhonghou, L.; et al. 2020. FastWordBug: A fast method to generate adversarial text against NLP applications. arXiv preprint arXiv:2002.00760.
- Hochreiter and Schmidhuber (1997) Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. In Neural Computation, volume 9, 1735–1780.
- Jiang et al. (2020) Jiang, H.; He, P.; Chen, W.; Liu, X.; Gao, J.; and Zhao, T. 2020. SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2177–2190.
- Jin et al. (2020) Jin, D.; Jin, Z.; Zhou, J. T.; and Szolovits, P. 2020. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 8018–8025.
- Kim (2014) Kim, Y. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1746–1751.
- Kojima et al. (2022) Kojima, T.; Gu, S. S.; Reid, M.; Matsuo, Y.; and Iwasawa, Y. 2022. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
- Kurakin, Goodfellow, and Bengio (2016) Kurakin, A.; Goodfellow, I.; and Bengio, S. 2016. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236.
- Kurakin, Goodfellow, and Bengio (2018) Kurakin, A.; Goodfellow, I. J.; and Bengio, S. 2018. Adversarial examples in the physical world. In Artificial Intelligence Safety and Security, 99–112.
- Lewis et al. (2019) Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- Li et al. (2020) Li, L.; Ma, R.; Guo, Q.; Xue, X.; and Qiu, X. 2020. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 6193–6202.
- Lundberg and Lee (2017) Lundberg, S. M.; and Lee, S.-I. 2017. A unified approach to interpreting model predictions. In Advances in neural information processing systems.
- Ma, Shi, and Guan (2020) Ma, G.; Shi, L.; and Guan, Z. 2020. Adversarial Text Generation via Probability Determined Word Saliency. In International Conference on Machine Learning for Cyber Security, 562–571.
- Maheshwary, Maheshwary, and Pudi (2021) Maheshwary, R.; Maheshwary, S.; and Pudi, V. 2021. Generating natural language attacks in a hard label black box setting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 13525–13533.
- Minaee et al. (2021) Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; and Gao, J. 2021. Deep learning–based text classification: a comprehensive review. In ACM Computing Surveys, volume 54, 1–40.
- Morris et al. (2020) Morris, J.; Lifland, E.; Yoo, J. Y.; Grigsby, J.; Jin, D.; and Qi, Y. 2020. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 119–126.
- Mrkšić et al. (2016) Mrkšić, N.; Ó Séaghdha, D.; Thomson, B.; Gašić, M.; Rojas-Barahona, L. M.; Su, P.-H.; Vandyke, D.; Wen, T.-H.; and Young, S. 2016. Counter-fitting Word Vectors to Linguistic Constraints. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 142–148.
- Pang and Lee (2005) Pang, B.; and Lee, L. 2005. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 115–124.
- Papernot et al. (2017) Papernot, N.; McDaniel, P.; Goodfellow, I.; Jha, S.; Celik, Z. B.; and Swami, A. 2017. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, 506–519.
- Raffel et al. (2020) Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. In The Journal of Machine Learning Research, volume 21, 5485–5551.
- Ribeiro, Singh, and Guestrin (2016) Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ” Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 1135–1144.
- Shrikumar et al. (2016) Shrikumar, A.; Greenside, P.; Shcherbina, A.; and Kundaje, A. 2016. Not just a black box: Learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713.
- Socher et al. (2013) Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642.
- Štrumbelj and Kononenko (2014) Štrumbelj, E.; and Kononenko, I. 2014. Explaining prediction models and individual predictions with feature contributions. In Knowledge and information systems, volume 41, 647–665.
- Szegedy et al. (2013) Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
- Wang et al. (2021) Wang, B.; Xu, C.; Wang, S.; Gan, Z.; Cheng, Y.; Gao, J.; Awadallah, A. H.; and Li, B. 2021. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. arXiv preprint arXiv:2111.02840.
- Wang et al. (2023) Wang, J.; Hu, X.; Hou, W.; Chen, H.; Zheng, R.; Wang, Y.; Yang, L.; Huang, H.; Ye, W.; Geng, X.; et al. 2023. On the robustness of chatgpt: An adversarial and out-of-distribution perspective. In arXiv preprint arXiv:2302.12095.
- Williams, Nangia, and Bowman (2018) Williams, A.; Nangia, N.; and Bowman, S. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1112–1122.
- Yang et al. (2021) Yang, X.; Liu, W.; Tao, D.; and Liu, W. 2021. BESA: BERT-based Simulated Annealing for Adversarial Text Attacks. In International Joint Conference on Artificial Intelligence, 3293–3299.
- Ye et al. (2022a) Ye, M.; Chen, J.; Miao, C.; Wang, T.; and Ma, F. 2022a. LeapAttack: Hard-Label Adversarial Attack on Text via Gradient-Based Optimization. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2307–2315.
- Ye et al. (2022b) Ye, M.; Miao, C.; Wang, T.; and Ma, F. 2022b. TextHoaxer: Budgeted Hard-Label Adversarial Attacks on Text. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 3877–3884.
- Yoo et al. (2020) Yoo, J. Y.; Morris, J.; Lifland, E.; and Qi, Y. 2020. Searching for a Search Method: Benchmarking Search Algorithms for Generating NLP Adversarial Examples. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 323–332.
- Yoo and Qi (2021) Yoo, J. Y.; and Qi, Y. 2021. Towards Improving Adversarial Training of NLP Models. Findings of the Association for Computational Linguistics: EMNLP.
- Yu et al. (2022) Yu, Z.; Wang, X.; Che, W.; and He, K. 2022. Learning-based Hybrid Local Search for the Hard-label Textual Attack. arXiv preprint arXiv:2201.08193.
- Zhang, Zhao, and LeCun (2015) Zhang, X.; Zhao, J.; and LeCun, Y. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems.
- Zhu, Zhao, and Wu (2023) Zhu, H.; Zhao, Q.; and Wu, Y. 2023. BeamAttack: Generating High-quality Textual Adversarial Examples through Beam Search and Mixed Semantic Spaces. arXiv preprint arXiv:2303.07199.
Appendix A Appendix A: Victim Model and Datasets
In our experiments, we carry out all experiments on NVIDIA Tesla V100 16G GPU. We adopt three neural networks CNN,LSTM and BERT from TextFooler. The CNN consists of three window sizes of 3, 4, and 5, and 100 filters for each window size. The LSTM consists of a bidirectional LSTM layer with 150 hidden states. Both CNN and LSTM have a dropout rate of 0.3 and 200-dimensional Glove word embeddings pre-trained on 6B tokens. The BERT consists of 12 layers with 768 units and 12 heads. The origin accuracy of victim models are listed in Table 10. Detailed datasets are listed in Table 10. We select different text length and different classes datasets.
Dataset | CNN | LSTM | BERT |
MR | 78.0 | 80.7 | 86.0 |
SST-2 | 82.7 | 84.5 | 92.4 |
AG | 91.5 | 91.3 | 94.2 |
Yahoo | 73.7 | 73.7 | 79.1 |
SNLI | - | - | 89.1 |
MNLIm | - | - | 85.1 |
MNLImm | - | - | 82.1 |
Task | Dataset | Train | Test | Classes | Length |
Classification | MR | 9K | 1K | 2 | 18 |
SST-2 | 70K | 2K | 2 | 8 | |
AG | 120K | 8K | 4 | 43 | |
Yahoo | 12K | 4K | 10 | 151 | |
Entailment | SNLI | 570K | 3K | 3 | 20 |
MNLI(m/mm) | 433K | 10K | 3 | 11 |
Appendix B Appendix B: The Effectiveness of LIME in Score-based Attacks
Traditional score-based attacks utilize deletion-based methods to calculate word importance ranking. They drop a word from the benign sample and query the victim model with the new sample . The difference in the model’s confidence score before and after deletion reflects the importance of this word:
(8) |
To verify the effectiveness of local explainable method, we replace deletion-based method with local explainable method in the score-based attack. We test on MR data set and results are shown in the Table 11. Local explainable method and deletion-based method achieve similar attack success rate, but deletion-based method achieves lower perturbation rate than local explainable method. Because the probability distribution of the model’s output is available, the influence of each word on the output can be well reflected by deletion-based method. Therefore, compared with score-based attacks, we think local explainable methods can play a greater advantage in hard-label attacks where deletion-based method is useless.
Dataset | Victim Models | Deletion-based | LIME | ||
ASR. | Pert. | ASR. | Pert. | ||
MR | CNN | 1.0 | 11.9 | 1.0 | 12.4 |
LSTM | 0.6 | 12.3 | 0.6 | 12.8 | |
BERT | 8.2 | 16.3 | 8.1 | 17.4 |
Appendix C Appendix C: The Effectiveness of Beam Size
Beam size directly determines the size of search space. Bigger search space is significant to generate the optimal solution (e.g., lower perturbation rate and higher semantic similarity), while it also requires a lot of model queries. Therefore, how to select an appropriate beam size to balance the query and attack success rate. As shown in the Figure 4, We test on MR and SST-2 data sets using BERT with different beam size. With the increase of beam size , the search space is effectively expanded, and the attack success rate and the quality of adversarial examples (the perturbation rate is reduced) are improved. With the further increase of beam size , the query also gradually increases, resulting in the decrease of attack success rate. Considering the comprehensive effect, we set the beam size .
Appendix D Appendix D: Transferability
The transferability of adversarial examples reveals the property that adversarial examples crafted by a particular victim model can also fool another. In detail, we calculate the prediction accuracy against the CNN and LSTM models on adversarial examples crafted for attacking BERT on MR dataset. As shown in the Figure 5, adversarial examples generated by LimeAttack achieves higher transferability than baselines. It reduces the prediction accuracy of CNN and LSTM models from 80.7%,78.0% to 58.5%, 58.4% respectively.
Appendix E Appendix E: Adversarial Training
Adversarial training is a prevalent technique to improve the victim model’s robustness by adding adversarial examples into the training data. We randomly selected 1000 adversarial examples from the MR dataset, retrained the CNN model, and then attacked the CNN model again. The results are shown in the Table 12, after adversarial training, the CNN model achieves higher test accuracy. In addition, LimeAttack’s attack success rate has decreased by 3% with the cost of more queries and a higher perturbation rate. Adversarial examples generated by LimeAttack effectively improve the victim model’s robustness and generalization.
Ori Acc. | ASR. | Pert. | Sim. | Query. | |
Original | 80.27 | 38.18 | 3.90 | 97.00 | 22.21 |
+Adv.Training | 81.53 | 35.09 | 3.94 | 97.01 | 24.90 |
Appendix F Appendix F: Large Language Models
Einstellungen
In this section, we provide a brief introduction to the large language models used in our experiments.
-
•
BART-L BART is a transformer-based model that can handle both generation and understanding tasks. It is trained on a combination of auto-regressive and denoising objectives, which is primarily focused on understanding tasks.
-
•
DeBERTa-L DeBERTa enhances BERT with a disentangled attention mechanism and an improved decoding scheme. This allows it to capture contextual information between different tokens more effectively and generate higher quality natural language sentences.
-
•
Flan-T5 Flan-T5 uses a text-to-text approach where both input and output are natural language sentences, enabling it to perform a variety of tasks including text generation, summarization, and classification. By taking an input sentence as a prompt, Flan-T5 can accomplish common NLP tasks.
-
•
Text-davinci-003 and ChatGPT are based on GPT3 and GPT3.5. They can perform any task by natural language inputs and produce higher quality and more faithful output.
In order to ensure the stability of the output of large language models, we use the same prompt for each models under zero-shot text classification task: Please classify the following sentence into either positive or negative. Answer me with ”positive” or ”negative”, just one word.
Discuss
Generalization Error.
In this subsection, we provide some analysis of models’ generalization error. which is also known as the out-of-sample error. It is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data. Let is a finite hypothesis set, is the number of training samples, for each , probably approximately correct (PAC) theory reveals that:
(9) |
where and are the ideal and empirical risk on classifier . According to the Table 6 in the main text, the robustness of the victim model is related to origin accuracy. The higher the origin accuracy, the stronger the victim model’s ability to defense adversarial examples. Generalization error relies on two factors: the training sample size () and the hypothesis space (). Large language models, like ChatGPT, excel in performance due to their extensive training data (large ). Moreover, although the hypothesis set () is finite, increasing and can lead to reduced generalization errors. This observation helps elucidate why such models excel in zero-shot classification for certain tasks.
Attack ChatGPT.
To validate the attack effectiveness of hard-label attack algorithms in the real world, we evaluate the attack performance of LimeAttack, HLBB, LeapAttack, TextHoaxer and TextHacker on ChatGPT. Due to OpenAI’s limit on the number of APIs calls, we select 20 adversarial examples generated by different hard-label attack algorithms which attack bert on the MR dataset, and input them into ChatGPT to observe if they produced opposite results compared to the original samples. As shown in Table 13, LimeAttack achieves higher attack success rate, generates higher quality adversarial examples than other methods when facing real world APIs under tight query budget.
Attack | ASR. | Pert. | Sim. |
HLBB | 10.0 | 3.70 | 96.80 |
LeapAttack | 20.0 | 8.57 | 88.85 |
TextHoaxer | 10.0 | 4.61 | 89.71 |
TextHacker | 20.0 | 7.61 | 90.21 |
LimeAttack | 20.0 | 4.51 | 95.30 |
Appendix G Appendix G: Significance Test
We have added a t-test and listed the mean, variance, and p-value of LimeAttack against other methods on the success rate in the Table 14. LimeAttack has run with five additional seeds and take the average, which is consistent with other baselines. As shown in the Table 14, LimeAttack has achieved better results than other baselines under a tight query budget.
Model_dataset | LimeAttack | HLBB | TextHoaxer | LeapAttack | TextHacker | |
Mean | Variance | p-value | p-value | p-value | p-value | |
CNN_MR | 49.9 | 9.00E-02 | 2.74E-05 | 2.38E-05 | 1.19E-05 | 8.20E-02 |
LSTM_MR | 47.6 | 2.50E-01 | 8.98E-05 | 3.45E-05 | 4.78E-05 | 5.69E-03 |
BERT_MR | 29.2 | 1.42E-01 | 2.85E-02 | 6.70E-02 | 2.37E-02 | 2.37E-02 |
CNN_SST | 42.8 | 2.91E-01 | 1.58E-05 | 1.24E-04 | 4.42E-04 | 1.24E-04 |
LSTM_SST | 40.1 | 8.02E-01 | 1.53E-03 | 5.66E-03 | 6.41E-02 | 3.30E-03 |
BERT_SST | 27.8 | 4.22E-02 | 1.73E-05 | 3.39E-05 | 5.71E-05 | 4.17E-05 |
CNN_AG | 20.9 | 2.28E-01 | 1.38E-02 | 2.27E-03 | 6.30E-01 | 3.09E-01 |
LSTM_AG | 17.3 | 5.18E-02 | 1.50E-03 | 5.07E-04 | 1.38E-02 | 3.17E-01 |
BERT_AG | 14.6 | 1.02E-02 | 4.41E-03 | 5.16E-04 | 3.77E-02 | 5.86E-03 |
CNN_Yahoo | 43.7 | 1.56E-01 | 6.30E-02 | 1.26E-03 | 2.68E-03 | 1.82E-04 |
LSTM_Yahoo | 40.3 | 4.22E-02 | 1.39E-03 | 3.45E-04 | 5.49E-04 | 2.69E-04 |
BERT_Yahoo | 37.4 | 2.25E-02 | 4.22E-01 | 1.59E-03 | 4.04E-03 | 8.47E-04 |
Appendix H Appendix H: LimeAttack Algorithm
The all process of LimeAttack’s algorithm is summarized in algo 1.
Input: Original text ,target model
Output: Adversarial example
Appendix I Appendix I: Qualitative Examples
Appendix J Appendix J: Limitation
-
•
Exploring more LLMs. Due to limited resources, this paper only tests some popular large language models. However, there are other victim models based on other LLMs, e.g.LLaMA. Hence, more victim models based on more LLMs might be studied.
-
•
More NLP tasks. In this paper, we only attack some classification tasks (e.g., text classification, textual entailment and zero-shot classification). It is interesting to attack other NLP applications, such as dialogue, text summarization, and machine translation.
Appendix K Appendix K: Semanticc Similarity of Different Attack Algorithms
We have added semantic similarity in Table 15. Some baselines take the similarity into account during the attack, thus LimeAttack exhibits lower similarity than other methods. Considering all metrics, LimeAttack is still dominant.
HLBB | TextHoaxer | LeapAttack | TextHacker | LimeAttack | ||
MR | CNN | 97.20 | 97.11 | 97.17 | 94.56 | 95.21 |
LSTM | 97.27 | 97.27 | 97.22 | 95.01 | 95.31 | |
BERT | 97.13 | 97.16 | 97.09 | 94.16 | 94.77 | |
SST | CNN | 97.18 | 97.22 | 97.14 | 94.02 | 94.41 |
LSTM | 97.22 | 97.21 | 97.18 | 94.58 | 94.69 | |
BERT | 97.22 | 97.07 | 97.13 | 93.77 | 94.56 | |
AG | CNN | 97.64 | 97.62 | 97.62 | 95.71 | 96.27 |
LSTM | 97.64 | 97.58 | 97.62 | 95.46 | 96.11 | |
BERT | 97.57 | 97.61 | 97.56 | 95.14 | 96.53 | |
Yahoo | CNN | 97.75 | 97.72 | 97.71 | 95.33 | 96.21 |
LSTM | 97.71 | 97.66 | 97.67 | 95.41 | 96.41 | |
BERT | 97.73 | 97.68 | 97.63 | 95.12 | 96.55 |
Appendix L Appendix L: Comparison with Score-based Attacks
Since LimeAttack follows the two-stage strategies samed from score-based attacks, we also take some classic score-based attacks for reference. LimeAttack and these score-based attacks have exactly the same settings. In addition, score-based attacks can obtain the probability distribution of the output, while LimeAttack does not. Therefore, we do not limit query budgets for LimeAttack and score-based attacks. As shown in Table 16, LimeAttack still achieves a higher attack success rate and semantic similarity in most cases. LimeAttack’s superiority can be attributed to its focus on crucial words through the learned word importance ranking and the expanded search space with the introduction of beam search. However, LimeAttack requires more queries to compute word importance rankings because it lacks a probability distribution for the output. This situation is more obvious in long texts.
Dataset | Model | Attack | ASR. | Pert. | Sim. | Query. | Dataset | Model | Attack | ASR. | Pert. | Sim. | Query. |
MR | CNN | TF | 60.9 | 5.88 | 94.21 | 51.84 | AG | CNN | TF | 32.1 | 5.96 | 94.65 | 43.67 |
PWWS | 62.4 | 5.88 | 92.34 | 144.37 | PWWS | 32.1 | 5.94 | 94.85 | 47.68 | ||||
Bert-Attack | 46.3 | 5.75 | 94.51 | 28.25 | LimeAttack | 38.1 | 4.55 | 96.53 | 879.23 | ||||
LimeAttack | 62.5 | 5.60 | 95.33 | 268.94 | LSTM | TF | 30.5 | 5.51 | 95.40 | 46.93 | |||
LSTM | TF | 65.8 | 5.63 | 94.56 | 49.96 | PWWS | 32.1 | 5.94 | 94.85 | 47.68 | |||
Bert-Attack | 50.2 | 5.77 | 94.4 | 28.53 | LimeAttack | 35.4 | 4.55 | 96.13 | 975.35 | ||||
Limeattack | 61.2 | 5.51 | 95.44 | 253.07 | SST-2 | CNN | TF | 51.0 | 5.96 | 93.83 | 51.67 | ||
BERT | TF | 46.5 | 5.68 | 94.43 | 51.48 | LimeAttack | 51.0 | 5.99 | 94.90 | 150.08 | |||
Bert-Attack | 35.0 | 5.82 | 94.64 | 28.59 | LSTM | TF | 52.1 | 5.93 | 93.54 | 50.7 | |||
LimeAttack | 47.6 | 5.59 | 94.99 | 821.28 | LimeAttack | 50.5 | 6.13 | 94.70 | 320.45 |
Appendix M Appendix M: Evaluation on Defense Methods
We used A2T (The core part of A2T is a new and cheaper word substitution attack optimized for adversarial training) and ASCC to enhance the defense ability of BERT on MR and SST datasets, and conducted attack experiments on this basis. As shown in Table 17. Even after adversarial training and enhancement, our algorithm still has a certain attack effect on these defense methods. Compared with A2T, ASCC has better defense effect and improves a certain degree of model robustness.
origin BERT-MR | A2T | ASCC | origin BERT-SST | A2T | ASCC | |||||||
ASR | PERT | ASR | PERT | ASR | PERT | ASR | PERT | ASR | PERT | ASR | PERT | |
HLBB | 26.6 | 5.6 | 23.5 | 5.6 | 20.1 | 5.6 | 23.0 | 5.8 | 21.3 | 6.0 | 19.3 | 6.1 |
TextHoaxer | 27.0 | 5.5 | 24.3 | 5.6 | 21.2 | 5.7 | 24.9 | 5.8 | 21.8 | 5.9 | 20.1 | 5.9 |
LeapAttack | 26.5 | 5.4 | 24.0 | 5.6 | 22.3 | 5.6 | 26.1 | 5.8 | 21.7 | 5.9 | 19.6 | 6.1 |
TextHoaxer | 26.5 | 6.5 | 24.1 | 6.6 | 22.5 | 6.6 | 25.4 | 6.3 | 22.1 | 6.3 | 19.1 | 6.6 |
LimeAttack | 29.2 | 5.9 | 25.7 | 5.8 | 23.4 | 5.8 | 27.8 | 5.7 | 22.7 | 5.9 | 20.3 | 6.1 |
Appendix N Appendix N: Convergence of Attack Performance
convergence of attack success rate
We have conduct further evaluations on defense methods to validate their effectiveness. As shown in Table 18, LimeAttack achieves better attack success rate than other attacks. Attack success rate without considering the query budget is more of an ideal situation. It shows the upper limit of an attack algorithm. High query budget is equivalent to traverse the solution space and will approximate the asr and pert upper limit of victim model; However, asr and pert will interact with each other, resulting in the upper limit of asr and pert not being in the same direction. Therefore, for some victim models (LSTM-AG and BERT-Yahoo), limeattack’s pert is the lowest, but not the optimal asr (very close).
CNN_MR | CNN_SST | LSTM_MR | LSTM_SST | LSTM_AG | BERT_SST | BERT_Yahoo | ||||||||
ASR | PERT | ASR | PERT | ASR | PERT | ASR | PERT | ASR | PERT | ASR | PERT | ASR | PERT | |
HLBB | 55.6 | 5.6 | 43.4 | 6.4 | 54.5 | 5.6 | 43.3 | 6.4 | 30.4 | 5.5 | 30.3 | 6.7 | 62.2 | 6.7 |
TextHoaxer | 55.6 | 5.4 | 43.9 | 6.4 | 52.9 | 5.4 | 45.5 | 6.3 | 31.1 | 5.8 | 35.9 | 6.6 | 63.2 | 6.6 |
LeapAttack | 56.4 | 5.5 | 44.3 | 6.5 | 54.6 | 5.5 | 44.3 | 6.2 | 31.3 | 5.3 | 37.5 | 6.2 | 63.1 | 6.4 |
TextHoaxer | 59.2 | 5.6 | 38.0 | 6.7 | 56.0 | 5.6 | 44.0 | 6.5 | 32.0 | 5.8 | 38.0 | 6.0 | 67.2 | 6.4 |
LimeAttack | 59.4 | 5.7 | 48.6 | 6.0 | 59.3 | 5.5 | 45.5 | 5.9 | 31.2 | 5.3 | 42.5 | 6.1 | 66.0 | 6.2 |
convergence of perturbation rate
We list convergence behavior of different attack. As shown in the figure 6. Due to the use of complex optimization algorithms in previous algorithms, it does require a large number of queries to complete this part of optimization; Therefore, previous algorithms often have a good perturbation rates.
Appendix O Appendix O: Comparison with SHAP and Non-linear Models
In a hard-label setting, model’s logits are unavailable and model query budget is tiny. We list the result of attack success rate of different word importance ranking calculation under different query budgets. As shown in the Table 19, compared to LIME, attack success rate and perturbation rate of SHAP or non-linear models do not have significant advantages in tiny query budgets. Considering the time complexity, we adopt LIME to calculate word importance ranking in the main text.
query budgets 100 | query budgets 2000 | |||||||
CNN-MR | BERT-SST | CNN-MR | BERT-SST | |||||
ASR | PERT | ASR | PERT | ASR | PERT | ASR | PERT | |
LIME | 49.9 | 5.3 | 27.8 | 5.7 | 59.4 | 5.7 | 42.5 | 6.1 |
SHAP | 49.7 | 5.2 | 27.7 | 5.7 | 61.2 | 5.8 | 44.3 | 6.3 |
Decision Tree | 50.1 | 5.3 | 27.9 | 5.8 | 61.6 | 5.8 | 44.1 | 6.4 |
Attack | Texts | Query. |
No Attack |
It allows us hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker. |
0 |
HLBB |
It allows us hope that nolan is poised to incur a major career as a commercial yet ingenuity filmmaker. |
2062 |
TextHoaxer |
It allows us hope that nolan is poised to start a major career as a commercial yet contrivance filmmaker. |
48 |
LeapAttack |
It allows us hope that nolan is poised to embark a major career as a commercial yet contrivance filmmaker. |
30 |
TextHacker |
It allows us hope that nolan is readies to embark a major career as a commercial yet creative filmmaker. |
101 |
LimeAttack |
It allows us hope that nolan is poised to embark a major career as a commercial yet contrivance filmmaker. |
43 |
Attack | Texts | Query. |
No Attack |
The acting,costumes,music,cinematogrtaphy and sound are all astounding given the production’s austere locales. |
0 |
HLBB |
The acting,costumes,music,cinematogrtaphy and sound are all stupendous given the production’s austere locales. |
35 |
TextHoaxer |
The acting,costumes,music,cinematogrtaphy and sound are all staggering given the production’s austere locales. |
45 |
LeapAttack |
the acting,costumes,music,cinematogrtaphy and sound are all astounding dispensed the production’s austere locales. |
35 |
TextHacker |
the provisonal,costumes,music,cinematogrtaphy and sound sunt all startling given the production’s stoic locales. |
101 |
LimeAttack |
the acting,costumes,music,cinematogrtaphy and sound are all staggering given the production’s austere locales. |
25 |
Attack | Texts | Query. |
No Attack |
In basketball whats a suicide? is it like running back and forth? its an exercise where you run the entire court touching down in intnervals until youve completed the exercise on both sides of the court. |
0 |
HLBB |
In basket whats a suicide? is it like running back and forth? its an exercise where you run the entire court touching down in intnervals until youve completed the exercise on both sides of the court. |
6 |
TextHoaxer |
In wildcats whats a suicide? is it like running back and forth? its an exercise where you run the entire court touching down in intnervals until youve completed the exercise on both sides of the court |
6 |
LeapAttack |
In wildcats whats a suicide? is it like running back and forth? its an exercise where you run the entire court touching down in intnervals until youve completed the exercise on both sides of the court. |
6 |
TextHacker |
In basketball whats a suicide? is it like running back and forth? its an exercise where you run the entire court touching down in intnervals until havent completed the exercise on both sides of the court. |
101 |
LimeAttack |
In basketballs whats a suicide? is it like running back and forth? its an exercise where you run the entire court touching down in intnervals until youve completed the exercise on both sides of the court. |
39 |
Attack | Texts | Query. |
No Attack |
Who was the first indian who became the member of english parliament? dadabhai naoroji preeminent pioneer of indian nationalism freedom fighter and educationist the first indian to become member of british parliament 1862 congress president thrice the grand old man of india. |
0 |
HLBB |
Who was the first indian who became the member of english parliament? dadabhai naoroji preeminent groundbreaking of indian nationalistic freedom hunter and educationist the first indian to become member of british parliament 1862 congress president thrice the grand old man of india. |
116 |
TextHoaxer |
Who was the first indian who became the member of english parliament? dadabhai naoroji preeminent pioneer of indian nationalism freedom fighter and educationist the first indian to become member of british parliament 1862 congress president thrice the immense old man of indian. |
440 |
LeapAttack |
Who was the first indian who became the member of english parliament? dadabhai naoroji preeminent pioneer of indian nationalism liberty hunters and educationist the first indian to become member of british parliament 1862 congress president thrice the grand old man of india. |
1411 |
TextHacker |
Who was the first indian who became the member of english parliament? dadabhai naoroji preeminent pioneers of indian nationalism freedom fighter and educationist the first indian to become member of british chambre 1862 congress president thrice the grand old man of india. |
101 |
LimeAttack |
Who was the first indian who became the member of english parliament? dadabhai naoroji preeminent pioneer of indian nationalism freedom fighter and educationist the first indian to become member of british legislature 1862 congress president thrice the grand old man of india. |
45 |
Attack | Texts | Query. |
No Attack |
Those outside show business will enjoy a close look at people they do n’t really want to know. |
0 |
HLBB |
Those outside show business will enjoy a nearby look at people they do n’t really want to know. |
2241 |
TextHoaxer |
Those outside show business will recieve a close look at people they do n’t really want to know. |
202 |
LeapAttack |
Those outside show business will like a close glanced at people they do n’t really want to know. |
1431 |
TextHacker |
Those outside show companies will experience a close glance at volk they do n’t really want to know. |
103 |
LimeAttack |
Those outside show business will recieve a close glanced at people they do n’t really want to know |
53 |
Attack | Texts | Query. |
No Attack |
I’m convinced i could keep a family of five blind , crippled , amish people alive in this situation better than these british soldiers do at keeping themselves kicking. |
0 |
HLBB |
I’m convinced i could keep a family of five blind , invalids , amish people alive in this situation better than these british soldiers do at keeping themselves kicking. |
2110 |
TextHoaxer |
I’m gratified i could keep a family of five blind , crippled , amish people alive in this situation better than these british soldiers do at keeping themselves kicking. |
219 |
LeapAttack |
I’m contented i could keep a family of five blind , paralytic, amish people alive in this plight better than these british soldiers do at keeping themselves kicking. |
2162 |
TextHacker |
I’m convinced i could keep a family of five blind , handicapped , amish people lively in this situation better than these british soldiers do at keeping themselves kicking. |
101 |
LimeAttack |
I’m gratified i could keep a family of five blind , crippled , amish people alive in this situation better than these british soldiers do at keeping themselves kicking. |
50 |
Attack | Texts | Query. |
No Attack |
Spaniards to run luton airport after 551 m deal luton , cardiff and belfast international airports are to fall into the hands of a spanish toll motorways operator through a 551 m takeover of the aviation group tbi by a barcelona based abertis infrastructure. |
0 |
HLBB |
spaniards to executes luton airport after 551 m deal luton , cardiff and belfast international airports are to fall into the hands of a spanish toll motorways operator through a 551 m takeover of the aeroplanes group tbi by a barcelona based abertis infrastructure. |
969 |
TextHoaxer |
Spaniards to run luton airport after 551 m deal luton , cardiff and belfast international airports are to fall into the manaus of a spanish toll motorways exploiter through a 551 m coup of the aviation group tbi by a barcelona based abertis infrastructure. |
727 |
LeapAttack |
Spaniards to run luton airport after 551 m deal luton , cardiff and belfast international airports are to fall into the hands of a spanish toll motorways operator through a 551 m takeover of the aeroplanes group tbi by a barcelona based abertis infrastructure. |
2148 |
TextHacker |
Spaniards to implementing luton airport after 551 m deal luton , cardiff and belfast international airports represent to fall into the hands of a spanish toll motorways operator through a 551 m takeover of the aviation group tbi by a barcelona based abertis infrastructure. |
101 |
LimeAttack |
Spaniards to run luton luton after 551 m deal luton , cardiff and belfast international airports are to fall into the hands of a spanish toll motorways operator through a 551 m takeover of the aviation group tbi by a barcelona based abertis infrastructure. |
92 |
Attack | Texts | Query. |
No Attack |
Eisner says ovitz required oversight daily michael d eisner appeared for a second day of testimony in the shareholder lawsuit over the lucrative severance package granted to michael s ovitz. |
0 |
HLBB |
Eisner says ovitz required oversight daily michael d eisner appeared for a second weekly of testimony in the shareholder lawsuit over the lucrative severance package granted to michael s ovitz. |
31 |
TextHoaxer |
Eisner says ovitz required oversight daily michael d eisner appeared for a second day of testimony in the shareholder lawsuit over the interesting severance package granted to michael s ovitz. |
48 |
LeapAttack |
Eisner says ovitz needing oversight daily michael d eisner appeared for a second day of testimony in the shareholder lawsuit over the lucrative severance package granted to michael s ovitz. |
14 |
TextHacker |
Eisner says ovitz required surveillance everyday michael d eisner appeared for a second day of testimonies in the shareholder lawsuit over the rewarding severance package granted to michael s ovitz. |
101 |
LimeAttack |
Eisner says ovitz required oversight daily michael d eisner appeared for a second day of testimony in the proprietors lawsuit over the lucrative severance package granted to michael s ovitz. |
34 |
Attack | Texts | Query. |
No Attack |
Cray promotes two execs ly huong pham becomes the supercomputer maker’s senior vice presdent of operations,and peter ungaro is made senior vice president for sales,marketing and services. |
0 |
HLBB |
Cray promotes two execs ly huong pham buys the supercomputer maker’s senior vice presdent of operations,and peter ungaro is made senior obscene chairperson for sales,marketing and services. |
3811 |
TextHoaxer |
Hucknall promotes two execs ly huong pham becomes the supercomputer maker’s senior vice president of surgical, and peter ungaro is made senior vice president for sales, marketing and services. |
94 |
LeapAttack |
Cray promotes two execs ly huong pham becomes the quadrillion maker’s senior vice presdent of operations,and peter ungaro is made senior vice president for sales,marketing and services. |
42 |
TextHacker |
Cray promotes two ceos ly huong pham becomes the supercomputer maker’s senior prostitution presdent of operations,and peter ungaro is made senior vice president for selling,marketing and services. |
101 |
LimeAttack |
Cray promotes two execs ly huong pham becomes the thermonuclear maker’s senior vice presdent of operations,and peter ungaro is made senior vice president for sales,marketing and services. |
39 |