LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack

Hai Zhu^{1 3}, Qingyang Zhao², Weiwei Shang¹, Yuren Wu³, Kai Liu⁴ Corresponding author.

Abstract

Natural language processing models are vulnerable to adversarial examples. Previous textual adversarial attacks adopt model internal information (gradients or confidence scores) to generate adversarial examples. However, this information is unavailable in the real world. Therefore, we focus on a more realistic and challenging setting, named hard-label attack, in which the attacker can only query the model and obtain a discrete prediction label. Existing hard-label attack algorithms tend to initialize adversarial examples by random substitution and then utilize complex heuristic algorithms to optimize the adversarial perturbation. These methods require a lot of model queries and the attack success rate is restricted by adversary initialization. In this paper, we propose a novel hard-label attack algorithm named LimeAttack, which leverages a local explainable method to approximate word importance ranking, and then adopts beam search to find the optimal solution. Extensive experiments show that LimeAttack achieves the better attacking performance compared with existing hard-label attack under the same query budget. In addition, we evaluate the effectiveness of LimeAttack on large language models and some defense methods, and results indicate that adversarial examples remain a significant threat to large language models. The adversarial examples crafted by LimeAttack are highly transferable and effectively improve model robustness in adversarial training.

Einführung

Deep Neural Networks (DNNs) are widely applied in the natural language processing field and have achieved great success (Kim 2014; Devlin et al. 2019; Minaee et al. 2021; Hochreiter and Schmidhuber 1997). However, DNNs are vulnerable to adversarial examples, which are correctly classified samples altered by some slight perturbations (Jin et al. 2020; Papernot et al. 2017; Kurakin, Goodfellow, and Bengio 2016). These adversarial perturbations are imperceptible to humans but can mislead the model. Adversarial examples seriously threaten the robustness and reliability of DNNs, especially in some security-critical applications (e.g., autonomous driving and toxic text detection (Yang et al. 2021; Kurakin, Goodfellow, and Bengio 2018)). Therefore, adversarial examples have attracted enormous attention on adversarial attacks and defenses in computer vision, natural language processing and speech (Szegedy et al. 2013; Carlini and Wagner 2018; Yu et al. 2022). It is more challenging to craft textual adversarial examples due to the discrete nature of language along with the presence of lexical, semantic, and fluency constraints.

According to different scenarios, textual adversarial attacks can be briefly divided into white-box attacks, score-based attacks and hard-label attacks. In a white-box setting, the attacker utilizes the model’s parameters and gradients to generate adversarial examples (Goodman, Zhonghou et al. 2020; Jiang et al. 2020). Score-based attacks only adopt class probabilities or confidence scores to craft adversarial examples (Jin et al. 2020; Li et al. 2020; Ma, Shi, and Guan 2020; Zhu, Zhao, and Wu 2023). However, these attack methods perform poorly in reality due to DNNs being deployed through application programming interfaces (APIs), and the attacker having no access to the model’s parameters, gradients or probability distributions of all labels (Ye et al. 2022b). In contrast, under a hard-label scenario, the model’s internal structures, gradients, training data and even confidence scores are unavailable. The attacker can only query the black-box victim model and get a discrete prediction label, which is more challenging and realistic. Additionally, most realistic models (e.g., HuggingFace API, OpenAI API) usually have a limit on the number of calls. In reality, the adversarial examples attack setting is hard-label with tiny model queries.

Some hard-label attack algorithms have been proposed (Yu et al. 2022; Ye et al. 2022b; Maheshwary, Maheshwary, and Pudi 2021; Ye et al. 2022a). They follow two-stages strategies: i) generate low-quality adversarial examples by randomly replacing several original words with synonyms, and then ii) adopt complex heuristic algorithms (e.g., genetic algorithm) to optimize the adversary perturbation. Therefore, these attack methods usually require a lot of queries and the attack success rate and quality of adversarial examples are limited by adversary initialization. On the contrary, score-based attacks calculate the word importance based on the change in confidence scores after deleting one word. Word importance ranking improves attack efficiency by preferring to attack words that have a significant impact on the model’s predictions (Jin et al. 2020). However, score-based attacks cannot calculate the word importance in a hard-label setting because deleting one token hardly changes the discrete prediction label. Therefore, we want to investigate such a problem: how to calculate word importance ranking in a hard-label setting to improve attack efficiency?

Actually, word importance ranking can reveal the decision boundary to determine the better attack path, but existing hard-label algorithms ignore this useful information because it is hard to obtain. Inspired by local explainable methods (Ribeiro, Singh, and Guestrin 2016; Lundberg and Lee 2017; Shrikumar et al. 2016) for DNNs, which are often used to explain the outputs of black-box models, aim to estimate the token sensitivity on the benign sample. Previous study (Chai et al. 2023) has tried to simply replace deletion-based method with local explainable method to calculate word importance in score-based attack. However, In Appendix B, we have verified through experiments that local explainable method does not have a significant advantage over deletion-based method in a score-based scenario. Because the probability distribution of the model’s output is available, the influence of each word on the output can be well reflected by deletion-based method. Therefore, compared with score-based attacks, we think local explainable method can play a greater advantage in hard-label attacks where deletion-based method is useless. We adopt the most fundamental and straightforward local explainable method, namely LIME. LIME is easy to understand and more in line with the deletion-based method proposed in score-based attacks, since our goal is to bridge the gap between score-base attacks and hard-label attacks by introducing interpretability method. In fact, local explainable methods are model-agnostic and suitable for conducting word importance estimation for hard-label attacks. However, there are the following difficulties in applying LIME to hard-label attacks: 1) How to allocate LIME and search queries under tiny query budget to achieve optimal results. 2) How to establish a mapping relationship between LIME and word importance in adversarial samples without model’s logits output. 3) How to sample reasonably during perturbation execution to achieve optimal results. In subsequent sessions we will explain in detail how to solve these difficulties.

In this work, we propose a novel hard-label attack algorithm named LimeAttack. The application of LIME in hard-label attacks was inspired by the score-based attacks’ deletion method. We verify the effectiveness of inside-to-outside attack path in hard-label attacks, then many excellent score-based attacks may provide hard-label attacks more insight. To evaluate the attack performance and efficiency, we compare LimeAttack with other hard-label attacks and take several score-based attacks as references for two NLP tasks on seven common datasets. We also evaluate LimeAttack on the currently state-of-the-art large language models (e.g., ChatGPT). Experiments show that LimeAttack achieves the highest attack success rate compared to other baselines under the tiny query budget. Our contributions are summarized as follows:

•

We summarize the shortcomings of the existing hard-label attacks and apply LIME to connect score-base attacks and hard-label attacks and verify the effectiveness of inside-to-outside attack path in hard-label attacks.
•

Extensive experiments show that LimeAttack achieves higher attack success rate than existing hard-label attack algorithms under tiny query budget. Meanwhile, adversarial examples crafted by LimeAttack are high quality and difficult for humans to distinguish. ¹¹1Code is available in https://github.com/zhuhai-ustc/limeattack
•

In addition, we also conduct attacks and evaluations on the currently state-of-the-art large language models. Results indicate that adversarial examples remain a significant threat to large language models. We also have added attack performance on defense methods and convergence results of attack success rate and perturbation rate.

Related Work

Hard-Label Adversarial Attacks

In a hard-label setting, the attacker can only query the victim model and get a discrete prediction label. Therefore, hard-label setting is more practical and challenging. Existing hard-label attacks contain two-stages strategies, i.e., adversary initialization and perturbation optimization. HLBB (Maheshwary, Maheshwary, and Pudi 2021) initializes an adversarial example and adopts a genetic algorithm to optimize the perturbation. TextHoaxer (Ye et al. 2022b) and LeapAttack (Ye et al. 2022a) utilizes semantic similarity and perturbation rate as optimization objective to search for a better perturbation matrix in the continuous word embedding space. TextHacker (Yu et al. 2022) adopts a hybrid local search algorithm and a word importance table learned from attack history to guide the local search. These attack methods often require a lot of queries to reduce the perturbation rate, and the attack success rate and quality of adversary are limited by initialization. Therefore, in this work, we attempt to craft an adversarial example directly from the benign sample. This approach can generate high-quality adversarial examples with fewer queries.

Local Explainable Methods

To improve DNN interpretability and aid decision-making, various methods for explaining DNNs have been proposed and broadly categorized as global or local explainable methods. Global explainable methods focus on the model itself by using the overall knowledge about the model’s architecture and parameters. On the contrary, local methods fit a simple and interpretable model (e.g., decision tree) to a single input to measure the contribution of each token. In detail, local explainable methods (Lundberg and Lee 2017; Shrikumar et al. 2016; Štrumbelj and Kononenko 2014) associate all input tokens by defining a linear interpretability model and assumes that the contribution of each token in the input is additive. This is also called the additive feature attribution method. In this paper, local interpretable model-agnostic explanation (LIME) (Ribeiro, Singh, and Guestrin 2016) is applied to calculate word importance, which is a fundamental and representative local explainable method. The intuition of LIME is to generate many neighborhood samples by deleting some original words in the benign example. These samples are then used to train a linear model where the number of features equals to the number of words in the benign sample. The parameters of this linear model are approximated to the importance of each word. As LIME is model-agnostic, it is suitable for hard-label attacks.

Limitation of Existing Hard-Label Attack

In order to intuitively compare the difference between LimeAttack and existing hard-label attack algorithms, we create attack search path visualizations in Figure 3. LimeAttack’s search paths are represented by green lines, and they move from inside to outside. LimeAttack utilizes a local explainable method to learn word importance ranking and generates adversarial examples iteratively from benign samples. This helps LimeAttack to find the nearest decision boundary direction, and costs fewer model queries to attack keywords preferentially. In contrast, previous hard-label attack algorithms’ search paths are represented by blue lines, and they move from outside to inside. These algorithms typically begin with a randomly initialized adversarial example and optimize perturbation by maximizing semantic similarity between the initialized example and the benign sample, which requires a lot of model queries to achieve a low perturbation rate. Furthermore, their attack success rate and adversary quality are also limited by the adversary initialization.

Refer to caption — Figure 1: Search paths of existing hard-label attacks and LimeAttack.

Methodology

Problem Formulation

Given a sentence of $n$ words $\bm{X}=[x_{1},x_{2},\cdots,x_{n}]$ and its ground truth label $Y$ , an adversarial example $\bm{X^{\prime}}=[x_{1}^{\prime},x_{2}^{\prime},\cdots,x_{n}^{\prime}]$ is crafted by replacing one or more original words with synonyms to mislead the victim model $\mathcal{F}$ . i.e.,

\mathcal{F}(\bm{X^{\prime}})\neq\mathcal{F}(\bm{X}),\quad\mathrm{s.t.}\quad D(% \bm{X},\bm{X^{\prime}})<\epsilon

(1)

$D(\cdot,\cdot)$ is an edit distance that measures the modifications between a benign sample $\bm{X}=[x_{1},x_{2},\cdots,x_{n}]$ and an adversarial example $\bm{X^{\prime}}=[x_{1}^{\prime},x_{2}^{\prime},\cdots,x_{n}^{\prime}]$ :

D(\bm{X},\bm{X^{\prime}})=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}(x_{i},x^{\prime}% _{i})

(2)

$\mathbb{E}(\cdot,\cdot)$ is a binary variable that equals to 0 if $x_{i}=x^{\prime}_{i}$ and 1 otherwise. A high-quality adversarial example should be similar to the benign sample, and human readers should hardly be able to distinguish the difference. The LimeAttack belongs to the hard-label attack, it has nothing to do with the model’s parameters, gradients or confidence scores. The attacker can only query the victim model to obtain a predicted label $\hat{\bm{Y}}=\mathcal{F}(\hat{\bm{X}})$ .

The Proposed LimeAttack Algorithm

The overall flow chart is shown in Figure 2. LimeAttack follows two steps, i.e., word importance ranking and perturbation execution.

Word Importance Ranking.

Given a sentence of $n$ words $\bm{X}$ , we assume that the contribution of all words is additive, and their sum is positively related to the model’s prediction. As shown in the Figure 2, we generate some neighborhood samples $\mathcal{X}=[\bm{X}^{\prime}_{1},\bm{X}^{\prime}_{2},\cdots,\bm{X}^{\prime}_{n}]$ from a benign example $\bm{X}$ by randomly replacing some words with ’[MASK]’. Usually, sentences with more words often requires more neighbor samples to approximate the word importance. Therefore, we keep the number of neighborhood samples consistent with the number of tokens. We then feed $\mathcal{X}$ to the victim model $\mathcal{F}$ to obtain discrete prediction labels $\mathcal{\hat{Y}}=[\hat{\bm{Y}^{\prime}_{1}},\hat{\bm{Y}^{\prime}_{2}},\cdots,% \hat{\bm{Y}^{\prime}_{n}}]$ . Subsequently, we will fit a linear interpretability model to classify these neighborhood samples:

g(\bm{X},\bm{\theta})=\theta_{0}+\sum_{i=1}^{n}\theta_{i}\mathbb{I}(x_{i},\bm{% X})

(3)

where $\bm{\theta}$ is the parameter of the linear model, $\mathbb{I}(\cdot,\cdot)$ is a binary variable that equals to 1 if word $x_{i}$ in $\bm{X}$ and 0 otherwise. Therefore, the parameter $\theta_{i},i\in[1,n]$ reflects the change without word $x_{i}$ and is approximated to the word importance. In Appendix O, we have verified through experiments that the linear model (such as LIME) has the same effect as some advanced interpretation methods (such as SHAP) or non-linear models (such as decision tree) under tiny query budgets. SHAP or non-linear models also have a higher computational complexity. The advantages of some advanced interpretation methods or non-linear models will only be reflected when there are a large number of neighborhood samples and queries.

In detail, we transform each neighborhood sample $\bm{X}^{\prime}_{i}$ into the binary vector $\bm{V}^{\prime}_{i}$ . If the origin word is removed in $\bm{X}^{\prime}_{i}$ , its corresponding vector dimension in $\bm{V}^{\prime}_{i}$ is 0 otherwise 1. Therefore, $\bm{V}^{\prime}_{i}$ has the same length as $\bm{X}^{\prime}_{i}$ , which is the length of the benign example. A benign example $\bm{X}$ is also transformed to $\bm{V}$ . Sometimes neighborhood samples may not necessarily be linearly separable, LIME adopts gaussian kernel to weight the loss for each sample to gather points closest to the original sample, which helps with linear fitting. We give weights $\pi(\bm{V}^{\prime}_{i},\bm{V})$ to each neighborhood sample according to their distance from the benign sample (Ribeiro, Singh, and Guestrin 2016).

\pi(\bm{V}^{\prime}_{i},\bm{V})=\exp{(-d(\bm{V}^{\prime}_{i},\bm{V})^{2}/% \sigma^{2})}

(4)

where $d(\cdot,\cdot)$ is a distance function. We adopt the cosine similarity as the distance metric.

d(\bm{V}^{\prime}_{i},\bm{V})=\frac{\bm{V}^{\prime}_{i}\cdot\bm{V}}{\sqrt{% \lvert\bm{V}^{\prime}_{i}\rvert\lvert\bm{V}\rvert}}

(5)

Finally, we calculate the optimal parameters $\bm{\theta^{*}}$ :

\bm{\theta^{*}}=\underset{\bm{\theta}}{\arg\min}\sum_{i=1}^{n}\pi(\bm{V}^{% \prime}_{i},\bm{V}){\{\hat{\bm{Y}^{\prime}_{i}}}-g(\bm{X}^{\prime}_{i})\}^{2}+% \Omega(\bm{\theta})

(6)

where $\Omega(\bm{\theta})$ is the non-zero of parameters, which is a measure of the complexity of the linear model. After optimizing $\bm{\theta}$ , the importance of each word $x_{i}$ is equal to $\theta_{i}$ . LIME can be seen as an approximation of the model’s decision boundary in the original sample. The parameters can be interpreted as the margin, the larger the margin, the larger the importance of this word in approximating the decision boundary. We will filter out stop words using NLTK²²2https://www.nltk.org/ firstly and calculate the importance of each word. To ensure that LimeAttack has generated high-quality adversarial examples rather than just negative examples. We only adopt synonym replacement strategy and construct the synonym candidate set $\mathcal{C}(x_{i})$ for each word $x_{i}$ by selecting the top $k$ nearest synonyms in the counter-fitted embedding space (Mrkšić et al. 2016). Additionally, we present the results of human evaluation and more qualitative adversarial examples in Appendix I.

Perturbation Execution.

Adversarial examples generation is a combinatorial optimization problem. Score-based attack iterates by selecting the token that causes the greatest change in model’s logits each time. But there is no such information in the hard-label attack. Therefore, we can only rely on the similarity between the adversarial sample and the original sample for iteration. The problem is that the similarity and attack success rate are not completely linearly correlated. As shown in the Table.7, greedily selecting the adversarial sample with the lowest similarity each time cannot ensure that the final attack success rate is optimal. We hope that each sampling is uniformly distributed to balance attack success rate and semantic similarity. For each origin word $x_{i}$ , we replace it with $c\in\mathcal{C}(x_{i})$ to generate an adversarial example $\bm{X}^{\prime}=[x_{1},\cdots,x_{i-1},c,x_{i+1},\cdots,x_{n}]$ , then we calculate the semantic similarity between the benign sample $X$ and the adversarial example $\bm{X}^{\prime}$ by universal sentence encoder (USE)³³3https://tfhub.dev/google/ universal-sentence-encoder. We first sort candidates by similarity and sample $b$ adversarial examples each time to enter the next iteration. In detail, We have formulated the following sampling rules: (1) Sampling $\lfloor b/3\rfloor$ adversarial examples with the highest semantic similarity. (2) Sampling $\lfloor b/3\rfloor$ adversarial examples with the lowest semantic similarity. (3) Sampling $\lfloor b/3\rfloor$ of the remaining adversarial samples randomly. The analysis of hyper-parameters $b$ and LimeAttack’s algorithm are summarized in Appendix C and H.

Experiments

Analysis of the transferability and adversarial training of LimeAttack are listed in Appendix D and E.

Tasks, Datasets and Models

We adopt seven common datasets, such as MR (Pang and Lee 2005), SST-2 (Socher et al. 2013), AG (Zhang, Zhao, and LeCun 2015) and Yahoo (Yoo et al. 2020) for text classification. SNLI (Bowman et al. 2015) and MNLI (Williams, Nangia, and Bowman 2018) for textual entailment, where MNLI includes a matched version (MNLIm) and a mismatched version (MNLImm). In addition, we have trained three neural networks as victim models, including CNN (Kim 2014), LSTM (Hochreiter and Schmidhuber 1997) and BERT (Devlin et al. 2019). The parameters of the models and the detailed information of datasets are listed in Appendix A.

Baselines

We have chosen the following existing hard-label attack algorithms as our baselines: HLBB (Maheshwary, Maheshwary, and Pudi 2021), TextHoaxer (Ye et al. 2022b), LeapAttack (Ye et al. 2022a) and TextHacker (Yu et al. 2022) as our baselines. Additionally, we have included some classic score-based attack algorithms, such as TextFooler (TF) (Jin et al. 2020), PWWS (Ma, Shi, and Guan 2020) and Bert-Attack (Li et al. 2020) for references, which obtain additional confidence scores for attacks and are implemented on the TextAttack framework (Morris et al. 2020).

Automatic Evaluation Metrics

We use four metrics to evaluate the attack performance: attack success rate (ASR), perturbation rate (Pert), semantic similarity (Sim) and query number (Query). Specifically, given a dataset $\mathcal{D}=\{(\bm{X}_{i},\bm{Y}_{i})\}_{i=1}^{N}$ consisting of $N$ samples $\bm{X}_{i}$ and corresponding ground truth labels $\bm{Y}_{i}$ , attack success rate of an adversarial attack method $\mathcal{A}$ , which generates adversarial examples $\mathcal{A}(\bm{X})$ given an input $\bm{X}$ to attack a victim model $\mathcal{F}$ , is defined as (Wang et al. 2021):

ASR=\sum_{(\bm{X},\bm{Y})\in\mathcal{D}}\frac{\mathbb{I}[\mathcal{F}(\mathcal{% A}(\bm{X}))\neq\bm{Y}]}{|\mathcal{D}|}

(7)

The perturbation rate is the proportion of the number of substitutions to the number of original tokens, which has been defined in Eq 2. The semantic similarity is measured by the Universal Sentence Encoder (USE). Most papers (Maheshwary, Maheshwary, and Pudi 2021; Ye et al. 2022a) have adopted USE. In order to maintain consistency and facilitate comparability, we have also utilized USE. Query number is the number of model queries during the attack. The robustness of a model is inversely proportional to the attack success rate, while the perturbation rate and semantic similarity together reveal the quality of adversarial examples. Query number reveals the attack efficiency.

Implementation Details

We set the kernel width $\sigma=25$ , the number of neighborhood samples equal to the number of the benign sample’s tokens, and the beam size $b=10$ . For a fair comparison, all baselines follow the same settings: synonyms are selected from counter-fitted embedding space and the number of each candidate set $k=50$ , the same 1000 texts are sampled for baselines to attack. The results are averaged on five runs with different seeds (1234,2234,3234,4234 and 5234) to eliminate randomness. In order to improve the quality of adversarial examples, the attack succeeds if the perturbation rate of each adversarial example is less than 10%. We set a tiny query budget of 100 for hard-label attack, which corresponds to real-world settings. (e.g., The HuggingFace free Inference API typically limits calls to 200 times per minute.)

Experiments Results

Attack Performance.

Table 1 and 2 show that LimeAttack outperforms existing hard-label attacks on text classification and textual entailment tasks, achieving higher attack success rates and lower perturbation rates in datasets such as SST-2, AG, and MNLI. Unlike existing hard-label attacks that require many queries to optimize the perturbation, LimeAttack adopts a local explainable method to calculate word importance ranking and attacks key words first. This approach can generate adversarial examples with a high attack success rate, even under tiny query budgets. Appendix G includes a t-test and the mean and variance of LimeAttack’s success rate compared to other methods.In Appendix K and L, we list the semantic similarity and the results of the comparison results between LimeAttack and several score-based attacks.

Model	Attack	MR		SST-2		AG		Yahoo
Model	Attack	ASR. $\uparrow$	Pert. $\downarrow$	ASR. $\uparrow$	Pert. $\downarrow$	ASR. $\uparrow$	Pert. $\downarrow$	ASR. $\uparrow$	Pert. $\downarrow$
CNN	HLBB	44.4	5.4	33.4	5.6	17.7	3.3	41.8	3.6
	TextHoaxer	44.2	5.2	38.1	5.6	15.7	2.9	39.9	3.3
	LeapAttack	43.1	5.3	40.0	5.7	20.2	3.2	40.4	3.4
	TextHacker	49.4	6.2	38.1	6.3	20.5	6.2	38.1	5.9
	LimeAttack	49.9	5.3	42.8	5.6	20.9	2.9	43.7	3.7
LSTM	HLBB	41.2	5.2	33.1	5.7	15.2	3.1	38.4	3.3
	TextHoaxer	39.3	5.4	36.4	5.6	14.7	2.7	37.1	3.3
	LeapAttack	40.0	5.3	39.8	5.6	15.9	3.1	37.6	3.3
	TextHacker	45.8	6.1	35.2	6.4	16.5	6.2	36.8	5.9
	LimeAttack	47.6	5.4	40.1	5.5	17.3	2.7	40.3	3.7
BERT	HLBB	26.6	5.6	23.0	5.8	12.7	3.2	36.3	3.6
	TextHoaxer	27.0	5.5	24.9	5.8	9.8	3.0	32.7	3.3
	LeapAttack	26.5	5.4	26.1	5.8	13.7	2.9	34.1	3.4
	TextHacker	26.5	6.5	25.4	6.3	12.9	5.5	31.3	6.3
	LimeAttack	29.2	5.9	27.8	5.7	14.6	2.9	37.4	3.8

Table 1: The attack success rate (ASR.,%

\uparrow

) and perturbation rate (Pert.,%

\downarrow

) of different hard-label attack algorithms on three models for text classification under a query budget of 100.

Dataset	HLBB		TextHoaxer		LeapAttack		TextHacker		LimeAttack
Dataset	ASR. $\uparrow$	Pert. $\downarrow$	ASR. $\uparrow$	Pert. $\downarrow$	ASR. $\uparrow$	Pert. $\downarrow$	ASR. $\uparrow$	Pert. $\downarrow$	ASR. $\uparrow$	Pert. $\downarrow$
SNLI	24.9	8.3	24.7	8.3	28.3	8.3	22.8	8.3	29.1	8.4
MNLIm	41.9	7.8	40.9	7.7	49.1	7.7	38.2	7.8	49.7	7.7
MNLImm	47.8	7.5	45.6	7.6	56.0	7.6	44.3	7.7	56.3	7.6

Table 2: The ASR.,%

\uparrow

and Pert.,%

\downarrow

of LimeAttack and other baselines on BERT for textual entailment under a query budget of 100.

Query Budget.

As illustrated in Figure 3, LimeAttack still maintains a stable attack success rate and a smoother attack curve under different query budgets, which means that regardless of high or low query budget, LimeAttack often have a stable and excellent attack performance. The trend of perturbation rate are listed in Appendix N. Comparing the attack performance in low query and high query budgets can provide a more comprehensive evaluation. However, attack without considering the query budget is more of an ideal situation, it shows the upper limit of an attack algorithm. A large number of queries are expensive, we believe attack performance under low query budget is more practical. We also list some attack success rates and perturbation rates of different attacks under the query budget is 2000 in Appendix N.

Adversary Quality.

High-quality adversarial examples should be both fluent and context-aware, while also being similar to benign samples to evade human detection. We utilize Language-Tools⁴⁴4https://www.languagetool.org/ and USE to detect grammatical errors and measure semantic similarity. As shown in Table 3, LimeAttack has the lowest perturbation rate and grammatical error, though its semantic similarity is lower than HLBB, TextHoaxer, and LeapAttack. Because these methods take the similarity into account during the attack, thus LimeAttack exhibits lower similarity than other methods. Considering all metrics, LimeAttack is still dominant. To intuitively contrast the quality of adversarial examples, some qualitative examples are provided in Appendix I.

Attack	ASR. $\uparrow$	Pert. $\downarrow$	Sim. $\uparrow$	Gram. $\downarrow$
HLBB	23.0	5.8	99.2	1.6
TextHoaxer	24.9	5.8	99.2	1.7
LeapAttack	26.1	5.8	99.1	1.5
TextHacker	25.4	6.3	96.0	1.9
LimeAttack	27.8	5.7	96.4	1.5

Table 3: ASR.,%

\uparrow

, Pert.,%

\downarrow

, Sim.,%

\uparrow

and Gram.,

\downarrow

of different hard-label attack algorithms on SST-2 dataset for BERT under query budget of 100.

Evaluation on Large Language Models.

Model(size)	ASR. $\uparrow$	Pert. $\downarrow$	Sim. $\uparrow$	Acc. $\uparrow$
BART-L (407M)	42.0	5.15	93.7	87.0
DeBERTa-L (435M)	52.0	5.82	92.9	79.0
T5-L (780M)	28.0	5.59	95.1	93.0
GPT3(175B)	61.0	4.82	95.2	82.0
ChatGPT (175B)	25.0	5.62	95.3	92.0

Table 4: The evaluation of LimeAttack on large language models. We attack these large language models on MR dataset under query budget of 100.

Large language models (LLMs), also known as foundation models (Bommasani et al. 2021), have achieved impressive performance on various natural language processing tasks. However, their robustness to adversarial examples remains unclear (Wang et al. 2023). To evaluate the effectiveness of LimeAttack on LLMs, we select some popular models such as DeBERTa-L (Kojima et al. 2022), BART-L (Lewis et al. 2019), Flan-T5 (Raffel et al. 2020), GPT-3 (text-davinci-003) and ChatGPT (gpt-3.5-turbo) (Brown et al. 2020). Due to the limited API calls, we sample 100 texts from MR datasets and attacked the zero-shot classification task of these models. As Table 4 shows, LimeAttack successfully attacked most LLMs under tight query budgets. Although these models have high accuracy on zero-shot tasks, their robustness to adversarial examples still needs to be improved. ChatGPT and T5-L are more robust to adversarial examples. The robustness of the victim model is related to origin accuracy. The higher the origin accuracy, the stronger the victim model’s ability to defense adversarial examples. Further analysis of other hard-label attacks and experimental details are discussed in Appendix F.

Defense Method	HLBB		TextHoaxer		LeapAttack		TextHacker		LimeAttack
Defense Method	ASR. $\uparrow$	Pert. $\downarrow$	ASR. $\uparrow$	Pert. $\downarrow$	ASR. $\uparrow$	Pert. $\downarrow$	ASR. $\uparrow$	Pert. $\downarrow$	ASR. $\uparrow$	Pert. $\downarrow$
None	24.9	8.3	24.7	8.3	28.3	8.3	22.8	8.3	29.1	8.4
A2T	20.6	9.3	21.4	9.5	23.5	9.4	19.8	9.1	24.5	9.4
ASCC	13.2	6.5	13.4	6.5	14.3	6.4	12.5	7.2	15.8	6.7

Table 5: The evaluation of hard-label attacks on defense methods based on BERT-SNLI under query budget of 100.

Attack Performance on Defense Methods.

To evaluate the effectiveness of LimeAttack on defense methods, we use A2T (Yoo and Qi 2021) and ASCC (Dong et al. 2021) to enhance the defense ability of BERT on SNLI, and conducted attack experiments on this basis. As shown in Table 5, LimeAttack still has a certain attack effect and outcomes other baselines on these defense methods. More attack performance on defense methods are listed in Appendix M.

Ablation Study

Effect of Word Importance Ranking.

To validate the effectiveness of word importance ranking, we removed the word importance ranking strategy and instead randomly selected words to perturb to evaluate its effectiveness. Table 6 shows that without the word importance ranking, the attack success rate decreased by 9% and 6% on the MR and SST-2 datasets, respectively. Furthermore, adversarial examples generated by random selection had higher perturbation rates and required more queries. This indicates the importance of the word importance ranking in guiding LimeAttack to focus on crucial words, leading to a more efficient attack with lower perturbation rates.

Effect of Sampling Rules.

To verify the effectiveness of LimeAttack’s sampling rules, we will replace this strategy with one of three common sampling rules: (1) selecting $b$ adversarial examples with the highest semantic similarity, (2) selecting $b$ adversarial examples with the lowest semantic similarity, or (3) randomly selecting $b$ adversarial examples. The results in Table 7 show that LimeAttack outperforms other sampling rules with a higher attack success rate and lower perturbation rate. Additionally, it has a comparable (second highest) semantic similarity and number of queries.

	MR		SST-2
	Random	LIME	Random	LIME
Pert. $\downarrow$	6.1	5.6	6.4	5.9
ASR. $\uparrow$	30.1	39.3	32.1	36.5
Sim. $\uparrow$	94.6	94.8	94.2	94.6
Query. $\downarrow$	157.2	153.3	148.1	132.5

Table 6: Comparison between word importance ranking learned by LIME and random selecting for BERT under query budget of 1000.

Sample Rule	ASR. $\uparrow$	Pert. $\downarrow$	Sim. $\uparrow$	Query. $\downarrow$
Method 1	35.8	5.76	95.02	164.65
Method 2	31.5	6.13	93.79	87.45
Method 3	32.1	6.09	94.50	107.05
LimeAttack	39.3	5.65	94.81	153.03

Table 7: Comparison between different sample rules on MR dataset for BERT under query budget of 1000.

Human Evaluation

We selected 200 adversarial examples BERT-MR. Each adversarial example was evaluated by two human judges for semantic similarity, fluency and prediction accuracy. The entire human evaluation is consistent with TextFooler (Jin et al. 2020). In detail, we ask human judges to put a 5-point Likart scale (1-5 corresponds to very not fluent/similar, not fluent/similar, uncertain, fluent/similar, very fluent/similar respectively) to evaluate the the similarity and fluency of adversarial examples and benign samples. The results are listed in the Table 8, semantic similarity is 4.5, which means adversarial samples are similar to original sample. The prediction accuracy here is to make humans to predict what the label of this sentence is (such as it is positive or negative for sentiment analysis). 76.7% means majorities of adversarial examples have the same attribute as original samples from humans’ perspective but mistake victim model.

	Ori		Adv
Prediction Accuracy	81.2%		76.7%
Fluency	4.4		4.1
Semantic Similarity		4.5

Table 8: The semantic similarity, fluency and prediction accuracy of original texts and adversarial examples evaluated by human judges for BERT-MR.

Fazit

In this work, we summarize the previous score-based attacks and hard-label attacks and propose a novel hard-label attack algorithm called LimeAttack. LimeAttack adopts a local explainable method to approximate the word importance ranking, and then utilizes beam search to generate high-quality adversarial examples with tiny query budget. Experiments show that LimeAttack achieves a higher attack success rate than other hard-label attacks. In addition, we have evaluated LimeAttack’s attack performance on large language models and some defense methods. The adversarial examples crafted by LimeAttack are high-quality, high transferable and improves victim model’s robustness in adversarial training. LimeAttack has verified the effectiveness of inside-to-outside attack path in hard-label. Then many excellent score-based attacks may provide hard-label attacks more insight.

References

Bommasani et al. (2021) Bommasani, R.; Hudson, D. A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M. S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
Bowman et al. (2015) Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 632–642.
Brown et al. (2020) Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. In Advances in neural information processing systems, volume 33, 1877–1901.
Carlini and Wagner (2018) Carlini, N.; and Wagner, D. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops, 1–7.
Chai et al. (2023) Chai, Y.; Liang, R.; Samtani, S.; Zhu, H.; Wang, M.; Liu, Y.; and Jiang, Y. 2023. Additive Feature Attribution Explainable Methods to Craft Adversarial Attacks for Text Classification and Text Regression. IEEE Transactions on Knowledge and Data Engineering.
Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186.
Dong et al. (2021) Dong, X.; Luu, A. T.; Ji, R.; and Liu, H. 2021. Towards robustness against natural language word substitutions. International Conference on Learning Representations.
Goodman, Zhonghou et al. (2020) Goodman, D.; Zhonghou, L.; et al. 2020. FastWordBug: A fast method to generate adversarial text against NLP applications. arXiv preprint arXiv:2002.00760.
Hochreiter and Schmidhuber (1997) Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. In Neural Computation, volume 9, 1735–1780.
Jiang et al. (2020) Jiang, H.; He, P.; Chen, W.; Liu, X.; Gao, J.; and Zhao, T. 2020. SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2177–2190.
Jin et al. (2020) Jin, D.; Jin, Z.; Zhou, J. T.; and Szolovits, P. 2020. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 8018–8025.
Kim (2014) Kim, Y. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1746–1751.
Kojima et al. (2022) Kojima, T.; Gu, S. S.; Reid, M.; Matsuo, Y.; and Iwasawa, Y. 2022. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
Kurakin, Goodfellow, and Bengio (2016) Kurakin, A.; Goodfellow, I.; and Bengio, S. 2016. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236.
Kurakin, Goodfellow, and Bengio (2018) Kurakin, A.; Goodfellow, I. J.; and Bengio, S. 2018. Adversarial examples in the physical world. In Artificial Intelligence Safety and Security, 99–112.
Lewis et al. (2019) Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
Li et al. (2020) Li, L.; Ma, R.; Guo, Q.; Xue, X.; and Qiu, X. 2020. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 6193–6202.
Lundberg and Lee (2017) Lundberg, S. M.; and Lee, S.-I. 2017. A unified approach to interpreting model predictions. In Advances in neural information processing systems.
Ma, Shi, and Guan (2020) Ma, G.; Shi, L.; and Guan, Z. 2020. Adversarial Text Generation via Probability Determined Word Saliency. In International Conference on Machine Learning for Cyber Security, 562–571.
Maheshwary, Maheshwary, and Pudi (2021) Maheshwary, R.; Maheshwary, S.; and Pudi, V. 2021. Generating natural language attacks in a hard label black box setting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 13525–13533.
Minaee et al. (2021) Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; and Gao, J. 2021. Deep learning–based text classification: a comprehensive review. In ACM Computing Surveys, volume 54, 1–40.
Morris et al. (2020) Morris, J.; Lifland, E.; Yoo, J. Y.; Grigsby, J.; Jin, D.; and Qi, Y. 2020. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 119–126.
Mrkšić et al. (2016) Mrkšić, N.; Ó Séaghdha, D.; Thomson, B.; Gašić, M.; Rojas-Barahona, L. M.; Su, P.-H.; Vandyke, D.; Wen, T.-H.; and Young, S. 2016. Counter-fitting Word Vectors to Linguistic Constraints. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 142–148.
Pang and Lee (2005) Pang, B.; and Lee, L. 2005. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 115–124.
Papernot et al. (2017) Papernot, N.; McDaniel, P.; Goodfellow, I.; Jha, S.; Celik, Z. B.; and Swami, A. 2017. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, 506–519.
Raffel et al. (2020) Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. In The Journal of Machine Learning Research, volume 21, 5485–5551.
Ribeiro, Singh, and Guestrin (2016) Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ” Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 1135–1144.
Shrikumar et al. (2016) Shrikumar, A.; Greenside, P.; Shcherbina, A.; and Kundaje, A. 2016. Not just a black box: Learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713.
Socher et al. (2013) Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642.
Štrumbelj and Kononenko (2014) Štrumbelj, E.; and Kononenko, I. 2014. Explaining prediction models and individual predictions with feature contributions. In Knowledge and information systems, volume 41, 647–665.
Szegedy et al. (2013) Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
Wang et al. (2021) Wang, B.; Xu, C.; Wang, S.; Gan, Z.; Cheng, Y.; Gao, J.; Awadallah, A. H.; and Li, B. 2021. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. arXiv preprint arXiv:2111.02840.
Wang et al. (2023) Wang, J.; Hu, X.; Hou, W.; Chen, H.; Zheng, R.; Wang, Y.; Yang, L.; Huang, H.; Ye, W.; Geng, X.; et al. 2023. On the robustness of chatgpt: An adversarial and out-of-distribution perspective. In arXiv preprint arXiv:2302.12095.
Williams, Nangia, and Bowman (2018) Williams, A.; Nangia, N.; and Bowman, S. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1112–1122.
Yang et al. (2021) Yang, X.; Liu, W.; Tao, D.; and Liu, W. 2021. BESA: BERT-based Simulated Annealing for Adversarial Text Attacks. In International Joint Conference on Artificial Intelligence, 3293–3299.
Ye et al. (2022a) Ye, M.; Chen, J.; Miao, C.; Wang, T.; and Ma, F. 2022a. LeapAttack: Hard-Label Adversarial Attack on Text via Gradient-Based Optimization. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2307–2315.
Ye et al. (2022b) Ye, M.; Miao, C.; Wang, T.; and Ma, F. 2022b. TextHoaxer: Budgeted Hard-Label Adversarial Attacks on Text. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 3877–3884.
Yoo et al. (2020) Yoo, J. Y.; Morris, J.; Lifland, E.; and Qi, Y. 2020. Searching for a Search Method: Benchmarking Search Algorithms for Generating NLP Adversarial Examples. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 323–332.
Yoo and Qi (2021) Yoo, J. Y.; and Qi, Y. 2021. Towards Improving Adversarial Training of NLP Models. Findings of the Association for Computational Linguistics: EMNLP.
Yu et al. (2022) Yu, Z.; Wang, X.; Che, W.; and He, K. 2022. Learning-based Hybrid Local Search for the Hard-label Textual Attack. arXiv preprint arXiv:2201.08193.
Zhang, Zhao, and LeCun (2015) Zhang, X.; Zhao, J.; and LeCun, Y. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems.
Zhu, Zhao, and Wu (2023) Zhu, H.; Zhao, Q.; and Wu, Y. 2023. BeamAttack: Generating High-quality Textual Adversarial Examples through Beam Search and Mixed Semantic Spaces. arXiv preprint arXiv:2303.07199.

Appendix A Appendix A: Victim Model and Datasets

In our experiments, we carry out all experiments on NVIDIA Tesla V100 16G GPU. We adopt three neural networks CNN,LSTM and BERT from TextFooler. The CNN consists of three window sizes of 3, 4, and 5, and 100 filters for each window size. The LSTM consists of a bidirectional LSTM layer with 150 hidden states. Both CNN and LSTM have a dropout rate of 0.3 and 200-dimensional Glove word embeddings pre-trained on 6B tokens. The BERT ${}_{base}$ consists of 12 layers with 768 units and 12 heads. The origin accuracy of victim models are listed in Table 10. Detailed datasets are listed in Table 10. We select different text length and different classes datasets.

Table 9: The original accuracy of victim model on various data sets.

Table 10: Overview of datasets and NLP tasks.

Dataset	CNN	LSTM	BERT
MR	78.0	80.7	86.0
SST-2	82.7	84.5	92.4
AG	91.5	91.3	94.2
Yahoo	73.7	73.7	79.1
SNLI	-	-	89.1
MNLIm	-	-	85.1
MNLImm	-	-	82.1

Task	Dataset	Train	Test	Classes	Length
Classification	MR	9K	1K	2	18
	SST-2	70K	2K	2	8
	AG	120K	8K	4	43
	Yahoo	12K	4K	10	151
Entailment	SNLI	570K	3K	3	20
Entailment	MNLI(m/mm)	433K	10K	3	11

Table 10: Overview of datasets and NLP tasks.

Appendix B Appendix B: The Effectiveness of LIME in Score-based Attacks

Traditional score-based attacks utilize deletion-based methods to calculate word importance ranking. They drop a word $x_{i}$ from the benign sample $X$ and query the victim model $\mathcal{F}$ with the new sample $X/x_{i}=[x_{1},x_{2},\cdots,x_{i-1},x_{i+1},\cdots,x_{n}]$ . The difference in the model’s confidence score before and after deletion reflects the importance of this word:

I(x_{i})=\mathcal{F}(X)-\mathcal{F}(X/x_{i})

(8)

To verify the effectiveness of local explainable method, we replace deletion-based method with local explainable method in the score-based attack. We test on MR data set and results are shown in the Table 11. Local explainable method and deletion-based method achieve similar attack success rate, but deletion-based method achieves lower perturbation rate than local explainable method. Because the probability distribution of the model’s output is available, the influence of each word on the output can be well reflected by deletion-based method. Therefore, compared with score-based attacks, we think local explainable methods can play a greater advantage in hard-label attacks where deletion-based method is useless.

Table 11: The comparison with deletion-based method. ASR.,%

\uparrow

is attack success rate and Pert.,%

\downarrow

is perturbation rate.

Dataset	Victim Models	Deletion-based		LIME
Dataset	Victim Models	ASR. $\uparrow$	Pert. $\downarrow$	ASR. $\uparrow$	Pert. $\downarrow$
MR	CNN	1.0	11.9	1.0	12.4
	LSTM	0.6	12.3	0.6	12.8
	BERT	8.2	16.3	8.1	17.4

Appendix C Appendix C: The Effectiveness of Beam Size $b$

Beam size $b$ directly determines the size of search space. Bigger search space is significant to generate the optimal solution (e.g., lower perturbation rate and higher semantic similarity), while it also requires a lot of model queries. Therefore, how to select an appropriate beam size to balance the query and attack success rate. As shown in the Figure 4, We test on MR and SST-2 data sets using BERT with different beam size. With the increase of beam size $b$ , the search space is effectively expanded, and the attack success rate and the quality of adversarial examples (the perturbation rate is reduced) are improved. With the further increase of beam size $b$ , the query also gradually increases, resulting in the decrease of attack success rate. Considering the comprehensive effect, we set the beam size $b=10$ .

Appendix D Appendix D: Transferability

The transferability of adversarial examples reveals the property that adversarial examples crafted by a particular victim model can also fool another. In detail, we calculate the prediction accuracy against the CNN and LSTM models on adversarial examples crafted for attacking BERT on MR dataset. As shown in the Figure 5, adversarial examples generated by LimeAttack achieves higher transferability than baselines. It reduces the prediction accuracy of CNN and LSTM models from 80.7%,78.0% to 58.5%, 58.4% respectively.

Appendix E Appendix E: Adversarial Training

Adversarial training is a prevalent technique to improve the victim model’s robustness by adding adversarial examples into the training data. We randomly selected 1000 adversarial examples from the MR dataset, retrained the CNN model, and then attacked the CNN model again. The results are shown in the Table 12, after adversarial training, the CNN model achieves higher test accuracy. In addition, LimeAttack’s attack success rate has decreased by 3% with the cost of more queries and a higher perturbation rate. Adversarial examples generated by LimeAttack effectively improve the victim model’s robustness and generalization.

Table 12: The performance of CNN model with(out) adversarial training on the MR dataset.

	Ori Acc. $\uparrow$	ASR. $\uparrow$	Pert. $\downarrow$	Sim. $\uparrow$	Query. $\downarrow$
Original	80.27	38.18	3.90	97.00	22.21
+Adv.Training	81.53	35.09	3.94	97.01	24.90

Appendix F Appendix F: Large Language Models

Einstellungen

In this section, we provide a brief introduction to the large language models used in our experiments.

•

BART-L BART is a transformer-based model that can handle both generation and understanding tasks. It is trained on a combination of auto-regressive and denoising objectives, which is primarily focused on understanding tasks.
•

DeBERTa-L DeBERTa enhances BERT with a disentangled attention mechanism and an improved decoding scheme. This allows it to capture contextual information between different tokens more effectively and generate higher quality natural language sentences.
•

Flan-T5 Flan-T5 uses a text-to-text approach where both input and output are natural language sentences, enabling it to perform a variety of tasks including text generation, summarization, and classification. By taking an input sentence as a prompt, Flan-T5 can accomplish common NLP tasks.
•

Text-davinci-003 and ChatGPT are based on GPT3 and GPT3.5. They can perform any task by natural language inputs and produce higher quality and more faithful output.

In order to ensure the stability of the output of large language models, we use the same prompt for each models under zero-shot text classification task: Please classify the following sentence into either positive or negative. Answer me with ”positive” or ”negative”, just one word.

Discuss

Generalization Error.

In this subsection, we provide some analysis of models’ generalization error. which is also known as the out-of-sample error. It is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data. Let $\mathcal{F}$ is a finite hypothesis set, $m$ is the number of training samples, for each $f\in\mathcal{F}$ , probably approximately correct (PAC) theory reveals that:

P\Bigg{(}|\mathbb{E}(f)-\hat{\mathbb{E}}(f)|\leq\sqrt{\frac{ln|\mathcal{F}|+% \varsigma}{2m}}\Bigg{)}\geq 1-\delta

(9)

where $\mathbb{E}(f)$ and $\hat{\mathbb{E}}(f)$ are the ideal and empirical risk on classifier $f$ . According to the Table 6 in the main text, the robustness of the victim model is related to origin accuracy. The higher the origin accuracy, the stronger the victim model’s ability to defense adversarial examples. Generalization error relies on two factors: the training sample size ( $m$ ) and the hypothesis space ( $\mathcal{F}$ ). Large language models, like ChatGPT, excel in performance due to their extensive training data (large $m$ ). Moreover, although the hypothesis set ( $\mathcal{F}$ ) is finite, increasing $m$ and $|\mathcal{F}|$ can lead to reduced generalization errors. This observation helps elucidate why such models excel in zero-shot classification for certain tasks.

Attack ChatGPT.

To validate the attack effectiveness of hard-label attack algorithms in the real world, we evaluate the attack performance of LimeAttack, HLBB, LeapAttack, TextHoaxer and TextHacker on ChatGPT. Due to OpenAI’s limit on the number of APIs calls, we select 20 adversarial examples generated by different hard-label attack algorithms which attack bert on the MR dataset, and input them into ChatGPT to observe if they produced opposite results compared to the original samples. As shown in Table 13, LimeAttack achieves higher attack success rate, generates higher quality adversarial examples than other methods when facing real world APIs under tight query budget.

Table 13: Attack success rate (ASR., %), perturbation rate (Pert., %), semantic similarity (Sim., %) of various hard-label attacks on ChatGPT under the query budget of 100.

Attack	ASR. $\uparrow$	Pert. $\downarrow$	Sim. $\uparrow$
HLBB	10.0	3.70	96.80
LeapAttack	20.0	8.57	88.85
TextHoaxer	10.0	4.61	89.71
TextHacker	20.0	7.61	90.21
LimeAttack	20.0	4.51	95.30

Appendix G Appendix G: Significance Test

We have added a t-test and listed the mean, variance, and p-value of LimeAttack against other methods on the success rate in the Table 14. LimeAttack has run with five additional seeds and take the average, which is consistent with other baselines. As shown in the Table 14, LimeAttack has achieved better results than other baselines under a tight query budget.

Table 14: The mean, variance, and p-value of LimeAttack against other methods on the success rate in 5 runs.

Model_dataset	LimeAttack		HLBB	TextHoaxer	LeapAttack	TextHacker
Model_dataset	Mean	Variance	p-value	p-value	p-value	p-value
CNN_MR	49.9	9.00E-02	2.74E-05	2.38E-05	1.19E-05	8.20E-02
LSTM_MR	47.6	2.50E-01	8.98E-05	3.45E-05	4.78E-05	5.69E-03
BERT_MR	29.2	1.42E-01	2.85E-02	6.70E-02	2.37E-02	2.37E-02
CNN_SST	42.8	2.91E-01	1.58E-05	1.24E-04	4.42E-04	1.24E-04
LSTM_SST	40.1	8.02E-01	1.53E-03	5.66E-03	6.41E-02	3.30E-03
BERT_SST	27.8	4.22E-02	1.73E-05	3.39E-05	5.71E-05	4.17E-05
CNN_AG	20.9	2.28E-01	1.38E-02	2.27E-03	6.30E-01	3.09E-01
LSTM_AG	17.3	5.18E-02	1.50E-03	5.07E-04	1.38E-02	3.17E-01
BERT_AG	14.6	1.02E-02	4.41E-03	5.16E-04	3.77E-02	5.86E-03
CNN_Yahoo	43.7	1.56E-01	6.30E-02	1.26E-03	2.68E-03	1.82E-04
LSTM_Yahoo	40.3	4.22E-02	1.39E-03	3.45E-04	5.49E-04	2.69E-04
BERT_Yahoo	37.4	2.25E-02	4.22E-01	1.59E-03	4.04E-03	8.47E-04

Appendix H Appendix H: LimeAttack Algorithm

The all process of LimeAttack’s algorithm is summarized in algo 1.

Algorithm 1 The LimeAttack algorithm

Input: Original text $X$ ,target model $\mathcal{F}$
Output: Adversarial example $X_{\text{adv}}$

X_{\text{adv}}\leftarrow X

set({X_{\text{adv}}})\leftarrow X_{\text{adv}}

3: Compute the importance score

I(x_{i})

by LIME

4: Sort the words with importance score

I(x_{i})

5: for

i=1

n

6: Generate the candidate set

\mathcal{C}(x_{i})

7: end for

8: for

X_{\text{adv}}

set(X_{\text{adv}})

i

\leftarrow

index of the original word

10: for

c_{k}

\mathcal{C}(x_{i})

11:

X^{\prime}_{\text{adv}}

\leftarrow

Replace

x_{i}

with

c_{k}

X_{\text{adv}}

12: hinzufügen

X^{\prime}_{\text{adv}}

to the

set(X_{\text{adv}})

13: end for

14: for

X^{\prime}_{\text{adv}}

set(X_{\text{adv}})

15: if

\mathcal{F}(X^{\prime}_{\text{adv}})\neq y_{true}

then

16: return

X^{\prime}_{\text{adv}}

with highest semantic similarity

17: end if

18: end for

19:

set(X_{\text{adv}})

\leftarrow

Sample

b

adversarial examples in

set(X_{\text{adv}})

by rules

20: end for

21: return adversarial examples

X_{\text{adv}}

Appendix I Appendix I: Qualitative Examples

More qualify adversarial examples are listed in Table 20-28

Appendix J Appendix J: Limitation

•

Exploring more LLMs. Due to limited resources, this paper only tests some popular large language models. However, there are other victim models based on other LLMs, e.g.LLaMA. Hence, more victim models based on more LLMs might be studied.
•

More NLP tasks. In this paper, we only attack some classification tasks (e.g., text classification, textual entailment and zero-shot classification). It is interesting to attack other NLP applications, such as dialogue, text summarization, and machine translation.

Appendix K Appendix K: Semanticc Similarity of Different Attack Algorithms

We have added semantic similarity in Table 15. Some baselines take the similarity into account during the attack, thus LimeAttack exhibits lower similarity than other methods. Considering all metrics, LimeAttack is still dominant.

Table 15: The semanticc similarity of different attack algorithms.

		HLBB	TextHoaxer	LeapAttack	TextHacker	LimeAttack
MR	CNN	97.20	97.11	97.17	94.56	95.21
	LSTM	97.27	97.27	97.22	95.01	95.31
	BERT	97.13	97.16	97.09	94.16	94.77
SST	CNN	97.18	97.22	97.14	94.02	94.41
	LSTM	97.22	97.21	97.18	94.58	94.69
	BERT	97.22	97.07	97.13	93.77	94.56
AG	CNN	97.64	97.62	97.62	95.71	96.27
	LSTM	97.64	97.58	97.62	95.46	96.11
	BERT	97.57	97.61	97.56	95.14	96.53
Yahoo	CNN	97.75	97.72	97.71	95.33	96.21
	LSTM	97.71	97.66	97.67	95.41	96.41
	BERT	97.73	97.68	97.63	95.12	96.55

Appendix L Appendix L: Comparison with Score-based Attacks

Since LimeAttack follows the two-stage strategies samed from score-based attacks, we also take some classic score-based attacks for reference. LimeAttack and these score-based attacks have exactly the same settings. In addition, score-based attacks can obtain the probability distribution of the output, while LimeAttack does not. Therefore, we do not limit query budgets for LimeAttack and score-based attacks. As shown in Table 16, LimeAttack still achieves a higher attack success rate and semantic similarity in most cases. LimeAttack’s superiority can be attributed to its focus on crucial words through the learned word importance ranking and the expanded search space with the introduction of beam search. However, LimeAttack requires more queries to compute word importance rankings because it lacks a probability distribution for the output. This situation is more obvious in long texts.

Table 16: Comparison with other score-based attack. ASR.,%

\uparrow

is attack success rate, Pert.,%

\downarrow

is perturbation rate, Sim.,%

\uparrow

is semantic similarity and Query.,

\downarrow

is model queries.

Dataset	Model	Attack	ASR. $\uparrow$	Pert. $\downarrow$	Sim. $\uparrow$	Query. $\downarrow$	Dataset	Model	Attack	ASR. $\uparrow$	Pert. $\downarrow$	Sim. $\uparrow$	Query. $\downarrow$
MR	CNN	TF	60.9	5.88	94.21	51.84	AG	CNN	TF	32.1	5.96	94.65	43.67
		PWWS	62.4	5.88	92.34	144.37			PWWS	32.1	5.94	94.85	47.68
		Bert-Attack	46.3	5.75	94.51	28.25			LimeAttack	38.1	4.55	96.53	879.23
		LimeAttack	62.5	5.60	95.33	268.94		LSTM	TF	30.5	5.51	95.40	46.93
	LSTM	TF	65.8	5.63	94.56	49.96			PWWS	32.1	5.94	94.85	47.68
		Bert-Attack	50.2	5.77	94.4	28.53			LimeAttack	35.4	4.55	96.13	975.35
		Limeattack	61.2	5.51	95.44	253.07	SST-2	CNN	TF	51.0	5.96	93.83	51.67
	BERT	TF	46.5	5.68	94.43	51.48		CNN	LimeAttack	51.0	5.99	94.90	150.08
		Bert-Attack	35.0	5.82	94.64	28.59		LSTM	TF	52.1	5.93	93.54	50.7
		LimeAttack	47.6	5.59	94.99	821.28		LSTM	LimeAttack	50.5	6.13	94.70	320.45

Appendix M Appendix M: Evaluation on Defense Methods

We used A2T (The core part of A2T is a new and cheaper word substitution attack optimized for adversarial training) and ASCC to enhance the defense ability of BERT on MR and SST datasets, and conducted attack experiments on this basis. As shown in Table 17. Even after adversarial training and enhancement, our algorithm still has a certain attack effect on these defense methods. Compared with A2T, ASCC has better defense effect and improves a certain degree of model robustness.

Table 17: The attack performance of different attack algorithms on A2T and ASCC defense methods and original target models in BERT-MR and BERT-SST.

	origin BERT-MR		A2T		ASCC		origin BERT-SST		A2T		ASCC
	ASR	PERT	ASR	PERT	ASR	PERT	ASR	PERT	ASR	PERT	ASR	PERT
HLBB	26.6	5.6	23.5	5.6	20.1	5.6	23.0	5.8	21.3	6.0	19.3	6.1
TextHoaxer	27.0	5.5	24.3	5.6	21.2	5.7	24.9	5.8	21.8	5.9	20.1	5.9
LeapAttack	26.5	5.4	24.0	5.6	22.3	5.6	26.1	5.8	21.7	5.9	19.6	6.1
TextHoaxer	26.5	6.5	24.1	6.6	22.5	6.6	25.4	6.3	22.1	6.3	19.1	6.6
LimeAttack	29.2	5.9	25.7	5.8	23.4	5.8	27.8	5.7	22.7	5.9	20.3	6.1

Appendix N Appendix N: Convergence of Attack Performance

convergence of attack success rate

We have conduct further evaluations on defense methods to validate their effectiveness. As shown in Table 18, LimeAttack achieves better attack success rate than other attacks. Attack success rate without considering the query budget is more of an ideal situation. It shows the upper limit of an attack algorithm. High query budget is equivalent to traverse the solution space and will approximate the asr and pert upper limit of victim model; However, asr and pert will interact with each other, resulting in the upper limit of asr and pert not being in the same direction. Therefore, for some victim models (LSTM-AG and BERT-Yahoo), limeattack’s pert is the lowest, but not the optimal asr (very close).

Table 18: Different attack algorithms on different model and datasets under query is 1000.

	CNN_MR		CNN_SST		LSTM_MR		LSTM_SST		LSTM_AG		BERT_SST		BERT_Yahoo
	ASR	PERT	ASR	PERT	ASR	PERT	ASR	PERT	ASR	PERT	ASR	PERT	ASR	PERT
HLBB	55.6	5.6	43.4	6.4	54.5	5.6	43.3	6.4	30.4	5.5	30.3	6.7	62.2	6.7
TextHoaxer	55.6	5.4	43.9	6.4	52.9	5.4	45.5	6.3	31.1	5.8	35.9	6.6	63.2	6.6
LeapAttack	56.4	5.5	44.3	6.5	54.6	5.5	44.3	6.2	31.3	5.3	37.5	6.2	63.1	6.4
TextHoaxer	59.2	5.6	38.0	6.7	56.0	5.6	44.0	6.5	32.0	5.8	38.0	6.0	67.2	6.4
LimeAttack	59.4	5.7	48.6	6.0	59.3	5.5	45.5	5.9	31.2	5.3	42.5	6.1	66.0	6.2

convergence of perturbation rate

We list convergence behavior of different attack. As shown in the figure 6. Due to the use of complex optimization algorithms in previous algorithms, it does require a large number of queries to complete this part of optimization; Therefore, previous algorithms often have a good perturbation rates.

Appendix O Appendix O: Comparison with SHAP and Non-linear Models

In a hard-label setting, model’s logits are unavailable and model query budget is tiny. We list the result of attack success rate of different word importance ranking calculation under different query budgets. As shown in the Table 19, compared to LIME, attack success rate and perturbation rate of SHAP or non-linear models do not have significant advantages in tiny query budgets. Considering the time complexity, we adopt LIME to calculate word importance ranking in the main text.

Table 19: Evaluation of different word importance ranking calculation on CNN-MR and BERT-SST under different query budgets.

	query budgets 100				query budgets 2000
	CNN-MR		BERT-SST		CNN-MR		BERT-SST
	ASR	PERT	ASR	PERT	ASR	PERT	ASR	PERT
LIME	49.9	5.3	27.8	5.7	59.4	5.7	42.5	6.1
SHAP	49.7	5.2	27.7	5.7	61.2	5.8	44.3	6.3
Decision Tree	50.1	5.3	27.9	5.8	61.6	5.8	44.1	6.4

Table 20: The adversarial example crafted by different attack algorithms on CNN using SST-2 dataset. Replacement words are represented in red. Query.

\downarrow

is model query numbers.

Attack	Texts	Query.
No Attack	It allows us hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker.	0
HLBB	It allows us hope that nolan is poised to incur a major career as a commercial yet ingenuity filmmaker.	2062
TextHoaxer	It allows us hope that nolan is poised to start a major career as a commercial yet contrivance filmmaker.	48
LeapAttack	It allows us hope that nolan is poised to embark a major career as a commercial yet contrivance filmmaker.	30
TextHacker	It allows us hope that nolan is readies to embark a major career as a commercial yet creative filmmaker.	101
LimeAttack	It allows us hope that nolan is poised to embark a major career as a commercial yet contrivance filmmaker.	43

Table 21: The adversarial example crafted by different attack algorithms on BERT using SST-2 dataset. Replacement words are represented in red. Query.

\downarrow

is model query numbers.

Attack	Texts	Query.
No Attack	The acting,costumes,music,cinematogrtaphy and sound are all astounding given the production’s austere locales.	0
HLBB	The acting,costumes,music,cinematogrtaphy and sound are all stupendous given the production’s austere locales.	35
TextHoaxer	The acting,costumes,music,cinematogrtaphy and sound are all staggering given the production’s austere locales.	45
LeapAttack	the acting,costumes,music,cinematogrtaphy and sound are all astounding dispensed the production’s austere locales.	35
TextHacker	the provisonal,costumes,music,cinematogrtaphy and sound sunt all startling given the production’s stoic locales.	101
LimeAttack	the acting,costumes,music,cinematogrtaphy and sound are all staggering given the production’s austere locales.	25

Table 22: The adversarial example crafted by different attack algorithms on LSTM using Yahoo dataset. Replacement words are represented in red. Query.

\downarrow

is model query numbers.

Attack	Texts	Query.
No Attack	In basketball whats a suicide? is it like running back and forth? its an exercise where you run the entire court touching down in intnervals until youve completed the exercise on both sides of the court.	0
HLBB	In basket whats a suicide? is it like running back and forth? its an exercise where you run the entire court touching down in intnervals until youve completed the exercise on both sides of the court.	6
TextHoaxer	In wildcats whats a suicide? is it like running back and forth? its an exercise where you run the entire court touching down in intnervals until youve completed the exercise on both sides of the court	6
LeapAttack	In wildcats whats a suicide? is it like running back and forth? its an exercise where you run the entire court touching down in intnervals until youve completed the exercise on both sides of the court.	6
TextHacker	In basketball whats a suicide? is it like running back and forth? its an exercise where you run the entire court touching down in intnervals until havent completed the exercise on both sides of the court.	101
LimeAttack	In basketballs whats a suicide? is it like running back and forth? its an exercise where you run the entire court touching down in intnervals until youve completed the exercise on both sides of the court.	39

Table 23: The adversarial example crafted by different attack algorithms on CNN using Yahoo dataset. Replacement words are represented in red. Query.

\downarrow

is model query numbers.

Attack	Texts	Query.
No Attack	Who was the first indian who became the member of english parliament? dadabhai naoroji preeminent pioneer of indian nationalism freedom fighter and educationist the first indian to become member of british parliament 1862 congress president thrice the grand old man of india.	0
HLBB	Who was the first indian who became the member of english parliament? dadabhai naoroji preeminent groundbreaking of indian nationalistic freedom hunter and educationist the first indian to become member of british parliament 1862 congress president thrice the grand old man of india.	116
TextHoaxer	Who was the first indian who became the member of english parliament? dadabhai naoroji preeminent pioneer of indian nationalism freedom fighter and educationist the first indian to become member of british parliament 1862 congress president thrice the immense old man of indian.	440
LeapAttack	Who was the first indian who became the member of english parliament? dadabhai naoroji preeminent pioneer of indian nationalism liberty hunters and educationist the first indian to become member of british parliament 1862 congress president thrice the grand old man of india.	1411
TextHacker	Who was the first indian who became the member of english parliament? dadabhai naoroji preeminent pioneers of indian nationalism freedom fighter and educationist the first indian to become member of british chambre 1862 congress president thrice the grand old man of india.	101
LimeAttack	Who was the first indian who became the member of english parliament? dadabhai naoroji preeminent pioneer of indian nationalism freedom fighter and educationist the first indian to become member of british legislature 1862 congress president thrice the grand old man of india.	45

Table 24: The adversarial example crafted by different attack algorithms on CNN using MR dataset. Replacement words are represented in red. Query.

\downarrow

is model query numbers.

Attack	Texts	Query.
No Attack	Those outside show business will enjoy a close look at people they do n’t really want to know.	0
HLBB	Those outside show business will enjoy a nearby look at people they do n’t really want to know.	2241
TextHoaxer	Those outside show business will recieve a close look at people they do n’t really want to know.	202
LeapAttack	Those outside show business will like a close glanced at people they do n’t really want to know.	1431
TextHacker	Those outside show companies will experience a close glance at volk they do n’t really want to know.	103
LimeAttack	Those outside show business will recieve a close glanced at people they do n’t really want to know	53

Table 25: The adversarial example crafted by different attack algorithms on LSTM using MR dataset. Replacement words are represented in red. Query.

\downarrow

is model query numbers.

Attack	Texts	Query.
No Attack	I’m convinced i could keep a family of five blind , crippled , amish people alive in this situation better than these british soldiers do at keeping themselves kicking.	0
HLBB	I’m convinced i could keep a family of five blind , invalids , amish people alive in this situation better than these british soldiers do at keeping themselves kicking.	2110
TextHoaxer	I’m gratified i could keep a family of five blind , crippled , amish people alive in this situation better than these british soldiers do at keeping themselves kicking.	219
LeapAttack	I’m contented i could keep a family of five blind , paralytic, amish people alive in this plight better than these british soldiers do at keeping themselves kicking.	2162
TextHacker	I’m convinced i could keep a family of five blind , handicapped , amish people lively in this situation better than these british soldiers do at keeping themselves kicking.	101
LimeAttack	I’m gratified i could keep a family of five blind , crippled , amish people alive in this situation better than these british soldiers do at keeping themselves kicking.	50

Table 26: The adversarial example crafted by different attack algorithms on LSTM using AG dataset. Replacement words are represented in red. Query.

\downarrow

is model query numbers.

Attack	Texts	Query.
No Attack	Spaniards to run luton airport after 551 m deal luton , cardiff and belfast international airports are to fall into the hands of a spanish toll motorways operator through a 551 m takeover of the aviation group tbi by a barcelona based abertis infrastructure.	0
HLBB	spaniards to executes luton airport after 551 m deal luton , cardiff and belfast international airports are to fall into the hands of a spanish toll motorways operator through a 551 m takeover of the aeroplanes group tbi by a barcelona based abertis infrastructure.	969
TextHoaxer	Spaniards to run luton airport after 551 m deal luton , cardiff and belfast international airports are to fall into the manaus of a spanish toll motorways exploiter through a 551 m coup of the aviation group tbi by a barcelona based abertis infrastructure.	727
LeapAttack	Spaniards to run luton airport after 551 m deal luton , cardiff and belfast international airports are to fall into the hands of a spanish toll motorways operator through a 551 m takeover of the aeroplanes group tbi by a barcelona based abertis infrastructure.	2148
TextHacker	Spaniards to implementing luton airport after 551 m deal luton , cardiff and belfast international airports represent to fall into the hands of a spanish toll motorways operator through a 551 m takeover of the aviation group tbi by a barcelona based abertis infrastructure.	101
LimeAttack	Spaniards to run luton luton after 551 m deal luton , cardiff and belfast international airports are to fall into the hands of a spanish toll motorways operator through a 551 m takeover of the aviation group tbi by a barcelona based abertis infrastructure.	92

Table 27: The adversarial example crafted by different attack algorithms on CNN using AG dataset. Replacement words are represented in red. Query.

\downarrow

is model query numbers.

Attack	Texts	Query.
No Attack	Eisner says ovitz required oversight daily michael d eisner appeared for a second day of testimony in the shareholder lawsuit over the lucrative severance package granted to michael s ovitz.	0
HLBB	Eisner says ovitz required oversight daily michael d eisner appeared for a second weekly of testimony in the shareholder lawsuit over the lucrative severance package granted to michael s ovitz.	31
TextHoaxer	Eisner says ovitz required oversight daily michael d eisner appeared for a second day of testimony in the shareholder lawsuit over the interesting severance package granted to michael s ovitz.	48
LeapAttack	Eisner says ovitz needing oversight daily michael d eisner appeared for a second day of testimony in the shareholder lawsuit over the lucrative severance package granted to michael s ovitz.	14
TextHacker	Eisner says ovitz required surveillance everyday michael d eisner appeared for a second day of testimonies in the shareholder lawsuit over the rewarding severance package granted to michael s ovitz.	101
LimeAttack	Eisner says ovitz required oversight daily michael d eisner appeared for a second day of testimony in the proprietors lawsuit over the lucrative severance package granted to michael s ovitz.	34

Table 28: The adversarial example crafted by different attack algorithms on BERT using AG dataset. Replacement words are represented in red. Query.

\downarrow

is model query numbers.

Attack	Texts	Query.
No Attack	Cray promotes two execs ly huong pham becomes the supercomputer maker’s senior vice presdent of operations,and peter ungaro is made senior vice president for sales,marketing and services.	0
HLBB	Cray promotes two execs ly huong pham buys the supercomputer maker’s senior vice presdent of operations,and peter ungaro is made senior obscene chairperson for sales,marketing and services.	3811
TextHoaxer	Hucknall promotes two execs ly huong pham becomes the supercomputer maker’s senior vice president of surgical, and peter ungaro is made senior vice president for sales, marketing and services.	94
LeapAttack	Cray promotes two execs ly huong pham becomes the quadrillion maker’s senior vice presdent of operations,and peter ungaro is made senior vice president for sales,marketing and services.	42
TextHacker	Cray promotes two ceos ly huong pham becomes the supercomputer maker’s senior prostitution presdent of operations,and peter ungaro is made senior vice president for selling,marketing and services.	101
LimeAttack	Cray promotes two execs ly huong pham becomes the thermonuclear maker’s senior vice presdent of operations,and peter ungaro is made senior vice president for sales,marketing and services.	39

LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack

Abstract

Einführung

Related Work

Hard-Label Adversarial Attacks

Local Explainable Methods

Limitation of Existing Hard-Label Attack

Methodology

Problem Formulation

The Proposed LimeAttack Algorithm

Word Importance Ranking.

Perturbation Execution.

Experiments

Tasks, Datasets and Models

Baselines

Automatic Evaluation Metrics

Implementation Details

Experiments Results

Attack Performance.

Query Budget.

Adversary Quality.

Evaluation on Large Language Models.

Attack Performance on Defense Methods.

Ablation Study

Effect of Word Importance Ranking.

Effect of Sampling Rules.

Human Evaluation

Fazit

References

Appendix A Appendix A: Victim Model and Datasets

Appendix B Appendix B: The Effectiveness of LIME in Score-based Attacks

Appendix C Appendix C: The Effectiveness of Beam Size b𝑏bitalic_b

Appendix D Appendix D: Transferability

Appendix E Appendix E: Adversarial Training

Appendix F Appendix F: Large Language Models

Einstellungen

Discuss

Generalization Error.

Attack ChatGPT.

Appendix G Appendix G: Significance Test

Appendix H Appendix H: LimeAttack Algorithm

Appendix I Appendix I: Qualitative Examples

Appendix J Appendix J: Limitation

Appendix K Appendix K: Semanticc Similarity of Different Attack Algorithms

Appendix L Appendix L: Comparison with Score-based Attacks

Appendix M Appendix M: Evaluation on Defense Methods

Appendix N Appendix N: Convergence of Attack Performance

convergence of attack success rate

convergence of perturbation rate

Appendix O Appendix O: Comparison with SHAP and Non-linear Models

Appendix C Appendix C: The Effectiveness of Beam Size $b$