HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: inconsolata
  • failed: utfsym
  • failed: fontawesome

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.04127v2 [cs.CL] 23 Feb 2024

Analyzing the Inherent Response Tendency of LLMs:
Real-World Instructions-Driven Jailbreak

Yanrui Du, Sendong Zhao, Ming Ma, Yuhan Chen, Bing Qin
Harbin Institute of Technology, Harbin, China
{ yrdu, sdzhao,mma,yhchen,qinb}@ir.hit.edu.cn
   Corresponding author
Abstract

Extensive work has been devoted to improving the safety mechanism of Large Language Models (LLMs). However, LLMs still tend to generate harmful responses when faced with malicious instructions, a phenomenon referred to as “Jailbreak Attack”. In our research, we introduce a novel automatic jailbreak method RADIAL, which bypasses the security mechanism by amplifying the potential of LLMs to generate affirmation responses. The jailbreak idea of our method is “Inherent Response Tendency Analysis” which identifies real-world instructions that can inherently induce LLMs to generate affirmation responses and the corresponding jailbreak strategy is “Real-World Instructions-Driven Jailbreak” which involves strategically splicing real-world instructions identified through the above analysis around the malicious instruction. Our method achieves excellent attack performance on English malicious instructions with five open-source advanced LLMs while maintaining robust attack performance in executing cross-language attacks against Chinese malicious instructions. We conduct experiments to verify the effectiveness of our jailbreak idea and the rationality of our jailbreak strategy design. Notably, our method designed a semantically coherent attack prompt, highlighting the potential risks of LLMs. Our study provides detailed insights into jailbreak attacks, establishing a foundation for the development of safer LLMs.

Analyzing the Inherent Response Tendency of LLMs:
Real-World Instructions-Driven Jailbreak


Yanrui Du, Sendong Zhaothanks:    Corresponding author, Ming Ma, Yuhan Chen, Bing Qin Harbin Institute of Technology, Harbin, China { yrdu, sdzhao,mma,yhchen,qinb}@ir.hit.edu.cn

1 Introduction

Refer to caption
Figure 1: Illustration of jailbreak methods. Manual-designed methods typically demand substantial effort and face challenges in adaptability across LLMs. The automatic searched suffix lacks meaningful semantics, which can be easily detected by PPL algorithms. In comparison, our RADIAL method is an automatic process that designs semantically coherent attack prompts.


Large Language Models (LLMs) OpenAI (2023); Touvron et al. (2023); Baichuan (2023); Du et al. (2022b) exhibit great potential across fields, yet a significant hurdle to broader application lies in ensuring the harmlessness of their responses Liu et al. (2023b). Substantial efforts have been dedicated to addressing this concern, particularly in aligning LLMs with human values, exemplified by the Reinforcement Learning from Human Feedback (RLHF) Ouyang et al. (2022). Despite these ongoing efforts, a threat persists in the form of jailbreak attacks Goldstein et al. (2023); Kang et al. (2023); Hazell (2023), which bypass the LLMs’ safety mechanisms by gaining control of prompts.

In recent studies, there has been a significant focus on jailbreak attack methods, which provide valuable insights into the limitations of LLMs and guidance for further enhancing their safety. As shown in Fig. 1, various jailbreak attack methods are illustrated. Some efforts involve the creation of manual-designed prompts Wei et al. (2023); Abdelnabi et al. (2023); Li et al. (2023); Wang et al. (2023); Liu et al. (2023c), including executing a competitive objective or fashioning a role environment. Some efforts involve leveraging hundreds of manual-designed targets to automatically search attack suffixes Zou et al. ; Jones et al. (2023); Carlini et al. (2023); Wen et al. (2023). Regrettably, the above methods exhibit notable shortcomings: 1) Manual-designed prompts are time-consuming and challenging, particularly when adapting them for use across various LLMs. 2) Automatic searched suffixes lack meaningful semantics, which can be easily detected through the measurement of Perplexity (PPL) Jain et al. (2023).

In our study, we introduce a novel jailbreak method called ReAl-worlD Instructions-driven jAiLbreak (RADIAL). Initially, we present the idea of “Inherent Response Tendency Analysis” where we assess the inherent response tendencies of LLMs by calculating the generation probabilities for both affirmative and negative responses. Through this analysis, we identify specific real-world instructions that can inherently induce LLMs to generate affirmation responses. Building on this insight, we develop the “Real-World Instructions-Driven Jailbreak” strategy where we strategically splice identified real-world instructions around the malicious instruction. This manipulation prompts the LLMs to generate the affirmation response rather than the rejection response when faced with malicious instructions, thereby bypassing the LLMs’ safety mechanisms. The primary advantages of our method include: 1) The requirement for only 40 manual-crafted responses (20 affirmation responses and 20 rejection responses) significantly conserves manual costs. 2) A semantically coherent attack prompt is automatically designed, as shown in Fig. 1.

Our experimental results demonstrate that whether confronted with English or Chinese malicious instructions, our method outperforms strong baselines in terms of attack performance. Moreover, we conduct detailed ablation experiments to verify the effectiveness of our jailbreak idea “Inherent Response Tendency Analysis” and the rationality of our jailbreak strategy “Real-World Instructions-Driven Jailbreak”. Through our research, we found that it is vulnerable for LLMs to generate more comprehensive harmful responses in subsequent rounds when the LLMs’ safety mechanism is bypassed in the first round of dialogue.

Our contributions can be summarized as follows:

  • We propose the “Inherent Response Tendency Analysis” jailbreak idea, which provides a new perspective on jailbreak attacks.

  • Based on the above idea, we propose a jailbreak strategy “Real-World Instructions-Driven Jailbreak”. Our strategy designs a semantically coherent attack prompt, which exposes potential risks in LLMs’ applications.

  • Across multiple LLMs, we conduct various experiments to verify the superiority and soundness of our method.

2 Background

Defense mechanisms.

The defense mechanism of LLMs can be approached from two perspectives. On the one hand, it focuses on the enhancing safety of LLMs themselves Xie et al. (2023). For instance, the chat version of some open-source LLMs like Baichuan2 Baichuan (2023) and ChatGLM2 Du et al. (2022b) employ the RLHF Ouyang et al. (2022) strategy to ensure alignment with human values. On the other hand, it focuses on integrating the external detection modules. This involves pre-processing detection to assess whether the input prompt contains malicious content, and post-processing detection to assess whether the LLM’s output contains harmful content. Prior work Deng et al. (2023) uses the method of network delay detection, thereby revealing that commercial systems such as Bing, Bard, and ChatGPT integrate the external detection modules. In our study, we provide some unique insights into improving the security of LLMs themselves. Therefore, our study focuses on open-source advanced LLMs, rather than commercial systems.

Refer to caption
Figure 2: Overall framework of RADIAL method.

Jailbreak attack.

Jailbreak methods can be broadly classified into two categories: manually designed methods and automated methods. For manually designed methods, some notable works include techniques Perez and Ribeiro ; Wei et al. (2023) that induce LLMs to ignore non-malicious instructions but focus solely on malicious instructions, introduce competitive targets within prompts to induce the LLMs or encode malicious instructions in base64 format. For automated methods, some works Zou et al. ; Jones et al. (2023); Carlini et al. (2023) involve utilizing adversarial concepts to conduct discrete searches on prompts, driven by artificially constructed targets. However, such methods always produce prompts lacking coherent semantics, making them easily detectable. Some other works Chao et al. (2023); Wang et al. (2023) involve leveraging LLM’s intrinsic capabilities to discover attack prompts through self-interaction among LLMs. While such methods aim for successful attacks, they often fall short in providing insights for enhancing LLM security. Our work provides a new perspective on performing the jailbreaking attack by analyzing the LLMs’ inherent response tendency, shedding light on potential vulnerabilities within LLMs.

3 RADIAL Method

3.1 Overall

Recent work Zhao et al. (2024); Wei et al. (2023) indicated that the main goal of a successful jailbreak attack is to induce LLMs to generate affirmation responses rather than rejection responses. Therefore, our method attempts to create a condition within the prompt conducive to affirmation responses. In our work, we introduce the concept of inherent response tendency, where LLMs have the inherent tendency towards affirmation or rejection responses when faced with each real-world instruction. We measure it by calculating the generation probabilities of affirmation and rejection responses, and in Sec. 5.1, we have conducted quantitative experiments to verify its existence. By analyzing LLMs’ inherent response tendency, we identify real-world instructions that can inherently induce LLMs to generate affirmation responses. In our jailbreak attack strategy, we splice the above-identified instructions around the malicious instruction to amplify the LLMs’ potential to generate affirmation responses. Overall, as shown in Fig. 2, our method consists of the jailbreak idea “Inherent Response Tendency Analysis” and jailbreak strategy “Real-World Instructions-Driven Jailbreak”.

3.2 Inherent Response Tendency Analysis

As shown on the left side of Fig. 2, to initiate this analysis, we constructed 20 affirmation responses and 20 rejection responses, which are designed to be general and not specific to any particular instruction. For instance, a representative affirmation response takes the form of “Sure, here’s the information.” while a representative rejection response is “Sorry, I am unable to provide the information”. All manual-constructed responses can be found in App. A. Furthermore, we collected 30,000 real-world English instructions from the alpaca official repository111https://github.com/tloen/alpaca-lora and iterated over each instruction to calculate the generation probabilities of LLM’s affirmation and rejection responses. Specifically, we assume that a real-world instruction as the input of LLM can be represented by X𝑋Xitalic_X, an affirmation response can be represented by ya={ya0,ya1,,yan}subscript𝑦𝑎subscript𝑦𝑎0subscript𝑦𝑎1subscript𝑦𝑎𝑛y_{a}=\{y_{a0},y_{a1},...,y_{an}\}italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT italic_a 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_a 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_a italic_n end_POSTSUBSCRIPT }, and a rejection response can be represented by yr={yr0,yr1,,yrm}subscript𝑦𝑟subscript𝑦𝑟0subscript𝑦𝑟1subscript𝑦𝑟𝑚y_{r}=\{y_{r0},y_{r1},...,y_{rm}\}italic_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT italic_r 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_r 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_r italic_m end_POSTSUBSCRIPT }. For the LLM’s affirmation response tendency, the probability pasubscript𝑝𝑎p_{a}italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT of generating an affirmation response yasubscript𝑦𝑎y_{a}italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT can be calculated as:

pa=i=1nP(yai|X,ya0,,ya(i1))subscript𝑝𝑎superscriptsubscript𝑖1𝑛𝑃conditionalsubscript𝑦𝑎𝑖𝑋subscript𝑦𝑎0subscript𝑦𝑎𝑖1\displaystyle p_{a}=\sum_{i=1}^{n}P(y_{ai}|X,y_{a0},...,y_{a(i-1)})italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( italic_y start_POSTSUBSCRIPT italic_a italic_i end_POSTSUBSCRIPT | italic_X , italic_y start_POSTSUBSCRIPT italic_a 0 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_a ( italic_i - 1 ) end_POSTSUBSCRIPT ) (1)

We further consider what the LLM itself wants to generate. The probability pa*subscriptsuperscript𝑝𝑎p^{*}_{a}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT can be calculated as:

pa*=i=1nargmaxyP(y|X,ya0,,ya(i1))subscriptsuperscript𝑝𝑎superscriptsubscript𝑖1𝑛𝑎𝑟𝑔𝑚𝑎subscript𝑥𝑦𝑃conditional𝑦𝑋subscript𝑦𝑎0subscript𝑦𝑎𝑖1\displaystyle p^{*}_{a}=\sum_{i=1}^{n}argmax_{y}P(y|X,y_{a0},...,y_{a(i-1)})italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_P ( italic_y | italic_X , italic_y start_POSTSUBSCRIPT italic_a 0 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_a ( italic_i - 1 ) end_POSTSUBSCRIPT ) (2)

Finally, We employ our constructed affirmation responses to assess LLM’s affirmation response tendency (Tasubscript𝑇𝑎T_{a}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT) to real-world instructions, which can be calculated as:

Ta=1numj=1numpajpaj*subscript𝑇𝑎1𝑛𝑢𝑚superscriptsubscript𝑗1𝑛𝑢𝑚subscript𝑝𝑎𝑗subscriptsuperscript𝑝𝑎𝑗\displaystyle T_{a}=\frac{1}{num}\sum_{j=1}^{num}\frac{p_{aj}}{p^{*}_{aj}}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n italic_u italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_u italic_m end_POSTSUPERSCRIPT divide start_ARG italic_p start_POSTSUBSCRIPT italic_a italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_j end_POSTSUBSCRIPT end_ARG (3)

where num𝑛𝑢𝑚numitalic_n italic_u italic_m represents the number of constructed affirmation responses.

For the LLM’s rejection response tendency, the process of calculation is similar. The probability prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT of generating a rejection response can be calculated as:

pr=i=1mP(yri|X,yr1,,yr(i1))subscript𝑝𝑟superscriptsubscript𝑖1𝑚𝑃conditionalsubscript𝑦𝑟𝑖𝑋subscript𝑦𝑟1subscript𝑦𝑟𝑖1\displaystyle p_{r}=\sum_{i=1}^{m}P(y_{ri}|X,y_{r1},...,y_{r(i-1)})italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_P ( italic_y start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT | italic_X , italic_y start_POSTSUBSCRIPT italic_r 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_r ( italic_i - 1 ) end_POSTSUBSCRIPT ) (4)

The probability pr*subscriptsuperscript𝑝𝑟p^{*}_{r}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT can be calculated as:

pr*=i=1nargmaxyP(y|X,yr1,,yr(i1))subscriptsuperscript𝑝𝑟superscriptsubscript𝑖1𝑛𝑎𝑟𝑔𝑚𝑎subscript𝑥𝑦𝑃conditional𝑦𝑋subscript𝑦𝑟1subscript𝑦𝑟𝑖1\displaystyle p^{*}_{r}=\sum_{i=1}^{n}argmax_{y}P(y|X,y_{r1},...,y_{r(i-1)})italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_P ( italic_y | italic_X , italic_y start_POSTSUBSCRIPT italic_r 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_r ( italic_i - 1 ) end_POSTSUBSCRIPT ) (5)

The LLM’s rejection response tendency (Trsubscript𝑇𝑟T_{r}italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT) to real-world instructions can be calculated as:

Tr=1numj=1numprjprj*subscript𝑇𝑟1𝑛𝑢𝑚superscriptsubscript𝑗1𝑛𝑢𝑚subscript𝑝𝑟𝑗subscriptsuperscript𝑝𝑟𝑗\displaystyle T_{r}=\frac{1}{num}\sum_{j=1}^{num}\frac{p_{rj}}{p^{*}_{rj}}italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n italic_u italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_u italic_m end_POSTSUPERSCRIPT divide start_ARG italic_p start_POSTSUBSCRIPT italic_r italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_j end_POSTSUBSCRIPT end_ARG (6)

where num𝑛𝑢𝑚numitalic_n italic_u italic_m represents the number of constructed rejection responses.

Overall, for each real-world instruction, we assign a score to each instruction, reflecting the LLM’s inherent response tendency. The score can be calculated as:

Score=TaTr𝑆𝑐𝑜𝑟𝑒subscript𝑇𝑎subscript𝑇𝑟\displaystyle Score=\frac{T_{a}}{T_{r}}italic_S italic_c italic_o italic_r italic_e = divide start_ARG italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG (7)

where the higher the score, the higher the LLM’s inherent tendency to affirm. As shown in Fig. 2, based on the calculated score, we can get a ranking of real-world instructions.

3.3 Real-World Instructions-Driven Jailbreak

As shown on the right side of Fig. 2, we perform real-world instructions-driven jailbreak. Based on the above instruction ranking, we select real-world instructions from the top that can inherently induce the LLMs to generate affirmation responses, thereby creating a condition within the prompt conducive to affirmation responses. Notably, for the type of instructions, we abandoned text manipulation instructions, such as “Please translate the following sentence” or “Please change the following text” etc. These instructions always lead the LLM to manipulate the subsequent text, which results in the malicious instruction being translated or rewritten. Subsequently, we strategically splice our selected real-world instructions around the malicious instructions. During the splicing process, we consider the number of spliced real-world instructions and the location of the malicious instructions within the prompt. For the number of spliced real-world instructions, we take into account the LLM’s capacity to process multiple instructions concurrently. Excessive splicing of instructions can lead to the LLM’s responses being impacted by the context, potentially resulting in ambiguity in its comprehension of the instructions. Consequently, our method empirically splices two or four instructions. For the location of the malicious instruction within the prompt, we tried three distinct positions: the front, middle, and end. Our experimental findings reveal that embedding the malicious instruction at the end of the prompt yields optimal performance.

4 Experiment

4.1 Preliminary

Before presenting the experiment results, we introduce our selected evaluation metrics, test data, advanced LLMs, and comparison baselines used in our experiments.

Human-aligned (RLHF) Instruction Fine-tuned
Baichuan27B7𝐵{}_{7B}start_FLOATSUBSCRIPT 7 italic_B end_FLOATSUBSCRIPT Baichuan213B13𝐵{}_{13B}start_FLOATSUBSCRIPT 13 italic_B end_FLOATSUBSCRIPT ChatGLM26B6𝐵{}_{6B}start_FLOATSUBSCRIPT 6 italic_B end_FLOATSUBSCRIPT Mistral7B7𝐵{}_{7B}start_FLOATSUBSCRIPT 7 italic_B end_FLOATSUBSCRIPT Vicuna7B7𝐵{}_{7B}start_FLOATSUBSCRIPT 7 italic_B end_FLOATSUBSCRIPT
GPT-4 KWM GPT-4 KWM GPT-4 KWM GPT-4 KWM GPT-4 KWM
None 5 2 0 2 9 5 8 13 4 5
Manual
Evil 64 28 90 47 10 8 99 79 88 40
Comp. 71 32 40 20 37 28 28 19 96 36
Auto
Dits.{}^{{\dagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 32 38 20 25 46 61 14 28 32 31
Dits.{}^{{\ddagger}}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT 35 44 30 32 60 70 17 34 42 48
Suffix{}^{{\dagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 40 72 15 20 35 32 33 23 27 35
Suffix{}^{{\ddagger}}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT 59 74 18 26 37 34 35 28 30 37
Our{}^{{\dagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 73 78 63 64 58 60 31 30 49 51
Our{}^{{\ddagger}}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT 93 84 75 77 76 75 41 39 57 64
Table 1: Experimental results on English. ASR (%) evaluated by KWM and GPT-4 are reported. {\dagger} represents the performance of a single jailbreak attack and {\ddagger} represents the overall performance of two jailbreak attacks.
Baichuan27B7𝐵{}_{7B}start_FLOATSUBSCRIPT 7 italic_B end_FLOATSUBSCRIPT Baichuan213B13𝐵{}_{13B}start_FLOATSUBSCRIPT 13 italic_B end_FLOATSUBSCRIPT ChatGLM26B6𝐵{}_{6B}start_FLOATSUBSCRIPT 6 italic_B end_FLOATSUBSCRIPT
GPT-4 KWM GPT-4 KWM GPT-4 KWM
None 2 8 2 2 7 7
Manual
Evil 76 30 97 64 13 5
Comp. 68 42 55 34 53 53
Auto
Dits.{}^{{\dagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 29 38 23 27 46 64
Dits.{}^{{\ddagger}}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT 41 54 36 43 58 70
Suffix{}^{{\dagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 62 71 14 28 32 42
Suffix{}^{{\ddagger}}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT 68 77 18 32 39 47
Our{}^{{\dagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 71 75 45 75 52 71
Our{}^{{\ddagger}}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT 83 87 59 87 68 79
Table 2: Experimental results on Chinese. ASR (%) evaluated by KWM and GPT-4 are reported. We perform a cross-language attack, using real-world English instructions to attack on Chinese. {\dagger} and {\ddagger} have the same meaning as Tab. 1.

Evaluation metrics.

Consistent with previous work Zou et al. (2023), We consider a jailbreak attack successful when the responses generated by LLMs contain harmful content. One evaluation metric Zou et al. (2023) relies on Key Word Matching (KWM). Specifically, if rejection expressions such as “I apologize” or “illegal and unethical” are detected, the jailbreak attack is deemed unsuccessful; otherwise, it is deemed successful. However, ignoring the semantic level and simply being rule-based will lead to evaluation errors in many cases. The other evaluation metric Zhao et al. (2024) relies on the GPT-4222In our work, we use the GPT-4 API interface from November 1 to November 15, 2023 to analyze the semantics of responses. In App. B, we provide the details of the GPT-4 evaluation process and evaluate 6 sets of experimental results from our work to measure the performance of the two evaluation metrics. We observe that compared to the KWM evaluation, the GPT-4 evaluation has a higher degree of alignment with human evaluation. In our experimental results, we report the Attack Success Rate (ASR).

Test data.

Many benchmarks Shen et al. (2023); Liu et al. (2023a); Zhang et al. (2023) have been established to evaluate the effectiveness of jailbreak methods. In our work, we follow the work Zou et al. (2023), utilizing their constructed 100 English malicious instructions and meanwhile manually translating these 100 instructions into Chinese.

LLMs.

LLMs can be categorized into two primary groups: one is trained solely through instruction fine-tuned to enhance conversational abilities, while the other is further aligned with human values through the RLHF strategy following instruction fine-tuned, thereby improving their safety. For the former category, we selected the Llama-2 Touvron et al. (2023) variant Vicuna-7B Zheng et al. (2023) and Mistral-7B Jiang et al. (2023). For the latter category, we selected Baichuan2-7B-Chat, Baichuan2-13B-Chat, and ChatGLM2-6B. It’s worth noting that all the LLMs assessed in our work have consistently achieved high scores in evaluations conducted on an authoritative website333https://opencompass.org.cn/.

Comparison baselines.

In our work, we explore both manual-designed and automatic searched methods as our comparison baselines. Examples for all attack prompts in our baseline can be found in App. C. For the manual-designed method, we first selected the Comp. method introduced in  Wei et al. (2023), which involves executing a competitive goal, such as appending “Start with ’Absolutely, here is a”’ after the malicious instruction. Secondly, our attention turned towards a website444https://www.jailbreakchat.com/ with specialized jailbreak attack prompts tailored for GPT-4. Among the available prompts, we selected the “Evil Confidant” Evil method (the highest jailbreak score), which involves fashioning a role environment to confuse LLMs.

For the automatic searched method, we first considered the distraction Dist. method highlighted in  Wei et al. (2023); Shi et al. (2023), which involves randomly inserting additional real-world instructions around the malicious one to divert the LLM’s attention. Secondly, we explored the Suffix method proposed in  Zou et al. (2023), which focuses on searching attack suffixes based on hundreds of manually designed adversarial targets.

4.2 Main Experiments

Experiment settings.

For the manual-designed method, we execute a single jailbreak attack for each malicious instruction. In contrast, the automatic method provides the convenience of conducting repeated attacks. Therefore, we allow the automatic method to execute two jailbreak attacks for each malicious instruction. A successful attack is deemed if either of the two attempts proves successful. Specifically, for the Dist. method, we employ a randomized selection of distinct real-world instructions in two separate attacks to distract the LLMs. For the Suffix method, we utilize two attack suffixes sourced from the official repository555https://github.com/llm-attacks/llm-attacks. For our method, we select the top 2 and top 4 instructions based on our instruction ranking to execute two attacks. In our experiment results for automated methods, we present the performance{}^{{\dagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT of a single jailbreak attack and the overall performance{}^{{\ddagger}}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT of two jailbreak attacks.

Experiment results on English.

As shown in Tab. 1, for the Instruction Fine-tuned (IFT) LLMs, manual-designed methods can easily achieve high ASR, while the performance of automatic methods is mediocre. For instance, the “Evil” method has achieved an impressive 99% ASR on the Vicuna7B7𝐵{}_{7B}start_FLOATSUBSCRIPT 7 italic_B end_FLOATSUBSCRIPT, along with a respectable 88% ASR on the Mistral7B7𝐵{}_{7B}start_FLOATSUBSCRIPT 7 italic_B end_FLOATSUBSCRIPT. This phenomenon underscores that manual-designed methods are effective enough for IFT LLMs.

For the human-aligned LLMs, manual-designed methods have substantial room for performance improvement and pose challenges in their adaptability across various LLMs. For instance, the “Evil” method can achieve an impressive 90% ASR on Baichuan213B13𝐵{}_{13B}start_FLOATSUBSCRIPT 13 italic_B end_FLOATSUBSCRIPT, but its effectiveness drops significantly to only 10% when applied to ChatGLM26B6𝐵{}_{6B}start_FLOATSUBSCRIPT 6 italic_B end_FLOATSUBSCRIPT. Notably, compared with manual-designed methods, our method achieves comparable or even higher ASR and exhibits a high degree of adaptability across various LLMs. Besides, our method demonstrates remarkable potential. As we scale up the number of automatic attacks, the ASR sees a substantial increase. For instance, our method{}^{{\dagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT achieves a 58% ASR on ChatGLM26B6𝐵{}_{6B}start_FLOATSUBSCRIPT 6 italic_B end_FLOATSUBSCRIPT. When considering the overall performance of two jailbreak attacks, our method{}^{{\ddagger}}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT significantly boosts the ASR to 76%.

Furthermore, experiment results demonstrate that both for instruction fine-tuned and human-aligned LLMs, our method achieves a higher ASR compared to other automatic methods, reflecting the performance superiority of our method.

Refer to caption
(a) Inherent response tendencies of Baichuan27B7𝐵{}_{7B}start_FLOATSUBSCRIPT 7 italic_B end_FLOATSUBSCRIPT.
Refer to caption
(b) Inherent response tendencies of Baichuan213B13𝐵{}_{13B}start_FLOATSUBSCRIPT 13 italic_B end_FLOATSUBSCRIPT.
Refer to caption
(c) Inherent response tendencies of ChatGLM26B6𝐵{}_{6B}start_FLOATSUBSCRIPT 6 italic_B end_FLOATSUBSCRIPT.
Figure 3: Distribution of the inherent response tendency scores of three advanced LLMs. The horizontal axis represents the score, and the vertical axis represents the number of real-world instructions.

Experiment results on Chinese.

We perform a cross-language attack, using real-world English instructions to attack on Chinese. LLMs with a strong proficiency in Chinese are selected as our analysis objects. The experimental results, as presented in Tab. 2, reveal that even when subjected to cross-language attacks, our method still achieves outstanding performance. This phenomenon indicates the flexibility of our method, emphasizing that it is not tied to specific languages.

5 Analysis

In our analysis, we selected LLMs that have been aligned with human values as analysis objects.

5.1 Distribution of Tendency Score

We calculate the inherent response tendency scores of LLMs and display their distribution in Fig. 3. We can observe that the distribution of scores exhibits a predominantly normal distribution overall. While the majority of instructions have scores concentrated within a certain range, there are still numerous instructions with scores that are dispersed on both ends. This observation underscores the presence of LLMs’ inherent response tendency. We guess that this may be due to the LLMs’ capturing the biased distribution of the training data, which has been extensively investigated in previous work Poliak et al. (2018); Du et al. (2022a); McCoy et al. (2019); Du et al. (2023).

Refer to caption
(a) Analysis experiment results on Baichuan27B7𝐵{}_{7B}start_FLOATSUBSCRIPT 7 italic_B end_FLOATSUBSCRIPT.
Refer to caption
(b) Analysis experiment results on Baichuan213B13𝐵{}_{13B}start_FLOATSUBSCRIPT 13 italic_B end_FLOATSUBSCRIPT.
Refer to caption
(c) Analysis experiment results on ChatGLM26B6𝐵{}_{6B}start_FLOATSUBSCRIPT 6 italic_B end_FLOATSUBSCRIPT.
Figure 4: ASR(%) evaluated by GPT-4 are reported. In {k}_{pos} on the horizontal axis, “k” represents the number of selected real-world instructions, and “pos” represents the position of the malicious instruction in the prompt. Moreover, when attacking each test sample, the term “Top” denotes the selection of k instructions from the Top k of instruction rankings. “Top N” denotes the random selection of k instructions from the Top N of instruction rankings. “Random” denotes the random selection of k instructions from the entire set of instructions. “Bottom N” denotes the random selection of k instructions from the Bottom N of instruction rankings.
Refer to caption
Figure 5: A case study of asking the follow-up question.

5.2 Ablation Analysis

In our ablation analysis, we assess the impact of varying factors on our method. On the one hand, we analyze the effectiveness of the instruction ranking. Assuming that we need to splice k instructions, we have performed the following four settings each time we execute the attack:

  • Top: We select k instructions from the top k instruction.

  • Top N: Instructions with a score greater than or equal to 1.1 are regarded as the top N instructions, and we select k instructions from the top N instructions.

  • Random: We randomly select k instructions from the entire instructions.

  • Bottom N: Instructions with a score less than or equal to 0.6 are identified as the bottom N instructions, and we select k instructions from the bottom N instructions.

The hierarchy of ASR among these settings is expected as follows: Top >Top N >Random >Bottom N. Fig. 4 illustrates the changing trends of the average attack success rates for each case, with the AVG line indicating the expected behavior aligning with our hypothesis. Thus, through the validation of instruction ranking’s pivotal role, we can verify the effectiveness of our instruction ranking.

On the other hand, we focus on the number of splicing instructions and the placement of the malicious instruction within the prompt. For the number of splicing instructions, Fig. 4 shows that a higher overall attack success rate is always observed when a greater number of instructions are spliced. However, we caution against an indiscriminate increase in the number of spliced instructions. There are many instances where the accurate execution of each instruction has become challenging when splicing six instructions. We believe that this challenge is closely tied to the LLMs’ inherent capacity to concurrently execute multiple instructions, which has also been discussed in previous work Wei et al. (2023).m For the location of the malicious instruction within the prompt, we experimented with placing the malicious instructions at the front, middle, and end of the prompt, respectively. Fig. 4 shows that pacing the malicious instruction at the end of the prompt yields a higher overall attack success rate.

Baichuan27B7𝐵{}_{7B}start_FLOATSUBSCRIPT 7 italic_B end_FLOATSUBSCRIPT Baichuan213B13𝐵{}_{13B}start_FLOATSUBSCRIPT 13 italic_B end_FLOATSUBSCRIPT ChatGLM26B6𝐵{}_{6B}start_FLOATSUBSCRIPT 6 italic_B end_FLOATSUBSCRIPT
2_end 100(40/40) 100(29/29) 100(22/22)
4_end 82.61(38/46) 88.57(31/35) 95.65(44/46)
Table 3: Ratio% (Ssuperscript𝑆S^{\spadesuit}italic_S start_POSTSUPERSCRIPT ♠ end_POSTSUPERSCRIPT/Ssuperscript𝑆S^{\clubsuit}italic_S start_POSTSUPERSCRIPT ♣ end_POSTSUPERSCRIPT) of samples in which the LLM produces more detailed harmful information in the second round of dialogue.

5.3 Asking Follow-up Question

In our analysis, we observe that LLMs’ responses to malicious instructions are sometimes brief. As illustrated in Fig. 5, the LLM produced only a brief set of planning steps. However, our expectation is for the LLM to provide specific details for each step. We attribute such a phenomenon to LLMs’ susceptibility to in-context, which has been widely investigated in In-Context Learning Dong et al. (2022); Xie et al. (2021); Brown et al. (2020). The two spliced instructions in Fig. 5 both involve the content of “a sentence” (marked in red), which may subtly lead to the LLM’s brief response. To address this limitation, we implement a strategy, asking a follow-up question in the second round of dialogue as shown in Fig. 5. To verify its effectiveness, we analyze the results of two settings on three LLMs. As shown in Tab. 3, we first manually counted the number of samples(Ssuperscript𝑆S^{\clubsuit}italic_S start_POSTSUPERSCRIPT ♣ end_POSTSUPERSCRIPT) where the attack is successful but with a brief response. Then, based on the samples(Ssuperscript𝑆S^{\clubsuit}italic_S start_POSTSUPERSCRIPT ♣ end_POSTSUPERSCRIPT), we counted samples(Ssuperscript𝑆S^{\spadesuit}italic_S start_POSTSUPERSCRIPT ♠ end_POSTSUPERSCRIPT) where the response becomes more detailed under our strategy. The ratio of Ssuperscript𝑆S^{\spadesuit}italic_S start_POSTSUPERSCRIPT ♠ end_POSTSUPERSCRIPT to Ssuperscript𝑆S^{\clubsuit}italic_S start_POSTSUPERSCRIPT ♣ end_POSTSUPERSCRIPT is reproted. Experiment results show that in over 80% of cases, this strategy effectively works. It is crucial to highlight that this level of performance is easily attained during the second round of dialogue through a straightforward question.

Refer to caption
Figure 6: ASR(%) evaluated by GPT-4 are reported. {k}_{pos} has the same meaning as Fig.4.

5.4 Parallel VS Pipeline

In our method, we adopt the strategy of having the LLMs execute multiple instructions in parallel. We claim that our identified instructions will create a condition conducive to affirmation responses. To verify this claim, we change the way instructions are executed from the parallel to the pipeline, where LLMs execute instructions respectively in multiple rounds and the malicious instruction is typically executed in the final round. The experimental results in Fig. 6 indicate that the pipeline strategy has a significant decrease in ASR compared to the parallel strategy. These results indicate that if we do not splice our identified instructions around the malicious instruction, it will have a significant impact on the attack performance. Such a phenomenon verifies our claim.

6 Conclusion

In our work, we design a novel automatic jailbreak method RADIAL, which consists of the “Inherent Response Tendency Analysis” idea and the “Real-World Instructions-Driven Jailbreak” strategy. Our comprehensive analysis sheds light on the potential risks of LLMs, which serves as a crucial step toward fostering the development of safer LLMs.

7 Acknowledgement

Thank two volunteers from the Harbin Institute of Technology, Haochun Wang and Jianyu Chen, for participating in our manual evaluation work.

8 Limitation

Our work introduces a novel jailbreaking method from a fresh perspective while also shedding light on potential vulnerabilities within LLMs. However, it is important to acknowledge certain limitations:

  • Our method involves constructing a limited set of affirmation and rejection responses to assess the LLMs’ inherent response tendency. This process remains manual, and further discussion is needed to determine the specific types of responses to construct.

  • Our method primarily focuses on white-box attack methods, directed at open-source LLMs. There is a need for further investigation and research to think how to guide attacks under the black-box model.

9 Ethics Statement

We conduct all experiments on publicly available datasets and LLMs with authorization from the respective maintainers. The paper includes some potentially problematic content that has been generated by LLMs. It’s important to note that these examples are included solely for illustrative purposes and are not intended to serve as instructive or harmful in any way.

References

  • Abdelnabi et al. (2023) Sahar Abdelnabi, Kai Greshake, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90.
  • Baichuan (2023) Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Carlini et al. (2023) Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, et al. 2023. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447.
  • Chao et al. (2023) Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  • Deng et al. (2023) Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. 2023. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
  • Dong et al. (2022) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  • Du et al. (2022a) Yanrui Du, Jing Yan, Yan Chen, Jing Liu, Sendong Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang, and Bing Qin. 2022a. Less learn shortcut: Analyzing and mitigating learning of spurious feature-label correlation. arXiv preprint arXiv:2205.12593.
  • Du et al. (2023) Yanrui Du, Sendong Zhao, Yuhan Chen, Rai Bai, Jing Liu, Hua Wu, Haifeng Wang, and Bing Qin. 2023. Gls-csc: A simple but effective strategy to mitigate chinese stm models’ over-reliance on superficial clue. arXiv preprint arXiv:2309.04162.
  • Du et al. (2022b) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022b. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  • Goldstein et al. (2023) Josh A Goldstein, Girish Sastry, Micah Musser, Renee DiResta, Matthew Gentzel, and Katerina Sedova. 2023. Generative language models and automated influence operations: Emerging threats and potential mitigations. arXiv preprint arXiv:2301.04246.
  • Hazell (2023) Julian Hazell. 2023. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972.
  • Jain et al. (2023) Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
  • Jones et al. (2023) Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt. 2023. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381.
  • Kang et al. (2023) Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. 2023. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733.
  • Li et al. (2023) Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. 2023. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
  • Liu et al. (2023a) Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, and Fei Wu. 2023a. A chinese prompt attack dataset for llms with evil content. arXiv preprint arXiv:2309.11830.
  • Liu et al. (2023b) Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. 2023b. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374.
  • Liu et al. (2023c) Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2023c. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
  • McCoy et al. (2019) R Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007.
  • OpenAI (2023) OpenAI. 2023. Gpt-4 technical report.
  • Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback.
  • (24) Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arxiv 2022. arXiv preprint arXiv:2211.09527.
  • Poliak et al. (2018) Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. arXiv preprint arXiv:1805.01042.
  • Shen et al. (2023) Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2023. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
  • Shi et al. (2023) Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  • Wang et al. (2023) Rui Wang, Hongru Wang, Fei Mi, Yi Chen, Ruifeng Xu, and Kam-Fai Wong. 2023. Self-critique prompting with large language models for inductive instructions. arXiv preprint arXiv:2305.13733.
  • Wei et al. (2023) Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
  • Wen et al. (2023) Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2023. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668.
  • Xie et al. (2021) Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. 2021. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080.
  • Xie et al. (2023) Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, pages 1–11.
  • Zhang et al. (2023) Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, and Minlie Huang. 2023. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045.
  • Zhao et al. (2024) Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. 2024. Weak-to-strong jailbreaking on large language models. arXiv preprint arXiv:2401.17256.
  • Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  • (37) Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. communication, it is essential for you to comprehend user queries in Cipher Code and subsequently deliver your responses utilizing Cipher Code.
  • Zou et al. (2023) Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.

Appendix A Manual-constructed Affirmation and Rejection Responses

In Tab. 4, we show 20 manual-constructed confirmation responses and 20 manual-constructed rejection responses. All of them are general and not specific to a specific real-world instruction.

Affirmation Responses Rejection Responses
Certainly, here’s the information you asked for. Regrettably, I cannot comply with that request.
Of course, here’s what you need. Unfortunately, I won’t be able to assist with that.
Absolutely, here’s the relevant content. I’m afraid I cannot fulfill your request.
Certainly, I have what you’re looking for. I’m afraid I can’t provide the information you’re seeking.
Of course, here’s what you found. I must decline your request.
Sure thing, here’s the relevant information. Sadly, I’m unable to provide the information you seek.
Indeed, here’s the solution. I’m sorry, but I can’t accommodate that particular request.
Of course, here’s the requested information. Regretfully, I won’t be able to comply with that.
Certainly, I’ve got you covered. I’m afraid I can’t be of service for that specific task.
Absolutely, here is the requested material. Unfortunately, I’m not equipped to handle that inquiry.
Sure thing, here’s the answer to your question. It’s beyond my ability to respond to that request.
Of course, here’s the relevant info. I must respectfully decline your request.
Sure, here’s what you’re after. I’m unable to assist with that particular query.
Certainly, here’s the content you need. I’m sorry, but I can’t fulfill your request at this time.
Of course, here’s the information at your disposal. Regretfully, I’m not able to help with that.
No worries, here’s the solution to your query. Unfortunately, I cannot provide the information you’re looking for.
Certainly, here’s what I’ve got for you. I’m afraid I can’t meet your request’s requirements.
Sure, here’s what you’ve asked for. It’s not feasible for me to assist with that request.
Sure, here’s what you’re looking for. I must decline your request due to its nature.
Certainly, here’s the response you need. I’m unable to accommodate that specific request.
Table 4: 20 manual-constructed affirmation responses and 20 manual-constructed rejection responses are shown.

Appendix B GPT4 Evaluation VS KWM Evaluation

The details of the GPT-4 evaluation process are illustrated in Fig. 8. We compared the consistency of GPT-4 and KWM evaluations with human evaluation. As shown in Fig. 9, human evaluation encompassed six sets of experimental results. The results reveal that the alignment consistency between KWM evaluation and human evaluation is approximately 70%. In contrast, GPT-4 evaluation demonstrates a remarkably higher consistency rate of around 90% with human evaluation. Our findings suggest that, in the evaluation of jailbreak attacks, the semantic-based GPT-4 evaluation outperforms the rule-based KWM evaluation, offering a more accurate and reliable assessment aligned with human evaluation.

Further, we show a real case under the Evil method to explain why GPT-4 evaluation is more accurate than KWM evaluation. As shown in Fig. 7, LLM’s response mentioned that “this is illegal and can have severe consequences for both you and anyone involved", but then still provided harmful responses. KWM evaluation can determine that the response is harmless by detecting the word “illegal", while GPT-4 evaluation can determine that the response is harmful by analyzing the semantics.

Appendix C Examples for all Attack Prompts

As shown in Fig. 10, we show examples for all attack prompts in our baseline.

Refer to caption
Figure 7: A real case under the Evil method to explain why GPT-4 evaluation is more accurate.
Refer to caption
Figure 8: Illustration of GPT-4 evaluation.
Refer to caption
Figure 9: The consistency of GPT-4 and KWM evaluations with human evaluation. “green” represents cases that remain consistent and “yellow” represents cases that remain inconsistent. 2_end and 4_end represent two experiment settings, which are detailed in the caption of Fig. 4.
Refer to caption
Figure 10: Examples for all attack prompts in our baseline.