Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models

Abstract

Speech enabled foundation models, either in the form of flexible speech recognition based systems or audio-prompted large language models (LLMs), are becoming increasingly popular. One of the interesting aspects of these models is their ability to perform tasks other than automatic speech recognition (ASR) using an appropriate prompt. For example, the OpenAI Whisper model can perform both speech transcription and speech translation. With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. Without any access to the model prompt it is possible to modify the behaviour of the system by appropriately changing the audio input. To illustrate this risk, we demonstrate that it is possible to prepend a short universal adversarial acoustic segment to any input speech signal to override the prompt setting of an ASR foundation model. Specifically, we successfully use a universal adversarial acoustic segment to control Whisper to always perform speech translation, despite being set to perform speech transcription. Overall, this work demonstrates a new form of adversarial attack on multi-tasking speech enabled foundation models that needs to be considered prior to the deployment of this form of model.

Index Terms—  ASR, Adversarial Attacks, Control

1 Introduction

Refer to caption
Fig. 1: A short universal acoustic adversarial segment can be prepended to any input speech signal to control the behavior of a multi-task Automatic Speech Recognition (ASR) model. For example, Whisper’s transcribe setting can be overridden such that it operates in its translate setting.

In the form of flexible automatic speech recognition (ASR) systems [1] or audio-prompted Large Language Models (LLMs) [2], speech enabled foundation models are increasingly able to perform other speech processing tasks beyond just speech transcription - we refer to such systems as multi-task ASR models. These models are trained to execute a diverse set of speech processing tasks within a single framework, enabling their deployment across a wide range of applications. A notable example of such a model is OpenAI’s Whisper [1], which employs an encoder-decoder Transformer architecture to perform both speech transcription and speech translation. The task to be performed by Whisper is determined by a ‘task tag’ included in the textual prompt to the decoder. It is anticipated that future advancements will lead to the development of increasingly flexible speech-enabled foundation models capable of performing a broader number of speech processing tasks withing a single framework.

However, in this work we are the first to demonstrate that the capability of ASR models to perform multiple tasks introduces a new class of vulnerabilities: model-control adversarial attacks. These attacks aim to manipulate a model’s behavior such that, despite being configured for a specific task in deployment, the adversary can override the task setting and induce the model to perform an alternative target task (the target task is required to be within the set of tasks the model has been trained to be able to do). Historically, adversarial attacks on ASR models have primarily focused on disrupting the model’s performance or causing it to transcribe a predetermined phrase [3, 4]. In contrast, our proposed model-control adversarial attacks have a distinctly different objective where they instead seek to override the operational setting of a multi-task ASR model, forcing it to function in an unintended mode.

To highlight the susceptibility of multi-task ASR models to such control-attacks, this work proposes a practical method to subvert the ‘transcription’ setting of the Whisper model, and instead encouraging Whisper to execute speech ‘translation’. Given that an adversary typically does not have access to the textual decoder prompt, we adopt a threat model where the attack is restricted to the acoustic space. Our findings reveal that a short (less than 3 seconds) universal adversarial acoustic segment can be prepended to any speech signal and override Whisper’s ‘transcribe’ setting and execute the ‘translate’ function instead (depicted in Figure 1). Experimental evaluations across four languages indicate that this universal acoustic attack segment can consistently manipulate Whisper’s behavior for nearly all test samples. Further examination of more constrained attack scenarios reveals an interesting bi-modal pattern: the attacks are either highly successful in overriding the command or entirely ineffective for specific samples.

Overall, this work illustrates the vulnerability of flexible multi-task speech foundation systems to a novel form of adversarial attack. Model-control attacks exploit the model’s capability to perform various tasks, thereby enabling the adversary to override the intended operational setting. As ASR models become increasingly flexible and capable of handling more tasks, it is crucial to consider the potential risks posed by model-control adversarial attacks when deploying these models in real-world applications.

2 Related Work

2.1 Multi-task Speech Foundation Models

Speech enabled foundation models are trained on a large quantity of data to process input speech signals and perform various different speech processing tasks. The development of these models have taken two main forms. First, several approaches have attempted to extend powerful textual decoder-only Large Language Models (LLMs) to support direct speech inputs with a trained connection module to provide the embedded audio as a soft prompt [2, 5, 6, 7, 8, 9]. Alternatively, generalist speech enabled foundation models have emerged in the form of flexible Automatic Speech Recognition (ASR) models [1] using an encoder-decoder architecture [10]. ASR models are traditionally designed to perform a single task: transcribe the input speech signal. However, recent powerful ASR models have been augmented with further speech processing abilities. Some of the most powerful flexible ASR models include OpenAI’s Whisper model [1] and NVIDIA’s Canary model [11], where these models can perform both speech transcription and speech translation tasks. These models adopt an encoder-decoder architecture, and a textual task tag is input to the decoder to indicate the speech task to be performed (transcribe or translate). Given its popularity, this work uses Whisper to illustrate the risk of control-attacks on such multi-task ASR models.

2.2 Audio Adversarial Attacks

Traditional audio adversarial attacks on ASR models have one of two objectives: corrupt the output transcription or deceive the ASR model into transcribing a desired target phrase. Early research on audio attacks explored gradient-based methods to perturb input audio for end-to-end ASR systems like WaveCNN and HMM-DNN, aiming to increase word error rates (WER) in transcriptions [3, 4]. Later studies introduced targeted attacks on ASR systems such as DeepSpeech, HMM-DNN, and LSTM-based neural networks to generate specific transcriptions [12, 13, 14, 15], while other research focused on making audio adversarial attacks more imperceptible [16, 17]. Practical advancements include generating universal adversarial perturbations for various speech signals, although initial methods required synchronizing with the entire speech signal [18]. This was addressed by creating perturbations that do not need to be synchronized with the source speech signal [19], and extending these attacks to newer end-to-end ASR systems like LAS, CTC, and RNN-T [20]. Additional techniques involve transferability from substitute models [21, 22, 23], evolutionary attacks [24, 25, 26, 27, 28], utterance-based attacks [29], and featurization attacks [30, 31, 32]. With the establishment of the Whisper model, new vulnerabilities to audio adversarial attacks have been identified, showing for example that adversarial attacks can lead to incorrect transcriptions [33] or even force entirely muted outputs for any speech input [34].

However, with the emergence of the multi-task speech enabled foundation models, in this work, we demonstrate that there is the threat of a new kind of adversarial attack: model-control attack. With the flexibility to perform multiple tasks, an adversary can override the operational setting of a deployed ASR model and force it to perform a different target task - i.e., take control of the model. With Whisper, we illustrate that despite being deployed in ‘transcribe’ mode, a simple universal adversarial attack can always force Whisper to operate in ‘translate’ model.

3 Multi-task Automatic Speech Recognition Models

This section describes how encoder-decoder Automatic Speech Recognition (ASR) systems, such as Whisper, can perform multiple speech processing tasks. Continuous-time speech is sampled to create a sequence of samples, 𝐱=x1:N𝐱subscript𝑥:1𝑁\mathbf{x}=x_{1:N}bold_x = italic_x start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT of N𝑁Nitalic_N frames. An ASR system maps this sampled speech/audio signal, 𝐱𝐱\mathbf{x}bold_x, to generate text, 𝐲=y1:M𝐲subscript𝑦:1𝑀\mathbf{y}=y_{1:M}bold_y = italic_y start_POSTSUBSCRIPT 1 : italic_M end_POSTSUBSCRIPT, traditionally for speech transcription. However, this generated text can also serve other speech processing tasks, such as speech translation. Whisper, a flexible multi-task ASR system, uses an encoder-decoder architecture with parameters θ𝜃\thetaitalic_θ to auto-regressively generate the output text tokens. The likelihood of an output sequence 𝐲𝐲\mathbf{y}bold_y is modeled as:

P(𝐲|𝐱,𝒯)=mP(ym|y<m,𝐱,𝒯;θ),𝑃conditional𝐲𝐱𝒯subscriptproduct𝑚𝑃conditionalsubscript𝑦𝑚subscript𝑦absent𝑚𝐱𝒯𝜃P(\mathbf{y}|\mathbf{x},\mathcal{T})=\prod_{m}P(y_{m}|y_{<m},\mathbf{x},% \mathcal{T};\theta),italic_P ( bold_y | bold_x , caligraphic_T ) = ∏ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_P ( italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT < italic_m end_POSTSUBSCRIPT , bold_x , caligraphic_T ; italic_θ ) , (1)

where 𝒯𝒯\mathcal{T}caligraphic_T defines the speech processing task, and

P(ym|y<m,𝐱,𝒯;θ)𝑃conditionalsubscript𝑦𝑚subscript𝑦absent𝑚𝐱𝒯𝜃P(y_{m}|y_{<m},\mathbf{x},\mathcal{T};\theta)italic_P ( italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT < italic_m end_POSTSUBSCRIPT , bold_x , caligraphic_T ; italic_θ )

is the probability predicted by the decoder for output token ymsubscript𝑦𝑚y_{m}italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, given 𝐱𝐱\mathbf{x}bold_x at the encoder input and y<msubscript𝑦absent𝑚y_{<m}italic_y start_POSTSUBSCRIPT < italic_m end_POSTSUBSCRIPT as the previously decoded tokens passed to the decoder input. To enable Whisper to perform multiple speech processing tasks, the task 𝒯𝒯\mathcal{T}caligraphic_T is set via special text tokens at the decoder input. The first token input to the decoder is set as <|startoftranscript|>, followed by a token to indicate the source audio language, for example, <fr> for French. The next special token sets the task, such as <transcribe> or <translate>. With these special tokens in the decoder history, Whisper can flexibly perform either speech transcription (𝒯=tc𝒯tc\mathcal{T}=\texttt{tc}caligraphic_T = tc) or speech translation (𝒯=tl𝒯tl\mathcal{T}=\texttt{tl}caligraphic_T = tl) based on the specified task token.

4 Model-control Adversarial Attacks

The objective of a model-control adversarial attack is to override the operational task setting of a multi-task ASR model, such that the model performs an alternate task as desired by the adversary. To illustrate this form of attack, in this section we propose a practical method to override Whisper’s transcription setting and force Whisper instead to perform speech translation - this is depicted in Figure 1.

4.1 Threat Model

When Whisper is deployed for a specific speech processing task, an adversary does not have access to the internal structure of the model and cannot simply change the special tokens input to the decoder to alter the task setting. Realistically, an adversary can only modify the source audio to achieve their goal of overriding Whisper’s task setting. Thus, we assume the adversary can only make changes in the acoustic space. Moreover, our threat model requires the adversarial attack to be easy to apply, as complex manipulations of the audio are impractical for live speech processing. To address this, we use a prepend adversarial attack form, where the attack requires only a short acoustic adversarial segment to be prepended to the input speech. Learning a different adversarial manipulation for each new speech signal is impractical, especially for live streamed audio. Therefore, our threat model aims to learn a universal audio manipulation—a single short acoustic adversarial segment that can be prepended to any speech signal to achieve the desired model control. While we allow whitebox access (gradient access to the model) for learning the universal attack segment, it must be universally applicable to unseen speech signals.

4.2 Attack Method

To perturb a speech signal 𝐱=x1:N𝐱subscript𝑥:1𝑁\mathbf{x}=x_{1:N}bold_x = italic_x start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT, we prepend a short adversarial audio segment of T𝑇Titalic_T frames, 𝐱~=x~1:T~𝐱subscript~𝑥:1𝑇\tilde{\mathbf{x}}=\tilde{x}_{1:T}over~ start_ARG bold_x end_ARG = over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT. The perturbed speech signal becomes 𝐱~𝐱direct-sum~𝐱𝐱\tilde{\mathbf{x}}\oplus\mathbf{x}over~ start_ARG bold_x end_ARG ⊕ bold_x, where direct-sum\oplus represents concatenation in the raw audio space. To override Whisper’s ‘transcribe’ setting with ‘translate,’ we need to find the optimal adversarial audio segment, 𝐱~^^~𝐱\hat{\tilde{\mathbf{x}}}over^ start_ARG over~ start_ARG bold_x end_ARG end_ARG, that maximizes the likelihood of generating the translated sequence, despite Whisper running in transcribe mode:

𝐱~^=argmax𝐱~P(𝐲(tl)|𝐱~𝐱,𝒯=tc),^~𝐱subscriptargmax~𝐱𝑃conditionalsuperscript𝐲tldirect-sum~𝐱𝐱𝒯tc\hat{\tilde{\mathbf{x}}}=\operatorname*{arg\,max}_{\tilde{\mathbf{x}}}P(% \mathbf{y}^{(\texttt{tl})}|\tilde{\mathbf{x}}\oplus\mathbf{x},\mathcal{T}=% \texttt{tc}),over^ start_ARG over~ start_ARG bold_x end_ARG end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT over~ start_ARG bold_x end_ARG end_POSTSUBSCRIPT italic_P ( bold_y start_POSTSUPERSCRIPT ( tl ) end_POSTSUPERSCRIPT | over~ start_ARG bold_x end_ARG ⊕ bold_x , caligraphic_T = tc ) , (2)

where

𝐲(tl)=argmax𝐲P(𝐲|𝐱,𝒯=tl).superscript𝐲tlsubscriptargmax𝐲𝑃conditional𝐲𝐱𝒯tl\mathbf{y}^{(\texttt{tl})}=\operatorname*{arg\,max}_{\mathbf{y}}P(\mathbf{y}|% \mathbf{x},\mathcal{T}=\texttt{tl}).bold_y start_POSTSUPERSCRIPT ( tl ) end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT italic_P ( bold_y | bold_x , caligraphic_T = tl ) . (3)

To learn a universal acoustic attack, we adapt Equation 2 to maximize the likelihood of generating the translated sequence over a training set of samples, {𝐱j}j=1Jsuperscriptsubscriptsubscript𝐱𝑗𝑗1𝐽\{\mathbf{x}_{j}\}_{j=1}^{J}{ bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT,

𝐱~^=argmax𝐱~jP(𝐲𝐣(tl)|𝐱~𝐱j,𝒯=tc).^~𝐱subscriptargmax~𝐱subscriptproduct𝑗𝑃conditionalsuperscriptsubscript𝐲𝐣tldirect-sum~𝐱subscript𝐱𝑗𝒯tc\hat{\tilde{\mathbf{x}}}=\operatorname*{arg\,max}_{\tilde{\mathbf{x}}}\prod_{j% }P(\mathbf{y_{j}}^{(\texttt{tl})}|\tilde{\mathbf{x}}\oplus\mathbf{x}_{j},% \mathcal{T}=\texttt{tc}).over^ start_ARG over~ start_ARG bold_x end_ARG end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT over~ start_ARG bold_x end_ARG end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_P ( bold_y start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( tl ) end_POSTSUPERSCRIPT | over~ start_ARG bold_x end_ARG ⊕ bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , caligraphic_T = tc ) . (4)

To maximize the likelihood of Equation 4, standard gradient descent based approaches can be used to update 𝐱~~𝐱\tilde{\mathbf{x}}over~ start_ARG bold_x end_ARG. It is important for the adversarial audio segment generated to be somewhat imperceptible such that it is not flagged as suspicious when prepended to natural speech signals. We achieve this imperceptibility in two ways: by ensuring the adversarial audio segment is short in duration and by limiting its ‘power’ or amplitude relative to natural speech. To limit the power, we introduce a constraint in the optimization objective of Equation 4 that limits the amplitude of the adversarial audio,

x~^1:Tϵ,subscriptnormsubscript^~𝑥:1𝑇italic-ϵ||\hat{\tilde{x}}_{1:T}||_{\infty}\leq\epsilon,| | over^ start_ARG over~ start_ARG italic_x end_ARG end_ARG start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ϵ , (5)

where ||||||\cdot||_{\infty}| | ⋅ | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT represents the l-infinity norm. During the gradient-based learning of the adversarial audio segment 𝐱~^^~𝐱\hat{\tilde{\mathbf{x}}}over^ start_ARG over~ start_ARG bold_x end_ARG end_ARG, this l-infinity norm constraint is incorporated by clamping the values at ϵitalic-ϵ\epsilonitalic_ϵ [35]. In our experiments (Section 5) we explore the impact of varying these two imperceptibility constraints on the efficacy of the control-attack.

5 Experiments

5.1 Experimental Setup

Attack Configuration. As described in Section 4, we learn a universal adversarial acoustic segment that can be prepended to any input speech signal to control Whisper to perform speech translation, despite Whisper being deployed in the speech transcription setting. We present experimental results for Whisper medium (769M parameters). Smaller models are not considered as it is found that due to their lower performance in speech translation it limits the potential of the model-control attack (the model-control attack assumes the model has the ability to perform the target task). Following the imperceptibility definitions given in Section 4.2, we learn universal adversarial acoustic segments of three different strengths: 1) a weak attack (attack-w) with the harshest constraints of a maximum amplitude ϵ=0.02italic-ϵ0.02\epsilon=0.02italic_ϵ = 0.02 and an audio length of 10,240 frames, equivalent to 0.64 seconds for audio sampled at 16kHz; 2) a mid strength attack (attack-m) with ϵ=0.2italic-ϵ0.2\epsilon=0.2italic_ϵ = 0.2 and length 0.64 seconds; and 3) a strong attack (attack-s) with ϵ=2.0italic-ϵ2.0\epsilon=2.0italic_ϵ = 2.0 and length 2.56 seconds.

Data. To illustrate the impact of the universal model-control attack to force Whisper to translate instead of transcribe, we use the popular Few-shot Learning Evaluation of Universal Representation of Speech (FLEURS) dataset [36]. It is a n-way parallel speech dataset in 102 languages, with 12 hours of speech per language. We select the French-English language pair as the primary development set, where we present the results and analysis for the weak, mid and strong attacks in Section 5.2. To assess the generalizability of the universal attack method to other language pairs, we carry out an ablation study for the following other aligned language pairs: German-English (de-en), Russian-English (ru-en) and Korean-English (ko-en). The train splits of each dataset are used to learn the universal attack, and experimental results are reported on the unseen test splits. As the Whisper model is only trained to perform speech translation from a language X to English, in our experiments language X (where X is not en) audio is input to Whisper and the aim of the attack is to encourage Whisper to translate it to English.

Metrics. We use standard popular ASR and speech translation metrics to measure the impact of the attack. This includes Word Error Rate (WER); the BLEU score [37]; and the COMET score [38]. Finally, as the aim of the attack is to cause Whisper to translate the input audio to English, we also report the average (across the dataset) probability of English, P(en), as given by Google’s LangDetect model [39].

5.2 Attack Results and Analysis

The model-control universal acoustic segments are trained such that they can be prepended to any input speech signal and cause Whisper to perform speech translation despite being set to perform speech transcription. Using French-English (fr-en) as the primary language pair for evaluating the model-control attack, Table 1 presents the impact of the model-control attacks of each strength. For reference, the performances with respect to the English transcriptions are given when Whisper is run with no attack in transcription mode (expected to be low performance as Whisper transcribes in the source audio language French) and when Whisper is run with no attack in translation mode (the upperbound of performance for the attack). As the attack strength is increased from weak to strong, the attack performance approach the translation mode upperbound performance. This demonstrates that the model-control attack can be successful in overriding the transcription mode to cause Whisper to perform speech translation instead.

Mode WER\downarrow BLEU\uparrow COMET\uparrow P(en)\uparrow
No Attack tc 115 0.42 14.9 0.0
No Attack tl 55.3 17.4 51.0 100
Attack-w tc 91.5 8.63 16.5 42.9
Attack-m tc 79.4 11.0 32.9 62.4
Attack-s tc 57.6 17.5 47.4 98.2
Table 1: Model-control attack on fr-en: universal prepend attack to override Whisper’s transcribe (tc) mode and perform speech translation (tl). Input audio is French and metrics evaluated with respect to reference English translations.

Although the model-control attack is highly effective in forcing Whisper to perform speech translation, it is unable to perfectly match the upperbound performance with Whisper running freely in translation mode (Table 1). This could be due to either the attack failing to make Whisper enter translation mode for specific input speech signals or due to the attack resulting in Whisper generating lower quality translations than when Whisper is run directly in translation mode. We next investigate the cause of the discrepancy. For each attack strength, we compute BLEU-recall curves for samples on which the universal model-control attack is ‘successful’ and ‘unsuccessful’ (Figure 2111Identical trends were observed with COMET-recall curves.. A sample is classified as successfully attacked if P(en)>τ𝑃en𝜏P(\text{en})>\tauitalic_P ( en ) > italic_τ. Conversely, a sample is classified as a ‘failed’ attack if P(en)<τ𝑃en𝜏P(\text{en})<\tauitalic_P ( en ) < italic_τ. We sweep the arbitrary threshold τ𝜏\tauitalic_τ from 0 to 100% and compute BLEU scores for both the successfully and unsuccessfully attacked samples recalled at each value of τ𝜏\tauitalic_τ. These curves help explain the quality of the translations generated by the model when attacked, separately for when the attack succeeds or fails to make Whisper generate English tokens (measured by P(en)).

Refer to caption
(a) BLEU-recall success
Refer to caption
(b) BLEU-recall fail
Fig. 2: BLEU performance for recalled samples where samples are recalled if the model-control attack is successful or fails, as per the P(en). Curves for fr-en data samples.

From 2, in general the greater the probability of English in the generated translation, the higher the quality of the translation as measured by BLEU. However, the curves also reveal a more interesting behaviour of the attacks: the success recall curves (2a), have an observable discontinuity in the BLEU scores when 43%, 62% and 98% of the samples are ‘successfully’ attacked for the weak, mid and strong attacks respectively, and conversely in the fail recall curves (2b), there is a discontinuity when approximately 57%, 38% and 2% samples ‘fail’ to be attacked. These percentages almost perfectly align with the average probability of English given by each attack in Table 1. This suggests that the attacks result in a highly bi-modal response, where the attack on a specific input audio sample is either perfectly successful in causing Whisper to generate English text or perfectly unsuccessful, with near 0% probability of English. This bi-modal split is further verified in the distribution of P(en) given for each attack in Figure 3. When Whisper is running in transcribe (tc) mode, it is entirely non-English (left-most bar), whilst when Whisper is running in translate (tl) mode, it generates entirely English text (right-most bar). However for all of the attacks in Figure 3, we observe the above-mentioned bi-modal pattern, where the model-control attacks do not result in a gradual distributional shift from the French transcriptions to English translations, but instead display binary success. The stronger the attack the greater the proportion of the samples that are ‘successfully’ attacked (greater fraction of samples on the right side) resulting in almost complete English text. On the whole, the universal model-control attack, when successful, causes Whisper to behave as though as if it is in translation mode, but when the attack is unsuccessful it results in Whisper continuing to behave as set in its default transcription mode - there are no partial modes. It is the strictness around the imperceptibility constraints (strength of the attack) that dictates the fraction of samples the universal model-control attack is successful upon.

Refer to caption
Fig. 3: P(en) (%) distribution

5.3 Language Ablation

In this section we assess the portability of the model-control attack method to other language pairs beyond French-English, where the aim as before is to learn a universal acoustic segment that can be universally prepended to any input speech signal and cause Whisper to carry out speech translation (to English), despite being deployed in speech transcription mode. We consider the strong attack setting in this section.

Mode WER\downarrow BLEU\uparrow COMET\uparrow P(en)\uparrow
French-English (fr-en)
No Attack tc 115 0.42 14.9 0.0
No Attack tl 55.3 17.4 51.0 100
Attack tc 57.6 17.5 47.4 98.2
German-English (de-en)
No Attack tc 101 0.29 3.24 0.0
No Attack tl 53.7 18.9 45.0 99.9
Attack tc 78.6 13.0 23.8 95.5
Russian-English (ru-en)
No Attack tc 101 0.10 0.16 0.0
No Attack tl 61.6 14.6 40.6 99.9
Attack tc 96.8 10.4 16.8 95.4
Korean-English (ko-en)
No Attack tc 99.8 0.00 0.26 0.3
No Attack tl 72.8 9.68 34.9 98.7
Attack tc 93.6 8.46 18.4 98.1
Table 2: Model-control attack (strong) on fr/de/ru/ko-en: universal prepend attack to override Whisper’s transcribe tc mode and perform speech translation (tl). Input audio is in the source language and metrics evaluated wrt to reference English translations.

Table 2 presents the impact of the universal model-control attack for the new languages: German-English (de-en), Russian-English (ru-en) and Korean-English (ko-ru). On the whole, the attack appears to fairly successful for all language pairs in forcing Whisper to generate English translations, as indicated by the average probability of English (P(en)) going from 0.0% to above 95% for all language pairs. However, when considering the BLEU and COMET scores, it appears the quality of translations with the attack (relative to the upperbound performance given by no attack in tl mode) for de/ru/ko is not quite as high as for fr, where the strong attack closely approached the upperbound performance (Table 1). The gap between the attack (strong) performance and the upperbound performance for de/ru/ko can be explained by the decomposition of the WER between the translations generated. Table 3 presents the WER decomposition between translations generated by the attack (on Whisper in transcribe model) and the no attack translations. It is interesting to note that a significant contributor to the increase in WER for de/ru/ko relative to fr, comes from a large increase in the insertion rate. This suggests that the attack causes Whisper to hallucinate. As an example, we find that for Russian, the attack causes 167 samples to have the phrase however, it is clear that inserted at the beginning of the translation, whereas only 1 sample has this phrase in its translation when there is no attack, i.e., it seems that the attack can sometimes can learn an acoustic realization for the prepended acoustic audio, i.e, for some languages the translations generated at inference (free-running) can contain a prepended text sequence before the actual translation, resulting in lower translation quality.

Lang ins del sub WER
fr-en 6.7 4.2 10.4 21.3
de-en 18.9 14.3 18.3 51.3
ru-en 32.7 10.7 12.5 55.9
ko-en 31.7 10.9 21.5 64.1
Table 3: Word Error Rate (insertions, deletions and substitutions) between attack-strong on Whisper running in transcribe mode and no attack with Whisper running in translate mode.

Next, to verify that the attacks on de/ru/ko display the same bi-modal behaviour as the attack on fr-en, we consider the BLEU-recall curves for each language pair in Figure 4. Most easily observable in Figure 4b the discontinuity in BLEU performance occurs as before with fr-en at the percentage of samples corresponding to approximately the average probability of not English (in Table 2) - between 2% and 5% for the different languages. This verifies that for all languages the same trend holds: the model-control attack is highly binary in either successfully overriding Whisper to translate the source sample to complete English, or failing completely, such that the source audio is not translated at all.

Refer to caption
(a) BLEU-recall success
Refer to caption
(b) BLEU-recall fail
Fig. 4: BLEU performance for recalled samples where samples are recalled if the model-control attack (strong) is successful or fails, as per P(en).

6 Conclusion

This work reveals a significant vulnerability in multi-tasking speech-enabled foundation models, specifically through model-control adversarial attacks. Our study demonstrates that it is possible to manipulate the task setting of such models, as evidenced by our experiments with OpenAI’s Whisper model. By appending a short universal adversarial acoustic segment to any input speech signal, we were able to override the model’s prompt setting and force it to perform speech translation instead of its default speech transcription. An intriguing aspect is the binary nature of the attack’s success. The universal adversarial acoustic segment either successfully manipulates Whisper to operate in its translation mode, or it fails entirely, resulting in Whisper behaving in its default transcription mode. This bi-modal pattern indicates that the attack does not create intermediate operational states, but rather a clear switch between transcription and translation. The strictness around the imperceptibility constraints dictates the fraction of samples the universal model-control attack switches. The risk of model-control attacks highlights the need for increased security measures in the deployment of flexible ASR systems. As these models are designed to perform an increasing number of tasks within a single framework, they become more susceptible to such adversarial manipulations. We encourage future work on developing robust defenses against model-control adversarial attacks.

References

  • [1] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022.
  • [2] Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun MA, and Chao Zhang, “SALMONN: Towards generic hearing abilities for large language models,” in The Twelfth International Conference on Learning Representations, 2024.
  • [3] Yuan Gong and Christian Poellabauer, “Crafting adversarial examples for speech paralinguistics applications,” CoRR, vol. abs/1711.03280, 2017.
  • [4] Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet, “Houdini: Fooling deep structured prediction models,” 2017.
  • [5] Chien yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, and Hung yi Lee, “Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech,” 2024.
  • [6] Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, and Bo Xu, “X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages,” 2023.
  • [7] Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, and Yu Wu, “On decoder-only architecture for speech-to-text and large language model integration,” 2023.
  • [8] Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang, “Connecting speech encoder and large language model for asr,” 2023.
  • [9] Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, and Mike Seltzer, “Prompting large language models with speech recognition abilities,” 2023.
  • [10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, L ukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. 2017, vol. 30, Curran Associates, Inc.
  • [11] Elena Rastorgueva and Nithin Koluguri, “New standard for speech recognition and translation from the nvidia nemo canary model,” Apr 2024.
  • [12] Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A. Gunter, “Commandersong: A systematic approach for practical adversarial voice recognition,” 2018.
  • [13] Nicholas Carlini and David A. Wagner, “Audio adversarial examples: Targeted attacks on speech-to-text,” CoRR, vol. abs/1801.01944, 2018.
  • [14] Nilaksh Das, Madhuri Shanbhogue, Shang-Tse Chen, Li Chen, Michael E. Kounavis, and Duen Horng Chau, “ADAGIO: interactive experimentation with adversarial attack and defense for audio,” CoRR, vol. abs/1805.11852, 2018.
  • [15] Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison Cottrell, and Colin Raffel, “Imperceptible, robust, and targeted adversarial examples for automatic speech recognition,” 2019.
  • [16] Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa, “Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding,” 2018.
  • [17] Lea Schönherr, Katharina Siobhan Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa, “Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding,” ArXiv, vol. abs/1808.05665, 2018.
  • [18] Paarth Neekhara, Shehzeen Hussain, Prakhar Pandey, Shlomo Dubnov, Julian J. McAuley, and Farinaz Koushanfar, “Universal adversarial perturbations for speech recognition systems,” CoRR, vol. abs/1905.03828, 2019.
  • [19] Zhuohang Li, Yi Wu, Jian Liu, Yingying Chen, and Bo Yuan, “Advpulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations,” in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA, 2020, CCS ’20, p. 1121–1134, Association for Computing Machinery.
  • [20] Zhiyun Lu, Wei Han, Yu Zhang, and Liangliang Cao, “Exploring targeted universal adversarial perturbations to end-to-end asr models,” 2021.
  • [21] Yuxuan Chen, Xuejing Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and XiaoFeng Wang, “Devil’s whisper: A general approach for physical adversarial attacks against commercial black-box speech recognition devices,” in 29th USENIX Security Symposium (USENIX Security 20). Aug. 2020, pp. 2667–2684, USENIX Association.
  • [22] Wenshu Fan, Hongwei Li, Wenbo Jiang, Guowen Xu, and Rongxing Lu, “A practical black-box attack against autonomous speech recognition model,” in GLOBECOM 2020 - 2020 IEEE Global Communications Conference, 2020, pp. 1–6.
  • [23] Chen Ma, Li Chen, and Jun-Hai Yong, “Simulating unknown target models for query-efficient black-box attacks,” 2021.
  • [24] Moustafa Alzantot, Bharathan Balaji, and Mani B. Srivastava, “Did you hear that? adversarial examples against automatic speech recognition,” CoRR, vol. abs/1801.00554, 2018.
  • [25] Shreya Khare, Rahul Aralikatte, and Senthil Mani, “Adversarial black-box attacks on automatic speech recognition systems using multi-objective evolutionary optimization,” 2019.
  • [26] Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri, “Targeted adversarial examples for black box audio systems,” 2019.
  • [27] Tianyu Du, Shouling Ji, Jinfeng Li, Qinchen Gu, Ting Wang, and Raheem Beyah, “Sirenattack: Generating adversarial audio for end-to-end acoustic systems,” 2019.
  • [28] Baolin Zheng, Peipei Jiang, Qian Wang, Qi Li, Chao Shen, Cong Wang, Yunjie Ge, Qingyang Teng, and Shenyi Zhang, “Black-box adversarial attacks on commercial speech platforms with minimal information,” in Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. Nov. 2021, CCS ’21, ACM.
  • [29] V Raina, MJF Gales, and K Knill, “Universal adversarial attacks on spoken language assessment systems,” Interspeech, 2020.
  • [30] Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David Wagner, and Wenchao Zhou, “Hidden voice commands,” in 25th USENIX Security Symposium (USENIX Security 16), Austin, TX, Aug. 2016, pp. 513–530, USENIX Association.
  • [31] Guoming Zhang, Chen Yan, Xiaoyu Ji, Taimin Zhang, Tianchen Zhang, and Wenyuan Xu, “Dolphinatack: Inaudible voice commands,” CoRR, vol. abs/1708.09537, 2017.
  • [32] Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin R. B. Butler, and Joseph Wilson, “Practical hidden voice attacks against speech and speaker recognition systems,” 2019.
  • [33] Raphael Olivier and Bhiksha Raj, “There is more than one kind of robustness: Fooling whisper with adversarial examples,” 2023.
  • [34] Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, and Mark Gales, “Muting whisper: A universal acoustic adversarial attack on speech foundation models,” 2024.
  • [35] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu, “Towards deep learning models resistant to adversarial attacks,” 2019.
  • [36] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” arXiv preprint arXiv:2205.12446, 2022.
  • [37] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Annual Meeting of the Association for Computational Linguistics, 2002.
  • [38] Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie, “Comet: A neural framework for mt evaluation,” 2020.
  • [39] Nakatani Shuyo, “Language detection library for java,” 2010.