MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models

Sensors (Basel). 2025 Jan 5;25(1):258. doi: 10.3390/s25010258.

Abstract

Large visual language models like Contrastive Language-Image Pre-training (CLIP), despite their excellent performance, are highly vulnerable to the influence of adversarial examples. This work investigates the accuracy and robustness of visual language models (VLMs) from a novel multi-modal perspective. We propose a multi-modal fine-tuning method called Multi-modal Depth Adversarial Prompt Tuning (MDAPT), which guides the generation of visual prompts through text prompts to improve the accuracy and performance of visual language models. We conducted extensive experiments and significantly improved performance on three datasets (ϵ=4/255). Compared with traditional manual design prompts, the accuracy and robustness increased by an average of 17.84% and 10.85%, respectively. Not only that, our method still has a very good performance improvement under different attack methods. With our efficient settings, compared with traditional manual prompts, our average accuracy and robustness are improved by 32.16% and 21.00%, respectively, under three different attacks.

Keywords: adversarial robustness; multi-modal; prompt tuning; visual language models.