Large Visual-Language Models Are Also Good Classifiers: A Study of In-Context Multimodal Fake News Detection

Ye Jiang Yimin Wang

Abstract

Large visual-language models (LVLMs) exhibit exceptional performance in visual-language reasoning across diverse cross-modal benchmarks. Despite these advances, recent research indicates that Large Language Models (LLMs), like GPT-3.5-turbo, underachieve compared to well-trained smaller models, such as BERT, in Fake News Detection (FND), prompting inquiries into LVLMs’ efficacy in FND tasks. Although performance could improve through fine-tuning LVLMs, the substantial parameters and requisite pre-trained weights render it a resource-heavy endeavor for FND applications. This paper initially assesses the FND capabilities of two notable LVLMs, CogVLM and GPT4V, in comparison to a smaller yet adeptly trained CLIP model in a zero-shot context. The findings demonstrate that LVLMs can attain performance competitive with that of the smaller model. Next, we integrate standard in-context learning (ICL) with LVLMs, noting improvements in FND performance, though limited in scope and consistency. To address this, we introduce the In-context Multimodal Fake News Detection (IMFND) framework, enriching in-context examples and test inputs with predictions and corresponding probabilities from a well-trained smaller model. This strategic integration directs the LVLMs’ focus towards news segments associated with higher probabilities, thereby improving their analytical accuracy. The experimental results suggest that the IMFND framework significantly boosts the FND efficiency of LVLMs, achieving enhanced accuracy over the standard ICL approach across three publicly available FND datasets.

keywords:

large visual-language model , multimodal fusion , fake news detection , in-context learning

\affiliation

[label1]organization=College of Information Science and Technology,addressline= Qingdao University of Science and Technology, country=China

\affiliation

[label2]organization=College of Data Science,addressline= Qingdao University of Science and Technology, country=China

1 Introduction

The rapid dissemination of fake news across online platforms has presented tangible challenges [1]. Among the strategies to address these challenges, automated fake news detection (FND), designed to autonomously identify inaccurate or deceptive news articles from authentic ones, has proven to be a viable solution in previous studies [2, 3, 4].

Traditional methods predominantly depend on modeling the textual semantics of news articles, and have revealed inadequacies in addressing the multi-faced nature of fake news [5]. Therefore, multimodal approaches have incorporated pre-trained small¹¹1Here ‘small’ refers to models with fewer parameters compared to larger language models, like GPT-3 family. language models (SLMs, e.g., BERT [6]) with visual models (VMs, e.g., ResNet [7]) for establishing cross-modal representation, resulting in a more thorough performance for FND [8, 9]. While SLMs are normally pre-trained on open-domain datasets, for example, BERT was originally trained on Wikipedia, they are inherently limited in processing fake news that demands expertise beyond their training scope, and so as that of in the VMs [10].

Recent developments in large visual-language models (LVLMs), trained on extensive collections of text-image pairs covering a broad spectrum of global knowledge [11, 12], have shown a profound comprehension of cross-modal reasoning and demonstrated proficiency in identifying commonplace occurrences [13]. However, LVLMs are deficient in forgery-specific knowledge [14, 15], which undermines their efficacy in FND. This limitation could be ascribed to (1) media manipulation disrupting the coherence between various modalities, thereby leading to semantic inconsistencies [16]; and (2) the tendency of altered images to exhibit noticeable artifact indications [17]. Meanwhile, the pre-trained weights of LVLMs, like GPT-4, are normally inaccessible. The immense size of LVLMs also makes it challenging to fine-tune for specific tasks even if the model checkpoints are released.

Refer to caption — Figure 1: Illustration of multimodal fake news detection from LVLMs. (a) Zero-shot prompting LVLMs typically struggle to verify the truthfulness of news. (b) Traditional in-context approach could stimulate LVLMs to focus on verifying fake news by incorporating a few text-image pairs with their labels. (c) The workflow of IMFND integrates the predictions and their probabilities from a smaller model (i.e., CLIP) into the test inputs, indicating more robust performance in multimodal FND.

Alternatively, in-context learning (ICL) empowers LVLMs to acquire the capability to perform downstream tasks through conditioning on prompts composed of input-output samples, circumventing the need for explicit gradients updating [18]. Although previous studies [19, 14] have suggested that large language models (LLMs) might underperform the traditional SLMs in textual-based FND, the potential of ICL in aiding LVLMs for FND remains under-explored. To address the above limitations, this paper presents a framework In-context Multimodal Fake News Detection (IMFND), aimed at exploring three pivotal research questions:

1.

What is the effectiveness of LVLMs in identifying multimodal fake news?
2.

Can ICL enhance the FND performance by utilizing the intrinsic knowledge and capabilities of LVLMs ?
3.

What approaches should ICL adopt to improve LVLMs’ performance in multimodal FND?

We conduct a comprehensive evaluation of LVLMs within the scope of few-shot multimodal FND to answer the above questions. The overall workflow of IMFND is shown in Figure 1. The IMFND framework facilitates collaboration between LVLMs (i.e., GPT-4 [20] and CogVLM [21]) and a locally well-trained, smaller multimodal model (i.g., CLIP [22]).

Specifically, we first evaluate the zero-shot performances of LVLMs against that of the smaller model which is well-trained on FND data. Through the subsequent induction of LVLMs with zero-shot prompting, as shown in Figure 1(a), we find that LVLMs exhibit inherent unconfidence in responding to inquiries concerning authenticity verification, but they still maintain competitive effectiveness in comparison to smaller yet well-trained model. Subsequently, we implement an in-context strategy, as shown in Figure 1(b), by concatenating the input text with ground-truth labels in the few-shot samples, thereby enriching the contextual framework for LVLMs. We find the in-context approach can stimulate the rationales reasoning abilities of LVLMs, and thereby enhance the FND performance, but is limited in scope and consistency. Finally, we propose the IMFND framework to robust the FND performance by integrating prediction with its probability from the smaller model into the input text, as shown in Figure 1(c).

We extensively conduct experiments by randomly sampling five seeds of n-shot examples, and evaluate the effectiveness of IMFND on three benchmark FND datasets. The experimental results show that IMFND significantly outperforms smaller, yet well-trained models, as well as remain robust performances in few-shot settings.

The main contributions of this paper can be summarized as:

1.

Introduction of an In-context Multimodal Fake News Detection (IMFND) framework that can be easily adapted to the few-shot multimodal FND by bridging the common knowledge from LVLMs to the FND-specific data.
2.

IMFND augments the robustness of the in-context approach by integrating the predictions and their probabilities from the smaller model into the context, allowing the LVLMs to concentrate more effectively on the critical segments of a news article.
3.

Experimental results indicate that the in-context approach can significantly stimulate the FND ability of LVLMs. Meanwhile, IMFND enables LVLMs to achieve competitive performances over three FND datasets, and also provides reasonable and informative rationales for its decision-making.

2 Related work

2.1 Automatic fake news detection

Previous studies on FND can be classified into three primary categories: image-based, text-based and multimodal approaches.

Image-based approach aims to utilize traces of edits to ascertain the authenticity of visual content. Certain research focuses on the spatial domain, identifying artifact traces using techniques such as blending [17], achieving patch consistency [23], and implementing reconstruction methods [24].

Text-based methods have concentrated on the analysis of textual statistical features, such as length, punctuation, and exclamation marks [25], along with metadata attributes like likes and shares [26, 27], to manually detect fake news. Recently, deep learning strategies have predominantly employed BiLSTM [28, 29], GNNs [30], and pre-trained models [31, 32, 33, 34] for text feature analysis.

Multimodal approaches amalgamate cross-modal features to construct semantic representations. For example, MCAN [8] utilizes several co-attention layers to enhance the integration of textual and visual features for FND. CAFE [9] measures cross-modal ambiguity by evaluating the Kullback-Leibler (KL) divergence between distributions of unimodal features. LIIMR [35] identifies the modality with higher confidence for FND. COOLANT [36] aims to refine the alignment between image and textual representations, applying contrastive learning to achieve precise semantic alignment and cross-modal fusion for understanding inter-modality correlations. However, the efficacy of such methodologies is constrained due to their reliance on substantial volumes of annotated data and their typical training within isolated systems. This overlooks the potential benefits of assimilating and leveraging the expansive knowledge encompassed within LVLMs.

2.2 In-context learning for large visual-language models

LLMs [37, 20] have demonstrated exceptional capabilities in a range of downstream tasks. Recently, efforts have been made to expand LLMs’ abilities to interpret visual signals. For example, LLaVA [12] and Mini-GPT4 [38] initially enable the alignment of image-text features, subsequently optimizing for visual instruction tuning. PandaGPT [39] utilizes a straightforward linear layer to connect ImageBind [40] with the Vicuna model [41], thereby facilitating multimodal inputs. CogVLM [21] narrows the divide between the frozen pre-trained language model and the image encoder through a trainable visual expert module integrated within the attention and Feedforward Neural Network (FFN) layers.

Although LVLMs have achieved significant advances on diverse cross-modal benchmarks, recent studies [19, 14] suggest that LLM, such as GPT-3.5, still underperforms well-trained smaller models, such as BERT, in FND task, raising the question of whether LVLMs similarly fall short in FND performance. However, while fine-tuning LVLMs could enhance performance, the extensive parameters necessary for LVLMs make it a resource-intensive process for FND.

ICL, as an alternative, utilizes LLMs for new tasks without modifying the model’s parameters and was first introduced in the GPT-3 model [18]. It has been successfully applied to downstream tasks, including machine translation [42] and data generation [43]. Recent LVLMs, developed on the foundation of LLMs, have also demonstrated this ICL capability [44, 45, 46]. For example, the demonstration of precise labels can significantly affect the performance of ICL in specific contexts [47].

However, ICL on LVLMs varies due to the supplementary visual information in the demonstrations and the variation in model components. In this paper, we focus on ICL applied to LVLMs in the FND task, seeking to ascertain which component of information within the multimodal fake news demonstrations is of paramount importance.

3 In-context multimodal fake news detection

In-context multimodal fake news detection (IMFND) framework includes three phases sequentially: (1) Conducting few-shot training on in-context examples using a smaller multimodal model; (2) Integrating predictions and their probabilities from the pre-trained smaller model into the in-context examples and test inputs; (3) Employing the constructed in-context examples and test inputs with LVLMs for final prediction. The overall workflow is shown in Figure 1(c).

3.1 Few-shot learning with a smaller model

The standard in-context methodology integrates the ground-truth label with the text input directly, encouraging LVLMs to concentrate on data specific to the task. However, the outputs generated by LVLMs remain characterized by a lack of confidence and are frequently accompanied by uncertainty (e.g., ”I’m sorry, I can’t assist with verifying the authenticity of news articles”).

To address this, the IMFND first trains a smaller model to be able to generate the predictions and their probabilities for the in-context examples and test inputs. Specifically, a smaller model learns from $n$ samples of a labeled dataset $(t_{i},m_{i},y_{i})$ , where $t_{i},m_{i}$ are the text and the image inputs respectively and $i\in[1,n]$ (i.e., $n$ samples per class), passing to a pre-trained frozen multimodal encoder. Inspired by [48], which posits that few-shot samples from a singular modality may inadequately encapsulate the entirety of a concept class, each training sample is augmented to encompass five distinct feature representations: 1) a solely text feature, denoted as $f_{t}$ ; 2) a solely image feature, denoted as $f_{m}$ ; 3) the aggregation of L2-normalized features $f_{c}=[f_{t}\oplus f_{m}]$ , where $\oplus$ denotes the operation of concatenation; 4) an image-to-text cross-attention feature, denoted as $f_{mt}$ ; 5) a text-to-image cross-attention feature, denoted by $f_{tm}$ .

The cross-attention $CrossAtt()$ involves interchanging the textual query $f_{t}$ with the imagery query $f_{m}$ to derive the cross-attended feature $f_{mt}$ is, and is defined as:

f_{mt}=CrossAtt_{m\rightarrow t}(f_{m},f_{t},f_{t})=softmax(\frac{f_{m}f_{t}^{% T}}{\sqrt{d}})f_{t}

(1)

Conversely, interchanging the imagery query $f_{m}$ with the textual query $f_{t}$ facilitates the acquisition of the cross-attended feature $f_{tm}$ as outlined:

f_{tm}=CrossAtt_{t\rightarrow m}(f_{t},f_{m},f_{m})=softmax(\frac{f_{t}f_{m}^{% T}}{\sqrt{d}})f_{m}

(2)

where $d$ denotes the dimensionality of the smaller model. Subsequently, the multimodal features are comprised of the aforementioned five features, each feature is then passed through a linear classifier MLP to deduce five inferred probabilities. Finally, we concatenate the five deduced probabilities into a singular input for a meta-linear MLP classifier, thereby facilitating the formulation of the final prediction:

y^{\prime}_{i}=softmax(MLP(f_{t}\oplus f_{m}\oplus f_{c}\oplus f_{mt}\oplus f_% {tm})

(3)

The ultimate goal is to allocate a binary classification label of $y_{i}\in\{0,1\}$ , in which 0 denotes real news and 1 denotes fake news. Therefore, the few-shot smaller model aims to minimize the Cross-entropy loss $L$ :

L=-(y_{i}log(y^{\prime}_{i})+(1-y_{i})log(1-y^{\prime}_{i}))

(4)

where $y^{\prime}_{i}$ is the model inference from the meta-linear classifier $MLP$ after Softmax.

3.2 In-context examples reconstruction

Following the training of the smaller model with few-shot data, in-context examples are then reconstructed, aiding LVLMs in acquiring the FND-specific knowledge imparted by the smaller model. Specifically, the well-trained smaller model is initially deployed to generate predictions and their corresponding probabilities for in-context examples, which are comprised of instances randomly selected from the training set. Notably, the predictions and their probabilities, derived from the linear probing of the $f_{t}$ and $f_{m}$ features, are also utilized in formulating in-context examples. Table 1(c) showcases the final prompt example of IMFND, in comparison to the standard ICL prompt example presented in Table 1(a).

Table 1: Example in-context prompts and test prompts of the standard ICL and IMFND.

Prompt types	Examples
(a) ICL	Read this news and its image, do you think this is real or fake news? Just answer if it’s real or fake. This is a ¡label¿ news. ¡image¿ News: ¡text¿.
(b) ICL-test	Read this news and its image, do you think this is real or fake news? Just answer if it’s real or fake. ¡image¿ News: ¡text¿.
(c) IMFND	Read this news and its image, do you think this is real or fake news? Just answer if it’s real or fake. This is a ¡label¿ news. Text classifier prediction: ¡prediction¿ with ¡probability¿ confidence. Image classifier prediction: ¡prediction¿ with ¡probability¿ confidence. Multimodal classifier prediction: ¡prediction¿ with ¡probability¿ confidence. ¡image¿ News: ¡text¿.
(d) IMFND-test	Read this news and its image, do you think this is real or fake news? Just answer if it’s real or fake. Text classifier prediction: ¡prediction¿ with ¡probability¿ confidence. Image classifier prediction: ¡prediction¿ with ¡probability¿ confidence. Multimodal classifier prediction: ¡prediction¿ with ¡probability¿ confidence. ¡image¿ News: ¡text¿.

The constructed in-context examples serve to augment the LVLMs’ comprehension of the interplay between the inputs, ground-truth labels, and the expert knowledge of the smaller model, allowing the LVLMs to confidently produce the final predictions. Meanwhile, probabilities also act as an indicator of the smaller model’s uncertainty regarding its predictions. Integrating probabilities into the context enables the LVLMs to rely on predictions when the smaller model exhibits high confidence and to exercise caution in instances of uncertainty. Additionally, probabilities can direct the LVLM’s focus to more challenging in-context examples, facilitating learning from these complex examples and potentially enhancing FND performance overall.

3.3 Test inputs reconstruction

Following the assembly of the in-context examples, the constructed test input which encompasses both the predicted label and its probability, is concatenated with the test input and forms a comprehensive input for the LVLMs, as shown in Table 1(d). Finally, the complete IMFND algorithm can be summarized in Algorithm 1.

Algorithm 1 IMFND Algorithm

1: Input: Randomly sample

n

training examples

D={(t_{1},m_{1},y_{1}),...,(t_{n},m_{n},y_{n})}

, pre-trained smaller model

C

, LVLM

L

, number of seeds

S

2: for seed in

\{1,2,\dots,S\}

3: for each training example in

D

4: Derive image feature

f_{m}

, text feature

f_{t}

, apply L2 normalization to the concatenated feature

f_{c}

, and obtain cross-attended features

f_{mt}

and

f_{tm}

through linear probing of the smaller model

C

;

5: Combine inferences from the aforementioned features and input them into a meta-linear classifier to generate the FND-adapted smaller model

C^{\prime}

;

6: Utilize

C^{\prime}

to generate

y{{}_{i}}^{\prime}

and

p^{\prime}_{i}

, where

y{{}_{i}}^{\prime}

denotes the predicted label and

p^{\prime}_{i}

represents the associated probabilities;

7: end for

8: Formulate the in-context examples by concatenating

D^{\prime}=((t_{1},m_{1},y{{}_{1}}^{\prime},p{{}_{1}}^{\prime},y_{1}),...,(t_{% n},m_{n},y{{}_{n}}^{\prime},p{{}_{n}}^{\prime},y_{n}))

;

9: for each test example do

10: Predict

y{{}_{v}}^{\prime}

and

p^{\prime}_{v}

by using

C^{\prime}

;

11: Merge the prepared test input

D^{\prime}\oplus(t_{v},m_{v},y{{}_{v}}^{\prime},p^{\prime}_{v})

, where

\oplus

signifies concatenation;

12: Employ the LVLM

L

to predict

y{{}_{v}}

;

13: end for

14: end for

4 Experiments

4.1 Data

To evaluate the FND capabilities of the LVLMs, three publicly accessible datasets have been selected, encompassing a broad spectrum of fake news content and specifically including two languages: English and Chinese.

PolitiFact [2] contains a collection of political news items, each classified as fake or real by specialized assessors, forming a component of the benchmarking initiative FakeNewsNet. By employing the provided data crawling scripts to exclude news items lacking images or possessing invalid image URLs, a total of 198 multimodal news articles are curated.

GossipCop [2] contains entertainment narratives evaluated on a 0 to 10 scale, where narratives with scores below five are designated as fake news according to the creator of FakeNewsNet. Applying the same retrieval techniques to those used for PolitiFact, a compilation of 6,805 news articles is collected.

Weibo [49] represents a dataset derived from Chinese social media platforms, encapsulating a multimodal assortment of fake news that includes both textual and visual elements. Real news articles were collected from a credible outlet (Xinhua News), while fake articles were sourced from Weibopiyao, Weibo’s sanctioned platform for countering rumors, which compiles information via either public contribution or formal discrediting endeavors. Adhering to the pre-processing techniques employed in previous research [36], this process yielded a total of 7,853 Chinese news articles.

Notably, a news article may contain several images. To identify the most relevant image, the cosine similarity between each image and the associated text is computed. The image-text pair exhibiting the greatest similarity is retained, as ascertained by the pre-trained CLIP model. The statistics of the pre-processed dataset are shown in Table 2.

Table 2: The statistics of the pre-processed multimodal fake news datasets. Avg denotes the mean number of tokens per article.

Statistics	PolitiFact	GossipCop	Weibo
Total	198	6,805	7,853
Fake	96	1,877	4,211
Real	102	4,928	3,642
Avg	2,148	728	67

For each dataset, we first randomly stratified split 20% of data as the test set. Next, the remaining 80% of data are further sampled within five random seeds, to evaluate the robustness of the few-shot performances. For each seed, we conduct $n$ -shot classification, where $n\in\{1,3,5\}$ and $n$ is the number of samples per class. Consequently, the efficacy of LVLMs is assessed on the designated 20% test set.

4.2 Large visual-language models

Two publicly accessible LVLMs are utilized in the experiments:

CogVLM [21] is an open-source visual-language model that combines 10 billion visual parameters and 7 billion language parameters. It is designed for image understanding and multi-turn dialogue. We utilize the CogVLM-17B version (i.e., cogvlm-chat-hf)²²2https://huggingface.co/THUDM/cogvlm-chat-hf to conduct the experiments. To facilitate local execution of the experiments within the bounds of hardware limitations, 4-bit quantization was implemented at the time of loading CogVLM, aimed at reducing the memory demands during inference.

GPT-4V [20] enables the GPT to process images and respond to inquiries pertaining to them. The experiments use the GPT-4V model through the official OpenAI API³³3https://platform.openai.com/docs/api-reference.

In preliminary experiments, it is noted that employing higher temperature settings (e.g., 0.8 for LVLMs) led to inadequate classification performance due to the infusion of increased randomness into the LVLM’s responses. Following the hyperparameter settings described by [19], a reduced temperature setting (i.e., 0.2) is utilized for the LVLMs to enhance response stability, thereby augmenting reproducibility.

4.3 Baselines

The few-shot FND performances of IMFND with the above two LVLMs are compared against an unimodal BERT and a multimodal CLIP with linear probing as discussed in Section 3.1:

CLIP-LP is the variant of the CLIP model, which employs pre-trained models from OpenAI CLIP (ViT-B-32) [22] and Chinese CLIP (ViT-B-16) [50] for the extraction of text and image features respectively, and then passing to a linear probing (LP) layer for classification. The dimensionality of the linear probing is set to 512, aligning with the output dimension of the CLIP encoders.

BERT is fine-tuned directly on few-shot examples using bert-base-uncased ⁴⁴4https://huggingface.co/google-bert/bert-base-uncased and bert-base-chinese ⁵⁵5https://huggingface.co/google-bert/bert-base-chinese. The Huggingface Trainer⁶⁶6https://huggingface.co/docs/transformers/main_classes/trainer is utilized to fine-tune the BERT models.

For all baselines, the AdamW optimization is utilized, configured with a learning rate of $1\mathrm{e}{-3}$ and a decay rate of $1\mathrm{e}{-2}$ . Training spans 20 epochs, with the selection of the checkpoint being based on the best test accuracy. An early stopping is applied, utilizing a patience of three epochs. The performance results represent the mean across five seeds, disclosed through both accuracy and macro-f1 metrics.

5 Results

5.1 Performance comparison and ablation study

Table 3: Performance of the IMFND and its variants with different LVLMs in accuracy and Macro-F1 (in bracket). All results are averaged of five seeds. w/o proba denotes the IMFND removes the probabilities from the in-context examples and test inputs, and only uses the predictions from the smaller model. w/o proba

\&

pred is the standard ICL as the examples shown in Table 1(a) and (b). Kühn scores indicate the best performance. Underline scores are the second-best performance.

Method	PolitiFact			GossipCop			Weibo
Method	1	3	5	1	3	5	1	3	5	Avg
BERT	61.9(56.8)	68.8(68.0)	66.8(51.2)	62.0(49.1)	57.8(49.2)	56.9(49.8)	62.1(59.7)	62.0(59.9)	65.4(64.7)	58.1(56.3)
CLIP-LP	58.4(53.9)	62.9(58.6)	74.6(72.5)	51.2(38.7)	65.9(56.0)	63.4(47.9)	54.6(51.6)	55.1(48.1)	58.4(57.5)	60.5(53.9)
CogVLM-IMFND	75.3(75.0)	74.8(74.4)	76.3(75.9)	59.3(59.0)	62.1(62.1)	66.1(65.9)	60.6(45.6)	61.0(51.3)	66.3(56.1)	66.9(62.8)
w/o proba	66.8(66.6)	77.1(77.0)	76.3(75.5)	59.1(58.7)	61.9(60.7)	64.6(64.1)	59.3(38.4)	57.3(36.9)	62.1(55.5)	64.9(59.3)
w/o proba & pred	56.9(55.4)	58.0(55.9)	62.3(60.7)	56.3(54.5)	57.3(55.4)	63.2(62.2)	58.1(42.8)	57.8(42.3)	59.9(49.6)	59.0(53.2)
GPT4V-IMFND	79.1(76.3)	80.1(79.9)	81.3(81.3)	70.1(69.5)	71.3(70.5)	79.7(78.1)	63.9(59.3)	69.6(61.2)	83.2(79.5)	75.5(72.8)
w/o proba	73.2(69.3)	77.3(74.3)	80.9(79.1)	69.2(60.1)	69.5(63.8)	78.2(78.2)	61.3(57.2)	63.8(59.6)	72.8(65.9)	70.9(70.1)
w/o proba & pred	66.1(62.5)	74.5(72.6)	79.6(78.4)	64.9(58.5)	65.4(65.3)	73.2(69.1)	54.9(48.6)	57.3(50.3)	65.5(63.2)	66.8(63.2)

Table 3 presents the accuracy and macro-f1 scores achieved by integrating IMFND with both CogVLM and GPT-4V, compared with the performance of unimodal BERT and multimodal CLIP-LP in few-shot learning settings.

Comparing with the smaller models. The IMFND significantly enhances the FND efficacy of LVLMs over smaller models across various few-shot settings, with notable improvements (i.e., with 15% increase in average accuracy and 16.5% in average macro-f1) observed in GPT4V. Furthermore, it is observed that augmenting the number of in-context examples further amplifies this enhancement, indicating the FND abilities of LVLMs can be stimulated by integrating the additional information into the context.

We also investigate the performances between the smaller models (i.e., BERT vs. CLIP-LP) in the few-shot settings. Interestingly, although CLIP-LP demonstrates superior overall accuracy compared to BERT, BERT surpasses CLIP-LP in all 1-shot settings, particularly evident in the Weibo dataset. This observation suggests a potential detriment to the FND performance induced by the quality of image modality.

Comparing with the LVLMs. To investigate if the IMFND can be adapted to different LVLMs, the performances of the IMFND are also compared among two LVLMs. Generally, the GPT4V-IMFND outperforms the CogVLM-IMFND in all few-shot settings. This might be attributed to 1) the utilization of 4-bit quantization in CogVLM-IMFND for local deployment, which could potentially result in lower precision during prediction; 2) the size of CogVLM (i.e., 10B of vision model plus 7B of language model) is relatively much smaller than that of the GPT4V. Overall, the empirical evidence suggests that the IMFND framework can be effortlessly integrated with various LVLMs, consistently enhancing performance on the FND task.

Ablation study. We conduct several ablation experiments aimed at investigating the impact of key components in IMFND. This assessment involves evaluating the framework’s performance across various complete and partial configurations. In each experiment, IMFND is selectively utilized by systematically removing different components. The results illustrate the performance decay of IMFND in the absence of each component across most configurations, highlighting the significance of each key module within IMFND.

Initially, a minor reduction in both accuracy and macro-F1 score is observed when the prediction probabilities from the smaller model’s in-context examples are eliminated (i.e., w/o proba). This suggests that the prediction probabilities from the smaller model enhance the confidence of LVLMs for decision-making. Further elimination of the smaller model’s predictions from the context (i.e., w/o proba & pred) converts the framework into a standard ICL scenario. This results in a marked performance decline, with accuracy falling by 7.9% and 8.7% in CogVLM and GPT4V respectively, underscoring that the smaller model’s predictions and their probabilities serve as vital supplementary information for enhancing in-context examples.

Additionally, contrary to the findings from [19, 14], the standard ICL setup (i.e., w/o proba & pred) demonstrates commendable effectiveness in the FND task, notably for GPT4V, showcasing improvements of 6.3% and 6.9% in accuracy and macro-F1 score respectively over the highest-performing smaller models. This highlights that LVLMs, equipped with precisely crafted in-context information, can still excel as effective detectors of fake news.

5.2 Zero-shot performances

Table 4: Zero-shot performances comparison. Avg denotes the average scores in accuracy and macro-f1 (in bracket).

Models	PolitiFact		GossipCop		Weibo
Models	Acc	F1	Acc	F1	Acc	F1	Avg
CLIP-LP	56.6	51.9	53.9	48.5	55.6	40.3	55.4(46.9)
CogVLM	62.7	60.7	53.0	48.0	56.7	39.8	57.5(49.5)
GPT4V	66.5	61.7	52.7	39.1	51.5	34.2	56.9(45.0)

Table 4 demonstrates the zero-shot performances in the LVLMs and the CLIP-LP. Notably, the zero-shot application of CLIP-LP involves immobilizing the pre-trained CLIP model and appending a linear projection layer atop the merged text and image features derived from their respective encoders. The zero-shot configuration for LVLMs directly prompts the model with the inquiry: “Read this news and its image, do you think this is real or fake news? Just answer if it’s real or fake. ¡image¿ News: ¡text¿”.

Observations indicate that the LVLMs can achieve comparable performances to those of the zero-shot CLIP-LP. Specifically, the LVLMs exhibit superior performance on Politifact compared to GossipCop and Weibo. This disparity may be partly due to the sequence length constraints of LVLMs (e.g., 2096 tokens for CogVLM) which are typically longer than those of the CLIP model (defaulting to 77 tokens). The average of 2,148 tokens per Politifact article, as shown in Table 2, significantly exceeding that of GossipCop and Weibo, permits LVLMs to analyze more textual data compared to the latter datasets.

Furthermore, we also observe that LVLMs encounter difficulties in tasks such as the verification of news article authenticity in zero-shot settings, where the LVLMs frequently respond with: “I’m sorry, I can’t assist with verifying the authenticity of news articles”. This could be attributed to the alignment of LLMs with human preferences, whereby the models are designed to abstain from answering queries typically associated with authentic verification.

5.3 Stability test

Given that the selection of in-context examples can significantly influence model performance in the few-shot settings, all experiments presented in Table 3 are executed using five randomly selected seeds. The stability of each model is evaluated by calculating the standard deviation of accuracies across diverse few-shot configurations, as shown in Figure 2.

In general, observations indicate a decrease in the standard deviation for each model concurrent with an increase in the number of in-context examples, suggesting enhanced stability with the increment of training samples. Additionally, the overall standard deviations for the IMFND consistently fall below those of the smaller models across all datasets, illustrating that the IMFND facilitates more stable performance outcomes compared to the smaller models in few-shot settings.

We argue that the superior stability observed in the IMFND model derives from incorporating reference predictions of the CLIP-LP model. Such a strategy enables LVLMs to focus on enhancing and, as required, consulting CLIP’s predictions, thus diminishing the variances induced by diverse in-context examples.

5.4 Case study

Figure 3 showcases responses generated from CogVLM and GPT4V, employing either IMFND or standard ICL prompting methods.

Observations indicate that the predictions and their probabilities from the smaller model guide the LVLM to concentrate on specific segments of the news content, thereby aiding in the accuracy of predictions. For example, a 98% confidence level by the image classifier in affirming the news image as fake prompts the LVLM to intensify its focus on evaluating the news images, as shown in the first example of Figure 3(a).

Furthermore, IMFND prompting not only augments the responses generated from the LVLM, leading to a more comprehensive analysis of the text and image from news article, but also can override inaccurately classified predictions arising from standard ICL prompting, as demonstrated in the second example in Figure 3(a).

To enhance our comprehension of the limitations inherent to the LVLM, we also conduct an error analysis focusing on prevalent errors within both IMFND and ICL prompting. Manual examination of these inaccuracies revealed that error classifications by the smaller model could also lead the LVLM to generate incorrect predictions. For example, an image classifier’s prediction that news is fake with a 71% confidence level prompted CogVLM to deem it fake news, in contrast, the standard ICL arrived at the correct prediction uninfluenced by the smaller model’s misleading information, as illustrated in the first example of Figure 3(b).

Additionally, it is observed that the presence of credible sources, (e.g., the ‘Demoncratic Underground’, ‘Politico’, ‘CNN’) within the image or text cues LVLMs to classify the content as genuine news articles, as demonstrated in the second example of Figure 3(a) as well as in that of Figure 3(b).

6 Discussion

The experimental results highlight the proficiency of LVLMs in applications that demand an understanding of complex real-world contexts. Subsequently, we will contextualize these findings within the framework of our three research questions:

RQ(1) What is the effectiveness of LVLMs in identifying multimodal fake news?

Contrary to the conclusions drawn by [19, 14], which suggested that LLMs might not effectively replace SLMs in FND, our experimental findings indicate that LVLMs also possess substantial capability as detectors for complex real-world task, such as FND. This efficacy is attributed to the multimodal aspects of fake news, which likely offer more informative features to LVLMs compared to the unimodal approach of traditional LLMs. Additionally, LVLMs demonstrate the ability to process inputs with longer sequence lengths than SLMs, such as BERT, which is limited to 512 tokens. This capability, derived from leveraging the strengths of LLMs, results in superior performance in scenarios that involve datasets with lengthy documents.

RQ(2) Can ICL enhance the FND performance by utilizing the intrinsic knowledge and capabilities of LVLMs?

Our experimental findings reveal that standard ICL prompting does enhances the FND proficiency of LVLMs relative to the zero-shot setups in CogVLM and GPT4V, with respective accuracy increases of 1.5% and 9.9%. This suggests that: 1) incorporating the ground-truth label into the contextual prompt aids LVLMs in leveraging their inherent global knowledge, thereby improving task-specific performance; 2) Textual information holds precedence over visual cues, which are also suggested in [45], as the inclusion of the ground-truth label in the prompt predominantly dictates the performance outcome.

RQ(3) What approaches should ICL adopt to improve LVLMs’ performance in multimodal FND?

While standard ICL enhances the FND capabilities of LVLMs, as discussed in RQ2, the extent of this improvement varies depending on the architectural designs and sizes of the LVLMs. For example, while the ICL adaptation of GPT4V achieves an average accuracy rate of 66.8%, surpassing the most effective smaller model (CLIP-LP at 60.5%) by 6.3%, integrating ICL with CogVLM yields an average accuracy of 59.0%, underperforming in comparison to CLIP-LP. However, our result suggests that the proposed IMFND can yield significant enhancements and more robust results for LVLMs compared to the outcomes attained by utilizing standard ICL. This improvement is attributed to the smaller model’s predictions, which offer overarching guidance for LVLMs in assessing news veracity. Concurrently, the predictive probabilities direct LVLMs’ focus towards critical segments of content, particularly when these probabilities are high.

7 Conclusion

This study introduces the IMFND framework, designed to enhance FND across three datasets, by leveraging LVLMs. Initially, a zero-shot evaluation of LVLMs and the CLIP model for FND reveals that LVLMs are capable of attaining competitive results, distinct from those observed with traditional LLMs. Subsequent findings demonstrate that while standard ICL can enhance LVLM performance, the degree of improvement remains limited and variable. Finally, it is shown that incorporating predictions and their probabilities from a well-trained smaller model into LVLMs can significantly augment their FND capabilities, ensuring robust accuracy across diverse LVLMs. Moreover, the proposed IMFND, leveraging standard ICL without necessitating fine-tuning or gradient updates, exhibits flexible adaptability to various LVLMs, with the choice of the smaller model also being versatile.

8 Limitations

This study acknowledges certain constraints, including: 1) The comparative analysis of LVLMs is confined to two models (i.e., CogVLM and GPT4V), a limitation imposed by budgetary and hardware constraints; 2) Due to the time-intensive of prompting LVLMs with five random seeds, only a single prompting strategy has been evaluated. The exploration of prompt engineering design remains a subject for future research; 3) CogVLM’s performance may be limited by the hardware restriction of employing 4-bit quantization, suggesting potential for improved results with a full-precision configuration.

Acknowledgements

This work is funded by the Natural Science Foundation of Shandong Province under grant ZR2023QF151 and the Natural Science Foundation of China under grant 12303103.

CRediT authorship contribution statement

Ye Jiang: Conceptualization, Methodology, Writing–original draft, Writing–review & editing. Yimin Wang: Funding acquisition, Methodology, Writing–review & editing.

References

[1] N. K. Conroy, V. L. Rubin, Y. Chen, Automatic deception detection: Methods for finding fake news, Proceedings of the association for information science and technology 52 (1) (2015) 1–4.
[2] K. Shu, D. Mahudeswaran, S. Wang, D. Lee, H. Liu, Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media, Big data 8 (3) (2020) 171–188.
[3] Y. Long, Q. Lu, R. Xiang, M. Li, C.-R. Huang, Fake news detection through multi-perspective speaker profiles, in: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2017, pp. 252–256.
[4] Y. Wang, F. Ma, Z. Jin, Y. Yuan, G. Xun, K. Jha, L. Su, J. Gao, Eann: Event adversarial neural networks for multi-modal fake news detection, in: Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining, 2018, pp. 849–857.
[5] A. Lao, C. Shi, Y. Yang, Rumor detection with field of linear and non-linear propagation, in: Proceedings of the Web Conference 2021, 2021, pp. 3178–3187.
[6] J. Devlin, M.-W. Chang, K. Lee, K. N. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2018.
URL https://arxiv.org/abs/1810.04805
[7] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[8] Y. Wu, P. Zhan, Y. Zhang, L. Wang, Z. Xu, Multimodal fusion with co-attention networks for fake news detection, in: Findings of the association for computational linguistics: ACL-IJCNLP 2021, 2021, pp. 2560–2569.
[9] Y. Chen, D. Li, P. Zhang, J. Sui, Q. Lv, L. Tun, L. Shang, Cross-modal ambiguity learning for multimodal fake news detection, in: Proceedings of the ACM Web Conference 2022, 2022, pp. 2897–2905.
[10] Q. Sheng, X. Zhang, J. Cao, L. Zhong, Integrating pattern-and fact-based fake news detection via model preference learning, in: Proceedings of the 30th ACM international conference on information & knowledge management, 2021, pp. 1640–1650.
[11] B. Li, Y. Zhang, L. Chen, J. Wang, J. Yang, Z. Liu, Otter: A multi-modal model with in-context instruction tuning, arXiv preprint arXiv:2305.03726 (2023).
[12] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning (2023).
[13] Y. Ma, Y. Cao, Y. Hong, A. Sun, Large language model is not a good few-shot information extractor, but a good reranker for hard samples!, arXiv preprint arXiv:2303.08559 (2023).
[14] B. Hu, Q. Sheng, J. Cao, Y. Shi, Y. Li, D. Wang, P. Qi, Bad actor, good advisor: Exploring the role of large language models in fake news detection, arXiv preprint arXiv:2309.12247 (2023).
[15] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223 (2023).
[16] R. Shao, T. Wu, Z. Liu, Detecting and grounding multi-modal media manipulation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6904–6913.
[17] K. Shiohara, T. Yamasaki, Detecting deepfakes with self-blended images, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18720–18729.
[18] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901.
[19] Y. Mu, B. P. Wu, W. Thorne, A. Robinson, N. Aletras, C. Scarton, K. Bontcheva, X. Song, Navigating prompt complexity for zero-shot classification: A study of large language models in computational social science, arXiv preprint arXiv:2305.14310 (2023).
[20] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023).
[21] W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al., Cogvlm: Visual expert for pretrained language models, arXiv preprint arXiv:2311.03079 (2023).
[22] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763.
[23] Z. Zhao, E. Wallace, S. Feng, D. Klein, S. Singh, Calibrate before use: Improving few-shot performance of language models, in: International Conference on Machine Learning, PMLR, 2021, pp. 12697–12706.
[24] J. Liang, H. Shi, W. Deng, Exploring disentangled content information for face forgery detection, in: European Conference on Computer Vision, Springer, 2022, pp. 128–145.
[25] C. Castillo, M. Mendoza, B. Poblete, Information credibility on twitter, in: Proceedings of the 20th international conference on World wide web, 2011, pp. 675–684.
[26] B. Tabibian, I. Valera, M. Farajtabar, L. Song, B. Schölkopf, M. Gomez-Rodriguez, Distilling information reliability and source trustworthiness from digital traces, in: Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 847–855.
[27] C. Geeng, S. Yee, F. Roesner, Fake news on facebook and twitter: Investigating how people (don’t) investigate, in: Proceedings of the 2020 CHI conference on human factors in computing systems, 2020, pp. 1–14.
[28] P. Bahad, P. Saxena, R. Kamal, Fake news detection using bi-directional lstm-recurrent neural network, Procedia Computer Science 165 (2019) 74–82.
[29] S. Sridhar, S. Sanagavarapu, Fake news detection and analysis using multitask learning with bilstm capsnet model, in: 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence), IEEE, 2021, pp. 905–911.
[30] H. T. Phan, N. T. Nguyen, D. Hwang, Fake news detection: A survey of graph neural network methods, Applied Soft Computing (2023) 110235.
[31] X. Song, J. Petrak, Y. Jiang, I. Singh, D. Maynard, K. Bontcheva, Classification aware neural topic model for covid-19 disinformation categorisation, PloS one 16 (2) (2021) e0247086.
[32] Y. Jiang, X. Song, C. Scarton, A. Aker, K. Bontcheva, Categorising fine-to-coarse grained misinformation: An empirical study of covid-19 infodemic, arXiv preprint arXiv:2106.11702 (2021).
[33] Y. Jiang, Y. Wang, X. Song, D. Maynard, Comparing topic-aware neural networks for bias detection of news, in: ECAI 2020, IOS Press, 2020, pp. 2054–2061.
[34] Y. Jiang, Team QUST at SemEval-2023 task 3: A comprehensive study of monolingual and multilingual approaches for detecting online news genre, framing and persuasion techniques, in: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 300–306.
[35] S. Singhal, T. Pandey, S. Mrig, R. R. Shah, P. Kumaraguru, Leveraging intra and inter modality relationship for multimodal fake news detection, in: Companion Proceedings of the Web Conference 2022, 2022, pp. 726–734.
[36] L. Wang, C. Zhang, H. Xu, Y. Xu, X. Xu, S. Wang, Cross-modal contrastive learning for multimodal fake news detection, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5696–5704.
[37] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023).
[38] D. Zhu, J. Chen, X. Shen, X. Li, M. Elhoseiny, Minigpt-4: Enhancing vision-language understanding with advanced large language models, arXiv preprint arXiv:2304.10592 (2023).
[39] Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, D. Cai, Pandagpt: One model to instruction-follow them all, arXiv preprint arXiv:2305.16355 (2023).
[40] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, I. Misra, Imagebind: One embedding space to bind them all, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15180–15190.
[41] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, et al., Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, See https://vicuna. lmsys. org (accessed 14 April 2023) (2023).
[42] X. V. Lin, T. Mihaylov, M. Artetxe, T. Wang, S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhosale, J. Du, et al., Few-shot learning with multilingual language models, arXiv preprint arXiv:2112.10668 (2021).
[43] J. Ye, J. Gao, Z. Wu, J. Feng, T. Yu, L. Kong, Progen: Progressive zero-shot dataset generation via in-context feedback, in: Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, pp. 3671–3683.
[44] M. Monajatipoor, L. H. Li, M. Rouhsedaghat, L. Yang, K.-W. Chang, MetaVL: Transferring in-context learning ability from language models to vision-language models, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 495–508. doi:10.18653/v1/2023.acl-short.43.
URL https://aclanthology.org/2023.acl-short.43
[45] S. Chen, Z. Han, B. He, M. Buckley, P. Torr, V. Tresp, J. Gu, Understanding and improving in-context learning on vision-language models, arXiv preprint arXiv:2311.18021 (2023).
[46] Y. Zhou, X. Li, Q. Wang, J. Shen, Visual in-context learning for large vision-language models, arXiv preprint arXiv:2402.11574 (2024).
[47] K. M. Yoo, J. Kim, H. J. Kim, H. Cho, H. Jo, S.-W. Lee, S.-g. Lee, T. Kim, Ground-truth labels matter: A deeper look into input-label demonstrations, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 2422–2437.
[48] Z. Lin, S. Yu, Z. Kuang, D. Pathak, D. Ramanan, Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19325–19337.
[49] Z. Jin, J. Cao, H. Guo, Y. Zhang, J. Luo, Multimodal fusion with recurrent neural networks for rumor detection on microblogs, in: Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 795–816.
[50] A. Yang, J. Pan, J. Lin, R. Men, Y. Zhang, J. Zhou, C. Zhou, Chinese clip: Contrastive vision-language pretraining in chinese, arXiv preprint arXiv:2211.01335 (2022).