Cross-Modal Augmentation for Few-Shot Multimodal Fake News Detection

Ye Jiang Taihang Wang Xiaoman Xu Yimin Wang Xingyi Song Diana Maynard

Abstract

The nascent topic of fake news requires automatic detection methods to quickly learn from limited annotated samples. Therefore, the capacity to rapidly acquire proficiency in a new task with limited guidance, also known as few-shot learning, is critical for detecting fake news in its early stages. Existing approaches either involve fine-tuning pre-trained language models which come with a large number of parameters, or training a complex neural network from scratch with large-scale annotated datasets. This paper presents a multimodal fake news detection model which augments multimodal features using unimodal features. For this purpose, we introduce Cross-Modal Augmentation (CMA), a simple approach for enhancing few-shot multimodal fake news detection by transforming n-shot classification into a more robust (n $\times$ z)-shot problem, where z represents the number of supplementary features. The proposed CMA achieves SOTA results over three benchmark datasets, utilizing a surprisingly simple linear probing method to classify multimodal fake news with only a few training samples. Furthermore, our method is significantly more lightweight than prior approaches, particularly in terms of the number of trainable parameters and epoch times. The code is available here: https://github.com/zgjiangtoby/FND_fewshot

keywords:

Fake news detection , Multimodal fusion , Few-shot learning , Natural language processing

\affiliation

[label1]organization=College of Information Science and Technology,addressline= Qingdao University of Science and Technology, country=China

\affiliation

[label2]organization=College of Data Science,addressline= Qingdao University of Science and Technology, country=China

\affiliation

[label3]organization=Department of Computer Science,addressline= University of Sheffield, country=UK

1 Introduction

The recent proliferation of social media has not only transformed the landscape of information exchange, but also led to the pernicious spread of fake news. The detection and mitigation of fake news have consequently become pivotal areas of research [1, 2]. Traditional approaches, primarily relying on textual analysis, have shown limitations due to the sophisticated and multi-faceted nature of fake news [3, 4]. In response, many studies have incorporated multimodal methods that consider both text and accompanying images, yielding a more comprehensive and effective framework for identifying and debunking fake news [5, 6].

To explore the inconsistent semantics between text and image in fake news, many studies have either incorporated contrastive learning to achieve better alignment between image-text pairs [7], or designed complex neural networks to strengthen the deep-level fusion of multimodal features [8, 9]. The former relies on contrastive loss to align image-text pairs, but most image-text pairs in fake news are inherently not matched [10], and different image-text pairs may also have potential correlations [11], which can consequently confuse the model. The latter typically needs to be trained from scratch, which is fundamentally bounded by the availability of large-scale annotated data [12, 13].

In contrast to machines, the process of concept learning in humans involves integrating multimodal signals and representations [14, 15]. When processing uncertain information, people inherently seek help from other modalities. This capability enables humans to learn from a limited number of samples by incorporating cross-modal information, as shown in Figure 1. Meanwhile, the efficacy of fake news detection (FND) in the context of nascent topics, such as COVID-19, remains a significant challenge for prevailing strategies. This difficulty is compounded by the lack of extensive data and annotations in the target domain, underscoring the critical role of few-shot learning in mitigating the spread of early-stage fake news [16].

Refer to caption — Figure 1: Information from different modalities assists humans in decision-making, especially when faced with uncertainty.

In the context of emerging topics with limited training samples, prompt learning, through its few-shot learning capacity, encapsulates news articles in task-specific textual prompts for direct knowledge extraction from pre-trained language models (PLMs), achieving comparable performance across different tasks [17, 18]. However, most prompt-based methods primarily tune the PLM with unimodal textual information from fake news [19, 16], thus once again ignoring the multimodal nature of fake news. Even though the previous method [20] attempts to integrate the different prompt templates with image features extracted from the pre-trained vision model, the fusion strategy still utilized the multimodal features only, potentially struggling to address spatial discrepancies between visual and textual semantics [8, 21].

In this paper, we propose a Cross-Modal Augmentation (CMA) method to explore how unimodal features could assist in multimodal fusion for FND in few-shot scenarios. Specifically, we leverage the foundational multimodal model CLIP [22] to extract textual and visual features from fake news simultaneously. Utilizing class labels as supplementary one-shot training instances, the n-shot classification can then be converted to an $(n\times z)$ -shot problem, where $z$ represents the number of supplementary features (e.g., the fused feature from text and image). Meanwhile, we also fuse unimodal features by utilizing the cross-attention mechanism [7] as another supplementary. Finally, we employ a simple linear probing for each modality as well as for the fused multimodal features. The experimental results indicate that CMA achieves SOTA results across three datasets.

The main contributions of this paper are:

1.

Introduction of a Cross-Modal Augmentation (CMA) method for few-shot multimodal fake news detection, utilizing unimodal features to enhance multimodal fusion.
2.

Leveraging a pre-trained multimodal model to extract unimodal features, and repurposing class labels as additional one-shot training samples, transforming the n-shot classification into a more robust $(n\times z)$ -shot problem.
3.

By freezing the pre-trained multimodal model and training only with a simple linear classifier, the proposed CMA achieves SOTA results over three datasets, outperforming 11 baseline models and surpassing previous methods in efficiency.

2 Related work

2.1 Unimodal fake news detection

Unimodal fake news detection aims to extract significant semantics from either news texts or images. Given the precision of semantics in text, previous approaches have concentrated on the task of text-based unimodal fake news detection. Early works focused on analyzing statistical characteristics of text (e.g., length, punctuation, exclamation marks) [23] and metadata (e.g., likes, shares) [24, 25] for manual fake news detection. However, these manual feature engineering approaches are time-consuming and struggle with processing large-scale, real-time data [26, 27].

The advent of deep learning has significantly advanced automated fake news detection. These methods primarily utilize deep learning models like BiLSTM [28, 29], GNNs [30], and pre-trained models (e.g., BERT, GPT) [31, 32, 33] to analyze text features, extracting various attributes such as emotional [34], stance-based [35], and stylistic elements [36]. However, the recent proliferation of multimodal information (text, images, videos) in social networks has shifted the propagation of fake news from solely text-based to multimodal formats.

2.2 Multimodal fake news detection

Multimodal methods employing cross-modal discriminative patterns have been introduced, aiming to enhance performance in fake news detection. For example, MCAN [36] employs multiple co-attention layers to more effectively integrate textual and visual features in detecting fake news. CAFE [5] quantifies cross-modal ambiguity through the assessment of the Kullback-Leibler (KL) divergence among the distributions of unimodal features. LIIMR [37] determines the modality that exhibits greater confidence in the context of fake news detection. COOLANT [7] focuses on improving the alignment between image and text representations, utilizing contrastive learning for finer semantic alignment and cross-modal fusion to learn inter-modality correlations. However, these approaches are limited by the need for extensive annotated data in the context of emerging topics.

2.3 Cross-modal few-shot fake news detection

Few-shot learning is designed to master new tasks using a limited number of labeled examples [38]. Current few-shot learning methodologies, such as prototypical networks, acquire class-specific features in metric spaces for swift adaptation to novel tasks [39, 40]. Within computer vision, the concept of few-shot domain adaptation is explored in image classification for transferring knowledge to novel target domains [41, 42]. In natural language processing, meta-learning is suggested as a means to enhance few-shot learning performance in tasks like language modeling [43, 44] and misinformation detection [45, 46]. To our knowledge, the application of few-shot multimodal fake news detection through cross-modal augmentation remains unexplored in existing literature.

Meanwhile, previous multimodal learning approaches have sought to enhance unimodal tasks by leveraging data from various modalities [47, 48]. With multimodal pre-trained models achieving notable success in classic vision tasks [22, 49], there is a growing interest in formulating more efficient cross-modal augmentation techniques.

However, the prevailing techniques are based on successful strategies originally designed for multimodal foundational models. For example, CLIP utilizes linear probing [50, 51] and comprehensive fine-tuning [52] in its application to downstream tasks. CLIP-Adapter [53] and Tip-Adapter [54] draw inspiration from parameter-efficient finetuning approaches [55] that focus on optimizing lightweight MLPs while maintaining a fixed encoder. However, all the aforementioned methods, including WiSE-FT [56], employ an alternative modality, such as textual labels, as classifier weights, and continue to compute a unimodal Softmax loss on few-shot tasks. In contrast, this paper demonstrates the enhanced effectiveness of incorporating additional modalities as training samples.

3 Methodology

The proposed CMA enhances few-shot fake news detection by integrating samples from different modalities, and extends traditional unimodal few-shot classification to leverage the richness of cross-modal data, as shown in Figure 2.

This section starts with a standard unimodal few-shot FND framework, and the loss function is discussed. Then, it extends this to multiple modalities, assuming each training example is a combination of five different modalities. The modality-specific features are passed through MLP linear classifiers to obtain their inferences. Finally, we combine the inferences and train a meta-linear classifier to compute the final prediction.

3.1 Unimodal few-shot FND

Initially, unimodal few-shot FND learns from a labeled dataset of $(x,y)\in X$ , where $x$ is either the text or image passing to a pre-trained feature encoder $\phi(\cdot)$ . The ultimate goal is to allocate a binary classification label of $y\in\{0,1\}$ , in which 0 denotes real news and 1 denotes fake news. We assume only an n-shot subset $(x_{i},y_{i})$ from $X$ is provided for training, where $i\in[1,n]$ (i.e., $n$ samples per class); the rest of $X$ is used as the test set.

Therefore, the standard unimodal FND can be denoted as minimizing the cross-entropy loss $L$ :

L=-(y_{i}log(y_{i}^{\prime})+(1-y_{i})log(1-y^{\prime}_{i}))

(1)

where $y_{i}^{\prime}$ is the model inference from the linear classifier $MLP$ after softmax.

y_{i}^{\prime}=softmax(MLP(f(x_{i}))=-log(\frac{e^{w_{y}*f}}{\sum_{y^{\prime}}% e^{w_{y^{\prime}}*f}})

(2)

where $f$ is the feature representation from an MLP layer after the unimodal feature encoder, and $w_{y}$ and $w_{y^{\prime}}$ are the weights of the ground truth label and the predicted label, respectively.

3.2 Multimodal few-shot FND

To extend to multimodal FND, we assume that for each training sample, $f$ is a combination of five feature representations: 1) a text-only feature $f_{t}$ ; 2) an image-only feature $f_{m}$ ; 3) concatenation of L2 normalized $f_{c}=[f_{t}\oplus f_{m}]$ , where $\oplus$ is the concatenation operation; 4) an image-text cross-attended feature $f_{mt}$ ; 5) a text-image cross-attended feature $f_{tm}$ . The cross-attention mechanism, which swaps the text query $Q_{t}$ with the image query $Q_{m}$ , to obtain the cross-attended feature $f_{mt}$ is denoted as follows:

f_{mt}=CrossAtt_{m\rightarrow t}(Q_{m},K_{t},V_{t})=softmax(\frac{Q_{m}K_{t}^{% T}}{\sqrt{d}})V_{t}

(3)

In contrast, by swapping the image query $Q_{m}$ with the text query $Q_{t}$ , the cross-attended feature $f_{tm}$ can be obtained:

f_{tm}=CrossAtt_{t\rightarrow m}(Q_{t},K_{m},V_{m})=softmax(\frac{Q_{t}K_{m}^{% T}}{\sqrt{d}})V_{m}

(4)

where $K_{t}$ and $K_{m}$ represent the key vectors for text and image features respectively, $V_{t}$ and $V_{m}$ denote the corresponding value vectors, and $d$ refers to the dimensionality of the model.

For the sake of simplification, we assume that the number of $z$ different types of features are considered as distinct modalities. Therefore, each modality can be processed through the linear classifier MLP in the unimodal learning approach, as discussed above, to obtain five inferred probabilities.

Algorithm 1 Cross-modal Augmentation Algorithm

1: Input: source data

X

, number of seeds

S

, number of shots

n

2: Initialize pre-trained multimodal model;

3: for seed

\in\{1,2,\dots,S\}

4: for

x_{i}

\{x_{1},\dots,x_{n}\}

5: Extract image feature

f_{m}

from the pre-trained vision model;

6: Extract text feature

f_{t}

from the pre-trained language model;

7: Concatenate

f_{m}

with

f_{t}

and L2 normalize to obtain

f_{c}

;

8: Compute cross-attended features

f_{mt}

and

f_{tm}

with Equations 3 and 4;

9: Obtain inferences of each of the above features with linear classifiers from Equation 2;

10: Concatenate inferences and compute the final prediction with Equation 5;

11: Compute cross-entropy loss with Equation 1;

12: end for

13: end for

Inspired by the Representer Theorem [57], which indicates that optimally trained classifiers can be depicted as linear combinations of their training samples, we concatenate the five inferred probabilities as a new input to a meta-linear MLP classifier for making the final prediction:

\hat{y}=softmax(MLP(f_{t}\oplus f_{m}\oplus f_{c}\oplus f_{mt}\oplus f_{tm})

(5)

Instead of optimizing modality-specific weights independently, linear classification through the proposed CMA simultaneously determines all weights to minimize the training loss. Consequently, we convert the standard n-shot classification to an $(n\times z)$ -shot problem. The training details for CMA are presented in Algorithm 1.

4 Experiment

This section details experiments conducted to validate the effectiveness of the proposed approach. Initially, benchmark datasets are introduced, followed by the implementation details for experiments. The experimental results are analyzed in comparison to unimodal, multimodal, and few-shot FND methods. Finally, detailed analyses are provided to enhance the understanding of the proposed methods.

4.1 Data Setup

Three publicly available datasets are utilized for evaluation.

PolitiFact [13] comprises a dataset of political news categorized as either fake or real by expert evaluators and is part of the benchmark FakeNewsNet project. Using the provided data crawling scripts, news with no images or invalid image URLs are removed, resulting in 198 multimodal news articles.

GossipCop [13] features entertainment stories rated on a scale from 0 to 10, with stories scoring less than five classified as fake news by the author of FakeNewsNet. Using the same retrieval strategies as PolitiFact, 6,805 multimodal news articles are collected.

Weibo [58], a dataset sourced from Chinese social media platforms, comprises a multimodal fake news collection featuring both text and images. Authentic news items were crawled from a reputable source (Xinhua News), and fake news was obtained from Weibopiyao, an official rumor refutation platform of Weibo, that aggregates content either through crowdsourcing or official rumor refutation efforts. The same pre-processing methods as in previous work [7] are followed, resulting in 7,853 Chinese news articles.

Notably, a news article might be accompanied by multiple images. To find the most relevant image, the cosine similarity between each image and its corresponding text is calculated, and the image-text pair with the highest similarity, as determined by the pre-trained CLIP, is retained. The resulting dataset statistics are presented in Table 1.

Table 1: The statistics of the pre-processed multimodal fake news datasets. Avg tokens denote the average number of tokens per article.

Statistics	PolitiFact	GossipCop	Weibo
Total news	198	6,805	7,853
Fake news	96	1,877	4,211
Real news	102	4,928	3,642
Avg tokens	2,148	728	67

4.2 Implementation details

The pre-trained OpenAI CLIP (ViT-B-32) [22] and Chinese CLIP (ViT-B-16) [59] models are utilized to respectively extract text and image features for different languages. The hidden size for the cross-attention projection layer is 512, which is the same as the output dimension of CLIP encoders. The AdamW optimizer is employed with a learning rate of $1\mathrm{e}{-3}$ and a decay parameter of $1\mathrm{e}{-2}$ . The model is trained for 20 epochs, with the optimal checkpoint being determined by peak validation performance. Early stopping is utilized with a patience of three epochs.

In the few-shot context, the model is trained using a restricted set of samples, selected from the dataset to form an n-shot scenario. Here, $n\in[2,8,16,32]$ represents the number of samples for each class, while the remainder of the samples are reserved for testing purposes. Given that the data quality of the sampled training set might significantly impact the model’s performance, data sampling is repeated 10 times with random seeds, and the average score is reported after excluding the highest and lowest scores.

4.3 Benchmarked Models

The proposed CMA is benchmarked against 11 representative models. Specifically, we extensively compare the proposed method with unimodal approaches (1)-(3), multimodal approaches (4)-(6), and the few-shot approaches (7)-(11).

(1) dEFEND [60] utilizes the hierarchical attention network for FND. In this study, we remove the user comments from the original model.

(2) LDA-HAN [33] integrates pre-calculated topic distributions from Latent Dirichlet Allocation into a hierarchical attention network for text classification.

(3) FT-RoBERTa is a standard, fine-tuned version of the pre-trained language model RoBERTa; we use Huggingface Trainer to conduct the fine-tuning experiment.

(4) SpotFake [61] employs the pre-trained VGG and BERT for extracting image and text features, respectively, and then concatenating them for final classification.

(5) SAFE [62] transforms images into textual descriptions and utilizes the correlation between text and visual information for FND.

(6) CAFE [5] employs an ambiguity-aware multimodal strategy to adaptively aggregate unimodal features and their correlations.

(7) KPL [19] employs prompt learning in RoBERTa by enhancing it with external knowledge representations.

(8) M-SAMPLE [20] incorporates prompt learning with multimodal FND. It also applies a similarity-aware fusing to adaptively combine the intensity of multimodal representation for FND.

(9) PET [63] employs PLMs with task descriptions for supervised training, employing task-related cloze questions and verbalizers.

(10) KPT [64] enhances the label word space by incorporating class-related tokens that exhibit diverse granularities and perspectives.

(11) P&A [16] combines prompt-based learning with social alignment techniques and addresses label scarcity by using task-specific prompts in PLMs to elicit relevant knowledge.

4.4 Results

Table 2 demonstrates the FND accuracy comparison between the proposed CMA and all the baselines at various few-shot settings over the three datasets.

Table 2: Performance comparison between the CMA and the baseline models in accuracy (%). Bold indicates the best performance. Underline is the second-best performance. AVG denotes the average accuracy per model across all n-shot settings and datasets. Notably, the experimental results of P&A in Weibo are not accessible since it would require constructing the news proximity graph from the raw social context, which is not provided in the Weibo dataset.

Method	PolitiFact				GossipCop				Weibo
Method	2	8	16	32	2	8	16	32	2	8	16	32	AVG
dEFEND	21.3	39.7	37.5	54.1	25.6	26.0	44.1	47.8	31.9	33.0	40.1	44.5	37.1
LDA-HAN	39.4	47.3	52.2	54.9	21.2	30.4	39.5	41.3	40.3	41.8	44.4	50.9	42.0
FT-RoBERTa	52.0	63.1	70.0	72.5	41.3	60.4	62.6	65.9	39.7	58.1	64.3	66.3	59.7
SAFE	19.0	27.3	48.7	52.1	31.3	45.2	45.4	47.1	21.1	19.3	39.4	41.1	36.4
SpotFake	49.3	53.7	58.5	63.4	28.3	28.4	34.4	36.1	36.9	41.3	40.4	53.7	43.7
CAFE	38.6	46.4	48.9	51.0	42.3	48.1	55.9	59.3	44.4	40.6	47.5	51.3	47.9
KPL	55.1	60.7	65.5	66.3	53.3	54.8	58.6	61.3	45.4	49.3	50.2	59.9	56.7
M-SAMPLE	56.2	66.1	69.5	73.4	53.4	54.1	59.7	66.0	49.7	52.1	59.8	65.7	60.5
KPT	68.1	74.8	80.0	83.2	52.5	56.5	58.1	67.0	56.9	69.4	69.9	71.2	67.3
PET	73.2	68.4	68.3	70.1	65.7	66.9	68.3	71.1	65.4	66.6	70.3	71.5	68.8
P&A	71.9	80.7	81.7	83.5	54.9	58.4	75.6	69.3	-	-	-	-	72.0
CMA(Ours)	73.5	75.8	82.5	87.3	71.9	69.0	71.7	77.0	74.5	69.9	73.8	76.5	75.3

Comparing with unimodal baselines. First, we assess the accuracy of both unimodal approaches and the proposed CMA to evaluate their performances. Overall, CMA outperforms the best unimodal approach, FT-RoBERTa, achieving a 15.6% enhancement in average accuracy across all datasets, demonstrating its superiority in few-shot scenarios.

Surprisingly, FT-RoBERTa emerges as the most accurate model among both unimodal and multimodal approaches, suggesting that conventional fine-tuning methods can reach competitive levels of performance solely through the analysis of textual information from fake news. However, this method necessitates increased epoch time due to the adjustment of numerous parameters in the pre-trained language model (as shown in Table LABEL:tab:params), making it impractical for real-world few-shot FND applications.

LDA-HAN yields the second best in accuracy among unimodal models, with dEFEND coming in next. This could be attributed to two factors: firstly, the vanilla LDA model struggles to effectively generate topics from short texts, a characteristic of the datasets from GossipCop and Weibo (as detailed in Table 1) used in LDA-HAN; secondly, the employment of GloVe embeddings for initializing LDA-HAN and dEFEND may not perform as effectively as the contextualized embeddings generated by the BERT family.

Comparing with multimodal baselines. We evaluate the performance of CMA in comparison with multimodal approaches. CMA outperforms the best multimodal baseline, CAFE, with a 27.4% improvement in average accuracy across all datasets. The reason might be that the complex architecture of multimodal approaches inherently comes with a large number of trainable parameters, which might easily lead to overfitting in few-shot scenarios.

Excluding FT-RoBERTa, all multimodal baselines outperform unimodal models on average, showing that the inclusion of the image modality can significantly affect model accuracy. While these multimodal approaches excel in scenarios with abundant data, their effectiveness heavily relies on the availability of high-quality annotated training samples, which may not be readily accessible during the initial stages of FND. Moreover, all multimodal approaches utilize pre-trained unimodal models, such as VGG, ResNet, and BERT, to independently extract features from images and text. Yet, since these unimodal models are trained separately, merging their extracted features during the multimodal fusion process could potentially introduce noise[20].

Comparing with few-shot baselines. The effectiveness of the proposed CMA is evaluated in comparison with the latest prompt-based few-shot models. CMA outperforms the best few-shot baseline, P&A, with a 3.3% improvement in average accuracy, showing that using unimodal features to assist multimodal probing without prompting the pre-trained language model could also benefit the FND task.

While P&A demonstrates performance on par with CMA, it requires the pre-calculation of a news proximity graph. However, such social context data may not always be accessible, particularly in datasets not sourced from Twitter, like Weibo. After analyzing PET and KPT, it’s evident that these methods yield comparable outcomes, likely due to variations introduced by the manually crafted verbalizers used in prompting. This underscores the significance of hand-designed discrete templates in prompt-based learning. Concurrently, M-SAMPLE, a multimodal adaptation of KPL, demonstrates superior performance, suggesting that incorporating image modality can significantly enhance FND effectiveness.

5 Analysis

5.1 Ablation study

We investigate the impact of key components in CMA by assessing the framework’s performance in a range of complete and partial configurations. In each experiment, CMA is selectively utilized by removing different components, followed by training the framework from scratch. The results are averaged over five random seeds in each shot, and indicate the performance decay of CMA in the absence of each component in most configurations, underscoring the significance of each key module within CMA, as shown in Table 3.

Table 3: Ablation experiments of the CMA. -cross denotes the cross-attention is removed from the CMA. -meta means the meta-linear MLP layer is removed. -img means the image features are removed and only text features are used. -txt denotes the text features are removed and only image features are used.

Method	PolitiFact				GossipCop				Weibo
Method	2	8	16	32	2	8	16	32	2	8	16	32	AVG
CMA	73.5	75.8	82.5	87.3	71.9	69.0	71.7	77.0	74.5	69.9	73.8	76.5	75.3
-cross	67.6	76.7	81.2	84.0	71.8	71.8	71.6	71.1	58.4	65.2	68.4	75.2	71.6
-meta	72.2	74.1	74.7	78.4	49.0	53.2	56.8	56.1	50.0	50.9	57.4	61.7	61.2
-img	59.6	61.7	68.5	71.4	48.3	48.4	54.3	56.1	46.9	47.3	50.4	52.1	55.4
-txt	39.0	37.4	45.6	52.1	41.3	43.3	45.1	47.6	39.1	39.3	39.4	45.1	42.9

Specifically, removing the cross-attention from the CMA (i.e., -cross) results in a slight decrease in accuracy, showing that the cross-attended features from text and image capture semantic correlations and contribute to improved performance. Further removal of the meta-linear layer from the CMA (i.e., -meta) transforms the model into a standard n-shot classification, where it simply classifies concatenated multimodal features. This leads to a significant decrease in accuracy, emphasizing the importance of jointly updating all modality-specific weights in a meta-linear classifier for cross-modal adaptation and accuracy improvement. The meta-linear layer integrates modality-specific features, resembling an ensemble that transforms n-shot classification into a more robust $(n\times z)$ -shot problem, enhancing cross-modal adaptation in few-shot classification.

Additionally, experiments are performed by excluding either the image features (-img) or the text features (-txt), relying solely on the remaining modality for classification. Such setups led to additional reductions in accuracy, underscoring the comparative importance of text over image features in FND. This highlights the complexities in multimodal FND tasks, where the spatial discrepancies between visual and textual semantics tend to be more subtle than in broader multimodal datasets.

5.2 Stablility test

Given the selection of few-shot examples can significantly affect the model performance, we assess the stability of the CMA and other prompt-based baselines by measuring the standard deviation of accuracies in the few-shot settings, as shown in Figure 3.

Overall, the standard deviation for all models decreases in tandem with an increase in the number of n-shot settings, underscoring the importance of augmenting training examples in few-shot scenarios. This augmentation can be further observed that the standard deviation of the CMA tends to be the most stable among the few-shot approaches, indicating that the ensemble of unimodal features in the meta-linear layer can enhance the robustness of multimodal fusion in classification. Additionally, the GossipCop dataset exhibits greater instability compared to the PolitiFact dataset. This instability may be attributed to the semantic complexity in GossipCop, which is responsible for the lower accuracy across all models.

5.3 Model efficiency

Given the CMA achieves the best performance with a surprisingly simple augmentation, we further explore its efficiency in comparison to other baseline models. Table LABEL:tab:params showcases a comparison of the accuracies and epoch times between baselines and the CMA. The average accuracy of each model is determined in a 16-shot setting as shown in Table 2, along with the recording of average epoch times for each model. All experiments are tested with batch size 32 on a single RTX 4090 GPU in the GossipCop dataset for a fair comparison.

Among unimodal models, dEFEND and LDA-HAN exhibit comparable accuracy and epoch times, attributed to their analogous hierarchical architectural design. While FT-RoBERTa exceeds the performance of various unimodal (e.g., 18% higher than dEFEND) and multimodal methods (e.g., 6.9% higher than CAFE), it requires modifying a significant number of trainable parameters, thus extending epoch durations (on average, four minutes per epoch) relative to other unimodal baselines.

In the multimodal models, SAFE yields the lengthiest epoch durations owing to its prerequisite for independently pre-generating image descriptions. Although Spotfake achieves the fastest epoch duration due to its simple concatenation of the image and text features from the BERT and VGG respectively, it achieves the worst performance compared with other models. CAFE achieves the best multimodal FND outcomes by integrating a degree of ambiguity in the similarity across text and image features, albeit at the cost of marginally increased model complexity and consequently, slightly extended epoch durations.

Table 4: Comparisons of model efficiency. Both Accuracy (%) and Time represent averages derived from five random seeds. Times displayed in green signify an average duration of less than 3 minutes, whereas those in red indicate an average exceeding 3 minutes. Gain denotes notable improvements in accuracy relative to the dEFEND model.

Model	Accuracy	Zeit	Gain
dEFEND	40.9	2min	0
LDA-HAN	38.7	2min	-2.2
FT-RoBERTa	58.9	4min	+18.0
SAFE	41.1	7min	+0.2
Spotfake	33.9	2min	-7.0
CAFE	52.0	3min	+11.1
KPL	57.5	3min	+16.6
M-SAMPLE	58.1	5min	+17.2
KPT	54.3	3min	+13.4
PET	69.9	6min	+29.0
P&A	71.5	2min	+30.6
CMA	74.1	$<$ 1min	+33.2

All few-shot baselines demonstrate significant improvements over both unimodal and multimodal counterparts, indicating the suboptimality of traditional methods in contexts with limited annotated data. Specifically, the integration of external knowledge into the prompt-tuning phase by both KPL and KPT results in comparable epoch durations. However, KPL’s design of an FND-specific prompt may underlie its superior performance over KPT. PET records the lengthiest epoch duration among the few-shot baselines, potentially due to the repeated fine-tuning of the PLM for reconfiguring input examples with the task description. P%A not only achieves the second-best performance but also the second-shortest epoch durations, benefiting from the integration of user engagements. However, it incorporates an external alignment module to correlate user engagement with the PLM’s predictions, consequently increasing epoch times relative to CMA. Finally, CMA is more efficient and precise as it avoids the need for extensive parameter fine-tuning and does not depend on intensive image augmentation processes. Additionally, the inclusion of linear probing layers atop the image and text features presents a more streamlined approach than extensive fine-tuning and precise-crafted complex model designs.

5.4 Domain shift analysis

Real-world fake news demonstrates significant distribution discrepancies, which is also referred to as domain shift [zhu2022generalizing, zhu2022memory]. Consequently, automatic FND methods are required to rapidly adapt to emerging topics by using limited resources.

Table 5: Domain shift performance comparison. Poli

\rightarrow

Goss refers to utilze few-shot samples from the Politifact as training and the Gossipcop for testing. Goss

\rightarrow

Poli denotes the Gossipcop is utilized as training set and the Politifact is the test set. Bold and Underline denote the best and the second best accuracy (%) in that n-shot setting. AVG is the mean accuracy across all n-shot settings.

Method	Poli $\rightarrow$ Goss				Goss $\rightarrow$ Poli
Method	2	8	16	32	2	8	16	32	AVG
KPT	40.1	31.7	31.4	31.1	56.3	55.3	54.1	55.8	44.5
PET	51.0	51.3	51.5	51.6	53.1	54.1	54.5	54.1	52.6
P&A	53.2	53.4	53.2	54.5	50.1	50.4	50.3	50.5	51.9
CMA	48.7	53.5	56.1	58.6	51.4	55.3	53.0	55.9	54.1

To address this, we investigate the cross-domain capability of the proposed CMA against three strong few-shot FND baselines (i.e., P&A, PET and KPT). Considering Politifact’s focus on political news using formal language and Gossipcop’s emphasis on entertainment and celebrity narratives in a more casual tone, we first utilize Politifact for training and Gossipcop for testing, later inverting this arrangement.

The outcomes following domain shift are presented in Table 5. Notably,while the CMA model records the highest average accuracy among the few-shot baselines, the performance of each model markedly differs from that observed in the comparison experiments (as shown in Table 2). For example, KPT exhibits the strongest performance in both 2- and 8-shot scenarios in Goss $\rightarrow$ Poli. PET and P%A also achieve the highest performance in Goss $\rightarrow$ Poli and Poli $\rightarrow$ Goss respectively, highlighting the disparity between present few-shot FND methodologies and their adaptability to domain adaptation.

5.5 Feature visualization

At last, we present a visual comparison of the features extracted by M-SAMPLE and CMA, both of which are multimodal few-shot approaches. This involves the visualization of multimodal features alongside an assessment of their semantic correlations. For each dataset, a specific sample is chosen, with the corresponding multimodal features depicted in Figure 4.

Observations indicate that: 1) CMA can capture more consistent features from the image-text pair of fake news than those of M-SAMPLE. For example, although both M-SAMPLE and CMA successfully correlate the flag in the image with the word “Chinese” in the text, CMA can also identify the semantic meaning of “moon landing” between the text and image in the PolitiFact example; 2) The proposed CMA is more accurate in capturing important features from the image than M-SAMPLE. For example, although both models can identify the person “Nicole Kidman” and “black tarantula” in both the text and the image in the GossipCop example, the image region of the tarantula slightly overlaps with that of Nicole Kidman provided by M-SAMPLE. This is even more obvious in the Weibo example, as CMA successfully captures the “blue” color bar in the toothpaste, but M-SAMPLE fails to do so.

6 Conclusion

This paper introduced Cross-Modal Augmentation (CMA) for enhancing few-shot multimodal fake news detection by utilizing unimodal features to augment multimodal fusion. The proposed CMA leverages a pre-trained multimodal model for unimodal feature extraction and transforms n-shot classification into a robust (n $\times$ z)-shot problem using class labels as additional one-shot training samples. The CMA, employing a simple linear classifier, achieves SOTA performance on three datasets in few-shot settings, and demonstrates greater efficiency than current approaches.

7 Limitation

We acknowledge limitations in this study including: 1) The evaluation of CMA’s few-shot proficiency solely utilizes CLIP, future investigations will delve into how different multimodal models influence the proposed CMA; 2) Given the lack of multimodal information in certain datasets, this research adopted cosine similarity for image selection from multiple options, potentially leading to varied performance outcomes based on the text-image pairing technique employed; 3) CMA exhibits suboptimal domain shift performance, enhancing the architecture through the integration of knowledge distillation or domain adaptation techniques remains a prospect for future research.

Acknowledgements

This work is funded by the Natural Science Foundation of Shandong Province under grant ZR2023QF151 and the Natural Science Foundation of China under grant 12303103.

CRediT authorship contribution statement

Ye Jiang: Conceptualization, Methodology, Writing–original draft, Writing–review & editing. Taihang Wang: Methodology, Writing–review & editing. Xiaoman Xu: Data curation, Writing – review & editing. Yimin Wang: Funding acquisition, Methodology, Writing–review & editing. Xingyi Song: Supervision, Writing – review & editing. Diana Maynard: Investigation, Supervision, Writing – review & editing.

References

[1] N. K. Conroy, V. L. Rubin, Y. Chen, Automatic deception detection: Methods for finding fake news, Proceedings of the association for information science and technology 52 (1) (2015) 1–4.
[2] Y. Long, Q. Lu, R. Xiang, M. Li, C.-R. Huang, Fake news detection through multi-perspective speaker profiles, in: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2017, pp. 252–256.
[3] Y. Wang, F. Ma, Z. Jin, Y. Yuan, G. Xun, K. Jha, L. Su, J. Gao, Eann: Event adversarial neural networks for multi-modal fake news detection, in: Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining, 2018, pp. 849–857.
[4] A. Lao, C. Shi, Y. Yang, Rumor detection with field of linear and non-linear propagation, in: Proceedings of the Web Conference 2021, 2021, pp. 3178–3187.
[5] Y. Chen, D. Li, P. Zhang, J. Sui, Q. Lv, L. Tun, L. Shang, Cross-modal ambiguity learning for multimodal fake news detection, in: Proceedings of the ACM Web Conference 2022, 2022, pp. 2897–2905.
[6] Y. Zhou, Y. Yang, Q. Ying, Z. Qian, X. Zhang, Multimodal fake news detection via clip-guided learning, in: 2023 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2023, pp. 2825–2830.
[7] L. Wang, C. Zhang, H. Xu, Y. Xu, X. Xu, S. Wang, Cross-modal contrastive learning for multimodal fake news detection, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5696–5704.
[8] L. Wu, Y. Long, C. Gao, Z. Wang, Y. Zhang, Mfir: Multimodal fusion and inconsistency reasoning for explainable fake news detection, Information Fusion 100 (2023) 101944.
[9] Z. Qu, Y. Meng, G. Muhammad, P. Tiwari, Qmfnd: A quantum multimodal fusion-based fake news detection model for social media, Information Fusion 104 (2024) 102172.
[10] Y. Gao, J. Liu, Z. Xu, J. Zhang, K. Li, R. Ji, C. Shen, Pyramidclip: Hierarchical feature alignment for vision-language model pretraining, Advances in neural information processing systems 35 (2022) 35959–35970.
[11] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, S. C. H. Hoi, Align before fuse: Vision and language representation learning with momentum distillation, Advances in neural information processing systems 34 (2021) 9694–9705.
[12] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, Y. Choi, Truth of varying shades: Analyzing language in fake news and political fact-checking, in: Proceedings of the 2017 conference on empirical methods in natural language processing, 2017, pp. 2931–2937.
[13] K. Shu, D. Mahudeswaran, S. Wang, D. Lee, H. Liu, Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media, Big data 8 (3) (2020) 171–188.
[14] A. N. Meltzoff, R. W. Borton, Intermodal matching by human neonates, Nature 282 (5737) (1979) 403–404.
[15] B. Nanay, Multimodal mental imagery, Cortex 105 (2018) 125–134.
[16] J. Wu, S. Li, A. Deng, M. Xiong, B. Hooi, Prompt-and-align: Prompt-based social alignment for few-shot fake news detection, in: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023, pp. 2726–2736.
[17] T. Gao, A. Fisch, D. Chen, Making pre-trained language models better few-shot learners, in: Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021, Association for Computational Linguistics (ACL), 2021, pp. 3816–3830.
[18] N. Ding, S. Hu, W. Zhao, Y. Chen, Z. Liu, H.-T. Zheng, M. Sun, Openprompt: An open-source framework for prompt-learning, arXiv preprint arXiv:2111.01998 (2021).
[19] G. Jiang, S. Liu, Y. Zhao, Y. Sun, M. Zhang, Fake news detection via knowledgeable prompt learning, Information Processing & Management 59 (5) (2022) 103029.
[20] Y. Jiang, X. Yu, Y. Wang, X. Xu, X. Song, D. Maynard, Similarity-aware multimodal prompt learning for fake news detection, Information Sciences 647 (2023) 119446.
[21] Q. Guo, Z. Kang, L. Tian, Z. Chen, Tiefake: Title-text similarity and emotion-aware fake news detection, arXiv preprint arXiv:2304.09421 (2023).
[22] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763.
[23] C. Castillo, M. Mendoza, B. Poblete, Information credibility on twitter, in: Proceedings of the 20th international conference on World wide web, 2011, pp. 675–684.
[24] B. Tabibian, I. Valera, M. Farajtabar, L. Song, B. Schölkopf, M. Gomez-Rodriguez, Distilling information reliability and source trustworthiness from digital traces, in: Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 847–855.
[25] C. Geeng, S. Yee, F. Roesner, Fake news on facebook and twitter: Investigating how people (don’t) investigate, in: Proceedings of the 2020 CHI conference on human factors in computing systems, 2020, pp. 1–14.
[26] X. Liu, Q. Li, A. Nourbakhsh, R. Fang, M. Thomas, K. Anderson, R. Kociuba, M. Vedder, S. Pomerville, R. Wudali, et al., Reuters tracer: A large scale system of detecting & verifying real-time news events from twitter, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016, pp. 207–216.
[27] M. Fedoryszak, B. Frederick, V. Rajaram, C. Zhong, Real-time event detection on social data streams, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2774–2782.
[28] P. Bahad, P. Saxena, R. Kamal, Fake news detection using bi-directional lstm-recurrent neural network, Procedia Computer Science 165 (2019) 74–82.
[29] S. Sridhar, S. Sanagavarapu, Fake news detection and analysis using multitask learning with bilstm capsnet model, in: 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence), IEEE, 2021, pp. 905–911.
[30] H. T. Phan, N. T. Nguyen, D. Hwang, Fake news detection: A survey of graph neural network methods, Applied Soft Computing (2023) 110235.
[31] X. Song, J. Petrak, Y. Jiang, I. Singh, D. Maynard, K. Bontcheva, Classification aware neural topic model for covid-19 disinformation categorisation, PloS one 16 (2) (2021) e0247086.
[32] Y. Jiang, X. Song, C. Scarton, A. Aker, K. Bontcheva, Categorising fine-to-coarse grained misinformation: An empirical study of covid-19 infodemic, arXiv preprint arXiv:2106.11702 (2021).
[33] Y. Jiang, Y. Wang, X. Song, D. Maynard, Comparing topic-aware neural networks for bias detection of news, in: ECAI 2020, IOS Press, 2020, pp. 2054–2061.
[34] B. Ghanem, P. Rosso, F. Rangel, An emotional analysis of false information in social media and news articles, ACM Transactions on Internet Technology (TOIT) 20 (2) (2020) 1–18.
[35] Y. Jiang, Team QUST at SemEval-2023 task 3: A comprehensive study of monolingual and multilingual approaches for detecting online news genre, framing and persuasion techniques, in: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 300–306.
[36] Y. Wu, P. Zhan, Y. Zhang, L. Wang, Z. Xu, Multimodal fusion with co-attention networks for fake news detection, in: Findings of the association for computational linguistics: ACL-IJCNLP 2021, 2021, pp. 2560–2569.
[37] S. Singhal, T. Pandey, S. Mrig, R. R. Shah, P. Kumaraguru, Leveraging intra and inter modality relationship for multimodal fake news detection, in: Companion Proceedings of the Web Conference 2022, 2022, pp. 726–734.
[38] Y. Wang, Q. Yao, J. T. Kwok, L. M. Ni, Generalizing from a few examples: A survey on few-shot learning, ACM computing surveys (csur) 53 (3) (2020) 1–34.
[39] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al., Matching networks for one shot learning, Advances in neural information processing systems 29 (2016).
[40] J. Snell, K. Swersky, R. Zemel, Prototypical networks for few-shot learning, Advances in neural information processing systems 30 (2017).
[41] S. Motiian, Q. Jones, S. Iranmanesh, G. Doretto, Few-shot adversarial domain adaptation, Advances in neural information processing systems 30 (2017).
[42] A. Zhao, M. Ding, Z. Lu, T. Xiang, Y. Niu, J. Guan, J.-R. Wen, Domain-adaptive few-shot learning, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 1390–1399.
[43] A. Sharaf, H. H. Awadalla, H. Daumé III, Meta-learning for few-shot nmt adaptation, in: Proceedings of the Fourth Workshop on Neural Generation and Translation, 2020, pp. 43–53.
[44] C. Han, Z. Fan, D. Zhang, M. Qiu, M. Gao, A. Zhou, Meta-learning adversarial domain adaptation network for few-shot text classification, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 1664–1673.
[45] Z. Yue, H. Zeng, Y. Zhang, L. Shang, D. Wang, Metaadapt: Domain adaptive few-shot misinformation detection via meta learning, Association for Computational Linguistics (2023) 5223–5239.
[46] Q. Zhang, H. Huang, S. Liang, Z. Meng, E. Yilmaz, Learning to detect few-shot-few-clue misinformation, arXiv preprint arXiv:2108.03805 (2021).
[47] E. Schwartz, L. Karlinsky, R. Feris, R. Giryes, A. Bronstein, Baby steps towards few-shot learning with multiple semantics, Pattern Recognition Letters 160 (2022) 142–147.
[48] H. Zhang, J. Y. Koh, J. Baldridge, H. Lee, Y. Yang, Cross-modal contrastive learning for text-to-image generation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 833–842.
[49] H. Zhang, P. Zhang, X. Hu, Y.-C. Chen, L. Li, X. Dai, L. Wang, L. Yuan, J.-N. Hwang, J. Gao, Glipv2: Unifying localization and vision-language understanding, Advances in Neural Information Processing Systems 35 (2022) 36067–36080.
[50] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009.
[51] K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised visual representation learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
[52] R. Girdhar, D. Ramanan, Attentional pooling for action recognition, Advances in neural information processing systems 30 (2017).
[53] P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, Y. Qiao, Clip-adapter: Better vision-language models with feature adapters, International Journal of Computer Vision (2023) 1–15.
[54] R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y. Qiao, H. Li, Tip-adapter: Training-free clip-adapter for better vision-language modeling, arXiv preprint arXiv:2111.03930 (2021).
[55] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly, Parameter-efficient transfer learning for nlp, in: International Conference on Machine Learning, PMLR, 2019, pp. 2790–2799.
[56] M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Kornblith, R. Roelofs, R. G. Lopes, H. Hajishirzi, A. Farhadi, H. Namkoong, et al., Robust fine-tuning of zero-shot models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7959–7971.
[57] B. Schölkopf, R. Herbrich, A. J. Smola, A generalized representer theorem, in: International conference on computational learning theory, Springer, 2001, pp. 416–426.
[58] Z. Jin, J. Cao, H. Guo, Y. Zhang, J. Luo, Multimodal fusion with recurrent neural networks for rumor detection on microblogs, in: Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 795–816.
[59] A. Yang, J. Pan, J. Lin, R. Men, Y. Zhang, J. Zhou, C. Zhou, Chinese clip: Contrastive vision-language pretraining in chinese, arXiv preprint arXiv:2211.01335 (2022).
[60] K. Shu, L. Cui, S. Wang, D. Lee, H. Liu, defend: Explainable fake news detection, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 395–405. doi:10.1145/3292500.3330935.
URL https://doi.org/10.1145/3292500.3330935
[61] S. Singhal, R. R. Shah, T. Chakraborty, P. Kumaraguru, S. Satoh, Spotfake: A multi-modal framework for fake news detection, in: 2019 IEEE fifth international conference on multimedia big data (BigMM), IEEE, 2019, pp. 39–47.
[62] X. Zhou, J. Wu, R. Zafarani, : Similarity-aware multi-modal fake news detection, in: Pacific-Asia Conference on knowledge discovery and data mining, Springer, 2020, pp. 354–367.
[63] T. Schick, H. Schütze, Exploiting cloze-questions for few-shot text classification and natural language inference, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 255–269.
[64] S. Hu, N. Ding, H. Wang, Z. Liu, J. Wang, J. Li, W. Wu, M. Sun, Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2225–2240.