Cross-Modal Augmentation for Few-Shot Multimodal Fake News Detection

Ye Jiang Taihang Wang Xiaoman Xu Yimin Wang Xingyi Song Diana Maynard
Abstract

The nascent topic of fake news requires automatic detection methods to quickly learn from limited annotated samples. Therefore, the capacity to rapidly acquire proficiency in a new task with limited guidance, also known as few-shot learning, is critical for detecting fake news in its early stages. Existing approaches either involve fine-tuning pre-trained language models which come with a large number of parameters, or training a complex neural network from scratch with large-scale annotated datasets. This paper presents a multimodal fake news detection model which augments multimodal features using unimodal features. For this purpose, we introduce Cross-Modal Augmentation (CMA), a simple approach for enhancing few-shot multimodal fake news detection by transforming n-shot classification into a more robust (n ×\times× z)-shot problem, where z represents the number of supplementary features. The proposed CMA achieves SOTA results over three benchmark datasets, utilizing a surprisingly simple linear probing method to classify multimodal fake news with only a few training samples. Furthermore, our method is significantly more lightweight than prior approaches, particularly in terms of the number of trainable parameters and epoch times. The code is available here: https://github.com/zgjiangtoby/FND_fewshot

keywords:
Fake news detection , Multimodal fusion , Few-shot learning , Natural language processing
\affiliation

[label1]organization=College of Information Science and Technology,addressline= Qingdao University of Science and Technology, country=China

\affiliation

[label2]organization=College of Data Science,addressline= Qingdao University of Science and Technology, country=China

\affiliation

[label3]organization=Department of Computer Science,addressline= University of Sheffield, country=UK

1 Introduction

The recent proliferation of social media has not only transformed the landscape of information exchange, but also led to the pernicious spread of fake news. The detection and mitigation of fake news have consequently become pivotal areas of research [1, 2]. Traditional approaches, primarily relying on textual analysis, have shown limitations due to the sophisticated and multi-faceted nature of fake news [3, 4]. In response, many studies have incorporated multimodal methods that consider both text and accompanying images, yielding a more comprehensive and effective framework for identifying and debunking fake news [5, 6].

To explore the inconsistent semantics between text and image in fake news, many studies have either incorporated contrastive learning to achieve better alignment between image-text pairs [7], or designed complex neural networks to strengthen the deep-level fusion of multimodal features [8, 9]. The former relies on contrastive loss to align image-text pairs, but most image-text pairs in fake news are inherently not matched [10], and different image-text pairs may also have potential correlations [11], which can consequently confuse the model. The latter typically needs to be trained from scratch, which is fundamentally bounded by the availability of large-scale annotated data [12, 13].

In contrast to machines, the process of concept learning in humans involves integrating multimodal signals and representations [14, 15]. When processing uncertain information, people inherently seek help from other modalities. This capability enables humans to learn from a limited number of samples by incorporating cross-modal information, as shown in Figure 1. Meanwhile, the efficacy of fake news detection (FND) in the context of nascent topics, such as COVID-19, remains a significant challenge for prevailing strategies. This difficulty is compounded by the lack of extensive data and annotations in the target domain, underscoring the critical role of few-shot learning in mitigating the spread of early-stage fake news [16].

Refer to caption
Figure 1: Information from different modalities assists humans in decision-making, especially when faced with uncertainty.

In the context of emerging topics with limited training samples, prompt learning, through its few-shot learning capacity, encapsulates news articles in task-specific textual prompts for direct knowledge extraction from pre-trained language models (PLMs), achieving comparable performance across different tasks [17, 18]. However, most prompt-based methods primarily tune the PLM with unimodal textual information from fake news [19, 16], thus once again ignoring the multimodal nature of fake news. Even though the previous method [20] attempts to integrate the different prompt templates with image features extracted from the pre-trained vision model, the fusion strategy still utilized the multimodal features only, potentially struggling to address spatial discrepancies between visual and textual semantics [8, 21].

In this paper, we propose a Cross-Modal Augmentation (CMA) method to explore how unimodal features could assist in multimodal fusion for FND in few-shot scenarios. Specifically, we leverage the foundational multimodal model CLIP [22] to extract textual and visual features from fake news simultaneously. Utilizing class labels as supplementary one-shot training instances, the n-shot classification can then be converted to an (n×z)𝑛𝑧(n\times z)( italic_n × italic_z )-shot problem, where z𝑧zitalic_z represents the number of supplementary features (e.g., the fused feature from text and image). Meanwhile, we also fuse unimodal features by utilizing the cross-attention mechanism [7] as another supplementary. Finally, we employ a simple linear probing for each modality as well as for the fused multimodal features. The experimental results indicate that CMA achieves SOTA results across three datasets.

The main contributions of this paper are:

  • 1.

    Introduction of a Cross-Modal Augmentation (CMA) method for few-shot multimodal fake news detection, utilizing unimodal features to enhance multimodal fusion.

  • 2.

    Leveraging a pre-trained multimodal model to extract unimodal features, and repurposing class labels as additional one-shot training samples, transforming the n-shot classification into a more robust (n×z)𝑛𝑧(n\times z)( italic_n × italic_z )-shot problem.

  • 3.

    By freezing the pre-trained multimodal model and training only with a simple linear classifier, the proposed CMA achieves SOTA results over three datasets, outperforming 11 baseline models and surpassing previous methods in efficiency.

2 Related work

2.1 Unimodal fake news detection

Unimodal fake news detection aims to extract significant semantics from either news texts or images. Given the precision of semantics in text, previous approaches have concentrated on the task of text-based unimodal fake news detection. Early works focused on analyzing statistical characteristics of text (e.g., length, punctuation, exclamation marks) [23] and metadata (e.g., likes, shares) [24, 25] for manual fake news detection. However, these manual feature engineering approaches are time-consuming and struggle with processing large-scale, real-time data [26, 27].

The advent of deep learning has significantly advanced automated fake news detection. These methods primarily utilize deep learning models like BiLSTM [28, 29], GNNs [30], and pre-trained models (e.g., BERT, GPT) [31, 32, 33] to analyze text features, extracting various attributes such as emotional [34], stance-based [35], and stylistic elements [36]. However, the recent proliferation of multimodal information (text, images, videos) in social networks has shifted the propagation of fake news from solely text-based to multimodal formats.

Refer to caption
Figure 2: The overall architecture of the CMA model.

2.2 Multimodal fake news detection

Multimodal methods employing cross-modal discriminative patterns have been introduced, aiming to enhance performance in fake news detection. For example, MCAN [36] employs multiple co-attention layers to more effectively integrate textual and visual features in detecting fake news. CAFE [5] quantifies cross-modal ambiguity through the assessment of the Kullback-Leibler (KL) divergence among the distributions of unimodal features. LIIMR [37] determines the modality that exhibits greater confidence in the context of fake news detection. COOLANT [7] focuses on improving the alignment between image and text representations, utilizing contrastive learning for finer semantic alignment and cross-modal fusion to learn inter-modality correlations. However, these approaches are limited by the need for extensive annotated data in the context of emerging topics.

2.3 Cross-modal few-shot fake news detection

Few-shot learning is designed to master new tasks using a limited number of labeled examples [38]. Current few-shot learning methodologies, such as prototypical networks, acquire class-specific features in metric spaces for swift adaptation to novel tasks [39, 40]. Within computer vision, the concept of few-shot domain adaptation is explored in image classification for transferring knowledge to novel target domains [41, 42]. In natural language processing, meta-learning is suggested as a means to enhance few-shot learning performance in tasks like language modeling [43, 44] and misinformation detection [45, 46]. To our knowledge, the application of few-shot multimodal fake news detection through cross-modal augmentation remains unexplored in existing literature.

Meanwhile, previous multimodal learning approaches have sought to enhance unimodal tasks by leveraging data from various modalities [47, 48]. With multimodal pre-trained models achieving notable success in classic vision tasks [22, 49], there is a growing interest in formulating more efficient cross-modal augmentation techniques.

However, the prevailing techniques are based on successful strategies originally designed for multimodal foundational models. For example, CLIP utilizes linear probing [50, 51] and comprehensive fine-tuning [52] in its application to downstream tasks. CLIP-Adapter [53] and Tip-Adapter [54] draw inspiration from parameter-efficient finetuning approaches [55] that focus on optimizing lightweight MLPs while maintaining a fixed encoder. However, all the aforementioned methods, including WiSE-FT [56], employ an alternative modality, such as textual labels, as classifier weights, and continue to compute a unimodal Softmax loss on few-shot tasks. In contrast, this paper demonstrates the enhanced effectiveness of incorporating additional modalities as training samples.

3 Methodology

The proposed CMA enhances few-shot fake news detection by integrating samples from different modalities, and extends traditional unimodal few-shot classification to leverage the richness of cross-modal data, as shown in Figure 2.

This section starts with a standard unimodal few-shot FND framework, and the loss function is discussed. Then, it extends this to multiple modalities, assuming each training example is a combination of five different modalities. The modality-specific features are passed through MLP linear classifiers to obtain their inferences. Finally, we combine the inferences and train a meta-linear classifier to compute the final prediction.

3.1 Unimodal few-shot FND

Initially, unimodal few-shot FND learns from a labeled dataset of (x,y)X𝑥𝑦𝑋(x,y)\in X( italic_x , italic_y ) ∈ italic_X, where x𝑥xitalic_x is either the text or image passing to a pre-trained feature encoder ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ). The ultimate goal is to allocate a binary classification label of y{0,1}𝑦01y\in\{0,1\}italic_y ∈ { 0 , 1 }, in which 0 denotes real news and 1 denotes fake news. We assume only an n-shot subset (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) from X𝑋Xitalic_X is provided for training, where i[1,n]𝑖1𝑛i\in[1,n]italic_i ∈ [ 1 , italic_n ] (i.e., n𝑛nitalic_n samples per class); the rest of X𝑋Xitalic_X is used as the test set.

Therefore, the standard unimodal FND can be denoted as minimizing the cross-entropy loss L𝐿Litalic_L:

L=(yilog(yi)+(1yi)log(1yi))𝐿subscript𝑦𝑖𝑙𝑜𝑔superscriptsubscript𝑦𝑖1subscript𝑦𝑖𝑙𝑜𝑔1subscriptsuperscript𝑦𝑖L=-(y_{i}log(y_{i}^{\prime})+(1-y_{i})log(1-y^{\prime}_{i}))italic_L = - ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_l italic_o italic_g ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l italic_o italic_g ( 1 - italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (1)

where yisuperscriptsubscript𝑦𝑖y_{i}^{\prime}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the model inference from the linear classifier MLP𝑀𝐿𝑃MLPitalic_M italic_L italic_P after softmax.

yi=softmax(MLP(f(xi))=log(ewyfyewyf)y_{i}^{\prime}=softmax(MLP(f(x_{i}))=-log(\frac{e^{w_{y}*f}}{\sum_{y^{\prime}}% e^{w_{y^{\prime}}*f}})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_M italic_L italic_P ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = - italic_l italic_o italic_g ( divide start_ARG italic_e start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∗ italic_f end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∗ italic_f end_POSTSUPERSCRIPT end_ARG ) (2)

where f𝑓fitalic_f is the feature representation from an MLP layer after the unimodal feature encoder, and wysubscript𝑤𝑦w_{y}italic_w start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT and wysubscript𝑤superscript𝑦w_{y^{\prime}}italic_w start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are the weights of the ground truth label and the predicted label, respectively.

3.2 Multimodal few-shot FND

To extend to multimodal FND, we assume that for each training sample, f𝑓fitalic_f is a combination of five feature representations: 1) a text-only feature ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT; 2) an image-only feature fmsubscript𝑓𝑚f_{m}italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT; 3) concatenation of L2 normalized fc=[ftfm]subscript𝑓𝑐delimited-[]direct-sumsubscript𝑓𝑡subscript𝑓𝑚f_{c}=[f_{t}\oplus f_{m}]italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = [ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊕ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ], where direct-sum\oplus is the concatenation operation; 4) an image-text cross-attended feature fmtsubscript𝑓𝑚𝑡f_{mt}italic_f start_POSTSUBSCRIPT italic_m italic_t end_POSTSUBSCRIPT; 5) a text-image cross-attended feature ftmsubscript𝑓𝑡𝑚f_{tm}italic_f start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT. The cross-attention mechanism, which swaps the text query Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the image query Qmsubscript𝑄𝑚Q_{m}italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, to obtain the cross-attended feature fmtsubscript𝑓𝑚𝑡f_{mt}italic_f start_POSTSUBSCRIPT italic_m italic_t end_POSTSUBSCRIPT is denoted as follows:

fmt=CrossAttmt(Qm,Kt,Vt)=softmax(QmKtTd)Vtsubscript𝑓𝑚𝑡𝐶𝑟𝑜𝑠𝑠𝐴𝑡subscript𝑡𝑚𝑡subscript𝑄𝑚subscript𝐾𝑡subscript𝑉𝑡𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑄𝑚superscriptsubscript𝐾𝑡𝑇𝑑subscript𝑉𝑡f_{mt}=CrossAtt_{m\rightarrow t}(Q_{m},K_{t},V_{t})=softmax(\frac{Q_{m}K_{t}^{% T}}{\sqrt{d}})V_{t}italic_f start_POSTSUBSCRIPT italic_m italic_t end_POSTSUBSCRIPT = italic_C italic_r italic_o italic_s italic_s italic_A italic_t italic_t start_POSTSUBSCRIPT italic_m → italic_t end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (3)

In contrast, by swapping the image query Qmsubscript𝑄𝑚Q_{m}italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with the text query Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the cross-attended feature ftmsubscript𝑓𝑡𝑚f_{tm}italic_f start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT can be obtained:

ftm=CrossAtttm(Qt,Km,Vm)=softmax(QtKmTd)Vmsubscript𝑓𝑡𝑚𝐶𝑟𝑜𝑠𝑠𝐴𝑡subscript𝑡𝑡𝑚subscript𝑄𝑡subscript𝐾𝑚subscript𝑉𝑚𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑄𝑡superscriptsubscript𝐾𝑚𝑇𝑑subscript𝑉𝑚f_{tm}=CrossAtt_{t\rightarrow m}(Q_{t},K_{m},V_{m})=softmax(\frac{Q_{t}K_{m}^{% T}}{\sqrt{d}})V_{m}italic_f start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT = italic_C italic_r italic_o italic_s italic_s italic_A italic_t italic_t start_POSTSUBSCRIPT italic_t → italic_m end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (4)

where Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Kmsubscript𝐾𝑚K_{m}italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT represent the key vectors for text and image features respectively, Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Vmsubscript𝑉𝑚V_{m}italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denote the corresponding value vectors, and d𝑑ditalic_d refers to the dimensionality of the model.

For the sake of simplification, we assume that the number of z𝑧zitalic_z different types of features are considered as distinct modalities. Therefore, each modality can be processed through the linear classifier MLP in the unimodal learning approach, as discussed above, to obtain five inferred probabilities.

Algorithm 1 Cross-modal Augmentation Algorithm
1:  Input: source data X𝑋Xitalic_X, number of seeds S𝑆Sitalic_S, number of shots n𝑛nitalic_n
2:  Initialize pre-trained multimodal model;
3:  for seed {1,2,,S}absent12𝑆\in\{1,2,\dots,S\}∈ { 1 , 2 , … , italic_S } do
4:     for  xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in {x1,,xn}subscript𝑥1subscript𝑥𝑛\{x_{1},\dots,x_{n}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } do
5:        Extract image feature fmsubscript𝑓𝑚f_{m}italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT from the pre-trained vision model;
6:        Extract text feature ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the pre-trained language model;
7:        Concatenate fmsubscript𝑓𝑚f_{m}italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and L2 normalize to obtain fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT;
8:        Compute cross-attended features fmtsubscript𝑓𝑚𝑡f_{mt}italic_f start_POSTSUBSCRIPT italic_m italic_t end_POSTSUBSCRIPT and ftmsubscript𝑓𝑡𝑚f_{tm}italic_f start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT with Equations 3 and 4;
9:        Obtain inferences of each of the above features with linear classifiers from Equation 2;
10:        Concatenate inferences and compute the final prediction with Equation 5;
11:        Compute cross-entropy loss with Equation 1;
12:     end for
13:  end for

Inspired by the Representer Theorem [57], which indicates that optimally trained classifiers can be depicted as linear combinations of their training samples, we concatenate the five inferred probabilities as a new input to a meta-linear MLP classifier for making the final prediction:

y^=softmax(MLP(ftfmfcfmtftm)\hat{y}=softmax(MLP(f_{t}\oplus f_{m}\oplus f_{c}\oplus f_{mt}\oplus f_{tm})over^ start_ARG italic_y end_ARG = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_M italic_L italic_P ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊕ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊕ italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊕ italic_f start_POSTSUBSCRIPT italic_m italic_t end_POSTSUBSCRIPT ⊕ italic_f start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT ) (5)

Instead of optimizing modality-specific weights independently, linear classification through the proposed CMA simultaneously determines all weights to minimize the training loss. Consequently, we convert the standard n-shot classification to an (n×z)𝑛𝑧(n\times z)( italic_n × italic_z )-shot problem. The training details for CMA are presented in Algorithm 1.

4 Experiment

This section details experiments conducted to validate the effectiveness of the proposed approach. Initially, benchmark datasets are introduced, followed by the implementation details for experiments. The experimental results are analyzed in comparison to unimodal, multimodal, and few-shot FND methods. Finally, detailed analyses are provided to enhance the understanding of the proposed methods.

4.1 Data Setup

Three publicly available datasets are utilized for evaluation.

PolitiFact [13] comprises a dataset of political news categorized as either fake or real by expert evaluators and is part of the benchmark FakeNewsNet project. Using the provided data crawling scripts, news with no images or invalid image URLs are removed, resulting in 198 multimodal news articles.

GossipCop [13] features entertainment stories rated on a scale from 0 to 10, with stories scoring less than five classified as fake news by the author of FakeNewsNet. Using the same retrieval strategies as PolitiFact, 6,805 multimodal news articles are collected.

Weibo [58], a dataset sourced from Chinese social media platforms, comprises a multimodal fake news collection featuring both text and images. Authentic news items were crawled from a reputable source (Xinhua News), and fake news was obtained from Weibopiyao, an official rumor refutation platform of Weibo, that aggregates content either through crowdsourcing or official rumor refutation efforts. The same pre-processing methods as in previous work [7] are followed, resulting in 7,853 Chinese news articles.

Notably, a news article might be accompanied by multiple images. To find the most relevant image, the cosine similarity between each image and its corresponding text is calculated, and the image-text pair with the highest similarity, as determined by the pre-trained CLIP, is retained. The resulting dataset statistics are presented in Table 1.

Table 1: The statistics of the pre-processed multimodal fake news datasets. Avg tokens denote the average number of tokens per article.
Statistics PolitiFact GossipCop Weibo
Total news 198 6,805 7,853
Fake news 96 1,877 4,211
Real news 102 4,928 3,642
Avg tokens 2,148 728 67

4.2 Implementation details

The pre-trained OpenAI CLIP (ViT-B-32) [22] and Chinese CLIP (ViT-B-16) [59] models are utilized to respectively extract text and image features for different languages. The hidden size for the cross-attention projection layer is 512, which is the same as the output dimension of CLIP encoders. The AdamW optimizer is employed with a learning rate of 1e31e31\mathrm{e}{-3}1 roman_e - 3 and a decay parameter of 1e21e21\mathrm{e}{-2}1 roman_e - 2. The model is trained for 20 epochs, with the optimal checkpoint being determined by peak validation performance. Early stopping is utilized with a patience of three epochs.

In the few-shot context, the model is trained using a restricted set of samples, selected from the dataset to form an n-shot scenario. Here, n[2,8,16,32]𝑛281632n\in[2,8,16,32]italic_n ∈ [ 2 , 8 , 16 , 32 ] represents the number of samples for each class, while the remainder of the samples are reserved for testing purposes. Given that the data quality of the sampled training set might significantly impact the model’s performance, data sampling is repeated 10 times with random seeds, and the average score is reported after excluding the highest and lowest scores.

4.3 Benchmarked Models

The proposed CMA is benchmarked against 11 representative models. Specifically, we extensively compare the proposed method with unimodal approaches (1)-(3), multimodal approaches (4)-(6), and the few-shot approaches (7)-(11).

(1) dEFEND [60] utilizes the hierarchical attention network for FND. In this study, we remove the user comments from the original model.

(2) LDA-HAN [33] integrates pre-calculated topic distributions from Latent Dirichlet Allocation into a hierarchical attention network for text classification.

(3) FT-RoBERTa is a standard, fine-tuned version of the pre-trained language model RoBERTa; we use Huggingface Trainer to conduct the fine-tuning experiment.

(4) SpotFake [61] employs the pre-trained VGG and BERT for extracting image and text features, respectively, and then concatenating them for final classification.

(5) SAFE [62] transforms images into textual descriptions and utilizes the correlation between text and visual information for FND.

(6) CAFE [5] employs an ambiguity-aware multimodal strategy to adaptively aggregate unimodal features and their correlations.

(7) KPL [19] employs prompt learning in RoBERTa by enhancing it with external knowledge representations.

(8) M-SAMPLE [20] incorporates prompt learning with multimodal FND. It also applies a similarity-aware fusing to adaptively combine the intensity of multimodal representation for FND.

(9) PET [63] employs PLMs with task descriptions for supervised training, employing task-related cloze questions and verbalizers.

(10) KPT [64] enhances the label word space by incorporating class-related tokens that exhibit diverse granularities and perspectives.

(11) P&A [16] combines prompt-based learning with social alignment techniques and addresses label scarcity by using task-specific prompts in PLMs to elicit relevant knowledge.

4.4 Results

Table 2 demonstrates the FND accuracy comparison between the proposed CMA and all the baselines at various few-shot settings over the three datasets.

Table 2: Performance comparison between the CMA and the baseline models in accuracy (%). Bold indicates the best performance. Underline is the second-best performance. AVG denotes the average accuracy per model across all n-shot settings and datasets. Notably, the experimental results of P&A in Weibo are not accessible since it would require constructing the news proximity graph from the raw social context, which is not provided in the Weibo dataset.
Method PolitiFact GossipCop Weibo
2 8 16 32 2 8 16 32 2 8 16 32 AVG
dEFEND 21.3 39.7 37.5 54.1 25.6 26.0 44.1 47.8 31.9 33.0 40.1 44.5 37.1
LDA-HAN 39.4 47.3 52.2 54.9 21.2 30.4 39.5 41.3 40.3 41.8 44.4 50.9 42.0
FT-RoBERTa 52.0 63.1 70.0 72.5 41.3 60.4 62.6 65.9 39.7 58.1 64.3 66.3 59.7
SAFE 19.0 27.3 48.7 52.1 31.3 45.2 45.4 47.1 21.1 19.3 39.4 41.1 36.4
SpotFake 49.3 53.7 58.5 63.4 28.3 28.4 34.4 36.1 36.9 41.3 40.4 53.7 43.7
CAFE 38.6 46.4 48.9 51.0 42.3 48.1 55.9 59.3 44.4 40.6 47.5 51.3 47.9
KPL 55.1 60.7 65.5 66.3 53.3 54.8 58.6 61.3 45.4 49.3 50.2 59.9 56.7
M-SAMPLE 56.2 66.1 69.5 73.4 53.4 54.1 59.7 66.0 49.7 52.1 59.8 65.7 60.5
KPT 68.1 74.8 80.0 83.2 52.5 56.5 58.1 67.0 56.9 69.4 69.9 71.2 67.3
PET 73.2 68.4 68.3 70.1 65.7 66.9 68.3 71.1 65.4 66.6 70.3 71.5 68.8
P&A 71.9 80.7 81.7 83.5 54.9 58.4 75.6 69.3 - - - - 72.0
CMA(Ours) 73.5 75.8 82.5 87.3 71.9 69.0 71.7 77.0 74.5 69.9 73.8 76.5 75.3

Comparing with unimodal baselines. First, we assess the accuracy of both unimodal approaches and the proposed CMA to evaluate their performances. Overall, CMA outperforms the best unimodal approach, FT-RoBERTa, achieving a 15.6% enhancement in average accuracy across all datasets, demonstrating its superiority in few-shot scenarios.

Surprisingly, FT-RoBERTa emerges as the most accurate model among both unimodal and multimodal approaches, suggesting that conventional fine-tuning methods can reach competitive levels of performance solely through the analysis of textual information from fake news. However, this method necessitates increased epoch time due to the adjustment of numerous parameters in the pre-trained language model (as shown in Table LABEL:tab:params), making it impractical for real-world few-shot FND applications.

LDA-HAN yields the second best in accuracy among unimodal models, with dEFEND coming in next. This could be attributed to two factors: firstly, the vanilla LDA model struggles to effectively generate topics from short texts, a characteristic of the datasets from GossipCop and Weibo (as detailed in Table 1) used in LDA-HAN; secondly, the employment of GloVe embeddings for initializing LDA-HAN and dEFEND may not perform as effectively as the contextualized embeddings generated by the BERT family.

Comparing with multimodal baselines. We evaluate the performance of CMA in comparison with multimodal approaches. CMA outperforms the best multimodal baseline, CAFE, with a 27.4% improvement in average accuracy across all datasets. The reason might be that the complex architecture of multimodal approaches inherently comes with a large number of trainable parameters, which might easily lead to overfitting in few-shot scenarios.

Excluding FT-RoBERTa, all multimodal baselines outperform unimodal models on average, showing that the inclusion of the image modality can significantly affect model accuracy. While these multimodal approaches excel in scenarios with abundant data, their effectiveness heavily relies on the availability of high-quality annotated training samples, which may not be readily accessible during the initial stages of FND. Moreover, all multimodal approaches utilize pre-trained unimodal models, such as VGG, ResNet, and BERT, to independently extract features from images and text. Yet, since these unimodal models are trained separately, merging their extracted features during the multimodal fusion process could potentially introduce noise[20].

Comparing with few-shot baselines. The effectiveness of the proposed CMA is evaluated in comparison with the latest prompt-based few-shot models. CMA outperforms the best few-shot baseline, P&A, with a 3.3% improvement in average accuracy, showing that using unimodal features to assist multimodal probing without prompting the pre-trained language model could also benefit the FND task.

While P&A demonstrates performance on par with CMA, it requires the pre-calculation of a news proximity graph. However, such social context data may not always be accessible, particularly in datasets not sourced from Twitter, like Weibo. After analyzing PET and KPT, it’s evident that these methods yield comparable outcomes, likely due to variations introduced by the manually crafted verbalizers used in prompting. This underscores the significance of hand-designed discrete templates in prompt-based learning. Concurrently, M-SAMPLE, a multimodal adaptation of KPL, demonstrates superior performance, suggesting that incorporating image modality can significantly enhance FND effectiveness.

5 Analysis

5.1 Ablation study

We investigate the impact of key components in CMA by assessing the framework’s performance in a range of complete and partial configurations. In each experiment, CMA is selectively utilized by removing different components, followed by training the framework from scratch. The results are averaged over five random seeds in each shot, and indicate the performance decay of CMA in the absence of each component in most configurations, underscoring the significance of each key module within CMA, as shown in Table 3.

Table 3: Ablation experiments of the CMA. -cross denotes the cross-attention is removed from the CMA. -meta means the meta-linear MLP layer is removed. -img means the image features are removed and only text features are used. -txt denotes the text features are removed and only image features are used.
Method PolitiFact GossipCop Weibo
2 8 16 32 2 8 16 32 2 8 16 32 AVG
CMA 73.5 75.8 82.5 87.3 71.9 69.0 71.7 77.0 74.5 69.9 73.8 76.5 75.3
-cross 67.6 76.7 81.2 84.0 71.8 71.8 71.6 71.1 58.4 65.2 68.4 75.2 71.6
-meta 72.2 74.1 74.7 78.4 49.0 53.2 56.8 56.1 50.0 50.9 57.4 61.7 61.2
-img 59.6 61.7 68.5 71.4 48.3 48.4 54.3 56.1 46.9 47.3 50.4 52.1 55.4
-txt 39.0 37.4 45.6 52.1 41.3 43.3 45.1 47.6 39.1 39.3 39.4 45.1 42.9

Specifically, removing the cross-attention from the CMA (i.e., -cross) results in a slight decrease in accuracy, showing that the cross-attended features from text and image capture semantic correlations and contribute to improved performance. Further removal of the meta-linear layer from the CMA (i.e., -meta) transforms the model into a standard n-shot classification, where it simply classifies concatenated multimodal features. This leads to a significant decrease in accuracy, emphasizing the importance of jointly updating all modality-specific weights in a meta-linear classifier for cross-modal adaptation and accuracy improvement. The meta-linear layer integrates modality-specific features, resembling an ensemble that transforms n-shot classification into a more robust (n×z)𝑛𝑧(n\times z)( italic_n × italic_z )-shot problem, enhancing cross-modal adaptation in few-shot classification.

Additionally, experiments are performed by excluding either the image features (-img) or the text features (-txt), relying solely on the remaining modality for classification. Such setups led to additional reductions in accuracy, underscoring the comparative importance of text over image features in FND. This highlights the complexities in multimodal FND tasks, where the spatial discrepancies between visual and textual semantics tend to be more subtle than in broader multimodal datasets.

5.2 Stablility test

Given the selection of few-shot examples can significantly affect the model performance, we assess the stability of the CMA and other prompt-based baselines by measuring the standard deviation of accuracies in the few-shot settings, as shown in Figure 3.

Overall, the standard deviation for all models decreases in tandem with an increase in the number of n-shot settings, underscoring the importance of augmenting training examples in few-shot scenarios. This augmentation can be further observed that the standard deviation of the CMA tends to be the most stable among the few-shot approaches, indicating that the ensemble of unimodal features in the meta-linear layer can enhance the robustness of multimodal fusion in classification. Additionally, the GossipCop dataset exhibits greater instability compared to the PolitiFact dataset. This instability may be attributed to the semantic complexity in GossipCop, which is responsible for the lower accuracy across all models.

Refer to caption
(a) Standard deviation comparisons in the PolitiFact.
Refer to caption
(b) Standard deviation comparisons in the GossipCop.
Figure 3: The standard deviations of accuracies for both PolitiFact and GossipCop datasets among the few-shot baselines and the proposed CMA.

5.3 Model efficiency

Given the CMA achieves the best performance with a surprisingly simple augmentation, we further explore its efficiency in comparison to other baseline models. Table LABEL:tab:params showcases a comparison of the accuracies and epoch times between baselines and the CMA. The average accuracy of each model is determined in a 16-shot setting as shown in Table 2, along with the recording of average epoch times for each model. All experiments are tested with batch size 32 on a single RTX 4090 GPU in the GossipCop dataset for a fair comparison.

Among unimodal models, dEFEND and LDA-HAN exhibit comparable accuracy and epoch times, attributed to their analogous hierarchical architectural design. While FT-RoBERTa exceeds the performance of various unimodal (e.g., 18% higher than dEFEND) and multimodal methods (e.g., 6.9% higher than CAFE), it requires modifying a significant number of trainable parameters, thus extending epoch durations (on average, four minutes per epoch) relative to other unimodal baselines.

In the multimodal models, SAFE yields the lengthiest epoch durations owing to its prerequisite for independently pre-generating image descriptions. Although Spotfake achieves the fastest epoch duration due to its simple concatenation of the image and text features from the BERT and VGG respectively, it achieves the worst performance compared with other models. CAFE achieves the best multimodal FND outcomes by integrating a degree of ambiguity in the similarity across text and image features, albeit at the cost of marginally increased model complexity and consequently, slightly extended epoch durations.

Table 4: Comparisons of model efficiency. Both Accuracy (%) and Time represent averages derived from five random seeds. Times displayed in green signify an average duration of less than 3 minutes, whereas those in red indicate an average exceeding 3 minutes. Gain denotes notable improvements in accuracy relative to the dEFEND model.
Model Accuracy Zeit Gain
dEFEND 40.9 2min 0
LDA-HAN 38.7 2min -2.2
FT-RoBERTa 58.9 4min +18.0
SAFE 41.1 7min +0.2
Spotfake 33.9 2min -7.0
CAFE 52.0 3min +11.1
KPL 57.5 3min +16.6
M-SAMPLE 58.1 5min +17.2
KPT 54.3 3min +13.4
PET 69.9 6min +29.0
P&A 71.5 2min +30.6
CMA 74.1 <<<1min +33.2

All few-shot baselines demonstrate significant improvements over both unimodal and multimodal counterparts, indicating the suboptimality of traditional methods in contexts with limited annotated data. Specifically, the integration of external knowledge into the prompt-tuning phase by both KPL and KPT results in comparable epoch durations. However, KPL’s design of an FND-specific prompt may underlie its superior performance over KPT. PET records the lengthiest epoch duration among the few-shot baselines, potentially due to the repeated fine-tuning of the PLM for reconfiguring input examples with the task description. P%A not only achieves the second-best performance but also the second-shortest epoch durations, benefiting from the integration of user engagements. However, it incorporates an external alignment module to correlate user engagement with the PLM’s predictions, consequently increasing epoch times relative to CMA. Finally, CMA is more efficient and precise as it avoids the need for extensive parameter fine-tuning and does not depend on intensive image augmentation processes. Additionally, the inclusion of linear probing layers atop the image and text features presents a more streamlined approach than extensive fine-tuning and precise-crafted complex model designs.

5.4 Domain shift analysis

Real-world fake news demonstrates significant distribution discrepancies, which is also referred to as domain shift [zhu2022generalizing, zhu2022memory]. Consequently, automatic FND methods are required to rapidly adapt to emerging topics by using limited resources.

Table 5: Domain shift performance comparison. Poli\rightarrowGoss refers to utilze few-shot samples from the Politifact as training and the Gossipcop for testing. Goss\rightarrowPoli denotes the Gossipcop is utilized as training set and the Politifact is the test set. Bold and Underline denote the best and the second best accuracy (%) in that n-shot setting. AVG is the mean accuracy across all n-shot settings.
Method Poli\rightarrowGoss Goss\rightarrowPoli
2 8 16 32 2 8 16 32 AVG
KPT 40.1 31.7 31.4 31.1 56.3 55.3 54.1 55.8 44.5
PET 51.0 51.3 51.5 51.6 53.1 54.1 54.5 54.1 52.6
P&A 53.2 53.4 53.2 54.5 50.1 50.4 50.3 50.5 51.9
CMA 48.7 53.5 56.1 58.6 51.4 55.3 53.0 55.9 54.1

To address this, we investigate the cross-domain capability of the proposed CMA against three strong few-shot FND baselines (i.e., P&A, PET and KPT). Considering Politifact’s focus on political news using formal language and Gossipcop’s emphasis on entertainment and celebrity narratives in a more casual tone, we first utilize Politifact for training and Gossipcop for testing, later inverting this arrangement.

The outcomes following domain shift are presented in Table 5. Notably,while the CMA model records the highest average accuracy among the few-shot baselines, the performance of each model markedly differs from that observed in the comparison experiments (as shown in Table 2). For example, KPT exhibits the strongest performance in both 2- and 8-shot scenarios in Goss\rightarrowPoli. PET and P%A also achieve the highest performance in Goss\rightarrowPoli and Poli\rightarrowGoss respectively, highlighting the disparity between present few-shot FND methodologies and their adaptability to domain adaptation.

5.5 Feature visualization

Refer to caption
(a) Feature visualization from the M-SAMPLE.
Refer to caption
(b) Feature visualization from the proposed CMA.
Figure 4: Feature visualization comparisons between M-SAMPLE and CMA. English translation of the Weibo example: “When you buy toothpaste, pay attention to the color bar on the bottom of the toothpaste tube, the color bar has meaning! Try to choose greens and blues. Green: natural, blue: natural + medicine, Red: natural + chemical composition, Black: pure chemical. Surprisingly, most children’s toothpaste brands on the domestic market contain chemical ingredients.

At last, we present a visual comparison of the features extracted by M-SAMPLE and CMA, both of which are multimodal few-shot approaches. This involves the visualization of multimodal features alongside an assessment of their semantic correlations. For each dataset, a specific sample is chosen, with the corresponding multimodal features depicted in Figure 4.

Observations indicate that: 1) CMA can capture more consistent features from the image-text pair of fake news than those of M-SAMPLE. For example, although both M-SAMPLE and CMA successfully correlate the flag in the image with the word “Chinese” in the text, CMA can also identify the semantic meaning of “moon landing” between the text and image in the PolitiFact example; 2) The proposed CMA is more accurate in capturing important features from the image than M-SAMPLE. For example, although both models can identify the person “Nicole Kidman” and “black tarantula” in both the text and the image in the GossipCop example, the image region of the tarantula slightly overlaps with that of Nicole Kidman provided by M-SAMPLE. This is even more obvious in the Weibo example, as CMA successfully captures the “blue” color bar in the toothpaste, but M-SAMPLE fails to do so.

6 Conclusion

This paper introduced Cross-Modal Augmentation (CMA) for enhancing few-shot multimodal fake news detection by utilizing unimodal features to augment multimodal fusion. The proposed CMA leverages a pre-trained multimodal model for unimodal feature extraction and transforms n-shot classification into a robust (n ×\times× z)-shot problem using class labels as additional one-shot training samples. The CMA, employing a simple linear classifier, achieves SOTA performance on three datasets in few-shot settings, and demonstrates greater efficiency than current approaches.

7 Limitation

We acknowledge limitations in this study including: 1) The evaluation of CMA’s few-shot proficiency solely utilizes CLIP, future investigations will delve into how different multimodal models influence the proposed CMA; 2) Given the lack of multimodal information in certain datasets, this research adopted cosine similarity for image selection from multiple options, potentially leading to varied performance outcomes based on the text-image pairing technique employed; 3) CMA exhibits suboptimal domain shift performance, enhancing the architecture through the integration of knowledge distillation or domain adaptation techniques remains a prospect for future research.

Acknowledgements

This work is funded by the Natural Science Foundation of Shandong Province under grant ZR2023QF151 and the Natural Science Foundation of China under grant 12303103.

CRediT authorship contribution statement

Ye Jiang: Conceptualization, Methodology, Writing–original draft, Writing–review & editing. Taihang Wang: Methodology, Writing–review & editing. Xiaoman Xu: Data curation, Writing – review & editing. Yimin Wang: Funding acquisition, Methodology, Writing–review & editing. Xingyi Song: Supervision, Writing – review & editing. Diana Maynard: Investigation, Supervision, Writing – review & editing.

References

  • [1] N. K. Conroy, V. L. Rubin, Y. Chen, Automatic deception detection: Methods for finding fake news, Proceedings of the association for information science and technology 52 (1) (2015) 1–4.
  • [2] Y. Long, Q. Lu, R. Xiang, M. Li, C.-R. Huang, Fake news detection through multi-perspective speaker profiles, in: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2017, pp. 252–256.
  • [3] Y. Wang, F. Ma, Z. Jin, Y. Yuan, G. Xun, K. Jha, L. Su, J. Gao, Eann: Event adversarial neural networks for multi-modal fake news detection, in: Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining, 2018, pp. 849–857.
  • [4] A. Lao, C. Shi, Y. Yang, Rumor detection with field of linear and non-linear propagation, in: Proceedings of the Web Conference 2021, 2021, pp. 3178–3187.
  • [5] Y. Chen, D. Li, P. Zhang, J. Sui, Q. Lv, L. Tun, L. Shang, Cross-modal ambiguity learning for multimodal fake news detection, in: Proceedings of the ACM Web Conference 2022, 2022, pp. 2897–2905.
  • [6] Y. Zhou, Y. Yang, Q. Ying, Z. Qian, X. Zhang, Multimodal fake news detection via clip-guided learning, in: 2023 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2023, pp. 2825–2830.
  • [7] L. Wang, C. Zhang, H. Xu, Y. Xu, X. Xu, S. Wang, Cross-modal contrastive learning for multimodal fake news detection, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5696–5704.
  • [8] L. Wu, Y. Long, C. Gao, Z. Wang, Y. Zhang, Mfir: Multimodal fusion and inconsistency reasoning for explainable fake news detection, Information Fusion 100 (2023) 101944.
  • [9] Z. Qu, Y. Meng, G. Muhammad, P. Tiwari, Qmfnd: A quantum multimodal fusion-based fake news detection model for social media, Information Fusion 104 (2024) 102172.
  • [10] Y. Gao, J. Liu, Z. Xu, J. Zhang, K. Li, R. Ji, C. Shen, Pyramidclip: Hierarchical feature alignment for vision-language model pretraining, Advances in neural information processing systems 35 (2022) 35959–35970.
  • [11] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, S. C. H. Hoi, Align before fuse: Vision and language representation learning with momentum distillation, Advances in neural information processing systems 34 (2021) 9694–9705.
  • [12] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, Y. Choi, Truth of varying shades: Analyzing language in fake news and political fact-checking, in: Proceedings of the 2017 conference on empirical methods in natural language processing, 2017, pp. 2931–2937.
  • [13] K. Shu, D. Mahudeswaran, S. Wang, D. Lee, H. Liu, Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media, Big data 8 (3) (2020) 171–188.
  • [14] A. N. Meltzoff, R. W. Borton, Intermodal matching by human neonates, Nature 282 (5737) (1979) 403–404.
  • [15] B. Nanay, Multimodal mental imagery, Cortex 105 (2018) 125–134.
  • [16] J. Wu, S. Li, A. Deng, M. Xiong, B. Hooi, Prompt-and-align: Prompt-based social alignment for few-shot fake news detection, in: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023, pp. 2726–2736.
  • [17] T. Gao, A. Fisch, D. Chen, Making pre-trained language models better few-shot learners, in: Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021, Association for Computational Linguistics (ACL), 2021, pp. 3816–3830.
  • [18] N. Ding, S. Hu, W. Zhao, Y. Chen, Z. Liu, H.-T. Zheng, M. Sun, Openprompt: An open-source framework for prompt-learning, arXiv preprint arXiv:2111.01998 (2021).
  • [19] G. Jiang, S. Liu, Y. Zhao, Y. Sun, M. Zhang, Fake news detection via knowledgeable prompt learning, Information Processing & Management 59 (5) (2022) 103029.
  • [20] Y. Jiang, X. Yu, Y. Wang, X. Xu, X. Song, D. Maynard, Similarity-aware multimodal prompt learning for fake news detection, Information Sciences 647 (2023) 119446.
  • [21] Q. Guo, Z. Kang, L. Tian, Z. Chen, Tiefake: Title-text similarity and emotion-aware fake news detection, arXiv preprint arXiv:2304.09421 (2023).
  • [22] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763.
  • [23] C. Castillo, M. Mendoza, B. Poblete, Information credibility on twitter, in: Proceedings of the 20th international conference on World wide web, 2011, pp. 675–684.
  • [24] B. Tabibian, I. Valera, M. Farajtabar, L. Song, B. Schölkopf, M. Gomez-Rodriguez, Distilling information reliability and source trustworthiness from digital traces, in: Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 847–855.
  • [25] C. Geeng, S. Yee, F. Roesner, Fake news on facebook and twitter: Investigating how people (don’t) investigate, in: Proceedings of the 2020 CHI conference on human factors in computing systems, 2020, pp. 1–14.
  • [26] X. Liu, Q. Li, A. Nourbakhsh, R. Fang, M. Thomas, K. Anderson, R. Kociuba, M. Vedder, S. Pomerville, R. Wudali, et al., Reuters tracer: A large scale system of detecting & verifying real-time news events from twitter, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016, pp. 207–216.
  • [27] M. Fedoryszak, B. Frederick, V. Rajaram, C. Zhong, Real-time event detection on social data streams, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2774–2782.
  • [28] P. Bahad, P. Saxena, R. Kamal, Fake news detection using bi-directional lstm-recurrent neural network, Procedia Computer Science 165 (2019) 74–82.
  • [29] S. Sridhar, S. Sanagavarapu, Fake news detection and analysis using multitask learning with bilstm capsnet model, in: 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence), IEEE, 2021, pp. 905–911.
  • [30] H. T. Phan, N. T. Nguyen, D. Hwang, Fake news detection: A survey of graph neural network methods, Applied Soft Computing (2023) 110235.
  • [31] X. Song, J. Petrak, Y. Jiang, I. Singh, D. Maynard, K. Bontcheva, Classification aware neural topic model for covid-19 disinformation categorisation, PloS one 16 (2) (2021) e0247086.
  • [32] Y. Jiang, X. Song, C. Scarton, A. Aker, K. Bontcheva, Categorising fine-to-coarse grained misinformation: An empirical study of covid-19 infodemic, arXiv preprint arXiv:2106.11702 (2021).
  • [33] Y. Jiang, Y. Wang, X. Song, D. Maynard, Comparing topic-aware neural networks for bias detection of news, in: ECAI 2020, IOS Press, 2020, pp. 2054–2061.
  • [34] B. Ghanem, P. Rosso, F. Rangel, An emotional analysis of false information in social media and news articles, ACM Transactions on Internet Technology (TOIT) 20 (2) (2020) 1–18.
  • [35] Y. Jiang, Team QUST at SemEval-2023 task 3: A comprehensive study of monolingual and multilingual approaches for detecting online news genre, framing and persuasion techniques, in: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 300–306.
  • [36] Y. Wu, P. Zhan, Y. Zhang, L. Wang, Z. Xu, Multimodal fusion with co-attention networks for fake news detection, in: Findings of the association for computational linguistics: ACL-IJCNLP 2021, 2021, pp. 2560–2569.
  • [37] S. Singhal, T. Pandey, S. Mrig, R. R. Shah, P. Kumaraguru, Leveraging intra and inter modality relationship for multimodal fake news detection, in: Companion Proceedings of the Web Conference 2022, 2022, pp. 726–734.
  • [38] Y. Wang, Q. Yao, J. T. Kwok, L. M. Ni, Generalizing from a few examples: A survey on few-shot learning, ACM computing surveys (csur) 53 (3) (2020) 1–34.
  • [39] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al., Matching networks for one shot learning, Advances in neural information processing systems 29 (2016).
  • [40] J. Snell, K. Swersky, R. Zemel, Prototypical networks for few-shot learning, Advances in neural information processing systems 30 (2017).
  • [41] S. Motiian, Q. Jones, S. Iranmanesh, G. Doretto, Few-shot adversarial domain adaptation, Advances in neural information processing systems 30 (2017).
  • [42] A. Zhao, M. Ding, Z. Lu, T. Xiang, Y. Niu, J. Guan, J.-R. Wen, Domain-adaptive few-shot learning, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 1390–1399.
  • [43] A. Sharaf, H. H. Awadalla, H. Daumé III, Meta-learning for few-shot nmt adaptation, in: Proceedings of the Fourth Workshop on Neural Generation and Translation, 2020, pp. 43–53.
  • [44] C. Han, Z. Fan, D. Zhang, M. Qiu, M. Gao, A. Zhou, Meta-learning adversarial domain adaptation network for few-shot text classification, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 1664–1673.
  • [45] Z. Yue, H. Zeng, Y. Zhang, L. Shang, D. Wang, Metaadapt: Domain adaptive few-shot misinformation detection via meta learning, Association for Computational Linguistics (2023) 5223–5239.
  • [46] Q. Zhang, H. Huang, S. Liang, Z. Meng, E. Yilmaz, Learning to detect few-shot-few-clue misinformation, arXiv preprint arXiv:2108.03805 (2021).
  • [47] E. Schwartz, L. Karlinsky, R. Feris, R. Giryes, A. Bronstein, Baby steps towards few-shot learning with multiple semantics, Pattern Recognition Letters 160 (2022) 142–147.
  • [48] H. Zhang, J. Y. Koh, J. Baldridge, H. Lee, Y. Yang, Cross-modal contrastive learning for text-to-image generation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 833–842.
  • [49] H. Zhang, P. Zhang, X. Hu, Y.-C. Chen, L. Li, X. Dai, L. Wang, L. Yuan, J.-N. Hwang, J. Gao, Glipv2: Unifying localization and vision-language understanding, Advances in Neural Information Processing Systems 35 (2022) 36067–36080.
  • [50] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009.
  • [51] K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised visual representation learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
  • [52] R. Girdhar, D. Ramanan, Attentional pooling for action recognition, Advances in neural information processing systems 30 (2017).
  • [53] P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, Y. Qiao, Clip-adapter: Better vision-language models with feature adapters, International Journal of Computer Vision (2023) 1–15.
  • [54] R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y. Qiao, H. Li, Tip-adapter: Training-free clip-adapter for better vision-language modeling, arXiv preprint arXiv:2111.03930 (2021).
  • [55] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly, Parameter-efficient transfer learning for nlp, in: International Conference on Machine Learning, PMLR, 2019, pp. 2790–2799.
  • [56] M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Kornblith, R. Roelofs, R. G. Lopes, H. Hajishirzi, A. Farhadi, H. Namkoong, et al., Robust fine-tuning of zero-shot models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7959–7971.
  • [57] B. Schölkopf, R. Herbrich, A. J. Smola, A generalized representer theorem, in: International conference on computational learning theory, Springer, 2001, pp. 416–426.
  • [58] Z. Jin, J. Cao, H. Guo, Y. Zhang, J. Luo, Multimodal fusion with recurrent neural networks for rumor detection on microblogs, in: Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 795–816.
  • [59] A. Yang, J. Pan, J. Lin, R. Men, Y. Zhang, J. Zhou, C. Zhou, Chinese clip: Contrastive vision-language pretraining in chinese, arXiv preprint arXiv:2211.01335 (2022).
  • [60] K. Shu, L. Cui, S. Wang, D. Lee, H. Liu, defend: Explainable fake news detection, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 395–405. doi:10.1145/3292500.3330935.
    URL https://doi.org/10.1145/3292500.3330935
  • [61] S. Singhal, R. R. Shah, T. Chakraborty, P. Kumaraguru, S. Satoh, Spotfake: A multi-modal framework for fake news detection, in: 2019 IEEE fifth international conference on multimedia big data (BigMM), IEEE, 2019, pp. 39–47.
  • [62] X. Zhou, J. Wu, R. Zafarani, : Similarity-aware multi-modal fake news detection, in: Pacific-Asia Conference on knowledge discovery and data mining, Springer, 2020, pp. 354–367.
  • [63] T. Schick, H. Schütze, Exploiting cloze-questions for few-shot text classification and natural language inference, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 255–269.
  • [64] S. Hu, N. Ding, H. Wang, Z. Liu, J. Wang, J. Li, W. Wu, M. Sun, Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2225–2240.