VERA-ARAB: unveiling the Arabic tweets credibility by constructing balanced news dataset for veracity analysis

Mohamed A Mostafa; Ahmad Almogren

doi:10.7717/peerj-cs.2432

VERA-ARAB: unveiling the Arabic tweets credibility by constructing balanced news dataset for veracity analysis

PeerJ Comput Sci. 2024 Oct 30:10:e2432. doi: 10.7717/peerj-cs.2432. eCollection 2024.

Authors

Mohamed A Mostafa¹, Ahmad Almogren²

Affiliations

¹ Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.
² Chair of Cyber Security, Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.

Abstract

The proliferation of fake news on social media platforms necessitates the development of reliable datasets for effective fake news detection and veracity analysis. In this article, we introduce a veracity dataset of Arabic tweets called "VERA-ARAB", a pioneering large-scale dataset designed to enhance fake news detection in Arabic tweets. VERA-ARAB is a balanced, multi-domain, and multi-dialectal dataset, containing both fake and true news, meticulously verified by fact-checking experts from Misbar. Comprising approximately 20,000 tweets from 13,000 distinct users and covering 884 different claims, the dataset includes detailed information such as news text, user details, and spatiotemporal data, spanning diverse domains like sports and politics. We leveraged the X API to retrieve and structure the dataset, providing a comprehensive data dictionary to describe the raw data and conducting a thorough statistical descriptive analysis. This analysis reveals insightful patterns and distributions, visualized according to data type and nature. We also evaluated the dataset using multiple machine learning classification models, exploring various social and textual features. Our findings indicate promising results, particularly with textual features, underscoring the dataset's potential for enhancing fake news detection. Furthermore, we outline future work aimed at expanding VERA-ARAB to establish it as a benchmark for Arabic tweets in fake news detection. We also discuss other potential applications that could leverage the VERA-ARAB dataset, emphasizing its value and versatility for advancing the field of fake news detection in Arabic social media. Potential applications include user veracity assessment, topic modeling, and named entity recognition, demonstrating the dataset's wide-ranging utility for broader research in information quality management on social media.

Keywords: Arabic dataset; Fake news; Named entity recognition; Social computing; Social media; Topic classification.

Publication types

News

Grants and funding

This work was supported by the Deanship of Scientific Research at King Saud University, Riyadh, Saudi Arabia through the Vice Deanship of Scientific Research Chairs: Chair of Cyber Security. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.