Search | arXiv e-print repository

A Multi-Label Dataset of French Fake News: Human and Machine Insights

Authors: Benjamin Icard, François Maine, Morgane Casanova, Géraud Faye, Julien Chanson, Guillaume Gadek, Ghislain Atemezing, François Bancilhon, Paul Égré

Abstract: We present a corpus of 100 documents, OBSINFOX, selected from 17 sources of French press considered unreliable by expert agencies, annotated using 11 labels by 8 annotators. By collecting more labels than usual, by more annotators than is typically done, we can identify features that humans consider as characteristic of fake news, and compare them to the predictions of automated classifiers. We pr… ▽ More We present a corpus of 100 documents, OBSINFOX, selected from 17 sources of French press considered unreliable by expert agencies, annotated using 11 labels by 8 annotators. By collecting more labels than usual, by more annotators than is typically done, we can identify features that humans consider as characteristic of fake news, and compare them to the predictions of automated classifiers. We present a topic and genre analysis using Gate Cloud, indicative of the prevalence of satire-like text in the corpus. We then use the subjectivity analyzer VAGO, and a neural version of it, to clarify the link between ascriptions of the label Subjective and ascriptions of the label Fake News. The annotated dataset is available online at the following url: https://github.com/obs-info/obsinfox Keywords: Fake News, Multi-Labels, Subjectivity, Vagueness, Detail, Opinion, Exaggeration, French Press △ Less

Submitted 11 April, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

Comments: Paper to appear in the Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

arXiv:2309.06132 [pdf, other]

Measuring vagueness and subjectivity in texts: from symbolic to neural VAGO

Authors: Benjamin Icard, Vincent Claveau, Ghislain Atemezing, Paul Égré

Abstract: We present a hybrid approach to the automated measurement of vagueness and subjectivity in texts. We first introduce the expert system VAGO, we illustrate it on a small benchmark of fact vs. opinion sentences, and then test it on the larger French press corpus FreSaDa to confirm the higher prevalence of subjective markers in satirical vs. regular texts. We then build a neural clone of VAGO, based… ▽ More We present a hybrid approach to the automated measurement of vagueness and subjectivity in texts. We first introduce the expert system VAGO, we illustrate it on a small benchmark of fact vs. opinion sentences, and then test it on the larger French press corpus FreSaDa to confirm the higher prevalence of subjective markers in satirical vs. regular texts. We then build a neural clone of VAGO, based on a BERT-like architecture, trained on the symbolic VAGO scores obtained on FreSaDa. Using explainability tools (LIME), we show the interest of this neural version for the enrichment of the lexicons of the symbolic version, and for the production of versions in other languages. △ Less

Submitted 23 October, 2023; v1 submitted 12 September, 2023; originally announced September 2023.

Comments: Paper to appear in the Proceedings of the 2023 IEEE International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)

MSC Class: 68T07; 68T50

arXiv:2202.00609 [pdf, other]

Semantic of Cloud Computing services for Time Series workflows

Authors: Manuel Parra-Royón, Francisco Baldan, Ghislain Atemezing, J. M. Benitez

Abstract: Time series (TS) are present in many fields of knowledge, research, and engineering. The processing and analysis of TS are essential in order to extract knowledge from the data and to tackle forecasting or predictive maintenance tasks among others The modeling of TS is a challenging task, requiring high statistical expertise as well as outstanding knowledge about the application of Data Mining(DM)… ▽ More Time series (TS) are present in many fields of knowledge, research, and engineering. The processing and analysis of TS are essential in order to extract knowledge from the data and to tackle forecasting or predictive maintenance tasks among others The modeling of TS is a challenging task, requiring high statistical expertise as well as outstanding knowledge about the application of Data Mining(DM) and Machine Learning (ML) methods. The overall work with TS is not limited to the linear application of several techniques, but is composed of an open workflow of methods and tests. These workflow, developed mainly on programming languages, are complicated to execute and run effectively on different systems, including Cloud Computing (CC) environments. The adoption of CC can facilitate the integration and portability of services allowing to adopt solutions towards services Internet Technologies (IT) industrialization. The definition and description of workflow services for TS open up a new set of possibilities regarding the reduction of complexity in the deployment of this type of issues in CC environments. In this sense, we have designed an effective proposal based on semantic modeling (or vocabulary) that provides the full description of workflow for Time Series modeling as a CC service. Our proposal includes a broad spectrum of the most extended operations, accommodating any workflow applied to classification, regression, or clustering problems for Time Series, as well as including evaluation measures, information, tests, or machine learning algorithms among others. △ Less

Submitted 1 February, 2022; originally announced February 2022.

Comments: 11 pages, 12 figures

arXiv:2110.14780 [pdf, other]

Combining Vagueness Detection with Deep Learning to Identify Fake News

Authors: Paul Guélorget, Benjamin Icard, Guillaume Gadek, Souhir Gahbiche, Sylvain Gatepaille, Ghislain Atemezing, Paul Égré

Abstract: In this paper, we combine two independent detection methods for identifying fake news: the algorithm VAGO uses semantic rules combined with NLP techniques to measure vagueness and subjectivity in texts, while the classifier FAKE-CLF relies on Convolutional Neural Network classification and supervised deep learning to classify texts as biased or legitimate. We compare the results of the two methods… ▽ More In this paper, we combine two independent detection methods for identifying fake news: the algorithm VAGO uses semantic rules combined with NLP techniques to measure vagueness and subjectivity in texts, while the classifier FAKE-CLF relies on Convolutional Neural Network classification and supervised deep learning to classify texts as biased or legitimate. We compare the results of the two methods on four corpora. We find a positive correlation between the vagueness and subjectivity measures obtained by VAGO, and the classification of text as biased by FAKE-CLF. The comparison yields mutual benefits: VAGO helps explain the results of FAKE-CLF. Conversely FAKE-CLF helps us corroborate and expand VAGO's database. The use of two complementary techniques (rule-based vs data-driven) proves a fruitful approach for the challenging problem of identifying fake news. △ Less

Submitted 31 October, 2021; v1 submitted 27 October, 2021; originally announced October 2021.

Comments: Paper to appear in the Proceedings of the 24th International Conference on Information Fusion. Johannesburg. (2nd version: Typo corrected in metadata in one of the authors' names)

MSC Class: 68T07; 68T50

arXiv:1806.06826 [pdf, other]

Semantics of Data Mining Services in Cloud Computing

Authors: Manuel Parra-Royon, Ghislain Atemezing, J. M. Benítez

Abstract: In recent years with the rise of Cloud Computing, many companies providing services in the cloud, are empowering a new series of services to their catalogue, such as data mining and data processing, taking advantage of the vast computing resources available to them. Different service definition proposals have been put forward to address the problem of describing services in Cloud Computing in a co… ▽ More In recent years with the rise of Cloud Computing, many companies providing services in the cloud, are empowering a new series of services to their catalogue, such as data mining and data processing, taking advantage of the vast computing resources available to them. Different service definition proposals have been put forward to address the problem of describing services in Cloud Computing in a comprehensive way. Bearing in mind that each provider has its own definition of the logic of its services, and specifically of data mining services, it should be pointed out that the possibility of describing services in a flexible way between providers is fundamental in order to maintain the usability and portability of this type of Cloud Computing services. The use of semantic technologies based on the proposal offered by Linked Data for the definition of services, allows the design and modelling of data mining services, achieving a high degree of interoperability. In this article a schema for the definition of data mining services on cloud computing is presented considering all key aspects of service, such as prices, interfaces, Software Level Agreement, instances or data mining workflow, among others. The new schema is based on Linked Data, and it reuses other schemata obtaining a better and more complete definition of the services. In order to validate the completeness of the scheme, a series of data mining services have been created where a set of algorithms such as Random Forest or K-Means are modeled as services. In addition, a dataset has been generated including the definition of the services of several actual Cloud Computing data mining providers, confirming the effectiveness of the schema. △ Less

Submitted 14 January, 2019; v1 submitted 18 June, 2018; originally announced June 2018.

Comments: In-depth review. Fixed mistakes

Showing 1–5 of 5 results for author: Atemezing, G