Search | arXiv e-print repository

doi 10.1111/exsy.13671

What distinguishes conspiracy from critical narratives? A computational analysis of oppositional discourse

Authors: Damir Korenčić, Berta Chulvi, Xavier Bonet Casals, Alejandro Toselli, Mariona Taulé, Paolo Rosso

Abstract: The current prevalence of conspiracy theories on the internet is a significant issue, tackled by many computational approaches. However, these approaches fail to recognize the relevance of distinguishing between texts which contain a conspiracy theory and texts which are simply critical and oppose mainstream narratives. Furthermore, little attention is usually paid to the role of inter-group confl… ▽ More The current prevalence of conspiracy theories on the internet is a significant issue, tackled by many computational approaches. However, these approaches fail to recognize the relevance of distinguishing between texts which contain a conspiracy theory and texts which are simply critical and oppose mainstream narratives. Furthermore, little attention is usually paid to the role of inter-group conflict in oppositional narratives. We contribute by proposing a novel topic-agnostic annotation scheme that differentiates between conspiracies and critical texts, and that defines span-level categories of inter-group conflict. We also contribute with the multilingual XAI-DisInfodemics corpus (English and Spanish), which contains a high-quality annotation of Telegram messages related to COVID-19 (5,000 messages per language). We also demonstrate the feasibility of an NLP-based automatization by performing a range of experiments that yield strong baseline solutions. Finally, we perform an analysis which demonstrates that the promotion of intergroup conflict and the presence of violence and anger are key aspects to distinguish between the two types of oppositional narratives, i.e., conspiracy vs. critical. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: submitted to the Expert Systems journal

ACM Class: I.2.7; J.4

arXiv:2301.05935 [pdf, other]

doi 10.1016/j.patcog.2023.109695

End-to-End Page-Level Assessment of Handwritten Text Recognition

Authors: Enrique Vidal, Alejandro H. Toselli, Antonio Ríos-Vila, Jorge Calvo-Zaragoza

Abstract: The evaluation of Handwritten Text Recognition (HTR) systems has traditionally used metrics based on the edit distance between HTR and ground truth (GT) transcripts, at both the character and word levels. This is very adequate when the experimental protocol assumes that both GT and HTR text lines are the same, which allows edit distances to be independently computed to each given line. Driven by r… ▽ More The evaluation of Handwritten Text Recognition (HTR) systems has traditionally used metrics based on the edit distance between HTR and ground truth (GT) transcripts, at both the character and word levels. This is very adequate when the experimental protocol assumes that both GT and HTR text lines are the same, which allows edit distances to be independently computed to each given line. Driven by recent advances in pattern recognition, HTR systems increasingly face the end-to-end page-level transcription of a document, where the precision of locating the different text lines and their corresponding reading order (RO) play a key role. In such a case, the standard metrics do not take into account the inconsistencies that might appear. In this paper, the problem of evaluating HTR systems at the page level is introduced in detail. We analyse the convenience of using a two-fold evaluation, where the transcription accuracy and the RO goodness are considered separately. Different alternatives are proposed, analysed and empirically compared both through partially simulated and through real, full end-to-end experiments. Results support the validity of the proposed two-fold evaluation approach. An important conclusion is that such an evaluation can be adequately achieved by just two simple and well-known metrics: the Word Error Rate (WER), that takes transcription sequentiality into account, and the here re-formulated Bag of Words Word Error Rate (bWER), that ignores order. While the latter directly and very accurately assess intrinsic word recognition errors, the difference between both metrics gracefully correlates with the Normalised Spearman's Foot Rule Distance (NSFD), a metric which explicitly measures RO errors associated with layout analysis flaws. △ Less

Submitted 21 May, 2023; v1 submitted 14 January, 2023; originally announced January 2023.

Comments: Published in Pattern Recognition

ACM Class: I.5.4

arXiv:2212.02352 [pdf, ps, other]

Fake News and Hate Speech: Language in Common

Authors: Berta Chulvi, Alejandro Toselli, Paolo Rosso

Abstract: In this paper we raise the research question of whether fake news and hate speech spreaders share common patterns in language. We compute a novel index, the ingroup vs outgroup index, in three different datasets and we show that both phenomena share an "us vs them" narrative. In this paper we raise the research question of whether fake news and hate speech spreaders share common patterns in language. We compute a novel index, the ingroup vs outgroup index, in three different datasets and we show that both phenomena share an "us vs them" narrative. △ Less

Submitted 5 December, 2022; originally announced December 2022.

Comments: 2 pages

arXiv:2206.13342 [pdf, other]

Open Set Classification of Untranscribed Handwritten Documents

Authors: José Ramón Prieto, Juan José Flores, Enrique Vidal, Alejandro H. Toselli, David Garrido, Carlos Alonso

Abstract: Huge amounts of digital page images of important manuscripts are preserved in archives worldwide. The amounts are so large that it is generally unfeasible for archivists to adequately tag most of the documents with the required metadata so as to low proper organization of the archives and effective exploration by scholars and the general public. The class or ``typology'' of a document is perhaps t… ▽ More Huge amounts of digital page images of important manuscripts are preserved in archives worldwide. The amounts are so large that it is generally unfeasible for archivists to adequately tag most of the documents with the required metadata so as to low proper organization of the archives and effective exploration by scholars and the general public. The class or ``typology'' of a document is perhaps the most important tag to be included in the metadata. The technical problem is one of automatic classification of documents, each consisting of a set of untranscribed handwritten text images, by the textual contents of the images. The approach considered is based on ``probabilistic indexing'', a relatively novel technology which allows to effectively represent the intrinsic word-level uncertainty exhibited by handwritten text images. We assess the performance of this approach on a large collection of complex notarial manuscripts from the Spanish Archivo Hostórico Provincial de Cádiz, with promising results. △ Less

Submitted 20 June, 2022; originally announced June 2022.

arXiv:2112.12703 [pdf, other]

doi 10.1007/978-3-030-86331-9_30

Digital Editions as Distant Supervision for Layout Analysis of Printed Books

Authors: Alejandro H. Toselli, Si Wu, David A. Smith

Abstract: Archivists, textual scholars, and historians often produce digital editions of historical documents. Using markup schemes such as those of the Text Encoding Initiative and EpiDoc, these digital editions often record documents' semantic regions (such as notes and figures) and physical features (such as page and line breaks) as well as transcribing their textual content. We describe methods for expl… ▽ More Archivists, textual scholars, and historians often produce digital editions of historical documents. Using markup schemes such as those of the Text Encoding Initiative and EpiDoc, these digital editions often record documents' semantic regions (such as notes and figures) and physical features (such as page and line breaks) as well as transcribing their textual content. We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books. △ Less

Submitted 23 December, 2021; originally announced December 2021.

Comments: 15 pages, 2 figures. International Conference on Document Analysis and Recognition. Springer, Cham, 2021

arXiv:2106.08499 [pdf, other]

ICDAR 2021 Competition on Components Segmentation Task of Document Photos

Authors: Celso A. M. Lopes Junior, Ricardo B. das Neves Junior, Byron L. D. Bezerra, Alejandro H. Toselli, Donato Impedovo

Abstract: This paper describes the short-term competition on the Components Segmentation Task of Document Photos that was prepared in the context of the 16th International Conference on Document Analysis and Recognition (ICDAR 2021). This competition aims to bring together researchers working in the field of identification document image processing and provides them a suitable benchmark to compare their tec… ▽ More This paper describes the short-term competition on the Components Segmentation Task of Document Photos that was prepared in the context of the 16th International Conference on Document Analysis and Recognition (ICDAR 2021). This competition aims to bring together researchers working in the field of identification document image processing and provides them a suitable benchmark to compare their techniques on the component segmentation task of document images. Three challenge tasks were proposed entailing different segmentation assignments to be performed on a provided dataset. The collected data are from several types of Brazilian ID documents, whose personal information was conveniently replaced. There were 16 participants whose results obtained for some or all the three tasks show different rates for the adopted metrics, like Dice Similarity Coefficient ranging from 0.06 to 0.99. Different Deep Learning models were applied by the entrants with diverse strategies to achieve the best results in each of the tasks. Obtained results show that the currently applied methods for solving one of the proposed tasks (document boundary detection) are already well established. However, for the other two challenge tasks (text zone and handwritten sign detection) research and development of more robust approaches are still required to achieve acceptable results. △ Less

Submitted 8 July, 2021; v1 submitted 15 June, 2021; originally announced June 2021.

Comments: 15 pages; 5 figures; Accepted at ICDAR 2021: 16th International Conference on Document Analysis and Recognition

arXiv:2104.04556 [pdf, other]

A Probabilistic Framework for Lexicon-based Keyword Spotting in Handwritten Text Images

Authors: E. Vidal, A. H. Toselli, J. Puigcerver

Abstract: Query by String Keyword Spotting (KWS) is here considered as a key technology for indexing large collections of handwritten text images to allow fast textual access to the contents of these collections. Under this perspective, a probabilistic framework for lexicon-based KWS in text images is presented. The presentation aims at providing a tutorial view that helps to understand the relations betwee… ▽ More Query by String Keyword Spotting (KWS) is here considered as a key technology for indexing large collections of handwritten text images to allow fast textual access to the contents of these collections. Under this perspective, a probabilistic framework for lexicon-based KWS in text images is presented. The presentation aims at providing a tutorial view that helps to understand the relations between classical statements of KWS and the relative challenges entailed by these statements. More specifically, the development of the proposed framework makes it self-evident that word recognition or classification implicitly or explicitly underlies any formulation of KWS. Moreover, it clearly suggests that the same statistical models and training methods successfully used for handwriting text recognition can advantageously be used also for KWS, even though KWS does not generally require or rely on any kind of previously produced image transcripts. These ideas are developed into a specific, probabilistically sound approach for segmentation-free, lexicon-based, query-by-string KWS. Experiments carried out using this approach are presented, which support the consistency and general interest of the proposed framework. Several datasets, traditionally used for KWS benchmarking are considered, with results significantly better than those previously published for these datasets. In addition, results on two new, larger handwritten text image datasets are reported, showing the great potential of the methods proposed in this paper for indexing and textual search in large collections of handwritten documents. △ Less

Submitted 9 April, 2021; originally announced April 2021.

Comments: 42 pages, 35 headers, 16 figures/tables

Report number: Tech. rep., UPV (2017)

Showing 1–7 of 7 results for author: Toselli, A