-
What distinguishes conspiracy from critical narratives? A computational analysis of oppositional discourse
Authors:
Damir Korenčić,
Berta Chulvi,
Xavier Bonet Casals,
Alejandro Toselli,
Mariona Taulé,
Paolo Rosso
Abstract:
The current prevalence of conspiracy theories on the internet is a significant issue, tackled by many computational approaches. However, these approaches fail to recognize the relevance of distinguishing between texts which contain a conspiracy theory and texts which are simply critical and oppose mainstream narratives. Furthermore, little attention is usually paid to the role of inter-group confl…
▽ More
The current prevalence of conspiracy theories on the internet is a significant issue, tackled by many computational approaches. However, these approaches fail to recognize the relevance of distinguishing between texts which contain a conspiracy theory and texts which are simply critical and oppose mainstream narratives. Furthermore, little attention is usually paid to the role of inter-group conflict in oppositional narratives. We contribute by proposing a novel topic-agnostic annotation scheme that differentiates between conspiracies and critical texts, and that defines span-level categories of inter-group conflict. We also contribute with the multilingual XAI-DisInfodemics corpus (English and Spanish), which contains a high-quality annotation of Telegram messages related to COVID-19 (5,000 messages per language). We also demonstrate the feasibility of an NLP-based automatization by performing a range of experiments that yield strong baseline solutions. Finally, we perform an analysis which demonstrates that the promotion of intergroup conflict and the presence of violence and anger are key aspects to distinguish between the two types of oppositional narratives, i.e., conspiracy vs. critical.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
RoCode: A Dataset for Measuring Code Intelligence from Problem Definitions in Romanian
Authors:
Adrian Cosma,
Bogdan Iordache,
Paolo Rosso
Abstract:
Recently, large language models (LLMs) have become increasingly powerful and have become capable of solving a plethora of tasks through proper instructions in natural language. However, the vast majority of testing suites assume that the instructions are written in English, the de facto prompting language. Code intelligence and problem solving still remain a difficult task, even for the most advan…
▽ More
Recently, large language models (LLMs) have become increasingly powerful and have become capable of solving a plethora of tasks through proper instructions in natural language. However, the vast majority of testing suites assume that the instructions are written in English, the de facto prompting language. Code intelligence and problem solving still remain a difficult task, even for the most advanced LLMs. Currently, there are no datasets to measure the generalization power for code-generation models in a language other than English. In this work, we present RoCode, a competitive programming dataset, consisting of 2,642 problems written in Romanian, 11k solutions in C, C++ and Python and comprehensive testing suites for each problem. The purpose of RoCode is to provide a benchmark for evaluating the code intelligence of language models trained on Romanian / multilingual text as well as a fine-tuning set for pretrained Romanian models. Through our results and review of related works, we argue for the need to develop code models for languages other than English.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
QSpeckleFilter: a Quantum Machine Learning approach for SAR speckle filtering
Authors:
Francesco Mauro,
Alessandro Sebastianelli,
Maria Pia Del Rosso,
Paolo Gamba,
Silvia Liberata Ullo
Abstract:
The use of Synthetic Aperture Radar (SAR) has greatly advanced our capacity for comprehensive Earth monitoring, providing detailed insights into terrestrial surface use and cover regardless of weather conditions, and at any time of day or night. However, SAR imagery quality is often compromised by speckle, a granular disturbance that poses challenges in producing accurate results without suitable…
▽ More
The use of Synthetic Aperture Radar (SAR) has greatly advanced our capacity for comprehensive Earth monitoring, providing detailed insights into terrestrial surface use and cover regardless of weather conditions, and at any time of day or night. However, SAR imagery quality is often compromised by speckle, a granular disturbance that poses challenges in producing accurate results without suitable data processing. In this context, the present paper explores the cutting-edge application of Quantum Machine Learning (QML) in speckle filtering, harnessing quantum algorithms to address computational complexities. We introduce here QSpeckleFilter, a novel QML model for SAR speckle filtering. The proposed method compared to a previous work from the same authors showcases its superior performance in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) on a testing dataset, and it opens new avenues for Earth Observation (EO) applications.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Reading Between the Frames: Multi-Modal Depression Detection in Videos from Non-Verbal Cues
Authors:
David Gimeno-Gómez,
Ana-Maria Bucur,
Adrian Cosma,
Carlos-David Martínez-Hinarejos,
Paolo Rosso
Abstract:
Depression, a prominent contributor to global disability, affects a substantial portion of the population. Efforts to detect depression from social media texts have been prevalent, yet only a few works explored depression detection from user-generated video content. In this work, we address this research gap by proposing a simple and flexible multi-modal temporal model capable of discerning non-ve…
▽ More
Depression, a prominent contributor to global disability, affects a substantial portion of the population. Efforts to detect depression from social media texts have been prevalent, yet only a few works explored depression detection from user-generated video content. In this work, we address this research gap by proposing a simple and flexible multi-modal temporal model capable of discerning non-verbal depression cues from diverse modalities in noisy, real-world videos. We show that, for in-the-wild videos, using additional high-level non-verbal cues is crucial to achieving good performance, and we extracted and processed audio speech embeddings, face emotion embeddings, face, body and hand landmarks, and gaze and blinking information. Through extensive experiments, we show that our model achieves state-of-the-art results on three key benchmark datasets for depression detection from video by a substantial margin. Our code is publicly available on GitHub.
△ Less
Submitted 5 January, 2024;
originally announced January 2024.
-
Toxic language detection: a systematic review of Arabic datasets
Authors:
Imene Bensalem,
Paolo Rosso,
Hanane Zitouni
Abstract:
The detection of toxic language in the Arabic language has emerged as an active area of research in recent years, and reviewing the existing datasets employed for training the developed solutions has become a pressing need. This paper offers a comprehensive survey of Arabic datasets focused on online toxic language. We systematically gathered a total of 54 available datasets and their correspondin…
▽ More
The detection of toxic language in the Arabic language has emerged as an active area of research in recent years, and reviewing the existing datasets employed for training the developed solutions has become a pressing need. This paper offers a comprehensive survey of Arabic datasets focused on online toxic language. We systematically gathered a total of 54 available datasets and their corresponding papers and conducted a thorough analysis, considering 18 criteria across four primary dimensions: availability details, content, annotation process, and reusability. This analysis enabled us to identify existing gaps and make recommendations for future research works. For the convenience of the research community, the list of the analysed datasets is maintained in a GitHub repository (https://github.com/Imene1/Arabic-toxic-language).
△ Less
Submitted 29 January, 2024; v1 submitted 12 December, 2023;
originally announced December 2023.
-
Vicinal Risk Minimization for Few-Shot Cross-lingual Transfer in Abusive Language Detection
Authors:
Gretel Liz De la Peña Sarracén,
Paolo Rosso,
Robert Litschko,
Goran Glavaš,
Simone Paolo Ponzetto
Abstract:
Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques…
▽ More
Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques based on vicinal risk minimization and propose MIXAG, a novel data augmentation method which interpolates pairs of instances based on the angle of their representations. Our experiments involve seven languages typologically distinct from English and three different domains. The results reveal that the data augmentation strategies can enhance few-shot cross-lingual abusive language detection. Specifically, we observe that consistently in all target languages, MIXAG improves significantly in multidomain and multilingual environments. Finally, we show through an error analysis how the domain adaptation can favour the class of abusive texts (reducing false negatives), but at the same time, declines the precision of the abusive language detection model.
△ Less
Submitted 3 November, 2023;
originally announced November 2023.
-
Overview of AuTexTification at IberLEF 2023: Detection and Attribution of Machine-Generated Text in Multiple Domains
Authors:
Areg Mikael Sarvazyan,
José Ángel González,
Marc Franco-Salvador,
Francisco Rangel,
Berta Chulvi,
Paolo Rosso
Abstract:
This paper presents the overview of the AuTexTification shared task as part of the IberLEF 2023 Workshop in Iberian Languages Evaluation Forum, within the framework of the SEPLN 2023 conference. AuTexTification consists of two subtasks: for Subtask 1, participants had to determine whether a text is human-authored or has been generated by a large language model. For Subtask 2, participants had to a…
▽ More
This paper presents the overview of the AuTexTification shared task as part of the IberLEF 2023 Workshop in Iberian Languages Evaluation Forum, within the framework of the SEPLN 2023 conference. AuTexTification consists of two subtasks: for Subtask 1, participants had to determine whether a text is human-authored or has been generated by a large language model. For Subtask 2, participants had to attribute a machine-generated text to one of six different text generation models. Our AuTexTification 2023 dataset contains more than 160.000 texts across two languages (English and Spanish) and five domains (tweets, reviews, news, legal, and how-to articles). A total of 114 teams signed up to participate, of which 36 sent 175 runs, and 20 of them sent their working notes. In this overview, we present the AuTexTification dataset and task, the submitted participating systems, and the results.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
Mitigating Negative Transfer with Task Awareness for Sexism, Hate Speech, and Toxic Language Detection
Authors:
Angel Felipe Magnossão de Paula,
Paolo Rosso,
Damiano Spina
Abstract:
This paper proposes a novelty approach to mitigate the negative transfer problem. In the field of machine learning, the common strategy is to apply the Single-Task Learning approach in order to train a supervised model to solve a specific task. Training a robust model requires a lot of data and a significant amount of computational resources, making this solution unfeasible in cases where data are…
▽ More
This paper proposes a novelty approach to mitigate the negative transfer problem. In the field of machine learning, the common strategy is to apply the Single-Task Learning approach in order to train a supervised model to solve a specific task. Training a robust model requires a lot of data and a significant amount of computational resources, making this solution unfeasible in cases where data are unavailable or expensive to gather. Therefore another solution, based on the sharing of information between tasks, has been developed: Multi-Task Learning (MTL). Despite the recent developments regarding MTL, the problem of negative transfer has still to be solved. Negative transfer is a phenomenon that occurs when noisy information is shared between tasks, resulting in a drop in performance. This paper proposes a new approach to mitigate the negative transfer problem based on the task awareness concept. The proposed approach results in diminishing the negative transfer together with an improvement of performance over classic MTL solution. Moreover, the proposed approach has been implemented in two unified architectures to detect Sexism, Hate Speech, and Toxic Language in text comments. The proposed architectures set a new state-of-the-art both in EXIST-2021 and HatEval-2019 benchmarks.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Transformers and Ensemble methods: A solution for Hate Speech Detection in Arabic languages
Authors:
Angel Felipe Magnossão de Paula,
Imene Bensalem,
Paolo Rosso,
Wajdi Zaghouani
Abstract:
This paper describes our participation in the shared task of hate speech detection, which is one of the subtasks of the CERIST NLP Challenge 2022. Our experiments evaluate the performance of six transformer models and their combination using 2 ensemble approaches. The best results on the training set, in a five-fold cross validation scenario, were obtained by using the ensemble approach based on t…
▽ More
This paper describes our participation in the shared task of hate speech detection, which is one of the subtasks of the CERIST NLP Challenge 2022. Our experiments evaluate the performance of six transformer models and their combination using 2 ensemble approaches. The best results on the training set, in a five-fold cross validation scenario, were obtained by using the ensemble approach based on the majority vote. The evaluation of this approach on the test set resulted in an F1-score of 0.60 and an Accuracy of 0.86.
△ Less
Submitted 17 March, 2023;
originally announced March 2023.
-
Multilingual Detection of Check-Worthy Claims using World Languages and Adapter Fusion
Authors:
Ipek Baris Schlicht,
Lucie Flek,
Paolo Rosso
Abstract:
Check-worthiness detection is the task of identifying claims, worthy to be investigated by fact-checkers. Resource scarcity for non-world languages and model learning costs remain major challenges for the creation of models supporting multilingual check-worthiness detection. This paper proposes cross-training adapters on a subset of world languages, combined by adapter fusion, to detect claims eme…
▽ More
Check-worthiness detection is the task of identifying claims, worthy to be investigated by fact-checkers. Resource scarcity for non-world languages and model learning costs remain major challenges for the creation of models supporting multilingual check-worthiness detection. This paper proposes cross-training adapters on a subset of world languages, combined by adapter fusion, to detect claims emerging globally in multiple languages. (1) With a vast number of annotators available for world languages and the storage-efficient adapter models, this approach is more cost efficient. Models can be updated more frequently and thus stay up-to-date. (2) Adapter fusion provides insights and allows for interpretation regarding the influence of each adapter model on a particular language. The proposed solution often outperformed the top multilingual approaches in our benchmark tasks.
△ Less
Submitted 13 January, 2023;
originally announced January 2023.
-
It's Just a Matter of Time: Detecting Depression with Time-Enriched Multimodal Transformers
Authors:
Ana-Maria Bucur,
Adrian Cosma,
Paolo Rosso,
Liviu P. Dinu
Abstract:
Depression detection from user-generated content on the internet has been a long-lasting topic of interest in the research community, providing valuable screening tools for psychologists. The ubiquitous use of social media platforms lays out the perfect avenue for exploring mental health manifestations in posts and interactions with other users. Current methods for depression detection from social…
▽ More
Depression detection from user-generated content on the internet has been a long-lasting topic of interest in the research community, providing valuable screening tools for psychologists. The ubiquitous use of social media platforms lays out the perfect avenue for exploring mental health manifestations in posts and interactions with other users. Current methods for depression detection from social media mainly focus on text processing, and only a few also utilize images posted by users. In this work, we propose a flexible time-enriched multimodal transformer architecture for detecting depression from social media posts, using pretrained models for extracting image and text embeddings. Our model operates directly at the user-level, and we enrich it with the relative time between posts by using time2vec positional embeddings. Moreover, we propose another model variant, which can operate on randomly sampled and unordered sets of posts to be more robust to dataset noise. We show that our method, using EmoBERTa and CLIP embeddings, surpasses other methods on two multimodal datasets, obtaining state-of-the-art results of 0.931 F1 score on a popular multimodal Twitter dataset, and 0.902 F1 score on the only multimodal Reddit dataset.
△ Less
Submitted 6 February, 2023; v1 submitted 13 January, 2023;
originally announced January 2023.
-
Fake News and Hate Speech: Language in Common
Authors:
Berta Chulvi,
Alejandro Toselli,
Paolo Rosso
Abstract:
In this paper we raise the research question of whether fake news and hate speech spreaders share common patterns in language. We compute a novel index, the ingroup vs outgroup index, in three different datasets and we show that both phenomena share an "us vs them" narrative.
In this paper we raise the research question of whether fake news and hate speech spreaders share common patterns in language. We compute a novel index, the ingroup vs outgroup index, in three different datasets and we show that both phenomena share an "us vs them" narrative.
△ Less
Submitted 5 December, 2022;
originally announced December 2022.
-
UrduFake@FIRE2020: Shared Track on Fake News Identification in Urdu
Authors:
Maaz Amjad,
Grigori Sidorov,
Alisa Zhila,
Alexander Gelbukh,
Paolo Rosso
Abstract:
This paper gives the overview of the first shared task at FIRE 2020 on fake news detection in the Urdu language. This is a binary classification task in which the goal is to identify fake news using a dataset composed of 900 annotated news articles for training and 400 news articles for testing. The dataset contains news in five domains: (i) Health, (ii) Sports, (iii) Showbiz, (iv) Technology, and…
▽ More
This paper gives the overview of the first shared task at FIRE 2020 on fake news detection in the Urdu language. This is a binary classification task in which the goal is to identify fake news using a dataset composed of 900 annotated news articles for training and 400 news articles for testing. The dataset contains news in five domains: (i) Health, (ii) Sports, (iii) Showbiz, (iv) Technology, and (v) Business. 42 teams from 6 different countries (India, China, Egypt, Germany, Pakistan, and the UK) registered for the task. 9 teams submitted their experimental results. The participants used various machine learning methods ranging from feature-based traditional machine learning to neural network techniques. The best performing system achieved an F-score value of 0.90, showing that the BERT-based approach outperforms other machine learning classifiers.
△ Less
Submitted 24 July, 2022;
originally announced July 2022.
-
Overview of the Shared Task on Fake News Detection in Urdu at FIRE 2020
Authors:
Maaz Amjad,
Grigori Sidorov,
Alisa Zhila,
Alexander Gelbukh,
Paolo Rosso
Abstract:
This overview paper describes the first shared task on fake news detection in Urdu language. The task was posed as a binary classification task, in which the goal is to differentiate between real and fake news. We provided a dataset divided into 900 annotated news articles for training and 400 news articles for testing. The dataset contained news in five domains: (i) Health, (ii) Sports, (iii) Sho…
▽ More
This overview paper describes the first shared task on fake news detection in Urdu language. The task was posed as a binary classification task, in which the goal is to differentiate between real and fake news. We provided a dataset divided into 900 annotated news articles for training and 400 news articles for testing. The dataset contained news in five domains: (i) Health, (ii) Sports, (iii) Showbiz, (iv) Technology, and (v) Business. 42 teams from 6 different countries (India, China, Egypt, Germany, Pakistan, and the UK) registered for the task. 9 teams submitted their experimental results. The participants used various machine learning methods ranging from feature-based traditional machine learning to neural networks techniques. The best performing system achieved an F-score value of 0.90, showing that the BERT-based approach outperforms other machine learning techniques
△ Less
Submitted 24 July, 2022;
originally announced July 2022.
-
The OpenMP Cluster Programming Model
Authors:
Hervé Yviquel,
Marcio Pereira,
Emílio Francesquini,
Guilherme Valarini,
Gustavo Leite,
Pedro Rosso,
Rodrigo Ceccato,
Carla Cusihualpa,
Vitoria Dias,
Sandro Rigo,
Alan Souza,
Guido Araujo
Abstract:
Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programmin…
▽ More
Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP's offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra- and inter-node parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application.
△ Less
Submitted 13 August, 2022; v1 submitted 12 July, 2022;
originally announced July 2022.
-
An End-to-End Set Transformer for User-Level Classification of Depression and Gambling Disorder
Authors:
Ana-Maria Bucur,
Adrian Cosma,
Liviu P. Dinu,
Paolo Rosso
Abstract:
This work proposes a transformer architecture for user-level classification of gambling addiction and depression that is trainable end-to-end. As opposed to other methods that operate at the post level, we process a set of social media posts from a particular individual, to make use of the interactions between posts and eliminate label noise at the post level. We exploit the fact that, by not inje…
▽ More
This work proposes a transformer architecture for user-level classification of gambling addiction and depression that is trainable end-to-end. As opposed to other methods that operate at the post level, we process a set of social media posts from a particular individual, to make use of the interactions between posts and eliminate label noise at the post level. We exploit the fact that, by not injecting positional encodings, multi-head attention is permutation invariant and we process randomly sampled sets of texts from a user after being encoded with a modern pretrained sentence encoder (RoBERTa / MiniLM). Moreover, our architecture is interpretable with modern feature attribution methods and allows for automatic dataset creation by identifying discriminating posts in a user's text-set. We perform ablation studies on hyper-parameters and evaluate our method for the eRisk 2022 Lab on early detection of signs of pathological gambling and early risk detection of depression. The method proposed by our team BLUE obtained the best ERDE5 score of 0.015, and the second-best ERDE50 score of 0.009 for pathological gambling detection. For the early detection of depression, we obtained the second-best ERDE50 of 0.027.
△ Less
Submitted 2 July, 2022;
originally announced July 2022.
-
Cryptocurrency Bubble Detection: A New Stock Market Dataset, Financial Task & Hyperbolic Models
Authors:
Ramit Sawhney,
Shivam Agarwal,
Vivek Mittal,
Paolo Rosso,
Vikram Nanda,
Sudheer Chava
Abstract:
The rapid spread of information over social media influences quantitative trading and investments. The growing popularity of speculative trading of highly volatile assets such as cryptocurrencies and meme stocks presents a fresh challenge in the financial realm. Investigating such "bubbles" - periods of sudden anomalous behavior of markets are critical in better understanding investor behavior and…
▽ More
The rapid spread of information over social media influences quantitative trading and investments. The growing popularity of speculative trading of highly volatile assets such as cryptocurrencies and meme stocks presents a fresh challenge in the financial realm. Investigating such "bubbles" - periods of sudden anomalous behavior of markets are critical in better understanding investor behavior and market dynamics. However, high volatility coupled with massive volumes of chaotic social media texts, especially for underexplored assets like cryptocoins pose a challenge to existing methods. Taking the first step towards NLP for cryptocoins, we present and publicly release CryptoBubbles, a novel multi-span identification task for bubble detection, and a dataset of more than 400 cryptocoins from 9 exchanges over five years spanning over two million tweets. Further, we develop a set of sequence-to-sequence hyperbolic models suited to this multi-span identification task based on the power-law dynamics of cryptocurrencies and user behavior on social media. We further test the effectiveness of our models under zero-shot settings on a test set of Reddit posts pertaining to 29 "meme stocks", which see an increase in trade volume due to social media hype. Through quantitative, qualitative, and zero-shot analyses on Reddit and Twitter spanning cryptocoins and meme-stocks, we show the practical applicability of CryptoBubbles and hyperbolic models.
△ Less
Submitted 11 May, 2022;
originally announced June 2022.
-
FACTOID: A New Dataset for Identifying Misinformation Spreaders and Political Bias
Authors:
Flora Sakketou,
Joan Plepi,
Riccardo Cervero,
Henri-Jacques Geiss,
Paolo Rosso,
Lucie Flek
Abstract:
Proactively identifying misinformation spreaders is an important step towards mitigating the impact of fake news on our society. In this paper, we introduce a new contemporary Reddit dataset for fake news spreader analysis, called FACTOID, monitoring political discussions on Reddit since the beginning of 2020. The dataset contains over 4K users with 3.4M Reddit posts, and includes, beyond the user…
▽ More
Proactively identifying misinformation spreaders is an important step towards mitigating the impact of fake news on our society. In this paper, we introduce a new contemporary Reddit dataset for fake news spreader analysis, called FACTOID, monitoring political discussions on Reddit since the beginning of 2020. The dataset contains over 4K users with 3.4M Reddit posts, and includes, beyond the users' binary labels, also their fine-grained credibility level (very low to very high) and their political bias strength (extreme right to extreme left). As far as we are aware, this is the first fake news spreader dataset that simultaneously captures both the long-term context of users' historical posts and the interactions between them. To create the first benchmark on our data, we provide methods for identifying misinformation spreaders by utilizing the social connections between the users along with their psycho-linguistic features. We show that the users' social interactions can, on their own, indicate misinformation spreading, while the psycho-linguistic features are mostly informative in non-neural classification settings. In a qualitative analysis, we observe that detecting affective mental processes correlates negatively with right-biased users, and that the openness to experience factor is lower for those who spread fake news.
△ Less
Submitted 11 May, 2022;
originally announced May 2022.
-
Detecting early signs of depression in the conversational domain: The role of transfer learning in low-resource scenarios
Authors:
Petr Lorenc,
Ana-Sabina Uban,
Paolo Rosso,
Jan Šedivý
Abstract:
The high prevalence of depression in society has given rise to the need for new digital tools to assist in its early detection. To this end, existing research has mainly focused on detecting depression in the domain of social media, where there is a sufficient amount of data. However, with the rise of conversational agents like Siri or Alexa, the conversational domain is becoming more critical. Un…
▽ More
The high prevalence of depression in society has given rise to the need for new digital tools to assist in its early detection. To this end, existing research has mainly focused on detecting depression in the domain of social media, where there is a sufficient amount of data. However, with the rise of conversational agents like Siri or Alexa, the conversational domain is becoming more critical. Unfortunately, there is a lack of data in the conversational domain. We perform a study focusing on domain adaptation from social media to the conversational domain. Our approach mainly exploits the linguistic information preserved in the vector representation of text. We describe transfer learning techniques to classify users who suffer from early signs of depression with high recall. We achieve state-of-the-art results on a commonly used conversational dataset, and we highlight how the method can easily be used in conversational agents. We publicly release all source code.
△ Less
Submitted 22 April, 2022;
originally announced April 2022.
-
Unsupervised Ranking and Aggregation of Label Descriptions for Zero-Shot Classifiers
Authors:
Angelo Basile,
Marc Franco-Salvador,
Paolo Rosso
Abstract:
Zero-shot text classifiers based on label descriptions embed an input text and a set of labels into the same space: measures such as cosine similarity can then be used to select the most similar label description to the input text as the predicted label. In a true zero-shot setup, designing good label descriptions is challenging because no development set is available. Inspired by the literature o…
▽ More
Zero-shot text classifiers based on label descriptions embed an input text and a set of labels into the same space: measures such as cosine similarity can then be used to select the most similar label description to the input text as the predicted label. In a true zero-shot setup, designing good label descriptions is challenging because no development set is available. Inspired by the literature on Learning with Disagreements, we look at how probabilistic models of repeated rating analysis can be used for selecting the best label descriptions in an unsupervised fashion. We evaluate our method on a set of diverse datasets and tasks (sentiment, topic and stance). Furthermore, we show that multiple, noisy label descriptions can be aggregated to boost the performance.
△ Less
Submitted 24 May, 2022; v1 submitted 20 April, 2022;
originally announced April 2022.
-
UPV at TREC Health Misinformation Track 2021 Ranking with SBERT and Quality Estimators
Authors:
Ipek Baris Schlicht,
Angel Felipe Magnossão de Paula,
Paolo Rosso
Abstract:
Health misinformation on search engines is a significant problem that could negatively affect individuals or public health. To mitigate the problem, TREC organizes a health misinformation track. This paper presents our submissions to this track. We use a BM25 and a domain-specific semantic search engine for retrieving initial documents. Later, we examine a health news schema for quality assessment…
▽ More
Health misinformation on search engines is a significant problem that could negatively affect individuals or public health. To mitigate the problem, TREC organizes a health misinformation track. This paper presents our submissions to this track. We use a BM25 and a domain-specific semantic search engine for retrieving initial documents. Later, we examine a health news schema for quality assessment and apply it to re-rank documents. We merge the scores from the different components by using reciprocal rank fusion. Finally, we discuss the results and conclude with future works.
△ Less
Submitted 11 December, 2021;
originally announced December 2021.
-
UPV at CheckThat! 2021: Mitigating Cultural Differences for Identifying Multilingual Check-worthy Claims
Authors:
Ipek Baris Schlicht,
Angel Felipe Magnossão de Paula,
Paolo Rosso
Abstract:
Identifying check-worthy claims is often the first step of automated fact-checking systems. Tackling this task in a multilingual setting has been understudied. Encoding inputs with multilingual text representations could be one approach to solve the multilingual check-worthiness detection. However, this approach could suffer if cultural bias exists within the communities on determining what is che…
▽ More
Identifying check-worthy claims is often the first step of automated fact-checking systems. Tackling this task in a multilingual setting has been understudied. Encoding inputs with multilingual text representations could be one approach to solve the multilingual check-worthiness detection. However, this approach could suffer if cultural bias exists within the communities on determining what is check-worthy.In this paper, we propose a language identification task as an auxiliary task to mitigate unintended bias.With this purpose, we experiment joint training by using the datasets from CLEF-2021 CheckThat!, that contain tweets in English, Arabic, Bulgarian, Spanish and Turkish. Our results show that joint training of language identification and check-worthy claim detection tasks can provide performance gains for some of the selected languages.
△ Less
Submitted 19 September, 2021;
originally announced September 2021.
-
Studying Fake News Spreading, Polarisation Dynamics, and Manipulation by Bots: a Tale of Networks and Language
Authors:
Giancarlo Ruffo,
Alfonso Semeraro,
Anastasia Giachanou,
Paolo Rosso
Abstract:
With the explosive growth of online social media, the ancient problem of information disorders interfering with news diffusion has surfaced with a renewed intensity threatening our democracies, public health, and news outlets' credibility. Therefore, thousands of scientific papers have been published in a relatively short period, making researchers of different disciplines struggle with an informa…
▽ More
With the explosive growth of online social media, the ancient problem of information disorders interfering with news diffusion has surfaced with a renewed intensity threatening our democracies, public health, and news outlets' credibility. Therefore, thousands of scientific papers have been published in a relatively short period, making researchers of different disciplines struggle with an information overload problem. The aim of this survey is threefold: (1) we present the results of a network-based analysis of the existing multidisciplinary literature to support the search for relevant trends and central publications; (2) we describe the main results and necessary background to attack the problem under a computational perspective; (3) we review selected contributions using network science as a unifying framework and computational linguistics as the tool to make sense of the shared content. Despite scholars working on computational linguistics and networks traditionally belong to different scientific communities, we expect that those interested in the area of fake news should be aware of crucial aspects of both disciplines.
△ Less
Submitted 14 January, 2023; v1 submitted 13 September, 2021;
originally announced September 2021.
-
On Board Volcanic Eruption Detection through CNNs and Satellite Multispectral Imagery
Authors:
Maria Pia Del Rosso,
Alessandro Sebastianelli,
Dario Spiller,
Pierre Philippe Mathieu,
Silvia Liberata Ullo
Abstract:
In recent years, the growth of Machine Learning (ML) algorithms has raised the number of studies including their applicability in a variety of different scenarios. Among all, one of the hardest ones is the aerospace, due to its peculiar physical requirements. In this context, a feasibility study and a first prototype for an Artificial Intelligence (AI) model to be deployed on board satellites are…
▽ More
In recent years, the growth of Machine Learning (ML) algorithms has raised the number of studies including their applicability in a variety of different scenarios. Among all, one of the hardest ones is the aerospace, due to its peculiar physical requirements. In this context, a feasibility study and a first prototype for an Artificial Intelligence (AI) model to be deployed on board satellites are presented in this work. As a case study, the detection of volcanic eruptions has been investigated as a method to swiftly produce alerts and allow immediate interventions. Two Convolutional Neural Networks (CNNs) have been proposed and designed, showing how to efficiently implement them for identifying the eruptions and at the same time adapting their complexity in order to fit on board requirements.
△ Less
Submitted 28 July, 2021; v1 submitted 29 June, 2021;
originally announced June 2021.
-
Spatio-Temporal SAR-Optical Data Fusion for Cloud Removal via a Deep Hierarchical Model
Authors:
Alessandro Sebastianelli,
Artur Nowakowski,
Erika Puglisi,
Maria Pia Del Rosso,
Jamila Mifdal,
Fiora Pirri,
Pierre Philippe Mathieu,
Silvia Liberata Ullo
Abstract:
Cloud removal is a relevant topic in Remote Sensing as it fosters the usability of high-resolution optical images for Earth monitoring and study. Related techniques have been analyzed for years with a progressively clearer view of the appropriate methods to adopt, from multi-spectral to inpainting methods. Recent applications of deep generative models and sequence-to-sequence-based models have pro…
▽ More
Cloud removal is a relevant topic in Remote Sensing as it fosters the usability of high-resolution optical images for Earth monitoring and study. Related techniques have been analyzed for years with a progressively clearer view of the appropriate methods to adopt, from multi-spectral to inpainting methods. Recent applications of deep generative models and sequence-to-sequence-based models have proved their capability to advance the field significantly. Nevertheless, there are still some gaps, mostly related to the amount of cloud coverage, the density and thickness of clouds, and the occurred temporal landscape changes. In this work, we fill some of these gaps by introducing a novel multi-modal method that uses different sources of information, both spatial and temporal, to restore the whole optical scene of interest. The proposed method introduces an innovative deep model, using the outcomes of both temporal-sequence blending and direct translation from Synthetic Aperture Radar (SAR) to optical images to obtain a pixel-wise restoration of the whole scene. The advantage of our approach is demonstrated across a variety of atmospheric conditions tested on a dataset we have generated and made available. Quantitative and qualitative results prove that the proposed method obtains cloud-free images, preserving scene details without resorting to a huge portion of a clean image and coping with landscape changes.
△ Less
Submitted 28 March, 2022; v1 submitted 23 June, 2021;
originally announced June 2021.
-
Paradigm selection for Data Fusion of SAR and Multispectral Sentinel data applied to Land-Cover Classification
Authors:
Alessandro Sebastianelli,
Maria Pia Del Rosso,
Pierre Philippe Mathieu,
Silvia Liberata Ullo
Abstract:
Data fusion is a well-known technique, becoming more and more popular in the Artificial Intelligence for Earth Observation (AI4EO) domain mainly due to its ability of reinforcing AI4EO applications by combining multiple data sources and thus bringing better results. On the other hand, like other methods for satellite data analysis, data fusion itself is also benefiting and evolving thanks to the i…
▽ More
Data fusion is a well-known technique, becoming more and more popular in the Artificial Intelligence for Earth Observation (AI4EO) domain mainly due to its ability of reinforcing AI4EO applications by combining multiple data sources and thus bringing better results. On the other hand, like other methods for satellite data analysis, data fusion itself is also benefiting and evolving thanks to the integration of Artificial Intelligence (AI). In this letter, four data fusion paradigms, based on Convolutional Neural Networks (CNNs), are analyzed and implemented. The goals are to provide a systematic procedure for choosing the best data fusion framework, resulting in the best classification results, once the basic structure for the CNN has been defined, and to help interested researchers in their work when data fusion applied to remote sensing is involved. The procedure has been validated for land-cover classification but it can be transferred to other cases.
△ Less
Submitted 18 June, 2021;
originally announced June 2021.
-
A speckle filter for Sentinel-1 SAR Ground Range Detected data based on Residual Convolutional Neural Networks
Authors:
Alessandro Sebastianelli,
Maria Pia Del Rosso,
Silvia Liberata Ullo,
Paolo Gamba
Abstract:
In recent years, machine learning (ML) algorithms have become widespread in all the fields of remote sensing (RS) and earth observation (EO). This has allowed the rapid development of new procedures to solve problems affecting these sectors. In this context, this work aims at presenting a novel method for filtering speckle noise from Sentinel-1 ground range detected (GRD) data by applying deep lea…
▽ More
In recent years, machine learning (ML) algorithms have become widespread in all the fields of remote sensing (RS) and earth observation (EO). This has allowed the rapid development of new procedures to solve problems affecting these sectors. In this context, this work aims at presenting a novel method for filtering speckle noise from Sentinel-1 ground range detected (GRD) data by applying deep learning (DL) algorithms, based on convolutional neural networks (CNNs). The paper provides an easy yet very effective approach to extract the large amount of training data needed for DL approaches in this challenging case. The experimental results on simulated speckled images and an actual SAR dataset show a clear improvement with respect to the state of the art in terms of peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), equivalent number of looks (ENL), proving the effectiveness of the proposed architecture.
△ Less
Submitted 17 May, 2022; v1 submitted 19 April, 2021;
originally announced April 2021.
-
FakeFlow: Fake News Detection by Modeling the Flow of Affective Information
Authors:
Bilal Ghanem,
Simone Paolo Ponzetto,
Paolo Rosso,
Francisco Rangel
Abstract:
Fake news articles often stir the readers' attention by means of emotional appeals that arouse their feelings. Unlike in short news texts, authors of longer articles can exploit such affective factors to manipulate readers by adding exaggerations or fabricating events, in order to affect the readers' emotions. To capture this, we propose in this paper to model the flow of affective information in…
▽ More
Fake news articles often stir the readers' attention by means of emotional appeals that arouse their feelings. Unlike in short news texts, authors of longer articles can exploit such affective factors to manipulate readers by adding exaggerations or fabricating events, in order to affect the readers' emotions. To capture this, we propose in this paper to model the flow of affective information in fake news articles using a neural architecture. The proposed model, FakeFlow, learns this flow by combining topic and affective information extracted from text. We evaluate the model's performance with several experiments on four real-world datasets. The results show that FakeFlow achieves superior results when compared against state-of-the-art methods, thus confirming the importance of capturing the flow of the affective information in news articles.
△ Less
Submitted 24 January, 2021;
originally announced January 2021.
-
Analysis and tuning of hierarchical topic models based on Renyi entropy approach
Authors:
Sergei Koltcov,
Vera Ignatenko,
Maxim Terpilovskii,
Paolo Rosso
Abstract:
Hierarchical topic modeling is a potentially powerful instrument for determining the topical structure of text collections that allows constructing a topical hierarchy representing levels of topical abstraction. However, tuning of parameters of hierarchical models, including the number of topics on each hierarchical level, remains a challenging task and an open issue. In this paper, we propose a R…
▽ More
Hierarchical topic modeling is a potentially powerful instrument for determining the topical structure of text collections that allows constructing a topical hierarchy representing levels of topical abstraction. However, tuning of parameters of hierarchical models, including the number of topics on each hierarchical level, remains a challenging task and an open issue. In this paper, we propose a Renyi entropy-based approach for a partial solution to the above problem. First, we propose a Renyi entropy-based metric of quality for hierarchical models. Second, we propose a practical concept of hierarchical topic model tuning tested on datasets with human mark-up. In the numerical experiments, we consider three different hierarchical models, namely, hierarchical latent Dirichlet allocation (hLDA) model, hierarchical Pachinko allocation model (hPAM), and hierarchical additive regularization of topic models (hARTM). We demonstrate that hLDA model possesses a significant level of instability and, moreover, the derived numbers of topics are far away from the true numbers for labeled datasets. For hPAM model, the Renyi entropy approach allows us to determine only one level of the data structure. For hARTM model, the proposed approach allows us to estimate the number of topics for two hierarchical levels.
△ Less
Submitted 19 January, 2021;
originally announced January 2021.
-
Multilingual Irony Detection with Dependency Syntax and Neural Models
Authors:
Alessandra Teresa Cignarella,
Valerio Basile,
Manuela Sanguinetti,
Cristina Bosco,
Paolo Rosso,
Farah Benamara
Abstract:
This paper presents an in-depth investigation of the effectiveness of dependency-based syntactic features on the irony detection task in a multilingual perspective (English, Spanish, French and Italian). It focuses on the contribution from syntactic knowledge, exploiting linguistic resources where syntax is annotated according to the Universal Dependencies scheme. Three distinct experimental setti…
▽ More
This paper presents an in-depth investigation of the effectiveness of dependency-based syntactic features on the irony detection task in a multilingual perspective (English, Spanish, French and Italian). It focuses on the contribution from syntactic knowledge, exploiting linguistic resources where syntax is annotated according to the Universal Dependencies scheme. Three distinct experimental settings are provided. In the first, a variety of syntactic dependency-based features combined with classical machine learning classifiers are explored. In the second scenario, two well-known types of word embeddings are trained on parsed data and tested against gold standard datasets. In the third setting, dependency-based syntactic features are combined into the Multilingual BERT architecture. The results suggest that fine-grained dependency-based syntactic information is informative for the detection of irony.
△ Less
Submitted 11 November, 2020;
originally announced November 2020.
-
Classifier Combination Approach for Question Classification for Bengali Question Answering System
Authors:
Somnath Banerjee,
Sudip Kumar Naskar,
Paolo Rosso,
Sivaji Bandyopadhyay
Abstract:
Question classification (QC) is a prime constituent of automated question answering system. The work presented here demonstrates that the combination of multiple models achieve better classification performance than those obtained with existing individual models for the question classification task in Bengali. We have exploited state-of-the-art multiple model combination techniques, i.e., ensemble…
▽ More
Question classification (QC) is a prime constituent of automated question answering system. The work presented here demonstrates that the combination of multiple models achieve better classification performance than those obtained with existing individual models for the question classification task in Bengali. We have exploited state-of-the-art multiple model combination techniques, i.e., ensemble, stacking and voting, to increase QC accuracy. Lexical, syntactic and semantic features of Bengali questions are used for four well-known classifiers, namely Naïve Bayes, kernel Naïve Bayes, Rule Induction, and Decision Tree, which serve as our base learners. Single-layer question-class taxonomy with 8 coarse-grained classes is extended to two-layer taxonomy by adding 69 fine-grained classes. We carried out the experiments both on single-layer and two-layer taxonomies. Experimental results confirmed that classifier combination approaches outperform single classifier classification approaches by 4.02% for coarse-grained question classes. Overall, the stacking approach produces the best results for fine-grained classification and achieves 87.79% of accuracy. The approach presented here could be used in other Indo-Aryan or Indic languages to develop a question answering system.
△ Less
Submitted 6 September, 2020; v1 submitted 31 August, 2020;
originally announced August 2020.
-
LIMSI_UPV at SemEval-2020 Task 9: Recurrent Convolutional Neural Network for Code-mixed Sentiment Analysis
Authors:
Somnath Banerjee,
Sahar Ghannay,
Sophie Rosset,
Anne Vilnat,
Paolo Rosso
Abstract:
This paper describes the participation of LIMSI UPV team in SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text. The proposed approach competed in SentiMix Hindi-English subtask, that addresses the problem of predicting the sentiment of a given Hindi-English code-mixed tweet. We propose Recurrent Convolutional Neural Network that combines both the recurrent neural network and…
▽ More
This paper describes the participation of LIMSI UPV team in SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text. The proposed approach competed in SentiMix Hindi-English subtask, that addresses the problem of predicting the sentiment of a given Hindi-English code-mixed tweet. We propose Recurrent Convolutional Neural Network that combines both the recurrent neural network and the convolutional network to better capture the semantics of the text, for code-mixed sentiment analysis. The proposed system obtained 0.69 (best run) in terms of F1 score on the given test data and achieved the 9th place (Codalab username: somban) in the SentiMix Hindi-English subtask.
△ Less
Submitted 30 August, 2020;
originally announced August 2020.
-
Automatic Dataset Builder for Machine Learning Applications to Satellite Imagery
Authors:
Alessandro Sebastianelli,
Maria Pia Del Rosso,
Silvia Liberata Ullo
Abstract:
Nowadays the use of Machine Learning (ML) algorithms is spreading in the field of Remote Sensing, with applications ranging from detection and classification of land use and monitoring to the prediction of many natural or anthropic phenomena of interest. One main limit of their employment is related to the need for a huge amount of data for training the neural network, chosen for the specific appl…
▽ More
Nowadays the use of Machine Learning (ML) algorithms is spreading in the field of Remote Sensing, with applications ranging from detection and classification of land use and monitoring to the prediction of many natural or anthropic phenomena of interest. One main limit of their employment is related to the need for a huge amount of data for training the neural network, chosen for the specific application, and the resulting computational weight and time required to collect the necessary data. In this letter the architecture of an innovative tool, enabling researchers to create in an automatic way suitable datasets for AI (Artificial Intelligence) applications in the EO (Earth Observation) context, is presented. Two versions of the architecture have been implemented and made available on Git-Hub, with a specific Graphical User Interface (GUI) for non-expert users.
△ Less
Submitted 4 August, 2020;
originally announced August 2020.
-
#Brexit: Leave or Remain? The Role of User's Community and Diachronic Evolution on Stance Detection
Authors:
Mirko Lai,
Viviana Patti,
Giancarlo Ruffo,
Paolo Rosso
Abstract:
Interest has grown around the classification of stance that users assume within online debates in recent years. Stance has been usually addressed by considering users posts in isolation, while social studies highlight that social communities may contribute to influence users' opinion. Furthermore, stance should be studied in a diachronic perspective, since it could help to shed light on users' opi…
▽ More
Interest has grown around the classification of stance that users assume within online debates in recent years. Stance has been usually addressed by considering users posts in isolation, while social studies highlight that social communities may contribute to influence users' opinion. Furthermore, stance should be studied in a diachronic perspective, since it could help to shed light on users' opinion shift dynamics that can be recorded during the debate. We analyzed the political discussion in UK about the BREXIT referendum on Twitter, proposing a novel approach and annotation schema for stance detection, with the main aim of investigating the role of features related to social network community and diachronic stance evolution. Classification experiments show that such features provide very useful clues for detecting stance.
△ Less
Submitted 29 July, 2020;
originally announced July 2020.
-
Application of DInSAR Technique to High Coherence Satellite Images for Strategic Infrastructure Monitoring
Authors:
Tony De Corso,
Luca Mignone,
Alessandro Sebastianelli,
Maria Pia Del Rosso,
Claire Yost,
Elena Ciampa,
Marisa Pecce,
Stefania Sica,
Silvia Ullo
Abstract:
In this paper the authors present and validate a procedure, which intends to combine the latest state of the art models in bridge monitoring with freely available satellite data. Through the Differential SAR interferometry (DinSAR) technique, a dataset of displacements for the Morandi bridge in Genoa (Italy), before its collapse, has been created, by using images downloaded by the Copernicus Open-…
▽ More
In this paper the authors present and validate a procedure, which intends to combine the latest state of the art models in bridge monitoring with freely available satellite data. Through the Differential SAR interferometry (DinSAR) technique, a dataset of displacements for the Morandi bridge in Genoa (Italy), before its collapse, has been created, by using images downloaded by the Copernicus Open-Access Hub and the ASFVertex Hub. The data have been processed through the ESA SNAP software to identify the rate of displacements in the parts of the bridge where collapse occurred. Results demonstrate that the adopted procedure has great potentiality in the application field, as it represents a simple and inexpensive method to monitor large structures in a continuous way, by helping to better quantify risks and guide effective mitigation countermeasures. Moreover, the same procedure, once properly validated, could be effectively extended to the current and future performance estimation of civil infrastructures.
△ Less
Submitted 19 April, 2020;
originally announced April 2020.
-
Irony Detection in a Multilingual Context
Authors:
Bilal Ghanem,
Jihen Karoui,
Farah Benamara,
Paolo Rosso,
Véronique Moriceau
Abstract:
This paper proposes the first multilingual (French, English and Arabic) and multicultural (Indo-European languages vs. less culturally close languages) irony detection system. We employ both feature-based models and neural architectures using monolingual word representation. We compare the performance of these systems with state-of-the-art systems to identify their capabilities. We show that these…
▽ More
This paper proposes the first multilingual (French, English and Arabic) and multicultural (Indo-European languages vs. less culturally close languages) irony detection system. We employ both feature-based models and neural architectures using monolingual word representation. We compare the performance of these systems with state-of-the-art systems to identify their capabilities. We show that these monolingual models trained separately on different languages using multilingual word representation or text-based features can open the door to irony detection in languages that lack of annotated data for irony.
△ Less
Submitted 6 February, 2020;
originally announced February 2020.
-
Stryker: Scaling Specification-Based Program Repair by Pruning Infeasible Mutants with SAT
Authors:
Luciano Zemín,
Simón Gutiérrez Brida,
Santiago Bermúdez,
Santiago Perez De Rosso,
Nazareno Aguirre,
Ali Mili,
Ali Jaoua,
Marcelo F. Frias
Abstract:
Many techniques for automated program repair involve syntactic program transformations. Applying combinations of such transformations on faulty code yields fix candidates whose correctness must be determined. Exploring these combinations leads to an explosion on the number of generated fix candidates that severely limits the applicability of such fault repair techniques. This explosion is most tim…
▽ More
Many techniques for automated program repair involve syntactic program transformations. Applying combinations of such transformations on faulty code yields fix candidates whose correctness must be determined. Exploring these combinations leads to an explosion on the number of generated fix candidates that severely limits the applicability of such fault repair techniques. This explosion is most times tamed by not considering fix candidates exhaustively, and by disabling intra-statement modifications. In this article we present a technique for program repair that considers an ample set of intra-statement syntactic operations, and explores fix candidates exhaustively up to a provided bound. The suitability of the technique, implemented in our tool Stryker, is supported by a novel mechanism to detect and prune infeasible fix candidates. This allows Stryker to repair programs with several bugs, whose fixes require multiple modifications. We evaluate our technique on a benchmark of faulty Java container classes, which Stryker is able to repair, pruning significant parts of the space of generated candidates when more than one bug is present in the code.
△ Less
Submitted 30 October, 2019;
originally announced October 2019.
-
FacTweet: Profiling Fake News Twitter Accounts
Authors:
Bilal Ghanem,
Simone Paolo Ponzetto,
Paolo Rosso
Abstract:
We present an approach to detect fake news in Twitter at the account level using a neural recurrent model and a variety of different semantic and stylistic features. Our method extracts a set of features from the timelines of news Twitter accounts by reading their posts as chunks, rather than dealing with each tweet independently. We show the experimental benefits of modeling latent stylistic sign…
▽ More
We present an approach to detect fake news in Twitter at the account level using a neural recurrent model and a variety of different semantic and stylistic features. Our method extracts a set of features from the timelines of news Twitter accounts by reading their posts as chunks, rather than dealing with each tweet independently. We show the experimental benefits of modeling latent stylistic signatures of mixed fake and real news with a sequential model over a wide range of strong baselines.
△ Less
Submitted 15 October, 2019;
originally announced October 2019.
-
TexTrolls: Identifying Russian Trolls on Twitter from a Textual Perspective
Authors:
Bilal Ghanem,
Davide Buscaldi,
Paolo Rosso
Abstract:
The online new emerging suspicious users, that usually are called trolls, are one of the main sources of hate, fake, and deceptive online messages. Some agendas are utilizing these harmful users to spread incitement tweets, and as a consequence, the audience get deceived. The challenge in detecting such accounts is that they conceal their identities which make them disguised in social media, addin…
▽ More
The online new emerging suspicious users, that usually are called trolls, are one of the main sources of hate, fake, and deceptive online messages. Some agendas are utilizing these harmful users to spread incitement tweets, and as a consequence, the audience get deceived. The challenge in detecting such accounts is that they conceal their identities which make them disguised in social media, adding more difficulty to identify them using just their social network information. Therefore, in this paper, we propose a text-based approach to detect the online trolls such as those that were discovered during the US 2016 presidential elections. Our approach is mainly based on textual features which utilize thematic information, and profiling features to identify the accounts from their way of writing tweets. We deduced the thematic information in a unsupervised way and we show that coupling them with the textual features enhanced the performance of the proposed model. In addition, we find that the proposed profiling features perform the best comparing to the textual features.
△ Less
Submitted 3 October, 2019;
originally announced October 2019.
-
An Emotional Analysis of False Information in Social Media and News Articles
Authors:
Bilal Ghanem,
Paolo Rosso,
Francisco Rangel
Abstract:
Fake news is risky since it has been created to manipulate the readers' opinions and beliefs. In this work, we compared the language of false news to the real one of real news from an emotional perspective, considering a set of false information types (propaganda, hoax, clickbait, and satire) from social media and online news articles sources. Our experiments showed that false information has diff…
▽ More
Fake news is risky since it has been created to manipulate the readers' opinions and beliefs. In this work, we compared the language of false news to the real one of real news from an emotional perspective, considering a set of false information types (propaganda, hoax, clickbait, and satire) from social media and online news articles sources. Our experiments showed that false information has different emotional patterns in each of its types, and emotions play a key role in deceiving the reader. Based on that, we proposed a LSTM neural network model that is emotionally-infused to detect false news.
△ Less
Submitted 26 August, 2019;
originally announced August 2019.
-
Landslide Geohazard Assessment With Convolutional Neural Networks Using Sentinel-2 Imagery Data
Authors:
Silvia L. Ullo,
Maximillian S. Langenkamp,
Tuomas P. Oikarinen,
Maria P. Del Rosso,
Alessandro Sebastianelli,
Federica Piccirillo,
Stefania Sica
Abstract:
In this paper, the authors aim to combine the latest state of the art models in image recognition with the best publicly available satellite images to create a system for landslide risk mitigation. We focus first on landslide detection and further propose a similar system to be used for prediction. Such models are valuable as they could easily be scaled up to provide data for hazard evaluation, as…
▽ More
In this paper, the authors aim to combine the latest state of the art models in image recognition with the best publicly available satellite images to create a system for landslide risk mitigation. We focus first on landslide detection and further propose a similar system to be used for prediction. Such models are valuable as they could easily be scaled up to provide data for hazard evaluation, as satellite imagery becomes increasingly available. The goal is to use satellite images and correlated data to enrich the public repository of data and guide disaster relief efforts for locating precise areas where landslides have occurred. Different image augmentation methods are used to increase diversity in the chosen dataset and create more robust classification. The resulting outputs are then fed into variants of 3-D convolutional neural networks. A review of the current literature indicates there is no research using CNNs (Convolutional Neural Networks) and freely available satellite imagery for classifying landslide risk. The model has shown to be ultimately able to achieve a significantly better than baseline accuracy.
△ Less
Submitted 10 June, 2019;
originally announced June 2019.
-
Unmasking Bias in News
Authors:
Javier Sánchez-Junquera,
Paolo Rosso,
Manuel Montes-y-Gómez,
Simone Paolo Ponzetto
Abstract:
We present experiments on detecting hyperpartisanship in news using a 'masking' method that allows us to assess the role of style vs. content for the task at hand. Our results corroborate previous research on this task in that topic related features yield better results than stylistic ones. We additionally show that competitive results can be achieved by simply including higher-length n-grams, whi…
▽ More
We present experiments on detecting hyperpartisanship in news using a 'masking' method that allows us to assess the role of style vs. content for the task at hand. Our results corroborate previous research on this task in that topic related features yield better results than stylistic ones. We additionally show that competitive results can be achieved by simply including higher-length n-grams, which suggests the need to develop more challenging datasets and tasks that address implicit and more subtle forms of bias.
△ Less
Submitted 11 June, 2019;
originally announced June 2019.
-
Low-dispersion low-loss dielectric gratings for efficient ultrafast laser pulse compression at high average powers
Authors:
David A. Alessi,
Hoang T. Nguyen,
Jerald A. Britten,
Paul A. Rosso,
Constantin Haefner
Abstract:
We have developed low-dispersion (1480 l/mm), resonance-free, diffraction gratings made of dielectric materials resistant to femtosecond laser damage $(SiO_{2}/HfO_{2})$. A 14 cm diameter sample was fabricated resulting in a mean diffraction efficiency of 99.1% at λ = 810 nm with 0.4% uniformity using equipment which can fabricate gratings up to 1m diagonal. The implementation of these gratings in…
▽ More
We have developed low-dispersion (1480 l/mm), resonance-free, diffraction gratings made of dielectric materials resistant to femtosecond laser damage $(SiO_{2}/HfO_{2})$. A 14 cm diameter sample was fabricated resulting in a mean diffraction efficiency of 99.1% at λ = 810 nm with 0.4% uniformity using equipment which can fabricate gratings up to 1m diagonal. The implementation of these gratings in the compression of 30 fs pulses in an out-of-plane geometry can result in compressor efficiencies of ~95%. The measured laser absorption is 500x lower than current ultrafast petawatt-class compressor gratings which will enable a substantial increase in average power handling capabilities of these laser systems.
△ Less
Submitted 6 November, 2018;
originally announced November 2018.
-
UH-PRHLT at SemEval-2016 Task 3: Combining Lexical and Semantic-based Features for Community Question Answering
Authors:
Marc Franco-Salvador,
Sudipta Kar,
Thamar Solorio,
Paolo Rosso
Abstract:
In this work we describe the system built for the three English subtasks of the SemEval 2016 Task 3 by the Department of Computer Science of the University of Houston (UH) and the Pattern Recognition and Human Language Technology (PRHLT) research center - Universitat Polit`ecnica de Val`encia: UH-PRHLT. Our system represents instances by using both lexical and semantic-based similarity measures be…
▽ More
In this work we describe the system built for the three English subtasks of the SemEval 2016 Task 3 by the Department of Computer Science of the University of Houston (UH) and the Pattern Recognition and Human Language Technology (PRHLT) research center - Universitat Polit`ecnica de Val`encia: UH-PRHLT. Our system represents instances by using both lexical and semantic-based similarity measures between text pairs. Our semantic features include the use of distributed representations of words, knowledge graphs generated with the BabelNet multilingual semantic network, and the FrameNet lexical database. Experimental results outperform the random and Google search engine baselines in the three English subtasks. Our approach obtained the highest results of subtask B compared to the other task participants.
△ Less
Submitted 30 July, 2018;
originally announced July 2018.
-
Semantically-informed distance and similarity measures for paraphrase plagiarism identification
Authors:
Miguel A. Álvarez-Carmona,
Marc Franco-Salvador,
Esaú Villatoro-Tello,
Manuel Montes-y-Gómez,
Paolo Rosso,
Luis Villaseñor-Pineda
Abstract:
Paraphrase plagiarism identification represents a very complex task given that plagiarized texts are intentionally modified through several rewording techniques. Accordingly, this paper introduces two new measures for evaluating the relatedness of two given texts: a semantically-informed similarity measure and a semantically-informed edit distance. Both measures are able to extract semantic inform…
▽ More
Paraphrase plagiarism identification represents a very complex task given that plagiarized texts are intentionally modified through several rewording techniques. Accordingly, this paper introduces two new measures for evaluating the relatedness of two given texts: a semantically-informed similarity measure and a semantically-informed edit distance. Both measures are able to extract semantic information from either an external resource or a distributed representation of words, resulting in informative features for training a supervised classifier for detecting paraphrase plagiarism. Obtained results indicate that the proposed metrics are consistently good in detecting different types of paraphrase plagiarism. In addition, results are very competitive against state-of-the art methods having the advantage of representing a much more simple but equally effective solution.
△ Less
Submitted 29 May, 2018;
originally announced May 2018.
-
A Resource-Light Method for Cross-Lingual Semantic Textual Similarity
Authors:
Goran Glavaš,
Marc Franco-Salvador,
Simone Paolo Ponzetto,
Paolo Rosso
Abstract:
Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named ent…
▽ More
Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate different unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a sufficiently large corpus, required to learn monolingual word embeddings. Experimental results on three different datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource intensive methods, displaying stability across different language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks.
△ Less
Submitted 19 January, 2018;
originally announced January 2018.
-
A Low Dimensionality Representation for Language Variety Identification
Authors:
Francisco Rangel,
Marc Franco-Salvador,
Paolo Rosso
Abstract:
Language variety identification aims at labelling texts in a native language (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Argentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our L…
▽ More
Language variety identification aims at labelling texts in a native language (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Argentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with common state-of-the-art representations and show an increase in accuracy of ~35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show competitive performance while dramatically reducing the dimensionality --and increasing the big data suitability-- to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages.
△ Less
Submitted 30 May, 2017;
originally announced May 2017.
-
Friends and Enemies of Clinton and Trump: Using Context for Detecting Stance in Political Tweets
Authors:
Mirko Lai,
Delia Irazú Hernández Farías,
Viviana Patti,
Paolo Rosso
Abstract:
Stance detection, the task of identifying the speaker's opinion towards a particular target, has attracted the attention of researchers. This paper describes a novel approach for detecting stance in Twitter. We define a set of features in order to consider the context surrounding a target of interest with the final aim of training a model for predicting the stance towards the mentioned targets. In…
▽ More
Stance detection, the task of identifying the speaker's opinion towards a particular target, has attracted the attention of researchers. This paper describes a novel approach for detecting stance in Twitter. We define a set of features in order to consider the context surrounding a target of interest with the final aim of training a model for predicting the stance towards the mentioned targets. In particular, we are interested in investigating political debates in social media. For this reason we evaluated our approach focusing on two targets of the SemEval-2016 Task6 on Detecting stance in tweets, which are related to the political campaign for the 2016 U.S. presidential elections: Hillary Clinton vs. Donald Trump. For the sake of comparison with the state of the art, we evaluated our model against the dataset released in the SemEval-2016 Task 6 shared task competition. Our results outperform the best ones obtained by participating teams, and show that information about enemies and friends of politicians help in detecting stance towards them.
△ Less
Submitted 26 February, 2017;
originally announced February 2017.
-
Squeezing bottlenecks: exploring the limits of autoencoder semantic representation capabilities
Authors:
Parth Gupta,
Rafael E. Banchs,
Paolo Rosso
Abstract:
We present a comprehensive study on the use of autoencoders for modelling text data, in which (differently from previous studies) we focus our attention on the following issues: i) we explore the suitability of two different models bDA and rsDA for constructing deep autoencoders for text data at the sentence level; ii) we propose and evaluate two novel metrics for better assessing the text-reconst…
▽ More
We present a comprehensive study on the use of autoencoders for modelling text data, in which (differently from previous studies) we focus our attention on the following issues: i) we explore the suitability of two different models bDA and rsDA for constructing deep autoencoders for text data at the sentence level; ii) we propose and evaluate two novel metrics for better assessing the text-reconstruction capabilities of autoencoders; and iii) we propose an automatic method to find the critical bottleneck dimensionality for text language representations (below which structural information is lost).
△ Less
Submitted 13 February, 2014;
originally announced February 2014.