-
VaccinEU: COVID-19 vaccine conversations on Twitter in French, German and Italian
Authors:
Marco Di Giovanni,
Francesco Pierri,
Christopher Torres-Lugo,
Marco Brambilla
Abstract:
Despite the increasing limitations for unvaccinated people, in many European countries there is still a non-negligible fraction of individuals who refuse to get vaccinated against SARS-CoV-2, undermining governmental efforts to eradicate the virus. We study the role of online social media in influencing individuals' opinion towards getting vaccinated by designing a large-scale collection of Twitte…
▽ More
Despite the increasing limitations for unvaccinated people, in many European countries there is still a non-negligible fraction of individuals who refuse to get vaccinated against SARS-CoV-2, undermining governmental efforts to eradicate the virus. We study the role of online social media in influencing individuals' opinion towards getting vaccinated by designing a large-scale collection of Twitter messages in three different languages -- French, German and Italian -- and providing public access to the data collected. Focusing on the European context, our VaccinEU dataset aims to help researchers to better understand the impact of online (mis)information about vaccines and design more accurate communication strategies to maximize vaccination coverage.
△ Less
Submitted 4 April, 2022; v1 submitted 17 January, 2022;
originally announced January 2022.
-
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
Authors:
Kaustubh D. Dhole,
Varun Gangal,
Sebastian Gehrmann,
Aadesh Gupta,
Zhenhao Li,
Saad Mahamood,
Abinaya Mahendiran,
Simon Mille,
Ashish Shrivastava,
Samson Tan,
Tongshuang Wu,
Jascha Sohl-Dickstein,
Jinho D. Choi,
Eduard Hovy,
Ondrej Dusek,
Sebastian Ruder,
Sajant Anand,
Nagender Aneja,
Rabin Banjade,
Lisa Barthe,
Hanna Behnke,
Ian Berlot-Attwell,
Connor Boyle,
Caroline Brun,
Marco Antonio Sobrevilla Cabezudo
, et al. (101 additional authors not shown)
Abstract:
Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data split…
▽ More
Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robustness analysis results are available publicly on the NL-Augmenter repository (https://github.com/GEM-benchmark/NL-Augmenter).
△ Less
Submitted 11 October, 2022; v1 submitted 5 December, 2021;
originally announced December 2021.
-
Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings
Authors:
Marco Di Giovanni,
Marco Brambilla
Abstract:
Semantic sentence embeddings are usually supervisedly built minimizing distances between pairs of embeddings of sentences labelled as semantically similar by annotators. Since big labelled datasets are rare, in particular for non-English languages, and expensive, recent studies focus on unsupervised approaches that require not-paired input sentences. We instead propose a language-independent appro…
▽ More
Semantic sentence embeddings are usually supervisedly built minimizing distances between pairs of embeddings of sentences labelled as semantically similar by annotators. Since big labelled datasets are rare, in particular for non-English languages, and expensive, recent studies focus on unsupervised approaches that require not-paired input sentences. We instead propose a language-independent approach to build large datasets of pairs of informal texts weakly similar, without manual human effort, exploiting Twitter's intrinsic powerful signals of relatedness: replies and quotes of tweets. We use the collected pairs to train a Transformer model with triplet-like structures, and we test the generated embeddings on Twitter NLP similarity tasks (PIT and TURL) and STSb. We also introduce four new sentence ranking evaluation benchmarks of informal texts, carefully extracted from the initial collections of tweets, proving not only that our best model learns classical Semantic Textual Similarity, but also excels on tasks where pairs of sentences are not exact paraphrases. Ablation studies reveal how increasing the corpus size influences positively the results, even at 2M samples, suggesting that bigger collections of Tweets still do not contain redundant information about semantic similarities.
△ Less
Submitted 5 October, 2021;
originally announced October 2021.
-
VaccinItaly: monitoring Italian conversations around vaccines on Twitter and Facebook
Authors:
Francesco Pierri,
Andrea Tocchetti,
Lorenzo Corti,
Marco Di Giovanni,
Silvio Pavanetto,
Marco Brambilla,
Stefano Ceri
Abstract:
We present VaccinItaly, a project which monitors Italian online conversations around vaccines, on Twitter and Facebook. We describe the ongoing data collection, which follows the SARS-CoV-2 vaccination campaign roll-out in Italy and we provide public access to the data collected. We show results from a preliminary analysis of the spread of low- and high-credibility news shared alongside vaccine-re…
▽ More
We present VaccinItaly, a project which monitors Italian online conversations around vaccines, on Twitter and Facebook. We describe the ongoing data collection, which follows the SARS-CoV-2 vaccination campaign roll-out in Italy and we provide public access to the data collected. We show results from a preliminary analysis of the spread of low- and high-credibility news shared alongside vaccine-related conversations on both social media platforms. We also investigate the content of most popular YouTube videos and encounter several cases of harmful and misleading content about vaccines. Finally, we geolocate Twitter users who discuss vaccines and correlate their activity with open data statistics on vaccine uptake. We make up-to-date results available to the public through an interactive online dashboard associated with the project. The goal of our project is to gain further understanding of the interplay between the public discourse on online social media and the dynamics of vaccine uptake in the real world.
△ Less
Submitted 4 May, 2021; v1 submitted 11 January, 2021;
originally announced January 2021.
-
EFSG: Evolutionary Fooling Sentences Generator
Authors:
Marco Di Giovanni,
Marco Brambilla
Abstract:
Large pre-trained language representation models (LMs) have recently collected a huge number of successes in many NLP tasks.
In 2018 BERT, and later its successors (e.g. RoBERTa), obtained state-of-the-art results in classical benchmark tasks, such as GLUE benchmark.
After that, works about adversarial attacks have been published to test their generalization proprieties and robustness.
In th…
▽ More
Large pre-trained language representation models (LMs) have recently collected a huge number of successes in many NLP tasks.
In 2018 BERT, and later its successors (e.g. RoBERTa), obtained state-of-the-art results in classical benchmark tasks, such as GLUE benchmark.
After that, works about adversarial attacks have been published to test their generalization proprieties and robustness.
In this work, we design Evolutionary Fooling Sentences Generator (EFSG), a model- and task-agnostic adversarial attack algorithm built using an evolutionary approach to generate false-positive sentences for binary classification tasks.
We successfully apply EFSG to CoLA and MRPC tasks, on BERT and RoBERTa, comparing performances. Results prove the presence of weak spots in state-of-the-art LMs.
We finally test adversarial training as a data augmentation defence approach against EFSG, obtaining stronger improved models with no loss of accuracy when tested on the original datasets.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
Information disorders on Italian Facebook during COVID-19 infodemic
Authors:
Alessandro Celestini,
Marco Di Giovanni,
Stefano Guarino,
Francesco Pierri
Abstract:
In this work we carry out an exploratory analysis of online conversations on the Italian Facebook during the recent COVID-19 pandemic. We analyze the circulation of controversial topics associated with the origin of the virus, which involve popular targets of misinformation, such as migrants and 5G technology. We collected over 1.5 M posts in Italian language and related to COVID-19, shared by nea…
▽ More
In this work we carry out an exploratory analysis of online conversations on the Italian Facebook during the recent COVID-19 pandemic. We analyze the circulation of controversial topics associated with the origin of the virus, which involve popular targets of misinformation, such as migrants and 5G technology. We collected over 1.5 M posts in Italian language and related to COVID-19, shared by nearly 80k public pages and groups for a period of four months since January 2020. Overall, we find that potentially harmful content shared by unreliable sources is substantially negligible compared to traditional news websites, and that discussions over controversial topics has a limited engagement w.r.t to the pandemic in general. Besides, we highlight a "small-worldness" effect in the URL sharing diffusion network, indicating that users navigating through a limited set of pages could reach almost the entire pool of shared content related to the pandemic, thus being easily exposed to harmful propaganda as well as to verified information on the virus.
△ Less
Submitted 22 July, 2020;
originally announced July 2020.
-
Physical Symmetries Embedded in Neural Networks
Authors:
M. Mattheakis,
P. Protopapas,
D. Sondak,
M. Di Giovanni,
E. Kaxiras
Abstract:
Neural networks are a central technique in machine learning. Recent years have seen a wave of interest in applying neural networks to physical systems for which the governing dynamics are known and expressed through differential equations. Two fundamental challenges facing the development of neural networks in physics applications is their lack of interpretability and their physics-agnostic design…
▽ More
Neural networks are a central technique in machine learning. Recent years have seen a wave of interest in applying neural networks to physical systems for which the governing dynamics are known and expressed through differential equations. Two fundamental challenges facing the development of neural networks in physics applications is their lack of interpretability and their physics-agnostic design. The focus of the present work is to embed physical constraints into the structure of the neural network to address the second fundamental challenge. By constraining tunable parameters (such as weights and biases) and adding special layers to the network, the desired constraints are guaranteed to be satisfied without the need for explicit regularization terms. This is demonstrated on upervised and unsupervised networks for two basic symmetries: even/odd symmetry of a function and energy conservation. In the supervised case, the network with embedded constraints is shown to perform well on regression problems while simultaneously obeying the desired constraints whereas a traditional network fits the data but violates the underlying constraints. Finally, a new unsupervised neural network is proposed that guarantees energy conservation through an embedded symplectic structure. The symplectic neural network is used to solve a system of energy-conserving differential equations and out-performs an unsupervised, non-symplectic neural network.
△ Less
Submitted 29 January, 2020; v1 submitted 18 April, 2019;
originally announced April 2019.