Search | arXiv e-print repository

Examining the Behavior of LLM Architectures Within the Framework of Standardized National Exams in Brazil

Authors: Marcelo Sartori Locatelli, Matheus Prado Miranda, Igor Joaquim da Silva Costa, Matheus Torres Prates, Victor Thomé, Mateus Zaparoli Monteiro, Tomas Lacerda, Adriana Pagano, Eduardo Rios Neto, Wagner Meira Jr., Virgilio Almeida

Abstract: The Exame Nacional do Ensino Médio (ENEM) is a pivotal test for Brazilian students, required for admission to a significant number of universities in Brazil. The test consists of four objective high-school level tests on Math, Humanities, Natural Sciences and Languages, and one writing essay. Students' answers to the test and to the accompanying socioeconomic status questionnaire are made public e… ▽ More The Exame Nacional do Ensino Médio (ENEM) is a pivotal test for Brazilian students, required for admission to a significant number of universities in Brazil. The test consists of four objective high-school level tests on Math, Humanities, Natural Sciences and Languages, and one writing essay. Students' answers to the test and to the accompanying socioeconomic status questionnaire are made public every year (albeit anonymized) due to transparency policies from the Brazilian Government. In the context of large language models (LLMs), these data lend themselves nicely to comparing different groups of humans with AI, as we can have access to human and machine answer distributions. We leverage these characteristics of the ENEM dataset and compare GPT-3.5 and 4, and MariTalk, a model trained using Portuguese data, to humans, aiming to ascertain how their answers relate to real societal groups and what that may reveal about the model biases. We divide the human groups by using socioeconomic status (SES), and compare their answer distribution with LLMs for each question and for the essay. We find no significant biases when comparing LLM performance to humans on the multiple-choice Brazilian Portuguese tests, as the distance between model and human answers is mostly determined by the human accuracy. A similar conclusion is found by looking at the generated text as, when analyzing the essays, we observe that human and LLM essays differ in a few key factors, one being the choice of words where model essays were easily separable from human ones. The texts also differ syntactically, with LLM generated essays exhibiting, on average, smaller sentences and less thought units, among other differences. These results suggest that, for Brazilian Portuguese in the ENEM context, LLM outputs represent no group of humans, being significantly different from the answers from Brazilian students across all tests. △ Less

Submitted 9 August, 2024; originally announced August 2024.

Comments: Accepted at the Seventh AAAI/ACM Conference on AI, Ethics and Society (AIES 2024). 14 pages, 4 figures

arXiv:2312.11326 [pdf, other]

doi 10.1609/icwsm.v18i1.31366

Topic Shifts as a Proxy for Assessing Politicization in Social Media

Authors: Marcelo Sartori Locatelli, Pedro Calais, Matheus Prado Miranda, João Pedro Junho, Tomas Lacerda Muniz, Wagner Meira Jr., Virgilio Almeida

Abstract: Politicization is a social phenomenon studied by political science characterized by the extent to which ideas and facts are given a political tone. A range of topics, such as climate change, religion and vaccines has been subject to increasing politicization in the media and social media platforms. In this work, we propose a computational method for assessing politicization in online conversations… ▽ More Politicization is a social phenomenon studied by political science characterized by the extent to which ideas and facts are given a political tone. A range of topics, such as climate change, religion and vaccines has been subject to increasing politicization in the media and social media platforms. In this work, we propose a computational method for assessing politicization in online conversations based on topic shifts, i.e., the degree to which people switch topics in online conversations. The intuition is that topic shifts from a non-political topic to politics are a direct measure of politicization -- making something political, and that the more people switch conversations to politics, the more they perceive politics as playing a vital role in their daily lives. A fundamental challenge that must be addressed when one studies politicization in social media is that, a priori, any topic may be politicized. Hence, any keyword-based method or even machine learning approaches that rely on topic labels to classify topics are expensive to run and potentially ineffective. Instead, we learn from a seed of political keywords and use Positive-Unlabeled (PU) Learning to detect political comments in reaction to non-political news articles posted on Twitter, YouTube, and TikTok during the 2022 Brazilian presidential elections. Our findings indicate that all platforms show evidence of politicization as discussion around topics adjacent to politics such as economy, crime and drugs tend to shift to politics. Even the least politicized topics had the rate in which their topics shift to politics increased in the lead up to the elections and after other political events in Brazil -- an evidence of politicization. △ Less

Submitted 13 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: 12 pages, 6 figures, accepted for the 18th AAAI International Conference on Web and Social Media (ICWSM-2024)

Journal ref: Topic Shifts as a Proxy for Assessing Politicization in Social Media. In: Proceedings of the International AAAI Conference on Web and Social Media. 2024. p. 972-984

arXiv:2208.01509 [pdf, other]

doi 10.1145/3511095.3531283

Characterizing Vaccination Movements on YouTube in the United States and Brazil

Authors: Marcelo Sartori Locatelli, Josemar Caetano, Wagner Meira Jr., Virgilio Almeida

Abstract: In the context of COVID-19 pandemic, social networks such as Twitter and YouTube stand out as important sources of information. YouTube, as the largest and most engaging online media consumption platform, has a large influence in the spread of information and misinformation, which makes it important to study how it deals with the problems that arise from disinformation, as well as how its users in… ▽ More In the context of COVID-19 pandemic, social networks such as Twitter and YouTube stand out as important sources of information. YouTube, as the largest and most engaging online media consumption platform, has a large influence in the spread of information and misinformation, which makes it important to study how it deals with the problems that arise from disinformation, as well as how its users interact with different types of content. Considering that United States (USA) and Brazil (BR) are two countries with the highest COVID-19 death tolls, we asked the following question: What are the nuances of vaccination campaigns in the two countries? With that in mind, we engage in a comparative analysis of pro and anti-vaccine movements on YouTube. We also investigate the role of YouTube in countering online vaccine misinformation in USA and BR. For this means, we monitored the removal of vaccine related content on the platform and also applied various techniques to analyze the differences in discourse and engagement in pro and anti-vaccine "comment sections". We found that American anti-vaccine content tend to lead to considerably more toxic and negative discussion than their pro-vaccine counterparts while also leading to 18% higher user-user engagement, while Brazilian anti-vaccine content was significantly less engaging. We also found that pro-vaccine and anti-vaccine discourses are considerably different as the former is associated with conspiracy theories (e.g. ccp), misinformation and alternative medicine (e.g. hydroxychloroquine), while the latter is associated with protective measures. Finally, it was observed that YouTube content removals are still insufficient, with only approximately 16% of the anti-vaccine content being removed by the end of the studied period, with the USA registering the highest percentage of removed anti-vaccine content(34%) and BR registering the lowest(9.8%). △ Less

Submitted 2 August, 2022; originally announced August 2022.

Comments: Accepted at ACM HT 2022, 15 pages, 7 figures

Journal ref: Proceedings of the 33rd ACM Conference on Hypertext and Social Media. 2022. p. 80-90

arXiv:2105.07523 [pdf, other]

Analyzing the "Sleeping Giants" Activism Model in Brazil

Authors: Bárbara Gomes Ribeiro, Manoel Horta Ribeiro, Virgílio Almeida, Wagner Meira Jr

Abstract: In 2020, amidst the COVID pandemic and a polarized political climate, the Sleeping Giants online activist movement gained traction in Brazil. Its rationale was simple: to curb the spread of misinformation by harming the advertising revenue of sources that produce this type of content. Like its international counterparts, Sleeping Giants Brasil (SGB) campaigned against media outlets using Twitter t… ▽ More In 2020, amidst the COVID pandemic and a polarized political climate, the Sleeping Giants online activist movement gained traction in Brazil. Its rationale was simple: to curb the spread of misinformation by harming the advertising revenue of sources that produce this type of content. Like its international counterparts, Sleeping Giants Brasil (SGB) campaigned against media outlets using Twitter to ask companies to remove ads from the targeted outlets. This work presents a thorough quantitative characterization of this activism model, analyzing the three campaigns carried out by SGB between May and September 2020. To do so, we use digital traces from both Twitter and Google Trends, toxicity and sentiment classifiers trained for the Portuguese language, and an annotated corpus of SGB's tweets. Our key findings were threefold. First, we found that SGB's requests to companies were largely successful (with 83.85\% of all 192 targeted companies responding positively) and that user pressure was correlated to the speed of companies' responses. Second, there were no significant changes in the online attention and the user engagement going towards the targeted media outlets in the six months that followed SGB's campaign (as measured by Google Trends and Twitter engagement). Third, we observed that user interactions with companies changed only transiently, even if the companies did not respond to SGB's request. Overall, our results paint a nuanced portrait of internet activism. On the one hand, they suggest that SGB was successful in getting companies to boycott specific media outlets, which may have harmed their advertisement revenue stream. On the other hand, they also suggest that the activist movement did not impact the online attention these media outlets received nor the online image of companies that did not respond positively to their requests. △ Less

Submitted 25 February, 2022; v1 submitted 16 May, 2021; originally announced May 2021.

arXiv:2104.04571 [pdf, other]

doi 10.1007/s00158-021-03066-z

Finite Variation Sensitivity Analysis for Discrete Topology Optimization of Continuum Structures

Authors: Daniel Candeloro Cunha, Breno Vincenzo de Almeida, Heitor Nigro Lopes, Renato Pavanello

Abstract: This paper proposes two novel approaches to perform more suitable sensitivity analyses for discrete topology optimization methods. To properly support them, we introduce a more formal description of the Bi-directional Evolutionary Structural Optimization (BESO) method, in which the sensitivity analysis is based on finite variations of the objective function. The proposed approaches are compared to… ▽ More This paper proposes two novel approaches to perform more suitable sensitivity analyses for discrete topology optimization methods. To properly support them, we introduce a more formal description of the Bi-directional Evolutionary Structural Optimization (BESO) method, in which the sensitivity analysis is based on finite variations of the objective function. The proposed approaches are compared to a naive strategy; to the conventional strategy, referred to as First-Order Continuous Interpolation (FOCI) approach; and to a strategy previously developed by other researchers, referred to as High-Order Continuous Interpolation (HOCI) approach. The novel Woodbury approach provides exact sensitivity values and is a better alternative to HOCI. Although HOCI and Woodbury approaches may be computationally prohibitive, they provide useful expressions for a better understanding of the problem. The novel Conjugate Gradient Method (CGM) approach provides sensitivity values with arbitrary precision and is computationally viable for a small number of steps. The CGM approach is a better alternative to FOCI since, for appropriate initial conditions, it is always more accurate than the conventional strategy. The standard compliance minimization problem with volume constraint is considered to illustrate the methodology. Numerical examples are presented together with a broad discussion about BESO-type methods. △ Less

Submitted 17 May, 2021; v1 submitted 7 April, 2021; originally announced April 2021.

Comments: 31 pages, 25 figures, submitted to Structural and Multidisciplinary Optimization

arXiv:2007.04361 [pdf, other]

Understanding the impact of the alphabetical ordering of names in user interfaces: a gender bias analysis

Authors: Daniel Sullivan, Carlos Caminha, Victor Dantas, Elizabeth Furtado, Vasco Furtado, Virgílio Almeida

Abstract: Listing people alphabetically on an electronic output device is a traditional technique, since alphabetical order is easily perceived by users and facilitates access to information. However, this apparently harmless technique, especially when the list is ordered by first name, needs to be used with caution by designers and programmers. We show, via empirical data analysis, that when an interface d… ▽ More Listing people alphabetically on an electronic output device is a traditional technique, since alphabetical order is easily perceived by users and facilitates access to information. However, this apparently harmless technique, especially when the list is ordered by first name, needs to be used with caution by designers and programmers. We show, via empirical data analysis, that when an interface displays people's first name in alphabetical order in several pages/screens, each page/screen may have imbalances in respect to gender of its Top-k individuals.k represents the size of the list of names visualized first, which may be the number of names that fits in a screen page of a certain device.The research work was carried out with the analysis of actual datasets of names of five different countries. Each dataset has a person name and the frequency of adoption of the name in the country.Our analysis shows that, even though all countries have exhibit imbalance problems, the samples of individuals with Brazilian and Spanish first names are more prone to gender imbalance among their Top-k individuals. These results can be useful for designers and engineers to construct information systems that avoid gender bias induction. △ Less

Submitted 8 July, 2020; originally announced July 2020.

arXiv:2002.05869 [pdf]

DSCEP: An Infrastructure for Distributed Semantic Complex Event Processing

Authors: Vitor Pinheiro de Almeida, Sukanya Bhowmik, Markus Endler, Kurt Rothermel

Abstract: Today most applications continuously produce information under the form of streams, due to the advent of the means of collecting data. Sensors and social networks collect an immense variety and volume of data, from different real-life situations and at a considerable velocity. Increasingly, applications require processing of heterogeneous data streams from different sources together with large bac… ▽ More Today most applications continuously produce information under the form of streams, due to the advent of the means of collecting data. Sensors and social networks collect an immense variety and volume of data, from different real-life situations and at a considerable velocity. Increasingly, applications require processing of heterogeneous data streams from different sources together with large background knowledge. To use only the information on the data stream is not enough for many use cases. Semantic Complex Event Processing (CEP) systems have evolved from the classical rule-based CEP systems, by integrating high-level knowledge representation and RDF stream processing using both the data stream and background static knowledge. Additionally, CEP approaches lack the capability to semantically interpret and analyze data, which Semantic CEP (SCEP) attempts to address. SCEP has several limitations; one of them is related to their high processing time. This paper provides a conceptual model and an implementation of an infrastructure for distributed SCEP, where each SCEP operator can process part of the data and send it to other SCEP operators in order to achieves some answer. We show that by splitting the RDF stream processing and the background knowledge using the concept of SCEP operators, it's possible to considerably reduce processing time. △ Less

Submitted 13 February, 2020; originally announced February 2020.

Comments: 9 pages

arXiv:1908.08313 [pdf, other]

Auditing Radicalization Pathways on YouTube

Authors: Manoel Horta Ribeiro, Raphael Ottoni, Robert West, Virgílio A. F. Almeida, Wagner Meira

Abstract: Non-profits, as well as the media, have hypothesized the existence of a radicalization pipeline on YouTube, claiming that users systematically progress towards more extreme content on the platform. Yet, there is to date no substantial quantitative evidence of this alleged pipeline. To close this gap, we conduct a large-scale audit of user radicalization on YouTube. We analyze 330,925 videos posted… ▽ More Non-profits, as well as the media, have hypothesized the existence of a radicalization pipeline on YouTube, claiming that users systematically progress towards more extreme content on the platform. Yet, there is to date no substantial quantitative evidence of this alleged pipeline. To close this gap, we conduct a large-scale audit of user radicalization on YouTube. We analyze 330,925 videos posted on 349 channels, which we broadly classified into four types: Media, the Alt-lite, the Intellectual Dark Web (I.D.W.), and the Alt-right. According to the aforementioned radicalization hypothesis, channels in the I.D.W. and the Alt-lite serve as gateways to fringe far-right ideology, here represented by Alt-right channels. Processing 72M+ comments, we show that the three channel types indeed increasingly share the same user base; that users consistently migrate from milder to more extreme content; and that a large percentage of users who consume Alt-right content now consumed Alt-lite and I.D.W. content in the past. We also probe YouTube's recommendation algorithm, looking at more than 2M video and channel recommendations between May/July 2019. We find that Alt-lite content is easily reachable from I.D.W. channels, while Alt-right videos are reachable only through channel recommendations. Overall, we paint a comprehensive picture of user radicalization on YouTube. △ Less

Submitted 21 October, 2021; v1 submitted 22 August, 2019; originally announced August 2019.

Comments: 10 pages plus appendices

arXiv:1905.00825 [pdf, other]

doi 10.1145/3292522.3326018

Characterizing Attention Cascades in WhatsApp Groups

Authors: Josemar Alves Caetano, Gabriel Magno, Marcos Gonçalves, Jussara Almeida, Humberto T. Marques-Neto, Virgílio Almeida

Abstract: An important political and social phenomena discussed in several countries, like India and Brazil, is the use of WhatsApp to spread false or misleading content. However, little is known about the information dissemination process in WhatsApp groups. Attention affects the dissemination of information in WhatsApp groups, determining what topics or subjects are more attractive to participants of a gr… ▽ More An important political and social phenomena discussed in several countries, like India and Brazil, is the use of WhatsApp to spread false or misleading content. However, little is known about the information dissemination process in WhatsApp groups. Attention affects the dissemination of information in WhatsApp groups, determining what topics or subjects are more attractive to participants of a group. In this paper, we characterize and analyze how attention propagates among the participants of a WhatsApp group. An attention cascade begins when a user asserts a topic in a message to the group, which could include written text, photos, or links to articles online. Others then propagate the information by responding to it. We analyzed attention cascades in more than 1.7 million messages posted in 120 groups over one year. Our analysis focused on the structural and temporal evolution of attention cascades as well as on the behavior of users that participate in them. We found specific characteristics in cascades associated with groups that discuss political subjects and false information. For instance, we observe that cascades with false information tend to be deeper, reach more users, and last longer in political groups than in non-political groups. △ Less

Submitted 3 May, 2019; v1 submitted 2 May, 2019; originally announced May 2019.

Comments: Accepted as a full paper at the 11th International ACM Web Science Conference (WebSci 2019). Please cite the WebSci version

arXiv:1808.05927 [pdf, other]

Characterizing the public perception of WhatsApp through the lens of media

Authors: Josemar Alves Caetano, Gabriel Magno, Evandro Cunha, Wagner Meira Jr., Humberto T. Marques-Neto, Virgilio Almeida

Abstract: WhatsApp is, as of 2018, a significant component of the global information and communication infrastructure, especially in developing countries. However, probably due to its strong end-to-end encryption, WhatsApp became an attractive place for the dissemination of misinformation, extremism and other forms of undesirable behavior. In this paper, we investigate the public perception of WhatsApp thro… ▽ More WhatsApp is, as of 2018, a significant component of the global information and communication infrastructure, especially in developing countries. However, probably due to its strong end-to-end encryption, WhatsApp became an attractive place for the dissemination of misinformation, extremism and other forms of undesirable behavior. In this paper, we investigate the public perception of WhatsApp through the lens of media. We analyze two large datasets of news and show the kind of content that is being associated with WhatsApp in different regions of the world and over time. Our analyses include the examination of named entities, general vocabulary, and topics addressed in news articles that mention WhatsApp, as well as the polarity of these texts. Among other results, we demonstrate that the vocabulary and topics around the term "whatsapp" in the media have been changing over the years and in 2018 concentrate on matters related to misinformation, politics and criminal scams. More generally, our findings are useful to understand the impact that tools like WhatsApp play in the contemporary society and how they are seen by the communities themselves. △ Less

Submitted 17 August, 2018; originally announced August 2018.

Comments: Accepted as a full paper at the 2nd International Workshop on Rumours and Deception in Social Media (RDSM 2018), co-located with CIKM 2018 in Turin. Please cite the RDSM version

arXiv:1807.06926 [pdf, other]

Fake news as we feel it: perception and conceptualization of the term "fake news" in the media

Authors: Evandro Cunha, Gabriel Magno, Josemar Caetano, Douglas Teixeira, Virgilio Almeida

Abstract: In this article, we quantitatively analyze how the term "fake news" is being shaped in news media in recent years. We study the perception and the conceptualization of this term in the traditional media using eight years of data collected from news outlets based in 20 countries. Our results not only corroborate previous indications of a high increase in the usage of the expression "fake news", but… ▽ More In this article, we quantitatively analyze how the term "fake news" is being shaped in news media in recent years. We study the perception and the conceptualization of this term in the traditional media using eight years of data collected from news outlets based in 20 countries. Our results not only corroborate previous indications of a high increase in the usage of the expression "fake news", but also show contextual changes around this expression after the United States presidential election of 2016. Among other results, we found changes in the related vocabulary, in the mentioned entities, in the surrounding topics and in the contextual polarity around the term "fake news", suggesting that this expression underwent a change in perception and conceptualization after 2016. These outcomes expand the understandings on the usage of the term "fake news", helping to comprehend and more accurately characterize this relevant social phenomenon linked to misinformation and manipulation. △ Less

Submitted 18 July, 2018; originally announced July 2018.

Comments: Accepted as a full paper at the 10th International Conference on Social Informatics (SocInfo 2018). Please cite the SocInfo version

arXiv:1804.04096 [pdf, other]

doi 10.1145/3201064.3201081

Analyzing Right-wing YouTube Channels: Hate, Violence and Discrimination

Authors: Raphael Ottoni, Evandro Cunha, Gabriel Magno, Pedro Bernadina, Wagner Meira Jr, Virgilio Almeida

Abstract: As of 2018, YouTube, the major online video sharing website, hosts multiple channels promoting right-wing content. In this paper, we observe issues related to hate, violence and discriminatory bias in a dataset containing more than 7,000 videos and 17 million comments. We investigate similarities and differences between users' comments and video content in a selection of right-wing channels and co… ▽ More As of 2018, YouTube, the major online video sharing website, hosts multiple channels promoting right-wing content. In this paper, we observe issues related to hate, violence and discriminatory bias in a dataset containing more than 7,000 videos and 17 million comments. We investigate similarities and differences between users' comments and video content in a selection of right-wing channels and compare it to a baseline set using a three-layered approach, in which we analyze (a) lexicon, (b) topics and (c) implicit biases present in the texts. Among other results, our analyses show that right-wing channels tend to (a) contain a higher degree of words from "negative" semantic fields, (b) raise more topics related to war and terrorism, and (c) demonstrate more discriminatory bias against Muslims (in videos) and towards LGBT people (in comments). Our findings shed light not only into the collective conduct of the YouTube community promoting and consuming right-wing content, but also into the general behavior of YouTube users. △ Less

Submitted 11 April, 2018; originally announced April 2018.

Comments: In Proceedings of the 10th ACM Conference on Web Science

arXiv:1804.00397 [pdf, other]

Analyzing and characterizing political discussions in WhatsApp public groups

Authors: Josemar Alves Caetano, Jaqueline Faria de Oliveira, Helder Seixas Lima, Humberto T. Marques-Neto, Gabriel Magno, Wagner Meira Jr, Virgílio A. F. Almeida

Abstract: We present a thorough characterization of what we believe to be the first significant analysis of the behavior of groups in WhatsApp in the scientific literature. Our characterization of over 270,000 messages and about 7,000 users spanning a 28-day period is done at three different layers. The message layer focuses on individual messages, each of which is the result of specific posts performed by… ▽ More We present a thorough characterization of what we believe to be the first significant analysis of the behavior of groups in WhatsApp in the scientific literature. Our characterization of over 270,000 messages and about 7,000 users spanning a 28-day period is done at three different layers. The message layer focuses on individual messages, each of which is the result of specific posts performed by a user. The user layer characterizes the user actions while interacting with a group. The group layer characterizes the aggregate message patterns of all users that participate in a group. We analyze 81 public groups in WhatsApp and classify them into two categories, political and non-political groups according to keywords associated with each group. Our contributions are two-fold. First, we introduce a framework and a number of metrics to characterize the behavior of communication groups in mobile messaging systems such as WhatsApp. Second, our analysis underscores a Zipf-like profile for user messages in political groups. Also, our analysis reveals that Whatsapp messages are multimedia, with a combination of different forms of content. Multimedia content (i.e., audio, image, and video) and emojis are present in 20% and 11.2% of all messages respectively. Political groups use more text messages than non-political groups. Second, we characterize novel features that represent the behavior of a public group, with multiple conversational turns between key members, with the participation of other members of the group. △ Less

Submitted 2 April, 2018; originally announced April 2018.

Comments: 10 pages, 12 figures

arXiv:1803.08977 [pdf, other]

Characterizing and Detecting Hateful Users on Twitter

Authors: Manoel Horta Ribeiro, Pedro H. Calais, Yuri A. Santos, Virgílio A. F. Almeida, Wagner Meira Jr

Abstract: Most current approaches to characterize and detect hate speech focus on \textit{content} posted in Online Social Networks. They face shortcomings to collect and annotate hateful speech due to the incompleteness and noisiness of OSN text and the subjectivity of hate speech. These limitations are often aided with constraints that oversimplify the problem, such as considering only tweets containing h… ▽ More Most current approaches to characterize and detect hate speech focus on \textit{content} posted in Online Social Networks. They face shortcomings to collect and annotate hateful speech due to the incompleteness and noisiness of OSN text and the subjectivity of hate speech. These limitations are often aided with constraints that oversimplify the problem, such as considering only tweets containing hate-related words. In this work we partially address these issues by shifting the focus towards \textit{users}. We develop and employ a robust methodology to collect and annotate hateful users which does not depend directly on lexicon and where the users are annotated given their entire profile. This results in a sample of Twitter's retweet graph containing $100,386$ users, out of which $4,972$ were annotated. We also collect the users who were banned in the three months that followed the data collection. We show that hateful users differ from normal ones in terms of their activity patterns, word usage and as well as network structure. We obtain similar results comparing the neighbors of hateful vs. neighbors of normal users and also suspended users vs. active users, increasing the robustness of our analysis. We observe that hateful users are densely connected, and thus formulate the hate speech detection problem as a task of semi-supervised learning over a graph, exploiting the network of connections on Twitter. We find that a node embedding algorithm, which exploits the graph structure, outperforms content-based approaches for the detection of both hateful ($95\%$ AUC vs $88\%$ AUC) and suspended users ($93\%$ AUC vs $88\%$ AUC). Altogether, we present a user-centric view of hate speech, paving the way for better detection and understanding of this relevant and challenging issue. △ Less

Submitted 23 March, 2018; originally announced March 2018.

Comments: This is an extended version of the homonymous short paper to be presented at ICWSM-18. arXiv admin note: text overlap with arXiv:1801.00317

arXiv:1801.00317 [pdf, other]

"Like Sheep Among Wolves": Characterizing Hateful Users on Twitter

Authors: Manoel Horta Ribeiro, Pedro H. Calais, Yuri A. Santos, Virgílio A. F. Almeida, Wagner Meira Jr

Abstract: Hateful speech in Online Social Networks (OSNs) is a key challenge for companies and governments, as it impacts users and advertisers, and as several countries have strict legislation against the practice. This has motivated work on detecting and characterizing the phenomenon in tweets, social media posts and comments. However, these approaches face several shortcomings due to the noisiness of OSN… ▽ More Hateful speech in Online Social Networks (OSNs) is a key challenge for companies and governments, as it impacts users and advertisers, and as several countries have strict legislation against the practice. This has motivated work on detecting and characterizing the phenomenon in tweets, social media posts and comments. However, these approaches face several shortcomings due to the noisiness of OSN data, the sparsity of the phenomenon, and the subjectivity of the definition of hate speech. This works presents a user-centric view of hate speech, paving the way for better detection methods and understanding. We collect a Twitter dataset of $100,386$ users along with up to $200$ tweets from their timelines with a random-walk-based crawler on the retweet graph, and select a subsample of $4,972$ to be manually annotated as hateful or not through crowdsourcing. We examine the difference between user activity patterns, the content disseminated between hateful and normal users, and network centrality measurements in the sampled graph. Our results show that hateful users have more recent account creation dates, and more statuses, and followees per day. Additionally, they favorite more tweets, tweet in shorter intervals and are more central in the retweet network, contradicting the "lone wolf" stereotype often associated with such behavior. Hateful users are more negative, more profane, and use less words associated with topics such as hate, terrorism, violence and anger. We also identify similarities between hateful/normal users and their 1-neighborhood, suggesting strong homophily. △ Less

Submitted 14 January, 2018; v1 submitted 31 December, 2017; originally announced January 2018.

Comments: 8 pages, 11 figures, to be presented at MIS2 Workshop @ WSDM'18

arXiv:1707.00971 [pdf, other]

Characterizing videos, audience and advertising in Youtube channels for kids

Authors: Camila Souza Araujo, Gabriel Magno, Wagner Meira Jr, Virgilio Almeida, Pedro Hartung, Danilo Doneda

Abstract: Online video services, messaging systems, games and social media services are tremendously popular among young people and children in many countries. Most of the digital services offered on the internet are advertising funded, which makes advertising ubiquitous in children's everyday life. To understand the impact of advertising-based digital services on children, we study the collective behavior… ▽ More Online video services, messaging systems, games and social media services are tremendously popular among young people and children in many countries. Most of the digital services offered on the internet are advertising funded, which makes advertising ubiquitous in children's everyday life. To understand the impact of advertising-based digital services on children, we study the collective behavior of users of YouTube for kids channels and present the demographics of a large number of users. We collected data from 12,848 videos from 17 channels in US and UK and 24 channels in Brazil. The channels in English have been viewed more than 37 billion times. We also collected more than 14 million comments made by users. Based on a combination of text-analysis and face recognition tools, we show the presence of racial and gender biases in our large sample of users. We also identify children actively using YouTube, although the minimum age for using the service is 13 years in most countries. We provide comparisons of user behavior among the three countries, which represent large user populations in the global North and the global South. △ Less

Submitted 4 July, 2017; originally announced July 2017.

arXiv:1706.05924 [pdf, other]

"Everything I Disagree With is #FakeNews": Correlating Political Polarization and Spread of Misinformation

Authors: Manoel Horta Ribeiro, Pedro H. Calais, Virgílio A. F. Almeida, Wagner Meira Jr

Abstract: An important challenge in the process of tracking and detecting the dissemination of misinformation is to understand the political gap between people that engage with the so called "fake news". A possible factor responsible for this gap is opinion polarization, which may prompt the general public to classify content that they disagree or want to discredit as fake. In this work, we study the relati… ▽ More An important challenge in the process of tracking and detecting the dissemination of misinformation is to understand the political gap between people that engage with the so called "fake news". A possible factor responsible for this gap is opinion polarization, which may prompt the general public to classify content that they disagree or want to discredit as fake. In this work, we study the relationship between political polarization and content reported by Twitter users as related to "fake news". We investigate how polarization may create distinct narratives on what misinformation actually is. We perform our study based on two datasets collected from Twitter. The first dataset contains tweets about US politics in general, from which we compute the degree of polarization of each user towards the Republican and Democratic Party. In the second dataset, we collect tweets and URLs that co-occurred with "fake news" related keywords and hashtags, such as #FakeNews and #AlternativeFact, as well as reactions towards such tweets and URLs. We then analyze the relationship between polarization and what is perceived as misinformation, and whether users are designating information that they disagree as fake. Our results show an increase in the polarization of users and URLs associated with fake-news keywords and hashtags, when compared to information not labeled as "fake news". We discuss the impact of our findings on the challenges of tracking "fake news" in the ongoing battle against misinformation. △ Less

Submitted 17 July, 2017; v1 submitted 19 June, 2017; originally announced June 2017.

Comments: 8 pages, 10 figures, to be presented at DS+J Workshop @ KDD'17

arXiv:1612.05218 [pdf, other]

How Do App Stores Challenge the Global Internet Governance Ecosystem?

Authors: Virgilio A. F. Almeida, Danilo Doneda, Carolina Rossini

Abstract: App stores challenge the culture of openness and resistance to central authorities cultivated by the pioneers of the Internet. Could multistakeholder governance bodies bring more inclusivity into the global cyberspace governance ecosystem? App stores challenge the culture of openness and resistance to central authorities cultivated by the pioneers of the Internet. Could multistakeholder governance bodies bring more inclusivity into the global cyberspace governance ecosystem? △ Less

Submitted 15 December, 2016; originally announced December 2016.

arXiv:1609.05413 [pdf, other]

Stereotypes in Search Engine Results: Understanding The Role of Local and Global Factors

Authors: Gabriel Magno, Camila Souza Araújo, Wagner Meira Jr., Virgilio Almeida

Abstract: The internet has been blurring the lines between local and global cultures, affecting in different ways the perception of people about themselves and others. In the global context of the internet, search engine platforms are a key mediator between individuals and information. In this paper, we examine the local and global impact of the internet on the formation of female physical attractiveness st… ▽ More The internet has been blurring the lines between local and global cultures, affecting in different ways the perception of people about themselves and others. In the global context of the internet, search engine platforms are a key mediator between individuals and information. In this paper, we examine the local and global impact of the internet on the formation of female physical attractiveness stereotypes in search engine results. By investigating datasets of images collected from two major search engines in 42 countries, we identify a significant fraction of replicated images. We find that common images are clustered around countries with the same language. We also show that existence of common images among countries is practically eliminated when the queries are limited to local sites. In summary, we show evidence that results from search engines are biased towards the language used to query the system, which leads to certain attractiveness stereotypes that are often quite different from the majority of the female population of the country. △ Less

Submitted 7 November, 2016; v1 submitted 17 September, 2016; originally announced September 2016.

arXiv:1608.02499 [pdf, other]

Identifying Stereotypes in the Online Perception of Physical Attractiveness

Authors: Camila Souza Araújo, Wagner Meira Jr., Virgilio Almeida

Abstract: Stereotyping can be viewed as oversimplified ideas about social groups. They can be positive, neutral or negative. The main goal of this paper is to identify stereotypes for female physical attractiveness in images available in the Web. We look at the search engines as possible sources of stereotypes. We conducted experiments on Google and Bing by querying the search engines for beautiful and ugly… ▽ More Stereotyping can be viewed as oversimplified ideas about social groups. They can be positive, neutral or negative. The main goal of this paper is to identify stereotypes for female physical attractiveness in images available in the Web. We look at the search engines as possible sources of stereotypes. We conducted experiments on Google and Bing by querying the search engines for beautiful and ugly women. We then collect images and extract information of faces. We propose a methodology and apply it to analyze photos gathered from search engines to understand how race and age manifest in the observed stereotypes and how they vary according to countries and regions. Our findings demonstrate the existence of stereotypes for female physical attractiveness, in particular negative stereotypes about black women and positive stereotypes about white women in terms of beauty. We also found negative stereotypes associated with older women in terms of physical attractiveness. Finally, we have identified patterns of stereotypes that are common to groups of countries. △ Less

Submitted 8 August, 2016; originally announced August 2016.

arXiv:1510.05700 [pdf, other]

Dawn of the Selfie Era: The Whos, Wheres, and Hows of Selfies on Instagram

Authors: Flávio Souza, Diego de Las Casas, Vinícius Flores, SunBum Youn, Meeyoung Cha, Daniele Quercia, Virgílio Almeida

Abstract: Online interactions are increasingly involving images, especially those containing human faces, which are naturally attention grabbing and more effective at conveying feelings than text. To understand this new convention of digital culture, we study the collective behavior of sharing selfies on Instagram and present how people appear in selfies and which patterns emerge from such interactions. Ana… ▽ More Online interactions are increasingly involving images, especially those containing human faces, which are naturally attention grabbing and more effective at conveying feelings than text. To understand this new convention of digital culture, we study the collective behavior of sharing selfies on Instagram and present how people appear in selfies and which patterns emerge from such interactions. Analysis of millions of photos shows that the amount of selfies has increased by 900 times from 2012 to 2014. Selfies are an effective medium to grab attention; they generate on average 1.1--3.2 times more likes and comments than other types of content on Instagram. Compared to other content, interactions involving selfies exhibit variations in homophily scores (in terms of age and gender) that suggest they are becoming more widespread. Their style also varies by cultural boundaries in that the average age and majority gender seen in selfies differ from one country to another. We provide explanations of such country-wise variations based on cultural and socioeconomic contexts. △ Less

Submitted 19 October, 2015; originally announced October 2015.

Comments: ACM Conference on Online Social Networks 2015, Stanford University, California, USA

ACM Class: J.4; H.3.5

arXiv:1301.6932 [pdf, other]

Cross-Pollination of Information in Online Social Media: A Case Study on Popular Social Networks

Authors: Paridhi Jain, Tiago Rodrigues, Gabriel Magno, Ponnurangam Kumaraguru, Virgilio Almeida

Abstract: Owing to the popularity of Online Social Media (OSM), Internet users share a lot of information (including personal) on and across OSM services every day. For example, it is common to find a YouTube video embedded in a blog post with an option to share the link on Facebook. Users recommend, comment, and forward information they receive from friends, contributing in spreading the information in and… ▽ More Owing to the popularity of Online Social Media (OSM), Internet users share a lot of information (including personal) on and across OSM services every day. For example, it is common to find a YouTube video embedded in a blog post with an option to share the link on Facebook. Users recommend, comment, and forward information they receive from friends, contributing in spreading the information in and across OSM services. We term this information diffusion process from one OSM service to another as Cross-Pollination, and the network formed by users who participate in Cross-Pollination and content produced in the network as \emph{Cross-Pollinated network}. Research has been done about information diffusion within one OSM service, but little is known about Cross-Pollination. In this paper, we aim at filling this gap by studying how information (video, photo, location) from three popular OSM services (YouTube, Flickr and Foursquare) diffuses on Twitter, the most popular microblogging service. Our results show that Cross-Pollinated networks follow temporal and topological characteristics of the diffusion OSM (Twitter in our study). Furthermore, popularity of information on source OSM (YouTube, Flickr and Foursquare) does not imply its popularity on Twitter. Our results also show that Cross-Pollination helps Twitter in terms of traffic generation and user involvement, but only a small fraction of videos and photos gain a significant number of views from Twitter. We believe this is the first research work which explicitly characterizes the diffusion of information across different OSM services. △ Less

Submitted 29 January, 2013; originally announced January 2013.

Comments: This report has been published in SocialCom PASSAT 2011 as a six page short paper

arXiv:1301.6870 [pdf, other]

Studying User Footprints in Different Online Social Networks

Authors: Anshu Malhotra, Luam Totti, Wagner Meira Jr., Ponnurangam Kumaraguru, Virgilio Almeida

Abstract: With the growing popularity and usage of online social media services, people now have accounts (some times several) on multiple and diverse services like Facebook, LinkedIn, Twitter and YouTube. Publicly available information can be used to create a digital footprint of any user using these social media services. Generating such digital footprints can be very useful for personalization, profile m… ▽ More With the growing popularity and usage of online social media services, people now have accounts (some times several) on multiple and diverse services like Facebook, LinkedIn, Twitter and YouTube. Publicly available information can be used to create a digital footprint of any user using these social media services. Generating such digital footprints can be very useful for personalization, profile management, detecting malicious behavior of users. A very important application of analyzing users' online digital footprints is to protect users from potential privacy and security risks arising from the huge publicly available user information. We extracted information about user identities on different social networks through Social Graph API, FriendFeed, and Profilactic; we collated our own dataset to create the digital footprints of the users. We used username, display name, description, location, profile image, and number of connections to generate the digital footprints of the user. We applied context specific techniques (e.g. Jaro Winkler similarity, Wordnet based ontologies) to measure the similarity of the user profiles on different social networks. We specifically focused on Twitter and LinkedIn. In this paper, we present the analysis and results from applying automated classifiers for disambiguating profiles belonging to the same user from different social networks. UserID and Name were found to be the most discriminative features for disambiguating user profiles. Using the most promising set of features and similarity metrics, we achieved accuracy, precision and recall of 98%, 99%, and 96%, respectively. △ Less

Submitted 29 January, 2013; originally announced January 2013.

Comments: The paper is already published in ASONAM 2012

arXiv:1006.5059 [pdf, ps, other]

Capacity Planning for Vertical Search Engines

Authors: Claudine Badue, Jussara Almeida, Virgilio Almeida, Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Artur Ziviani, Nivio Ziviani

Abstract: Vertical search engines focus on specific slices of content, such as the Web of a single country or the document collection of a large corporation. Despite this, like general open web search engines, they are expensive to maintain, expensive to operate, and hard to design. Because of this, predicting the response time of a vertical search engine is usually done empirically through experimentation,… ▽ More Vertical search engines focus on specific slices of content, such as the Web of a single country or the document collection of a large corporation. Despite this, like general open web search engines, they are expensive to maintain, expensive to operate, and hard to design. Because of this, predicting the response time of a vertical search engine is usually done empirically through experimentation, requiring a costly setup. An alternative is to develop a model of the search engine for predicting performance. However, this alternative is of interest only if its predictions are accurate. In this paper we propose a methodology for analyzing the performance of vertical search engines. Applying the proposed methodology, we present a capacity planning model based on a queueing network for search engines with a scale typically suitable for the needs of large corporations. The model is simple and yet reasonably accurate and, in contrast to previous work, considers the imbalance in query service times among homogeneous index servers. We discuss how we tune up the model and how we apply it to predict the impact on the query response time when parameters such as CPU and disk capacities are changed. This allows a manager of a vertical search engine to determine a priori whether a new configuration of the system might keep the query response under specified performance constraints. △ Less

Submitted 25 June, 2010; originally announced June 2010.

arXiv:0804.4865 [pdf, ps, other]

Characterizing Video Responses in Social Networks

Authors: Fabricio Benevenuto, Fernando Duarte, Tiago Rodrigues, Virgilio Almeida, Jussara Almeida, Keith Ross

Abstract: Video sharing sites, such as YouTube, use video responses to enhance the social interactions among their users. The video response feature allows users to interact and converse through video, by creating a video sequence that begins with an opening video and followed by video responses from other users. Our characterization is over 3.4 million videos and 400,000 video responses collected from Yo… ▽ More Video sharing sites, such as YouTube, use video responses to enhance the social interactions among their users. The video response feature allows users to interact and converse through video, by creating a video sequence that begins with an opening video and followed by video responses from other users. Our characterization is over 3.4 million videos and 400,000 video responses collected from YouTube during a 7-day period. We first analyze the characteristics of the video responses, such as popularity, duration, and geography. We then examine the social networks that emerge from the video response interactions. △ Less

Submitted 30 April, 2008; originally announced April 2008.

ACM Class: J.4; H.3.5

arXiv:cs/0504012 [pdf, ps, other]

Improving Spam Detection Based on Structural Similarity

Authors: Luiz H. Gomes, Fernando D. O. Castro, Rodrigo B. Almeida, Luis M. A. Bettencourt, Virgilio A. F. Almeida, Jussara M. Almeida

Abstract: We propose a new detection algorithm that uses structural relationships between senders and recipients of email as the basis for the identification of spam messages. Users and receivers are represented as vectors in their reciprocal spaces. A measure of similarity between vectors is constructed and used to group users into clusters. Knowledge of their classification as past senders/receivers of… ▽ More We propose a new detection algorithm that uses structural relationships between senders and recipients of email as the basis for the identification of spam messages. Users and receivers are represented as vectors in their reciprocal spaces. A measure of similarity between vectors is constructed and used to group users into clusters. Knowledge of their classification as past senders/receivers of spam or legitimate mail, comming from an auxiliary detection algorithm, is then used to label these clusters probabilistically. This knowledge comes from an auxiliary algorithm. The measure of similarity between the sender and receiver sets of a new message to the center vector of clusters is then used to asses the possibility of that message being legitimate or spam. We show that the proposed algorithm is able to correct part of the false positives (legitimate messages classified as spam) using a testbed of one week smtp log. △ Less

Submitted 5 April, 2005; originally announced April 2005.

arXiv:cs/0212045 [pdf, ps, other]

Local Community Identification through User Access Patterns

Authors: Rodrigo B. Almeida, Virgilio A. F. Almeida

Abstract: Community identification algorithms have been used to enhance the quality of the services perceived by its users. Although algorithms for community have a widespread use in the Web, their application to portals or specific subsets of the Web has not been much studied. In this paper, we propose a technique for local community identification that takes into account user access behavior derived fro… ▽ More Community identification algorithms have been used to enhance the quality of the services perceived by its users. Although algorithms for community have a widespread use in the Web, their application to portals or specific subsets of the Web has not been much studied. In this paper, we propose a technique for local community identification that takes into account user access behavior derived from access logs of servers in the Web. The technique takes a departure from the existing community algorithms since it changes the focus of in terest, moving from authors to users. Our approach does not use relations imposed by authors (e.g. hyperlinks in the case of Web pages). It uses information derived from user accesses to a service in order to infer relationships. The communities identified are of great interest to content providers since they can be used to improve quality of their services. We also propose an evaluation methodology for analyzing the results obtained by the algorithm. We present two case studies based on actual data from two services: an online bookstore and an online radio. The case of the online radio is particularly relevant, because it emphasizes the contribution of the proposed algorithm to find out communities in an environment (i.e., streaming media service) without links, that represent the relations imposed by authors (e.g. hyperlinks in the case of Web pages). △ Less

Submitted 16 December, 2002; originally announced December 2002.

Comments: 11 pages, 2 figures, 2 tables, submitted to WWW2003 for evaluation

ACM Class: I.5.3; H.1.2; J.4

Showing 1–27 of 27 results for author: Almeida, V