Showing 1–2 of 2 results for author: Nunes, M d G V

Search v0.5.6 released 2020-02-24

arXiv:1712.08917 [pdf, ps, other]

cs.CL

Building a Sentiment Corpus of Tweets in Brazilian Portuguese

Authors: Henrico Bertini Brum, Maria das Graças Volpe Nunes

Abstract: The large amount of data available in social media, forums and websites motivates researches in several areas of Natural Language Processing, such as sentiment analysis. The popularity of the area due to its subjective and semantic characteristics motivates research on novel methods and approaches for classification. Hence, there is a high demand for datasets on different domains and different lan… ▽ More The large amount of data available in social media, forums and websites motivates researches in several areas of Natural Language Processing, such as sentiment analysis. The popularity of the area due to its subjective and semantic characteristics motivates research on novel methods and approaches for classification. Hence, there is a high demand for datasets on different domains and different languages. This paper introduces TweetSentBR, a sentiment corpora for Brazilian Portuguese manually annotated with 15.000 sentences on TV show domain. The sentences were labeled in three classes (positive, neutral and negative) by seven annotators, following literature guidelines for ensuring reliability on the annotation. We also ran baseline experiments on polarity classification using three machine learning methods, reaching 80.99% on F-Measure and 82.06% on accuracy in binary classification, and 59.85% F-Measure and 64.62% on accuracy on three point classification. △ Less

Submitted 24 December, 2017; originally announced December 2017.

Comments: Accepted for publication in 11th International Conference on Language Resources and Evaluation (LREC 2018)
arXiv:1704.02963 [pdf, other]

cs.CL cs.AI

Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization

Authors: Thales Felipe Costa Bertaglia, Maria das Graças Volpe Nunes

Abstract: Text normalization techniques based on rules, lexicons or supervised training requiring large corpora are not scalable nor domain interchangeable, and this makes them unsuitable for normalizing user-generated content (UGC). Current tools available for Brazilian Portuguese make use of such techniques. In this work we propose a technique based on distributed representation of words (or word embeddin… ▽ More Text normalization techniques based on rules, lexicons or supervised training requiring large corpora are not scalable nor domain interchangeable, and this makes them unsuitable for normalizing user-generated content (UGC). Current tools available for Brazilian Portuguese make use of such techniques. In this work we propose a technique based on distributed representation of words (or word embeddings). It generates continuous numeric vectors of high-dimensionality to represent words. The vectors explicitly encode many linguistic regularities and patterns, as well as syntactic and semantic word relationships. Words that share semantic similarity are represented by similar vectors. Based on these features, we present a totally unsupervised, expandable and language and domain independent method for learning normalization lexicons from word embeddings. Our approach obtains high correction rate of orthographic errors and internet slang in product reviews, outperforming the current available tools for Brazilian Portuguese. △ Less

Submitted 10 April, 2017; originally announced April 2017.

Comments: Published in Proceedings of the 2nd Workshop on Noisy User-generated Text, 9 pages

Search v0.5.6 released 2020-02-24