-
Systematic Task Exploration with LLMs: A Study in Citation Text Generation
Authors:
Furkan Şahinuç,
Ilia Kuznetsov,
Yufang Hou,
Iryna Gurevych
Abstract:
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks. Yet, this flexibility brings new challenges, as it introduces new degrees of freedom in formulating the task inputs and instructions and in evaluating model performance. To facilitate the exploration of creative NLG tasks, we propose a three-component re…
▽ More
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks. Yet, this flexibility brings new challenges, as it introduces new degrees of freedom in formulating the task inputs and instructions and in evaluating model performance. To facilitate the exploration of creative NLG tasks, we propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement. We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric and has not yet been tackled within the LLM paradigm. Our results highlight the importance of systematically investigating both task instruction and input configuration when prompting LLMs, and reveal non-trivial relationships between different evaluation metrics used for citation text generation. Additional human generation and human evaluation experiments provide new qualitative insights into the task to guide future research in citation text generation. We make our code and data publicly available.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation Detection
Authors:
Cagri Toraman,
Oguzhan Ozcelik,
Furkan Şahinuç,
Fazli Can
Abstract:
The rapid dissemination of misinformation through online social networks poses a pressing issue with harmful consequences jeopardizing human health, public safety, democracy, and the economy; therefore, urgent action is required to address this problem. In this study, we construct a new human-annotated dataset, called MiDe22, having 5,284 English and 5,064 Turkish tweets with their misinformation…
▽ More
The rapid dissemination of misinformation through online social networks poses a pressing issue with harmful consequences jeopardizing human health, public safety, democracy, and the economy; therefore, urgent action is required to address this problem. In this study, we construct a new human-annotated dataset, called MiDe22, having 5,284 English and 5,064 Turkish tweets with their misinformation labels for several recent events between 2020 and 2022, including the Russia-Ukraine war, COVID-19 pandemic, and Refugees. The dataset includes user engagements with the tweets in terms of likes, replies, retweets, and quotes. We also provide a detailed data analysis with descriptive statistics and the experimental results of a benchmark evaluation for misinformation detection.
△ Less
Submitted 11 July, 2024; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Fast-FNet: Accelerating Transformer Encoder Models via Efficient Fourier Layers
Authors:
Nurullah Sevim,
Ege Ozan Özyedek,
Furkan Şahinuç,
Aykut Koç
Abstract:
Transformer-based language models utilize the attention mechanism for substantial performance improvements in almost all natural language processing (NLP) tasks. Similar attention structures are also extensively studied in several other areas. Although the attention mechanism enhances the model performances significantly, its quadratic complexity prevents efficient processing of long sequences. Re…
▽ More
Transformer-based language models utilize the attention mechanism for substantial performance improvements in almost all natural language processing (NLP) tasks. Similar attention structures are also extensively studied in several other areas. Although the attention mechanism enhances the model performances significantly, its quadratic complexity prevents efficient processing of long sequences. Recent works focused on eliminating the disadvantages of computational inefficiency and showed that transformer-based models can still reach competitive results without the attention layer. A pioneering study proposed the FNet, which replaces the attention layer with the Fourier Transform (FT) in the transformer encoder architecture. FNet achieves competitive performances concerning the original transformer encoder model while accelerating training process by removing the computational burden of the attention mechanism. However, the FNet model ignores essential properties of the FT from the classical signal processing that can be leveraged to increase model efficiency further. We propose different methods to deploy FT efficiently in transformer encoder models. Our proposed architectures have smaller number of model parameters, shorter training times, less memory usage, and some additional performance improvements. We demonstrate these improvements through extensive experiments on common benchmarks.
△ Less
Submitted 16 May, 2023; v1 submitted 26 September, 2022;
originally announced September 2022.
-
Impact of Tokenization on Language Models: An Analysis for Turkish
Authors:
Cagri Toraman,
Eyup Halit Yilmaz,
Furkan Şahinuç,
Oguzhan Ozcelik
Abstract:
Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenize…
▽ More
Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.
△ Less
Submitted 19 April, 2022;
originally announced April 2022.
-
Large-Scale Hate Speech Detection with Cross-Domain Transfer
Authors:
Cagri Toraman,
Furkan Şahinuç,
Eyup Halit Yilmaz
Abstract:
The performance of hate speech detection models relies on the datasets on which the models are trained. Existing datasets are mostly prepared with a limited number of instances or hate domains that define hate topics. This hinders large-scale analysis and transfer learning with respect to hate domains. In this study, we construct large-scale tweet datasets for hate speech detection in English and…
▽ More
The performance of hate speech detection models relies on the datasets on which the models are trained. Existing datasets are mostly prepared with a limited number of instances or hate domains that define hate topics. This hinders large-scale analysis and transfer learning with respect to hate domains. In this study, we construct large-scale tweet datasets for hate speech detection in English and a low-resource language, Turkish, consisting of human-labeled 100k tweets per each. Our datasets are designed to have equal number of tweets distributed over five domains. The experimental results supported by statistical tests show that Transformer-based language models outperform conventional bag-of-words and neural models by at least 5% in English and 10% in Turkish for large-scale hate speech detection. The performance is also scalable to different training sizes, such that 98% of performance in English, and 97% in Turkish, are recovered when 20% of training instances are used. We further examine the generalization ability of cross-domain transfer among hate domains. We show that 96% of the performance of a target domain in average is recovered by other domains for English, and 92% for Turkish. Gender and religion are more successful to generalize to other domains, while sports fail most.
△ Less
Submitted 5 July, 2022; v1 submitted 2 March, 2022;
originally announced March 2022.
-
BlackLivesMatter 2020: An Analysis of Deleted and Suspended Users in Twitter
Authors:
Cagri Toraman,
Furkan Şahinuç,
Eyup Halit Yilmaz
Abstract:
After George Floyd's death in May 2020, the volume of discussion in social media increased dramatically. A series of protests followed this tragic event, called as the 2020 BlackLivesMatter movement. Eventually, many user accounts are deleted by their owners or suspended due to violating the rules of social media platforms. In this study, we analyze what happened in Twitter before and after the ev…
▽ More
After George Floyd's death in May 2020, the volume of discussion in social media increased dramatically. A series of protests followed this tragic event, called as the 2020 BlackLivesMatter movement. Eventually, many user accounts are deleted by their owners or suspended due to violating the rules of social media platforms. In this study, we analyze what happened in Twitter before and after the event triggers with respect to deleted and suspended users. We create a novel dataset that includes approximately 500k users sharing 20m tweets, half of whom actively participated in the 2020 BlackLivesMatter discussion, but some of them were deleted or suspended later. We particularly examine the factors for undesirable behavior in terms of spamming, negative language, hate speech, and misinformation spread. We find that the users who participated to the 2020 BlackLivesMatter discussion have more negative and undesirable tweets, compared to the users who did not. Furthermore, the number of new accounts in Twitter increased significantly after the trigger event occurred, yet new users are more oriented to have undesirable tweets, compared to old ones.
△ Less
Submitted 6 July, 2022; v1 submitted 30 September, 2021;
originally announced October 2021.
-
Imparting Interpretability to Word Embeddings while Preserving Semantic Structure
Authors:
Lutfi Kerem Senel,
Ihsan Utlu,
Furkan Şahinuç,
Haldun M. Ozaktas,
Aykut Koç
Abstract:
As an ubiquitous method in natural language processing, word embeddings are extensively employed to map semantic properties of words into a dense vector representation. They capture semantic and syntactic relations among words but the vectors corresponding to the words are only meaningful relative to each other. Neither the vector nor its dimensions have any absolute, interpretable meaning. We int…
▽ More
As an ubiquitous method in natural language processing, word embeddings are extensively employed to map semantic properties of words into a dense vector representation. They capture semantic and syntactic relations among words but the vectors corresponding to the words are only meaningful relative to each other. Neither the vector nor its dimensions have any absolute, interpretable meaning. We introduce an additive modification to the objective function of the embedding learning algorithm that encourages the embedding vectors of words that are semantically related to a predefined concept to take larger values along a specified dimension, while leaving the original semantic learning mechanism mostly unaffected. In other words, we align words that are already determined to be related, along predefined concepts. Therefore, we impart interpretability to the word embedding by assigning meaning to its vector dimensions. The predefined concepts are derived from an external lexical resource, which in this paper is chosen as Roget's Thesaurus. We observe that alignment along the chosen concepts is not limited to words in the Thesaurus and extends to other related words as well. We quantify the extent of interpretability and assignment of meaning from our experimental results. Manual human evaluation results have also been presented to further verify that the proposed method increases interpretability. We also demonstrate the preservation of semantic coherence of the resulting vector space by using word-analogy and word-similarity tests. These tests show that the interpretability-imparted word embeddings that are obtained by the proposed framework do not sacrifice performances in common benchmark tests.
△ Less
Submitted 2 July, 2020; v1 submitted 19 July, 2018;
originally announced July 2018.