11institutetext: Department of Computer Science
University of Bucharest
14 Academiei, Bucharest, Romania

PoPreRo: A New Dataset for Popularity Prediction of Romanian Reddit Posts

Ana-Cristina Rogoz    Maria Ilinca Nechita    Radu Tudor Ionescu Corresponding author: [email protected]
Abstract

We introduce PoPreRo, the first dataset for Popularity Prediction of Romanian posts collected from Reddit. The PoPreRo dataset includes a varied compilation of post samples from five distinct subreddits of Romania, totaling 28,107 data samples. Along with our novel dataset, we introduce a set of competitive models to be used as baselines for future research. Interestingly, the top-scoring model achieves an accuracy of 61.35%percent61.3561.35\%61.35 % and a macro F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score of 60.60%percent60.6060.60\%60.60 % on the test set, indicating that the popularity prediction task on PoPreRo is very challenging. Further investigations based on few-shot prompting the Falcon-7B Large Language Model also point in the same direction. We thus believe that PoPreRo is a valuable resource that can be used to evaluate models on predicting the popularity of social media posts in Romanian. We release our dataset at https://github.com/ana-rogoz/PoPreRo.

Keywords:
natural language processing reddit popularity popularity detection virality detection Romanian LLM prompting.

1 Introduction

Understanding the factors influencing the popularity of social media posts represents a critical and multifaceted challenge for NLP research. Social media platforms generate vast amounts of user-created content, offering a unique window into real-time public discourse and collective attention. Analyzing what resonates with audiences goes beyond just sentiment analysis, demanding nuanced NLP techniques to capture humor, sarcasm, and the subtle cues that drive engagement. This pursuit fosters not only theoretical advancements but also practical applications across diverse fields, from marketing and public health to combating misinformation and predicting cultural trends. Studying social media popularity, therefore, is not just an interesting NLP problem, but a key to unlocking the true potential of language in the digital age.

So far, the phenomenon has been studied both for individual social media platforms, such as Instagram [4, 26, 21, 5], Reddit [2, 13], Twitter [17, 27, 16], either as a whole phenomenon, for detecting popularity [20, 25], or for generating engaging content [8].

Reddit, in particular, has been one of the most studied platforms in the ever-evolving landscape of online content. From gauging public opinion and identifying emerging trends to optimizing content recommendation systems and combating misinformation, accurate popularity detection offers a multitude of applications across various domains. There are existing datasets generated from Reddit content, studying several topics, from political conflicts [28], to personality traits [10], language biases [9, 11], and mental health related topics, such as stress analysis [24], depression [23] and anxiety [22].

While existing Reddit datasets have played a crucial role in advancing NLP research, they predominantly focus on high-resource languages, such as English. This creates a bias towards high-resource languages in NLP models, neglecting the necessity of exploring NLP capabilities on less studied languages, such as Romanian.

We emphasize that what constitutes a popular (viral) post can vary across countries and regions, since the topics of interest can naturally change from one local community to another. This is because people are usually more influenced by major local events, e.g. the war in Ukraine is still a major subject of discussion in Romania, a neighboring country of Ukraine, while the subject may have faded out in countries from other continents. This justifies the need to study the popularity prediction task across multiple countries, and consequently, in various languages. To this end, we introduce PoPreRo, the first dataset for Popularity Prediction of Romanian posts collected from Reddit. We leverage this novel resource to explore popularity detection in a low-resource language, Romanian, establishing six diverse baselines for future comparative analysis.

2 Dataset

2.1 Data Collection

PoPreRo gathers Reddit posts from five different Romanian subreddit channels, which represent either one of the biggest cities in Romania or the country-wide subreddit. The subreddits are: Romania, Bucureşti, Cluj, Iaşi and Timişoara. These subreddits were collected at first using Reddit API, divided into JSON files to extract the information needed for analyzing the popularity of each reddit post, such as title, content, number of comments, number of up and down votes. However, Reddit API has a limitation of 1000 requests for extraction of different data. Due to the large number of samples that we target for the dataset, the API could not provide all necessary data. Therefore, we use an open-source archive, from where the samples are collected. As mentioned above, all the data is stored in separate JSON files for each subreddit, containing relevant information for determining the popularity of posts.

2.2 Dataset Statistics

Table 1: Number of samples (#posts) and number of tokens (#tokens) for each subset in PoPreRo.
Set Unpopular Popular Total
#posts #tokens #posts #tokens #posts #tokens
Training 12,053 398,219 11,592 560,580 23,645 958,799
Validation 1,059 75,742 1,054 80,297 2,113 156,039
Test 1,177 72,819 1,172 93,268 2,349 168,867
Total 14,289 546,780 13,818 734,145 28,107 1,283,705

The dataset comprises 28,107 samples (14,289 unpopular and 13,818 popular) containing over 1 million tokens in total (see detailed statistics in Table 1). Each sample consists of a title, a content, and a binary label, where the title and content are concatenated into a single text. We divide the posts into “popular” or “unpopular” based on the sum of upvotes and downvotes for each post, where the threshold between the two categories is given by the median number of votes (15). To enable consistent evaluation and comparison with future studies, we provide an official split with distinct training, validation, and test sets. Inspired by McHardy et al. [18], we utilize disjoint subreddits for each set, ensuring models cannot capitalize on knowledge of specific topics. To further mitigate potential biases arising from uneven topic or time distributions, we select posts from the same time frame across all subreddits.

Additionally, to control for a potential bias related to the time of day when posts were submitted, we performed an analysis of post popularity by hour. We divided each day into four-hour intervals and categorized the number of popular and unpopular posts within each interval. The detailed results are presented in Figure 2. Notably, we observe a consistent trend across all time intervals for both popular and unpopular posts. This finding suggests that the hour of submission does not exert a significant influence on post popularity within our dataset.

Table 2: Number of samples (#posts) for each label (popular/unpopular), distributed by the time of posting for each subset in PoPreRo.
Set Label #posts in time window (h)
[0-4) [4-8) [8-12) [12-16) [16-20) [20-24)
Training popular 816 260 2,200 3,451 2,797 2,272
unpopular 1,050 254 1,779 3,280 3,014 2,472
Validation popular 78 38 255 284 228 172
unpopular 87 32 174 273 232 260
Test popular 57 24 241 319 287 244
unpopular 67 32 259 325 274 220
[0-4)[4-8)[8-12)[12-16)[16-20)[20-24)001,00010001{,}0001 , 0002,00020002{,}0002 , 0003,00030003{,}0003 , 0004,00040004{,}0004 , 0009519519519513223223223222,69626962{,}6962 , 6964,05440544{,}0544 , 0543,31233123{,}3123 , 3122,68826882{,}6882 , 6881,20412041{,}2041 , 2043183183183182,21222122{,}2122 , 2123,87838783{,}8783 , 8783,52035203{,}5203 , 5202,95229522{,}9522 , 952time intervals#postsPopularUnpopular
Figure 1: Number of samples (#posts) for each label (popular/unpopular), distributed by the time of posting. The 24 hours in a day are divided into six four-hour intervals. Best viewed in color.

2.3 Preprocessing

After gathering the data from Reddit, we implement a two-step preprocessing pipeline to ensure data quality and consistency. First, language identification was performed on post titles using FastText [12] to filter out non-Romanian posts (filtered posts are not counted in Table 1). This step guarantees the linguistic homogeneity of the dataset. Subsequently, upvote/downvote scores are normalized to the [0,1]01[0,1][ 0 , 1 ] interval. Finally, a binary popularity label is assigned with respect to the median value of the normalized scores, which corresponds to 15 votes. This approach provides a clear threshold for distinguishing popular and unpopular posts. Notably, our data collection and labeling procedure is directly transferable to other languages.

3 Methods

To comprehensively evaluate the performance for the popularity prediction task on the newly introduced dataset, we establish six baseline approaches. Two of these baselines leverage state-of-the-art deep learning models for language processing. Another three baselines utilize various classifiers based on shallow or deep (frozen) features. Our final baseline uses a Large Language Model (LLM) based on in-context learning, also known as few-shot prompting. For all models, we use the concatenated title and content of each post as the input data.

3.1 Fine-Tuned Ro-GPT2

Our first baseline relies on fine-tuning a Ro-GPT2 model [19], a large language model specifically trained on Romanian text. It is based on the original GPT2 architecture, but trained on a Romanian dataset consisting of over 1 million tokens. This allows it to capture the nuances and specificities of the Romanian language, making it more suitable for tasks involving Romanian than the general-purpose GPT2. The Ro-GPT2 encoder is utilized to encode each text sequence into a list of token IDs. Subsequently, the model processes these tokens, generating corresponding 768-dimensional embeddings. We then incorporate a global average pooling layer to capture a Continuous Bag-of-Words (CBOW) representation for each text sequence. This representation is fed into a Softmax output layer comprising two neurons, each predicting the probability of belonging to either the unpopular or popular category. To assign the final class label, we apply the argmax function on the two predicted probabilities. The entire model is fine-tuned for 5 epochs on mini-batches of 32 samples. We employ the Adam optimizer with decoupled weight decay (AdamW) [15] with a learning rate of 51075superscript1075\cdot 10^{-7}5 ⋅ 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT and ϵ=5107italic-ϵ5superscript107\epsilon=5\cdot 10^{-7}italic_ϵ = 5 ⋅ 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT.

3.2 Fine-Tuned Ro-BERT

As our second baseline, we employ a fine-tuned Romanian Bidirectional Encoder Representations from Transformers (Ro-BERT) model [7]. Sharing the same transformer-based architecture as the original BERT [6], Ro-BERT has been demonstrated to outperform multilingual BERT on various tasks, as reported by Dumitrescu et al. [7]. Consequently, we anticipate Ro-BERT to be a strong baseline for our Romanian corpus.

Similarly to the previous baseline, we use the Ro-BERT encoder to encode each text into a list of token IDs. We keep the same design as before, where the model generates 768-dimensional embeddings, followed by a global average pooling layer which is fed into a Softmax output layer with two neurons. To assign the final class label, we apply the argmax function on the two predicted probabilities. The entire model is fine-tuned for 10 epochs on mini-batches of 32 samples. We employ the AdamW optimizer [15] with a learning rate of 21072superscript1072\cdot 10^{-7}2 ⋅ 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT and the default value for ϵitalic-ϵ\epsilonitalic_ϵ.

3.3 Ro-BERT Embeddings + Logistic Regression

For our third classification approach, we leverage pre-trained Ro-BERT embeddings in conjunction with a Logistic Regression (LR) classifier. Consistent with the fine-tuned Ro-BERT baseline, we first tokenize all input samples from the three datasets. Subsequently, we utilize the Ro-BERT model to extract 768-dimensional vector representations for each sample. These representations, corresponding to the final hidden layer of Ro-BERT, are then fed into the LR model for classification.

3.4 FastText + SVM

The first shallow classification approach is based on FastText embeddings [3] and a Support Vector Machines (SVM) classifier. After textual cleaning and tokenization using NLTK’s word tokenizer, we fine-tune a FastText model on the training corpus. This model provides word embeddings for train, validation, and test sets. For each text sample, the word embeddings are averaged to produce a 300-dimensional feature vector, which is subsequently passed to the SVM. Finally, we train the SVM classifier using the linear kernel and the regularization hyperparameter C𝐶Citalic_C set to 10101010.

3.5 TF-IDF + Random Forest

Our second shallow classification approach is based on the Term Frequency-Inverse Document Frequency (TF-IDF) representation and a Random Forest (RF) classifier. As for the previous method, we initiate the process by cleaning and tokenizing the text using NLTK’s word tokenizer. Subsequently, we employed a TF-IDF vectorizer to quantify the importance of words within the corpus, generating numerical features for each document. These features are then used to train a Random Forest classifier.

3.6 Few-Shot LLM Prompting

To explore the feasibility of large language models (LLMs) for post popularity prediction in PoPreRo, we employ a prompt-based approach utilizing the 7-billion parameter Falcon LLM [1] (Falcon-7B). Due to computational limitations, we prompt the LLM with contexts comprising two unpopular and two popular examples. Subsequently, we attach an individual test sample to each prompt and ask the LLM to predict the corresponding label. Below, we illustrate the structure of our prompt via a concrete example:

PROMPT (Original): Text: ’Nu vreau sa mai traiesc pe aceasta planeta !’ Label: ’Popular’. Text: ’Unde pot verifica compozi\textcommabelowtia unui produs?. Să testez de exemplu dacă ingredientele unui produs sunt într-adevăr acelea. Sau dacă ni\textcommabelowste tablete de vitamine chiar con\textcommabelowtin vitamine. În ce propor\textcommabelowtii? Sau câtă vitamina A con\textcommabelowtine un morcov - unde pot verifica asta? Ceva laboratoare?’ Label: ’Unpopular’. Text: ’Azi a venit mitropolitul ardealului la noi la liceu să ne convingă să facem religie. Primul lucru care mi-a venit în cap când am văzut ce ma\textcommabelowsină \textcommabelowsi-a parcat în curtea institu\textcommabelowtii..’ Label: ’Popular’. Text: ’Daca intereseaza pe cineva, sa stiti ca e reddit si in romana’ Label: ’Unpopular’. Text: ’Am prins niste fulgere faine zilele trecute’ Label:

PROMPT (Translated): Text: ’I don’t want to live on this planet anymore!’ Label: ’Popular’. Text: ’Where can I check the composition of a product?. To test for example whether the ingredients of a product are indeed those. Or if some vitamin tablets

actually contain vitamins. In what proportions? Or how much vitamin A contains a carrot - where can I check this? Some laboratories?’ Label: ’Unpopular’. Text: ’Today the metropolitan of Transylvania came to us at high school to convince us to do religion. First thing that came to mind when I saw what car he has parked in the courtyard of institutions..’ Label: ’Popular’. Text: ’If anyone is interested, there’s reddit in Romanian’ Label: ’Unpopular’. Text: ’I caught some fine lightning the other day’ Label:

4 Experiments

4.1 Evaluation

Our binary classification experiments focus on predicting the popularity of text within the PoPreRo dataset. Each text sample is categorized as either popular or unpopular. To evaluate the performance of our models, we employ several metrics. For each class, we calculate precision (proportion of true positives among the identified positives) and recall (proportion of true positives with respect to all positives). Additionally, we aggregate these scores using macro F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and micro F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (accuracy) measures.

4.2 Hyperparameter Tuning

The hyperparameters of all models are determined via grid search. For the transformer-based methods (Ro-BERT, Ro-GPT2), we employ a grid search over the maximum number of input tokens in the set {50,70,100,120,150,200}5070100120150200\{50,70,100,120,150,200\}{ 50 , 70 , 100 , 120 , 150 , 200 }, as well as the learning rate in the set {105,5105,106,5106,107,2107,5107,108,5108}superscript1055superscript105superscript1065superscript106superscript1072superscript1075superscript107superscript1085superscript108\{10^{-5},5\cdot 10^{-5},10^{-6},5\cdot 10^{-6},10^{-7},2\cdot 10^{-7},5\cdot 1% 0^{-7},10^{-8},5\cdot 10^{-8}\}{ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 5 ⋅ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT , 5 ⋅ 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT , 2 ⋅ 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT , 5 ⋅ 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT , 5 ⋅ 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT } and the value of ϵitalic-ϵ\epsilonitalic_ϵ for AdamW in the set {106,107,108}superscript106superscript107superscript108\{10^{-6},10^{-7},10^{-8}\}{ 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT }.

For the FastText + SVM approach, we vary the FastText word-embeddings dimension ({150,200,300,350}150200300350\{150,200,300,350\}{ 150 , 200 , 300 , 350 }), the window size for the input ({2,3,4}234\{2,3,4\}{ 2 , 3 , 4 }), as well as the kernel (linear or RBF) and the parameter C𝐶Citalic_C ({0.1,1,10,100,1000}0.11101001000\{0.1,1,10,100,1000\}{ 0.1 , 1 , 10 , 100 , 1000 }) of the SVM classifier. Similarly, for the Ro-BERT + Logistic Regression approach, we run a search over the maximum numbers of Ro-BERT input tokens in the same set as before ({50,70,100,120,150,200}5070100120150200\{50,70,100,120,150,200\}{ 50 , 70 , 100 , 120 , 150 , 200 }) and test different penalty term values (‘l1’, ‘l2’, ‘elastic net’ or ‘None’) for the classifier.

Lastly, for the TF-IDF + Random Forest method, we vary the minimum ({4,5,6}456\{4,5,6\}{ 4 , 5 , 6 }) and maximum ({0.6,0.7,0.8}0.60.70.8\{0.6,0.7,0.8\}{ 0.6 , 0.7 , 0.8 }, in percentages) document frequency of the TF-IDF Vectorizer, together with the number of decision trees in the set {50,100,150,200}50100150200\{50,100,150,200\}{ 50 , 100 , 150 , 200 } for the Random Forest classifier.

All other hyperparameters are set to their default values. Please note that we release the code to reproduce all baselines, along with the PoPreRo dataset111https://github.com/ana-rogoz/PoPreRo.

4.3 Results

We present the results of our five baselines on the PoPreRo validation and test sets in Table 3. We find that Ro-GPT2 exhibits the best performance, with an accuracy (micro F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and a macro F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score above 0.60.60.60.6 on both validation and test sets, in contrast to the other baselines which seem to perform similarly well on the validation set, but reach worse performance on the test set.

Table 3: Validation and test results of the six baselines. The random chance baseline is added as reference. There is no hyperparameter tuning for Falcon-7B LLM, so the model is directly applied on the test set (using in-context learning). The best score on each subset and for each metric is highlighted in bold.
Set Method Acc. Macro Unpopular Popular
F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Prec. Rec. Prec. Rec.
Validation Random chance 0.4998 0.4999 0.4988 0.5011 0.5011 0.4988
Fine-tuned Ro-GPT2 0.6525 0.6397 0.6157 0.8097 0.7351 0.4986
Fine-tuned Ro-BERT 0.6343 0.6278 0.6189 0.6995 0.6411 0.5562
FastText + SVM 0.6677 0.6624 0.6348 0.7920 0.7225 0.5431
TF-IDF + RF 0.6535 0.6395 0.6107 0.8497 0.7519 0.4568
Ro-BERT + LR 0.6824 0.6721 0.6354 0.8582 0.7807 0.5061
Test Random chance 0.4998 0.4999 0.5010 0.4989 0.4989 0.5010
Fine-tuned Ro-GPT2 0.6135 0.6060 0.6146 0.6331 0.6145 0.5933
Fine-tuned Ro-BERT 0.5605 0.5489 0.5505 0.6611 0.5767 0.4565
FastText + SVM 0.5644 0.5637 0.5718 0.5208 0.5583 0.6083
TF-IDF + RF 0.5759 0.5729 0.5661 0.6584 0.5897 0.4931
Ro-BERT + LR 0.5998 0.5973 0.5873 0.6771 0.6169 0.5221
Few-shot prompted Falcon-7B 0.4143 0.4126 0.4143 0.7904 0.5537 0.1887

Evaluating the two state-of-the-art transformer models, Ro-GPT2 and Ro-BERT, reveals some interesting findings. While both achieve comparable accuracy on the validation set (0.65250.65250.65250.6525 for Ro-GPT2 and 0.63430.63430.63430.6343 for Ro-BERT), Ro-GPT2 clearly outperforms Ro-BERT on the test set, indicating the superior ability of the former model to generalize to unseen data. Analyzing the precision-recall trade-off, we observe a shared propensity for both models to exhibit higher recall for the “popular” category, followed by a shift towards higher precision when identifying the “unpopular” class.

Table 4: Examples of relevant terms for popular posts, learned by the fine-tuned Ro-BERT and SVM models.
Model Topic Example Translation
Ro-BERT Call to action “pentru cei care vor să se implice activ ” “for those who want to be actively involved”
“ar fi interesati de un voluntariat” “would be interested in volunteering”
News “încep săpăturile la metrou” “excavations begin at the subway”
“un nou residence la “doar 20 de minute” de Centru” “a new residence building “only 20 minutes” from the center”
Events “Seara de film la Casa Tineretului” “Movie night at the Youth House”
SVM News “mic protest la primaria capitalei“ “small protest at Bucharest City Hall“
Local transport “am vazut ca este tren de la gara de nord la aeroport aproape la fiecare ora“ “I saw that there is a train from Gara de Nord to the airport almost every hour“

The FastText + SVM, TF-IDF + RF and Ro-BERT + LR models achieve comparable performance. All three models obtain accuracy rates higher than 65%percent6565\%65 % on the validation set, which drop below 60%percent6060\%60 % on the test set. In terms of precision and recall, almost all of them achieve higher precision for the “popular” category on both validation and test sets, with one exception being the FastText + SVM method on the test set, where the precision on the two classes is comparable. A distinctive behavior of the three models is that the TF-IDF + RF obtains a higher recall for the “popular” category, while FastText + SVM and RoBERT + LR attain a higher recall for the “unpopular” category.

Table 3 also shows the results on the test set of our few-shot prompted LLM. While this approach exhibits a bias similar to our other baselines, favoring recall for unpopular predictions and precision for popular ones, its overall performance falls below that of a random chance classifier. This suggests a limitation in the generalization capacity of LLMs to the popularity prediction task, particularly for languages with limited online resources, such as Romanian.

4.4 Discriminative Feature Analysis

We analyze the discriminative features learned by the fine-tuned Ro-BERT and by the FastText + SVM. The motivation behind this analysis is to validate that the decisions of these models are not based on some biases that escaped our data collection, but on actual data understanding.

For the Ro-BERT model, we use the Captum [14] library via its Layer Integrated Gradients method to infer valuable insights from the fine-tuned model. This technique delves into the BERT embeddings layer, attributing importance scores to individual input words which led to the final label prediction.

To find the words with higher influence on the decisions given by the SVM, we consider the cosine similarities between the primal weights of the SVM and the FastText embedding of each word. We sort the words based on the similarity values, and keep the first 10 and last 10 words from the sorted list as features for the positive (“popular”) and negative (“unpopular”) classes, respectively.

Table 5: Examples of relevant terms for unpopular posts, learned by the fine-tuned Ro-BERT and SVM models.
Model Topic Example Translation
Ro-BERT Proper names “Palatul Roznovanu” “Roznovanu palace”
“Ceauşescu” “Ceauşescu”
“în Timişoara” “in Timişoara”
Seeking advice “terenuri ok de baschet în…” “ok basketball courts in…”
“print shop pentru poze mari în …” “print shop for big pictures in …”
Mundane problems “Se închide circulaţia” “traffic is closed”
“construim blocuri între case” “building apartment building between houses”
SVM City names “bucuresti” “bucharest”
Seeking advice “cunoasteti un loc de facut tatuaj temporar personalizat ” “do you know a place to do custom temporary tattoo”
Opinion sharing “lumea ca se plange de targul de craciun de anul acesta ” “people complain about this year’s Christmas market”

In Tables 4 and 5, we present a few examples of interesting patterns that were picked up by the models. In predicting post popularity, the Ro-BERT model demonstrates a bias toward content reflecting current trends, including news and events, and posts encouraging community engagement through calls to action. Conversely, references to proper nouns like city names or historical landmarks appear to hinder popularity, as do posts seeking community advice or expressing dissatisfaction with platitudes. Similar to Ro-BERT, we find that the SVM labels posts that share news as popular, and posts by people seeking advice as unpopular.

Table 6: Examples of the most discriminative words for the popular and unpopular classes, selected according to the weights learned by the SVM model based on FastText features.
Label Token Weight
popular online 5.974352
dupa 4.821379
youtube 4.121604
asa 4.08882
cazul 3.839789
unpopular toate -4.089375
un -4.190036
google -4.31336
nia -4.339616
eu -4.72841

Furthermore, we extend the feature analysis for the SVM in order to determine the most discriminative words for the popular and unpopular classes. To achieve this, we determine the discriminative weight of each word based on the cosine similarity between the respective word embedding and the SVM weights. We sort the words according to their weights, and select the ones with the highest and lowest weights. In Table 6, we provide the five most discriminative words for the popular and unpopular classes, according to the SVM based on FastText features. We observe that posts mentioning “online” or “youtube” are more popular, likely because readers appreciate posts that provide links to YouTube videos. We also note the preference for posts that discuss particular cases/experiences, which are usually introduced by the word “cazul” (translated to “case” in English). On the other hand, posts that recommend searching on “google” are unpopular, as the readers consider such suggestions unhelpful. Moreover, discussing subjective perspectives, using the singular first person pronoun “eu”, is again unpopular, likely because the readers appreciate more objective posts.

5 Conclusion

In this paper, we introduced PoPreRo, the first publicly available dataset of Romanian Reddit posts dedicated to the task of popularity prediction. We collected 28,107 posts from five diverse Romanian subreddits, amounting to over 1 million tokens. Aiming to predict binary labels resulting from the sum of upvotes and downvotes for each post, we explored five distinct popularity detection methods and presented comparative results. We found that Ro-GPT2 significantly outperforms the other models.

Building upon our foundation, future research can further study popularity detection algorithms and delve deeper into the factors driving engagement on Romanian Reddit.

6 Limitations

It is crucial to acknowledge that Reddit’s popularity in Romania might not be representative for the wider population. While Reddit offers a valuable platform for research due to its diverse communities and open discussions, its user base in Romania is comparatively smaller than other social media platforms, such as Facebook, Instagram, or YouTube. Furthermore, Reddit’s API restricts data access, limiting historical data collection and imposing retrieval caps.

7 Ethics Statement

The data was collected from a publicly available Reddit archive, selecting five Romanian subreddits. The social media posts are freely accessible to the public without any type of subscription. As the data was collected from an archived public website (Reddit), we adhere to the European regulations222https://eur-lex.europa.eu/eli/dir/2019/790/oj that allow researchers to use data in the public web domain for non-commercial research purposes. We thus release our corpus as open-source under a non-commercial share-alike license agreement, namely CC BY-NC-SA 4.0333https://creativecommons.org/licenses/by-nc-sa/4.0/.

We acknowledge that some posts could refer to certain people, e.g. public figures in Romania. Following GDPR regulations, we will remove all references to a person, upon receiving removal requests via an email to any of the authors.

References

  • [1] Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Goffinet, É., Hesslow, D., Launay, J., Malartic, Q., et al.: The Falcon Series of Open Language Models. arXiv preprint arXiv:2311.16867 (2023)
  • [2] Barnes, K., Riesenmy, T., Trinh, M.D., Lleshi, E., Balogh, N., Molontay, R.: Dank or not? Analyzing and predicting the popularity of memes on Reddit. Applied Network Science 6(1),  21 (Mar 2021)
  • [3] Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135–146 (2017)
  • [4] Carta, S., Podda, A.S., Recupero, D.R., Saia, R., Usai, G.: Popularity Prediction of Instagram Posts. Information 11(9),  453 (2020)
  • [5] De, S., Maity, A., Goel, V., Shitole, S., Bhattacharya, A.: Predicting the popularity of Instagram posts for a lifestyle magazine using deep learning. In: Proceedings of CSCITA. pp. 174–177 (2017)
  • [6] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL. pp. 4171–4186 (2019)
  • [7] Dumitrescu, S.D., Avram, A.M., Pyysalo, S.: The birth of Romanian BERT. In: Findings of EMNLP. pp. 4324–4328 (2020)
  • [8] Fang, Z., Yu, M., Fu, Z., Zhang, B., Huang, X., Tang, X., Yang, Y.: How to generate popular post headlines on social media? AI Open 5,  1–9 (2024)
  • [9] Ferrer, X., van Nuenen, T., Such, J.M., Criado, N.: Discovering and Categorising Language Biases in Reddit. In: Proceedings of ICWSM. pp. 140–151 (2021)
  • [10] Gjurković, M., Šnajder, J.: Reddit: A gold mine for personality prediction. In: Proceedings of PEOPLES. pp. 87–97 (2018)
  • [11] Hada, R., Sudhir, S., Mishra, P., Yannakoudakis, H., Mohammad, S.M., Shutova, E.: Ruddit: Norms of Offensiveness for English Reddit Comments. In: Proceedings of ACL. pp. 2700–2717 (2022)
  • [12] Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
  • [13] Kim, J.: Predicting the Popularity of Reddit Posts with AI. arXiv preprint arXiv:2106.07380 (2021)
  • [14] Kokhlikyan, N., Miglani, V., Martin, M., Wang, E., Alsallakh, B., Reynolds, J., Melnikov, A., Kliushkina, N., Araya, C., Yan, S., Reblitz-Richardson, O.: Captum: A unified and generic model interpretability library for PyTorch. arXiv preprint arXiv:2009.07896 (2020)
  • [15] Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization. In: Proceedings of ICLR (2019)
  • [16] Ma, Z., Sun, A., Cong, G.: On predicting the popularity of newly emerging hashtags in twitter. Journal of the American Society for Information Science and Technology 64, 1399–1410 (2013)
  • [17] Mahdavi, M., Asadpour, M., Ghavami, S.: A comprehensive analysis of tweet content and its impact on popularity. In: Proceedings of IST. pp. 559–564 (2016)
  • [18] McHardy, R., Adel, H., Klinger, R.: Adversarial Training for Satire Detection: Controlling for Confounding Variables. In: Proceedings of NAACL. pp. 660–665 (2019)
  • [19] Niculescu, M.A., Ruseti, S., Dascalu, M.: RoGPT2: Romanian GPT2 for Text Generation. In: Proceedings of ICTAI. pp. 1154–1161 (2021)
  • [20] Poecze, F., Ebster, C., Strauss, C.: Social media metrics and sentiment analysis to evaluate the effectiveness of social media posts. In: Proceedings of ANT-SEIT. pp. 660–666 (2018)
  • [21] Purba, K.R., Asirvatham, D., Murugesan, R.K.: Instagram post popularity trend analysis and prediction using hashtag, image assessment, and user history features. The International Arab Journal of Information Technology 18(1), 85–94 (2021)
  • [22] Shen, J.H., Rudzicz, F.: Detecting anxiety through Reddit. In: Proceedings of CLPsych. pp. 58–65 (Aug 2017)
  • [23] Tadesse, M.M., Lin, H., Xu, B., Yang, L.: Detection of Depression-Related Posts in Reddit Social Media Forum. IEEE Access 7, 44883–44893 (2019)
  • [24] Turcan, E., McKeown, K.: Dreaddit: A Reddit Dataset for Stress Analysis in Social Media. In: Proceedings of LOUHI. pp. 97–107 (2019)
  • [25] Wang, C., Xiao, Z., Liu, Y., Xu, Y., Zhou, A., Zhang, K.: SentiView: Sentiment Analysis and Visualization for Internet Popular Topics. IEEE Transactions on Human-Machine Systems 43(6), 620–630 (2013)
  • [26] Zhang, Z., Chen, T., Zhou, Z., Li, J., Luo, J.: How to become instagram famous: Post popularity prediction with dual-attention. arXiv preprint arXiv:1809.09314 (2019)
  • [27] Zhao, Q., Erdogdu, M.A., He, H.Y., Rajaraman, A., Leskovec, J.: SEISMIC: A Self-Exciting Point Process Model for Predicting Tweet Popularity. In: Proceedings of KDD. pp. 1513–1522 (2015)
  • [28] Zhu, Y., ul Haq, E., Lee, L.H., Tyson, G., Hui, P.: A Reddit Dataset for the Russo-Ukrainian Conflict in 2022. arXiv preprint arXiv:2206.05107 (2022)