¹¹institutetext: Department of Computer Science
University of Bucharest
14 Academiei, Bucharest, Romania

PoPreRo: A New Dataset for Popularity Prediction of Romanian Reddit Posts

Ana-Cristina Rogoz Maria Ilinca Nechita Radu Tudor Ionescu Corresponding author: [email protected]

Abstract

We introduce PoPreRo, the first dataset for Popularity Prediction of Romanian posts collected from Reddit. The PoPreRo dataset includes a varied compilation of post samples from five distinct subreddits of Romania, totaling 28,107 data samples. Along with our novel dataset, we introduce a set of competitive models to be used as baselines for future research. Interestingly, the top-scoring model achieves an accuracy of $61.35\%$ and a macro $F_{1}$ score of $60.60\%$ on the test set, indicating that the popularity prediction task on PoPreRo is very challenging. Further investigations based on few-shot prompting the Falcon-7B Large Language Model also point in the same direction. We thus believe that PoPreRo is a valuable resource that can be used to evaluate models on predicting the popularity of social media posts in Romanian. We release our dataset at https://github.com/ana-rogoz/PoPreRo.

Keywords:

natural language processing reddit popularity popularity detection virality detection Romanian LLM prompting.

1 Introduction

Understanding the factors influencing the popularity of social media posts represents a critical and multifaceted challenge for NLP research. Social media platforms generate vast amounts of user-created content, offering a unique window into real-time public discourse and collective attention. Analyzing what resonates with audiences goes beyond just sentiment analysis, demanding nuanced NLP techniques to capture humor, sarcasm, and the subtle cues that drive engagement. This pursuit fosters not only theoretical advancements but also practical applications across diverse fields, from marketing and public health to combating misinformation and predicting cultural trends. Studying social media popularity, therefore, is not just an interesting NLP problem, but a key to unlocking the true potential of language in the digital age.

So far, the phenomenon has been studied both for individual social media platforms, such as Instagram [4, 26, 21, 5], Reddit [2, 13], Twitter [17, 27, 16], either as a whole phenomenon, for detecting popularity [20, 25], or for generating engaging content [8].

Reddit, in particular, has been one of the most studied platforms in the ever-evolving landscape of online content. From gauging public opinion and identifying emerging trends to optimizing content recommendation systems and combating misinformation, accurate popularity detection offers a multitude of applications across various domains. There are existing datasets generated from Reddit content, studying several topics, from political conflicts [28], to personality traits [10], language biases [9, 11], and mental health related topics, such as stress analysis [24], depression [23] and anxiety [22].

While existing Reddit datasets have played a crucial role in advancing NLP research, they predominantly focus on high-resource languages, such as English. This creates a bias towards high-resource languages in NLP models, neglecting the necessity of exploring NLP capabilities on less studied languages, such as Romanian.

We emphasize that what constitutes a popular (viral) post can vary across countries and regions, since the topics of interest can naturally change from one local community to another. This is because people are usually more influenced by major local events, e.g. the war in Ukraine is still a major subject of discussion in Romania, a neighboring country of Ukraine, while the subject may have faded out in countries from other continents. This justifies the need to study the popularity prediction task across multiple countries, and consequently, in various languages. To this end, we introduce PoPreRo, the first dataset for Popularity Prediction of Romanian posts collected from Reddit. We leverage this novel resource to explore popularity detection in a low-resource language, Romanian, establishing six diverse baselines for future comparative analysis.

2 Dataset

2.1 Data Collection

PoPreRo gathers Reddit posts from five different Romanian subreddit channels, which represent either one of the biggest cities in Romania or the country-wide subreddit. The subreddits are: Romania, Bucureşti, Cluj, Iaşi and Timişoara. These subreddits were collected at first using Reddit API, divided into JSON files to extract the information needed for analyzing the popularity of each reddit post, such as title, content, number of comments, number of up and down votes. However, Reddit API has a limitation of 1000 requests for extraction of different data. Due to the large number of samples that we target for the dataset, the API could not provide all necessary data. Therefore, we use an open-source archive, from where the samples are collected. As mentioned above, all the data is stored in separate JSON files for each subreddit, containing relevant information for determining the popularity of posts.

2.2 Dataset Statistics

Table 1: Number of samples (#posts) and number of tokens (#tokens) for each subset in PoPreRo.

Set	Unpopular		Popular		Total
Set	#posts	#tokens	#posts	#tokens	#posts	#tokens
Training	12,053	398,219	11,592	560,580	23,645	958,799
Validation	1,059	75,742	1,054	80,297	2,113	156,039
Test	1,177	72,819	1,172	93,268	2,349	168,867
Total	14,289	546,780	13,818	734,145	28,107	1,283,705

The dataset comprises 28,107 samples (14,289 unpopular and 13,818 popular) containing over 1 million tokens in total (see detailed statistics in Table 1). Each sample consists of a title, a content, and a binary label, where the title and content are concatenated into a single text. We divide the posts into “popular” or “unpopular” based on the sum of upvotes and downvotes for each post, where the threshold between the two categories is given by the median number of votes (15). To enable consistent evaluation and comparison with future studies, we provide an official split with distinct training, validation, and test sets. Inspired by McHardy et al. [18], we utilize disjoint subreddits for each set, ensuring models cannot capitalize on knowledge of specific topics. To further mitigate potential biases arising from uneven topic or time distributions, we select posts from the same time frame across all subreddits.

Additionally, to control for a potential bias related to the time of day when posts were submitted, we performed an analysis of post popularity by hour. We divided each day into four-hour intervals and categorized the number of popular and unpopular posts within each interval. The detailed results are presented in Figure 2. Notably, we observe a consistent trend across all time intervals for both popular and unpopular posts. This finding suggests that the hour of submission does not exert a significant influence on post popularity within our dataset.

Table 2: Number of samples (#posts) for each label (popular/unpopular), distributed by the time of posting for each subset in PoPreRo.

Set	Label	#posts in time window (h)
Set	Label	[0-4)	[4-8)	[8-12)	[12-16)	[16-20)	[20-24)
Training	popular	816	260	2,200	3,451	2,797	2,272
Training	unpopular	1,050	254	1,779	3,280	3,014	2,472
Validation	popular	78	38	255	284	228	172
Validation	unpopular	87	32	174	273	232	260
Test	popular	57	24	241	319	287	244
Test	unpopular	67	32	259	325	274	220

Figure 1: Number of samples (#posts) for each label (popular/unpopular), distributed by the time of posting. The 24 hours in a day are divided into six four-hour intervals. Best viewed in color.

2.3 Preprocessing

After gathering the data from Reddit, we implement a two-step preprocessing pipeline to ensure data quality and consistency. First, language identification was performed on post titles using FastText [12] to filter out non-Romanian posts (filtered posts are not counted in Table 1). This step guarantees the linguistic homogeneity of the dataset. Subsequently, upvote/downvote scores are normalized to the $[0,1]$ interval. Finally, a binary popularity label is assigned with respect to the median value of the normalized scores, which corresponds to 15 votes. This approach provides a clear threshold for distinguishing popular and unpopular posts. Notably, our data collection and labeling procedure is directly transferable to other languages.

3 Methods

To comprehensively evaluate the performance for the popularity prediction task on the newly introduced dataset, we establish six baseline approaches. Two of these baselines leverage state-of-the-art deep learning models for language processing. Another three baselines utilize various classifiers based on shallow or deep (frozen) features. Our final baseline uses a Large Language Model (LLM) based on in-context learning, also known as few-shot prompting. For all models, we use the concatenated title and content of each post as the input data.

3.1 Fine-Tuned Ro-GPT2

Our first baseline relies on fine-tuning a Ro-GPT2 model [19], a large language model specifically trained on Romanian text. It is based on the original GPT2 architecture, but trained on a Romanian dataset consisting of over 1 million tokens. This allows it to capture the nuances and specificities of the Romanian language, making it more suitable for tasks involving Romanian than the general-purpose GPT2. The Ro-GPT2 encoder is utilized to encode each text sequence into a list of token IDs. Subsequently, the model processes these tokens, generating corresponding 768-dimensional embeddings. We then incorporate a global average pooling layer to capture a Continuous Bag-of-Words (CBOW) representation for each text sequence. This representation is fed into a Softmax output layer comprising two neurons, each predicting the probability of belonging to either the unpopular or popular category. To assign the final class label, we apply the argmax function on the two predicted probabilities. The entire model is fine-tuned for 5 epochs on mini-batches of 32 samples. We employ the Adam optimizer with decoupled weight decay (AdamW) [15] with a learning rate of $5\cdot 10^{-7}$ and $\epsilon=5\cdot 10^{-7}$ .

3.2 Fine-Tuned Ro-BERT

As our second baseline, we employ a fine-tuned Romanian Bidirectional Encoder Representations from Transformers (Ro-BERT) model [7]. Sharing the same transformer-based architecture as the original BERT [6], Ro-BERT has been demonstrated to outperform multilingual BERT on various tasks, as reported by Dumitrescu et al. [7]. Consequently, we anticipate Ro-BERT to be a strong baseline for our Romanian corpus.

Similarly to the previous baseline, we use the Ro-BERT encoder to encode each text into a list of token IDs. We keep the same design as before, where the model generates 768-dimensional embeddings, followed by a global average pooling layer which is fed into a Softmax output layer with two neurons. To assign the final class label, we apply the argmax function on the two predicted probabilities. The entire model is fine-tuned for 10 epochs on mini-batches of 32 samples. We employ the AdamW optimizer [15] with a learning rate of $2\cdot 10^{-7}$ and the default value for $\epsilon$ .

3.3 Ro-BERT Embeddings + Logistic Regression

For our third classification approach, we leverage pre-trained Ro-BERT embeddings in conjunction with a Logistic Regression (LR) classifier. Consistent with the fine-tuned Ro-BERT baseline, we first tokenize all input samples from the three datasets. Subsequently, we utilize the Ro-BERT model to extract 768-dimensional vector representations for each sample. These representations, corresponding to the final hidden layer of Ro-BERT, are then fed into the LR model for classification.

3.4 FastText + SVM

The first shallow classification approach is based on FastText embeddings [3] and a Support Vector Machines (SVM) classifier. After textual cleaning and tokenization using NLTK’s word tokenizer, we fine-tune a FastText model on the training corpus. This model provides word embeddings for train, validation, and test sets. For each text sample, the word embeddings are averaged to produce a 300-dimensional feature vector, which is subsequently passed to the SVM. Finally, we train the SVM classifier using the linear kernel and the regularization hyperparameter $C$ set to $10$ .

3.5 TF-IDF + Random Forest

Our second shallow classification approach is based on the Term Frequency-Inverse Document Frequency (TF-IDF) representation and a Random Forest (RF) classifier. As for the previous method, we initiate the process by cleaning and tokenizing the text using NLTK’s word tokenizer. Subsequently, we employed a TF-IDF vectorizer to quantify the importance of words within the corpus, generating numerical features for each document. These features are then used to train a Random Forest classifier.

3.6 Few-Shot LLM Prompting

To explore the feasibility of large language models (LLMs) for post popularity prediction in PoPreRo, we employ a prompt-based approach utilizing the 7-billion parameter Falcon LLM [1] (Falcon-7B). Due to computational limitations, we prompt the LLM with contexts comprising two unpopular and two popular examples. Subsequently, we attach an individual test sample to each prompt and ask the LLM to predict the corresponding label. Below, we illustrate the structure of our prompt via a concrete example:

PROMPT (Original): Text: ’Nu vreau sa mai traiesc pe aceasta planeta !’ Label: ’Popular’. Text: ’Unde pot verifica compozi\textcommabelowtia unui produs?. Să testez de exemplu dacă ingredientele unui produs sunt într-adevăr acelea. Sau dacă ni\textcommabelowste tablete de vitamine chiar con\textcommabelowtin vitamine. În ce propor\textcommabelowtii? Sau câtă vitamina A con\textcommabelowtine un morcov - unde pot verifica asta? Ceva laboratoare?’ Label: ’Unpopular’. Text: ’Azi a venit mitropolitul ardealului la noi la liceu să ne convingă să facem religie. Primul lucru care mi-a venit în cap când am văzut ce ma\textcommabelowsină \textcommabelowsi-a parcat în curtea institu\textcommabelowtii..’ Label: ’Popular’. Text: ’Daca intereseaza pe cineva, sa stiti ca e reddit si in romana’ Label: ’Unpopular’. Text: ’Am prins niste fulgere faine zilele trecute’ Label:

PROMPT (Translated): Text: ’I don’t want to live on this planet anymore!’ Label: ’Popular’. Text: ’Where can I check the composition of a product?. To test for example whether the ingredients of a product are indeed those. Or if some vitamin tablets

actually contain vitamins. In what proportions? Or how much vitamin A contains a carrot - where can I check this? Some laboratories?’ Label: ’Unpopular’. Text: ’Today the metropolitan of Transylvania came to us at high school to convince us to do religion. First thing that came to mind when I saw what car he has parked in the courtyard of institutions..’ Label: ’Popular’. Text: ’If anyone is interested, there’s reddit in Romanian’ Label: ’Unpopular’. Text: ’I caught some fine lightning the other day’ Label:

4 Experiments

4.1 Evaluation

Our binary classification experiments focus on predicting the popularity of text within the PoPreRo dataset. Each text sample is categorized as either popular or unpopular. To evaluate the performance of our models, we employ several metrics. For each class, we calculate precision (proportion of true positives among the identified positives) and recall (proportion of true positives with respect to all positives). Additionally, we aggregate these scores using macro $F_{1}$ and micro $F_{1}$ (accuracy) measures.

4.2 Hyperparameter Tuning

The hyperparameters of all models are determined via grid search. For the transformer-based methods (Ro-BERT, Ro-GPT2), we employ a grid search over the maximum number of input tokens in the set $\{50,70,100,120,150,200\}$ , as well as the learning rate in the set $\{10^{-5},5\cdot 10^{-5},10^{-6},5\cdot 10^{-6},10^{-7},2\cdot 10^{-7},5\cdot 1% 0^{-7},10^{-8},5\cdot 10^{-8}\}$ and the value of $\epsilon$ for AdamW in the set $\{10^{-6},10^{-7},10^{-8}\}$ .

For the FastText + SVM approach, we vary the FastText word-embeddings dimension ( $\{150,200,300,350\}$ ), the window size for the input ( $\{2,3,4\}$ ), as well as the kernel (linear or RBF) and the parameter $C$ ( $\{0.1,1,10,100,1000\}$ ) of the SVM classifier. Similarly, for the Ro-BERT + Logistic Regression approach, we run a search over the maximum numbers of Ro-BERT input tokens in the same set as before ( $\{50,70,100,120,150,200\}$ ) and test different penalty term values (‘l1’, ‘l2’, ‘elastic net’ or ‘None’) for the classifier.

Lastly, for the TF-IDF + Random Forest method, we vary the minimum ( $\{4,5,6\}$ ) and maximum ( $\{0.6,0.7,0.8\}$ , in percentages) document frequency of the TF-IDF Vectorizer, together with the number of decision trees in the set $\{50,100,150,200\}$ for the Random Forest classifier.

All other hyperparameters are set to their default values. Please note that we release the code to reproduce all baselines, along with the PoPreRo dataset¹¹1https://github.com/ana-rogoz/PoPreRo.

4.3 Results

We present the results of our five baselines on the PoPreRo validation and test sets in Table 3. We find that Ro-GPT2 exhibits the best performance, with an accuracy (micro $F_{1}$ ) and a macro $F_{1}$ score above $0.6$ on both validation and test sets, in contrast to the other baselines which seem to perform similarly well on the validation set, but reach worse performance on the test set.

Table 3: Validation and test results of the six baselines. The random chance baseline is added as reference. There is no hyperparameter tuning for Falcon-7B LLM, so the model is directly applied on the test set (using in-context learning). The best score on each subset and for each metric is highlighted in bold.

Set	Method	Acc.	Macro	Unpopular		Popular
Set	Method	Acc.	$F_{1}$	Prec.	Rec.	Prec.	Rec.
Validation	Random chance	0.4998	0.4999	0.4988	0.5011	0.5011	0.4988
	Fine-tuned Ro-GPT2	0.6525	0.6397	0.6157	0.8097	0.7351	0.4986
	Fine-tuned Ro-BERT	0.6343	0.6278	0.6189	0.6995	0.6411	0.5562
	FastText + SVM	0.6677	0.6624	0.6348	0.7920	0.7225	0.5431
	TF-IDF + RF	0.6535	0.6395	0.6107	0.8497	0.7519	0.4568
	Ro-BERT + LR	0.6824	0.6721	0.6354	0.8582	0.7807	0.5061
Test	Random chance	0.4998	0.4999	0.5010	0.4989	0.4989	0.5010
	Fine-tuned Ro-GPT2	0.6135	0.6060	0.6146	0.6331	0.6145	0.5933
	Fine-tuned Ro-BERT	0.5605	0.5489	0.5505	0.6611	0.5767	0.4565
	FastText + SVM	0.5644	0.5637	0.5718	0.5208	0.5583	0.6083
	TF-IDF + RF	0.5759	0.5729	0.5661	0.6584	0.5897	0.4931
	Ro-BERT + LR	0.5998	0.5973	0.5873	0.6771	0.6169	0.5221
	Few-shot prompted Falcon-7B	0.4143	0.4126	0.4143	0.7904	0.5537	0.1887

Evaluating the two state-of-the-art transformer models, Ro-GPT2 and Ro-BERT, reveals some interesting findings. While both achieve comparable accuracy on the validation set ( $0.6525$ for Ro-GPT2 and $0.6343$ for Ro-BERT), Ro-GPT2 clearly outperforms Ro-BERT on the test set, indicating the superior ability of the former model to generalize to unseen data. Analyzing the precision-recall trade-off, we observe a shared propensity for both models to exhibit higher recall for the “popular” category, followed by a shift towards higher precision when identifying the “unpopular” class.

Table 4: Examples of relevant terms for popular posts, learned by the fine-tuned Ro-BERT and SVM models.

Model	Topic	Example	Translation
Ro-BERT	Call to action	“pentru cei care vor să se implice activ ”	“for those who want to be actively involved”
		“ar fi interesati de un voluntariat”	“would be interested in volunteering”
	News	“încep săpăturile la metrou”	“excavations begin at the subway”
		“un nou residence la “doar 20 de minute” de Centru”	“a new residence building “only 20 minutes” from the center”
	Events	“Seara de film la Casa Tineretului”	“Movie night at the Youth House”
SVM	News	“mic protest la primaria capitalei“	“small protest at Bucharest City Hall“
	Local transport	“am vazut ca este tren de la gara de nord la aeroport aproape la fiecare ora“	“I saw that there is a train from Gara de Nord to the airport almost every hour“

The FastText + SVM, TF-IDF + RF and Ro-BERT + LR models achieve comparable performance. All three models obtain accuracy rates higher than $65\%$ on the validation set, which drop below $60\%$ on the test set. In terms of precision and recall, almost all of them achieve higher precision for the “popular” category on both validation and test sets, with one exception being the FastText + SVM method on the test set, where the precision on the two classes is comparable. A distinctive behavior of the three models is that the TF-IDF + RF obtains a higher recall for the “popular” category, while FastText + SVM and RoBERT + LR attain a higher recall for the “unpopular” category.

Table 3 also shows the results on the test set of our few-shot prompted LLM. While this approach exhibits a bias similar to our other baselines, favoring recall for unpopular predictions and precision for popular ones, its overall performance falls below that of a random chance classifier. This suggests a limitation in the generalization capacity of LLMs to the popularity prediction task, particularly for languages with limited online resources, such as Romanian.

4.4 Discriminative Feature Analysis

We analyze the discriminative features learned by the fine-tuned Ro-BERT and by the FastText + SVM. The motivation behind this analysis is to validate that the decisions of these models are not based on some biases that escaped our data collection, but on actual data understanding.

For the Ro-BERT model, we use the Captum [14] library via its Layer Integrated Gradients method to infer valuable insights from the fine-tuned model. This technique delves into the BERT embeddings layer, attributing importance scores to individual input words which led to the final label prediction.

To find the words with higher influence on the decisions given by the SVM, we consider the cosine similarities between the primal weights of the SVM and the FastText embedding of each word. We sort the words based on the similarity values, and keep the first 10 and last 10 words from the sorted list as features for the positive (“popular”) and negative (“unpopular”) classes, respectively.

Table 5: Examples of relevant terms for unpopular posts, learned by the fine-tuned Ro-BERT and SVM models.

Model	Topic	Example	Translation
Ro-BERT	Proper names	“Palatul Roznovanu”	“Roznovanu palace”
		“Ceauşescu”	“Ceauşescu”
		“în Timişoara”	“in Timişoara”
	Seeking advice	“terenuri ok de baschet în…”	“ok basketball courts in…”
		“print shop pentru poze mari în …”	“print shop for big pictures in …”
	Mundane problems	“Se închide circulaţia”	“traffic is closed”
		“construim blocuri între case”	“building apartment building between houses”
SVM	City names	“bucuresti”	“bucharest”
	Seeking advice	“cunoasteti un loc de facut tatuaj temporar personalizat ”	“do you know a place to do custom temporary tattoo”
	Opinion sharing	“lumea ca se plange de targul de craciun de anul acesta ”	“people complain about this year’s Christmas market”

In Tables 4 and 5, we present a few examples of interesting patterns that were picked up by the models. In predicting post popularity, the Ro-BERT model demonstrates a bias toward content reflecting current trends, including news and events, and posts encouraging community engagement through calls to action. Conversely, references to proper nouns like city names or historical landmarks appear to hinder popularity, as do posts seeking community advice or expressing dissatisfaction with platitudes. Similar to Ro-BERT, we find that the SVM labels posts that share news as popular, and posts by people seeking advice as unpopular.

Table 6: Examples of the most discriminative words for the popular and unpopular classes, selected according to the weights learned by the SVM model based on FastText features.

Label	Token	Weight
popular	online	5.974352
	dupa	4.821379
	youtube	4.121604
	asa	4.08882
	cazul	3.839789
unpopular	toate	-4.089375
	un	-4.190036
	google	-4.31336
	nia	-4.339616
	eu	-4.72841

Furthermore, we extend the feature analysis for the SVM in order to determine the most discriminative words for the popular and unpopular classes. To achieve this, we determine the discriminative weight of each word based on the cosine similarity between the respective word embedding and the SVM weights. We sort the words according to their weights, and select the ones with the highest and lowest weights. In Table 6, we provide the five most discriminative words for the popular and unpopular classes, according to the SVM based on FastText features. We observe that posts mentioning “online” or “youtube” are more popular, likely because readers appreciate posts that provide links to YouTube videos. We also note the preference for posts that discuss particular cases/experiences, which are usually introduced by the word “cazul” (translated to “case” in English). On the other hand, posts that recommend searching on “google” are unpopular, as the readers consider such suggestions unhelpful. Moreover, discussing subjective perspectives, using the singular first person pronoun “eu”, is again unpopular, likely because the readers appreciate more objective posts.

5 Conclusion

In this paper, we introduced PoPreRo, the first publicly available dataset of Romanian Reddit posts dedicated to the task of popularity prediction. We collected 28,107 posts from five diverse Romanian subreddits, amounting to over 1 million tokens. Aiming to predict binary labels resulting from the sum of upvotes and downvotes for each post, we explored five distinct popularity detection methods and presented comparative results. We found that Ro-GPT2 significantly outperforms the other models.

Building upon our foundation, future research can further study popularity detection algorithms and delve deeper into the factors driving engagement on Romanian Reddit.

6 Limitations

It is crucial to acknowledge that Reddit’s popularity in Romania might not be representative for the wider population. While Reddit offers a valuable platform for research due to its diverse communities and open discussions, its user base in Romania is comparatively smaller than other social media platforms, such as Facebook, Instagram, or YouTube. Furthermore, Reddit’s API restricts data access, limiting historical data collection and imposing retrieval caps.

7 Ethics Statement

The data was collected from a publicly available Reddit archive, selecting five Romanian subreddits. The social media posts are freely accessible to the public without any type of subscription. As the data was collected from an archived public website (Reddit), we adhere to the European regulations²²2https://eur-lex.europa.eu/eli/dir/2019/790/oj that allow researchers to use data in the public web domain for non-commercial research purposes. We thus release our corpus as open-source under a non-commercial share-alike license agreement, namely CC BY-NC-SA 4.0³³3https://creativecommons.org/licenses/by-nc-sa/4.0/.

We acknowledge that some posts could refer to certain people, e.g. public figures in Romania. Following GDPR regulations, we will remove all references to a person, upon receiving removal requests via an email to any of the authors.

References

[1] Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Goffinet, É., Hesslow, D., Launay, J., Malartic, Q., et al.: The Falcon Series of Open Language Models. arXiv preprint arXiv:2311.16867 (2023)
[2] Barnes, K., Riesenmy, T., Trinh, M.D., Lleshi, E., Balogh, N., Molontay, R.: Dank or not? Analyzing and predicting the popularity of memes on Reddit. Applied Network Science 6(1), 21 (Mar 2021)
[3] Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135–146 (2017)
[4] Carta, S., Podda, A.S., Recupero, D.R., Saia, R., Usai, G.: Popularity Prediction of Instagram Posts. Information 11(9), 453 (2020)
[5] De, S., Maity, A., Goel, V., Shitole, S., Bhattacharya, A.: Predicting the popularity of Instagram posts for a lifestyle magazine using deep learning. In: Proceedings of CSCITA. pp. 174–177 (2017)
[6] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL. pp. 4171–4186 (2019)
[7] Dumitrescu, S.D., Avram, A.M., Pyysalo, S.: The birth of Romanian BERT. In: Findings of EMNLP. pp. 4324–4328 (2020)
[8] Fang, Z., Yu, M., Fu, Z., Zhang, B., Huang, X., Tang, X., Yang, Y.: How to generate popular post headlines on social media? AI Open 5, 1–9 (2024)
[9] Ferrer, X., van Nuenen, T., Such, J.M., Criado, N.: Discovering and Categorising Language Biases in Reddit. In: Proceedings of ICWSM. pp. 140–151 (2021)
[10] Gjurković, M., Šnajder, J.: Reddit: A gold mine for personality prediction. In: Proceedings of PEOPLES. pp. 87–97 (2018)
[11] Hada, R., Sudhir, S., Mishra, P., Yannakoudakis, H., Mohammad, S.M., Shutova, E.: Ruddit: Norms of Offensiveness for English Reddit Comments. In: Proceedings of ACL. pp. 2700–2717 (2022)
[12] Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
[13] Kim, J.: Predicting the Popularity of Reddit Posts with AI. arXiv preprint arXiv:2106.07380 (2021)
[14] Kokhlikyan, N., Miglani, V., Martin, M., Wang, E., Alsallakh, B., Reynolds, J., Melnikov, A., Kliushkina, N., Araya, C., Yan, S., Reblitz-Richardson, O.: Captum: A unified and generic model interpretability library for PyTorch. arXiv preprint arXiv:2009.07896 (2020)
[15] Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization. In: Proceedings of ICLR (2019)
[16] Ma, Z., Sun, A., Cong, G.: On predicting the popularity of newly emerging hashtags in twitter. Journal of the American Society for Information Science and Technology 64, 1399–1410 (2013)
[17] Mahdavi, M., Asadpour, M., Ghavami, S.: A comprehensive analysis of tweet content and its impact on popularity. In: Proceedings of IST. pp. 559–564 (2016)
[18] McHardy, R., Adel, H., Klinger, R.: Adversarial Training for Satire Detection: Controlling for Confounding Variables. In: Proceedings of NAACL. pp. 660–665 (2019)
[19] Niculescu, M.A., Ruseti, S., Dascalu, M.: RoGPT2: Romanian GPT2 for Text Generation. In: Proceedings of ICTAI. pp. 1154–1161 (2021)
[20] Poecze, F., Ebster, C., Strauss, C.: Social media metrics and sentiment analysis to evaluate the effectiveness of social media posts. In: Proceedings of ANT-SEIT. pp. 660–666 (2018)
[21] Purba, K.R., Asirvatham, D., Murugesan, R.K.: Instagram post popularity trend analysis and prediction using hashtag, image assessment, and user history features. The International Arab Journal of Information Technology 18(1), 85–94 (2021)
[22] Shen, J.H., Rudzicz, F.: Detecting anxiety through Reddit. In: Proceedings of CLPsych. pp. 58–65 (Aug 2017)
[23] Tadesse, M.M., Lin, H., Xu, B., Yang, L.: Detection of Depression-Related Posts in Reddit Social Media Forum. IEEE Access 7, 44883–44893 (2019)
[24] Turcan, E., McKeown, K.: Dreaddit: A Reddit Dataset for Stress Analysis in Social Media. In: Proceedings of LOUHI. pp. 97–107 (2019)
[25] Wang, C., Xiao, Z., Liu, Y., Xu, Y., Zhou, A., Zhang, K.: SentiView: Sentiment Analysis and Visualization for Internet Popular Topics. IEEE Transactions on Human-Machine Systems 43(6), 620–630 (2013)
[26] Zhang, Z., Chen, T., Zhou, Z., Li, J., Luo, J.: How to become instagram famous: Post popularity prediction with dual-attention. arXiv preprint arXiv:1809.09314 (2019)
[27] Zhao, Q., Erdogdu, M.A., He, H.Y., Rajaraman, A., Leskovec, J.: SEISMIC: A Self-Exciting Point Process Model for Predicting Tweet Popularity. In: Proceedings of KDD. pp. 1513–1522 (2015)
[28] Zhu, Y., ul Haq, E., Lee, L.H., Tyson, G., Hui, P.: A Reddit Dataset for the Russo-Ukrainian Conflict in 2022. arXiv preprint arXiv:2206.05107 (2022)