PoPreRo: A New Dataset for Popularity Prediction of Romanian Reddit Posts

AC Rogoz, MI Nechita, RT Ionescu - arXiv preprint arXiv:2407.04541, 2024 - arxiv.org
AC Rogoz, MI Nechita, RT Ionescu
arXiv preprint arXiv:2407.04541, 2024arxiv.org
We introduce PoPreRo, the first dataset for Popularity Prediction of Romanian posts
collected from Reddit. The PoPreRo dataset includes a varied compilation of post samples
from five distinct subreddits of Romania, totaling 28,107 data samples. Along with our novel
dataset, we introduce a set of competitive models to be used as baselines for future
research. Interestingly, the top-scoring model achieves an accuracy of 61.35% and a macro
F1 score of 60.60% on the test set, indicating that the popularity prediction task on PoPreRo …
We introduce PoPreRo, the first dataset for Popularity Prediction of Romanian posts collected from Reddit. The PoPreRo dataset includes a varied compilation of post samples from five distinct subreddits of Romania, totaling 28,107 data samples. Along with our novel dataset, we introduce a set of competitive models to be used as baselines for future research. Interestingly, the top-scoring model achieves an accuracy of 61.35% and a macro F1 score of 60.60% on the test set, indicating that the popularity prediction task on PoPreRo is very challenging. Further investigations based on few-shot prompting the Falcon-7B Large Language Model also point in the same direction. We thus believe that PoPreRo is a valuable resource that can be used to evaluate models on predicting the popularity of social media posts in Romanian. We release our dataset at https://github.com/ana-rogoz/PoPreRo.
arxiv.org