Our study aims to collect data to understand ideological and extreme bias in text articles shared across various online communities, particularly focusing on the language used in subreddits associated with extremism and targeted violence. Initially, we gathered data from related online communities, specifically the r/Liberal and r/Conservative communities on Reddit, utilizing the Reddit Pushshift API to collect URLs shared within these subreddits. Our aim was to gather news, opinion, and feature articles, resulting in a corpus of 226,010 articles. We also curated a balanced subset of 45,108 articles and annotated 4000 articles to validate their relevance, facilitating understanding of language usage within ideological Reddit communities and insights into ideological bias in media content. Expanding beyond binary ideologies, we introduced a new category termed "Restricted" to encompass articles shared in private or banned subreddits. This third category encompasses articles shared in restricted, privatized, quarantined, or banned subreddits characterized by radicalized and extremist ideologies. This expansion yielded a large dataset of 377,144 articles. Additionally, we included articles from subreddits with unspecified ideologies, creating a holdout set of 922,522 articles. In total, our combined dataset of 1.3 million articles collected from 55 different subreddits will assist in examining radicalized communities and providing discourse analysis in associated subreddits, enhancing understanding of the language used in articles shared within radicalized Reddit communities and offering insights into extreme bias in media content. In summary, we collected 1.52 million articles to understand ideological and extreme bias, providing a comprehensive dataset that aids in understanding language usage within text articles posted in ideological and extreme Reddit communities.
Keywords: Context modeling; Machine learning; Natural language processing; News media consumption; Reddit; Social networking (online); Text classification; Topic modeling.
© 2024 The Author(s).